[jira] [Commented] (SPARK-16913) [SQL] Better codegen where querying nested struct

2016-08-06 Thread Kazuaki Ishizaki (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-16913?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15410840#comment-15410840
 ] 

Kazuaki Ishizaki commented on SPARK-16913:
--

It seems to copy each elements in a struct. Since {{InternalRow}} does not 
include a structure, the {{internalRow}} keeps two scalar values, which 
consists of {{isNull}} and {{value}} in this case. If we can provide better 
schema property (i.e. {{nullable = false}} for {{a}} and {{b}}), lines 44-62 
would be simpler.

> [SQL] Better codegen where querying nested struct
> -
>
> Key: SPARK-16913
> URL: https://issues.apache.org/jira/browse/SPARK-16913
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Maciej BryƄski
>
> I have parquet file created as result of:
> {code}
> spark.range(100).selectExpr("id as a", "id as b").selectExpr("struct(a, b) as 
> c").write.parquet("/mnt/mfs/codegen_test")
> {code}
> Then I'm querying whole nested structure with:
> {code}
> spark.read.parquet("/mnt/mfs/codegen_test").selectExpr("c.*")
> {code}
> As a result of spark whole stage codegen I'm getting following code.
> Is it possible to remove part from line 044 and just return whole result of 
> getStruct ? (maybe just copied)
> {code}
> Generated code:
> /* 001 */ public Object generate(Object[] references) {
> /* 002 */   return new GeneratedIterator(references);
> /* 003 */ }
> /* 004 */
> /* 005 */ final class GeneratedIterator extends 
> org.apache.spark.sql.execution.BufferedRowIterator {
> /* 006 */   private Object[] references;
> /* 007 */   private org.apache.spark.sql.execution.metric.SQLMetric 
> scan_numOutputRows;
> /* 008 */   private scala.collection.Iterator scan_input;
> /* 009 */   private UnsafeRow scan_result;
> /* 010 */   private 
> org.apache.spark.sql.catalyst.expressions.codegen.BufferHolder scan_holder;
> /* 011 */   private 
> org.apache.spark.sql.catalyst.expressions.codegen.UnsafeRowWriter 
> scan_rowWriter;
> /* 012 */   private 
> org.apache.spark.sql.catalyst.expressions.codegen.UnsafeRowWriter 
> scan_rowWriter1;
> /* 013 */   private UnsafeRow project_result;
> /* 014 */   private 
> org.apache.spark.sql.catalyst.expressions.codegen.BufferHolder project_holder;
> /* 015 */   private 
> org.apache.spark.sql.catalyst.expressions.codegen.UnsafeRowWriter 
> project_rowWriter;
> /* 016 */
> /* 017 */   public GeneratedIterator(Object[] references) {
> /* 018 */ this.references = references;
> /* 019 */   }
> /* 020 */
> /* 021 */   public void init(int index, scala.collection.Iterator inputs[]) {
> /* 022 */ partitionIndex = index;
> /* 023 */ this.scan_numOutputRows = 
> (org.apache.spark.sql.execution.metric.SQLMetric) references[0];
> /* 024 */ scan_input = inputs[0];
> /* 025 */ scan_result = new UnsafeRow(1);
> /* 026 */ this.scan_holder = new 
> org.apache.spark.sql.catalyst.expressions.codegen.BufferHolder(scan_result, 
> 32);
> /* 027 */ this.scan_rowWriter = new 
> org.apache.spark.sql.catalyst.expressions.codegen.UnsafeRowWriter(scan_holder,
>  1);
> /* 028 */ this.scan_rowWriter1 = new 
> org.apache.spark.sql.catalyst.expressions.codegen.UnsafeRowWriter(scan_holder,
>  2);
> /* 029 */ project_result = new UnsafeRow(2);
> /* 030 */ this.project_holder = new 
> org.apache.spark.sql.catalyst.expressions.codegen.BufferHolder(project_result,
>  0);
> /* 031 */ this.project_rowWriter = new 
> org.apache.spark.sql.catalyst.expressions.codegen.UnsafeRowWriter(project_holder,
>  2);
> /* 032 */   }
> /* 033 */
> /* 034 */   protected void processNext() throws java.io.IOException {
> /* 035 */ while (scan_input.hasNext()) {
> /* 036 */   InternalRow scan_row = (InternalRow) scan_input.next();
> /* 037 */   scan_numOutputRows.add(1);
> /* 038 */   boolean scan_isNull = scan_row.isNullAt(0);
> /* 039 */   InternalRow scan_value = scan_isNull ? null : 
> (scan_row.getStruct(0, 2));
> /* 040 */
> /* 041 */   boolean project_isNull = scan_isNull;
> /* 042 */   long project_value = -1L;
> /* 043 */
> /* 044 */   if (!scan_isNull) {
> /* 045 */ if (scan_value.isNullAt(0)) {
> /* 046 */   project_isNull = true;
> /* 047 */ } else {
> /* 048 */   project_value = scan_value.getLong(0);
> /* 049 */ }
> /* 050 */
> /* 051 */   }
> /* 052 */   boolean project_isNull2 = scan_isNull;
> /* 053 */   long project_value2 = -1L;
> /* 054 */
> /* 055 */   if (!scan_isNull) {
> /* 056 */ if (scan_value.isNullAt(1)) {
> /* 057 */   project_isNull2 = true;
> /* 058 */ } else {
> /* 059 */   project_value2 = scan_value.getLong(1);
> /* 060 */ }
> /* 061 */
> /* 062 */   }
> /* 063 */   project_rowWriter.zeroOutN

[jira] [Commented] (SPARK-8904) When using LDA DAGScheduler throws exception

2016-08-06 Thread Nabarun (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-8904?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15410835#comment-15410835
 ] 

Nabarun commented on SPARK-8904:


This seem to be related to something which I am seeing at my end to. I 
converted my countVectors into DF

val ldaDF = countVectors.map { case Row(id: Long, countVector: Vector) => (id, 
countVector) } 

When I am trying to display it, this throws following exception

org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in 
stage 3148.0 failed 4 times, most recent failure: Lost task 0.3 in stage 3148.0 
(TID 11632, 10.209.235.85): scala.MatchError: 
[0,(1671,[1,2,3,5,8,10,11,12,14,15,17,18,20,21,23,27,28,29,30,31,32,36,37,38,39,41,42,43,45,46,51,52,54,66,69,71,74,75,78,80,82,83,85,88,89,90,92,96,97,98,99,102,104,106,107,108,109,111,112,115,118,121,123,124,126,134,138,139,143,144,145,148,150,151,152,153,155,161,166,171,172,173,174,176,178,179,180,181,189,190,197,199,200,201,207,209,212,216,217,218,220,222,223,224,226,227,228,232,234,238,240,244,246,250,252,254,255,260,261,262,264,268,269,270,277,280,281,282,286,292,294,295,296,297,301,310,312,314,316,318,323,324,325,337,341,343,346,347,351,355,359,366,367,379,380,381,388,390,391,398,403,405,411,417,442,444,448,456,460,464,466,468,470,477,480,484,487,490,491,495,496,501,502,507,509,512,522,523,527,529,531,533,534,535,552,554,556,557,565,566,567,569,574,575,585,624,630,632,633,638,644,646,652,653,658,668,669,670,680,683,686,690,693,696,698,704,705,712,723,726,736,746,747,750,757,758,761,765,773,774,775,783,786,796,797,801,807,811,815,825,830,833,843,844,845,847,849,859,861,862,864,867,871,872,876,879,882,892,895,896,897,912,923,924,935,937,941,944,945,948,949,952,968,982,989,1000,1003,1015,1018,1021,1025,1029,1034,1036,1038,1041,1048,1072,1082,1086,1092,1106,,1114,1117,1123,1128,1133,1135,1145,1149,1154,1168,1169,1171,1178,1180,1181,1183,1184,1201,1224,1234,1240,1250,1260,1261,1267,1269,1270,1280,1305,1309,1317,1333,1354,1355,1358,1378,1379,1386,1389,1393,1411,1413,1426,1428,1475,1480,1504,1506,1521,1525,1530,1532,1545,1555,1601,1614,1635,1643,1649,1653,1668],[1.0,5.0,4.0,3.0,2.0,14.0,30.0,2.0,72.0,9.0,6.0,6.0,1.0,13.0,1.0,4.0,1.0,3.0,2.0,10.0,2.0,4.0,74.0,3.0,11.0,1.0,35.0,1.0,16.0,1.0,2.0,15.0,3.0,4.0,17.0,2.0,8.0,60.0,35.0,3.0,1.0,33.0,2.0,2.0,3.0,11.0,16.0,2.0,8.0,2.0,3.0,48.0,1.0,1.0,4.0,8.0,4.0,3.0,4.0,4.0,1.0,3.0,1.0,11.0,1.0,2.0,3.0,1.0,35.0,6.0,2.0,1.0,2.0,3.0,3.0,4.0,2.0,2.0,1.0,1.0,20.0,9.0,6.0,17.0,10.0,8.0,1.0,12.0,1.0,3.0,3.0,2.0,9.0,1.0,2.0,19.0,1.0,2.0,1.0,1.0,2.0,9.0,1.0,1.0,1.0,5.0,1.0,2.0,5.0,1.0,1.0,1.0,1.0,1.0,7.0,1.0,14.0,2.0,2.0,1.0,5.0,2.0,5.0,5.0,20.0,2.0,27.0,3.0,4.0,11.0,1.0,3.0,3.0,1.0,2.0,2.0,7.0,5.0,2.0,2.0,1.0,3.0,1.0,2.0,1.0,2.0,8.0,5.0,1.0,5.0,3.0,1.0,4.0,3.0,3.0,4.0,1.0,3.0,4.0,1.0,2.0,3.0,5.0,7.0,1.0,8.0,1.0,2.0,4.0,2.0,1.0,12.0,5.0,1.0,6.0,4.0,2.0,2.0,1.0,1.0,3.0,4.0,1.0,1.0,2.0,4.0,3.0,1.0,2.0,6.0,1.0,1.0,1.0,4.0,2.0,1.0,7.0,12.0,1.0,12.0,1.0,1.0,9.0,2.0,1.0,2.0,1.0,1.0,6.0,6.0,1.0,1.0,1.0,2.0,1.0,1.0,1.0,1.0,1.0,1.0,2.0,1.0,2.0,2.0,3.0,1.0,1.0,2.0,1.0,3.0,1.0,4.0,1.0,5.0,2.0,1.0,2.0,2.0,3.0,1.0,2.0,1.0,1.0,2.0,3.0,1.0,4.0,3.0,1.0,3.0,2.0,1.0,2.0,2.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,4.0,1.0,1.0,1.0,1.0,1.0,4.0,1.0,2.0,2.0,1.0,1.0,2.0,3.0,1.0,1.0,2.0,2.0,1.0,1.0,1.0,2.0,1.0,2.0,1.0,1.0,1.0,5.0,3.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,2.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,4.0,1.0,1.0,1.0,2.0,1.0,2.0,1.0,1.0,1.0,1.0,1.0,2.0,1.0,1.0,1.0,2.0,1.0,3.0,1.0,2.0,1.0,1.0,2.0,1.0,1.0,1.0,2.0,1.0,1.0,1.0,1.0,1.0,2.0,2.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,2.0,2.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0])]
 (of class org.apache.spark.sql.catalyst.expressions.GenericRowWithSchema)
at 
line1907dd16af5d4fbfa217a9d52f096b36316.$read$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$anonfun$1.apply(:142)
at 
line1907dd16af5d4fbfa217a9d52f096b36316.$read$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$anonfun$1.apply(:142)
at 
org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown
 Source)
at 
org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
at 
org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:370)
at 
org.apache.spark.sql.execution.SparkPlan$$anonfun$3.apply(SparkPlan.scala:231)
at 
org.apache.spark.sql.execution.SparkPlan$$anonfun$3.apply(SparkPlan.scala:225)
at 
org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$24.apply(RDD.scala:790)
at 
org.apache.spark.rdd.RDD$$anon

[jira] [Commented] (SPARK-16936) Case Sensitivity Support for Refresh Temp Table

2016-08-06 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-16936?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15410833#comment-15410833
 ] 

Apache Spark commented on SPARK-16936:
--

User 'gatorsmile' has created a pull request for this issue:
https://github.com/apache/spark/pull/14523

> Case Sensitivity Support for Refresh Temp Table
> ---
>
> Key: SPARK-16936
> URL: https://issues.apache.org/jira/browse/SPARK-16936
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Xiao Li
>
> Currently, the `refreshTable` API is always case sensitive.
> When users use the view name without the exact case match, the API silently 
> ignores the call. Users might expect the command has been successfully 
> completed. However, when users run the subsequent SQL commands, they might 
> still get the exception, like 
> {noformat}
> Job aborted due to stage failure: 
> Task 1 in stage 4.0 failed 1 times, most recent failure: Lost task 1.0 in 
> stage 4.0 (TID 7, localhost): 
> java.io.FileNotFoundException: 
> File 
> file:/private/var/folders/4b/sgmfldk15js406vk7lw5llzwgn/T/spark-bd4b9ea6-9aec-49c5-8f05-01cff426211e/part-r-0-0c84b915-c032-4f2e-abf5-1d48fdbddf38.snappy.parquet
>  does not exist
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-16936) Case Sensitivity Support for Refresh Temp Table

2016-08-06 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-16936?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-16936:


Assignee: (was: Apache Spark)

> Case Sensitivity Support for Refresh Temp Table
> ---
>
> Key: SPARK-16936
> URL: https://issues.apache.org/jira/browse/SPARK-16936
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Xiao Li
>
> Currently, the `refreshTable` API is always case sensitive.
> When users use the view name without the exact case match, the API silently 
> ignores the call. Users might expect the command has been successfully 
> completed. However, when users run the subsequent SQL commands, they might 
> still get the exception, like 
> {noformat}
> Job aborted due to stage failure: 
> Task 1 in stage 4.0 failed 1 times, most recent failure: Lost task 1.0 in 
> stage 4.0 (TID 7, localhost): 
> java.io.FileNotFoundException: 
> File 
> file:/private/var/folders/4b/sgmfldk15js406vk7lw5llzwgn/T/spark-bd4b9ea6-9aec-49c5-8f05-01cff426211e/part-r-0-0c84b915-c032-4f2e-abf5-1d48fdbddf38.snappy.parquet
>  does not exist
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-16936) Case Sensitivity Support for Refresh Temp Table

2016-08-06 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-16936?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-16936:


Assignee: Apache Spark

> Case Sensitivity Support for Refresh Temp Table
> ---
>
> Key: SPARK-16936
> URL: https://issues.apache.org/jira/browse/SPARK-16936
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Xiao Li
>Assignee: Apache Spark
>
> Currently, the `refreshTable` API is always case sensitive.
> When users use the view name without the exact case match, the API silently 
> ignores the call. Users might expect the command has been successfully 
> completed. However, when users run the subsequent SQL commands, they might 
> still get the exception, like 
> {noformat}
> Job aborted due to stage failure: 
> Task 1 in stage 4.0 failed 1 times, most recent failure: Lost task 1.0 in 
> stage 4.0 (TID 7, localhost): 
> java.io.FileNotFoundException: 
> File 
> file:/private/var/folders/4b/sgmfldk15js406vk7lw5llzwgn/T/spark-bd4b9ea6-9aec-49c5-8f05-01cff426211e/part-r-0-0c84b915-c032-4f2e-abf5-1d48fdbddf38.snappy.parquet
>  does not exist
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-16936) Case Sensitivity Support for Refresh Temp Table

2016-08-06 Thread Xiao Li (JIRA)
Xiao Li created SPARK-16936:
---

 Summary: Case Sensitivity Support for Refresh Temp Table
 Key: SPARK-16936
 URL: https://issues.apache.org/jira/browse/SPARK-16936
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.0.0
Reporter: Xiao Li


Currently, the `refreshTable` API is always case sensitive.

When users use the view name without the exact case match, the API silently 
ignores the call. Users might expect the command has been successfully 
completed. However, when users run the subsequent SQL commands, they might 
still get the exception, like 
{noformat}
Job aborted due to stage failure: 
Task 1 in stage 4.0 failed 1 times, most recent failure: Lost task 1.0 in stage 
4.0 (TID 7, localhost): 
java.io.FileNotFoundException: 
File 
file:/private/var/folders/4b/sgmfldk15js406vk7lw5llzwgn/T/spark-bd4b9ea6-9aec-49c5-8f05-01cff426211e/part-r-0-0c84b915-c032-4f2e-abf5-1d48fdbddf38.snappy.parquet
 does not exist
{noformat}




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-16925) Spark tasks which cause JVM to exit with a zero exit code may cause app to hang in Standalone mode

2016-08-06 Thread Josh Rosen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-16925?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Josh Rosen resolved SPARK-16925.

   Resolution: Fixed
Fix Version/s: 2.1.0
   2.0.1
   1.6.3

> Spark tasks which cause JVM to exit with a zero exit code may cause app to 
> hang in Standalone mode
> --
>
> Key: SPARK-16925
> URL: https://issues.apache.org/jira/browse/SPARK-16925
> Project: Spark
>  Issue Type: Bug
>  Components: Deploy
>Affects Versions: 1.6.0, 2.0.0
>Reporter: Josh Rosen
>Assignee: Josh Rosen
>Priority: Critical
> Fix For: 1.6.3, 2.0.1, 2.1.0
>
>
> If you have a Spark standalone cluster which runs a single application and 
> you have a Spark task which repeatedly fails by causing the executor JVM to 
> exit with a _zero_ exit code then this may temporarily freeze / hang the 
> Spark application.
> For example, running
> {code}
> sc.parallelize(1 to 1, 1).foreachPartition { _ => System.exit(0) }
> {code}
> on a cluster will cause all executors to die but those executors won't be 
> replaced unless another Spark application or worker joins or leaves the 
> cluster. This is caused by a bug in the standalone Master where 
> {{schedule()}} is only called on executor exit when the exit code is 
> non-zero, whereas I think that we should always call {{schedule()}} even on a 
> "clean" executor shutdown since {{schedule()}} should always be safe to call.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-16925) Spark tasks which cause JVM to exit with a zero exit code may cause app to hang in Standalone mode

2016-08-06 Thread Josh Rosen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-16925?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15410803#comment-15410803
 ] 

Josh Rosen commented on SPARK-16925:


Fixed by my patch.

> Spark tasks which cause JVM to exit with a zero exit code may cause app to 
> hang in Standalone mode
> --
>
> Key: SPARK-16925
> URL: https://issues.apache.org/jira/browse/SPARK-16925
> Project: Spark
>  Issue Type: Bug
>  Components: Deploy
>Affects Versions: 1.6.0, 2.0.0
>Reporter: Josh Rosen
>Assignee: Josh Rosen
>Priority: Critical
> Fix For: 1.6.3, 2.0.1, 2.1.0
>
>
> If you have a Spark standalone cluster which runs a single application and 
> you have a Spark task which repeatedly fails by causing the executor JVM to 
> exit with a _zero_ exit code then this may temporarily freeze / hang the 
> Spark application.
> For example, running
> {code}
> sc.parallelize(1 to 1, 1).foreachPartition { _ => System.exit(0) }
> {code}
> on a cluster will cause all executors to die but those executors won't be 
> replaced unless another Spark application or worker joins or leaves the 
> cluster. This is caused by a bug in the standalone Master where 
> {{schedule()}} is only called on executor exit when the exit code is 
> non-zero, whereas I think that we should always call {{schedule()}} even on a 
> "clean" executor shutdown since {{schedule()}} should always be safe to call.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-16508) Fix documentation warnings found by R CMD check

2016-08-06 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-16508?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15410791#comment-15410791
 ] 

Apache Spark commented on SPARK-16508:
--

User 'junyangq' has created a pull request for this issue:
https://github.com/apache/spark/pull/14522

> Fix documentation warnings found by R CMD check
> ---
>
> Key: SPARK-16508
> URL: https://issues.apache.org/jira/browse/SPARK-16508
> Project: Spark
>  Issue Type: Sub-task
>  Components: SparkR
>Reporter: Shivaram Venkataraman
>
> A full list of warnings after the fixes in SPARK-16507 is at 
> https://gist.github.com/shivaram/62866c4ca59c5d34b8963939cf04b5eb 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-16508) Fix documentation warnings found by R CMD check

2016-08-06 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-16508?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-16508:


Assignee: (was: Apache Spark)

> Fix documentation warnings found by R CMD check
> ---
>
> Key: SPARK-16508
> URL: https://issues.apache.org/jira/browse/SPARK-16508
> Project: Spark
>  Issue Type: Sub-task
>  Components: SparkR
>Reporter: Shivaram Venkataraman
>
> A full list of warnings after the fixes in SPARK-16507 is at 
> https://gist.github.com/shivaram/62866c4ca59c5d34b8963939cf04b5eb 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-16508) Fix documentation warnings found by R CMD check

2016-08-06 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-16508?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-16508:


Assignee: Apache Spark

> Fix documentation warnings found by R CMD check
> ---
>
> Key: SPARK-16508
> URL: https://issues.apache.org/jira/browse/SPARK-16508
> Project: Spark
>  Issue Type: Sub-task
>  Components: SparkR
>Reporter: Shivaram Venkataraman
>Assignee: Apache Spark
>
> A full list of warnings after the fixes in SPARK-16507 is at 
> https://gist.github.com/shivaram/62866c4ca59c5d34b8963939cf04b5eb 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-16935) Verification of Function-related ExternalCatalog APIs

2016-08-06 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-16935?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-16935:


Assignee: (was: Apache Spark)

> Verification of Function-related ExternalCatalog APIs
> -
>
> Key: SPARK-16935
> URL: https://issues.apache.org/jira/browse/SPARK-16935
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Xiao Li
>
> Function-related `HiveExternalCatalog` APIs do not have enough verification 
> logics. After the PR, `HiveExternalCatalog` and `InMemoryCatalog` become 
> consistent in the error handling. 
> For example, below is the exception we got when calling `renameFunction`. 
> {noformat}
> 15:13:40.369 WARN org.apache.hadoop.hive.metastore.ObjectStore: Failed to get 
> database db1, returning NoSuchObjectException
> 15:13:40.377 WARN org.apache.hadoop.hive.metastore.ObjectStore: Failed to get 
> database db2, returning NoSuchObjectException
> 15:13:40.739 ERROR DataNucleus.Datastore.Persist: Update of object 
> "org.apache.hadoop.hive.metastore.model.MFunction@205629e9" using statement 
> "UPDATE FUNCS SET FUNC_NAME=? WHERE FUNC_ID=?" failed : 
> org.apache.derby.shared.common.error.DerbySQLIntegrityConstraintViolationException:
>  The statement was aborted because it would have caused a duplicate key value 
> in a unique or primary key constraint or unique index identified by 
> 'UNIQUEFUNCTION' defined on 'FUNCS'.
>   at 
> org.apache.derby.impl.jdbc.SQLExceptionFactory.getSQLException(Unknown Source)
>   at org.apache.derby.impl.jdbc.Util.generateCsSQLException(Unknown 
> Source)
>   at 
> org.apache.derby.impl.jdbc.TransactionResourceImpl.wrapInSQLException(Unknown 
> Source)
>   at 
> org.apache.derby.impl.jdbc.TransactionResourceImpl.handleException(Unknown 
> Source)
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-16935) Verification of Function-related ExternalCatalog APIs

2016-08-06 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-16935?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15410762#comment-15410762
 ] 

Apache Spark commented on SPARK-16935:
--

User 'gatorsmile' has created a pull request for this issue:
https://github.com/apache/spark/pull/14521

> Verification of Function-related ExternalCatalog APIs
> -
>
> Key: SPARK-16935
> URL: https://issues.apache.org/jira/browse/SPARK-16935
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Xiao Li
>
> Function-related `HiveExternalCatalog` APIs do not have enough verification 
> logics. After the PR, `HiveExternalCatalog` and `InMemoryCatalog` become 
> consistent in the error handling. 
> For example, below is the exception we got when calling `renameFunction`. 
> {noformat}
> 15:13:40.369 WARN org.apache.hadoop.hive.metastore.ObjectStore: Failed to get 
> database db1, returning NoSuchObjectException
> 15:13:40.377 WARN org.apache.hadoop.hive.metastore.ObjectStore: Failed to get 
> database db2, returning NoSuchObjectException
> 15:13:40.739 ERROR DataNucleus.Datastore.Persist: Update of object 
> "org.apache.hadoop.hive.metastore.model.MFunction@205629e9" using statement 
> "UPDATE FUNCS SET FUNC_NAME=? WHERE FUNC_ID=?" failed : 
> org.apache.derby.shared.common.error.DerbySQLIntegrityConstraintViolationException:
>  The statement was aborted because it would have caused a duplicate key value 
> in a unique or primary key constraint or unique index identified by 
> 'UNIQUEFUNCTION' defined on 'FUNCS'.
>   at 
> org.apache.derby.impl.jdbc.SQLExceptionFactory.getSQLException(Unknown Source)
>   at org.apache.derby.impl.jdbc.Util.generateCsSQLException(Unknown 
> Source)
>   at 
> org.apache.derby.impl.jdbc.TransactionResourceImpl.wrapInSQLException(Unknown 
> Source)
>   at 
> org.apache.derby.impl.jdbc.TransactionResourceImpl.handleException(Unknown 
> Source)
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-16935) Verification of Function-related ExternalCatalog APIs

2016-08-06 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-16935?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-16935:


Assignee: Apache Spark

> Verification of Function-related ExternalCatalog APIs
> -
>
> Key: SPARK-16935
> URL: https://issues.apache.org/jira/browse/SPARK-16935
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Xiao Li
>Assignee: Apache Spark
>
> Function-related `HiveExternalCatalog` APIs do not have enough verification 
> logics. After the PR, `HiveExternalCatalog` and `InMemoryCatalog` become 
> consistent in the error handling. 
> For example, below is the exception we got when calling `renameFunction`. 
> {noformat}
> 15:13:40.369 WARN org.apache.hadoop.hive.metastore.ObjectStore: Failed to get 
> database db1, returning NoSuchObjectException
> 15:13:40.377 WARN org.apache.hadoop.hive.metastore.ObjectStore: Failed to get 
> database db2, returning NoSuchObjectException
> 15:13:40.739 ERROR DataNucleus.Datastore.Persist: Update of object 
> "org.apache.hadoop.hive.metastore.model.MFunction@205629e9" using statement 
> "UPDATE FUNCS SET FUNC_NAME=? WHERE FUNC_ID=?" failed : 
> org.apache.derby.shared.common.error.DerbySQLIntegrityConstraintViolationException:
>  The statement was aborted because it would have caused a duplicate key value 
> in a unique or primary key constraint or unique index identified by 
> 'UNIQUEFUNCTION' defined on 'FUNCS'.
>   at 
> org.apache.derby.impl.jdbc.SQLExceptionFactory.getSQLException(Unknown Source)
>   at org.apache.derby.impl.jdbc.Util.generateCsSQLException(Unknown 
> Source)
>   at 
> org.apache.derby.impl.jdbc.TransactionResourceImpl.wrapInSQLException(Unknown 
> Source)
>   at 
> org.apache.derby.impl.jdbc.TransactionResourceImpl.handleException(Unknown 
> Source)
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-16935) Verification of Function-related ExternalCatalog APIs

2016-08-06 Thread Xiao Li (JIRA)
Xiao Li created SPARK-16935:
---

 Summary: Verification of Function-related ExternalCatalog APIs
 Key: SPARK-16935
 URL: https://issues.apache.org/jira/browse/SPARK-16935
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 2.0.0
Reporter: Xiao Li


Function-related `HiveExternalCatalog` APIs do not have enough verification 
logics. After the PR, `HiveExternalCatalog` and `InMemoryCatalog` become 
consistent in the error handling. 

For example, below is the exception we got when calling `renameFunction`. 
{noformat}
15:13:40.369 WARN org.apache.hadoop.hive.metastore.ObjectStore: Failed to get 
database db1, returning NoSuchObjectException
15:13:40.377 WARN org.apache.hadoop.hive.metastore.ObjectStore: Failed to get 
database db2, returning NoSuchObjectException
15:13:40.739 ERROR DataNucleus.Datastore.Persist: Update of object 
"org.apache.hadoop.hive.metastore.model.MFunction@205629e9" using statement 
"UPDATE FUNCS SET FUNC_NAME=? WHERE FUNC_ID=?" failed : 
org.apache.derby.shared.common.error.DerbySQLIntegrityConstraintViolationException:
 The statement was aborted because it would have caused a duplicate key value 
in a unique or primary key constraint or unique index identified by 
'UNIQUEFUNCTION' defined on 'FUNCS'.
at 
org.apache.derby.impl.jdbc.SQLExceptionFactory.getSQLException(Unknown Source)
at org.apache.derby.impl.jdbc.Util.generateCsSQLException(Unknown 
Source)
at 
org.apache.derby.impl.jdbc.TransactionResourceImpl.wrapInSQLException(Unknown 
Source)
at 
org.apache.derby.impl.jdbc.TransactionResourceImpl.handleException(Unknown 
Source)
{noformat}




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-16922) Query failure due to executor OOM in Spark 2.0

2016-08-06 Thread Sital Kedia (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-16922?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15410703#comment-15410703
 ] 

Sital Kedia commented on SPARK-16922:
-

Update - The query works fine when Broadcast hash join in turned off, so the 
issue might be in broadcast hash join. I put some debug print in 
UnsafeRowWriter class 
(https://github.com/apache/spark/blob/branch-2.0/sql/catalyst/src/main/java/org/apache/spark/sql/catalyst/expressions/codegen/UnsafeRowWriter.java#L214)
 and I found that it is receiving a row of size around 800MB and OOMing while 
trying to grow the buffer holder. It might suggest that there is some data 
corruption going on probably in the Broadcast hash join. 

cc- [~davies] - Any pointer on how to debug this issue further? 

> Query failure due to executor OOM in Spark 2.0
> --
>
> Key: SPARK-16922
> URL: https://issues.apache.org/jira/browse/SPARK-16922
> Project: Spark
>  Issue Type: Bug
>  Components: Shuffle
>Affects Versions: 2.0.0
>Reporter: Sital Kedia
>
> A query which used to work in Spark 1.6 fails with executor OOM in 2.0.
> Stack trace - 
> {code}
>   at 
> org.apache.spark.unsafe.types.UTF8String.getBytes(UTF8String.java:229)
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator$agg_VectorizedHashMap.hash$(Unknown
>  Source)
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator$agg_VectorizedHashMap.findOrInsert(Unknown
>  Source)
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.agg_doAggregateWithKeys$(Unknown
>  Source)
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown
>  Source)
>   at 
> org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
>   at 
> org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:370)
>   at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
>   at 
> org.apache.spark.shuffle.sort.UnsafeShuffleWriter.write(UnsafeShuffleWriter.java:161)
>   at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:79)
>   at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:47)
>   at org.apache.spark.scheduler.Task.run(Task.scala:85)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>   at java.lang.Thread.run(Thread.java:745)
> {code}
> Query plan in Spark 1.6
> {code}
> == Physical Plan ==
> TungstenAggregate(key=[field1#101], functions=[(sum((field2#74 / 
> 100.0)),mode=Final,isDistinct=false)], output=[field1#101,field3#3])
> +- TungstenExchange hashpartitioning(field1#101,200), None
>+- TungstenAggregate(key=[field1#101], functions=[(sum((field2#74 / 
> 100.0)),mode=Partial,isDistinct=false)], output=[field1#101,sum#111])
>   +- Project [field1#101,field2#74]
>  +- BroadcastHashJoin [field5#63L], [cast(cast(field4#97 as 
> decimal(20,0)) as bigint)], BuildRight
> :- ConvertToUnsafe
> :  +- HiveTableScan [field2#74,field5#63L], MetastoreRelation 
> foo, table1, Some(a), [(ds#57 >= 2013-10-01),(ds#57 <= 2013-12-31)]
> +- ConvertToUnsafe
>+- HiveTableScan [field1#101,field4#97], MetastoreRelation 
> foo, table2, Some(b)
> {code}
> Query plan in 2.0
> {code}
> == Physical Plan ==
> *HashAggregate(keys=[field1#160], functions=[sum((field2#133 / 100.0))])
> +- Exchange hashpartitioning(field1#160, 200)
>+- *HashAggregate(keys=[field1#160], functions=[partial_sum((field2#133 / 
> 100.0))])
>   +- *Project [field2#133, field1#160]
>  +- *BroadcastHashJoin [field5#122L], [cast(cast(field4#156 as 
> decimal(20,0)) as bigint)], Inner, BuildRight
> :- *Filter isnotnull(field5#122L)
> :  +- HiveTableScan [field5#122L, field2#133], MetastoreRelation 
> foo, table1, a, [isnotnull(ds#116), (ds#116 >= 2013-10-01), (ds#116 <= 
> 2013-12-31)]
> +- BroadcastExchange 
> HashedRelationBroadcastMode(List(cast(cast(input[0, string, false] as 
> decimal(20,0)) as bigint)))
>+- *Filter isnotnull(field4#156)
>   +- HiveTableScan [field4#156, field1#160], 
> MetastoreRelation foo, table2, b
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-16508) Fix documentation warnings found by R CMD check

2016-08-06 Thread Shivaram Venkataraman (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-16508?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15410681#comment-15410681
 ] 

Shivaram Venkataraman commented on SPARK-16508:
---

Yeah we should deal with those. This JIRA was opened to track PRs for that as I 
mentioned in the comment list above.

> Fix documentation warnings found by R CMD check
> ---
>
> Key: SPARK-16508
> URL: https://issues.apache.org/jira/browse/SPARK-16508
> Project: Spark
>  Issue Type: Sub-task
>  Components: SparkR
>Reporter: Shivaram Venkataraman
>
> A full list of warnings after the fixes in SPARK-16507 is at 
> https://gist.github.com/shivaram/62866c4ca59c5d34b8963939cf04b5eb 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-16804) Correlated subqueries containing non-deterministic operators return incorrect results

2016-08-06 Thread Nattavut Sutyanyong (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-16804?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15410651#comment-15410651
 ] 

Nattavut Sutyanyong commented on SPARK-16804:
-

The PR also extends the fix to block the {{TABLESAMPLE}} operation in any 
correlated subquery. 

> Correlated subqueries containing non-deterministic operators return incorrect 
> results
> -
>
> Key: SPARK-16804
> URL: https://issues.apache.org/jira/browse/SPARK-16804
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Nattavut Sutyanyong
>   Original Estimate: 72h
>  Remaining Estimate: 72h
>
> Correlated subqueries with LIMIT could return incorrect results. The rule 
> ResolveSubquery in the Analysis phase moves correlated predicates to a join 
> predicates and neglect the semantic of the LIMIT.
> Example:
> {noformat}
> Seq(1, 2).toDF("c1").createOrReplaceTempView("t1")
> Seq(1, 2).toDF("c2").createOrReplaceTempView("t2")
> sql("select c1 from t1 where exists (select 1 from t2 where t1.c1=t2.c2 LIMIT 
> 1)").show
> +---+ 
>   
> | c1|
> +---+
> |  1|
> +---+
> {noformat}
> The correct result contains both rows from T1.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-16804) Correlated subqueries containing non-deterministic operators return incorrect results

2016-08-06 Thread Nattavut Sutyanyong (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-16804?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15400150#comment-15400150
 ] 

Nattavut Sutyanyong edited comment on SPARK-16804 at 8/6/16 4:21 PM:
-

To demonstrate that this fix does not unnecessarily block the "good" cases 
(where LIMIT is present but NOT on the correlated path), here is an example, 
which produce the same result set in both with and without this proposed fix.

{noformat}
scala> sql("select c1 from t1 where exists (select 1 from (select 1 from t2 
limit 1) where
t1.c1=t2.c2)").show
+---+
| c1|
+---+
|  1|
+---+
{noformat}


was (Author: nsyca):
To demonstrate that this fix does not unnecessarily block the "good" cases 
(where LIMIT is present but NOT on the correlated path), here is an example, 
which produce the same result set in both with and without this proposed fix.

{{scala> sql("select c1 from t1 where exists (select 1 from (select 1 from t2 
limit 1) where
t1.c1=t2.c2)").show }}
{{+---+}}
{{| c1|}}
{{+---+}}
{{|  1|}}
{{+---+}}


> Correlated subqueries containing non-deterministic operators return incorrect 
> results
> -
>
> Key: SPARK-16804
> URL: https://issues.apache.org/jira/browse/SPARK-16804
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Nattavut Sutyanyong
>   Original Estimate: 72h
>  Remaining Estimate: 72h
>
> Correlated subqueries with LIMIT could return incorrect results. The rule 
> ResolveSubquery in the Analysis phase moves correlated predicates to a join 
> predicates and neglect the semantic of the LIMIT.
> Example:
> {noformat}
> Seq(1, 2).toDF("c1").createOrReplaceTempView("t1")
> Seq(1, 2).toDF("c2").createOrReplaceTempView("t2")
> sql("select c1 from t1 where exists (select 1 from t2 where t1.c1=t2.c2 LIMIT 
> 1)").show
> +---+ 
>   
> | c1|
> +---+
> |  1|
> +---+
> {noformat}
> The correct result contains both rows from T1.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-16804) Correlated subqueries containing non-deterministic operators return incorrect results

2016-08-06 Thread Nattavut Sutyanyong (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-16804?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15400090#comment-15400090
 ] 

Nattavut Sutyanyong edited comment on SPARK-16804 at 8/6/16 4:20 PM:
-

{noformat}
scala> sql("select c1 from t1 where exists (select 1 from t2 where t1.c1=t2.c2 
LIMIT 1)").explain(true)
== Parsed Logical Plan ==
'Project ['c1]
+- 'Filter exists#21
   :  +- 'SubqueryAlias exists#21
   : +- 'GlobalLimit 1
   :+- 'LocalLimit 1
   :   +- 'Project [unresolvedalias(1, None)]
   :  +- 'Filter ('t1.c1 = 't2.c2)
   : +- 'UnresolvedRelation `t2`
   +- 'UnresolvedRelation `t1`

== Analyzed Logical Plan ==
c1: int
Project [c1#17]
+- Filter predicate-subquery#21 [(c1#17 = c2#10)]
   :  +- SubqueryAlias predicate-subquery#21 [(c1#17 = c2#10)]   <== This 
correlated predicate is incorrectly moved above the LIMIT
   : +- GlobalLimit 1
   :+- LocalLimit 1
   :   +- Project [1 AS 1#26, c2#10]
   :  +- SubqueryAlias t2
   : +- Project [value#8 AS c2#10]
   :+- LocalRelation [value#8]
   +- SubqueryAlias t1
  +- Project [value#15 AS c1#17]
 +- LocalRelation [value#15]
{noformat}
By rewriting the correlated predicate in the subquery in Analysis phase from 
below the LIMIT 1 operation to above it causing the scan of the subquery table 
to return only 1 row. The correct semantic is the LIMIT 1 must be applied on 
the subquery for each input value from the parent table.


was (Author: nsyca):
{{noformat}}
scala> sql("select c1 from t1 where exists (select 1 from t2 where t1.c1=t2.c2 
LIMIT 1)").explain(true)
== Parsed Logical Plan ==
'Project ['c1]
+- 'Filter exists#21
   :  +- 'SubqueryAlias exists#21
   : +- 'GlobalLimit 1
   :+- 'LocalLimit 1
   :   +- 'Project [unresolvedalias(1, None)]
   :  +- 'Filter ('t1.c1 = 't2.c2)
   : +- 'UnresolvedRelation `t2`
   +- 'UnresolvedRelation `t1`

== Analyzed Logical Plan ==
c1: int
Project [c1#17]
+- Filter predicate-subquery#21 [(c1#17 = c2#10)]
   :  +- SubqueryAlias predicate-subquery#21 [(c1#17 = c2#10)]   <== This 
correlated predicate is incorrectly moved above the LIMIT
   : +- GlobalLimit 1
   :+- LocalLimit 1
   :   +- Project [1 AS 1#26, c2#10]
   :  +- SubqueryAlias t2
   : +- Project [value#8 AS c2#10]
   :+- LocalRelation [value#8]
   +- SubqueryAlias t1
  +- Project [value#15 AS c1#17]
 +- LocalRelation [value#15]
{{noformat}}
By rewriting the correlated predicate in the subquery in Analysis phase from 
below the LIMIT 1 operation to above it causing the scan of the subquery table 
to return only 1 row. The correct semantic is the LIMIT 1 must be applied on 
the subquery for each input value from the parent table.

> Correlated subqueries containing non-deterministic operators return incorrect 
> results
> -
>
> Key: SPARK-16804
> URL: https://issues.apache.org/jira/browse/SPARK-16804
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Nattavut Sutyanyong
>   Original Estimate: 72h
>  Remaining Estimate: 72h
>
> Correlated subqueries with LIMIT could return incorrect results. The rule 
> ResolveSubquery in the Analysis phase moves correlated predicates to a join 
> predicates and neglect the semantic of the LIMIT.
> Example:
> {noformat}
> Seq(1, 2).toDF("c1").createOrReplaceTempView("t1")
> Seq(1, 2).toDF("c2").createOrReplaceTempView("t2")
> sql("select c1 from t1 where exists (select 1 from t2 where t1.c1=t2.c2 LIMIT 
> 1)").show
> +---+ 
>   
> | c1|
> +---+
> |  1|
> +---+
> {noformat}
> The correct result contains both rows from T1.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-16804) Correlated subqueries containing non-deterministic operators return incorrect results

2016-08-06 Thread Nattavut Sutyanyong (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-16804?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15400090#comment-15400090
 ] 

Nattavut Sutyanyong edited comment on SPARK-16804 at 8/6/16 4:20 PM:
-

{{noformat}}
scala> sql("select c1 from t1 where exists (select 1 from t2 where t1.c1=t2.c2 
LIMIT 1)").explain(true)
== Parsed Logical Plan ==
'Project ['c1]
+- 'Filter exists#21
   :  +- 'SubqueryAlias exists#21
   : +- 'GlobalLimit 1
   :+- 'LocalLimit 1
   :   +- 'Project [unresolvedalias(1, None)]
   :  +- 'Filter ('t1.c1 = 't2.c2)
   : +- 'UnresolvedRelation `t2`
   +- 'UnresolvedRelation `t1`

== Analyzed Logical Plan ==
c1: int
Project [c1#17]
+- Filter predicate-subquery#21 [(c1#17 = c2#10)]
   :  +- SubqueryAlias predicate-subquery#21 [(c1#17 = c2#10)]   <== This 
correlated predicate is incorrectly moved above the LIMIT
   : +- GlobalLimit 1
   :+- LocalLimit 1
   :   +- Project [1 AS 1#26, c2#10]
   :  +- SubqueryAlias t2
   : +- Project [value#8 AS c2#10]
   :+- LocalRelation [value#8]
   +- SubqueryAlias t1
  +- Project [value#15 AS c1#17]
 +- LocalRelation [value#15]
{{noformat}}
By rewriting the correlated predicate in the subquery in Analysis phase from 
below the LIMIT 1 operation to above it causing the scan of the subquery table 
to return only 1 row. The correct semantic is the LIMIT 1 must be applied on 
the subquery for each input value from the parent table.


was (Author: nsyca):
scala> sql("select c1 from t1 where exists (select 1 from t2 where t1.c1=t2.c2 
LIMIT 1)").explain(true)
== Parsed Logical Plan ==
'Project ['c1]
+- 'Filter exists#21
   :  +- 'SubqueryAlias exists#21
   : +- 'GlobalLimit 1
   :+- 'LocalLimit 1
   :   +- 'Project [unresolvedalias(1, None)]
   :  +- 'Filter ('t1.c1 = 't2.c2)
   : +- 'UnresolvedRelation `t2`
   +- 'UnresolvedRelation `t1`

== Analyzed Logical Plan ==
c1: int
Project [c1#17]
+- Filter predicate-subquery#21 [(c1#17 = c2#10)]
   :  +- SubqueryAlias predicate-subquery#21 [(c1#17 = c2#10)]   <== This 
correlated predicate is incorrectly moved above the LIMIT
   : +- GlobalLimit 1
   :+- LocalLimit 1
   :   +- Project [1 AS 1#26, c2#10]
   :  +- SubqueryAlias t2
   : +- Project [value#8 AS c2#10]
   :+- LocalRelation [value#8]
   +- SubqueryAlias t1
  +- Project [value#15 AS c1#17]
 +- LocalRelation [value#15]

By rewriting the correlated predicate in the subquery in Analysis phase from 
below the LIMIT 1 operation to above it causing the scan of the subquery table 
to return only 1 row. The correct semantic is the LIMIT 1 must be applied on 
the subquery for each input value from the parent table.

> Correlated subqueries containing non-deterministic operators return incorrect 
> results
> -
>
> Key: SPARK-16804
> URL: https://issues.apache.org/jira/browse/SPARK-16804
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Nattavut Sutyanyong
>   Original Estimate: 72h
>  Remaining Estimate: 72h
>
> Correlated subqueries with LIMIT could return incorrect results. The rule 
> ResolveSubquery in the Analysis phase moves correlated predicates to a join 
> predicates and neglect the semantic of the LIMIT.
> Example:
> {noformat}
> Seq(1, 2).toDF("c1").createOrReplaceTempView("t1")
> Seq(1, 2).toDF("c2").createOrReplaceTempView("t2")
> sql("select c1 from t1 where exists (select 1 from t2 where t1.c1=t2.c2 LIMIT 
> 1)").show
> +---+ 
>   
> | c1|
> +---+
> |  1|
> +---+
> {noformat}
> The correct result contains both rows from T1.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-16804) Correlated subqueries containing non-deterministic operators return incorrect results

2016-08-06 Thread Nattavut Sutyanyong (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-16804?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15400150#comment-15400150
 ] 

Nattavut Sutyanyong edited comment on SPARK-16804 at 8/6/16 4:18 PM:
-

To demonstrate that this fix does not unnecessarily block the "good" cases 
(where LIMIT is present but NOT on the correlated path), here is an example, 
which produce the same result set in both with and without this proposed fix.

{{scala> sql("select c1 from t1 where exists (select 1 from (select 1 from t2 
limit 1) where
t1.c1=t2.c2)").show }}
{{+---+}}
{{| c1|}}
{{+---+}}
{{|  1|}}
{{+---+}}



was (Author: nsyca):
To demonstrate that this fix does not unnecessarily block the "good" cases 
(where LIMIT is present but NOT on the correlated path), here is an example, 
which produce the same result set in both with and without this proposed fix.

scala> sql("select c1 from t1 where exists (select 1 from (select 1 from t2 
limit 1) where t1.c1=t2.c2)").show 
+---+   
| c1|
+---+
|  1|
+---+


> Correlated subqueries containing non-deterministic operators return incorrect 
> results
> -
>
> Key: SPARK-16804
> URL: https://issues.apache.org/jira/browse/SPARK-16804
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Nattavut Sutyanyong
>   Original Estimate: 72h
>  Remaining Estimate: 72h
>
> Correlated subqueries with LIMIT could return incorrect results. The rule 
> ResolveSubquery in the Analysis phase moves correlated predicates to a join 
> predicates and neglect the semantic of the LIMIT.
> Example:
> {noformat}
> Seq(1, 2).toDF("c1").createOrReplaceTempView("t1")
> Seq(1, 2).toDF("c2").createOrReplaceTempView("t2")
> sql("select c1 from t1 where exists (select 1 from t2 where t1.c1=t2.c2 LIMIT 
> 1)").show
> +---+ 
>   
> | c1|
> +---+
> |  1|
> +---+
> {noformat}
> The correct result contains both rows from T1.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-16804) Correlated subqueries containing non-deterministic operators return incorrect results

2016-08-06 Thread Nattavut Sutyanyong (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-16804?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15400150#comment-15400150
 ] 

Nattavut Sutyanyong edited comment on SPARK-16804 at 8/6/16 4:14 PM:
-

To demonstrate that this fix does not unnecessarily block the "good" cases 
(where LIMIT is present but NOT on the correlated path), here is an example, 
which produce the same result set in both with and without this proposed fix.

scala> sql("select c1 from t1 where exists (select 1 from (select 1 from t2 
limit 1) where t1.c1=t2.c2)").show 
+---+   
| c1|
+---+
|  1|
+---+



was (Author: nsyca):
To demonstrate that this fix does not unnecessarily block the "good" cases 
(where LIMIT is present but NOT on the correlated path), here is an example, 
which produce the same result set in both with and without this proposed fix.

scala> sql("select c1 from t1 where exists (select 1 from (select 1 from t2 
limit 1) where t1.c1=t2.c2)").show 
+---+   
| c1|
+---+
|  1|
+---+


> Correlated subqueries containing non-deterministic operators return incorrect 
> results
> -
>
> Key: SPARK-16804
> URL: https://issues.apache.org/jira/browse/SPARK-16804
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Nattavut Sutyanyong
>   Original Estimate: 72h
>  Remaining Estimate: 72h
>
> Correlated subqueries with LIMIT could return incorrect results. The rule 
> ResolveSubquery in the Analysis phase moves correlated predicates to a join 
> predicates and neglect the semantic of the LIMIT.
> Example:
> {noformat}
> Seq(1, 2).toDF("c1").createOrReplaceTempView("t1")
> Seq(1, 2).toDF("c2").createOrReplaceTempView("t2")
> sql("select c1 from t1 where exists (select 1 from t2 where t1.c1=t2.c2 LIMIT 
> 1)").show
> +---+ 
>   
> | c1|
> +---+
> |  1|
> +---+
> {noformat}
> The correct result contains both rows from T1.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-16933) AFTAggregator in AFTSurvivalRegression serializes unnecessary data

2016-08-06 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-16933?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15410642#comment-15410642
 ] 

Sean Owen commented on SPARK-16933:
---

OK, I also see https://issues.apache.org/jira/browse/SPARK-16934 opened for 
another instance. Let's not keep opening JIRAs for the same issue. One more to 
fix the rest? CC [~WeichenXu123]

> AFTAggregator in AFTSurvivalRegression serializes unnecessary data
> --
>
> Key: SPARK-16933
> URL: https://issues.apache.org/jira/browse/SPARK-16933
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Reporter: Yanbo Liang
>
> This is basically the same issue as SPARK-16008, but for aft survival 
> regression, where {{parameters}} and {{featuresStd}} are unnecessarily 
> serialized between stages.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-16917) Spark streaming kafka version compatibility.

2016-08-06 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-16917?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-16917.
---
Resolution: Duplicate

> Spark streaming kafka version compatibility. 
> -
>
> Key: SPARK-16917
> URL: https://issues.apache.org/jira/browse/SPARK-16917
> Project: Spark
>  Issue Type: Improvement
>  Components: Documentation
>Affects Versions: 2.0.0
>Reporter: Sudev
>Priority: Trivial
>  Labels: documentation
>
> It would be nice to have Kafka version compatibility information in the 
> official documentation. 
> It's very confusing now. 
> * If you look at this JIRA[1], it seems like Kafka is supported in Spark 
> 2.0.0.
> * The documentation lists artifact for (Kafka 0.8)  
> spark-streaming-kafka-0-8_2.11
> Is Kafka 0.9 supported by Spark 2.0.0 ?
> Since I'm confused here even after an hours effort googling on the same, I 
> think someone should help add the compatibility matrix.
> [1] https://issues.apache.org/jira/browse/SPARK-12177



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-16934) Improve LogisticCostFun to avoid redundant serielization

2016-08-06 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-16934?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-16934:


Assignee: Apache Spark

> Improve LogisticCostFun to avoid redundant serielization
> 
>
> Key: SPARK-16934
> URL: https://issues.apache.org/jira/browse/SPARK-16934
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, MLlib
>Reporter: Weichen Xu
>Assignee: Apache Spark
>   Original Estimate: 24h
>  Remaining Estimate: 24h
>
> The LogisticCostFun, when calculate, it will serialize closure var 
> `featureStd` each time when called, we can improve it using broadcast var.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-16934) Improve LogisticCostFun to avoid redundant serielization

2016-08-06 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-16934?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-16934:


Assignee: (was: Apache Spark)

> Improve LogisticCostFun to avoid redundant serielization
> 
>
> Key: SPARK-16934
> URL: https://issues.apache.org/jira/browse/SPARK-16934
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, MLlib
>Reporter: Weichen Xu
>   Original Estimate: 24h
>  Remaining Estimate: 24h
>
> The LogisticCostFun, when calculate, it will serialize closure var 
> `featureStd` each time when called, we can improve it using broadcast var.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-16934) Improve LogisticCostFun to avoid redundant serielization

2016-08-06 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-16934?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15410640#comment-15410640
 ] 

Apache Spark commented on SPARK-16934:
--

User 'WeichenXu123' has created a pull request for this issue:
https://github.com/apache/spark/pull/14520

> Improve LogisticCostFun to avoid redundant serielization
> 
>
> Key: SPARK-16934
> URL: https://issues.apache.org/jira/browse/SPARK-16934
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, MLlib
>Reporter: Weichen Xu
>   Original Estimate: 24h
>  Remaining Estimate: 24h
>
> The LogisticCostFun, when calculate, it will serialize closure var 
> `featureStd` each time when called, we can improve it using broadcast var.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-16934) Improve LogisticCostFun to avoid redundant serielization

2016-08-06 Thread Weichen Xu (JIRA)
Weichen Xu created SPARK-16934:
--

 Summary: Improve LogisticCostFun to avoid redundant serielization
 Key: SPARK-16934
 URL: https://issues.apache.org/jira/browse/SPARK-16934
 Project: Spark
  Issue Type: Improvement
  Components: ML, MLlib
Reporter: Weichen Xu


The LogisticCostFun, when calculate, it will serialize closure var `featureStd` 
each time when called, we can improve it using broadcast var.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-16917) Spark streaming kafka version compatibility.

2016-08-06 Thread Cody Koeninger (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-16917?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15410623#comment-15410623
 ] 

Cody Koeninger commented on SPARK-16917:


I think the doc changes I submitted make it pretty clear that 
spark--streaming-kafka-0-8 works with brokers 0.8 or higher, and 0-10 works 
with brokers 0.10 or higher

> Spark streaming kafka version compatibility. 
> -
>
> Key: SPARK-16917
> URL: https://issues.apache.org/jira/browse/SPARK-16917
> Project: Spark
>  Issue Type: Improvement
>  Components: Documentation
>Affects Versions: 2.0.0
>Reporter: Sudev
>Priority: Trivial
>  Labels: documentation
>
> It would be nice to have Kafka version compatibility information in the 
> official documentation. 
> It's very confusing now. 
> * If you look at this JIRA[1], it seems like Kafka is supported in Spark 
> 2.0.0.
> * The documentation lists artifact for (Kafka 0.8)  
> spark-streaming-kafka-0-8_2.11
> Is Kafka 0.9 supported by Spark 2.0.0 ?
> Since I'm confused here even after an hours effort googling on the same, I 
> think someone should help add the compatibility matrix.
> [1] https://issues.apache.org/jira/browse/SPARK-12177



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-16933) AFTAggregator in AFTSurvivalRegression serializes unnecessary data

2016-08-06 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-16933?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-16933:


Assignee: Apache Spark

> AFTAggregator in AFTSurvivalRegression serializes unnecessary data
> --
>
> Key: SPARK-16933
> URL: https://issues.apache.org/jira/browse/SPARK-16933
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Reporter: Yanbo Liang
>Assignee: Apache Spark
>
> This is basically the same issue as SPARK-16008, but for aft survival 
> regression, where {{parameters}} and {{featuresStd}} are unnecessarily 
> serialized between stages.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-16933) AFTAggregator in AFTSurvivalRegression serializes unnecessary data

2016-08-06 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-16933?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15410616#comment-15410616
 ] 

Apache Spark commented on SPARK-16933:
--

User 'yanboliang' has created a pull request for this issue:
https://github.com/apache/spark/pull/14519

> AFTAggregator in AFTSurvivalRegression serializes unnecessary data
> --
>
> Key: SPARK-16933
> URL: https://issues.apache.org/jira/browse/SPARK-16933
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Reporter: Yanbo Liang
>
> This is basically the same issue as SPARK-16008, but for aft survival 
> regression, where {{parameters}} and {{featuresStd}} are unnecessarily 
> serialized between stages.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-16933) AFTAggregator in AFTSurvivalRegression serializes unnecessary data

2016-08-06 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-16933?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-16933:


Assignee: (was: Apache Spark)

> AFTAggregator in AFTSurvivalRegression serializes unnecessary data
> --
>
> Key: SPARK-16933
> URL: https://issues.apache.org/jira/browse/SPARK-16933
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Reporter: Yanbo Liang
>
> This is basically the same issue as SPARK-16008, but for aft survival 
> regression, where {{parameters}} and {{featuresStd}} are unnecessarily 
> serialized between stages.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-16933) AFTAggregator in AFTSurvivalRegression serializes unnecessary data

2016-08-06 Thread Yanbo Liang (JIRA)
Yanbo Liang created SPARK-16933:
---

 Summary: AFTAggregator in AFTSurvivalRegression serializes 
unnecessary data
 Key: SPARK-16933
 URL: https://issues.apache.org/jira/browse/SPARK-16933
 Project: Spark
  Issue Type: Improvement
  Components: ML
Reporter: Yanbo Liang


This is basically the same issue as SPARK-16008, but for aft survival 
regression, where {{parameters}} and {{featuresStd}} are unnecessarily 
serialized between stages.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-16864) Comprehensive version info

2016-08-06 Thread Jan Gorecki (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-16864?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15410596#comment-15410596
 ] 

Jan Gorecki edited comment on SPARK-16864 at 8/6/16 11:50 AM:
--

Record exact spark source code reference while processing ETL workflow so 
performance implication can be measures precisely referencing point in time of 
source code. I doubt if version number or date/time is a natural key for spark 
source code, is it? If you don't have a natural key you can't build reliable 
workflow. How would you automatically git clone, reset, build, deploy and 
re-run your workflow - based on data collected by spark - if you don't even 
have git commit there? Lookup git commit hash by version and date... sure it 
works, but why users can't just access that info directly? I don't see ANY 
reason to not have that feature. If you have any I would be glad to read. And 
no, even for developers that info is not available on runtime.


was (Author: jangorecki):
Record exact spark source code reference while processing ETL workflow so 
performance implication can be measures precisely referencing point in time of 
source code. I doubt if version number or date/time is a natural key for spark 
source code, is it? If you don't have a natural key you can't build reliable 
workflow. How would you automatically git clone, reset, build, deploy and 
re-run your workflow - based on data collected by spark - if you don't even 
have git commit there? Lookup git commit hash by version and date... sure it 
works, but why users can't just access that info directly? I don't see ANY 
reason to not have that feature? If you have any I would be glad to read. And 
no, even for developers that info is not available on runtime.

> Comprehensive version info 
> ---
>
> Key: SPARK-16864
> URL: https://issues.apache.org/jira/browse/SPARK-16864
> Project: Spark
>  Issue Type: Improvement
>Reporter: jay vyas
>
> Spark versions can be grepped out of the Spark banner that comes up on 
> startup, but otherwise, there is no programmatic/reliable way to get version 
> information.
> Also there is no git commit id, etc.  So precise version checking isnt 
> possible.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-16864) Comprehensive version info

2016-08-06 Thread Jan Gorecki (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-16864?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15410596#comment-15410596
 ] 

Jan Gorecki commented on SPARK-16864:
-

Record exact spark source code reference while processing ETL workflow so 
performance implication can be measures precisely referencing point in time of 
source code. I doubt if version number or date/time is a natural key for spark 
source code, is it? If you don't have a natural key you can't build reliable 
workflow. How would you automatically git clone, reset, build, deploy and 
re-run your workflow - based on data collected by spark - if you don't even 
have git commit there? Lookup git commit hash by version and date... sure it 
works, but why users can't just access that info directly? I don't see ANY 
reason to not have that feature? If you have any I would be glad to read. And 
no, even for developers that info is not available on runtime.

> Comprehensive version info 
> ---
>
> Key: SPARK-16864
> URL: https://issues.apache.org/jira/browse/SPARK-16864
> Project: Spark
>  Issue Type: Improvement
>Reporter: jay vyas
>
> Spark versions can be grepped out of the Spark banner that comes up on 
> startup, but otherwise, there is no programmatic/reliable way to get version 
> information.
> Also there is no git commit id, etc.  So precise version checking isnt 
> possible.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-16326) Evaluate sparklyr package from RStudio

2016-08-06 Thread Shivaram Venkataraman (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-16326?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15410526#comment-15410526
 ] 

Shivaram Venkataraman commented on SPARK-16326:
---

Yeah I dont think there is anything actionable here per-se. We can continue 
this discussion on the dev mailing list and open new issues when required ?

> Evaluate sparklyr package from RStudio
> --
>
> Key: SPARK-16326
> URL: https://issues.apache.org/jira/browse/SPARK-16326
> Project: Spark
>  Issue Type: Brainstorming
>  Components: SparkR
>Reporter: Sun Rui
>
> Rstudio has developed sparklyr (https://github.com/rstudio/sparklyr) 
> connecting R community to Spark. A rough review shows that sparklyr provides 
> a dplyr backend and new API for mLLIB and for calling Spark from R. Of 
> course, sparklyr internally uses the low level mechanism in SparkR.
> We can discuss how to position SparkR with sparklyr.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org