[jira] [Commented] (SPARK-16913) [SQL] Better codegen where querying nested struct
[ https://issues.apache.org/jira/browse/SPARK-16913?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15410840#comment-15410840 ] Kazuaki Ishizaki commented on SPARK-16913: -- It seems to copy each elements in a struct. Since {{InternalRow}} does not include a structure, the {{internalRow}} keeps two scalar values, which consists of {{isNull}} and {{value}} in this case. If we can provide better schema property (i.e. {{nullable = false}} for {{a}} and {{b}}), lines 44-62 would be simpler. > [SQL] Better codegen where querying nested struct > - > > Key: SPARK-16913 > URL: https://issues.apache.org/jira/browse/SPARK-16913 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.0.0 >Reporter: Maciej BryĆski > > I have parquet file created as result of: > {code} > spark.range(100).selectExpr("id as a", "id as b").selectExpr("struct(a, b) as > c").write.parquet("/mnt/mfs/codegen_test") > {code} > Then I'm querying whole nested structure with: > {code} > spark.read.parquet("/mnt/mfs/codegen_test").selectExpr("c.*") > {code} > As a result of spark whole stage codegen I'm getting following code. > Is it possible to remove part from line 044 and just return whole result of > getStruct ? (maybe just copied) > {code} > Generated code: > /* 001 */ public Object generate(Object[] references) { > /* 002 */ return new GeneratedIterator(references); > /* 003 */ } > /* 004 */ > /* 005 */ final class GeneratedIterator extends > org.apache.spark.sql.execution.BufferedRowIterator { > /* 006 */ private Object[] references; > /* 007 */ private org.apache.spark.sql.execution.metric.SQLMetric > scan_numOutputRows; > /* 008 */ private scala.collection.Iterator scan_input; > /* 009 */ private UnsafeRow scan_result; > /* 010 */ private > org.apache.spark.sql.catalyst.expressions.codegen.BufferHolder scan_holder; > /* 011 */ private > org.apache.spark.sql.catalyst.expressions.codegen.UnsafeRowWriter > scan_rowWriter; > /* 012 */ private > org.apache.spark.sql.catalyst.expressions.codegen.UnsafeRowWriter > scan_rowWriter1; > /* 013 */ private UnsafeRow project_result; > /* 014 */ private > org.apache.spark.sql.catalyst.expressions.codegen.BufferHolder project_holder; > /* 015 */ private > org.apache.spark.sql.catalyst.expressions.codegen.UnsafeRowWriter > project_rowWriter; > /* 016 */ > /* 017 */ public GeneratedIterator(Object[] references) { > /* 018 */ this.references = references; > /* 019 */ } > /* 020 */ > /* 021 */ public void init(int index, scala.collection.Iterator inputs[]) { > /* 022 */ partitionIndex = index; > /* 023 */ this.scan_numOutputRows = > (org.apache.spark.sql.execution.metric.SQLMetric) references[0]; > /* 024 */ scan_input = inputs[0]; > /* 025 */ scan_result = new UnsafeRow(1); > /* 026 */ this.scan_holder = new > org.apache.spark.sql.catalyst.expressions.codegen.BufferHolder(scan_result, > 32); > /* 027 */ this.scan_rowWriter = new > org.apache.spark.sql.catalyst.expressions.codegen.UnsafeRowWriter(scan_holder, > 1); > /* 028 */ this.scan_rowWriter1 = new > org.apache.spark.sql.catalyst.expressions.codegen.UnsafeRowWriter(scan_holder, > 2); > /* 029 */ project_result = new UnsafeRow(2); > /* 030 */ this.project_holder = new > org.apache.spark.sql.catalyst.expressions.codegen.BufferHolder(project_result, > 0); > /* 031 */ this.project_rowWriter = new > org.apache.spark.sql.catalyst.expressions.codegen.UnsafeRowWriter(project_holder, > 2); > /* 032 */ } > /* 033 */ > /* 034 */ protected void processNext() throws java.io.IOException { > /* 035 */ while (scan_input.hasNext()) { > /* 036 */ InternalRow scan_row = (InternalRow) scan_input.next(); > /* 037 */ scan_numOutputRows.add(1); > /* 038 */ boolean scan_isNull = scan_row.isNullAt(0); > /* 039 */ InternalRow scan_value = scan_isNull ? null : > (scan_row.getStruct(0, 2)); > /* 040 */ > /* 041 */ boolean project_isNull = scan_isNull; > /* 042 */ long project_value = -1L; > /* 043 */ > /* 044 */ if (!scan_isNull) { > /* 045 */ if (scan_value.isNullAt(0)) { > /* 046 */ project_isNull = true; > /* 047 */ } else { > /* 048 */ project_value = scan_value.getLong(0); > /* 049 */ } > /* 050 */ > /* 051 */ } > /* 052 */ boolean project_isNull2 = scan_isNull; > /* 053 */ long project_value2 = -1L; > /* 054 */ > /* 055 */ if (!scan_isNull) { > /* 056 */ if (scan_value.isNullAt(1)) { > /* 057 */ project_isNull2 = true; > /* 058 */ } else { > /* 059 */ project_value2 = scan_value.getLong(1); > /* 060 */ } > /* 061 */ > /* 062 */ } > /* 063 */ project_rowWriter.zeroOutN
[jira] [Commented] (SPARK-8904) When using LDA DAGScheduler throws exception
[ https://issues.apache.org/jira/browse/SPARK-8904?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15410835#comment-15410835 ] Nabarun commented on SPARK-8904: This seem to be related to something which I am seeing at my end to. I converted my countVectors into DF val ldaDF = countVectors.map { case Row(id: Long, countVector: Vector) => (id, countVector) } When I am trying to display it, this throws following exception org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 3148.0 failed 4 times, most recent failure: Lost task 0.3 in stage 3148.0 (TID 11632, 10.209.235.85): scala.MatchError: [0,(1671,[1,2,3,5,8,10,11,12,14,15,17,18,20,21,23,27,28,29,30,31,32,36,37,38,39,41,42,43,45,46,51,52,54,66,69,71,74,75,78,80,82,83,85,88,89,90,92,96,97,98,99,102,104,106,107,108,109,111,112,115,118,121,123,124,126,134,138,139,143,144,145,148,150,151,152,153,155,161,166,171,172,173,174,176,178,179,180,181,189,190,197,199,200,201,207,209,212,216,217,218,220,222,223,224,226,227,228,232,234,238,240,244,246,250,252,254,255,260,261,262,264,268,269,270,277,280,281,282,286,292,294,295,296,297,301,310,312,314,316,318,323,324,325,337,341,343,346,347,351,355,359,366,367,379,380,381,388,390,391,398,403,405,411,417,442,444,448,456,460,464,466,468,470,477,480,484,487,490,491,495,496,501,502,507,509,512,522,523,527,529,531,533,534,535,552,554,556,557,565,566,567,569,574,575,585,624,630,632,633,638,644,646,652,653,658,668,669,670,680,683,686,690,693,696,698,704,705,712,723,726,736,746,747,750,757,758,761,765,773,774,775,783,786,796,797,801,807,811,815,825,830,833,843,844,845,847,849,859,861,862,864,867,871,872,876,879,882,892,895,896,897,912,923,924,935,937,941,944,945,948,949,952,968,982,989,1000,1003,1015,1018,1021,1025,1029,1034,1036,1038,1041,1048,1072,1082,1086,1092,1106,,1114,1117,1123,1128,1133,1135,1145,1149,1154,1168,1169,1171,1178,1180,1181,1183,1184,1201,1224,1234,1240,1250,1260,1261,1267,1269,1270,1280,1305,1309,1317,1333,1354,1355,1358,1378,1379,1386,1389,1393,1411,1413,1426,1428,1475,1480,1504,1506,1521,1525,1530,1532,1545,1555,1601,1614,1635,1643,1649,1653,1668],[1.0,5.0,4.0,3.0,2.0,14.0,30.0,2.0,72.0,9.0,6.0,6.0,1.0,13.0,1.0,4.0,1.0,3.0,2.0,10.0,2.0,4.0,74.0,3.0,11.0,1.0,35.0,1.0,16.0,1.0,2.0,15.0,3.0,4.0,17.0,2.0,8.0,60.0,35.0,3.0,1.0,33.0,2.0,2.0,3.0,11.0,16.0,2.0,8.0,2.0,3.0,48.0,1.0,1.0,4.0,8.0,4.0,3.0,4.0,4.0,1.0,3.0,1.0,11.0,1.0,2.0,3.0,1.0,35.0,6.0,2.0,1.0,2.0,3.0,3.0,4.0,2.0,2.0,1.0,1.0,20.0,9.0,6.0,17.0,10.0,8.0,1.0,12.0,1.0,3.0,3.0,2.0,9.0,1.0,2.0,19.0,1.0,2.0,1.0,1.0,2.0,9.0,1.0,1.0,1.0,5.0,1.0,2.0,5.0,1.0,1.0,1.0,1.0,1.0,7.0,1.0,14.0,2.0,2.0,1.0,5.0,2.0,5.0,5.0,20.0,2.0,27.0,3.0,4.0,11.0,1.0,3.0,3.0,1.0,2.0,2.0,7.0,5.0,2.0,2.0,1.0,3.0,1.0,2.0,1.0,2.0,8.0,5.0,1.0,5.0,3.0,1.0,4.0,3.0,3.0,4.0,1.0,3.0,4.0,1.0,2.0,3.0,5.0,7.0,1.0,8.0,1.0,2.0,4.0,2.0,1.0,12.0,5.0,1.0,6.0,4.0,2.0,2.0,1.0,1.0,3.0,4.0,1.0,1.0,2.0,4.0,3.0,1.0,2.0,6.0,1.0,1.0,1.0,4.0,2.0,1.0,7.0,12.0,1.0,12.0,1.0,1.0,9.0,2.0,1.0,2.0,1.0,1.0,6.0,6.0,1.0,1.0,1.0,2.0,1.0,1.0,1.0,1.0,1.0,1.0,2.0,1.0,2.0,2.0,3.0,1.0,1.0,2.0,1.0,3.0,1.0,4.0,1.0,5.0,2.0,1.0,2.0,2.0,3.0,1.0,2.0,1.0,1.0,2.0,3.0,1.0,4.0,3.0,1.0,3.0,2.0,1.0,2.0,2.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,4.0,1.0,1.0,1.0,1.0,1.0,4.0,1.0,2.0,2.0,1.0,1.0,2.0,3.0,1.0,1.0,2.0,2.0,1.0,1.0,1.0,2.0,1.0,2.0,1.0,1.0,1.0,5.0,3.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,2.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,4.0,1.0,1.0,1.0,2.0,1.0,2.0,1.0,1.0,1.0,1.0,1.0,2.0,1.0,1.0,1.0,2.0,1.0,3.0,1.0,2.0,1.0,1.0,2.0,1.0,1.0,1.0,2.0,1.0,1.0,1.0,1.0,1.0,2.0,2.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,2.0,2.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0])] (of class org.apache.spark.sql.catalyst.expressions.GenericRowWithSchema) at line1907dd16af5d4fbfa217a9d52f096b36316.$read$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$anonfun$1.apply(:142) at line1907dd16af5d4fbfa217a9d52f096b36316.$read$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$anonfun$1.apply(:142) at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown Source) at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43) at org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:370) at org.apache.spark.sql.execution.SparkPlan$$anonfun$3.apply(SparkPlan.scala:231) at org.apache.spark.sql.execution.SparkPlan$$anonfun$3.apply(SparkPlan.scala:225) at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$24.apply(RDD.scala:790) at org.apache.spark.rdd.RDD$$anon
[jira] [Commented] (SPARK-16936) Case Sensitivity Support for Refresh Temp Table
[ https://issues.apache.org/jira/browse/SPARK-16936?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15410833#comment-15410833 ] Apache Spark commented on SPARK-16936: -- User 'gatorsmile' has created a pull request for this issue: https://github.com/apache/spark/pull/14523 > Case Sensitivity Support for Refresh Temp Table > --- > > Key: SPARK-16936 > URL: https://issues.apache.org/jira/browse/SPARK-16936 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.0 >Reporter: Xiao Li > > Currently, the `refreshTable` API is always case sensitive. > When users use the view name without the exact case match, the API silently > ignores the call. Users might expect the command has been successfully > completed. However, when users run the subsequent SQL commands, they might > still get the exception, like > {noformat} > Job aborted due to stage failure: > Task 1 in stage 4.0 failed 1 times, most recent failure: Lost task 1.0 in > stage 4.0 (TID 7, localhost): > java.io.FileNotFoundException: > File > file:/private/var/folders/4b/sgmfldk15js406vk7lw5llzwgn/T/spark-bd4b9ea6-9aec-49c5-8f05-01cff426211e/part-r-0-0c84b915-c032-4f2e-abf5-1d48fdbddf38.snappy.parquet > does not exist > {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-16936) Case Sensitivity Support for Refresh Temp Table
[ https://issues.apache.org/jira/browse/SPARK-16936?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-16936: Assignee: (was: Apache Spark) > Case Sensitivity Support for Refresh Temp Table > --- > > Key: SPARK-16936 > URL: https://issues.apache.org/jira/browse/SPARK-16936 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.0 >Reporter: Xiao Li > > Currently, the `refreshTable` API is always case sensitive. > When users use the view name without the exact case match, the API silently > ignores the call. Users might expect the command has been successfully > completed. However, when users run the subsequent SQL commands, they might > still get the exception, like > {noformat} > Job aborted due to stage failure: > Task 1 in stage 4.0 failed 1 times, most recent failure: Lost task 1.0 in > stage 4.0 (TID 7, localhost): > java.io.FileNotFoundException: > File > file:/private/var/folders/4b/sgmfldk15js406vk7lw5llzwgn/T/spark-bd4b9ea6-9aec-49c5-8f05-01cff426211e/part-r-0-0c84b915-c032-4f2e-abf5-1d48fdbddf38.snappy.parquet > does not exist > {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-16936) Case Sensitivity Support for Refresh Temp Table
[ https://issues.apache.org/jira/browse/SPARK-16936?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-16936: Assignee: Apache Spark > Case Sensitivity Support for Refresh Temp Table > --- > > Key: SPARK-16936 > URL: https://issues.apache.org/jira/browse/SPARK-16936 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.0 >Reporter: Xiao Li >Assignee: Apache Spark > > Currently, the `refreshTable` API is always case sensitive. > When users use the view name without the exact case match, the API silently > ignores the call. Users might expect the command has been successfully > completed. However, when users run the subsequent SQL commands, they might > still get the exception, like > {noformat} > Job aborted due to stage failure: > Task 1 in stage 4.0 failed 1 times, most recent failure: Lost task 1.0 in > stage 4.0 (TID 7, localhost): > java.io.FileNotFoundException: > File > file:/private/var/folders/4b/sgmfldk15js406vk7lw5llzwgn/T/spark-bd4b9ea6-9aec-49c5-8f05-01cff426211e/part-r-0-0c84b915-c032-4f2e-abf5-1d48fdbddf38.snappy.parquet > does not exist > {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-16936) Case Sensitivity Support for Refresh Temp Table
Xiao Li created SPARK-16936: --- Summary: Case Sensitivity Support for Refresh Temp Table Key: SPARK-16936 URL: https://issues.apache.org/jira/browse/SPARK-16936 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 2.0.0 Reporter: Xiao Li Currently, the `refreshTable` API is always case sensitive. When users use the view name without the exact case match, the API silently ignores the call. Users might expect the command has been successfully completed. However, when users run the subsequent SQL commands, they might still get the exception, like {noformat} Job aborted due to stage failure: Task 1 in stage 4.0 failed 1 times, most recent failure: Lost task 1.0 in stage 4.0 (TID 7, localhost): java.io.FileNotFoundException: File file:/private/var/folders/4b/sgmfldk15js406vk7lw5llzwgn/T/spark-bd4b9ea6-9aec-49c5-8f05-01cff426211e/part-r-0-0c84b915-c032-4f2e-abf5-1d48fdbddf38.snappy.parquet does not exist {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-16925) Spark tasks which cause JVM to exit with a zero exit code may cause app to hang in Standalone mode
[ https://issues.apache.org/jira/browse/SPARK-16925?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Josh Rosen resolved SPARK-16925. Resolution: Fixed Fix Version/s: 2.1.0 2.0.1 1.6.3 > Spark tasks which cause JVM to exit with a zero exit code may cause app to > hang in Standalone mode > -- > > Key: SPARK-16925 > URL: https://issues.apache.org/jira/browse/SPARK-16925 > Project: Spark > Issue Type: Bug > Components: Deploy >Affects Versions: 1.6.0, 2.0.0 >Reporter: Josh Rosen >Assignee: Josh Rosen >Priority: Critical > Fix For: 1.6.3, 2.0.1, 2.1.0 > > > If you have a Spark standalone cluster which runs a single application and > you have a Spark task which repeatedly fails by causing the executor JVM to > exit with a _zero_ exit code then this may temporarily freeze / hang the > Spark application. > For example, running > {code} > sc.parallelize(1 to 1, 1).foreachPartition { _ => System.exit(0) } > {code} > on a cluster will cause all executors to die but those executors won't be > replaced unless another Spark application or worker joins or leaves the > cluster. This is caused by a bug in the standalone Master where > {{schedule()}} is only called on executor exit when the exit code is > non-zero, whereas I think that we should always call {{schedule()}} even on a > "clean" executor shutdown since {{schedule()}} should always be safe to call. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-16925) Spark tasks which cause JVM to exit with a zero exit code may cause app to hang in Standalone mode
[ https://issues.apache.org/jira/browse/SPARK-16925?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15410803#comment-15410803 ] Josh Rosen commented on SPARK-16925: Fixed by my patch. > Spark tasks which cause JVM to exit with a zero exit code may cause app to > hang in Standalone mode > -- > > Key: SPARK-16925 > URL: https://issues.apache.org/jira/browse/SPARK-16925 > Project: Spark > Issue Type: Bug > Components: Deploy >Affects Versions: 1.6.0, 2.0.0 >Reporter: Josh Rosen >Assignee: Josh Rosen >Priority: Critical > Fix For: 1.6.3, 2.0.1, 2.1.0 > > > If you have a Spark standalone cluster which runs a single application and > you have a Spark task which repeatedly fails by causing the executor JVM to > exit with a _zero_ exit code then this may temporarily freeze / hang the > Spark application. > For example, running > {code} > sc.parallelize(1 to 1, 1).foreachPartition { _ => System.exit(0) } > {code} > on a cluster will cause all executors to die but those executors won't be > replaced unless another Spark application or worker joins or leaves the > cluster. This is caused by a bug in the standalone Master where > {{schedule()}} is only called on executor exit when the exit code is > non-zero, whereas I think that we should always call {{schedule()}} even on a > "clean" executor shutdown since {{schedule()}} should always be safe to call. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-16508) Fix documentation warnings found by R CMD check
[ https://issues.apache.org/jira/browse/SPARK-16508?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15410791#comment-15410791 ] Apache Spark commented on SPARK-16508: -- User 'junyangq' has created a pull request for this issue: https://github.com/apache/spark/pull/14522 > Fix documentation warnings found by R CMD check > --- > > Key: SPARK-16508 > URL: https://issues.apache.org/jira/browse/SPARK-16508 > Project: Spark > Issue Type: Sub-task > Components: SparkR >Reporter: Shivaram Venkataraman > > A full list of warnings after the fixes in SPARK-16507 is at > https://gist.github.com/shivaram/62866c4ca59c5d34b8963939cf04b5eb -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-16508) Fix documentation warnings found by R CMD check
[ https://issues.apache.org/jira/browse/SPARK-16508?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-16508: Assignee: (was: Apache Spark) > Fix documentation warnings found by R CMD check > --- > > Key: SPARK-16508 > URL: https://issues.apache.org/jira/browse/SPARK-16508 > Project: Spark > Issue Type: Sub-task > Components: SparkR >Reporter: Shivaram Venkataraman > > A full list of warnings after the fixes in SPARK-16507 is at > https://gist.github.com/shivaram/62866c4ca59c5d34b8963939cf04b5eb -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-16508) Fix documentation warnings found by R CMD check
[ https://issues.apache.org/jira/browse/SPARK-16508?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-16508: Assignee: Apache Spark > Fix documentation warnings found by R CMD check > --- > > Key: SPARK-16508 > URL: https://issues.apache.org/jira/browse/SPARK-16508 > Project: Spark > Issue Type: Sub-task > Components: SparkR >Reporter: Shivaram Venkataraman >Assignee: Apache Spark > > A full list of warnings after the fixes in SPARK-16507 is at > https://gist.github.com/shivaram/62866c4ca59c5d34b8963939cf04b5eb -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-16935) Verification of Function-related ExternalCatalog APIs
[ https://issues.apache.org/jira/browse/SPARK-16935?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-16935: Assignee: (was: Apache Spark) > Verification of Function-related ExternalCatalog APIs > - > > Key: SPARK-16935 > URL: https://issues.apache.org/jira/browse/SPARK-16935 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.0.0 >Reporter: Xiao Li > > Function-related `HiveExternalCatalog` APIs do not have enough verification > logics. After the PR, `HiveExternalCatalog` and `InMemoryCatalog` become > consistent in the error handling. > For example, below is the exception we got when calling `renameFunction`. > {noformat} > 15:13:40.369 WARN org.apache.hadoop.hive.metastore.ObjectStore: Failed to get > database db1, returning NoSuchObjectException > 15:13:40.377 WARN org.apache.hadoop.hive.metastore.ObjectStore: Failed to get > database db2, returning NoSuchObjectException > 15:13:40.739 ERROR DataNucleus.Datastore.Persist: Update of object > "org.apache.hadoop.hive.metastore.model.MFunction@205629e9" using statement > "UPDATE FUNCS SET FUNC_NAME=? WHERE FUNC_ID=?" failed : > org.apache.derby.shared.common.error.DerbySQLIntegrityConstraintViolationException: > The statement was aborted because it would have caused a duplicate key value > in a unique or primary key constraint or unique index identified by > 'UNIQUEFUNCTION' defined on 'FUNCS'. > at > org.apache.derby.impl.jdbc.SQLExceptionFactory.getSQLException(Unknown Source) > at org.apache.derby.impl.jdbc.Util.generateCsSQLException(Unknown > Source) > at > org.apache.derby.impl.jdbc.TransactionResourceImpl.wrapInSQLException(Unknown > Source) > at > org.apache.derby.impl.jdbc.TransactionResourceImpl.handleException(Unknown > Source) > {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-16935) Verification of Function-related ExternalCatalog APIs
[ https://issues.apache.org/jira/browse/SPARK-16935?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15410762#comment-15410762 ] Apache Spark commented on SPARK-16935: -- User 'gatorsmile' has created a pull request for this issue: https://github.com/apache/spark/pull/14521 > Verification of Function-related ExternalCatalog APIs > - > > Key: SPARK-16935 > URL: https://issues.apache.org/jira/browse/SPARK-16935 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.0.0 >Reporter: Xiao Li > > Function-related `HiveExternalCatalog` APIs do not have enough verification > logics. After the PR, `HiveExternalCatalog` and `InMemoryCatalog` become > consistent in the error handling. > For example, below is the exception we got when calling `renameFunction`. > {noformat} > 15:13:40.369 WARN org.apache.hadoop.hive.metastore.ObjectStore: Failed to get > database db1, returning NoSuchObjectException > 15:13:40.377 WARN org.apache.hadoop.hive.metastore.ObjectStore: Failed to get > database db2, returning NoSuchObjectException > 15:13:40.739 ERROR DataNucleus.Datastore.Persist: Update of object > "org.apache.hadoop.hive.metastore.model.MFunction@205629e9" using statement > "UPDATE FUNCS SET FUNC_NAME=? WHERE FUNC_ID=?" failed : > org.apache.derby.shared.common.error.DerbySQLIntegrityConstraintViolationException: > The statement was aborted because it would have caused a duplicate key value > in a unique or primary key constraint or unique index identified by > 'UNIQUEFUNCTION' defined on 'FUNCS'. > at > org.apache.derby.impl.jdbc.SQLExceptionFactory.getSQLException(Unknown Source) > at org.apache.derby.impl.jdbc.Util.generateCsSQLException(Unknown > Source) > at > org.apache.derby.impl.jdbc.TransactionResourceImpl.wrapInSQLException(Unknown > Source) > at > org.apache.derby.impl.jdbc.TransactionResourceImpl.handleException(Unknown > Source) > {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-16935) Verification of Function-related ExternalCatalog APIs
[ https://issues.apache.org/jira/browse/SPARK-16935?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-16935: Assignee: Apache Spark > Verification of Function-related ExternalCatalog APIs > - > > Key: SPARK-16935 > URL: https://issues.apache.org/jira/browse/SPARK-16935 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.0.0 >Reporter: Xiao Li >Assignee: Apache Spark > > Function-related `HiveExternalCatalog` APIs do not have enough verification > logics. After the PR, `HiveExternalCatalog` and `InMemoryCatalog` become > consistent in the error handling. > For example, below is the exception we got when calling `renameFunction`. > {noformat} > 15:13:40.369 WARN org.apache.hadoop.hive.metastore.ObjectStore: Failed to get > database db1, returning NoSuchObjectException > 15:13:40.377 WARN org.apache.hadoop.hive.metastore.ObjectStore: Failed to get > database db2, returning NoSuchObjectException > 15:13:40.739 ERROR DataNucleus.Datastore.Persist: Update of object > "org.apache.hadoop.hive.metastore.model.MFunction@205629e9" using statement > "UPDATE FUNCS SET FUNC_NAME=? WHERE FUNC_ID=?" failed : > org.apache.derby.shared.common.error.DerbySQLIntegrityConstraintViolationException: > The statement was aborted because it would have caused a duplicate key value > in a unique or primary key constraint or unique index identified by > 'UNIQUEFUNCTION' defined on 'FUNCS'. > at > org.apache.derby.impl.jdbc.SQLExceptionFactory.getSQLException(Unknown Source) > at org.apache.derby.impl.jdbc.Util.generateCsSQLException(Unknown > Source) > at > org.apache.derby.impl.jdbc.TransactionResourceImpl.wrapInSQLException(Unknown > Source) > at > org.apache.derby.impl.jdbc.TransactionResourceImpl.handleException(Unknown > Source) > {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-16935) Verification of Function-related ExternalCatalog APIs
Xiao Li created SPARK-16935: --- Summary: Verification of Function-related ExternalCatalog APIs Key: SPARK-16935 URL: https://issues.apache.org/jira/browse/SPARK-16935 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 2.0.0 Reporter: Xiao Li Function-related `HiveExternalCatalog` APIs do not have enough verification logics. After the PR, `HiveExternalCatalog` and `InMemoryCatalog` become consistent in the error handling. For example, below is the exception we got when calling `renameFunction`. {noformat} 15:13:40.369 WARN org.apache.hadoop.hive.metastore.ObjectStore: Failed to get database db1, returning NoSuchObjectException 15:13:40.377 WARN org.apache.hadoop.hive.metastore.ObjectStore: Failed to get database db2, returning NoSuchObjectException 15:13:40.739 ERROR DataNucleus.Datastore.Persist: Update of object "org.apache.hadoop.hive.metastore.model.MFunction@205629e9" using statement "UPDATE FUNCS SET FUNC_NAME=? WHERE FUNC_ID=?" failed : org.apache.derby.shared.common.error.DerbySQLIntegrityConstraintViolationException: The statement was aborted because it would have caused a duplicate key value in a unique or primary key constraint or unique index identified by 'UNIQUEFUNCTION' defined on 'FUNCS'. at org.apache.derby.impl.jdbc.SQLExceptionFactory.getSQLException(Unknown Source) at org.apache.derby.impl.jdbc.Util.generateCsSQLException(Unknown Source) at org.apache.derby.impl.jdbc.TransactionResourceImpl.wrapInSQLException(Unknown Source) at org.apache.derby.impl.jdbc.TransactionResourceImpl.handleException(Unknown Source) {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-16922) Query failure due to executor OOM in Spark 2.0
[ https://issues.apache.org/jira/browse/SPARK-16922?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15410703#comment-15410703 ] Sital Kedia commented on SPARK-16922: - Update - The query works fine when Broadcast hash join in turned off, so the issue might be in broadcast hash join. I put some debug print in UnsafeRowWriter class (https://github.com/apache/spark/blob/branch-2.0/sql/catalyst/src/main/java/org/apache/spark/sql/catalyst/expressions/codegen/UnsafeRowWriter.java#L214) and I found that it is receiving a row of size around 800MB and OOMing while trying to grow the buffer holder. It might suggest that there is some data corruption going on probably in the Broadcast hash join. cc- [~davies] - Any pointer on how to debug this issue further? > Query failure due to executor OOM in Spark 2.0 > -- > > Key: SPARK-16922 > URL: https://issues.apache.org/jira/browse/SPARK-16922 > Project: Spark > Issue Type: Bug > Components: Shuffle >Affects Versions: 2.0.0 >Reporter: Sital Kedia > > A query which used to work in Spark 1.6 fails with executor OOM in 2.0. > Stack trace - > {code} > at > org.apache.spark.unsafe.types.UTF8String.getBytes(UTF8String.java:229) > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator$agg_VectorizedHashMap.hash$(Unknown > Source) > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator$agg_VectorizedHashMap.findOrInsert(Unknown > Source) > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.agg_doAggregateWithKeys$(Unknown > Source) > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown > Source) > at > org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43) > at > org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:370) > at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408) > at > org.apache.spark.shuffle.sort.UnsafeShuffleWriter.write(UnsafeShuffleWriter.java:161) > at > org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:79) > at > org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:47) > at org.apache.spark.scheduler.Task.run(Task.scala:85) > at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) > at java.lang.Thread.run(Thread.java:745) > {code} > Query plan in Spark 1.6 > {code} > == Physical Plan == > TungstenAggregate(key=[field1#101], functions=[(sum((field2#74 / > 100.0)),mode=Final,isDistinct=false)], output=[field1#101,field3#3]) > +- TungstenExchange hashpartitioning(field1#101,200), None >+- TungstenAggregate(key=[field1#101], functions=[(sum((field2#74 / > 100.0)),mode=Partial,isDistinct=false)], output=[field1#101,sum#111]) > +- Project [field1#101,field2#74] > +- BroadcastHashJoin [field5#63L], [cast(cast(field4#97 as > decimal(20,0)) as bigint)], BuildRight > :- ConvertToUnsafe > : +- HiveTableScan [field2#74,field5#63L], MetastoreRelation > foo, table1, Some(a), [(ds#57 >= 2013-10-01),(ds#57 <= 2013-12-31)] > +- ConvertToUnsafe >+- HiveTableScan [field1#101,field4#97], MetastoreRelation > foo, table2, Some(b) > {code} > Query plan in 2.0 > {code} > == Physical Plan == > *HashAggregate(keys=[field1#160], functions=[sum((field2#133 / 100.0))]) > +- Exchange hashpartitioning(field1#160, 200) >+- *HashAggregate(keys=[field1#160], functions=[partial_sum((field2#133 / > 100.0))]) > +- *Project [field2#133, field1#160] > +- *BroadcastHashJoin [field5#122L], [cast(cast(field4#156 as > decimal(20,0)) as bigint)], Inner, BuildRight > :- *Filter isnotnull(field5#122L) > : +- HiveTableScan [field5#122L, field2#133], MetastoreRelation > foo, table1, a, [isnotnull(ds#116), (ds#116 >= 2013-10-01), (ds#116 <= > 2013-12-31)] > +- BroadcastExchange > HashedRelationBroadcastMode(List(cast(cast(input[0, string, false] as > decimal(20,0)) as bigint))) >+- *Filter isnotnull(field4#156) > +- HiveTableScan [field4#156, field1#160], > MetastoreRelation foo, table2, b > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-16508) Fix documentation warnings found by R CMD check
[ https://issues.apache.org/jira/browse/SPARK-16508?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15410681#comment-15410681 ] Shivaram Venkataraman commented on SPARK-16508: --- Yeah we should deal with those. This JIRA was opened to track PRs for that as I mentioned in the comment list above. > Fix documentation warnings found by R CMD check > --- > > Key: SPARK-16508 > URL: https://issues.apache.org/jira/browse/SPARK-16508 > Project: Spark > Issue Type: Sub-task > Components: SparkR >Reporter: Shivaram Venkataraman > > A full list of warnings after the fixes in SPARK-16507 is at > https://gist.github.com/shivaram/62866c4ca59c5d34b8963939cf04b5eb -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-16804) Correlated subqueries containing non-deterministic operators return incorrect results
[ https://issues.apache.org/jira/browse/SPARK-16804?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15410651#comment-15410651 ] Nattavut Sutyanyong commented on SPARK-16804: - The PR also extends the fix to block the {{TABLESAMPLE}} operation in any correlated subquery. > Correlated subqueries containing non-deterministic operators return incorrect > results > - > > Key: SPARK-16804 > URL: https://issues.apache.org/jira/browse/SPARK-16804 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.0 >Reporter: Nattavut Sutyanyong > Original Estimate: 72h > Remaining Estimate: 72h > > Correlated subqueries with LIMIT could return incorrect results. The rule > ResolveSubquery in the Analysis phase moves correlated predicates to a join > predicates and neglect the semantic of the LIMIT. > Example: > {noformat} > Seq(1, 2).toDF("c1").createOrReplaceTempView("t1") > Seq(1, 2).toDF("c2").createOrReplaceTempView("t2") > sql("select c1 from t1 where exists (select 1 from t2 where t1.c1=t2.c2 LIMIT > 1)").show > +---+ > > | c1| > +---+ > | 1| > +---+ > {noformat} > The correct result contains both rows from T1. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-16804) Correlated subqueries containing non-deterministic operators return incorrect results
[ https://issues.apache.org/jira/browse/SPARK-16804?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15400150#comment-15400150 ] Nattavut Sutyanyong edited comment on SPARK-16804 at 8/6/16 4:21 PM: - To demonstrate that this fix does not unnecessarily block the "good" cases (where LIMIT is present but NOT on the correlated path), here is an example, which produce the same result set in both with and without this proposed fix. {noformat} scala> sql("select c1 from t1 where exists (select 1 from (select 1 from t2 limit 1) where t1.c1=t2.c2)").show +---+ | c1| +---+ | 1| +---+ {noformat} was (Author: nsyca): To demonstrate that this fix does not unnecessarily block the "good" cases (where LIMIT is present but NOT on the correlated path), here is an example, which produce the same result set in both with and without this proposed fix. {{scala> sql("select c1 from t1 where exists (select 1 from (select 1 from t2 limit 1) where t1.c1=t2.c2)").show }} {{+---+}} {{| c1|}} {{+---+}} {{| 1|}} {{+---+}} > Correlated subqueries containing non-deterministic operators return incorrect > results > - > > Key: SPARK-16804 > URL: https://issues.apache.org/jira/browse/SPARK-16804 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.0 >Reporter: Nattavut Sutyanyong > Original Estimate: 72h > Remaining Estimate: 72h > > Correlated subqueries with LIMIT could return incorrect results. The rule > ResolveSubquery in the Analysis phase moves correlated predicates to a join > predicates and neglect the semantic of the LIMIT. > Example: > {noformat} > Seq(1, 2).toDF("c1").createOrReplaceTempView("t1") > Seq(1, 2).toDF("c2").createOrReplaceTempView("t2") > sql("select c1 from t1 where exists (select 1 from t2 where t1.c1=t2.c2 LIMIT > 1)").show > +---+ > > | c1| > +---+ > | 1| > +---+ > {noformat} > The correct result contains both rows from T1. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-16804) Correlated subqueries containing non-deterministic operators return incorrect results
[ https://issues.apache.org/jira/browse/SPARK-16804?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15400090#comment-15400090 ] Nattavut Sutyanyong edited comment on SPARK-16804 at 8/6/16 4:20 PM: - {noformat} scala> sql("select c1 from t1 where exists (select 1 from t2 where t1.c1=t2.c2 LIMIT 1)").explain(true) == Parsed Logical Plan == 'Project ['c1] +- 'Filter exists#21 : +- 'SubqueryAlias exists#21 : +- 'GlobalLimit 1 :+- 'LocalLimit 1 : +- 'Project [unresolvedalias(1, None)] : +- 'Filter ('t1.c1 = 't2.c2) : +- 'UnresolvedRelation `t2` +- 'UnresolvedRelation `t1` == Analyzed Logical Plan == c1: int Project [c1#17] +- Filter predicate-subquery#21 [(c1#17 = c2#10)] : +- SubqueryAlias predicate-subquery#21 [(c1#17 = c2#10)] <== This correlated predicate is incorrectly moved above the LIMIT : +- GlobalLimit 1 :+- LocalLimit 1 : +- Project [1 AS 1#26, c2#10] : +- SubqueryAlias t2 : +- Project [value#8 AS c2#10] :+- LocalRelation [value#8] +- SubqueryAlias t1 +- Project [value#15 AS c1#17] +- LocalRelation [value#15] {noformat} By rewriting the correlated predicate in the subquery in Analysis phase from below the LIMIT 1 operation to above it causing the scan of the subquery table to return only 1 row. The correct semantic is the LIMIT 1 must be applied on the subquery for each input value from the parent table. was (Author: nsyca): {{noformat}} scala> sql("select c1 from t1 where exists (select 1 from t2 where t1.c1=t2.c2 LIMIT 1)").explain(true) == Parsed Logical Plan == 'Project ['c1] +- 'Filter exists#21 : +- 'SubqueryAlias exists#21 : +- 'GlobalLimit 1 :+- 'LocalLimit 1 : +- 'Project [unresolvedalias(1, None)] : +- 'Filter ('t1.c1 = 't2.c2) : +- 'UnresolvedRelation `t2` +- 'UnresolvedRelation `t1` == Analyzed Logical Plan == c1: int Project [c1#17] +- Filter predicate-subquery#21 [(c1#17 = c2#10)] : +- SubqueryAlias predicate-subquery#21 [(c1#17 = c2#10)] <== This correlated predicate is incorrectly moved above the LIMIT : +- GlobalLimit 1 :+- LocalLimit 1 : +- Project [1 AS 1#26, c2#10] : +- SubqueryAlias t2 : +- Project [value#8 AS c2#10] :+- LocalRelation [value#8] +- SubqueryAlias t1 +- Project [value#15 AS c1#17] +- LocalRelation [value#15] {{noformat}} By rewriting the correlated predicate in the subquery in Analysis phase from below the LIMIT 1 operation to above it causing the scan of the subquery table to return only 1 row. The correct semantic is the LIMIT 1 must be applied on the subquery for each input value from the parent table. > Correlated subqueries containing non-deterministic operators return incorrect > results > - > > Key: SPARK-16804 > URL: https://issues.apache.org/jira/browse/SPARK-16804 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.0 >Reporter: Nattavut Sutyanyong > Original Estimate: 72h > Remaining Estimate: 72h > > Correlated subqueries with LIMIT could return incorrect results. The rule > ResolveSubquery in the Analysis phase moves correlated predicates to a join > predicates and neglect the semantic of the LIMIT. > Example: > {noformat} > Seq(1, 2).toDF("c1").createOrReplaceTempView("t1") > Seq(1, 2).toDF("c2").createOrReplaceTempView("t2") > sql("select c1 from t1 where exists (select 1 from t2 where t1.c1=t2.c2 LIMIT > 1)").show > +---+ > > | c1| > +---+ > | 1| > +---+ > {noformat} > The correct result contains both rows from T1. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-16804) Correlated subqueries containing non-deterministic operators return incorrect results
[ https://issues.apache.org/jira/browse/SPARK-16804?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15400090#comment-15400090 ] Nattavut Sutyanyong edited comment on SPARK-16804 at 8/6/16 4:20 PM: - {{noformat}} scala> sql("select c1 from t1 where exists (select 1 from t2 where t1.c1=t2.c2 LIMIT 1)").explain(true) == Parsed Logical Plan == 'Project ['c1] +- 'Filter exists#21 : +- 'SubqueryAlias exists#21 : +- 'GlobalLimit 1 :+- 'LocalLimit 1 : +- 'Project [unresolvedalias(1, None)] : +- 'Filter ('t1.c1 = 't2.c2) : +- 'UnresolvedRelation `t2` +- 'UnresolvedRelation `t1` == Analyzed Logical Plan == c1: int Project [c1#17] +- Filter predicate-subquery#21 [(c1#17 = c2#10)] : +- SubqueryAlias predicate-subquery#21 [(c1#17 = c2#10)] <== This correlated predicate is incorrectly moved above the LIMIT : +- GlobalLimit 1 :+- LocalLimit 1 : +- Project [1 AS 1#26, c2#10] : +- SubqueryAlias t2 : +- Project [value#8 AS c2#10] :+- LocalRelation [value#8] +- SubqueryAlias t1 +- Project [value#15 AS c1#17] +- LocalRelation [value#15] {{noformat}} By rewriting the correlated predicate in the subquery in Analysis phase from below the LIMIT 1 operation to above it causing the scan of the subquery table to return only 1 row. The correct semantic is the LIMIT 1 must be applied on the subquery for each input value from the parent table. was (Author: nsyca): scala> sql("select c1 from t1 where exists (select 1 from t2 where t1.c1=t2.c2 LIMIT 1)").explain(true) == Parsed Logical Plan == 'Project ['c1] +- 'Filter exists#21 : +- 'SubqueryAlias exists#21 : +- 'GlobalLimit 1 :+- 'LocalLimit 1 : +- 'Project [unresolvedalias(1, None)] : +- 'Filter ('t1.c1 = 't2.c2) : +- 'UnresolvedRelation `t2` +- 'UnresolvedRelation `t1` == Analyzed Logical Plan == c1: int Project [c1#17] +- Filter predicate-subquery#21 [(c1#17 = c2#10)] : +- SubqueryAlias predicate-subquery#21 [(c1#17 = c2#10)] <== This correlated predicate is incorrectly moved above the LIMIT : +- GlobalLimit 1 :+- LocalLimit 1 : +- Project [1 AS 1#26, c2#10] : +- SubqueryAlias t2 : +- Project [value#8 AS c2#10] :+- LocalRelation [value#8] +- SubqueryAlias t1 +- Project [value#15 AS c1#17] +- LocalRelation [value#15] By rewriting the correlated predicate in the subquery in Analysis phase from below the LIMIT 1 operation to above it causing the scan of the subquery table to return only 1 row. The correct semantic is the LIMIT 1 must be applied on the subquery for each input value from the parent table. > Correlated subqueries containing non-deterministic operators return incorrect > results > - > > Key: SPARK-16804 > URL: https://issues.apache.org/jira/browse/SPARK-16804 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.0 >Reporter: Nattavut Sutyanyong > Original Estimate: 72h > Remaining Estimate: 72h > > Correlated subqueries with LIMIT could return incorrect results. The rule > ResolveSubquery in the Analysis phase moves correlated predicates to a join > predicates and neglect the semantic of the LIMIT. > Example: > {noformat} > Seq(1, 2).toDF("c1").createOrReplaceTempView("t1") > Seq(1, 2).toDF("c2").createOrReplaceTempView("t2") > sql("select c1 from t1 where exists (select 1 from t2 where t1.c1=t2.c2 LIMIT > 1)").show > +---+ > > | c1| > +---+ > | 1| > +---+ > {noformat} > The correct result contains both rows from T1. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-16804) Correlated subqueries containing non-deterministic operators return incorrect results
[ https://issues.apache.org/jira/browse/SPARK-16804?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15400150#comment-15400150 ] Nattavut Sutyanyong edited comment on SPARK-16804 at 8/6/16 4:18 PM: - To demonstrate that this fix does not unnecessarily block the "good" cases (where LIMIT is present but NOT on the correlated path), here is an example, which produce the same result set in both with and without this proposed fix. {{scala> sql("select c1 from t1 where exists (select 1 from (select 1 from t2 limit 1) where t1.c1=t2.c2)").show }} {{+---+}} {{| c1|}} {{+---+}} {{| 1|}} {{+---+}} was (Author: nsyca): To demonstrate that this fix does not unnecessarily block the "good" cases (where LIMIT is present but NOT on the correlated path), here is an example, which produce the same result set in both with and without this proposed fix. scala> sql("select c1 from t1 where exists (select 1 from (select 1 from t2 limit 1) where t1.c1=t2.c2)").show +---+ | c1| +---+ | 1| +---+ > Correlated subqueries containing non-deterministic operators return incorrect > results > - > > Key: SPARK-16804 > URL: https://issues.apache.org/jira/browse/SPARK-16804 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.0 >Reporter: Nattavut Sutyanyong > Original Estimate: 72h > Remaining Estimate: 72h > > Correlated subqueries with LIMIT could return incorrect results. The rule > ResolveSubquery in the Analysis phase moves correlated predicates to a join > predicates and neglect the semantic of the LIMIT. > Example: > {noformat} > Seq(1, 2).toDF("c1").createOrReplaceTempView("t1") > Seq(1, 2).toDF("c2").createOrReplaceTempView("t2") > sql("select c1 from t1 where exists (select 1 from t2 where t1.c1=t2.c2 LIMIT > 1)").show > +---+ > > | c1| > +---+ > | 1| > +---+ > {noformat} > The correct result contains both rows from T1. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-16804) Correlated subqueries containing non-deterministic operators return incorrect results
[ https://issues.apache.org/jira/browse/SPARK-16804?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15400150#comment-15400150 ] Nattavut Sutyanyong edited comment on SPARK-16804 at 8/6/16 4:14 PM: - To demonstrate that this fix does not unnecessarily block the "good" cases (where LIMIT is present but NOT on the correlated path), here is an example, which produce the same result set in both with and without this proposed fix. scala> sql("select c1 from t1 where exists (select 1 from (select 1 from t2 limit 1) where t1.c1=t2.c2)").show +---+ | c1| +---+ | 1| +---+ was (Author: nsyca): To demonstrate that this fix does not unnecessarily block the "good" cases (where LIMIT is present but NOT on the correlated path), here is an example, which produce the same result set in both with and without this proposed fix. scala> sql("select c1 from t1 where exists (select 1 from (select 1 from t2 limit 1) where t1.c1=t2.c2)").show +---+ | c1| +---+ | 1| +---+ > Correlated subqueries containing non-deterministic operators return incorrect > results > - > > Key: SPARK-16804 > URL: https://issues.apache.org/jira/browse/SPARK-16804 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.0 >Reporter: Nattavut Sutyanyong > Original Estimate: 72h > Remaining Estimate: 72h > > Correlated subqueries with LIMIT could return incorrect results. The rule > ResolveSubquery in the Analysis phase moves correlated predicates to a join > predicates and neglect the semantic of the LIMIT. > Example: > {noformat} > Seq(1, 2).toDF("c1").createOrReplaceTempView("t1") > Seq(1, 2).toDF("c2").createOrReplaceTempView("t2") > sql("select c1 from t1 where exists (select 1 from t2 where t1.c1=t2.c2 LIMIT > 1)").show > +---+ > > | c1| > +---+ > | 1| > +---+ > {noformat} > The correct result contains both rows from T1. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-16933) AFTAggregator in AFTSurvivalRegression serializes unnecessary data
[ https://issues.apache.org/jira/browse/SPARK-16933?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15410642#comment-15410642 ] Sean Owen commented on SPARK-16933: --- OK, I also see https://issues.apache.org/jira/browse/SPARK-16934 opened for another instance. Let's not keep opening JIRAs for the same issue. One more to fix the rest? CC [~WeichenXu123] > AFTAggregator in AFTSurvivalRegression serializes unnecessary data > -- > > Key: SPARK-16933 > URL: https://issues.apache.org/jira/browse/SPARK-16933 > Project: Spark > Issue Type: Improvement > Components: ML >Reporter: Yanbo Liang > > This is basically the same issue as SPARK-16008, but for aft survival > regression, where {{parameters}} and {{featuresStd}} are unnecessarily > serialized between stages. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-16917) Spark streaming kafka version compatibility.
[ https://issues.apache.org/jira/browse/SPARK-16917?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved SPARK-16917. --- Resolution: Duplicate > Spark streaming kafka version compatibility. > - > > Key: SPARK-16917 > URL: https://issues.apache.org/jira/browse/SPARK-16917 > Project: Spark > Issue Type: Improvement > Components: Documentation >Affects Versions: 2.0.0 >Reporter: Sudev >Priority: Trivial > Labels: documentation > > It would be nice to have Kafka version compatibility information in the > official documentation. > It's very confusing now. > * If you look at this JIRA[1], it seems like Kafka is supported in Spark > 2.0.0. > * The documentation lists artifact for (Kafka 0.8) > spark-streaming-kafka-0-8_2.11 > Is Kafka 0.9 supported by Spark 2.0.0 ? > Since I'm confused here even after an hours effort googling on the same, I > think someone should help add the compatibility matrix. > [1] https://issues.apache.org/jira/browse/SPARK-12177 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-16934) Improve LogisticCostFun to avoid redundant serielization
[ https://issues.apache.org/jira/browse/SPARK-16934?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-16934: Assignee: Apache Spark > Improve LogisticCostFun to avoid redundant serielization > > > Key: SPARK-16934 > URL: https://issues.apache.org/jira/browse/SPARK-16934 > Project: Spark > Issue Type: Improvement > Components: ML, MLlib >Reporter: Weichen Xu >Assignee: Apache Spark > Original Estimate: 24h > Remaining Estimate: 24h > > The LogisticCostFun, when calculate, it will serialize closure var > `featureStd` each time when called, we can improve it using broadcast var. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-16934) Improve LogisticCostFun to avoid redundant serielization
[ https://issues.apache.org/jira/browse/SPARK-16934?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-16934: Assignee: (was: Apache Spark) > Improve LogisticCostFun to avoid redundant serielization > > > Key: SPARK-16934 > URL: https://issues.apache.org/jira/browse/SPARK-16934 > Project: Spark > Issue Type: Improvement > Components: ML, MLlib >Reporter: Weichen Xu > Original Estimate: 24h > Remaining Estimate: 24h > > The LogisticCostFun, when calculate, it will serialize closure var > `featureStd` each time when called, we can improve it using broadcast var. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-16934) Improve LogisticCostFun to avoid redundant serielization
[ https://issues.apache.org/jira/browse/SPARK-16934?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15410640#comment-15410640 ] Apache Spark commented on SPARK-16934: -- User 'WeichenXu123' has created a pull request for this issue: https://github.com/apache/spark/pull/14520 > Improve LogisticCostFun to avoid redundant serielization > > > Key: SPARK-16934 > URL: https://issues.apache.org/jira/browse/SPARK-16934 > Project: Spark > Issue Type: Improvement > Components: ML, MLlib >Reporter: Weichen Xu > Original Estimate: 24h > Remaining Estimate: 24h > > The LogisticCostFun, when calculate, it will serialize closure var > `featureStd` each time when called, we can improve it using broadcast var. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-16934) Improve LogisticCostFun to avoid redundant serielization
Weichen Xu created SPARK-16934: -- Summary: Improve LogisticCostFun to avoid redundant serielization Key: SPARK-16934 URL: https://issues.apache.org/jira/browse/SPARK-16934 Project: Spark Issue Type: Improvement Components: ML, MLlib Reporter: Weichen Xu The LogisticCostFun, when calculate, it will serialize closure var `featureStd` each time when called, we can improve it using broadcast var. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-16917) Spark streaming kafka version compatibility.
[ https://issues.apache.org/jira/browse/SPARK-16917?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15410623#comment-15410623 ] Cody Koeninger commented on SPARK-16917: I think the doc changes I submitted make it pretty clear that spark--streaming-kafka-0-8 works with brokers 0.8 or higher, and 0-10 works with brokers 0.10 or higher > Spark streaming kafka version compatibility. > - > > Key: SPARK-16917 > URL: https://issues.apache.org/jira/browse/SPARK-16917 > Project: Spark > Issue Type: Improvement > Components: Documentation >Affects Versions: 2.0.0 >Reporter: Sudev >Priority: Trivial > Labels: documentation > > It would be nice to have Kafka version compatibility information in the > official documentation. > It's very confusing now. > * If you look at this JIRA[1], it seems like Kafka is supported in Spark > 2.0.0. > * The documentation lists artifact for (Kafka 0.8) > spark-streaming-kafka-0-8_2.11 > Is Kafka 0.9 supported by Spark 2.0.0 ? > Since I'm confused here even after an hours effort googling on the same, I > think someone should help add the compatibility matrix. > [1] https://issues.apache.org/jira/browse/SPARK-12177 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-16933) AFTAggregator in AFTSurvivalRegression serializes unnecessary data
[ https://issues.apache.org/jira/browse/SPARK-16933?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-16933: Assignee: Apache Spark > AFTAggregator in AFTSurvivalRegression serializes unnecessary data > -- > > Key: SPARK-16933 > URL: https://issues.apache.org/jira/browse/SPARK-16933 > Project: Spark > Issue Type: Improvement > Components: ML >Reporter: Yanbo Liang >Assignee: Apache Spark > > This is basically the same issue as SPARK-16008, but for aft survival > regression, where {{parameters}} and {{featuresStd}} are unnecessarily > serialized between stages. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-16933) AFTAggregator in AFTSurvivalRegression serializes unnecessary data
[ https://issues.apache.org/jira/browse/SPARK-16933?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15410616#comment-15410616 ] Apache Spark commented on SPARK-16933: -- User 'yanboliang' has created a pull request for this issue: https://github.com/apache/spark/pull/14519 > AFTAggregator in AFTSurvivalRegression serializes unnecessary data > -- > > Key: SPARK-16933 > URL: https://issues.apache.org/jira/browse/SPARK-16933 > Project: Spark > Issue Type: Improvement > Components: ML >Reporter: Yanbo Liang > > This is basically the same issue as SPARK-16008, but for aft survival > regression, where {{parameters}} and {{featuresStd}} are unnecessarily > serialized between stages. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-16933) AFTAggregator in AFTSurvivalRegression serializes unnecessary data
[ https://issues.apache.org/jira/browse/SPARK-16933?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-16933: Assignee: (was: Apache Spark) > AFTAggregator in AFTSurvivalRegression serializes unnecessary data > -- > > Key: SPARK-16933 > URL: https://issues.apache.org/jira/browse/SPARK-16933 > Project: Spark > Issue Type: Improvement > Components: ML >Reporter: Yanbo Liang > > This is basically the same issue as SPARK-16008, but for aft survival > regression, where {{parameters}} and {{featuresStd}} are unnecessarily > serialized between stages. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-16933) AFTAggregator in AFTSurvivalRegression serializes unnecessary data
Yanbo Liang created SPARK-16933: --- Summary: AFTAggregator in AFTSurvivalRegression serializes unnecessary data Key: SPARK-16933 URL: https://issues.apache.org/jira/browse/SPARK-16933 Project: Spark Issue Type: Improvement Components: ML Reporter: Yanbo Liang This is basically the same issue as SPARK-16008, but for aft survival regression, where {{parameters}} and {{featuresStd}} are unnecessarily serialized between stages. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-16864) Comprehensive version info
[ https://issues.apache.org/jira/browse/SPARK-16864?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15410596#comment-15410596 ] Jan Gorecki edited comment on SPARK-16864 at 8/6/16 11:50 AM: -- Record exact spark source code reference while processing ETL workflow so performance implication can be measures precisely referencing point in time of source code. I doubt if version number or date/time is a natural key for spark source code, is it? If you don't have a natural key you can't build reliable workflow. How would you automatically git clone, reset, build, deploy and re-run your workflow - based on data collected by spark - if you don't even have git commit there? Lookup git commit hash by version and date... sure it works, but why users can't just access that info directly? I don't see ANY reason to not have that feature. If you have any I would be glad to read. And no, even for developers that info is not available on runtime. was (Author: jangorecki): Record exact spark source code reference while processing ETL workflow so performance implication can be measures precisely referencing point in time of source code. I doubt if version number or date/time is a natural key for spark source code, is it? If you don't have a natural key you can't build reliable workflow. How would you automatically git clone, reset, build, deploy and re-run your workflow - based on data collected by spark - if you don't even have git commit there? Lookup git commit hash by version and date... sure it works, but why users can't just access that info directly? I don't see ANY reason to not have that feature? If you have any I would be glad to read. And no, even for developers that info is not available on runtime. > Comprehensive version info > --- > > Key: SPARK-16864 > URL: https://issues.apache.org/jira/browse/SPARK-16864 > Project: Spark > Issue Type: Improvement >Reporter: jay vyas > > Spark versions can be grepped out of the Spark banner that comes up on > startup, but otherwise, there is no programmatic/reliable way to get version > information. > Also there is no git commit id, etc. So precise version checking isnt > possible. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-16864) Comprehensive version info
[ https://issues.apache.org/jira/browse/SPARK-16864?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15410596#comment-15410596 ] Jan Gorecki commented on SPARK-16864: - Record exact spark source code reference while processing ETL workflow so performance implication can be measures precisely referencing point in time of source code. I doubt if version number or date/time is a natural key for spark source code, is it? If you don't have a natural key you can't build reliable workflow. How would you automatically git clone, reset, build, deploy and re-run your workflow - based on data collected by spark - if you don't even have git commit there? Lookup git commit hash by version and date... sure it works, but why users can't just access that info directly? I don't see ANY reason to not have that feature? If you have any I would be glad to read. And no, even for developers that info is not available on runtime. > Comprehensive version info > --- > > Key: SPARK-16864 > URL: https://issues.apache.org/jira/browse/SPARK-16864 > Project: Spark > Issue Type: Improvement >Reporter: jay vyas > > Spark versions can be grepped out of the Spark banner that comes up on > startup, but otherwise, there is no programmatic/reliable way to get version > information. > Also there is no git commit id, etc. So precise version checking isnt > possible. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-16326) Evaluate sparklyr package from RStudio
[ https://issues.apache.org/jira/browse/SPARK-16326?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15410526#comment-15410526 ] Shivaram Venkataraman commented on SPARK-16326: --- Yeah I dont think there is anything actionable here per-se. We can continue this discussion on the dev mailing list and open new issues when required ? > Evaluate sparklyr package from RStudio > -- > > Key: SPARK-16326 > URL: https://issues.apache.org/jira/browse/SPARK-16326 > Project: Spark > Issue Type: Brainstorming > Components: SparkR >Reporter: Sun Rui > > Rstudio has developed sparklyr (https://github.com/rstudio/sparklyr) > connecting R community to Spark. A rough review shows that sparklyr provides > a dplyr backend and new API for mLLIB and for calling Spark from R. Of > course, sparklyr internally uses the low level mechanism in SparkR. > We can discuss how to position SparkR with sparklyr. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org