[jira] [Commented] (SPARK-12059) Standalone Master assertion error
[ https://issues.apache.org/jira/browse/SPARK-12059?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15033268#comment-15033268 ] Saisai Shao commented on SPARK-12059: - Hi [~andrewor14], when will this be happened? I suppose state from {{RUNNING}} to {{RUNNING}} should not be happened normally. > Standalone Master assertion error > - > > Key: SPARK-12059 > URL: https://issues.apache.org/jira/browse/SPARK-12059 > Project: Spark > Issue Type: Bug > Components: Deploy >Affects Versions: 1.6.0 >Reporter: Andrew Or >Assignee: Saisai Shao >Priority: Critical > > {code} > 15/11/30 09:55:04 ERROR Inbox: Ignoring error > java.lang.AssertionError: assertion failed: executor 4 state transfer from > RUNNING to RUNNING is illegal > at scala.Predef$.assert(Predef.scala:179) > at > org.apache.spark.deploy.master.Master$$anonfun$receive$1.applyOrElse(Master.scala:260) > at > org.apache.spark.rpc.netty.Inbox$$anonfun$process$1.apply$mcV$sp(Inbox.scala:116) > at org.apache.spark.rpc.netty.Inbox.safelyCall(Inbox.scala:204) > at org.apache.spark.rpc.netty.Inbox.process(Inbox.scala:100) > at > org.apache.spark.rpc.netty.Dispatcher$MessageLoop.run(Dispatcher.scala:215) > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-12046) Visibility and format issues in ScalaDoc/JavaDoc for branch-1.6
[ https://issues.apache.org/jira/browse/SPARK-12046?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15033260#comment-15033260 ] Apache Spark commented on SPARK-12046: -- User 'liancheng' has created a pull request for this issue: https://github.com/apache/spark/pull/10063 > Visibility and format issues in ScalaDoc/JavaDoc for branch-1.6 > --- > > Key: SPARK-12046 > URL: https://issues.apache.org/jira/browse/SPARK-12046 > Project: Spark > Issue Type: Documentation > Components: Documentation >Affects Versions: 1.6.0 >Reporter: Cheng Lian >Assignee: Cheng Lian > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-11605) ML 1.6 QA: API: Java compatibility, docs
[ https://issues.apache.org/jira/browse/SPARK-11605?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15033250#comment-15033250 ] Joseph K. Bradley commented on SPARK-11605: --- Those are private APIs. > ML 1.6 QA: API: Java compatibility, docs > > > Key: SPARK-11605 > URL: https://issues.apache.org/jira/browse/SPARK-11605 > Project: Spark > Issue Type: Sub-task > Components: Documentation, Java API, ML, MLlib >Reporter: Joseph K. Bradley >Assignee: yuhao yang > > Check Java compatibility for MLlib for this release. > Checking compatibility means: > * comparing with the Scala doc > * verifying that Java docs are not messed up by Scala type incompatibilities. > Some items to look out for are: > ** Check for generic "Object" types where Java cannot understand complex > Scala types. > *** *Note*: The Java docs do not always match the bytecode. If you find a > problem, please verify it using {{javap}}. > ** Check Scala objects (especially with nesting!) carefully. > ** Check for uses of Scala and Java enumerations, which can show up oddly in > the other language's doc. > * If needed for complex issues, create small Java unit tests which execute > each method. (The correctness can be checked in Scala.) > If you find issues, please comment here, or for larger items, create separate > JIRAs and link here. > Note that we should not break APIs from previous releases. So if you find a > problem, check if it was introduced in this Spark release (in which case we > can fix it) or in a previous one (in which case we can create a java-friendly > version of the API). -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-11605) ML 1.6 QA: API: Java compatibility, docs
[ https://issues.apache.org/jira/browse/SPARK-11605?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15033242#comment-15033242 ] Joseph K. Bradley commented on SPARK-11605: --- There is already a Java-friendly version. > ML 1.6 QA: API: Java compatibility, docs > > > Key: SPARK-11605 > URL: https://issues.apache.org/jira/browse/SPARK-11605 > Project: Spark > Issue Type: Sub-task > Components: Documentation, Java API, ML, MLlib >Reporter: Joseph K. Bradley >Assignee: yuhao yang > > Check Java compatibility for MLlib for this release. > Checking compatibility means: > * comparing with the Scala doc > * verifying that Java docs are not messed up by Scala type incompatibilities. > Some items to look out for are: > ** Check for generic "Object" types where Java cannot understand complex > Scala types. > *** *Note*: The Java docs do not always match the bytecode. If you find a > problem, please verify it using {{javap}}. > ** Check Scala objects (especially with nesting!) carefully. > ** Check for uses of Scala and Java enumerations, which can show up oddly in > the other language's doc. > * If needed for complex issues, create small Java unit tests which execute > each method. (The correctness can be checked in Scala.) > If you find issues, please comment here, or for larger items, create separate > JIRAs and link here. > Note that we should not break APIs from previous releases. So if you find a > problem, check if it was introduced in this Spark release (in which case we > can fix it) or in a previous one (in which case we can create a java-friendly > version of the API). -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-11605) ML 1.6 QA: API: Java compatibility, docs
[ https://issues.apache.org/jira/browse/SPARK-11605?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15033241#comment-15033241 ] Joseph K. Bradley commented on SPARK-11605: --- There are already Java-friendly versions. > ML 1.6 QA: API: Java compatibility, docs > > > Key: SPARK-11605 > URL: https://issues.apache.org/jira/browse/SPARK-11605 > Project: Spark > Issue Type: Sub-task > Components: Documentation, Java API, ML, MLlib >Reporter: Joseph K. Bradley >Assignee: yuhao yang > > Check Java compatibility for MLlib for this release. > Checking compatibility means: > * comparing with the Scala doc > * verifying that Java docs are not messed up by Scala type incompatibilities. > Some items to look out for are: > ** Check for generic "Object" types where Java cannot understand complex > Scala types. > *** *Note*: The Java docs do not always match the bytecode. If you find a > problem, please verify it using {{javap}}. > ** Check Scala objects (especially with nesting!) carefully. > ** Check for uses of Scala and Java enumerations, which can show up oddly in > the other language's doc. > * If needed for complex issues, create small Java unit tests which execute > each method. (The correctness can be checked in Scala.) > If you find issues, please comment here, or for larger items, create separate > JIRAs and link here. > Note that we should not break APIs from previous releases. So if you find a > problem, check if it was introduced in this Spark release (in which case we > can fix it) or in a previous one (in which case we can create a java-friendly > version of the API). -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-11605) ML 1.6 QA: API: Java compatibility, docs
[ https://issues.apache.org/jira/browse/SPARK-11605?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15033239#comment-15033239 ] Joseph K. Bradley commented on SPARK-11605: --- We could add Java-friendly versions. > ML 1.6 QA: API: Java compatibility, docs > > > Key: SPARK-11605 > URL: https://issues.apache.org/jira/browse/SPARK-11605 > Project: Spark > Issue Type: Sub-task > Components: Documentation, Java API, ML, MLlib >Reporter: Joseph K. Bradley >Assignee: yuhao yang > > Check Java compatibility for MLlib for this release. > Checking compatibility means: > * comparing with the Scala doc > * verifying that Java docs are not messed up by Scala type incompatibilities. > Some items to look out for are: > ** Check for generic "Object" types where Java cannot understand complex > Scala types. > *** *Note*: The Java docs do not always match the bytecode. If you find a > problem, please verify it using {{javap}}. > ** Check Scala objects (especially with nesting!) carefully. > ** Check for uses of Scala and Java enumerations, which can show up oddly in > the other language's doc. > * If needed for complex issues, create small Java unit tests which execute > each method. (The correctness can be checked in Scala.) > If you find issues, please comment here, or for larger items, create separate > JIRAs and link here. > Note that we should not break APIs from previous releases. So if you find a > problem, check if it was introduced in this Spark release (in which case we > can fix it) or in a previous one (in which case we can create a java-friendly > version of the API). -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-12070) PySpark implementation of Slicing operator incorrect
[ https://issues.apache.org/jira/browse/SPARK-12070?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-12070: Assignee: (was: Apache Spark) > PySpark implementation of Slicing operator incorrect > > > Key: SPARK-12070 > URL: https://issues.apache.org/jira/browse/SPARK-12070 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 1.5.2 >Reporter: Jeff Zhang > > {code} > aa=('Ofer', 1), ('Wei', 2) > a = sqlContext.createDataFrame(aa) > a.select(a._1[2:]).show() > {code} > Traceback (most recent call last): > File "", line 1, in > File "/Users/jzhang/github/spark/python/pyspark/sql/column.py", line 286, > in substr > jc = self._jc.substr(startPos, length) > File > "/Users/jzhang/github/spark/python/lib/py4j-0.9-src.zip/py4j/java_gateway.py", > line 813, in __call__ > File "/Users/jzhang/github/spark/python/pyspark/sql/utils.py", line 45, in > deco > return f(*a, **kw) > File > "/Users/jzhang/github/spark/python/lib/py4j-0.9-src.zip/py4j/protocol.py", > line 312, in get_return_value > py4j.protocol.Py4JError: An error occurred while calling o37.substr. Trace: > py4j.Py4JException: Method substr([class java.lang.Integer, class > java.lang.Long]) does not exist > at py4j.reflection.ReflectionEngine.getMethod(ReflectionEngine.java:335) > at py4j.reflection.ReflectionEngine.getMethod(ReflectionEngine.java:344) > at py4j.Gateway.invoke(Gateway.java:252) > at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:133) > at py4j.commands.CallCommand.execute(CallCommand.java:79) > at py4j.GatewayConnection.run(GatewayConnection.java:209) > at java.lang.Thread.run(Thread.java:745) > {code} > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-12070) PySpark implementation of Slicing operator incorrect
[ https://issues.apache.org/jira/browse/SPARK-12070?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-12070: Assignee: Apache Spark > PySpark implementation of Slicing operator incorrect > > > Key: SPARK-12070 > URL: https://issues.apache.org/jira/browse/SPARK-12070 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 1.5.2 >Reporter: Jeff Zhang >Assignee: Apache Spark > > {code} > aa=('Ofer', 1), ('Wei', 2) > a = sqlContext.createDataFrame(aa) > a.select(a._1[2:]).show() > {code} > Traceback (most recent call last): > File "", line 1, in > File "/Users/jzhang/github/spark/python/pyspark/sql/column.py", line 286, > in substr > jc = self._jc.substr(startPos, length) > File > "/Users/jzhang/github/spark/python/lib/py4j-0.9-src.zip/py4j/java_gateway.py", > line 813, in __call__ > File "/Users/jzhang/github/spark/python/pyspark/sql/utils.py", line 45, in > deco > return f(*a, **kw) > File > "/Users/jzhang/github/spark/python/lib/py4j-0.9-src.zip/py4j/protocol.py", > line 312, in get_return_value > py4j.protocol.Py4JError: An error occurred while calling o37.substr. Trace: > py4j.Py4JException: Method substr([class java.lang.Integer, class > java.lang.Long]) does not exist > at py4j.reflection.ReflectionEngine.getMethod(ReflectionEngine.java:335) > at py4j.reflection.ReflectionEngine.getMethod(ReflectionEngine.java:344) > at py4j.Gateway.invoke(Gateway.java:252) > at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:133) > at py4j.commands.CallCommand.execute(CallCommand.java:79) > at py4j.GatewayConnection.run(GatewayConnection.java:209) > at java.lang.Thread.run(Thread.java:745) > {code} > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-12070) PySpark implementation of Slicing operator incorrect
[ https://issues.apache.org/jira/browse/SPARK-12070?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15033233#comment-15033233 ] Apache Spark commented on SPARK-12070: -- User 'zjffdu' has created a pull request for this issue: https://github.com/apache/spark/pull/10062 > PySpark implementation of Slicing operator incorrect > > > Key: SPARK-12070 > URL: https://issues.apache.org/jira/browse/SPARK-12070 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 1.5.2 >Reporter: Jeff Zhang > > {code} > aa=('Ofer', 1), ('Wei', 2) > a = sqlContext.createDataFrame(aa) > a.select(a._1[2:]).show() > {code} > Traceback (most recent call last): > File "", line 1, in > File "/Users/jzhang/github/spark/python/pyspark/sql/column.py", line 286, > in substr > jc = self._jc.substr(startPos, length) > File > "/Users/jzhang/github/spark/python/lib/py4j-0.9-src.zip/py4j/java_gateway.py", > line 813, in __call__ > File "/Users/jzhang/github/spark/python/pyspark/sql/utils.py", line 45, in > deco > return f(*a, **kw) > File > "/Users/jzhang/github/spark/python/lib/py4j-0.9-src.zip/py4j/protocol.py", > line 312, in get_return_value > py4j.protocol.Py4JError: An error occurred while calling o37.substr. Trace: > py4j.Py4JException: Method substr([class java.lang.Integer, class > java.lang.Long]) does not exist > at py4j.reflection.ReflectionEngine.getMethod(ReflectionEngine.java:335) > at py4j.reflection.ReflectionEngine.getMethod(ReflectionEngine.java:344) > at py4j.Gateway.invoke(Gateway.java:252) > at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:133) > at py4j.commands.CallCommand.execute(CallCommand.java:79) > at py4j.GatewayConnection.run(GatewayConnection.java:209) > at java.lang.Thread.run(Thread.java:745) > {code} > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-11605) ML 1.6 QA: API: Java compatibility, docs
[ https://issues.apache.org/jira/browse/SPARK-11605?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15033230#comment-15033230 ] Joseph K. Bradley commented on SPARK-11605: --- We don't need to worry about Attribute; it's an old API and is a DeveloperApi we expect to change. If you see Option issues with other public APIs though, they would be good to investigate. > ML 1.6 QA: API: Java compatibility, docs > > > Key: SPARK-11605 > URL: https://issues.apache.org/jira/browse/SPARK-11605 > Project: Spark > Issue Type: Sub-task > Components: Documentation, Java API, ML, MLlib >Reporter: Joseph K. Bradley >Assignee: yuhao yang > > Check Java compatibility for MLlib for this release. > Checking compatibility means: > * comparing with the Scala doc > * verifying that Java docs are not messed up by Scala type incompatibilities. > Some items to look out for are: > ** Check for generic "Object" types where Java cannot understand complex > Scala types. > *** *Note*: The Java docs do not always match the bytecode. If you find a > problem, please verify it using {{javap}}. > ** Check Scala objects (especially with nesting!) carefully. > ** Check for uses of Scala and Java enumerations, which can show up oddly in > the other language's doc. > * If needed for complex issues, create small Java unit tests which execute > each method. (The correctness can be checked in Scala.) > If you find issues, please comment here, or for larger items, create separate > JIRAs and link here. > Note that we should not break APIs from previous releases. So if you find a > problem, check if it was introduced in this Spark release (in which case we > can fix it) or in a previous one (in which case we can create a java-friendly > version of the API). -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-11605) ML 1.6 QA: API: Java compatibility, docs
[ https://issues.apache.org/jira/browse/SPARK-11605?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15033228#comment-15033228 ] Joseph K. Bradley commented on SPARK-11605: --- This is a problem we should fix for this release since it's a new API. > ML 1.6 QA: API: Java compatibility, docs > > > Key: SPARK-11605 > URL: https://issues.apache.org/jira/browse/SPARK-11605 > Project: Spark > Issue Type: Sub-task > Components: Documentation, Java API, ML, MLlib >Reporter: Joseph K. Bradley >Assignee: yuhao yang > > Check Java compatibility for MLlib for this release. > Checking compatibility means: > * comparing with the Scala doc > * verifying that Java docs are not messed up by Scala type incompatibilities. > Some items to look out for are: > ** Check for generic "Object" types where Java cannot understand complex > Scala types. > *** *Note*: The Java docs do not always match the bytecode. If you find a > problem, please verify it using {{javap}}. > ** Check Scala objects (especially with nesting!) carefully. > ** Check for uses of Scala and Java enumerations, which can show up oddly in > the other language's doc. > * If needed for complex issues, create small Java unit tests which execute > each method. (The correctness can be checked in Scala.) > If you find issues, please comment here, or for larger items, create separate > JIRAs and link here. > Note that we should not break APIs from previous releases. So if you find a > problem, check if it was introduced in this Spark release (in which case we > can fix it) or in a previous one (in which case we can create a java-friendly > version of the API). -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-10798) JsonMappingException with Spark Context Parallelize
[ https://issues.apache.org/jira/browse/SPARK-10798?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15033229#comment-15033229 ] Miao Wang commented on SPARK-10798: --- These two lines: byte[] data= Kryo.serialize(List) List fromKryoRows=Kryo.unserialize(data) can't be compiled in my Java application. I googled the usage of Kryo serializer and there is no matched usage as shown in the above two lines. Miao > JsonMappingException with Spark Context Parallelize > --- > > Key: SPARK-10798 > URL: https://issues.apache.org/jira/browse/SPARK-10798 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 1.5.0 > Environment: Linux, Java 1.8.45 >Reporter: Dev Lakhani > > When trying to create an RDD of Rows using a Java Spark Context and if I > serialize the rows with Kryo first, the sparkContext fails. > byte[] data= Kryo.serialize(List) > List fromKryoRows=Kryo.unserialize(data) > List rows= new Vector(); //using a new set of data. > rows.add(RowFactory.create("test")); > javaSparkContext.parallelize(rows); > OR > javaSparkContext.parallelize(fromKryoRows); //using deserialized rows > I get : > com.fasterxml.jackson.databind.JsonMappingException: (None,None) (of class > scala.Tuple2) (through reference chain: > org.apache.spark.rdd.RDDOperationScope["parent"]) >at > com.fasterxml.jackson.databind.JsonMappingException.wrapWithPath(JsonMappingException.java:210) >at > com.fasterxml.jackson.databind.JsonMappingException.wrapWithPath(JsonMappingException.java:177) >at > com.fasterxml.jackson.databind.ser.std.StdSerializer.wrapAndThrow(StdSerializer.java:187) >at > com.fasterxml.jackson.databind.ser.std.BeanSerializerBase.serializeFields(BeanSerializerBase.java:647) >at > com.fasterxml.jackson.databind.ser.BeanSerializer.serialize(BeanSerializer.java:152) >at > com.fasterxml.jackson.databind.ser.DefaultSerializerProvider.serializeValue(DefaultSerializerProvider.java:128) >at > com.fasterxml.jackson.databind.ObjectMapper._configAndWriteValue(ObjectMapper.java:2881) >at > com.fasterxml.jackson.databind.ObjectMapper.writeValueAsString(ObjectMapper.java:2338) >at > org.apache.spark.rdd.RDDOperationScope.toJson(RDDOperationScope.scala:50) >at > org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:141) >at > org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:108) >at > org.apache.spark.SparkContext.withScope(SparkContext.scala:700) >at > org.apache.spark.SparkContext.parallelize(SparkContext.scala:714) >at > org.apache.spark.api.java.JavaSparkContext.parallelize(JavaSparkContext.scala:145) >at > org.apache.spark.api.java.JavaSparkContext.parallelize(JavaSparkContext.scala:157) >... > Caused by: scala.MatchError: (None,None) (of class scala.Tuple2) >at > com.fasterxml.jackson.module.scala.ser.OptionSerializer$$anonfun$serialize$1.apply$mcV$sp(OptionSerializerModule.scala:32) >at > com.fasterxml.jackson.module.scala.ser.OptionSerializer$$anonfun$serialize$1.apply(OptionSerializerModule.scala:32) >at > com.fasterxml.jackson.module.scala.ser.OptionSerializer$$anonfun$serialize$1.apply(OptionSerializerModule.scala:32) >at scala.Option.getOrElse(Option.scala:120) >at > com.fasterxml.jackson.module.scala.ser.OptionSerializer.serialize(OptionSerializerModule.scala:31) >at > com.fasterxml.jackson.module.scala.ser.OptionSerializer.serialize(OptionSerializerModule.scala:22) >at > com.fasterxml.jackson.databind.ser.BeanPropertyWriter.serializeAsField(BeanPropertyWriter.java:505) >at > com.fasterxml.jackson.module.scala.ser.OptionPropertyWriter.serializeAsField(OptionSerializerModule.scala:128) >at > com.fasterxml.jackson.databind.ser.std.BeanSerializerBase.serializeFields(BeanSerializerBase.java:639) >... 19 more > I've tried updating jackson module scala to 2.6.1 but same issue. This > happens in local mode with java 1.8_45. I searched the web and this Jira for > similar issues but found nothing of interest. > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-12071) Programming guide should explain NULL in JVM translate to NA in R
[ https://issues.apache.org/jira/browse/SPARK-12071?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15033227#comment-15033227 ] Felix Cheung edited comment on SPARK-12071 at 12/1/15 7:07 AM: --- See commit https://github.com/apache/spark/commit/71a138cd0e0a14e8426f97877e3b52a562bbd02c was (Author: felixcheung): See PR https://github.com/apache/spark/commit/71a138cd0e0a14e8426f97877e3b52a562bbd02c > Programming guide should explain NULL in JVM translate to NA in R > - > > Key: SPARK-12071 > URL: https://issues.apache.org/jira/browse/SPARK-12071 > Project: Spark > Issue Type: Bug > Components: SparkR >Affects Versions: 1.6.0 >Reporter: Felix Cheung >Priority: Minor > > This behavior seems to be new for Spark 1.6.0 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-12071) Programming guide should explain NULL in JVM translate to NA in R
[ https://issues.apache.org/jira/browse/SPARK-12071?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15033227#comment-15033227 ] Felix Cheung commented on SPARK-12071: -- See PR https://github.com/apache/spark/commit/71a138cd0e0a14e8426f97877e3b52a562bbd02c > Programming guide should explain NULL in JVM translate to NA in R > - > > Key: SPARK-12071 > URL: https://issues.apache.org/jira/browse/SPARK-12071 > Project: Spark > Issue Type: Bug > Components: SparkR >Affects Versions: 1.6.0 >Reporter: Felix Cheung >Priority: Minor > > This behavior seems to be new for Spark 1.6.0 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-12071) Programming guide should explain NULL in JVM translate to NA in R
Felix Cheung created SPARK-12071: Summary: Programming guide should explain NULL in JVM translate to NA in R Key: SPARK-12071 URL: https://issues.apache.org/jira/browse/SPARK-12071 Project: Spark Issue Type: Bug Components: SparkR Affects Versions: 1.6.0 Reporter: Felix Cheung Priority: Minor This behavior seems to be new for Spark 1.6.0 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-12070) PySpark implementation of Slicing operator incorrect
[ https://issues.apache.org/jira/browse/SPARK-12070?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15033218#comment-15033218 ] Jeff Zhang commented on SPARK-12070: The root cause is that when using syntax like this str[1:] for slice, the length will be set as the max int of python which is long for java. Because the range of python int is larger than that of java int. > PySpark implementation of Slicing operator incorrect > > > Key: SPARK-12070 > URL: https://issues.apache.org/jira/browse/SPARK-12070 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 1.5.2 >Reporter: Jeff Zhang > > {code} > aa=('Ofer', 1), ('Wei', 2) > a = sqlContext.createDataFrame(aa) > a.select(a._1[2:]).show() > {code} > Traceback (most recent call last): > File "", line 1, in > File "/Users/jzhang/github/spark/python/pyspark/sql/column.py", line 286, > in substr > jc = self._jc.substr(startPos, length) > File > "/Users/jzhang/github/spark/python/lib/py4j-0.9-src.zip/py4j/java_gateway.py", > line 813, in __call__ > File "/Users/jzhang/github/spark/python/pyspark/sql/utils.py", line 45, in > deco > return f(*a, **kw) > File > "/Users/jzhang/github/spark/python/lib/py4j-0.9-src.zip/py4j/protocol.py", > line 312, in get_return_value > py4j.protocol.Py4JError: An error occurred while calling o37.substr. Trace: > py4j.Py4JException: Method substr([class java.lang.Integer, class > java.lang.Long]) does not exist > at py4j.reflection.ReflectionEngine.getMethod(ReflectionEngine.java:335) > at py4j.reflection.ReflectionEngine.getMethod(ReflectionEngine.java:344) > at py4j.Gateway.invoke(Gateway.java:252) > at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:133) > at py4j.commands.CallCommand.execute(CallCommand.java:79) > at py4j.GatewayConnection.run(GatewayConnection.java:209) > at java.lang.Thread.run(Thread.java:745) > {code} > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-12070) PySpark implementation of Slicing operator incorrect
[ https://issues.apache.org/jira/browse/SPARK-12070?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15033218#comment-15033218 ] Jeff Zhang edited comment on SPARK-12070 at 12/1/15 6:59 AM: - The root cause is that when using syntax like this str[1:] for slice, the length will be set as the max int of python which is long for java. Because the range of python int is larger than that of java int. Will create a PR. was (Author: zjffdu): The root cause is that when using syntax like this str[1:] for slice, the length will be set as the max int of python which is long for java. Because the range of python int is larger than that of java int. > PySpark implementation of Slicing operator incorrect > > > Key: SPARK-12070 > URL: https://issues.apache.org/jira/browse/SPARK-12070 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 1.5.2 >Reporter: Jeff Zhang > > {code} > aa=('Ofer', 1), ('Wei', 2) > a = sqlContext.createDataFrame(aa) > a.select(a._1[2:]).show() > {code} > Traceback (most recent call last): > File "", line 1, in > File "/Users/jzhang/github/spark/python/pyspark/sql/column.py", line 286, > in substr > jc = self._jc.substr(startPos, length) > File > "/Users/jzhang/github/spark/python/lib/py4j-0.9-src.zip/py4j/java_gateway.py", > line 813, in __call__ > File "/Users/jzhang/github/spark/python/pyspark/sql/utils.py", line 45, in > deco > return f(*a, **kw) > File > "/Users/jzhang/github/spark/python/lib/py4j-0.9-src.zip/py4j/protocol.py", > line 312, in get_return_value > py4j.protocol.Py4JError: An error occurred while calling o37.substr. Trace: > py4j.Py4JException: Method substr([class java.lang.Integer, class > java.lang.Long]) does not exist > at py4j.reflection.ReflectionEngine.getMethod(ReflectionEngine.java:335) > at py4j.reflection.ReflectionEngine.getMethod(ReflectionEngine.java:344) > at py4j.Gateway.invoke(Gateway.java:252) > at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:133) > at py4j.commands.CallCommand.execute(CallCommand.java:79) > at py4j.GatewayConnection.run(GatewayConnection.java:209) > at java.lang.Thread.run(Thread.java:745) > {code} > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-12070) PySpark implementation of Slicing operator incorrect
Jeff Zhang created SPARK-12070: -- Summary: PySpark implementation of Slicing operator incorrect Key: SPARK-12070 URL: https://issues.apache.org/jira/browse/SPARK-12070 Project: Spark Issue Type: Bug Components: PySpark Affects Versions: 1.5.2 Reporter: Jeff Zhang {code} aa=('Ofer', 1), ('Wei', 2) a = sqlContext.createDataFrame(aa) a.select(a._1[2:]).show() {code} Traceback (most recent call last): File "", line 1, in File "/Users/jzhang/github/spark/python/pyspark/sql/column.py", line 286, in substr jc = self._jc.substr(startPos, length) File "/Users/jzhang/github/spark/python/lib/py4j-0.9-src.zip/py4j/java_gateway.py", line 813, in __call__ File "/Users/jzhang/github/spark/python/pyspark/sql/utils.py", line 45, in deco return f(*a, **kw) File "/Users/jzhang/github/spark/python/lib/py4j-0.9-src.zip/py4j/protocol.py", line 312, in get_return_value py4j.protocol.Py4JError: An error occurred while calling o37.substr. Trace: py4j.Py4JException: Method substr([class java.lang.Integer, class java.lang.Long]) does not exist at py4j.reflection.ReflectionEngine.getMethod(ReflectionEngine.java:335) at py4j.reflection.ReflectionEngine.getMethod(ReflectionEngine.java:344) at py4j.Gateway.invoke(Gateway.java:252) at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:133) at py4j.commands.CallCommand.execute(CallCommand.java:79) at py4j.GatewayConnection.run(GatewayConnection.java:209) at java.lang.Thread.run(Thread.java:745) {code} {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-11605) ML 1.6 QA: API: Java compatibility, docs
[ https://issues.apache.org/jira/browse/SPARK-11605?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15033213#comment-15033213 ] Joseph K. Bradley commented on SPARK-11605: --- private[ml] functions can be ignored. It's a shame that they are public in Java, but at least they do not show up in the Java doc. > ML 1.6 QA: API: Java compatibility, docs > > > Key: SPARK-11605 > URL: https://issues.apache.org/jira/browse/SPARK-11605 > Project: Spark > Issue Type: Sub-task > Components: Documentation, Java API, ML, MLlib >Reporter: Joseph K. Bradley >Assignee: yuhao yang > > Check Java compatibility for MLlib for this release. > Checking compatibility means: > * comparing with the Scala doc > * verifying that Java docs are not messed up by Scala type incompatibilities. > Some items to look out for are: > ** Check for generic "Object" types where Java cannot understand complex > Scala types. > *** *Note*: The Java docs do not always match the bytecode. If you find a > problem, please verify it using {{javap}}. > ** Check Scala objects (especially with nesting!) carefully. > ** Check for uses of Scala and Java enumerations, which can show up oddly in > the other language's doc. > * If needed for complex issues, create small Java unit tests which execute > each method. (The correctness can be checked in Scala.) > If you find issues, please comment here, or for larger items, create separate > JIRAs and link here. > Note that we should not break APIs from previous releases. So if you find a > problem, check if it was introduced in this Spark release (in which case we > can fix it) or in a previous one (in which case we can create a java-friendly > version of the API). -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-11206) Support SQL UI on the history server
[ https://issues.apache.org/jira/browse/SPARK-11206?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15033209#comment-15033209 ] Apache Spark commented on SPARK-11206: -- User 'carsonwang' has created a pull request for this issue: https://github.com/apache/spark/pull/10061 > Support SQL UI on the history server > > > Key: SPARK-11206 > URL: https://issues.apache.org/jira/browse/SPARK-11206 > Project: Spark > Issue Type: New Feature > Components: SQL, Web UI >Reporter: Carson Wang >Assignee: Carson Wang > > On the live web UI, there is a SQL tab which provides valuable information > for the SQL query. But once the workload is finished, we won't see the SQL > tab on the history server. It will be helpful if we support SQL UI on the > history server so we can analyze it even after its execution. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-12032) Filter can't be pushed down to correct Join because of bad order of Join
[ https://issues.apache.org/jira/browse/SPARK-12032?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15033187#comment-15033187 ] Reynold Xin commented on SPARK-12032: - [~marmbrus] do you mean the selinger algo? > Filter can't be pushed down to correct Join because of bad order of Join > > > Key: SPARK-12032 > URL: https://issues.apache.org/jira/browse/SPARK-12032 > Project: Spark > Issue Type: Improvement > Components: SQL >Reporter: Davies Liu >Assignee: Davies Liu >Priority: Critical > > For this query: > {code} > select d.d_year, count(*) cnt >FROM store_sales, date_dim d, customer c >WHERE ss_customer_sk = c.c_customer_sk AND c.c_first_shipto_date_sk = > d.d_date_sk >group by d.d_year > {code} > Current optimized plan is > {code} > == Optimized Logical Plan == > Aggregate [d_year#147], [d_year#147,(count(1),mode=Complete,isDistinct=false) > AS cnt#425L] > Project [d_year#147] > Join Inner, Some(((ss_customer_sk#283 = c_customer_sk#101) && > (c_first_shipto_date_sk#106 = d_date_sk#141))) >Project [d_date_sk#141,d_year#147,ss_customer_sk#283] > Join Inner, None > Project [ss_customer_sk#283] > Relation[] ParquetRelation[store_sales] > Project [d_date_sk#141,d_year#147] > Relation[] ParquetRelation[date_dim] >Project [c_customer_sk#101,c_first_shipto_date_sk#106] > Relation[] ParquetRelation[customer] > {code} > It will join store_sales and date_dim together without any condition, the > condition c.c_first_shipto_date_sk = d.d_date_sk is not pushed to it because > the bad order of joins. > The optimizer should re-order the joins, join date_dim after customer, then > it can pushed down the condition correctly. > The plan should be > {code} > Aggregate [d_year#147], [d_year#147,(count(1),mode=Complete,isDistinct=false) > AS cnt#425L] > Project [d_year#147] > Join Inner, Some((c_first_shipto_date_sk#106 = d_date_sk#141)) >Project [c_first_shipto_date_sk#106] > Join Inner, Some((ss_customer_sk#283 = c_customer_sk#101)) > Project [ss_customer_sk#283] > Relation[store_sales] > Project [c_first_shipto_date_sk#106,c_customer_sk#101] > Relation[customer] >Project [d_year#147,d_date_sk#141] > Relation[date_dim] > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-12031) Integer overflow when do sampling.
[ https://issues.apache.org/jira/browse/SPARK-12031?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust updated SPARK-12031: - Priority: Critical (was: Major) > Integer overflow when do sampling. > -- > > Key: SPARK-12031 > URL: https://issues.apache.org/jira/browse/SPARK-12031 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 1.5.1, 1.5.2 >Reporter: uncleGen >Priority: Critical > > In my case, some partitions contain too much items. When do range partition, > exception thrown as: > {code} > java.lang.IllegalArgumentException: n must be positive > at java.util.Random.nextInt(Random.java:300) > at > org.apache.spark.util.random.SamplingUtils$.reservoirSampleAndCount(SamplingUtils.scala:58) > at org.apache.spark.RangePartitioner$$anonfun$8.apply(Partitioner.scala:259) > at org.apache.spark.RangePartitioner$$anonfun$8.apply(Partitioner.scala:257) > at > org.apache.spark.rdd.RDD$$anonfun$mapPartitionsWithIndex$1$$anonfun$apply$18.apply(RDD.scala:703) > at > org.apache.spark.rdd.RDD$$anonfun$mapPartitionsWithIndex$1$$anonfun$apply$18.apply(RDD.scala:703) > at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:244) > at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:63) > at org.apache.spark.scheduler.Task.run(Task.scala:70) > at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:213) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) > at java.lang.Thread.run(Thread.java:745) > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-6280) Remove Akka systemName from Spark
[ https://issues.apache.org/jira/browse/SPARK-6280?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust updated SPARK-6280: Target Version/s: (was: 1.6.0) > Remove Akka systemName from Spark > - > > Key: SPARK-6280 > URL: https://issues.apache.org/jira/browse/SPARK-6280 > Project: Spark > Issue Type: Sub-task > Components: Spark Core >Reporter: Shixiong Zhu > > `systemName` is a Akka concept. A RPC implementation does not need to support > it. > We can hard code the system name in Spark and hide it in the internal Akka > RPC implementation. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-12069) Documentation update for Datasets
Michael Armbrust created SPARK-12069: Summary: Documentation update for Datasets Key: SPARK-12069 URL: https://issues.apache.org/jira/browse/SPARK-12069 Project: Spark Issue Type: Bug Components: Documentation, SQL Reporter: Michael Armbrust Assignee: Michael Armbrust -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-12069) Documentation update for Datasets
[ https://issues.apache.org/jira/browse/SPARK-12069?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-12069: Assignee: Apache Spark (was: Michael Armbrust) > Documentation update for Datasets > - > > Key: SPARK-12069 > URL: https://issues.apache.org/jira/browse/SPARK-12069 > Project: Spark > Issue Type: Bug > Components: Documentation, SQL >Reporter: Michael Armbrust >Assignee: Apache Spark > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-12069) Documentation update for Datasets
[ https://issues.apache.org/jira/browse/SPARK-12069?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-12069: Assignee: Michael Armbrust (was: Apache Spark) > Documentation update for Datasets > - > > Key: SPARK-12069 > URL: https://issues.apache.org/jira/browse/SPARK-12069 > Project: Spark > Issue Type: Bug > Components: Documentation, SQL >Reporter: Michael Armbrust >Assignee: Michael Armbrust > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-12069) Documentation update for Datasets
[ https://issues.apache.org/jira/browse/SPARK-12069?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15033183#comment-15033183 ] Apache Spark commented on SPARK-12069: -- User 'marmbrus' has created a pull request for this issue: https://github.com/apache/spark/pull/10060 > Documentation update for Datasets > - > > Key: SPARK-12069 > URL: https://issues.apache.org/jira/browse/SPARK-12069 > Project: Spark > Issue Type: Bug > Components: Documentation, SQL >Reporter: Michael Armbrust >Assignee: Michael Armbrust > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-11954) Encoder for JavaBeans / POJOs
[ https://issues.apache.org/jira/browse/SPARK-11954?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust updated SPARK-11954: - Assignee: Wenchen Fan > Encoder for JavaBeans / POJOs > - > > Key: SPARK-11954 > URL: https://issues.apache.org/jira/browse/SPARK-11954 > Project: Spark > Issue Type: Sub-task > Components: SQL >Reporter: Wenchen Fan >Assignee: Wenchen Fan > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-12068) use a single column in Dataset.groupBy and count will fail
[ https://issues.apache.org/jira/browse/SPARK-12068?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-12068: Assignee: (was: Apache Spark) > use a single column in Dataset.groupBy and count will fail > -- > > Key: SPARK-12068 > URL: https://issues.apache.org/jira/browse/SPARK-12068 > Project: Spark > Issue Type: Bug > Components: SQL >Reporter: Wenchen Fan > > {code} > val ds = Seq("a" -> 1, "b" -> 1, "a" -> 2).toDS() > val count = ds.groupBy($"_1").count() > count.collect() // will fail > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-12068) use a single column in Dataset.groupBy and count will fail
[ https://issues.apache.org/jira/browse/SPARK-12068?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15033175#comment-15033175 ] Apache Spark commented on SPARK-12068: -- User 'cloud-fan' has created a pull request for this issue: https://github.com/apache/spark/pull/10059 > use a single column in Dataset.groupBy and count will fail > -- > > Key: SPARK-12068 > URL: https://issues.apache.org/jira/browse/SPARK-12068 > Project: Spark > Issue Type: Bug > Components: SQL >Reporter: Wenchen Fan > > {code} > val ds = Seq("a" -> 1, "b" -> 1, "a" -> 2).toDS() > val count = ds.groupBy($"_1").count() > count.collect() // will fail > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-12068) use a single column in Dataset.groupBy and count will fail
[ https://issues.apache.org/jira/browse/SPARK-12068?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-12068: Assignee: Apache Spark > use a single column in Dataset.groupBy and count will fail > -- > > Key: SPARK-12068 > URL: https://issues.apache.org/jira/browse/SPARK-12068 > Project: Spark > Issue Type: Bug > Components: SQL >Reporter: Wenchen Fan >Assignee: Apache Spark > > {code} > val ds = Seq("a" -> 1, "b" -> 1, "a" -> 2).toDS() > val count = ds.groupBy($"_1").count() > count.collect() // will fail > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-11193) Spark 1.5+ Kinesis Streaming - ClassCastException when starting KinesisReceiver
[ https://issues.apache.org/jira/browse/SPARK-11193?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15033172#comment-15033172 ] Jean-Baptiste Onofré commented on SPARK-11193: -- Hi Phil, it's on my bucket. I should submit the PR today. > Spark 1.5+ Kinesis Streaming - ClassCastException when starting > KinesisReceiver > --- > > Key: SPARK-11193 > URL: https://issues.apache.org/jira/browse/SPARK-11193 > Project: Spark > Issue Type: Bug > Components: Streaming >Affects Versions: 1.5.0, 1.5.1 >Reporter: Phil Kallos > Attachments: screen.png > > > After upgrading from Spark 1.4.x -> 1.5.x, I am now unable to start a Kinesis > Spark Streaming application, and am being consistently greeted with this > exception: > java.lang.ClassCastException: scala.collection.mutable.HashMap cannot be cast > to scala.collection.mutable.SynchronizedMap > at > org.apache.spark.streaming.kinesis.KinesisReceiver.onStart(KinesisReceiver.scala:175) > at > org.apache.spark.streaming.receiver.ReceiverSupervisor.startReceiver(ReceiverSupervisor.scala:148) > at > org.apache.spark.streaming.receiver.ReceiverSupervisor.start(ReceiverSupervisor.scala:130) > at > org.apache.spark.streaming.scheduler.ReceiverTracker$ReceiverTrackerEndpoint$$anonfun$9.apply(ReceiverTracker.scala:542) > at > org.apache.spark.streaming.scheduler.ReceiverTracker$ReceiverTrackerEndpoint$$anonfun$9.apply(ReceiverTracker.scala:532) > at > org.apache.spark.SparkContext$$anonfun$38.apply(SparkContext.scala:1982) > at > org.apache.spark.SparkContext$$anonfun$38.apply(SparkContext.scala:1982) > at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66) > at org.apache.spark.scheduler.Task.run(Task.scala:88) > at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) > at java.lang.Thread.run(Thread.java:745) > Worth noting that I am able to reproduce this issue locally, and also on > Amazon EMR (using the latest emr-release 4.1.0 which packages Spark 1.5.0). > Also, I am not able to run the included kinesis-asl example. > Built locally using: > git checkout v1.5.1 > mvn -Pyarn -Pkinesis-asl -Phadoop-2.6 -DskipTests clean package > Example run command: > bin/run-example streaming.KinesisWordCountASL phibit-test kinesis-connector > https://kinesis.us-east-1.amazonaws.com -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-12068) use a single column in Dataset.groupBy and count will fail
Wenchen Fan created SPARK-12068: --- Summary: use a single column in Dataset.groupBy and count will fail Key: SPARK-12068 URL: https://issues.apache.org/jira/browse/SPARK-12068 Project: Spark Issue Type: Bug Components: SQL Reporter: Wenchen Fan {code} val ds = Seq("a" -> 1, "b" -> 1, "a" -> 2).toDS() val count = ds.groupBy($"_1").count() count.collect() // will fail {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-12010) Spark JDBC requires support for column-name-free INSERT syntax
[ https://issues.apache.org/jira/browse/SPARK-12010?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15033124#comment-15033124 ] Michael Armbrust commented on SPARK-12010: -- Thanks for working on this, but we've already hit code freeze for 1.6.0 so I'm going to retarget. Typically [let project committers|https://cwiki.apache.org/confluence/display/SPARK/Contributing+to+Spark#ContributingtoSpark-JIRA] set the "target version". > Spark JDBC requires support for column-name-free INSERT syntax > -- > > Key: SPARK-12010 > URL: https://issues.apache.org/jira/browse/SPARK-12010 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.5.2 >Reporter: Christian Kurz > Original Estimate: 24h > Remaining Estimate: 24h > > Spark JDBC write only works with technologies which support the following > INSERT statement syntax (JdbcUtils.scala: insertStatement()): > INSERT INTO $table VALUES ( ?, ?, ..., ? ) > Some technologies require a list of column names: > INSERT INTO $table ( $colNameList ) VALUES ( ?, ?, ..., ? ) > Therefore technologies like Progress JDBC Driver for Cassandra do not work > with Spark JDBC write. > Idea for fix: > Move JdbcUtils.scala:insertStatement() into SqlDialect and add a SqlDialect > for Progress JDBC Driver for Cassandra -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-12010) Spark JDBC requires support for column-name-free INSERT syntax
[ https://issues.apache.org/jira/browse/SPARK-12010?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust updated SPARK-12010: - Target Version/s: (was: 1.6.0) > Spark JDBC requires support for column-name-free INSERT syntax > -- > > Key: SPARK-12010 > URL: https://issues.apache.org/jira/browse/SPARK-12010 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.5.2 >Reporter: Christian Kurz > Original Estimate: 24h > Remaining Estimate: 24h > > Spark JDBC write only works with technologies which support the following > INSERT statement syntax (JdbcUtils.scala: insertStatement()): > INSERT INTO $table VALUES ( ?, ?, ..., ? ) > Some technologies require a list of column names: > INSERT INTO $table ( $colNameList ) VALUES ( ?, ?, ..., ? ) > Therefore technologies like Progress JDBC Driver for Cassandra do not work > with Spark JDBC write. > Idea for fix: > Move JdbcUtils.scala:insertStatement() into SqlDialect and add a SqlDialect > for Progress JDBC Driver for Cassandra -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-12017) Java Doc Publishing Broken
[ https://issues.apache.org/jira/browse/SPARK-12017?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust resolved SPARK-12017. -- Resolution: Fixed Assignee: Josh Rosen Fix Version/s: 1.6.0 > Java Doc Publishing Broken > -- > > Key: SPARK-12017 > URL: https://issues.apache.org/jira/browse/SPARK-12017 > Project: Spark > Issue Type: Bug > Components: Build >Reporter: Michael Armbrust >Assignee: Josh Rosen >Priority: Blocker > Fix For: 1.6.0 > > > The java docs are missing from the 1.6 preview. I think that > [this|https://github.com/apache/spark/commit/529a1d3380c4c23fed068ad05a6376162c4b76d6#commitcomment-14392230] > is the problem. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-12017) Java Doc Publishing Broken
[ https://issues.apache.org/jira/browse/SPARK-12017?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15033119#comment-15033119 ] Michael Armbrust commented on SPARK-12017: -- Fixed in https://github.com/apache/spark/pull/10049 > Java Doc Publishing Broken > -- > > Key: SPARK-12017 > URL: https://issues.apache.org/jira/browse/SPARK-12017 > Project: Spark > Issue Type: Bug > Components: Build >Reporter: Michael Armbrust >Priority: Blocker > > The java docs are missing from the 1.6 preview. I think that > [this|https://github.com/apache/spark/commit/529a1d3380c4c23fed068ad05a6376162c4b76d6#commitcomment-14392230] > is the problem. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-11796) Docker JDBC integration tests fail in Maven build due to dependency issue
[ https://issues.apache.org/jira/browse/SPARK-11796?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust updated SPARK-11796: - Component/s: Tests > Docker JDBC integration tests fail in Maven build due to dependency issue > - > > Key: SPARK-11796 > URL: https://issues.apache.org/jira/browse/SPARK-11796 > Project: Spark > Issue Type: Bug > Components: SQL, Tests >Affects Versions: 1.6.0 >Reporter: Josh Rosen > > Our new Docker integration tests for JDBC dialects are failing in the Maven > builds. For now, I've disabled this for Maven by adding the > {{-Dtest.exclude.tags=org.apache.spark.tags.DockerTest}} flag to our Jenkins > builds, but we should fix this soon. The test failures seem to be related to > dependency or classpath issues: > {code} > *** RUN ABORTED *** > java.lang.NoSuchMethodError: > org.apache.http.impl.client.HttpClientBuilder.setConnectionManagerShared(Z)Lorg/apache/http/impl/client/HttpClientBuilder; > at > org.glassfish.jersey.apache.connector.ApacheConnector.(ApacheConnector.java:240) > at > org.glassfish.jersey.apache.connector.ApacheConnectorProvider.getConnector(ApacheConnectorProvider.java:115) > at > org.glassfish.jersey.client.ClientConfig$State.initRuntime(ClientConfig.java:418) > at > org.glassfish.jersey.client.ClientConfig$State.access$000(ClientConfig.java:88) > at > org.glassfish.jersey.client.ClientConfig$State$3.get(ClientConfig.java:120) > at > org.glassfish.jersey.client.ClientConfig$State$3.get(ClientConfig.java:117) > at > org.glassfish.jersey.internal.util.collection.Values$LazyValueImpl.get(Values.java:340) > at > org.glassfish.jersey.client.ClientConfig.getRuntime(ClientConfig.java:726) > at > org.glassfish.jersey.client.ClientRequest.getConfiguration(ClientRequest.java:285) > at > org.glassfish.jersey.client.JerseyInvocation.validateHttpMethodAndEntity(JerseyInvocation.java:126) > ... > {code} > To reproduce locally: {{build/mvn -pl docker-integration-tests package}}. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-11601) ML 1.6 QA: API: Binary incompatible changes
[ https://issues.apache.org/jira/browse/SPARK-11601?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust updated SPARK-11601: - Component/s: Documentation > ML 1.6 QA: API: Binary incompatible changes > --- > > Key: SPARK-11601 > URL: https://issues.apache.org/jira/browse/SPARK-11601 > Project: Spark > Issue Type: Sub-task > Components: Documentation, ML, MLlib >Reporter: Joseph K. Bradley >Assignee: Timothy Hunter > > Generate a list of binary incompatible changes using MiMa and create new > JIRAs for issues found. Filter out false positives as needed. > If you want to take this task, ping [~mengxr] for advice since he did it for > 1.5. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-11607) Update MLlib website for 1.6
[ https://issues.apache.org/jira/browse/SPARK-11607?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust updated SPARK-11607: - Component/s: Documentation > Update MLlib website for 1.6 > > > Key: SPARK-11607 > URL: https://issues.apache.org/jira/browse/SPARK-11607 > Project: Spark > Issue Type: Sub-task > Components: Documentation, ML, MLlib >Reporter: Joseph K. Bradley >Assignee: Xiangrui Meng > > Update MLlib's website to include features in 1.6. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-11603) ML 1.6 QA: API: Experimental, DeveloperApi, final, sealed audit
[ https://issues.apache.org/jira/browse/SPARK-11603?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust updated SPARK-11603: - Component/s: Documentation > ML 1.6 QA: API: Experimental, DeveloperApi, final, sealed audit > --- > > Key: SPARK-11603 > URL: https://issues.apache.org/jira/browse/SPARK-11603 > Project: Spark > Issue Type: Sub-task > Components: Documentation, ML, MLlib >Reporter: Joseph K. Bradley >Assignee: DB Tsai > > We should make a pass through the items marked as Experimental or > DeveloperApi and see if any are stable enough to be unmarked. This will > probably not include the Pipeline APIs yet since some parts (e.g., feature > attributes) are still under flux. > We should also check for items marked final or sealed to see if they are > stable enough to be opened up as APIs. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-11315) Add YARN extension service to publish Spark events to YARN timeline service
[ https://issues.apache.org/jira/browse/SPARK-11315?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust updated SPARK-11315: - Target Version/s: (was: 1.6.0) > Add YARN extension service to publish Spark events to YARN timeline service > --- > > Key: SPARK-11315 > URL: https://issues.apache.org/jira/browse/SPARK-11315 > Project: Spark > Issue Type: Sub-task > Components: YARN >Affects Versions: 1.5.1 > Environment: Hadoop 2.6+ >Reporter: Steve Loughran > > Add an extension service (using SPARK-11314) to subscribe to Spark lifecycle > events, batch them and forward them to the YARN Application Timeline Service. > This data can then be retrieved by a new back end for the Spark History > Service, and by other analytics tools. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-11600) Spark MLlib 1.6 QA umbrella
[ https://issues.apache.org/jira/browse/SPARK-11600?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust updated SPARK-11600: - Component/s: Documentation > Spark MLlib 1.6 QA umbrella > --- > > Key: SPARK-11600 > URL: https://issues.apache.org/jira/browse/SPARK-11600 > Project: Spark > Issue Type: Umbrella > Components: Documentation, ML, MLlib >Reporter: Joseph K. Bradley >Assignee: Joseph K. Bradley >Priority: Critical > > This JIRA lists tasks for the next MLlib release's QA period. > h2. API > * Check binary API compatibility (SPARK-11601) > * Audit new public APIs (from the generated html doc) > ** Scala (SPARK-11602) > ** Java compatibility (SPARK-11605) > ** Python coverage (SPARK-11604) > * Check Experimental, DeveloperApi tags (SPARK-11603) > h2. Algorithms and performance > *Performance* > * _List any other missing performance tests from spark-perf here_ > * ALS.recommendAll (SPARK-7457) > * perf-tests in Python (SPARK-7539) > * perf-tests for transformers (SPARK-2838) > * MultilayerPerceptron (SPARK-11911) > h2. Documentation and example code > * For new algorithms, create JIRAs for updating the user guide (SPARK-11606) > * For major components, create JIRAs for example code (SPARK-9670) > * Update Programming Guide for 1.6 (towards end of QA) (SPARK-11608) > * Update website (SPARK-11607) > * Merge duplicate content under examples/ (SPARK-11685) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-8414) Ensure ContextCleaner actually triggers clean ups
[ https://issues.apache.org/jira/browse/SPARK-8414?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15033103#comment-15033103 ] Michael Armbrust commented on SPARK-8414: - Still planning to do this for 1.6? > Ensure ContextCleaner actually triggers clean ups > - > > Key: SPARK-8414 > URL: https://issues.apache.org/jira/browse/SPARK-8414 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 1.4.0 >Reporter: Andrew Or >Assignee: Andrew Or >Priority: Critical > > Right now it cleans up old references only through natural GCs, which may not > occur if the driver has infinite RAM. We should do a periodic GC to make sure > that we actually do clean things up. Something like once per 30 minutes seems > relatively inexpensive. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-7348) DAG visualization: add links to RDD page
[ https://issues.apache.org/jira/browse/SPARK-7348?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust updated SPARK-7348: Target Version/s: (was: 1.6.0) > DAG visualization: add links to RDD page > > > Key: SPARK-7348 > URL: https://issues.apache.org/jira/browse/SPARK-7348 > Project: Spark > Issue Type: Improvement > Components: Web UI >Affects Versions: 1.4.0 >Reporter: Andrew Or >Assignee: Andrew Or >Priority: Minor > > It currently has links from the job page to the stage page. It would be nice > if it has links to the corresponding RDD page as well. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-11985) Update Spark Streaming - Kinesis Library Documentation regarding data de-aggregation and message handler
[ https://issues.apache.org/jira/browse/SPARK-11985?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust updated SPARK-11985: - Component/s: Documentation > Update Spark Streaming - Kinesis Library Documentation regarding data > de-aggregation and message handler > > > Key: SPARK-11985 > URL: https://issues.apache.org/jira/browse/SPARK-11985 > Project: Spark > Issue Type: Documentation > Components: Documentation, Streaming >Reporter: Burak Yavuz > > Update documentation and provide how-to example in guide. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-6518) Add example code and user guide for bisecting k-means
[ https://issues.apache.org/jira/browse/SPARK-6518?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust updated SPARK-6518: Component/s: Documentation > Add example code and user guide for bisecting k-means > - > > Key: SPARK-6518 > URL: https://issues.apache.org/jira/browse/SPARK-6518 > Project: Spark > Issue Type: Documentation > Components: Documentation, MLlib >Reporter: Yu Ishikawa >Assignee: Yu Ishikawa > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-12060) Avoid memory copy in JavaSerializerInstance.serialize
[ https://issues.apache.org/jira/browse/SPARK-12060?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust updated SPARK-12060: - Priority: Critical (was: Major) > Avoid memory copy in JavaSerializerInstance.serialize > - > > Key: SPARK-12060 > URL: https://issues.apache.org/jira/browse/SPARK-12060 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Reporter: Shixiong Zhu >Assignee: Shixiong Zhu >Priority: Critical > > JavaSerializerInstance.serialize uses ByteArrayOutputStream.toByteArray to > get the serialized data. ByteArrayOutputStream.toByteArray needs to copy the > content in the internal array to a new array. However, since the array will > be converted to ByteBuffer at once, we can avoid the memory copy. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-8966) Design a mechanism to ensure that temporary files created in tasks are cleaned up after failures
[ https://issues.apache.org/jira/browse/SPARK-8966?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust updated SPARK-8966: Target Version/s: (was: 1.6.0) > Design a mechanism to ensure that temporary files created in tasks are > cleaned up after failures > > > Key: SPARK-8966 > URL: https://issues.apache.org/jira/browse/SPARK-8966 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Reporter: Josh Rosen > > It's important to avoid leaking temporary files, such as spill files created > by the external sorter. Individual operators should still make an effort to > clean up their own files / perform their own error handling, but I think that > we should add a safety-net mechanism to track file creation on a per-task > basis and automatically clean up leaked files. > During tests, this mechanism should throw an exception when a leak is > detected. In production deployments, it should log a warning and clean up the > leak itself. This is similar to the TaskMemoryManager's leak detection and > cleanup code. > We may be able to implement this via a convenience method that registers task > completion handlers with TaskContext. > We might also explore techniques that will cause files to be cleaned up > automatically when their file descriptors are closed (e.g. by calling unlink > on an open file). These techniques should not be our last line of defense > against file resource leaks, though, since they might be platform-specific > and may clean up resources later than we'd like. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-12031) Integer overflow when do sampling.
[ https://issues.apache.org/jira/browse/SPARK-12031?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust updated SPARK-12031: - Description: In my case, some partitions contain too much items. When do range partition, exception thrown as: {code} java.lang.IllegalArgumentException: n must be positive at java.util.Random.nextInt(Random.java:300) at org.apache.spark.util.random.SamplingUtils$.reservoirSampleAndCount(SamplingUtils.scala:58) at org.apache.spark.RangePartitioner$$anonfun$8.apply(Partitioner.scala:259) at org.apache.spark.RangePartitioner$$anonfun$8.apply(Partitioner.scala:257) at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsWithIndex$1$$anonfun$apply$18.apply(RDD.scala:703) at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsWithIndex$1$$anonfun$apply$18.apply(RDD.scala:703) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277) at org.apache.spark.rdd.RDD.iterator(RDD.scala:244) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:63) at org.apache.spark.scheduler.Task.run(Task.scala:70) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:213) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:745) {code} was: In my case, some partitions contain too much items. When do range partition, exception thrown as: java.lang.IllegalArgumentException: n must be positive at java.util.Random.nextInt(Random.java:300) at org.apache.spark.util.random.SamplingUtils$.reservoirSampleAndCount(SamplingUtils.scala:58) at org.apache.spark.RangePartitioner$$anonfun$8.apply(Partitioner.scala:259) at org.apache.spark.RangePartitioner$$anonfun$8.apply(Partitioner.scala:257) at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsWithIndex$1$$anonfun$apply$18.apply(RDD.scala:703) at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsWithIndex$1$$anonfun$apply$18.apply(RDD.scala:703) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277) at org.apache.spark.rdd.RDD.iterator(RDD.scala:244) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:63) at org.apache.spark.scheduler.Task.run(Task.scala:70) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:213) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:745) > Integer overflow when do sampling. > -- > > Key: SPARK-12031 > URL: https://issues.apache.org/jira/browse/SPARK-12031 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 1.5.1, 1.5.2 >Reporter: uncleGen > > In my case, some partitions contain too much items. When do range partition, > exception thrown as: > {code} > java.lang.IllegalArgumentException: n must be positive > at java.util.Random.nextInt(Random.java:300) > at > org.apache.spark.util.random.SamplingUtils$.reservoirSampleAndCount(SamplingUtils.scala:58) > at org.apache.spark.RangePartitioner$$anonfun$8.apply(Partitioner.scala:259) > at org.apache.spark.RangePartitioner$$anonfun$8.apply(Partitioner.scala:257) > at > org.apache.spark.rdd.RDD$$anonfun$mapPartitionsWithIndex$1$$anonfun$apply$18.apply(RDD.scala:703) > at > org.apache.spark.rdd.RDD$$anonfun$mapPartitionsWithIndex$1$$anonfun$apply$18.apply(RDD.scala:703) > at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:244) > at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:63) > at org.apache.spark.scheduler.Task.run(Task.scala:70) > at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:213) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) > at java.lang.Thread.run(Thread.java:745) > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-7729) Executor which has been killed should also be displayed on Executors Tab.
[ https://issues.apache.org/jira/browse/SPARK-7729?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15033095#comment-15033095 ] Apache Spark commented on SPARK-7729: - User 'lianhuiwang' has created a pull request for this issue: https://github.com/apache/spark/pull/10058 > Executor which has been killed should also be displayed on Executors Tab. > - > > Key: SPARK-7729 > URL: https://issues.apache.org/jira/browse/SPARK-7729 > Project: Spark > Issue Type: Improvement > Components: Web UI >Affects Versions: 1.3.1 >Reporter: Archit Thakur >Priority: Minor > Attachments: WebUI.png > > > On the ExecutorsTab there is no information about the executors which have > been killed. It only shows the running executors. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-11966) Spark API for UDTFs
[ https://issues.apache.org/jira/browse/SPARK-11966?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust updated SPARK-11966: - Target Version/s: 1.7.0 > Spark API for UDTFs > --- > > Key: SPARK-11966 > URL: https://issues.apache.org/jira/browse/SPARK-11966 > Project: Spark > Issue Type: New Feature > Components: SQL >Reporter: Jaka Jancar >Priority: Minor > > Defining UDFs is easy using sqlContext.udf.register, but not table-generating > functions. For those you still have to use these horrendous Hive interfaces: > https://github.com/prongs/apache-hive/blob/master/contrib/src/java/org/apache/hadoop/hive/contrib/udtf/example/GenericUDTFCount2.java -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-12018) Refactor common subexpression elimination code
[ https://issues.apache.org/jira/browse/SPARK-12018?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust resolved SPARK-12018. -- Resolution: Fixed Fix Version/s: 1.6.0 Issue resolved by pull request 10009 [https://github.com/apache/spark/pull/10009] > Refactor common subexpression elimination code > -- > > Key: SPARK-12018 > URL: https://issues.apache.org/jira/browse/SPARK-12018 > Project: Spark > Issue Type: Improvement > Components: SQL >Reporter: Liang-Chi Hsieh > Fix For: 1.6.0 > > > The code of common subexpression elimination can be factored and simplified. > Some unnecessary variables can be removed. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-12032) Filter can't be pushed down to correct Join because of bad order of Join
[ https://issues.apache.org/jira/browse/SPARK-12032?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Davies Liu reassigned SPARK-12032: -- Assignee: Davies Liu > Filter can't be pushed down to correct Join because of bad order of Join > > > Key: SPARK-12032 > URL: https://issues.apache.org/jira/browse/SPARK-12032 > Project: Spark > Issue Type: Improvement > Components: SQL >Reporter: Davies Liu >Assignee: Davies Liu >Priority: Critical > > For this query: > {code} > select d.d_year, count(*) cnt >FROM store_sales, date_dim d, customer c >WHERE ss_customer_sk = c.c_customer_sk AND c.c_first_shipto_date_sk = > d.d_date_sk >group by d.d_year > {code} > Current optimized plan is > {code} > == Optimized Logical Plan == > Aggregate [d_year#147], [d_year#147,(count(1),mode=Complete,isDistinct=false) > AS cnt#425L] > Project [d_year#147] > Join Inner, Some(((ss_customer_sk#283 = c_customer_sk#101) && > (c_first_shipto_date_sk#106 = d_date_sk#141))) >Project [d_date_sk#141,d_year#147,ss_customer_sk#283] > Join Inner, None > Project [ss_customer_sk#283] > Relation[] ParquetRelation[store_sales] > Project [d_date_sk#141,d_year#147] > Relation[] ParquetRelation[date_dim] >Project [c_customer_sk#101,c_first_shipto_date_sk#106] > Relation[] ParquetRelation[customer] > {code} > It will join store_sales and date_dim together without any condition, the > condition c.c_first_shipto_date_sk = d.d_date_sk is not pushed to it because > the bad order of joins. > The optimizer should re-order the joins, join date_dim after customer, then > it can pushed down the condition correctly. > The plan should be > {code} > Aggregate [d_year#147], [d_year#147,(count(1),mode=Complete,isDistinct=false) > AS cnt#425L] > Project [d_year#147] > Join Inner, Some((c_first_shipto_date_sk#106 = d_date_sk#141)) >Project [c_first_shipto_date_sk#106] > Join Inner, Some((ss_customer_sk#283 = c_customer_sk#101)) > Project [ss_customer_sk#283] > Relation[store_sales] > Project [c_first_shipto_date_sk#106,c_customer_sk#101] > Relation[customer] >Project [d_year#147,d_date_sk#141] > Relation[date_dim] > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-10647) Mesos HA mode misuses spark.deploy.zookeeper.dir property; configs should be documented
[ https://issues.apache.org/jira/browse/SPARK-10647?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-10647: Assignee: Apache Spark (was: Timothy Chen) > Mesos HA mode misuses spark.deploy.zookeeper.dir property; configs should be > documented > --- > > Key: SPARK-10647 > URL: https://issues.apache.org/jira/browse/SPARK-10647 > Project: Spark > Issue Type: Bug > Components: Mesos >Affects Versions: 1.4.1, 1.5.0 >Reporter: Alan Braithwaite >Assignee: Apache Spark >Priority: Minor > > The property `spark.deploy.zookeeper.dir` doesn't match up with the other > properties surrounding it, namely: > spark.mesos.deploy.zookeeper.url > and > spark.mesos.deploy.recoveryMode > Since it's also a property specific to mesos, it makes sense to be under that > hierarchy as well. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-10647) Mesos HA mode misuses spark.deploy.zookeeper.dir property; configs should be documented
[ https://issues.apache.org/jira/browse/SPARK-10647?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15033076#comment-15033076 ] Apache Spark commented on SPARK-10647: -- User 'tnachen' has created a pull request for this issue: https://github.com/apache/spark/pull/10057 > Mesos HA mode misuses spark.deploy.zookeeper.dir property; configs should be > documented > --- > > Key: SPARK-10647 > URL: https://issues.apache.org/jira/browse/SPARK-10647 > Project: Spark > Issue Type: Bug > Components: Mesos >Affects Versions: 1.4.1, 1.5.0 >Reporter: Alan Braithwaite >Assignee: Timothy Chen >Priority: Minor > > The property `spark.deploy.zookeeper.dir` doesn't match up with the other > properties surrounding it, namely: > spark.mesos.deploy.zookeeper.url > and > spark.mesos.deploy.recoveryMode > Since it's also a property specific to mesos, it makes sense to be under that > hierarchy as well. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-10647) Mesos HA mode misuses spark.deploy.zookeeper.dir property; configs should be documented
[ https://issues.apache.org/jira/browse/SPARK-10647?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-10647: Assignee: Timothy Chen (was: Apache Spark) > Mesos HA mode misuses spark.deploy.zookeeper.dir property; configs should be > documented > --- > > Key: SPARK-10647 > URL: https://issues.apache.org/jira/browse/SPARK-10647 > Project: Spark > Issue Type: Bug > Components: Mesos >Affects Versions: 1.4.1, 1.5.0 >Reporter: Alan Braithwaite >Assignee: Timothy Chen >Priority: Minor > > The property `spark.deploy.zookeeper.dir` doesn't match up with the other > properties surrounding it, namely: > spark.mesos.deploy.zookeeper.url > and > spark.mesos.deploy.recoveryMode > Since it's also a property specific to mesos, it makes sense to be under that > hierarchy as well. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-12064) Make the SqlParser as trait for better integrated with extensions
[ https://issues.apache.org/jira/browse/SPARK-12064?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Cheng Hao resolved SPARK-12064. --- Resolution: Won't Fix DBX has plan to remove the SqlParser in 2.0. > Make the SqlParser as trait for better integrated with extensions > - > > Key: SPARK-12064 > URL: https://issues.apache.org/jira/browse/SPARK-12064 > Project: Spark > Issue Type: Improvement > Components: SQL >Reporter: Cheng Hao > > `SqlParser` is now an object, which hard to reuse it in extensions, a proper > implementation will be make the `SqlParser` as trait, and keep all of its > implementation unchanged, and then add another object called `SqlParser` > inherits from the trait. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6521) Bypass network shuffle read if both endpoints are local
[ https://issues.apache.org/jira/browse/SPARK-6521?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15033016#comment-15033016 ] Takeshi Yamamuro commented on SPARK-6521: - Performance of the current Spark heavily depends on CPU, so this shuffle optimization has little effect on that ( benchmark results could be found in pullreq #9478). For a while, this ticket needs not to be considered and > Bypass network shuffle read if both endpoints are local > --- > > Key: SPARK-6521 > URL: https://issues.apache.org/jira/browse/SPARK-6521 > Project: Spark > Issue Type: Improvement > Components: Shuffle, Spark Core >Affects Versions: 1.2.0 >Reporter: xukun > > In the past, executor read other executor's shuffle file in the same node by > net. This pr make that executors in the same node read local shuffle file In > sort-based Shuffle. It will reduce net transport. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-12067) Fix usage of isnan, isnull, isnotnull of Column and DataFrame
[ https://issues.apache.org/jira/browse/SPARK-12067?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yanbo Liang updated SPARK-12067: Description: * SPARK-11947 has deprecated DataFrame.isNaN, DataFrame.isNull and replaced by DataFrame.isnan, DataFrame.isnull, this PR changed Column.isNaN to Column.isnan, Column.isNull to Column.isnull, Column.isNotNull to Column.isnotnull. * Add Column.notnull as alias of Column.isnotnull following the pandas naming convention. * Add DataFrame.isnotnull and DataFrame.notnull. was:Fix usage of isnan, isnull, isnotnull of Column and DataFrame. > Fix usage of isnan, isnull, isnotnull of Column and DataFrame > - > > Key: SPARK-12067 > URL: https://issues.apache.org/jira/browse/SPARK-12067 > Project: Spark > Issue Type: Improvement > Components: SQL >Reporter: Yanbo Liang > > * SPARK-11947 has deprecated DataFrame.isNaN, DataFrame.isNull and replaced > by DataFrame.isnan, DataFrame.isnull, this PR changed Column.isNaN to > Column.isnan, Column.isNull to Column.isnull, Column.isNotNull to > Column.isnotnull. > * Add Column.notnull as alias of Column.isnotnull following the pandas naming > convention. > * Add DataFrame.isnotnull and DataFrame.notnull. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-12067) Fix usage of isnan, isnull, isnotnull of Column and DataFrame
[ https://issues.apache.org/jira/browse/SPARK-12067?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-12067: Assignee: (was: Apache Spark) > Fix usage of isnan, isnull, isnotnull of Column and DataFrame > - > > Key: SPARK-12067 > URL: https://issues.apache.org/jira/browse/SPARK-12067 > Project: Spark > Issue Type: Improvement > Components: SQL >Reporter: Yanbo Liang > > Fix usage of isnan, isnull, isnotnull of Column and DataFrame. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-12067) Fix usage of isnan, isnull, isnotnull of Column and DataFrame
[ https://issues.apache.org/jira/browse/SPARK-12067?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-12067: Assignee: Apache Spark > Fix usage of isnan, isnull, isnotnull of Column and DataFrame > - > > Key: SPARK-12067 > URL: https://issues.apache.org/jira/browse/SPARK-12067 > Project: Spark > Issue Type: Improvement > Components: SQL >Reporter: Yanbo Liang >Assignee: Apache Spark > > Fix usage of isnan, isnull, isnotnull of Column and DataFrame. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-12067) Fix usage of isnan, isnull, isnotnull of Column and DataFrame
[ https://issues.apache.org/jira/browse/SPARK-12067?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15033013#comment-15033013 ] Apache Spark commented on SPARK-12067: -- User 'yanboliang' has created a pull request for this issue: https://github.com/apache/spark/pull/10056 > Fix usage of isnan, isnull, isnotnull of Column and DataFrame > - > > Key: SPARK-12067 > URL: https://issues.apache.org/jira/browse/SPARK-12067 > Project: Spark > Issue Type: Improvement > Components: SQL >Reporter: Yanbo Liang > > Fix usage of isnan, isnull, isnotnull of Column and DataFrame. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-12067) Fix usage of isnan, isnull, isnotnull of Column and DataFrame
Yanbo Liang created SPARK-12067: --- Summary: Fix usage of isnan, isnull, isnotnull of Column and DataFrame Key: SPARK-12067 URL: https://issues.apache.org/jira/browse/SPARK-12067 Project: Spark Issue Type: Improvement Components: SQL Reporter: Yanbo Liang Fix usage of isnan, isnull, isnotnull of Column and DataFrame. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-12066) spark sql throw java.lang.ArrayIndexOutOfBoundsException when use table.* with join
Ricky Yang created SPARK-12066: -- Summary: spark sql throw java.lang.ArrayIndexOutOfBoundsException when use table.* with join Key: SPARK-12066 URL: https://issues.apache.org/jira/browse/SPARK-12066 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.5.2, 1.4.0 Environment: linux Reporter: Ricky Yang Priority: Blocker throw java.lang.ArrayIndexOutOfBoundsException when I use following spark sql on spark standlone or yarn. the sql: select ta.* from bi_td.dm_price_seg_td tb join bi_sor.sor_ord_detail_tf ta on 1 = 1 where ta.sale_dt = '20140514' and ta.sale_price >= tb.pri_from and ta.sale_price < tb.pri_to limit 10 ; But ,the result is correct when using no * as following: select ta.sale_dt from bi_td.dm_price_seg_td tb join bi_sor.sor_ord_detail_tf ta on 1 = 1 where ta.sale_dt = '20140514' and ta.sale_price >= tb.pri_from and ta.sale_price < tb.pri_to limit 10 ; standlone version is 1.4.0 and version spark on yarn is 1.5.2 error log : 15/11/30 14:19:59 ERROR SparkSQLDriver: Failed in [select ta.* from bi_td.dm_price_seg_td tb join bi_sor.sor_ord_detail_tf ta on 1 = 1 where ta.sale_dt = '20140514' and ta.sale_price >= tb.pri_from and ta.sale_price < tb.pri_to limit 10 ] org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 0.0 failed 4 times, most recent failure: Lost task 0.3 in stage 0.0 (TID 3, namenode2-sit.cnsuning.com): java.lang.ArrayIndexOutOfBoundsException Driver stacktrace: at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1283) at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1271) at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1270) at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47) at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1270) at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:697) at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:697) at scala.Option.foreach(Option.scala:236) at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:697) at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1496) at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1458) at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1447) at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48) at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:567) at org.apache.spark.SparkContext.runJob(SparkContext.scala:1824) at org.apache.spark.SparkContext.runJob(SparkContext.scala:1837) at org.apache.spark.SparkContext.runJob(SparkContext.scala:1850) at org.apache.spark.sql.execution.SparkPlan.executeTake(SparkPlan.scala:215) at org.apache.spark.sql.execution.Limit.executeCollect(basicOperators.scala:207) at org.apache.spark.sql.hive.HiveContext$QueryExecution.stringResult(HiveContext.scala:587) at org.apache.spark.sql.hive.thriftserver.SparkSQLDriver.run(SparkSQLDriver.scala:63) at org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver.processCmd(SparkSQLCLIDriver.scala:308) at org.apache.hadoop.hive.cli.CliDriver.processLine(CliDriver.java:376) at org.apache.hadoop.hive.cli.CliDriver.processLine(CliDriver.java:311) at org.apache.hadoop.hive.cli.CliDriver.processReader(CliDriver.java:409) at org.apache.hadoop.hive.cli.CliDriver.processFile(CliDriver.java:425) at org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver$.main(SparkSQLCLIDriver.scala:166) at org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver.main(SparkSQLCLIDriver.scala) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:606) at org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:674) at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:180) at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:205) at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:120) at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala) Caused by: java.lang.ArrayIndexOutOfBoundsException org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 0.0 failed 4 times, most recent failure:
[jira] [Assigned] (SPARK-12065) Upgrade Tachyon dependency to 0.8.2
[ https://issues.apache.org/jira/browse/SPARK-12065?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-12065: Assignee: Josh Rosen (was: Apache Spark) > Upgrade Tachyon dependency to 0.8.2 > --- > > Key: SPARK-12065 > URL: https://issues.apache.org/jira/browse/SPARK-12065 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Reporter: Josh Rosen >Assignee: Josh Rosen > > I think that we should upgrade from Tachyon 0.8.1 to 0.8.2 in order to get > the fix for https://tachyon.atlassian.net/browse/TACHYON-1254. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-12065) Upgrade Tachyon dependency to 0.8.2
[ https://issues.apache.org/jira/browse/SPARK-12065?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15032993#comment-15032993 ] Apache Spark commented on SPARK-12065: -- User 'JoshRosen' has created a pull request for this issue: https://github.com/apache/spark/pull/10054 > Upgrade Tachyon dependency to 0.8.2 > --- > > Key: SPARK-12065 > URL: https://issues.apache.org/jira/browse/SPARK-12065 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Reporter: Josh Rosen >Assignee: Josh Rosen > > I think that we should upgrade from Tachyon 0.8.1 to 0.8.2 in order to get > the fix for https://tachyon.atlassian.net/browse/TACHYON-1254. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-12065) Upgrade Tachyon dependency to 0.8.2
[ https://issues.apache.org/jira/browse/SPARK-12065?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-12065: Assignee: Apache Spark (was: Josh Rosen) > Upgrade Tachyon dependency to 0.8.2 > --- > > Key: SPARK-12065 > URL: https://issues.apache.org/jira/browse/SPARK-12065 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Reporter: Josh Rosen >Assignee: Apache Spark > > I think that we should upgrade from Tachyon 0.8.1 to 0.8.2 in order to get > the fix for https://tachyon.atlassian.net/browse/TACHYON-1254. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-12065) Upgrade Tachyon dependency to 0.8.2
[ https://issues.apache.org/jira/browse/SPARK-12065?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Josh Rosen updated SPARK-12065: --- Issue Type: Improvement (was: Bug) > Upgrade Tachyon dependency to 0.8.2 > --- > > Key: SPARK-12065 > URL: https://issues.apache.org/jira/browse/SPARK-12065 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Reporter: Josh Rosen >Assignee: Josh Rosen > > I think that we should upgrade from Tachyon 0.8.1 to 0.8.2 in order to get > the fix for https://tachyon.atlassian.net/browse/TACHYON-1254. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-12065) Upgrade Tachyon dependency to 0.8.2
Josh Rosen created SPARK-12065: -- Summary: Upgrade Tachyon dependency to 0.8.2 Key: SPARK-12065 URL: https://issues.apache.org/jira/browse/SPARK-12065 Project: Spark Issue Type: Bug Components: Spark Core Reporter: Josh Rosen Assignee: Josh Rosen I think that we should upgrade from Tachyon 0.8.1 to 0.8.2 in order to get the fix for https://tachyon.atlassian.net/browse/TACHYON-1254. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-11941) JSON representation of nested StructTypes could be more uniform
[ https://issues.apache.org/jira/browse/SPARK-11941?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15032960#comment-15032960 ] Henri DF commented on SPARK-11941: -- I wasn't trying to serialize to/from using the Spark APIs - I was just getting the json representation out in order to build a programmatic representation of the structtype in another (non-Spark) environment. Recursing down the tree would be trivial if it was regular, but is painful with its current layout. Anyway, with your question I think I better understand the intended use for this, and it does indeed appear to work fine for ser/deser within Spark. So I get the rationale for making it an "Improvement". Thanks! > JSON representation of nested StructTypes could be more uniform > --- > > Key: SPARK-11941 > URL: https://issues.apache.org/jira/browse/SPARK-11941 > Project: Spark > Issue Type: Improvement > Components: SQL >Reporter: Henri DF > > I have a json file with a single row {code}{"a":1, "b": 1.0, "c": "asdfasd", > "d":[1, 2, 4]}{code} After reading that file in, the schema is correctly > inferred: > {code} > scala> df.printSchema > root > |-- a: long (nullable = true) > |-- b: double (nullable = true) > |-- c: string (nullable = true) > |-- d: array (nullable = true) > ||-- element: long (containsNull = true) > {code} > However, the json representation has a strange nesting under "type" for > column "d": > {code} > scala> df.collect()(0).schema.prettyJson > res60: String = > { > "type" : "struct", > "fields" : [ { > "name" : "a", > "type" : "long", > "nullable" : true, > "metadata" : { } > }, { > "name" : "b", > "type" : "double", > "nullable" : true, > "metadata" : { } > }, { > "name" : "c", > "type" : "string", > "nullable" : true, > "metadata" : { } > }, { > "name" : "d", > "type" : { > "type" : "array", > "elementType" : "long", > "containsNull" : true > }, > "nullable" : true, > "metadata" : { } > }] > } > {code} > Specifically, in the last element, "type" is an object instead of being a > string. I would expect the last element to be: > {code} > { > "name":"d", > "type":"array", > "elementType":"long", > "containsNull":true, > "nullable":true, > "metadata":{} > } > {code} > There's a similar issue for nested structs. > (I ran into this while writing node.js bindings, wanted to recurse down this > representation, which would be nicer if it was uniform...). -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-11940) Python API for ml.clustering.LDA
[ https://issues.apache.org/jira/browse/SPARK-11940?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15032952#comment-15032952 ] Jeff Zhang commented on SPARK-11940: Thanks [~yanboliang] I will work on it. > Python API for ml.clustering.LDA > > > Key: SPARK-11940 > URL: https://issues.apache.org/jira/browse/SPARK-11940 > Project: Spark > Issue Type: Sub-task > Components: ML, PySpark >Reporter: Yanbo Liang > > Add Python API for ml.clustering.LDA. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-11940) Python API for ml.clustering.LDA
[ https://issues.apache.org/jira/browse/SPARK-11940?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15032948#comment-15032948 ] Yanbo Liang commented on SPARK-11940: - [~zjffdu] I'm not working on this, you can take it. > Python API for ml.clustering.LDA > > > Key: SPARK-11940 > URL: https://issues.apache.org/jira/browse/SPARK-11940 > Project: Spark > Issue Type: Sub-task > Components: ML, PySpark >Reporter: Yanbo Liang > > Add Python API for ml.clustering.LDA. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-11941) JSON representation of nested StructTypes could be more uniform
[ https://issues.apache.org/jira/browse/SPARK-11941?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15032946#comment-15032946 ] Michael Armbrust commented on SPARK-11941: -- Sorry, maybe I'm misunderstanding. Can you construct a case where we serialize the case class representation to and from json and we lose information? If you can, then I agree this is a bug and we should fix it. Otherwise, it seems like an inconvenience. > JSON representation of nested StructTypes could be more uniform > --- > > Key: SPARK-11941 > URL: https://issues.apache.org/jira/browse/SPARK-11941 > Project: Spark > Issue Type: Improvement > Components: SQL >Reporter: Henri DF > > I have a json file with a single row {code}{"a":1, "b": 1.0, "c": "asdfasd", > "d":[1, 2, 4]}{code} After reading that file in, the schema is correctly > inferred: > {code} > scala> df.printSchema > root > |-- a: long (nullable = true) > |-- b: double (nullable = true) > |-- c: string (nullable = true) > |-- d: array (nullable = true) > ||-- element: long (containsNull = true) > {code} > However, the json representation has a strange nesting under "type" for > column "d": > {code} > scala> df.collect()(0).schema.prettyJson > res60: String = > { > "type" : "struct", > "fields" : [ { > "name" : "a", > "type" : "long", > "nullable" : true, > "metadata" : { } > }, { > "name" : "b", > "type" : "double", > "nullable" : true, > "metadata" : { } > }, { > "name" : "c", > "type" : "string", > "nullable" : true, > "metadata" : { } > }, { > "name" : "d", > "type" : { > "type" : "array", > "elementType" : "long", > "containsNull" : true > }, > "nullable" : true, > "metadata" : { } > }] > } > {code} > Specifically, in the last element, "type" is an object instead of being a > string. I would expect the last element to be: > {code} > { > "name":"d", > "type":"array", > "elementType":"long", > "containsNull":true, > "nullable":true, > "metadata":{} > } > {code} > There's a similar issue for nested structs. > (I ran into this while writing node.js bindings, wanted to recurse down this > representation, which would be nicer if it was uniform...). -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-12030) Incorrect results when aggregate joined data
[ https://issues.apache.org/jira/browse/SPARK-12030?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15032939#comment-15032939 ] Xiao Li edited comment on SPARK-12030 at 12/1/15 2:16 AM: -- Let me post a simple case that can trigger the data corruption. The data set t1 is downloaded from this JIRA. {code} test("sort result") { withSQLConf(SQLConf.AUTO_BROADCASTJOIN_THRESHOLD.key -> "-1", SQLConf.SHUFFLE_PARTITIONS.key -> "1") { val t1test = sqlContext.read.parquet("/Users/xiaoli/Downloads/t1").dropDuplicates().where("fk1=39 or (fk1=525 and id1 < 664618 and id1 >= 470050)").repartition(1).cache() //t1test.orderBy("fk1").explain(true) val t1 = t1test.orderBy("fk1").cache() checkAnswer( t1test, t1.collect() ) } {code} I am not sure if you can see the un-match. I am unable to reproduce it in a Thinkpad, but I can easily reproduce it in my macbook. My case did not hit any exception, but I saw a data corruption. After sorting, one row [664615,525] is replaced by another row [664611,525]. Thus one row disappeared after sorting, but you can see a duplicate of another row. The number of total rows was not changed after the sort. was (Author: smilegator): Let me post a simple case that can trigger the data corruption. The data set t1 is downloaded from this JIRA. {code} test("sort result") { withSQLConf(SQLConf.AUTO_BROADCASTJOIN_THRESHOLD.key -> "-1", SQLConf.SHUFFLE_PARTITIONS.key -> "1") { val t1test = sqlContext.read.parquet("/Users/xiaoli/Downloads/t1").dropDuplicates().where("fk1=39 or (fk1=525 and id1 < 664618 and id1 >= 470050)").repartition(1).cache() //t1test.orderBy("fk1").explain(true) val t1 = t1test.orderBy("fk1").cache() checkAnswer( t1test, t1.collect() ) } {code} I am not sure if you can see the un-match. I am unable to reproduce it in the Thinkpad, but I can easily reproduce it in my macbook. My case did not hit any exception, but I saw a data corruption. After sorting, one row [664615,525] is replaced by another row [664611,525]. Thus one row disappears after sorting, but you can see a duplicate in another row. The number of total rows is not changed after the sort. > Incorrect results when aggregate joined data > > > Key: SPARK-12030 > URL: https://issues.apache.org/jira/browse/SPARK-12030 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.6.0 >Reporter: Maciej Bryński >Priority: Blocker > Attachments: spark.jpg, t1.tar.gz, t2.tar.gz > > > I have following issue. > I created 2 dataframes from JDBC (MySQL) and joined them (t1 has fk1 to t2) > {code} > t1 = sqlCtx.read.jdbc("jdbc:mysql://XXX", t1, id1, 0, size1, 200).cache() > t2 = sqlCtx.read.jdbc("jdbc:mysql://XXX", t2).cache() > joined = t1.join(t2, t1.fk1 == t2.id2, "left_outer") > {code} > Important: both table are cached, so results should be the same on every > query. > Then I did come counts: > {code} > t1.count() -> 5900729 > t1.registerTempTable("t1") > sqlCtx.sql("select distinct(id1) from t1").count() -> 5900729 > t2.count() -> 54298 > joined.count() -> 5900729 > {code} > And here magic begins - I counted distinct id1 from joined table > {code} > joined.registerTempTable("joined") > sqlCtx.sql("select distinct(id1) from joined").count() > {code} > Results varies *(are different on every run)* between 5899000 and > 590 but never are equal to 5900729. > In addition. I did more queries: > {code} > sqlCtx.sql("select id1, count(*) from joined group by id1 having count(*) > > 1").collect() > {code} > This gives some results but this query return *1* > {code} > len(sqlCtx.sql("select * from joined where id1 = result").collect()) > {code} > What's wrong ? -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-12030) Incorrect results when aggregate joined data
[ https://issues.apache.org/jira/browse/SPARK-12030?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15032939#comment-15032939 ] Xiao Li commented on SPARK-12030: - Let me post a simple case that can trigger the data corruption. The data set t1 is downloaded from this JIRA. {code} test("sort result") { withSQLConf(SQLConf.AUTO_BROADCASTJOIN_THRESHOLD.key -> "-1", SQLConf.SHUFFLE_PARTITIONS.key -> "1") { val t1test = sqlContext.read.parquet("/Users/xiaoli/Downloads/t1").dropDuplicates().where("fk1=39 or (fk1=525 and id1 < 664618 and id1 >= 470050)").repartition(1).cache() //t1test.orderBy("fk1").explain(true) val t1 = t1test.orderBy("fk1").cache() checkAnswer( t1test, t1.collect() ) } {code} I am not sure if you can see the un-match. I am unable to reproduce it in the Thinkpad, but I can easily reproduce it in my macbook. My case did not hit any exception, but I saw a data corruption. After sorting, one row [664615,525] is replaced by another row [664611,525]. Thus one row disappears after sorting, but you can see a duplicate in another row. The number of total rows is not changed after the sort. > Incorrect results when aggregate joined data > > > Key: SPARK-12030 > URL: https://issues.apache.org/jira/browse/SPARK-12030 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.6.0 >Reporter: Maciej Bryński >Priority: Blocker > Attachments: spark.jpg, t1.tar.gz, t2.tar.gz > > > I have following issue. > I created 2 dataframes from JDBC (MySQL) and joined them (t1 has fk1 to t2) > {code} > t1 = sqlCtx.read.jdbc("jdbc:mysql://XXX", t1, id1, 0, size1, 200).cache() > t2 = sqlCtx.read.jdbc("jdbc:mysql://XXX", t2).cache() > joined = t1.join(t2, t1.fk1 == t2.id2, "left_outer") > {code} > Important: both table are cached, so results should be the same on every > query. > Then I did come counts: > {code} > t1.count() -> 5900729 > t1.registerTempTable("t1") > sqlCtx.sql("select distinct(id1) from t1").count() -> 5900729 > t2.count() -> 54298 > joined.count() -> 5900729 > {code} > And here magic begins - I counted distinct id1 from joined table > {code} > joined.registerTempTable("joined") > sqlCtx.sql("select distinct(id1) from joined").count() > {code} > Results varies *(are different on every run)* between 5899000 and > 590 but never are equal to 5900729. > In addition. I did more queries: > {code} > sqlCtx.sql("select id1, count(*) from joined group by id1 having count(*) > > 1").collect() > {code} > This gives some results but this query return *1* > {code} > len(sqlCtx.sql("select * from joined where id1 = result").collect()) > {code} > What's wrong ? -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-11941) JSON representation of nested StructTypes could be more uniform
[ https://issues.apache.org/jira/browse/SPARK-11941?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15032931#comment-15032931 ] Henri DF edited comment on SPARK-11941 at 12/1/15 2:14 AM: --- I think "might be nicer if it was flat' is a bit of an understatement The current representation isn't of much use with nested structs. If it's hard to fix, wouldn't it be better to make this private rather than leave exposed it in its current state? was (Author: henridf): I think "might be nicer if it was flat' is a bit of an understatement The current representation isn't of much use with nested structs. If it's hard to fix, would it be better to remove this than leave it in its current state? > JSON representation of nested StructTypes could be more uniform > --- > > Key: SPARK-11941 > URL: https://issues.apache.org/jira/browse/SPARK-11941 > Project: Spark > Issue Type: Improvement > Components: SQL >Reporter: Henri DF > > I have a json file with a single row {code}{"a":1, "b": 1.0, "c": "asdfasd", > "d":[1, 2, 4]}{code} After reading that file in, the schema is correctly > inferred: > {code} > scala> df.printSchema > root > |-- a: long (nullable = true) > |-- b: double (nullable = true) > |-- c: string (nullable = true) > |-- d: array (nullable = true) > ||-- element: long (containsNull = true) > {code} > However, the json representation has a strange nesting under "type" for > column "d": > {code} > scala> df.collect()(0).schema.prettyJson > res60: String = > { > "type" : "struct", > "fields" : [ { > "name" : "a", > "type" : "long", > "nullable" : true, > "metadata" : { } > }, { > "name" : "b", > "type" : "double", > "nullable" : true, > "metadata" : { } > }, { > "name" : "c", > "type" : "string", > "nullable" : true, > "metadata" : { } > }, { > "name" : "d", > "type" : { > "type" : "array", > "elementType" : "long", > "containsNull" : true > }, > "nullable" : true, > "metadata" : { } > }] > } > {code} > Specifically, in the last element, "type" is an object instead of being a > string. I would expect the last element to be: > {code} > { > "name":"d", > "type":"array", > "elementType":"long", > "containsNull":true, > "nullable":true, > "metadata":{} > } > {code} > There's a similar issue for nested structs. > (I ran into this while writing node.js bindings, wanted to recurse down this > representation, which would be nicer if it was uniform...). -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-11941) JSON representation of nested StructTypes could be more uniform
[ https://issues.apache.org/jira/browse/SPARK-11941?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15032931#comment-15032931 ] Henri DF commented on SPARK-11941: -- I think "might be nicer if it was flat' is a bit of an understatement The current representation isn't of much use with nested structs. If it's hard to fix, would it be better to remove this than leave it in its current state? > JSON representation of nested StructTypes could be more uniform > --- > > Key: SPARK-11941 > URL: https://issues.apache.org/jira/browse/SPARK-11941 > Project: Spark > Issue Type: Improvement > Components: SQL >Reporter: Henri DF > > I have a json file with a single row {code}{"a":1, "b": 1.0, "c": "asdfasd", > "d":[1, 2, 4]}{code} After reading that file in, the schema is correctly > inferred: > {code} > scala> df.printSchema > root > |-- a: long (nullable = true) > |-- b: double (nullable = true) > |-- c: string (nullable = true) > |-- d: array (nullable = true) > ||-- element: long (containsNull = true) > {code} > However, the json representation has a strange nesting under "type" for > column "d": > {code} > scala> df.collect()(0).schema.prettyJson > res60: String = > { > "type" : "struct", > "fields" : [ { > "name" : "a", > "type" : "long", > "nullable" : true, > "metadata" : { } > }, { > "name" : "b", > "type" : "double", > "nullable" : true, > "metadata" : { } > }, { > "name" : "c", > "type" : "string", > "nullable" : true, > "metadata" : { } > }, { > "name" : "d", > "type" : { > "type" : "array", > "elementType" : "long", > "containsNull" : true > }, > "nullable" : true, > "metadata" : { } > }] > } > {code} > Specifically, in the last element, "type" is an object instead of being a > string. I would expect the last element to be: > {code} > { > "name":"d", > "type":"array", > "elementType":"long", > "containsNull":true, > "nullable":true, > "metadata":{} > } > {code} > There's a similar issue for nested structs. > (I ran into this while writing node.js bindings, wanted to recurse down this > representation, which would be nicer if it was uniform...). -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-12030) Incorrect results when aggregate joined data
[ https://issues.apache.org/jira/browse/SPARK-12030?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15032927#comment-15032927 ] Yin Huai commented on SPARK-12030: -- [~smilegator] Can you post the case that triggers the problem? Also, is https://issues.apache.org/jira/browse/SPARK-12055 a related issue? > Incorrect results when aggregate joined data > > > Key: SPARK-12030 > URL: https://issues.apache.org/jira/browse/SPARK-12030 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.6.0 >Reporter: Maciej Bryński >Priority: Blocker > Attachments: spark.jpg, t1.tar.gz, t2.tar.gz > > > I have following issue. > I created 2 dataframes from JDBC (MySQL) and joined them (t1 has fk1 to t2) > {code} > t1 = sqlCtx.read.jdbc("jdbc:mysql://XXX", t1, id1, 0, size1, 200).cache() > t2 = sqlCtx.read.jdbc("jdbc:mysql://XXX", t2).cache() > joined = t1.join(t2, t1.fk1 == t2.id2, "left_outer") > {code} > Important: both table are cached, so results should be the same on every > query. > Then I did come counts: > {code} > t1.count() -> 5900729 > t1.registerTempTable("t1") > sqlCtx.sql("select distinct(id1) from t1").count() -> 5900729 > t2.count() -> 54298 > joined.count() -> 5900729 > {code} > And here magic begins - I counted distinct id1 from joined table > {code} > joined.registerTempTable("joined") > sqlCtx.sql("select distinct(id1) from joined").count() > {code} > Results varies *(are different on every run)* between 5899000 and > 590 but never are equal to 5900729. > In addition. I did more queries: > {code} > sqlCtx.sql("select id1, count(*) from joined group by id1 having count(*) > > 1").collect() > {code} > This gives some results but this query return *1* > {code} > len(sqlCtx.sql("select * from joined where id1 = result").collect()) > {code} > What's wrong ? -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-11966) Spark API for UDTFs
[ https://issues.apache.org/jira/browse/SPARK-11966?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15032925#comment-15032925 ] Michael Armbrust commented on SPARK-11966: -- Ah, I was proposing the DataFrame function explode as it gives you something very close to UDTFs. However, if you want to be able to use the functions in pure SQL then thats not going to be sufficient. > Spark API for UDTFs > --- > > Key: SPARK-11966 > URL: https://issues.apache.org/jira/browse/SPARK-11966 > Project: Spark > Issue Type: New Feature > Components: SQL >Reporter: Jaka Jancar >Priority: Minor > > Defining UDFs is easy using sqlContext.udf.register, but not table-generating > functions. For those you still have to use these horrendous Hive interfaces: > https://github.com/prongs/apache-hive/blob/master/contrib/src/java/org/apache/hadoop/hive/contrib/udtf/example/GenericUDTFCount2.java -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-12032) Filter can't be pushed down to correct Join because of bad order of Join
[ https://issues.apache.org/jira/browse/SPARK-12032?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust updated SPARK-12032: - Issue Type: Improvement (was: Bug) > Filter can't be pushed down to correct Join because of bad order of Join > > > Key: SPARK-12032 > URL: https://issues.apache.org/jira/browse/SPARK-12032 > Project: Spark > Issue Type: Improvement > Components: SQL >Reporter: Davies Liu >Priority: Critical > > For this query: > {code} > select d.d_year, count(*) cnt >FROM store_sales, date_dim d, customer c >WHERE ss_customer_sk = c.c_customer_sk AND c.c_first_shipto_date_sk = > d.d_date_sk >group by d.d_year > {code} > Current optimized plan is > {code} > == Optimized Logical Plan == > Aggregate [d_year#147], [d_year#147,(count(1),mode=Complete,isDistinct=false) > AS cnt#425L] > Project [d_year#147] > Join Inner, Some(((ss_customer_sk#283 = c_customer_sk#101) && > (c_first_shipto_date_sk#106 = d_date_sk#141))) >Project [d_date_sk#141,d_year#147,ss_customer_sk#283] > Join Inner, None > Project [ss_customer_sk#283] > Relation[] ParquetRelation[store_sales] > Project [d_date_sk#141,d_year#147] > Relation[] ParquetRelation[date_dim] >Project [c_customer_sk#101,c_first_shipto_date_sk#106] > Relation[] ParquetRelation[customer] > {code} > It will join store_sales and date_dim together without any condition, the > condition c.c_first_shipto_date_sk = d.d_date_sk is not pushed to it because > the bad order of joins. > The optimizer should re-order the joins, join date_dim after customer, then > it can pushed down the condition correctly. > The plan should be > {code} > Aggregate [d_year#147], [d_year#147,(count(1),mode=Complete,isDistinct=false) > AS cnt#425L] > Project [d_year#147] > Join Inner, Some((c_first_shipto_date_sk#106 = d_date_sk#141)) >Project [c_first_shipto_date_sk#106] > Join Inner, Some((ss_customer_sk#283 = c_customer_sk#101)) > Project [ss_customer_sk#283] > Relation[store_sales] > Project [c_first_shipto_date_sk#106,c_customer_sk#101] > Relation[customer] >Project [d_year#147,d_date_sk#141] > Relation[date_dim] > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-11873) Regression for TPC-DS query 63 when used with decimal datatype and windows function
[ https://issues.apache.org/jira/browse/SPARK-11873?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15032921#comment-15032921 ] Michael Armbrust commented on SPARK-11873: -- What about with Spark 1.6? > Regression for TPC-DS query 63 when used with decimal datatype and windows > function > --- > > Key: SPARK-11873 > URL: https://issues.apache.org/jira/browse/SPARK-11873 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 1.5.0 >Reporter: Dileep Kumar > Labels: perfomance > Attachments: 63.1.1, 63.1.5, 63.decimal_schema, > 63.decimal_schema_windows_function, 63.double_schema, 98.1.1, 98.1.5, > decimal_schema.sql, double_schema.sql > > > When running the TPC-DS based queries for benchmarking spark found that query > 63 (after making it similar to original query) show different behavior > compared to other queries eg. q98 which has similar function. > Here are performance numbers(execution time in seconds): > 1.1 Baseline1.5 1.5 + Decimal > q63 27 26 38 > q98 18 26 24 > As you can see q63 is showing regression compared to similar query. I am > attaching the both version of queries and affected schemas. When adding the > windows function back this is the only query seem to be slower than 1.1 in > 1.5. > I have attached the both version of schema and queries. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-11966) Spark API for UDTFs
[ https://issues.apache.org/jira/browse/SPARK-11966?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15032906#comment-15032906 ] Jaka Jancar edited comment on SPARK-11966 at 12/1/15 1:59 AM: -- Not sure I understand. I would like to do {{SELECT * FROM my_create_table(...)}}. Right now, all I can do is {{SELECT * FROM explode(my_create_array(...))}}. //edit: In reality, this would be a part of JOIN or lateral view. I would like it to be doable with only SQL. was (Author: jakajancar): Not sure I understand. I would like to do {{SELECT * FROM my_create_table(...)}}. Right now, all I can do is {{SELECT * FROM explode(my_create_array(...))}}. > Spark API for UDTFs > --- > > Key: SPARK-11966 > URL: https://issues.apache.org/jira/browse/SPARK-11966 > Project: Spark > Issue Type: New Feature > Components: SQL >Reporter: Jaka Jancar >Priority: Minor > > Defining UDFs is easy using sqlContext.udf.register, but not table-generating > functions. For those you still have to use these horrendous Hive interfaces: > https://github.com/prongs/apache-hive/blob/master/contrib/src/java/org/apache/hadoop/hive/contrib/udtf/example/GenericUDTFCount2.java -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-11966) Spark API for UDTFs
[ https://issues.apache.org/jira/browse/SPARK-11966?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15032906#comment-15032906 ] Jaka Jancar commented on SPARK-11966: - Not sure I understand. I would like to do {{SELECT * FROM my_create_table(...)}}. Right now, all I can do is {{SELECT * FROM explode(my_create_array(...))}}. > Spark API for UDTFs > --- > > Key: SPARK-11966 > URL: https://issues.apache.org/jira/browse/SPARK-11966 > Project: Spark > Issue Type: New Feature > Components: SQL >Reporter: Jaka Jancar >Priority: Minor > > Defining UDFs is easy using sqlContext.udf.register, but not table-generating > functions. For those you still have to use these horrendous Hive interfaces: > https://github.com/prongs/apache-hive/blob/master/contrib/src/java/org/apache/hadoop/hive/contrib/udtf/example/GenericUDTFCount2.java -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-11941) JSON representation of nested StructTypes could be more uniform
[ https://issues.apache.org/jira/browse/SPARK-11941?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust updated SPARK-11941: - Issue Type: Improvement (was: Bug) > JSON representation of nested StructTypes could be more uniform > --- > > Key: SPARK-11941 > URL: https://issues.apache.org/jira/browse/SPARK-11941 > Project: Spark > Issue Type: Improvement > Components: SQL >Reporter: Henri DF > > I have a json file with a single row {code}{"a":1, "b": 1.0, "c": "asdfasd", > "d":[1, 2, 4]}{code} After reading that file in, the schema is correctly > inferred: > {code} > scala> df.printSchema > root > |-- a: long (nullable = true) > |-- b: double (nullable = true) > |-- c: string (nullable = true) > |-- d: array (nullable = true) > ||-- element: long (containsNull = true) > {code} > However, the json representation has a strange nesting under "type" for > column "d": > {code} > scala> df.collect()(0).schema.prettyJson > res60: String = > { > "type" : "struct", > "fields" : [ { > "name" : "a", > "type" : "long", > "nullable" : true, > "metadata" : { } > }, { > "name" : "b", > "type" : "double", > "nullable" : true, > "metadata" : { } > }, { > "name" : "c", > "type" : "string", > "nullable" : true, > "metadata" : { } > }, { > "name" : "d", > "type" : { > "type" : "array", > "elementType" : "long", > "containsNull" : true > }, > "nullable" : true, > "metadata" : { } > }] > } > {code} > Specifically, in the last element, "type" is an object instead of being a > string. I would expect the last element to be: > {code} > { > "name":"d", > "type":"array", > "elementType":"long", > "containsNull":true, > "nullable":true, > "metadata":{} > } > {code} > There's a similar issue for nested structs. > (I ran into this while writing node.js bindings, wanted to recurse down this > representation, which would be nicer if it was uniform...). -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-11941) JSON representation of nested StructTypes could be more uniform
[ https://issues.apache.org/jira/browse/SPARK-11941?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15032899#comment-15032899 ] Michael Armbrust commented on SPARK-11941: -- /cc [~lian cheng] > JSON representation of nested StructTypes could be more uniform > --- > > Key: SPARK-11941 > URL: https://issues.apache.org/jira/browse/SPARK-11941 > Project: Spark > Issue Type: Bug > Components: SQL >Reporter: Henri DF > > I have a json file with a single row {code}{"a":1, "b": 1.0, "c": "asdfasd", > "d":[1, 2, 4]}{code} After reading that file in, the schema is correctly > inferred: > {code} > scala> df.printSchema > root > |-- a: long (nullable = true) > |-- b: double (nullable = true) > |-- c: string (nullable = true) > |-- d: array (nullable = true) > ||-- element: long (containsNull = true) > {code} > However, the json representation has a strange nesting under "type" for > column "d": > {code} > scala> df.collect()(0).schema.prettyJson > res60: String = > { > "type" : "struct", > "fields" : [ { > "name" : "a", > "type" : "long", > "nullable" : true, > "metadata" : { } > }, { > "name" : "b", > "type" : "double", > "nullable" : true, > "metadata" : { } > }, { > "name" : "c", > "type" : "string", > "nullable" : true, > "metadata" : { } > }, { > "name" : "d", > "type" : { > "type" : "array", > "elementType" : "long", > "containsNull" : true > }, > "nullable" : true, > "metadata" : { } > }] > } > {code} > Specifically, in the last element, "type" is an object instead of being a > string. I would expect the last element to be: > {code} > { > "name":"d", > "type":"array", > "elementType":"long", > "containsNull":true, > "nullable":true, > "metadata":{} > } > {code} > There's a similar issue for nested structs. > (I ran into this while writing node.js bindings, wanted to recurse down this > representation, which would be nicer if it was uniform...). -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-12030) Incorrect results when aggregate joined data
[ https://issues.apache.org/jira/browse/SPARK-12030?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15032902#comment-15032902 ] Xiao Li commented on SPARK-12030: - I already excluded Exchange and Partitioning. It should be caused by Sort. Will continue the investigation tonight. Will keep you posted if I can locate the exact changes. Thanks! > Incorrect results when aggregate joined data > > > Key: SPARK-12030 > URL: https://issues.apache.org/jira/browse/SPARK-12030 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.6.0 >Reporter: Maciej Bryński >Priority: Blocker > Attachments: spark.jpg, t1.tar.gz, t2.tar.gz > > > I have following issue. > I created 2 dataframes from JDBC (MySQL) and joined them (t1 has fk1 to t2) > {code} > t1 = sqlCtx.read.jdbc("jdbc:mysql://XXX", t1, id1, 0, size1, 200).cache() > t2 = sqlCtx.read.jdbc("jdbc:mysql://XXX", t2).cache() > joined = t1.join(t2, t1.fk1 == t2.id2, "left_outer") > {code} > Important: both table are cached, so results should be the same on every > query. > Then I did come counts: > {code} > t1.count() -> 5900729 > t1.registerTempTable("t1") > sqlCtx.sql("select distinct(id1) from t1").count() -> 5900729 > t2.count() -> 54298 > joined.count() -> 5900729 > {code} > And here magic begins - I counted distinct id1 from joined table > {code} > joined.registerTempTable("joined") > sqlCtx.sql("select distinct(id1) from joined").count() > {code} > Results varies *(are different on every run)* between 5899000 and > 590 but never are equal to 5900729. > In addition. I did more queries: > {code} > sqlCtx.sql("select id1, count(*) from joined group by id1 having count(*) > > 1").collect() > {code} > This gives some results but this query return *1* > {code} > len(sqlCtx.sql("select * from joined where id1 = result").collect()) > {code} > What's wrong ? -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-11941) JSON representation of nested StructTypes could be more uniform
[ https://issues.apache.org/jira/browse/SPARK-11941?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust updated SPARK-11941: - Summary: JSON representation of nested StructTypes could be more uniform (was: JSON representation of nested StructTypes is incorrect) > JSON representation of nested StructTypes could be more uniform > --- > > Key: SPARK-11941 > URL: https://issues.apache.org/jira/browse/SPARK-11941 > Project: Spark > Issue Type: Bug > Components: SQL >Reporter: Henri DF > > I have a json file with a single row {code}{"a":1, "b": 1.0, "c": "asdfasd", > "d":[1, 2, 4]}{code} After reading that file in, the schema is correctly > inferred: > {code} > scala> df.printSchema > root > |-- a: long (nullable = true) > |-- b: double (nullable = true) > |-- c: string (nullable = true) > |-- d: array (nullable = true) > ||-- element: long (containsNull = true) > {code} > However, the json representation has a strange nesting under "type" for > column "d": > {code} > scala> df.collect()(0).schema.prettyJson > res60: String = > { > "type" : "struct", > "fields" : [ { > "name" : "a", > "type" : "long", > "nullable" : true, > "metadata" : { } > }, { > "name" : "b", > "type" : "double", > "nullable" : true, > "metadata" : { } > }, { > "name" : "c", > "type" : "string", > "nullable" : true, > "metadata" : { } > }, { > "name" : "d", > "type" : { > "type" : "array", > "elementType" : "long", > "containsNull" : true > }, > "nullable" : true, > "metadata" : { } > }] > } > {code} > Specifically, in the last element, "type" is an object instead of being a > string. I would expect the last element to be: > {code} > { > "name":"d", > "type":"array", > "elementType":"long", > "containsNull":true, > "nullable":true, > "metadata":{} > } > {code} > There's a similar issue for nested structs. > (I ran into this while writing node.js bindings, wanted to recurse down this > representation, which would be nicer if it was uniform...). -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-11941) JSON representation of nested StructTypes is incorrect
[ https://issues.apache.org/jira/browse/SPARK-11941?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15032889#comment-15032889 ] Michael Armbrust commented on SPARK-11941: -- While I can appreciate that this might be nicer if it was flat, I don't think that changing it at this point is worth the cost. This is a stable representation that we persist with data. As such, if we change it we are going to have to support parsing both representations forever. > JSON representation of nested StructTypes is incorrect > -- > > Key: SPARK-11941 > URL: https://issues.apache.org/jira/browse/SPARK-11941 > Project: Spark > Issue Type: Bug > Components: SQL >Reporter: Henri DF > > I have a json file with a single row {code}{"a":1, "b": 1.0, "c": "asdfasd", > "d":[1, 2, 4]}{code} After reading that file in, the schema is correctly > inferred: > {code} > scala> df.printSchema > root > |-- a: long (nullable = true) > |-- b: double (nullable = true) > |-- c: string (nullable = true) > |-- d: array (nullable = true) > ||-- element: long (containsNull = true) > {code} > However, the json representation has a strange nesting under "type" for > column "d": > {code} > scala> df.collect()(0).schema.prettyJson > res60: String = > { > "type" : "struct", > "fields" : [ { > "name" : "a", > "type" : "long", > "nullable" : true, > "metadata" : { } > }, { > "name" : "b", > "type" : "double", > "nullable" : true, > "metadata" : { } > }, { > "name" : "c", > "type" : "string", > "nullable" : true, > "metadata" : { } > }, { > "name" : "d", > "type" : { > "type" : "array", > "elementType" : "long", > "containsNull" : true > }, > "nullable" : true, > "metadata" : { } > }] > } > {code} > Specifically, in the last element, "type" is an object instead of being a > string. I would expect the last element to be: > {code} > { > "name":"d", > "type":"array", > "elementType":"long", > "containsNull":true, > "nullable":true, > "metadata":{} > } > {code} > There's a similar issue for nested structs. > (I ran into this while writing node.js bindings, wanted to recurse down this > representation, which would be nicer if it was uniform...). -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-12030) Incorrect results when aggregate joined data
[ https://issues.apache.org/jira/browse/SPARK-12030?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15032887#comment-15032887 ] Xiao Li commented on SPARK-12030: - [SPARK-7542][SQL] Support off-heap index/sort buffer https://github.com/apache/spark/pull/9477 and [SPARK-11389][CORE] Add support for off-heap memory to MemoryManage https://github.com/apache/spark/pull/9344 The problem does not exist if I took out the code changes by these two JIRAs. The code changes of these two JIRAs are mixed. Thus, I assume it should be caused by #9477. > Incorrect results when aggregate joined data > > > Key: SPARK-12030 > URL: https://issues.apache.org/jira/browse/SPARK-12030 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.6.0 >Reporter: Maciej Bryński >Priority: Blocker > Attachments: spark.jpg, t1.tar.gz, t2.tar.gz > > > I have following issue. > I created 2 dataframes from JDBC (MySQL) and joined them (t1 has fk1 to t2) > {code} > t1 = sqlCtx.read.jdbc("jdbc:mysql://XXX", t1, id1, 0, size1, 200).cache() > t2 = sqlCtx.read.jdbc("jdbc:mysql://XXX", t2).cache() > joined = t1.join(t2, t1.fk1 == t2.id2, "left_outer") > {code} > Important: both table are cached, so results should be the same on every > query. > Then I did come counts: > {code} > t1.count() -> 5900729 > t1.registerTempTable("t1") > sqlCtx.sql("select distinct(id1) from t1").count() -> 5900729 > t2.count() -> 54298 > joined.count() -> 5900729 > {code} > And here magic begins - I counted distinct id1 from joined table > {code} > joined.registerTempTable("joined") > sqlCtx.sql("select distinct(id1) from joined").count() > {code} > Results varies *(are different on every run)* between 5899000 and > 590 but never are equal to 5900729. > In addition. I did more queries: > {code} > sqlCtx.sql("select id1, count(*) from joined group by id1 having count(*) > > 1").collect() > {code} > This gives some results but this query return *1* > {code} > len(sqlCtx.sql("select * from joined where id1 = result").collect()) > {code} > What's wrong ? -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-11966) Spark API for UDTFs
[ https://issues.apache.org/jira/browse/SPARK-11966?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15032877#comment-15032877 ] Michael Armbrust commented on SPARK-11966: -- Have you seen [explode|https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/DataFrame.scala#L1146]. Does this do what you want, or is something missing? > Spark API for UDTFs > --- > > Key: SPARK-11966 > URL: https://issues.apache.org/jira/browse/SPARK-11966 > Project: Spark > Issue Type: New Feature > Components: SQL >Reporter: Jaka Jancar >Priority: Minor > > Defining UDFs is easy using sqlContext.udf.register, but not table-generating > functions. For those you still have to use these horrendous Hive interfaces: > https://github.com/prongs/apache-hive/blob/master/contrib/src/java/org/apache/hadoop/hive/contrib/udtf/example/GenericUDTFCount2.java -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-12049) User JVM shutdown hook can cause deadlock at shutdown
[ https://issues.apache.org/jira/browse/SPARK-12049?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Marcelo Vanzin resolved SPARK-12049. Resolution: Fixed Fix Version/s: 1.6.0 1.5.3 > User JVM shutdown hook can cause deadlock at shutdown > - > > Key: SPARK-12049 > URL: https://issues.apache.org/jira/browse/SPARK-12049 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 1.5.2, 1.6.0 >Reporter: Sean Owen >Assignee: Sean Owen > Fix For: 1.5.3, 1.6.0 > > > Here's a simplification of a deadlock that can occur a shutdown if the user > app has also installed a shutdown hook to clean up: > - Spark Shutdown Hook thread runs > - {{SparkShutdownHookManager.runAll()}} is invoked, locking > {{SparkShutdownHookManager}} as it is {{synchronized}} > - A user shutdown hook thread runs > - User hook tries to call, for example {{StreamingContext.stop()}}, which is > {{synchronized}} and locks it > - User hook blocks when the {{StreamingContext}} tries to {{remove()}} the > Spark Streaming shutdown task, since it's {{synchronized}} per above > - Spark Shutdown Hook tries to execute the Spark Streaming shutdown task, but > blocks on {{StreamingContext.stop()}} > I think this is actually not that critical, since it requires a pretty > specific setup, and I think it can be worked around in many cases by > integrating with Hadoop's shutdown hook mechanism like Spark does so that > these happen serially. > I also think it's solvable in the code by not locking > {{SparkShutdownHookManager}} in the 3 methods that are {{synchronized}} since > these are really only protecting {{hooks}}. {{runAll()}} shouldn't hold the > lock while executing hooks. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-12000) `sbt publishLocal` hits a Scala compiler bug caused by `Since` annotation
[ https://issues.apache.org/jira/browse/SPARK-12000?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15032863#comment-15032863 ] Josh Rosen commented on SPARK-12000: Here's the full stacktrace of the compiler crash: {code} last tree to typer: Literal(Constant(1.5.0)) symbol: null symbol definition: null tpe: String("1.5.0") symbol owners: context owners: value -> package clustering == Enclosing template or block == Apply( new Since."" "1.5.0" ) == Expanded type of tree == ConstantType(value = Constant(1.5.0)) no-symbol does not have an owner at scala.reflect.internal.SymbolTable.abort(SymbolTable.scala:49) at scala.tools.nsc.Global.abort(Global.scala:254) at scala.reflect.internal.Symbols$NoSymbol.owner(Symbols.scala:3257) at scala.tools.nsc.symtab.classfile.ClassfileParser.addEnclosingTParams(ClassfileParser.scala:585) at scala.tools.nsc.symtab.classfile.ClassfileParser.parseClass(ClassfileParser.scala:530) at scala.tools.nsc.symtab.classfile.ClassfileParser.parse(ClassfileParser.scala:88) at scala.tools.nsc.symtab.SymbolLoaders$ClassfileLoader.doComplete(SymbolLoaders.scala:261) at scala.tools.nsc.symtab.SymbolLoaders$SymbolLoader.complete(SymbolLoaders.scala:194) at scala.reflect.internal.Symbols$Symbol.info(Symbols.scala:1231) at scala.tools.nsc.doc.base.MemberLookupBase$$anonfun$cleanupBogusClasses$1$1.apply(MemberLookupBase.scala:153) at scala.tools.nsc.doc.base.MemberLookupBase$$anonfun$cleanupBogusClasses$1$1.apply(MemberLookupBase.scala:153) at scala.collection.TraversableLike$$anonfun$filter$1.apply(TraversableLike.scala:264) at scala.collection.immutable.List.foreach(List.scala:318) at scala.collection.TraversableLike$class.filter(TraversableLike.scala:263) at scala.collection.AbstractTraversable.filter(Traversable.scala:105) at scala.tools.nsc.doc.base.MemberLookupBase$class.cleanupBogusClasses$1(MemberLookupBase.scala:153) at scala.tools.nsc.doc.base.MemberLookupBase$class.lookupInTemplate(MemberLookupBase.scala:164) at scala.tools.nsc.doc.base.MemberLookupBase$class.scala$tools$nsc$doc$base$MemberLookupBase$$lookupInTemplate(MemberLookupBase.scala:128) at scala.tools.nsc.doc.base.MemberLookupBase$class.lookupInRootPackage(MemberLookupBase.scala:115) at scala.tools.nsc.doc.base.MemberLookupBase$class.memberLookup(MemberLookupBase.scala:52) at scala.tools.nsc.doc.DocFactory$$anon$1.memberLookup(DocFactory.scala:78) at scala.tools.nsc.doc.base.MemberLookupBase$$anon$1.link$lzycompute(MemberLookupBase.scala:27) at scala.tools.nsc.doc.base.MemberLookupBase$$anon$1.link(MemberLookupBase.scala:27) at scala.tools.nsc.doc.base.comment.EntityLink$.unapply(Body.scala:75) at scala.tools.nsc.doc.html.HtmlPage.inlineToHtml(HtmlPage.scala:126) at scala.tools.nsc.doc.html.HtmlPage$$anonfun$inlineToHtml$1.apply(HtmlPage.scala:115) at scala.tools.nsc.doc.html.HtmlPage$$anonfun$inlineToHtml$1.apply(HtmlPage.scala:115) at scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:251) at scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:251) at scala.collection.immutable.List.foreach(List.scala:318) at scala.collection.TraversableLike$class.flatMap(TraversableLike.scala:251) at scala.collection.AbstractTraversable.flatMap(Traversable.scala:105) at scala.tools.nsc.doc.html.HtmlPage.inlineToHtml(HtmlPage.scala:115) at scala.tools.nsc.doc.html.HtmlPage$$anonfun$inlineToHtml$1.apply(HtmlPage.scala:115) at scala.tools.nsc.doc.html.HtmlPage$$anonfun$inlineToHtml$1.apply(HtmlPage.scala:115) at scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:251) at scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:251) at scala.collection.immutable.List.foreach(List.scala:318) at scala.collection.TraversableLike$class.flatMap(TraversableLike.scala:251) at scala.collection.AbstractTraversable.flatMap(Traversable.scala:105) at scala.tools.nsc.doc.html.HtmlPage.inlineToHtml(HtmlPage.scala:115) at scala.tools.nsc.doc.html.HtmlPage.inlineToHtml(HtmlPage.scala:124) at scala.tools.nsc.doc.html.HtmlPage$$anonfun$inlineToHtml$1.apply(HtmlPage.scala:115) at scala.tools.nsc.doc.html.HtmlPage$$anonfun$inlineToHtml$1.apply(HtmlPage.scala:115) at scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:251) at scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:251) at scala.collection.immutable.List.foreach(List.scala:318) at scala.colle
[jira] [Commented] (SPARK-12030) Incorrect results when aggregate joined data
[ https://issues.apache.org/jira/browse/SPARK-12030?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15032857#comment-15032857 ] Davies Liu commented on SPARK-12030: [~smilegator] Could you post the related PRs here? So we can also looking into it, thanks! > Incorrect results when aggregate joined data > > > Key: SPARK-12030 > URL: https://issues.apache.org/jira/browse/SPARK-12030 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.6.0 >Reporter: Maciej Bryński >Priority: Blocker > Attachments: spark.jpg, t1.tar.gz, t2.tar.gz > > > I have following issue. > I created 2 dataframes from JDBC (MySQL) and joined them (t1 has fk1 to t2) > {code} > t1 = sqlCtx.read.jdbc("jdbc:mysql://XXX", t1, id1, 0, size1, 200).cache() > t2 = sqlCtx.read.jdbc("jdbc:mysql://XXX", t2).cache() > joined = t1.join(t2, t1.fk1 == t2.id2, "left_outer") > {code} > Important: both table are cached, so results should be the same on every > query. > Then I did come counts: > {code} > t1.count() -> 5900729 > t1.registerTempTable("t1") > sqlCtx.sql("select distinct(id1) from t1").count() -> 5900729 > t2.count() -> 54298 > joined.count() -> 5900729 > {code} > And here magic begins - I counted distinct id1 from joined table > {code} > joined.registerTempTable("joined") > sqlCtx.sql("select distinct(id1) from joined").count() > {code} > Results varies *(are different on every run)* between 5899000 and > 590 but never are equal to 5900729. > In addition. I did more queries: > {code} > sqlCtx.sql("select id1, count(*) from joined group by id1 having count(*) > > 1").collect() > {code} > This gives some results but this query return *1* > {code} > len(sqlCtx.sql("select * from joined where id1 = result").collect()) > {code} > What's wrong ? -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-12007) Network library's RPC layer requires a lot of copying
[ https://issues.apache.org/jira/browse/SPARK-12007?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrew Or resolved SPARK-12007. --- Resolution: Fixed Fix Version/s: 1.6.0 > Network library's RPC layer requires a lot of copying > - > > Key: SPARK-12007 > URL: https://issues.apache.org/jira/browse/SPARK-12007 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 1.6.0 >Reporter: Marcelo Vanzin >Assignee: Marcelo Vanzin > Fix For: 1.6.0 > > > The network library's RPC layer has an external API based on byte arrays, > instead of ByteBuffer; that requires a lot of copying since the internals of > the library use ByteBuffers (or rather Netty's ByteBuf), and lots of external > clients also use ByteBuffer. > The extra copies could be avoided if the API used ByteBuffer instead. > To show an extreme case, look at an RPC send via NettyRpcEnv: > - message is encoded using JavaSerializer, resulting in a ByteBuffer > - the ByteBuffer is copied into a byte array of the right size, since its > internal array may be larger than the actual data it holds > - the network library's encoder copies the byte array into a ByteBuf > - finally the data is written to the socket > The intermediate 2 copies could be avoided if the API allowed the original > ByteBuffer to be sent instead. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-12037) Executors use heartbeatReceiverRef to report heartbeats and task metrics that might not be initialized and leads to NullPointerException
[ https://issues.apache.org/jira/browse/SPARK-12037?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrew Or resolved SPARK-12037. --- Resolution: Fixed Assignee: Nan Zhu Fix Version/s: 1.6.0 Target Version/s: 1.6.0 > Executors use heartbeatReceiverRef to report heartbeats and task metrics that > might not be initialized and leads to NullPointerException > > > Key: SPARK-12037 > URL: https://issues.apache.org/jira/browse/SPARK-12037 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 1.6.0 > Environment: The latest sources at revision {{c793d2d}} >Reporter: Jacek Laskowski >Assignee: Nan Zhu > Fix For: 1.6.0 > > > When {{Executor}} starts it starts driver heartbeater (using > {{startDriverHeartbeater()}}) that uses {{heartbeatReceiverRef}} that is > initialized later and there is a possibility of NullPointerException (after > {{spark.executor.heartbeatInterval}} or {{10s}}). > {code} > WARN Executor: Issue communicating with driver in heartbeater > java.lang.NullPointerException > at > org.apache.spark.executor.Executor.org$apache$spark$executor$Executor$$reportHeartBeat(Executor.scala:447) > at > org.apache.spark.executor.Executor$$anon$1$$anonfun$run$1.apply$mcV$sp(Executor.scala:467) > at > org.apache.spark.executor.Executor$$anon$1$$anonfun$run$1.apply(Executor.scala:467) > at > org.apache.spark.executor.Executor$$anon$1$$anonfun$run$1.apply(Executor.scala:467) > at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1717) > at org.apache.spark.executor.Executor$$anon$1.run(Executor.scala:467) > at > java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) > at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:308) > at > java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:180) > at > java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:294) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) > at java.lang.Thread.run(Thread.java:745) > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-12035) Add more debug information in include_example tag of Jekyll
[ https://issues.apache.org/jira/browse/SPARK-12035?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrew Or resolved SPARK-12035. --- Resolution: Fixed Assignee: Xusen Yin Fix Version/s: 1.6.0 Target Version/s: 1.6.0 > Add more debug information in include_example tag of Jekyll > --- > > Key: SPARK-12035 > URL: https://issues.apache.org/jira/browse/SPARK-12035 > Project: Spark > Issue Type: Improvement > Components: Build, Documentation >Reporter: Xusen Yin >Assignee: Xusen Yin >Priority: Minor > Labels: documentation > Fix For: 1.6.0 > > > Add more debug information in the include_example tag of Jekyll, so that we > can know more when facing with errors of `jekyll build`. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-12064) Make the SqlParser as trait for better integrated with extensions
[ https://issues.apache.org/jira/browse/SPARK-12064?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-12064: Assignee: Apache Spark > Make the SqlParser as trait for better integrated with extensions > - > > Key: SPARK-12064 > URL: https://issues.apache.org/jira/browse/SPARK-12064 > Project: Spark > Issue Type: Improvement > Components: SQL >Reporter: Cheng Hao >Assignee: Apache Spark > > `SqlParser` is now an object, which hard to reuse it in extensions, a proper > implementation will be make the `SqlParser` as trait, and keep all of its > implementation unchanged, and then add another object called `SqlParser` > inherits from the trait. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-12064) Make the SqlParser as trait for better integrated with extensions
[ https://issues.apache.org/jira/browse/SPARK-12064?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-12064: Assignee: (was: Apache Spark) > Make the SqlParser as trait for better integrated with extensions > - > > Key: SPARK-12064 > URL: https://issues.apache.org/jira/browse/SPARK-12064 > Project: Spark > Issue Type: Improvement > Components: SQL >Reporter: Cheng Hao > > `SqlParser` is now an object, which hard to reuse it in extensions, a proper > implementation will be make the `SqlParser` as trait, and keep all of its > implementation unchanged, and then add another object called `SqlParser` > inherits from the trait. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-12064) Make the SqlParser as trait for better integrated with extensions
[ https://issues.apache.org/jira/browse/SPARK-12064?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15032842#comment-15032842 ] Apache Spark commented on SPARK-12064: -- User 'chenghao-intel' has created a pull request for this issue: https://github.com/apache/spark/pull/10053 > Make the SqlParser as trait for better integrated with extensions > - > > Key: SPARK-12064 > URL: https://issues.apache.org/jira/browse/SPARK-12064 > Project: Spark > Issue Type: Improvement > Components: SQL >Reporter: Cheng Hao > > `SqlParser` is now an object, which hard to reuse it in extensions, a proper > implementation will be make the `SqlParser` as trait, and keep all of its > implementation unchanged, and then add another object called `SqlParser` > inherits from the trait. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org