[jira] [Resolved] (SPARK-13408) Exception in resultHandler will shutdown SparkContext
[ https://issues.apache.org/jira/browse/SPARK-13408?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Davies Liu resolved SPARK-13408. Resolution: Fixed Fix Version/s: 2.0.0 Issue resolved by pull request 11280 [https://github.com/apache/spark/pull/11280] > Exception in resultHandler will shutdown SparkContext > - > > Key: SPARK-13408 > URL: https://issues.apache.org/jira/browse/SPARK-13408 > Project: Spark > Issue Type: Bug >Reporter: Davies Liu >Assignee: Shixiong Zhu > Fix For: 2.0.0 > > > {code} > davies@localhost:~/work/spark$ bin/spark-submit > python/pyspark/sql/dataframe.py > NOTE: SPARK_PREPEND_CLASSES is set, placing locally compiled Spark classes > ahead of assembly. > 16/02/19 12:46:24 WARN NativeCodeLoader: Unable to load native-hadoop library > for your platform... using builtin-java classes where applicable > 16/02/19 12:46:24 WARN Utils: Your hostname, localhost resolves to a loopback > address: 127.0.0.1; using 192.168.0.143 instead (on interface en0) > 16/02/19 12:46:24 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to > another address > ** > File > "/Users/davies/work/spark/python/lib/pyspark.zip/pyspark/sql/dataframe.py", > line 554, in pyspark.sql.dataframe.DataFrame.alias > Failed example: > joined_df.select(col("df_as1.name"), col("df_as2.name"), > col("df_as2.age")).collect() > Differences (ndiff with -expected +actual): > - [Row(name=u'Bob', name=u'Bob', age=5), Row(name=u'Alice', > name=u'Alice', age=2)] > + [Row(name=u'Alice', name=u'Alice', age=2), Row(name=u'Bob', > name=u'Bob', age=5)] > org.apache.spark.SparkDriverExecutionException: Execution error > at > org.apache.spark.scheduler.DAGScheduler.handleTaskCompletion(DAGScheduler.scala:1189) > at > org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1658) > at > org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1620) > at > org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1609) > at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48) > Caused by: java.lang.NullPointerException > at > org.apache.spark.sql.catalyst.expressions.codegen.LazilyGeneratedOrdering.compare(GenerateOrdering.scala:157) > at > org.apache.spark.sql.catalyst.expressions.codegen.LazilyGeneratedOrdering.compare(GenerateOrdering.scala:148) > at scala.math.Ordering$$anon$4.compare(Ordering.scala:111) > at java.util.PriorityQueue.siftUpUsingComparator(PriorityQueue.java:649) > at java.util.PriorityQueue.siftUp(PriorityQueue.java:627) > at java.util.PriorityQueue.offer(PriorityQueue.java:329) > at > org.apache.spark.util.BoundedPriorityQueue.$plus$eq(BoundedPriorityQueue.scala:47) > at > org.apache.spark.util.BoundedPriorityQueue$$anonfun$$plus$plus$eq$1.apply(BoundedPriorityQueue.scala:41) > at > org.apache.spark.util.BoundedPriorityQueue$$anonfun$$plus$plus$eq$1.apply(BoundedPriorityQueue.scala:41) > at scala.collection.Iterator$class.foreach(Iterator.scala:742) > at scala.collection.AbstractIterator.foreach(Iterator.scala:1194) > at scala.collection.IterableLike$class.foreach(IterableLike.scala:72) > at > org.apache.spark.util.BoundedPriorityQueue.foreach(BoundedPriorityQueue.scala:31) > at > org.apache.spark.util.BoundedPriorityQueue.$plus$plus$eq(BoundedPriorityQueue.scala:41) > at > org.apache.spark.rdd.RDD$$anonfun$takeOrdered$1$$anonfun$apply$46.apply(RDD.scala:1319) > at > org.apache.spark.rdd.RDD$$anonfun$takeOrdered$1$$anonfun$apply$46.apply(RDD.scala:1318) > at > org.apache.spark.rdd.RDD$$anonfun$reduce$1$$anonfun$15.apply(RDD.scala:932) > at > org.apache.spark.rdd.RDD$$anonfun$reduce$1$$anonfun$15.apply(RDD.scala:929) > at > org.apache.spark.scheduler.JobWaiter.taskSucceeded(JobWaiter.scala:57) > at > org.apache.spark.scheduler.DAGScheduler.handleTaskCompletion(DAGScheduler.scala:1185) > ... 4 more > org.apache.spark.SparkDriverExecutionException: Execution error > at > org.apache.spark.scheduler.DAGScheduler.handleTaskCompletion(DAGScheduler.scala:1189) > at > org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1658) > at > org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1620) > at > org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1609) > at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48) > Caused by: java.lang.NullPointerException > at >
[jira] [Comment Edited] (SPARK-13409) Log the stacktrace when stopping a SparkContext
[ https://issues.apache.org/jira/browse/SPARK-13409?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15155441#comment-15155441 ] Reynold Xin edited comment on SPARK-13409 at 2/20/16 6:58 AM: -- OK that makes sense. It wasn't clear to me what you meant by "remembering". You are proposing adding a field to SparkContext and a function that can be used to retrieve the stacktrace when SparkContext stops? was (Author: rxin): OK that makes sense. It wasn't clear to me what you meant by "remembering". You are proposing adding a field to SparkContext that can be used to retrieve the stacktrace when SparkContext stops? > Log the stacktrace when stopping a SparkContext > --- > > Key: SPARK-13409 > URL: https://issues.apache.org/jira/browse/SPARK-13409 > Project: Spark > Issue Type: Bug > Components: Spark Core >Reporter: Davies Liu > > Somethings we saw a stopped SparkContext, then have no idea it's stopped by > what, we should log that for troubleshooting. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-13304) Broadcast join with two ints could be very slow
[ https://issues.apache.org/jira/browse/SPARK-13304?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Davies Liu resolved SPARK-13304. Resolution: Fixed Assignee: Davies Liu Fix Version/s: 2.0.0 Fixed by https://github.com/apache/spark/pull/11130 > Broadcast join with two ints could be very slow > --- > > Key: SPARK-13304 > URL: https://issues.apache.org/jira/browse/SPARK-13304 > Project: Spark > Issue Type: Bug > Components: SQL >Reporter: Davies Liu >Assignee: Davies Liu > Fix For: 2.0.0 > > > If the two join columns have the same value, the hash code of them will be (a > ^ b), which is 0, then the HashMap will be very very slow. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-13409) Log the stacktrace when stopping a SparkContext
[ https://issues.apache.org/jira/browse/SPARK-13409?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15155441#comment-15155441 ] Reynold Xin commented on SPARK-13409: - OK that makes sense. It wasn't clear to me what you meant by "remembering". You are proposing adding a field to SparkContext that can be used to retrieve the stacktrace when SparkContext stops? > Log the stacktrace when stopping a SparkContext > --- > > Key: SPARK-13409 > URL: https://issues.apache.org/jira/browse/SPARK-13409 > Project: Spark > Issue Type: Bug > Components: Spark Core >Reporter: Davies Liu > > Somethings we saw a stopped SparkContext, then have no idea it's stopped by > what, we should log that for troubleshooting. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-13409) Log the stacktrace when stopping a SparkContext
[ https://issues.apache.org/jira/browse/SPARK-13409?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15155439#comment-15155439 ] Davies Liu commented on SPARK-13409: [~rxin] I think we should remember the stacktrace, put that in the message when we tried to access the stopped the SparkContext, this will be much usefully than just logging it (hard to find the log). We already did similar things when creating a SparkContext. > Log the stacktrace when stopping a SparkContext > --- > > Key: SPARK-13409 > URL: https://issues.apache.org/jira/browse/SPARK-13409 > Project: Spark > Issue Type: Bug > Components: Spark Core >Reporter: Davies Liu > > Somethings we saw a stopped SparkContext, then have no idea it's stopped by > what, we should log that for troubleshooting. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-13213) BroadcastNestedLoopJoin is very slow
[ https://issues.apache.org/jira/browse/SPARK-13213?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15155438#comment-15155438 ] Davies Liu commented on SPARK-13213: It depends, I'm open to any reasonable solution. > BroadcastNestedLoopJoin is very slow > > > Key: SPARK-13213 > URL: https://issues.apache.org/jira/browse/SPARK-13213 > Project: Spark > Issue Type: Improvement > Components: SQL >Reporter: Davies Liu > > Since we have improve the performance of CartisianProduct, which should be > faster and robuster than BroacastNestedLoopJoin, we should do > CartisianProduct instead of BroacastNestedLoopJoin, especially when the > broadcasted table is not that small. > Today, we hit a query that take very long time but still not finished, once > decrease the threshold for broadcast (disable BroacastNestedLoopJoin), it > just finished in seconds. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-12567) Add aes_encrypt and aes_decrypt UDFs
[ https://issues.apache.org/jira/browse/SPARK-12567?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-12567: Assignee: Kai Jiang (was: Apache Spark) > Add aes_encrypt and aes_decrypt UDFs > > > Key: SPARK-12567 > URL: https://issues.apache.org/jira/browse/SPARK-12567 > Project: Spark > Issue Type: New Feature > Components: SQL >Reporter: Kai Jiang >Assignee: Kai Jiang > Fix For: 2.0.0 > > > AES (Advanced Encryption Standard) algorithm. > Add aes_encrypt and aes_decrypt UDFs. > Ref: > [Hive|https://cwiki.apache.org/confluence/display/Hive/LanguageManual+UDF#LanguageManualUDF-Misc.Functions] > [MySQL|https://dev.mysql.com/doc/refman/5.5/en/encryption-functions.html#function_aes-decrypt] -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Reopened] (SPARK-12567) Add aes_encrypt and aes_decrypt UDFs
[ https://issues.apache.org/jira/browse/SPARK-12567?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reynold Xin reopened SPARK-12567: - > Add aes_encrypt and aes_decrypt UDFs > > > Key: SPARK-12567 > URL: https://issues.apache.org/jira/browse/SPARK-12567 > Project: Spark > Issue Type: New Feature > Components: SQL >Reporter: Kai Jiang >Assignee: Kai Jiang > Fix For: 2.0.0 > > > AES (Advanced Encryption Standard) algorithm. > Add aes_encrypt and aes_decrypt UDFs. > Ref: > [Hive|https://cwiki.apache.org/confluence/display/Hive/LanguageManual+UDF#LanguageManualUDF-Misc.Functions] > [MySQL|https://dev.mysql.com/doc/refman/5.5/en/encryption-functions.html#function_aes-decrypt] -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-12567) Add aes_encrypt and aes_decrypt UDFs
[ https://issues.apache.org/jira/browse/SPARK-12567?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-12567: Assignee: Apache Spark (was: Kai Jiang) > Add aes_encrypt and aes_decrypt UDFs > > > Key: SPARK-12567 > URL: https://issues.apache.org/jira/browse/SPARK-12567 > Project: Spark > Issue Type: New Feature > Components: SQL >Reporter: Kai Jiang >Assignee: Apache Spark > Fix For: 2.0.0 > > > AES (Advanced Encryption Standard) algorithm. > Add aes_encrypt and aes_decrypt UDFs. > Ref: > [Hive|https://cwiki.apache.org/confluence/display/Hive/LanguageManual+UDF#LanguageManualUDF-Misc.Functions] > [MySQL|https://dev.mysql.com/doc/refman/5.5/en/encryption-functions.html#function_aes-decrypt] -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-12567) Add aes_encrypt and aes_decrypt UDFs
[ https://issues.apache.org/jira/browse/SPARK-12567?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reynold Xin updated SPARK-12567: Summary: Add aes_encrypt and aes_decrypt UDFs (was: Add aes_{encrypt,decrypt} UDFs) > Add aes_encrypt and aes_decrypt UDFs > > > Key: SPARK-12567 > URL: https://issues.apache.org/jira/browse/SPARK-12567 > Project: Spark > Issue Type: New Feature > Components: SQL >Reporter: Kai Jiang >Assignee: Kai Jiang > Fix For: 2.0.0 > > > AES (Advanced Encryption Standard) algorithm. > Add aes_encrypt and aes_decrypt UDFs. > Ref: > [Hive|https://cwiki.apache.org/confluence/display/Hive/LanguageManual+UDF#LanguageManualUDF-Misc.Functions] > [MySQL|https://dev.mysql.com/doc/refman/5.5/en/encryption-functions.html#function_aes-decrypt] -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-12567) Add aes_{encrypt,decrypt} UDFs
[ https://issues.apache.org/jira/browse/SPARK-12567?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Davies Liu resolved SPARK-12567. Resolution: Fixed Assignee: Kai Jiang Fix Version/s: 2.0.0 > Add aes_{encrypt,decrypt} UDFs > -- > > Key: SPARK-12567 > URL: https://issues.apache.org/jira/browse/SPARK-12567 > Project: Spark > Issue Type: New Feature > Components: SQL >Reporter: Kai Jiang >Assignee: Kai Jiang > Fix For: 2.0.0 > > > AES (Advanced Encryption Standard) algorithm. > Add aes_encrypt and aes_decrypt UDFs. > Ref: > [Hive|https://cwiki.apache.org/confluence/display/Hive/LanguageManual+UDF#LanguageManualUDF-Misc.Functions] > [MySQL|https://dev.mysql.com/doc/refman/5.5/en/encryption-functions.html#function_aes-decrypt] -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-12594) Outer Join Elimination by Filter Condition
[ https://issues.apache.org/jira/browse/SPARK-12594?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Davies Liu resolved SPARK-12594. Resolution: Fixed Fix Version/s: 2.0.0 Issue resolved by pull request 10567 [https://github.com/apache/spark/pull/10567] > Outer Join Elimination by Filter Condition > -- > > Key: SPARK-12594 > URL: https://issues.apache.org/jira/browse/SPARK-12594 > Project: Spark > Issue Type: Improvement > Components: Optimizer, SQL >Affects Versions: 1.6.0 >Reporter: Xiao Li >Priority: Critical > Fix For: 2.0.0 > > > Elimination of outer joins, if the predicates in the filter condition can > restrict the result sets so that all null-supplying rows are eliminated. > - full outer -> inner if both sides have such predicates > - left outer -> inner if the right side has such predicates > - right outer -> inner if the left side has such predicates > - full outer -> left outer if only the left side has such predicates > - full outer -> right outer if only the right side has such predicates > If applicable, this can greatly improve the performance. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-13213) BroadcastNestedLoopJoin is very slow
[ https://issues.apache.org/jira/browse/SPARK-13213?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15155418#comment-15155418 ] Reynold Xin commented on SPARK-13213: - [~davies] what is this ticket about? Is it about making BroadcastNestedLoopJoin faster, or using CartesianProduct as much as possible? > BroadcastNestedLoopJoin is very slow > > > Key: SPARK-13213 > URL: https://issues.apache.org/jira/browse/SPARK-13213 > Project: Spark > Issue Type: Improvement > Components: SQL >Reporter: Davies Liu > > Since we have improve the performance of CartisianProduct, which should be > faster and robuster than BroacastNestedLoopJoin, we should do > CartisianProduct instead of BroacastNestedLoopJoin, especially when the > broadcasted table is not that small. > Today, we hit a query that take very long time but still not finished, once > decrease the threshold for broadcast (disable BroacastNestedLoopJoin), it > just finished in seconds. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-13382) Update PySpark testing notes
[ https://issues.apache.org/jira/browse/SPARK-13382?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15155415#comment-15155415 ] holdenk commented on SPARK-13382: - Note: while I've got a PR we should also update the wiki and I don't have permission to edit the wiki. > Update PySpark testing notes > > > Key: SPARK-13382 > URL: https://issues.apache.org/jira/browse/SPARK-13382 > Project: Spark > Issue Type: Documentation > Components: Documentation, PySpark >Reporter: holdenk >Priority: Trivial > > As discussed on the mailing list, running the full python tests requires that > Spark is built with the hive assembly. We should update both the wiki and the > build instructions for Python to mention this. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-12720) SQL generation support for cube, rollup, and grouping set
[ https://issues.apache.org/jira/browse/SPARK-12720?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15155390#comment-15155390 ] Xiao Li commented on SPARK-12720: - Expression {{grouping_id()}} is needed for resolving this JIRA. Thus, it is blocked by Spark-12799, which resolves it. > SQL generation support for cube, rollup, and grouping set > - > > Key: SPARK-12720 > URL: https://issues.apache.org/jira/browse/SPARK-12720 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 2.0.0 >Reporter: Cheng Lian >Assignee: Xiao Li > > {{HiveCompatibilitySuite}} can be useful for bootstrapping test coverage. > Please refer to SPARK-11012 for more details. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-13391) Use Apache Arrow as In-memory columnar store implementation
[ https://issues.apache.org/jira/browse/SPARK-13391?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15155268#comment-15155268 ] Wes McKinney commented on SPARK-13391: -- Indeed, one of the major motivations of Arrow (for Python and R) is higher data throughput between native pandas / R data-frame memory representation and Spark. I will be looking to add C-level data marshaling algorithms between Arrow and pandas (via NumPy arrays) to the Arrow codebase within the next couple of months. Will cross-post JIRAs as they develop > Use Apache Arrow as In-memory columnar store implementation > --- > > Key: SPARK-13391 > URL: https://issues.apache.org/jira/browse/SPARK-13391 > Project: Spark > Issue Type: New Feature > Components: Spark Core, SQL >Affects Versions: 1.6.0, 2.0.0 >Reporter: Maciej BryĆski > > Idea. > Apache Arrow (http://arrow.apache.org/) is Open Source implementation of > inmemory columnar store. It has APIs in many programming languages. > We can think about using it in Apache Spark to avoid data (de-)serialization > when running PySpark (and R) UDFs. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-13393) Column mismatch issue in left_outer join using Spark DataFrame
[ https://issues.apache.org/jira/browse/SPARK-13393?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15155263#comment-15155263 ] Varadharajan commented on SPARK-13393: -- Hi, I just tried the same scenario with prebuilt 1.6.0 and it still has this issue. I will give master branch a try today evening. > Column mismatch issue in left_outer join using Spark DataFrame > -- > > Key: SPARK-13393 > URL: https://issues.apache.org/jira/browse/SPARK-13393 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.5.0 >Reporter: Varadharajan > > Consider the below snippet: > {code:title=test.scala|borderStyle=solid} > case class Person(id: Int, name: String) > val df = sc.parallelize(List( > Person(1, "varadha"), > Person(2, "nagaraj") > )).toDF > val varadha = df.filter("id = 1") > val errorDF = df.join(varadha, df("id") === varadha("id"), > "left_outer").select(df("id"), varadha("id") as "varadha_id") > val nagaraj = df.filter("id = 2").select(df("id") as "n_id") > val correctDF = df.join(nagaraj, df("id") === nagaraj("n_id"), > "left_outer").select(df("id"), nagaraj("n_id") as "nagaraj_id") > {code} > The `errorDF` dataframe, after the left join is messed up and shows as below: > | id|varadha_id| > | 1| 1| > | 2| 2 (*This should've been null*)| > whereas correctDF has the correct output after the left join: > | id|nagaraj_id| > | 1| null| > | 2| 2| -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-13393) Column mismatch issue in left_outer join using Spark DataFrame
[ https://issues.apache.org/jira/browse/SPARK-13393?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Varadharajan updated SPARK-13393: - Description: Consider the below snippet: {code:title=test.scala|borderStyle=solid} case class Person(id: Int, name: String) val df = sc.parallelize(List( Person(1, "varadha"), Person(2, "nagaraj") )).toDF val varadha = df.filter("id = 1") val errorDF = df.join(varadha, df("id") === varadha("id"), "left_outer").select(df("id"), varadha("id") as "varadha_id") val nagaraj = df.filter("id = 2").select(df("id") as "n_id") val correctDF = df.join(nagaraj, df("id") === nagaraj("n_id"), "left_outer").select(df("id"), nagaraj("n_id") as "nagaraj_id") {code} The `errorDF` dataframe, after the left join is messed up and shows as below: | id|varadha_id| | 1| 1| | 2| 2 (*This should've been null*)| whereas correctDF has the correct output after the left join: | id|nagaraj_id| | 1| null| | 2| 2| was: Consider the below snippet: {code:title=test.scala|borderStyle=solid} class Person(id: Int, name: String) val df = sc.parallelize(List( Person(1, "varadha"), Person(2, "nagaraj") )).toDF val varadha = df.filter("id = 1") val errorDF = df.join(varadha, df("id") === varadha("id"), "left_outer").select(df("id"), varadha("id") as "varadha_id") val nagaraj = df.filter("id = 2").select(df("id") as "n_id") val correctDF = df.join(nagaraj, df("id") === nagaraj("n_id"), "left_outer").select(df("id"), nagaraj("n_id") as "nagaraj_id") {code} The `errorDF` dataframe, after the left join is messed up and shows as below: | id|varadha_id| | 1| 1| | 2| 2 (*This should've been null*)| whereas correctDF has the correct output after the left join: | id|nagaraj_id| | 1| null| | 2| 2| > Column mismatch issue in left_outer join using Spark DataFrame > -- > > Key: SPARK-13393 > URL: https://issues.apache.org/jira/browse/SPARK-13393 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.5.0 >Reporter: Varadharajan > > Consider the below snippet: > {code:title=test.scala|borderStyle=solid} > case class Person(id: Int, name: String) > val df = sc.parallelize(List( > Person(1, "varadha"), > Person(2, "nagaraj") > )).toDF > val varadha = df.filter("id = 1") > val errorDF = df.join(varadha, df("id") === varadha("id"), > "left_outer").select(df("id"), varadha("id") as "varadha_id") > val nagaraj = df.filter("id = 2").select(df("id") as "n_id") > val correctDF = df.join(nagaraj, df("id") === nagaraj("n_id"), > "left_outer").select(df("id"), nagaraj("n_id") as "nagaraj_id") > {code} > The `errorDF` dataframe, after the left join is messed up and shows as below: > | id|varadha_id| > | 1| 1| > | 2| 2 (*This should've been null*)| > whereas correctDF has the correct output after the left join: > | id|nagaraj_id| > | 1| null| > | 2| 2| -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-10001) Allow Ctrl-C in spark-shell to kill running job
[ https://issues.apache.org/jira/browse/SPARK-10001?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15155232#comment-15155232 ] Jon Maurer commented on SPARK-10001: Thank you for your feedback and consideration. I opened a new enhancement for a confirmation prompt prior to exiting the shell. https://issues.apache.org/jira/browse/SPARK-13412 > Allow Ctrl-C in spark-shell to kill running job > --- > > Key: SPARK-10001 > URL: https://issues.apache.org/jira/browse/SPARK-10001 > Project: Spark > Issue Type: Improvement > Components: Spark Shell >Affects Versions: 1.4.1 >Reporter: Cheolsoo Park >Priority: Minor > > Hitting Ctrl-C in spark-sql (and other tools like presto) cancels any running > job and starts a new input line on the prompt. It would be nice if > spark-shell also can do that. Otherwise, in case a user submits a job, say he > made a mistake, and wants to cancel it, he needs to exit the shell and > re-login to continue his work. Re-login can be a pain especially in Spark on > yarn, since it takes a while to allocate AM container and initial executors. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-13412) Spark Shell Ctrl-C behaviour suggestion
Jon Maurer created SPARK-13412: -- Summary: Spark Shell Ctrl-C behaviour suggestion Key: SPARK-13412 URL: https://issues.apache.org/jira/browse/SPARK-13412 Project: Spark Issue Type: Improvement Components: Spark Shell Affects Versions: 1.6.0 Reporter: Jon Maurer Priority: Minor It would be useful to catch the interrupt from a ctrl-c and prompt for confirmation prior to closing spark shell. This is currently an issue when sitting at an idle prompt. For example, if a user accidentally enters ctrl-c then all previous progress is lost and must be run again. Instead, the desired behavior would instead prompt the user to enter 'yes' or another ctrl-c to exit the shell, thus preventing rework. There is related discussion about this sort of feature on the Scala issue tracker: https://issues.scala-lang.org/browse/SI-6302 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-13333) DataFrame filter + randn + unionAll has bad interaction
[ https://issues.apache.org/jira/browse/SPARK-1?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15155229#comment-15155229 ] Xiao Li commented on SPARK-1: - Thanks! > DataFrame filter + randn + unionAll has bad interaction > --- > > Key: SPARK-1 > URL: https://issues.apache.org/jira/browse/SPARK-1 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.4.2, 1.6.1, 2.0.0 >Reporter: Joseph K. Bradley > > Buggy workflow > * Create a DataFrame df0 > * Filter df0 > * Add a randn column > * Create a copy of the DataFrame > * unionAll the two DataFrames > This fails, where randn produces the same results on the original DataFrame > and the copy before unionAll but fails to do so after unionAll. Removing the > filter fixes the problem. > The bug can be reproduced on master: > {code} > import org.apache.spark.sql.functions.randn > val df0 = sqlContext.createDataFrame(Seq(0, 1).map(Tuple1(_))).toDF("id") > // Removing the following filter() call makes this give the expected result. > val df1 = df0.filter(col("id") === 0).withColumn("b", randn(12345)) > println("DF1") > df1.show() > val df2 = df1.select("id", "b") > println("DF2") > df2.show() // same as df1.show(), as expected > val df3 = df1.unionAll(df2) > println("DF3") > df3.show() // NOT two copies of df1, which is unexpected > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-13048) EMLDAOptimizer deletes dependent checkpoint of DistributedLDAModel
[ https://issues.apache.org/jira/browse/SPARK-13048?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15155212#comment-15155212 ] Jeff Stein commented on SPARK-13048: As an aside, the code in the clustering namespace violates the [open/closed principle](https://en.wikipedia.org/wiki/Open/closed_principle). - LDAOptimizer is unnecessarily a sealed trait (I understand it's a developer api, but I'm a developer...) - EMLDAOptimizer is final - Lots of private[clustering] All of this meant that writing a decent workaround for the bug took a lot more code than I would have hoped. > EMLDAOptimizer deletes dependent checkpoint of DistributedLDAModel > -- > > Key: SPARK-13048 > URL: https://issues.apache.org/jira/browse/SPARK-13048 > Project: Spark > Issue Type: Bug > Components: MLlib >Affects Versions: 1.5.2 > Environment: Standalone Spark cluster >Reporter: Jeff Stein > > In EMLDAOptimizer, all checkpoints are deleted before returning the > DistributedLDAModel. > The most recent checkpoint is still necessary for operations on the > DistributedLDAModel under a couple scenarios: > - The graph doesn't fit in memory on the worker nodes (e.g. very large data > set). > - Late worker failures that require reading the now-dependent checkpoint. > I ran into this problem running a 10M record LDA model in a memory starved > environment. The model consistently failed in either the {{collect at > LDAModel.scala:528}} stage (when converting to a LocalLDAModel) or in the > {{reduce at LDAModel.scala:563}} stage (when calling "describeTopics" on the > model). In both cases, a FileNotFoundException is thrown attempting to access > a checkpoint file. > I'm not sure what the correct fix is here; it might involve a class signature > change. An alternative simple fix is to leave the last checkpoint around and > expect the user to clean the checkpoint directory themselves. > {noformat} > java.io.FileNotFoundException: File does not exist: > /hdfs/path/to/checkpoints/c8bd2b4e-27dd-47b3-84ec-3ff0bac04587/rdd-635/part-00071 > {noformat} > Relevant code is included below. > LDAOptimizer.scala: > {noformat} > override private[clustering] def getLDAModel(iterationTimes: > Array[Double]): LDAModel = { > require(graph != null, "graph is null, EMLDAOptimizer not initialized.") > this.graphCheckpointer.deleteAllCheckpoints() > // The constructor's default arguments assume gammaShape = 100 to ensure > equivalence in > // LDAModel.toLocal conversion > new DistributedLDAModel(this.graph, this.globalTopicTotals, this.k, > this.vocabSize, > Vectors.dense(Array.fill(this.k)(this.docConcentration)), > this.topicConcentration, > iterationTimes) > } > {noformat} > PeriodicCheckpointer.scala > {noformat} > /** >* Call this at the end to delete any remaining checkpoint files. >*/ > def deleteAllCheckpoints(): Unit = { > while (checkpointQueue.nonEmpty) { > removeCheckpointFile() > } > } > /** >* Dequeue the oldest checkpointed Dataset, and remove its checkpoint files. >* This prints a warning but does not fail if the files cannot be removed. >*/ > private def removeCheckpointFile(): Unit = { > val old = checkpointQueue.dequeue() > // Since the old checkpoint is not deleted by Spark, we manually delete > it. > val fs = FileSystem.get(sc.hadoopConfiguration) > getCheckpointFiles(old).foreach { checkpointFile => > try { > fs.delete(new Path(checkpointFile), true) > } catch { > case e: Exception => > logWarning("PeriodicCheckpointer could not remove old checkpoint > file: " + > checkpointFile) > } > } > } > {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-13407) TaskMetrics.fromAccumulatorUpdates can crash when trying to access garbage-collected accumulators
[ https://issues.apache.org/jira/browse/SPARK-13407?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Josh Rosen resolved SPARK-13407. Resolution: Fixed Fix Version/s: 2.0.0 Fixed by my patch. > TaskMetrics.fromAccumulatorUpdates can crash when trying to access > garbage-collected accumulators > - > > Key: SPARK-13407 > URL: https://issues.apache.org/jira/browse/SPARK-13407 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.0.0 >Reporter: Josh Rosen >Assignee: Josh Rosen > Fix For: 2.0.0 > > > TaskMetrics.fromAccumulatorUpdates can fail if accumulators have been > garbage-collected: > {code} > java.lang.IllegalAccessError: Attempted to access garbage collected > accumulator 481596 > at > org.apache.spark.Accumulators$$anonfun$get$1$$anonfun$apply$1.apply(Accumulator.scala:133) > at > org.apache.spark.Accumulators$$anonfun$get$1$$anonfun$apply$1.apply(Accumulator.scala:133) > at scala.Option.getOrElse(Option.scala:120) > at > org.apache.spark.Accumulators$$anonfun$get$1.apply(Accumulator.scala:132) > at > org.apache.spark.Accumulators$$anonfun$get$1.apply(Accumulator.scala:130) > at scala.Option.map(Option.scala:145) > at org.apache.spark.Accumulators$.get(Accumulator.scala:130) > at > org.apache.spark.executor.TaskMetrics$$anonfun$9.apply(TaskMetrics.scala:414) > at > org.apache.spark.executor.TaskMetrics$$anonfun$9.apply(TaskMetrics.scala:412) > at > scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:251) > at > scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:251) > at > scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) > at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47) > at > scala.collection.TraversableLike$class.flatMap(TraversableLike.scala:251) > at scala.collection.AbstractTraversable.flatMap(Traversable.scala:105) > at > org.apache.spark.executor.TaskMetrics$.fromAccumulatorUpdates(TaskMetrics.scala:412) > at > org.apache.spark.ui.jobs.JobProgressListener$$anonfun$onExecutorMetricsUpdate$2.apply(JobProgressListener.scala:499) > at > org.apache.spark.ui.jobs.JobProgressListener$$anonfun$onExecutorMetricsUpdate$2.apply(JobProgressListener.scala:493) > at > scala.collection.TraversableLike$WithFilter$$anonfun$foreach$1.apply(TraversableLike.scala:772) > at > scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33) > at scala.collection.mutable.WrappedArray.foreach(WrappedArray.scala:34) > at > scala.collection.TraversableLike$WithFilter.foreach(TraversableLike.scala:771) > at > org.apache.spark.ui.jobs.JobProgressListener.onExecutorMetricsUpdate(JobProgressListener.scala:493) > at > org.apache.spark.scheduler.SparkListenerBus$class.doPostEvent(SparkListenerBus.scala:56) > at > org.apache.spark.scheduler.LiveListenerBus.doPostEvent(LiveListenerBus.scala:35) > at > org.apache.spark.scheduler.LiveListenerBus.doPostEvent(LiveListenerBus.scala:35) > at > org.apache.spark.util.ListenerBus$class.postToAll(ListenerBus.scala:63) > at > org.apache.spark.scheduler.LiveListenerBus.postToAll(LiveListenerBus.scala:35) > at > org.apache.spark.scheduler.LiveListenerBus$$anon$1$$anonfun$run$1$$anonfun$apply$mcV$sp$1.apply$mcV$sp(LiveListenerBus.scala:81) > at > org.apache.spark.scheduler.LiveListenerBus$$anon$1$$anonfun$run$1$$anonfun$apply$mcV$sp$1.apply(LiveListenerBus.scala:66) > at > org.apache.spark.scheduler.LiveListenerBus$$anon$1$$anonfun$run$1$$anonfun$apply$mcV$sp$1.apply(LiveListenerBus.scala:66) > at scala.util.DynamicVariable.withValue(DynamicVariable.scala:57) > at > org.apache.spark.scheduler.LiveListenerBus$$anon$1$$anonfun$run$1.apply$mcV$sp(LiveListenerBus.scala:65) > at org.apache.spark.util.Utils$.tryOrStopSparkContext(Utils.scala:1178) > at > org.apache.spark.scheduler.LiveListenerBus$$anon$1.run(LiveListenerBus.scala:64) > {code} > In order to guard against this, we can eliminate the need to access > driver-side accumulators when constructing TaskMetrics. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-13329) Considering output for statistics of logical plan
[ https://issues.apache.org/jira/browse/SPARK-13329?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrew Or updated SPARK-13329: -- Summary: Considering output for statistics of logical plan (was: Considering output for statistics of logicol plan) > Considering output for statistics of logical plan > - > > Key: SPARK-13329 > URL: https://issues.apache.org/jira/browse/SPARK-13329 > Project: Spark > Issue Type: Improvement > Components: SQL >Reporter: Davies Liu >Assignee: Davies Liu > > The current implementation of statistics of UnaryNode does not considering > output (for example, Project), we should considering it to have a better > guess. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-13409) Log the stacktrace when stopping a SparkContext
[ https://issues.apache.org/jira/browse/SPARK-13409?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reynold Xin updated SPARK-13409: Description: Somethings we saw a stopped SparkContext, then have no idea it's stopped by what, we should log that for troubleshooting. (was: Somethings we saw a stopped SparkContext, then have no idea it's stopped by what, we should remember that for troubleshooting.) > Log the stacktrace when stopping a SparkContext > --- > > Key: SPARK-13409 > URL: https://issues.apache.org/jira/browse/SPARK-13409 > Project: Spark > Issue Type: Bug > Components: Spark Core >Reporter: Davies Liu > > Somethings we saw a stopped SparkContext, then have no idea it's stopped by > what, we should log that for troubleshooting. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-13409) Log the stacktrace when stopping a SparkContext
[ https://issues.apache.org/jira/browse/SPARK-13409?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reynold Xin updated SPARK-13409: Summary: Log the stacktrace when stopping a SparkContext (was: Remember the stacktrace when stop a SparkContext) > Log the stacktrace when stopping a SparkContext > --- > > Key: SPARK-13409 > URL: https://issues.apache.org/jira/browse/SPARK-13409 > Project: Spark > Issue Type: Bug > Components: Spark Core >Reporter: Davies Liu > > Somethings we saw a stopped SparkContext, then have no idea it's stopped by > what, we should remember that for troubleshooting. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-13091) Rewrite/Propagate constraints for Aliases
[ https://issues.apache.org/jira/browse/SPARK-13091?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust resolved SPARK-13091. -- Resolution: Fixed Fix Version/s: 2.0.0 Issue resolved by pull request 11144 [https://github.com/apache/spark/pull/11144] > Rewrite/Propagate constraints for Aliases > - > > Key: SPARK-13091 > URL: https://issues.apache.org/jira/browse/SPARK-13091 > Project: Spark > Issue Type: Sub-task > Components: SQL >Reporter: Sameer Agarwal >Assignee: Sameer Agarwal > Fix For: 2.0.0 > > > We'd want to duplicate constraints when there is an alias (i.e. for "SELECT > a, a AS b", any constraints on a now apply to b) > This is a follow up task based on [~marmbrus]'s suggestion in > https://docs.google.com/document/d/1WQRgDurUBV9Y6CWOBS75PQIqJwT-6WftVa18xzm7nCo/edit#heading=h.6hjcndo36qze -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-13333) DataFrame filter + randn + unionAll has bad interaction
[ https://issues.apache.org/jira/browse/SPARK-1?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15155047#comment-15155047 ] Reynold Xin commented on SPARK-1: - OK I think this is going to be really difficult to fix right now. However, once we refactor the sql internals and introduce the concept of a local plan tree, then we might be able to just hash the local plan tree and use that as the seed. > DataFrame filter + randn + unionAll has bad interaction > --- > > Key: SPARK-1 > URL: https://issues.apache.org/jira/browse/SPARK-1 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.4.2, 1.6.1, 2.0.0 >Reporter: Joseph K. Bradley > > Buggy workflow > * Create a DataFrame df0 > * Filter df0 > * Add a randn column > * Create a copy of the DataFrame > * unionAll the two DataFrames > This fails, where randn produces the same results on the original DataFrame > and the copy before unionAll but fails to do so after unionAll. Removing the > filter fixes the problem. > The bug can be reproduced on master: > {code} > import org.apache.spark.sql.functions.randn > val df0 = sqlContext.createDataFrame(Seq(0, 1).map(Tuple1(_))).toDF("id") > // Removing the following filter() call makes this give the expected result. > val df1 = df0.filter(col("id") === 0).withColumn("b", randn(12345)) > println("DF1") > df1.show() > val df2 = df1.select("id", "b") > println("DF2") > df2.show() // same as df1.show(), as expected > val df3 = df1.unionAll(df2) > println("DF3") > df3.show() // NOT two copies of df1, which is unexpected > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-13261) Expose maxCharactersPerColumn as a user configurable option
[ https://issues.apache.org/jira/browse/SPARK-13261?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust resolved SPARK-13261. -- Resolution: Fixed Fix Version/s: 2.0.0 Issue resolved by pull request 11147 [https://github.com/apache/spark/pull/11147] > Expose maxCharactersPerColumn as a user configurable option > --- > > Key: SPARK-13261 > URL: https://issues.apache.org/jira/browse/SPARK-13261 > Project: Spark > Issue Type: Sub-task > Components: SQL >Reporter: Hossein Falaki > Fix For: 2.0.0 > > > We are using Univocity parser in the CSV data source in Spark. The parser has > a fairly small limit for maximum number of characters per column. Spark's CSV > data source updates it but it is not exposed to user. There are still use > cases where the limit is too small. I think we should just expose it as an > option. I suggest "maxCharsPerColumn" for the option. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-13411) change in null aggregation behavior from spark 1.5.2 and 1.6.0
[ https://issues.apache.org/jira/browse/SPARK-13411?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Barry Becker updated SPARK-13411: - Description: I don't know if the behavior in 1.5.3 or 1.6.0 is correct, but its definitely different. Suppose I have a dataframe with a double column, "foo", that is all null valued. If I do val ext: DataFrame = df.agg(min("foo"), max("foo"), count(col("foo")).alias("nonNullCount")) In 1.5.2 I could do ext.getDouble(0) and get Double.NaN. In 1.6.0, when I try this I get "value in null at index 0". Maybe the new behavior is correct, but I think there is a typo in the message. It should say "value is null at index 0". Which behavior is correct? If 1.6.0 is correct, then it looks like I will need to add isNull checks everywhere when retrieving values. was: I don't know if the behavior in 1.5.3 or 1.6.0 is correct, but its definitely different. Suppose I have a dataframe with a double column, "foo", that is all null valued. If I do val ext: DataFrame = df.agg(min("foo"), max("foo"), count(col("foo")).alias("nonNullCount")) In 1.5.2 I could do ext.getDouble(0) and get Double.NaN. In 1.6.0, when I try this I get Which is correct. I think the 1.5.2 behavior is better otherwise I need to add special case handling for when a column is all null. > change in null aggregation behavior from spark 1.5.2 and 1.6.0 > --- > > Key: SPARK-13411 > URL: https://issues.apache.org/jira/browse/SPARK-13411 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 1.6.0 >Reporter: Barry Becker > > I don't know if the behavior in 1.5.3 or 1.6.0 is correct, but its definitely > different. > Suppose I have a dataframe with a double column, "foo", that is all null > valued. > If I do > val ext: DataFrame = df.agg(min("foo"), max("foo"), > count(col("foo")).alias("nonNullCount")) > In 1.5.2 I could do ext.getDouble(0) and get Double.NaN. > In 1.6.0, when I try this I get "value in null at index 0". Maybe the new > behavior is correct, but I think there is a typo in the message. It should > say "value is null at index 0". > Which behavior is correct? If 1.6.0 is correct, then it looks like I will > need to add isNull checks everywhere when retrieving values. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-12966) Postgres JDBC ArrayType(DecimalType) 'Unable to find server array type'
[ https://issues.apache.org/jira/browse/SPARK-12966?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust resolved SPARK-12966. -- Resolution: Fixed Fix Version/s: 2.0.0 > Postgres JDBC ArrayType(DecimalType) 'Unable to find server array type' > --- > > Key: SPARK-12966 > URL: https://issues.apache.org/jira/browse/SPARK-12966 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.6.0 >Reporter: Brandon Bradley > Fix For: 2.0.0 > > > Similar to SPARK-12747 but for DecimalType. > Do we need to handle precision and scale? > I've already starting trying to work on this. I cannot see if Postgres JDBC > driver handles precision and scale or just converts to default BigDecimal > precision and scale. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-13411) change in null aggregation behavior from spark 1.5.2 and 1.6.0
[ https://issues.apache.org/jira/browse/SPARK-13411?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Barry Becker updated SPARK-13411: - Description: I don't know if the behavior in 1.5.3 or 1.6.0 is correct, but its definitely different. Suppose I have a dataframe with a double column, "foo", that is all null valued. If I do val ext: DataFrame = df.agg(min("foo"), max("foo"), count(col("foo")).alias("nonNullCount")) In 1.5.2 I could do ext.getDouble(0) and get Double.NaN. In 1.6.0, when I try this I get Which is correct. I think the 1.5.2 behavior is better otherwise I need to add special case handling for when a column is all null. was: I don't know if the behavior in 1.5.3 or 1.6.0 is correct, but its definitely different. I suspect 1.6.0 is wrong. Suppose I have a dataframe with a double column, "foo", that is all null valued. If I do val ext: DataFrame = df.agg(min("foo"), max("foo"), count(col("foo")).alias("nonNullCount")) then in 1.6.0 I get a completely empty dataframe as the result. In 1.5.2, I got a single row with the aggregate min and max values being Double.NaN. Which is correct. I think the 1.5.2 behavior is better otherwise I need to add special case handling for when a column is all null. > change in null aggregation behavior from spark 1.5.2 and 1.6.0 > --- > > Key: SPARK-13411 > URL: https://issues.apache.org/jira/browse/SPARK-13411 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 1.6.0 >Reporter: Barry Becker > > I don't know if the behavior in 1.5.3 or 1.6.0 is correct, but its definitely > different. > Suppose I have a dataframe with a double column, "foo", that is all null > valued. > If I do > val ext: DataFrame = df.agg(min("foo"), max("foo"), > count(col("foo")).alias("nonNullCount")) > In 1.5.2 I could do ext.getDouble(0) and get Double.NaN. > In 1.6.0, when I try this I get > Which is correct. > I think the 1.5.2 behavior is better otherwise I need to add special case > handling for when a column is all null. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-13411) change in null aggregation behavior from spark 1.5.2 and 1.6.0
Barry Becker created SPARK-13411: Summary: change in null aggregation behavior from spark 1.5.2 and 1.6.0 Key: SPARK-13411 URL: https://issues.apache.org/jira/browse/SPARK-13411 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 1.6.0 Reporter: Barry Becker I don't know if the behavior in 1.5.3 or 1.6.0 is correct, but its definitely different. I suspect 1.6.0 is wrong. Suppose I have a dataframe with a double column, "foo", that is all null valued. If I do val ext: DataFrame = df.agg(min("foo"), max("foo"), count(col("foo")).alias("nonNullCount")) then in 1.6.0 I get a completely empty dataframe as the result. In 1.5.2, I got a single row with the aggregate min and max values being Double.NaN. Which is correct. I think the 1.5.2 behavior is better otherwise I need to add special case handling for when a column is all null. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-13410) unionAll throws error with DataFrames containing UDT columns.
[ https://issues.apache.org/jira/browse/SPARK-13410?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Franklyn Dsouza updated SPARK-13410: Description: Unioning two DataFrames that contain UDTs fails with {quote} AnalysisException: u"unresolved operator 'Union;" {quote} I tracked this down to this line https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/basicOperators.scala#L202 Which compares datatypes between the output attributes of both logical plans. However for UDTs this will be a new instance of the UserDefinedType or PythonUserDefinedType https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/types/DataType.scala#L158 So this equality check will check if the two instances are the same and since they aren't references to a singleton this check fails. **Note: this will work fine if you are unioning the dataframe with itself.** I have a proposed patch for this which overrides the equality operator on the two classes here: https://github.com/apache/spark/pull/11279 Reproduction steps {code} from pyspark.sql.tests import PythonOnlyPoint, PythonOnlyUDT from pyspark.sql import types schema = types.StructType([types.StructField("point", PythonOnlyUDT(), True)]) #note they need to be two separate dataframes a = sqlCtx.createDataFrame([[PythonOnlyPoint(1.0, 2.0)]], schema) b = sqlCtx.createDataFrame([[PythonOnlyPoint(3.0, 4.0)]], schema) c = a.unionAll(b) {code} was: Unioning two DataFrames that contain UDTs fails with {quote} AnalysisException: u"unresolved operator 'Union;" {quote} I tracked this down to this line https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/basicOperators.scala#L202 Which compares datatypes between the output attributes of both logical plans. However for UDTs this will be a new instance of the UserDefinedType or PythonUserDefinedType https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/types/DataType.scala#L158 So this equality check will check if the two instances are the same and since they aren't references to a singleton this check fails. Note: this will work fine if you are unioning the dataframe with itself. I have a proposed patch for this which overrides the equality operator on the two classes here: https://github.com/apache/spark/pull/11279 Reproduction steps {code} from pyspark.sql.tests import PythonOnlyPoint, PythonOnlyUDT from pyspark.sql import types schema = types.StructType([types.StructField("point", PythonOnlyUDT(), True)]) #note they need to be two separate dataframes a = sqlCtx.createDataFrame([[PythonOnlyPoint(1.0, 2.0)]], schema) b = sqlCtx.createDataFrame([[PythonOnlyPoint(3.0, 4.0)]], schema) c = a.unionAll(b) {code} > unionAll throws error with DataFrames containing UDT columns. > - > > Key: SPARK-13410 > URL: https://issues.apache.org/jira/browse/SPARK-13410 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.5.0, 1.6.0 >Reporter: Franklyn Dsouza > Labels: patch > Original Estimate: 3h > Remaining Estimate: 3h > > Unioning two DataFrames that contain UDTs fails with > {quote} > AnalysisException: u"unresolved operator 'Union;" > {quote} > I tracked this down to this line > https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/basicOperators.scala#L202 > Which compares datatypes between the output attributes of both logical plans. > However for UDTs this will be a new instance of the UserDefinedType or > PythonUserDefinedType > https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/types/DataType.scala#L158 > > So this equality check will check if the two instances are the same and since > they aren't references to a singleton this check fails. **Note: this will > work fine if you are unioning the dataframe with itself.** > I have a proposed patch for this which overrides the equality operator on the > two classes here: https://github.com/apache/spark/pull/11279 > Reproduction steps > {code} > from pyspark.sql.tests import PythonOnlyPoint, PythonOnlyUDT > from pyspark.sql import types > schema = types.StructType([types.StructField("point", PythonOnlyUDT(), True)]) > #note they need to be two separate dataframes > a = sqlCtx.createDataFrame([[PythonOnlyPoint(1.0, 2.0)]], schema) > b = sqlCtx.createDataFrame([[PythonOnlyPoint(3.0, 4.0)]], schema) > c = a.unionAll(b) > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
[jira] [Updated] (SPARK-13410) unionAll throws error with DataFrames containing UDT columns.
[ https://issues.apache.org/jira/browse/SPARK-13410?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Franklyn Dsouza updated SPARK-13410: Description: Unioning two DataFrames that contain UDTs fails with {quote} AnalysisException: u"unresolved operator 'Union;" {quote} I tracked this down to this line https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/basicOperators.scala#L202 Which compares datatypes between the output attributes of both logical plans. However for UDTs this will be a new instance of the UserDefinedType or PythonUserDefinedType https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/types/DataType.scala#L158 So this equality check will check if the two instances are the same and since they aren't references to a singleton this check fails. *Note: this will work fine if you are unioning the dataframe with itself.* I have a proposed patch for this which overrides the equality operator on the two classes here: https://github.com/apache/spark/pull/11279 Reproduction steps {code} from pyspark.sql.tests import PythonOnlyPoint, PythonOnlyUDT from pyspark.sql import types schema = types.StructType([types.StructField("point", PythonOnlyUDT(), True)]) #note they need to be two separate dataframes a = sqlCtx.createDataFrame([[PythonOnlyPoint(1.0, 2.0)]], schema) b = sqlCtx.createDataFrame([[PythonOnlyPoint(3.0, 4.0)]], schema) c = a.unionAll(b) {code} was: Unioning two DataFrames that contain UDTs fails with {quote} AnalysisException: u"unresolved operator 'Union;" {quote} I tracked this down to this line https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/basicOperators.scala#L202 Which compares datatypes between the output attributes of both logical plans. However for UDTs this will be a new instance of the UserDefinedType or PythonUserDefinedType https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/types/DataType.scala#L158 So this equality check will check if the two instances are the same and since they aren't references to a singleton this check fails. **Note: this will work fine if you are unioning the dataframe with itself.** I have a proposed patch for this which overrides the equality operator on the two classes here: https://github.com/apache/spark/pull/11279 Reproduction steps {code} from pyspark.sql.tests import PythonOnlyPoint, PythonOnlyUDT from pyspark.sql import types schema = types.StructType([types.StructField("point", PythonOnlyUDT(), True)]) #note they need to be two separate dataframes a = sqlCtx.createDataFrame([[PythonOnlyPoint(1.0, 2.0)]], schema) b = sqlCtx.createDataFrame([[PythonOnlyPoint(3.0, 4.0)]], schema) c = a.unionAll(b) {code} > unionAll throws error with DataFrames containing UDT columns. > - > > Key: SPARK-13410 > URL: https://issues.apache.org/jira/browse/SPARK-13410 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.5.0, 1.6.0 >Reporter: Franklyn Dsouza > Labels: patch > Original Estimate: 3h > Remaining Estimate: 3h > > Unioning two DataFrames that contain UDTs fails with > {quote} > AnalysisException: u"unresolved operator 'Union;" > {quote} > I tracked this down to this line > https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/basicOperators.scala#L202 > Which compares datatypes between the output attributes of both logical plans. > However for UDTs this will be a new instance of the UserDefinedType or > PythonUserDefinedType > https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/types/DataType.scala#L158 > > So this equality check will check if the two instances are the same and since > they aren't references to a singleton this check fails. > *Note: this will work fine if you are unioning the dataframe with itself.* > I have a proposed patch for this which overrides the equality operator on the > two classes here: https://github.com/apache/spark/pull/11279 > Reproduction steps > {code} > from pyspark.sql.tests import PythonOnlyPoint, PythonOnlyUDT > from pyspark.sql import types > schema = types.StructType([types.StructField("point", PythonOnlyUDT(), True)]) > #note they need to be two separate dataframes > a = sqlCtx.createDataFrame([[PythonOnlyPoint(1.0, 2.0)]], schema) > b = sqlCtx.createDataFrame([[PythonOnlyPoint(3.0, 4.0)]], schema) > c = a.unionAll(b) > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
[jira] [Updated] (SPARK-13410) unionAll AnalysisException with DataFrames containing UDT columns.
[ https://issues.apache.org/jira/browse/SPARK-13410?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Franklyn Dsouza updated SPARK-13410: Summary: unionAll AnalysisException with DataFrames containing UDT columns. (was: unionAll throws error with DataFrames containing UDT columns.) > unionAll AnalysisException with DataFrames containing UDT columns. > -- > > Key: SPARK-13410 > URL: https://issues.apache.org/jira/browse/SPARK-13410 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.5.0, 1.6.0 >Reporter: Franklyn Dsouza > Labels: patch > Original Estimate: 3h > Remaining Estimate: 3h > > Unioning two DataFrames that contain UDTs fails with > {quote} > AnalysisException: u"unresolved operator 'Union;" > {quote} > I tracked this down to this line > https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/basicOperators.scala#L202 > Which compares datatypes between the output attributes of both logical plans. > However for UDTs this will be a new instance of the UserDefinedType or > PythonUserDefinedType > https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/types/DataType.scala#L158 > > So this equality check will check if the two instances are the same and since > they aren't references to a singleton this check fails. > *Note: this will work fine if you are unioning the dataframe with itself.* > I have a proposed patch for this which overrides the equality operator on the > two classes here: https://github.com/apache/spark/pull/11279 > Reproduction steps > {code} > from pyspark.sql.tests import PythonOnlyPoint, PythonOnlyUDT > from pyspark.sql import types > schema = types.StructType([types.StructField("point", PythonOnlyUDT(), True)]) > #note they need to be two separate dataframes > a = sqlCtx.createDataFrame([[PythonOnlyPoint(1.0, 2.0)]], schema) > b = sqlCtx.createDataFrame([[PythonOnlyPoint(3.0, 4.0)]], schema) > c = a.unionAll(b) > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-13408) Exception in resultHandler will shutdown SparkContext
[ https://issues.apache.org/jira/browse/SPARK-13408?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15154962#comment-15154962 ] Apache Spark commented on SPARK-13408: -- User 'zsxwing' has created a pull request for this issue: https://github.com/apache/spark/pull/11280 > Exception in resultHandler will shutdown SparkContext > - > > Key: SPARK-13408 > URL: https://issues.apache.org/jira/browse/SPARK-13408 > Project: Spark > Issue Type: Bug >Reporter: Davies Liu >Assignee: Shixiong Zhu > > {code} > davies@localhost:~/work/spark$ bin/spark-submit > python/pyspark/sql/dataframe.py > NOTE: SPARK_PREPEND_CLASSES is set, placing locally compiled Spark classes > ahead of assembly. > 16/02/19 12:46:24 WARN NativeCodeLoader: Unable to load native-hadoop library > for your platform... using builtin-java classes where applicable > 16/02/19 12:46:24 WARN Utils: Your hostname, localhost resolves to a loopback > address: 127.0.0.1; using 192.168.0.143 instead (on interface en0) > 16/02/19 12:46:24 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to > another address > ** > File > "/Users/davies/work/spark/python/lib/pyspark.zip/pyspark/sql/dataframe.py", > line 554, in pyspark.sql.dataframe.DataFrame.alias > Failed example: > joined_df.select(col("df_as1.name"), col("df_as2.name"), > col("df_as2.age")).collect() > Differences (ndiff with -expected +actual): > - [Row(name=u'Bob', name=u'Bob', age=5), Row(name=u'Alice', > name=u'Alice', age=2)] > + [Row(name=u'Alice', name=u'Alice', age=2), Row(name=u'Bob', > name=u'Bob', age=5)] > org.apache.spark.SparkDriverExecutionException: Execution error > at > org.apache.spark.scheduler.DAGScheduler.handleTaskCompletion(DAGScheduler.scala:1189) > at > org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1658) > at > org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1620) > at > org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1609) > at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48) > Caused by: java.lang.NullPointerException > at > org.apache.spark.sql.catalyst.expressions.codegen.LazilyGeneratedOrdering.compare(GenerateOrdering.scala:157) > at > org.apache.spark.sql.catalyst.expressions.codegen.LazilyGeneratedOrdering.compare(GenerateOrdering.scala:148) > at scala.math.Ordering$$anon$4.compare(Ordering.scala:111) > at java.util.PriorityQueue.siftUpUsingComparator(PriorityQueue.java:649) > at java.util.PriorityQueue.siftUp(PriorityQueue.java:627) > at java.util.PriorityQueue.offer(PriorityQueue.java:329) > at > org.apache.spark.util.BoundedPriorityQueue.$plus$eq(BoundedPriorityQueue.scala:47) > at > org.apache.spark.util.BoundedPriorityQueue$$anonfun$$plus$plus$eq$1.apply(BoundedPriorityQueue.scala:41) > at > org.apache.spark.util.BoundedPriorityQueue$$anonfun$$plus$plus$eq$1.apply(BoundedPriorityQueue.scala:41) > at scala.collection.Iterator$class.foreach(Iterator.scala:742) > at scala.collection.AbstractIterator.foreach(Iterator.scala:1194) > at scala.collection.IterableLike$class.foreach(IterableLike.scala:72) > at > org.apache.spark.util.BoundedPriorityQueue.foreach(BoundedPriorityQueue.scala:31) > at > org.apache.spark.util.BoundedPriorityQueue.$plus$plus$eq(BoundedPriorityQueue.scala:41) > at > org.apache.spark.rdd.RDD$$anonfun$takeOrdered$1$$anonfun$apply$46.apply(RDD.scala:1319) > at > org.apache.spark.rdd.RDD$$anonfun$takeOrdered$1$$anonfun$apply$46.apply(RDD.scala:1318) > at > org.apache.spark.rdd.RDD$$anonfun$reduce$1$$anonfun$15.apply(RDD.scala:932) > at > org.apache.spark.rdd.RDD$$anonfun$reduce$1$$anonfun$15.apply(RDD.scala:929) > at > org.apache.spark.scheduler.JobWaiter.taskSucceeded(JobWaiter.scala:57) > at > org.apache.spark.scheduler.DAGScheduler.handleTaskCompletion(DAGScheduler.scala:1185) > ... 4 more > org.apache.spark.SparkDriverExecutionException: Execution error > at > org.apache.spark.scheduler.DAGScheduler.handleTaskCompletion(DAGScheduler.scala:1189) > at > org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1658) > at > org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1620) > at > org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1609) > at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48) > Caused by: java.lang.NullPointerException > at >
[jira] [Assigned] (SPARK-13408) Exception in resultHandler will shutdown SparkContext
[ https://issues.apache.org/jira/browse/SPARK-13408?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-13408: Assignee: Apache Spark (was: Shixiong Zhu) > Exception in resultHandler will shutdown SparkContext > - > > Key: SPARK-13408 > URL: https://issues.apache.org/jira/browse/SPARK-13408 > Project: Spark > Issue Type: Bug >Reporter: Davies Liu >Assignee: Apache Spark > > {code} > davies@localhost:~/work/spark$ bin/spark-submit > python/pyspark/sql/dataframe.py > NOTE: SPARK_PREPEND_CLASSES is set, placing locally compiled Spark classes > ahead of assembly. > 16/02/19 12:46:24 WARN NativeCodeLoader: Unable to load native-hadoop library > for your platform... using builtin-java classes where applicable > 16/02/19 12:46:24 WARN Utils: Your hostname, localhost resolves to a loopback > address: 127.0.0.1; using 192.168.0.143 instead (on interface en0) > 16/02/19 12:46:24 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to > another address > ** > File > "/Users/davies/work/spark/python/lib/pyspark.zip/pyspark/sql/dataframe.py", > line 554, in pyspark.sql.dataframe.DataFrame.alias > Failed example: > joined_df.select(col("df_as1.name"), col("df_as2.name"), > col("df_as2.age")).collect() > Differences (ndiff with -expected +actual): > - [Row(name=u'Bob', name=u'Bob', age=5), Row(name=u'Alice', > name=u'Alice', age=2)] > + [Row(name=u'Alice', name=u'Alice', age=2), Row(name=u'Bob', > name=u'Bob', age=5)] > org.apache.spark.SparkDriverExecutionException: Execution error > at > org.apache.spark.scheduler.DAGScheduler.handleTaskCompletion(DAGScheduler.scala:1189) > at > org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1658) > at > org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1620) > at > org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1609) > at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48) > Caused by: java.lang.NullPointerException > at > org.apache.spark.sql.catalyst.expressions.codegen.LazilyGeneratedOrdering.compare(GenerateOrdering.scala:157) > at > org.apache.spark.sql.catalyst.expressions.codegen.LazilyGeneratedOrdering.compare(GenerateOrdering.scala:148) > at scala.math.Ordering$$anon$4.compare(Ordering.scala:111) > at java.util.PriorityQueue.siftUpUsingComparator(PriorityQueue.java:649) > at java.util.PriorityQueue.siftUp(PriorityQueue.java:627) > at java.util.PriorityQueue.offer(PriorityQueue.java:329) > at > org.apache.spark.util.BoundedPriorityQueue.$plus$eq(BoundedPriorityQueue.scala:47) > at > org.apache.spark.util.BoundedPriorityQueue$$anonfun$$plus$plus$eq$1.apply(BoundedPriorityQueue.scala:41) > at > org.apache.spark.util.BoundedPriorityQueue$$anonfun$$plus$plus$eq$1.apply(BoundedPriorityQueue.scala:41) > at scala.collection.Iterator$class.foreach(Iterator.scala:742) > at scala.collection.AbstractIterator.foreach(Iterator.scala:1194) > at scala.collection.IterableLike$class.foreach(IterableLike.scala:72) > at > org.apache.spark.util.BoundedPriorityQueue.foreach(BoundedPriorityQueue.scala:31) > at > org.apache.spark.util.BoundedPriorityQueue.$plus$plus$eq(BoundedPriorityQueue.scala:41) > at > org.apache.spark.rdd.RDD$$anonfun$takeOrdered$1$$anonfun$apply$46.apply(RDD.scala:1319) > at > org.apache.spark.rdd.RDD$$anonfun$takeOrdered$1$$anonfun$apply$46.apply(RDD.scala:1318) > at > org.apache.spark.rdd.RDD$$anonfun$reduce$1$$anonfun$15.apply(RDD.scala:932) > at > org.apache.spark.rdd.RDD$$anonfun$reduce$1$$anonfun$15.apply(RDD.scala:929) > at > org.apache.spark.scheduler.JobWaiter.taskSucceeded(JobWaiter.scala:57) > at > org.apache.spark.scheduler.DAGScheduler.handleTaskCompletion(DAGScheduler.scala:1185) > ... 4 more > org.apache.spark.SparkDriverExecutionException: Execution error > at > org.apache.spark.scheduler.DAGScheduler.handleTaskCompletion(DAGScheduler.scala:1189) > at > org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1658) > at > org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1620) > at > org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1609) > at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48) > Caused by: java.lang.NullPointerException > at > org.apache.spark.sql.catalyst.expressions.codegen.LazilyGeneratedOrdering.compare(GenerateOrdering.scala:157) > at >
[jira] [Commented] (SPARK-13410) unionAll throws error with DataFrames containing UDT columns.
[ https://issues.apache.org/jira/browse/SPARK-13410?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15154935#comment-15154935 ] Apache Spark commented on SPARK-13410: -- User 'damnMeddlingKid' has created a pull request for this issue: https://github.com/apache/spark/pull/11279 > unionAll throws error with DataFrames containing UDT columns. > - > > Key: SPARK-13410 > URL: https://issues.apache.org/jira/browse/SPARK-13410 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.5.0, 1.6.0 >Reporter: Franklyn Dsouza > Labels: patch > Original Estimate: 3h > Remaining Estimate: 3h > > Unioning two DataFrames that contain UDTs fails with > {quote} > AnalysisException: u"unresolved operator 'Union;" > {quote} > I tracked this down to this line > https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/basicOperators.scala#L202 > Which compares datatypes between the output attributes of both logical plans. > However for UDTs this will be a new instance of the UserDefinedType or > PythonUserDefinedType > https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/types/DataType.scala#L158 > > So this equality check will check if the two instances are the same and since > they aren't references to a singleton this check fails. Note: this will work > fine if you are unioning the dataframe with itself. > I have a patch for this which overrides the equality operator on the two > classes here: https://github.com/damnMeddlingKid/spark/pull/2 > Reproduction steps > {code} > from pyspark.sql.tests import PythonOnlyPoint, PythonOnlyUDT > from pyspark.sql import types > schema = types.StructType([types.StructField("point", PythonOnlyUDT(), True)]) > #note they need to be two separate dataframes > a = sqlCtx.createDataFrame([[PythonOnlyPoint(1.0, 2.0)]], schema) > b = sqlCtx.createDataFrame([[PythonOnlyPoint(3.0, 4.0)]], schema) > c = a.unionAll(b) > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-13410) unionAll throws error with DataFrames containing UDT columns.
[ https://issues.apache.org/jira/browse/SPARK-13410?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-13410: Assignee: Apache Spark > unionAll throws error with DataFrames containing UDT columns. > - > > Key: SPARK-13410 > URL: https://issues.apache.org/jira/browse/SPARK-13410 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.5.0, 1.6.0 >Reporter: Franklyn Dsouza >Assignee: Apache Spark > Labels: patch > Original Estimate: 3h > Remaining Estimate: 3h > > Unioning two DataFrames that contain UDTs fails with > {quote} > AnalysisException: u"unresolved operator 'Union;" > {quote} > I tracked this down to this line > https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/basicOperators.scala#L202 > Which compares datatypes between the output attributes of both logical plans. > However for UDTs this will be a new instance of the UserDefinedType or > PythonUserDefinedType > https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/types/DataType.scala#L158 > > So this equality check will check if the two instances are the same and since > they aren't references to a singleton this check fails. Note: this will work > fine if you are unioning the dataframe with itself. > I have a patch for this which overrides the equality operator on the two > classes here: https://github.com/damnMeddlingKid/spark/pull/2 > Reproduction steps > {code} > from pyspark.sql.tests import PythonOnlyPoint, PythonOnlyUDT > from pyspark.sql import types > schema = types.StructType([types.StructField("point", PythonOnlyUDT(), True)]) > #note they need to be two separate dataframes > a = sqlCtx.createDataFrame([[PythonOnlyPoint(1.0, 2.0)]], schema) > b = sqlCtx.createDataFrame([[PythonOnlyPoint(3.0, 4.0)]], schema) > c = a.unionAll(b) > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-13410) unionAll throws error with DataFrames containing UDT columns.
[ https://issues.apache.org/jira/browse/SPARK-13410?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Franklyn Dsouza updated SPARK-13410: Description: Unioning two DataFrames that contain UDTs fails with {quote} AnalysisException: u"unresolved operator 'Union;" {quote} I tracked this down to this line https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/basicOperators.scala#L202 Which compares datatypes between the output attributes of both logical plans. However for UDTs this will be a new instance of the UserDefinedType or PythonUserDefinedType https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/types/DataType.scala#L158 So this equality check will check if the two instances are the same and since they aren't references to a singleton this check fails. Note: this will work fine if you are unioning the dataframe with itself. I have a proposed patch for this which overrides the equality operator on the two classes here: https://github.com/apache/spark/pull/11279 Reproduction steps {code} from pyspark.sql.tests import PythonOnlyPoint, PythonOnlyUDT from pyspark.sql import types schema = types.StructType([types.StructField("point", PythonOnlyUDT(), True)]) #note they need to be two separate dataframes a = sqlCtx.createDataFrame([[PythonOnlyPoint(1.0, 2.0)]], schema) b = sqlCtx.createDataFrame([[PythonOnlyPoint(3.0, 4.0)]], schema) c = a.unionAll(b) {code} was: Unioning two DataFrames that contain UDTs fails with {quote} AnalysisException: u"unresolved operator 'Union;" {quote} I tracked this down to this line https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/basicOperators.scala#L202 Which compares datatypes between the output attributes of both logical plans. However for UDTs this will be a new instance of the UserDefinedType or PythonUserDefinedType https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/types/DataType.scala#L158 So this equality check will check if the two instances are the same and since they aren't references to a singleton this check fails. Note: this will work fine if you are unioning the dataframe with itself. I have a patch for this which overrides the equality operator on the two classes here: https://github.com/damnMeddlingKid/spark/pull/2 Reproduction steps {code} from pyspark.sql.tests import PythonOnlyPoint, PythonOnlyUDT from pyspark.sql import types schema = types.StructType([types.StructField("point", PythonOnlyUDT(), True)]) #note they need to be two separate dataframes a = sqlCtx.createDataFrame([[PythonOnlyPoint(1.0, 2.0)]], schema) b = sqlCtx.createDataFrame([[PythonOnlyPoint(3.0, 4.0)]], schema) c = a.unionAll(b) {code} > unionAll throws error with DataFrames containing UDT columns. > - > > Key: SPARK-13410 > URL: https://issues.apache.org/jira/browse/SPARK-13410 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.5.0, 1.6.0 >Reporter: Franklyn Dsouza > Labels: patch > Original Estimate: 3h > Remaining Estimate: 3h > > Unioning two DataFrames that contain UDTs fails with > {quote} > AnalysisException: u"unresolved operator 'Union;" > {quote} > I tracked this down to this line > https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/basicOperators.scala#L202 > Which compares datatypes between the output attributes of both logical plans. > However for UDTs this will be a new instance of the UserDefinedType or > PythonUserDefinedType > https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/types/DataType.scala#L158 > > So this equality check will check if the two instances are the same and since > they aren't references to a singleton this check fails. Note: this will work > fine if you are unioning the dataframe with itself. > I have a proposed patch for this which overrides the equality operator on the > two classes here: https://github.com/apache/spark/pull/11279 > Reproduction steps > {code} > from pyspark.sql.tests import PythonOnlyPoint, PythonOnlyUDT > from pyspark.sql import types > schema = types.StructType([types.StructField("point", PythonOnlyUDT(), True)]) > #note they need to be two separate dataframes > a = sqlCtx.createDataFrame([[PythonOnlyPoint(1.0, 2.0)]], schema) > b = sqlCtx.createDataFrame([[PythonOnlyPoint(3.0, 4.0)]], schema) > c = a.unionAll(b) > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For
[jira] [Assigned] (SPARK-13410) unionAll throws error with DataFrames containing UDT columns.
[ https://issues.apache.org/jira/browse/SPARK-13410?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-13410: Assignee: (was: Apache Spark) > unionAll throws error with DataFrames containing UDT columns. > - > > Key: SPARK-13410 > URL: https://issues.apache.org/jira/browse/SPARK-13410 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.5.0, 1.6.0 >Reporter: Franklyn Dsouza > Labels: patch > Original Estimate: 3h > Remaining Estimate: 3h > > Unioning two DataFrames that contain UDTs fails with > {quote} > AnalysisException: u"unresolved operator 'Union;" > {quote} > I tracked this down to this line > https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/basicOperators.scala#L202 > Which compares datatypes between the output attributes of both logical plans. > However for UDTs this will be a new instance of the UserDefinedType or > PythonUserDefinedType > https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/types/DataType.scala#L158 > > So this equality check will check if the two instances are the same and since > they aren't references to a singleton this check fails. Note: this will work > fine if you are unioning the dataframe with itself. > I have a patch for this which overrides the equality operator on the two > classes here: https://github.com/damnMeddlingKid/spark/pull/2 > Reproduction steps > {code} > from pyspark.sql.tests import PythonOnlyPoint, PythonOnlyUDT > from pyspark.sql import types > schema = types.StructType([types.StructField("point", PythonOnlyUDT(), True)]) > #note they need to be two separate dataframes > a = sqlCtx.createDataFrame([[PythonOnlyPoint(1.0, 2.0)]], schema) > b = sqlCtx.createDataFrame([[PythonOnlyPoint(3.0, 4.0)]], schema) > c = a.unionAll(b) > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-12864) Fetch failure from AM restart
[ https://issues.apache.org/jira/browse/SPARK-12864?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrew Or updated SPARK-12864: -- Summary: Fetch failure from AM restart (was: initialize executorIdCounter after ApplicationMaster killed for max number of executor failures reached) > Fetch failure from AM restart > - > > Key: SPARK-12864 > URL: https://issues.apache.org/jira/browse/SPARK-12864 > Project: Spark > Issue Type: Bug > Components: YARN >Affects Versions: 1.3.1, 1.4.1, 1.5.2 >Reporter: iward > > Currently, when max number of executor failures reached the > *maxNumExecutorFailures*, *ApplicationMaster* will be killed and re-register > another one.This time, *YarnAllocator* will be created a new instance. > But, the value of property *executorIdCounter* in *YarnAllocator* will reset > to *0*. Then the *Id* of new executor will starting from 1. This will confuse > with the executor has already created before, which will cause > FetchFailedException. > For example, the following is the task log: > {noformat} > 2015-12-22 02:33:15 INFO 15/12/22 02:33:15 WARN > YarnSchedulerBackend$YarnSchedulerEndpoint: ApplicationMaster has > disassociated: 172.22.92.14:45125 > 2015-12-22 02:33:26 INFO 15/12/22 02:33:26 INFO > YarnSchedulerBackend$YarnSchedulerEndpoint: ApplicationMaster registered as > AkkaRpcEndpointRef(Actor[akka.tcp://sparkYarnAM@172.22.168.72:54040/user/YarnAM#-1290854604]) > {noformat} > {noformat} > 2015-12-22 02:35:02 INFO 15/12/22 02:35:02 INFO YarnClientSchedulerBackend: > Registered executor: > AkkaRpcEndpointRef(Actor[akka.tcp://sparkexecu...@bjhc-hera-16217.hadoop.jd.local:46538/user/Executor#-790726793]) > with ID 1 > {noformat} > {noformat} > Lost task 3.0 in stage 102.0 (TID 1963, BJHC-HERA-16217.hadoop.jd.local): > FetchFailed(BlockManagerId(1, BJHC-HERA-17030.hadoop.jd.local, 7337 > ), shuffleId=5, mapId=2, reduceId=3, message= > 2015-12-22 02:43:20 INFO org.apache.spark.shuffle.FetchFailedException: > /data3/yarn1/local/usercache/dd_edw/appcache/application_1450438154359_206399/blockmgr-b1fd0363-6d53-4d09-8086-adc4a13f4dc4/0f/shuffl > e_5_2_0.index (No such file or directory) > 2015-12-22 02:43:20 INFO at > org.apache.spark.shuffle.hash.BlockStoreShuffleFetcher$.org$apache$spark$shuffle$hash$BlockStoreShuffleFetcher$$unpackBlock$1(BlockStoreShuffleFetcher.scala:67) > 2015-12-22 02:43:20 INFO at > org.apache.spark.shuffle.hash.BlockStoreShuffleFetcher$$anonfun$3.apply(BlockStoreShuffleFetcher.scala:84) > 2015-12-22 02:43:20 INFO at > org.apache.spark.shuffle.hash.BlockStoreShuffleFetcher$$anonfun$3.apply(BlockStoreShuffleFetcher.scala:84) > 2015-12-22 02:43:20 INFO at > scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371) > 2015-12-22 02:43:20 INFO at > org.apache.spark.util.CompletionIterator.hasNext(CompletionIterator.scala:32) > 2015-12-22 02:43:20 INFO at > org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:39) > 2015-12-22 02:43:20 INFO at > scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327) > 2015-12-22 02:43:20 INFO at > scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327) > 2015-12-22 02:43:20 INFO at > org.apache.spark.sql.execution.Aggregate$$anonfun$execute$1$$anonfun$7.apply(Aggregate.scala:154) > 2015-12-22 02:43:20 INFO at > org.apache.spark.sql.execution.Aggregate$$anonfun$execute$1$$anonfun$7.apply(Aggregate.scala:149) > 2015-12-22 02:43:20 INFO at > org.apache.spark.rdd.RDD$$anonfun$14.apply(RDD.scala:640) > 2015-12-22 02:43:20 INFO at > org.apache.spark.rdd.RDD$$anonfun$14.apply(RDD.scala:640) > 2015-12-22 02:43:20 INFO at > org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35) > 2015-12-22 02:43:20 INFO at > org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277) > {noformat} > As the task log show, the executor id of *BJHC-HERA-16217.hadoop.jd.local* > is the same as *BJHC-HERA-17030.hadoop.jd.local*. So, it is confusion and > cause FetchFailedException. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-13410) unionAll throws error with DataFrames containing UDT columns.
Franklyn Dsouza created SPARK-13410: --- Summary: unionAll throws error with DataFrames containing UDT columns. Key: SPARK-13410 URL: https://issues.apache.org/jira/browse/SPARK-13410 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.6.0, 1.5.0 Reporter: Franklyn Dsouza Unioning two DataFrames that contain UDTs fails with {quote} AnalysisException: u"unresolved operator 'Union;" {quote} I tracked this down to this line https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/basicOperators.scala#L202 Which compares datatypes between the output attributes of both logical plans. However for UDTs this will be a new instance of the UserDefinedType or PythonUserDefinedType https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/types/DataType.scala#L158 So this equality check will check if the two instances are the same and since they aren't references to a singleton this check fails. Note: this will work fine if you are unioning the dataframe with itself. I have a patch for this which overrides the equality operator on the two classes here: https://github.com/damnMeddlingKid/spark/pull/2 Reproduction steps {code} from pyspark.sql.tests import PythonOnlyPoint, PythonOnlyUDT from pyspark.sql import types schema = types.StructType([types.StructField("point", PythonOnlyUDT(), True)]) #note they need to be two separate dataframes a = sqlCtx.createDataFrame([[PythonOnlyPoint(1.0, 2.0)]], schema) b = sqlCtx.createDataFrame([[PythonOnlyPoint(3.0, 4.0)]], schema) c = a.unionAll(b) {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-13382) Update PySpark testing notes
[ https://issues.apache.org/jira/browse/SPARK-13382?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15154926#comment-15154926 ] Apache Spark commented on SPARK-13382: -- User 'holdenk' has created a pull request for this issue: https://github.com/apache/spark/pull/11278 > Update PySpark testing notes > > > Key: SPARK-13382 > URL: https://issues.apache.org/jira/browse/SPARK-13382 > Project: Spark > Issue Type: Documentation > Components: Documentation, PySpark >Reporter: holdenk >Priority: Trivial > > As discussed on the mailing list, running the full python tests requires that > Spark is built with the hive assembly. We should update both the wiki and the > build instructions for Python to mention this. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-13382) Update PySpark testing notes
[ https://issues.apache.org/jira/browse/SPARK-13382?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-13382: Assignee: Apache Spark > Update PySpark testing notes > > > Key: SPARK-13382 > URL: https://issues.apache.org/jira/browse/SPARK-13382 > Project: Spark > Issue Type: Documentation > Components: Documentation, PySpark >Reporter: holdenk >Assignee: Apache Spark >Priority: Trivial > > As discussed on the mailing list, running the full python tests requires that > Spark is built with the hive assembly. We should update both the wiki and the > build instructions for Python to mention this. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-13382) Update PySpark testing notes
[ https://issues.apache.org/jira/browse/SPARK-13382?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-13382: Assignee: (was: Apache Spark) > Update PySpark testing notes > > > Key: SPARK-13382 > URL: https://issues.apache.org/jira/browse/SPARK-13382 > Project: Spark > Issue Type: Documentation > Components: Documentation, PySpark >Reporter: holdenk >Priority: Trivial > > As discussed on the mailing list, running the full python tests requires that > Spark is built with the hive assembly. We should update both the wiki and the > build instructions for Python to mention this. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-10001) Allow Ctrl-C in spark-shell to kill running job
[ https://issues.apache.org/jira/browse/SPARK-10001?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15154885#comment-15154885 ] Ryan Blue commented on SPARK-10001: --- bq. I . . . am uneasy about adopting unusual semantics for a standard signal, which should rightly kill the shell by default. I just checked nodejs, python, and irb (ruby) and the behavior in all three is to return execution to the shell's prompt, not to exit. I think that makes sense since it basically mimics a normal bash shell. I agree with Jon that it would be a nice feature, though I would rather get the current patch in and address the new suggestion in a follow up issue. [~tri...@gmail.com], could you open a follow-up issue that outlines the behavior you suggest? > Allow Ctrl-C in spark-shell to kill running job > --- > > Key: SPARK-10001 > URL: https://issues.apache.org/jira/browse/SPARK-10001 > Project: Spark > Issue Type: Improvement > Components: Spark Shell >Affects Versions: 1.4.1 >Reporter: Cheolsoo Park >Priority: Minor > > Hitting Ctrl-C in spark-sql (and other tools like presto) cancels any running > job and starts a new input line on the prompt. It would be nice if > spark-shell also can do that. Otherwise, in case a user submits a job, say he > made a mistake, and wants to cancel it, he needs to exit the shell and > re-login to continue his work. Re-login can be a pain especially in Spark on > yarn, since it takes a while to allocate AM container and initial executors. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-13409) Remember the stacktrace when stop a SparkContext
[ https://issues.apache.org/jira/browse/SPARK-13409?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Davies Liu updated SPARK-13409: --- Component/s: Spark Core > Remember the stacktrace when stop a SparkContext > > > Key: SPARK-13409 > URL: https://issues.apache.org/jira/browse/SPARK-13409 > Project: Spark > Issue Type: Bug > Components: Spark Core >Reporter: Davies Liu > > Somethings we saw a stopped SparkContext, then have no idea it's stopped by > what, we should remember that for troubleshooting. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-13409) Remember the stacktrace when stop a SparkContext
Davies Liu created SPARK-13409: -- Summary: Remember the stacktrace when stop a SparkContext Key: SPARK-13409 URL: https://issues.apache.org/jira/browse/SPARK-13409 Project: Spark Issue Type: Bug Reporter: Davies Liu Somethings we saw a stopped SparkContext, then have no idea it's stopped by what, we should remember that for troubleshooting. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-13408) Exception in resultHandler will shutdown SparkContext
Davies Liu created SPARK-13408: -- Summary: Exception in resultHandler will shutdown SparkContext Key: SPARK-13408 URL: https://issues.apache.org/jira/browse/SPARK-13408 Project: Spark Issue Type: Bug Reporter: Davies Liu Assignee: Shixiong Zhu {code} davies@localhost:~/work/spark$ bin/spark-submit python/pyspark/sql/dataframe.py NOTE: SPARK_PREPEND_CLASSES is set, placing locally compiled Spark classes ahead of assembly. 16/02/19 12:46:24 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable 16/02/19 12:46:24 WARN Utils: Your hostname, localhost resolves to a loopback address: 127.0.0.1; using 192.168.0.143 instead (on interface en0) 16/02/19 12:46:24 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to another address ** File "/Users/davies/work/spark/python/lib/pyspark.zip/pyspark/sql/dataframe.py", line 554, in pyspark.sql.dataframe.DataFrame.alias Failed example: joined_df.select(col("df_as1.name"), col("df_as2.name"), col("df_as2.age")).collect() Differences (ndiff with -expected +actual): - [Row(name=u'Bob', name=u'Bob', age=5), Row(name=u'Alice', name=u'Alice', age=2)] + [Row(name=u'Alice', name=u'Alice', age=2), Row(name=u'Bob', name=u'Bob', age=5)] org.apache.spark.SparkDriverExecutionException: Execution error at org.apache.spark.scheduler.DAGScheduler.handleTaskCompletion(DAGScheduler.scala:1189) at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1658) at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1620) at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1609) at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48) Caused by: java.lang.NullPointerException at org.apache.spark.sql.catalyst.expressions.codegen.LazilyGeneratedOrdering.compare(GenerateOrdering.scala:157) at org.apache.spark.sql.catalyst.expressions.codegen.LazilyGeneratedOrdering.compare(GenerateOrdering.scala:148) at scala.math.Ordering$$anon$4.compare(Ordering.scala:111) at java.util.PriorityQueue.siftUpUsingComparator(PriorityQueue.java:649) at java.util.PriorityQueue.siftUp(PriorityQueue.java:627) at java.util.PriorityQueue.offer(PriorityQueue.java:329) at org.apache.spark.util.BoundedPriorityQueue.$plus$eq(BoundedPriorityQueue.scala:47) at org.apache.spark.util.BoundedPriorityQueue$$anonfun$$plus$plus$eq$1.apply(BoundedPriorityQueue.scala:41) at org.apache.spark.util.BoundedPriorityQueue$$anonfun$$plus$plus$eq$1.apply(BoundedPriorityQueue.scala:41) at scala.collection.Iterator$class.foreach(Iterator.scala:742) at scala.collection.AbstractIterator.foreach(Iterator.scala:1194) at scala.collection.IterableLike$class.foreach(IterableLike.scala:72) at org.apache.spark.util.BoundedPriorityQueue.foreach(BoundedPriorityQueue.scala:31) at org.apache.spark.util.BoundedPriorityQueue.$plus$plus$eq(BoundedPriorityQueue.scala:41) at org.apache.spark.rdd.RDD$$anonfun$takeOrdered$1$$anonfun$apply$46.apply(RDD.scala:1319) at org.apache.spark.rdd.RDD$$anonfun$takeOrdered$1$$anonfun$apply$46.apply(RDD.scala:1318) at org.apache.spark.rdd.RDD$$anonfun$reduce$1$$anonfun$15.apply(RDD.scala:932) at org.apache.spark.rdd.RDD$$anonfun$reduce$1$$anonfun$15.apply(RDD.scala:929) at org.apache.spark.scheduler.JobWaiter.taskSucceeded(JobWaiter.scala:57) at org.apache.spark.scheduler.DAGScheduler.handleTaskCompletion(DAGScheduler.scala:1185) ... 4 more org.apache.spark.SparkDriverExecutionException: Execution error at org.apache.spark.scheduler.DAGScheduler.handleTaskCompletion(DAGScheduler.scala:1189) at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1658) at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1620) at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1609) at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48) Caused by: java.lang.NullPointerException at org.apache.spark.sql.catalyst.expressions.codegen.LazilyGeneratedOrdering.compare(GenerateOrdering.scala:157) at org.apache.spark.sql.catalyst.expressions.codegen.LazilyGeneratedOrdering.compare(GenerateOrdering.scala:148) at scala.math.Ordering$$anon$4.compare(Ordering.scala:111) at java.util.PriorityQueue.siftUpUsingComparator(PriorityQueue.java:649) at java.util.PriorityQueue.siftUp(PriorityQueue.java:627) at
[jira] [Assigned] (SPARK-13387) Add support for SPARK_DAEMON_JAVA_OPTS with MesosClusterDispatcher.
[ https://issues.apache.org/jira/browse/SPARK-13387?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-13387: Assignee: Apache Spark > Add support for SPARK_DAEMON_JAVA_OPTS with MesosClusterDispatcher. > --- > > Key: SPARK-13387 > URL: https://issues.apache.org/jira/browse/SPARK-13387 > Project: Spark > Issue Type: Improvement > Components: Mesos >Reporter: Timothy Chen >Assignee: Apache Spark >Priority: Minor > > As SPARK_JAVA_OPTS is getting deprecated, to allow setting java properties > for MesosClusterDispatcher it also should support SPARK_DAEMON_JAVA_OPTS. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-13387) Add support for SPARK_DAEMON_JAVA_OPTS with MesosClusterDispatcher.
[ https://issues.apache.org/jira/browse/SPARK-13387?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-13387: Assignee: (was: Apache Spark) > Add support for SPARK_DAEMON_JAVA_OPTS with MesosClusterDispatcher. > --- > > Key: SPARK-13387 > URL: https://issues.apache.org/jira/browse/SPARK-13387 > Project: Spark > Issue Type: Improvement > Components: Mesos >Reporter: Timothy Chen >Priority: Minor > > As SPARK_JAVA_OPTS is getting deprecated, to allow setting java properties > for MesosClusterDispatcher it also should support SPARK_DAEMON_JAVA_OPTS. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-13387) Add support for SPARK_DAEMON_JAVA_OPTS with MesosClusterDispatcher.
[ https://issues.apache.org/jira/browse/SPARK-13387?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15154842#comment-15154842 ] Apache Spark commented on SPARK-13387: -- User 'tnachen' has created a pull request for this issue: https://github.com/apache/spark/pull/11277 > Add support for SPARK_DAEMON_JAVA_OPTS with MesosClusterDispatcher. > --- > > Key: SPARK-13387 > URL: https://issues.apache.org/jira/browse/SPARK-13387 > Project: Spark > Issue Type: Improvement > Components: Mesos >Reporter: Timothy Chen >Priority: Minor > > As SPARK_JAVA_OPTS is getting deprecated, to allow setting java properties > for MesosClusterDispatcher it also should support SPARK_DAEMON_JAVA_OPTS. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-13407) TaskMetrics.fromAccumulatorUpdates can crash when trying to access garbage-collected accumulators
[ https://issues.apache.org/jira/browse/SPARK-13407?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15154826#comment-15154826 ] Apache Spark commented on SPARK-13407: -- User 'JoshRosen' has created a pull request for this issue: https://github.com/apache/spark/pull/11276 > TaskMetrics.fromAccumulatorUpdates can crash when trying to access > garbage-collected accumulators > - > > Key: SPARK-13407 > URL: https://issues.apache.org/jira/browse/SPARK-13407 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.0.0 >Reporter: Josh Rosen >Assignee: Josh Rosen > > TaskMetrics.fromAccumulatorUpdates can fail if accumulators have been > garbage-collected: > {code} > java.lang.IllegalAccessError: Attempted to access garbage collected > accumulator 481596 > at > org.apache.spark.Accumulators$$anonfun$get$1$$anonfun$apply$1.apply(Accumulator.scala:133) > at > org.apache.spark.Accumulators$$anonfun$get$1$$anonfun$apply$1.apply(Accumulator.scala:133) > at scala.Option.getOrElse(Option.scala:120) > at > org.apache.spark.Accumulators$$anonfun$get$1.apply(Accumulator.scala:132) > at > org.apache.spark.Accumulators$$anonfun$get$1.apply(Accumulator.scala:130) > at scala.Option.map(Option.scala:145) > at org.apache.spark.Accumulators$.get(Accumulator.scala:130) > at > org.apache.spark.executor.TaskMetrics$$anonfun$9.apply(TaskMetrics.scala:414) > at > org.apache.spark.executor.TaskMetrics$$anonfun$9.apply(TaskMetrics.scala:412) > at > scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:251) > at > scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:251) > at > scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) > at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47) > at > scala.collection.TraversableLike$class.flatMap(TraversableLike.scala:251) > at scala.collection.AbstractTraversable.flatMap(Traversable.scala:105) > at > org.apache.spark.executor.TaskMetrics$.fromAccumulatorUpdates(TaskMetrics.scala:412) > at > org.apache.spark.ui.jobs.JobProgressListener$$anonfun$onExecutorMetricsUpdate$2.apply(JobProgressListener.scala:499) > at > org.apache.spark.ui.jobs.JobProgressListener$$anonfun$onExecutorMetricsUpdate$2.apply(JobProgressListener.scala:493) > at > scala.collection.TraversableLike$WithFilter$$anonfun$foreach$1.apply(TraversableLike.scala:772) > at > scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33) > at scala.collection.mutable.WrappedArray.foreach(WrappedArray.scala:34) > at > scala.collection.TraversableLike$WithFilter.foreach(TraversableLike.scala:771) > at > org.apache.spark.ui.jobs.JobProgressListener.onExecutorMetricsUpdate(JobProgressListener.scala:493) > at > org.apache.spark.scheduler.SparkListenerBus$class.doPostEvent(SparkListenerBus.scala:56) > at > org.apache.spark.scheduler.LiveListenerBus.doPostEvent(LiveListenerBus.scala:35) > at > org.apache.spark.scheduler.LiveListenerBus.doPostEvent(LiveListenerBus.scala:35) > at > org.apache.spark.util.ListenerBus$class.postToAll(ListenerBus.scala:63) > at > org.apache.spark.scheduler.LiveListenerBus.postToAll(LiveListenerBus.scala:35) > at > org.apache.spark.scheduler.LiveListenerBus$$anon$1$$anonfun$run$1$$anonfun$apply$mcV$sp$1.apply$mcV$sp(LiveListenerBus.scala:81) > at > org.apache.spark.scheduler.LiveListenerBus$$anon$1$$anonfun$run$1$$anonfun$apply$mcV$sp$1.apply(LiveListenerBus.scala:66) > at > org.apache.spark.scheduler.LiveListenerBus$$anon$1$$anonfun$run$1$$anonfun$apply$mcV$sp$1.apply(LiveListenerBus.scala:66) > at scala.util.DynamicVariable.withValue(DynamicVariable.scala:57) > at > org.apache.spark.scheduler.LiveListenerBus$$anon$1$$anonfun$run$1.apply$mcV$sp(LiveListenerBus.scala:65) > at org.apache.spark.util.Utils$.tryOrStopSparkContext(Utils.scala:1178) > at > org.apache.spark.scheduler.LiveListenerBus$$anon$1.run(LiveListenerBus.scala:64) > {code} > In order to guard against this, we can eliminate the need to access > driver-side accumulators when constructing TaskMetrics. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-13407) TaskMetrics.fromAccumulatorUpdates can crash when trying to access garbage-collected accumulators
Josh Rosen created SPARK-13407: -- Summary: TaskMetrics.fromAccumulatorUpdates can crash when trying to access garbage-collected accumulators Key: SPARK-13407 URL: https://issues.apache.org/jira/browse/SPARK-13407 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 2.0.0 Reporter: Josh Rosen Assignee: Josh Rosen TaskMetrics.fromAccumulatorUpdates can fail if accumulators have been garbage-collected: {code} java.lang.IllegalAccessError: Attempted to access garbage collected accumulator 481596 at org.apache.spark.Accumulators$$anonfun$get$1$$anonfun$apply$1.apply(Accumulator.scala:133) at org.apache.spark.Accumulators$$anonfun$get$1$$anonfun$apply$1.apply(Accumulator.scala:133) at scala.Option.getOrElse(Option.scala:120) at org.apache.spark.Accumulators$$anonfun$get$1.apply(Accumulator.scala:132) at org.apache.spark.Accumulators$$anonfun$get$1.apply(Accumulator.scala:130) at scala.Option.map(Option.scala:145) at org.apache.spark.Accumulators$.get(Accumulator.scala:130) at org.apache.spark.executor.TaskMetrics$$anonfun$9.apply(TaskMetrics.scala:414) at org.apache.spark.executor.TaskMetrics$$anonfun$9.apply(TaskMetrics.scala:412) at scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:251) at scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:251) at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47) at scala.collection.TraversableLike$class.flatMap(TraversableLike.scala:251) at scala.collection.AbstractTraversable.flatMap(Traversable.scala:105) at org.apache.spark.executor.TaskMetrics$.fromAccumulatorUpdates(TaskMetrics.scala:412) at org.apache.spark.ui.jobs.JobProgressListener$$anonfun$onExecutorMetricsUpdate$2.apply(JobProgressListener.scala:499) at org.apache.spark.ui.jobs.JobProgressListener$$anonfun$onExecutorMetricsUpdate$2.apply(JobProgressListener.scala:493) at scala.collection.TraversableLike$WithFilter$$anonfun$foreach$1.apply(TraversableLike.scala:772) at scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33) at scala.collection.mutable.WrappedArray.foreach(WrappedArray.scala:34) at scala.collection.TraversableLike$WithFilter.foreach(TraversableLike.scala:771) at org.apache.spark.ui.jobs.JobProgressListener.onExecutorMetricsUpdate(JobProgressListener.scala:493) at org.apache.spark.scheduler.SparkListenerBus$class.doPostEvent(SparkListenerBus.scala:56) at org.apache.spark.scheduler.LiveListenerBus.doPostEvent(LiveListenerBus.scala:35) at org.apache.spark.scheduler.LiveListenerBus.doPostEvent(LiveListenerBus.scala:35) at org.apache.spark.util.ListenerBus$class.postToAll(ListenerBus.scala:63) at org.apache.spark.scheduler.LiveListenerBus.postToAll(LiveListenerBus.scala:35) at org.apache.spark.scheduler.LiveListenerBus$$anon$1$$anonfun$run$1$$anonfun$apply$mcV$sp$1.apply$mcV$sp(LiveListenerBus.scala:81) at org.apache.spark.scheduler.LiveListenerBus$$anon$1$$anonfun$run$1$$anonfun$apply$mcV$sp$1.apply(LiveListenerBus.scala:66) at org.apache.spark.scheduler.LiveListenerBus$$anon$1$$anonfun$run$1$$anonfun$apply$mcV$sp$1.apply(LiveListenerBus.scala:66) at scala.util.DynamicVariable.withValue(DynamicVariable.scala:57) at org.apache.spark.scheduler.LiveListenerBus$$anon$1$$anonfun$run$1.apply$mcV$sp(LiveListenerBus.scala:65) at org.apache.spark.util.Utils$.tryOrStopSparkContext(Utils.scala:1178) at org.apache.spark.scheduler.LiveListenerBus$$anon$1.run(LiveListenerBus.scala:64) {code} In order to guard against this, we can eliminate the need to access driver-side accumulators when constructing TaskMetrics. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-13406) NPE in LazilyGeneratedOrdering
Davies Liu created SPARK-13406: -- Summary: NPE in LazilyGeneratedOrdering Key: SPARK-13406 URL: https://issues.apache.org/jira/browse/SPARK-13406 Project: Spark Issue Type: Improvement Components: SQL Reporter: Davies Liu Assignee: Josh Rosen {code} File "/Users/davies/work/spark/python/lib/pyspark.zip/pyspark/sql/dataframe.py", line ?, in pyspark.sql.dataframe.DataFrameStatFunctions.sampleBy Failed example: sampled.groupBy("key").count().orderBy("key").show() Exception raised: Traceback (most recent call last): File "//anaconda/lib/python2.7/doctest.py", line 1315, in __run compileflags, 1) in test.globs File "", line 1, in sampled.groupBy("key").count().orderBy("key").show() File "/Users/davies/work/spark/python/lib/pyspark.zip/pyspark/sql/dataframe.py", line 217, in show print(self._jdf.showString(n, truncate)) File "/Users/davies/work/spark/python/lib/py4j-0.9.1-src.zip/py4j/java_gateway.py", line 835, in __call__ answer, self.gateway_client, self.target_id, self.name) File "/Users/davies/work/spark/python/lib/pyspark.zip/pyspark/sql/utils.py", line 45, in deco return f(*a, **kw) File "/Users/davies/work/spark/python/lib/py4j-0.9.1-src.zip/py4j/protocol.py", line 310, in get_return_value format(target_id, ".", name), value) Py4JJavaError: An error occurred while calling o681.showString. : org.apache.spark.SparkDriverExecutionException: Execution error at org.apache.spark.scheduler.DAGScheduler.handleTaskCompletion(DAGScheduler.scala:1189) at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1658) at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1620) at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1609) at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48) at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:623) at org.apache.spark.SparkContext.runJob(SparkContext.scala:1782) at org.apache.spark.SparkContext.runJob(SparkContext.scala:1845) at org.apache.spark.rdd.RDD$$anonfun$reduce$1.apply(RDD.scala:937) at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:150) at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:111) at org.apache.spark.rdd.RDD.withScope(RDD.scala:323) at org.apache.spark.rdd.RDD.reduce(RDD.scala:919) at org.apache.spark.rdd.RDD$$anonfun$takeOrdered$1.apply(RDD.scala:1318) at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:150) at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:111) at org.apache.spark.rdd.RDD.withScope(RDD.scala:323) at org.apache.spark.rdd.RDD.takeOrdered(RDD.scala:1305) at org.apache.spark.sql.execution.TakeOrderedAndProject.executeCollect(limit.scala:94) at org.apache.spark.sql.execution.SparkPlan.executeCollectPublic(SparkPlan.scala:157) at org.apache.spark.sql.DataFrame$$anonfun$org$apache$spark$sql$DataFrame$$execute$1$1.apply(DataFrame.scala:1520) at org.apache.spark.sql.DataFrame$$anonfun$org$apache$spark$sql$DataFrame$$execute$1$1.apply(DataFrame.scala:1520) at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:53) at org.apache.spark.sql.DataFrame.withNewExecutionId(DataFrame.scala:1769) at org.apache.spark.sql.DataFrame.org$apache$spark$sql$DataFrame$$execute$1(DataFrame.scala:1519) at org.apache.spark.sql.DataFrame.org$apache$spark$sql$DataFrame$$collect(DataFrame.scala:1526) at org.apache.spark.sql.DataFrame$$anonfun$head$1.apply(DataFrame.scala:1396) at org.apache.spark.sql.DataFrame$$anonfun$head$1.apply(DataFrame.scala:1395) at org.apache.spark.sql.DataFrame.withCallback(DataFrame.scala:1782) at org.apache.spark.sql.DataFrame.head(DataFrame.scala:1395) at org.apache.spark.sql.DataFrame.take(DataFrame.scala:1477) at org.apache.spark.sql.DataFrame.showString(DataFrame.scala:167) at sun.reflect.GeneratedMethodAccessor63.invoke(Unknown Source) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:606) at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:231) at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:381) at py4j.Gateway.invoke(Gateway.java:290) at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:133) at py4j.commands.CallCommand.execute(CallCommand.java:79) at
[jira] [Updated] (SPARK-13384) Keep attribute qualifiers after dedup in Analyzer
[ https://issues.apache.org/jira/browse/SPARK-13384?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust updated SPARK-13384: - Assignee: Liang-Chi Hsieh > Keep attribute qualifiers after dedup in Analyzer > - > > Key: SPARK-13384 > URL: https://issues.apache.org/jira/browse/SPARK-13384 > Project: Spark > Issue Type: Bug > Components: SQL >Reporter: Liang-Chi Hsieh >Assignee: Liang-Chi Hsieh > Fix For: 2.0.0 > > > When we de-duplicate attributes in Analyzer, we create new attributes. > However, we don't keep original qualifiers. Some plans will be failed to > analysed. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-13384) Keep attribute qualifiers after dedup in Analyzer
[ https://issues.apache.org/jira/browse/SPARK-13384?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust resolved SPARK-13384. -- Resolution: Fixed Fix Version/s: 2.0.0 Issue resolved by pull request 11261 [https://github.com/apache/spark/pull/11261] > Keep attribute qualifiers after dedup in Analyzer > - > > Key: SPARK-13384 > URL: https://issues.apache.org/jira/browse/SPARK-13384 > Project: Spark > Issue Type: Bug > Components: SQL >Reporter: Liang-Chi Hsieh > Fix For: 2.0.0 > > > When we de-duplicate attributes in Analyzer, we create new attributes. > However, we don't keep original qualifiers. Some plans will be failed to > analysed. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-13046) Partitioning looks broken in 1.6
[ https://issues.apache.org/jira/browse/SPARK-13046?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15152718#comment-15152718 ] Julien Baley edited comment on SPARK-13046 at 2/19/16 7:56 PM: --- Sorry it took me so long to come back to you. We're using Hive (and Java), and I'm calling the `hiveContext.createExternalTable("table_name", "s3://bucket/some_path/", "parquet");`, i.e. I believe I'm passing the correct path and then Spark perhaps infers something wrongly in the middle? I've changed my call to: `hiveContext.createExternalTable("table_name", "parquet", ImmutableMap.of("path", "s3://bucket/some_path/", "basePath", "s3://bucket/some_path/"));` and it seems to work. I still feel there's a bug (along the line between createExternalTable and the partitioning code), since really I shouldn't have to pass twice the same value to createExternalTable, should I? was (Author: julien.baley): Sorry it took me so long to come back to you. We're using Hive (and Java), and I'm calling the `hiveContext.createExternalTable("table_name", "s3://bucket/some_path/", "parquet");`, i.e. I believe I'm passing the correct path and then Spark perhaps infers something wrongly in the middle? I've changed my call to: `hiveContext.createExternalTable("table_name", "parquet", ImmutableMap.of("path", "s3://bucket/some_path/", "basePath", "s3://bucket/some_path/"));` Is that what you meant [~yhuai] ? It gets me: org.apache.spark.SparkException: Failed to merge incompatible data types StringType and StructType(StructField(name,StringType,true), StructField(version,StringType,true)) when I try to query it afterwards, so I assume things still go wrong underneath. > Partitioning looks broken in 1.6 > > > Key: SPARK-13046 > URL: https://issues.apache.org/jira/browse/SPARK-13046 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 1.6.0 >Reporter: Julien Baley > > Hello, > I have a list of files in s3: > {code} > s3://bucket/some_path/date_received=2016-01-13/fingerprint=2f6a09d370b4021d/{_SUCCESS,metadata,some > parquet files} > s3://bucket/some_path/date_received=2016-01-14/fingerprint=2f6a09d370b4021d/{_SUCCESS,metadata,some > parquet files} > s3://bucket/some_path/date_received=2016-01-15/fingerprint=2f6a09d370b4021d/{_SUCCESS,metadata,some > parquet files} > {code} > Until 1.5.2, it all worked well and passing s3://bucket/some_path/ (the same > for the three lines) would correctly identify 2 pairs of key/value, one > `date_received` and one `fingerprint`. > From 1.6.0, I get the following exception: > {code} > assertion failed: Conflicting directory structures detected. Suspicious paths > s3://bucket/some_path/date_received=2016-01-13 > s3://bucket/some_path/date_received=2016-01-14 > s3://bucket/some_path/date_received=2016-01-15 > {code} > That is to say, the partitioning code now fails to identify > date_received=2016-01-13 as a key/value pair. > I can see that there has been some activity on > spark/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/PartitioningUtils.scala > recently, so that seems related (especially the commits > https://github.com/apache/spark/commit/7b5d9051cf91c099458d092a6705545899134b3b > and > https://github.com/apache/spark/commit/de289bf279e14e47859b5fbcd70e97b9d0759f14 > ). > If I read correctly the tests added in those commits: > -they don't seem to actually test the return value, only that it doesn't crash > -they only test cases where the s3 path contain 1 key/value pair (which > otherwise would catch the bug) > This is problematic for us as we're trying to migrate all of our spark > services to 1.6.0 and this bug is a real blocker. I know it's possible to > force a 'union', but I'd rather not do that if the bug can be fixed. > Any question, please shoot. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-13341) Casting Unix timestamp to SQL timestamp fails
[ https://issues.apache.org/jira/browse/SPARK-13341?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15154694#comment-15154694 ] Srinivasa Reddy Vundela commented on SPARK-13341: - I guess the following commit is the reason for the change https://github.com/apache/spark/commit/9ed4ad4265cf9d3135307eb62dae6de0b220fc21 Seems HIVE-3454 fixed in 1.2.0 and if customers are using earlier versions of HIVE they will see this problem. > Casting Unix timestamp to SQL timestamp fails > - > > Key: SPARK-13341 > URL: https://issues.apache.org/jira/browse/SPARK-13341 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.6.0 >Reporter: William Dee > > The way that unix timestamp casting is handled has been broken between Spark > 1.5.2 and Spark 1.6.0. This can be easily demonstrated via the spark-shell: > {code:title=1.5.2} > scala> sqlContext.sql("SELECT CAST(145558084 AS TIMESTAMP) as ts, > CAST(CAST(145558084 AS TIMESTAMP) AS DATE) as d").show > ++--+ > | ts| d| > ++--+ > |2016-02-16 00:00:...|2016-02-16| > ++--+ > {code} > {code:title=1.6.0} > scala> sqlContext.sql("SELECT CAST(145558084 AS TIMESTAMP) as ts, > CAST(CAST(145558084 AS TIMESTAMP) AS DATE) as d").show > ++--+ > | ts| d| > ++--+ > |48095-07-09 12:06...|095-07-09| > ++--+ > {code} > I'm not sure what exactly is causing this but this defect has definitely been > introduced in Spark 1.6.0 as jobs that relied on this functionality ran on > 1.5.2 and now don't run on 1.6.0. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-13375) PySpark API Utils missing item: kFold
[ https://issues.apache.org/jira/browse/SPARK-13375?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Bruno Wu updated SPARK-13375: - Affects Version/s: (was: 1.6.0) 1.5.0 > PySpark API Utils missing item: kFold > - > > Key: SPARK-13375 > URL: https://issues.apache.org/jira/browse/SPARK-13375 > Project: Spark > Issue Type: Task > Components: MLlib, PySpark >Affects Versions: 1.5.0 >Reporter: Bruno Wu >Priority: Minor > > kFold function has not been implemented in MLUtils in Python API for MLlib > (pyspark.mllib.util as of 1.6.0) > This JIRA ticket is opened to add this function into pyspark.mllib.util. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-13405) Flaky test: o.a.s.sql.util.ContinuousQueryListenerSuite.event ordering
[ https://issues.apache.org/jira/browse/SPARK-13405?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shixiong Zhu updated SPARK-13405: - Labels: flaky-test (was: ) > Flaky test: o.a.s.sql.util.ContinuousQueryListenerSuite.event ordering > -- > > Key: SPARK-13405 > URL: https://issues.apache.org/jira/browse/SPARK-13405 > Project: Spark > Issue Type: Bug > Components: Streaming >Reporter: Shixiong Zhu >Assignee: Shixiong Zhu > Labels: flaky-test > > {code} > org.scalatest.exceptions.TestFailedException: > Assert failed: : null equaled null onQueryTerminated called before > onQueryStarted > org.scalatest.Assertions$class.newAssertionFailedException(Assertions.scala:500) > org.scalatest.FunSuite.newAssertionFailedException(FunSuite.scala:1555) > > org.scalatest.Assertions$AssertionsHelper.macroAssert(Assertions.scala:466) > > org.apache.spark.sql.util.ContinuousQueryListenerSuite$QueryStatusCollector$$anonfun$onQueryTerminated$1.apply$mcV$sp(ContinuousQueryListenerSuite.scala:204) > > org.scalatest.concurrent.AsyncAssertions$Waiter.apply(AsyncAssertions.scala:349) > > org.apache.spark.sql.util.ContinuousQueryListenerSuite$QueryStatusCollector.onQueryTerminated(ContinuousQueryListenerSuite.scala:203) > > org.apache.spark.sql.execution.streaming.ContinuousQueryListenerBus.doPostEvent(ContinuousQueryListenerBus.scala:67) > > org.apache.spark.sql.execution.streaming.ContinuousQueryListenerBus.doPostEvent(ContinuousQueryListenerBus.scala:32) > org.apache.spark.util.ListenerBus$class.postToAll(ListenerBus.scala:63) > > org.apache.spark.sql.execution.streaming.ContinuousQueryListenerBus.postToAll(ContinuousQueryListenerBus.scala:32) > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-13405) Flaky test: o.a.s.sql.util.ContinuousQueryListenerSuite.event ordering
[ https://issues.apache.org/jira/browse/SPARK-13405?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-13405: Assignee: Apache Spark (was: Shixiong Zhu) > Flaky test: o.a.s.sql.util.ContinuousQueryListenerSuite.event ordering > -- > > Key: SPARK-13405 > URL: https://issues.apache.org/jira/browse/SPARK-13405 > Project: Spark > Issue Type: Bug > Components: Streaming >Reporter: Shixiong Zhu >Assignee: Apache Spark > > {code} > org.scalatest.exceptions.TestFailedException: > Assert failed: : null equaled null onQueryTerminated called before > onQueryStarted > org.scalatest.Assertions$class.newAssertionFailedException(Assertions.scala:500) > org.scalatest.FunSuite.newAssertionFailedException(FunSuite.scala:1555) > > org.scalatest.Assertions$AssertionsHelper.macroAssert(Assertions.scala:466) > > org.apache.spark.sql.util.ContinuousQueryListenerSuite$QueryStatusCollector$$anonfun$onQueryTerminated$1.apply$mcV$sp(ContinuousQueryListenerSuite.scala:204) > > org.scalatest.concurrent.AsyncAssertions$Waiter.apply(AsyncAssertions.scala:349) > > org.apache.spark.sql.util.ContinuousQueryListenerSuite$QueryStatusCollector.onQueryTerminated(ContinuousQueryListenerSuite.scala:203) > > org.apache.spark.sql.execution.streaming.ContinuousQueryListenerBus.doPostEvent(ContinuousQueryListenerBus.scala:67) > > org.apache.spark.sql.execution.streaming.ContinuousQueryListenerBus.doPostEvent(ContinuousQueryListenerBus.scala:32) > org.apache.spark.util.ListenerBus$class.postToAll(ListenerBus.scala:63) > > org.apache.spark.sql.execution.streaming.ContinuousQueryListenerBus.postToAll(ContinuousQueryListenerBus.scala:32) > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-13405) Flaky test: o.a.s.sql.util.ContinuousQueryListenerSuite.event ordering
[ https://issues.apache.org/jira/browse/SPARK-13405?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15154580#comment-15154580 ] Apache Spark commented on SPARK-13405: -- User 'zsxwing' has created a pull request for this issue: https://github.com/apache/spark/pull/11275 > Flaky test: o.a.s.sql.util.ContinuousQueryListenerSuite.event ordering > -- > > Key: SPARK-13405 > URL: https://issues.apache.org/jira/browse/SPARK-13405 > Project: Spark > Issue Type: Bug > Components: Streaming >Reporter: Shixiong Zhu >Assignee: Shixiong Zhu > > {code} > org.scalatest.exceptions.TestFailedException: > Assert failed: : null equaled null onQueryTerminated called before > onQueryStarted > org.scalatest.Assertions$class.newAssertionFailedException(Assertions.scala:500) > org.scalatest.FunSuite.newAssertionFailedException(FunSuite.scala:1555) > > org.scalatest.Assertions$AssertionsHelper.macroAssert(Assertions.scala:466) > > org.apache.spark.sql.util.ContinuousQueryListenerSuite$QueryStatusCollector$$anonfun$onQueryTerminated$1.apply$mcV$sp(ContinuousQueryListenerSuite.scala:204) > > org.scalatest.concurrent.AsyncAssertions$Waiter.apply(AsyncAssertions.scala:349) > > org.apache.spark.sql.util.ContinuousQueryListenerSuite$QueryStatusCollector.onQueryTerminated(ContinuousQueryListenerSuite.scala:203) > > org.apache.spark.sql.execution.streaming.ContinuousQueryListenerBus.doPostEvent(ContinuousQueryListenerBus.scala:67) > > org.apache.spark.sql.execution.streaming.ContinuousQueryListenerBus.doPostEvent(ContinuousQueryListenerBus.scala:32) > org.apache.spark.util.ListenerBus$class.postToAll(ListenerBus.scala:63) > > org.apache.spark.sql.execution.streaming.ContinuousQueryListenerBus.postToAll(ContinuousQueryListenerBus.scala:32) > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-13405) Flaky test: o.a.s.sql.util.ContinuousQueryListenerSuite.event ordering
[ https://issues.apache.org/jira/browse/SPARK-13405?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-13405: Assignee: Shixiong Zhu (was: Apache Spark) > Flaky test: o.a.s.sql.util.ContinuousQueryListenerSuite.event ordering > -- > > Key: SPARK-13405 > URL: https://issues.apache.org/jira/browse/SPARK-13405 > Project: Spark > Issue Type: Bug > Components: Streaming >Reporter: Shixiong Zhu >Assignee: Shixiong Zhu > > {code} > org.scalatest.exceptions.TestFailedException: > Assert failed: : null equaled null onQueryTerminated called before > onQueryStarted > org.scalatest.Assertions$class.newAssertionFailedException(Assertions.scala:500) > org.scalatest.FunSuite.newAssertionFailedException(FunSuite.scala:1555) > > org.scalatest.Assertions$AssertionsHelper.macroAssert(Assertions.scala:466) > > org.apache.spark.sql.util.ContinuousQueryListenerSuite$QueryStatusCollector$$anonfun$onQueryTerminated$1.apply$mcV$sp(ContinuousQueryListenerSuite.scala:204) > > org.scalatest.concurrent.AsyncAssertions$Waiter.apply(AsyncAssertions.scala:349) > > org.apache.spark.sql.util.ContinuousQueryListenerSuite$QueryStatusCollector.onQueryTerminated(ContinuousQueryListenerSuite.scala:203) > > org.apache.spark.sql.execution.streaming.ContinuousQueryListenerBus.doPostEvent(ContinuousQueryListenerBus.scala:67) > > org.apache.spark.sql.execution.streaming.ContinuousQueryListenerBus.doPostEvent(ContinuousQueryListenerBus.scala:32) > org.apache.spark.util.ListenerBus$class.postToAll(ListenerBus.scala:63) > > org.apache.spark.sql.execution.streaming.ContinuousQueryListenerBus.postToAll(ContinuousQueryListenerBus.scala:32) > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-1405) parallel Latent Dirichlet Allocation (LDA) atop of spark in MLlib
[ https://issues.apache.org/jira/browse/SPARK-1405?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng updated SPARK-1405: - Description: Latent Dirichlet Allocation (a.k.a. LDA) is a topic model which extracts topics from text corpus. Different with current machine learning algorithms in MLlib, instead of using optimization algorithms such as gradient desent, LDA uses expectation algorithms such as Gibbs sampling. In this PR, I prepare a LDA implementation based on Gibbs sampling, with a wholeTextFiles API (solved yet), a word segmentation (import from Lucene), and a Gibbs sampling core. Algorithm survey from Pedro: https://docs.google.com/document/d/13MfroPXEEGKgaQaZlHkg1wdJMtCN5d8aHJuVkiOrOK4/edit?usp=sharing API design doc from Joseph: https://docs.google.com/document/d/1kSsDqTeZMEB94Bs4GTd0mvdAmduvZSSkpoSfn-seAzo/edit?usp=sharing was: Latent Dirichlet Allocation (a.k.a. LDA) is a topic model which extracts topics from text corpus. Different with current machine learning algorithms in MLlib, instead of using optimization algorithms such as gradient desent, LDA uses expectation algorithms such as Gibbs sampling. In this PR, I prepare a LDA implementation based on Gibbs sampling, with a wholeTextFiles API (solved yet), a word segmentation (import from Lucene), and a Gibbs sampling core. > parallel Latent Dirichlet Allocation (LDA) atop of spark in MLlib > - > > Key: SPARK-1405 > URL: https://issues.apache.org/jira/browse/SPARK-1405 > Project: Spark > Issue Type: New Feature > Components: MLlib >Reporter: Xusen Yin >Assignee: Joseph K. Bradley >Priority: Critical > Labels: features > Fix For: 1.3.0 > > Attachments: performance_comparison.png > > Original Estimate: 336h > Remaining Estimate: 336h > > Latent Dirichlet Allocation (a.k.a. LDA) is a topic model which extracts > topics from text corpus. Different with current machine learning algorithms > in MLlib, instead of using optimization algorithms such as gradient desent, > LDA uses expectation algorithms such as Gibbs sampling. > In this PR, I prepare a LDA implementation based on Gibbs sampling, with a > wholeTextFiles API (solved yet), a word segmentation (import from Lucene), > and a Gibbs sampling core. > Algorithm survey from Pedro: > https://docs.google.com/document/d/13MfroPXEEGKgaQaZlHkg1wdJMtCN5d8aHJuVkiOrOK4/edit?usp=sharing > API design doc from Joseph: > https://docs.google.com/document/d/1kSsDqTeZMEB94Bs4GTd0mvdAmduvZSSkpoSfn-seAzo/edit?usp=sharing -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-13405) Flaky test: o.a.s.sql.util.ContinuousQueryListenerSuite.event ordering
Shixiong Zhu created SPARK-13405: Summary: Flaky test: o.a.s.sql.util.ContinuousQueryListenerSuite.event ordering Key: SPARK-13405 URL: https://issues.apache.org/jira/browse/SPARK-13405 Project: Spark Issue Type: Bug Components: Streaming Reporter: Shixiong Zhu Assignee: Shixiong Zhu {code} org.scalatest.exceptions.TestFailedException: Assert failed: : null equaled null onQueryTerminated called before onQueryStarted org.scalatest.Assertions$class.newAssertionFailedException(Assertions.scala:500) org.scalatest.FunSuite.newAssertionFailedException(FunSuite.scala:1555) org.scalatest.Assertions$AssertionsHelper.macroAssert(Assertions.scala:466) org.apache.spark.sql.util.ContinuousQueryListenerSuite$QueryStatusCollector$$anonfun$onQueryTerminated$1.apply$mcV$sp(ContinuousQueryListenerSuite.scala:204) org.scalatest.concurrent.AsyncAssertions$Waiter.apply(AsyncAssertions.scala:349) org.apache.spark.sql.util.ContinuousQueryListenerSuite$QueryStatusCollector.onQueryTerminated(ContinuousQueryListenerSuite.scala:203) org.apache.spark.sql.execution.streaming.ContinuousQueryListenerBus.doPostEvent(ContinuousQueryListenerBus.scala:67) org.apache.spark.sql.execution.streaming.ContinuousQueryListenerBus.doPostEvent(ContinuousQueryListenerBus.scala:32) org.apache.spark.util.ListenerBus$class.postToAll(ListenerBus.scala:63) org.apache.spark.sql.execution.streaming.ContinuousQueryListenerBus.postToAll(ContinuousQueryListenerBus.scala:32) {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-13304) Broadcast join with two ints could be very slow
[ https://issues.apache.org/jira/browse/SPARK-13304?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-13304: Assignee: Apache Spark > Broadcast join with two ints could be very slow > --- > > Key: SPARK-13304 > URL: https://issues.apache.org/jira/browse/SPARK-13304 > Project: Spark > Issue Type: Bug > Components: SQL >Reporter: Davies Liu >Assignee: Apache Spark > > If the two join columns have the same value, the hash code of them will be (a > ^ b), which is 0, then the HashMap will be very very slow. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-13304) Broadcast join with two ints could be very slow
[ https://issues.apache.org/jira/browse/SPARK-13304?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15154570#comment-15154570 ] Apache Spark commented on SPARK-13304: -- User 'davies' has created a pull request for this issue: https://github.com/apache/spark/pull/11188 > Broadcast join with two ints could be very slow > --- > > Key: SPARK-13304 > URL: https://issues.apache.org/jira/browse/SPARK-13304 > Project: Spark > Issue Type: Bug > Components: SQL >Reporter: Davies Liu > > If the two join columns have the same value, the hash code of them will be (a > ^ b), which is 0, then the HashMap will be very very slow. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-13304) Broadcast join with two ints could be very slow
[ https://issues.apache.org/jira/browse/SPARK-13304?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-13304: Assignee: (was: Apache Spark) > Broadcast join with two ints could be very slow > --- > > Key: SPARK-13304 > URL: https://issues.apache.org/jira/browse/SPARK-13304 > Project: Spark > Issue Type: Bug > Components: SQL >Reporter: Davies Liu > > If the two join columns have the same value, the hash code of them will be (a > ^ b), which is 0, then the HashMap will be very very slow. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-13404) Create the variables for input when it's used
[ https://issues.apache.org/jira/browse/SPARK-13404?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-13404: Assignee: Apache Spark (was: Davies Liu) > Create the variables for input when it's used > - > > Key: SPARK-13404 > URL: https://issues.apache.org/jira/browse/SPARK-13404 > Project: Spark > Issue Type: Improvement > Components: SQL >Reporter: Davies Liu >Assignee: Apache Spark > > Right now, we create the variables in the first operator (usually > InputAdapter), they could be wasted if most of rows after filtered out > immediately. > We should defer that until they are used by following operators. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-13404) Create the variables for input when it's used
[ https://issues.apache.org/jira/browse/SPARK-13404?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15154559#comment-15154559 ] Apache Spark commented on SPARK-13404: -- User 'davies' has created a pull request for this issue: https://github.com/apache/spark/pull/11274 > Create the variables for input when it's used > - > > Key: SPARK-13404 > URL: https://issues.apache.org/jira/browse/SPARK-13404 > Project: Spark > Issue Type: Improvement > Components: SQL >Reporter: Davies Liu >Assignee: Davies Liu > > Right now, we create the variables in the first operator (usually > InputAdapter), they could be wasted if most of rows after filtered out > immediately. > We should defer that until they are used by following operators. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-13404) Create the variables for input when it's used
[ https://issues.apache.org/jira/browse/SPARK-13404?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-13404: Assignee: Davies Liu (was: Apache Spark) > Create the variables for input when it's used > - > > Key: SPARK-13404 > URL: https://issues.apache.org/jira/browse/SPARK-13404 > Project: Spark > Issue Type: Improvement > Components: SQL >Reporter: Davies Liu >Assignee: Davies Liu > > Right now, we create the variables in the first operator (usually > InputAdapter), they could be wasted if most of rows after filtered out > immediately. > We should defer that until they are used by following operators. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-13349) adding a split and union to a streaming application cause big performance hit
[ https://issues.apache.org/jira/browse/SPARK-13349?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15154558#comment-15154558 ] krishna ramachandran edited comment on SPARK-13349 at 2/19/16 6:01 PM: --- enabling "cache" for a DStream causes the app to run out of memory. I believe this is a bug was (Author: ramach1776): enabling "cache" for a DStream causes the app to run out of memory > adding a split and union to a streaming application cause big performance hit > - > > Key: SPARK-13349 > URL: https://issues.apache.org/jira/browse/SPARK-13349 > Project: Spark > Issue Type: Improvement >Affects Versions: 1.4.1 >Reporter: krishna ramachandran >Priority: Critical > > We have a streaming application containing approximately 12 jobs every batch, > running in streaming mode (4 sec batches). Each job writes output to cassandra > each job can contain several stages. > job 1 > ---> receive Stream A --> map --> filter -> (union with another stream B) --> > map --> groupbykey --> transform --> reducebykey --> map > we go thro' few more jobs of transforms and save to database. > Around stage 5, we union the output of Dstream from job 1 (in red) with > another stream (generated by split during job 2) and save that state > It appears the whole execution thus far is repeated which is redundant (I can > see this in execution graph & also performance -> processing time). > Processing time per batch nearly doubles or triples. > This additional & redundant processing cause each batch to run as much as 2.5 > times slower compared to runs without the union - union for most batches does > not alter the original DStream (union with an empty set). If I cache the > DStream from job 1(red block output), performance improves substantially but > hit out of memory errors within few hours. > What is the recommended way to cache/unpersist in such a scenario? there is > no dstream level "unpersist" > setting "spark.streaming.unpersist" to true and > streamingContext.remember("duration") did not help. Still seeing out of > memory errors -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Reopened] (SPARK-13349) adding a split and union to a streaming application cause big performance hit
[ https://issues.apache.org/jira/browse/SPARK-13349?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] krishna ramachandran reopened SPARK-13349: -- enabling "cache" for a DStream causes the app to run out of memory > adding a split and union to a streaming application cause big performance hit > - > > Key: SPARK-13349 > URL: https://issues.apache.org/jira/browse/SPARK-13349 > Project: Spark > Issue Type: Improvement >Affects Versions: 1.4.1 >Reporter: krishna ramachandran >Priority: Critical > > We have a streaming application containing approximately 12 jobs every batch, > running in streaming mode (4 sec batches). Each job writes output to cassandra > each job can contain several stages. > job 1 > ---> receive Stream A --> map --> filter -> (union with another stream B) --> > map --> groupbykey --> transform --> reducebykey --> map > we go thro' few more jobs of transforms and save to database. > Around stage 5, we union the output of Dstream from job 1 (in red) with > another stream (generated by split during job 2) and save that state > It appears the whole execution thus far is repeated which is redundant (I can > see this in execution graph & also performance -> processing time). > Processing time per batch nearly doubles or triples. > This additional & redundant processing cause each batch to run as much as 2.5 > times slower compared to runs without the union - union for most batches does > not alter the original DStream (union with an empty set). If I cache the > DStream from job 1(red block output), performance improves substantially but > hit out of memory errors within few hours. > What is the recommended way to cache/unpersist in such a scenario? there is > no dstream level "unpersist" > setting "spark.streaming.unpersist" to true and > streamingContext.remember("duration") did not help. Still seeing out of > memory errors -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-13349) adding a split and union to a streaming application cause big performance hit
[ https://issues.apache.org/jira/browse/SPARK-13349?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15154555#comment-15154555 ] krishna ramachandran commented on SPARK-13349: -- Hi Sean I posted to user@ 2 problems 1) not much traction 2) though I registered multiple times I keep getting this message at Nable "This post has NOT been accepted by the mailing list yet" the message I posted is pasted below. this is not just a question - it is a bug We have a streaming application containing approximately 12 jobs every batch, running in streaming mode (4 sec batches). Each job has several transformations and 1 action (output to cassandra) which causes the execution of the job (DAG) For example the first job, job 1 ---> receive Stream A --> map --> filter -> (union with another stream B) --> map --> groupbykey --> transform --> reducebykey --> map Likewise we go thro' few more transforms and save to database (job2, job3...) Recently we added a new transformation further downstream wherein we union the output of DStream from job 1 (in italics) with output from a new transformation(job 5). It appears the whole execution thus far is repeated which is redundant (I can see this in execution graph & also performance -> processing time). That is, with this additional transformation (union with a stream processed upstream) each batch runs as much as 2.5 times slower compared to runs without the union. If I cache the DStream from job 1(italics), performance improves substantially but hit out of memory errors within few hours. What is the recommended way to cache/unpersist in such a scenario? there is no dstream level "unpersist" setting "spark.streaming.unpersist" to true and streamingContext.remember("duration") did not help. > adding a split and union to a streaming application cause big performance hit > - > > Key: SPARK-13349 > URL: https://issues.apache.org/jira/browse/SPARK-13349 > Project: Spark > Issue Type: Improvement >Affects Versions: 1.4.1 >Reporter: krishna ramachandran >Priority: Critical > > We have a streaming application containing approximately 12 jobs every batch, > running in streaming mode (4 sec batches). Each job writes output to cassandra > each job can contain several stages. > job 1 > ---> receive Stream A --> map --> filter -> (union with another stream B) --> > map --> groupbykey --> transform --> reducebykey --> map > we go thro' few more jobs of transforms and save to database. > Around stage 5, we union the output of Dstream from job 1 (in red) with > another stream (generated by split during job 2) and save that state > It appears the whole execution thus far is repeated which is redundant (I can > see this in execution graph & also performance -> processing time). > Processing time per batch nearly doubles or triples. > This additional & redundant processing cause each batch to run as much as 2.5 > times slower compared to runs without the union - union for most batches does > not alter the original DStream (union with an empty set). If I cache the > DStream from job 1(red block output), performance improves substantially but > hit out of memory errors within few hours. > What is the recommended way to cache/unpersist in such a scenario? there is > no dstream level "unpersist" > setting "spark.streaming.unpersist" to true and > streamingContext.remember("duration") did not help. Still seeing out of > memory errors -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-13404) Create the variables for input when it's used
Davies Liu created SPARK-13404: -- Summary: Create the variables for input when it's used Key: SPARK-13404 URL: https://issues.apache.org/jira/browse/SPARK-13404 Project: Spark Issue Type: Improvement Components: SQL Reporter: Davies Liu Assignee: Davies Liu Right now, we create the variables in the first operator (usually InputAdapter), they could be wasted if most of rows after filtered out immediately. We should defer that until they are used by following operators. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-13373) Generate code for sort merge join
[ https://issues.apache.org/jira/browse/SPARK-13373?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-13373: Assignee: Apache Spark (was: Davies Liu) > Generate code for sort merge join > - > > Key: SPARK-13373 > URL: https://issues.apache.org/jira/browse/SPARK-13373 > Project: Spark > Issue Type: New Feature > Components: SQL >Reporter: Davies Liu >Assignee: Apache Spark > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-13373) Generate code for sort merge join
[ https://issues.apache.org/jira/browse/SPARK-13373?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15154549#comment-15154549 ] Apache Spark commented on SPARK-13373: -- User 'davies' has created a pull request for this issue: https://github.com/apache/spark/pull/11248 > Generate code for sort merge join > - > > Key: SPARK-13373 > URL: https://issues.apache.org/jira/browse/SPARK-13373 > Project: Spark > Issue Type: New Feature > Components: SQL >Reporter: Davies Liu >Assignee: Davies Liu > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-13373) Generate code for sort merge join
[ https://issues.apache.org/jira/browse/SPARK-13373?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-13373: Assignee: Davies Liu (was: Apache Spark) > Generate code for sort merge join > - > > Key: SPARK-13373 > URL: https://issues.apache.org/jira/browse/SPARK-13373 > Project: Spark > Issue Type: New Feature > Components: SQL >Reporter: Davies Liu >Assignee: Davies Liu > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-13403) HiveConf used for SparkSQL is not based on the Hadoop configuration
[ https://issues.apache.org/jira/browse/SPARK-13403?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-13403: Assignee: Apache Spark > HiveConf used for SparkSQL is not based on the Hadoop configuration > --- > > Key: SPARK-13403 > URL: https://issues.apache.org/jira/browse/SPARK-13403 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 1.6.0 >Reporter: Ryan Blue >Assignee: Apache Spark > > The HiveConf instances used by HiveContext are not instantiated by passing in > the SparkContext's Hadoop conf and are instead based only on the config files > in the environment. Hadoop best practice is to instantiate just one > Configuration from the environment and then pass that conf when instantiating > others so that modifications aren't lost. > Spark will set configuration variables that start with "spark.hadoop." from > spark-defaults.conf when creating {{sc.hadoopConfiguration}}, which are not > correctly passed to the HiveConf because of this. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-13403) HiveConf used for SparkSQL is not based on the Hadoop configuration
[ https://issues.apache.org/jira/browse/SPARK-13403?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-13403: Assignee: (was: Apache Spark) > HiveConf used for SparkSQL is not based on the Hadoop configuration > --- > > Key: SPARK-13403 > URL: https://issues.apache.org/jira/browse/SPARK-13403 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 1.6.0 >Reporter: Ryan Blue > > The HiveConf instances used by HiveContext are not instantiated by passing in > the SparkContext's Hadoop conf and are instead based only on the config files > in the environment. Hadoop best practice is to instantiate just one > Configuration from the environment and then pass that conf when instantiating > others so that modifications aren't lost. > Spark will set configuration variables that start with "spark.hadoop." from > spark-defaults.conf when creating {{sc.hadoopConfiguration}}, which are not > correctly passed to the HiveConf because of this. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-13403) HiveConf used for SparkSQL is not based on the Hadoop configuration
[ https://issues.apache.org/jira/browse/SPARK-13403?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15154533#comment-15154533 ] Apache Spark commented on SPARK-13403: -- User 'rdblue' has created a pull request for this issue: https://github.com/apache/spark/pull/11273 > HiveConf used for SparkSQL is not based on the Hadoop configuration > --- > > Key: SPARK-13403 > URL: https://issues.apache.org/jira/browse/SPARK-13403 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 1.6.0 >Reporter: Ryan Blue > > The HiveConf instances used by HiveContext are not instantiated by passing in > the SparkContext's Hadoop conf and are instead based only on the config files > in the environment. Hadoop best practice is to instantiate just one > Configuration from the environment and then pass that conf when instantiating > others so that modifications aren't lost. > Spark will set configuration variables that start with "spark.hadoop." from > spark-defaults.conf when creating {{sc.hadoopConfiguration}}, which are not > correctly passed to the HiveConf because of this. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-13403) HiveConf used for SparkSQL is not based on the Hadoop configuration
Ryan Blue created SPARK-13403: - Summary: HiveConf used for SparkSQL is not based on the Hadoop configuration Key: SPARK-13403 URL: https://issues.apache.org/jira/browse/SPARK-13403 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 1.6.0 Reporter: Ryan Blue The HiveConf instances used by HiveContext are not instantiated by passing in the SparkContext's Hadoop conf and are instead based only on the config files in the environment. Hadoop best practice is to instantiate just one Configuration from the environment and then pass that conf when instantiating others so that modifications aren't lost. Spark will set configuration variables that start with "spark.hadoop." from spark-defaults.conf when creating {{sc.hadoopConfiguration}}, which are not correctly passed to the HiveConf because of this. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Closed] (SPARK-13402) List Spark R dependencies
[ https://issues.apache.org/jira/browse/SPARK-13402?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] holdenk closed SPARK-13402. --- Resolution: Not A Problem > List Spark R dependencies > - > > Key: SPARK-13402 > URL: https://issues.apache.org/jira/browse/SPARK-13402 > Project: Spark > Issue Type: Improvement > Components: Documentation, SparkR >Reporter: holdenk >Priority: Trivial > > Especially for developers in other languages who want to build the docs and > similar it would be good to have a list of packages that SparkR depends on so > they can quickly build the docs. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-13402) List Spark R dependencies
[ https://issues.apache.org/jira/browse/SPARK-13402?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15154507#comment-15154507 ] holdenk commented on SPARK-13402: - Oh let me close this, I was looking in the wrong place (inside of the R documentation and the R setup scripts). > List Spark R dependencies > - > > Key: SPARK-13402 > URL: https://issues.apache.org/jira/browse/SPARK-13402 > Project: Spark > Issue Type: Improvement > Components: Documentation, SparkR >Reporter: holdenk >Priority: Trivial > > Especially for developers in other languages who want to build the docs and > similar it would be good to have a list of packages that SparkR depends on so > they can quickly build the docs. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-13402) List Spark R dependencies
[ https://issues.apache.org/jira/browse/SPARK-13402?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15154505#comment-15154505 ] Sean Owen commented on SPARK-13402: --- https://github.com/apache/spark/blob/master/docs/README.md says knitr and devtools; are there more or was that what you're looking for? > List Spark R dependencies > - > > Key: SPARK-13402 > URL: https://issues.apache.org/jira/browse/SPARK-13402 > Project: Spark > Issue Type: Improvement > Components: Documentation, SparkR >Reporter: holdenk >Priority: Trivial > > Especially for developers in other languages who want to build the docs and > similar it would be good to have a list of packages that SparkR depends on so > they can quickly build the docs. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-13402) List Spark R dependencies
holdenk created SPARK-13402: --- Summary: List Spark R dependencies Key: SPARK-13402 URL: https://issues.apache.org/jira/browse/SPARK-13402 Project: Spark Issue Type: Improvement Components: Documentation, SparkR Reporter: holdenk Priority: Trivial Especially for developers in other languages who want to build the docs and similar it would be good to have a list of packages that SparkR depends on so they can quickly build the docs. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-13400) Stop using deprecated Octal escape literals
[ https://issues.apache.org/jira/browse/SPARK-13400?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15154483#comment-15154483 ] holdenk commented on SPARK-13400: - oh yah for sure, I was going to try running some test generation tools and other things. > Stop using deprecated Octal escape literals > --- > > Key: SPARK-13400 > URL: https://issues.apache.org/jira/browse/SPARK-13400 > Project: Spark > Issue Type: Sub-task > Components: SQL >Reporter: holdenk >Priority: Trivial > > We use some deprecated octal literals. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-13400) Stop using deprecated Octal escape literals
[ https://issues.apache.org/jira/browse/SPARK-13400?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15154479#comment-15154479 ] Sean Owen commented on SPARK-13400: --- Yeah all these are good. We need to zap virtually all the warnings before 2.0. Keep going and I'll review. I also want to file some changes to clean up things that aren't compiler warnings but that are bugs identifiable from static analysis, or needlessly slow code that can be simplified. > Stop using deprecated Octal escape literals > --- > > Key: SPARK-13400 > URL: https://issues.apache.org/jira/browse/SPARK-13400 > Project: Spark > Issue Type: Sub-task > Components: SQL >Reporter: holdenk >Priority: Trivial > > We use some deprecated octal literals. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-13401) Fix SQL test warnings
holdenk created SPARK-13401: --- Summary: Fix SQL test warnings Key: SPARK-13401 URL: https://issues.apache.org/jira/browse/SPARK-13401 Project: Spark Issue Type: Improvement Components: SQL, Tests Reporter: holdenk Priority: Trivial SQL tests have a few number of warnings about unreachable code, non-exhaustive matches, and unchecked type casts. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-13400) Stop using deprecated Octal escape literals
holdenk created SPARK-13400: --- Summary: Stop using deprecated Octal escape literals Key: SPARK-13400 URL: https://issues.apache.org/jira/browse/SPARK-13400 Project: Spark Issue Type: Sub-task Components: SQL Reporter: holdenk Priority: Trivial We use some deprecated octal literals. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-13399) Investigate type erasure warnings in CheckpointSuite
holdenk created SPARK-13399: --- Summary: Investigate type erasure warnings in CheckpointSuite Key: SPARK-13399 URL: https://issues.apache.org/jira/browse/SPARK-13399 Project: Spark Issue Type: Improvement Reporter: holdenk Priority: Trivial [warn] /home/holden/repos/spark/streaming/src/test/scala/org/apache/spark/streaming/CheckpointSuite.scala:154: abstract type V in type org.apache.spark.streaming.TestOutputStreamWithPartitions[V] is unchecked since it is eliminated by erasure [warn] dstream.isInstanceOf[TestOutputStreamWithPartitions[V]] [warn] ^ [warn] /home/holden/repos/spark/streaming/src/test/scala/org/apache/spark/streaming/CheckpointSuite.scala:911: abstract type V in type org.apache.spark.streaming.TestOutputStreamWithPartitions[V] is unchecked since it is eliminated by erasure [warn] dstream.isInstanceOf[TestOutputStreamWithPartitions[V]] [warn] ^ -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-13398) Move away from deprecated ThreadPoolTaskSupport
holdenk created SPARK-13398: --- Summary: Move away from deprecated ThreadPoolTaskSupport Key: SPARK-13398 URL: https://issues.apache.org/jira/browse/SPARK-13398 Project: Spark Issue Type: Sub-task Components: Streaming Reporter: holdenk Priority: Trivial ThreadPoolTaskSupport has been replaced by ForkJoinTaskSupport -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-13397) Cleanup transient annotations which aren't being applied
holdenk created SPARK-13397: --- Summary: Cleanup transient annotations which aren't being applied Key: SPARK-13397 URL: https://issues.apache.org/jira/browse/SPARK-13397 Project: Spark Issue Type: Sub-task Reporter: holdenk Priority: Trivial A number of places we have transient markers which are discarded as unused. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-13396) Stop using our internal deprecated .metrics on ExceptionFailure instead use accumUpdates
holdenk created SPARK-13396: --- Summary: Stop using our internal deprecated .metrics on ExceptionFailure instead use accumUpdates Key: SPARK-13396 URL: https://issues.apache.org/jira/browse/SPARK-13396 Project: Spark Issue Type: Sub-task Components: Spark Core Reporter: holdenk Priority: Minor src/main/scala/org/apache/spark/ui/jobs/JobProgressListener.scala:385: value metrics in class ExceptionFailure is deprecated: use accumUpdates instead -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-13395) Silence or skip unsafe deprecation warnings
holdenk created SPARK-13395: --- Summary: Silence or skip unsafe deprecation warnings Key: SPARK-13395 URL: https://issues.apache.org/jira/browse/SPARK-13395 Project: Spark Issue Type: Sub-task Reporter: holdenk Priority: Trivial A number of places inside of Spark we use the unsafe API which produces warnings we probably want to silence if its possible. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-13395) Silence or skip unsafe deprecation warnings
[ https://issues.apache.org/jira/browse/SPARK-13395?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] holdenk updated SPARK-13395: Component/s: (was: Streaming) (was: Examples) > Silence or skip unsafe deprecation warnings > --- > > Key: SPARK-13395 > URL: https://issues.apache.org/jira/browse/SPARK-13395 > Project: Spark > Issue Type: Sub-task > Components: Spark Core, SQL >Reporter: holdenk >Priority: Trivial > > A number of places inside of Spark we use the unsafe API which produces > warnings we probably want to silence if its possible. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-13393) Column mismatch issue in left_outer join using Spark DataFrame
[ https://issues.apache.org/jira/browse/SPARK-13393?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15154393#comment-15154393 ] Herman van Hovell commented on SPARK-13393: --- I think you are encountering the following bug: https://issues.apache.org/jira/browse/SPARK-10737 Could try to run this on 1.5.2/1.6.0/master? > Column mismatch issue in left_outer join using Spark DataFrame > -- > > Key: SPARK-13393 > URL: https://issues.apache.org/jira/browse/SPARK-13393 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.5.0 >Reporter: Varadharajan > > Consider the below snippet: > {code:title=test.scala|borderStyle=solid} > class Person(id: Int, name: String) > val df = sc.parallelize(List( > Person(1, "varadha"), > Person(2, "nagaraj") > )).toDF > val varadha = df.filter("id = 1") > val errorDF = df.join(varadha, df("id") === varadha("id"), > "left_outer").select(df("id"), varadha("id") as "varadha_id") > val nagaraj = df.filter("id = 2").select(df("id") as "n_id") > val correctDF = df.join(nagaraj, df("id") === nagaraj("n_id"), > "left_outer").select(df("id"), nagaraj("n_id") as "nagaraj_id") > {code} > The `errorDF` dataframe, after the left join is messed up and shows as below: > | id|varadha_id| > | 1| 1| > | 2| 2 (*This should've been null*)| > whereas correctDF has the correct output after the left join: > | id|nagaraj_id| > | 1| null| > | 2| 2| -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org