[jira] [Updated] (SPARK-20135) spark thriftserver2: no job running but cores not release on yarn
[ https://issues.apache.org/jira/browse/SPARK-20135?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] bruce xu updated SPARK-20135: - Attachment: 0329-3.png 0329-2.png 0329-1.png cores and memory not release for a long time when no job running > spark thriftserver2: no job running but cores not release on yarn > - > > Key: SPARK-20135 > URL: https://issues.apache.org/jira/browse/SPARK-20135 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.1 > Environment: spark 2.0.1 with hadoop 2.6.0 >Reporter: bruce xu > Attachments: 0329-1.png, 0329-2.png, 0329-3.png > > > i opened the executor dynamic allocation feature, however it doesn't work > sometimes. > i set the initial executor num 50, after job finished the cores and mem > resource did not release. > from the spark web UI, the active job/running task/stage num is 0 , but the > executors page show cores 1276, active task 7288. > from the yarn web UI, the thriftserver job's running containers is 639 > this may be a bug. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-20135) spark thriftserver2: no job running but cores not release on yarn
bruce xu created SPARK-20135: Summary: spark thriftserver2: no job running but cores not release on yarn Key: SPARK-20135 URL: https://issues.apache.org/jira/browse/SPARK-20135 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 2.0.1 Environment: spark 2.0.1 with hadoop 2.6.0 Reporter: bruce xu i opened the executor dynamic allocation feature, however it doesn't work sometimes. i set the initial executor num 50, after job finished the cores and mem resource did not release. from the spark web UI, the active job/running task/stage num is 0 , but the executors page show cores 1276, active task 7288. from the yarn web UI, the thriftserver job's running containers is 639 this may be a bug. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-20107) Add spark.hadoop.mapreduce.fileoutputcommitter.algorithm.version option to configuration.md
[ https://issues.apache.org/jira/browse/SPARK-20107?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yuming Wang updated SPARK-20107: Description: Set {{spark.hadoop.mapreduce.fileoutputcommitter.algorithm.version=2}} can speed up [HadoopMapReduceCommitProtocol#commitJob|https://github.com/apache/spark/blob/v2.1.0/core/src/main/scala/org/apache/spark/internal/io/HadoopMapReduceCommitProtocol.scala#L121] for many output files. It can speed up {{11 minutes}} for 216869 output files: {code:sql} CREATE TABLE tmp.spark_20107 AS SELECT category_id, product_id, track_id, concat( substr(ds, 3, 2), substr(ds, 6, 2), substr(ds, 9, 2) ) shortDate, CASE WHEN actiontype = '0' THEN 'browse' WHEN actiontype = '1' THEN 'fav' WHEN actiontype = '2' THEN 'cart' WHEN actiontype = '3' THEN 'order' ELSE 'invalid actio' END AS type FROM tmp.user_action WHERE ds > date_sub('2017-01-23', 730) AND actiontype IN ('0','1','2','3'); {code} {code} $ hadoop fs -ls /user/hive/warehouse/tmp.db/spark_20107 | wc -l 216870 {code} We should add this option to [configuration.md|http://spark.apache.org/docs/latest/configuration.html]. All cloudera's hadoop 2.6.0-cdh5.4.0 or higher versions(see: [cloudera/hadoop-common@1c12361|https://github.com/cloudera/hadoop-common/commit/1c1236182304d4075276c00c4592358f428bc433] and [cloudera/hadoop-common@16b2de2|https://github.com/cloudera/hadoop-common/commit/16b2de27321db7ce2395c08baccfdec5562017f0]) and apache's hadoop 2.7.0 or higher versions support this improvement. was: Set {{mapreduce.fileoutputcommitter.algorithm.version=2}} to speed up [HadoopMapReduceCommitProtocol#commitJob|https://github.com/apache/spark/blob/v2.1.0/core/src/main/scala/org/apache/spark/internal/io/HadoopMapReduceCommitProtocol.scala#L121] for many output files. It can speed up {{11 minutes}} for 216869 output files: {code:sql} CREATE TABLE tmp.spark_20107 AS SELECT category_id, product_id, track_id, concat( substr(ds, 3, 2), substr(ds, 6, 2), substr(ds, 9, 2) ) shortDate, CASE WHEN actiontype = '0' THEN 'browse' WHEN actiontype = '1' THEN 'fav' WHEN actiontype = '2' THEN 'cart' WHEN actiontype = '3' THEN 'order' ELSE 'invalid actio' END AS type FROM tmp.user_action WHERE ds > date_sub('2017-01-23', 730) AND actiontype IN ('0','1','2','3'); {code} {code} $ hadoop fs -ls /user/hive/warehouse/tmp.db/spark_20107 | wc -l 216870 {code} This improvement can effect all cloudera's hadoop cdh5-2.6.0_5.4.0 higher versions(see: [cloudera/hadoop-common@1c12361|https://github.com/cloudera/hadoop-common/commit/1c1236182304d4075276c00c4592358f428bc433] and [cloudera/hadoop-common@16b2de2|https://github.com/cloudera/hadoop-common/commit/16b2de27321db7ce2395c08baccfdec5562017f0]) and apache's hadoop 2.7.0 higher versions. > Add spark.hadoop.mapreduce.fileoutputcommitter.algorithm.version option to > configuration.md > --- > > Key: SPARK-20107 > URL: https://issues.apache.org/jira/browse/SPARK-20107 > Project: Spark > Issue Type: Improvement > Components: Documentation >Affects Versions: 2.1.0 >Reporter: Yuming Wang >Assignee: Yuming Wang >Priority: Trivial > > Set {{spark.hadoop.mapreduce.fileoutputcommitter.algorithm.version=2}} can > speed up > [HadoopMapReduceCommitProtocol#commitJob|https://github.com/apache/spark/blob/v2.1.0/core/src/main/scala/org/apache/spark/internal/io/HadoopMapReduceCommitProtocol.scala#L121] > for many output files. > It can speed up {{11 minutes}} for 216869 output files: > {code:sql} > CREATE TABLE tmp.spark_20107 AS SELECT > category_id, > product_id, > track_id, > concat( > substr(ds, 3, 2), > substr(ds, 6, 2), > substr(ds, 9, 2) > ) shortDate, > CASE WHEN actiontype = '0' THEN 'browse' WHEN actiontype = '1' THEN 'fav' > WHEN actiontype = '2' THEN 'cart' WHEN actiontype = '3' THEN 'order' ELSE > 'invalid actio' END AS type > FROM > tmp.user_action > WHERE > ds > date_sub('2017-01-23', 730) > AND actiontype IN ('0','1','2','3'); > {code} > {code} > $ hadoop fs -ls /user/hive/warehouse/tmp.db/spark_20107 | wc -l > 216870 > {code} > We should add this option to > [configuration.md|http://spark.apache.org/docs/latest/configuration.html]. > All cloudera's hadoop 2.6.0-cdh5.4.0 or higher versions(see: > [cloudera/hadoop-common@1c12361|https://github.com/cloudera/hadoop-common/commit/1c1236182304d4075276c00c4592358f428bc433] > and > [cloudera/hadoop-common@16b2de2|https://github.com/cloudera/hadoop-common/commit/16b2de27321db7ce2395c08baccfdec5562017f0]) > and apache's hadoop 2.7.0 or higher versions support this improvement. -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Assigned] (SPARK-20107) Add spark.hadoop.mapreduce.fileoutputcommitter.algorithm.version option to configuration.md
[ https://issues.apache.org/jira/browse/SPARK-20107?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen reassigned SPARK-20107: - Assignee: Yuming Wang Priority: Trivial (was: Major) Summary: Add spark.hadoop.mapreduce.fileoutputcommitter.algorithm.version option to configuration.md (was: Speed up HadoopMapReduceCommitProtocol#commitJob for many output files) > Add spark.hadoop.mapreduce.fileoutputcommitter.algorithm.version option to > configuration.md > --- > > Key: SPARK-20107 > URL: https://issues.apache.org/jira/browse/SPARK-20107 > Project: Spark > Issue Type: Improvement > Components: Documentation >Affects Versions: 2.1.0 >Reporter: Yuming Wang >Assignee: Yuming Wang >Priority: Trivial > > Set {{mapreduce.fileoutputcommitter.algorithm.version=2}} to speed up > [HadoopMapReduceCommitProtocol#commitJob|https://github.com/apache/spark/blob/v2.1.0/core/src/main/scala/org/apache/spark/internal/io/HadoopMapReduceCommitProtocol.scala#L121] > for many output files. > It can speed up {{11 minutes}} for 216869 output files: > {code:sql} > CREATE TABLE tmp.spark_20107 AS SELECT > category_id, > product_id, > track_id, > concat( > substr(ds, 3, 2), > substr(ds, 6, 2), > substr(ds, 9, 2) > ) shortDate, > CASE WHEN actiontype = '0' THEN 'browse' WHEN actiontype = '1' THEN 'fav' > WHEN actiontype = '2' THEN 'cart' WHEN actiontype = '3' THEN 'order' ELSE > 'invalid actio' END AS type > FROM > tmp.user_action > WHERE > ds > date_sub('2017-01-23', 730) > AND actiontype IN ('0','1','2','3'); > {code} > {code} > $ hadoop fs -ls /user/hive/warehouse/tmp.db/spark_20107 | wc -l > 216870 > {code} > This improvement can effect all cloudera's hadoop cdh5-2.6.0_5.4.0 higher > versions(see: > [cloudera/hadoop-common@1c12361|https://github.com/cloudera/hadoop-common/commit/1c1236182304d4075276c00c4592358f428bc433] > and > [cloudera/hadoop-common@16b2de2|https://github.com/cloudera/hadoop-common/commit/16b2de27321db7ce2395c08baccfdec5562017f0]) > and apache's hadoop 2.7.0 higher versions. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-20107) Speed up HadoopMapReduceCommitProtocol#commitJob for many output files
[ https://issues.apache.org/jira/browse/SPARK-20107?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yuming Wang updated SPARK-20107: Component/s: (was: SQL) > Speed up HadoopMapReduceCommitProtocol#commitJob for many output files > -- > > Key: SPARK-20107 > URL: https://issues.apache.org/jira/browse/SPARK-20107 > Project: Spark > Issue Type: Improvement > Components: Documentation >Affects Versions: 2.1.0 >Reporter: Yuming Wang > > Set {{mapreduce.fileoutputcommitter.algorithm.version=2}} to speed up > [HadoopMapReduceCommitProtocol#commitJob|https://github.com/apache/spark/blob/v2.1.0/core/src/main/scala/org/apache/spark/internal/io/HadoopMapReduceCommitProtocol.scala#L121] > for many output files. > It can speed up {{11 minutes}} for 216869 output files: > {code:sql} > CREATE TABLE tmp.spark_20107 AS SELECT > category_id, > product_id, > track_id, > concat( > substr(ds, 3, 2), > substr(ds, 6, 2), > substr(ds, 9, 2) > ) shortDate, > CASE WHEN actiontype = '0' THEN 'browse' WHEN actiontype = '1' THEN 'fav' > WHEN actiontype = '2' THEN 'cart' WHEN actiontype = '3' THEN 'order' ELSE > 'invalid actio' END AS type > FROM > tmp.user_action > WHERE > ds > date_sub('2017-01-23', 730) > AND actiontype IN ('0','1','2','3'); > {code} > {code} > $ hadoop fs -ls /user/hive/warehouse/tmp.db/spark_20107 | wc -l > 216870 > {code} > This improvement can effect all cloudera's hadoop cdh5-2.6.0_5.4.0 higher > versions(see: > [cloudera/hadoop-common@1c12361|https://github.com/cloudera/hadoop-common/commit/1c1236182304d4075276c00c4592358f428bc433] > and > [cloudera/hadoop-common@16b2de2|https://github.com/cloudera/hadoop-common/commit/16b2de27321db7ce2395c08baccfdec5562017f0]) > and apache's hadoop 2.7.0 higher versions. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-20107) Speed up HadoopMapReduceCommitProtocol#commitJob for many output files
[ https://issues.apache.org/jira/browse/SPARK-20107?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yuming Wang updated SPARK-20107: Component/s: Documentation > Speed up HadoopMapReduceCommitProtocol#commitJob for many output files > -- > > Key: SPARK-20107 > URL: https://issues.apache.org/jira/browse/SPARK-20107 > Project: Spark > Issue Type: Improvement > Components: Documentation, SQL >Affects Versions: 2.1.0 >Reporter: Yuming Wang > > Set {{mapreduce.fileoutputcommitter.algorithm.version=2}} to speed up > [HadoopMapReduceCommitProtocol#commitJob|https://github.com/apache/spark/blob/v2.1.0/core/src/main/scala/org/apache/spark/internal/io/HadoopMapReduceCommitProtocol.scala#L121] > for many output files. > It can speed up {{11 minutes}} for 216869 output files: > {code:sql} > CREATE TABLE tmp.spark_20107 AS SELECT > category_id, > product_id, > track_id, > concat( > substr(ds, 3, 2), > substr(ds, 6, 2), > substr(ds, 9, 2) > ) shortDate, > CASE WHEN actiontype = '0' THEN 'browse' WHEN actiontype = '1' THEN 'fav' > WHEN actiontype = '2' THEN 'cart' WHEN actiontype = '3' THEN 'order' ELSE > 'invalid actio' END AS type > FROM > tmp.user_action > WHERE > ds > date_sub('2017-01-23', 730) > AND actiontype IN ('0','1','2','3'); > {code} > {code} > $ hadoop fs -ls /user/hive/warehouse/tmp.db/spark_20107 | wc -l > 216870 > {code} > This improvement can effect all cloudera's hadoop cdh5-2.6.0_5.4.0 higher > versions(see: > [cloudera/hadoop-common@1c12361|https://github.com/cloudera/hadoop-common/commit/1c1236182304d4075276c00c4592358f428bc433] > and > [cloudera/hadoop-common@16b2de2|https://github.com/cloudera/hadoop-common/commit/16b2de27321db7ce2395c08baccfdec5562017f0]) > and apache's hadoop 2.7.0 higher versions. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-20134) SQLMetrics.postDriverMetricUpdates to simplify driver side metric updates
[ https://issues.apache.org/jira/browse/SPARK-20134?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-20134: Assignee: Reynold Xin (was: Apache Spark) > SQLMetrics.postDriverMetricUpdates to simplify driver side metric updates > - > > Key: SPARK-20134 > URL: https://issues.apache.org/jira/browse/SPARK-20134 > Project: Spark > Issue Type: New Feature > Components: SQL >Affects Versions: 2.1.0 >Reporter: Reynold Xin >Assignee: Reynold Xin > > It is not super intuitive how to update SQLMetric on the driver side. This > patch introduces a new SQLMetrics.postDriverMetricUpdates function to do > that, and adds documentation to make it more obvious. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-20134) SQLMetrics.postDriverMetricUpdates to simplify driver side metric updates
[ https://issues.apache.org/jira/browse/SPARK-20134?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-20134: Assignee: Apache Spark (was: Reynold Xin) > SQLMetrics.postDriverMetricUpdates to simplify driver side metric updates > - > > Key: SPARK-20134 > URL: https://issues.apache.org/jira/browse/SPARK-20134 > Project: Spark > Issue Type: New Feature > Components: SQL >Affects Versions: 2.1.0 >Reporter: Reynold Xin >Assignee: Apache Spark > > It is not super intuitive how to update SQLMetric on the driver side. This > patch introduces a new SQLMetrics.postDriverMetricUpdates function to do > that, and adds documentation to make it more obvious. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-20134) SQLMetrics.postDriverMetricUpdates to simplify driver side metric updates
[ https://issues.apache.org/jira/browse/SPARK-20134?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15946525#comment-15946525 ] Apache Spark commented on SPARK-20134: -- User 'rxin' has created a pull request for this issue: https://github.com/apache/spark/pull/17464 > SQLMetrics.postDriverMetricUpdates to simplify driver side metric updates > - > > Key: SPARK-20134 > URL: https://issues.apache.org/jira/browse/SPARK-20134 > Project: Spark > Issue Type: New Feature > Components: SQL >Affects Versions: 2.1.0 >Reporter: Reynold Xin >Assignee: Reynold Xin > > It is not super intuitive how to update SQLMetric on the driver side. This > patch introduces a new SQLMetrics.postDriverMetricUpdates function to do > that, and adds documentation to make it more obvious. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-20134) SQLMetrics.postDriverMetricUpdates to simplify driver side metric updates
Reynold Xin created SPARK-20134: --- Summary: SQLMetrics.postDriverMetricUpdates to simplify driver side metric updates Key: SPARK-20134 URL: https://issues.apache.org/jira/browse/SPARK-20134 Project: Spark Issue Type: New Feature Components: SQL Affects Versions: 2.1.0 Reporter: Reynold Xin Assignee: Reynold Xin It is not super intuitive how to update SQLMetric on the driver side. This patch introduces a new SQLMetrics.postDriverMetricUpdates function to do that, and adds documentation to make it more obvious. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-20131) Flaky Test: org.apache.spark.streaming.StreamingContextSuite
[ https://issues.apache.org/jira/browse/SPARK-20131?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-20131: Assignee: (was: Apache Spark) > Flaky Test: org.apache.spark.streaming.StreamingContextSuite > > > Key: SPARK-20131 > URL: https://issues.apache.org/jira/browse/SPARK-20131 > Project: Spark > Issue Type: Bug > Components: DStreams >Affects Versions: 2.2.0 >Reporter: Takuya Ueshin >Priority: Minor > Labels: flaky-test > > This test failed recently here: > https://amplab.cs.berkeley.edu/jenkins/job/spark-master-test-maven-hadoop-2.7/2861/testReport/junit/org.apache.spark.streaming/StreamingContextSuite/SPARK_18560_Receiver_data_should_be_deserialized_properly_/ > Dashboard > https://spark-tests.appspot.com/test-details?suite_name=org.apache.spark.streaming.StreamingContextSuite_name=SPARK-18560+Receiver+data+should+be+deserialized+properly. > Error Message > {code} > latch.await(60L, SECONDS) was false > {code} > {code} > org.scalatest.exceptions.TestFailedException: latch.await(60L, SECONDS) was > false > at > org.scalatest.Assertions$class.newAssertionFailedException(Assertions.scala:500) > at > org.scalatest.FunSuite.newAssertionFailedException(FunSuite.scala:1555) > at > org.scalatest.Assertions$AssertionsHelper.macroAssert(Assertions.scala:466) > at > org.apache.spark.streaming.StreamingContextSuite$$anonfun$43.apply$mcV$sp(StreamingContextSuite.scala:837) > at > org.apache.spark.streaming.StreamingContextSuite$$anonfun$43.apply(StreamingContextSuite.scala:810) > at > org.apache.spark.streaming.StreamingContextSuite$$anonfun$43.apply(StreamingContextSuite.scala:810) > at > org.scalatest.Transformer$$anonfun$apply$1.apply$mcV$sp(Transformer.scala:22) > at org.scalatest.OutcomeOf$class.outcomeOf(OutcomeOf.scala:85) > at org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104) > at org.scalatest.Transformer.apply(Transformer.scala:22) > at org.scalatest.Transformer.apply(Transformer.scala:20) > at org.scalatest.FunSuiteLike$$anon$1.apply(FunSuiteLike.scala:166) > at org.apache.spark.SparkFunSuite.withFixture(SparkFunSuite.scala:68) > at > org.scalatest.FunSuiteLike$class.invokeWithFixture$1(FunSuiteLike.scala:163) > at > org.scalatest.FunSuiteLike$$anonfun$runTest$1.apply(FunSuiteLike.scala:175) > at > org.scalatest.FunSuiteLike$$anonfun$runTest$1.apply(FunSuiteLike.scala:175) > at org.scalatest.SuperEngine.runTestImpl(Engine.scala:306) > at org.scalatest.FunSuiteLike$class.runTest(FunSuiteLike.scala:175) > at > org.apache.spark.streaming.StreamingContextSuite.org$scalatest$BeforeAndAfter$$super$runTest(StreamingContextSuite.scala:44) > at org.scalatest.BeforeAndAfter$class.runTest(BeforeAndAfter.scala:200) > at > org.apache.spark.streaming.StreamingContextSuite.runTest(StreamingContextSuite.scala:44) > at > org.scalatest.FunSuiteLike$$anonfun$runTests$1.apply(FunSuiteLike.scala:208) > at > org.scalatest.FunSuiteLike$$anonfun$runTests$1.apply(FunSuiteLike.scala:208) > at > org.scalatest.SuperEngine$$anonfun$traverseSubNodes$1$1.apply(Engine.scala:413) > at > org.scalatest.SuperEngine$$anonfun$traverseSubNodes$1$1.apply(Engine.scala:401) > at scala.collection.immutable.List.foreach(List.scala:381) > at org.scalatest.SuperEngine.traverseSubNodes$1(Engine.scala:401) > at > org.scalatest.SuperEngine.org$scalatest$SuperEngine$$runTestsInBranch(Engine.scala:396) > at org.scalatest.SuperEngine.runTestsImpl(Engine.scala:483) > at org.scalatest.FunSuiteLike$class.runTests(FunSuiteLike.scala:208) > at org.scalatest.FunSuite.runTests(FunSuite.scala:1555) > at org.scalatest.Suite$class.run(Suite.scala:1424) > at > org.scalatest.FunSuite.org$scalatest$FunSuiteLike$$super$run(FunSuite.scala:1555) > at > org.scalatest.FunSuiteLike$$anonfun$run$1.apply(FunSuiteLike.scala:212) > at > org.scalatest.FunSuiteLike$$anonfun$run$1.apply(FunSuiteLike.scala:212) > at org.scalatest.SuperEngine.runImpl(Engine.scala:545) > at org.scalatest.FunSuiteLike$class.run(FunSuiteLike.scala:212) > at > org.apache.spark.SparkFunSuite.org$scalatest$BeforeAndAfterAll$$super$run(SparkFunSuite.scala:31) > at > org.scalatest.BeforeAndAfterAll$class.liftedTree1$1(BeforeAndAfterAll.scala:257) > at > org.scalatest.BeforeAndAfterAll$class.run(BeforeAndAfterAll.scala:256) > at > org.apache.spark.streaming.StreamingContextSuite.org$scalatest$BeforeAndAfter$$super$run(StreamingContextSuite.scala:44) > at org.scalatest.BeforeAndAfter$class.run(BeforeAndAfter.scala:241) > at >
[jira] [Commented] (SPARK-20131) Flaky Test: org.apache.spark.streaming.StreamingContextSuite
[ https://issues.apache.org/jira/browse/SPARK-20131?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15946522#comment-15946522 ] Apache Spark commented on SPARK-20131: -- User 'uncleGen' has created a pull request for this issue: https://github.com/apache/spark/pull/17463 > Flaky Test: org.apache.spark.streaming.StreamingContextSuite > > > Key: SPARK-20131 > URL: https://issues.apache.org/jira/browse/SPARK-20131 > Project: Spark > Issue Type: Bug > Components: DStreams >Affects Versions: 2.2.0 >Reporter: Takuya Ueshin >Priority: Minor > Labels: flaky-test > > This test failed recently here: > https://amplab.cs.berkeley.edu/jenkins/job/spark-master-test-maven-hadoop-2.7/2861/testReport/junit/org.apache.spark.streaming/StreamingContextSuite/SPARK_18560_Receiver_data_should_be_deserialized_properly_/ > Dashboard > https://spark-tests.appspot.com/test-details?suite_name=org.apache.spark.streaming.StreamingContextSuite_name=SPARK-18560+Receiver+data+should+be+deserialized+properly. > Error Message > {code} > latch.await(60L, SECONDS) was false > {code} > {code} > org.scalatest.exceptions.TestFailedException: latch.await(60L, SECONDS) was > false > at > org.scalatest.Assertions$class.newAssertionFailedException(Assertions.scala:500) > at > org.scalatest.FunSuite.newAssertionFailedException(FunSuite.scala:1555) > at > org.scalatest.Assertions$AssertionsHelper.macroAssert(Assertions.scala:466) > at > org.apache.spark.streaming.StreamingContextSuite$$anonfun$43.apply$mcV$sp(StreamingContextSuite.scala:837) > at > org.apache.spark.streaming.StreamingContextSuite$$anonfun$43.apply(StreamingContextSuite.scala:810) > at > org.apache.spark.streaming.StreamingContextSuite$$anonfun$43.apply(StreamingContextSuite.scala:810) > at > org.scalatest.Transformer$$anonfun$apply$1.apply$mcV$sp(Transformer.scala:22) > at org.scalatest.OutcomeOf$class.outcomeOf(OutcomeOf.scala:85) > at org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104) > at org.scalatest.Transformer.apply(Transformer.scala:22) > at org.scalatest.Transformer.apply(Transformer.scala:20) > at org.scalatest.FunSuiteLike$$anon$1.apply(FunSuiteLike.scala:166) > at org.apache.spark.SparkFunSuite.withFixture(SparkFunSuite.scala:68) > at > org.scalatest.FunSuiteLike$class.invokeWithFixture$1(FunSuiteLike.scala:163) > at > org.scalatest.FunSuiteLike$$anonfun$runTest$1.apply(FunSuiteLike.scala:175) > at > org.scalatest.FunSuiteLike$$anonfun$runTest$1.apply(FunSuiteLike.scala:175) > at org.scalatest.SuperEngine.runTestImpl(Engine.scala:306) > at org.scalatest.FunSuiteLike$class.runTest(FunSuiteLike.scala:175) > at > org.apache.spark.streaming.StreamingContextSuite.org$scalatest$BeforeAndAfter$$super$runTest(StreamingContextSuite.scala:44) > at org.scalatest.BeforeAndAfter$class.runTest(BeforeAndAfter.scala:200) > at > org.apache.spark.streaming.StreamingContextSuite.runTest(StreamingContextSuite.scala:44) > at > org.scalatest.FunSuiteLike$$anonfun$runTests$1.apply(FunSuiteLike.scala:208) > at > org.scalatest.FunSuiteLike$$anonfun$runTests$1.apply(FunSuiteLike.scala:208) > at > org.scalatest.SuperEngine$$anonfun$traverseSubNodes$1$1.apply(Engine.scala:413) > at > org.scalatest.SuperEngine$$anonfun$traverseSubNodes$1$1.apply(Engine.scala:401) > at scala.collection.immutable.List.foreach(List.scala:381) > at org.scalatest.SuperEngine.traverseSubNodes$1(Engine.scala:401) > at > org.scalatest.SuperEngine.org$scalatest$SuperEngine$$runTestsInBranch(Engine.scala:396) > at org.scalatest.SuperEngine.runTestsImpl(Engine.scala:483) > at org.scalatest.FunSuiteLike$class.runTests(FunSuiteLike.scala:208) > at org.scalatest.FunSuite.runTests(FunSuite.scala:1555) > at org.scalatest.Suite$class.run(Suite.scala:1424) > at > org.scalatest.FunSuite.org$scalatest$FunSuiteLike$$super$run(FunSuite.scala:1555) > at > org.scalatest.FunSuiteLike$$anonfun$run$1.apply(FunSuiteLike.scala:212) > at > org.scalatest.FunSuiteLike$$anonfun$run$1.apply(FunSuiteLike.scala:212) > at org.scalatest.SuperEngine.runImpl(Engine.scala:545) > at org.scalatest.FunSuiteLike$class.run(FunSuiteLike.scala:212) > at > org.apache.spark.SparkFunSuite.org$scalatest$BeforeAndAfterAll$$super$run(SparkFunSuite.scala:31) > at > org.scalatest.BeforeAndAfterAll$class.liftedTree1$1(BeforeAndAfterAll.scala:257) > at > org.scalatest.BeforeAndAfterAll$class.run(BeforeAndAfterAll.scala:256) > at > org.apache.spark.streaming.StreamingContextSuite.org$scalatest$BeforeAndAfter$$super$run(StreamingContextSuite.scala:44) > at
[jira] [Resolved] (SPARK-20093) Exception when Joining dataframe with another dataframe generated by applying groupBy transformation on original one
[ https://issues.apache.org/jira/browse/SPARK-20093?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-20093. -- Resolution: Duplicate ^ It seems a duplicate of that to me as well. I am resolving this. Please reopen if anyone feels this is different. > Exception when Joining dataframe with another dataframe generated by applying > groupBy transformation on original one > > > Key: SPARK-20093 > URL: https://issues.apache.org/jira/browse/SPARK-20093 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.0, 2.0.1, 2.0.2, 2.1.0, 2.2.0 >Reporter: Hosur Narahari > > When we generate a dataframe by doing grouping, and perform join on original > dataframe with aggregate column, we get AnalysisException. Below I've > attached a piece of code and resulting exception to reproduce. > Code: > import org.apache.spark.sql.SparkSession > object App { > lazy val spark = > SparkSession.builder.appName("Test").master("local").getOrCreate > def main(args: Array[String]): Unit = { > test1 > } > private def test1 { > import org.apache.spark.sql.functions._ > val df = spark.createDataFrame(Seq(("M",172,60), ("M", 170, 60), ("F", > 155, 56), ("M", 160, 55), ("F", 150, 53))).toDF("gender", "height", "weight") > val groupDF = df.groupBy("gender").agg(min("height").as("height")) > groupDF.show() > val out = groupDF.join(df, groupDF("height") <=> > df("height")).select(df("gender"), df("height"), df("weight")) > out.show > } > } > When I ran above code, I got below exception: > Exception in thread "main" org.apache.spark.sql.AnalysisException: resolved > attribute(s) height#8 missing from > height#19,height#30,gender#29,weight#31,gender#7 in operator !Join Inner, > (height#19 <=> height#8);; > !Join Inner, (height#19 <=> height#8) > :- Aggregate [gender#7], [gender#7, min(height#8) AS height#19] > : +- Project [_1#0 AS gender#7, _2#1 AS height#8, _3#2 AS weight#9] > : +- LocalRelation [_1#0, _2#1, _3#2] > +- Project [_1#0 AS gender#29, _2#1 AS height#30, _3#2 AS weight#31] >+- LocalRelation [_1#0, _2#1, _3#2] > at > org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.failAnalysis(CheckAnalysis.scala:39) > at > org.apache.spark.sql.catalyst.analysis.Analyzer.failAnalysis(Analyzer.scala:90) > at > org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:342) > at > org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:78) > at > org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:127) > at > org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.checkAnalysis(CheckAnalysis.scala:78) > at > org.apache.spark.sql.catalyst.analysis.Analyzer.checkAnalysis(Analyzer.scala:90) > at > org.apache.spark.sql.execution.QueryExecution.assertAnalyzed(QueryExecution.scala:53) > at org.apache.spark.sql.Dataset$.ofRows(Dataset.scala:67) > at > org.apache.spark.sql.Dataset.org$apache$spark$sql$Dataset$$withPlan(Dataset.scala:2831) > at org.apache.spark.sql.Dataset.join(Dataset.scala:843) > at org.apache.spark.sql.Dataset.join(Dataset.scala:807) > at App$.test1(App.scala:17) > at App$.main(App.scala:9) > at App.main(App.scala) > Please someone look into it. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-20093) Exception when Joining dataframe with another dataframe generated by applying groupBy transformation on original one
[ https://issues.apache.org/jira/browse/SPARK-20093?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15946501#comment-15946501 ] Takeshi Yamamuro commented on SPARK-20093: -- It seems this issue is the same with SPARK-10925. > Exception when Joining dataframe with another dataframe generated by applying > groupBy transformation on original one > > > Key: SPARK-20093 > URL: https://issues.apache.org/jira/browse/SPARK-20093 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.0, 2.0.1, 2.0.2, 2.1.0, 2.2.0 >Reporter: Hosur Narahari > > When we generate a dataframe by doing grouping, and perform join on original > dataframe with aggregate column, we get AnalysisException. Below I've > attached a piece of code and resulting exception to reproduce. > Code: > import org.apache.spark.sql.SparkSession > object App { > lazy val spark = > SparkSession.builder.appName("Test").master("local").getOrCreate > def main(args: Array[String]): Unit = { > test1 > } > private def test1 { > import org.apache.spark.sql.functions._ > val df = spark.createDataFrame(Seq(("M",172,60), ("M", 170, 60), ("F", > 155, 56), ("M", 160, 55), ("F", 150, 53))).toDF("gender", "height", "weight") > val groupDF = df.groupBy("gender").agg(min("height").as("height")) > groupDF.show() > val out = groupDF.join(df, groupDF("height") <=> > df("height")).select(df("gender"), df("height"), df("weight")) > out.show > } > } > When I ran above code, I got below exception: > Exception in thread "main" org.apache.spark.sql.AnalysisException: resolved > attribute(s) height#8 missing from > height#19,height#30,gender#29,weight#31,gender#7 in operator !Join Inner, > (height#19 <=> height#8);; > !Join Inner, (height#19 <=> height#8) > :- Aggregate [gender#7], [gender#7, min(height#8) AS height#19] > : +- Project [_1#0 AS gender#7, _2#1 AS height#8, _3#2 AS weight#9] > : +- LocalRelation [_1#0, _2#1, _3#2] > +- Project [_1#0 AS gender#29, _2#1 AS height#30, _3#2 AS weight#31] >+- LocalRelation [_1#0, _2#1, _3#2] > at > org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.failAnalysis(CheckAnalysis.scala:39) > at > org.apache.spark.sql.catalyst.analysis.Analyzer.failAnalysis(Analyzer.scala:90) > at > org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:342) > at > org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:78) > at > org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:127) > at > org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.checkAnalysis(CheckAnalysis.scala:78) > at > org.apache.spark.sql.catalyst.analysis.Analyzer.checkAnalysis(Analyzer.scala:90) > at > org.apache.spark.sql.execution.QueryExecution.assertAnalyzed(QueryExecution.scala:53) > at org.apache.spark.sql.Dataset$.ofRows(Dataset.scala:67) > at > org.apache.spark.sql.Dataset.org$apache$spark$sql$Dataset$$withPlan(Dataset.scala:2831) > at org.apache.spark.sql.Dataset.join(Dataset.scala:843) > at org.apache.spark.sql.Dataset.join(Dataset.scala:807) > at App$.test1(App.scala:17) > at App$.main(App.scala:9) > at App.main(App.scala) > Please someone look into it. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-20128) MetricsSystem not always killed in SparkContext.stop()
[ https://issues.apache.org/jira/browse/SPARK-20128?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15946493#comment-15946493 ] Saisai Shao commented on SPARK-20128: - Sorry I cannot access the logs. What I could see from the link provided above is: {noformat} [info] - internal accumulators in multiple stages (185 milliseconds) 3/24/17 2:02:19 PM = -- Gauges -- master.aliveWorkers 3/24/17 2:22:19 PM = -- Gauges -- master.aliveWorkers 3/24/17 2:42:19 PM = -- Gauges -- master.aliveWorkers 3/24/17 3:02:19 PM = -- Gauges -- master.aliveWorkers 3/24/17 3:22:19 PM = -- Gauges -- master.aliveWorkers 3/24/17 3:42:19 PM = -- Gauges -- master.aliveWorkers 3/24/17 4:02:19 PM = -- Gauges -- master.aliveWorkers 3/24/17 4:22:19 PM = -- Gauges -- master.aliveWorkers 3/24/17 4:42:19 PM = -- Gauges -- master.aliveWorkers 3/24/17 5:02:19 PM = -- Gauges -- master.aliveWorkers 3/24/17 5:22:19 PM = -- Gauges -- {noformat} >From the console output what I could see is that after this {{internal >accumulators in multiple stages}} unit test is finished, then the whole test >is hang, and just print some metrics information. > MetricsSystem not always killed in SparkContext.stop() > -- > > Key: SPARK-20128 > URL: https://issues.apache.org/jira/browse/SPARK-20128 > Project: Spark > Issue Type: Test > Components: Spark Core, Tests >Affects Versions: 2.2.0 >Reporter: Imran Rashid > Labels: flaky-test > > One Jenkins run failed due to the MetricsSystem never getting killed after a > failed test, which led that test to hang and the tests to timeout: > https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/75176 > {noformat} > 17/03/24 13:44:19.537 dag-scheduler-event-loop ERROR > DAGSchedulerEventProcessLoop: DAGSchedulerEventProcessLoop failed; shutting > down SparkContext > java.lang.ArrayIndexOutOfBoundsException: -1 > at > org.apache.spark.MapOutputTrackerMaster$$anonfun$getEpochForMapOutput$1.apply(MapOutputTracker.scala:431) > at > org.apache.spark.MapOutputTrackerMaster$$anonfun$getEpochForMapOutput$1.apply(MapOutputTracker.scala:430) > at scala.Option.flatMap(Option.scala:171) > at > org.apache.spark.MapOutputTrackerMaster.getEpochForMapOutput(MapOutputTracker.scala:430) > at > org.apache.spark.scheduler.DAGScheduler.handleTaskCompletion(DAGScheduler.scala:1298) > at > org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1731) > at > org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1689) > at > org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1678) > at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48) > 17/03/24 13:44:19.540 dispatcher-event-loop-11 INFO > MapOutputTrackerMasterEndpoint: MapOutputTrackerMasterEndpoint stopped! > 17/03/24 13:44:19.546 stop-spark-context INFO MemoryStore: MemoryStore cleared > 17/03/24 13:44:19.546 stop-spark-context INFO BlockManager: BlockManager > stopped > 17/03/24 13:44:19.546 stop-spark-context INFO BlockManagerMaster: > BlockManagerMaster stopped > 17/03/24 13:44:19.546 dispatcher-event-loop-16 INFO > OutputCommitCoordinator$OutputCommitCoordinatorEndpoint: > OutputCommitCoordinator stopped! > 17/03/24 13:44:19.547 stop-spark-context INFO SparkContext: Successfully >
[jira] [Commented] (SPARK-20128) MetricsSystem not always killed in SparkContext.stop()
[ https://issues.apache.org/jira/browse/SPARK-20128?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15946486#comment-15946486 ] Imran Rashid commented on SPARK-20128: -- Thanks [~jerryshao], that is helpful, definitely good to know this seems to be limited to unit tests. But I still don't see how the Master isn't getting stopped. We never see {{logInfo("Shutting down local Spark cluster.")}}, which means that the we're never calling {{LocalSparkCluster.stop()}} in this case. > MetricsSystem not always killed in SparkContext.stop() > -- > > Key: SPARK-20128 > URL: https://issues.apache.org/jira/browse/SPARK-20128 > Project: Spark > Issue Type: Test > Components: Spark Core, Tests >Affects Versions: 2.2.0 >Reporter: Imran Rashid > Labels: flaky-test > > One Jenkins run failed due to the MetricsSystem never getting killed after a > failed test, which led that test to hang and the tests to timeout: > https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/75176 > {noformat} > 17/03/24 13:44:19.537 dag-scheduler-event-loop ERROR > DAGSchedulerEventProcessLoop: DAGSchedulerEventProcessLoop failed; shutting > down SparkContext > java.lang.ArrayIndexOutOfBoundsException: -1 > at > org.apache.spark.MapOutputTrackerMaster$$anonfun$getEpochForMapOutput$1.apply(MapOutputTracker.scala:431) > at > org.apache.spark.MapOutputTrackerMaster$$anonfun$getEpochForMapOutput$1.apply(MapOutputTracker.scala:430) > at scala.Option.flatMap(Option.scala:171) > at > org.apache.spark.MapOutputTrackerMaster.getEpochForMapOutput(MapOutputTracker.scala:430) > at > org.apache.spark.scheduler.DAGScheduler.handleTaskCompletion(DAGScheduler.scala:1298) > at > org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1731) > at > org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1689) > at > org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1678) > at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48) > 17/03/24 13:44:19.540 dispatcher-event-loop-11 INFO > MapOutputTrackerMasterEndpoint: MapOutputTrackerMasterEndpoint stopped! > 17/03/24 13:44:19.546 stop-spark-context INFO MemoryStore: MemoryStore cleared > 17/03/24 13:44:19.546 stop-spark-context INFO BlockManager: BlockManager > stopped > 17/03/24 13:44:19.546 stop-spark-context INFO BlockManagerMaster: > BlockManagerMaster stopped > 17/03/24 13:44:19.546 dispatcher-event-loop-16 INFO > OutputCommitCoordinator$OutputCommitCoordinatorEndpoint: > OutputCommitCoordinator stopped! > 17/03/24 13:44:19.547 stop-spark-context INFO SparkContext: Successfully > stopped SparkContext > 17/03/24 14:02:19.934 metrics-console-reporter-1-thread-1 ERROR > ScheduledReporter: RuntimeException thrown from ConsoleReporter#report. > Exception was suppressed. > java.lang.NullPointerException > at > org.apache.spark.deploy.master.MasterSource$$anon$2.getValue(MasterSource.scala:35) > at > org.apache.spark.deploy.master.MasterSource$$anon$2.getValue(MasterSource.scala:34) > at > com.codahale.metrics.ConsoleReporter.printGauge(ConsoleReporter.java:239) > ... > {noformat} > unfortunately I didn't save the entire test logs, but what happens is the > initial IndexOutOfBoundsException is a real bug, which causes the > SparkContext to stop, and the test to fail. However, the MetricsSystem > somehow stays alive, and since its not a daemon thread, it just hangs, and > every 20 mins we get that NPE from within the metrics system as it tries to > report. > I am totally perplexed at how this can happen, it looks like the metric > system should always get stopped by the time we see > {noformat} > 17/03/24 13:44:19.547 stop-spark-context INFO SparkContext: Successfully > stopped SparkContext > {noformat} > I don't think I've ever seen this in a real spark use, but it doesn't look > like something which is limited to tests, whatever the cause. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-20128) MetricsSystem not always killed in SparkContext.stop()
[ https://issues.apache.org/jira/browse/SPARK-20128?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15946457#comment-15946457 ] Saisai Shao edited comment on SPARK-20128 at 3/29/17 3:20 AM: -- Here the exception is from MasterSource, which only exists in Standalone Master, I think it should not be related to SparkContext, may be the Master is not cleanly stopped. --Also as I remembered by default we will not enable ConsoleReporter, not sure how this could be happened.-- Looks like we have a metrics property in the test resource, that's why console sink will be enabled in UT. was (Author: jerryshao): Here the exception is from MasterSource, which only exists in Standalone Master, I think it should not be related to SparkContext, may be the Master is not cleanly stopped. Also as I remembered by default we will not enable ConsoleReporter, not sure how this could be happened. > MetricsSystem not always killed in SparkContext.stop() > -- > > Key: SPARK-20128 > URL: https://issues.apache.org/jira/browse/SPARK-20128 > Project: Spark > Issue Type: Test > Components: Spark Core, Tests >Affects Versions: 2.2.0 >Reporter: Imran Rashid > Labels: flaky-test > > One Jenkins run failed due to the MetricsSystem never getting killed after a > failed test, which led that test to hang and the tests to timeout: > https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/75176 > {noformat} > 17/03/24 13:44:19.537 dag-scheduler-event-loop ERROR > DAGSchedulerEventProcessLoop: DAGSchedulerEventProcessLoop failed; shutting > down SparkContext > java.lang.ArrayIndexOutOfBoundsException: -1 > at > org.apache.spark.MapOutputTrackerMaster$$anonfun$getEpochForMapOutput$1.apply(MapOutputTracker.scala:431) > at > org.apache.spark.MapOutputTrackerMaster$$anonfun$getEpochForMapOutput$1.apply(MapOutputTracker.scala:430) > at scala.Option.flatMap(Option.scala:171) > at > org.apache.spark.MapOutputTrackerMaster.getEpochForMapOutput(MapOutputTracker.scala:430) > at > org.apache.spark.scheduler.DAGScheduler.handleTaskCompletion(DAGScheduler.scala:1298) > at > org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1731) > at > org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1689) > at > org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1678) > at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48) > 17/03/24 13:44:19.540 dispatcher-event-loop-11 INFO > MapOutputTrackerMasterEndpoint: MapOutputTrackerMasterEndpoint stopped! > 17/03/24 13:44:19.546 stop-spark-context INFO MemoryStore: MemoryStore cleared > 17/03/24 13:44:19.546 stop-spark-context INFO BlockManager: BlockManager > stopped > 17/03/24 13:44:19.546 stop-spark-context INFO BlockManagerMaster: > BlockManagerMaster stopped > 17/03/24 13:44:19.546 dispatcher-event-loop-16 INFO > OutputCommitCoordinator$OutputCommitCoordinatorEndpoint: > OutputCommitCoordinator stopped! > 17/03/24 13:44:19.547 stop-spark-context INFO SparkContext: Successfully > stopped SparkContext > 17/03/24 14:02:19.934 metrics-console-reporter-1-thread-1 ERROR > ScheduledReporter: RuntimeException thrown from ConsoleReporter#report. > Exception was suppressed. > java.lang.NullPointerException > at > org.apache.spark.deploy.master.MasterSource$$anon$2.getValue(MasterSource.scala:35) > at > org.apache.spark.deploy.master.MasterSource$$anon$2.getValue(MasterSource.scala:34) > at > com.codahale.metrics.ConsoleReporter.printGauge(ConsoleReporter.java:239) > ... > {noformat} > unfortunately I didn't save the entire test logs, but what happens is the > initial IndexOutOfBoundsException is a real bug, which causes the > SparkContext to stop, and the test to fail. However, the MetricsSystem > somehow stays alive, and since its not a daemon thread, it just hangs, and > every 20 mins we get that NPE from within the metrics system as it tries to > report. > I am totally perplexed at how this can happen, it looks like the metric > system should always get stopped by the time we see > {noformat} > 17/03/24 13:44:19.547 stop-spark-context INFO SparkContext: Successfully > stopped SparkContext > {noformat} > I don't think I've ever seen this in a real spark use, but it doesn't look > like something which is limited to tests, whatever the cause. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-20128) MetricsSystem not always killed in SparkContext.stop()
[ https://issues.apache.org/jira/browse/SPARK-20128?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15946457#comment-15946457 ] Saisai Shao commented on SPARK-20128: - Here the exception is from MasterSource, which only exists in Standalone Master, I think it should not be related to SparkContext, may be the Master is not cleanly stopped. Also as I remembered by default we will not enable ConsoleReporter, not sure how this could be happened. > MetricsSystem not always killed in SparkContext.stop() > -- > > Key: SPARK-20128 > URL: https://issues.apache.org/jira/browse/SPARK-20128 > Project: Spark > Issue Type: Test > Components: Spark Core, Tests >Affects Versions: 2.2.0 >Reporter: Imran Rashid > Labels: flaky-test > > One Jenkins run failed due to the MetricsSystem never getting killed after a > failed test, which led that test to hang and the tests to timeout: > https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/75176 > {noformat} > 17/03/24 13:44:19.537 dag-scheduler-event-loop ERROR > DAGSchedulerEventProcessLoop: DAGSchedulerEventProcessLoop failed; shutting > down SparkContext > java.lang.ArrayIndexOutOfBoundsException: -1 > at > org.apache.spark.MapOutputTrackerMaster$$anonfun$getEpochForMapOutput$1.apply(MapOutputTracker.scala:431) > at > org.apache.spark.MapOutputTrackerMaster$$anonfun$getEpochForMapOutput$1.apply(MapOutputTracker.scala:430) > at scala.Option.flatMap(Option.scala:171) > at > org.apache.spark.MapOutputTrackerMaster.getEpochForMapOutput(MapOutputTracker.scala:430) > at > org.apache.spark.scheduler.DAGScheduler.handleTaskCompletion(DAGScheduler.scala:1298) > at > org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1731) > at > org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1689) > at > org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1678) > at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48) > 17/03/24 13:44:19.540 dispatcher-event-loop-11 INFO > MapOutputTrackerMasterEndpoint: MapOutputTrackerMasterEndpoint stopped! > 17/03/24 13:44:19.546 stop-spark-context INFO MemoryStore: MemoryStore cleared > 17/03/24 13:44:19.546 stop-spark-context INFO BlockManager: BlockManager > stopped > 17/03/24 13:44:19.546 stop-spark-context INFO BlockManagerMaster: > BlockManagerMaster stopped > 17/03/24 13:44:19.546 dispatcher-event-loop-16 INFO > OutputCommitCoordinator$OutputCommitCoordinatorEndpoint: > OutputCommitCoordinator stopped! > 17/03/24 13:44:19.547 stop-spark-context INFO SparkContext: Successfully > stopped SparkContext > 17/03/24 14:02:19.934 metrics-console-reporter-1-thread-1 ERROR > ScheduledReporter: RuntimeException thrown from ConsoleReporter#report. > Exception was suppressed. > java.lang.NullPointerException > at > org.apache.spark.deploy.master.MasterSource$$anon$2.getValue(MasterSource.scala:35) > at > org.apache.spark.deploy.master.MasterSource$$anon$2.getValue(MasterSource.scala:34) > at > com.codahale.metrics.ConsoleReporter.printGauge(ConsoleReporter.java:239) > ... > {noformat} > unfortunately I didn't save the entire test logs, but what happens is the > initial IndexOutOfBoundsException is a real bug, which causes the > SparkContext to stop, and the test to fail. However, the MetricsSystem > somehow stays alive, and since its not a daemon thread, it just hangs, and > every 20 mins we get that NPE from within the metrics system as it tries to > report. > I am totally perplexed at how this can happen, it looks like the metric > system should always get stopped by the time we see > {noformat} > 17/03/24 13:44:19.547 stop-spark-context INFO SparkContext: Successfully > stopped SparkContext > {noformat} > I don't think I've ever seen this in a real spark use, but it doesn't look > like something which is limited to tests, whatever the cause. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-20133) User guide for spark.ml.stat.ChiSquareTest
[ https://issues.apache.org/jira/browse/SPARK-20133?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joseph K. Bradley updated SPARK-20133: -- Description: Add new user guide section for spark.ml.stat, and document ChiSquareTest. This may involve adding new example scripts. > User guide for spark.ml.stat.ChiSquareTest > -- > > Key: SPARK-20133 > URL: https://issues.apache.org/jira/browse/SPARK-20133 > Project: Spark > Issue Type: Documentation > Components: Documentation, ML >Affects Versions: 2.2.0 >Reporter: Joseph K. Bradley >Priority: Minor > > Add new user guide section for spark.ml.stat, and document ChiSquareTest. > This may involve adding new example scripts. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-20133) User guide for spark.ml.stat.ChiSquareTest
Joseph K. Bradley created SPARK-20133: - Summary: User guide for spark.ml.stat.ChiSquareTest Key: SPARK-20133 URL: https://issues.apache.org/jira/browse/SPARK-20133 Project: Spark Issue Type: Documentation Components: Documentation, ML Affects Versions: 2.2.0 Reporter: Joseph K. Bradley Priority: Minor -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-20040) Python API for ml.stat.ChiSquareTest
[ https://issues.apache.org/jira/browse/SPARK-20040?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joseph K. Bradley resolved SPARK-20040. --- Resolution: Fixed Fix Version/s: 2.2.0 Issue resolved by pull request 17421 [https://github.com/apache/spark/pull/17421] > Python API for ml.stat.ChiSquareTest > > > Key: SPARK-20040 > URL: https://issues.apache.org/jira/browse/SPARK-20040 > Project: Spark > Issue Type: New Feature > Components: ML, PySpark >Affects Versions: 2.2.0 >Reporter: Joseph K. Bradley >Assignee: Bago Amirbekian > Fix For: 2.2.0 > > > Add PySpark wrapper for ChiSquareTest. Note that it's currently called > ChiSquare, but I'm about to rename it to ChiSquareTest in [SPARK-20039] -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-20040) Python API for ml.stat.ChiSquareTest
[ https://issues.apache.org/jira/browse/SPARK-20040?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joseph K. Bradley reassigned SPARK-20040: - Assignee: Joseph K. Bradley > Python API for ml.stat.ChiSquareTest > > > Key: SPARK-20040 > URL: https://issues.apache.org/jira/browse/SPARK-20040 > Project: Spark > Issue Type: New Feature > Components: ML, PySpark >Affects Versions: 2.2.0 >Reporter: Joseph K. Bradley >Assignee: Joseph K. Bradley > > Add PySpark wrapper for ChiSquareTest. Note that it's currently called > ChiSquare, but I'm about to rename it to ChiSquareTest in [SPARK-20039] -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-20040) Python API for ml.stat.ChiSquareTest
[ https://issues.apache.org/jira/browse/SPARK-20040?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joseph K. Bradley reassigned SPARK-20040: - Assignee: Bago Amirbekian (was: Joseph K. Bradley) > Python API for ml.stat.ChiSquareTest > > > Key: SPARK-20040 > URL: https://issues.apache.org/jira/browse/SPARK-20040 > Project: Spark > Issue Type: New Feature > Components: ML, PySpark >Affects Versions: 2.2.0 >Reporter: Joseph K. Bradley >Assignee: Bago Amirbekian > > Add PySpark wrapper for ChiSquareTest. Note that it's currently called > ChiSquare, but I'm about to rename it to ChiSquareTest in [SPARK-20039] -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-20093) Exception when Joining dataframe with another dataframe generated by applying groupBy transformation on original one
[ https://issues.apache.org/jira/browse/SPARK-20093?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15946375#comment-15946375 ] Yong Zhang edited comment on SPARK-20093 at 3/29/17 1:34 AM: - This problem exists. It looks like if switching the order of join, then it works fine. scala> spark.version res16: String = 2.1.0 scala> groupDF.join(df, groupDF("height") === df("height")).show org.apache.spark.sql.AnalysisException: resolved attribute(s) height#8 missing from height#181,gender#7,height#338,weight#339,gender#337 in operator !Join Inner, (height#181 = height#8);; !Join Inner, (height#181 = height#8) :- Aggregate [gender#7], [gender#7, min(height#8) AS height#181] : +- Project [_1#3 AS gender#7, _2#4 AS height#8, _3#5 AS weight#9] : +- LocalRelation [_1#3, _2#4, _3#5] +- Project [_1#3 AS gender#337, _2#4 AS height#338, _3#5 AS weight#339] +- LocalRelation [_1#3, _2#4, _3#5] at org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.failAnalysis(CheckAnalysis.scala:40) at org.apache.spark.sql.catalyst.analysis.Analyzer.failAnalysis(Analyzer.scala:57) at org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:337) at org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:67) at org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:128) at org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.checkAnalysis(CheckAnalysis.scala:67) at org.apache.spark.sql.catalyst.analysis.Analyzer.checkAnalysis(Analyzer.scala:57) at org.apache.spark.sql.execution.QueryExecution.assertAnalyzed(QueryExecution.scala:48) at org.apache.spark.sql.Dataset$.ofRows(Dataset.scala:63) at org.apache.spark.sql.Dataset.org$apache$spark$sql$Dataset$$withPlan(Dataset.scala:2822) at org.apache.spark.sql.Dataset.join(Dataset.scala:830) at org.apache.spark.sql.Dataset.join(Dataset.scala:796) ... 48 elided scala> df.join(groupDF, groupDF("height") === df("height")).show +--+--+--+--+--+ |gender|height|weight|gender|height| +--+--+--+--+--+ | M| 160|55| M| 160| | F| 150|53| F| 150| +--+--+--+--+--+ was (Author: java8964): This problem exists. It looks like if switch the order of join, then it works fine. scala> spark.version res16: String = 2.1.0 scala> groupDF.join(df, groupDF("height") === df("height")).show org.apache.spark.sql.AnalysisException: resolved attribute(s) height#8 missing from height#181,gender#7,height#338,weight#339,gender#337 in operator !Join Inner, (height#181 = height#8);; !Join Inner, (height#181 = height#8) :- Aggregate [gender#7], [gender#7, min(height#8) AS height#181] : +- Project [_1#3 AS gender#7, _2#4 AS height#8, _3#5 AS weight#9] : +- LocalRelation [_1#3, _2#4, _3#5] +- Project [_1#3 AS gender#337, _2#4 AS height#338, _3#5 AS weight#339] +- LocalRelation [_1#3, _2#4, _3#5] at org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.failAnalysis(CheckAnalysis.scala:40) at org.apache.spark.sql.catalyst.analysis.Analyzer.failAnalysis(Analyzer.scala:57) at org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:337) at org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:67) at org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:128) at org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.checkAnalysis(CheckAnalysis.scala:67) at org.apache.spark.sql.catalyst.analysis.Analyzer.checkAnalysis(Analyzer.scala:57) at org.apache.spark.sql.execution.QueryExecution.assertAnalyzed(QueryExecution.scala:48) at org.apache.spark.sql.Dataset$.ofRows(Dataset.scala:63) at org.apache.spark.sql.Dataset.org$apache$spark$sql$Dataset$$withPlan(Dataset.scala:2822) at org.apache.spark.sql.Dataset.join(Dataset.scala:830) at org.apache.spark.sql.Dataset.join(Dataset.scala:796) ... 48 elided scala> df.join(groupDF, groupDF("height") === df("height")).show +--+--+--+--+--+ |gender|height|weight|gender|height| +--+--+--+--+--+ | M| 160|55| M| 160| | F| 150|53| F| 150| +--+--+--+--+--+ > Exception when Joining dataframe with another dataframe generated by applying > groupBy transformation on original one > > > Key: SPARK-20093 > URL: https://issues.apache.org/jira/browse/SPARK-20093 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.0, 2.0.1, 2.0.2, 2.1.0, 2.2.0 >Reporter: Hosur Narahari > > When we generate a
[jira] [Commented] (SPARK-20093) Exception when Joining dataframe with another dataframe generated by applying groupBy transformation on original one
[ https://issues.apache.org/jira/browse/SPARK-20093?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15946375#comment-15946375 ] Yong Zhang commented on SPARK-20093: This problem exists. It looks like if switch the order of join, then it works fine. scala> spark.version res16: String = 2.1.0 scala> groupDF.join(df, groupDF("height") === df("height")).show org.apache.spark.sql.AnalysisException: resolved attribute(s) height#8 missing from height#181,gender#7,height#338,weight#339,gender#337 in operator !Join Inner, (height#181 = height#8);; !Join Inner, (height#181 = height#8) :- Aggregate [gender#7], [gender#7, min(height#8) AS height#181] : +- Project [_1#3 AS gender#7, _2#4 AS height#8, _3#5 AS weight#9] : +- LocalRelation [_1#3, _2#4, _3#5] +- Project [_1#3 AS gender#337, _2#4 AS height#338, _3#5 AS weight#339] +- LocalRelation [_1#3, _2#4, _3#5] at org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.failAnalysis(CheckAnalysis.scala:40) at org.apache.spark.sql.catalyst.analysis.Analyzer.failAnalysis(Analyzer.scala:57) at org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:337) at org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:67) at org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:128) at org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.checkAnalysis(CheckAnalysis.scala:67) at org.apache.spark.sql.catalyst.analysis.Analyzer.checkAnalysis(Analyzer.scala:57) at org.apache.spark.sql.execution.QueryExecution.assertAnalyzed(QueryExecution.scala:48) at org.apache.spark.sql.Dataset$.ofRows(Dataset.scala:63) at org.apache.spark.sql.Dataset.org$apache$spark$sql$Dataset$$withPlan(Dataset.scala:2822) at org.apache.spark.sql.Dataset.join(Dataset.scala:830) at org.apache.spark.sql.Dataset.join(Dataset.scala:796) ... 48 elided scala> df.join(groupDF, groupDF("height") === df("height")).show +--+--+--+--+--+ |gender|height|weight|gender|height| +--+--+--+--+--+ | M| 160|55| M| 160| | F| 150|53| F| 150| +--+--+--+--+--+ > Exception when Joining dataframe with another dataframe generated by applying > groupBy transformation on original one > > > Key: SPARK-20093 > URL: https://issues.apache.org/jira/browse/SPARK-20093 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.0, 2.0.1, 2.0.2, 2.1.0, 2.2.0 >Reporter: Hosur Narahari > > When we generate a dataframe by doing grouping, and perform join on original > dataframe with aggregate column, we get AnalysisException. Below I've > attached a piece of code and resulting exception to reproduce. > Code: > import org.apache.spark.sql.SparkSession > object App { > lazy val spark = > SparkSession.builder.appName("Test").master("local").getOrCreate > def main(args: Array[String]): Unit = { > test1 > } > private def test1 { > import org.apache.spark.sql.functions._ > val df = spark.createDataFrame(Seq(("M",172,60), ("M", 170, 60), ("F", > 155, 56), ("M", 160, 55), ("F", 150, 53))).toDF("gender", "height", "weight") > val groupDF = df.groupBy("gender").agg(min("height").as("height")) > groupDF.show() > val out = groupDF.join(df, groupDF("height") <=> > df("height")).select(df("gender"), df("height"), df("weight")) > out.show > } > } > When I ran above code, I got below exception: > Exception in thread "main" org.apache.spark.sql.AnalysisException: resolved > attribute(s) height#8 missing from > height#19,height#30,gender#29,weight#31,gender#7 in operator !Join Inner, > (height#19 <=> height#8);; > !Join Inner, (height#19 <=> height#8) > :- Aggregate [gender#7], [gender#7, min(height#8) AS height#19] > : +- Project [_1#0 AS gender#7, _2#1 AS height#8, _3#2 AS weight#9] > : +- LocalRelation [_1#0, _2#1, _3#2] > +- Project [_1#0 AS gender#29, _2#1 AS height#30, _3#2 AS weight#31] >+- LocalRelation [_1#0, _2#1, _3#2] > at > org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.failAnalysis(CheckAnalysis.scala:39) > at > org.apache.spark.sql.catalyst.analysis.Analyzer.failAnalysis(Analyzer.scala:90) > at > org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:342) > at > org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:78) > at > org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:127) > at > org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.checkAnalysis(CheckAnalysis.scala:78) > at >
[jira] [Commented] (SPARK-16288) Implement inline table generating function
[ https://issues.apache.org/jira/browse/SPARK-16288?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15946176#comment-15946176 ] Guilherme Braccialli commented on SPARK-16288: -- Is it possible to call this function direct from df.select(inline(field))? I could only make it work using df.selectExpr("inline(field)"). thanks. > Implement inline table generating function > -- > > Key: SPARK-16288 > URL: https://issues.apache.org/jira/browse/SPARK-16288 > Project: Spark > Issue Type: Sub-task > Components: SQL >Reporter: Reynold Xin >Assignee: Dongjoon Hyun > Fix For: 2.0.0 > > -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-20043) Decision Tree loader does not handle uppercase impurity param values
[ https://issues.apache.org/jira/browse/SPARK-20043?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joseph K. Bradley resolved SPARK-20043. --- Resolution: Fixed Fix Version/s: 2.2.0 2.1.1 Issue resolved by pull request 17407 [https://github.com/apache/spark/pull/17407] > Decision Tree loader does not handle uppercase impurity param values > > > Key: SPARK-20043 > URL: https://issues.apache.org/jira/browse/SPARK-20043 > Project: Spark > Issue Type: Bug > Components: ML >Affects Versions: 2.1.0, 2.2.0 >Reporter: Zied Sellami >Assignee: Yan Facai (颜发才) > Labels: starter > Fix For: 2.1.1, 2.2.0 > > > I saved a CrossValidatorModel with a decision tree and a random forest. I use > Paramgrid to test "gini" and "entropy" impurity. CrossValidatorModel are not > able to load the saved model, when impurity are written not in lowercase. I > obtain an error from Spark "impurity Gini (Entropy) not recognized. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-20043) Decision Tree loader does not handle uppercase impurity param values
[ https://issues.apache.org/jira/browse/SPARK-20043?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joseph K. Bradley reassigned SPARK-20043: - Assignee: Yan Facai (颜发才) > Decision Tree loader does not handle uppercase impurity param values > > > Key: SPARK-20043 > URL: https://issues.apache.org/jira/browse/SPARK-20043 > Project: Spark > Issue Type: Bug > Components: ML >Affects Versions: 2.1.0, 2.2.0 >Reporter: Zied Sellami >Assignee: Yan Facai (颜发才) > Labels: starter > > I saved a CrossValidatorModel with a decision tree and a random forest. I use > Paramgrid to test "gini" and "entropy" impurity. CrossValidatorModel are not > able to load the saved model, when impurity are written not in lowercase. I > obtain an error from Spark "impurity Gini (Entropy) not recognized. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-20132) Add documentation for column string functions
[ https://issues.apache.org/jira/browse/SPARK-20132?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15946130#comment-15946130 ] Michael Patterson commented on SPARK-20132: --- I have a commit with the documentation: https://github.com/map222/spark/commit/ac91b654555f9a07021222f2f1a162634d81be5b I will make a more formal PR tonight. > Add documentation for column string functions > - > > Key: SPARK-20132 > URL: https://issues.apache.org/jira/browse/SPARK-20132 > Project: Spark > Issue Type: Documentation > Components: PySpark, SQL >Affects Versions: 2.1.0 >Reporter: Michael Patterson >Priority: Minor > Labels: documentation, newbie > > Four Column string functions do not have documentation for PySpark: > rlike > like > startswith > endswith > These functions are called through the _bin_op interface, which allows the > passing of a docstring. I have added docstrings with examples to each of the > four functions. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-20132) Add documentation for column string functions
Michael Patterson created SPARK-20132: - Summary: Add documentation for column string functions Key: SPARK-20132 URL: https://issues.apache.org/jira/browse/SPARK-20132 Project: Spark Issue Type: Documentation Components: PySpark, SQL Affects Versions: 2.1.0 Reporter: Michael Patterson Priority: Minor Four Column string functions do not have documentation for PySpark: rlike like startswith endswith These functions are called through the _bin_op interface, which allows the passing of a docstring. I have added docstrings with examples to each of the four functions. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-20050) Kafka 0.10 DirectStream doesn't commit last processed batch's offset when graceful shutdown
[ https://issues.apache.org/jira/browse/SPARK-20050?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-20050: Assignee: (was: Apache Spark) > Kafka 0.10 DirectStream doesn't commit last processed batch's offset when > graceful shutdown > --- > > Key: SPARK-20050 > URL: https://issues.apache.org/jira/browse/SPARK-20050 > Project: Spark > Issue Type: Bug > Components: DStreams >Affects Versions: 2.2.0 >Reporter: Sasaki Toru > > I use Kafka 0.10 DirectStream with properties 'enable.auto.commit=false' and > call 'DirectKafkaInputDStream#commitAsync' finally in each batches, such > below > {code} > val kafkaStream = KafkaUtils.createDirectStream[String, String](...) > kafkaStream.map { input => > "key: " + input.key.toString + " value: " + input.value.toString + " > offset: " + input.offset.toString > }.foreachRDD { rdd => > rdd.foreach { input => > println(input) > } > } > kafkaStream.foreachRDD { rdd => > val offsetRanges = rdd.asInstanceOf[HasOffsetRanges].offsetRanges > kafkaStream.asInstanceOf[CanCommitOffsets].commitAsync(offsetRanges) > } > {\code} > Some records which processed in the last batch before Streaming graceful > shutdown reprocess in the first batch after Spark Streaming restart, such > below > * output first run of this application > {code} > key: null value: 1 offset: 101452472 > key: null value: 2 offset: 101452473 > key: null value: 3 offset: 101452474 > key: null value: 4 offset: 101452475 > key: null value: 5 offset: 101452476 > key: null value: 6 offset: 101452477 > key: null value: 7 offset: 101452478 > key: null value: 8 offset: 101452479 > key: null value: 9 offset: 101452480 // this is a last record before > shutdown Spark Streaming gracefully > {\code} > * output re-run of this application > {code} > key: null value: 7 offset: 101452478 // duplication > key: null value: 8 offset: 101452479 // duplication > key: null value: 9 offset: 101452480 // duplication > key: null value: 10 offset: 101452481 > {\code} > It may cause offsets specified in commitAsync will commit in the head of next > batch. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-20050) Kafka 0.10 DirectStream doesn't commit last processed batch's offset when graceful shutdown
[ https://issues.apache.org/jira/browse/SPARK-20050?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-20050: Assignee: Apache Spark > Kafka 0.10 DirectStream doesn't commit last processed batch's offset when > graceful shutdown > --- > > Key: SPARK-20050 > URL: https://issues.apache.org/jira/browse/SPARK-20050 > Project: Spark > Issue Type: Bug > Components: DStreams >Affects Versions: 2.2.0 >Reporter: Sasaki Toru >Assignee: Apache Spark > > I use Kafka 0.10 DirectStream with properties 'enable.auto.commit=false' and > call 'DirectKafkaInputDStream#commitAsync' finally in each batches, such > below > {code} > val kafkaStream = KafkaUtils.createDirectStream[String, String](...) > kafkaStream.map { input => > "key: " + input.key.toString + " value: " + input.value.toString + " > offset: " + input.offset.toString > }.foreachRDD { rdd => > rdd.foreach { input => > println(input) > } > } > kafkaStream.foreachRDD { rdd => > val offsetRanges = rdd.asInstanceOf[HasOffsetRanges].offsetRanges > kafkaStream.asInstanceOf[CanCommitOffsets].commitAsync(offsetRanges) > } > {\code} > Some records which processed in the last batch before Streaming graceful > shutdown reprocess in the first batch after Spark Streaming restart, such > below > * output first run of this application > {code} > key: null value: 1 offset: 101452472 > key: null value: 2 offset: 101452473 > key: null value: 3 offset: 101452474 > key: null value: 4 offset: 101452475 > key: null value: 5 offset: 101452476 > key: null value: 6 offset: 101452477 > key: null value: 7 offset: 101452478 > key: null value: 8 offset: 101452479 > key: null value: 9 offset: 101452480 // this is a last record before > shutdown Spark Streaming gracefully > {\code} > * output re-run of this application > {code} > key: null value: 7 offset: 101452478 // duplication > key: null value: 8 offset: 101452479 // duplication > key: null value: 9 offset: 101452480 // duplication > key: null value: 10 offset: 101452481 > {\code} > It may cause offsets specified in commitAsync will commit in the head of next > batch. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-20050) Kafka 0.10 DirectStream doesn't commit last processed batch's offset when graceful shutdown
[ https://issues.apache.org/jira/browse/SPARK-20050?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15946098#comment-15946098 ] Apache Spark commented on SPARK-20050: -- User 'sasakitoa' has created a pull request for this issue: https://github.com/apache/spark/pull/17462 > Kafka 0.10 DirectStream doesn't commit last processed batch's offset when > graceful shutdown > --- > > Key: SPARK-20050 > URL: https://issues.apache.org/jira/browse/SPARK-20050 > Project: Spark > Issue Type: Bug > Components: DStreams >Affects Versions: 2.2.0 >Reporter: Sasaki Toru > > I use Kafka 0.10 DirectStream with properties 'enable.auto.commit=false' and > call 'DirectKafkaInputDStream#commitAsync' finally in each batches, such > below > {code} > val kafkaStream = KafkaUtils.createDirectStream[String, String](...) > kafkaStream.map { input => > "key: " + input.key.toString + " value: " + input.value.toString + " > offset: " + input.offset.toString > }.foreachRDD { rdd => > rdd.foreach { input => > println(input) > } > } > kafkaStream.foreachRDD { rdd => > val offsetRanges = rdd.asInstanceOf[HasOffsetRanges].offsetRanges > kafkaStream.asInstanceOf[CanCommitOffsets].commitAsync(offsetRanges) > } > {\code} > Some records which processed in the last batch before Streaming graceful > shutdown reprocess in the first batch after Spark Streaming restart, such > below > * output first run of this application > {code} > key: null value: 1 offset: 101452472 > key: null value: 2 offset: 101452473 > key: null value: 3 offset: 101452474 > key: null value: 4 offset: 101452475 > key: null value: 5 offset: 101452476 > key: null value: 6 offset: 101452477 > key: null value: 7 offset: 101452478 > key: null value: 8 offset: 101452479 > key: null value: 9 offset: 101452480 // this is a last record before > shutdown Spark Streaming gracefully > {\code} > * output re-run of this application > {code} > key: null value: 7 offset: 101452478 // duplication > key: null value: 8 offset: 101452479 // duplication > key: null value: 9 offset: 101452480 // duplication > key: null value: 10 offset: 101452481 > {\code} > It may cause offsets specified in commitAsync will commit in the head of next > batch. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-20125) Dataset of type option of map does not work
[ https://issues.apache.org/jira/browse/SPARK-20125?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiao Li resolved SPARK-20125. - Resolution: Fixed Fix Version/s: 2.2.0 2.1.1 > Dataset of type option of map does not work > --- > > Key: SPARK-20125 > URL: https://issues.apache.org/jira/browse/SPARK-20125 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.1.0, 2.2.0 >Reporter: Wenchen Fan >Assignee: Wenchen Fan > Fix For: 2.1.1, 2.2.0 > > > A simple reproduce: > {code} > case class ABC(m: Option[scala.collection.Map[Int, Int]]) > val ds = Seq(ABC(Some(Map(1 -> 1.toDS() > ds.collect() > {code} -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-14536) NPE in JDBCRDD when array column contains nulls (postgresql)
[ https://issues.apache.org/jira/browse/SPARK-14536?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiao Li updated SPARK-14536: Fix Version/s: 2.1.1 > NPE in JDBCRDD when array column contains nulls (postgresql) > > > Key: SPARK-14536 > URL: https://issues.apache.org/jira/browse/SPARK-14536 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.6.1 >Reporter: Jeremy Smith >Assignee: Suresh Thalamati > Labels: NullPointerException > Fix For: 2.1.1, 2.2.0 > > > At > https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/jdbc/JDBCRDD.scala#L453 > it is assumed that the JDBC driver will definitely return a non-null `Array` > object from the call to `getArray`, and that in the event of a null array it > will return an non-null `Array` object with a null underlying array. But as > you can see here > https://github.com/pgjdbc/pgjdbc/blob/master/pgjdbc/src/main/java/org/postgresql/jdbc/PgResultSet.java#L387 > that isn't the case, at least for PostgreSQL. This causes a > `NullPointerException` whenever an array column contains null values. It > seems like the PostgreSQL JDBC driver is probably doing the wrong thing, but > even so there should be a null check in JDBCRDD. I'm happy to submit a PR if > that would be helpful. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-20082) Incremental update of LDA model, by adding initialModel as start point
[ https://issues.apache.org/jira/browse/SPARK-20082?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15945892#comment-15945892 ] Mathieu D edited comment on SPARK-20082 at 3/28/17 8:39 PM: [~yuhaoyan] would you mind having a look to this PR ? Right now, I added an initialModel, suported only by the Online optimizer. The implementation is inspired from the KMeans one. The initialModel is used as a replacement of the initial randomized matrix. Regarding the EM optimizer, in the same way, we could use an existing model instead of a randomly weighted graph, by adding new doc vertices and new doc->term edges to the existing graph. But it's unclear for me how the new doc vertices should be weighted when added. Right now for a new model, docs and terms vertices are weighted randomly, with the same total weight on docs and terms. If I add new docs to an existing graph, how to initialize the weights on this side ? was (Author: mathieude): [~yuhaoyan] would you mind having a look to this PR ? Right now, I added an initialModel, suported only by the Online optimizer. Regarding the EM optimizer, I could add new doc vertices and new doc->term edges to the existing graph. But it's unclear for me how the new doc vertices should be weighted when added. Right now for a new model, docs and terms vertices are weighted randomly, with the same total weight on docs and terms. If I add new docs to an existing graph, how to initialize the weights on this side ? > Incremental update of LDA model, by adding initialModel as start point > -- > > Key: SPARK-20082 > URL: https://issues.apache.org/jira/browse/SPARK-20082 > Project: Spark > Issue Type: New Feature > Components: ML >Affects Versions: 2.1.0 >Reporter: Mathieu D > > Some mllib models support an initialModel to start from and update it > incrementally with new data. > From what I understand of OnlineLDAOptimizer, it is possible to incrementally > update an existing model with batches of new documents. > I suggest to add an initialModel as a start point for LDA. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-20082) Incremental update of LDA model, by adding initialModel as start point
[ https://issues.apache.org/jira/browse/SPARK-20082?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15945892#comment-15945892 ] Mathieu D commented on SPARK-20082: --- [~yuhaoyan] would you mind having a look to this PR. Right now, I added an initialModel only for the Online optimizer. Regarding the EM optimizer, I could add new doc vertices and new doc->term edges to the existing graph. But it's unclear for me how the new doc vertices should be weighted when added. Right now for a new model, docs and terms vertices are weighted randomly, with the same total weight on docs and terms. If I add new docs to an existing graph, how to initialize the weights on this side ? > Incremental update of LDA model, by adding initialModel as start point > -- > > Key: SPARK-20082 > URL: https://issues.apache.org/jira/browse/SPARK-20082 > Project: Spark > Issue Type: New Feature > Components: ML >Affects Versions: 2.1.0 >Reporter: Mathieu D > > Some mllib models support an initialModel to start from and update it > incrementally with new data. > From what I understand of OnlineLDAOptimizer, it is possible to incrementally > update an existing model with batches of new documents. > I suggest to add an initialModel as a start point for LDA. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-20082) Incremental update of LDA model, by adding initialModel as start point
[ https://issues.apache.org/jira/browse/SPARK-20082?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15945892#comment-15945892 ] Mathieu D edited comment on SPARK-20082 at 3/28/17 8:26 PM: [~yuhaoyan] would you mind having a look to this PR ? Right now, I added an initialModel only for the Online optimizer. Regarding the EM optimizer, I could add new doc vertices and new doc->term edges to the existing graph. But it's unclear for me how the new doc vertices should be weighted when added. Right now for a new model, docs and terms vertices are weighted randomly, with the same total weight on docs and terms. If I add new docs to an existing graph, how to initialize the weights on this side ? was (Author: mathieude): [~yuhaoyan] would you mind having a look to this PR. Right now, I added an initialModel only for the Online optimizer. Regarding the EM optimizer, I could add new doc vertices and new doc->term edges to the existing graph. But it's unclear for me how the new doc vertices should be weighted when added. Right now for a new model, docs and terms vertices are weighted randomly, with the same total weight on docs and terms. If I add new docs to an existing graph, how to initialize the weights on this side ? > Incremental update of LDA model, by adding initialModel as start point > -- > > Key: SPARK-20082 > URL: https://issues.apache.org/jira/browse/SPARK-20082 > Project: Spark > Issue Type: New Feature > Components: ML >Affects Versions: 2.1.0 >Reporter: Mathieu D > > Some mllib models support an initialModel to start from and update it > incrementally with new data. > From what I understand of OnlineLDAOptimizer, it is possible to incrementally > update an existing model with batches of new documents. > I suggest to add an initialModel as a start point for LDA. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-20082) Incremental update of LDA model, by adding initialModel as start point
[ https://issues.apache.org/jira/browse/SPARK-20082?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15945892#comment-15945892 ] Mathieu D edited comment on SPARK-20082 at 3/28/17 8:27 PM: [~yuhaoyan] would you mind having a look to this PR ? Right now, I added an initialModel, suported only by the Online optimizer. Regarding the EM optimizer, I could add new doc vertices and new doc->term edges to the existing graph. But it's unclear for me how the new doc vertices should be weighted when added. Right now for a new model, docs and terms vertices are weighted randomly, with the same total weight on docs and terms. If I add new docs to an existing graph, how to initialize the weights on this side ? was (Author: mathieude): [~yuhaoyan] would you mind having a look to this PR ? Right now, I added an initialModel only for the Online optimizer. Regarding the EM optimizer, I could add new doc vertices and new doc->term edges to the existing graph. But it's unclear for me how the new doc vertices should be weighted when added. Right now for a new model, docs and terms vertices are weighted randomly, with the same total weight on docs and terms. If I add new docs to an existing graph, how to initialize the weights on this side ? > Incremental update of LDA model, by adding initialModel as start point > -- > > Key: SPARK-20082 > URL: https://issues.apache.org/jira/browse/SPARK-20082 > Project: Spark > Issue Type: New Feature > Components: ML >Affects Versions: 2.1.0 >Reporter: Mathieu D > > Some mllib models support an initialModel to start from and update it > incrementally with new data. > From what I understand of OnlineLDAOptimizer, it is possible to incrementally > update an existing model with batches of new documents. > I suggest to add an initialModel as a start point for LDA. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-20082) Incremental update of LDA model, by adding initialModel as start point
[ https://issues.apache.org/jira/browse/SPARK-20082?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-20082: Assignee: (was: Apache Spark) > Incremental update of LDA model, by adding initialModel as start point > -- > > Key: SPARK-20082 > URL: https://issues.apache.org/jira/browse/SPARK-20082 > Project: Spark > Issue Type: New Feature > Components: ML >Affects Versions: 2.1.0 >Reporter: Mathieu D > > Some mllib models support an initialModel to start from and update it > incrementally with new data. > From what I understand of OnlineLDAOptimizer, it is possible to incrementally > update an existing model with batches of new documents. > I suggest to add an initialModel as a start point for LDA. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-20082) Incremental update of LDA model, by adding initialModel as start point
[ https://issues.apache.org/jira/browse/SPARK-20082?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-20082: Assignee: Apache Spark > Incremental update of LDA model, by adding initialModel as start point > -- > > Key: SPARK-20082 > URL: https://issues.apache.org/jira/browse/SPARK-20082 > Project: Spark > Issue Type: New Feature > Components: ML >Affects Versions: 2.1.0 >Reporter: Mathieu D >Assignee: Apache Spark > > Some mllib models support an initialModel to start from and update it > incrementally with new data. > From what I understand of OnlineLDAOptimizer, it is possible to incrementally > update an existing model with batches of new documents. > I suggest to add an initialModel as a start point for LDA. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-20082) Incremental update of LDA model, by adding initialModel as start point
[ https://issues.apache.org/jira/browse/SPARK-20082?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15945866#comment-15945866 ] Apache Spark commented on SPARK-20082: -- User 'mdespriee' has created a pull request for this issue: https://github.com/apache/spark/pull/17461 > Incremental update of LDA model, by adding initialModel as start point > -- > > Key: SPARK-20082 > URL: https://issues.apache.org/jira/browse/SPARK-20082 > Project: Spark > Issue Type: New Feature > Components: ML >Affects Versions: 2.1.0 >Reporter: Mathieu D > > Some mllib models support an initialModel to start from and update it > incrementally with new data. > From what I understand of OnlineLDAOptimizer, it is possible to incrementally > update an existing model with batches of new documents. > I suggest to add an initialModel as a start point for LDA. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-16929) Speculation-related synchronization bottleneck in checkSpeculatableTasks
[ https://issues.apache.org/jira/browse/SPARK-16929?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Imran Rashid updated SPARK-16929: - Issue Type: Improvement (was: Bug) > Speculation-related synchronization bottleneck in checkSpeculatableTasks > > > Key: SPARK-16929 > URL: https://issues.apache.org/jira/browse/SPARK-16929 > Project: Spark > Issue Type: Improvement > Components: Scheduler >Reporter: Nicholas Brown >Assignee: jin xing > Fix For: 2.2.0 > > > Our cluster has been running slowly since I got speculation working, I looked > into it and noticed that stderr was saying some tasks were taking almost an > hour to run even though in the application logs on the nodes that task only > took a minute or so to run. Digging into the thread dump for the master node > I noticed a number of threads are blocked, apparently by speculation thread. > At line 476 of TaskSchedulerImpl it grabs a lock on the TaskScheduler while > it looks through the tasks to see what needs to be rerun. Unfortunately that > code loops through each of the tasks, so when you have even just a couple > hundred thousand tasks to run that can be prohibitively slow to run inside of > a synchronized block. Once I disabled speculation, the job went back to > having acceptable performance. > There are no comments around that lock indicating why it was added, and the > git history seems to have a couple refactorings so its hard to find where it > was added. I'm tempted to believe it is the result of someone assuming that > an extra synchronized block never hurt anyone (in reality I've probably just > as many bugs caused by over synchronization as too little) as it looks too > broad to be actually guarding any potential concurrency issue. But, since > concurrency issues can be tricky to reproduce (and yes, I understand that's > an extreme understatement) I'm not sure just blindly removing it without > being familiar with the history is necessarily safe. > Can someone look into this? Or at least make a note in the documentation > that speculation should not be used with large clusters? -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-19868) conflict TasksetManager lead to spark stopped
[ https://issues.apache.org/jira/browse/SPARK-19868?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kay Ousterhout reassigned SPARK-19868: -- Assignee: liujianhui > conflict TasksetManager lead to spark stopped > - > > Key: SPARK-19868 > URL: https://issues.apache.org/jira/browse/SPARK-19868 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.1.0 >Reporter: liujianhui >Assignee: liujianhui > Fix For: 2.2.0 > > > ##scenario > conflict taskSetManager throw an exception which lead to sparkcontext > stopped. log as > {code} > java.lang.IllegalStateException: more than one active taskSet for stage > 4571114: 4571114.2,4571114.1 > at > org.apache.spark.scheduler.TaskSchedulerImpl.submitTasks(TaskSchedulerImpl.scala:173) > at > org.apache.spark.scheduler.DAGScheduler.submitMissingTasks(DAGScheduler.scala:1052) > at > org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$submitStage(DAGScheduler.scala:921) > at > org.apache.spark.scheduler.DAGScheduler.handleTaskCompletion(DAGScheduler.scala:1214) > at > org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1637) > at > org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1599) > at > org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1588) > at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48) > {code} > the reason for that is the resubmitting of stage conflict with the running > stage,the missing task of stage should be resubmit since the zoombie of the > tasksetManager assigned by true > {code} > [INFO][dag-scheduler-event-loop][2017-03-03+22:16:29.547][org.apache.spark.scheduler.DAGScheduler]Resubmitting > ShuffleMapStage 4571114 (map at MainApp.scala:73) because some of its tasks > had failed: 0 > [INFO][dag-scheduler-event-loop][2017-03-03+22:16:29.547][org.apache.spark.scheduler.DAGScheduler]Submitting > ShuffleMapStage 4571114 (MapPartitionsRDD[3719544] at map at > MainApp.scala:73), which has no missing parents > {code} > the executor which the shuffle task ran on was lost > {code} > [INFO][dag-scheduler-event-loop][2017-03-03+22:16:27.427][org.apache.spark.scheduler.DAGScheduler]Ignoring > possibly bogus ShuffleMapTask(4571114, 0) completion from executor 4 > {code} > the time of the task set finished and the resubmit of stage > {code} > handleSuccessfuleTask > [INFO][task-result-getter-2][2017-03-03+22:16:29.999][org.apache.spark.scheduler.TaskSchedulerImpl]Removed > TaskSet 4571114.1, whose tasks have all completed, from pool > resubmit stage > [INFO][dag-scheduler-event-loop][2017-03-03+22:16:29.549][org.apache.spark.scheduler.TaskSchedulerImpl]Adding > task set 4571114.2 with 1 tasks > {code} -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-19868) conflict TasksetManager lead to spark stopped
[ https://issues.apache.org/jira/browse/SPARK-19868?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kay Ousterhout resolved SPARK-19868. Resolution: Fixed Fix Version/s: 2.2.0 > conflict TasksetManager lead to spark stopped > - > > Key: SPARK-19868 > URL: https://issues.apache.org/jira/browse/SPARK-19868 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.1.0 >Reporter: liujianhui > Fix For: 2.2.0 > > > ##scenario > conflict taskSetManager throw an exception which lead to sparkcontext > stopped. log as > {code} > java.lang.IllegalStateException: more than one active taskSet for stage > 4571114: 4571114.2,4571114.1 > at > org.apache.spark.scheduler.TaskSchedulerImpl.submitTasks(TaskSchedulerImpl.scala:173) > at > org.apache.spark.scheduler.DAGScheduler.submitMissingTasks(DAGScheduler.scala:1052) > at > org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$submitStage(DAGScheduler.scala:921) > at > org.apache.spark.scheduler.DAGScheduler.handleTaskCompletion(DAGScheduler.scala:1214) > at > org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1637) > at > org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1599) > at > org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1588) > at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48) > {code} > the reason for that is the resubmitting of stage conflict with the running > stage,the missing task of stage should be resubmit since the zoombie of the > tasksetManager assigned by true > {code} > [INFO][dag-scheduler-event-loop][2017-03-03+22:16:29.547][org.apache.spark.scheduler.DAGScheduler]Resubmitting > ShuffleMapStage 4571114 (map at MainApp.scala:73) because some of its tasks > had failed: 0 > [INFO][dag-scheduler-event-loop][2017-03-03+22:16:29.547][org.apache.spark.scheduler.DAGScheduler]Submitting > ShuffleMapStage 4571114 (MapPartitionsRDD[3719544] at map at > MainApp.scala:73), which has no missing parents > {code} > the executor which the shuffle task ran on was lost > {code} > [INFO][dag-scheduler-event-loop][2017-03-03+22:16:27.427][org.apache.spark.scheduler.DAGScheduler]Ignoring > possibly bogus ShuffleMapTask(4571114, 0) completion from executor 4 > {code} > the time of the task set finished and the resubmit of stage > {code} > handleSuccessfuleTask > [INFO][task-result-getter-2][2017-03-03+22:16:29.999][org.apache.spark.scheduler.TaskSchedulerImpl]Removed > TaskSet 4571114.1, whose tasks have all completed, from pool > resubmit stage > [INFO][dag-scheduler-event-loop][2017-03-03+22:16:29.549][org.apache.spark.scheduler.TaskSchedulerImpl]Adding > task set 4571114.2 with 1 tasks > {code} -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-20131) Flaky Test: org.apache.spark.streaming.StreamingContextSuite
Takuya Ueshin created SPARK-20131: - Summary: Flaky Test: org.apache.spark.streaming.StreamingContextSuite Key: SPARK-20131 URL: https://issues.apache.org/jira/browse/SPARK-20131 Project: Spark Issue Type: Bug Components: DStreams Affects Versions: 2.2.0 Reporter: Takuya Ueshin Priority: Minor This test failed recently here: https://amplab.cs.berkeley.edu/jenkins/job/spark-master-test-maven-hadoop-2.7/2861/testReport/junit/org.apache.spark.streaming/StreamingContextSuite/SPARK_18560_Receiver_data_should_be_deserialized_properly_/ Dashboard https://spark-tests.appspot.com/test-details?suite_name=org.apache.spark.streaming.StreamingContextSuite_name=SPARK-18560+Receiver+data+should+be+deserialized+properly. Error Message {code} latch.await(60L, SECONDS) was false {code} {code} org.scalatest.exceptions.TestFailedException: latch.await(60L, SECONDS) was false at org.scalatest.Assertions$class.newAssertionFailedException(Assertions.scala:500) at org.scalatest.FunSuite.newAssertionFailedException(FunSuite.scala:1555) at org.scalatest.Assertions$AssertionsHelper.macroAssert(Assertions.scala:466) at org.apache.spark.streaming.StreamingContextSuite$$anonfun$43.apply$mcV$sp(StreamingContextSuite.scala:837) at org.apache.spark.streaming.StreamingContextSuite$$anonfun$43.apply(StreamingContextSuite.scala:810) at org.apache.spark.streaming.StreamingContextSuite$$anonfun$43.apply(StreamingContextSuite.scala:810) at org.scalatest.Transformer$$anonfun$apply$1.apply$mcV$sp(Transformer.scala:22) at org.scalatest.OutcomeOf$class.outcomeOf(OutcomeOf.scala:85) at org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104) at org.scalatest.Transformer.apply(Transformer.scala:22) at org.scalatest.Transformer.apply(Transformer.scala:20) at org.scalatest.FunSuiteLike$$anon$1.apply(FunSuiteLike.scala:166) at org.apache.spark.SparkFunSuite.withFixture(SparkFunSuite.scala:68) at org.scalatest.FunSuiteLike$class.invokeWithFixture$1(FunSuiteLike.scala:163) at org.scalatest.FunSuiteLike$$anonfun$runTest$1.apply(FunSuiteLike.scala:175) at org.scalatest.FunSuiteLike$$anonfun$runTest$1.apply(FunSuiteLike.scala:175) at org.scalatest.SuperEngine.runTestImpl(Engine.scala:306) at org.scalatest.FunSuiteLike$class.runTest(FunSuiteLike.scala:175) at org.apache.spark.streaming.StreamingContextSuite.org$scalatest$BeforeAndAfter$$super$runTest(StreamingContextSuite.scala:44) at org.scalatest.BeforeAndAfter$class.runTest(BeforeAndAfter.scala:200) at org.apache.spark.streaming.StreamingContextSuite.runTest(StreamingContextSuite.scala:44) at org.scalatest.FunSuiteLike$$anonfun$runTests$1.apply(FunSuiteLike.scala:208) at org.scalatest.FunSuiteLike$$anonfun$runTests$1.apply(FunSuiteLike.scala:208) at org.scalatest.SuperEngine$$anonfun$traverseSubNodes$1$1.apply(Engine.scala:413) at org.scalatest.SuperEngine$$anonfun$traverseSubNodes$1$1.apply(Engine.scala:401) at scala.collection.immutable.List.foreach(List.scala:381) at org.scalatest.SuperEngine.traverseSubNodes$1(Engine.scala:401) at org.scalatest.SuperEngine.org$scalatest$SuperEngine$$runTestsInBranch(Engine.scala:396) at org.scalatest.SuperEngine.runTestsImpl(Engine.scala:483) at org.scalatest.FunSuiteLike$class.runTests(FunSuiteLike.scala:208) at org.scalatest.FunSuite.runTests(FunSuite.scala:1555) at org.scalatest.Suite$class.run(Suite.scala:1424) at org.scalatest.FunSuite.org$scalatest$FunSuiteLike$$super$run(FunSuite.scala:1555) at org.scalatest.FunSuiteLike$$anonfun$run$1.apply(FunSuiteLike.scala:212) at org.scalatest.FunSuiteLike$$anonfun$run$1.apply(FunSuiteLike.scala:212) at org.scalatest.SuperEngine.runImpl(Engine.scala:545) at org.scalatest.FunSuiteLike$class.run(FunSuiteLike.scala:212) at org.apache.spark.SparkFunSuite.org$scalatest$BeforeAndAfterAll$$super$run(SparkFunSuite.scala:31) at org.scalatest.BeforeAndAfterAll$class.liftedTree1$1(BeforeAndAfterAll.scala:257) at org.scalatest.BeforeAndAfterAll$class.run(BeforeAndAfterAll.scala:256) at org.apache.spark.streaming.StreamingContextSuite.org$scalatest$BeforeAndAfter$$super$run(StreamingContextSuite.scala:44) at org.scalatest.BeforeAndAfter$class.run(BeforeAndAfter.scala:241) at org.apache.spark.streaming.StreamingContextSuite.run(StreamingContextSuite.scala:44) at org.scalatest.Suite$class.callExecuteOnSuite$1(Suite.scala:1492) at org.scalatest.Suite$$anonfun$runNestedSuites$1.apply(Suite.scala:1528) at org.scalatest.Suite$$anonfun$runNestedSuites$1.apply(Suite.scala:1526) at scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33) at
[jira] [Commented] (SPARK-19551) Theme for PySpark documenation could do with improving
[ https://issues.apache.org/jira/browse/SPARK-19551?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15945685#comment-15945685 ] Arthur Tacca commented on SPARK-19551: -- Thanks, I needed the reminder! In fact the person that generated their own build of the docs got back to me; I hope they don't mind me pasting what they said here: I've compiled the documentation using Sphinx (ver 1.3.5.). I have a foggy memory on this as it's been a while, but I recall I had to rollback to older version of sphinx to have copybutton.js to work properly - this is what allows us to toggle the ``>>>`` mark in the python code - (example) http://takwatanabe.me/pyspark/generated/generated/mllib.classification.DenseVector.html#mllib.classification.DenseVector - (js file) https://github.com/scipy/scipy-sphinx-theme/blob/master/_theme/scipy/static/js/copybutton.js Otherwise, I simply just used the ``autosummary`` directive offered by Sphinx (http://www.sphinx-doc.org/en/stable/ext/autosummary.html). You can see how I used these in the *.rst files in https://github.com/wtak23/pyspark/tree/master/source. For instance, in order to create the **entire** html subtrees from the link http://takwatanabe.me/pyspark/pyspark.ml.html , all I had to have was this rst file: https://raw.githubusercontent.com/wtak23/pyspark/master/source/pyspark.ml.rst Once you have PySpark directory included in your $PYTHONPATH envvar, you should be able to simply run ``make html`` using the Makefile from the github branch below. https://github.com/wtak23/pyspark/tree/master > Theme for PySpark documenation could do with improving > -- > > Key: SPARK-19551 > URL: https://issues.apache.org/jira/browse/SPARK-19551 > Project: Spark > Issue Type: Documentation > Components: Documentation, PySpark >Affects Versions: 2.1.0 >Reporter: Arthur Tacca >Priority: Minor > > I have found the Python Spark documentation hard to navigate for two reasons: > * Each page in the documentation is huge, because the whole of the > documentation is split up into only a few chunks. > * The methods for each class is not listed in a short form, so the only way > to look through them is to browse past the full documentation for all methods > (including parameter lists, examples, etc.). > This has irritated someone enough that they have done [their own build of the > pyspark documentation|http://takwatanabe.me/pyspark/index.html]. In > comparison to the official docs they are a delight to use. But of course it > is not clear whether they'll be kept up to date, which is why I'm asking here > that the official docs are improved. Perhaps that site could be used as > inspiration? I don't know much about these things, but it appears that the > main change they have made is to switch to the "read the docs" theme. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-14536) NPE in JDBCRDD when array column contains nulls (postgresql)
[ https://issues.apache.org/jira/browse/SPARK-14536?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15945673#comment-15945673 ] Apache Spark commented on SPARK-14536: -- User 'sureshthalamati' has created a pull request for this issue: https://github.com/apache/spark/pull/17460 > NPE in JDBCRDD when array column contains nulls (postgresql) > > > Key: SPARK-14536 > URL: https://issues.apache.org/jira/browse/SPARK-14536 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.6.1 >Reporter: Jeremy Smith >Assignee: Suresh Thalamati > Labels: NullPointerException > Fix For: 2.2.0 > > > At > https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/jdbc/JDBCRDD.scala#L453 > it is assumed that the JDBC driver will definitely return a non-null `Array` > object from the call to `getArray`, and that in the event of a null array it > will return an non-null `Array` object with a null underlying array. But as > you can see here > https://github.com/pgjdbc/pgjdbc/blob/master/pgjdbc/src/main/java/org/postgresql/jdbc/PgResultSet.java#L387 > that isn't the case, at least for PostgreSQL. This causes a > `NullPointerException` whenever an array column contains null values. It > seems like the PostgreSQL JDBC driver is probably doing the wrong thing, but > even so there should be a null check in JDBCRDD. I'm happy to submit a PR if > that would be helpful. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-20130) Flaky test: BlockManagerProactiveReplicationSuite
Marcelo Vanzin created SPARK-20130: -- Summary: Flaky test: BlockManagerProactiveReplicationSuite Key: SPARK-20130 URL: https://issues.apache.org/jira/browse/SPARK-20130 Project: Spark Issue Type: Bug Components: Tests Affects Versions: 2.2.0 Reporter: Marcelo Vanzin See following page: https://spark-tests.appspot.com/test-details?suite_name=org.apache.spark.storage.BlockManagerProactiveReplicationSuite_name=proactive+block+replication+-+5+replicas+-+4+block+manager+deletions I also have seen it fail intermittently during local unit test runs. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-19995) Using real user to connect HiveMetastore in HiveClientImpl
[ https://issues.apache.org/jira/browse/SPARK-19995?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Marcelo Vanzin resolved SPARK-19995. Resolution: Fixed Assignee: Saisai Shao Fix Version/s: 2.2.0 2.1.1 > Using real user to connect HiveMetastore in HiveClientImpl > -- > > Key: SPARK-19995 > URL: https://issues.apache.org/jira/browse/SPARK-19995 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.2.0 >Reporter: Saisai Shao >Assignee: Saisai Shao > Fix For: 2.1.1, 2.2.0 > > > If user specify "--proxy-user" in kerberized environment with Hive catalog > implementation, HiveClientImpl will try to connect hive metastore with > current user. While we use real user to do kinit, this will make connection > failure. We should change like what we did before in yarn code to use real > user. > {noformat} > ERROR TSaslTransport: SASL negotiation failure > javax.security.sasl.SaslException: GSS initiate failed [Caused by > GSSException: No valid credentials provided (Mechanism level: Failed to find > any Kerberos tgt)] > at > com.sun.security.sasl.gsskerb.GssKrb5Client.evaluateChallenge(GssKrb5Client.java:211) > at > org.apache.thrift.transport.TSaslClientTransport.handleSaslStartMessage(TSaslClientTransport.java:94) > at > org.apache.thrift.transport.TSaslTransport.open(TSaslTransport.java:271) > at > org.apache.thrift.transport.TSaslClientTransport.open(TSaslClientTransport.java:37) > at > org.apache.hadoop.hive.thrift.client.TUGIAssumingTransport$1.run(TUGIAssumingTransport.java:52) > at > org.apache.hadoop.hive.thrift.client.TUGIAssumingTransport$1.run(TUGIAssumingTransport.java:49) > at java.security.AccessController.doPrivileged(Native Method) > at javax.security.auth.Subject.doAs(Subject.java:422) > at > org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1866) > at > org.apache.hadoop.hive.thrift.client.TUGIAssumingTransport.open(TUGIAssumingTransport.java:49) > at > org.apache.hadoop.hive.metastore.HiveMetaStoreClient.open(HiveMetaStoreClient.java:420) > at > org.apache.hadoop.hive.metastore.HiveMetaStoreClient.(HiveMetaStoreClient.java:236) > at > org.apache.hadoop.hive.ql.metadata.SessionHiveMetaStoreClient.(SessionHiveMetaStoreClient.java:74) > at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method) > at > sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62) > at > sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45) > at java.lang.reflect.Constructor.newInstance(Constructor.java:423) > at > org.apache.hadoop.hive.metastore.MetaStoreUtils.newInstance(MetaStoreUtils.java:1521) > at > org.apache.hadoop.hive.metastore.RetryingMetaStoreClient.(RetryingMetaStoreClient.java:86) > at > org.apache.hadoop.hive.metastore.RetryingMetaStoreClient.getProxy(RetryingMetaStoreClient.java:132) > at > org.apache.hadoop.hive.metastore.RetryingMetaStoreClient.getProxy(RetryingMetaStoreClient.java:104) > at > org.apache.hadoop.hive.ql.metadata.Hive.createMetaStoreClient(Hive.java:3005) > at org.apache.hadoop.hive.ql.metadata.Hive.getMSC(Hive.java:3024) > at > org.apache.hadoop.hive.ql.metadata.Hive.getAllDatabases(Hive.java:1234) > at > org.apache.hadoop.hive.ql.metadata.Hive.reloadFunctions(Hive.java:174) > at org.apache.hadoop.hive.ql.metadata.Hive.(Hive.java:166) > at > org.apache.hadoop.hive.ql.session.SessionState.start(SessionState.java:503) > at > org.apache.spark.sql.hive.client.HiveClientImpl.(HiveClientImpl.scala:188) > at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method) > at > sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62) > at > sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45) > at java.lang.reflect.Constructor.newInstance(Constructor.java:423) > at > org.apache.spark.sql.hive.client.IsolatedClientLoader.createClient(IsolatedClientLoader.scala:264) > at > org.apache.spark.sql.hive.HiveUtils$.newClientForMetadata(HiveUtils.scala:366) > at > org.apache.spark.sql.hive.HiveUtils$.newClientForMetadata(HiveUtils.scala:270) > at > org.apache.spark.sql.hive.HiveExternalCatalog.(HiveExternalCatalog.scala:65) > at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method) > at > sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62) > at >
[jira] [Commented] (SPARK-20112) SIGSEGV in GeneratedIterator.sort_addToSorter
[ https://issues.apache.org/jira/browse/SPARK-20112?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=1594#comment-1594 ] Kazuaki Ishizaki commented on SPARK-20112: -- [~MasterDDT] Thank you for preparing additional information. The size of hashed relation does not seem to be very large. In these two cases, I cannot correlate load instructions, which caused SIGSEGV, to Java statements. > SIGSEGV in GeneratedIterator.sort_addToSorter > - > > Key: SPARK-20112 > URL: https://issues.apache.org/jira/browse/SPARK-20112 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.2 > Environment: AWS m4.10xlarge with EBS (io1 drive, 400g, 4000iops) >Reporter: Mitesh > Attachments: codegen_sorter_crash.log, hs_err_pid19271.log, > hs_err_pid22870.log > > > I'm seeing a very weird crash in {{GeneratedIterator.sort_addToSorter}}. The > hs_err_pid and codegen file are attached (with query plans). Its not a > deterministic repro, but running a big query load, I eventually see it come > up within a few minutes. > Here is some interesting repro information: > - Using AWS r3.8xlarge machines, which have ephermal attached drives, I can't > repro this. But it does repro with m4.10xlarge with an io1 EBS drive. So I > think that means its not an issue with the code-gen, but I cant figure out > what the difference in behavior is. > - The broadcast joins in the plan are all small tables. I have > autoJoinBroadcast=-1 because I always hint which tables should be broadcast. > - As you can see from the plan, all the sources are cached memory tables. And > we partition/sort them all beforehand so its always sort-merge-joins or > broadcast joins (with small tables). > {noformat} > # A fatal error has been detected by the Java Runtime Environment: > # > # [thread 139872345896704 also had an error] > SIGSEGV (0xb) at pc=0x7f38a378caa3, pid=19271, tid=139872342738688 > # > # JRE version: Java(TM) SE Runtime Environment (8.0_60-b27) (build > 1.8.0_60-b27) > # Java VM: Java HotSpot(TM) 64-Bit Server VM (25.60-b23 mixed mode > linux-amd64 compressed oops) > [thread 139872348002048 also had an error]# Problematic frame: > # > J 28454 C1 > org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.sort_addToSorter$(Lorg/apache/spark/sql/catalyst/expressions/GeneratedClass$GeneratedIterator;)V > (369 bytes) @ 0x7f38a378caa3 [0x7f38a378b5e0+0x14c3] > {noformat} > This kind of looks like https://issues.apache.org/jira/browse/SPARK-15822, > but that is marked fix in 2.0.0 -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-20129) JavaSparkContext should use SparkContext.getOrCreate
Xiangrui Meng created SPARK-20129: - Summary: JavaSparkContext should use SparkContext.getOrCreate Key: SPARK-20129 URL: https://issues.apache.org/jira/browse/SPARK-20129 Project: Spark Issue Type: Improvement Components: Java API Affects Versions: 2.1.0 Reporter: Xiangrui Meng It should re-use an existing SparkContext if there is a live one. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-20129) JavaSparkContext should use SparkContext.getOrCreate
[ https://issues.apache.org/jira/browse/SPARK-20129?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng reassigned SPARK-20129: - Assignee: Xiangrui Meng > JavaSparkContext should use SparkContext.getOrCreate > > > Key: SPARK-20129 > URL: https://issues.apache.org/jira/browse/SPARK-20129 > Project: Spark > Issue Type: Improvement > Components: Java API >Affects Versions: 2.1.0 >Reporter: Xiangrui Meng >Assignee: Xiangrui Meng > > It should re-use an existing SparkContext if there is a live one. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-20112) SIGSEGV in GeneratedIterator.sort_addToSorter
[ https://issues.apache.org/jira/browse/SPARK-20112?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15945399#comment-15945399 ] Mitesh edited comment on SPARK-20112 at 3/28/17 3:46 PM: - [~kiszk] I can try out spark 2.0.3+ or 2.1. Actually I disabled wholestage codegen and I do see a failure still on 2.0.2, but in a different place now in {{HashJoin.advanceNext}}. Also uploaded the new hs_err_pid22870. The hashed relations are around 1-10MB, but a few are 200MB. The non-hashed cached tables are 1-100G {noformat} 17/03/27 22:15:59 DEBUG [Executor task launch worker-17] TaskMemoryManager: Task 152119 acquired 64.0 KB for org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter@2b60e395 SIGSEGV17/03/27 22:15:59 DEBUG [Executor task launch worker-17] TaskMemoryManager: Task 152119 acquired 64.0 MB for org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter@2b60e395 [thread 140369911781120 also had an error] (0xb) at pc=0x7fad1f7afc11, pid=22870, tid=140369909675776 # # JRE version: Java(TM) SE Runtime Environment (8.0_60-b27) (build 1.8.0_60-b27) # Java VM: Java HotSpot(TM) 64-Bit Server VM (25.60-b23 mixed mode linux-amd64 compressed oops) # Problematic frame: # J 25558 C2 org.apache.spark.sql.execution.joins.HashJoin$$anonfun$outerJoin$1$$anon$1.advanceNext()Z (110 bytes) @ 0x7fad1f7afc11 [0x7fad1f7afb20+0xf1] # # Failed to write core dump. Core dumps have been disabled. To enable core dumping, try "ulimit -c unlimited" before starting Java again # # An error report file with more information is saved as: # /mnt/xvdb/spark/worker_dir/app-20170327213416-0005/14/hs_err_pid22870.log 17/03/27 22:15:59 DEBUG [Executor task launch worker-19] TaskMemoryManager: Task 152090 acquired 64.0 MB for org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter@51de5289 2502.591: [G1Ergonomics (Concurrent Cycles) request concurrent cycle initiation, reason: occupancy higher than threshold, occupancy: 7667187712 bytes, allocation request: 160640 bytes, threshold: 8214124950 bytes (45.00 %), source: concurrent humongous allocation] [thread 140376087648000 also had an error] [thread 140369903376128 also had an error] # {noformat} was (Author: masterddt): [~kiszk] I can try out spark 2.0.3+ or 2.1. Actually I disabled wholestage codegen and I do see a failure still on 2.0.2, but in a different place now in {{HashJoin.advanceNext}}. Also uploaded the new hs_err_pid22870. The hashed relations are around 1-10MB, but a few are 200MB. {noformat} 17/03/27 22:15:59 DEBUG [Executor task launch worker-17] TaskMemoryManager: Task 152119 acquired 64.0 KB for org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter@2b60e395 SIGSEGV17/03/27 22:15:59 DEBUG [Executor task launch worker-17] TaskMemoryManager: Task 152119 acquired 64.0 MB for org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter@2b60e395 [thread 140369911781120 also had an error] (0xb) at pc=0x7fad1f7afc11, pid=22870, tid=140369909675776 # # JRE version: Java(TM) SE Runtime Environment (8.0_60-b27) (build 1.8.0_60-b27) # Java VM: Java HotSpot(TM) 64-Bit Server VM (25.60-b23 mixed mode linux-amd64 compressed oops) # Problematic frame: # J 25558 C2 org.apache.spark.sql.execution.joins.HashJoin$$anonfun$outerJoin$1$$anon$1.advanceNext()Z (110 bytes) @ 0x7fad1f7afc11 [0x7fad1f7afb20+0xf1] # # Failed to write core dump. Core dumps have been disabled. To enable core dumping, try "ulimit -c unlimited" before starting Java again # # An error report file with more information is saved as: # /mnt/xvdb/spark/worker_dir/app-20170327213416-0005/14/hs_err_pid22870.log 17/03/27 22:15:59 DEBUG [Executor task launch worker-19] TaskMemoryManager: Task 152090 acquired 64.0 MB for org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter@51de5289 2502.591: [G1Ergonomics (Concurrent Cycles) request concurrent cycle initiation, reason: occupancy higher than threshold, occupancy: 7667187712 bytes, allocation request: 160640 bytes, threshold: 8214124950 bytes (45.00 %), source: concurrent humongous allocation] [thread 140376087648000 also had an error] [thread 140369903376128 also had an error] # {noformat} > SIGSEGV in GeneratedIterator.sort_addToSorter > - > > Key: SPARK-20112 > URL: https://issues.apache.org/jira/browse/SPARK-20112 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.2 > Environment: AWS m4.10xlarge with EBS (io1 drive, 400g, 4000iops) >Reporter: Mitesh > Attachments: codegen_sorter_crash.log, hs_err_pid19271.log, > hs_err_pid22870.log > > > I'm seeing a very weird crash in {{GeneratedIterator.sort_addToSorter}}. The > hs_err_pid and codegen file are attached (with query plans). Its not a >
[jira] [Comment Edited] (SPARK-20112) SIGSEGV in GeneratedIterator.sort_addToSorter
[ https://issues.apache.org/jira/browse/SPARK-20112?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15945399#comment-15945399 ] Mitesh edited comment on SPARK-20112 at 3/28/17 3:46 PM: - [~kiszk] I can try out spark 2.0.3+ or 2.1. Actually I disabled wholestage codegen and I do see a failure still on 2.0.2, but in a different place now in {{HashJoin.advanceNext}}. Also uploaded the new hs_err_pid22870. The hashed relations are around 1-10MB, but a few are 200MB. {noformat} 17/03/27 22:15:59 DEBUG [Executor task launch worker-17] TaskMemoryManager: Task 152119 acquired 64.0 KB for org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter@2b60e395 SIGSEGV17/03/27 22:15:59 DEBUG [Executor task launch worker-17] TaskMemoryManager: Task 152119 acquired 64.0 MB for org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter@2b60e395 [thread 140369911781120 also had an error] (0xb) at pc=0x7fad1f7afc11, pid=22870, tid=140369909675776 # # JRE version: Java(TM) SE Runtime Environment (8.0_60-b27) (build 1.8.0_60-b27) # Java VM: Java HotSpot(TM) 64-Bit Server VM (25.60-b23 mixed mode linux-amd64 compressed oops) # Problematic frame: # J 25558 C2 org.apache.spark.sql.execution.joins.HashJoin$$anonfun$outerJoin$1$$anon$1.advanceNext()Z (110 bytes) @ 0x7fad1f7afc11 [0x7fad1f7afb20+0xf1] # # Failed to write core dump. Core dumps have been disabled. To enable core dumping, try "ulimit -c unlimited" before starting Java again # # An error report file with more information is saved as: # /mnt/xvdb/spark/worker_dir/app-20170327213416-0005/14/hs_err_pid22870.log 17/03/27 22:15:59 DEBUG [Executor task launch worker-19] TaskMemoryManager: Task 152090 acquired 64.0 MB for org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter@51de5289 2502.591: [G1Ergonomics (Concurrent Cycles) request concurrent cycle initiation, reason: occupancy higher than threshold, occupancy: 7667187712 bytes, allocation request: 160640 bytes, threshold: 8214124950 bytes (45.00 %), source: concurrent humongous allocation] [thread 140376087648000 also had an error] [thread 140369903376128 also had an error] # {noformat} was (Author: masterddt): [~kiszk] I can try out spark 2.0.3+ or 2.1. Actually I disabled wholestage codegen and I do see a failure still on 2.0.2, but in a different place now in {{HashJoin.advanceNext}}. Also uploaded the new hs_err_pid22870. The hashed relations are around 1-10M, but a few are 200M. {noformat} 17/03/27 22:15:59 DEBUG [Executor task launch worker-17] TaskMemoryManager: Task 152119 acquired 64.0 KB for org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter@2b60e395 SIGSEGV17/03/27 22:15:59 DEBUG [Executor task launch worker-17] TaskMemoryManager: Task 152119 acquired 64.0 MB for org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter@2b60e395 [thread 140369911781120 also had an error] (0xb) at pc=0x7fad1f7afc11, pid=22870, tid=140369909675776 # # JRE version: Java(TM) SE Runtime Environment (8.0_60-b27) (build 1.8.0_60-b27) # Java VM: Java HotSpot(TM) 64-Bit Server VM (25.60-b23 mixed mode linux-amd64 compressed oops) # Problematic frame: # J 25558 C2 org.apache.spark.sql.execution.joins.HashJoin$$anonfun$outerJoin$1$$anon$1.advanceNext()Z (110 bytes) @ 0x7fad1f7afc11 [0x7fad1f7afb20+0xf1] # # Failed to write core dump. Core dumps have been disabled. To enable core dumping, try "ulimit -c unlimited" before starting Java again # # An error report file with more information is saved as: # /mnt/xvdb/spark/worker_dir/app-20170327213416-0005/14/hs_err_pid22870.log 17/03/27 22:15:59 DEBUG [Executor task launch worker-19] TaskMemoryManager: Task 152090 acquired 64.0 MB for org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter@51de5289 2502.591: [G1Ergonomics (Concurrent Cycles) request concurrent cycle initiation, reason: occupancy higher than threshold, occupancy: 7667187712 bytes, allocation request: 160640 bytes, threshold: 8214124950 bytes (45.00 %), source: concurrent humongous allocation] [thread 140376087648000 also had an error] [thread 140369903376128 also had an error] # {noformat} > SIGSEGV in GeneratedIterator.sort_addToSorter > - > > Key: SPARK-20112 > URL: https://issues.apache.org/jira/browse/SPARK-20112 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.2 > Environment: AWS m4.10xlarge with EBS (io1 drive, 400g, 4000iops) >Reporter: Mitesh > Attachments: codegen_sorter_crash.log, hs_err_pid19271.log, > hs_err_pid22870.log > > > I'm seeing a very weird crash in {{GeneratedIterator.sort_addToSorter}}. The > hs_err_pid and codegen file are attached (with query plans). Its not a > deterministic repro, but running a big query load, I
[jira] [Comment Edited] (SPARK-20112) SIGSEGV in GeneratedIterator.sort_addToSorter
[ https://issues.apache.org/jira/browse/SPARK-20112?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15945399#comment-15945399 ] Mitesh edited comment on SPARK-20112 at 3/28/17 3:40 PM: - [~kiszk] I can try out spark 2.0.3+ or 2.1. Actually I disabled wholestage codegen and I do see a failure still on 2.0.2, but in a different place now in {{HashJoin.advanceNext}}. Also uploaded the new hs_err_pid22870. The hashed relations are around 1-10M, but a few are 200M. {noformat} 17/03/27 22:15:59 DEBUG [Executor task launch worker-17] TaskMemoryManager: Task 152119 acquired 64.0 KB for org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter@2b60e395 SIGSEGV17/03/27 22:15:59 DEBUG [Executor task launch worker-17] TaskMemoryManager: Task 152119 acquired 64.0 MB for org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter@2b60e395 [thread 140369911781120 also had an error] (0xb) at pc=0x7fad1f7afc11, pid=22870, tid=140369909675776 # # JRE version: Java(TM) SE Runtime Environment (8.0_60-b27) (build 1.8.0_60-b27) # Java VM: Java HotSpot(TM) 64-Bit Server VM (25.60-b23 mixed mode linux-amd64 compressed oops) # Problematic frame: # J 25558 C2 org.apache.spark.sql.execution.joins.HashJoin$$anonfun$outerJoin$1$$anon$1.advanceNext()Z (110 bytes) @ 0x7fad1f7afc11 [0x7fad1f7afb20+0xf1] # # Failed to write core dump. Core dumps have been disabled. To enable core dumping, try "ulimit -c unlimited" before starting Java again # # An error report file with more information is saved as: # /mnt/xvdb/spark/worker_dir/app-20170327213416-0005/14/hs_err_pid22870.log 17/03/27 22:15:59 DEBUG [Executor task launch worker-19] TaskMemoryManager: Task 152090 acquired 64.0 MB for org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter@51de5289 2502.591: [G1Ergonomics (Concurrent Cycles) request concurrent cycle initiation, reason: occupancy higher than threshold, occupancy: 7667187712 bytes, allocation request: 160640 bytes, threshold: 8214124950 bytes (45.00 %), source: concurrent humongous allocation] [thread 140376087648000 also had an error] [thread 140369903376128 also had an error] # {noformat} was (Author: masterddt): [~kiszk] I can try out spark 2.0.3+ or 2.1. Actually I disabled wholestage codegen and I do see a failure still on 2.0.2, but in a different place now in {{HashJoin.advanceNext}}. Also uploaded the new hs_err_pid22870. The hashed relations are around 1-10M, but a few are 200M. {noformat} 17/03/27 22:15:59 DEBUG [Executor task launch worker-17] TaskMemoryManager: Task 152119 acquired 64.0 KB for org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter@2b60e395 SIGSEGV17/03/27 22:15:59 DEBUG [Executor task launch worker-17] TaskMemoryManager: Task 152119 acquired 64.0 MB for org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter@2b60e395 [thread 140369911781120 also had an error] (0xb) at pc=0x7fad1f7afc11, pid=22870, tid=140369909675776 # # JRE version: Java(TM) SE Runtime Environment (8.0_60-b27) (build 1.8.0_60-b27) # Java VM: Java HotSpot(TM) 64-Bit Server VM (25.60-b23 mixed mode linux-amd64 compressed oops) # Problematic frame: # J 25558 C2 org.apache.spark.sql.execution.joins.HashJoin$$anonfun$outerJoin$1$$anon$1.advanceNext()Z (110 bytes) @ 0x7fad1f7afc11 [0x7fad1f7afb20+0xf1] # # Failed to write core dump. Core dumps have been disabled. To enable core dumping, try "ulimit -c unlimited" before starting Java again # # An error report file with more information is saved as: # /mnt/xvdb/spark/worker_dir/app-20170327213416-0005/14/hs_err_pid22870.log 17/03/27 22:15:59 DEBUG [Executor task launch worker-19] TaskMemoryManager: Task 152090 acquired 64.0 MB for org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter@51de5289 2502.591: [G1Ergonomics (Concurrent Cycles) request concurrent cycle initiation, reason: occupancy higher than threshold, occupancy: 7667187712 bytes, allocation request: 160640 bytes, threshold: 8214124950 bytes (45.00 %), source: concurrent humongous allocation] [thread 140376087648000 also had an error] [thread 140369903376128 also had an error] # {noformat} > SIGSEGV in GeneratedIterator.sort_addToSorter > - > > Key: SPARK-20112 > URL: https://issues.apache.org/jira/browse/SPARK-20112 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.2 > Environment: AWS m4.10xlarge with EBS (io1 drive, 400g, 4000iops) >Reporter: Mitesh > Attachments: codegen_sorter_crash.log, hs_err_pid19271.log, > hs_err_pid22870.log > > > I'm seeing a very weird crash in {{GeneratedIterator.sort_addToSorter}}. The > hs_err_pid and codegen file are attached (with query plans). Its not a > deterministic repro, but running a big query load, I
[jira] [Commented] (SPARK-20112) SIGSEGV in GeneratedIterator.sort_addToSorter
[ https://issues.apache.org/jira/browse/SPARK-20112?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15945399#comment-15945399 ] Mitesh commented on SPARK-20112: [~kiszk] I can try out spark 2.0.3+ or 2.1. Actually I disabled wholestage codegen and I do see a failure still on 2.0.2, but in a different place now in {{HashJoin.advanceNext}}. Also uploaded the new hs_err_pid22870. The hashed relations are around 1-10M, but a few are 200M. {noformat} 17/03/27 22:15:59 DEBUG [Executor task launch worker-17] TaskMemoryManager: Task 152119 acquired 64.0 KB for org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter@2b60e395 SIGSEGV17/03/27 22:15:59 DEBUG [Executor task launch worker-17] TaskMemoryManager: Task 152119 acquired 64.0 MB for org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter@2b60e395 [thread 140369911781120 also had an error] (0xb) at pc=0x7fad1f7afc11, pid=22870, tid=140369909675776 # # JRE version: Java(TM) SE Runtime Environment (8.0_60-b27) (build 1.8.0_60-b27) # Java VM: Java HotSpot(TM) 64-Bit Server VM (25.60-b23 mixed mode linux-amd64 compressed oops) # Problematic frame: # J 25558 C2 org.apache.spark.sql.execution.joins.HashJoin$$anonfun$outerJoin$1$$anon$1.advanceNext()Z (110 bytes) @ 0x7fad1f7afc11 [0x7fad1f7afb20+0xf1] # # Failed to write core dump. Core dumps have been disabled. To enable core dumping, try "ulimit -c unlimited" before starting Java again # # An error report file with more information is saved as: # /mnt/xvdb/spark/worker_dir/app-20170327213416-0005/14/hs_err_pid22870.log 17/03/27 22:15:59 DEBUG [Executor task launch worker-19] TaskMemoryManager: Task 152090 acquired 64.0 MB for org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter@51de5289 2502.591: [G1Ergonomics (Concurrent Cycles) request concurrent cycle initiation, reason: occupancy higher than threshold, occupancy: 7667187712 bytes, allocation request: 160640 bytes, threshold: 8214124950 bytes (45.00 %), source: concurrent humongous allocation] [thread 140376087648000 also had an error] [thread 140369903376128 also had an error] # {noformat} > SIGSEGV in GeneratedIterator.sort_addToSorter > - > > Key: SPARK-20112 > URL: https://issues.apache.org/jira/browse/SPARK-20112 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.2 > Environment: AWS m4.10xlarge with EBS (io1 drive, 400g, 4000iops) >Reporter: Mitesh > Attachments: codegen_sorter_crash.log, hs_err_pid19271.log, > hs_err_pid22870.log > > > I'm seeing a very weird crash in {{GeneratedIterator.sort_addToSorter}}. The > hs_err_pid and codegen file are attached (with query plans). Its not a > deterministic repro, but running a big query load, I eventually see it come > up within a few minutes. > Here is some interesting repro information: > - Using AWS r3.8xlarge machines, which have ephermal attached drives, I can't > repro this. But it does repro with m4.10xlarge with an io1 EBS drive. So I > think that means its not an issue with the code-gen, but I cant figure out > what the difference in behavior is. > - The broadcast joins in the plan are all small tables. I have > autoJoinBroadcast=-1 because I always hint which tables should be broadcast. > - As you can see from the plan, all the sources are cached memory tables. And > we partition/sort them all beforehand so its always sort-merge-joins or > broadcast joins (with small tables). > {noformat} > # A fatal error has been detected by the Java Runtime Environment: > # > # [thread 139872345896704 also had an error] > SIGSEGV (0xb) at pc=0x7f38a378caa3, pid=19271, tid=139872342738688 > # > # JRE version: Java(TM) SE Runtime Environment (8.0_60-b27) (build > 1.8.0_60-b27) > # Java VM: Java HotSpot(TM) 64-Bit Server VM (25.60-b23 mixed mode > linux-amd64 compressed oops) > [thread 139872348002048 also had an error]# Problematic frame: > # > J 28454 C1 > org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.sort_addToSorter$(Lorg/apache/spark/sql/catalyst/expressions/GeneratedClass$GeneratedIterator;)V > (369 bytes) @ 0x7f38a378caa3 [0x7f38a378b5e0+0x14c3] > {noformat} > This kind of looks like https://issues.apache.org/jira/browse/SPARK-15822, > but that is marked fix in 2.0.0 -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-20112) SIGSEGV in GeneratedIterator.sort_addToSorter
[ https://issues.apache.org/jira/browse/SPARK-20112?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Mitesh updated SPARK-20112: --- Attachment: hs_err_pid22870.log > SIGSEGV in GeneratedIterator.sort_addToSorter > - > > Key: SPARK-20112 > URL: https://issues.apache.org/jira/browse/SPARK-20112 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.2 > Environment: AWS m4.10xlarge with EBS (io1 drive, 400g, 4000iops) >Reporter: Mitesh > Attachments: codegen_sorter_crash.log, hs_err_pid19271.log, > hs_err_pid22870.log > > > I'm seeing a very weird crash in {{GeneratedIterator.sort_addToSorter}}. The > hs_err_pid and codegen file are attached (with query plans). Its not a > deterministic repro, but running a big query load, I eventually see it come > up within a few minutes. > Here is some interesting repro information: > - Using AWS r3.8xlarge machines, which have ephermal attached drives, I can't > repro this. But it does repro with m4.10xlarge with an io1 EBS drive. So I > think that means its not an issue with the code-gen, but I cant figure out > what the difference in behavior is. > - The broadcast joins in the plan are all small tables. I have > autoJoinBroadcast=-1 because I always hint which tables should be broadcast. > - As you can see from the plan, all the sources are cached memory tables. And > we partition/sort them all beforehand so its always sort-merge-joins or > broadcast joins (with small tables). > {noformat} > # A fatal error has been detected by the Java Runtime Environment: > # > # [thread 139872345896704 also had an error] > SIGSEGV (0xb) at pc=0x7f38a378caa3, pid=19271, tid=139872342738688 > # > # JRE version: Java(TM) SE Runtime Environment (8.0_60-b27) (build > 1.8.0_60-b27) > # Java VM: Java HotSpot(TM) 64-Bit Server VM (25.60-b23 mixed mode > linux-amd64 compressed oops) > [thread 139872348002048 also had an error]# Problematic frame: > # > J 28454 C1 > org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.sort_addToSorter$(Lorg/apache/spark/sql/catalyst/expressions/GeneratedClass$GeneratedIterator;)V > (369 bytes) @ 0x7f38a378caa3 [0x7f38a378b5e0+0x14c3] > {noformat} > This kind of looks like https://issues.apache.org/jira/browse/SPARK-15822, > but that is marked fix in 2.0.0 -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-20107) Speed up HadoopMapReduceCommitProtocol#commitJob for many output files
[ https://issues.apache.org/jira/browse/SPARK-20107?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15945379#comment-15945379 ] Yuming Wang edited comment on SPARK-20107 at 3/28/17 3:37 PM: -- OK, I will add {{spark.hadoop.mapreduce.fileoutputcommitter.algorithm.version}} to document. was (Author: q79969786): OK, I will add {{spark.mapreduce.fileoutputcommitter.algorithm.version}} to document. > Speed up HadoopMapReduceCommitProtocol#commitJob for many output files > -- > > Key: SPARK-20107 > URL: https://issues.apache.org/jira/browse/SPARK-20107 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.1.0 >Reporter: Yuming Wang > > Set {{mapreduce.fileoutputcommitter.algorithm.version=2}} to speed up > [HadoopMapReduceCommitProtocol#commitJob|https://github.com/apache/spark/blob/v2.1.0/core/src/main/scala/org/apache/spark/internal/io/HadoopMapReduceCommitProtocol.scala#L121] > for many output files. > It can speed up {{11 minutes}} for 216869 output files: > {code:sql} > CREATE TABLE tmp.spark_20107 AS SELECT > category_id, > product_id, > track_id, > concat( > substr(ds, 3, 2), > substr(ds, 6, 2), > substr(ds, 9, 2) > ) shortDate, > CASE WHEN actiontype = '0' THEN 'browse' WHEN actiontype = '1' THEN 'fav' > WHEN actiontype = '2' THEN 'cart' WHEN actiontype = '3' THEN 'order' ELSE > 'invalid actio' END AS type > FROM > tmp.user_action > WHERE > ds > date_sub('2017-01-23', 730) > AND actiontype IN ('0','1','2','3'); > {code} > {code} > $ hadoop fs -ls /user/hive/warehouse/tmp.db/spark_20107 | wc -l > 216870 > {code} > This improvement can effect all cloudera's hadoop cdh5-2.6.0_5.4.0 higher > versions(see: > [cloudera/hadoop-common@1c12361|https://github.com/cloudera/hadoop-common/commit/1c1236182304d4075276c00c4592358f428bc433] > and > [cloudera/hadoop-common@16b2de2|https://github.com/cloudera/hadoop-common/commit/16b2de27321db7ce2395c08baccfdec5562017f0]) > and apache's hadoop 2.7.0 higher versions. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-20109) Need a way to convert from IndexedRowMatrix to Dense Block Matrices
[ https://issues.apache.org/jira/browse/SPARK-20109?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-20109: Assignee: (was: Apache Spark) > Need a way to convert from IndexedRowMatrix to Dense Block Matrices > --- > > Key: SPARK-20109 > URL: https://issues.apache.org/jira/browse/SPARK-20109 > Project: Spark > Issue Type: New Feature > Components: MLlib >Affects Versions: 2.1.0 >Reporter: John Compitello > > The current implementation of toBlockMatrix on IndexedRowMatrix is > insufficient. It is implemented by first converting the IndexedRowMatrix to a > CoordinateMatrix, then converting that CoordinateMatrix to a BlockMatrix. Not > only is this slower than it needs to be, it also means that the created > BlockMatrix ends up being backed by instances of SparseMatrix, which a user > may not want. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-20109) Need a way to convert from IndexedRowMatrix to Dense Block Matrices
[ https://issues.apache.org/jira/browse/SPARK-20109?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] John Compitello updated SPARK-20109: Description: The current implementation of toBlockMatrix on IndexedRowMatrix is insufficient. It is implemented by first converting the IndexedRowMatrix to a CoordinateMatrix, then converting that CoordinateMatrix to a BlockMatrix. Not only is this slower than it needs to be, it also means that the created BlockMatrix ends up being backed by instances of SparseMatrix, which a user may not want. Users need an option to convert from IndexedRowMatrix to BlockMatrix that backs the BlockMatrix with local instances of DenseMatrix. (was: The current implementation of toBlockMatrix on IndexedRowMatrix is insufficient. It is implemented by first converting the IndexedRowMatrix to a CoordinateMatrix, then converting that CoordinateMatrix to a BlockMatrix. Not only is this slower than it needs to be, it also means that the created BlockMatrix ends up being backed by instances of SparseMatrix, which a user may not want. ) > Need a way to convert from IndexedRowMatrix to Dense Block Matrices > --- > > Key: SPARK-20109 > URL: https://issues.apache.org/jira/browse/SPARK-20109 > Project: Spark > Issue Type: New Feature > Components: MLlib >Affects Versions: 2.1.0 >Reporter: John Compitello > > The current implementation of toBlockMatrix on IndexedRowMatrix is > insufficient. It is implemented by first converting the IndexedRowMatrix to a > CoordinateMatrix, then converting that CoordinateMatrix to a BlockMatrix. Not > only is this slower than it needs to be, it also means that the created > BlockMatrix ends up being backed by instances of SparseMatrix, which a user > may not want. Users need an option to convert from IndexedRowMatrix to > BlockMatrix that backs the BlockMatrix with local instances of DenseMatrix. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-20109) Need a way to convert from IndexedRowMatrix to Dense Block Matrices
[ https://issues.apache.org/jira/browse/SPARK-20109?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15945388#comment-15945388 ] Apache Spark commented on SPARK-20109: -- User 'johnc1231' has created a pull request for this issue: https://github.com/apache/spark/pull/17459 > Need a way to convert from IndexedRowMatrix to Dense Block Matrices > --- > > Key: SPARK-20109 > URL: https://issues.apache.org/jira/browse/SPARK-20109 > Project: Spark > Issue Type: New Feature > Components: MLlib >Affects Versions: 2.1.0 >Reporter: John Compitello > > The current implementation of toBlockMatrix on IndexedRowMatrix is > insufficient. It is implemented by first converting the IndexedRowMatrix to a > CoordinateMatrix, then converting that CoordinateMatrix to a BlockMatrix. Not > only is this slower than it needs to be, it also means that the created > BlockMatrix ends up being backed by instances of SparseMatrix, which a user > may not want. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-20112) SIGSEGV in GeneratedIterator.sort_addToSorter
[ https://issues.apache.org/jira/browse/SPARK-20112?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15945389#comment-15945389 ] Kazuaki Ishizaki commented on SPARK-20112: -- SPARK-18745 fixed integer overflow issues in {{HashedRelation.scala}} due to large data, which was merged into post-2.0.2. If the data is very large, would it be possible to have a change to try it with the latest branch-2.0? > SIGSEGV in GeneratedIterator.sort_addToSorter > - > > Key: SPARK-20112 > URL: https://issues.apache.org/jira/browse/SPARK-20112 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.2 > Environment: AWS m4.10xlarge with EBS (io1 drive, 400g, 4000iops) >Reporter: Mitesh > Attachments: codegen_sorter_crash.log, hs_err_pid19271.log > > > I'm seeing a very weird crash in {{GeneratedIterator.sort_addToSorter}}. The > hs_err_pid and codegen file are attached (with query plans). Its not a > deterministic repro, but running a big query load, I eventually see it come > up within a few minutes. > Here is some interesting repro information: > - Using AWS r3.8xlarge machines, which have ephermal attached drives, I can't > repro this. But it does repro with m4.10xlarge with an io1 EBS drive. So I > think that means its not an issue with the code-gen, but I cant figure out > what the difference in behavior is. > - The broadcast joins in the plan are all small tables. I have > autoJoinBroadcast=-1 because I always hint which tables should be broadcast. > - As you can see from the plan, all the sources are cached memory tables. And > we partition/sort them all beforehand so its always sort-merge-joins or > broadcast joins (with small tables). > {noformat} > # A fatal error has been detected by the Java Runtime Environment: > # > # [thread 139872345896704 also had an error] > SIGSEGV (0xb) at pc=0x7f38a378caa3, pid=19271, tid=139872342738688 > # > # JRE version: Java(TM) SE Runtime Environment (8.0_60-b27) (build > 1.8.0_60-b27) > # Java VM: Java HotSpot(TM) 64-Bit Server VM (25.60-b23 mixed mode > linux-amd64 compressed oops) > [thread 139872348002048 also had an error]# Problematic frame: > # > J 28454 C1 > org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.sort_addToSorter$(Lorg/apache/spark/sql/catalyst/expressions/GeneratedClass$GeneratedIterator;)V > (369 bytes) @ 0x7f38a378caa3 [0x7f38a378b5e0+0x14c3] > {noformat} > This kind of looks like https://issues.apache.org/jira/browse/SPARK-15822, > but that is marked fix in 2.0.0 -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-20109) Need a way to convert from IndexedRowMatrix to Dense Block Matrices
[ https://issues.apache.org/jira/browse/SPARK-20109?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-20109: Assignee: Apache Spark > Need a way to convert from IndexedRowMatrix to Dense Block Matrices > --- > > Key: SPARK-20109 > URL: https://issues.apache.org/jira/browse/SPARK-20109 > Project: Spark > Issue Type: New Feature > Components: MLlib >Affects Versions: 2.1.0 >Reporter: John Compitello >Assignee: Apache Spark > > The current implementation of toBlockMatrix on IndexedRowMatrix is > insufficient. It is implemented by first converting the IndexedRowMatrix to a > CoordinateMatrix, then converting that CoordinateMatrix to a BlockMatrix. Not > only is this slower than it needs to be, it also means that the created > BlockMatrix ends up being backed by instances of SparseMatrix, which a user > may not want. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-20109) Need a way to convert from IndexedRowMatrix to Dense Block Matrices
[ https://issues.apache.org/jira/browse/SPARK-20109?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] John Compitello updated SPARK-20109: Summary: Need a way to convert from IndexedRowMatrix to Dense Block Matrices (was: Need a way to convert from IndexedRowMatrix to Block) > Need a way to convert from IndexedRowMatrix to Dense Block Matrices > --- > > Key: SPARK-20109 > URL: https://issues.apache.org/jira/browse/SPARK-20109 > Project: Spark > Issue Type: New Feature > Components: MLlib >Affects Versions: 2.1.0 >Reporter: John Compitello > > The current implementation of toBlockMatrix on IndexedRowMatrix is > insufficient. It is implemented by first converting the IndexedRowMatrix to a > CoordinateMatrix, then converting that CoordinateMatrix to a BlockMatrix. Not > only is this slower than it needs to be, it also means that the created > BlockMatrix ends up being backed by instances of SparseMatrix, which a user > may not want. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-20107) Speed up HadoopMapReduceCommitProtocol#commitJob for many output files
[ https://issues.apache.org/jira/browse/SPARK-20107?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15945379#comment-15945379 ] Yuming Wang commented on SPARK-20107: - OK, I will add {{spark.mapreduce.fileoutputcommitter.algorithm.version}} to document. > Speed up HadoopMapReduceCommitProtocol#commitJob for many output files > -- > > Key: SPARK-20107 > URL: https://issues.apache.org/jira/browse/SPARK-20107 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.1.0 >Reporter: Yuming Wang > > Set {{mapreduce.fileoutputcommitter.algorithm.version=2}} to speed up > [HadoopMapReduceCommitProtocol#commitJob|https://github.com/apache/spark/blob/v2.1.0/core/src/main/scala/org/apache/spark/internal/io/HadoopMapReduceCommitProtocol.scala#L121] > for many output files. > It can speed up {{11 minutes}} for 216869 output files: > {code:sql} > CREATE TABLE tmp.spark_20107 AS SELECT > category_id, > product_id, > track_id, > concat( > substr(ds, 3, 2), > substr(ds, 6, 2), > substr(ds, 9, 2) > ) shortDate, > CASE WHEN actiontype = '0' THEN 'browse' WHEN actiontype = '1' THEN 'fav' > WHEN actiontype = '2' THEN 'cart' WHEN actiontype = '3' THEN 'order' ELSE > 'invalid actio' END AS type > FROM > tmp.user_action > WHERE > ds > date_sub('2017-01-23', 730) > AND actiontype IN ('0','1','2','3'); > {code} > {code} > $ hadoop fs -ls /user/hive/warehouse/tmp.db/spark_20107 | wc -l > 216870 > {code} > This improvement can effect all cloudera's hadoop cdh5-2.6.0_5.4.0 higher > versions(see: > [cloudera/hadoop-common@1c12361|https://github.com/cloudera/hadoop-common/commit/1c1236182304d4075276c00c4592358f428bc433] > and > [cloudera/hadoop-common@16b2de2|https://github.com/cloudera/hadoop-common/commit/16b2de27321db7ce2395c08baccfdec5562017f0]) > and apache's hadoop 2.7.0 higher versions. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-19981) Sort-Merge join inserts shuffles when joining dataframes with aliased columns
[ https://issues.apache.org/jira/browse/SPARK-19981?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15945166#comment-15945166 ] Mitesh edited comment on SPARK-19981 at 3/28/17 3:21 PM: - As I mentioned on the PR, maybe its better to fix it here? https://github.com/maropu/spark/blob/b5d1038ed5d65a6ddec20ea6eef186d25fc3/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/Canonicalize.scala#L41 Perhaps it should handle canonicalizing an alias so {{a#1}} is the same as {{a#1 as newA#2}}. Otherwise you have a similar problem with sorting. Here is a sort example with 1 partition. I believe the extra sort on {{newA}} is unnecessary. {code} scala> val df1 = Seq((1, 2), (3, 4)).toDF("a", "b").coalesce(1).sortWithinPartitions("a") df1: org.apache.spark.sql.Dataset[org.apache.spark.sql.Row] = [a: int, b: int] scala> val df2 = df1.selectExpr("a as newA", "b") df2: org.apache.spark.sql.DataFrame = [newA: int, b: int] scala> println(df1.join(df2, df1("a") === df2("newA")).queryExecution.executedPlan) *SortMergeJoin [args=[a#37225], [newA#37232], Inner][outPart=PartitioningCollection(1, )][outOrder=List(a#37225 ASC%NONNULL)][output=List(a#37225:int%NONNULL, b#37226:int%NONNULL, newA#37232:int%NONNULL, b#37243:int%NONNULL)] :- *Sort [args=[a#37225 ASC], false, 0][outPart=SinglePartition][outOrder=ArrayBuffer(a#37225 ASC%NONNULL)][output=List(a#37225:int%NONNULL, b#37226:int%NONNULL)] : +- Coalesce [args=1][outPart=SinglePartition][outOrder=List()][output=List(a#37225:int%NONNULL, b#37226:int%NONNULL)] : +- LocalTableScan [args=[a#37225, b#37226]][outPart=UnknownPartitioning(0)][outOrder=List()][output=List(a#37225:int%NONNULL, b#37226:int%NONNULL)] +- *Sort [args=[newA#37232 ASC], false, 0][outPart=SinglePartition][outOrder=List(newA#37232 ASC%NONNULL)][output=ArrayBuffer(newA#37232:int%NONNULL, b#37243:int%NONNULL)] +- *Project [args=[a#37242 AS newA#37232, b#37243]][outPart=SinglePartition][outOrder=ArrayBuffer(a#37242 ASC%NONNULL)][output=ArrayBuffer(newA#37232:int%NONNULL, b#37243:int%NONNULL)] +- *Sort [args=[a#37242 ASC], false, 0][outPart=SinglePartition][outOrder=ArrayBuffer(a#37242 ASC%NONNULL)][output=List(a#37242:int%NONNULL, b#37243:int%NONNULL)] +- Coalesce [args=1][outPart=SinglePartition][outOrder=List()][output=List(a#37242:int%NONNULL, b#37243:int%NONNULL)] +- LocalTableScan [args=[a#37242, b#37243]][outPart=UnknownPartitioning(0)][outOrder=List()][output=List(a#37242:int%NONNULL, b#37243:int%NONNULL)] {code} was (Author: masterddt): As I mentioned on the PR, maybe a better to fix it here: https://github.com/maropu/spark/blob/b5d1038ed5d65a6ddec20ea6eef186d25fc3/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/Canonicalize.scala#L41 Perhaps it should handle canonicalizing an alias so {{a#1}} is the same as {{a#1 as newA#2}}. Otherwise you have a similar problem with sorting. Here is a sort example with 1 partition. I believe the extra sort on {{newA}} is unnecessary. {code} scala> val df1 = Seq((1, 2), (3, 4)).toDF("a", "b").coalesce(1).sortWithinPartitions("a") df1: org.apache.spark.sql.Dataset[org.apache.spark.sql.Row] = [a: int, b: int] scala> val df2 = df1.selectExpr("a as newA", "b") df2: org.apache.spark.sql.DataFrame = [newA: int, b: int] scala> println(df1.join(df2, df1("a") === df2("newA")).queryExecution.executedPlan) *SortMergeJoin [args=[a#37225], [newA#37232], Inner][outPart=PartitioningCollection(1, )][outOrder=List(a#37225 ASC%NONNULL)][output=List(a#37225:int%NONNULL, b#37226:int%NONNULL, newA#37232:int%NONNULL, b#37243:int%NONNULL)] :- *Sort [args=[a#37225 ASC], false, 0][outPart=SinglePartition][outOrder=ArrayBuffer(a#37225 ASC%NONNULL)][output=List(a#37225:int%NONNULL, b#37226:int%NONNULL)] : +- Coalesce [args=1][outPart=SinglePartition][outOrder=List()][output=List(a#37225:int%NONNULL, b#37226:int%NONNULL)] : +- LocalTableScan [args=[a#37225, b#37226]][outPart=UnknownPartitioning(0)][outOrder=List()][output=List(a#37225:int%NONNULL, b#37226:int%NONNULL)] +- *Sort [args=[newA#37232 ASC], false, 0][outPart=SinglePartition][outOrder=List(newA#37232 ASC%NONNULL)][output=ArrayBuffer(newA#37232:int%NONNULL, b#37243:int%NONNULL)] +- *Project [args=[a#37242 AS newA#37232, b#37243]][outPart=SinglePartition][outOrder=ArrayBuffer(a#37242 ASC%NONNULL)][output=ArrayBuffer(newA#37232:int%NONNULL, b#37243:int%NONNULL)] +- *Sort [args=[a#37242 ASC], false, 0][outPart=SinglePartition][outOrder=ArrayBuffer(a#37242 ASC%NONNULL)][output=List(a#37242:int%NONNULL, b#37243:int%NONNULL)] +- Coalesce [args=1][outPart=SinglePartition][outOrder=List()][output=List(a#37242:int%NONNULL, b#37243:int%NONNULL)] +- LocalTableScan [args=[a#37242, b#37243]][outPart=UnknownPartitioning(0)][outOrder=List()][output=List(a#37242:int%NONNULL,
[jira] [Resolved] (SPARK-20126) Remove HiveSessionState
[ https://issues.apache.org/jira/browse/SPARK-20126?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan resolved SPARK-20126. - Resolution: Fixed Fix Version/s: 2.2.0 Issue resolved by pull request 17457 [https://github.com/apache/spark/pull/17457] > Remove HiveSessionState > --- > > Key: SPARK-20126 > URL: https://issues.apache.org/jira/browse/SPARK-20126 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.2.0 >Reporter: Herman van Hovell >Assignee: Herman van Hovell > Fix For: 2.2.0 > > > After SPARK-20100 the added value of the HiveSessionState has become quite > limited. We should remove it. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-20124) Join reorder should keep the same order of final project attributes
[ https://issues.apache.org/jira/browse/SPARK-20124?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan resolved SPARK-20124. - Resolution: Fixed Fix Version/s: 2.2.0 Issue resolved by pull request 17453 [https://github.com/apache/spark/pull/17453] > Join reorder should keep the same order of final project attributes > --- > > Key: SPARK-20124 > URL: https://issues.apache.org/jira/browse/SPARK-20124 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 2.2.0 >Reporter: Zhenhua Wang > Fix For: 2.2.0 > > > Join reorder algorithm should keep exactly the same order of output > attributes in the top project. > For example, if user want to select a, b, c, after reordering, we should > output a, b, c in the same order as specified by user, instead of b, a, c or > other orders. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-20124) Join reorder should keep the same order of final project attributes
[ https://issues.apache.org/jira/browse/SPARK-20124?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan reassigned SPARK-20124: --- Assignee: Zhenhua Wang > Join reorder should keep the same order of final project attributes > --- > > Key: SPARK-20124 > URL: https://issues.apache.org/jira/browse/SPARK-20124 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 2.2.0 >Reporter: Zhenhua Wang >Assignee: Zhenhua Wang > Fix For: 2.2.0 > > > Join reorder algorithm should keep exactly the same order of output > attributes in the top project. > For example, if user want to select a, b, c, after reordering, we should > output a, b, c in the same order as specified by user, instead of b, a, c or > other orders. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-20107) Speed up HadoopMapReduceCommitProtocol#commitJob for many output files
[ https://issues.apache.org/jira/browse/SPARK-20107?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15945177#comment-15945177 ] Imran Rashid commented on SPARK-20107: -- >From Marcelo's comment on the PR: bq. We shouldn't set the default to this value. Algo v2 has different correctness characteristics from v1 - yeah, it's faster especially in FSes like S3, but it also more likely to lead to bad data when things fail. bq. If a user is comfortable with the trade-offs they can set this in their own configuration. It would be nice if you want to turn this jira into documenting the option, otherwise this should be closed with a "won't fix" > Speed up HadoopMapReduceCommitProtocol#commitJob for many output files > -- > > Key: SPARK-20107 > URL: https://issues.apache.org/jira/browse/SPARK-20107 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.1.0 >Reporter: Yuming Wang > > Set {{mapreduce.fileoutputcommitter.algorithm.version=2}} to speed up > [HadoopMapReduceCommitProtocol#commitJob|https://github.com/apache/spark/blob/v2.1.0/core/src/main/scala/org/apache/spark/internal/io/HadoopMapReduceCommitProtocol.scala#L121] > for many output files. > It can speed up {{11 minutes}} for 216869 output files: > {code:sql} > CREATE TABLE tmp.spark_20107 AS SELECT > category_id, > product_id, > track_id, > concat( > substr(ds, 3, 2), > substr(ds, 6, 2), > substr(ds, 9, 2) > ) shortDate, > CASE WHEN actiontype = '0' THEN 'browse' WHEN actiontype = '1' THEN 'fav' > WHEN actiontype = '2' THEN 'cart' WHEN actiontype = '3' THEN 'order' ELSE > 'invalid actio' END AS type > FROM > tmp.user_action > WHERE > ds > date_sub('2017-01-23', 730) > AND actiontype IN ('0','1','2','3'); > {code} > {code} > $ hadoop fs -ls /user/hive/warehouse/tmp.db/spark_20107 | wc -l > 216870 > {code} > This improvement can effect all cloudera's hadoop cdh5-2.6.0_5.4.0 higher > versions(see: > [cloudera/hadoop-common@1c12361|https://github.com/cloudera/hadoop-common/commit/1c1236182304d4075276c00c4592358f428bc433] > and > [cloudera/hadoop-common@16b2de2|https://github.com/cloudera/hadoop-common/commit/16b2de27321db7ce2395c08baccfdec5562017f0]) > and apache's hadoop 2.7.0 higher versions. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-19981) Sort-Merge join inserts shuffles when joining dataframes with aliased columns
[ https://issues.apache.org/jira/browse/SPARK-19981?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15945166#comment-15945166 ] Mitesh edited comment on SPARK-19981 at 3/28/17 1:35 PM: - As I mentioned on the PR, maybe a better to fix it here: https://github.com/maropu/spark/blob/b5d1038ed5d65a6ddec20ea6eef186d25fc3/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/Canonicalize.scala#L41 Perhaps it should handle canonicalizing an alias so {{a#1}} is the same as {{a#1 as newA#2}}. Otherwise you have a similar problem with sorting. Here is a sort example with 1 partition. I believe the extra sort on {{newA}} is unnecessary. {code} scala> val df1 = Seq((1, 2), (3, 4)).toDF("a", "b").coalesce(1).sortWithinPartitions("a") df1: org.apache.spark.sql.Dataset[org.apache.spark.sql.Row] = [a: int, b: int] scala> val df2 = df1.selectExpr("a as newA", "b") df2: org.apache.spark.sql.DataFrame = [newA: int, b: int] scala> println(df1.join(df2, df1("a") === df2("newA")).queryExecution.executedPlan) *SortMergeJoin [args=[a#37225], [newA#37232], Inner][outPart=PartitioningCollection(1, )][outOrder=List(a#37225 ASC%NONNULL)][output=List(a#37225:int%NONNULL, b#37226:int%NONNULL, newA#37232:int%NONNULL, b#37243:int%NONNULL)] :- *Sort [args=[a#37225 ASC], false, 0][outPart=SinglePartition][outOrder=ArrayBuffer(a#37225 ASC%NONNULL)][output=List(a#37225:int%NONNULL, b#37226:int%NONNULL)] : +- Coalesce [args=1][outPart=SinglePartition][outOrder=List()][output=List(a#37225:int%NONNULL, b#37226:int%NONNULL)] : +- LocalTableScan [args=[a#37225, b#37226]][outPart=UnknownPartitioning(0)][outOrder=List()][output=List(a#37225:int%NONNULL, b#37226:int%NONNULL)] +- *Sort [args=[newA#37232 ASC], false, 0][outPart=SinglePartition][outOrder=List(newA#37232 ASC%NONNULL)][output=ArrayBuffer(newA#37232:int%NONNULL, b#37243:int%NONNULL)] +- *Project [args=[a#37242 AS newA#37232, b#37243]][outPart=SinglePartition][outOrder=ArrayBuffer(a#37242 ASC%NONNULL)][output=ArrayBuffer(newA#37232:int%NONNULL, b#37243:int%NONNULL)] +- *Sort [args=[a#37242 ASC], false, 0][outPart=SinglePartition][outOrder=ArrayBuffer(a#37242 ASC%NONNULL)][output=List(a#37242:int%NONNULL, b#37243:int%NONNULL)] +- Coalesce [args=1][outPart=SinglePartition][outOrder=List()][output=List(a#37242:int%NONNULL, b#37243:int%NONNULL)] +- LocalTableScan [args=[a#37242, b#37243]][outPart=UnknownPartitioning(0)][outOrder=List()][output=List(a#37242:int%NONNULL, b#37243:int%NONNULL)] {code} was (Author: masterddt): As I mentioned on the PR, maybe a better to fix it here: https://github.com/maropu/spark/blob/b5d1038ed5d65a6ddec20ea6eef186d25fc3/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/Canonicalize.scala#L41 Perhaps it should handle canonicalizing an alias so {{a#1}} is the same as {{a#1 as newA#2}}. Otherwise you have a similar problem with sorting. Here is a sort example with 1 partition. I believe the extra sort on {{newA}} is unnecessary. {code} scala> val df1 = Seq((1, 2), (3, 4)).toDF("a", "b").coalesce(1).sortWithinPartitions("a") df1: org.apache.spark.sql.Dataset[org.apache.spark.sql.Row] = [a: int, b: int] scala> val df2 = df1.selectExpr("a as newA", "b") df2: org.apache.spark.sql.DataFrame = [newA: int, b: int] scala> println(df1.join(df2, df1("a") === df2("newA")).queryExecution.executedPlan) *SortMergeJoin [args=[a#37225], [newA#37232], Inner][outPart=PartitioningCollection(1, )][outOrder=List(a#37225 ASC%NONNULL)][output=List(a#37225:int%NONNULL, b#37226:int%NONNULL, newA#37232:int%NONNULL, b#37243:int%NONNULL)] :- *Sort [args=[a#37225 ASC], false, 0][outPart=SinglePartition][outOrder=ArrayBuffer(a#37225 ASC%NONNULL)][output=List(a#37225:int%NONNULL, b#37226:int%NONNULL)] : +- Coalesce [args=1][outPart=SinglePartition][outOrder=List()][output=List(a#37225:int%NONNULL, b#37226:int%NONNULL)] : +- LocalTableScan [args=[a#37225, b#37226]][outPart=UnknownPartitioning(0)][outOrder=List()][output=List(a#37225:int%NONNULL, b#37226:int%NONNULL)] +- *Sort [args=[newA#37232 ASC], false, 0][outPart=SinglePartition][outOrder=List(newA#37232 ASC%NONNULL)][output=ArrayBuffer(newA#37232:int%NONNULL, b#37243:int%NONNULL)] +- *Project [args=[a#37242 AS newA#37232, b#37243]][outPart=SinglePartition][outOrder=ArrayBuffer(a#37242 ASC%NONNULL)][output=ArrayBuffer(newA#37232:int%NONNULL, b#37243:int%NONNULL)] +- *Sort [args=[a#37242 ASC], false, 0][outPart=SinglePartition][outOrder=ArrayBuffer(a#37242 ASC%NONNULL)][output=List(a#37242:int%NONNULL, b#37243:int%NONNULL)] +- Coalesce [args=1][outPart=SinglePartition][outOrder=List()][output=List(a#37242:int%NONNULL, b#37243:int%NONNULL)] +- LocalTableScan [args=[a#37242, b#37243]][outPart=UnknownPartitioning(0)][outOrder=List()][output=List(a#37242:int%NONNULL, b#37243:int%NONNULL)]
[jira] [Comment Edited] (SPARK-19981) Sort-Merge join inserts shuffles when joining dataframes with aliased columns
[ https://issues.apache.org/jira/browse/SPARK-19981?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15945166#comment-15945166 ] Mitesh edited comment on SPARK-19981 at 3/28/17 1:34 PM: - As I mentioned on the PR, maybe a better way is to handle it here: https://github.com/maropu/spark/blob/b5d1038ed5d65a6ddec20ea6eef186d25fc3/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/Canonicalize.scala#L41 Perhaps it should handle canonicalizing an alias so {{a#1}} is the same as {{a#1 as newA#2}}. Otherwise you have a similar problem with sorting. Here is a sort example with 1 partition. I believe the extra sort on {{newA}} is unnecessary. {code} scala> val df1 = Seq((1, 2), (3, 4)).toDF("a", "b").coalesce(1).sortWithinPartitions("a") df1: org.apache.spark.sql.Dataset[org.apache.spark.sql.Row] = [a: int, b: int] scala> val df2 = df1.selectExpr("a as newA", "b") df2: org.apache.spark.sql.DataFrame = [newA: int, b: int] scala> println(df1.join(df2, df1("a") === df2("newA")).queryExecution.executedPlan) *SortMergeJoin [args=[a#37225], [newA#37232], Inner][outPart=PartitioningCollection(1, )][outOrder=List(a#37225 ASC%NONNULL)][output=List(a#37225:int%NONNULL, b#37226:int%NONNULL, newA#37232:int%NONNULL, b#37243:int%NONNULL)] :- *Sort [args=[a#37225 ASC], false, 0][outPart=SinglePartition][outOrder=ArrayBuffer(a#37225 ASC%NONNULL)][output=List(a#37225:int%NONNULL, b#37226:int%NONNULL)] : +- Coalesce [args=1][outPart=SinglePartition][outOrder=List()][output=List(a#37225:int%NONNULL, b#37226:int%NONNULL)] : +- LocalTableScan [args=[a#37225, b#37226]][outPart=UnknownPartitioning(0)][outOrder=List()][output=List(a#37225:int%NONNULL, b#37226:int%NONNULL)] +- *Sort [args=[newA#37232 ASC], false, 0][outPart=SinglePartition][outOrder=List(newA#37232 ASC%NONNULL)][output=ArrayBuffer(newA#37232:int%NONNULL, b#37243:int%NONNULL)] +- *Project [args=[a#37242 AS newA#37232, b#37243]][outPart=SinglePartition][outOrder=ArrayBuffer(a#37242 ASC%NONNULL)][output=ArrayBuffer(newA#37232:int%NONNULL, b#37243:int%NONNULL)] +- *Sort [args=[a#37242 ASC], false, 0][outPart=SinglePartition][outOrder=ArrayBuffer(a#37242 ASC%NONNULL)][output=List(a#37242:int%NONNULL, b#37243:int%NONNULL)] +- Coalesce [args=1][outPart=SinglePartition][outOrder=List()][output=List(a#37242:int%NONNULL, b#37243:int%NONNULL)] +- LocalTableScan [args=[a#37242, b#37243]][outPart=UnknownPartitioning(0)][outOrder=List()][output=List(a#37242:int%NONNULL, b#37243:int%NONNULL)] {code} was (Author: masterddt): As I mentioned on the PR, this seems like it should be handled here: https://github.com/maropu/spark/blob/b5d1038ed5d65a6ddec20ea6eef186d25fc3/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/Canonicalize.scala#L41 Perhaps it should handle canonicalizing an alias so {{a#1}} is the same as {{a#1 as newA#2}}. Otherwise you have a similar problem with sorting. Here is a sort example with 1 partition. I believe the extra sort on {{newA}} is unnecessary. {code} scala> val df1 = Seq((1, 2), (3, 4)).toDF("a", "b").coalesce(1).sortWithinPartitions("a") df1: org.apache.spark.sql.Dataset[org.apache.spark.sql.Row] = [a: int, b: int] scala> val df2 = df1.selectExpr("a as newA", "b") df2: org.apache.spark.sql.DataFrame = [newA: int, b: int] scala> println(df1.join(df2, df1("a") === df2("newA")).queryExecution.executedPlan) *SortMergeJoin [args=[a#37225], [newA#37232], Inner][outPart=PartitioningCollection(1, )][outOrder=List(a#37225 ASC%NONNULL)][output=List(a#37225:int%NONNULL, b#37226:int%NONNULL, newA#37232:int%NONNULL, b#37243:int%NONNULL)] :- *Sort [args=[a#37225 ASC], false, 0][outPart=SinglePartition][outOrder=ArrayBuffer(a#37225 ASC%NONNULL)][output=List(a#37225:int%NONNULL, b#37226:int%NONNULL)] : +- Coalesce [args=1][outPart=SinglePartition][outOrder=List()][output=List(a#37225:int%NONNULL, b#37226:int%NONNULL)] : +- LocalTableScan [args=[a#37225, b#37226]][outPart=UnknownPartitioning(0)][outOrder=List()][output=List(a#37225:int%NONNULL, b#37226:int%NONNULL)] +- *Sort [args=[newA#37232 ASC], false, 0][outPart=SinglePartition][outOrder=List(newA#37232 ASC%NONNULL)][output=ArrayBuffer(newA#37232:int%NONNULL, b#37243:int%NONNULL)] +- *Project [args=[a#37242 AS newA#37232, b#37243]][outPart=SinglePartition][outOrder=ArrayBuffer(a#37242 ASC%NONNULL)][output=ArrayBuffer(newA#37232:int%NONNULL, b#37243:int%NONNULL)] +- *Sort [args=[a#37242 ASC], false, 0][outPart=SinglePartition][outOrder=ArrayBuffer(a#37242 ASC%NONNULL)][output=List(a#37242:int%NONNULL, b#37243:int%NONNULL)] +- Coalesce [args=1][outPart=SinglePartition][outOrder=List()][output=List(a#37242:int%NONNULL, b#37243:int%NONNULL)] +- LocalTableScan [args=[a#37242, b#37243]][outPart=UnknownPartitioning(0)][outOrder=List()][output=List(a#37242:int%NONNULL,
[jira] [Comment Edited] (SPARK-19981) Sort-Merge join inserts shuffles when joining dataframes with aliased columns
[ https://issues.apache.org/jira/browse/SPARK-19981?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15945166#comment-15945166 ] Mitesh edited comment on SPARK-19981 at 3/28/17 1:34 PM: - As I mentioned on the PR, this seems like it should be handled here: https://github.com/maropu/spark/blob/b5d1038ed5d65a6ddec20ea6eef186d25fc3/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/Canonicalize.scala#L41 Perhaps it should handle canonicalizing an alias so {{a#1}} is the same as {{a#1 as newA#2}}. Otherwise you have a similar problem with sorting. Here is a sort example with 1 partition. I believe the extra sort on {{newA}} is unnecessary. {code} scala> val df1 = Seq((1, 2), (3, 4)).toDF("a", "b").coalesce(1).sortWithinPartitions("a") df1: org.apache.spark.sql.Dataset[org.apache.spark.sql.Row] = [a: int, b: int] scala> val df2 = df1.selectExpr("a as newA", "b") df2: org.apache.spark.sql.DataFrame = [newA: int, b: int] scala> println(df1.join(df2, df1("a") === df2("newA")).queryExecution.executedPlan) *SortMergeJoin [args=[a#37225], [newA#37232], Inner][outPart=PartitioningCollection(1, )][outOrder=List(a#37225 ASC%NONNULL)][output=List(a#37225:int%NONNULL, b#37226:int%NONNULL, newA#37232:int%NONNULL, b#37243:int%NONNULL)] :- *Sort [args=[a#37225 ASC], false, 0][outPart=SinglePartition][outOrder=ArrayBuffer(a#37225 ASC%NONNULL)][output=List(a#37225:int%NONNULL, b#37226:int%NONNULL)] : +- Coalesce [args=1][outPart=SinglePartition][outOrder=List()][output=List(a#37225:int%NONNULL, b#37226:int%NONNULL)] : +- LocalTableScan [args=[a#37225, b#37226]][outPart=UnknownPartitioning(0)][outOrder=List()][output=List(a#37225:int%NONNULL, b#37226:int%NONNULL)] +- *Sort [args=[newA#37232 ASC], false, 0][outPart=SinglePartition][outOrder=List(newA#37232 ASC%NONNULL)][output=ArrayBuffer(newA#37232:int%NONNULL, b#37243:int%NONNULL)] +- *Project [args=[a#37242 AS newA#37232, b#37243]][outPart=SinglePartition][outOrder=ArrayBuffer(a#37242 ASC%NONNULL)][output=ArrayBuffer(newA#37232:int%NONNULL, b#37243:int%NONNULL)] +- *Sort [args=[a#37242 ASC], false, 0][outPart=SinglePartition][outOrder=ArrayBuffer(a#37242 ASC%NONNULL)][output=List(a#37242:int%NONNULL, b#37243:int%NONNULL)] +- Coalesce [args=1][outPart=SinglePartition][outOrder=List()][output=List(a#37242:int%NONNULL, b#37243:int%NONNULL)] +- LocalTableScan [args=[a#37242, b#37243]][outPart=UnknownPartitioning(0)][outOrder=List()][output=List(a#37242:int%NONNULL, b#37243:int%NONNULL)] {code} was (Author: masterddt): As I mentioned on the PR, this seems like it should be handled here: https://github.com/maropu/spark/blob/b5d1038ed5d65a6ddec20ea6eef186d25fc3/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/Canonicalize.scala#L41 Perhaps it should handle canonicalizing an alias so {{a#1}} is the same as {{a#1 as newA#2}}. Otherwise you have a similar problem with sorting. Here is a sort example with 1 partition. I believe the extra sort on {[newA}] is unnecessary. {code} scala> val df1 = Seq((1, 2), (3, 4)).toDF("a", "b").coalesce(1).sortWithinPartitions("a") df1: org.apache.spark.sql.Dataset[org.apache.spark.sql.Row] = [a: int, b: int] scala> val df2 = df1.selectExpr("a as newA", "b") df2: org.apache.spark.sql.DataFrame = [newA: int, b: int] scala> println(df1.join(df2, df1("a") === df2("newA")).queryExecution.executedPlan) *SortMergeJoin [args=[a#37225], [newA#37232], Inner][outPart=PartitioningCollection(1, )][outOrder=List(a#37225 ASC%NONNULL)][output=List(a#37225:int%NONNULL, b#37226:int%NONNULL, newA#37232:int%NONNULL, b#37243:int%NONNULL)] :- *Sort [args=[a#37225 ASC], false, 0][outPart=SinglePartition][outOrder=ArrayBuffer(a#37225 ASC%NONNULL)][output=List(a#37225:int%NONNULL, b#37226:int%NONNULL)] : +- Coalesce [args=1][outPart=SinglePartition][outOrder=List()][output=List(a#37225:int%NONNULL, b#37226:int%NONNULL)] : +- LocalTableScan [args=[a#37225, b#37226]][outPart=UnknownPartitioning(0)][outOrder=List()][output=List(a#37225:int%NONNULL, b#37226:int%NONNULL)] +- *Sort [args=[newA#37232 ASC], false, 0][outPart=SinglePartition][outOrder=List(newA#37232 ASC%NONNULL)][output=ArrayBuffer(newA#37232:int%NONNULL, b#37243:int%NONNULL)] +- *Project [args=[a#37242 AS newA#37232, b#37243]][outPart=SinglePartition][outOrder=ArrayBuffer(a#37242 ASC%NONNULL)][output=ArrayBuffer(newA#37232:int%NONNULL, b#37243:int%NONNULL)] +- *Sort [args=[a#37242 ASC], false, 0][outPart=SinglePartition][outOrder=ArrayBuffer(a#37242 ASC%NONNULL)][output=List(a#37242:int%NONNULL, b#37243:int%NONNULL)] +- Coalesce [args=1][outPart=SinglePartition][outOrder=List()][output=List(a#37242:int%NONNULL, b#37243:int%NONNULL)] +- LocalTableScan [args=[a#37242, b#37243]][outPart=UnknownPartitioning(0)][outOrder=List()][output=List(a#37242:int%NONNULL,
[jira] [Comment Edited] (SPARK-19981) Sort-Merge join inserts shuffles when joining dataframes with aliased columns
[ https://issues.apache.org/jira/browse/SPARK-19981?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15945166#comment-15945166 ] Mitesh edited comment on SPARK-19981 at 3/28/17 1:34 PM: - As I mentioned on the PR, maybe a better to fix it here: https://github.com/maropu/spark/blob/b5d1038ed5d65a6ddec20ea6eef186d25fc3/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/Canonicalize.scala#L41 Perhaps it should handle canonicalizing an alias so {{a#1}} is the same as {{a#1 as newA#2}}. Otherwise you have a similar problem with sorting. Here is a sort example with 1 partition. I believe the extra sort on {{newA}} is unnecessary. {code} scala> val df1 = Seq((1, 2), (3, 4)).toDF("a", "b").coalesce(1).sortWithinPartitions("a") df1: org.apache.spark.sql.Dataset[org.apache.spark.sql.Row] = [a: int, b: int] scala> val df2 = df1.selectExpr("a as newA", "b") df2: org.apache.spark.sql.DataFrame = [newA: int, b: int] scala> println(df1.join(df2, df1("a") === df2("newA")).queryExecution.executedPlan) *SortMergeJoin [args=[a#37225], [newA#37232], Inner][outPart=PartitioningCollection(1, )][outOrder=List(a#37225 ASC%NONNULL)][output=List(a#37225:int%NONNULL, b#37226:int%NONNULL, newA#37232:int%NONNULL, b#37243:int%NONNULL)] :- *Sort [args=[a#37225 ASC], false, 0][outPart=SinglePartition][outOrder=ArrayBuffer(a#37225 ASC%NONNULL)][output=List(a#37225:int%NONNULL, b#37226:int%NONNULL)] : +- Coalesce [args=1][outPart=SinglePartition][outOrder=List()][output=List(a#37225:int%NONNULL, b#37226:int%NONNULL)] : +- LocalTableScan [args=[a#37225, b#37226]][outPart=UnknownPartitioning(0)][outOrder=List()][output=List(a#37225:int%NONNULL, b#37226:int%NONNULL)] +- *Sort [args=[newA#37232 ASC], false, 0][outPart=SinglePartition][outOrder=List(newA#37232 ASC%NONNULL)][output=ArrayBuffer(newA#37232:int%NONNULL, b#37243:int%NONNULL)] +- *Project [args=[a#37242 AS newA#37232, b#37243]][outPart=SinglePartition][outOrder=ArrayBuffer(a#37242 ASC%NONNULL)][output=ArrayBuffer(newA#37232:int%NONNULL, b#37243:int%NONNULL)] +- *Sort [args=[a#37242 ASC], false, 0][outPart=SinglePartition][outOrder=ArrayBuffer(a#37242 ASC%NONNULL)][output=List(a#37242:int%NONNULL, b#37243:int%NONNULL)] +- Coalesce [args=1][outPart=SinglePartition][outOrder=List()][output=List(a#37242:int%NONNULL, b#37243:int%NONNULL)] +- LocalTableScan [args=[a#37242, b#37243]][outPart=UnknownPartitioning(0)][outOrder=List()][output=List(a#37242:int%NONNULL, b#37243:int%NONNULL)] {code} was (Author: masterddt): As I mentioned on the PR, maybe a better way is to handle it here: https://github.com/maropu/spark/blob/b5d1038ed5d65a6ddec20ea6eef186d25fc3/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/Canonicalize.scala#L41 Perhaps it should handle canonicalizing an alias so {{a#1}} is the same as {{a#1 as newA#2}}. Otherwise you have a similar problem with sorting. Here is a sort example with 1 partition. I believe the extra sort on {{newA}} is unnecessary. {code} scala> val df1 = Seq((1, 2), (3, 4)).toDF("a", "b").coalesce(1).sortWithinPartitions("a") df1: org.apache.spark.sql.Dataset[org.apache.spark.sql.Row] = [a: int, b: int] scala> val df2 = df1.selectExpr("a as newA", "b") df2: org.apache.spark.sql.DataFrame = [newA: int, b: int] scala> println(df1.join(df2, df1("a") === df2("newA")).queryExecution.executedPlan) *SortMergeJoin [args=[a#37225], [newA#37232], Inner][outPart=PartitioningCollection(1, )][outOrder=List(a#37225 ASC%NONNULL)][output=List(a#37225:int%NONNULL, b#37226:int%NONNULL, newA#37232:int%NONNULL, b#37243:int%NONNULL)] :- *Sort [args=[a#37225 ASC], false, 0][outPart=SinglePartition][outOrder=ArrayBuffer(a#37225 ASC%NONNULL)][output=List(a#37225:int%NONNULL, b#37226:int%NONNULL)] : +- Coalesce [args=1][outPart=SinglePartition][outOrder=List()][output=List(a#37225:int%NONNULL, b#37226:int%NONNULL)] : +- LocalTableScan [args=[a#37225, b#37226]][outPart=UnknownPartitioning(0)][outOrder=List()][output=List(a#37225:int%NONNULL, b#37226:int%NONNULL)] +- *Sort [args=[newA#37232 ASC], false, 0][outPart=SinglePartition][outOrder=List(newA#37232 ASC%NONNULL)][output=ArrayBuffer(newA#37232:int%NONNULL, b#37243:int%NONNULL)] +- *Project [args=[a#37242 AS newA#37232, b#37243]][outPart=SinglePartition][outOrder=ArrayBuffer(a#37242 ASC%NONNULL)][output=ArrayBuffer(newA#37232:int%NONNULL, b#37243:int%NONNULL)] +- *Sort [args=[a#37242 ASC], false, 0][outPart=SinglePartition][outOrder=ArrayBuffer(a#37242 ASC%NONNULL)][output=List(a#37242:int%NONNULL, b#37243:int%NONNULL)] +- Coalesce [args=1][outPart=SinglePartition][outOrder=List()][output=List(a#37242:int%NONNULL, b#37243:int%NONNULL)] +- LocalTableScan [args=[a#37242, b#37243]][outPart=UnknownPartitioning(0)][outOrder=List()][output=List(a#37242:int%NONNULL,
[jira] [Comment Edited] (SPARK-19981) Sort-Merge join inserts shuffles when joining dataframes with aliased columns
[ https://issues.apache.org/jira/browse/SPARK-19981?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15945166#comment-15945166 ] Mitesh edited comment on SPARK-19981 at 3/28/17 1:33 PM: - As I mentioned on the PR, this seems like it should be handled here: https://github.com/maropu/spark/blob/b5d1038ed5d65a6ddec20ea6eef186d25fc3/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/Canonicalize.scala#L41 Perhaps it should handle canonicalizing an alias so {{a#1}} is the same as {{a#1 as newA#2}}. Otherwise you have a similar problem with sorting. Here is a sort example with 1 partition. I believe the extra sort on {[newA}] is unnecessary. {code} scala> val df1 = Seq((1, 2), (3, 4)).toDF("a", "b").coalesce(1).sortWithinPartitions("a") df1: org.apache.spark.sql.Dataset[org.apache.spark.sql.Row] = [a: int, b: int] scala> val df2 = df1.selectExpr("a as newA", "b") df2: org.apache.spark.sql.DataFrame = [newA: int, b: int] scala> println(df1.join(df2, df1("a") === df2("newA")).queryExecution.executedPlan) *SortMergeJoin [args=[a#37225], [newA#37232], Inner][outPart=PartitioningCollection(1, )][outOrder=List(a#37225 ASC%NONNULL)][output=List(a#37225:int%NONNULL, b#37226:int%NONNULL, newA#37232:int%NONNULL, b#37243:int%NONNULL)] :- *Sort [args=[a#37225 ASC], false, 0][outPart=SinglePartition][outOrder=ArrayBuffer(a#37225 ASC%NONNULL)][output=List(a#37225:int%NONNULL, b#37226:int%NONNULL)] : +- Coalesce [args=1][outPart=SinglePartition][outOrder=List()][output=List(a#37225:int%NONNULL, b#37226:int%NONNULL)] : +- LocalTableScan [args=[a#37225, b#37226]][outPart=UnknownPartitioning(0)][outOrder=List()][output=List(a#37225:int%NONNULL, b#37226:int%NONNULL)] +- *Sort [args=[newA#37232 ASC], false, 0][outPart=SinglePartition][outOrder=List(newA#37232 ASC%NONNULL)][output=ArrayBuffer(newA#37232:int%NONNULL, b#37243:int%NONNULL)] +- *Project [args=[a#37242 AS newA#37232, b#37243]][outPart=SinglePartition][outOrder=ArrayBuffer(a#37242 ASC%NONNULL)][output=ArrayBuffer(newA#37232:int%NONNULL, b#37243:int%NONNULL)] +- *Sort [args=[a#37242 ASC], false, 0][outPart=SinglePartition][outOrder=ArrayBuffer(a#37242 ASC%NONNULL)][output=List(a#37242:int%NONNULL, b#37243:int%NONNULL)] +- Coalesce [args=1][outPart=SinglePartition][outOrder=List()][output=List(a#37242:int%NONNULL, b#37243:int%NONNULL)] +- LocalTableScan [args=[a#37242, b#37243]][outPart=UnknownPartitioning(0)][outOrder=List()][output=List(a#37242:int%NONNULL, b#37243:int%NONNULL)] {code} was (Author: masterddt): As I mentioned on the PR, this seems like it should be handled here: https://github.com/maropu/spark/blob/b5d1038ed5d65a6ddec20ea6eef186d25fc3/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/Canonicalize.scala#L41 Perhaps it should handle canonicalizing an alias so `a#1` is the same as `a#1 as newA#2`. Otherwise you have a similar problem with sorting. Here is a sort example with 1 partition. I believe the extra sort on `newA` is unnecessary. ``` scala> val df1 = Seq((1, 2), (3, 4)).toDF("a", "b").coalesce(1).sortWithinPartitions("a") df1: org.apache.spark.sql.Dataset[org.apache.spark.sql.Row] = [a: int, b: int] scala> val df2 = df1.selectExpr("a as newA", "b") df2: org.apache.spark.sql.DataFrame = [newA: int, b: int] scala> println(df1.join(df2, df1("a") === df2("newA")).queryExecution.executedPlan) *SortMergeJoin [args=[a#37225], [newA#37232], Inner][outPart=PartitioningCollection(1, )][outOrder=List(a#37225 ASC%NONNULL)][output=List(a#37225:int%NONNULL, b#37226:int%NONNULL, newA#37232:int%NONNULL, b#37243:int%NONNULL)] :- *Sort [args=[a#37225 ASC], false, 0][outPart=SinglePartition][outOrder=ArrayBuffer(a#37225 ASC%NONNULL)][output=List(a#37225:int%NONNULL, b#37226:int%NONNULL)] : +- Coalesce [args=1][outPart=SinglePartition][outOrder=List()][output=List(a#37225:int%NONNULL, b#37226:int%NONNULL)] : +- LocalTableScan [args=[a#37225, b#37226]][outPart=UnknownPartitioning(0)][outOrder=List()][output=List(a#37225:int%NONNULL, b#37226:int%NONNULL)] +- *Sort [args=[newA#37232 ASC], false, 0][outPart=SinglePartition][outOrder=List(newA#37232 ASC%NONNULL)][output=ArrayBuffer(newA#37232:int%NONNULL, b#37243:int%NONNULL)] +- *Project [args=[a#37242 AS newA#37232, b#37243]][outPart=SinglePartition][outOrder=ArrayBuffer(a#37242 ASC%NONNULL)][output=ArrayBuffer(newA#37232:int%NONNULL, b#37243:int%NONNULL)] +- *Sort [args=[a#37242 ASC], false, 0][outPart=SinglePartition][outOrder=ArrayBuffer(a#37242 ASC%NONNULL)][output=List(a#37242:int%NONNULL, b#37243:int%NONNULL)] +- Coalesce [args=1][outPart=SinglePartition][outOrder=List()][output=List(a#37242:int%NONNULL, b#37243:int%NONNULL)] +- LocalTableScan [args=[a#37242, b#37243]][outPart=UnknownPartitioning(0)][outOrder=List()][output=List(a#37242:int%NONNULL,
[jira] [Commented] (SPARK-19981) Sort-Merge join inserts shuffles when joining dataframes with aliased columns
[ https://issues.apache.org/jira/browse/SPARK-19981?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15945166#comment-15945166 ] Mitesh commented on SPARK-19981: As I mentioned on the PR, this seems like it should be handled here: https://github.com/maropu/spark/blob/b5d1038ed5d65a6ddec20ea6eef186d25fc3/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/Canonicalize.scala#L41 Perhaps it should handle canonicalizing an alias so `a#1` is the same as `a#1 as newA#2`. Otherwise you have a similar problem with sorting. Here is a sort example with 1 partition. I believe the extra sort on `newA` is unnecessary. ``` scala> val df1 = Seq((1, 2), (3, 4)).toDF("a", "b").coalesce(1).sortWithinPartitions("a") df1: org.apache.spark.sql.Dataset[org.apache.spark.sql.Row] = [a: int, b: int] scala> val df2 = df1.selectExpr("a as newA", "b") df2: org.apache.spark.sql.DataFrame = [newA: int, b: int] scala> println(df1.join(df2, df1("a") === df2("newA")).queryExecution.executedPlan) *SortMergeJoin [args=[a#37225], [newA#37232], Inner][outPart=PartitioningCollection(1, )][outOrder=List(a#37225 ASC%NONNULL)][output=List(a#37225:int%NONNULL, b#37226:int%NONNULL, newA#37232:int%NONNULL, b#37243:int%NONNULL)] :- *Sort [args=[a#37225 ASC], false, 0][outPart=SinglePartition][outOrder=ArrayBuffer(a#37225 ASC%NONNULL)][output=List(a#37225:int%NONNULL, b#37226:int%NONNULL)] : +- Coalesce [args=1][outPart=SinglePartition][outOrder=List()][output=List(a#37225:int%NONNULL, b#37226:int%NONNULL)] : +- LocalTableScan [args=[a#37225, b#37226]][outPart=UnknownPartitioning(0)][outOrder=List()][output=List(a#37225:int%NONNULL, b#37226:int%NONNULL)] +- *Sort [args=[newA#37232 ASC], false, 0][outPart=SinglePartition][outOrder=List(newA#37232 ASC%NONNULL)][output=ArrayBuffer(newA#37232:int%NONNULL, b#37243:int%NONNULL)] +- *Project [args=[a#37242 AS newA#37232, b#37243]][outPart=SinglePartition][outOrder=ArrayBuffer(a#37242 ASC%NONNULL)][output=ArrayBuffer(newA#37232:int%NONNULL, b#37243:int%NONNULL)] +- *Sort [args=[a#37242 ASC], false, 0][outPart=SinglePartition][outOrder=ArrayBuffer(a#37242 ASC%NONNULL)][output=List(a#37242:int%NONNULL, b#37243:int%NONNULL)] +- Coalesce [args=1][outPart=SinglePartition][outOrder=List()][output=List(a#37242:int%NONNULL, b#37243:int%NONNULL)] +- LocalTableScan [args=[a#37242, b#37243]][outPart=UnknownPartitioning(0)][outOrder=List()][output=List(a#37242:int%NONNULL, b#37243:int%NONNULL)]``` > Sort-Merge join inserts shuffles when joining dataframes with aliased columns > - > > Key: SPARK-19981 > URL: https://issues.apache.org/jira/browse/SPARK-19981 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.2 >Reporter: Allen George > > Performing a sort-merge join with two dataframes - each of which has the join > column aliased - causes Spark to insert an unnecessary shuffle. > Consider the scala test code below, which should be equivalent to the > following SQL. > {code:SQL} > SELECT * FROM > (SELECT number AS aliased from df1) t1 > LEFT JOIN > (SELECT number AS aliased from df2) t2 > ON t1.aliased = t2.aliased > {code} > {code:scala} > private case class OneItem(number: Long) > private case class TwoItem(number: Long, value: String) > test("join with aliases should not trigger shuffle") { > val df1 = sqlContext.createDataFrame( > Seq( > OneItem(0), > OneItem(2), > OneItem(4) > ) > ) > val partitionedDf1 = df1.repartition(10, col("number")) > partitionedDf1.createOrReplaceTempView("df1") > partitionedDf1.cache() partitionedDf1.count() > > val df2 = sqlContext.createDataFrame( > Seq( > TwoItem(0, "zero"), > TwoItem(2, "two"), > TwoItem(4, "four") > ) > ) > val partitionedDf2 = df2.repartition(10, col("number")) > partitionedDf2.createOrReplaceTempView("df2") > partitionedDf2.cache() partitionedDf2.count() > > val fromDf1 = sqlContext.sql("SELECT number from df1") > val fromDf2 = sqlContext.sql("SELECT number from df2") > val aliasedDf1 = fromDf1.select(col(fromDf1.columns.head) as "aliased") > val aliasedDf2 = fromDf2.select(col(fromDf2.columns.head) as "aliased") > aliasedDf1.join(aliasedDf2, Seq("aliased"), "left_outer") } > {code} > Both the SQL and the Scala code generate a query-plan where an extra exchange > is inserted before performing the sort-merge join. This exchange changes the > partitioning from {{HashPartitioning("number", 10)}} for each frame being > joined into {{HashPartitioning("aliased", 5)}}. I would have expected that > since it's a simple column aliasing, and both frames have exactly the same > partitioning that the initial frames. > {noformat} > *Project
[jira] [Updated] (SPARK-20128) MetricsSystem not always killed in SparkContext.stop()
[ https://issues.apache.org/jira/browse/SPARK-20128?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Imran Rashid updated SPARK-20128: - Description: One Jenkins run failed due to the MetricsSystem never getting killed after a failed test, which led that test to hang and the tests to timeout: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/75176 {noformat} 17/03/24 13:44:19.537 dag-scheduler-event-loop ERROR DAGSchedulerEventProcessLoop: DAGSchedulerEventProcessLoop failed; shutting down SparkContext java.lang.ArrayIndexOutOfBoundsException: -1 at org.apache.spark.MapOutputTrackerMaster$$anonfun$getEpochForMapOutput$1.apply(MapOutputTracker.scala:431) at org.apache.spark.MapOutputTrackerMaster$$anonfun$getEpochForMapOutput$1.apply(MapOutputTracker.scala:430) at scala.Option.flatMap(Option.scala:171) at org.apache.spark.MapOutputTrackerMaster.getEpochForMapOutput(MapOutputTracker.scala:430) at org.apache.spark.scheduler.DAGScheduler.handleTaskCompletion(DAGScheduler.scala:1298) at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1731) at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1689) at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1678) at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48) 17/03/24 13:44:19.540 dispatcher-event-loop-11 INFO MapOutputTrackerMasterEndpoint: MapOutputTrackerMasterEndpoint stopped! 17/03/24 13:44:19.546 stop-spark-context INFO MemoryStore: MemoryStore cleared 17/03/24 13:44:19.546 stop-spark-context INFO BlockManager: BlockManager stopped 17/03/24 13:44:19.546 stop-spark-context INFO BlockManagerMaster: BlockManagerMaster stopped 17/03/24 13:44:19.546 dispatcher-event-loop-16 INFO OutputCommitCoordinator$OutputCommitCoordinatorEndpoint: OutputCommitCoordinator stopped! 17/03/24 13:44:19.547 stop-spark-context INFO SparkContext: Successfully stopped SparkContext 17/03/24 14:02:19.934 metrics-console-reporter-1-thread-1 ERROR ScheduledReporter: RuntimeException thrown from ConsoleReporter#report. Exception was suppressed. java.lang.NullPointerException at org.apache.spark.deploy.master.MasterSource$$anon$2.getValue(MasterSource.scala:35) at org.apache.spark.deploy.master.MasterSource$$anon$2.getValue(MasterSource.scala:34) at com.codahale.metrics.ConsoleReporter.printGauge(ConsoleReporter.java:239) ... {noformat} unfortunately I didn't save the entire test logs, but what happens is the initial IndexOutOfBoundsException is a real bug, which causes the SparkContext to stop, and the test to fail. However, the MetricsSystem somehow stays alive, and since its not a daemon thread, it just hangs, and every 20 mins we get that NPE from within the metrics system as it tries to report. I am totally perplexed at how this can happen, it looks like the metric system should always get stopped by the time we see {noformat} 17/03/24 13:44:19.547 stop-spark-context INFO SparkContext: Successfully stopped SparkContext {noformat} I don't think I've ever seen this in a real spark use, but it doesn't look like something which is limited to tests, whatever the cause. was: One Jenkins run failed due to the MetricsSystem never getting killed after a failed test, which led that test to hang and the tests to timeout: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/75176 {noformat} 17/03/24 13:44:19.537 dag-scheduler-event-loop ERROR DAGSchedulerEventProcessLoop: DAGSchedulerEventProcessLoop failed; shutting down SparkContext java.lang.ArrayIndexOutOfBoundsException: -1 at org.apache.spark.MapOutputTrackerMaster$$anonfun$getEpochForMapOutput$1.apply(MapOutputTracker.scala:431) at org.apache.spark.MapOutputTrackerMaster$$anonfun$getEpochForMapOutput$1.apply(MapOutputTracker.scala:430) at scala.Option.flatMap(Option.scala:171) at org.apache.spark.MapOutputTrackerMaster.getEpochForMapOutput(MapOutputTracker.scala:430) at org.apache.spark.scheduler.DAGScheduler.handleTaskCompletion(DAGScheduler.scala:1298) at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1731) at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1689) at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1678) at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48) 17/03/24 13:44:19.540 dispatcher-event-loop-11 INFO MapOutputTrackerMasterEndpoint: MapOutputTrackerMasterEndpoint stopped! 17/03/24 13:44:19.546 stop-spark-context INFO MemoryStore: MemoryStore cleared 17/03/24 13:44:19.546 stop-spark-context INFO BlockManager: BlockManager stopped 17/03/24 13:44:19.546
[jira] [Created] (SPARK-20128) MetricsSystem not always killed in SparkContext.stop()
Imran Rashid created SPARK-20128: Summary: MetricsSystem not always killed in SparkContext.stop() Key: SPARK-20128 URL: https://issues.apache.org/jira/browse/SPARK-20128 Project: Spark Issue Type: Test Components: Spark Core, Tests Affects Versions: 2.2.0 Reporter: Imran Rashid One Jenkins run failed due to the MetricsSystem never getting killed after a failed test, which led that test to hang and the tests to timeout: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/75176 {noformat} 17/03/24 13:44:19.537 dag-scheduler-event-loop ERROR DAGSchedulerEventProcessLoop: DAGSchedulerEventProcessLoop failed; shutting down SparkContext java.lang.ArrayIndexOutOfBoundsException: -1 at org.apache.spark.MapOutputTrackerMaster$$anonfun$getEpochForMapOutput$1.apply(MapOutputTracker.scala:431) at org.apache.spark.MapOutputTrackerMaster$$anonfun$getEpochForMapOutput$1.apply(MapOutputTracker.scala:430) at scala.Option.flatMap(Option.scala:171) at org.apache.spark.MapOutputTrackerMaster.getEpochForMapOutput(MapOutputTracker.scala:430) at org.apache.spark.scheduler.DAGScheduler.handleTaskCompletion(DAGScheduler.scala:1298) at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1731) at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1689) at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1678) at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48) 17/03/24 13:44:19.540 dispatcher-event-loop-11 INFO MapOutputTrackerMasterEndpoint: MapOutputTrackerMasterEndpoint stopped! 17/03/24 13:44:19.546 stop-spark-context INFO MemoryStore: MemoryStore cleared 17/03/24 13:44:19.546 stop-spark-context INFO BlockManager: BlockManager stopped 17/03/24 13:44:19.546 stop-spark-context INFO BlockManagerMaster: BlockManagerMaster stopped 17/03/24 13:44:19.546 dispatcher-event-loop-16 INFO OutputCommitCoordinator$OutputCommitCoordinatorEndpoint: OutputCommitCoordinator stopped! 17/03/24 13:44:19.547 stop-spark-context INFO SparkContext: Successfully stopped SparkContext 17/03/24 14:02:19.934 metrics-console-reporter-1-thread-1 ERROR ScheduledReporter: RuntimeException thrown from ConsoleReporter#report. Exception was suppressed. java.lang.NullPointerException at org.apache.spark.deploy.master.MasterSource$$anon$2.getValue(MasterSource.scala:35) at org.apache.spark.deploy.master.MasterSource$$anon$2.getValue(MasterSource.scala:34) at com.codahale.metrics.ConsoleReporter.printGauge(ConsoleReporter.java:239) ... {noformat} unfortunately I didn't save the entire test logs, but what happens is the initial IndexOutOfBoundsException is a real bug, which causes the SparkContext to stop, and the test to fail. However, the MetricsSystem somehow stays alive, and since its not a daemon thread, it just hangs, and every 20 mins we get that NPE from within the metrics system as it tries to report. I am totally perplexed at how this can happen, it looks like the metric system should always get stopped by the time we see {noformat} 17/03/24 13:44:19.547 stop-spark-context INFO SparkContext: Successfully stopped SparkContext {noformat] I don't think I've ever seen this in a real spark use, but it doesn't look like something which is limited to tests, whatever the cause. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-20127) Minor code cleanup
[ https://issues.apache.org/jira/browse/SPARK-20127?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-20127: Assignee: Apache Spark > Minor code cleanup > -- > > Key: SPARK-20127 > URL: https://issues.apache.org/jira/browse/SPARK-20127 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 2.1.0 >Reporter: Denis Bolshakov >Assignee: Apache Spark >Priority: Minor > > Intellij IDEA shows a bunch of messages while inspecting source code. > So fixing the most explicit ones gives the following: > - improving source code quality > - involving to contributing process -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-20127) Minor code cleanup
[ https://issues.apache.org/jira/browse/SPARK-20127?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-20127: Assignee: (was: Apache Spark) > Minor code cleanup > -- > > Key: SPARK-20127 > URL: https://issues.apache.org/jira/browse/SPARK-20127 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 2.1.0 >Reporter: Denis Bolshakov >Priority: Minor > > Intellij IDEA shows a bunch of messages while inspecting source code. > So fixing the most explicit ones gives the following: > - improving source code quality > - involving to contributing process -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-20127) Minor code cleanup
[ https://issues.apache.org/jira/browse/SPARK-20127?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15945107#comment-15945107 ] Apache Spark commented on SPARK-20127: -- User 'dbolshak' has created a pull request for this issue: https://github.com/apache/spark/pull/17458 > Minor code cleanup > -- > > Key: SPARK-20127 > URL: https://issues.apache.org/jira/browse/SPARK-20127 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 2.1.0 >Reporter: Denis Bolshakov >Priority: Minor > > Intellij IDEA shows a bunch of messages while inspecting source code. > So fixing the most explicit ones gives the following: > - improving source code quality > - involving to contributing process -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-20126) Remove HiveSessionState
[ https://issues.apache.org/jira/browse/SPARK-20126?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-20126: Assignee: Herman van Hovell (was: Apache Spark) > Remove HiveSessionState > --- > > Key: SPARK-20126 > URL: https://issues.apache.org/jira/browse/SPARK-20126 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.2.0 >Reporter: Herman van Hovell >Assignee: Herman van Hovell > > After SPARK-20100 the added value of the HiveSessionState has become quite > limited. We should remove it. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-20126) Remove HiveSessionState
[ https://issues.apache.org/jira/browse/SPARK-20126?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-20126: Assignee: Apache Spark (was: Herman van Hovell) > Remove HiveSessionState > --- > > Key: SPARK-20126 > URL: https://issues.apache.org/jira/browse/SPARK-20126 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.2.0 >Reporter: Herman van Hovell >Assignee: Apache Spark > > After SPARK-20100 the added value of the HiveSessionState has become quite > limited. We should remove it. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-20126) Remove HiveSessionState
[ https://issues.apache.org/jira/browse/SPARK-20126?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15945072#comment-15945072 ] Apache Spark commented on SPARK-20126: -- User 'hvanhovell' has created a pull request for this issue: https://github.com/apache/spark/pull/17457 > Remove HiveSessionState > --- > > Key: SPARK-20126 > URL: https://issues.apache.org/jira/browse/SPARK-20126 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.2.0 >Reporter: Herman van Hovell >Assignee: Herman van Hovell > > After SPARK-20100 the added value of the HiveSessionState has become quite > limited. We should remove it. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-20127) Minor code cleanup
[ https://issues.apache.org/jira/browse/SPARK-20127?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15945058#comment-15945058 ] Sean Owen commented on SPARK-20127: --- We use pull requests to suggest changes, but before you do, I think most of those changes have not been made so far in the code on purpose. > Minor code cleanup > -- > > Key: SPARK-20127 > URL: https://issues.apache.org/jira/browse/SPARK-20127 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 2.1.0 >Reporter: Denis Bolshakov >Priority: Minor > > Intellij IDEA shows a bunch of messages while inspecting source code. > So fixing the most explicit ones gives the following: > - improving source code quality > - involving to contributing process -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-20127) Minor code cleanup
[ https://issues.apache.org/jira/browse/SPARK-20127?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15945051#comment-15945051 ] Denis Bolshakov edited comment on SPARK-20127 at 3/28/17 12:32 PM: --- I applied yours first comments (just by reverting changes). Full patch could be found here https://github.com/dbolshak/spark/tree/SPARK-20127 was (Author: bolshakov.de...@gmail.com): I applied you first comments (just by reverting changes). Full patch could be found here https://github.com/dbolshak/spark/tree/SPARK-20127 > Minor code cleanup > -- > > Key: SPARK-20127 > URL: https://issues.apache.org/jira/browse/SPARK-20127 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 2.1.0 >Reporter: Denis Bolshakov >Priority: Minor > > Intellij IDEA shows a bunch of messages while inspecting source code. > So fixing the most explicit ones gives the following: > - improving source code quality > - involving to contributing process -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-20127) Minor code cleanup
[ https://issues.apache.org/jira/browse/SPARK-20127?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15945051#comment-15945051 ] Denis Bolshakov commented on SPARK-20127: - I applied you first comments (just by reverting changes). Full patch could be found here https://github.com/dbolshak/spark/tree/SPARK-20127 > Minor code cleanup > -- > > Key: SPARK-20127 > URL: https://issues.apache.org/jira/browse/SPARK-20127 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 2.1.0 >Reporter: Denis Bolshakov >Priority: Minor > > Intellij IDEA shows a bunch of messages while inspecting source code. > So fixing the most explicit ones gives the following: > - improving source code quality > - involving to contributing process -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-20127) Minor code cleanup
[ https://issues.apache.org/jira/browse/SPARK-20127?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15945014#comment-15945014 ] Denis Bolshakov commented on SPARK-20127: - You can review changes shortly here https://github.com/dbolshak/spark/commit/8a31173b436eb7d39f2423eccda7932239f8d5b6 Nothing else right now. > Minor code cleanup > -- > > Key: SPARK-20127 > URL: https://issues.apache.org/jira/browse/SPARK-20127 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 2.1.0 >Reporter: Denis Bolshakov >Priority: Minor > > Intellij IDEA shows a bunch of messages while inspecting source code. > So fixing the most explicit ones gives the following: > - improving source code quality > - involving to contributing process -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-20127) Minor code cleanup
[ https://issues.apache.org/jira/browse/SPARK-20127?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen updated SPARK-20127: -- Affects Version/s: (was: 2.3.0) 2.1.0 Labels: (was: newbiee) Priority: Minor (was: Major) We don't assign issues at this stage in general. I would avoid changes that are not helping perf or correctness. May be better to discuss the kinds of things you would change here first. > Minor code cleanup > -- > > Key: SPARK-20127 > URL: https://issues.apache.org/jira/browse/SPARK-20127 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 2.1.0 >Reporter: Denis Bolshakov >Priority: Minor > > Intellij IDEA shows a bunch of messages while inspecting source code. > So fixing the most explicit ones gives the following: > - improving source code quality > - involving to contributing process -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-20127) Minor code cleanup
[ https://issues.apache.org/jira/browse/SPARK-20127?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15945002#comment-15945002 ] Denis Bolshakov edited comment on SPARK-20127 at 3/28/17 11:46 AM: --- Hello [~srowen], thanks for quick feedback. Could you please assign issue to me? )) I understand your point, and I've tried to exclude changes related to code style or dispute changes. I believe there are no changes that could have performance impact. Kind regards, Denis was (Author: bolshakov.de...@gmail.com): Hello [~srowen], thanks for quick feedback. Could you please assign issue to me? )) I understand your point I've tried to exclude changes related to code style or dispute changes. I believe there are no changes that could have performance impact. Kind regards, Denis > Minor code cleanup > -- > > Key: SPARK-20127 > URL: https://issues.apache.org/jira/browse/SPARK-20127 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 2.3.0 >Reporter: Denis Bolshakov > Labels: newbiee > > Intellij IDEA shows a bunch of messages while inspecting source code. > So fixing the most explicit ones gives the following: > - improving source code quality > - involving to contributing process -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-20127) Minor code cleanup
[ https://issues.apache.org/jira/browse/SPARK-20127?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15945002#comment-15945002 ] Denis Bolshakov commented on SPARK-20127: - Hello [~srowen], thanks for quick feedback. Could you please assign issue to me? )) I understand your point I've tried to exclude changes related to code style or dispute changes. I believe there are no changes that could have performance impact. Kind regards, Denis > Minor code cleanup > -- > > Key: SPARK-20127 > URL: https://issues.apache.org/jira/browse/SPARK-20127 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 2.3.0 >Reporter: Denis Bolshakov > Labels: newbiee > > Intellij IDEA shows a bunch of messages while inspecting source code. > So fixing the most explicit ones gives the following: > - improving source code quality > - involving to contributing process -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-20094) Should Prevent push down of IN subquery to Join operator
[ https://issues.apache.org/jira/browse/SPARK-20094?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Herman van Hovell resolved SPARK-20094. --- Resolution: Fixed Assignee: Zhenhua Wang Fix Version/s: 2.2.0 > Should Prevent push down of IN subquery to Join operator > > > Key: SPARK-20094 > URL: https://issues.apache.org/jira/browse/SPARK-20094 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.2.0 >Reporter: Zhenhua Wang >Assignee: Zhenhua Wang > Fix For: 2.2.0 > > > ReorderJoin collects all predicates and try to put them into join condition > when creating ordered join. If a predicate with an IN subquery is in a join > condition instead of a filter condition, > `RewritePredicateSubquery.rewriteExistentialExpr` would fail to convert the > subquery to an ExistenceJoin, and thus result in error. > For example, tpcds q45 fails due to the above reason: > {noformat} > spark-sql> explain codegen > > SELECT > > ca_zip, > > ca_city, > > sum(ws_sales_price) > > FROM web_sales, customer, customer_address, date_dim, item > > WHERE ws_bill_customer_sk = c_customer_sk > > AND c_current_addr_sk = ca_address_sk > > AND ws_item_sk = i_item_sk > > AND (substr(ca_zip, 1, 5) IN > > ('85669', '86197', '88274', '83405', '86475', '85392', '85460', > '80348', '81792') > > OR > > i_item_id IN (SELECT i_item_id > > FROM item > > WHERE i_item_sk IN (2, 3, 5, 7, 11, 13, 17, 19, 23, 29) > > ) > > ) > > AND ws_sold_date_sk = d_date_sk > > AND d_qoy = 2 AND d_year = 2001 > > GROUP BY ca_zip, ca_city > > ORDER BY ca_zip, ca_city > > LIMIT 100; > 17/03/25 15:27:02 ERROR SparkSQLDriver: Failed in [explain codegen > > SELECT > ca_zip, > ca_city, > sum(ws_sales_price) > FROM web_sales, customer, customer_address, date_dim, item > WHERE ws_bill_customer_sk = c_customer_sk > AND c_current_addr_sk = ca_address_sk > AND ws_item_sk = i_item_sk > AND (substr(ca_zip, 1, 5) IN > ('85669', '86197', '88274', '83405', '86475', '85392', '85460', '80348', > '81792') > OR > i_item_id IN (SELECT i_item_id > FROM item > WHERE i_item_sk IN (2, 3, 5, 7, 11, 13, 17, 19, 23, 29) > ) > ) > AND ws_sold_date_sk = d_date_sk > AND d_qoy = 2 AND d_year = 2001 > GROUP BY ca_zip, ca_city > ORDER BY ca_zip, ca_city > LIMIT 100] > java.lang.UnsupportedOperationException: Cannot evaluate expression: list#1 [] > at > org.apache.spark.sql.catalyst.expressions.Unevaluable$class.doGenCode(Expression.scala:224) > at > org.apache.spark.sql.catalyst.expressions.ListQuery.doGenCode(subquery.scala:262) > at > org.apache.spark.sql.catalyst.expressions.Expression$$anonfun$genCode$2.apply(Expression.scala:104) > at > org.apache.spark.sql.catalyst.expressions.Expression$$anonfun$genCode$2.apply(Expression.scala:101) > at scala.Option.getOrElse(Option.scala:121) > at > org.apache.spark.sql.catalyst.expressions.Expression.genCode(Expression.scala:101) > at > org.apache.spark.sql.catalyst.expressions.In$$anonfun$3.apply(predicates.scala:199) > at > org.apache.spark.sql.catalyst.expressions.In$$anonfun$3.apply(predicates.scala:199) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234) > at scala.collection.immutable.List.foreach(List.scala:381) > at scala.collection.TraversableLike$class.map(TraversableLike.scala:234) > at scala.collection.immutable.List.map(List.scala:285) > at > org.apache.spark.sql.catalyst.expressions.In.doGenCode(predicates.scala:199) > at > org.apache.spark.sql.catalyst.expressions.Expression$$anonfun$genCode$2.apply(Expression.scala:104) > at > org.apache.spark.sql.catalyst.expressions.Expression$$anonfun$genCode$2.apply(Expression.scala:101) > at scala.Option.getOrElse(Option.scala:121) > at > org.apache.spark.sql.catalyst.expressions.Expression.genCode(Expression.scala:101) > at > org.apache.spark.sql.catalyst.expressions.Or.doGenCode(predicates.scala:379) > at > org.apache.spark.sql.catalyst.expressions.Expression$$anonfun$genCode$2.apply(Expression.scala:104) > at > org.apache.spark.sql.catalyst.expressions.Expression$$anonfun$genCode$2.apply(Expression.scala:101) > at scala.Option.getOrElse(Option.scala:121) > at > org.apache.spark.sql.catalyst.expressions.Expression.genCode(Expression.scala:101) > at >
[jira] [Commented] (SPARK-20127) Minor code cleanup
[ https://issues.apache.org/jira/browse/SPARK-20127?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15944984#comment-15944984 ] Sean Owen commented on SPARK-20127: --- I am a fan of improving code style and static inspection. However we have generally declined to do big-bang modifications to the code base where the change is purely a question of style. Otherwise I would have done this a long time ago :) Before starting, have a look and see if any are correctness or significant performance issues, or small changes that would resolve a consistency problem. Mention them here first. > Minor code cleanup > -- > > Key: SPARK-20127 > URL: https://issues.apache.org/jira/browse/SPARK-20127 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 2.3.0 >Reporter: Denis Bolshakov > Labels: newbiee > > Intellij IDEA shows a bunch of messages while inspecting source code. > So fixing the most explicit ones gives the following: > - improving source code quality > - involving to contributing process -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-20127) Minor code cleanup
Denis Bolshakov created SPARK-20127: --- Summary: Minor code cleanup Key: SPARK-20127 URL: https://issues.apache.org/jira/browse/SPARK-20127 Project: Spark Issue Type: Improvement Components: Spark Core Affects Versions: 2.3.0 Reporter: Denis Bolshakov Intellij IDEA shows a bunch of messages while inspecting source code. So fixing the most explicit ones gives the following: - improving source code quality - involving to contributing process -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-20126) Remove HiveSessionState
Herman van Hovell created SPARK-20126: - Summary: Remove HiveSessionState Key: SPARK-20126 URL: https://issues.apache.org/jira/browse/SPARK-20126 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 2.2.0 Reporter: Herman van Hovell Assignee: Herman van Hovell After SPARK-20100 the added value of the HiveSessionState has become quite limited. We should remove it. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-20123) $SPARK_HOME variable might have spaces in it(e.g. $SPARK_HOME=/home/spark build/spark), then build spark failed.
[ https://issues.apache.org/jira/browse/SPARK-20123?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] zuotingbing updated SPARK-20123: Description: If $SPARK_HOME or $FWDIR variable contains spaces, then use "./dev/make-distribution.sh --name custom-spark --tgz -Psparkr -Phadoop-2.7 -Phive -Phive-thriftserver -Pmesos -Pyarn" build spark will failed. (was: If $SPARK_HOME or $FWDIR variable contains spaces, then use "./dev/make-distribution.sh --r --name custom-spark --tgz -Psparkr -Phadoop-2.4 -Phive -Phive-thriftserver -Pmesos -Pyarn" build spark will failed.) > $SPARK_HOME variable might have spaces in it(e.g. $SPARK_HOME=/home/spark > build/spark), then build spark failed. > > > Key: SPARK-20123 > URL: https://issues.apache.org/jira/browse/SPARK-20123 > Project: Spark > Issue Type: Bug > Components: Build >Affects Versions: 2.1.0 >Reporter: zuotingbing >Priority: Minor > > If $SPARK_HOME or $FWDIR variable contains spaces, then use > "./dev/make-distribution.sh --name custom-spark --tgz -Psparkr -Phadoop-2.7 > -Phive -Phive-thriftserver -Pmesos -Pyarn" build spark will failed. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-14228) Lost executor of RPC disassociated, and occurs exception: Could not find CoarseGrainedScheduler or it has been stopped
[ https://issues.apache.org/jira/browse/SPARK-14228?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15944912#comment-15944912 ] Amitabh commented on SPARK-14228: - Hi, can you specify the version you were working with? I have received the same error in 1.6.0 > Lost executor of RPC disassociated, and occurs exception: Could not find > CoarseGrainedScheduler or it has been stopped > -- > > Key: SPARK-14228 > URL: https://issues.apache.org/jira/browse/SPARK-14228 > Project: Spark > Issue Type: Bug >Reporter: meiyoula > > When I start 1000 executors, and then stop the process. It will call > SparkContext.stop to stop all executors. But during this process, the > executors has been killed will lost of rpc with driver, and try to > reviveOffers, but can't find CoarseGrainedScheduler or it has been stopped. > {quote} > 16/03/29 01:45:45 ERROR YarnScheduler: Lost executor 610 on 51-196-152-8: > remote Rpc client disassociated > 16/03/29 01:45:45 ERROR Inbox: Ignoring error > org.apache.spark.SparkException: Could not find CoarseGrainedScheduler or it > has been stopped. > at > org.apache.spark.rpc.netty.Dispatcher.postMessage(Dispatcher.scala:161) > at > org.apache.spark.rpc.netty.Dispatcher.postOneWayMessage(Dispatcher.scala:131) > at org.apache.spark.rpc.netty.NettyRpcEnv.send(NettyRpcEnv.scala:173) > at > org.apache.spark.rpc.netty.NettyRpcEndpointRef.send(NettyRpcEnv.scala:398) > at > org.apache.spark.scheduler.cluster.CoarseGrainedSchedulerBackend.reviveOffers(CoarseGrainedSchedulerBackend.scala:314) > at > org.apache.spark.scheduler.TaskSchedulerImpl.executorLost(TaskSchedulerImpl.scala:482) > at > org.apache.spark.scheduler.cluster.CoarseGrainedSchedulerBackend$DriverEndpoint.removeExecutor(CoarseGrainedSchedulerBackend.scala:261) > at > org.apache.spark.scheduler.cluster.CoarseGrainedSchedulerBackend$DriverEndpoint$$anonfun$onDisconnected$1.apply(CoarseGrainedSchedulerBackend.scala:207) > at > org.apache.spark.scheduler.cluster.CoarseGrainedSchedulerBackend$DriverEndpoint$$anonfun$onDisconnected$1.apply(CoarseGrainedSchedulerBackend.scala:207) > at scala.Option.foreach(Option.scala:236) > at > org.apache.spark.scheduler.cluster.CoarseGrainedSchedulerBackend$DriverEndpoint.onDisconnected(CoarseGrainedSchedulerBackend.scala:207) > at > org.apache.spark.rpc.netty.Inbox$$anonfun$process$1.apply$mcV$sp(Inbox.scala:144) > at org.apache.spark.rpc.netty.Inbox.safelyCall(Inbox.scala:204) > at org.apache.spark.rpc.netty.Inbox.process(Inbox.scala:102) > at > org.apache.spark.rpc.netty.Dispatcher$MessageLoop.run(Dispatcher.scala:215) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) > at java.lang.Thread.run(Thread.java:745) > {quote} -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-10294) When Parquet writer's close method throws an exception, we will call close again and trigger a NPE
[ https://issues.apache.org/jira/browse/SPARK-10294?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15944887#comment-15944887 ] Steve Loughran commented on SPARK-10294: consider it a failure in the exception logic; it tries to close streams, and, if one is fully/partially closed, an NPE is raised. the Parquet update may fix it in parquet, but the problem still lurks. A robust exception cleanup would be resilient to it ever recurring. Looking at the original cause, S3 file too big. That problem has gone away now; in tests s3a will happily stream 80+GB to a single blob > When Parquet writer's close method throws an exception, we will call close > again and trigger a NPE > -- > > Key: SPARK-10294 > URL: https://issues.apache.org/jira/browse/SPARK-10294 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.5.0 >Reporter: Yin Huai > Attachments: screenshot-1.png > > > When a task saves a large parquet file (larger than the S3 file size limit) > to S3, looks like we still call parquet writer's close twice and triggers NPE > reported in SPARK-7837. Eventually, job failed and I got NPE as the > exception. Actually, the real problem was that the file was too large for S3. > {code} > Driver stacktrace: > at > org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1280) > at > org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1268) > at > org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1267) > at > scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) > at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47) > at > org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1267) > at > org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:697) > at > org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:697) > at scala.Option.foreach(Option.scala:236) > at > org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:697) > at > org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1493) > at > org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1455) > at > org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1444) > at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48) > at > org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:567) > at org.apache.spark.SparkContext.runJob(SparkContext.scala:1818) > at org.apache.spark.SparkContext.runJob(SparkContext.scala:1831) > at org.apache.spark.SparkContext.runJob(SparkContext.scala:1908) > at > org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelation$$anonfun$run$1.apply$mcV$sp(InsertIntoHadoopFsRelation.scala:150) > at > org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelation$$anonfun$run$1.apply(InsertIntoHadoopFsRelation.scala:108) > at > org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelation$$anonfun$run$1.apply(InsertIntoHadoopFsRelation.scala:108) > at > org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:56) > at > org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelation.run(InsertIntoHadoopFsRelation.scala:108) > at > org.apache.spark.sql.execution.ExecutedCommand.sideEffectResult$lzycompute(commands.scala:57) > at > org.apache.spark.sql.execution.ExecutedCommand.sideEffectResult(commands.scala:57) > at > org.apache.spark.sql.execution.ExecutedCommand.doExecute(commands.scala:69) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$5.apply(SparkPlan.scala:140) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$5.apply(SparkPlan.scala:138) > at > org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:147) > at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:138) > at > org.apache.spark.sql.SQLContext$QueryExecution.toRdd$lzycompute(SQLContext.scala:927) > at > org.apache.spark.sql.SQLContext$QueryExecution.toRdd(SQLContext.scala:927) > at > org.apache.spark.sql.execution.datasources.ResolvedDataSource$.apply(ResolvedDataSource.scala:197) > at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:146) > at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:137) >
[jira] [Updated] (SPARK-20094) Should Prevent push down of IN subquery to Join operator
[ https://issues.apache.org/jira/browse/SPARK-20094?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Zhenhua Wang updated SPARK-20094: - Summary: Should Prevent push down of IN subquery to Join operator (was: Putting predicate with IN subquery into join condition in ReorderJoin fails RewritePredicateSubquery.rewriteExistentialExpr) > Should Prevent push down of IN subquery to Join operator > > > Key: SPARK-20094 > URL: https://issues.apache.org/jira/browse/SPARK-20094 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.2.0 >Reporter: Zhenhua Wang > > ReorderJoin collects all predicates and try to put them into join condition > when creating ordered join. If a predicate with an IN subquery is in a join > condition instead of a filter condition, > `RewritePredicateSubquery.rewriteExistentialExpr` would fail to convert the > subquery to an ExistenceJoin, and thus result in error. > For example, tpcds q45 fails due to the above reason: > {noformat} > spark-sql> explain codegen > > SELECT > > ca_zip, > > ca_city, > > sum(ws_sales_price) > > FROM web_sales, customer, customer_address, date_dim, item > > WHERE ws_bill_customer_sk = c_customer_sk > > AND c_current_addr_sk = ca_address_sk > > AND ws_item_sk = i_item_sk > > AND (substr(ca_zip, 1, 5) IN > > ('85669', '86197', '88274', '83405', '86475', '85392', '85460', > '80348', '81792') > > OR > > i_item_id IN (SELECT i_item_id > > FROM item > > WHERE i_item_sk IN (2, 3, 5, 7, 11, 13, 17, 19, 23, 29) > > ) > > ) > > AND ws_sold_date_sk = d_date_sk > > AND d_qoy = 2 AND d_year = 2001 > > GROUP BY ca_zip, ca_city > > ORDER BY ca_zip, ca_city > > LIMIT 100; > 17/03/25 15:27:02 ERROR SparkSQLDriver: Failed in [explain codegen > > SELECT > ca_zip, > ca_city, > sum(ws_sales_price) > FROM web_sales, customer, customer_address, date_dim, item > WHERE ws_bill_customer_sk = c_customer_sk > AND c_current_addr_sk = ca_address_sk > AND ws_item_sk = i_item_sk > AND (substr(ca_zip, 1, 5) IN > ('85669', '86197', '88274', '83405', '86475', '85392', '85460', '80348', > '81792') > OR > i_item_id IN (SELECT i_item_id > FROM item > WHERE i_item_sk IN (2, 3, 5, 7, 11, 13, 17, 19, 23, 29) > ) > ) > AND ws_sold_date_sk = d_date_sk > AND d_qoy = 2 AND d_year = 2001 > GROUP BY ca_zip, ca_city > ORDER BY ca_zip, ca_city > LIMIT 100] > java.lang.UnsupportedOperationException: Cannot evaluate expression: list#1 [] > at > org.apache.spark.sql.catalyst.expressions.Unevaluable$class.doGenCode(Expression.scala:224) > at > org.apache.spark.sql.catalyst.expressions.ListQuery.doGenCode(subquery.scala:262) > at > org.apache.spark.sql.catalyst.expressions.Expression$$anonfun$genCode$2.apply(Expression.scala:104) > at > org.apache.spark.sql.catalyst.expressions.Expression$$anonfun$genCode$2.apply(Expression.scala:101) > at scala.Option.getOrElse(Option.scala:121) > at > org.apache.spark.sql.catalyst.expressions.Expression.genCode(Expression.scala:101) > at > org.apache.spark.sql.catalyst.expressions.In$$anonfun$3.apply(predicates.scala:199) > at > org.apache.spark.sql.catalyst.expressions.In$$anonfun$3.apply(predicates.scala:199) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234) > at scala.collection.immutable.List.foreach(List.scala:381) > at scala.collection.TraversableLike$class.map(TraversableLike.scala:234) > at scala.collection.immutable.List.map(List.scala:285) > at > org.apache.spark.sql.catalyst.expressions.In.doGenCode(predicates.scala:199) > at > org.apache.spark.sql.catalyst.expressions.Expression$$anonfun$genCode$2.apply(Expression.scala:104) > at > org.apache.spark.sql.catalyst.expressions.Expression$$anonfun$genCode$2.apply(Expression.scala:101) > at scala.Option.getOrElse(Option.scala:121) > at > org.apache.spark.sql.catalyst.expressions.Expression.genCode(Expression.scala:101) > at > org.apache.spark.sql.catalyst.expressions.Or.doGenCode(predicates.scala:379) > at > org.apache.spark.sql.catalyst.expressions.Expression$$anonfun$genCode$2.apply(Expression.scala:104) > at > org.apache.spark.sql.catalyst.expressions.Expression$$anonfun$genCode$2.apply(Expression.scala:101) > at scala.Option.getOrElse(Option.scala:121) > at > org.apache.spark.sql.catalyst.expressions.Expression.genCode(Expression.scala:101) > at >