[jira] [Assigned] (SPARK-20670) Simplify FPGrowth transform
[ https://issues.apache.org/jira/browse/SPARK-20670?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-20670: Assignee: (was: Apache Spark) > Simplify FPGrowth transform > --- > > Key: SPARK-20670 > URL: https://issues.apache.org/jira/browse/SPARK-20670 > Project: Spark > Issue Type: Improvement > Components: ML >Affects Versions: 2.2.0 >Reporter: yuhao yang >Priority: Minor > > As suggested by [~srowen] in https://github.com/apache/spark/pull/17130, the > transform code in FPGrowthModel can be simplified. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-20670) Simplify FPGrowth transform
[ https://issues.apache.org/jira/browse/SPARK-20670?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-20670: Assignee: Apache Spark > Simplify FPGrowth transform > --- > > Key: SPARK-20670 > URL: https://issues.apache.org/jira/browse/SPARK-20670 > Project: Spark > Issue Type: Improvement > Components: ML >Affects Versions: 2.2.0 >Reporter: yuhao yang >Assignee: Apache Spark >Priority: Minor > > As suggested by [~srowen] in https://github.com/apache/spark/pull/17130, the > transform code in FPGrowthModel can be simplified. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-20670) Simplify FPGrowth transform
[ https://issues.apache.org/jira/browse/SPARK-20670?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16002098#comment-16002098 ] Apache Spark commented on SPARK-20670: -- User 'hhbyyh' has created a pull request for this issue: https://github.com/apache/spark/pull/17912 > Simplify FPGrowth transform > --- > > Key: SPARK-20670 > URL: https://issues.apache.org/jira/browse/SPARK-20670 > Project: Spark > Issue Type: Improvement > Components: ML >Affects Versions: 2.2.0 >Reporter: yuhao yang >Priority: Minor > > As suggested by [~srowen] in https://github.com/apache/spark/pull/17130, the > transform code in FPGrowthModel can be simplified. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-20671) Processing muitple kafka topics with single spark streaming context hangs on batchSubmitted.
amit kumar created SPARK-20671: -- Summary: Processing muitple kafka topics with single spark streaming context hangs on batchSubmitted. Key: SPARK-20671 URL: https://issues.apache.org/jira/browse/SPARK-20671 Project: Spark Issue Type: Bug Components: DStreams Affects Versions: 2.0.0 Environment: Ubuntu Reporter: amit kumar object SparkMain extends App { System.setProperty("spark.cassandra.connection.host", "127.0.0.1") val conf = new SparkConf().setMaster("local[2]").setAppName("kafkaspark").set("spark.streaming.concurrentJobs","4") val sc = new SparkContext(conf) val ssc = new StreamingContext(sc, Seconds(5)) val sqlContext= new SQLContext(sc) val host = "localhost:2181" val topicList = List("test","fb") topicList.foreach{ topic=> val lines =KafkaUtils.createStream(ssc, host, topic, Map(topic -> 1)).map(_._2); //configureStream(topic, lines) lines.foreachRDD(rdd => rdd.map(test(_)).saveToCassandra("test","rawdata",SomeColumns("key"))) } ssc.addStreamingListener(new StreamingListener { override def onBatchCompleted(batchCompleted: StreamingListenerBatchCompleted): Unit = { System.out.println("Batch completed, Total delay :" + batchCompleted.batchInfo.totalDelay.get.toString + " ms") } override def onReceiverStarted(receiverStarted: StreamingListenerReceiverStarted): Unit = { println("inside onReceiverStarted") } override def onReceiverError(receiverError: StreamingListenerReceiverError): Unit = { println("inside onReceiverError") } override def onReceiverStopped(receiverStopped: StreamingListenerReceiverStopped): Unit = { println("inside onReceiverStopped") } override def onBatchSubmitted(batchSubmitted: StreamingListenerBatchSubmitted): Unit = { println("inside onBatchSubmitted") } override def onBatchStarted(batchStarted: StreamingListenerBatchStarted): Unit = { println("inside onBatchStarted") } }) ssc.start() println("===") ssc.awaitTermination() } case class test(key: String) If i put any one of the topics at a time then each topic works.But when topic list has more than one topic, after getting the DataStream from kafka topic, it keeps printing "inside onBatchSubmitted". Thanks in advance. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-20668) Modify ScalaUDF to handle nullability.
[ https://issues.apache.org/jira/browse/SPARK-20668?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-20668: Assignee: Apache Spark > Modify ScalaUDF to handle nullability. > -- > > Key: SPARK-20668 > URL: https://issues.apache.org/jira/browse/SPARK-20668 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.3.0 >Reporter: Takuya Ueshin >Assignee: Apache Spark > > When registering Scala UDF, we can know if the udf will return nullable value > or not. {{ScalaUDF}} and related classes should handle the nullability. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-20668) Modify ScalaUDF to handle nullability.
[ https://issues.apache.org/jira/browse/SPARK-20668?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-20668: Assignee: (was: Apache Spark) > Modify ScalaUDF to handle nullability. > -- > > Key: SPARK-20668 > URL: https://issues.apache.org/jira/browse/SPARK-20668 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.3.0 >Reporter: Takuya Ueshin > > When registering Scala UDF, we can know if the udf will return nullable value > or not. {{ScalaUDF}} and related classes should handle the nullability. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-20668) Modify ScalaUDF to handle nullability.
[ https://issues.apache.org/jira/browse/SPARK-20668?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16002090#comment-16002090 ] Apache Spark commented on SPARK-20668: -- User 'ueshin' has created a pull request for this issue: https://github.com/apache/spark/pull/17911 > Modify ScalaUDF to handle nullability. > -- > > Key: SPARK-20668 > URL: https://issues.apache.org/jira/browse/SPARK-20668 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.3.0 >Reporter: Takuya Ueshin > > When registering Scala UDF, we can know if the udf will return nullable value > or not. {{ScalaUDF}} and related classes should handle the nullability. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-20670) Simplify FPGrowth transform
yuhao yang created SPARK-20670: -- Summary: Simplify FPGrowth transform Key: SPARK-20670 URL: https://issues.apache.org/jira/browse/SPARK-20670 Project: Spark Issue Type: Improvement Components: ML Affects Versions: 2.2.0 Reporter: yuhao yang Priority: Minor As suggested by [~srowen] in https://github.com/apache/spark/pull/17130, the transform code in FPGrowthModel can be simplified. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-20669) LogisticRegression family should be case insensitive
[ https://issues.apache.org/jira/browse/SPARK-20669?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-20669: Assignee: Apache Spark > LogisticRegression family should be case insensitive > - > > Key: SPARK-20669 > URL: https://issues.apache.org/jira/browse/SPARK-20669 > Project: Spark > Issue Type: Improvement > Components: ML >Affects Versions: 2.2.0 >Reporter: zhengruifeng >Assignee: Apache Spark >Priority: Trivial > > {{LogisticRegression}} family should be case insensitive -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-20669) LogisticRegression family should be case insensitive
[ https://issues.apache.org/jira/browse/SPARK-20669?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-20669: Assignee: (was: Apache Spark) > LogisticRegression family should be case insensitive > - > > Key: SPARK-20669 > URL: https://issues.apache.org/jira/browse/SPARK-20669 > Project: Spark > Issue Type: Improvement > Components: ML >Affects Versions: 2.2.0 >Reporter: zhengruifeng >Priority: Trivial > > {{LogisticRegression}} family should be case insensitive -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-20669) LogisticRegression family should be case insensitive
[ https://issues.apache.org/jira/browse/SPARK-20669?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16002086#comment-16002086 ] Apache Spark commented on SPARK-20669: -- User 'zhengruifeng' has created a pull request for this issue: https://github.com/apache/spark/pull/17910 > LogisticRegression family should be case insensitive > - > > Key: SPARK-20669 > URL: https://issues.apache.org/jira/browse/SPARK-20669 > Project: Spark > Issue Type: Improvement > Components: ML >Affects Versions: 2.2.0 >Reporter: zhengruifeng >Priority: Trivial > > {{LogisticRegression}} family should be case insensitive -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-20669) LogisticRegression family should be case insensitive
zhengruifeng created SPARK-20669: Summary: LogisticRegression family should be case insensitive Key: SPARK-20669 URL: https://issues.apache.org/jira/browse/SPARK-20669 Project: Spark Issue Type: Improvement Components: ML Affects Versions: 2.2.0 Reporter: zhengruifeng Priority: Trivial {{LogisticRegression}} family should be case insensitive -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-20661) SparkR tableNames() test fails
[ https://issues.apache.org/jira/browse/SPARK-20661?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16002082#comment-16002082 ] Apache Spark commented on SPARK-20661: -- User 'felixcheung' has created a pull request for this issue: https://github.com/apache/spark/pull/17909 > SparkR tableNames() test fails > -- > > Key: SPARK-20661 > URL: https://issues.apache.org/jira/browse/SPARK-20661 > Project: Spark > Issue Type: Bug > Components: SparkR >Affects Versions: 2.2.0 >Reporter: Hossein Falaki >Assignee: Hossein Falaki > Labels: test > Fix For: 2.2.0 > > > Due to prior state created by other test cases, testing {{tableNames()}} is > failing in master. > https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test/job/spark-master-test-sbt-hadoop-2.7/2846/console -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-20668) Modify ScalaUDF to handle nullability.
Takuya Ueshin created SPARK-20668: - Summary: Modify ScalaUDF to handle nullability. Key: SPARK-20668 URL: https://issues.apache.org/jira/browse/SPARK-20668 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 2.3.0 Reporter: Takuya Ueshin When registering Scala UDF, we can know if the udf will return nullable value or not. {{ScalaUDF}} and related classes should handle the nullability. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-20667) Cleanup the cataloged metadata after completing the package of sql/core and sql/hive
[ https://issues.apache.org/jira/browse/SPARK-20667?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-20667: Assignee: Apache Spark (was: Xiao Li) > Cleanup the cataloged metadata after completing the package of sql/core and > sql/hive > > > Key: SPARK-20667 > URL: https://issues.apache.org/jira/browse/SPARK-20667 > Project: Spark > Issue Type: Test > Components: SQL >Affects Versions: 2.2.0 >Reporter: Xiao Li >Assignee: Apache Spark > > So far, we do not drop all the cataloged tables after each package. > Sometimes, we might hit strange test case errors because the previous test > suite did not drop the tables/functions/database. At least, we can first > clean up the environment when completing the package of sql/core and sql/hive. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-20667) Cleanup the cataloged metadata after completing the package of sql/core and sql/hive
[ https://issues.apache.org/jira/browse/SPARK-20667?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-20667: Assignee: Xiao Li (was: Apache Spark) > Cleanup the cataloged metadata after completing the package of sql/core and > sql/hive > > > Key: SPARK-20667 > URL: https://issues.apache.org/jira/browse/SPARK-20667 > Project: Spark > Issue Type: Test > Components: SQL >Affects Versions: 2.2.0 >Reporter: Xiao Li >Assignee: Xiao Li > > So far, we do not drop all the cataloged tables after each package. > Sometimes, we might hit strange test case errors because the previous test > suite did not drop the tables/functions/database. At least, we can first > clean up the environment when completing the package of sql/core and sql/hive. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-20667) Cleanup the cataloged metadata after completing the package of sql/core and sql/hive
[ https://issues.apache.org/jira/browse/SPARK-20667?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16002035#comment-16002035 ] Apache Spark commented on SPARK-20667: -- User 'gatorsmile' has created a pull request for this issue: https://github.com/apache/spark/pull/17908 > Cleanup the cataloged metadata after completing the package of sql/core and > sql/hive > > > Key: SPARK-20667 > URL: https://issues.apache.org/jira/browse/SPARK-20667 > Project: Spark > Issue Type: Test > Components: SQL >Affects Versions: 2.2.0 >Reporter: Xiao Li >Assignee: Xiao Li > > So far, we do not drop all the cataloged tables after each package. > Sometimes, we might hit strange test case errors because the previous test > suite did not drop the tables/functions/database. At least, we can first > clean up the environment when completing the package of sql/core and sql/hive. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-20667) Cleanup the cataloged metadata after completing the package of sql/core and sql/hive
[ https://issues.apache.org/jira/browse/SPARK-20667?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiao Li updated SPARK-20667: Description: So far, we do not drop all the cataloged tables after each package. Sometimes, we might hit strange test case errors because the previous test suite did not drop the tables/functions/database. At least, we can first clean up the environment when completing the package of sql/core and sql/hive. (was: So far, we do not drop all the cataloged tables after each package. Sometimes, we might hit strange test case errors due to the previous test suite. At least, we can first clean up the environment when completing the package of sql/core and sql/hive) > Cleanup the cataloged metadata after completing the package of sql/core and > sql/hive > > > Key: SPARK-20667 > URL: https://issues.apache.org/jira/browse/SPARK-20667 > Project: Spark > Issue Type: Test > Components: SQL >Affects Versions: 2.2.0 >Reporter: Xiao Li >Assignee: Xiao Li > > So far, we do not drop all the cataloged tables after each package. > Sometimes, we might hit strange test case errors because the previous test > suite did not drop the tables/functions/database. At least, we can first > clean up the environment when completing the package of sql/core and sql/hive. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-20667) Cleanup the cataloged metadata after completing the package of sql/core and sql/hive
[ https://issues.apache.org/jira/browse/SPARK-20667?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiao Li updated SPARK-20667: Description: So far, we do not drop all the cataloged tables after each package. Sometimes, we might hit strange test case errors due to the previous test suite. At least, we can first clean up the environment when completing the package of sql/core and sql/hive (was: So far, we did not drop all the cataloged tables after each package. Sometimes, we might hit strange test case errors due to the previous test suite. At least, we can first clean up the environment when completing the package of sql/core and sql/hive) > Cleanup the cataloged metadata after completing the package of sql/core and > sql/hive > > > Key: SPARK-20667 > URL: https://issues.apache.org/jira/browse/SPARK-20667 > Project: Spark > Issue Type: Test > Components: SQL >Affects Versions: 2.2.0 >Reporter: Xiao Li >Assignee: Xiao Li > > So far, we do not drop all the cataloged tables after each package. > Sometimes, we might hit strange test case errors due to the previous test > suite. At least, we can first clean up the environment when completing the > package of sql/core and sql/hive -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-20667) Cleanup the cataloged metadata after completing the package of sql/core and sql/hive
Xiao Li created SPARK-20667: --- Summary: Cleanup the cataloged metadata after completing the package of sql/core and sql/hive Key: SPARK-20667 URL: https://issues.apache.org/jira/browse/SPARK-20667 Project: Spark Issue Type: Test Components: SQL Affects Versions: 2.2.0 Reporter: Xiao Li Assignee: Xiao Li So far, we did not drop all the cataloged tables after each package. Sometimes, we might hit strange test case errors due to the previous test suite. At least, we can first clean up the environment when completing the package of sql/core and sql/hive -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-20666) Flaky test - SparkListenerBus randomly failing on Windows
[ https://issues.apache.org/jira/browse/SPARK-20666?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Felix Cheung updated SPARK-20666: - Summary: Flaky test - SparkListenerBus randomly failing on Windows (was: Flaky test - random ml test failure on Windows) > Flaky test - SparkListenerBus randomly failing on Windows > - > > Key: SPARK-20666 > URL: https://issues.apache.org/jira/browse/SPARK-20666 > Project: Spark > Issue Type: Bug > Components: ML, Spark Core, SparkR >Affects Versions: 2.3.0 >Reporter: Felix Cheung > > seeing quite a bit of this on AppVeyor, aka Windows only > {code} > Exception in thread "SparkListenerBus" java.lang.IllegalAccessError: > Attempted to access garbage collected accumulator 159454 > at > org.apache.spark.util.AccumulatorContext$$anonfun$get$1.apply(AccumulatorV2.scala:265) > at > org.apache.spark.util.AccumulatorContext$$anonfun$get$1.apply(AccumulatorV2.scala:261) > at scala.Option.map(Option.scala:146) > at > org.apache.spark.util.AccumulatorContext$.get(AccumulatorV2.scala:261) > at org.apache.spark.util.AccumulatorV2.name(AccumulatorV2.scala:88) > at > org.apache.spark.sql.execution.metric.SQLMetric.toInfo(SQLMetrics.scala:67) > at > org.apache.spark.sql.execution.ui.SQLListener$$anonfun$onTaskEnd$1.apply(SQLListener.scala:216) > at > org.apache.spark.sql.execution.ui.SQLListener$$anonfun$onTaskEnd$1.apply(SQLListener.scala:216) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234) > at > scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) > at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48) > at scala.collection.TraversableLike$class.map(TraversableLike.scala:234) > at scala.collection.AbstractTraversable.map(Traversable.scala:104) > at > org.apache.spark.sql.execution.ui.SQLListener.onTaskEnd(SQLListener.scala:216) > at > org.apache.spark.scheduler.SparkListenerBus$class.doPostEvent(SparkListenerBus.scala:45) > at > org.apache.spark.scheduler.LiveListenerBus.doPostEvent(LiveListenerBus.scala:36) > at > org.apache.spark.scheduler.LiveListenerBus.doPostEvent(LiveListenerBus.scala:36) > at > org.apache.spark.util.ListenerBus$class.postToAll(ListenerBus.scala:63) > at > org.apache.spark.scheduler.LiveListenerBus.postToAll(LiveListenerBus.scala:36) > at > org.apache.spark.scheduler.LiveListenerBus$$anon$1$$anonfun$run$1$$anonfun$apply$mcV$sp$1.apply$mcV$sp(LiveListenerBus.scala:94) > at > org.apache.spark.scheduler.LiveListenerBus$$anon$1$$anonfun$run$1$$anonfun$apply$mcV$sp$1.apply(LiveListenerBus.scala:79) > at > org.apache.spark.scheduler.LiveListenerBus$$anon$1$$anonfun$run$1$$anonfun$apply$mcV$sp$1.apply(LiveListenerBus.scala:79) > at scala.util.DynamicVariable.withValue(DynamicVariable.scala:58) > at > org.apache.spark.scheduler.LiveListenerBus$$anon$1$$anonfun$run$1.apply$mcV$sp(LiveListenerBus.scala:78) > at org.apache.spark.util.Utils$.tryOrStopSparkContext(Utils.scala:1268) > at > org.apache.spark.scheduler.LiveListenerBus$$anon$1.run(LiveListenerBus.scala:77) > 1 > MLlib recommendation algorithms: Spark package found in SPARK_HOME: > C:\projects\spark\bin\.. > {code} > {code} > java.lang.IllegalStateException: SparkContext has been shutdown > at org.apache.spark.SparkContext.runJob(SparkContext.scala:2015) > at org.apache.spark.SparkContext.runJob(SparkContext.scala:2044) > at org.apache.spark.SparkContext.runJob(SparkContext.scala:2063) > at org.apache.spark.SparkContext.runJob(SparkContext.scala:2088) > at org.apache.spark.rdd.RDD$$anonfun$collect$1.apply(RDD.scala:936) > at > org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151) > at > org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112) > at org.apache.spark.rdd.RDD.withScope(RDD.scala:362) > at org.apache.spark.rdd.RDD.collect(RDD.scala:935) > at > org.apache.spark.sql.execution.SparkPlan.executeCollect(SparkPlan.scala:275) > at > org.apache.spark.sql.Dataset.org$apache$spark$sql$Dataset$$collectFromPlan(Dataset.scala:2923) > at > org.apache.spark.sql.Dataset$$anonfun$collect$1.apply(Dataset.scala:2474) > at > org.apache.spark.sql.Dataset$$anonfun$collect$1.apply(Dataset.scala:2474) > at org.apache.spark.sql.Dataset$$anonfun$57.apply(Dataset.scala:2907) > at > org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:65) > at org.apache.spark.sql.Dataset.withAction(Dataset.scala:2906) >
[jira] [Updated] (SPARK-20666) Flaky test - SparkListenerBus randomly failing java.lang.IllegalAccessError on Windows
[ https://issues.apache.org/jira/browse/SPARK-20666?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Felix Cheung updated SPARK-20666: - Summary: Flaky test - SparkListenerBus randomly failing java.lang.IllegalAccessError on Windows (was: Flaky test - SparkListenerBus randomly failing on Windows) > Flaky test - SparkListenerBus randomly failing java.lang.IllegalAccessError > on Windows > -- > > Key: SPARK-20666 > URL: https://issues.apache.org/jira/browse/SPARK-20666 > Project: Spark > Issue Type: Bug > Components: ML, Spark Core, SparkR >Affects Versions: 2.3.0 >Reporter: Felix Cheung > > seeing quite a bit of this on AppVeyor, aka Windows only, always only when > running ML tests, it seems > {code} > Exception in thread "SparkListenerBus" java.lang.IllegalAccessError: > Attempted to access garbage collected accumulator 159454 > at > org.apache.spark.util.AccumulatorContext$$anonfun$get$1.apply(AccumulatorV2.scala:265) > at > org.apache.spark.util.AccumulatorContext$$anonfun$get$1.apply(AccumulatorV2.scala:261) > at scala.Option.map(Option.scala:146) > at > org.apache.spark.util.AccumulatorContext$.get(AccumulatorV2.scala:261) > at org.apache.spark.util.AccumulatorV2.name(AccumulatorV2.scala:88) > at > org.apache.spark.sql.execution.metric.SQLMetric.toInfo(SQLMetrics.scala:67) > at > org.apache.spark.sql.execution.ui.SQLListener$$anonfun$onTaskEnd$1.apply(SQLListener.scala:216) > at > org.apache.spark.sql.execution.ui.SQLListener$$anonfun$onTaskEnd$1.apply(SQLListener.scala:216) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234) > at > scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) > at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48) > at scala.collection.TraversableLike$class.map(TraversableLike.scala:234) > at scala.collection.AbstractTraversable.map(Traversable.scala:104) > at > org.apache.spark.sql.execution.ui.SQLListener.onTaskEnd(SQLListener.scala:216) > at > org.apache.spark.scheduler.SparkListenerBus$class.doPostEvent(SparkListenerBus.scala:45) > at > org.apache.spark.scheduler.LiveListenerBus.doPostEvent(LiveListenerBus.scala:36) > at > org.apache.spark.scheduler.LiveListenerBus.doPostEvent(LiveListenerBus.scala:36) > at > org.apache.spark.util.ListenerBus$class.postToAll(ListenerBus.scala:63) > at > org.apache.spark.scheduler.LiveListenerBus.postToAll(LiveListenerBus.scala:36) > at > org.apache.spark.scheduler.LiveListenerBus$$anon$1$$anonfun$run$1$$anonfun$apply$mcV$sp$1.apply$mcV$sp(LiveListenerBus.scala:94) > at > org.apache.spark.scheduler.LiveListenerBus$$anon$1$$anonfun$run$1$$anonfun$apply$mcV$sp$1.apply(LiveListenerBus.scala:79) > at > org.apache.spark.scheduler.LiveListenerBus$$anon$1$$anonfun$run$1$$anonfun$apply$mcV$sp$1.apply(LiveListenerBus.scala:79) > at scala.util.DynamicVariable.withValue(DynamicVariable.scala:58) > at > org.apache.spark.scheduler.LiveListenerBus$$anon$1$$anonfun$run$1.apply$mcV$sp(LiveListenerBus.scala:78) > at org.apache.spark.util.Utils$.tryOrStopSparkContext(Utils.scala:1268) > at > org.apache.spark.scheduler.LiveListenerBus$$anon$1.run(LiveListenerBus.scala:77) > 1 > MLlib recommendation algorithms: Spark package found in SPARK_HOME: > C:\projects\spark\bin\.. > {code} > {code} > java.lang.IllegalStateException: SparkContext has been shutdown > at org.apache.spark.SparkContext.runJob(SparkContext.scala:2015) > at org.apache.spark.SparkContext.runJob(SparkContext.scala:2044) > at org.apache.spark.SparkContext.runJob(SparkContext.scala:2063) > at org.apache.spark.SparkContext.runJob(SparkContext.scala:2088) > at org.apache.spark.rdd.RDD$$anonfun$collect$1.apply(RDD.scala:936) > at > org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151) > at > org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112) > at org.apache.spark.rdd.RDD.withScope(RDD.scala:362) > at org.apache.spark.rdd.RDD.collect(RDD.scala:935) > at > org.apache.spark.sql.execution.SparkPlan.executeCollect(SparkPlan.scala:275) > at > org.apache.spark.sql.Dataset.org$apache$spark$sql$Dataset$$collectFromPlan(Dataset.scala:2923) > at > org.apache.spark.sql.Dataset$$anonfun$collect$1.apply(Dataset.scala:2474) > at > org.apache.spark.sql.Dataset$$anonfun$collect$1.apply(Dataset.scala:2474) > at org.apache.spark.sql.Dataset$$anonfun$57.apply(Dataset.scala:2907) > at >
[jira] [Updated] (SPARK-20666) Flaky test - SparkListenerBus randomly failing on Windows
[ https://issues.apache.org/jira/browse/SPARK-20666?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Felix Cheung updated SPARK-20666: - Description: seeing quite a bit of this on AppVeyor, aka Windows only, always only when running ML tests, it seems {code} Exception in thread "SparkListenerBus" java.lang.IllegalAccessError: Attempted to access garbage collected accumulator 159454 at org.apache.spark.util.AccumulatorContext$$anonfun$get$1.apply(AccumulatorV2.scala:265) at org.apache.spark.util.AccumulatorContext$$anonfun$get$1.apply(AccumulatorV2.scala:261) at scala.Option.map(Option.scala:146) at org.apache.spark.util.AccumulatorContext$.get(AccumulatorV2.scala:261) at org.apache.spark.util.AccumulatorV2.name(AccumulatorV2.scala:88) at org.apache.spark.sql.execution.metric.SQLMetric.toInfo(SQLMetrics.scala:67) at org.apache.spark.sql.execution.ui.SQLListener$$anonfun$onTaskEnd$1.apply(SQLListener.scala:216) at org.apache.spark.sql.execution.ui.SQLListener$$anonfun$onTaskEnd$1.apply(SQLListener.scala:216) at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234) at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234) at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48) at scala.collection.TraversableLike$class.map(TraversableLike.scala:234) at scala.collection.AbstractTraversable.map(Traversable.scala:104) at org.apache.spark.sql.execution.ui.SQLListener.onTaskEnd(SQLListener.scala:216) at org.apache.spark.scheduler.SparkListenerBus$class.doPostEvent(SparkListenerBus.scala:45) at org.apache.spark.scheduler.LiveListenerBus.doPostEvent(LiveListenerBus.scala:36) at org.apache.spark.scheduler.LiveListenerBus.doPostEvent(LiveListenerBus.scala:36) at org.apache.spark.util.ListenerBus$class.postToAll(ListenerBus.scala:63) at org.apache.spark.scheduler.LiveListenerBus.postToAll(LiveListenerBus.scala:36) at org.apache.spark.scheduler.LiveListenerBus$$anon$1$$anonfun$run$1$$anonfun$apply$mcV$sp$1.apply$mcV$sp(LiveListenerBus.scala:94) at org.apache.spark.scheduler.LiveListenerBus$$anon$1$$anonfun$run$1$$anonfun$apply$mcV$sp$1.apply(LiveListenerBus.scala:79) at org.apache.spark.scheduler.LiveListenerBus$$anon$1$$anonfun$run$1$$anonfun$apply$mcV$sp$1.apply(LiveListenerBus.scala:79) at scala.util.DynamicVariable.withValue(DynamicVariable.scala:58) at org.apache.spark.scheduler.LiveListenerBus$$anon$1$$anonfun$run$1.apply$mcV$sp(LiveListenerBus.scala:78) at org.apache.spark.util.Utils$.tryOrStopSparkContext(Utils.scala:1268) at org.apache.spark.scheduler.LiveListenerBus$$anon$1.run(LiveListenerBus.scala:77) 1 MLlib recommendation algorithms: Spark package found in SPARK_HOME: C:\projects\spark\bin\.. {code} {code} java.lang.IllegalStateException: SparkContext has been shutdown at org.apache.spark.SparkContext.runJob(SparkContext.scala:2015) at org.apache.spark.SparkContext.runJob(SparkContext.scala:2044) at org.apache.spark.SparkContext.runJob(SparkContext.scala:2063) at org.apache.spark.SparkContext.runJob(SparkContext.scala:2088) at org.apache.spark.rdd.RDD$$anonfun$collect$1.apply(RDD.scala:936) at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151) at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112) at org.apache.spark.rdd.RDD.withScope(RDD.scala:362) at org.apache.spark.rdd.RDD.collect(RDD.scala:935) at org.apache.spark.sql.execution.SparkPlan.executeCollect(SparkPlan.scala:275) at org.apache.spark.sql.Dataset.org$apache$spark$sql$Dataset$$collectFromPlan(Dataset.scala:2923) at org.apache.spark.sql.Dataset$$anonfun$collect$1.apply(Dataset.scala:2474) at org.apache.spark.sql.Dataset$$anonfun$collect$1.apply(Dataset.scala:2474) at org.apache.spark.sql.Dataset$$anonfun$57.apply(Dataset.scala:2907) at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:65) at org.apache.spark.sql.Dataset.withAction(Dataset.scala:2906) at org.apache.spark.sql.Dataset.collect(Dataset.scala:2474) at org.apache.spark.sql.api.r.SQLUtils$.dfToCols(SQLUtils.scala:173) at org.apache.spark.sql.api.r.SQLUtils.dfToCols(SQLUtils.scala) at sun.reflect.GeneratedMethodAccessor104.invoke(Unknown Source) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:498) at
[jira] [Updated] (SPARK-20666) Flaky test - SparkListenerBus randomly failing on Windows
[ https://issues.apache.org/jira/browse/SPARK-20666?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Felix Cheung updated SPARK-20666: - Component/s: Spark Core > Flaky test - SparkListenerBus randomly failing on Windows > - > > Key: SPARK-20666 > URL: https://issues.apache.org/jira/browse/SPARK-20666 > Project: Spark > Issue Type: Bug > Components: ML, Spark Core, SparkR >Affects Versions: 2.3.0 >Reporter: Felix Cheung > > seeing quite a bit of this on AppVeyor, aka Windows only > {code} > Exception in thread "SparkListenerBus" java.lang.IllegalAccessError: > Attempted to access garbage collected accumulator 159454 > at > org.apache.spark.util.AccumulatorContext$$anonfun$get$1.apply(AccumulatorV2.scala:265) > at > org.apache.spark.util.AccumulatorContext$$anonfun$get$1.apply(AccumulatorV2.scala:261) > at scala.Option.map(Option.scala:146) > at > org.apache.spark.util.AccumulatorContext$.get(AccumulatorV2.scala:261) > at org.apache.spark.util.AccumulatorV2.name(AccumulatorV2.scala:88) > at > org.apache.spark.sql.execution.metric.SQLMetric.toInfo(SQLMetrics.scala:67) > at > org.apache.spark.sql.execution.ui.SQLListener$$anonfun$onTaskEnd$1.apply(SQLListener.scala:216) > at > org.apache.spark.sql.execution.ui.SQLListener$$anonfun$onTaskEnd$1.apply(SQLListener.scala:216) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234) > at > scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) > at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48) > at scala.collection.TraversableLike$class.map(TraversableLike.scala:234) > at scala.collection.AbstractTraversable.map(Traversable.scala:104) > at > org.apache.spark.sql.execution.ui.SQLListener.onTaskEnd(SQLListener.scala:216) > at > org.apache.spark.scheduler.SparkListenerBus$class.doPostEvent(SparkListenerBus.scala:45) > at > org.apache.spark.scheduler.LiveListenerBus.doPostEvent(LiveListenerBus.scala:36) > at > org.apache.spark.scheduler.LiveListenerBus.doPostEvent(LiveListenerBus.scala:36) > at > org.apache.spark.util.ListenerBus$class.postToAll(ListenerBus.scala:63) > at > org.apache.spark.scheduler.LiveListenerBus.postToAll(LiveListenerBus.scala:36) > at > org.apache.spark.scheduler.LiveListenerBus$$anon$1$$anonfun$run$1$$anonfun$apply$mcV$sp$1.apply$mcV$sp(LiveListenerBus.scala:94) > at > org.apache.spark.scheduler.LiveListenerBus$$anon$1$$anonfun$run$1$$anonfun$apply$mcV$sp$1.apply(LiveListenerBus.scala:79) > at > org.apache.spark.scheduler.LiveListenerBus$$anon$1$$anonfun$run$1$$anonfun$apply$mcV$sp$1.apply(LiveListenerBus.scala:79) > at scala.util.DynamicVariable.withValue(DynamicVariable.scala:58) > at > org.apache.spark.scheduler.LiveListenerBus$$anon$1$$anonfun$run$1.apply$mcV$sp(LiveListenerBus.scala:78) > at org.apache.spark.util.Utils$.tryOrStopSparkContext(Utils.scala:1268) > at > org.apache.spark.scheduler.LiveListenerBus$$anon$1.run(LiveListenerBus.scala:77) > 1 > MLlib recommendation algorithms: Spark package found in SPARK_HOME: > C:\projects\spark\bin\.. > {code} > {code} > java.lang.IllegalStateException: SparkContext has been shutdown > at org.apache.spark.SparkContext.runJob(SparkContext.scala:2015) > at org.apache.spark.SparkContext.runJob(SparkContext.scala:2044) > at org.apache.spark.SparkContext.runJob(SparkContext.scala:2063) > at org.apache.spark.SparkContext.runJob(SparkContext.scala:2088) > at org.apache.spark.rdd.RDD$$anonfun$collect$1.apply(RDD.scala:936) > at > org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151) > at > org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112) > at org.apache.spark.rdd.RDD.withScope(RDD.scala:362) > at org.apache.spark.rdd.RDD.collect(RDD.scala:935) > at > org.apache.spark.sql.execution.SparkPlan.executeCollect(SparkPlan.scala:275) > at > org.apache.spark.sql.Dataset.org$apache$spark$sql$Dataset$$collectFromPlan(Dataset.scala:2923) > at > org.apache.spark.sql.Dataset$$anonfun$collect$1.apply(Dataset.scala:2474) > at > org.apache.spark.sql.Dataset$$anonfun$collect$1.apply(Dataset.scala:2474) > at org.apache.spark.sql.Dataset$$anonfun$57.apply(Dataset.scala:2907) > at > org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:65) > at org.apache.spark.sql.Dataset.withAction(Dataset.scala:2906) > at org.apache.spark.sql.Dataset.collect(Dataset.scala:2474) > at
[jira] [Updated] (SPARK-20666) Flaky test - random ml test failure on Windows
[ https://issues.apache.org/jira/browse/SPARK-20666?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Felix Cheung updated SPARK-20666: - Description: seeing quite a bit of this on AppVeyor, aka Windows only {code} Exception in thread "SparkListenerBus" java.lang.IllegalAccessError: Attempted to access garbage collected accumulator 159454 at org.apache.spark.util.AccumulatorContext$$anonfun$get$1.apply(AccumulatorV2.scala:265) at org.apache.spark.util.AccumulatorContext$$anonfun$get$1.apply(AccumulatorV2.scala:261) at scala.Option.map(Option.scala:146) at org.apache.spark.util.AccumulatorContext$.get(AccumulatorV2.scala:261) at org.apache.spark.util.AccumulatorV2.name(AccumulatorV2.scala:88) at org.apache.spark.sql.execution.metric.SQLMetric.toInfo(SQLMetrics.scala:67) at org.apache.spark.sql.execution.ui.SQLListener$$anonfun$onTaskEnd$1.apply(SQLListener.scala:216) at org.apache.spark.sql.execution.ui.SQLListener$$anonfun$onTaskEnd$1.apply(SQLListener.scala:216) at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234) at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234) at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48) at scala.collection.TraversableLike$class.map(TraversableLike.scala:234) at scala.collection.AbstractTraversable.map(Traversable.scala:104) at org.apache.spark.sql.execution.ui.SQLListener.onTaskEnd(SQLListener.scala:216) at org.apache.spark.scheduler.SparkListenerBus$class.doPostEvent(SparkListenerBus.scala:45) at org.apache.spark.scheduler.LiveListenerBus.doPostEvent(LiveListenerBus.scala:36) at org.apache.spark.scheduler.LiveListenerBus.doPostEvent(LiveListenerBus.scala:36) at org.apache.spark.util.ListenerBus$class.postToAll(ListenerBus.scala:63) at org.apache.spark.scheduler.LiveListenerBus.postToAll(LiveListenerBus.scala:36) at org.apache.spark.scheduler.LiveListenerBus$$anon$1$$anonfun$run$1$$anonfun$apply$mcV$sp$1.apply$mcV$sp(LiveListenerBus.scala:94) at org.apache.spark.scheduler.LiveListenerBus$$anon$1$$anonfun$run$1$$anonfun$apply$mcV$sp$1.apply(LiveListenerBus.scala:79) at org.apache.spark.scheduler.LiveListenerBus$$anon$1$$anonfun$run$1$$anonfun$apply$mcV$sp$1.apply(LiveListenerBus.scala:79) at scala.util.DynamicVariable.withValue(DynamicVariable.scala:58) at org.apache.spark.scheduler.LiveListenerBus$$anon$1$$anonfun$run$1.apply$mcV$sp(LiveListenerBus.scala:78) at org.apache.spark.util.Utils$.tryOrStopSparkContext(Utils.scala:1268) at org.apache.spark.scheduler.LiveListenerBus$$anon$1.run(LiveListenerBus.scala:77) 1 MLlib recommendation algorithms: Spark package found in SPARK_HOME: C:\projects\spark\bin\.. {code} {code} java.lang.IllegalStateException: SparkContext has been shutdown at org.apache.spark.SparkContext.runJob(SparkContext.scala:2015) at org.apache.spark.SparkContext.runJob(SparkContext.scala:2044) at org.apache.spark.SparkContext.runJob(SparkContext.scala:2063) at org.apache.spark.SparkContext.runJob(SparkContext.scala:2088) at org.apache.spark.rdd.RDD$$anonfun$collect$1.apply(RDD.scala:936) at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151) at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112) at org.apache.spark.rdd.RDD.withScope(RDD.scala:362) at org.apache.spark.rdd.RDD.collect(RDD.scala:935) at org.apache.spark.sql.execution.SparkPlan.executeCollect(SparkPlan.scala:275) at org.apache.spark.sql.Dataset.org$apache$spark$sql$Dataset$$collectFromPlan(Dataset.scala:2923) at org.apache.spark.sql.Dataset$$anonfun$collect$1.apply(Dataset.scala:2474) at org.apache.spark.sql.Dataset$$anonfun$collect$1.apply(Dataset.scala:2474) at org.apache.spark.sql.Dataset$$anonfun$57.apply(Dataset.scala:2907) at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:65) at org.apache.spark.sql.Dataset.withAction(Dataset.scala:2906) at org.apache.spark.sql.Dataset.collect(Dataset.scala:2474) at org.apache.spark.sql.api.r.SQLUtils$.dfToCols(SQLUtils.scala:173) at org.apache.spark.sql.api.r.SQLUtils.dfToCols(SQLUtils.scala) at sun.reflect.GeneratedMethodAccessor104.invoke(Unknown Source) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:498) at org.apache.spark.api.r.RBackendHandler.handleMethodCall(RBackendHandler.scala:167) at
[jira] [Created] (SPARK-20666) Flaky test - random ml test failure on Windows
Felix Cheung created SPARK-20666: Summary: Flaky test - random ml test failure on Windows Key: SPARK-20666 URL: https://issues.apache.org/jira/browse/SPARK-20666 Project: Spark Issue Type: Bug Components: ML, SparkR Affects Versions: 2.3.0 Reporter: Felix Cheung -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-20666) Flaky test - random ml test failure on Windows
[ https://issues.apache.org/jira/browse/SPARK-20666?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Felix Cheung updated SPARK-20666: - Description: seeing quite a bit of this on AppVeyor, aka Windows only {code} java.lang.IllegalStateException: SparkContext has been shutdown at org.apache.spark.SparkContext.runJob(SparkContext.scala:2015) at org.apache.spark.SparkContext.runJob(SparkContext.scala:2044) at org.apache.spark.SparkContext.runJob(SparkContext.scala:2063) at org.apache.spark.SparkContext.runJob(SparkContext.scala:2088) at org.apache.spark.rdd.RDD$$anonfun$collect$1.apply(RDD.scala:936) at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151) at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112) at org.apache.spark.rdd.RDD.withScope(RDD.scala:362) at org.apache.spark.rdd.RDD.collect(RDD.scala:935) at org.apache.spark.sql.execution.SparkPlan.executeCollect(SparkPlan.scala:275) at org.apache.spark.sql.Dataset.org$apache$spark$sql$Dataset$$collectFromPlan(Dataset.scala:2923) at org.apache.spark.sql.Dataset$$anonfun$collect$1.apply(Dataset.scala:2474) at org.apache.spark.sql.Dataset$$anonfun$collect$1.apply(Dataset.scala:2474) at org.apache.spark.sql.Dataset$$anonfun$57.apply(Dataset.scala:2907) at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:65) at org.apache.spark.sql.Dataset.withAction(Dataset.scala:2906) at org.apache.spark.sql.Dataset.collect(Dataset.scala:2474) at org.apache.spark.sql.api.r.SQLUtils$.dfToCols(SQLUtils.scala:173) at org.apache.spark.sql.api.r.SQLUtils.dfToCols(SQLUtils.scala) at sun.reflect.GeneratedMethodAccessor104.invoke(Unknown Source) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:498) at org.apache.spark.api.r.RBackendHandler.handleMethodCall(RBackendHandler.scala:167) at org.apache.spark.api.r.RBackendHandler.channelRead0(RBackendHandler.scala:108) at org.apache.spark.api.r.RBackendHandler.channelRead0(RBackendHandler.scala:40) at io.netty.channel.SimpleChannelInboundHandler.channelRead(SimpleChannelInboundHandler.java:105) at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:357) at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:343) at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:336) at io.netty.handler.timeout.IdleStateHandler.channelRead(IdleStateHandler.java:287) at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:357) at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:343) at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:336) at io.netty.handler.codec.MessageToMessageDecoder.channelRead(MessageToMessageDecoder.java:102) at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:357) at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:343) at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:336) at io.netty.handler.codec.ByteToMessageDecoder.fireChannelRead(ByteToMessageDecoder.java:293) at io.netty.handler.codec.ByteToMessageDecoder.channelRead(ByteToMessageDecoder.java:267) at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:357) at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:343) at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:336) at io.netty.channel.DefaultChannelPipeline$HeadContext.channelRead(DefaultChannelPipeline.java:1294) at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:357) at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:343) at io.netty.channel.DefaultChannelPipeline.fireChannelRead(DefaultChannelPipeline.java:911) at io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:131) at io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:643) at io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:566)
[jira] [Commented] (SPARK-19581) running NaiveBayes model with 0 features can crash the executor with D rorreGEMV
[ https://issues.apache.org/jira/browse/SPARK-19581?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16001987#comment-16001987 ] Yan Facai (颜发才) commented on SPARK-19581: - [~barrybecker4] Could you give a sample code? > running NaiveBayes model with 0 features can crash the executor with D > rorreGEMV > > > Key: SPARK-19581 > URL: https://issues.apache.org/jira/browse/SPARK-19581 > Project: Spark > Issue Type: Bug > Components: ML >Affects Versions: 2.1.0 > Environment: spark development or standalone mode on windows or linux. >Reporter: Barry Becker >Priority: Minor > > The severity of this bug is high (because nothing should cause spark to crash > like this) but the priority may be low (because there is an easy workaround). > In our application, a user can select features and a target to run the > NaiveBayes inducer. If columns have too many values or all one value, they > will be removed before we call the inducer to create the model. As a result, > there are some cases, where all the features may get removed. When this > happens, executors will crash and get restarted (if on a cluster) or spark > will crash and need to be manually restarted (if in development mode). > It looks like NaiveBayes uses BLAS, and BLAS does not handle this case well > when it is encountered. I emits this vague error : > ** On entry to DGEMV parameter number 6 had an illegal value > and terminates. > My code looks like this: > {code} >val predictions = model.transform(testData) // Make predictions > // figure out how many were correctly predicted > val numCorrect = predictions.filter(new Column(actualTarget) === new > Column(PREDICTION_LABEL_COLUMN)).count() > val numIncorrect = testRowCount - numCorrect > {code} > The failure is at the line that does the count, but it is not the count that > causes the problem, it is the model.transform step (where the model contains > the NaiveBayes classifier). > Here is the stack trace (in development mode): > {code} > [2017-02-13 06:28:39,946] TRACE evidence.EvidenceVizModel$ [] > [akka://JobServer/user/context-supervisor/sql-context] - done making > predictions in 232 > ** On entry to DGEMV parameter number 6 had an illegal value > ** On entry to DGEMV parameter number 6 had an illegal value > ** On entry to DGEMV parameter number 6 had an illegal value > [2017-02-13 06:28:40,506] ERROR .scheduler.LiveListenerBus [] > [akka://JobServer/user/context-supervisor/sql-context] - SparkListenerBus has > already stopped! Dropping event SparkListenerSQLExecutionEnd(9,1486996120505) > [2017-02-13 06:28:40,506] ERROR .scheduler.LiveListenerBus [] > [akka://JobServer/user/context-supervisor/sql-context] - SparkListenerBus has > already stopped! Dropping event > SparkListenerStageCompleted(org.apache.spark.scheduler.StageInfo@1f6c4a29) > [2017-02-13 06:28:40,508] ERROR .scheduler.LiveListenerBus [] > [akka://JobServer/user/context-supervisor/sql-context] - SparkListenerBus has > already stopped! Dropping event > SparkListenerJobEnd(12,1486996120507,JobFailed(org.apache.spark.SparkException: > Job 12 cancelled because SparkContext was shut down)) > [2017-02-13 06:28:40,509] ERROR .jobserver.JobManagerActor [] > [akka://JobServer/user/context-supervisor/sql-context] - Got Throwable > org.apache.spark.SparkException: Job 12 cancelled because SparkContext was > shut down > at > org.apache.spark.scheduler.DAGScheduler$$anonfun$cleanUpAfterSchedulerStop$1.apply(DAGScheduler.scala:808) > at > org.apache.spark.scheduler.DAGScheduler$$anonfun$cleanUpAfterSchedulerStop$1.apply(DAGScheduler.scala:806) > at scala.collection.mutable.HashSet.foreach(HashSet.scala:78) > at > org.apache.spark.scheduler.DAGScheduler.cleanUpAfterSchedulerStop(DAGScheduler.scala:806) > at > org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onStop(DAGScheduler.scala:1668) > at org.apache.spark.util.EventLoop.stop(EventLoop.scala:83) > at > org.apache.spark.scheduler.DAGScheduler.stop(DAGScheduler.scala:1587) > at > org.apache.spark.SparkContext$$anonfun$stop$8.apply$mcV$sp(SparkContext.scala:1826) > at org.apache.spark.util.Utils$.tryLogNonFatalError(Utils.scala:1283) > at org.apache.spark.SparkContext.stop(SparkContext.scala:1825) > at > org.apache.spark.SparkContext$$anonfun$2.apply$mcV$sp(SparkContext.scala:581) > at > org.apache.spark.util.SparkShutdownHook.run(ShutdownHookManager.scala:216) > at > org.apache.spark.util.SparkShutdownHookManager$$anonfun$runAll$1$$anonfun$apply$mcV$sp$1.apply$mcV$sp(ShutdownHookManager.scala:188) > at >
[jira] [Assigned] (SPARK-7856) Scalable PCA implementation for tall and fat matrices
[ https://issues.apache.org/jira/browse/SPARK-7856?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-7856: --- Assignee: Apache Spark > Scalable PCA implementation for tall and fat matrices > - > > Key: SPARK-7856 > URL: https://issues.apache.org/jira/browse/SPARK-7856 > Project: Spark > Issue Type: Improvement > Components: MLlib >Reporter: Tarek Elgamal >Assignee: Apache Spark > > Currently the PCA implementation has a limitation of fitting d^2 > covariance/grammian matrix entries in memory (d is the number of > columns/dimensions of the matrix). We often need only the largest k principal > components. To make pca really scalable, I suggest an implementation where > the memory usage is proportional to the principal components k rather than > the full dimensionality d. > I suggest adopting the solution described in this paper that is published in > SIGMOD 2015 (http://ds.qcri.org/images/profile/tarek_elgamal/sigmod2015.pdf). > The paper offers an implementation for Probabilistic PCA (PPCA) which has > less memory and time complexity and could potentially scale to tall and fat > matrices rather than tall and skinny matrices that is supported by the > current PCA impelmentation. > Probablistic PCA could be potentially added to the set of algorithms > supported by MLlib and it does not necessarily replace the old PCA > implementation. > PPCA implementation is adopted in Matlab's Statistics and Machine Learning > Toolbox (http://www.mathworks.com/help/stats/ppca.html) -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-7856) Scalable PCA implementation for tall and fat matrices
[ https://issues.apache.org/jira/browse/SPARK-7856?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-7856: --- Assignee: (was: Apache Spark) > Scalable PCA implementation for tall and fat matrices > - > > Key: SPARK-7856 > URL: https://issues.apache.org/jira/browse/SPARK-7856 > Project: Spark > Issue Type: Improvement > Components: MLlib >Reporter: Tarek Elgamal > > Currently the PCA implementation has a limitation of fitting d^2 > covariance/grammian matrix entries in memory (d is the number of > columns/dimensions of the matrix). We often need only the largest k principal > components. To make pca really scalable, I suggest an implementation where > the memory usage is proportional to the principal components k rather than > the full dimensionality d. > I suggest adopting the solution described in this paper that is published in > SIGMOD 2015 (http://ds.qcri.org/images/profile/tarek_elgamal/sigmod2015.pdf). > The paper offers an implementation for Probabilistic PCA (PPCA) which has > less memory and time complexity and could potentially scale to tall and fat > matrices rather than tall and skinny matrices that is supported by the > current PCA impelmentation. > Probablistic PCA could be potentially added to the set of algorithms > supported by MLlib and it does not necessarily replace the old PCA > implementation. > PPCA implementation is adopted in Matlab's Statistics and Machine Learning > Toolbox (http://www.mathworks.com/help/stats/ppca.html) -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-7856) Scalable PCA implementation for tall and fat matrices
[ https://issues.apache.org/jira/browse/SPARK-7856?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16001976#comment-16001976 ] Apache Spark commented on SPARK-7856: - User 'ghoto' has created a pull request for this issue: https://github.com/apache/spark/pull/17907 > Scalable PCA implementation for tall and fat matrices > - > > Key: SPARK-7856 > URL: https://issues.apache.org/jira/browse/SPARK-7856 > Project: Spark > Issue Type: Improvement > Components: MLlib >Reporter: Tarek Elgamal > > Currently the PCA implementation has a limitation of fitting d^2 > covariance/grammian matrix entries in memory (d is the number of > columns/dimensions of the matrix). We often need only the largest k principal > components. To make pca really scalable, I suggest an implementation where > the memory usage is proportional to the principal components k rather than > the full dimensionality d. > I suggest adopting the solution described in this paper that is published in > SIGMOD 2015 (http://ds.qcri.org/images/profile/tarek_elgamal/sigmod2015.pdf). > The paper offers an implementation for Probabilistic PCA (PPCA) which has > less memory and time complexity and could potentially scale to tall and fat > matrices rather than tall and skinny matrices that is supported by the > current PCA impelmentation. > Probablistic PCA could be potentially added to the set of algorithms > supported by MLlib and it does not necessarily replace the old PCA > implementation. > PPCA implementation is adopted in Matlab's Statistics and Machine Learning > Toolbox (http://www.mathworks.com/help/stats/ppca.html) -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-20665) Spark-sql, "Bround" function return NULL
[ https://issues.apache.org/jira/browse/SPARK-20665?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-20665: Assignee: (was: Apache Spark) > Spark-sql, "Bround" function return NULL > > > Key: SPARK-20665 > URL: https://issues.apache.org/jira/browse/SPARK-20665 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.2.0 >Reporter: liuxian > > >select bround(12.3, 2); > >NULL > For this case, the expected result is 12.3, but it is null -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-20665) Spark-sql, "Bround" function return NULL
[ https://issues.apache.org/jira/browse/SPARK-20665?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-20665: Assignee: Apache Spark > Spark-sql, "Bround" function return NULL > > > Key: SPARK-20665 > URL: https://issues.apache.org/jira/browse/SPARK-20665 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.2.0 >Reporter: liuxian >Assignee: Apache Spark > > >select bround(12.3, 2); > >NULL > For this case, the expected result is 12.3, but it is null -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-20665) Spark-sql, "Bround" function return NULL
[ https://issues.apache.org/jira/browse/SPARK-20665?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16001970#comment-16001970 ] Apache Spark commented on SPARK-20665: -- User '10110346' has created a pull request for this issue: https://github.com/apache/spark/pull/17906 > Spark-sql, "Bround" function return NULL > > > Key: SPARK-20665 > URL: https://issues.apache.org/jira/browse/SPARK-20665 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.2.0 >Reporter: liuxian > > >select bround(12.3, 2); > >NULL > For this case, the expected result is 12.3, but it is null -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-20661) SparkR tableNames() test fails
[ https://issues.apache.org/jira/browse/SPARK-20661?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16001966#comment-16001966 ] Apache Spark commented on SPARK-20661: -- User 'felixcheung' has created a pull request for this issue: https://github.com/apache/spark/pull/17905 > SparkR tableNames() test fails > -- > > Key: SPARK-20661 > URL: https://issues.apache.org/jira/browse/SPARK-20661 > Project: Spark > Issue Type: Bug > Components: SparkR >Affects Versions: 2.2.0 >Reporter: Hossein Falaki >Assignee: Hossein Falaki > Labels: test > Fix For: 2.2.0 > > > Due to prior state created by other test cases, testing {{tableNames()}} is > failing in master. > https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test/job/spark-master-test-sbt-hadoop-2.7/2846/console -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-20665) Spark-sql, "Bround" function return NULL
[ https://issues.apache.org/jira/browse/SPARK-20665?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] liuxian updated SPARK-20665: Description: >select bround(12.3, 2); >NULL For this case, the expected result is 12.3, but it is null was: >select bround(12.3, 2); >NULL For this case, we expected the result is 12.3, but it is null > Spark-sql, "Bround" function return NULL > > > Key: SPARK-20665 > URL: https://issues.apache.org/jira/browse/SPARK-20665 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.2.0 >Reporter: liuxian > > >select bround(12.3, 2); > >NULL > For this case, the expected result is 12.3, but it is null -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-20665) Spark-sql, "Bround" function return NULL
[ https://issues.apache.org/jira/browse/SPARK-20665?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] liuxian updated SPARK-20665: Description: >select bround(12.3, 2); >NULL For this case, we expected the result is 12.3, but it is null was: >select bround(12.3, 2); >NULL For this case, we expected de result is 12.3, but it is null > Spark-sql, "Bround" function return NULL > > > Key: SPARK-20665 > URL: https://issues.apache.org/jira/browse/SPARK-20665 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.2.0 >Reporter: liuxian > > >select bround(12.3, 2); > >NULL > For this case, we expected the result is 12.3, but it is null -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-20665) Spark-sql, "Bround" function return NULL
liuxian created SPARK-20665: --- Summary: Spark-sql, "Bround" function return NULL Key: SPARK-20665 URL: https://issues.apache.org/jira/browse/SPARK-20665 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 2.2.0 Reporter: liuxian >select bround(12.3, 2); >NULL For this case, we expected de result is 12.3, but it is null -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-18278) Support native submission of spark jobs to a kubernetes cluster
[ https://issues.apache.org/jira/browse/SPARK-18278?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16001924#comment-16001924 ] sandflee commented on SPARK-18278: -- for a spark user, what benefits could we get? beside could co-run docker app and spark app > Support native submission of spark jobs to a kubernetes cluster > --- > > Key: SPARK-18278 > URL: https://issues.apache.org/jira/browse/SPARK-18278 > Project: Spark > Issue Type: Umbrella > Components: Build, Deploy, Documentation, Scheduler, Spark Core >Reporter: Erik Erlandson > Attachments: SPARK-18278 - Spark on Kubernetes Design Proposal.pdf > > > A new Apache Spark sub-project that enables native support for submitting > Spark applications to a kubernetes cluster. The submitted application runs > in a driver executing on a kubernetes pod, and executors lifecycles are also > managed as pods. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-20658) spark.yarn.am.attemptFailuresValidityInterval doesn't seem to have an effect
[ https://issues.apache.org/jira/browse/SPARK-20658?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16001917#comment-16001917 ] Saisai Shao commented on SPARK-20658: - It is mainly depends on YARN to measure the failure validity interval and how to define a failure AM, Spark just proxy this parameter to YARN. So if there's any unexpected behavior I think we should investigate on YARN part to see the actual behavior. > spark.yarn.am.attemptFailuresValidityInterval doesn't seem to have an effect > > > Key: SPARK-20658 > URL: https://issues.apache.org/jira/browse/SPARK-20658 > Project: Spark > Issue Type: Bug > Components: YARN >Affects Versions: 2.1.0 >Reporter: Paul Jones >Priority: Minor > > I'm running a job in YARN cluster mode using > `spark.yarn.am.attemptFailuresValidityInterval=1h` specified in both > spark-default.conf and in my spark-submit command. (This flag shows up in the > environment tab of spark history server, so it seems that it's specified > correctly). > However, I just had a job die with with four AM failures (three of the four > failures were over an hour apart). So, I'm confused as to what could be going > on. I haven't figured out the cause of the individual failures, so is it > possible that we always count certain types of failures? E.g. jobs that are > killed due to memory issues always count? -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-7856) Scalable PCA implementation for tall and fat matrices
[ https://issues.apache.org/jira/browse/SPARK-7856?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16001551#comment-16001551 ] Ignacio Bermudez Corrales edited comment on SPARK-7856 at 5/9/17 1:59 AM: -- Apart from implementing Probabilistic PCA (which in my view is a different algorithm worth implementing as a separate algorithm), there are some issues with the current (2.3) implementation of RowMatrix.computePrincipalComponentsAndExplainedVariance that affect the PCA training. In my opinion the Big problem with the current implementation is the line 387 of RowMatrix.scala, which causes OutOfMemory exceptions for this kind of matrices, as it computes the covariance as a local breeze dense matrix. val Cov = computeCovariance().asBreeze.asInstanceOf[BDM[Double]] The implementation computes a dense covariance local breeze matrix, which is not needed for the computation of the principal components nor explained variance. In particular, RowMatrix provides a more optimized SVD decomposition. Therefore, principal components and variance can be derived from this implementation of the decomposition, by computing the (X - µ).computeSVD( k, false, 0). Which leads to a more scalable implementation of PCA for tall and fat matrices. If this ticket is for the implementation of PPCA, it should be specified in the title. was (Author: elghoto): Apart from implementing Probabilistic PCA (which in my view is a different algorithm worth implementing as a separate algorithm), there are some issues with the current (2.3) implementation of RowMatrix.computePrincipalComponentsAndExplainedVariance that affect the PCA training. In my opinion the Big problem with the current implementation is the line 387 of RowMatrix.scala, which causes OutOfMemory exceptions for this kind of matrices, as it computes the covariance as a local breeze dense matrix. val Cov = computeCovariance().asBreeze.asInstanceOf[BDM[Double]] The implementation computes a dense covariance local breeze matrix, which is not needed for the computation of the principal components nor explained variance. In particular, RowMatrix provides a more optimized SVD decomposition. Therefore, principal components and variance can be derived from this implementation of the decomposition, by computing the (X - µ).computeSVD( k, false, 0). > Scalable PCA implementation for tall and fat matrices > - > > Key: SPARK-7856 > URL: https://issues.apache.org/jira/browse/SPARK-7856 > Project: Spark > Issue Type: Improvement > Components: MLlib >Reporter: Tarek Elgamal > > Currently the PCA implementation has a limitation of fitting d^2 > covariance/grammian matrix entries in memory (d is the number of > columns/dimensions of the matrix). We often need only the largest k principal > components. To make pca really scalable, I suggest an implementation where > the memory usage is proportional to the principal components k rather than > the full dimensionality d. > I suggest adopting the solution described in this paper that is published in > SIGMOD 2015 (http://ds.qcri.org/images/profile/tarek_elgamal/sigmod2015.pdf). > The paper offers an implementation for Probabilistic PCA (PPCA) which has > less memory and time complexity and could potentially scale to tall and fat > matrices rather than tall and skinny matrices that is supported by the > current PCA impelmentation. > Probablistic PCA could be potentially added to the set of algorithms > supported by MLlib and it does not necessarily replace the old PCA > implementation. > PPCA implementation is adopted in Matlab's Statistics and Machine Learning > Toolbox (http://www.mathworks.com/help/stats/ppca.html) -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-7856) Scalable PCA implementation for tall and fat matrices
[ https://issues.apache.org/jira/browse/SPARK-7856?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16001551#comment-16001551 ] Ignacio Bermudez Corrales edited comment on SPARK-7856 at 5/9/17 2:00 AM: -- Apart from implementing Probabilistic PCA (which in my view is a different algorithm worth implementing separate from PCA), there are some issues with the current (2.3) implementation of RowMatrix.computePrincipalComponentsAndExplainedVariance that affect the PCA training. In my opinion the Big problem with the current implementation is the line 387 of RowMatrix.scala, which causes OutOfMemory exceptions for this kind of matrices, as it computes the covariance as a local breeze dense matrix. val Cov = computeCovariance().asBreeze.asInstanceOf[BDM[Double]] The implementation computes a dense covariance local breeze matrix, which is not needed for the computation of the principal components nor explained variance. In particular, RowMatrix provides a more optimized SVD decomposition. Therefore, principal components and variance can be derived from this implementation of the decomposition, by computing the (X - µ).computeSVD( k, false, 0). Which leads to a more scalable implementation of PCA for tall and fat matrices. If this ticket is for the implementation of PPCA, it should be specified in the title. was (Author: elghoto): Apart from implementing Probabilistic PCA (which in my view is a different algorithm worth implementing as a separate algorithm), there are some issues with the current (2.3) implementation of RowMatrix.computePrincipalComponentsAndExplainedVariance that affect the PCA training. In my opinion the Big problem with the current implementation is the line 387 of RowMatrix.scala, which causes OutOfMemory exceptions for this kind of matrices, as it computes the covariance as a local breeze dense matrix. val Cov = computeCovariance().asBreeze.asInstanceOf[BDM[Double]] The implementation computes a dense covariance local breeze matrix, which is not needed for the computation of the principal components nor explained variance. In particular, RowMatrix provides a more optimized SVD decomposition. Therefore, principal components and variance can be derived from this implementation of the decomposition, by computing the (X - µ).computeSVD( k, false, 0). Which leads to a more scalable implementation of PCA for tall and fat matrices. If this ticket is for the implementation of PPCA, it should be specified in the title. > Scalable PCA implementation for tall and fat matrices > - > > Key: SPARK-7856 > URL: https://issues.apache.org/jira/browse/SPARK-7856 > Project: Spark > Issue Type: Improvement > Components: MLlib >Reporter: Tarek Elgamal > > Currently the PCA implementation has a limitation of fitting d^2 > covariance/grammian matrix entries in memory (d is the number of > columns/dimensions of the matrix). We often need only the largest k principal > components. To make pca really scalable, I suggest an implementation where > the memory usage is proportional to the principal components k rather than > the full dimensionality d. > I suggest adopting the solution described in this paper that is published in > SIGMOD 2015 (http://ds.qcri.org/images/profile/tarek_elgamal/sigmod2015.pdf). > The paper offers an implementation for Probabilistic PCA (PPCA) which has > less memory and time complexity and could potentially scale to tall and fat > matrices rather than tall and skinny matrices that is supported by the > current PCA impelmentation. > Probablistic PCA could be potentially added to the set of algorithms > supported by MLlib and it does not necessarily replace the old PCA > implementation. > PPCA implementation is adopted in Matlab's Statistics and Machine Learning > Toolbox (http://www.mathworks.com/help/stats/ppca.html) -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-20663) Data missing after insert overwrite table partition which is created on specific location
[ https://issues.apache.org/jira/browse/SPARK-20663?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] kobefeng updated SPARK-20663: - Description: Use spark sql to create partition table first, and alter table by adding partition on specific location, then insert overwrite into this partition by selection, which will cause data missing compared with HIVE. {code:title=partition_table_insert_overwrite.sql|borderStyle=solid} -- create partition table first $ hadoop fs -mkdir /user/kofeng/partitioned_table $ /apache/spark-2.1.0-bin-hadoop2.7/bin/spark-sql spark-sql> create table kofeng.partitioned_table( > id bigint, > name string, > dt string > ) using parquet options ('compression'='snappy', 'path'='/user/kofeng/partitioned_table') > partitioned by (dt); -- add partition with specific location spark-sql> alter table kofeng.partitioned_table add if not exists partition(dt='20170507') location '/user/kofeng/partitioned_table/20170507'; $ hadoop fs -ls /user/kofeng/partitioned_table drwxr-xr-x - kofeng kofeng 0 2017-05-08 17:00 /user/kofeng/partitioned_table/20170507 -- insert overwrite this partition, and the specific location folder gone, data is missing, job is success by attaching _SUCCESS spark-sql> insert overwrite table kofeng.partitioned_table partition(dt='20170507') select 123 as id, "kofeng" as name; $ hadoop fs -ls /user/kofeng/partitioned_table -rw-r--r-- 3 kofeng kofeng 0 2017-05-08 17:06 /user/kofeng/partitioned_table/_SUCCESS -- Then drop this partition and use hive to add partition and insert overwrite this partition data, then verify: spark-sql> alter table kofeng.partitioned_table drop if exists partition(dt='20170507'); hive> alter table kofeng.partitioned_table add if not exists partition(dt='20170507') location '/user/kofeng/partitioned_table/20170507'; OK -- could see hive insert overwrite that specific location successfully. hive> insert overwrite table kofeng.partitioned_table partition(dt='20170507') select 123 as id, "kofeng" as name from kofeng.test; hive> select * from kofeng.partitioned_table; OK 123 kofeng 20170507 $ hadoop fs -ls /user/kofeng/partitioned_table/20170507 -rwxr-xr-x 3 kofeng kofeng 338 2017-05-08 17:26 /user/kofeng/partitioned_table/20170507/00_0 {code} was: Use spark sql to create partition table first, and alter table by adding partition on specific location, then insert overwrite into this partition by selection, which will cause data missing compared with HIVE. {code:title=partition_table_insert_overwrite.sql|borderStyle=solid} -- create partition table first $ hadoop fs -mkdir /user/kofeng/partitioned_table $ /apache/spark-2.1.0-bin-hadoop2.7/bin/spark-sql spark-sql> create table kofeng.partitioned_table( > id bigint, > name string, > dt string > ) using parquet options ('compression'='snappy', 'path'='/user/kofeng/partitioned_table') > partitioned by (dt); -- add partition with specific location spark-sql> alter table kofeng.partitioned_table add if not exists partition(dt='20170507') location '/user/kofeng/partitioned_table/20170507'; $ hadoop fs -ls /user/kofeng/partitioned_table drwxr-xr-x - kofeng kofeng 0 2017-05-08 17:00 /user/kofeng/partitioned_table/20170507 -- insert overwrite this partition, and the specific location folder gone, data is missing, job is success by attaching _SUCCESS spark-sql> insert overwrite table kofeng.partitioned_table partition(dt='20170507') select 123 as id, "kofeng" as name; $ hadoop fs -ls /user/kofeng/partitioned_table -rw-r--r-- 3 kofeng kofeng 0 2017-05-08 17:06 /user/kofeng/partitioned_table/_SUCCESS -- Then drop this partition and use hive to add partition and insert overwrite this partition data, then verify: spark-sql> alter table kofeng.partitioned_table drop if exists partition(dt='20170507'); hive> alter table kofeng.partitioned_table add if not exists partition(dt='20170507') location '/user/kofeng/partitioned_table/20170507'; OK -- could see hive also drop the specific location but data is preserved on auto-created partition folder hive> insert overwrite table kofeng.partitioned_table partition(dt='20170507') select 123 as id, "kofeng" as name from kofeng.test; hive> select * from kofeng.partitioned_table; OK 123 kofeng 20170507 $ hadoop fs -ls /user/kofeng/partitioned_table/20170507 -rwxr-xr-x 3 kofeng kofeng 338 2017-05-08 17:26 /user/kofeng/partitioned_table/20170507/00_0 {code} > Data missing after insert overwrite table partition which is created on > specific location > - > > Key: SPARK-20663 > URL: https://issues.apache.org/jira/browse/SPARK-20663 > Project:
[jira] [Updated] (SPARK-20663) Data missing after insert overwrite table partition which is created on specific location
[ https://issues.apache.org/jira/browse/SPARK-20663?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] kobefeng updated SPARK-20663: - Description: Use spark sql to create partition table first, and alter table by adding partition on specific location, then insert overwrite into this partition by selection, which will cause data missing compared with HIVE. {code:title=partition_table_insert_overwrite.sql|borderStyle=solid} -- create partition table first $ hadoop fs -mkdir /user/kofeng/partitioned_table $ /apache/spark-2.1.0-bin-hadoop2.7/bin/spark-sql spark-sql> create table kofeng.partitioned_table( > id bigint, > name string, > dt string > ) using parquet options ('compression'='snappy', 'path'='/user/kofeng/partitioned_table') > partitioned by (dt); -- add partition with specific location spark-sql> alter table kofeng.partitioned_table add if not exists partition(dt='20170507') location '/user/kofeng/partitioned_table/20170507'; $ hadoop fs -ls /user/kofeng/partitioned_table drwxr-xr-x - kofeng kofeng 0 2017-05-08 17:00 /user/kofeng/partitioned_table/20170507 -- insert overwrite this partition, and the specific location folder gone, data is missing, job is success by attaching _SUCCESS spark-sql> insert overwrite table kofeng.partitioned_table partition(dt='20170507') select 123 as id, "kofeng" as name; $ hadoop fs -ls /user/kofeng/partitioned_table -rw-r--r-- 3 kofeng kofeng 0 2017-05-08 17:06 /user/kofeng/partitioned_table/_SUCCESS -- Then drop this partition and use hive to add partition and insert overwrite this partition data, then verify: spark-sql> alter table kofeng.partitioned_table drop if exists partition(dt='20170507'); hive> alter table kofeng.partitioned_table add if not exists partition(dt='20170507') location '/user/kofeng/partitioned_table/20170507'; OK -- could see hive also drop the specific location but data is preserved on auto-created partition folder hive> insert overwrite table kofeng.partitioned_table partition(dt='20170507') select 123 as id, "kofeng" as name from kofeng.test; hive> select * from kofeng.partitioned_table; OK 123 kofeng 20170507 $ hadoop fs -ls /user/kofeng/partitioned_table/20170507 -rwxr-xr-x 3 kofeng kofeng 338 2017-05-08 17:26 /user/kofeng/partitioned_table/20170507/00_0 {code} was: Use spark sql to create partition table first, and alter table by adding partition on specific location, then insert overwrite into this partition by selection, which will cause data missing compared with HIVE. {code:title=partition_table_insert_overwrite.sql|borderStyle=solid} -- create partition table first $ hadoop fs -mkdir /user/kofeng/partitioned_table $ /apache/spark-2.1.0-bin-hadoop2.7/bin/spark-sql spark-sql> create table kofeng.partitioned_table( > id bigint, > name string, > dt string > ) using parquet options ('compression'='snappy', 'path'='/user/kofeng/partitioned_table') > partitioned by (dt); -- add partition with specific location spark-sql> alter table kofeng.partitioned_table add if not exists partition(dt='20170507') location '/user/kofeng/partitioned_table/20170507'; $ hadoop fs -ls /user/kofeng/partitioned_table drwxr-xr-x - kofeng kofeng 0 2017-05-08 17:00 /user/kofeng/partitioned_table/20170507 -- insert overwrite this partition, and the specific location folder gone, data is missing, job is success by attaching _SUCCESS spark-sql> insert overwrite table kofeng.partitioned_table partition(dt='20170507') select 123 as id, "kofeng" as name; $ hadoop fs -ls /user/kofeng/partitioned_table -rw-r--r-- 3 kofeng kofeng 0 2017-05-08 17:06 /user/kofeng/partitioned_table/_SUCCESS -- Then drop this partition and use hive to add partition and insert overwrite this partition data, then verify: spark-sql> alter table kofeng.partitioned_table drop if exists partition(dt='20170507'); hive> alter table kofeng.partitioned_table add if not exists partition(dt='20170507') location '/user/kofeng/partitioned_table/20170507'; OK -- could see hive also drop the specific location but data is preserved on auto-created partition folder hive> insert overwrite table kofeng.partitioned_table partition(dt='20170507') select 123 as id, "kofeng" as name from kofeng.test; Loading data to table kofeng.partitioned_table partition (dt=20170507) Moved: '/user/kofeng/partitioned_table/dt=20170507/00_0' to trash at: /user/kofeng/.Trash/Current Partition kofeng.partitioned_table{dt=20170507} stats: [numFiles=1, numRows=1, totalSize=338, rawDataSize=2] MapReduce Jobs Launched: Stage-Stage-1: Map: 2 Cumulative CPU: 10.61 sec HDFS Read: 9767 HDFS Write: 577 SUCCESS Stage-Stage-3: Map: 1 Cumulative CPU: 12.36 sec HDFS Read: 3635 HDFS Write:
[jira] [Created] (SPARK-20664) Remove stale applications from SHS listing
Marcelo Vanzin created SPARK-20664: -- Summary: Remove stale applications from SHS listing Key: SPARK-20664 URL: https://issues.apache.org/jira/browse/SPARK-20664 Project: Spark Issue Type: Sub-task Components: Spark Core Affects Versions: 2.3.0 Reporter: Marcelo Vanzin See spec in parent issue (SPARK-18085) for more details. This task is actually not explicit in the spec, and it's also an issue with the current SHS. But having the SHS persist listing data makes it worse. Basically, the SHS currently does not detect when files are deleted from the event log directory manually; so those applications are still listed, and trying to see their UI will either show the UI (if it's loaded) or an error (if it's not). With the new SHS, that also means that data is leaked in the disk stores used to persist listing and UI data, making the problem worse. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-20663) Data missing after insert overwrite table partition which is created on specific location
[ https://issues.apache.org/jira/browse/SPARK-20663?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] kobefeng updated SPARK-20663: - Description: Use spark sql to create partition table first, and alter table by adding partition on specific location, then insert overwrite into this partition by selection, which will cause data missing compared with HIVE. {code:title=partition_table_insert_overwrite.sql|borderStyle=solid} -- create partition table first $ hadoop fs -mkdir /user/kofeng/partitioned_table $ /apache/spark-2.1.0-bin-hadoop2.7/bin/spark-sql spark-sql> create table kofeng.partitioned_table( > id bigint, > name string, > dt string > ) using parquet options ('compression'='snappy', 'path'='/user/kofeng/partitioned_table') > partitioned by (dt); -- add partition with specific location spark-sql> alter table kofeng.partitioned_table add if not exists partition(dt='20170507') location '/user/kofeng/partitioned_table/20170507'; $ hadoop fs -ls /user/kofeng/partitioned_table drwxr-xr-x - kofeng kofeng 0 2017-05-08 17:00 /user/kofeng/partitioned_table/20170507 -- insert overwrite this partition, and the specific location folder gone, data is missing, job is success by attaching _SUCCESS spark-sql> insert overwrite table kofeng.partitioned_table partition(dt='20170507') select 123 as id, "kofeng" as name; $ hadoop fs -ls /user/kofeng/partitioned_table -rw-r--r-- 3 kofeng kofeng 0 2017-05-08 17:06 /user/kofeng/partitioned_table/_SUCCESS -- Then drop this partition and use hive to add partition and insert overwrite this partition data, then verify: spark-sql> alter table kofeng.partitioned_table drop if exists partition(dt='20170507'); hive> alter table kofeng.partitioned_table add if not exists partition(dt='20170507') location '/user/kofeng/partitioned_table/20170507'; OK -- could see hive also drop the specific location but data is preserved on auto-created partition folder hive> insert overwrite table kofeng.partitioned_table partition(dt='20170507') select 123 as id, "kofeng" as name from kofeng.test; Loading data to table kofeng.partitioned_table partition (dt=20170507) Moved: '/user/kofeng/partitioned_table/dt=20170507/00_0' to trash at: /user/kofeng/.Trash/Current Partition kofeng.partitioned_table{dt=20170507} stats: [numFiles=1, numRows=1, totalSize=338, rawDataSize=2] MapReduce Jobs Launched: Stage-Stage-1: Map: 2 Cumulative CPU: 10.61 sec HDFS Read: 9767 HDFS Write: 577 SUCCESS Stage-Stage-3: Map: 1 Cumulative CPU: 12.36 sec HDFS Read: 3635 HDFS Write: 338 SUCCESS hive> select * from kofeng.partitioned_table; OK 123 kofeng 20170507 $ hadoop fs -ls /user/kofeng/partitioned_table/dt=20170507 -rwxr-xr-x 3 kofeng kofeng 338 2017-05-08 17:26 /user/kofeng/partitioned_table/dt=20170507/00_0 {code} was: Use spark sql to create partition table first, and alter table by adding partition on specific location, then insert overwrite into this partition by selection, which will cause data missing compared with HIVE. {code:title=partition_table_insert_overwrite.sql|borderStyle=solid} -- create partition table first $ hadoop fs -mkdir /user/kofeng/partitioned_table $ /apache/spark-2.1.0-bin-hadoop2.7/bin/spark-sql spark-sql> create table kofeng.partitioned_table( > id bigint, > name string, > dt string > ) using parquet options ('compression'='snappy', 'path'='/user/kofeng/partitioned_table') > partitioned by (dt); -- add partition with specific location spark-sql> alter table kofeng.partitioned_table add if not exists partition(dt='20170507') location '/user/kofeng/partitioned_table/20170507'; $ hadoop fs -ls /user/kofeng/partitioned_table drwxr-xr-x - kofeng kofeng 0 2017-05-08 17:00 /user/kofeng/partitioned_table/20170507 -- insert overwrite this partition, and the specific location folder gone, data is missing, job is success by attaching _SUCCESS spark-sql> insert overwrite table kofeng.partitioned_table partition(dt='20170507') select 123 as id, "kofeng" as name; $ hadoop fs -ls /user/kofeng/partitioned_table -rw-r--r-- 3 kofeng hdmi-technology 0 2017-05-08 17:06 /user/kofeng/partitioned_table/_SUCCESS -- Then drop this partition and use hive to add partition and insert overwrite this partition data, then verify: spark-sql> alter table kofeng.partitioned_table drop if exists partition(dt='20170507'); hive> alter table kofeng.partitioned_table add if not exists partition(dt='20170507') location '/user/kofeng/partitioned_table/20170507'; OK -- could see hive also drop the specific location but data is preserved on auto-created partition folder hive> insert overwrite table kofeng.partitioned_table partition(dt='20170507') select 123 as id,
[jira] [Updated] (SPARK-20663) Data missing after insert overwrite table partition which is created on specific location
[ https://issues.apache.org/jira/browse/SPARK-20663?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] kobefeng updated SPARK-20663: - Priority: Major (was: Minor) > Data missing after insert overwrite table partition which is created on > specific location > - > > Key: SPARK-20663 > URL: https://issues.apache.org/jira/browse/SPARK-20663 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.1.0 >Reporter: kobefeng > Labels: easyfix > > Use spark sql to create partition table first, and alter table by adding > partition on specific location, then insert overwrite into this partition by > selection, which will cause data missing compared with HIVE. > {code:title=partition_table_insert_overwrite.sql|borderStyle=solid} > -- create partition table first > $ hadoop fs -mkdir /user/kofeng/partitioned_table > $ /apache/spark-2.1.0-bin-hadoop2.7/bin/spark-sql > spark-sql> create table kofeng.partitioned_table( > > id bigint, > > name string, > > dt string > > ) using parquet options ('compression'='snappy', > 'path'='/user/kofeng/partitioned_table') > > partitioned by (dt); > -- add partition with specific location > spark-sql> alter table kofeng.partitioned_table add if not exists > partition(dt='20170507') location '/user/kofeng/partitioned_table/20170507'; > $ hadoop fs -ls /user/kofeng/partitioned_table > drwxr-xr-x - kofeng kofeng 0 2017-05-08 17:00 > /user/kofeng/partitioned_table/20170507 > -- insert overwrite this partition, and the specific location folder gone, > data is missing, job is success by attaching _SUCCESS > spark-sql> insert overwrite table kofeng.partitioned_table > partition(dt='20170507') select 123 as id, "kofeng" as name; > $ hadoop fs -ls /user/kofeng/partitioned_table > -rw-r--r-- 3 kofeng hdmi-technology 0 2017-05-08 17:06 > /user/kofeng/partitioned_table/_SUCCESS > > > -- Then drop this partition and use hive to add partition and insert > overwrite this partition data, then verify: > spark-sql> alter table kofeng.partitioned_table drop if exists > partition(dt='20170507'); > hive> alter table kofeng.partitioned_table add if not exists > partition(dt='20170507') location '/user/kofeng/partitioned_table/20170507'; > OK > -- could see hive also drop the specific location but data is preserved on > auto-created partition folder > hive> insert overwrite table kofeng.partitioned_table > partition(dt='20170507') select 123 as id, "kofeng" as name from kofeng.test; > Loading data to table kofeng.partitioned_table partition (dt=20170507) > Moved: '/user/kofeng/partitioned_table/dt=20170507/00_0' to trash > at: /user/kofeng/.Trash/Current > Partition kofeng.partitioned_table{dt=20170507} stats: [numFiles=1, > numRows=1, totalSize=338, rawDataSize=2] > MapReduce Jobs Launched: > Stage-Stage-1: Map: 2 Cumulative CPU: 10.61 sec HDFS Read: 9767 > HDFS Write: 577 SUCCESS > Stage-Stage-3: Map: 1 Cumulative CPU: 12.36 sec HDFS Read: 3635 > HDFS Write: 338 SUCCESS > hive> select * from kofeng.partitioned_table; > OK > 123 kofeng 20170507 > $ hadoop fs -ls /user/kofeng/partitioned_table/dt=20170507 > -rwxr-xr-x 3 kofeng hdmi-technology338 2017-05-08 17:26 > /user/kofeng/partitioned_table/dt=20170507/00_0 > {code} -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-20663) Data missing after insert overwrite table partition which is created on specific location
[ https://issues.apache.org/jira/browse/SPARK-20663?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] kobefeng updated SPARK-20663: - Affects Version/s: (was: 2.1.1) > Data missing after insert overwrite table partition which is created on > specific location > - > > Key: SPARK-20663 > URL: https://issues.apache.org/jira/browse/SPARK-20663 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.1.0 >Reporter: kobefeng >Priority: Minor > Labels: easyfix > > Use spark sql to create partition table first, and alter table by adding > partition on specific location, then insert overwrite into this partition by > selection, which will cause data missing compared with HIVE. > {code:title=partition_table_insert_overwrite.sql|borderStyle=solid} > -- create partition table first > $ hadoop fs -mkdir /user/kofeng/partitioned_table > $ /apache/spark-2.1.0-bin-hadoop2.7/bin/spark-sql > spark-sql> create table kofeng.partitioned_table( > > id bigint, > > name string, > > dt string > > ) using parquet options ('compression'='snappy', > 'path'='/user/kofeng/partitioned_table') > > partitioned by (dt); > -- add partition with specific location > spark-sql> alter table kofeng.partitioned_table add if not exists > partition(dt='20170507') location '/user/kofeng/partitioned_table/20170507'; > $ hadoop fs -ls /user/kofeng/partitioned_table > drwxr-xr-x - kofeng kofeng 0 2017-05-08 17:00 > /user/kofeng/partitioned_table/20170507 > -- insert overwrite this partition, and the specific location folder gone, > data is missing, job is success by attaching _SUCCESS > spark-sql> insert overwrite table kofeng.partitioned_table > partition(dt='20170507') select 123 as id, "kofeng" as name; > $ hadoop fs -ls /user/kofeng/partitioned_table > -rw-r--r-- 3 kofeng hdmi-technology 0 2017-05-08 17:06 > /user/kofeng/partitioned_table/_SUCCESS > > > -- Then drop this partition and use hive to add partition and insert > overwrite this partition data, then verify: > spark-sql> alter table kofeng.partitioned_table drop if exists > partition(dt='20170507'); > hive> alter table kofeng.partitioned_table add if not exists > partition(dt='20170507') location '/user/kofeng/partitioned_table/20170507'; > OK > -- could see hive also drop the specific location but data is preserved on > auto-created partition folder > hive> insert overwrite table kofeng.partitioned_table > partition(dt='20170507') select 123 as id, "kofeng" as name from kofeng.test; > Loading data to table kofeng.partitioned_table partition (dt=20170507) > Moved: '/user/kofeng/partitioned_table/dt=20170507/00_0' to trash > at: /user/kofeng/.Trash/Current > Partition kofeng.partitioned_table{dt=20170507} stats: [numFiles=1, > numRows=1, totalSize=338, rawDataSize=2] > MapReduce Jobs Launched: > Stage-Stage-1: Map: 2 Cumulative CPU: 10.61 sec HDFS Read: 9767 > HDFS Write: 577 SUCCESS > Stage-Stage-3: Map: 1 Cumulative CPU: 12.36 sec HDFS Read: 3635 > HDFS Write: 338 SUCCESS > hive> select * from kofeng.partitioned_table; > OK > 123 kofeng 20170507 > $ hadoop fs -ls /user/kofeng/partitioned_table/dt=20170507 > -rwxr-xr-x 3 kofeng hdmi-technology338 2017-05-08 17:26 > /user/kofeng/partitioned_table/dt=20170507/00_0 > {code} -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-20663) Data missing after insert overwrite table partition which is created on specific location
[ https://issues.apache.org/jira/browse/SPARK-20663?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] kobefeng updated SPARK-20663: - Labels: easyfix (was: ) > Data missing after insert overwrite table partition which is created on > specific location > - > > Key: SPARK-20663 > URL: https://issues.apache.org/jira/browse/SPARK-20663 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.1.0, 2.1.1 >Reporter: kobefeng >Priority: Minor > Labels: easyfix > > Use spark sql to create partition table first, and alter table by adding > partition on specific location, then insert overwrite into this partition by > selection, which will cause data missing compared with HIVE. > {code:title=partition_table_insert_overwrite.sql|borderStyle=solid} > -- create partition table first > $ hadoop fs -mkdir /user/kofeng/partitioned_table > $ /apache/spark-2.1.0-bin-hadoop2.7/bin/spark-sql > spark-sql> create table kofeng.partitioned_table( > > id bigint, > > name string, > > dt string > > ) using parquet options ('compression'='snappy', > 'path'='/user/kofeng/partitioned_table') > > partitioned by (dt); > -- add partition with specific location > spark-sql> alter table kofeng.partitioned_table add if not exists > partition(dt='20170507') location '/user/kofeng/partitioned_table/20170507'; > $ hadoop fs -ls /user/kofeng/partitioned_table > drwxr-xr-x - kofeng kofeng 0 2017-05-08 17:00 > /user/kofeng/partitioned_table/20170507 > -- insert overwrite this partition, and the specific location folder gone, > data is missing, job is success by attaching _SUCCESS > spark-sql> insert overwrite table kofeng.partitioned_table > partition(dt='20170507') select 123 as id, "kofeng" as name; > $ hadoop fs -ls /user/kofeng/partitioned_table > -rw-r--r-- 3 kofeng hdmi-technology 0 2017-05-08 17:06 > /user/kofeng/partitioned_table/_SUCCESS > > > -- Then drop this partition and use hive to add partition and insert > overwrite this partition data, then verify: > spark-sql> alter table kofeng.partitioned_table drop if exists > partition(dt='20170507'); > hive> alter table kofeng.partitioned_table add if not exists > partition(dt='20170507') location '/user/kofeng/partitioned_table/20170507'; > OK > -- could see hive also drop the specific location but data is preserved on > auto-created partition folder > hive> insert overwrite table kofeng.partitioned_table > partition(dt='20170507') select 123 as id, "kofeng" as name from kofeng.test; > Loading data to table kofeng.partitioned_table partition (dt=20170507) > Moved: '/user/kofeng/partitioned_table/dt=20170507/00_0' to trash > at: /user/kofeng/.Trash/Current > Partition kofeng.partitioned_table{dt=20170507} stats: [numFiles=1, > numRows=1, totalSize=338, rawDataSize=2] > MapReduce Jobs Launched: > Stage-Stage-1: Map: 2 Cumulative CPU: 10.61 sec HDFS Read: 9767 > HDFS Write: 577 SUCCESS > Stage-Stage-3: Map: 1 Cumulative CPU: 12.36 sec HDFS Read: 3635 > HDFS Write: 338 SUCCESS > hive> select * from kofeng.partitioned_table; > OK > 123 kofeng 20170507 > $ hadoop fs -ls /user/kofeng/partitioned_table/dt=20170507 > -rwxr-xr-x 3 kofeng hdmi-technology338 2017-05-08 17:26 > /user/kofeng/partitioned_table/dt=20170507/00_0 > {code} -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-20663) Data missing after insert overwrite table partition which is created on specific location
[ https://issues.apache.org/jira/browse/SPARK-20663?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] kobefeng updated SPARK-20663: - Description: Use spark sql to create partition table first, and alter table by adding partition on specific location, then insert overwrite into this partition by selection, which will cause data missing compared with HIVE. {code:title=partition_table_insert_overwrite.sql|borderStyle=solid} -- create partition table first $ hadoop fs -mkdir /user/kofeng/partitioned_table $ /apache/spark-2.1.0-bin-hadoop2.7/bin/spark-sql spark-sql> create table kofeng.partitioned_table( > id bigint, > name string, > dt string > ) using parquet options ('compression'='snappy', 'path'='hdfs://ares-lvs-nn-ha/user/kofeng/partitioned_table') > partitioned by (dt); -- add partition with specific location spark-sql> alter table kofeng.partitioned_table add if not exists partition(dt='20170507') location '/user/kofeng/partitioned_table/20170507'; $ hadoop fs -ls /user/kofeng/partitioned_table drwxr-xr-x - kofeng kofeng 0 2017-05-08 17:00 /user/kofeng/partitioned_table/20170507 -- insert overwrite this partition, and the specific location folder gone, data is missing, job is success by attaching _SUCCESS spark-sql> insert overwrite table kofeng.partitioned_table partition(dt='20170507') select 123 as id, "kofeng" as name; $ hadoop fs -ls /user/kofeng/partitioned_table -rw-r--r-- 3 kofeng hdmi-technology 0 2017-05-08 17:06 /user/kofeng/partitioned_table/_SUCCESS -- Then drop this partition and use hive to add partition and insert overwrite this partition data, then verify: spark-sql> alter table kofeng.partitioned_table drop if exists partition(dt='20170507'); hive> alter table kofeng.partitioned_table add if not exists partition(dt='20170507') location '/user/kofeng/partitioned_table/20170507'; OK -- could see hive also drop the specific location but data is preserved on auto-created partition folder hive> insert overwrite table kofeng.partitioned_table partition(dt='20170507') select 123 as id, "kofeng" as name from kofeng.test; Loading data to table kofeng.partitioned_table partition (dt=20170507) Moved: 'hdfs://ares-lvs-nn-ha/user/kofeng/partitioned_table/dt=20170507/00_0' to trash at: hdfs://ares-lvs-nn-ha/user/kofeng/.Trash/Current Partition kofeng.partitioned_table{dt=20170507} stats: [numFiles=1, numRows=1, totalSize=338, rawDataSize=2] MapReduce Jobs Launched: Stage-Stage-1: Map: 2 Cumulative CPU: 10.61 sec HDFS Read: 9767 HDFS Write: 577 SUCCESS Stage-Stage-3: Map: 1 Cumulative CPU: 12.36 sec HDFS Read: 3635 HDFS Write: 338 SUCCESS hive> select * from kofeng.partitioned_table; OK 123 kofeng 20170507 $ hadoop fs -ls /user/kofeng/partitioned_table/dt=20170507 -rwxr-xr-x 3 kofeng hdmi-technology338 2017-05-08 17:26 /user/kofeng/partitioned_table/dt=20170507/00_0 {code} was: Use spark sql to create partition table first, and alter table by adding partition on specific location, then insert overwrite into this partition by selection, which will cause data missing compared with HIVE. {code:title=Bar.sql|borderStyle=solid} -- create partition table first $ hadoop fs -mkdir /user/kofeng/partitioned_table $ /apache/spark-2.1.0-bin-hadoop2.7/bin/spark-sql spark-sql> create table kofeng.partitioned_table( > id bigint, > name string, > dt string > ) using parquet options ('compression'='snappy', 'path'='hdfs://ares-lvs-nn-ha/user/kofeng/partitioned_table') > partitioned by (dt); -- add partition with specific location spark-sql> alter table kofeng.partitioned_table add if not exists partition(dt='20170507') location '/user/kofeng/partitioned_table/20170507'; $ hadoop fs -ls /user/kofeng/partitioned_table drwxr-xr-x - kofeng kofeng 0 2017-05-08 17:00 /user/kofeng/partitioned_table/20170507 -- insert overwrite this partition, and the specific location folder gone, data is missing, job is success by attaching _SUCCESS spark-sql> insert overwrite table kofeng.partitioned_table partition(dt='20170507') select 123 as id, "kofeng" as name; $ hadoop fs -ls /user/kofeng/partitioned_table -rw-r--r-- 3 kofeng hdmi-technology 0 2017-05-08 17:06 /user/kofeng/partitioned_table/_SUCCESS -- Then drop this partition and use hive to add partition and insert overwrite this partition data, then verify: spark-sql> alter table kofeng.partitioned_table drop if exists partition(dt='20170507'); hive> alter table kofeng.partitioned_table add if not exists partition(dt='20170507') location '/user/kofeng/partitioned_table/20170507'; OK -- could see hive also drop the specific location but data is preserved on auto-created partition folder hive> insert overwrite table
[jira] [Updated] (SPARK-20663) Data missing after insert overwrite table partition which is created on specific location
[ https://issues.apache.org/jira/browse/SPARK-20663?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] kobefeng updated SPARK-20663: - Description: Use spark sql to create partition table first, and alter table by adding partition on specific location, then insert overwrite into this partition by selection, which will cause data missing compared with HIVE. {code:title=partition_table_insert_overwrite.sql|borderStyle=solid} -- create partition table first $ hadoop fs -mkdir /user/kofeng/partitioned_table $ /apache/spark-2.1.0-bin-hadoop2.7/bin/spark-sql spark-sql> create table kofeng.partitioned_table( > id bigint, > name string, > dt string > ) using parquet options ('compression'='snappy', 'path'='/user/kofeng/partitioned_table') > partitioned by (dt); -- add partition with specific location spark-sql> alter table kofeng.partitioned_table add if not exists partition(dt='20170507') location '/user/kofeng/partitioned_table/20170507'; $ hadoop fs -ls /user/kofeng/partitioned_table drwxr-xr-x - kofeng kofeng 0 2017-05-08 17:00 /user/kofeng/partitioned_table/20170507 -- insert overwrite this partition, and the specific location folder gone, data is missing, job is success by attaching _SUCCESS spark-sql> insert overwrite table kofeng.partitioned_table partition(dt='20170507') select 123 as id, "kofeng" as name; $ hadoop fs -ls /user/kofeng/partitioned_table -rw-r--r-- 3 kofeng hdmi-technology 0 2017-05-08 17:06 /user/kofeng/partitioned_table/_SUCCESS -- Then drop this partition and use hive to add partition and insert overwrite this partition data, then verify: spark-sql> alter table kofeng.partitioned_table drop if exists partition(dt='20170507'); hive> alter table kofeng.partitioned_table add if not exists partition(dt='20170507') location '/user/kofeng/partitioned_table/20170507'; OK -- could see hive also drop the specific location but data is preserved on auto-created partition folder hive> insert overwrite table kofeng.partitioned_table partition(dt='20170507') select 123 as id, "kofeng" as name from kofeng.test; Loading data to table kofeng.partitioned_table partition (dt=20170507) Moved: '/user/kofeng/partitioned_table/dt=20170507/00_0' to trash at: /user/kofeng/.Trash/Current Partition kofeng.partitioned_table{dt=20170507} stats: [numFiles=1, numRows=1, totalSize=338, rawDataSize=2] MapReduce Jobs Launched: Stage-Stage-1: Map: 2 Cumulative CPU: 10.61 sec HDFS Read: 9767 HDFS Write: 577 SUCCESS Stage-Stage-3: Map: 1 Cumulative CPU: 12.36 sec HDFS Read: 3635 HDFS Write: 338 SUCCESS hive> select * from kofeng.partitioned_table; OK 123 kofeng 20170507 $ hadoop fs -ls /user/kofeng/partitioned_table/dt=20170507 -rwxr-xr-x 3 kofeng hdmi-technology338 2017-05-08 17:26 /user/kofeng/partitioned_table/dt=20170507/00_0 {code} was: Use spark sql to create partition table first, and alter table by adding partition on specific location, then insert overwrite into this partition by selection, which will cause data missing compared with HIVE. {code:title=partition_table_insert_overwrite.sql|borderStyle=solid} -- create partition table first $ hadoop fs -mkdir /user/kofeng/partitioned_table $ /apache/spark-2.1.0-bin-hadoop2.7/bin/spark-sql spark-sql> create table kofeng.partitioned_table( > id bigint, > name string, > dt string > ) using parquet options ('compression'='snappy', 'path'='hdfs://ares-lvs-nn-ha/user/kofeng/partitioned_table') > partitioned by (dt); -- add partition with specific location spark-sql> alter table kofeng.partitioned_table add if not exists partition(dt='20170507') location '/user/kofeng/partitioned_table/20170507'; $ hadoop fs -ls /user/kofeng/partitioned_table drwxr-xr-x - kofeng kofeng 0 2017-05-08 17:00 /user/kofeng/partitioned_table/20170507 -- insert overwrite this partition, and the specific location folder gone, data is missing, job is success by attaching _SUCCESS spark-sql> insert overwrite table kofeng.partitioned_table partition(dt='20170507') select 123 as id, "kofeng" as name; $ hadoop fs -ls /user/kofeng/partitioned_table -rw-r--r-- 3 kofeng hdmi-technology 0 2017-05-08 17:06 /user/kofeng/partitioned_table/_SUCCESS -- Then drop this partition and use hive to add partition and insert overwrite this partition data, then verify: spark-sql> alter table kofeng.partitioned_table drop if exists partition(dt='20170507'); hive> alter table kofeng.partitioned_table add if not exists partition(dt='20170507') location '/user/kofeng/partitioned_table/20170507'; OK -- could see hive also drop the specific location but data is preserved on auto-created partition folder hive> insert overwrite table kofeng.partitioned_table
[jira] [Updated] (SPARK-20663) Data missing after insert overwrite table partition which is created on specific location
[ https://issues.apache.org/jira/browse/SPARK-20663?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] kobefeng updated SPARK-20663: - Description: Use spark sql to create partition table first, and alter table by adding partition on specific location, then insert overwrite into this partition by selection, which will cause data missing compared with HIVE. {code:title=Bar.sql|borderStyle=solid} -- create partition table first $ hadoop fs -mkdir /user/kofeng/partitioned_table $ /apache/spark-2.1.0-bin-hadoop2.7/bin/spark-sql spark-sql> create table kofeng.partitioned_table( > id bigint, > name string, > dt string > ) using parquet options ('compression'='snappy', 'path'='hdfs://ares-lvs-nn-ha/user/kofeng/partitioned_table') > partitioned by (dt); -- add partition with specific location spark-sql> alter table kofeng.partitioned_table add if not exists partition(dt='20170507') location '/user/kofeng/partitioned_table/20170507'; $ hadoop fs -ls /user/kofeng/partitioned_table drwxr-xr-x - kofeng kofeng 0 2017-05-08 17:00 /user/kofeng/partitioned_table/20170507 -- insert overwrite this partition, and the specific location folder gone, data is missing, job is success by attaching _SUCCESS spark-sql> insert overwrite table kofeng.partitioned_table partition(dt='20170507') select 123 as id, "kofeng" as name; $ hadoop fs -ls /user/kofeng/partitioned_table -rw-r--r-- 3 kofeng hdmi-technology 0 2017-05-08 17:06 /user/kofeng/partitioned_table/_SUCCESS -- Then drop this partition and use hive to add partition and insert overwrite this partition data, then verify: spark-sql> alter table kofeng.partitioned_table drop if exists partition(dt='20170507'); hive> alter table kofeng.partitioned_table add if not exists partition(dt='20170507') location '/user/kofeng/partitioned_table/20170507'; OK -- could see hive also drop the specific location but data is preserved on auto-created partition folder hive> insert overwrite table kofeng.partitioned_table partition(dt='20170507') select 123 as id, "kofeng" as name from kofeng.test; Loading data to table kofeng.partitioned_table partition (dt=20170507) Moved: 'hdfs://ares-lvs-nn-ha/user/kofeng/partitioned_table/dt=20170507/00_0' to trash at: hdfs://ares-lvs-nn-ha/user/kofeng/.Trash/Current Partition kofeng.partitioned_table{dt=20170507} stats: [numFiles=1, numRows=1, totalSize=338, rawDataSize=2] MapReduce Jobs Launched: Stage-Stage-1: Map: 2 Cumulative CPU: 10.61 sec HDFS Read: 9767 HDFS Write: 577 SUCCESS Stage-Stage-3: Map: 1 Cumulative CPU: 12.36 sec HDFS Read: 3635 HDFS Write: 338 SUCCESS hive> select * from kofeng.partitioned_table; OK 123 kofeng 20170507 $ hadoop fs -ls /user/kofeng/partitioned_table/dt=20170507 -rwxr-xr-x 3 kofeng hdmi-technology338 2017-05-08 17:26 /user/kofeng/partitioned_table/dt=20170507/00_0 {code} was: Use spark sql to create partition table first, and alter table by adding partition on specific location, then insert overwrite into this partition by selection, which will cause data missing compared with HIVE. bq. -- create partition table first $ hadoop fs -mkdir /user/kofeng/partitioned_table $ /apache/spark-2.1.0-bin-hadoop2.7/bin/spark-sql spark-sql> create table kofeng.partitioned_table( > id bigint, > name string, > dt string > ) using parquet options ('compression'='snappy', 'path'='hdfs://ares-lvs-nn-ha/user/kofeng/partitioned_table') > partitioned by (dt); -- add partition with specific location spark-sql> alter table kofeng.partitioned_table add if not exists partition(dt='20170507') location '/user/kofeng/partitioned_table/20170507'; $ hadoop fs -ls /user/kofeng/partitioned_table drwxr-xr-x - kofeng kofeng 0 2017-05-08 17:00 /user/kofeng/partitioned_table/20170507 -- insert overwrite this partition, and the specific location folder gone, data is missing, job is success by attaching _SUCCESS spark-sql> insert overwrite table kofeng.partitioned_table partition(dt='20170507') select 123 as id, "kofeng" as name; $ hadoop fs -ls /user/kofeng/partitioned_table -rw-r--r-- 3 kofeng hdmi-technology 0 2017-05-08 17:06 /user/kofeng/partitioned_table/_SUCCESS -- Then drop this partition and use hive to add partition and insert overwrite this partition data, then verify: spark-sql> alter table kofeng.partitioned_table drop if exists partition(dt='20170507'); hive> alter table kofeng.partitioned_table add if not exists partition(dt='20170507') location '/user/kofeng/partitioned_table/20170507'; OK -- could see hive also drop the specific location but data is preserved on auto-created partition folder hive> insert overwrite table kofeng.partitioned_table partition(dt='20170507') select 123
[jira] [Updated] (SPARK-20663) Data missing after insert overwrite table partition which is created on specific location
[ https://issues.apache.org/jira/browse/SPARK-20663?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] kobefeng updated SPARK-20663: - Description: Use spark sql to create partition table first, and alter table by adding partition on specific location, then insert overwrite into this partition by selection, which will cause data missing compared with HIVE. bq. -- create partition table first $ hadoop fs -mkdir /user/kofeng/partitioned_table $ /apache/spark-2.1.0-bin-hadoop2.7/bin/spark-sql spark-sql> create table kofeng.partitioned_table( > id bigint, > name string, > dt string > ) using parquet options ('compression'='snappy', 'path'='hdfs://ares-lvs-nn-ha/user/kofeng/partitioned_table') > partitioned by (dt); -- add partition with specific location spark-sql> alter table kofeng.partitioned_table add if not exists partition(dt='20170507') location '/user/kofeng/partitioned_table/20170507'; $ hadoop fs -ls /user/kofeng/partitioned_table drwxr-xr-x - kofeng kofeng 0 2017-05-08 17:00 /user/kofeng/partitioned_table/20170507 -- insert overwrite this partition, and the specific location folder gone, data is missing, job is success by attaching _SUCCESS spark-sql> insert overwrite table kofeng.partitioned_table partition(dt='20170507') select 123 as id, "kofeng" as name; $ hadoop fs -ls /user/kofeng/partitioned_table -rw-r--r-- 3 kofeng hdmi-technology 0 2017-05-08 17:06 /user/kofeng/partitioned_table/_SUCCESS -- Then drop this partition and use hive to add partition and insert overwrite this partition data, then verify: spark-sql> alter table kofeng.partitioned_table drop if exists partition(dt='20170507'); hive> alter table kofeng.partitioned_table add if not exists partition(dt='20170507') location '/user/kofeng/partitioned_table/20170507'; OK -- could see hive also drop the specific location but data is preserved on auto-created partition folder hive> insert overwrite table kofeng.partitioned_table partition(dt='20170507') select 123 as id, "kofeng" as name from kofeng.test; Loading data to table kofeng.partitioned_table partition (dt=20170507) Moved: 'hdfs://ares-lvs-nn-ha/user/kofeng/partitioned_table/dt=20170507/00_0' to trash at: hdfs://ares-lvs-nn-ha/user/kofeng/.Trash/Current Partition kofeng.partitioned_table{dt=20170507} stats: [numFiles=1, numRows=1, totalSize=338, rawDataSize=2] MapReduce Jobs Launched: Stage-Stage-1: Map: 2 Cumulative CPU: 10.61 sec HDFS Read: 9767 HDFS Write: 577 SUCCESS Stage-Stage-3: Map: 1 Cumulative CPU: 12.36 sec HDFS Read: 3635 HDFS Write: 338 SUCCESS hive> select * from kofeng.partitioned_table; OK 123 kofeng 20170507 $ hadoop fs -ls /user/kofeng/partitioned_table/dt=20170507 -rwxr-xr-x 3 kofeng hdmi-technology338 2017-05-08 17:26 /user/kofeng/partitioned_table/dt=20170507/00_0 was:Use spark sql to create partition table first, and alter table by adding partition on specific location, then insert overwrite into this partition by selection, which will cause data missing compared with HIVE. > Data missing after insert overwrite table partition which is created on > specific location > - > > Key: SPARK-20663 > URL: https://issues.apache.org/jira/browse/SPARK-20663 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.1.0, 2.1.1 >Reporter: kobefeng >Priority: Minor > > Use spark sql to create partition table first, and alter table by adding > partition on specific location, then insert overwrite into this partition by > selection, which will cause data missing compared with HIVE. > bq. > -- create partition table first > $ hadoop fs -mkdir /user/kofeng/partitioned_table > $ /apache/spark-2.1.0-bin-hadoop2.7/bin/spark-sql > spark-sql> create table kofeng.partitioned_table( > > id bigint, > > name string, > > dt string > > ) using parquet options ('compression'='snappy', > 'path'='hdfs://ares-lvs-nn-ha/user/kofeng/partitioned_table') > > partitioned by (dt); > -- add partition with specific location > spark-sql> alter table kofeng.partitioned_table add if not exists > partition(dt='20170507') location '/user/kofeng/partitioned_table/20170507'; > $ hadoop fs -ls /user/kofeng/partitioned_table > drwxr-xr-x - kofeng kofeng 0 2017-05-08 17:00 > /user/kofeng/partitioned_table/20170507 > -- insert overwrite this partition, and the specific location folder gone, > data is missing, job is success by attaching _SUCCESS > spark-sql> insert overwrite table kofeng.partitioned_table > partition(dt='20170507') select 123 as id, "kofeng" as name; > $ hadoop fs -ls
[jira] [Updated] (SPARK-20663) Data missing after insert overwrite table partition which is created on specific location
[ https://issues.apache.org/jira/browse/SPARK-20663?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] kobefeng updated SPARK-20663: - Docs Text: (was: -- create partition table first $ hadoop fs -mkdir /user/kofeng/partitioned_table $ /apache/spark-2.1.0-bin-hadoop2.7/bin/spark-sql spark-sql> create table kofeng.partitioned_table( > id bigint, > name string, > dt string > ) using parquet options ('compression'='snappy', 'path'='hdfs://ares-lvs-nn-ha/user/kofeng/partitioned_table') > partitioned by (dt); -- add partition with specific location spark-sql> alter table kofeng.partitioned_table add if not exists partition(dt='20170507') location '/user/kofeng/partitioned_table/20170507'; $ hadoop fs -ls /user/kofeng/partitioned_table drwxr-xr-x - kofeng kofeng 0 2017-05-08 17:00 /user/kofeng/partitioned_table/20170507 -- insert overwrite this partition, and the specific location folder gone, data is missing, job is success by attaching _SUCCESS spark-sql> insert overwrite table kofeng.partitioned_table partition(dt='20170507') select 123 as id, "kofeng" as name; $ hadoop fs -ls /user/kofeng/partitioned_table -rw-r--r-- 3 kofeng hdmi-technology 0 2017-05-08 17:06 /user/kofeng/partitioned_table/_SUCCESS -- Then drop this partition and use hive to add partition and insert overwrite this partition data, then verify: spark-sql> alter table kofeng.partitioned_table drop if exists partition(dt='20170507'); hive> alter table kofeng.partitioned_table add if not exists partition(dt='20170507') location '/user/kofeng/partitioned_table/20170507'; OK -- could see hive also drop the specific location but data is preserved on auto-created partition folder hive> insert overwrite table kofeng.partitioned_table partition(dt='20170507') select 123 as id, "kofeng" as name from kofeng.test; Loading data to table kofeng.partitioned_table partition (dt=20170507) Moved: 'hdfs://ares-lvs-nn-ha/user/kofeng/partitioned_table/dt=20170507/00_0' to trash at: hdfs://ares-lvs-nn-ha/user/kofeng/.Trash/Current Partition kofeng.partitioned_table{dt=20170507} stats: [numFiles=1, numRows=1, totalSize=338, rawDataSize=2] MapReduce Jobs Launched: Stage-Stage-1: Map: 2 Cumulative CPU: 10.61 sec HDFS Read: 9767 HDFS Write: 577 SUCCESS Stage-Stage-3: Map: 1 Cumulative CPU: 12.36 sec HDFS Read: 3635 HDFS Write: 338 SUCCESS hive> select * from kofeng.partitioned_table; OK 123 kofeng 20170507 $ hadoop fs -ls /user/kofeng/partitioned_table/dt=20170507 -rwxr-xr-x 3 kofeng hdmi-technology338 2017-05-08 17:26 /user/kofeng/partitioned_table/dt=20170507/00_0) > Data missing after insert overwrite table partition which is created on > specific location > - > > Key: SPARK-20663 > URL: https://issues.apache.org/jira/browse/SPARK-20663 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.1.0, 2.1.1 >Reporter: kobefeng >Priority: Minor > > Use spark sql to create partition table first, and alter table by adding > partition on specific location, then insert overwrite into this partition by > selection, which will cause data missing compared with HIVE. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-20663) Data missing after insert overwrite table partition which is created on specific location
kobefeng created SPARK-20663: Summary: Data missing after insert overwrite table partition which is created on specific location Key: SPARK-20663 URL: https://issues.apache.org/jira/browse/SPARK-20663 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 2.1.1, 2.1.0 Reporter: kobefeng Priority: Minor Use spark sql to create partition table first, and alter table by adding partition on specific location, then insert overwrite into this partition by selection, which will cause data missing compared with HIVE. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-20638) Optimize the CartesianRDD to reduce repeatedly data fetching
[ https://issues.apache.org/jira/browse/SPARK-20638?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-20638: Assignee: (was: Apache Spark) > Optimize the CartesianRDD to reduce repeatedly data fetching > > > Key: SPARK-20638 > URL: https://issues.apache.org/jira/browse/SPARK-20638 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 2.2.0 >Reporter: Teng Jiang > > In CartesianRDD, group each iterator to multiple groups. Thus in the second > iteration, the data with be fetched (num of data)/groupSize times, rather > than (num of data) times. > The test results are: > Test Environment : 3 workers, each has 10 cores, 30G memory, 1 executor > Test data : users : 480,189, each is a 10-dim vector, and items : 17770, each > is a 10-dim vector. > With default CartesianRDD, cartesian time is 2420.7s. > With this proposal, cartesian time is 45.3s > 50x faster than the original method. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-20638) Optimize the CartesianRDD to reduce repeatedly data fetching
[ https://issues.apache.org/jira/browse/SPARK-20638?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-20638: Assignee: Apache Spark > Optimize the CartesianRDD to reduce repeatedly data fetching > > > Key: SPARK-20638 > URL: https://issues.apache.org/jira/browse/SPARK-20638 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 2.2.0 >Reporter: Teng Jiang >Assignee: Apache Spark > > In CartesianRDD, group each iterator to multiple groups. Thus in the second > iteration, the data with be fetched (num of data)/groupSize times, rather > than (num of data) times. > The test results are: > Test Environment : 3 workers, each has 10 cores, 30G memory, 1 executor > Test data : users : 480,189, each is a 10-dim vector, and items : 17770, each > is a 10-dim vector. > With default CartesianRDD, cartesian time is 2420.7s. > With this proposal, cartesian time is 45.3s > 50x faster than the original method. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-20658) spark.yarn.am.attemptFailuresValidityInterval doesn't seem to have an effect
[ https://issues.apache.org/jira/browse/SPARK-20658?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16001811#comment-16001811 ] Marcelo Vanzin commented on SPARK-20658: Ok, so it's not an issue with old YARN jars being used. Will need to take a closer look, probably later this week... > spark.yarn.am.attemptFailuresValidityInterval doesn't seem to have an effect > > > Key: SPARK-20658 > URL: https://issues.apache.org/jira/browse/SPARK-20658 > Project: Spark > Issue Type: Bug > Components: YARN >Affects Versions: 2.1.0 >Reporter: Paul Jones >Priority: Minor > > I'm running a job in YARN cluster mode using > `spark.yarn.am.attemptFailuresValidityInterval=1h` specified in both > spark-default.conf and in my spark-submit command. (This flag shows up in the > environment tab of spark history server, so it seems that it's specified > correctly). > However, I just had a job die with with four AM failures (three of the four > failures were over an hour apart). So, I'm confused as to what could be going > on. I haven't figured out the cause of the individual failures, so is it > possible that we always count certain types of failures? E.g. jobs that are > killed due to memory issues always count? -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-20658) spark.yarn.am.attemptFailuresValidityInterval doesn't seem to have an effect
[ https://issues.apache.org/jira/browse/SPARK-20658?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16001803#comment-16001803 ] Paul Jones commented on SPARK-20658: The jars are versioned 2.7.3. Finally finished grepping through the logs. I didn't find that error message. The closest I found was: {noformat} applications/hadoop-yarn/yarn-yarn-resourcemanager-ip-10-0-15-75.log.2017-04-28-03.gz:2017-04-28 03:37:33,051 INFO org.apache.hadoop.yar n.server.resourcemanager.rmapp.RMAppImpl (IPC Server handler 34 on 8032): The attemptFailuresValidityInterval for the application: application_1493122281436_0 016 is 360. {noformat} > spark.yarn.am.attemptFailuresValidityInterval doesn't seem to have an effect > > > Key: SPARK-20658 > URL: https://issues.apache.org/jira/browse/SPARK-20658 > Project: Spark > Issue Type: Bug > Components: YARN >Affects Versions: 2.1.0 >Reporter: Paul Jones >Priority: Minor > > I'm running a job in YARN cluster mode using > `spark.yarn.am.attemptFailuresValidityInterval=1h` specified in both > spark-default.conf and in my spark-submit command. (This flag shows up in the > environment tab of spark history server, so it seems that it's specified > correctly). > However, I just had a job die with with four AM failures (three of the four > failures were over an hour apart). So, I'm confused as to what could be going > on. I haven't figured out the cause of the individual failures, so is it > possible that we always count certain types of failures? E.g. jobs that are > killed due to memory issues always count? -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-20608) Standby namenodes should be allowed to included in yarn.spark.access.namenodes to support HDFS HA
[ https://issues.apache.org/jira/browse/SPARK-20608?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16001778#comment-16001778 ] Mingliang Liu commented on SPARK-20608: --- Quick question [~charliechen] : I think [~vanzin] is suggesting that we simply use the logical HDFS namespace instead of specific NNs. Say {{dfs.nameservices=mycluster}}, then {{hdfs://mycluster}} is what you need as NN endpoint, instead of the specific namenodes (e.g. {{dfs.ha.namenodes.mycluste=nn1,nn2}}, you want to use {{hdfs://nn1}} and {{hdfs://nn2}} directly). > Standby namenodes should be allowed to included in > yarn.spark.access.namenodes to support HDFS HA > - > > Key: SPARK-20608 > URL: https://issues.apache.org/jira/browse/SPARK-20608 > Project: Spark > Issue Type: Improvement > Components: Spark Submit, YARN >Affects Versions: 2.0.1, 2.1.0 >Reporter: Yuechen Chen >Priority: Minor > Original Estimate: 672h > Remaining Estimate: 672h > > If one Spark Application need to access remote namenodes, > yarn.spark.access.namenodes should be only be configged in spark-submit > scripts, and Spark Client(On Yarn) would fetch HDFS credential periodically. > If one hadoop cluster is configured by HA, there would be one active namenode > and at least one standby namenode. > However, if yarn.spark.access.namenodes includes both active and standby > namenodes, Spark Application will be failed for the reason that the standby > namenode would not access by Spark for org.apache.hadoop.ipc.StandbyException. > I think it won't cause any bad effect to config standby namenodes in > yarn.spark.access.namenodes, and my Spark Application can be able to sustain > the failover of Hadoop namenode. > HA Examples: > Spark-submit script: > yarn.spark.access.namenodes=hdfs://namenode01,hdfs://namenode02 > Spark Application Codes: > dataframe.write.parquet(getActiveNameNode(...) + hdfsPath) -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Reopened] (SPARK-1449) Please delete old releases from mirroring system
[ https://issues.apache.org/jira/browse/SPARK-1449?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sebb reopened SPARK-1449: - The mirror system includes the following: spark-1.6.2 spark-1.6.3 spark-2.0.1 spark-2.0.2 spark-2.1.0 spark-2.1.1 At least half of these are clearly superseded versions which should please be deleted. > Please delete old releases from mirroring system > > > Key: SPARK-1449 > URL: https://issues.apache.org/jira/browse/SPARK-1449 > Project: Spark > Issue Type: Task > Components: Project Infra >Affects Versions: 0.8.1, 0.9.1, 0.9.2, 1.0.0, 1.0.1 >Reporter: Sebb >Assignee: Patrick Wendell > Fix For: 0.8.1, 0.9.1, 0.9.2, 1.0.0, 1.0.1 > > > To reduce the load on the ASF mirrors, projects are required to delete old > releases [1] > Please can you remove all non-current releases? > Thanks! > [Note that older releases are always available from the ASF archive server] > Any links to older releases on download pages should first be adjusted to > point to the archive server. > [1] http://www.apache.org/dev/release.html#when-to-archive -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-20600) KafkaRelation should be pretty printed in web UI (Details for Query)
[ https://issues.apache.org/jira/browse/SPARK-20600?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16001680#comment-16001680 ] Shixiong Zhu edited comment on SPARK-20600 at 5/8/17 10:38 PM: --- [~jlaskowski] Hope you can do it soon. Then we can put it into 2.2.0 if RC2 fails. was (Author: zsxwing): [~jlaskowski] Hope you want do it soon. Then we can put it into 2.2.0 if RC2 fails. > KafkaRelation should be pretty printed in web UI (Details for Query) > > > Key: SPARK-20600 > URL: https://issues.apache.org/jira/browse/SPARK-20600 > Project: Spark > Issue Type: Improvement > Components: Structured Streaming >Affects Versions: 2.2.0 >Reporter: Jacek Laskowski >Priority: Trivial > Attachments: kafka-source-scan-webui.png > > > Executing the following batch query gives the default stringified/internal > name of {{KafkaRelation}} in web UI (under Details for Query), i.e. > http://localhost:4040/SQL/execution/?id=3 (<-- change the {{id}}). See the > attachment. > {code} > spark. > read. > format("kafka"). > option("subscribe", "topic1"). > option("kafka.bootstrap.servers", "localhost:9092"). > load. > select('value cast "string"). > write. > csv("fromkafka.csv") > {code} -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-20600) KafkaRelation should be pretty printed in web UI (Details for Query)
[ https://issues.apache.org/jira/browse/SPARK-20600?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16001680#comment-16001680 ] Shixiong Zhu commented on SPARK-20600: -- [~jlaskowski] Hope you want do it soon. Then we can put it into 2.2.0 if RC2 fails. > KafkaRelation should be pretty printed in web UI (Details for Query) > > > Key: SPARK-20600 > URL: https://issues.apache.org/jira/browse/SPARK-20600 > Project: Spark > Issue Type: Improvement > Components: Structured Streaming >Affects Versions: 2.2.0 >Reporter: Jacek Laskowski >Priority: Trivial > Attachments: kafka-source-scan-webui.png > > > Executing the following batch query gives the default stringified/internal > name of {{KafkaRelation}} in web UI (under Details for Query), i.e. > http://localhost:4040/SQL/execution/?id=3 (<-- change the {{id}}). See the > attachment. > {code} > spark. > read. > format("kafka"). > option("subscribe", "topic1"). > option("kafka.bootstrap.servers", "localhost:9092"). > load. > select('value cast "string"). > write. > csv("fromkafka.csv") > {code} -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-20658) spark.yarn.am.attemptFailuresValidityInterval doesn't seem to have an effect
[ https://issues.apache.org/jira/browse/SPARK-20658?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16001638#comment-16001638 ] Marcelo Vanzin commented on SPARK-20658: That's different... what version of Hadoop libraries is part of the Spark build? Generally there will be Hadoop jars in {{$SPARK_HOME/jars}}. Those are the ones that matter. (Alternatively, if you found - or did not find - the log message I mentioned in your logs, that would have answered these questions already.) > spark.yarn.am.attemptFailuresValidityInterval doesn't seem to have an effect > > > Key: SPARK-20658 > URL: https://issues.apache.org/jira/browse/SPARK-20658 > Project: Spark > Issue Type: Bug > Components: YARN >Affects Versions: 2.1.0 >Reporter: Paul Jones >Priority: Minor > > I'm running a job in YARN cluster mode using > `spark.yarn.am.attemptFailuresValidityInterval=1h` specified in both > spark-default.conf and in my spark-submit command. (This flag shows up in the > environment tab of spark history server, so it seems that it's specified > correctly). > However, I just had a job die with with four AM failures (three of the four > failures were over an hour apart). So, I'm confused as to what could be going > on. I haven't figured out the cause of the individual failures, so is it > possible that we always count certain types of failures? E.g. jobs that are > killed due to memory issues always count? -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-20658) spark.yarn.am.attemptFailuresValidityInterval doesn't seem to have an effect
[ https://issues.apache.org/jira/browse/SPARK-20658?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16001638#comment-16001638 ] Marcelo Vanzin edited comment on SPARK-20658 at 5/8/17 10:02 PM: - That's different... what version of Hadoop libraries is part of the Spark build? Generally these will be Hadoop jars in {{$SPARK_HOME/jars}}. Those are the ones that matter. (Alternatively, if you found - or did not find - the log message I mentioned in your logs, that would have answered these questions already.) was (Author: vanzin): That's different... what version of Hadoop libraries is part of the Spark build? Generally there will be Hadoop jars in {{$SPARK_HOME/jars}}. Those are the ones that matter. (Alternatively, if you found - or did not find - the log message I mentioned in your logs, that would have answered these questions already.) > spark.yarn.am.attemptFailuresValidityInterval doesn't seem to have an effect > > > Key: SPARK-20658 > URL: https://issues.apache.org/jira/browse/SPARK-20658 > Project: Spark > Issue Type: Bug > Components: YARN >Affects Versions: 2.1.0 >Reporter: Paul Jones >Priority: Minor > > I'm running a job in YARN cluster mode using > `spark.yarn.am.attemptFailuresValidityInterval=1h` specified in both > spark-default.conf and in my spark-submit command. (This flag shows up in the > environment tab of spark history server, so it seems that it's specified > correctly). > However, I just had a job die with with four AM failures (three of the four > failures were over an hour apart). So, I'm confused as to what could be going > on. I haven't figured out the cause of the individual failures, so is it > possible that we always count certain types of failures? E.g. jobs that are > killed due to memory issues always count? -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-19268) File does not exist: /tmp/temporary-157b89c1-27bb-49f3-a70c-ca1b75022b4d/state/0/2/1.delta
[ https://issues.apache.org/jira/browse/SPARK-19268?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16001631#comment-16001631 ] Shixiong Zhu commented on SPARK-19268: -- [~skrishna] could you provide your codes, or the output of "dataset.explain(true)", please? Perhaps there is another bug in aggregation. > File does not exist: > /tmp/temporary-157b89c1-27bb-49f3-a70c-ca1b75022b4d/state/0/2/1.delta > -- > > Key: SPARK-19268 > URL: https://issues.apache.org/jira/browse/SPARK-19268 > Project: Spark > Issue Type: Bug > Components: Structured Streaming >Affects Versions: 2.0.0, 2.0.1, 2.0.2, 2.1.0 > Environment: - hadoop2.7 > - Java 7 >Reporter: liyan >Assignee: Shixiong Zhu >Priority: Critical > Fix For: 2.1.1, 2.2.0 > > > bq. ./run-example sql.streaming.JavaStructuredKafkaWordCount > 192.168.3.110:9092 subscribe topic03 > when i run the spark example raises the following error: > {quote} > Exception in thread "main" 17/01/17 14:13:41 DEBUG ContextCleaner: Got > cleaning task CleanBroadcast(4) > org.apache.spark.sql.streaming.StreamingQueryException: Job aborted due to > stage failure: Task 2 in stage 9.0 failed 1 times, most recent failure: Lost > task 2.0 in stage 9.0 (TID 46, localhost, executor driver): > java.lang.IllegalStateException: Error reading delta file > /tmp/temporary-157b89c1-27bb-49f3-a70c-ca1b75022b4d/state/0/2/1.delta of > HDFSStateStoreProvider[id = (op=0, part=2), dir = > /tmp/temporary-157b89c1-27bb-49f3-a70c-ca1b75022b4d/state/0/2]: > /tmp/temporary-157b89c1-27bb-49f3-a70c-ca1b75022b4d/state/0/2/1.delta does > not exist > at > org.apache.spark.sql.execution.streaming.state.HDFSBackedStateStoreProvider.org$apache$spark$sql$execution$streaming$state$HDFSBackedStateStoreProvider$$updateFromDeltaFile(HDFSBackedStateStoreProvider.scala:354) > at > org.apache.spark.sql.execution.streaming.state.HDFSBackedStateStoreProvider$$anonfun$org$apache$spark$sql$execution$streaming$state$HDFSBackedStateStoreProvider$$loadMap$1$$anonfun$6.apply(HDFSBackedStateStoreProvider.scala:306) > at > org.apache.spark.sql.execution.streaming.state.HDFSBackedStateStoreProvider$$anonfun$org$apache$spark$sql$execution$streaming$state$HDFSBackedStateStoreProvider$$loadMap$1$$anonfun$6.apply(HDFSBackedStateStoreProvider.scala:303) > at scala.Option.getOrElse(Option.scala:121) > at > org.apache.spark.sql.execution.streaming.state.HDFSBackedStateStoreProvider$$anonfun$org$apache$spark$sql$execution$streaming$state$HDFSBackedStateStoreProvider$$loadMap$1.apply(HDFSBackedStateStoreProvider.scala:303) > at > org.apache.spark.sql.execution.streaming.state.HDFSBackedStateStoreProvider$$anonfun$org$apache$spark$sql$execution$streaming$state$HDFSBackedStateStoreProvider$$loadMap$1.apply(HDFSBackedStateStoreProvider.scala:302) > at scala.Option.getOrElse(Option.scala:121) > at > org.apache.spark.sql.execution.streaming.state.HDFSBackedStateStoreProvider.org$apache$spark$sql$execution$streaming$state$HDFSBackedStateStoreProvider$$loadMap(HDFSBackedStateStoreProvider.scala:302) > at > org.apache.spark.sql.execution.streaming.state.HDFSBackedStateStoreProvider.getStore(HDFSBackedStateStoreProvider.scala:220) > at > org.apache.spark.sql.execution.streaming.state.StateStore$.get(StateStore.scala:151) > at > org.apache.spark.sql.execution.streaming.state.StateStoreRDD.compute(StateStoreRDD.scala:61) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:287) > at > org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:287) > at > org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:287) > at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87) > at org.apache.spark.scheduler.Task.run(Task.scala:99) > at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:282) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) > at java.lang.Thread.run(Thread.java:745) > Caused by: java.io.FileNotFoundException: File does not exist: > /tmp/temporary-157b89c1-27bb-49f3-a70c-ca1b75022b4d/state/0/2/1.delta > at > org.apache.hadoop.hdfs.server.namenode.INodeFile.valueOf(INodeFile.java:71) > at >
[jira] [Commented] (SPARK-20658) spark.yarn.am.attemptFailuresValidityInterval doesn't seem to have an effect
[ https://issues.apache.org/jira/browse/SPARK-20658?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16001628#comment-16001628 ] Paul Jones commented on SPARK-20658: Ah... This is using Amazons version of Hadoop 2.7.3 {noformat} $ hadoop version Hadoop 2.7.3-amzn-1 Subversion g...@aws157git.com:/pkg/Aws157BigTop -r 30eccced8ce8c483445f0aa3175ce725831ff06b Compiled by ec2-user on 2017-02-17T17:59Z Compiled with protoc 2.5.0 >From source with checksum 1833aada17b94cfb94ad40ccd02d3df8 This command was run using /usr/lib/hadoop/hadoop-common-2.7.3-amzn-1.jar {noformat} > spark.yarn.am.attemptFailuresValidityInterval doesn't seem to have an effect > > > Key: SPARK-20658 > URL: https://issues.apache.org/jira/browse/SPARK-20658 > Project: Spark > Issue Type: Bug > Components: YARN >Affects Versions: 2.1.0 >Reporter: Paul Jones >Priority: Minor > > I'm running a job in YARN cluster mode using > `spark.yarn.am.attemptFailuresValidityInterval=1h` specified in both > spark-default.conf and in my spark-submit command. (This flag shows up in the > environment tab of spark history server, so it seems that it's specified > correctly). > However, I just had a job die with with four AM failures (three of the four > failures were over an hour apart). So, I'm confused as to what could be going > on. I haven't figured out the cause of the individual failures, so is it > possible that we always count certain types of failures? E.g. jobs that are > killed due to memory issues always count? -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-20658) spark.yarn.am.attemptFailuresValidityInterval doesn't seem to have an effect
[ https://issues.apache.org/jira/browse/SPARK-20658?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16001623#comment-16001623 ] Marcelo Vanzin commented on SPARK-20658: That does not say which package you used (i.e. which version of Hadoop is packaged with your Spark build). > spark.yarn.am.attemptFailuresValidityInterval doesn't seem to have an effect > > > Key: SPARK-20658 > URL: https://issues.apache.org/jira/browse/SPARK-20658 > Project: Spark > Issue Type: Bug > Components: YARN >Affects Versions: 2.1.0 >Reporter: Paul Jones >Priority: Minor > > I'm running a job in YARN cluster mode using > `spark.yarn.am.attemptFailuresValidityInterval=1h` specified in both > spark-default.conf and in my spark-submit command. (This flag shows up in the > environment tab of spark history server, so it seems that it's specified > correctly). > However, I just had a job die with with four AM failures (three of the four > failures were over an hour apart). So, I'm confused as to what could be going > on. I haven't figured out the cause of the individual failures, so is it > possible that we always count certain types of failures? E.g. jobs that are > killed due to memory issues always count? -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-20661) SparkR tableNames() test fails
[ https://issues.apache.org/jira/browse/SPARK-20661?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yin Huai reassigned SPARK-20661: Assignee: Hossein Falaki > SparkR tableNames() test fails > -- > > Key: SPARK-20661 > URL: https://issues.apache.org/jira/browse/SPARK-20661 > Project: Spark > Issue Type: Bug > Components: SparkR >Affects Versions: 2.2.0 >Reporter: Hossein Falaki >Assignee: Hossein Falaki > Labels: test > Fix For: 2.2.0 > > > Due to prior state created by other test cases, testing {{tableNames()}} is > failing in master. > https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test/job/spark-master-test-sbt-hadoop-2.7/2846/console -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-20658) spark.yarn.am.attemptFailuresValidityInterval doesn't seem to have an effect
[ https://issues.apache.org/jira/browse/SPARK-20658?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16001612#comment-16001612 ] Paul Jones commented on SPARK-20658: {noformat} $ spark-submit --version version 2.1.0 Using Scala version 2.11.8, OpenJDK 64-Bit Server VM, 1.8.0_121 Branch HEAD Compiled by user ec2-user on 2017-02-17T19:03:33Z Revision 30eccced8ce8c483445f0aa3175ce725831ff06b Url g...@aws157git.com:/pkg/Aws157BigTop {noformat} > spark.yarn.am.attemptFailuresValidityInterval doesn't seem to have an effect > > > Key: SPARK-20658 > URL: https://issues.apache.org/jira/browse/SPARK-20658 > Project: Spark > Issue Type: Bug > Components: YARN >Affects Versions: 2.1.0 >Reporter: Paul Jones >Priority: Minor > > I'm running a job in YARN cluster mode using > `spark.yarn.am.attemptFailuresValidityInterval=1h` specified in both > spark-default.conf and in my spark-submit command. (This flag shows up in the > environment tab of spark history server, so it seems that it's specified > correctly). > However, I just had a job die with with four AM failures (three of the four > failures were over an hour apart). So, I'm confused as to what could be going > on. I haven't figured out the cause of the individual failures, so is it > possible that we always count certain types of failures? E.g. jobs that are > killed due to memory issues always count? -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-20661) SparkR tableNames() test fails
[ https://issues.apache.org/jira/browse/SPARK-20661?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yin Huai resolved SPARK-20661. -- Resolution: Fixed Fix Version/s: 2.2.0 Issue resolved by pull request 17903 [https://github.com/apache/spark/pull/17903] > SparkR tableNames() test fails > -- > > Key: SPARK-20661 > URL: https://issues.apache.org/jira/browse/SPARK-20661 > Project: Spark > Issue Type: Bug > Components: SparkR >Affects Versions: 2.2.0 >Reporter: Hossein Falaki > Labels: test > Fix For: 2.2.0 > > > Due to prior state created by other test cases, testing {{tableNames()}} is > failing in master. > https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test/job/spark-master-test-sbt-hadoop-2.7/2846/console -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-20630) Thread Dump link available in Executors tab irrespective of spark.ui.threadDumpsEnabled
[ https://issues.apache.org/jira/browse/SPARK-20630?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-20630: Assignee: (was: Apache Spark) > Thread Dump link available in Executors tab irrespective of > spark.ui.threadDumpsEnabled > --- > > Key: SPARK-20630 > URL: https://issues.apache.org/jira/browse/SPARK-20630 > Project: Spark > Issue Type: Bug > Components: Web UI >Affects Versions: 2.3.0 >Reporter: Jacek Laskowski >Priority: Minor > Attachments: spark-webui-executors-threadDump.png > > > Irrespective of {{spark.ui.threadDumpsEnabled}} property web UI's Executors > page displays *Thread Dump* column with an active link (that does nothing > though). -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-20630) Thread Dump link available in Executors tab irrespective of spark.ui.threadDumpsEnabled
[ https://issues.apache.org/jira/browse/SPARK-20630?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-20630: Assignee: Apache Spark > Thread Dump link available in Executors tab irrespective of > spark.ui.threadDumpsEnabled > --- > > Key: SPARK-20630 > URL: https://issues.apache.org/jira/browse/SPARK-20630 > Project: Spark > Issue Type: Bug > Components: Web UI >Affects Versions: 2.3.0 >Reporter: Jacek Laskowski >Assignee: Apache Spark >Priority: Minor > Attachments: spark-webui-executors-threadDump.png > > > Irrespective of {{spark.ui.threadDumpsEnabled}} property web UI's Executors > page displays *Thread Dump* column with an active link (that does nothing > though). -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-20630) Thread Dump link available in Executors tab irrespective of spark.ui.threadDumpsEnabled
[ https://issues.apache.org/jira/browse/SPARK-20630?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16001609#comment-16001609 ] Apache Spark commented on SPARK-20630: -- User 'ajbozarth' has created a pull request for this issue: https://github.com/apache/spark/pull/17904 > Thread Dump link available in Executors tab irrespective of > spark.ui.threadDumpsEnabled > --- > > Key: SPARK-20630 > URL: https://issues.apache.org/jira/browse/SPARK-20630 > Project: Spark > Issue Type: Bug > Components: Web UI >Affects Versions: 2.3.0 >Reporter: Jacek Laskowski >Priority: Minor > Attachments: spark-webui-executors-threadDump.png > > > Irrespective of {{spark.ui.threadDumpsEnabled}} property web UI's Executors > page displays *Thread Dump* column with an active link (that does nothing > though). -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-20580) Allow RDD cache with unserializable objects
[ https://issues.apache.org/jira/browse/SPARK-20580?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16001607#comment-16001607 ] Fernando Pereira commented on SPARK-20580: -- I understand that at some point it will be better to fully implement serialization of our objects. To be more precise in our use case, the objects are instances of Python extension types (implemented in Cython). Apparently by default they will serialize and deserialize with their basic structures, except not non-python data, like buffers, and therefore the "deserialized" objects are not valid. My discussion here started since I found counter-intuitive that in some situations cache() may lead to the program to beak, I was looking for confirmation whether any operation following a map() will induce data deserialization (instead of trying to use the previous RDD data). Any chance this behavior changes? Thanks > Allow RDD cache with unserializable objects > --- > > Key: SPARK-20580 > URL: https://issues.apache.org/jira/browse/SPARK-20580 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 1.3.0 >Reporter: Fernando Pereira >Priority: Minor > > In my current scenario we load complex Python objects in the worker nodes > that are not completely serializable. We then apply map certain operations to > the RDD which at some point we collect. In this basic usage all works well. > However, if we cache() the RDD (which defaults to memory) suddenly it fails > to execute the transformations after the caching step. Apparently caching > serializes the RDD data and deserializes it whenever more transformations are > required. > It would be nice to avoid serialization of the objects if they are to be > cached to memory, and keep the original object -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-20605) Deprecate not used AM and executor port configuration
[ https://issues.apache.org/jira/browse/SPARK-20605?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Marcelo Vanzin resolved SPARK-20605. Resolution: Fixed Assignee: Saisai Shao Fix Version/s: 2.3.0 > Deprecate not used AM and executor port configuration > - > > Key: SPARK-20605 > URL: https://issues.apache.org/jira/browse/SPARK-20605 > Project: Spark > Issue Type: Bug > Components: Mesos, Spark Core, YARN >Affects Versions: 2.2.0 >Reporter: Saisai Shao >Assignee: Saisai Shao >Priority: Minor > Fix For: 2.3.0 > > > After SPARK-10997, client mode Netty RpcEnv doesn't require to bind a port to > start server, so port configurations are not used any more, here propose to > remove these two configurations: "spark.executor.port" and "spark.am.port". -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-20662) Block jobs that have greater than a configured number of tasks
Xuefu Zhang created SPARK-20662: --- Summary: Block jobs that have greater than a configured number of tasks Key: SPARK-20662 URL: https://issues.apache.org/jira/browse/SPARK-20662 Project: Spark Issue Type: Improvement Components: Spark Core Affects Versions: 2.0.0, 1.6.0 Reporter: Xuefu Zhang In a shared cluster, it's desirable for an admin to block large Spark jobs. While there might not be a single metrics defining the size of a job, the number of tasks is usually a good indicator. Thus, it would be useful for Spark scheduler to block a job whose number of tasks reaches a configured limit. By default, the limit could be just infinite, to retain the existing behavior. MapReduce has mapreduce.job.max.map and mapreduce.job.max.reduce to be configured, which blocks a MR job at job submission time. The proposed configuration is spark.job.max.tasks with a default value -1 (infinite). -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-20500) ML, Graph 2.2 QA: API: Binary incompatible changes
[ https://issues.apache.org/jira/browse/SPARK-20500?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16001564#comment-16001564 ] Joseph K. Bradley commented on SPARK-20500: --- I'll take this one. > ML, Graph 2.2 QA: API: Binary incompatible changes > -- > > Key: SPARK-20500 > URL: https://issues.apache.org/jira/browse/SPARK-20500 > Project: Spark > Issue Type: Sub-task > Components: Documentation, GraphX, ML, MLlib >Reporter: Joseph K. Bradley >Assignee: Joseph K. Bradley >Priority: Blocker > > Generate a list of binary incompatible changes using MiMa and create new > JIRAs for issues found. Filter out false positives as needed. > If you want to take this task, look at the analogous task from the previous > release QA, and ping the Assignee for advice. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-20500) ML, Graph 2.2 QA: API: Binary incompatible changes
[ https://issues.apache.org/jira/browse/SPARK-20500?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joseph K. Bradley reassigned SPARK-20500: - Assignee: Joseph K. Bradley > ML, Graph 2.2 QA: API: Binary incompatible changes > -- > > Key: SPARK-20500 > URL: https://issues.apache.org/jira/browse/SPARK-20500 > Project: Spark > Issue Type: Sub-task > Components: Documentation, GraphX, ML, MLlib >Reporter: Joseph K. Bradley >Assignee: Joseph K. Bradley >Priority: Blocker > > Generate a list of binary incompatible changes using MiMa and create new > JIRAs for issues found. Filter out false positives as needed. > If you want to take this task, look at the analogous task from the previous > release QA, and ping the Assignee for advice. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-7856) Scalable PCA implementation for tall and fat matrices
[ https://issues.apache.org/jira/browse/SPARK-7856?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16001551#comment-16001551 ] Ignacio Bermudez Corrales edited comment on SPARK-7856 at 5/8/17 9:13 PM: -- Apart from implementing Probabilistic PCA (which in my view is a different algorithm worth implementing as a separate algorithm), there are some issues with the current (2.3) implementation of RowMatrix.computePrincipalComponentsAndExplainedVariance that affect the PCA training. In my opinion the Big problem with the current implementation is the line 387 of RowMatrix.scala, which causes OutOfMemory exceptions for this kind of matrices, as it computes the covariance as a local breeze dense matrix. val Cov = computeCovariance().asBreeze.asInstanceOf[BDM[Double]] The implementation computes a dense covariance local breeze matrix, which is not needed for the computation of the principal components nor explained variance. In particular, RowMatrix provides a more optimized SVD decomposition. Therefore, principal components and variance can be derived from this implementation of the decomposition, by computing the (X - µ).computeSVD( k, false, 0). was (Author: elghoto): Apart from implementing Probabilistic PCA (which in my view is a different algorithm worth implementing as a separate algorithm), there are some issues with the current (2.11) implementation of RowMatrix.computePrincipalComponentsAndExplainedVariance that affect the PCA training. In my opinion the Big problem with the current implementation is the line 387 of RowMatrix.scala, which causes OutOfMemory exceptions for this kind of matrices, as it computes the covariance as a local breeze dense matrix. val Cov = computeCovariance().asBreeze.asInstanceOf[BDM[Double]] The implementation computes a dense covariance local breeze matrix, which is not needed for the computation of the principal components nor explained variance. In particular, RowMatrix provides a more optimized SVD decomposition. Therefore, principal components and variance can be derived from this implementation of the decomposition, by computing the (X - µ).computeSVD( k, false, 0). > Scalable PCA implementation for tall and fat matrices > - > > Key: SPARK-7856 > URL: https://issues.apache.org/jira/browse/SPARK-7856 > Project: Spark > Issue Type: Improvement > Components: MLlib >Reporter: Tarek Elgamal > > Currently the PCA implementation has a limitation of fitting d^2 > covariance/grammian matrix entries in memory (d is the number of > columns/dimensions of the matrix). We often need only the largest k principal > components. To make pca really scalable, I suggest an implementation where > the memory usage is proportional to the principal components k rather than > the full dimensionality d. > I suggest adopting the solution described in this paper that is published in > SIGMOD 2015 (http://ds.qcri.org/images/profile/tarek_elgamal/sigmod2015.pdf). > The paper offers an implementation for Probabilistic PCA (PPCA) which has > less memory and time complexity and could potentially scale to tall and fat > matrices rather than tall and skinny matrices that is supported by the > current PCA impelmentation. > Probablistic PCA could be potentially added to the set of algorithms > supported by MLlib and it does not necessarily replace the old PCA > implementation. > PPCA implementation is adopted in Matlab's Statistics and Machine Learning > Toolbox (http://www.mathworks.com/help/stats/ppca.html) -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-7856) Scalable PCA implementation for tall and fat matrices
[ https://issues.apache.org/jira/browse/SPARK-7856?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16001551#comment-16001551 ] Ignacio Bermudez Corrales commented on SPARK-7856: -- Apart from implementing Probabilistic PCA (which in my view is a different algorithm worth implementing as a separate algorithm), there are some issues with the current (2.11) implementation of RowMatrix.computePrincipalComponentsAndExplainedVariance that affect the PCA training. In my opinion the Big problem with the current implementation is the line 387 of RowMatrix.scala, which causes OutOfMemory exceptions for this kind of matrices, as it computes the covariance as a local breeze dense matrix. val Cov = computeCovariance().asBreeze.asInstanceOf[BDM[Double]] The implementation computes a dense covariance local breeze matrix, which is not needed for the computation of the principal components nor explained variance. In particular, RowMatrix provides a more optimized SVD decomposition. Therefore, principal components and variance can be derived from this implementation of the decomposition, by computing the (X - µ).computeSVD( k, false, 0). > Scalable PCA implementation for tall and fat matrices > - > > Key: SPARK-7856 > URL: https://issues.apache.org/jira/browse/SPARK-7856 > Project: Spark > Issue Type: Improvement > Components: MLlib >Reporter: Tarek Elgamal > > Currently the PCA implementation has a limitation of fitting d^2 > covariance/grammian matrix entries in memory (d is the number of > columns/dimensions of the matrix). We often need only the largest k principal > components. To make pca really scalable, I suggest an implementation where > the memory usage is proportional to the principal components k rather than > the full dimensionality d. > I suggest adopting the solution described in this paper that is published in > SIGMOD 2015 (http://ds.qcri.org/images/profile/tarek_elgamal/sigmod2015.pdf). > The paper offers an implementation for Probabilistic PCA (PPCA) which has > less memory and time complexity and could potentially scale to tall and fat > matrices rather than tall and skinny matrices that is supported by the > current PCA impelmentation. > Probablistic PCA could be potentially added to the set of algorithms > supported by MLlib and it does not necessarily replace the old PCA > implementation. > PPCA implementation is adopted in Matlab's Statistics and Machine Learning > Toolbox (http://www.mathworks.com/help/stats/ppca.html) -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-20661) SparkR tableNames() test fails
[ https://issues.apache.org/jira/browse/SPARK-20661?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-20661: Assignee: Apache Spark > SparkR tableNames() test fails > -- > > Key: SPARK-20661 > URL: https://issues.apache.org/jira/browse/SPARK-20661 > Project: Spark > Issue Type: Bug > Components: SparkR >Affects Versions: 2.2.0 >Reporter: Hossein Falaki >Assignee: Apache Spark > Labels: test > > Due to prior state created by other test cases, testing {{tableNames()}} is > failing in master. > https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test/job/spark-master-test-sbt-hadoop-2.7/2846/console -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-20661) SparkR tableNames() test fails
[ https://issues.apache.org/jira/browse/SPARK-20661?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-20661: Assignee: (was: Apache Spark) > SparkR tableNames() test fails > -- > > Key: SPARK-20661 > URL: https://issues.apache.org/jira/browse/SPARK-20661 > Project: Spark > Issue Type: Bug > Components: SparkR >Affects Versions: 2.2.0 >Reporter: Hossein Falaki > Labels: test > > Due to prior state created by other test cases, testing {{tableNames()}} is > failing in master. > https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test/job/spark-master-test-sbt-hadoop-2.7/2846/console -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-20661) SparkR tableNames() test fails
[ https://issues.apache.org/jira/browse/SPARK-20661?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16001521#comment-16001521 ] Apache Spark commented on SPARK-20661: -- User 'falaki' has created a pull request for this issue: https://github.com/apache/spark/pull/17903 > SparkR tableNames() test fails > -- > > Key: SPARK-20661 > URL: https://issues.apache.org/jira/browse/SPARK-20661 > Project: Spark > Issue Type: Bug > Components: SparkR >Affects Versions: 2.2.0 >Reporter: Hossein Falaki > Labels: test > > Due to prior state created by other test cases, testing {{tableNames()}} is > failing in master. > https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test/job/spark-master-test-sbt-hadoop-2.7/2846/console -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-20661) SparkR tableNames() test fails
Hossein Falaki created SPARK-20661: -- Summary: SparkR tableNames() test fails Key: SPARK-20661 URL: https://issues.apache.org/jira/browse/SPARK-20661 Project: Spark Issue Type: Bug Components: SparkR Affects Versions: 2.2.0 Reporter: Hossein Falaki Due to prior state created by other test cases, testing {{tableNames()}} is failing in master. https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test/job/spark-master-test-sbt-hadoop-2.7/2846/console -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-18057) Update structured streaming kafka from 10.0.1 to 10.2.0
[ https://issues.apache.org/jira/browse/SPARK-18057?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16001474#comment-16001474 ] Shixiong Zhu commented on SPARK-18057: -- [~helena_e] I didn't mean for Spark. Even in Spark, the required code changes are in tests. I meant, as a Spark user, why you cannot add the Kafka client as a dependency and update the Kafka client? Because you have some test codes similar to Spark, or are you using Kafka API directly in your codes? > Update structured streaming kafka from 10.0.1 to 10.2.0 > --- > > Key: SPARK-18057 > URL: https://issues.apache.org/jira/browse/SPARK-18057 > Project: Spark > Issue Type: Improvement > Components: Structured Streaming >Reporter: Cody Koeninger > > There are a couple of relevant KIPs here, > https://archive.apache.org/dist/kafka/0.10.1.0/RELEASE_NOTES.html -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-20660) Not able to merge Dataframes with different column orders
[ https://issues.apache.org/jira/browse/SPARK-20660?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michel Lemay updated SPARK-20660: - Description: Union on two dataframes with different column orders is not supported and lead to hard to find issues. Here is an example showing the issue. {code} import org.apache.spark.sql.types._ import org.apache.spark.sql.Row var inputSchema = StructType(StructField("key", StringType, nullable=true) :: StructField("value", IntegerType, nullable=true) :: Nil) var a = spark.createDataFrame(sc.parallelize((1 to 10)).map(x => Row(x.toString, 555)), inputSchema) var b = a.select($"value" * 2 alias "value", $"key") // any transformation changing column order will show the problem. a.union(b).show // in order to make it work, we need to reorder columns val bCols = a.columns.map(aCol => b(aCol)) a.union(b.select(bCols:_*)).show {code} was: Union on two dataframes with different column orders is not supported and lead to hard to find issues. Here is an example showing the issue. {code} import org.apache.spark.sql.types._ import org.apache.spark.sql.Row var inputSchema = StructType(StructField("key", StringType, nullable=true) :: StructField("value", IntegerType, nullable=true) :: Nil) var a = spark.createDataFrame(sc.parallelize((1 to 10)).map(x => Row(x.toString, 555)), inputSchema) var b = a.select($"value", $"key") // any transformation changing column order will show the problem. a.union(b).show // in order to make it work, we need to reorder columns val bCols = a.columns.map(aCol => b(aCol)) a.union(b.select(bCols:_*)).show {code} > Not able to merge Dataframes with different column orders > - > > Key: SPARK-20660 > URL: https://issues.apache.org/jira/browse/SPARK-20660 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.1.0 >Reporter: Michel Lemay >Priority: Minor > > Union on two dataframes with different column orders is not supported and > lead to hard to find issues. > Here is an example showing the issue. > {code} > import org.apache.spark.sql.types._ > import org.apache.spark.sql.Row > var inputSchema = StructType(StructField("key", StringType, nullable=true) :: > StructField("value", IntegerType, nullable=true) :: Nil) > var a = spark.createDataFrame(sc.parallelize((1 to 10)).map(x => > Row(x.toString, 555)), inputSchema) > var b = a.select($"value" * 2 alias "value", $"key") // any transformation > changing column order will show the problem. > a.union(b).show > // in order to make it work, we need to reorder columns > val bCols = a.columns.map(aCol => b(aCol)) > a.union(b.select(bCols:_*)).show > {code} -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-18057) Update structured streaming kafka from 10.0.1 to 10.2.0
[ https://issues.apache.org/jira/browse/SPARK-18057?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16000130#comment-16000130 ] Helena Edelson edited comment on SPARK-18057 at 5/8/17 8:23 PM: It's not that simple, the PR I have queued for this required some code changes in the upgrade. It's not just a dependency addition/exclusion. was (Author: helena_e): Did that a while ago, my only point is not modifying artifacts ideally, by adding and excluding in builds. > Update structured streaming kafka from 10.0.1 to 10.2.0 > --- > > Key: SPARK-18057 > URL: https://issues.apache.org/jira/browse/SPARK-18057 > Project: Spark > Issue Type: Improvement > Components: Structured Streaming >Reporter: Cody Koeninger > > There are a couple of relevant KIPs here, > https://archive.apache.org/dist/kafka/0.10.1.0/RELEASE_NOTES.html -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-20660) Not able to merge Dataframes with different column orders
[ https://issues.apache.org/jira/browse/SPARK-20660?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michel Lemay updated SPARK-20660: - Description: Union on two dataframes with different column orders is not supported and lead to hard to find issues. Here is an example showing the issue. {code} import org.apache.spark.sql.types._ import org.apache.spark.sql.Row var inputSchema = StructType(StructField("key", StringType, nullable=true) :: StructField("value", IntegerType, nullable=true) :: Nil) var a = spark.createDataFrame(sc.parallelize((1 to 10)).map(x => Row(x.toString, 555)), inputSchema) var b = a.select($"value", $"key") // any transformation changing column order will show the problem. a.union(b).show // in order to make it work, we need to reorder columns val bCols = a.columns.map(aCol => b(aCol)) a.union(b.select(bCols:_*)).show {code} was: Union on two dataframes with different column orders is not supported and lead to hard to find issues. Here is an example showing the issue. {code} import org.apache.spark.sql.types._ import org.apache.spark.sql.Row var inputSchema = StructType(StructField("key", StringType, nullable=true) :: StructField("value", IntegerType, nullable=true) :: Nil) var a = spark.createDataFrame(sc.parallelize((1 to 10)).map(x => Row(x.toString, 555)), inputSchema) var b = a.select($"value", $"key") // any transformation changing column order will show the problem. a.union(c).show // in order to make it work, we need to reorder columns val bCols = a.columns.map(aCol => b(aCol)) a.union(b.select(bCols:_*)).show {code} > Not able to merge Dataframes with different column orders > - > > Key: SPARK-20660 > URL: https://issues.apache.org/jira/browse/SPARK-20660 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.1.0 >Reporter: Michel Lemay >Priority: Minor > > Union on two dataframes with different column orders is not supported and > lead to hard to find issues. > Here is an example showing the issue. > {code} > import org.apache.spark.sql.types._ > import org.apache.spark.sql.Row > var inputSchema = StructType(StructField("key", StringType, nullable=true) :: > StructField("value", IntegerType, nullable=true) :: Nil) > var a = spark.createDataFrame(sc.parallelize((1 to 10)).map(x => > Row(x.toString, 555)), inputSchema) > var b = a.select($"value", $"key") // any transformation changing column > order will show the problem. > a.union(b).show > // in order to make it work, we need to reorder columns > val bCols = a.columns.map(aCol => b(aCol)) > a.union(b.select(bCols:_*)).show > {code} -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-20660) Not able to merge Dataframes with different column orders
[ https://issues.apache.org/jira/browse/SPARK-20660?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michel Lemay updated SPARK-20660: - Description: Union on two dataframes with different column orders is not supported and lead to hard to find issues. Here is an example showing the issue. {code} import org.apache.spark.sql.types._ import org.apache.spark.sql.Row var inputSchema = StructType(StructField("key", StringType, nullable=true) :: StructField("value", IntegerType, nullable=true) :: Nil) var a = spark.createDataFrame(sc.parallelize((1 to 10)).map(x => Row(x.toString, 555)), inputSchema) var b = a.select($"value", $"key") // any transformation changing column order will show the problem. a.union(c).show // in order to make it work, we need to reorder columns val bCols = a.columns.map(aCol => b(aCol)) a.union(b.select(bCols:_*)).show {code} was: Union on two dataframes with different column orders is not supported and lead to hard to find issues. Here is an example showing the issue. {code} import org.apache.spark.sql.types._ import org.apache.spark.sql.Row var inputSchema = StructType(StructField("key", StringType, nullable=true) :: StructField("value", IntegerType, nullable=true) :: Nil) var a = spark.createDataFrame(sc.parallelize((1 to 10 by 2)).map(x => Row(x.toString, 555)), inputSchema) var b = a.select($"value", $"key") // any transformation changing column order will show the problem. a.union(c).show // in order to make it work, we need to reorder columns val bCols = a.columns.map(aCol => b(aCol)) a.union(b.select(bCols:_*)).show {code} > Not able to merge Dataframes with different column orders > - > > Key: SPARK-20660 > URL: https://issues.apache.org/jira/browse/SPARK-20660 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.1.0 >Reporter: Michel Lemay >Priority: Minor > > Union on two dataframes with different column orders is not supported and > lead to hard to find issues. > Here is an example showing the issue. > {code} > import org.apache.spark.sql.types._ > import org.apache.spark.sql.Row > var inputSchema = StructType(StructField("key", StringType, nullable=true) :: > StructField("value", IntegerType, nullable=true) :: Nil) > var a = spark.createDataFrame(sc.parallelize((1 to 10)).map(x => > Row(x.toString, 555)), inputSchema) > var b = a.select($"value", $"key") // any transformation changing column > order will show the problem. > a.union(c).show > // in order to make it work, we need to reorder columns > val bCols = a.columns.map(aCol => b(aCol)) > a.union(b.select(bCols:_*)).show > {code} -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-20660) Not able to merge Dataframes with different column orders
Michel Lemay created SPARK-20660: Summary: Not able to merge Dataframes with different column orders Key: SPARK-20660 URL: https://issues.apache.org/jira/browse/SPARK-20660 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 2.1.0 Reporter: Michel Lemay Priority: Minor Union on two dataframes with different column orders is not supported and lead to hard to find issues. Here is an example showing the issue. {code} import org.apache.spark.sql.types._ import org.apache.spark.sql.Row var inputSchema = StructType(StructField("key", StringType, nullable=true) :: StructField("value", IntegerType, nullable=true) :: Nil) var a = spark.createDataFrame(sc.parallelize((1 to 10 by 2)).map(x => Row(x.toString, 555)), inputSchema) var b = a.select($"value", $"key") // any transformation changing column order will show the problem. a.union(c).show // in order to make it work, we need to reorder columns val bCols = a.columns.map(aCol => b(aCol)) a.union(b.select(bCols:_*)).show {code} -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-20659) Remove StorageStatus, or make it private.
Marcelo Vanzin created SPARK-20659: -- Summary: Remove StorageStatus, or make it private. Key: SPARK-20659 URL: https://issues.apache.org/jira/browse/SPARK-20659 Project: Spark Issue Type: Improvement Components: Spark Core Affects Versions: 2.3.0 Reporter: Marcelo Vanzin With the work being done in SPARK-18085, StorageStatus is not used anymore by the UI. It's still used in a couple of other places, though: - {{SparkContext.getExecutorStorageStatus}} - {{BlockManagerSource}} (a metrics source) Both could be changed to use the REST API types; the first one could be replaced with a new method in {{SparkStatusTracker}}, which I also think is a better place for it anyway. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-20658) spark.yarn.am.attemptFailuresValidityInterval doesn't seem to have an effect
[ https://issues.apache.org/jira/browse/SPARK-20658?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16001440#comment-16001440 ] Marcelo Vanzin commented on SPARK-20658: The exact build of Spark you're using would help. You can also check the logs for something like this: {noformat} Ignoring spark.yarn.am.attemptFailuresValidityInterval because the version of YARN does not support it {noformat} > spark.yarn.am.attemptFailuresValidityInterval doesn't seem to have an effect > > > Key: SPARK-20658 > URL: https://issues.apache.org/jira/browse/SPARK-20658 > Project: Spark > Issue Type: Bug > Components: YARN >Affects Versions: 2.1.0 >Reporter: Paul Jones >Priority: Minor > > I'm running a job in YARN cluster mode using > `spark.yarn.am.attemptFailuresValidityInterval=1h` specified in both > spark-default.conf and in my spark-submit command. (This flag shows up in the > environment tab of spark history server, so it seems that it's specified > correctly). > However, I just had a job die with with four AM failures (three of the four > failures were over an hour apart). So, I'm confused as to what could be going > on. I haven't figured out the cause of the individual failures, so is it > possible that we always count certain types of failures? E.g. jobs that are > killed due to memory issues always count? -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-20658) spark.yarn.am.attemptFailuresValidityInterval doesn't seem to have an effect
[ https://issues.apache.org/jira/browse/SPARK-20658?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16001439#comment-16001439 ] Paul Jones commented on SPARK-20658: I know this likely isn't enough information to debug this issue. I'm happy to provide additionally information. > spark.yarn.am.attemptFailuresValidityInterval doesn't seem to have an effect > > > Key: SPARK-20658 > URL: https://issues.apache.org/jira/browse/SPARK-20658 > Project: Spark > Issue Type: Bug > Components: YARN >Affects Versions: 2.1.0 >Reporter: Paul Jones >Priority: Minor > > I'm running a job in YARN cluster mode using > `spark.yarn.am.attemptFailuresValidityInterval=1h` specified in both > spark-default.conf and in my spark-submit command. (This flag shows up in the > environment tab of spark history server, so it seems that it's specified > correctly). > However, I just had a job die with with four AM failures (three of the four > failures were over an hour apart). So, I'm confused as to what could be going > on. I haven't figured out the cause of the individual failures, so is it > possible that we always count certain types of failures? E.g. jobs that are > killed due to memory issues always count? -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-20658) spark.yarn.am.attemptFailuresValidityInterval doesn't seem to have an effect
Paul Jones created SPARK-20658: -- Summary: spark.yarn.am.attemptFailuresValidityInterval doesn't seem to have an effect Key: SPARK-20658 URL: https://issues.apache.org/jira/browse/SPARK-20658 Project: Spark Issue Type: Bug Components: YARN Affects Versions: 2.1.0 Reporter: Paul Jones Priority: Minor I'm running a job in YARN cluster mode using `spark.yarn.am.attemptFailuresValidityInterval=1h` specified in both spark-default.conf and in my spark-submit command. (This flag shows up in the environment tab of spark history server, so it seems that it's specified correctly). However, I just had a job die with with four AM failures (three of the four failures were over an hour apart). So, I'm confused as to what could be going on. I haven't figured out the cause of the individual failures, so is it possible that we always count certain types of failures? E.g. jobs that are killed due to memory issues always count? -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-20641) Key-value store abstraction and implementation for storing application data
[ https://issues.apache.org/jira/browse/SPARK-20641?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16001432#comment-16001432 ] Apache Spark commented on SPARK-20641: -- User 'vanzin' has created a pull request for this issue: https://github.com/apache/spark/pull/17902 > Key-value store abstraction and implementation for storing application data > --- > > Key: SPARK-20641 > URL: https://issues.apache.org/jira/browse/SPARK-20641 > Project: Spark > Issue Type: Sub-task > Components: Spark Core >Affects Versions: 2.3.0 >Reporter: Marcelo Vanzin > > See spec in parent issue (SPARK-18085) for more details. > This task tracks adding a key-value store abstraction and initial LevelDB > implementation to be used to store application data for building the UI and > REST API. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-20641) Key-value store abstraction and implementation for storing application data
[ https://issues.apache.org/jira/browse/SPARK-20641?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-20641: Assignee: (was: Apache Spark) > Key-value store abstraction and implementation for storing application data > --- > > Key: SPARK-20641 > URL: https://issues.apache.org/jira/browse/SPARK-20641 > Project: Spark > Issue Type: Sub-task > Components: Spark Core >Affects Versions: 2.3.0 >Reporter: Marcelo Vanzin > > See spec in parent issue (SPARK-18085) for more details. > This task tracks adding a key-value store abstraction and initial LevelDB > implementation to be used to store application data for building the UI and > REST API. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-20641) Key-value store abstraction and implementation for storing application data
[ https://issues.apache.org/jira/browse/SPARK-20641?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-20641: Assignee: Apache Spark > Key-value store abstraction and implementation for storing application data > --- > > Key: SPARK-20641 > URL: https://issues.apache.org/jira/browse/SPARK-20641 > Project: Spark > Issue Type: Sub-task > Components: Spark Core >Affects Versions: 2.3.0 >Reporter: Marcelo Vanzin >Assignee: Apache Spark > > See spec in parent issue (SPARK-18085) for more details. > This task tracks adding a key-value store abstraction and initial LevelDB > implementation to be used to store application data for building the UI and > REST API. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-12297) Add work-around for Parquet/Hive int96 timestamp bug.
[ https://issues.apache.org/jira/browse/SPARK-12297?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16001429#comment-16001429 ] Ryan Blue commented on SPARK-12297: --- The Impala team has been working with the Parquet community recently to update the Parquet spec so that we can distinguish between timestamp with/without time zone. I think once that's committed, we should just move off of the INT96 timestamp and use the proper spec. > Add work-around for Parquet/Hive int96 timestamp bug. > - > > Key: SPARK-12297 > URL: https://issues.apache.org/jira/browse/SPARK-12297 > Project: Spark > Issue Type: Task > Components: Spark Core >Reporter: Ryan Blue >Assignee: Imran Rashid > Fix For: 2.3.0 > > > Spark copied Hive's behavior for parquet, but this was inconsistent with > other file formats, and inconsistent with Impala (which is the original > source of putting a timestamp as an int96 in parquet, I believe). This made > timestamps in parquet act more like timestamps with timezones, while in other > file formats, timestamps have no time zone, they are a "floating time". > The easiest way to see this issue is to write out a table with timestamps in > multiple different formats from one timezone, then try to read them back in > another timezone. Eg., here I write out a few timestamps to parquet and > textfile hive tables, and also just as a json file, all in the > "America/Los_Angeles" timezone: > {code} > import org.apache.spark.sql.Row > import org.apache.spark.sql.types._ > val tblPrefix = args(0) > val schema = new StructType().add("ts", TimestampType) > val rows = sc.parallelize(Seq( > "2015-12-31 23:50:59.123", > "2015-12-31 22:49:59.123", > "2016-01-01 00:39:59.123", > "2016-01-01 01:29:59.123" > ).map { x => Row(java.sql.Timestamp.valueOf(x)) }) > val rawData = spark.createDataFrame(rows, schema).toDF() > rawData.show() > Seq("parquet", "textfile").foreach { format => > val tblName = s"${tblPrefix}_$format" > spark.sql(s"DROP TABLE IF EXISTS $tblName") > spark.sql( > raw"""CREATE TABLE $tblName ( > | ts timestamp > | ) > | STORED AS $format > """.stripMargin) > rawData.write.insertInto(tblName) > } > rawData.write.json(s"${tblPrefix}_json") > {code} > Then I start a spark-shell in "America/New_York" timezone, and read the data > back from each table: > {code} > scala> spark.sql("select * from la_parquet").collect().foreach{println} > [2016-01-01 02:50:59.123] > [2016-01-01 01:49:59.123] > [2016-01-01 03:39:59.123] > [2016-01-01 04:29:59.123] > scala> spark.sql("select * from la_textfile").collect().foreach{println} > [2015-12-31 23:50:59.123] > [2015-12-31 22:49:59.123] > [2016-01-01 00:39:59.123] > [2016-01-01 01:29:59.123] > scala> spark.read.json("la_json").collect().foreach{println} > [2015-12-31 23:50:59.123] > [2015-12-31 22:49:59.123] > [2016-01-01 00:39:59.123] > [2016-01-01 01:29:59.123] > scala> spark.read.json("la_json").join(spark.sql("select * from > la_textfile"), "ts").show() > ++ > | ts| > ++ > |2015-12-31 23:50:...| > |2015-12-31 22:49:...| > |2016-01-01 00:39:...| > |2016-01-01 01:29:...| > ++ > scala> spark.read.json("la_json").join(spark.sql("select * from la_parquet"), > "ts").show() > +---+ > | ts| > +---+ > +---+ > {code} > The textfile and json based data shows the same times, and can be joined > against each other, while the times from the parquet data have changed (and > obviously joins fail). > This is a big problem for any organization that may try to read the same data > (say in S3) with clusters in multiple timezones. It can also be a nasty > surprise as an organization tries to migrate file formats. Finally, its a > source of incompatibility between Hive, Impala, and Spark. > HIVE-12767 aims to fix this by introducing a table property which indicates > the "storage timezone" for the table. Spark should add the same to ensure > consistency between file formats, and with Hive & Impala. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-13747) Concurrent execution in SQL doesn't work with Scala ForkJoinPool
[ https://issues.apache.org/jira/browse/SPARK-13747?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16001426#comment-16001426 ] Shixiong Zhu commented on SPARK-13747: -- [~revolucion09] The default dispatcher uses ForkJoinPool. See http://doc.akka.io/docs/akka/current/scala/dispatchers.html#Default_dispatcher > Concurrent execution in SQL doesn't work with Scala ForkJoinPool > > > Key: SPARK-13747 > URL: https://issues.apache.org/jira/browse/SPARK-13747 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.0, 2.0.1 >Reporter: Shixiong Zhu >Assignee: Shixiong Zhu > > Run the following codes may fail > {code} > (1 to 100).par.foreach { _ => > println(sc.parallelize(1 to 5).map { i => (i, i) }.toDF("a", "b").count()) > } > java.lang.IllegalArgumentException: spark.sql.execution.id is already set > at > org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:87) > > at > org.apache.spark.sql.DataFrame.withNewExecutionId(DataFrame.scala:1904) > at org.apache.spark.sql.DataFrame.collect(DataFrame.scala:1385) > {code} > This is because SparkContext.runJob can be suspended when using a > ForkJoinPool (e.g.,scala.concurrent.ExecutionContext.Implicits.global) as it > calls Await.ready (introduced by https://github.com/apache/spark/pull/9264). > So when SparkContext.runJob is suspended, ForkJoinPool will run another task > in the same thread, however, the local properties has been polluted. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org