[jira] [Commented] (SPARK-2636) no where to get job identifier while submit spark job through spark API
[ https://issues.apache.org/jira/browse/SPARK-2636?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14113400#comment-14113400 ] Apache Spark commented on SPARK-2636: - User 'lirui-intel' has created a pull request for this issue: https://github.com/apache/spark/pull/2176 no where to get job identifier while submit spark job through spark API --- Key: SPARK-2636 URL: https://issues.apache.org/jira/browse/SPARK-2636 Project: Spark Issue Type: New Feature Components: Java API Reporter: Chengxiang Li Labels: hive In Hive on Spark, we want to track spark job status through Spark API, the basic idea is as following: # create an hive-specified spark listener and register it to spark listener bus. # hive-specified spark listener generate job status by spark listener events. # hive driver track job status through hive-specified spark listener. the current problem is that hive driver need job identifier to track specified job status through spark listener, but there is no spark API to get job identifier(like job id) while submit spark job. I think other project whoever try to track job status with spark API would suffer from this as well. -- This message was sent by Atlassian JIRA (v6.2#6252) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-3277) LZ4 compression cause the the ExternalSort exception
hzw created SPARK-3277: -- Summary: LZ4 compression cause the the ExternalSort exception Key: SPARK-3277 URL: https://issues.apache.org/jira/browse/SPARK-3277 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 1.0.2 Reporter: hzw Fix For: 1.1.0 I tested the LZ4 compression,and it come up with such problem.(with wordcount) Also I tested the snappy and LZF,and they were OK. At last I set the spark.shuffle.spill as false to avoid such exeception, but once open this switch, this error would come. Exeception Info as follow: java.lang.AssertionError: assertion failed at scala.Predef$.assert(Predef.scala:165) at org.apache.spark.util.collection.ExternalAppendOnlyMap$DiskMapIterator.init(ExternalAppendOnlyMap.scala:416) at org.apache.spark.util.collection.ExternalAppendOnlyMap.spill(ExternalAppendOnlyMap.scala:235) at org.apache.spark.util.collection.ExternalAppendOnlyMap.insertAll(ExternalAppendOnlyMap.scala:150) at org.apache.spark.Aggregator.combineValuesByKey(Aggregator.scala:58) at org.apache.spark.shuffle.hash.HashShuffleWriter.write(HashShuffleWriter.scala:55) at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:68) at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41) at org.apache.spark.scheduler.Task.run(Task.scala:54) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:177) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1110) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:603) at java.lang.Thread.run(Thread.java:722) -- This message was sent by Atlassian JIRA (v6.2#6252) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-3230) UDFs that return structs result in ClassCastException
[ https://issues.apache.org/jira/browse/SPARK-3230?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust resolved SPARK-3230. - Resolution: Fixed UDFs that return structs result in ClassCastException - Key: SPARK-3230 URL: https://issues.apache.org/jira/browse/SPARK-3230 Project: Spark Issue Type: Bug Components: SQL Reporter: Michael Armbrust Assignee: Michael Armbrust -- This message was sent by Atlassian JIRA (v6.2#6252) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-3026) Provide a good error message if JDBC server is used but Spark is not compiled with -Pthriftserver
[ https://issues.apache.org/jira/browse/SPARK-3026?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust resolved SPARK-3026. - Resolution: Fixed Fix Version/s: 1.1.0 Provide a good error message if JDBC server is used but Spark is not compiled with -Pthriftserver - Key: SPARK-3026 URL: https://issues.apache.org/jira/browse/SPARK-3026 Project: Spark Issue Type: Bug Components: SQL Reporter: Patrick Wendell Assignee: Cheng Lian Priority: Critical Fix For: 1.1.0 Instead of giving a ClassNotFoundException we should detect this case and just tell the user to build with -Phiveserver. -- This message was sent by Atlassian JIRA (v6.2#6252) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-3269) SparkSQLOperationManager.getNextRowSet OOMs when a large maxRows is set
[ https://issues.apache.org/jira/browse/SPARK-3269?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust updated SPARK-3269: Assignee: Cheng Lian SparkSQLOperationManager.getNextRowSet OOMs when a large maxRows is set --- Key: SPARK-3269 URL: https://issues.apache.org/jira/browse/SPARK-3269 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.0.2 Reporter: Cheng Lian Assignee: Cheng Lian {{SparkSQLOperationManager.getNextRowSet}} allocates an {{ArrayBuffer[Row]}} as large as {{maxRows}}, which can lead to OOM if {{maxRows}} is large, even if the actual size of the row set is much smaller. -- This message was sent by Atlassian JIRA (v6.2#6252) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-3044) Create RSS feed for Spark News
[ https://issues.apache.org/jira/browse/SPARK-3044?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Patrick Wendell updated SPARK-3044: --- Component/s: Project Infra Create RSS feed for Spark News -- Key: SPARK-3044 URL: https://issues.apache.org/jira/browse/SPARK-3044 Project: Spark Issue Type: Documentation Components: Project Infra Reporter: Nicholas Chammas Priority: Minor Project updates are often posted here: http://spark.apache.org/news/ Currently, there is no way to subscribe to a feed of these updates. It would be nice there was a way people could be notified of new posts there without having to check manually. -- This message was sent by Atlassian JIRA (v6.2#6252) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-3278) Isotonic regression
Xiangrui Meng created SPARK-3278: Summary: Isotonic regression Key: SPARK-3278 URL: https://issues.apache.org/jira/browse/SPARK-3278 Project: Spark Issue Type: New Feature Components: MLlib Reporter: Xiangrui Meng Add isotonic regression for score calibration. -- This message was sent by Atlassian JIRA (v6.2#6252) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-3279) Remove useless field variable in ApplicationMaster
Kousuke Saruta created SPARK-3279: - Summary: Remove useless field variable in ApplicationMaster Key: SPARK-3279 URL: https://issues.apache.org/jira/browse/SPARK-3279 Project: Spark Issue Type: Bug Components: YARN Affects Versions: 1.1.0 Reporter: Kousuke Saruta ApplicationMaster no longer use ALLOCATE_HEARTBEAT_INTERVAL. Let's remove it. -- This message was sent by Atlassian JIRA (v6.2#6252) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-3267) Deadlock between ScalaReflectionLock and Data type initialization
[ https://issues.apache.org/jira/browse/SPARK-3267?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust updated SPARK-3267: Priority: Critical (was: Major) Target Version/s: 1.2.0 Assignee: Michael Armbrust Deadlock between ScalaReflectionLock and Data type initialization - Key: SPARK-3267 URL: https://issues.apache.org/jira/browse/SPARK-3267 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.1.0 Reporter: Aaron Davidson Assignee: Michael Armbrust Priority: Critical Deadlock here: {code} Executor task launch worker-0 daemon prio=10 tid=0x7fab50036000 nid=0x27a in Object.wait() [0x7fab60c2e000 ] java.lang.Thread.State: RUNNABLE at org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator.defaultPrimitive(CodeGenerator.scala:565) at org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator$$anonfun$1.applyOrElse(CodeGenerator.scal a:202) at org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator$$anonfun$1.applyOrElse(CodeGenerator.scal a:195) at scala.PartialFunction$Lifted.apply(PartialFunction.scala:218) at scala.PartialFunction$Lifted.apply(PartialFunction.scala:214) at org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator.expressionEvaluator(CodeGenerator.scala:4 93) at org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator$Evaluate2$2.evaluateAs(CodeGenerator.scal a:175) at org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator$$anonfun$1.applyOrElse(CodeGenerator.scal a:304) at org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator$$anonfun$1.applyOrElse(CodeGenerator.scal a:195) at scala.PartialFunction$Lifted.apply(PartialFunction.scala:218) at scala.PartialFunction$Lifted.apply(PartialFunction.scala:214) at org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator.expressionEvaluator(CodeGenerator.scala:4 93) at org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator$$anonfun$1.applyOrElse(CodeGenerator.scal a:314) at org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator$$anonfun$1.applyOrElse(CodeGenerator.scal a:195) at scala.PartialFunction$Lifted.apply(PartialFunction.scala:218) at scala.PartialFunction$Lifted.apply(PartialFunction.scala:214) at org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator.expressionEvaluator(CodeGenerator.scala:4 93) at org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator$$anonfun$1.applyOrElse(CodeGenerator.scal a:313) at org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator$$anonfun$1.applyOrElse(CodeGenerator.scal a:195) at scala.PartialFunction$Lifted.apply(PartialFunction.scala:218) at scala.PartialFunction$Lifted.apply(PartialFunction.scala:214) ... {code} and {code} Executor task launch worker-2 daemon prio=10 tid=0x7fab100f0800 nid=0x27e in Object.wait() [0x7fab0eeec000 ] java.lang.Thread.State: RUNNABLE at org.apache.spark.sql.catalyst.expressions.Cast.cast$lzycompute(Cast.scala:250) - locked 0x00064e5d9a48 (a org.apache.spark.sql.catalyst.expressions.Cast) at org.apache.spark.sql.catalyst.expressions.Cast.cast(Cast.scala:247) at org.apache.spark.sql.catalyst.expressions.Cast.eval(Cast.scala:263) at org.apache.spark.sql.parquet.ParquetTableScan$$anonfun$execute$2$$anonfun$6.apply(ParquetTableOperations. scala:139) at org.apache.spark.sql.parquet.ParquetTableScan$$anonfun$execute$2$$anonfun$6.apply(ParquetTableOperations. scala:139) at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244) at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244) at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47) at scala.collection.TraversableLike$class.map(TraversableLike.scala:244) at scala.collection.AbstractTraversable.map(Traversable.scala:105) at org.apache.spark.sql.parquet.ParquetTableScan$$anonfun$execute$2.apply(ParquetTableOperations.scala:139) at org.apache.spark.sql.parquet.ParquetTableScan$$anonfun$execute$2.apply(ParquetTableOperations.scala:126) at org.apache.spark.rdd.NewHadoopRDD$NewHadoopMapPartitionsWithSplitRDD.compute(NewHadoopRDD.scala:197) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262) at org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
[jira] [Commented] (SPARK-3279) Remove useless field variable in ApplicationMaster
[ https://issues.apache.org/jira/browse/SPARK-3279?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14113505#comment-14113505 ] Apache Spark commented on SPARK-3279: - User 'sarutak' has created a pull request for this issue: https://github.com/apache/spark/pull/2177 Remove useless field variable in ApplicationMaster -- Key: SPARK-3279 URL: https://issues.apache.org/jira/browse/SPARK-3279 Project: Spark Issue Type: Bug Components: YARN Affects Versions: 1.1.0 Reporter: Kousuke Saruta ApplicationMaster no longer use ALLOCATE_HEARTBEAT_INTERVAL. Let's remove it. -- This message was sent by Atlassian JIRA (v6.2#6252) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-3280) Made sort-based shuffle the default implementation
Reynold Xin created SPARK-3280: -- Summary: Made sort-based shuffle the default implementation Key: SPARK-3280 URL: https://issues.apache.org/jira/browse/SPARK-3280 Project: Spark Issue Type: Improvement Reporter: Reynold Xin Assignee: Reynold Xin sort-based shuffle has lower memory usage and seems to outperform hash-based in almost all of our testing. -- This message was sent by Atlassian JIRA (v6.2#6252) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3280) Made sort-based shuffle the default implementation
[ https://issues.apache.org/jira/browse/SPARK-3280?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14113519#comment-14113519 ] Apache Spark commented on SPARK-3280: - User 'rxin' has created a pull request for this issue: https://github.com/apache/spark/pull/2178 Made sort-based shuffle the default implementation -- Key: SPARK-3280 URL: https://issues.apache.org/jira/browse/SPARK-3280 Project: Spark Issue Type: Improvement Reporter: Reynold Xin Assignee: Reynold Xin sort-based shuffle has lower memory usage and seems to outperform hash-based in almost all of our testing. -- This message was sent by Atlassian JIRA (v6.2#6252) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-1912) Compression memory issue during reduce
[ https://issues.apache.org/jira/browse/SPARK-1912?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14113524#comment-14113524 ] Apache Spark commented on SPARK-1912: - User 'rxin' has created a pull request for this issue: https://github.com/apache/spark/pull/2179 Compression memory issue during reduce -- Key: SPARK-1912 URL: https://issues.apache.org/jira/browse/SPARK-1912 Project: Spark Issue Type: Bug Components: Spark Core Reporter: Wenchen Fan Assignee: Wenchen Fan Fix For: 0.9.2, 1.0.1, 1.1.0 When we need to read a compressed block, we will first create a compress stream instance(LZF or Snappy) and use it to wrap that block. Let's say a reducer task need to read 1000 local shuffle blocks, it will first prepare to read that 1000 blocks, which means create 1000 compression stream instance to wrap them. But the initialization of compression instance will allocate some memory and when we have many compression instance at the same time, it is a problem. Actually reducer reads the shuffle blocks one by one, so why we create compression instance at the first time? Can we do it lazily that when a block is first read, create compression instance for it. -- This message was sent by Atlassian JIRA (v6.2#6252) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-3281) Remove Netty specific code in BlockManager
Reynold Xin created SPARK-3281: -- Summary: Remove Netty specific code in BlockManager Key: SPARK-3281 URL: https://issues.apache.org/jira/browse/SPARK-3281 Project: Spark Issue Type: Sub-task Components: Shuffle, Spark Core Reporter: Reynold Xin Assignee: Reynold Xin Everything should go through the BlockTransferService interface rather than having conditional branches for Netty. -- This message was sent by Atlassian JIRA (v6.2#6252) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3281) Remove Netty specific code in BlockManager
[ https://issues.apache.org/jira/browse/SPARK-3281?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14113550#comment-14113550 ] Apache Spark commented on SPARK-3281: - User 'rxin' has created a pull request for this issue: https://github.com/apache/spark/pull/2181 Remove Netty specific code in BlockManager -- Key: SPARK-3281 URL: https://issues.apache.org/jira/browse/SPARK-3281 Project: Spark Issue Type: Sub-task Components: Shuffle, Spark Core Reporter: Reynold Xin Assignee: Reynold Xin Everything should go through the BlockTransferService interface rather than having conditional branches for Netty. -- This message was sent by Atlassian JIRA (v6.2#6252) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-3282) It should support multiple receivers at one socketInputDStream
shenhong created SPARK-3282: --- Summary: It should support multiple receivers at one socketInputDStream Key: SPARK-3282 URL: https://issues.apache.org/jira/browse/SPARK-3282 Project: Spark Issue Type: Improvement Components: Streaming Affects Versions: 1.0.2 Reporter: shenhong At present, a socketInputDStream support at most one receiver, it will be bottleneck when large inputStrem appear. It should support multiple receivers at one socketInputDStream -- This message was sent by Atlassian JIRA (v6.2#6252) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-3283) Receivers sometimes do not get spread out to multiple nodes
Tathagata Das created SPARK-3283: Summary: Receivers sometimes do not get spread out to multiple nodes Key: SPARK-3283 URL: https://issues.apache.org/jira/browse/SPARK-3283 Project: Spark Issue Type: Bug Components: Streaming Reporter: Tathagata Das Assignee: Tathagata Das The probable reason this happens is because the JobGenerator and JobScheduler start generating jobs with tasks. When the ReceiverTracker submits the task containing receivers, the tasks get assigned according to empty slots, which may be instantaneously available on one node, instead of all the nodes. The original behavior was that the jobs started only after the receivers are started, thus ensuring that all the slots are free and the receivers are spread evenly across all the nodes. -- This message was sent by Atlassian JIRA (v6.2#6252) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-2633) enhance spark listener API to gather more spark job information
[ https://issues.apache.org/jira/browse/SPARK-2633?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14113611#comment-14113611 ] Chengxiang Li commented on SPARK-2633: -- It's quite subjective I think, like Hive on MR display job progress by task finished percentage, while Hive on Tez display job progress with exact running/failed/finished task number. I think it's better we collect more detail job status info while it does not introduce much extra effort. enhance spark listener API to gather more spark job information --- Key: SPARK-2633 URL: https://issues.apache.org/jira/browse/SPARK-2633 Project: Spark Issue Type: New Feature Components: Java API Reporter: Chengxiang Li Priority: Critical Labels: hive Attachments: Spark listener enhancement for Hive on Spark job monitor and statistic.docx Based on Hive on Spark job status monitoring and statistic collection requirement, try to enhance spark listener API to gather more spark job information. -- This message was sent by Atlassian JIRA (v6.2#6252) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-3284) saveAsParquetFile not working on windows
Pravesh Jain created SPARK-3284: --- Summary: saveAsParquetFile not working on windows Key: SPARK-3284 URL: https://issues.apache.org/jira/browse/SPARK-3284 Project: Spark Issue Type: Bug Affects Versions: 1.0.2 Environment: Windows Reporter: Pravesh Jain Priority: Minor object parquet { case class Person(name: String, age: Int) def main(args: Array[String]) { val sparkConf = new SparkConf().setMaster(local).setAppName(HdfsWordCount) val sc = new SparkContext(sparkConf) val sqlContext = new org.apache.spark.sql.SQLContext(sc) // createSchemaRDD is used to implicitly convert an RDD to a SchemaRDD. import sqlContext.createSchemaRDD val people = sc.textFile(C:/Users/pravesh.jain/Desktop/people/people.txt).map(_.split(,)).map(p = Person(p(0), p(1).trim.toInt)) people.saveAsParquetFile(C:/Users/pravesh.jain/Desktop/people/people.parquet) val parquetFile = sqlContext.parquetFile(C:/Users/pravesh.jain/Desktop/people/people.parquet) } } gives the error Exception in thread main java.lang.NullPointerException at org.apache.spark.parquet$.main(parquet.scala:16) which is the line saveAsParquetFile. -- This message was sent by Atlassian JIRA (v6.2#6252) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-3284) saveAsParquetFile not working on windows
[ https://issues.apache.org/jira/browse/SPARK-3284?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Pravesh Jain updated SPARK-3284: Description: object parquet { case class Person(name: String, age: Int) def main(args: Array[String]) { val sparkConf = new SparkConf().setMaster(local).setAppName(HdfsWordCount) val sc = new SparkContext(sparkConf) val sqlContext = new org.apache.spark.sql.SQLContext(sc) // createSchemaRDD is used to implicitly convert an RDD to a SchemaRDD. import sqlContext.createSchemaRDD val people = sc.textFile(C:/Users/pravesh.jain/Desktop/people/people.txt).map(_.split(,)).map(p = Person(p(0), p(1).trim.toInt)) people.saveAsParquetFile(C:/Users/pravesh.jain/Desktop/people/people.parquet) val parquetFile = sqlContext.parquetFile(C:/Users/pravesh.jain/Desktop/people/people.parquet) } } gives the error Exception in thread main java.lang.NullPointerException at org.apache.spark.parquet$.main(parquet.scala:16) which is the line saveAsParquetFile. This works fine in linux but using in eclipse in windows gives the error. was: object parquet { case class Person(name: String, age: Int) def main(args: Array[String]) { val sparkConf = new SparkConf().setMaster(local).setAppName(HdfsWordCount) val sc = new SparkContext(sparkConf) val sqlContext = new org.apache.spark.sql.SQLContext(sc) // createSchemaRDD is used to implicitly convert an RDD to a SchemaRDD. import sqlContext.createSchemaRDD val people = sc.textFile(C:/Users/pravesh.jain/Desktop/people/people.txt).map(_.split(,)).map(p = Person(p(0), p(1).trim.toInt)) people.saveAsParquetFile(C:/Users/pravesh.jain/Desktop/people/people.parquet) val parquetFile = sqlContext.parquetFile(C:/Users/pravesh.jain/Desktop/people/people.parquet) } } gives the error Exception in thread main java.lang.NullPointerException at org.apache.spark.parquet$.main(parquet.scala:16) which is the line saveAsParquetFile. saveAsParquetFile not working on windows Key: SPARK-3284 URL: https://issues.apache.org/jira/browse/SPARK-3284 Project: Spark Issue Type: Bug Affects Versions: 1.0.2 Environment: Windows Reporter: Pravesh Jain Priority: Minor object parquet { case class Person(name: String, age: Int) def main(args: Array[String]) { val sparkConf = new SparkConf().setMaster(local).setAppName(HdfsWordCount) val sc = new SparkContext(sparkConf) val sqlContext = new org.apache.spark.sql.SQLContext(sc) // createSchemaRDD is used to implicitly convert an RDD to a SchemaRDD. import sqlContext.createSchemaRDD val people = sc.textFile(C:/Users/pravesh.jain/Desktop/people/people.txt).map(_.split(,)).map(p = Person(p(0), p(1).trim.toInt)) people.saveAsParquetFile(C:/Users/pravesh.jain/Desktop/people/people.parquet) val parquetFile = sqlContext.parquetFile(C:/Users/pravesh.jain/Desktop/people/people.parquet) } } gives the error Exception in thread main java.lang.NullPointerException at org.apache.spark.parquet$.main(parquet.scala:16) which is the line saveAsParquetFile. This works fine in linux but using in eclipse in windows gives the error. -- This message was sent by Atlassian JIRA (v6.2#6252) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Issue Comment Deleted] (SPARK-1647) Prevent data loss when Streaming driver goes down
[ https://issues.apache.org/jira/browse/SPARK-1647?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Giulio De Vecchi updated SPARK-1647: Comment: was deleted (was: Not sure if this make sense, but maybe would be nice to have a kind of flag available within the code that tells me if I'm running in a normal situation or during a recovery. To better explain this, let's consider the following scenario: I am processing data, let's say from a Kafka streaming, and I am updating a database based on the computations. During the recovery I don't want to update again the database (for many reasons, let's just assume that) but I want my system to be in the same status as before, thus I would like to know if my code is running for the first time or during a recovery so I can avoid to update the database again. More generally I want to know this in case I'm interacting with external entities. ) Prevent data loss when Streaming driver goes down - Key: SPARK-1647 URL: https://issues.apache.org/jira/browse/SPARK-1647 Project: Spark Issue Type: Bug Components: Streaming Reporter: Hari Shreedharan Assignee: Hari Shreedharan Currently when the driver goes down, any uncheckpointed data is lost from within spark. If the system from which messages are pulled can replay messages, the data may be available - but for some systems, like Flume this is not the case. Also, all windowing information is lost for windowing functions. We must persist raw data somehow, and be able to replay this data if required. We also must persist windowing information with the data itself. This will likely require quite a bit of work to complete and probably will have to be split into several sub-jiras. -- This message was sent by Atlassian JIRA (v6.2#6252) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3276) Provide a API to specify whether the old files need to be ignored in file input text DStream
[ https://issues.apache.org/jira/browse/SPARK-3276?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14113704#comment-14113704 ] Sean Owen commented on SPARK-3276: -- Given the nature of a stream processing framework, when would you want to keep reprocessing all old data? that is something you can do, but, doesn't require Spark Streaming Provide a API to specify whether the old files need to be ignored in file input text DStream Key: SPARK-3276 URL: https://issues.apache.org/jira/browse/SPARK-3276 Project: Spark Issue Type: Bug Components: Streaming Affects Versions: 1.0.2 Reporter: Jack Hu Currently, only one API called textFileStream in StreamingContext to specify the text file dstream, which ignores the old files always. On some times, the old files is still useful. Need a API to let user choose whether the old files need to be ingored or not . The API currently in StreamingContext: def textFileStream(directory: String): DStream[String] = { fileStream[LongWritable, Text, TextInputFormat](directory).map(_._2.toString) } -- This message was sent by Atlassian JIRA (v6.2#6252) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3274) Spark Streaming Java API reports java.lang.ClassCastException when calling collectAsMap on JavaPairDStream
[ https://issues.apache.org/jira/browse/SPARK-3274?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14113702#comment-14113702 ] Sean Owen commented on SPARK-3274: -- Same as the problem and solution in https://issues.apache.org/jira/browse/SPARK-1040 Spark Streaming Java API reports java.lang.ClassCastException when calling collectAsMap on JavaPairDStream -- Key: SPARK-3274 URL: https://issues.apache.org/jira/browse/SPARK-3274 Project: Spark Issue Type: Bug Components: Java API Affects Versions: 1.0.2 Reporter: Jack Hu Reproduce code: scontext .socketTextStream(localhost, 1) .mapToPair(new PairFunctionString, String, String(){ public Tuple2String, String call(String arg0) throws Exception { return new Tuple2String, String(1, arg0); } }) .foreachRDD(new Function2JavaPairRDDString, String, Time, Void() { public Void call(JavaPairRDDString, String v1, Time v2) throws Exception { System.out.println(v2.toString() + : + v1.collectAsMap().toString()); return null; } }); Exception: java.lang.ClassCastException: [Ljava.lang.Object; cannot be cast to [Lscala.Tupl e2; at org.apache.spark.rdd.PairRDDFunctions.collectAsMap(PairRDDFunctions.s cala:447) at org.apache.spark.api.java.JavaPairRDD.collectAsMap(JavaPairRDD.scala: 464) at tuk.usecase.failedcall.FailedCall$1.call(FailedCall.java:90) at tuk.usecase.failedcall.FailedCall$1.call(FailedCall.java:88) at org.apache.spark.streaming.api.java.JavaDStreamLike$$anonfun$foreachR DD$2.apply(JavaDStreamLike.scala:282) at org.apache.spark.streaming.api.java.JavaDStreamLike$$anonfun$foreachR DD$2.apply(JavaDStreamLike.scala:282) at org.apache.spark.streaming.dstream.ForEachDStream$$anonfun$1.apply$mc V$sp(ForEachDStream.scala:41) at org.apache.spark.streaming.dstream.ForEachDStream$$anonfun$1.apply(Fo rEachDStream.scala:40) at org.apache.spark.streaming.dstream.ForEachDStream$$anonfun$1.apply(Fo rEachDStream.scala:40) at scala.util.Try$.apply(Try.scala:161) at org.apache.spark.streaming.scheduler.Job.run(Job.scala:32) at org.apache.spark.streaming.scheduler.JobScheduler$JobHandler.run(JobS -- This message was sent by Atlassian JIRA (v6.2#6252) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-3285) Using values.sum is easier to understand than using values.foldLeft(0)(_ + _)
Yadong Qi created SPARK-3285: Summary: Using values.sum is easier to understand than using values.foldLeft(0)(_ + _) Key: SPARK-3285 URL: https://issues.apache.org/jira/browse/SPARK-3285 Project: Spark Issue Type: Test Components: Examples Affects Versions: 1.0.2 Reporter: Yadong Qi def sumB : A: B = foldLeft(num.zero)(num.plus) Using values.sum is easier to understand than using values.foldLeft(0)(_ + ), so we'd better use values.sum instead of values.foldLeft(0)( + _) -- This message was sent by Atlassian JIRA (v6.2#6252) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3266) JavaDoubleRDD doesn't contain max()
[ https://issues.apache.org/jira/browse/SPARK-3266?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14113710#comment-14113710 ] Sean Owen commented on SPARK-3266: -- The method is declared in the superclass, JavaRDDLike: https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/api/java/JavaRDDLike.scala#L538 You are running a different version of Spark than you are compiling with, and the runtime version is perhaps too old to contain this method. This is not a Spark issue. JavaDoubleRDD doesn't contain max() --- Key: SPARK-3266 URL: https://issues.apache.org/jira/browse/SPARK-3266 Project: Spark Issue Type: Bug Components: Java API Affects Versions: 1.0.1 Reporter: Amey Chaugule While I can compile my code, I see: Caused by: java.lang.NoSuchMethodError: org.apache.spark.api.java.JavaDoubleRDD.max(Ljava/util/Comparator;)Ljava/lang/Double; When I try to execute my Spark code. Stepping into the JavaDoubleRDD class, I don't notice max() although it is clearly listed in the documentation. -- This message was sent by Atlassian JIRA (v6.2#6252) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3285) Using values.sum is easier to understand than using values.foldLeft(0)(_ + _)
[ https://issues.apache.org/jira/browse/SPARK-3285?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14113713#comment-14113713 ] Apache Spark commented on SPARK-3285: - User 'watermen' has created a pull request for this issue: https://github.com/apache/spark/pull/2182 Using values.sum is easier to understand than using values.foldLeft(0)(_ + _) - Key: SPARK-3285 URL: https://issues.apache.org/jira/browse/SPARK-3285 Project: Spark Issue Type: Test Components: Examples Affects Versions: 1.0.2 Reporter: Yadong Qi def sumB : A: B = foldLeft(num.zero)(num.plus) Using values.sum is easier to understand than using values.foldLeft(0)(_ + ), so we'd better use values.sum instead of values.foldLeft(0)( + _) -- This message was sent by Atlassian JIRA (v6.2#6252) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3277) LZ4 compression cause the the ExternalSort exception
[ https://issues.apache.org/jira/browse/SPARK-3277?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14113741#comment-14113741 ] Mridul Muralidharan commented on SPARK-3277: This looks like unrelated changes pushed to BlockObjectWriter as part of introduction of ShuffleWriteMetrics. I had introducing checks and also documented that we must not infer size based on position of stream after flush - since close can write data to the streams (and one flush can result in more data getting generated which need not be flushed to streams). Apparently this logic was modified subsequently causing this bug. Solution would be to revert changes to update shuffleBytesWritten before close of stream. It must be done after close and based on file.length LZ4 compression cause the the ExternalSort exception Key: SPARK-3277 URL: https://issues.apache.org/jira/browse/SPARK-3277 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 1.0.2 Reporter: hzw Fix For: 1.1.0 I tested the LZ4 compression,and it come up with such problem.(with wordcount) Also I tested the snappy and LZF,and they were OK. At last I set the spark.shuffle.spill as false to avoid such exeception, but once open this switch, this error would come. Exeception Info as follow: java.lang.AssertionError: assertion failed at scala.Predef$.assert(Predef.scala:165) at org.apache.spark.util.collection.ExternalAppendOnlyMap$DiskMapIterator.init(ExternalAppendOnlyMap.scala:416) at org.apache.spark.util.collection.ExternalAppendOnlyMap.spill(ExternalAppendOnlyMap.scala:235) at org.apache.spark.util.collection.ExternalAppendOnlyMap.insertAll(ExternalAppendOnlyMap.scala:150) at org.apache.spark.Aggregator.combineValuesByKey(Aggregator.scala:58) at org.apache.spark.shuffle.hash.HashShuffleWriter.write(HashShuffleWriter.scala:55) at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:68) at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41) at org.apache.spark.scheduler.Task.run(Task.scala:54) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:177) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1110) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:603) at java.lang.Thread.run(Thread.java:722) -- This message was sent by Atlassian JIRA (v6.2#6252) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-2855) pyspark test cases crashed for no reason
[ https://issues.apache.org/jira/browse/SPARK-2855?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14113769#comment-14113769 ] Matthew Farrellee commented on SPARK-2855: -- [~zhunansjtu] the link you supplied no longer works, please include the test failure in a comment on this jira pyspark test cases crashed for no reason Key: SPARK-2855 URL: https://issues.apache.org/jira/browse/SPARK-2855 Project: Spark Issue Type: Bug Components: PySpark Affects Versions: 1.1.0 Reporter: Nan Zhu I met this for several times, all scala/java test cases passed, but pyspark test cases just crashed https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/17875/consoleFull -- This message was sent by Atlassian JIRA (v6.2#6252) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-2855) pyspark test cases crashed for no reason
[ https://issues.apache.org/jira/browse/SPARK-2855?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14113778#comment-14113778 ] Nan Zhu commented on SPARK-2855: [~joshrosen]? pyspark test cases crashed for no reason Key: SPARK-2855 URL: https://issues.apache.org/jira/browse/SPARK-2855 Project: Spark Issue Type: Bug Components: PySpark Affects Versions: 1.1.0 Reporter: Nan Zhu I met this for several times, all scala/java test cases passed, but pyspark test cases just crashed https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/17875/consoleFull -- This message was sent by Atlassian JIRA (v6.2#6252) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-2855) pyspark test cases crashed for no reason
[ https://issues.apache.org/jira/browse/SPARK-2855?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14113776#comment-14113776 ] Nan Zhu commented on SPARK-2855: I guess they have fixed this.Jenkins side mistake? pyspark test cases crashed for no reason Key: SPARK-2855 URL: https://issues.apache.org/jira/browse/SPARK-2855 Project: Spark Issue Type: Bug Components: PySpark Affects Versions: 1.1.0 Reporter: Nan Zhu I met this for several times, all scala/java test cases passed, but pyspark test cases just crashed https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/17875/consoleFull -- This message was sent by Atlassian JIRA (v6.2#6252) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-3277) LZ4 compression cause the the ExternalSort exception
[ https://issues.apache.org/jira/browse/SPARK-3277?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] hzw updated SPARK-3277: --- Description: I tested the LZ4 compression,and it come up with such problem.(with wordcount) Also I tested the snappy and LZF,and they were OK. At last I set the spark.shuffle.spill as false to avoid such exeception, but once open this switch, this error would come. It seems that if num of the words is few, wordcount will go through,but if it is a complex text ,this problem will show Exeception Info as follow: java.lang.AssertionError: assertion failed at scala.Predef$.assert(Predef.scala:165) at org.apache.spark.util.collection.ExternalAppendOnlyMap$DiskMapIterator.init(ExternalAppendOnlyMap.scala:416) at org.apache.spark.util.collection.ExternalAppendOnlyMap.spill(ExternalAppendOnlyMap.scala:235) at org.apache.spark.util.collection.ExternalAppendOnlyMap.insertAll(ExternalAppendOnlyMap.scala:150) at org.apache.spark.Aggregator.combineValuesByKey(Aggregator.scala:58) at org.apache.spark.shuffle.hash.HashShuffleWriter.write(HashShuffleWriter.scala:55) at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:68) at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41) at org.apache.spark.scheduler.Task.run(Task.scala:54) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:177) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1110) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:603) at java.lang.Thread.run(Thread.java:722) was: I tested the LZ4 compression,and it come up with such problem.(with wordcount) Also I tested the snappy and LZF,and they were OK. At last I set the spark.shuffle.spill as false to avoid such exeception, but once open this switch, this error would come. Exeception Info as follow: java.lang.AssertionError: assertion failed at scala.Predef$.assert(Predef.scala:165) at org.apache.spark.util.collection.ExternalAppendOnlyMap$DiskMapIterator.init(ExternalAppendOnlyMap.scala:416) at org.apache.spark.util.collection.ExternalAppendOnlyMap.spill(ExternalAppendOnlyMap.scala:235) at org.apache.spark.util.collection.ExternalAppendOnlyMap.insertAll(ExternalAppendOnlyMap.scala:150) at org.apache.spark.Aggregator.combineValuesByKey(Aggregator.scala:58) at org.apache.spark.shuffle.hash.HashShuffleWriter.write(HashShuffleWriter.scala:55) at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:68) at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41) at org.apache.spark.scheduler.Task.run(Task.scala:54) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:177) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1110) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:603) at java.lang.Thread.run(Thread.java:722) LZ4 compression cause the the ExternalSort exception Key: SPARK-3277 URL: https://issues.apache.org/jira/browse/SPARK-3277 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 1.0.2 Reporter: hzw Fix For: 1.1.0 I tested the LZ4 compression,and it come up with such problem.(with wordcount) Also I tested the snappy and LZF,and they were OK. At last I set the spark.shuffle.spill as false to avoid such exeception, but once open this switch, this error would come. It seems that if num of the words is few, wordcount will go through,but if it is a complex text ,this problem will show Exeception Info as follow: java.lang.AssertionError: assertion failed at scala.Predef$.assert(Predef.scala:165) at org.apache.spark.util.collection.ExternalAppendOnlyMap$DiskMapIterator.init(ExternalAppendOnlyMap.scala:416) at org.apache.spark.util.collection.ExternalAppendOnlyMap.spill(ExternalAppendOnlyMap.scala:235) at org.apache.spark.util.collection.ExternalAppendOnlyMap.insertAll(ExternalAppendOnlyMap.scala:150) at org.apache.spark.Aggregator.combineValuesByKey(Aggregator.scala:58) at org.apache.spark.shuffle.hash.HashShuffleWriter.write(HashShuffleWriter.scala:55) at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:68) at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41) at org.apache.spark.scheduler.Task.run(Task.scala:54) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:177) at
[jira] [Commented] (SPARK-3277) LZ4 compression cause the the ExternalSort exception
[ https://issues.apache.org/jira/browse/SPARK-3277?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14113789#comment-14113789 ] hzw commented on SPARK-3277: Sorry,I can not understand it clearly since I'm not familiar with the code of this class. Can you point the line number of the code where it goes wrong or make a pr to fix this problem LZ4 compression cause the the ExternalSort exception Key: SPARK-3277 URL: https://issues.apache.org/jira/browse/SPARK-3277 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 1.0.2 Reporter: hzw Fix For: 1.1.0 I tested the LZ4 compression,and it come up with such problem.(with wordcount) Also I tested the snappy and LZF,and they were OK. At last I set the spark.shuffle.spill as false to avoid such exeception, but once open this switch, this error would come. It seems that if num of the words is few, wordcount will go through,but if it is a complex text ,this problem will show Exeception Info as follow: java.lang.AssertionError: assertion failed at scala.Predef$.assert(Predef.scala:165) at org.apache.spark.util.collection.ExternalAppendOnlyMap$DiskMapIterator.init(ExternalAppendOnlyMap.scala:416) at org.apache.spark.util.collection.ExternalAppendOnlyMap.spill(ExternalAppendOnlyMap.scala:235) at org.apache.spark.util.collection.ExternalAppendOnlyMap.insertAll(ExternalAppendOnlyMap.scala:150) at org.apache.spark.Aggregator.combineValuesByKey(Aggregator.scala:58) at org.apache.spark.shuffle.hash.HashShuffleWriter.write(HashShuffleWriter.scala:55) at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:68) at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41) at org.apache.spark.scheduler.Task.run(Task.scala:54) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:177) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1110) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:603) at java.lang.Thread.run(Thread.java:722) -- This message was sent by Atlassian JIRA (v6.2#6252) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-2435) Add shutdown hook to bin/pyspark
[ https://issues.apache.org/jira/browse/SPARK-2435?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14113816#comment-14113816 ] Matthew Farrellee commented on SPARK-2435: -- i couldn't find a PR for this, and it has been a problem for me, so i've created https://github.com/apache/spark/pull/2183 Add shutdown hook to bin/pyspark Key: SPARK-2435 URL: https://issues.apache.org/jira/browse/SPARK-2435 Project: Spark Issue Type: Bug Components: PySpark Affects Versions: 1.0.1 Reporter: Andrew Or Assignee: Josh Rosen Fix For: 1.1.0 We currently never stop the SparkContext cleanly in bin/pyspark unless the user explicitly runs sc.stop(). This behavior is not consistent with bin/spark-shell, in which case Ctrl+D stops the SparkContext before quitting the shell. -- This message was sent by Atlassian JIRA (v6.2#6252) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-2435) Add shutdown hook to bin/pyspark
[ https://issues.apache.org/jira/browse/SPARK-2435?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14113911#comment-14113911 ] Apache Spark commented on SPARK-2435: - User 'mattf' has created a pull request for this issue: https://github.com/apache/spark/pull/2183 Add shutdown hook to bin/pyspark Key: SPARK-2435 URL: https://issues.apache.org/jira/browse/SPARK-2435 Project: Spark Issue Type: Bug Components: PySpark Affects Versions: 1.0.1 Reporter: Andrew Or Assignee: Josh Rosen Fix For: 1.1.0 We currently never stop the SparkContext cleanly in bin/pyspark unless the user explicitly runs sc.stop(). This behavior is not consistent with bin/spark-shell, in which case Ctrl+D stops the SparkContext before quitting the shell. -- This message was sent by Atlassian JIRA (v6.2#6252) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-2855) pyspark test cases crashed for no reason
[ https://issues.apache.org/jira/browse/SPARK-2855?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14113957#comment-14113957 ] Josh Rosen commented on SPARK-2855: --- Do you recall the actual exception? Was it a Py4J error (something like connection to GatewayServer failed?). It seems like we've been experiencing some flakiness in these tests and I wonder whether it's due to some system resource being exhausted, such as ephemeral ports. pyspark test cases crashed for no reason Key: SPARK-2855 URL: https://issues.apache.org/jira/browse/SPARK-2855 Project: Spark Issue Type: Bug Components: PySpark Affects Versions: 1.1.0 Reporter: Nan Zhu I met this for several times, all scala/java test cases passed, but pyspark test cases just crashed https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/17875/consoleFull -- This message was sent by Atlassian JIRA (v6.2#6252) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-2855) pyspark test cases crashed for no reason
[ https://issues.apache.org/jira/browse/SPARK-2855?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14113965#comment-14113965 ] Nan Zhu commented on SPARK-2855: no https://github.com/apache/spark/pull/1313 search This particular failure was my fault, pyspark test cases crashed for no reason Key: SPARK-2855 URL: https://issues.apache.org/jira/browse/SPARK-2855 Project: Spark Issue Type: Bug Components: PySpark Affects Versions: 1.1.0 Reporter: Nan Zhu I met this for several times, all scala/java test cases passed, but pyspark test cases just crashed https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/17875/consoleFull -- This message was sent by Atlassian JIRA (v6.2#6252) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-2855) pyspark test cases crashed for no reason
[ https://issues.apache.org/jira/browse/SPARK-2855?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Josh Rosen resolved SPARK-2855. --- Resolution: Fixed That issue should be fixed now, so I'm going to mark this JIRA as resolved. Feel free to re-open (or open a new issue) if you notice flaky PySpark tests. pyspark test cases crashed for no reason Key: SPARK-2855 URL: https://issues.apache.org/jira/browse/SPARK-2855 Project: Spark Issue Type: Bug Components: PySpark Affects Versions: 1.1.0 Reporter: Nan Zhu I met this for several times, all scala/java test cases passed, but pyspark test cases just crashed https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/17875/consoleFull -- This message was sent by Atlassian JIRA (v6.2#6252) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-1297) Upgrade HBase dependency to 0.98.0
[ https://issues.apache.org/jira/browse/SPARK-1297?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ted Yu updated SPARK-1297: -- Attachment: spark-1297-v5.txt Upgrade HBase dependency to 0.98.0 -- Key: SPARK-1297 URL: https://issues.apache.org/jira/browse/SPARK-1297 Project: Spark Issue Type: Task Reporter: Ted Yu Assignee: Ted Yu Priority: Minor Attachments: pom.xml, spark-1297-v2.txt, spark-1297-v4.txt, spark-1297-v5.txt HBase 0.94.6 was released 11 months ago. Upgrade HBase dependency to 0.98.0 -- This message was sent by Atlassian JIRA (v6.2#6252) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-1297) Upgrade HBase dependency to 0.98.0
[ https://issues.apache.org/jira/browse/SPARK-1297?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14113977#comment-14113977 ] Ted Yu commented on SPARK-1297: --- Patch v5 is the aggregate of the 4 commits in the pull request. Upgrade HBase dependency to 0.98.0 -- Key: SPARK-1297 URL: https://issues.apache.org/jira/browse/SPARK-1297 Project: Spark Issue Type: Task Reporter: Ted Yu Assignee: Ted Yu Priority: Minor Attachments: pom.xml, spark-1297-v2.txt, spark-1297-v4.txt, spark-1297-v5.txt HBase 0.94.6 was released 11 months ago. Upgrade HBase dependency to 0.98.0 -- This message was sent by Atlassian JIRA (v6.2#6252) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-3277) LZ4 compression cause the the ExternalSort exception
[ https://issues.apache.org/jira/browse/SPARK-3277?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Mridul Muralidharan updated SPARK-3277: --- Priority: Blocker (was: Major) LZ4 compression cause the the ExternalSort exception Key: SPARK-3277 URL: https://issues.apache.org/jira/browse/SPARK-3277 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 1.0.2, 1.1.0, 1.2.0 Reporter: hzw Priority: Blocker Fix For: 1.1.0 I tested the LZ4 compression,and it come up with such problem.(with wordcount) Also I tested the snappy and LZF,and they were OK. At last I set the spark.shuffle.spill as false to avoid such exeception, but once open this switch, this error would come. It seems that if num of the words is few, wordcount will go through,but if it is a complex text ,this problem will show Exeception Info as follow: java.lang.AssertionError: assertion failed at scala.Predef$.assert(Predef.scala:165) at org.apache.spark.util.collection.ExternalAppendOnlyMap$DiskMapIterator.init(ExternalAppendOnlyMap.scala:416) at org.apache.spark.util.collection.ExternalAppendOnlyMap.spill(ExternalAppendOnlyMap.scala:235) at org.apache.spark.util.collection.ExternalAppendOnlyMap.insertAll(ExternalAppendOnlyMap.scala:150) at org.apache.spark.Aggregator.combineValuesByKey(Aggregator.scala:58) at org.apache.spark.shuffle.hash.HashShuffleWriter.write(HashShuffleWriter.scala:55) at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:68) at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41) at org.apache.spark.scheduler.Task.run(Task.scala:54) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:177) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1110) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:603) at java.lang.Thread.run(Thread.java:722) -- This message was sent by Atlassian JIRA (v6.2#6252) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-3277) LZ4 compression cause the the ExternalSort exception
[ https://issues.apache.org/jira/browse/SPARK-3277?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Mridul Muralidharan updated SPARK-3277: --- Affects Version/s: 1.2.0 1.1.0 LZ4 compression cause the the ExternalSort exception Key: SPARK-3277 URL: https://issues.apache.org/jira/browse/SPARK-3277 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 1.0.2, 1.1.0, 1.2.0 Reporter: hzw Priority: Blocker Fix For: 1.1.0 I tested the LZ4 compression,and it come up with such problem.(with wordcount) Also I tested the snappy and LZF,and they were OK. At last I set the spark.shuffle.spill as false to avoid such exeception, but once open this switch, this error would come. It seems that if num of the words is few, wordcount will go through,but if it is a complex text ,this problem will show Exeception Info as follow: java.lang.AssertionError: assertion failed at scala.Predef$.assert(Predef.scala:165) at org.apache.spark.util.collection.ExternalAppendOnlyMap$DiskMapIterator.init(ExternalAppendOnlyMap.scala:416) at org.apache.spark.util.collection.ExternalAppendOnlyMap.spill(ExternalAppendOnlyMap.scala:235) at org.apache.spark.util.collection.ExternalAppendOnlyMap.insertAll(ExternalAppendOnlyMap.scala:150) at org.apache.spark.Aggregator.combineValuesByKey(Aggregator.scala:58) at org.apache.spark.shuffle.hash.HashShuffleWriter.write(HashShuffleWriter.scala:55) at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:68) at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41) at org.apache.spark.scheduler.Task.run(Task.scala:54) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:177) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1110) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:603) at java.lang.Thread.run(Thread.java:722) -- This message was sent by Atlassian JIRA (v6.2#6252) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3277) LZ4 compression cause the the ExternalSort exception
[ https://issues.apache.org/jira/browse/SPARK-3277?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14114014#comment-14114014 ] Mridul Muralidharan commented on SPARK-3277: [~matei] Attaching a patch which reproduces the bug consistently. I suspect the issue is more serious than what I detailed above - spill to disk seems completely broken if I understood the assertion message correctly. Unfortunately, this is based on a few minutes of free time I could grab - so a more principled debugging session is definitely warranted ! LZ4 compression cause the the ExternalSort exception Key: SPARK-3277 URL: https://issues.apache.org/jira/browse/SPARK-3277 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 1.0.2, 1.1.0, 1.2.0 Reporter: hzw Priority: Blocker Fix For: 1.1.0 I tested the LZ4 compression,and it come up with such problem.(with wordcount) Also I tested the snappy and LZF,and they were OK. At last I set the spark.shuffle.spill as false to avoid such exeception, but once open this switch, this error would come. It seems that if num of the words is few, wordcount will go through,but if it is a complex text ,this problem will show Exeception Info as follow: java.lang.AssertionError: assertion failed at scala.Predef$.assert(Predef.scala:165) at org.apache.spark.util.collection.ExternalAppendOnlyMap$DiskMapIterator.init(ExternalAppendOnlyMap.scala:416) at org.apache.spark.util.collection.ExternalAppendOnlyMap.spill(ExternalAppendOnlyMap.scala:235) at org.apache.spark.util.collection.ExternalAppendOnlyMap.insertAll(ExternalAppendOnlyMap.scala:150) at org.apache.spark.Aggregator.combineValuesByKey(Aggregator.scala:58) at org.apache.spark.shuffle.hash.HashShuffleWriter.write(HashShuffleWriter.scala:55) at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:68) at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41) at org.apache.spark.scheduler.Task.run(Task.scala:54) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:177) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1110) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:603) at java.lang.Thread.run(Thread.java:722) -- This message was sent by Atlassian JIRA (v6.2#6252) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-3277) LZ4 compression cause the the ExternalSort exception
[ https://issues.apache.org/jira/browse/SPARK-3277?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14114022#comment-14114022 ] Mridul Muralidharan edited comment on SPARK-3277 at 8/28/14 5:37 PM: - Attached patch is against master, though I noticed similar changes in 1.1 also : but not yet verified. was (Author: mridulm80): Against master, though I noticed similar changes in 1.1 also : but not yet verified. LZ4 compression cause the the ExternalSort exception Key: SPARK-3277 URL: https://issues.apache.org/jira/browse/SPARK-3277 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 1.0.2, 1.1.0, 1.2.0 Reporter: hzw Priority: Blocker Fix For: 1.1.0 Attachments: test_lz4_bug.patch I tested the LZ4 compression,and it come up with such problem.(with wordcount) Also I tested the snappy and LZF,and they were OK. At last I set the spark.shuffle.spill as false to avoid such exeception, but once open this switch, this error would come. It seems that if num of the words is few, wordcount will go through,but if it is a complex text ,this problem will show Exeception Info as follow: java.lang.AssertionError: assertion failed at scala.Predef$.assert(Predef.scala:165) at org.apache.spark.util.collection.ExternalAppendOnlyMap$DiskMapIterator.init(ExternalAppendOnlyMap.scala:416) at org.apache.spark.util.collection.ExternalAppendOnlyMap.spill(ExternalAppendOnlyMap.scala:235) at org.apache.spark.util.collection.ExternalAppendOnlyMap.insertAll(ExternalAppendOnlyMap.scala:150) at org.apache.spark.Aggregator.combineValuesByKey(Aggregator.scala:58) at org.apache.spark.shuffle.hash.HashShuffleWriter.write(HashShuffleWriter.scala:55) at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:68) at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41) at org.apache.spark.scheduler.Task.run(Task.scala:54) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:177) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1110) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:603) at java.lang.Thread.run(Thread.java:722) -- This message was sent by Atlassian JIRA (v6.2#6252) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-3277) LZ4 compression cause the the ExternalSort exception
[ https://issues.apache.org/jira/browse/SPARK-3277?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Mridul Muralidharan updated SPARK-3277: --- Attachment: test_lz4_bug.patch Against master, though I noticed similar changes in 1.1 also : but not yet verified. LZ4 compression cause the the ExternalSort exception Key: SPARK-3277 URL: https://issues.apache.org/jira/browse/SPARK-3277 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 1.0.2, 1.1.0, 1.2.0 Reporter: hzw Priority: Blocker Fix For: 1.1.0 Attachments: test_lz4_bug.patch I tested the LZ4 compression,and it come up with such problem.(with wordcount) Also I tested the snappy and LZF,and they were OK. At last I set the spark.shuffle.spill as false to avoid such exeception, but once open this switch, this error would come. It seems that if num of the words is few, wordcount will go through,but if it is a complex text ,this problem will show Exeception Info as follow: java.lang.AssertionError: assertion failed at scala.Predef$.assert(Predef.scala:165) at org.apache.spark.util.collection.ExternalAppendOnlyMap$DiskMapIterator.init(ExternalAppendOnlyMap.scala:416) at org.apache.spark.util.collection.ExternalAppendOnlyMap.spill(ExternalAppendOnlyMap.scala:235) at org.apache.spark.util.collection.ExternalAppendOnlyMap.insertAll(ExternalAppendOnlyMap.scala:150) at org.apache.spark.Aggregator.combineValuesByKey(Aggregator.scala:58) at org.apache.spark.shuffle.hash.HashShuffleWriter.write(HashShuffleWriter.scala:55) at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:68) at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41) at org.apache.spark.scheduler.Task.run(Task.scala:54) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:177) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1110) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:603) at java.lang.Thread.run(Thread.java:722) -- This message was sent by Atlassian JIRA (v6.2#6252) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3277) LZ4 compression cause the the ExternalSort exception
[ https://issues.apache.org/jira/browse/SPARK-3277?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14114026#comment-14114026 ] Mridul Muralidharan commented on SPARK-3277: [~hzw] did you notice this against 1.0.2 ? I did not think the changes for consolidated shuffle were backported to that branch, [~mateiz] can comment more though. LZ4 compression cause the the ExternalSort exception Key: SPARK-3277 URL: https://issues.apache.org/jira/browse/SPARK-3277 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 1.0.2, 1.1.0, 1.2.0 Reporter: hzw Priority: Blocker Fix For: 1.1.0 Attachments: test_lz4_bug.patch I tested the LZ4 compression,and it come up with such problem.(with wordcount) Also I tested the snappy and LZF,and they were OK. At last I set the spark.shuffle.spill as false to avoid such exeception, but once open this switch, this error would come. It seems that if num of the words is few, wordcount will go through,but if it is a complex text ,this problem will show Exeception Info as follow: java.lang.AssertionError: assertion failed at scala.Predef$.assert(Predef.scala:165) at org.apache.spark.util.collection.ExternalAppendOnlyMap$DiskMapIterator.init(ExternalAppendOnlyMap.scala:416) at org.apache.spark.util.collection.ExternalAppendOnlyMap.spill(ExternalAppendOnlyMap.scala:235) at org.apache.spark.util.collection.ExternalAppendOnlyMap.insertAll(ExternalAppendOnlyMap.scala:150) at org.apache.spark.Aggregator.combineValuesByKey(Aggregator.scala:58) at org.apache.spark.shuffle.hash.HashShuffleWriter.write(HashShuffleWriter.scala:55) at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:68) at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41) at org.apache.spark.scheduler.Task.run(Task.scala:54) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:177) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1110) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:603) at java.lang.Thread.run(Thread.java:722) -- This message was sent by Atlassian JIRA (v6.2#6252) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-3150) NullPointerException in Spark recovery after simultaneous fall of master and driver
[ https://issues.apache.org/jira/browse/SPARK-3150?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Josh Rosen resolved SPARK-3150. --- Resolution: Fixed Fix Version/s: 1.0.3 1.1.1 NullPointerException in Spark recovery after simultaneous fall of master and driver --- Key: SPARK-3150 URL: https://issues.apache.org/jira/browse/SPARK-3150 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 1.0.2 Environment: Linux 3.2.0-23-generic x86_64 Reporter: Tatiana Borisova Fix For: 1.1.1, 1.0.3 The issue happens when Spark is run standalone on a cluster. When master and driver fall simultaneously on one node in a cluster, master tries to recover its state and restart spark driver. While restarting driver, it falls with NPE exception (stacktrace is below). After falling, it restarts and tries to recover its state and restart Spark driver again. It happens over and over in an infinite cycle. Namely, Spark tries to read DriverInfo state from zookeeper, but after reading it happens to be null in DriverInfo.worker. Stacktrace (on version 1.0.0, but reproduceable on version 1.0.2, too) 2014-08-14 21:44:59,519] ERROR (akka.actor.OneForOneStrategy) java.lang.NullPointerException at org.apache.spark.deploy.master.Master$$anonfun$completeRecovery$5.apply(Master.scala:448) at org.apache.spark.deploy.master.Master$$anonfun$completeRecovery$5.apply(Master.scala:448) at scala.collection.TraversableLike$$anonfun$filter$1.apply(TraversableLike.scala:264) at scala.collection.mutable.HashSet.foreach(HashSet.scala:79) at scala.collection.TraversableLike$class.filter(TraversableLike.scala:263) at scala.collection.AbstractTraversable.filter(Traversable.scala:105) at org.apache.spark.deploy.master.Master.completeRecovery(Master.scala:448) at org.apache.spark.deploy.master.Master$$anonfun$receive$1.applyOrElse(Master.scala:376) at akka.actor.ActorCell.receiveMessage(ActorCell.scala:498) at akka.actor.ActorCell.invoke(ActorCell.scala:456) at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:237) at akka.dispatch.Mailbox.run(Mailbox.scala:219) at akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:386) at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260) at scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339) at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979) at scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107) How to reproduce: kill all Spark processes when running Spark standalone on a cluster on some cluster node, where driver runs (kill driver, master and worker simultaneously). -- This message was sent by Atlassian JIRA (v6.2#6252) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-3277) LZ4 compression cause the the ExternalSort exception
[ https://issues.apache.org/jira/browse/SPARK-3277?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Cheng Lian updated SPARK-3277: -- Fix Version/s: (was: 1.1.0) LZ4 compression cause the the ExternalSort exception Key: SPARK-3277 URL: https://issues.apache.org/jira/browse/SPARK-3277 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 1.0.2, 1.1.0, 1.2.0 Reporter: hzw Priority: Blocker Attachments: test_lz4_bug.patch I tested the LZ4 compression,and it come up with such problem.(with wordcount) Also I tested the snappy and LZF,and they were OK. At last I set the spark.shuffle.spill as false to avoid such exeception, but once open this switch, this error would come. It seems that if num of the words is few, wordcount will go through,but if it is a complex text ,this problem will show Exeception Info as follow: java.lang.AssertionError: assertion failed at scala.Predef$.assert(Predef.scala:165) at org.apache.spark.util.collection.ExternalAppendOnlyMap$DiskMapIterator.init(ExternalAppendOnlyMap.scala:416) at org.apache.spark.util.collection.ExternalAppendOnlyMap.spill(ExternalAppendOnlyMap.scala:235) at org.apache.spark.util.collection.ExternalAppendOnlyMap.insertAll(ExternalAppendOnlyMap.scala:150) at org.apache.spark.Aggregator.combineValuesByKey(Aggregator.scala:58) at org.apache.spark.shuffle.hash.HashShuffleWriter.write(HashShuffleWriter.scala:55) at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:68) at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41) at org.apache.spark.scheduler.Task.run(Task.scala:54) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:177) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1110) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:603) at java.lang.Thread.run(Thread.java:722) -- This message was sent by Atlassian JIRA (v6.2#6252) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3280) Made sort-based shuffle the default implementation
[ https://issues.apache.org/jira/browse/SPARK-3280?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14114055#comment-14114055 ] Reynold Xin commented on SPARK-3280: [~joshrosen] [~brkyvz] can you guys post the performance comparisons between sort vs hash shuffle in this ticket? Made sort-based shuffle the default implementation -- Key: SPARK-3280 URL: https://issues.apache.org/jira/browse/SPARK-3280 Project: Spark Issue Type: Improvement Reporter: Reynold Xin Assignee: Reynold Xin sort-based shuffle has lower memory usage and seems to outperform hash-based in almost all of our testing. -- This message was sent by Atlassian JIRA (v6.2#6252) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-3264) Allow users to set executor Spark home in Mesos
[ https://issues.apache.org/jira/browse/SPARK-3264?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrew Or resolved SPARK-3264. -- Resolution: Fixed Allow users to set executor Spark home in Mesos --- Key: SPARK-3264 URL: https://issues.apache.org/jira/browse/SPARK-3264 Project: Spark Issue Type: Bug Components: Mesos Affects Versions: 1.0.2 Reporter: Andrew Or Assignee: Andrew Or There is an existing way to do this, through spark.home. However, this is neither documented nor intuitive. I propose that we add a more specific config spark.mesos.executor.home for this purpose, and fallback to the existing settings if this is not set. -- This message was sent by Atlassian JIRA (v6.2#6252) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-2608) Mesos doesn't handle spark.executor.extraJavaOptions correctly (among other things)
[ https://issues.apache.org/jira/browse/SPARK-2608?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrew Or updated SPARK-2608: - Fix Version/s: 1.1.0 Mesos doesn't handle spark.executor.extraJavaOptions correctly (among other things) --- Key: SPARK-2608 URL: https://issues.apache.org/jira/browse/SPARK-2608 Project: Spark Issue Type: Bug Components: Mesos Affects Versions: 1.0.0 Reporter: wangfei Priority: Blocker Fix For: 1.1.0 mesos scheduler backend use spark-class/spark-executor to launch executor backend, this will lead to problems: 1 when set spark.executor.extraJavaOptions CoarseMesosSchedulerBackend will throw error 2 spark.executor.extraJavaOptions and spark.executor.extraLibraryPath set in sparkconf will not be valid -- This message was sent by Atlassian JIRA (v6.2#6252) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-3264) Allow users to set executor Spark home in Mesos
[ https://issues.apache.org/jira/browse/SPARK-3264?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrew Or updated SPARK-3264: - Fix Version/s: 1.1.0 Allow users to set executor Spark home in Mesos --- Key: SPARK-3264 URL: https://issues.apache.org/jira/browse/SPARK-3264 Project: Spark Issue Type: Bug Components: Mesos Affects Versions: 1.0.2 Reporter: Andrew Or Assignee: Andrew Or Fix For: 1.1.0 There is an existing way to do this, through spark.home. However, this is neither documented nor intuitive. I propose that we add a more specific config spark.mesos.executor.home for this purpose, and fallback to the existing settings if this is not set. -- This message was sent by Atlassian JIRA (v6.2#6252) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3280) Made sort-based shuffle the default implementation
[ https://issues.apache.org/jira/browse/SPARK-3280?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14114092#comment-14114092 ] Josh Rosen commented on SPARK-3280: --- Here are some numbers from August 10. If I recall, this was running on 8 m3.8xlarge nodes. This test linearly scales a bunch of parameters (data set size, numbers of mappers and reducers, etc). You can see that hash-based shuffle's performance degrades severely in cases where we have many mappers and reducers, while sort scales much more gracefully: !http://i.imgur.com/rODzaG1.png! !http://i.imgur.com/72kCkH5.png! This was run with spark-perf; here's a sample config for one of the bars: {code} Java options: -Dspark.storage.memoryFraction=0.66 -Dspark.serializer=org.apache.spark.serializer.JavaSerializer -Dspark.locality.wait=6000 -Dspark.shuffle.manager=org.apache.spark.shuffle.hash.HashShuffleManager Options: aggregate-by-key-naive --num-trials=10 --inter-trial-wait=3 --num-partitions=400 --reduce-tasks=400 --random-seed=5 --persistent-type=memory --num-records=2 --unique-keys=2 --key-length=10 --unique-values=100 --value-length=10 --storage-location=hdfs://:9000/spark-perf-kv-data {code} I'll try to run a better set of tests today. I plan to look at a few cases that these tests didn't address, including the performance impact when running on spinning disks, as well as jobs where we have a large dataset with few mappers and reducers (I think this is the case that we'd expect to be most favorable to hash-based shuffle). Made sort-based shuffle the default implementation -- Key: SPARK-3280 URL: https://issues.apache.org/jira/browse/SPARK-3280 Project: Spark Issue Type: Improvement Reporter: Reynold Xin Assignee: Reynold Xin sort-based shuffle has lower memory usage and seems to outperform hash-based in almost all of our testing. -- This message was sent by Atlassian JIRA (v6.2#6252) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-1297) Upgrade HBase dependency to 0.98.0
[ https://issues.apache.org/jira/browse/SPARK-1297?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14114113#comment-14114113 ] Ted Yu commented on SPARK-1297: --- Here is sample command for building against 0.98 hbase: mvn -Dhbase.profile=hadoop2 -Phadoop-2.4,yarn -Dhadoop.version=2.4.1 -DskipTests clean package Upgrade HBase dependency to 0.98.0 -- Key: SPARK-1297 URL: https://issues.apache.org/jira/browse/SPARK-1297 Project: Spark Issue Type: Task Reporter: Ted Yu Assignee: Ted Yu Priority: Minor Attachments: pom.xml, spark-1297-v2.txt, spark-1297-v4.txt, spark-1297-v5.txt HBase 0.94.6 was released 11 months ago. Upgrade HBase dependency to 0.98.0 -- This message was sent by Atlassian JIRA (v6.2#6252) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3272) Calculate prediction for nodes separately from calculating information gain for splits in decision tree
[ https://issues.apache.org/jira/browse/SPARK-3272?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14114166#comment-14114166 ] Joseph K. Bradley commented on SPARK-3272: -- With respect to [SPARK-2207], I think this JIRA may or may not be necessary for implementing [SPARK-2207], depending on how the code is set up. For [SPARK-2207], I imagined checking the number of instances and the information gain when the Node is constructed in the main loop (in the train() method). If there are too few instances or too little information gain, then the Node will be set as a leaf. We could potentially avoid the aggregation for those leafs, but I would consider that a separate issue ([SPARK-3158]). Calculate prediction for nodes separately from calculating information gain for splits in decision tree --- Key: SPARK-3272 URL: https://issues.apache.org/jira/browse/SPARK-3272 Project: Spark Issue Type: Improvement Components: MLlib Affects Versions: 1.0.2 Reporter: Qiping Li Fix For: 1.1.0 In current implementation, prediction for a node is calculated along with calculation of information gain stats for each possible splits. The value to predict for a specific node is determined, no matter what the splits are. To save computation, we can first calculate prediction first and then calculate information gain stats for each split. This is also necessary if we want to support minimum instances per node parameters([SPARK-2207|https://issues.apache.org/jira/browse/SPARK-2207]) because when all splits don't satisfy minimum instances requirement , we don't use information gain of any splits. There should be a way to get the prediction value. -- This message was sent by Atlassian JIRA (v6.2#6252) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-2475) Check whether #cores #receivers in local mode
[ https://issues.apache.org/jira/browse/SPARK-2475?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14114204#comment-14114204 ] Chris Fregly commented on SPARK-2475: - another option for the examples, specifically, is to default the number of local threads similar to to how the Kinesis example does it: https://github.com/apache/spark/blob/ae58aea2d1435b5bb011e68127e1bcddc2edf5b2/extras/kinesis-asl/src/main/scala/org/apache/spark/examples/streaming/KinesisWordCountASL.scala#L104 i get the number of shards in the given Kinesis stream and add 1. the goal was to make this example work out of the box with little friction - even an error message can be discouraging. for the other examples, we could just default to 2. the advanced user can override if they want. though i don't think i support an override in my kinesis example. whoops! :) Check whether #cores #receivers in local mode --- Key: SPARK-2475 URL: https://issues.apache.org/jira/browse/SPARK-2475 Project: Spark Issue Type: Improvement Components: Streaming Reporter: Tathagata Das When the number of slots in local mode is not more than the number of receivers, then the system should throw an error. Otherwise the system just keeps waiting for resources to process the received data. -- This message was sent by Atlassian JIRA (v6.2#6252) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3277) LZ4 compression cause the the ExternalSort exception
[ https://issues.apache.org/jira/browse/SPARK-3277?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14114247#comment-14114247 ] Matei Zaharia commented on SPARK-3277: -- Thanks Mridul -- I think Andrew and Patrick have figured this out. LZ4 compression cause the the ExternalSort exception Key: SPARK-3277 URL: https://issues.apache.org/jira/browse/SPARK-3277 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 1.0.2, 1.1.0, 1.2.0 Reporter: hzw Priority: Blocker Attachments: test_lz4_bug.patch I tested the LZ4 compression,and it come up with such problem.(with wordcount) Also I tested the snappy and LZF,and they were OK. At last I set the spark.shuffle.spill as false to avoid such exeception, but once open this switch, this error would come. It seems that if num of the words is few, wordcount will go through,but if it is a complex text ,this problem will show Exeception Info as follow: java.lang.AssertionError: assertion failed at scala.Predef$.assert(Predef.scala:165) at org.apache.spark.util.collection.ExternalAppendOnlyMap$DiskMapIterator.init(ExternalAppendOnlyMap.scala:416) at org.apache.spark.util.collection.ExternalAppendOnlyMap.spill(ExternalAppendOnlyMap.scala:235) at org.apache.spark.util.collection.ExternalAppendOnlyMap.insertAll(ExternalAppendOnlyMap.scala:150) at org.apache.spark.Aggregator.combineValuesByKey(Aggregator.scala:58) at org.apache.spark.shuffle.hash.HashShuffleWriter.write(HashShuffleWriter.scala:55) at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:68) at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41) at org.apache.spark.scheduler.Task.run(Task.scala:54) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:177) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1110) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:603) at java.lang.Thread.run(Thread.java:722) -- This message was sent by Atlassian JIRA (v6.2#6252) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-3287) When ResourceManager High Availability is enabled, ApplicationMaster webUI is not displayed.
[ https://issues.apache.org/jira/browse/SPARK-3287?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Benoy Antony updated SPARK-3287: Description: When ResourceManager High Availability is enabled, there will be multiple resource managers and each of them could act as a proxy. AmIpFilter is modified to accept multiple proxy hosts. But Spark ApplicationMaster fails to read the ResourceManager IPs properly from the configuration. So AmIpFilter is initialized with an empty set of proxy hosts. So any access to the ApplicationMaster WebUI will be redirected to port RM port on the local host. was: When ResourceManager High Availability is enabled, there will be multiple resource managers and each of them could act as a proxy. AmIpFilter is modified to accept multiple proxy hosts. But Spark ApplicationMaster fails read the ResourceManager IPs properly from the configuration. So AmIpFilter is initialized with an empty set of proxy hosts. So any access to the ApplicationMaster WebUI will be redirected to port RM port on the local host. When ResourceManager High Availability is enabled, ApplicationMaster webUI is not displayed. Key: SPARK-3287 URL: https://issues.apache.org/jira/browse/SPARK-3287 Project: Spark Issue Type: Bug Components: Spark Core, YARN Affects Versions: 1.0.2 Reporter: Benoy Antony Attachments: SPARK-3287.patch When ResourceManager High Availability is enabled, there will be multiple resource managers and each of them could act as a proxy. AmIpFilter is modified to accept multiple proxy hosts. But Spark ApplicationMaster fails to read the ResourceManager IPs properly from the configuration. So AmIpFilter is initialized with an empty set of proxy hosts. So any access to the ApplicationMaster WebUI will be redirected to port RM port on the local host. -- This message was sent by Atlassian JIRA (v6.2#6252) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-3288) All fields in TaskMetrics should be private and use getters/setters
Patrick Wendell created SPARK-3288: -- Summary: All fields in TaskMetrics should be private and use getters/setters Key: SPARK-3288 URL: https://issues.apache.org/jira/browse/SPARK-3288 Project: Spark Issue Type: Improvement Components: Spark Core Reporter: Patrick Wendell Assignee: Andrew Or This is particularly bad because we expose this as a developer API. Technically a library could create a TaskMetrics object and then change the values inside of it and pass it onto someone else. It can be written pretty compactly like below: {code} /** * Number of bytes written for the shuffle by this task */ @volatile private var _shuffleBytesWritten: Long = _ def incrementShuffleBytesWritten(value: Long) = _shuffleBytesWritten += value def decrementShuffleBytesWritten(value: Long) = _shuffleBytesWritten -= value def shuffleBytesWritten = _shuffleBytesWritten {code} -- This message was sent by Atlassian JIRA (v6.2#6252) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-3266) JavaDoubleRDD doesn't contain max()
[ https://issues.apache.org/jira/browse/SPARK-3266?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Colin B. updated SPARK-3266: Attachment: spark-repro-3266.tar.gz I have attached a simple java project which reproduces the issue. [^spark-repro-3266.tar.gz] {code} tar xvzf spark-repro-3266.tar.gz ... cd spark-repro-3266 mvn clean package /path/to/spark-1.0.2-bin-hadoop2/bin/spark-submit --class SimpleApp target/testcase-4-1.0.jar ... Exception in thread main java.lang.NoSuchMethodError: org.apache.spark.api.java.JavaDoubleRDD.max(Ljava/util/Comparator;)Ljava/lang/Double; at SimpleApp.main(SimpleApp.java:17) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:606) at org.apache.spark.deploy.SparkSubmit$.launch(SparkSubmit.scala:303) at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:55) at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala) {code} JavaDoubleRDD doesn't contain max() --- Key: SPARK-3266 URL: https://issues.apache.org/jira/browse/SPARK-3266 Project: Spark Issue Type: Bug Components: Java API Affects Versions: 1.0.1 Reporter: Amey Chaugule Attachments: spark-repro-3266.tar.gz While I can compile my code, I see: Caused by: java.lang.NoSuchMethodError: org.apache.spark.api.java.JavaDoubleRDD.max(Ljava/util/Comparator;)Ljava/lang/Double; When I try to execute my Spark code. Stepping into the JavaDoubleRDD class, I don't notice max() although it is clearly listed in the documentation. -- This message was sent by Atlassian JIRA (v6.2#6252) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-3266) JavaDoubleRDD doesn't contain max()
[ https://issues.apache.org/jira/browse/SPARK-3266?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Josh Rosen reassigned SPARK-3266: - Assignee: Josh Rosen JavaDoubleRDD doesn't contain max() --- Key: SPARK-3266 URL: https://issues.apache.org/jira/browse/SPARK-3266 Project: Spark Issue Type: Bug Components: Java API Affects Versions: 1.0.1 Reporter: Amey Chaugule Assignee: Josh Rosen Attachments: spark-repro-3266.tar.gz While I can compile my code, I see: Caused by: java.lang.NoSuchMethodError: org.apache.spark.api.java.JavaDoubleRDD.max(Ljava/util/Comparator;)Ljava/lang/Double; When I try to execute my Spark code. Stepping into the JavaDoubleRDD class, I don't notice max() although it is clearly listed in the documentation. -- This message was sent by Atlassian JIRA (v6.2#6252) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-3281) Remove Netty specific code in BlockManager
[ https://issues.apache.org/jira/browse/SPARK-3281?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reynold Xin resolved SPARK-3281. Resolution: Fixed Fix Version/s: 1.2.0 Remove Netty specific code in BlockManager -- Key: SPARK-3281 URL: https://issues.apache.org/jira/browse/SPARK-3281 Project: Spark Issue Type: Sub-task Components: Shuffle, Spark Core Reporter: Reynold Xin Assignee: Reynold Xin Fix For: 1.2.0 Everything should go through the BlockTransferService interface rather than having conditional branches for Netty. -- This message was sent by Atlassian JIRA (v6.2#6252) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-3285) Using values.sum is easier to understand than using values.foldLeft(0)(_ + _)
[ https://issues.apache.org/jira/browse/SPARK-3285?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reynold Xin resolved SPARK-3285. Resolution: Fixed Fix Version/s: 1.2.0 Using values.sum is easier to understand than using values.foldLeft(0)(_ + _) - Key: SPARK-3285 URL: https://issues.apache.org/jira/browse/SPARK-3285 Project: Spark Issue Type: Test Components: Examples Affects Versions: 1.0.2 Reporter: Yadong Qi Fix For: 1.2.0 def sumB : A: B = foldLeft(num.zero)(num.plus) Using values.sum is easier to understand than using values.foldLeft(0)(_ + ), so we'd better use values.sum instead of values.foldLeft(0)( + _) -- This message was sent by Atlassian JIRA (v6.2#6252) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-3266) JavaDoubleRDD doesn't contain max()
[ https://issues.apache.org/jira/browse/SPARK-3266?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Josh Rosen updated SPARK-3266: -- Affects Version/s: 1.0.2 JavaDoubleRDD doesn't contain max() --- Key: SPARK-3266 URL: https://issues.apache.org/jira/browse/SPARK-3266 Project: Spark Issue Type: Bug Components: Java API Affects Versions: 1.0.1, 1.0.2 Reporter: Amey Chaugule Assignee: Josh Rosen Attachments: spark-repro-3266.tar.gz While I can compile my code, I see: Caused by: java.lang.NoSuchMethodError: org.apache.spark.api.java.JavaDoubleRDD.max(Ljava/util/Comparator;)Ljava/lang/Double; When I try to execute my Spark code. Stepping into the JavaDoubleRDD class, I don't notice max() although it is clearly listed in the documentation. -- This message was sent by Atlassian JIRA (v6.2#6252) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3266) JavaDoubleRDD doesn't contain max()
[ https://issues.apache.org/jira/browse/SPARK-3266?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14114362#comment-14114362 ] Josh Rosen commented on SPARK-3266: --- Thanks for the reproduction! I tried it myself and see the same issue. If I replace {code} JavaDoubleRDD javaDoubleRDD = sc.parallelizeDoubles(numbers); {code} with {code} JavaRDDLikeDouble, ? javaDoubleRDD = sc.parallelizeDoubles(numbers); {code} then it seems to work. I'll take a closer look using {{javap}} to see if I can figure out why this is happening. JavaDoubleRDD doesn't contain max() --- Key: SPARK-3266 URL: https://issues.apache.org/jira/browse/SPARK-3266 Project: Spark Issue Type: Bug Components: Java API Affects Versions: 1.0.1, 1.0.2 Reporter: Amey Chaugule Assignee: Josh Rosen Attachments: spark-repro-3266.tar.gz While I can compile my code, I see: Caused by: java.lang.NoSuchMethodError: org.apache.spark.api.java.JavaDoubleRDD.max(Ljava/util/Comparator;)Ljava/lang/Double; When I try to execute my Spark code. Stepping into the JavaDoubleRDD class, I don't notice max() although it is clearly listed in the documentation. -- This message was sent by Atlassian JIRA (v6.2#6252) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3183) Add option for requesting full YARN cluster
[ https://issues.apache.org/jira/browse/SPARK-3183?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14114368#comment-14114368 ] Shay Rojansky commented on SPARK-3183: -- +1. As a current workaround for cores, we specify a number well beyond the YARN cluster capacity. This gets handled well by Spark/YARN, and we get the entire cluster. Add option for requesting full YARN cluster --- Key: SPARK-3183 URL: https://issues.apache.org/jira/browse/SPARK-3183 Project: Spark Issue Type: Improvement Components: YARN Reporter: Sandy Ryza This could possibly be in the form of --executor-cores ALL --executor-memory ALL --num-executors ALL. -- This message was sent by Atlassian JIRA (v6.2#6252) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3266) JavaDoubleRDD doesn't contain max()
[ https://issues.apache.org/jira/browse/SPARK-3266?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14114387#comment-14114387 ] Colin B. commented on SPARK-3266: - So there is no method: {code} org.apache.spark.api.java.JavaDoubleRDD.max(Ljava/util/Comparator;)Ljava/lang/Double; {code} but there is a method: {code} org.apache.spark.api.java.JavaDoubleRDD.max(Ljava/util/Comparator;)Ljava/lang/Object; {code} I've heard that the return type is part of the type signature in java bytecode, so the two are different. (one returns a Double, the other an Object) This looks a bit like a scala type erasure related issue. The spark/scala code generated for JavaRDDLike includes a max method that returns an object. In JavaDoubleRDD the type is bounded to Double, so java code which calls max on JavaDoubleRDD expects a method returning Double. Since the code for max is implemented in the JavaRDDLike trait, the java code doesn't seem to inherit it correctly when types are involved. I tested making JavaRDDLike an abstract class instead of a trait. It was able to compile and run correctly. However it is not compatible with 1.0.2. JavaDoubleRDD doesn't contain max() --- Key: SPARK-3266 URL: https://issues.apache.org/jira/browse/SPARK-3266 Project: Spark Issue Type: Bug Components: Java API Affects Versions: 1.0.1, 1.0.2 Reporter: Amey Chaugule Assignee: Josh Rosen Attachments: spark-repro-3266.tar.gz While I can compile my code, I see: Caused by: java.lang.NoSuchMethodError: org.apache.spark.api.java.JavaDoubleRDD.max(Ljava/util/Comparator;)Ljava/lang/Double; When I try to execute my Spark code. Stepping into the JavaDoubleRDD class, I don't notice max() although it is clearly listed in the documentation. -- This message was sent by Atlassian JIRA (v6.2#6252) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3266) JavaDoubleRDD doesn't contain max()
[ https://issues.apache.org/jira/browse/SPARK-3266?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14114391#comment-14114391 ] Sean Owen commented on SPARK-3266: -- (Mea culpa! The example shows this is a legitimate question. I'll be quiet now.) JavaDoubleRDD doesn't contain max() --- Key: SPARK-3266 URL: https://issues.apache.org/jira/browse/SPARK-3266 Project: Spark Issue Type: Bug Components: Java API Affects Versions: 1.0.1, 1.0.2 Reporter: Amey Chaugule Assignee: Josh Rosen Attachments: spark-repro-3266.tar.gz While I can compile my code, I see: Caused by: java.lang.NoSuchMethodError: org.apache.spark.api.java.JavaDoubleRDD.max(Ljava/util/Comparator;)Ljava/lang/Double; When I try to execute my Spark code. Stepping into the JavaDoubleRDD class, I don't notice max() although it is clearly listed in the documentation. -- This message was sent by Atlassian JIRA (v6.2#6252) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3266) JavaDoubleRDD doesn't contain max()
[ https://issues.apache.org/jira/browse/SPARK-3266?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14114413#comment-14114413 ] Amey Chaugule commented on SPARK-3266: -- No worries, I initially assumed my runtime env was old too until i rechecked. JavaDoubleRDD doesn't contain max() --- Key: SPARK-3266 URL: https://issues.apache.org/jira/browse/SPARK-3266 Project: Spark Issue Type: Bug Components: Java API Affects Versions: 1.0.1, 1.0.2, 1.1.0 Reporter: Amey Chaugule Assignee: Josh Rosen Attachments: spark-repro-3266.tar.gz While I can compile my code, I see: Caused by: java.lang.NoSuchMethodError: org.apache.spark.api.java.JavaDoubleRDD.max(Ljava/util/Comparator;)Ljava/lang/Double; When I try to execute my Spark code. Stepping into the JavaDoubleRDD class, I don't notice max() although it is clearly listed in the documentation. -- This message was sent by Atlassian JIRA (v6.2#6252) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3260) Yarn - pass acls along with executor launch
[ https://issues.apache.org/jira/browse/SPARK-3260?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14114408#comment-14114408 ] Apache Spark commented on SPARK-3260: - User 'tgravescs' has created a pull request for this issue: https://github.com/apache/spark/pull/2185 Yarn - pass acls along with executor launch --- Key: SPARK-3260 URL: https://issues.apache.org/jira/browse/SPARK-3260 Project: Spark Issue Type: Bug Components: YARN Affects Versions: 1.1.0 Reporter: Thomas Graves Assignee: Thomas Graves In https://github.com/apache/spark/pull/1196 I added passing the spark view and modify acls into yarn. Unfortunately we are only passing them into the application master and I missed passing them in when we launch individual containers (executors). We need to modify the ExecutorRunnable.startContainer to set the acls in the ContainerLaunchContext. -- This message was sent by Atlassian JIRA (v6.2#6252) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-3266) JavaDoubleRDD doesn't contain max()
[ https://issues.apache.org/jira/browse/SPARK-3266?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Josh Rosen updated SPARK-3266: -- Affects Version/s: 1.1.0 JavaRDDLike probably should be an abstract class. I think the current trait implementation was a holdover from an earlier prototype that attempted to achieve higher code reuse for operations like map() and filter(). I added a test case to JavaAPISuite that reproduces this issue on master, too. The simplest solution is probably to make JavaRDDLike into a trait. I think we can do this while maintaining source compatibility. A less invasive but messier solution would be to just copy the implementation of max() and min() into each Java*RDD class and remove it from the trait. JavaDoubleRDD doesn't contain max() --- Key: SPARK-3266 URL: https://issues.apache.org/jira/browse/SPARK-3266 Project: Spark Issue Type: Bug Components: Java API Affects Versions: 1.0.1, 1.0.2, 1.1.0 Reporter: Amey Chaugule Assignee: Josh Rosen Attachments: spark-repro-3266.tar.gz While I can compile my code, I see: Caused by: java.lang.NoSuchMethodError: org.apache.spark.api.java.JavaDoubleRDD.max(Ljava/util/Comparator;)Ljava/lang/Double; When I try to execute my Spark code. Stepping into the JavaDoubleRDD class, I don't notice max() although it is clearly listed in the documentation. -- This message was sent by Atlassian JIRA (v6.2#6252) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-3277) LZ4 compression cause the the ExternalSort exception
[ https://issues.apache.org/jira/browse/SPARK-3277?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reynold Xin updated SPARK-3277: --- Assignee: Andrew Or LZ4 compression cause the the ExternalSort exception Key: SPARK-3277 URL: https://issues.apache.org/jira/browse/SPARK-3277 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 1.0.2, 1.1.0, 1.2.0 Reporter: hzw Assignee: Andrew Or Priority: Blocker Attachments: test_lz4_bug.patch I tested the LZ4 compression,and it come up with such problem.(with wordcount) Also I tested the snappy and LZF,and they were OK. At last I set the spark.shuffle.spill as false to avoid such exeception, but once open this switch, this error would come. It seems that if num of the[ words is few, wordcount will go through,but if it is a complex text ,this problem will show Exeception Info as follow: {code} java.lang.AssertionError: assertion failed at scala.Predef$.assert(Predef.scala:165) at org.apache.spark.util.collection.ExternalAppendOnlyMap$DiskMapIterator.init(ExternalAppendOnlyMap.scala:416) at org.apache.spark.util.collection.ExternalAppendOnlyMap.spill(ExternalAppendOnlyMap.scala:235) at org.apache.spark.util.collection.ExternalAppendOnlyMap.insertAll(ExternalAppendOnlyMap.scala:150) at org.apache.spark.Aggregator.combineValuesByKey(Aggregator.scala:58) at org.apache.spark.shuffle.hash.HashShuffleWriter.write(HashShuffleWriter.scala:55) at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:68) at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41) at org.apache.spark.scheduler.Task.run(Task.scala:54) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:177) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1110) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:603) at java.lang.Thread.run(Thread.java:722) {code} -- This message was sent by Atlassian JIRA (v6.2#6252) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-3277) LZ4 compression cause the the ExternalSort exception
[ https://issues.apache.org/jira/browse/SPARK-3277?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reynold Xin updated SPARK-3277: --- Description: I tested the LZ4 compression,and it come up with such problem.(with wordcount) Also I tested the snappy and LZF,and they were OK. At last I set the spark.shuffle.spill as false to avoid such exeception, but once open this switch, this error would come. It seems that if num of the[ words is few, wordcount will go through,but if it is a complex text ,this problem will show Exeception Info as follow: {code} java.lang.AssertionError: assertion failed at scala.Predef$.assert(Predef.scala:165) at org.apache.spark.util.collection.ExternalAppendOnlyMap$DiskMapIterator.init(ExternalAppendOnlyMap.scala:416) at org.apache.spark.util.collection.ExternalAppendOnlyMap.spill(ExternalAppendOnlyMap.scala:235) at org.apache.spark.util.collection.ExternalAppendOnlyMap.insertAll(ExternalAppendOnlyMap.scala:150) at org.apache.spark.Aggregator.combineValuesByKey(Aggregator.scala:58) at org.apache.spark.shuffle.hash.HashShuffleWriter.write(HashShuffleWriter.scala:55) at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:68) at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41) at org.apache.spark.scheduler.Task.run(Task.scala:54) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:177) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1110) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:603) at java.lang.Thread.run(Thread.java:722) {code} was: I tested the LZ4 compression,and it come up with such problem.(with wordcount) Also I tested the snappy and LZF,and they were OK. At last I set the spark.shuffle.spill as false to avoid such exeception, but once open this switch, this error would come. It seems that if num of the words is few, wordcount will go through,but if it is a complex text ,this problem will show Exeception Info as follow: java.lang.AssertionError: assertion failed at scala.Predef$.assert(Predef.scala:165) at org.apache.spark.util.collection.ExternalAppendOnlyMap$DiskMapIterator.init(ExternalAppendOnlyMap.scala:416) at org.apache.spark.util.collection.ExternalAppendOnlyMap.spill(ExternalAppendOnlyMap.scala:235) at org.apache.spark.util.collection.ExternalAppendOnlyMap.insertAll(ExternalAppendOnlyMap.scala:150) at org.apache.spark.Aggregator.combineValuesByKey(Aggregator.scala:58) at org.apache.spark.shuffle.hash.HashShuffleWriter.write(HashShuffleWriter.scala:55) at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:68) at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41) at org.apache.spark.scheduler.Task.run(Task.scala:54) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:177) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1110) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:603) at java.lang.Thread.run(Thread.java:722) LZ4 compression cause the the ExternalSort exception Key: SPARK-3277 URL: https://issues.apache.org/jira/browse/SPARK-3277 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 1.0.2, 1.1.0, 1.2.0 Reporter: hzw Priority: Blocker Attachments: test_lz4_bug.patch I tested the LZ4 compression,and it come up with such problem.(with wordcount) Also I tested the snappy and LZF,and they were OK. At last I set the spark.shuffle.spill as false to avoid such exeception, but once open this switch, this error would come. It seems that if num of the[ words is few, wordcount will go through,but if it is a complex text ,this problem will show Exeception Info as follow: {code} java.lang.AssertionError: assertion failed at scala.Predef$.assert(Predef.scala:165) at org.apache.spark.util.collection.ExternalAppendOnlyMap$DiskMapIterator.init(ExternalAppendOnlyMap.scala:416) at org.apache.spark.util.collection.ExternalAppendOnlyMap.spill(ExternalAppendOnlyMap.scala:235) at org.apache.spark.util.collection.ExternalAppendOnlyMap.insertAll(ExternalAppendOnlyMap.scala:150) at org.apache.spark.Aggregator.combineValuesByKey(Aggregator.scala:58) at org.apache.spark.shuffle.hash.HashShuffleWriter.write(HashShuffleWriter.scala:55) at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:68) at
[jira] [Updated] (SPARK-3286) Cannot view ApplicationMaster UI when Yarn’s url scheme is https
[ https://issues.apache.org/jira/browse/SPARK-3286?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Benoy Antony updated SPARK-3286: Attachment: SPARK-3286.patch Attaching the patch for the master Cannot view ApplicationMaster UI when Yarn’s url scheme is https Key: SPARK-3286 URL: https://issues.apache.org/jira/browse/SPARK-3286 Project: Spark Issue Type: Bug Components: YARN Affects Versions: 1.0.2 Reporter: Benoy Antony Attachments: SPARK-3286.patch, SPARK-3286.patch The spark Application Master starts its web UI at http://host-name:port. When Spark ApplicationMaster registers its URL with Resource Manager , the URL does not contain URI scheme. If the URL scheme is absent, Resource Manager’s web app proxy will use the HTTP Policy of the Resource Manager.(YARN-1553) If the HTTP Policy of the Resource Manager is https, then web app proxy will try to access https://host-name:port. This will result in error. -- This message was sent by Atlassian JIRA (v6.2#6252) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-3286) Cannot view ApplicationMaster UI when Yarn’s url scheme is https
[ https://issues.apache.org/jira/browse/SPARK-3286?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Benoy Antony updated SPARK-3286: Attachment: SPARK-3286-branch-1-0.patch Cannot view ApplicationMaster UI when Yarn’s url scheme is https Key: SPARK-3286 URL: https://issues.apache.org/jira/browse/SPARK-3286 Project: Spark Issue Type: Bug Components: YARN Affects Versions: 1.0.2 Reporter: Benoy Antony Attachments: SPARK-3286-branch-1-0.patch, SPARK-3286.patch The spark Application Master starts its web UI at http://host-name:port. When Spark ApplicationMaster registers its URL with Resource Manager , the URL does not contain URI scheme. If the URL scheme is absent, Resource Manager’s web app proxy will use the HTTP Policy of the Resource Manager.(YARN-1553) If the HTTP Policy of the Resource Manager is https, then web app proxy will try to access https://host-name:port. This will result in error. -- This message was sent by Atlassian JIRA (v6.2#6252) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-3286) Cannot view ApplicationMaster UI when Yarn’s url scheme is https
[ https://issues.apache.org/jira/browse/SPARK-3286?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Benoy Antony updated SPARK-3286: Attachment: (was: SPARK-3286.patch) Cannot view ApplicationMaster UI when Yarn’s url scheme is https Key: SPARK-3286 URL: https://issues.apache.org/jira/browse/SPARK-3286 Project: Spark Issue Type: Bug Components: YARN Affects Versions: 1.0.2 Reporter: Benoy Antony Attachments: SPARK-3286-branch-1-0.patch, SPARK-3286.patch The spark Application Master starts its web UI at http://host-name:port. When Spark ApplicationMaster registers its URL with Resource Manager , the URL does not contain URI scheme. If the URL scheme is absent, Resource Manager’s web app proxy will use the HTTP Policy of the Resource Manager.(YARN-1553) If the HTTP Policy of the Resource Manager is https, then web app proxy will try to access https://host-name:port. This will result in error. -- This message was sent by Atlassian JIRA (v6.2#6252) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3266) JavaDoubleRDD doesn't contain max()
[ https://issues.apache.org/jira/browse/SPARK-3266?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14114453#comment-14114453 ] Apache Spark commented on SPARK-3266: - User 'JoshRosen' has created a pull request for this issue: https://github.com/apache/spark/pull/2186 JavaDoubleRDD doesn't contain max() --- Key: SPARK-3266 URL: https://issues.apache.org/jira/browse/SPARK-3266 Project: Spark Issue Type: Bug Components: Java API Affects Versions: 1.0.1, 1.0.2, 1.1.0 Reporter: Amey Chaugule Assignee: Josh Rosen Attachments: spark-repro-3266.tar.gz While I can compile my code, I see: Caused by: java.lang.NoSuchMethodError: org.apache.spark.api.java.JavaDoubleRDD.max(Ljava/util/Comparator;)Ljava/lang/Double; When I try to execute my Spark code. Stepping into the JavaDoubleRDD class, I don't notice max() although it is clearly listed in the documentation. -- This message was sent by Atlassian JIRA (v6.2#6252) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-3190) Creation of large graph( 2.15 B nodes) seems to be broken:possible overflow somewhere
[ https://issues.apache.org/jira/browse/SPARK-3190?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Josh Rosen resolved SPARK-3190. --- Resolution: Fixed Fix Version/s: 1.0.3 1.1.1 1.2.0 Issue resolved by pull request 2106 [https://github.com/apache/spark/pull/2106] Creation of large graph( 2.15 B nodes) seems to be broken:possible overflow somewhere --- Key: SPARK-3190 URL: https://issues.apache.org/jira/browse/SPARK-3190 Project: Spark Issue Type: Bug Components: GraphX Affects Versions: 1.0.3 Environment: Standalone mode running on EC2 . Using latest code from master branch upto commit #db56f2df1b8027171da1b8d2571d1f2ef1e103b6 . Reporter: npanj Assignee: Ankur Dave Priority: Critical Fix For: 1.2.0, 1.1.1, 1.0.3 While creating a graph with 6B nodes and 12B edges, I noticed that 'numVertices' api returns incorrect result; 'numEdges' reports correct number. For few times(with different dataset 2.5B nodes) I have also notices that numVertices is returned as -ive number; so I suspect that there is some overflow (may be we are using Int for some field?). Here is some details of experiments I have done so far: 1. Input: numNodes=6101995593 ; noEdges=12163784626 Graph returns: numVertices=1807028297 ; numEdges=12163784626 2. Input : numNodes=2157586441 ; noEdges=2747322705 Graph Returns: numVertices=-2137380855 ; numEdges=2747322705 3. Input: numNodes=1725060105 ; noEdges=204176821 Graph: numVertices=1725060105 ; numEdges=2041768213 You can find the code to generate this bug here: https://gist.github.com/npanj/92e949d86d08715bf4bf Note: Nodes are labeled are 1...6B . -- This message was sent by Atlassian JIRA (v6.2#6252) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-3289) Prevent complete job failures due to rescheduling of failing tasks on buggy machines
Josh Rosen created SPARK-3289: - Summary: Prevent complete job failures due to rescheduling of failing tasks on buggy machines Key: SPARK-3289 URL: https://issues.apache.org/jira/browse/SPARK-3289 Project: Spark Issue Type: Bug Components: Spark Core Reporter: Josh Rosen Some users have reported issues where a task fails due to an environment / configuration issue on some machine, then the task is reattempted _on that same buggy machine_ until the entire job failures because that single task has failed too many times. To guard against this, maybe we should add some randomization in how we reschedule failed tasks. -- This message was sent by Atlassian JIRA (v6.2#6252) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-3289) Avoid job failures due to rescheduling of failing tasks on buggy machines
[ https://issues.apache.org/jira/browse/SPARK-3289?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Josh Rosen updated SPARK-3289: -- Summary: Avoid job failures due to rescheduling of failing tasks on buggy machines (was: Prevent complete job failures due to rescheduling of failing tasks on buggy machines) Avoid job failures due to rescheduling of failing tasks on buggy machines - Key: SPARK-3289 URL: https://issues.apache.org/jira/browse/SPARK-3289 Project: Spark Issue Type: Bug Components: Spark Core Reporter: Josh Rosen Some users have reported issues where a task fails due to an environment / configuration issue on some machine, then the task is reattempted _on that same buggy machine_ until the entire job failures because that single task has failed too many times. To guard against this, maybe we should add some randomization in how we reschedule failed tasks. -- This message was sent by Atlassian JIRA (v6.2#6252) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3277) LZ4 compression cause the the ExternalSort exception
[ https://issues.apache.org/jira/browse/SPARK-3277?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14114484#comment-14114484 ] Mridul Muralidharan commented on SPARK-3277: Sounds great, thx ! I suspect it is because for lzo we configure it to write block on flush (partial if insufficient data to fill block); but for lz4, either such config does not exist or we dont use that. Resulting in flush becoming noop in case the data in current block is insufficientto cause a compressed block to be created - while close will force patial block to be written out. Which is why the asserion lists all sizes as 0 LZ4 compression cause the the ExternalSort exception Key: SPARK-3277 URL: https://issues.apache.org/jira/browse/SPARK-3277 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 1.0.2, 1.1.0, 1.2.0 Reporter: hzw Assignee: Andrew Or Priority: Blocker Attachments: test_lz4_bug.patch I tested the LZ4 compression,and it come up with such problem.(with wordcount) Also I tested the snappy and LZF,and they were OK. At last I set the spark.shuffle.spill as false to avoid such exeception, but once open this switch, this error would come. It seems that if num of the[ words is few, wordcount will go through,but if it is a complex text ,this problem will show Exeception Info as follow: {code} java.lang.AssertionError: assertion failed at scala.Predef$.assert(Predef.scala:165) at org.apache.spark.util.collection.ExternalAppendOnlyMap$DiskMapIterator.init(ExternalAppendOnlyMap.scala:416) at org.apache.spark.util.collection.ExternalAppendOnlyMap.spill(ExternalAppendOnlyMap.scala:235) at org.apache.spark.util.collection.ExternalAppendOnlyMap.insertAll(ExternalAppendOnlyMap.scala:150) at org.apache.spark.Aggregator.combineValuesByKey(Aggregator.scala:58) at org.apache.spark.shuffle.hash.HashShuffleWriter.write(HashShuffleWriter.scala:55) at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:68) at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41) at org.apache.spark.scheduler.Task.run(Task.scala:54) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:177) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1110) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:603) at java.lang.Thread.run(Thread.java:722) {code} -- This message was sent by Atlassian JIRA (v6.2#6252) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3272) Calculate prediction for nodes separately from calculating information gain for splits in decision tree
[ https://issues.apache.org/jira/browse/SPARK-3272?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14114591#comment-14114591 ] Qiping Li commented on SPARK-3272: -- Hi Joseph, thanks for your comment, I think checking the number of instances can't be done in the train() method because we don't know the number of instances for the leftSplit or rightSplit, for each split, we can only get information from InformationGainStats, which doesn't contain number of instances. In my implementation of SPARK-2207, the check is done in calculateGainForSplit, when the check fails, return a invalid information gain, the calculation of predict value may be skipped in that case. Maybe we can include number of instances for leftSplit and rightSplit in information gain stats and calculate predict value no matter whether check passes or not. I think either is fine for me. Calculate prediction for nodes separately from calculating information gain for splits in decision tree --- Key: SPARK-3272 URL: https://issues.apache.org/jira/browse/SPARK-3272 Project: Spark Issue Type: Improvement Components: MLlib Affects Versions: 1.0.2 Reporter: Qiping Li Fix For: 1.1.0 In current implementation, prediction for a node is calculated along with calculation of information gain stats for each possible splits. The value to predict for a specific node is determined, no matter what the splits are. To save computation, we can first calculate prediction first and then calculate information gain stats for each split. This is also necessary if we want to support minimum instances per node parameters([SPARK-2207|https://issues.apache.org/jira/browse/SPARK-2207]) because when all splits don't satisfy minimum instances requirement , we don't use information gain of any splits. There should be a way to get the prediction value. -- This message was sent by Atlassian JIRA (v6.2#6252) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3289) Avoid job failures due to rescheduling of failing tasks on buggy machines
[ https://issues.apache.org/jira/browse/SPARK-3289?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14114611#comment-14114611 ] Mark Hamstra commented on SPARK-3289: - https://github.com/apache/spark/pull/1360 Avoid job failures due to rescheduling of failing tasks on buggy machines - Key: SPARK-3289 URL: https://issues.apache.org/jira/browse/SPARK-3289 Project: Spark Issue Type: Bug Components: Spark Core Reporter: Josh Rosen Some users have reported issues where a task fails due to an environment / configuration issue on some machine, then the task is reattempted _on that same buggy machine_ until the entire job failures because that single task has failed too many times. To guard against this, maybe we should add some randomization in how we reschedule failed tasks. -- This message was sent by Atlassian JIRA (v6.2#6252) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3287) When ResourceManager High Availability is enabled, ApplicationMaster webUI is not displayed.
[ https://issues.apache.org/jira/browse/SPARK-3287?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14114612#comment-14114612 ] Benoy Antony commented on SPARK-3287: - I'll submit a git pull request. When ResourceManager High Availability is enabled, ApplicationMaster webUI is not displayed. Key: SPARK-3287 URL: https://issues.apache.org/jira/browse/SPARK-3287 Project: Spark Issue Type: Bug Components: Spark Core, YARN Affects Versions: 1.0.2 Reporter: Benoy Antony Attachments: SPARK-3287.patch When ResourceManager High Availability is enabled, there will be multiple resource managers and each of them could act as a proxy. AmIpFilter is modified to accept multiple proxy hosts. But Spark ApplicationMaster fails to read the ResourceManager IPs properly from the configuration. So AmIpFilter is initialized with an empty set of proxy hosts. So any access to the ApplicationMaster WebUI will be redirected to port RM port on the local host. -- This message was sent by Atlassian JIRA (v6.2#6252) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3286) Cannot view ApplicationMaster UI when Yarn’s url scheme is https
[ https://issues.apache.org/jira/browse/SPARK-3286?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14114614#comment-14114614 ] Benoy Antony commented on SPARK-3286: - I'll submit a git pull request. Cannot view ApplicationMaster UI when Yarn’s url scheme is https Key: SPARK-3286 URL: https://issues.apache.org/jira/browse/SPARK-3286 Project: Spark Issue Type: Bug Components: YARN Affects Versions: 1.0.2 Reporter: Benoy Antony Attachments: SPARK-3286-branch-1-0.patch, SPARK-3286.patch The spark Application Master starts its web UI at http://host-name:port. When Spark ApplicationMaster registers its URL with Resource Manager , the URL does not contain URI scheme. If the URL scheme is absent, Resource Manager’s web app proxy will use the HTTP Policy of the Resource Manager.(YARN-1553) If the HTTP Policy of the Resource Manager is https, then web app proxy will try to access https://host-name:port. This will result in error. -- This message was sent by Atlassian JIRA (v6.2#6252) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-3277) LZ4 compression cause the the ExternalSort exception
[ https://issues.apache.org/jira/browse/SPARK-3277?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Patrick Wendell resolved SPARK-3277. Resolution: Fixed Fixed by https://github.com/apache/spark/pull/2187 Thanks to everyone who helped isolate and debug this. LZ4 compression cause the the ExternalSort exception Key: SPARK-3277 URL: https://issues.apache.org/jira/browse/SPARK-3277 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 1.0.2, 1.1.0, 1.2.0 Reporter: hzw Assignee: Andrew Or Priority: Blocker Attachments: test_lz4_bug.patch I tested the LZ4 compression,and it come up with such problem.(with wordcount) Also I tested the snappy and LZF,and they were OK. At last I set the spark.shuffle.spill as false to avoid such exeception, but once open this switch, this error would come. It seems that if num of the[ words is few, wordcount will go through,but if it is a complex text ,this problem will show Exeception Info as follow: {code} java.lang.AssertionError: assertion failed at scala.Predef$.assert(Predef.scala:165) at org.apache.spark.util.collection.ExternalAppendOnlyMap$DiskMapIterator.init(ExternalAppendOnlyMap.scala:416) at org.apache.spark.util.collection.ExternalAppendOnlyMap.spill(ExternalAppendOnlyMap.scala:235) at org.apache.spark.util.collection.ExternalAppendOnlyMap.insertAll(ExternalAppendOnlyMap.scala:150) at org.apache.spark.Aggregator.combineValuesByKey(Aggregator.scala:58) at org.apache.spark.shuffle.hash.HashShuffleWriter.write(HashShuffleWriter.scala:55) at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:68) at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41) at org.apache.spark.scheduler.Task.run(Task.scala:54) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:177) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1110) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:603) at java.lang.Thread.run(Thread.java:722) {code} -- This message was sent by Atlassian JIRA (v6.2#6252) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-2970) spark-sql script ends with IOException when EventLogging is enabled
[ https://issues.apache.org/jira/browse/SPARK-2970?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Cheng Lian resolved SPARK-2970. --- Resolution: Fixed spark-sql script ends with IOException when EventLogging is enabled --- Key: SPARK-2970 URL: https://issues.apache.org/jira/browse/SPARK-2970 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.1.0 Environment: CDH5.1.0 (Hadoop 2.3.0) Reporter: Kousuke Saruta Priority: Critical Fix For: 1.1.0 When spark-sql script run with spark.eventLog.enabled set true, it ends with IOException because FileLogger can not create APPLICATION_COMPLETE file in HDFS. It's is because a shutdown hook of SparkSQLCLIDriver is executed after a shutdown hook of org.apache.hadoop.fs.FileSystem is executed. When spark.eventLog.enabled is true, the hook of SparkSQLCLIDriver finally try to create a file to mark the application finished but the hook of FileSystem try to close FileSystem. -- This message was sent by Atlassian JIRA (v6.2#6252) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-2961) Use statistics to skip partitions when reading from in-memory columnar data
[ https://issues.apache.org/jira/browse/SPARK-2961?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14114668#comment-14114668 ] Apache Spark commented on SPARK-2961: - User 'liancheng' has created a pull request for this issue: https://github.com/apache/spark/pull/2188 Use statistics to skip partitions when reading from in-memory columnar data --- Key: SPARK-2961 URL: https://issues.apache.org/jira/browse/SPARK-2961 Project: Spark Issue Type: Improvement Components: SQL Reporter: Michael Armbrust Assignee: Michael Armbrust -- This message was sent by Atlassian JIRA (v6.2#6252) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3266) JavaDoubleRDD doesn't contain max()
[ https://issues.apache.org/jira/browse/SPARK-3266?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14114680#comment-14114680 ] Patrick Wendell commented on SPARK-3266: [~joshrosen] is there a solution here that preserves binary compatibility? That's been our goal at this point and we've maintained it by and large except for a few very minor mandatory Scala 2.11 upgrades. JavaDoubleRDD doesn't contain max() --- Key: SPARK-3266 URL: https://issues.apache.org/jira/browse/SPARK-3266 Project: Spark Issue Type: Bug Components: Java API Affects Versions: 1.0.1, 1.0.2, 1.1.0 Reporter: Amey Chaugule Assignee: Josh Rosen Attachments: spark-repro-3266.tar.gz While I can compile my code, I see: Caused by: java.lang.NoSuchMethodError: org.apache.spark.api.java.JavaDoubleRDD.max(Ljava/util/Comparator;)Ljava/lang/Double; When I try to execute my Spark code. Stepping into the JavaDoubleRDD class, I don't notice max() although it is clearly listed in the documentation. -- This message was sent by Atlassian JIRA (v6.2#6252) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-2636) Expose job ID in JobWaiter API
[ https://issues.apache.org/jira/browse/SPARK-2636?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Patrick Wendell updated SPARK-2636: --- Summary: Expose job ID in JobWaiter API (was: no where to get job identifier while submit spark job through spark API) Expose job ID in JobWaiter API -- Key: SPARK-2636 URL: https://issues.apache.org/jira/browse/SPARK-2636 Project: Spark Issue Type: New Feature Components: Java API Reporter: Chengxiang Li Labels: hive In Hive on Spark, we want to track spark job status through Spark API, the basic idea is as following: # create an hive-specified spark listener and register it to spark listener bus. # hive-specified spark listener generate job status by spark listener events. # hive driver track job status through hive-specified spark listener. the current problem is that hive driver need job identifier to track specified job status through spark listener, but there is no spark API to get job identifier(like job id) while submit spark job. I think other project whoever try to track job status with spark API would suffer from this as well. -- This message was sent by Atlassian JIRA (v6.2#6252) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-3200) Class defined with reference to external variables crashes in REPL.
[ https://issues.apache.org/jira/browse/SPARK-3200?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Patrick Wendell updated SPARK-3200: --- Component/s: Spark Shell Class defined with reference to external variables crashes in REPL. --- Key: SPARK-3200 URL: https://issues.apache.org/jira/browse/SPARK-3200 Project: Spark Issue Type: Bug Components: Spark Shell Affects Versions: 1.1.0 Reporter: Prashant Sharma Assignee: Prashant Sharma Reproducer: {noformat} val a = sc.textFile(README.md).count case class A(i: Int) { val j = a} sc.parallelize(1 to 10).map(A(_)).collect() {noformat} This will happen, when one refers something that refers sc and not otherwise. There are many ways to work around this, like directly assign a constant value instead of referring the variable. -- This message was sent by Atlassian JIRA (v6.2#6252) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-3245) spark insert into hbase class not serialize
[ https://issues.apache.org/jira/browse/SPARK-3245?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Patrick Wendell resolved SPARK-3245. Resolution: Invalid I'm closing this for now because we typically reported only isolated issues on the JIRA. Feel free to ping the spark user list for help narrowing down the issue. spark insert into hbase class not serialize Key: SPARK-3245 URL: https://issues.apache.org/jira/browse/SPARK-3245 Project: Spark Issue Type: Bug Environment: spark-1.0.1 + hbase-0.96.2 + hadoop-2.2.0 Reporter: 刘勇 val result: org.apache.spark.rdd.RDD[(String, Int)] result.foreach(res ={ var put = new Put(java.util.UUID.randomUUID().toString.reverse.getBytes()) .add(lv6.getBytes(), res._1.toString.getBytes(), res._2.toString.getBytes) table.put(put) } ) Exception in thread Thread-3 java.lang.reflect.InvocationTargetException at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:606) at org.apache.spark.deploy.yarn.ApplicationMaster$$anon$2.run(ApplicationMaster.scala:186) Caused by: org.apache.spark.SparkException: Job aborted due to stage failure: Task not serializable: java.io.NotSerializableException: org.apache.hadoop.hbase.client.HTablePool$PooledHTable at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1044) at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1028) at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1026) at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47) at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1026) at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$submitMissingTasks(DAGScheduler.scala:771) at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskCompletion$16$$anonfun$apply$1.apply$mcVI$sp(DAGScheduler.scala:901) at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskCompletion$16$$anonfun$apply$1.apply(DAGScheduler.scala:898) at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskCompletion$16$$anonfun$apply$1.apply(DAGScheduler.scala:898) at scala.Option.foreach(Option.scala:236) at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskCompletion$16.apply(DAGScheduler.scala:898) at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskCompletion$16.apply(DAGScheduler.scala:897) at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47) at org.apache.spark.scheduler.DAGScheduler.handleTaskCompletion(DAGScheduler.scala:897) at org.apache.spark.scheduler.DAGSchedulerEventProcessActor$$anonfun$receive$2.applyOrElse(DAGScheduler.scala:1226) at akka.actor.ActorCell.receiveMessage(ActorCell.scala:498) at akka.actor.ActorCell.invoke(ActorCell.scala:456) at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:237) at akka.dispatch.Mailbox.run(Mailbox.scala:219) at akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:386) at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260) at scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339) at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979) -- This message was sent by Atlassian JIRA (v6.2#6252) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-3234) SPARK_HADOOP_VERSION doesn't have a valid value by default in make-distribution.sh
[ https://issues.apache.org/jira/browse/SPARK-3234?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Patrick Wendell updated SPARK-3234: --- Target Version/s: 1.2.0 SPARK_HADOOP_VERSION doesn't have a valid value by default in make-distribution.sh --- Key: SPARK-3234 URL: https://issues.apache.org/jira/browse/SPARK-3234 Project: Spark Issue Type: Bug Components: Build Affects Versions: 1.0.2 Reporter: Cheng Lian Priority: Minor {{SPARK_HADOOP_VERSION}} has already been deprecated, but {{make-distribution.sh}} uses it as part of the distribution tarball name. As a result, we end up with something like {{spark-1.1.0-SNAPSHOT-bin-.tgz}} because {{SPARK_HADOOP_VERSION}} is empty. A possible fix is to add the antrun plugin into the Maven build and run Maven to print {{$hadoop.version}}. Instructions can be found in [this post|http://www.avajava.com/tutorials/lessons/how-do-i-display-the-value-of-a-property.html]. -- This message was sent by Atlassian JIRA (v6.2#6252) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-3277) LZ4 compression cause the the ExternalSort exception
[ https://issues.apache.org/jira/browse/SPARK-3277?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrew Or updated SPARK-3277: - Affects Version/s: (was: 1.2.0) LZ4 compression cause the the ExternalSort exception Key: SPARK-3277 URL: https://issues.apache.org/jira/browse/SPARK-3277 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 1.0.2, 1.1.0 Reporter: hzw Assignee: Andrew Or Priority: Blocker Fix For: 1.1.0 Attachments: test_lz4_bug.patch I tested the LZ4 compression,and it come up with such problem.(with wordcount) Also I tested the snappy and LZF,and they were OK. At last I set the spark.shuffle.spill as false to avoid such exeception, but once open this switch, this error would come. It seems that if num of the[ words is few, wordcount will go through,but if it is a complex text ,this problem will show Exeception Info as follow: {code} java.lang.AssertionError: assertion failed at scala.Predef$.assert(Predef.scala:165) at org.apache.spark.util.collection.ExternalAppendOnlyMap$DiskMapIterator.init(ExternalAppendOnlyMap.scala:416) at org.apache.spark.util.collection.ExternalAppendOnlyMap.spill(ExternalAppendOnlyMap.scala:235) at org.apache.spark.util.collection.ExternalAppendOnlyMap.insertAll(ExternalAppendOnlyMap.scala:150) at org.apache.spark.Aggregator.combineValuesByKey(Aggregator.scala:58) at org.apache.spark.shuffle.hash.HashShuffleWriter.write(HashShuffleWriter.scala:55) at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:68) at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41) at org.apache.spark.scheduler.Task.run(Task.scala:54) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:177) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1110) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:603) at java.lang.Thread.run(Thread.java:722) {code} -- This message was sent by Atlassian JIRA (v6.2#6252) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-3277) LZ4 compression cause the the ExternalSort exception
[ https://issues.apache.org/jira/browse/SPARK-3277?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrew Or updated SPARK-3277: - Fix Version/s: 1.1.0 LZ4 compression cause the the ExternalSort exception Key: SPARK-3277 URL: https://issues.apache.org/jira/browse/SPARK-3277 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 1.0.2, 1.1.0 Reporter: hzw Assignee: Andrew Or Priority: Blocker Fix For: 1.1.0 Attachments: test_lz4_bug.patch I tested the LZ4 compression,and it come up with such problem.(with wordcount) Also I tested the snappy and LZF,and they were OK. At last I set the spark.shuffle.spill as false to avoid such exeception, but once open this switch, this error would come. It seems that if num of the[ words is few, wordcount will go through,but if it is a complex text ,this problem will show Exeception Info as follow: {code} java.lang.AssertionError: assertion failed at scala.Predef$.assert(Predef.scala:165) at org.apache.spark.util.collection.ExternalAppendOnlyMap$DiskMapIterator.init(ExternalAppendOnlyMap.scala:416) at org.apache.spark.util.collection.ExternalAppendOnlyMap.spill(ExternalAppendOnlyMap.scala:235) at org.apache.spark.util.collection.ExternalAppendOnlyMap.insertAll(ExternalAppendOnlyMap.scala:150) at org.apache.spark.Aggregator.combineValuesByKey(Aggregator.scala:58) at org.apache.spark.shuffle.hash.HashShuffleWriter.write(HashShuffleWriter.scala:55) at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:68) at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41) at org.apache.spark.scheduler.Task.run(Task.scala:54) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:177) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1110) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:603) at java.lang.Thread.run(Thread.java:722) {code} -- This message was sent by Atlassian JIRA (v6.2#6252) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-3288) All fields in TaskMetrics should be private and use getters/setters
[ https://issues.apache.org/jira/browse/SPARK-3288?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrew Or updated SPARK-3288: - Affects Version/s: 1.1.0 All fields in TaskMetrics should be private and use getters/setters --- Key: SPARK-3288 URL: https://issues.apache.org/jira/browse/SPARK-3288 Project: Spark Issue Type: Improvement Components: Spark Core Affects Versions: 1.1.0 Reporter: Patrick Wendell Assignee: Andrew Or This is particularly bad because we expose this as a developer API. Technically a library could create a TaskMetrics object and then change the values inside of it and pass it onto someone else. It can be written pretty compactly like below: {code} /** * Number of bytes written for the shuffle by this task */ @volatile private var _shuffleBytesWritten: Long = _ def incrementShuffleBytesWritten(value: Long) = _shuffleBytesWritten += value def decrementShuffleBytesWritten(value: Long) = _shuffleBytesWritten -= value def shuffleBytesWritten = _shuffleBytesWritten {code} -- This message was sent by Atlassian JIRA (v6.2#6252) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-2594) Add CACHE TABLE name AS SELECT ...
[ https://issues.apache.org/jira/browse/SPARK-2594?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14114729#comment-14114729 ] Michael Armbrust commented on SPARK-2594: - Its a lot of overhead to assign issues to people, but feel free to work on this now that you have posted here. Please post a design here before you begin coding. Add CACHE TABLE name AS SELECT ... Key: SPARK-2594 URL: https://issues.apache.org/jira/browse/SPARK-2594 Project: Spark Issue Type: New Feature Components: SQL Reporter: Michael Armbrust Priority: Critical -- This message was sent by Atlassian JIRA (v6.2#6252) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3272) Calculate prediction for nodes separately from calculating information gain for splits in decision tree
[ https://issues.apache.org/jira/browse/SPARK-3272?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14114742#comment-14114742 ] Joseph K. Bradley commented on SPARK-3272: -- Hi Qiping, you are right; I missed that! I like your idea of storing the number of instances in the InformationGainStats. (That seems easier to understand than a special invalid gain value.) For now, I would recommend storing the number for the node, not for the left right child nodes. That would allow you to decide if the node being considered is a leaf (not its children). I agree that, eventually, we should identify if the children are leafs at the same time. That should be part of [SPARK-3158], which could modify findBestSplits to return ImpurityCalculators (a new class from my PR [https://github.com/apache/spark/pull/2125]) for the left and right child nodes. Does that sound reasonable? Calculate prediction for nodes separately from calculating information gain for splits in decision tree --- Key: SPARK-3272 URL: https://issues.apache.org/jira/browse/SPARK-3272 Project: Spark Issue Type: Improvement Components: MLlib Affects Versions: 1.0.2 Reporter: Qiping Li Fix For: 1.1.0 In current implementation, prediction for a node is calculated along with calculation of information gain stats for each possible splits. The value to predict for a specific node is determined, no matter what the splits are. To save computation, we can first calculate prediction first and then calculate information gain stats for each split. This is also necessary if we want to support minimum instances per node parameters([SPARK-2207|https://issues.apache.org/jira/browse/SPARK-2207]) because when all splits don't satisfy minimum instances requirement , we don't use information gain of any splits. There should be a way to get the prediction value. -- This message was sent by Atlassian JIRA (v6.2#6252) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-3291) TestcaseName in createQueryTest should not contain :
Qiping Li created SPARK-3291: Summary: TestcaseName in createQueryTest should not contain : Key: SPARK-3291 URL: https://issues.apache.org/jira/browse/SPARK-3291 Project: Spark Issue Type: Bug Components: SQL Reporter: Qiping Li : is not allowed to appear in a file name of Windows system. If file name contains :, this file can't be checked out in a Windows system and developers using Windows must be careful to not commit the deletion of such files, Which is very inconvenient. -- This message was sent by Atlassian JIRA (v6.2#6252) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3250) More Efficient Sampling
[ https://issues.apache.org/jira/browse/SPARK-3250?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14114789#comment-14114789 ] Erik Erlandson commented on SPARK-3250: --- I did some experiments with sampling that models the gaps between samples (so one can use iterator.drop between samples). The results are here: https://gist.github.com/erikerlandson/66b42d96500589f25553 There appears to be a crossover point in efficiency, around sampling probability p=0.3, where densities below 0.3 are best done using the new logic, and higher sampling densities are better done using traditional filter-based logic. I need to run more tests, but the first results are promising. At low sampling densities the improvement is large. More Efficient Sampling --- Key: SPARK-3250 URL: https://issues.apache.org/jira/browse/SPARK-3250 Project: Spark Issue Type: Improvement Components: Spark Core Reporter: RJ Nowling Sampling, as currently implemented in Spark, is an O\(n\) operation. A number of stochastic algorithms achieve speed ups by exploiting O\(k\) sampling, where k is the number of data points to sample. Examples of such algorithms include KMeans MiniBatch (SPARK-2308) and Stochastic Gradient Descent with mini batching. More efficient sampling may be achievable by packing partitions with an ArrayBuffer or other data structure supporting random access. Since many of these stochastic algorithms perform repeated rounds of sampling, it may be feasible to perform a transformation to change the backing data structure followed by multiple rounds of sampling. -- This message was sent by Atlassian JIRA (v6.2#6252) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-3292) Shuffle Tasks run indefinitely even though there's no inputs
guowei created SPARK-3292: - Summary: Shuffle Tasks run indefinitely even though there's no inputs Key: SPARK-3292 URL: https://issues.apache.org/jira/browse/SPARK-3292 Project: Spark Issue Type: Improvement Components: Streaming Affects Versions: 1.0.2 Reporter: guowei such as repartition groupby join and cogroup it's too expensive , for example if i want outputs save as hadoop file ,then many emtpy file generate. -- This message was sent by Atlassian JIRA (v6.2#6252) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org