[jira] [Created] (SPARK-4783) Remove all System.exit calls from sparkcontext
David Semeria created SPARK-4783: Summary: Remove all System.exit calls from sparkcontext Key: SPARK-4783 URL: https://issues.apache.org/jira/browse/SPARK-4783 Project: Spark Issue Type: Bug Reporter: David Semeria A common architectural choice for integrating Spark within a larger application is to employ a gateway to handle Spark jobs. The gateway is a server which contains one or more long-running sparkcontexts. A typical server is created with the following pseudo code: ``` var continue = true while (continue){ try { server.run() } catch (e) { continue = log_and_examine_error(e) } ``` The problem is that sparkcontext frequently calls System.exit when it encounters a problem which means the server can only be re-spawned at the process level, which is much more messy than the simple code above. Therefore, I believe it makes sense to replace all System.exit calls in sparkcontext with the throwing of a fatal error. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-4783) Remove all System.exit calls from sparkcontext
[ https://issues.apache.org/jira/browse/SPARK-4783?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] David Semeria updated SPARK-4783: - Description: A common architectural choice for integrating Spark within a larger application is to employ a gateway to handle Spark jobs. The gateway is a server which contains one or more long-running sparkcontexts. A typical server is created with the following pseudo code: var continue = true while (continue){ try { server.run() } catch (e) { continue = log_and_examine_error(e) } The problem is that sparkcontext frequently calls System.exit when it encounters a problem which means the server can only be re-spawned at the process level, which is much more messy than the simple code above. Therefore, I believe it makes sense to replace all System.exit calls in sparkcontext with the throwing of a fatal error. was: A common architectural choice for integrating Spark within a larger application is to employ a gateway to handle Spark jobs. The gateway is a server which contains one or more long-running sparkcontexts. A typical server is created with the following pseudo code: ``` var continue = true while (continue){ try { server.run() } catch (e) { continue = log_and_examine_error(e) } ``` The problem is that sparkcontext frequently calls System.exit when it encounters a problem which means the server can only be re-spawned at the process level, which is much more messy than the simple code above. Therefore, I believe it makes sense to replace all System.exit calls in sparkcontext with the throwing of a fatal error. Remove all System.exit calls from sparkcontext -- Key: SPARK-4783 URL: https://issues.apache.org/jira/browse/SPARK-4783 Project: Spark Issue Type: Bug Reporter: David Semeria A common architectural choice for integrating Spark within a larger application is to employ a gateway to handle Spark jobs. The gateway is a server which contains one or more long-running sparkcontexts. A typical server is created with the following pseudo code: var continue = true while (continue){ try { server.run() } catch (e) { continue = log_and_examine_error(e) } The problem is that sparkcontext frequently calls System.exit when it encounters a problem which means the server can only be re-spawned at the process level, which is much more messy than the simple code above. Therefore, I believe it makes sense to replace all System.exit calls in sparkcontext with the throwing of a fatal error. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-2808) update kafka to version 0.8.2
[ https://issues.apache.org/jira/browse/SPARK-2808?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14237150#comment-14237150 ] Helena Edelson commented on SPARK-2808: --- I've done the migration, am testing the changes against Scala 2.10.4 since the parent spark pom uses scala.version2.10.4/scala.version scala.binary.version2.10/scala.binary.version This will allow usage of the new producer, among other nice additions in 0.8.2. Though I do not know when it will be GA. update kafka to version 0.8.2 - Key: SPARK-2808 URL: https://issues.apache.org/jira/browse/SPARK-2808 Project: Spark Issue Type: Sub-task Components: Build, Spark Core Reporter: Anand Avati First kafka_2.11 0.8.1 has to be released -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4181) Create separate options to control the client-mode AM resource allocation request
[ https://issues.apache.org/jira/browse/SPARK-4181?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14237163#comment-14237163 ] WangTaoTheTonic commented on SPARK-4181: I couldn't think out one case for extraLibraryPath either. But we should honor extraClasspath as some function, for instance, log system, would load resources from self-defined classpath. Create separate options to control the client-mode AM resource allocation request - Key: SPARK-4181 URL: https://issues.apache.org/jira/browse/SPARK-4181 Project: Spark Issue Type: Improvement Components: YARN Reporter: WangTaoTheTonic Priority: Minor I found related discussion in https://github.com/apache/spark/pull/2115, SPARK-1953 and SPARK-1507. And recently I found some inconvenience in configuring properties like logging while we use yarn-client mode. So if no one else do the work, I will try it. Maybe start in few days, and complete in next 1 or 2 weeks. As not very familiar with spark on yarn, any discussion and feedback is welcome! -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4740) Netty's network throughput is about 1/2 of NIO's in spark-perf sortByKey
[ https://issues.apache.org/jira/browse/SPARK-4740?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14237178#comment-14237178 ] Zhang, Liye commented on SPARK-4740: [~pwendell], I have the same concern first, and I verified that, and replaced the assembly jar files on all nodes, and also restart the cluster, but the problem still exists. One more solid proof is that the better performance is not stick to one machine, maybe one node performs better for this test and another node for other test. Netty's network throughput is about 1/2 of NIO's in spark-perf sortByKey Key: SPARK-4740 URL: https://issues.apache.org/jira/browse/SPARK-4740 Project: Spark Issue Type: Improvement Components: Shuffle, Spark Core Affects Versions: 1.2.0 Reporter: Zhang, Liye Assignee: Reynold Xin Priority: Blocker Attachments: (rxin patch better executor)TestRunner sort-by-key - Thread dump for executor 3_files.zip, (rxin patch normal executor)TestRunner sort-by-key - Thread dump for executor 0 _files.zip, Spark-perf Test Report 16 Cores per Executor.pdf, Spark-perf Test Report.pdf, TestRunner sort-by-key - Thread dump for executor 1_files (Netty-48 Cores per node).zip, TestRunner sort-by-key - Thread dump for executor 1_files (Nio-48 cores per node).zip When testing current spark master (1.3.0-snapshot) with spark-perf (sort-by-key, aggregate-by-key, etc), Netty based shuffle transferService takes much longer time than NIO based shuffle transferService. The network throughput of Netty is only about half of that of NIO. We tested with standalone mode, and the data set we used for test is 20 billion records, and the total size is about 400GB. Spark-perf test is Running on a 4 node cluster with 10G NIC, 48 cpu cores per node and each executor memory is 64GB. The reduce tasks number is set to 1000. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-4740) Netty's network throughput is about 1/2 of NIO's in spark-perf sortByKey
[ https://issues.apache.org/jira/browse/SPARK-4740?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14237178#comment-14237178 ] Zhang, Liye edited comment on SPARK-4740 at 12/7/14 3:32 PM: - [~pwendell], I have the same concern at first. I verified for several times, and replaced the assembly jar files on all nodes, and also restart the cluster, but the problem still exists. One more solid proof is that the better performance node is not stick to one machine, maybe one node performs better for this test and may switch to another node for next test. was (Author: liyezhang556520): [~pwendell], I have the same concern first, and I verified that, and replaced the assembly jar files on all nodes, and also restart the cluster, but the problem still exists. One more solid proof is that the better performance is not stick to one machine, maybe one node performs better for this test and another node for other test. Netty's network throughput is about 1/2 of NIO's in spark-perf sortByKey Key: SPARK-4740 URL: https://issues.apache.org/jira/browse/SPARK-4740 Project: Spark Issue Type: Improvement Components: Shuffle, Spark Core Affects Versions: 1.2.0 Reporter: Zhang, Liye Assignee: Reynold Xin Priority: Blocker Attachments: (rxin patch better executor)TestRunner sort-by-key - Thread dump for executor 3_files.zip, (rxin patch normal executor)TestRunner sort-by-key - Thread dump for executor 0 _files.zip, Spark-perf Test Report 16 Cores per Executor.pdf, Spark-perf Test Report.pdf, TestRunner sort-by-key - Thread dump for executor 1_files (Netty-48 Cores per node).zip, TestRunner sort-by-key - Thread dump for executor 1_files (Nio-48 cores per node).zip When testing current spark master (1.3.0-snapshot) with spark-perf (sort-by-key, aggregate-by-key, etc), Netty based shuffle transferService takes much longer time than NIO based shuffle transferService. The network throughput of Netty is only about half of that of NIO. We tested with standalone mode, and the data set we used for test is 20 billion records, and the total size is about 400GB. Spark-perf test is Running on a 4 node cluster with 10G NIC, 48 cpu cores per node and each executor memory is 64GB. The reduce tasks number is set to 1000. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4740) Netty's network throughput is about 1/2 of NIO's in spark-perf sortByKey
[ https://issues.apache.org/jira/browse/SPARK-4740?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14237179#comment-14237179 ] Zhang, Liye commented on SPARK-4740: Hi [~adav], I kept speculation as default (false). Netty's network throughput is about 1/2 of NIO's in spark-perf sortByKey Key: SPARK-4740 URL: https://issues.apache.org/jira/browse/SPARK-4740 Project: Spark Issue Type: Improvement Components: Shuffle, Spark Core Affects Versions: 1.2.0 Reporter: Zhang, Liye Assignee: Reynold Xin Priority: Blocker Attachments: (rxin patch better executor)TestRunner sort-by-key - Thread dump for executor 3_files.zip, (rxin patch normal executor)TestRunner sort-by-key - Thread dump for executor 0 _files.zip, Spark-perf Test Report 16 Cores per Executor.pdf, Spark-perf Test Report.pdf, TestRunner sort-by-key - Thread dump for executor 1_files (Netty-48 Cores per node).zip, TestRunner sort-by-key - Thread dump for executor 1_files (Nio-48 cores per node).zip When testing current spark master (1.3.0-snapshot) with spark-perf (sort-by-key, aggregate-by-key, etc), Netty based shuffle transferService takes much longer time than NIO based shuffle transferService. The network throughput of Netty is only about half of that of NIO. We tested with standalone mode, and the data set we used for test is 20 billion records, and the total size is about 400GB. Spark-perf test is Running on a 4 node cluster with 10G NIC, 48 cpu cores per node and each executor memory is 64GB. The reduce tasks number is set to 1000. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-2808) update kafka to version 0.8.2
[ https://issues.apache.org/jira/browse/SPARK-2808?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14237150#comment-14237150 ] Helena Edelson edited comment on SPARK-2808 at 12/7/14 3:35 PM: I've done the migration, am testing the changes against Scala 2.10.4 since the parent spark pom uses scala.version2.10.4/scala.version scala.binary.version2.10/scala.binary.version I do not know when it will be GA. was (Author: helena_e): I've done the migration, am testing the changes against Scala 2.10.4 since the parent spark pom uses scala.version2.10.4/scala.version scala.binary.version2.10/scala.binary.version This will allow usage of the new producer, among other nice additions in 0.8.2. Though I do not know when it will be GA. update kafka to version 0.8.2 - Key: SPARK-2808 URL: https://issues.apache.org/jira/browse/SPARK-2808 Project: Spark Issue Type: Sub-task Components: Build, Spark Core Reporter: Anand Avati First kafka_2.11 0.8.1 has to be released -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-2808) update kafka to version 0.8.2
[ https://issues.apache.org/jira/browse/SPARK-2808?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14237150#comment-14237150 ] Helena Edelson edited comment on SPARK-2808 at 12/7/14 3:52 PM: I've done the migration, am testing the changes against Scala 2.10.4 since the parent spark pom uses scala.version2.10.4/scala.version scala.binary.version2.10/scala.binary.version I do not know when it will be GA. https://github.com/apache/spark/pull/3631 was (Author: helena_e): I've done the migration, am testing the changes against Scala 2.10.4 since the parent spark pom uses scala.version2.10.4/scala.version scala.binary.version2.10/scala.binary.version I do not know when it will be GA. update kafka to version 0.8.2 - Key: SPARK-2808 URL: https://issues.apache.org/jira/browse/SPARK-2808 Project: Spark Issue Type: Sub-task Components: Build, Spark Core Reporter: Anand Avati First kafka_2.11 0.8.1 has to be released -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3655) Support sorting of values in addition to keys (i.e. secondary sort)
[ https://issues.apache.org/jira/browse/SPARK-3655?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14237254#comment-14237254 ] koert kuipers commented on SPARK-3655: -- i also dont like the signature def groupByKeyAndSortValues(valueOrdering: Ordering[V], partitioner: Partitioner): RDD[(K, Iterable[V])] i doubt it can be implemented efficiently i would much prefer def groupByKeyAndSortValues(valueOrdering: Ordering[V], partitioner: Partitioner): RDD[(K, TraversableOnce[V])] but that is inconsistent with groupByKey (which i guess has Iterable in it's return type for historical reasons.. used to be Seq) Support sorting of values in addition to keys (i.e. secondary sort) --- Key: SPARK-3655 URL: https://issues.apache.org/jira/browse/SPARK-3655 Project: Spark Issue Type: New Feature Components: Spark Core Affects Versions: 1.1.0 Reporter: koert kuipers Assignee: Koert Kuipers Priority: Minor Now that spark has a sort based shuffle, can we expect a secondary sort soon? There are some use cases where getting a sorted iterator of values per key is helpful. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4759) Deadlock in complex spark job in local mode with multiple cores
[ https://issues.apache.org/jira/browse/SPARK-4759?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14237264#comment-14237264 ] Davis Shepherd commented on SPARK-4759: --- It is possible to reproduce the issue with single core local mode. Simply change the partitions parameter to 1 (in the attached version it uses the defaultParallelism of the spark context, which in single core mode is 1) in either of the coalesce calls. Deadlock in complex spark job in local mode with multiple cores --- Key: SPARK-4759 URL: https://issues.apache.org/jira/browse/SPARK-4759 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 1.1.1, 1.2.0, 1.3.0 Environment: Java version 1.7.0_51 Java(TM) SE Runtime Environment (build 1.7.0_51-b13) Java HotSpot(TM) 64-Bit Server VM (build 24.51-b03, mixed mode) Mac OSX 10.10.1 Using local spark context Reporter: Davis Shepherd Assignee: Andrew Or Priority: Critical Attachments: SparkBugReplicator.scala The attached test class runs two identical jobs that perform some iterative computation on an RDD[(Int, Int)]. This computation involves # taking new data merging it with the previous result # caching and checkpointing the new result # rinse and repeat The first time the job is run, it runs successfully, and the spark context is shut down. The second time the job is run with a new spark context in the same process, the job hangs indefinitely, only having scheduled a subset of the necessary tasks for the final stage. Ive been able to produce a test case that reproduces the issue, and I've added some comments where some knockout experimentation has left some breadcrumbs as to where the issue might be. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3655) Support sorting of values in addition to keys (i.e. secondary sort)
[ https://issues.apache.org/jira/browse/SPARK-3655?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14237298#comment-14237298 ] Apache Spark commented on SPARK-3655: - User 'koertkuipers' has created a pull request for this issue: https://github.com/apache/spark/pull/3632 Support sorting of values in addition to keys (i.e. secondary sort) --- Key: SPARK-3655 URL: https://issues.apache.org/jira/browse/SPARK-3655 Project: Spark Issue Type: New Feature Components: Spark Core Affects Versions: 1.1.0 Reporter: koert kuipers Assignee: Koert Kuipers Priority: Minor Now that spark has a sort based shuffle, can we expect a secondary sort soon? There are some use cases where getting a sorted iterator of values per key is helpful. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3655) Support sorting of values in addition to keys (i.e. secondary sort)
[ https://issues.apache.org/jira/browse/SPARK-3655?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14237299#comment-14237299 ] koert kuipers commented on SPARK-3655: -- i have a new pullreq that implements just groupByKeyAndSortValues in scala and java. i will need some help with python. pullreq is here: https://github.com/apache/spark/pull/3632 i changed methods to return RDD[(K, TraversableOnce[V])] instead of RDD[(K, Iterable[V])], since i dont see a reasonable way to implement it so that it returns Iterables without resorting to keeping the data in memory. The assumption made is that once you move on to the next key within a partition that the previous value (so the TraversableOnce[V]) will no longer be used. I personally find this API too generic, and too easy to abuse or make mistakes with. So i prefer a more constrained API like foldLeft. Support sorting of values in addition to keys (i.e. secondary sort) --- Key: SPARK-3655 URL: https://issues.apache.org/jira/browse/SPARK-3655 Project: Spark Issue Type: New Feature Components: Spark Core Affects Versions: 1.1.0 Reporter: koert kuipers Assignee: Koert Kuipers Priority: Minor Now that spark has a sort based shuffle, can we expect a secondary sort soon? There are some use cases where getting a sorted iterator of values per key is helpful. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-3655) Support sorting of values in addition to keys (i.e. secondary sort)
[ https://issues.apache.org/jira/browse/SPARK-3655?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] koert kuipers updated SPARK-3655: - Affects Version/s: 1.2.0 Support sorting of values in addition to keys (i.e. secondary sort) --- Key: SPARK-3655 URL: https://issues.apache.org/jira/browse/SPARK-3655 Project: Spark Issue Type: New Feature Components: Spark Core Affects Versions: 1.1.0, 1.2.0 Reporter: koert kuipers Assignee: Koert Kuipers Priority: Minor Now that spark has a sort based shuffle, can we expect a secondary sort soon? There are some use cases where getting a sorted iterator of values per key is helpful. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4759) Deadlock in complex spark job in local mode with multiple cores
[ https://issues.apache.org/jira/browse/SPARK-4759?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14237321#comment-14237321 ] Andrew Or commented on SPARK-4759: -- Hey I came up with a much smaller reproduction for this from your program. 1. Start spark-shell with --master local[8] 2. Copy and paste the following into your REPL {code} import org.apache.spark.SparkContext import org.apache.spark.rdd.RDD def makeMyRdd(sc: SparkContext): RDD[Int] = { sc.parallelize(1 to 100).repartition(sc.defaultParallelism).cache() } def runMyJob(sc: SparkContext): Unit = { sc.setCheckpointDir(/tmp/spark-test) val rdd = makeMyRdd(sc) rdd.checkpoint() rdd.count() val rdd2 = makeMyRdd(sc) val newRdd = rdd.union(rdd2).coalesce(sc.defaultParallelism).cache() newRdd.checkpoint() newRdd.count() } {code} 3. runMyJob(sc) It should be stuck at task 7/8. Deadlock in complex spark job in local mode with multiple cores --- Key: SPARK-4759 URL: https://issues.apache.org/jira/browse/SPARK-4759 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 1.1.1, 1.2.0, 1.3.0 Environment: Java version 1.7.0_51 Java(TM) SE Runtime Environment (build 1.7.0_51-b13) Java HotSpot(TM) 64-Bit Server VM (build 24.51-b03, mixed mode) Mac OSX 10.10.1 Using local spark context Reporter: Davis Shepherd Assignee: Andrew Or Priority: Critical Attachments: SparkBugReplicator.scala The attached test class runs two identical jobs that perform some iterative computation on an RDD[(Int, Int)]. This computation involves # taking new data merging it with the previous result # caching and checkpointing the new result # rinse and repeat The first time the job is run, it runs successfully, and the spark context is shut down. The second time the job is run with a new spark context in the same process, the job hangs indefinitely, only having scheduled a subset of the necessary tasks for the final stage. Ive been able to produce a test case that reproduces the issue, and I've added some comments where some knockout experimentation has left some breadcrumbs as to where the issue might be. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-4759) Deadlock in complex spark job in local mode with multiple cores
[ https://issues.apache.org/jira/browse/SPARK-4759?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14237321#comment-14237321 ] Andrew Or edited comment on SPARK-4759 at 12/7/14 11:53 PM: Hey I came up with a much smaller reproduction for this from your program. 1. Start spark-shell with --master local[8] 2. Copy and paste the following into your REPL {code} import org.apache.spark.SparkContext import org.apache.spark.rdd.RDD def makeMyRdd(sc: SparkContext): RDD[Int] = { sc.parallelize(1 to 100).repartition(4).cache() } def runMyJob(sc: SparkContext): Unit = { sc.setCheckpointDir(/tmp/spark-test) val rdd = makeMyRdd(sc) rdd.checkpoint() rdd.count() val rdd2 = makeMyRdd(sc) val newRdd = rdd.union(rdd2).coalesce(4).cache() newRdd.checkpoint() newRdd.count() } {code} 3. runMyJob(sc) It should be stuck at task 7/8. was (Author: andrewor14): Hey I came up with a much smaller reproduction for this from your program. 1. Start spark-shell with --master local[8] 2. Copy and paste the following into your REPL {code} import org.apache.spark.SparkContext import org.apache.spark.rdd.RDD def makeMyRdd(sc: SparkContext): RDD[Int] = { sc.parallelize(1 to 100).repartition(sc.defaultParallelism).cache() } def runMyJob(sc: SparkContext): Unit = { sc.setCheckpointDir(/tmp/spark-test) val rdd = makeMyRdd(sc) rdd.checkpoint() rdd.count() val rdd2 = makeMyRdd(sc) val newRdd = rdd.union(rdd2).coalesce(sc.defaultParallelism).cache() newRdd.checkpoint() newRdd.count() } {code} 3. runMyJob(sc) It should be stuck at task 7/8. Deadlock in complex spark job in local mode with multiple cores --- Key: SPARK-4759 URL: https://issues.apache.org/jira/browse/SPARK-4759 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 1.1.1, 1.2.0, 1.3.0 Environment: Java version 1.7.0_51 Java(TM) SE Runtime Environment (build 1.7.0_51-b13) Java HotSpot(TM) 64-Bit Server VM (build 24.51-b03, mixed mode) Mac OSX 10.10.1 Using local spark context Reporter: Davis Shepherd Assignee: Andrew Or Priority: Critical Attachments: SparkBugReplicator.scala The attached test class runs two identical jobs that perform some iterative computation on an RDD[(Int, Int)]. This computation involves # taking new data merging it with the previous result # caching and checkpointing the new result # rinse and repeat The first time the job is run, it runs successfully, and the spark context is shut down. The second time the job is run with a new spark context in the same process, the job hangs indefinitely, only having scheduled a subset of the necessary tasks for the final stage. Ive been able to produce a test case that reproduces the issue, and I've added some comments where some knockout experimentation has left some breadcrumbs as to where the issue might be. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-4759) Deadlock in complex spark job in local mode
[ https://issues.apache.org/jira/browse/SPARK-4759?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrew Or updated SPARK-4759: - Summary: Deadlock in complex spark job in local mode (was: Deadlock in complex spark job in local mode with multiple cores) Deadlock in complex spark job in local mode --- Key: SPARK-4759 URL: https://issues.apache.org/jira/browse/SPARK-4759 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 1.1.1, 1.2.0, 1.3.0 Environment: Java version 1.7.0_51 Java(TM) SE Runtime Environment (build 1.7.0_51-b13) Java HotSpot(TM) 64-Bit Server VM (build 24.51-b03, mixed mode) Mac OSX 10.10.1 Using local spark context Reporter: Davis Shepherd Assignee: Andrew Or Priority: Critical Attachments: SparkBugReplicator.scala The attached test class runs two identical jobs that perform some iterative computation on an RDD[(Int, Int)]. This computation involves # taking new data merging it with the previous result # caching and checkpointing the new result # rinse and repeat The first time the job is run, it runs successfully, and the spark context is shut down. The second time the job is run with a new spark context in the same process, the job hangs indefinitely, only having scheduled a subset of the necessary tasks for the final stage. Ive been able to produce a test case that reproduces the issue, and I've added some comments where some knockout experimentation has left some breadcrumbs as to where the issue might be. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-4759) Deadlock in complex spark job in local mode with multiple cores
[ https://issues.apache.org/jira/browse/SPARK-4759?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14237321#comment-14237321 ] Andrew Or edited comment on SPARK-4759 at 12/7/14 11:53 PM: Hey I came up with a much smaller reproduction for this from your program. 1. Start spark-shell with --master local[8] (or simply local) 2. Copy and paste the following into your REPL {code} import org.apache.spark.SparkContext import org.apache.spark.rdd.RDD def makeMyRdd(sc: SparkContext): RDD[Int] = { sc.parallelize(1 to 100).repartition(4).cache() } def runMyJob(sc: SparkContext): Unit = { sc.setCheckpointDir(/tmp/spark-test) val rdd = makeMyRdd(sc) rdd.checkpoint() rdd.count() val rdd2 = makeMyRdd(sc) val newRdd = rdd.union(rdd2).coalesce(4).cache() newRdd.checkpoint() newRdd.count() } {code} 3. runMyJob(sc) It should be stuck at task 7/8. was (Author: andrewor14): Hey I came up with a much smaller reproduction for this from your program. 1. Start spark-shell with --master local[8] 2. Copy and paste the following into your REPL {code} import org.apache.spark.SparkContext import org.apache.spark.rdd.RDD def makeMyRdd(sc: SparkContext): RDD[Int] = { sc.parallelize(1 to 100).repartition(4).cache() } def runMyJob(sc: SparkContext): Unit = { sc.setCheckpointDir(/tmp/spark-test) val rdd = makeMyRdd(sc) rdd.checkpoint() rdd.count() val rdd2 = makeMyRdd(sc) val newRdd = rdd.union(rdd2).coalesce(4).cache() newRdd.checkpoint() newRdd.count() } {code} 3. runMyJob(sc) It should be stuck at task 7/8. Deadlock in complex spark job in local mode with multiple cores --- Key: SPARK-4759 URL: https://issues.apache.org/jira/browse/SPARK-4759 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 1.1.1, 1.2.0, 1.3.0 Environment: Java version 1.7.0_51 Java(TM) SE Runtime Environment (build 1.7.0_51-b13) Java HotSpot(TM) 64-Bit Server VM (build 24.51-b03, mixed mode) Mac OSX 10.10.1 Using local spark context Reporter: Davis Shepherd Assignee: Andrew Or Priority: Critical Attachments: SparkBugReplicator.scala The attached test class runs two identical jobs that perform some iterative computation on an RDD[(Int, Int)]. This computation involves # taking new data merging it with the previous result # caching and checkpointing the new result # rinse and repeat The first time the job is run, it runs successfully, and the spark context is shut down. The second time the job is run with a new spark context in the same process, the job hangs indefinitely, only having scheduled a subset of the necessary tasks for the final stage. Ive been able to produce a test case that reproduces the issue, and I've added some comments where some knockout experimentation has left some breadcrumbs as to where the issue might be. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-4759) Deadlock in complex spark job in local mode
[ https://issues.apache.org/jira/browse/SPARK-4759?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14237321#comment-14237321 ] Andrew Or edited comment on SPARK-4759 at 12/7/14 11:57 PM: Hey I came up with a much smaller reproduction for this from your program. 1. Start spark-shell with --master local[N] where N can be anything (or simply local with 1 core) 2. Copy and paste the following into your REPL {code} import org.apache.spark.SparkContext import org.apache.spark.rdd.RDD def makeMyRdd(sc: SparkContext): RDD[Int] = { sc.parallelize(1 to 100).repartition(4).cache() } def runMyJob(sc: SparkContext): Unit = { sc.setCheckpointDir(/tmp/spark-test) val rdd = makeMyRdd(sc) rdd.checkpoint() rdd.count() val rdd2 = makeMyRdd(sc) val newRdd = rdd.union(rdd2).coalesce(4).cache() newRdd.checkpoint() newRdd.count() } {code} 3. runMyJob(sc) It should be stuck at task 1/4. Note that with local-cluster and (local) standalone mode, it pauses a little at 1/4 too, but finishes shortly afterwards. was (Author: andrewor14): Hey I came up with a much smaller reproduction for this from your program. 1. Start spark-shell with --master local[8] (or simply local) 2. Copy and paste the following into your REPL {code} import org.apache.spark.SparkContext import org.apache.spark.rdd.RDD def makeMyRdd(sc: SparkContext): RDD[Int] = { sc.parallelize(1 to 100).repartition(4).cache() } def runMyJob(sc: SparkContext): Unit = { sc.setCheckpointDir(/tmp/spark-test) val rdd = makeMyRdd(sc) rdd.checkpoint() rdd.count() val rdd2 = makeMyRdd(sc) val newRdd = rdd.union(rdd2).coalesce(4).cache() newRdd.checkpoint() newRdd.count() } {code} 3. runMyJob(sc) It should be stuck at task 7/8. Deadlock in complex spark job in local mode --- Key: SPARK-4759 URL: https://issues.apache.org/jira/browse/SPARK-4759 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 1.1.1, 1.2.0, 1.3.0 Environment: Java version 1.7.0_51 Java(TM) SE Runtime Environment (build 1.7.0_51-b13) Java HotSpot(TM) 64-Bit Server VM (build 24.51-b03, mixed mode) Mac OSX 10.10.1 Using local spark context Reporter: Davis Shepherd Assignee: Andrew Or Priority: Critical Attachments: SparkBugReplicator.scala The attached test class runs two identical jobs that perform some iterative computation on an RDD[(Int, Int)]. This computation involves # taking new data merging it with the previous result # caching and checkpointing the new result # rinse and repeat The first time the job is run, it runs successfully, and the spark context is shut down. The second time the job is run with a new spark context in the same process, the job hangs indefinitely, only having scheduled a subset of the necessary tasks for the final stage. Ive been able to produce a test case that reproduces the issue, and I've added some comments where some knockout experimentation has left some breadcrumbs as to where the issue might be. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-4759) Deadlock in complex spark job in local mode
[ https://issues.apache.org/jira/browse/SPARK-4759?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14237321#comment-14237321 ] Andrew Or edited comment on SPARK-4759 at 12/8/14 12:17 AM: Hey I came up with a much smaller reproduction for this from your program. 1. Start spark-shell with --master local[N] where N can be anything (or simply local with 1 core) 2. Copy and paste the following into your REPL {code} import org.apache.spark.SparkContext import org.apache.spark.rdd.RDD def runMyJob(sc: SparkContext): Unit = { sc.setCheckpointDir(/tmp/spark-test) val rdd = sc.parallelize(1 to 100).repartition(4).cache() rdd.checkpoint() rdd.count() val rdd2 = sc.parallelize(1 to 100).repartition(4) rdd.union(rdd2).count() } {code} 3. runMyJob(sc) It should be stuck at task 4/8. Note that with local-cluster and (local) standalone mode, it pauses a little at 4/8 too, but finishes shortly afterwards. was (Author: andrewor14): Hey I came up with a much smaller reproduction for this from your program. 1. Start spark-shell with --master local[N] where N can be anything (or simply local with 1 core) 2. Copy and paste the following into your REPL {code} import org.apache.spark.SparkContext import org.apache.spark.rdd.RDD def makeMyRdd(sc: SparkContext): RDD[Int] = { sc.parallelize(1 to 100).repartition(4).cache() } def runMyJob(sc: SparkContext): Unit = { sc.setCheckpointDir(/tmp/spark-test) val rdd = makeMyRdd(sc) rdd.checkpoint() rdd.count() val rdd2 = makeMyRdd(sc) val newRdd = rdd.union(rdd2).coalesce(4).cache() newRdd.checkpoint() newRdd.count() } {code} 3. runMyJob(sc) It should be stuck at task 1/4. Note that with local-cluster and (local) standalone mode, it pauses a little at 1/4 too, but finishes shortly afterwards. Deadlock in complex spark job in local mode --- Key: SPARK-4759 URL: https://issues.apache.org/jira/browse/SPARK-4759 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 1.1.1, 1.2.0, 1.3.0 Environment: Java version 1.7.0_51 Java(TM) SE Runtime Environment (build 1.7.0_51-b13) Java HotSpot(TM) 64-Bit Server VM (build 24.51-b03, mixed mode) Mac OSX 10.10.1 Using local spark context Reporter: Davis Shepherd Assignee: Andrew Or Priority: Critical Attachments: SparkBugReplicator.scala The attached test class runs two identical jobs that perform some iterative computation on an RDD[(Int, Int)]. This computation involves # taking new data merging it with the previous result # caching and checkpointing the new result # rinse and repeat The first time the job is run, it runs successfully, and the spark context is shut down. The second time the job is run with a new spark context in the same process, the job hangs indefinitely, only having scheduled a subset of the necessary tasks for the final stage. Ive been able to produce a test case that reproduces the issue, and I've added some comments where some knockout experimentation has left some breadcrumbs as to where the issue might be. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-4759) Deadlock in complex spark job in local mode
[ https://issues.apache.org/jira/browse/SPARK-4759?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14237321#comment-14237321 ] Andrew Or edited comment on SPARK-4759 at 12/8/14 12:20 AM: Hey I came up with a much smaller reproduction for this from your program. 1. Start spark-shell with --master local[N] where N can be anything (or simply local with 1 core) 2. Copy and paste the following into your REPL {code} def runMyJob(): Unit = { sc.setCheckpointDir(/tmp/spark-test) val rdd = sc.parallelize(1 to 100).repartition(4).cache() rdd.checkpoint() rdd.count() val rdd2 = sc.parallelize(1 to 100).repartition(4) rdd.union(rdd2).count() } {code} 3. runMyJob(sc) It should be stuck at task 4/8. Note that with local-cluster and (local) standalone mode, it pauses a little at 4/8 too, but finishes shortly afterwards. was (Author: andrewor14): Hey I came up with a much smaller reproduction for this from your program. 1. Start spark-shell with --master local[N] where N can be anything (or simply local with 1 core) 2. Copy and paste the following into your REPL {code} def runMyJob(sc: SparkContext): Unit = { sc.setCheckpointDir(/tmp/spark-test) val rdd = sc.parallelize(1 to 100).repartition(4).cache() rdd.checkpoint() rdd.count() val rdd2 = sc.parallelize(1 to 100).repartition(4) rdd.union(rdd2).count() } {code} 3. runMyJob(sc) It should be stuck at task 4/8. Note that with local-cluster and (local) standalone mode, it pauses a little at 4/8 too, but finishes shortly afterwards. Deadlock in complex spark job in local mode --- Key: SPARK-4759 URL: https://issues.apache.org/jira/browse/SPARK-4759 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 1.1.1, 1.2.0, 1.3.0 Environment: Java version 1.7.0_51 Java(TM) SE Runtime Environment (build 1.7.0_51-b13) Java HotSpot(TM) 64-Bit Server VM (build 24.51-b03, mixed mode) Mac OSX 10.10.1 Using local spark context Reporter: Davis Shepherd Assignee: Andrew Or Priority: Critical Attachments: SparkBugReplicator.scala The attached test class runs two identical jobs that perform some iterative computation on an RDD[(Int, Int)]. This computation involves # taking new data merging it with the previous result # caching and checkpointing the new result # rinse and repeat The first time the job is run, it runs successfully, and the spark context is shut down. The second time the job is run with a new spark context in the same process, the job hangs indefinitely, only having scheduled a subset of the necessary tasks for the final stage. Ive been able to produce a test case that reproduces the issue, and I've added some comments where some knockout experimentation has left some breadcrumbs as to where the issue might be. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-4759) Deadlock in complex spark job in local mode
[ https://issues.apache.org/jira/browse/SPARK-4759?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14237321#comment-14237321 ] Andrew Or edited comment on SPARK-4759 at 12/8/14 12:20 AM: Hey I came up with a much smaller reproduction for this from your program. 1. Start spark-shell with --master local[N] where N can be anything (or simply local with 1 core) 2. Copy and paste the following into your REPL {code} def runMyJob(sc: SparkContext): Unit = { sc.setCheckpointDir(/tmp/spark-test) val rdd = sc.parallelize(1 to 100).repartition(4).cache() rdd.checkpoint() rdd.count() val rdd2 = sc.parallelize(1 to 100).repartition(4) rdd.union(rdd2).count() } {code} 3. runMyJob(sc) It should be stuck at task 4/8. Note that with local-cluster and (local) standalone mode, it pauses a little at 4/8 too, but finishes shortly afterwards. was (Author: andrewor14): Hey I came up with a much smaller reproduction for this from your program. 1. Start spark-shell with --master local[N] where N can be anything (or simply local with 1 core) 2. Copy and paste the following into your REPL {code} import org.apache.spark.SparkContext import org.apache.spark.rdd.RDD def runMyJob(sc: SparkContext): Unit = { sc.setCheckpointDir(/tmp/spark-test) val rdd = sc.parallelize(1 to 100).repartition(4).cache() rdd.checkpoint() rdd.count() val rdd2 = sc.parallelize(1 to 100).repartition(4) rdd.union(rdd2).count() } {code} 3. runMyJob(sc) It should be stuck at task 4/8. Note that with local-cluster and (local) standalone mode, it pauses a little at 4/8 too, but finishes shortly afterwards. Deadlock in complex spark job in local mode --- Key: SPARK-4759 URL: https://issues.apache.org/jira/browse/SPARK-4759 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 1.1.1, 1.2.0, 1.3.0 Environment: Java version 1.7.0_51 Java(TM) SE Runtime Environment (build 1.7.0_51-b13) Java HotSpot(TM) 64-Bit Server VM (build 24.51-b03, mixed mode) Mac OSX 10.10.1 Using local spark context Reporter: Davis Shepherd Assignee: Andrew Or Priority: Critical Attachments: SparkBugReplicator.scala The attached test class runs two identical jobs that perform some iterative computation on an RDD[(Int, Int)]. This computation involves # taking new data merging it with the previous result # caching and checkpointing the new result # rinse and repeat The first time the job is run, it runs successfully, and the spark context is shut down. The second time the job is run with a new spark context in the same process, the job hangs indefinitely, only having scheduled a subset of the necessary tasks for the final stage. Ive been able to produce a test case that reproduces the issue, and I've added some comments where some knockout experimentation has left some breadcrumbs as to where the issue might be. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-4759) Deadlock in complex spark job in local mode
[ https://issues.apache.org/jira/browse/SPARK-4759?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14237321#comment-14237321 ] Andrew Or edited comment on SPARK-4759 at 12/8/14 12:25 AM: Hey I came up with a much smaller reproduction for this from your program. 1. Start spark-shell with --master local[N] where N can be anything (or simply local with 1 core) 2. Copy and paste the following into your REPL {code} def runMyJob(): Unit = { sc.setCheckpointDir(/tmp/spark-test) val rdd = sc.parallelize(1 to 100).coalesce(4).cache() rdd.checkpoint() rdd.count() val rdd2 = sc.parallelize(1 to 100).coalesce(4) rdd.union(rdd2).count() } {code} 3. runMyJob(sc) It should be stuck at task 1/2. Note that with local-cluster and (local) standalone mode, it pauses a little at 1/2 too, but finishes shortly afterwards. was (Author: andrewor14): Hey I came up with a much smaller reproduction for this from your program. 1. Start spark-shell with --master local[N] where N can be anything (or simply local with 1 core) 2. Copy and paste the following into your REPL {code} def runMyJob(): Unit = { sc.setCheckpointDir(/tmp/spark-test) val rdd = sc.parallelize(1 to 100).repartition(4).cache() rdd.checkpoint() rdd.count() val rdd2 = sc.parallelize(1 to 100).repartition(4) rdd.union(rdd2).count() } {code} 3. runMyJob(sc) It should be stuck at task 4/8. Note that with local-cluster and (local) standalone mode, it pauses a little at 4/8 too, but finishes shortly afterwards. Deadlock in complex spark job in local mode --- Key: SPARK-4759 URL: https://issues.apache.org/jira/browse/SPARK-4759 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 1.1.1, 1.2.0, 1.3.0 Environment: Java version 1.7.0_51 Java(TM) SE Runtime Environment (build 1.7.0_51-b13) Java HotSpot(TM) 64-Bit Server VM (build 24.51-b03, mixed mode) Mac OSX 10.10.1 Using local spark context Reporter: Davis Shepherd Assignee: Andrew Or Priority: Critical Attachments: SparkBugReplicator.scala The attached test class runs two identical jobs that perform some iterative computation on an RDD[(Int, Int)]. This computation involves # taking new data merging it with the previous result # caching and checkpointing the new result # rinse and repeat The first time the job is run, it runs successfully, and the spark context is shut down. The second time the job is run with a new spark context in the same process, the job hangs indefinitely, only having scheduled a subset of the necessary tasks for the final stage. Ive been able to produce a test case that reproduces the issue, and I've added some comments where some knockout experimentation has left some breadcrumbs as to where the issue might be. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-4759) Deadlock in complex spark job in local mode
[ https://issues.apache.org/jira/browse/SPARK-4759?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14237321#comment-14237321 ] Andrew Or edited comment on SPARK-4759 at 12/8/14 12:36 AM: Hey I came up with a much smaller reproduction for this from your program. 1. Start spark-shell with --master local[N] where N can be anything (or simply local with 1 core) 2. Copy and paste the following into your REPL {code} def runMyJob(): Unit = { sc.setCheckpointDir(/tmp/spark-test) val rdd = sc.parallelize(1 to 100).repartition(5).cache() rdd.checkpoint() rdd.count() val rdd2 = sc.parallelize(1 to 100).repartition(12) rdd.union(rdd2).count() } {code} 3. runMyJob(sc) It should be stuck at task 5/12. Note that with local-cluster and (local) standalone mode, it pauses a little at 5/12 too, but finishes shortly afterwards. was (Author: andrewor14): Hey I came up with a much smaller reproduction for this from your program. 1. Start spark-shell with --master local[N] where N can be anything (or simply local with 1 core) 2. Copy and paste the following into your REPL {code} def runMyJob(): Unit = { sc.setCheckpointDir(/tmp/spark-test) val rdd = sc.parallelize(1 to 100).coalesce(4).cache() rdd.checkpoint() rdd.count() val rdd2 = sc.parallelize(1 to 100).coalesce(4) rdd.union(rdd2).count() } {code} 3. runMyJob(sc) It should be stuck at task 1/2. Note that with local-cluster and (local) standalone mode, it pauses a little at 1/2 too, but finishes shortly afterwards. Deadlock in complex spark job in local mode --- Key: SPARK-4759 URL: https://issues.apache.org/jira/browse/SPARK-4759 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 1.1.1, 1.2.0, 1.3.0 Environment: Java version 1.7.0_51 Java(TM) SE Runtime Environment (build 1.7.0_51-b13) Java HotSpot(TM) 64-Bit Server VM (build 24.51-b03, mixed mode) Mac OSX 10.10.1 Using local spark context Reporter: Davis Shepherd Assignee: Andrew Or Priority: Critical Attachments: SparkBugReplicator.scala The attached test class runs two identical jobs that perform some iterative computation on an RDD[(Int, Int)]. This computation involves # taking new data merging it with the previous result # caching and checkpointing the new result # rinse and repeat The first time the job is run, it runs successfully, and the spark context is shut down. The second time the job is run with a new spark context in the same process, the job hangs indefinitely, only having scheduled a subset of the necessary tasks for the final stage. Ive been able to produce a test case that reproduces the issue, and I've added some comments where some knockout experimentation has left some breadcrumbs as to where the issue might be. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4783) Remove all System.exit calls from sparkcontext
[ https://issues.apache.org/jira/browse/SPARK-4783?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14237341#comment-14237341 ] Patrick Wendell commented on SPARK-4783: For code cleanliness, we should go through and look everywhere we call System.exit() and see which ones can be converted safely to exceptions. Most of our use of System.exit is on the executor side, but there may be a few on the driver/SparkContext side. That said, if there is a fatal exception in the SparkContext, I don't think your app can safely just catch the exception, log it, and create a new SparkContext. Is that what you are trying to do? In that case there could be static state around that is not properly cleaned up and will cause the new context to be buggy. Remove all System.exit calls from sparkcontext -- Key: SPARK-4783 URL: https://issues.apache.org/jira/browse/SPARK-4783 Project: Spark Issue Type: Bug Reporter: David Semeria A common architectural choice for integrating Spark within a larger application is to employ a gateway to handle Spark jobs. The gateway is a server which contains one or more long-running sparkcontexts. A typical server is created with the following pseudo code: var continue = true while (continue){ try { server.run() } catch (e) { continue = log_and_examine_error(e) } The problem is that sparkcontext frequently calls System.exit when it encounters a problem which means the server can only be re-spawned at the process level, which is much more messy than the simple code above. Therefore, I believe it makes sense to replace all System.exit calls in sparkcontext with the throwing of a fatal error. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-4759) Deadlock in complex spark job in local mode
[ https://issues.apache.org/jira/browse/SPARK-4759?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14237321#comment-14237321 ] Andrew Or edited comment on SPARK-4759 at 12/8/14 1:29 AM: --- Hey I came up with a much smaller reproduction for this from your program. 1. Start spark-shell with --master local[N] where N can be anything (or simply local with 1 core) 2. Copy and paste the following into your REPL {code} def runMyJob(): Unit = { sc.setCheckpointDir(/tmp/spark-test) val rdd = sc.parallelize(1 to 100).repartition(5).cache() rdd.count() val rdd2 = sc.parallelize(1 to 100).repartition(12) rdd.union(rdd2).count() } {code} 3. runMyJob(sc) It should be stuck at task 5/17. Note that with local-cluster and (local) standalone mode, it pauses a little at 5/17 too, but finishes shortly afterwards. was (Author: andrewor14): Hey I came up with a much smaller reproduction for this from your program. 1. Start spark-shell with --master local[N] where N can be anything (or simply local with 1 core) 2. Copy and paste the following into your REPL {code} def runMyJob(): Unit = { sc.setCheckpointDir(/tmp/spark-test) val rdd = sc.parallelize(1 to 100).repartition(5).cache() rdd.checkpoint() rdd.count() val rdd2 = sc.parallelize(1 to 100).repartition(12) rdd.union(rdd2).count() } {code} 3. runMyJob(sc) It should be stuck at task 5/17. Note that with local-cluster and (local) standalone mode, it pauses a little at 5/17 too, but finishes shortly afterwards. Deadlock in complex spark job in local mode --- Key: SPARK-4759 URL: https://issues.apache.org/jira/browse/SPARK-4759 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 1.1.1, 1.2.0, 1.3.0 Environment: Java version 1.7.0_51 Java(TM) SE Runtime Environment (build 1.7.0_51-b13) Java HotSpot(TM) 64-Bit Server VM (build 24.51-b03, mixed mode) Mac OSX 10.10.1 Using local spark context Reporter: Davis Shepherd Assignee: Andrew Or Priority: Critical Attachments: SparkBugReplicator.scala The attached test class runs two identical jobs that perform some iterative computation on an RDD[(Int, Int)]. This computation involves # taking new data merging it with the previous result # caching and checkpointing the new result # rinse and repeat The first time the job is run, it runs successfully, and the spark context is shut down. The second time the job is run with a new spark context in the same process, the job hangs indefinitely, only having scheduled a subset of the necessary tasks for the final stage. Ive been able to produce a test case that reproduces the issue, and I've added some comments where some knockout experimentation has left some breadcrumbs as to where the issue might be. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-4759) Deadlock in complex spark job in local mode
[ https://issues.apache.org/jira/browse/SPARK-4759?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14237321#comment-14237321 ] Andrew Or edited comment on SPARK-4759 at 12/8/14 1:33 AM: --- Hey I came up with a much smaller reproduction for this from your program. 1. Start spark-shell with --master local[N] where N can be anything (or simply local with 1 core) 2. Copy and paste the following into your REPL {code} def runMyJob(): Unit = { val rdd = sc.parallelize(1 to 100).repartition(5).cache() rdd.count() val rdd2 = sc.parallelize(1 to 100).repartition(12) rdd.union(rdd2).count() } {code} 3. runMyJob() It should be stuck at task 5/17. Note that with local-cluster and (local) standalone mode, it pauses a little at 5/17 too, but finishes shortly afterwards. was (Author: andrewor14): Hey I came up with a much smaller reproduction for this from your program. 1. Start spark-shell with --master local[N] where N can be anything (or simply local with 1 core) 2. Copy and paste the following into your REPL {code} def runMyJob(): Unit = { val rdd = sc.parallelize(1 to 100).repartition(5).cache() rdd.count() val rdd2 = sc.parallelize(1 to 100).repartition(12) rdd.union(rdd2).count() } {code} 3. runMyJob(sc) It should be stuck at task 5/17. Note that with local-cluster and (local) standalone mode, it pauses a little at 5/17 too, but finishes shortly afterwards. Deadlock in complex spark job in local mode --- Key: SPARK-4759 URL: https://issues.apache.org/jira/browse/SPARK-4759 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 1.1.1, 1.2.0, 1.3.0 Environment: Java version 1.7.0_51 Java(TM) SE Runtime Environment (build 1.7.0_51-b13) Java HotSpot(TM) 64-Bit Server VM (build 24.51-b03, mixed mode) Mac OSX 10.10.1 Using local spark context Reporter: Davis Shepherd Assignee: Andrew Or Priority: Critical Attachments: SparkBugReplicator.scala The attached test class runs two identical jobs that perform some iterative computation on an RDD[(Int, Int)]. This computation involves # taking new data merging it with the previous result # caching and checkpointing the new result # rinse and repeat The first time the job is run, it runs successfully, and the spark context is shut down. The second time the job is run with a new spark context in the same process, the job hangs indefinitely, only having scheduled a subset of the necessary tasks for the final stage. Ive been able to produce a test case that reproduces the issue, and I've added some comments where some knockout experimentation has left some breadcrumbs as to where the issue might be. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4759) Deadlock in complex spark job in local mode
[ https://issues.apache.org/jira/browse/SPARK-4759?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14237374#comment-14237374 ] Andrew Or commented on SPARK-4759: -- [~dgshep] That's strange. I am able to reproduce this every time, and I only need to call runMyJob once. What master are you running? Deadlock in complex spark job in local mode --- Key: SPARK-4759 URL: https://issues.apache.org/jira/browse/SPARK-4759 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 1.1.1, 1.2.0, 1.3.0 Environment: Java version 1.7.0_51 Java(TM) SE Runtime Environment (build 1.7.0_51-b13) Java HotSpot(TM) 64-Bit Server VM (build 24.51-b03, mixed mode) Mac OSX 10.10.1 Using local spark context Reporter: Davis Shepherd Assignee: Andrew Or Priority: Critical Attachments: SparkBugReplicator.scala The attached test class runs two identical jobs that perform some iterative computation on an RDD[(Int, Int)]. This computation involves # taking new data merging it with the previous result # caching and checkpointing the new result # rinse and repeat The first time the job is run, it runs successfully, and the spark context is shut down. The second time the job is run with a new spark context in the same process, the job hangs indefinitely, only having scheduled a subset of the necessary tasks for the final stage. Ive been able to produce a test case that reproduces the issue, and I've added some comments where some knockout experimentation has left some breadcrumbs as to where the issue might be. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4759) Deadlock in complex spark job in local mode
[ https://issues.apache.org/jira/browse/SPARK-4759?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14237376#comment-14237376 ] Davis Shepherd commented on SPARK-4759: --- local[2] on the latest commit of branch 1.1 Deadlock in complex spark job in local mode --- Key: SPARK-4759 URL: https://issues.apache.org/jira/browse/SPARK-4759 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 1.1.1, 1.2.0, 1.3.0 Environment: Java version 1.7.0_51 Java(TM) SE Runtime Environment (build 1.7.0_51-b13) Java HotSpot(TM) 64-Bit Server VM (build 24.51-b03, mixed mode) Mac OSX 10.10.1 Using local spark context Reporter: Davis Shepherd Assignee: Andrew Or Priority: Critical Attachments: SparkBugReplicator.scala The attached test class runs two identical jobs that perform some iterative computation on an RDD[(Int, Int)]. This computation involves # taking new data merging it with the previous result # caching and checkpointing the new result # rinse and repeat The first time the job is run, it runs successfully, and the spark context is shut down. The second time the job is run with a new spark context in the same process, the job hangs indefinitely, only having scheduled a subset of the necessary tasks for the final stage. Ive been able to produce a test case that reproduces the issue, and I've added some comments where some knockout experimentation has left some breadcrumbs as to where the issue might be. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-4759) Deadlock in complex spark job in local mode
[ https://issues.apache.org/jira/browse/SPARK-4759?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14237374#comment-14237374 ] Andrew Or edited comment on SPARK-4759 at 12/8/14 2:36 AM: --- [~dgshep] That's strange. I am able to reproduce this every time, and I only need to call runMyJob once. What master are you running? I just tried local, local[6], and local[*] and they all reproduced the deadlock. I am running the master branch with this commit: 6eb1b6f6204ea3c8083af3fb9cd990d9f3dac89d was (Author: andrewor14): [~dgshep] That's strange. I am able to reproduce this every time, and I only need to call runMyJob once. What master are you running? Deadlock in complex spark job in local mode --- Key: SPARK-4759 URL: https://issues.apache.org/jira/browse/SPARK-4759 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 1.1.1, 1.2.0, 1.3.0 Environment: Java version 1.7.0_51 Java(TM) SE Runtime Environment (build 1.7.0_51-b13) Java HotSpot(TM) 64-Bit Server VM (build 24.51-b03, mixed mode) Mac OSX 10.10.1 Using local spark context Reporter: Davis Shepherd Assignee: Andrew Or Priority: Critical Attachments: SparkBugReplicator.scala The attached test class runs two identical jobs that perform some iterative computation on an RDD[(Int, Int)]. This computation involves # taking new data merging it with the previous result # caching and checkpointing the new result # rinse and repeat The first time the job is run, it runs successfully, and the spark context is shut down. The second time the job is run with a new spark context in the same process, the job hangs indefinitely, only having scheduled a subset of the necessary tasks for the final stage. Ive been able to produce a test case that reproduces the issue, and I've added some comments where some knockout experimentation has left some breadcrumbs as to where the issue might be. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4759) Deadlock in complex spark job in local mode
[ https://issues.apache.org/jira/browse/SPARK-4759?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14237377#comment-14237377 ] Andrew Or commented on SPARK-4759: -- Hm I'll try branch 1.1 again later tonight. There might very well be more than one issue that causes this. Deadlock in complex spark job in local mode --- Key: SPARK-4759 URL: https://issues.apache.org/jira/browse/SPARK-4759 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 1.1.1, 1.2.0, 1.3.0 Environment: Java version 1.7.0_51 Java(TM) SE Runtime Environment (build 1.7.0_51-b13) Java HotSpot(TM) 64-Bit Server VM (build 24.51-b03, mixed mode) Mac OSX 10.10.1 Using local spark context Reporter: Davis Shepherd Assignee: Andrew Or Priority: Critical Attachments: SparkBugReplicator.scala The attached test class runs two identical jobs that perform some iterative computation on an RDD[(Int, Int)]. This computation involves # taking new data merging it with the previous result # caching and checkpointing the new result # rinse and repeat The first time the job is run, it runs successfully, and the spark context is shut down. The second time the job is run with a new spark context in the same process, the job hangs indefinitely, only having scheduled a subset of the necessary tasks for the final stage. Ive been able to produce a test case that reproduces the issue, and I've added some comments where some knockout experimentation has left some breadcrumbs as to where the issue might be. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-4783) System.exit() calls in SparkContext disrupt applications embedding Spark
[ https://issues.apache.org/jira/browse/SPARK-4783?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Patrick Wendell updated SPARK-4783: --- Summary: System.exit() calls in SparkContext disrupt applications embedding Spark (was: Remove all System.exit calls from sparkcontext) System.exit() calls in SparkContext disrupt applications embedding Spark Key: SPARK-4783 URL: https://issues.apache.org/jira/browse/SPARK-4783 Project: Spark Issue Type: Bug Reporter: David Semeria A common architectural choice for integrating Spark within a larger application is to employ a gateway to handle Spark jobs. The gateway is a server which contains one or more long-running sparkcontexts. A typical server is created with the following pseudo code: var continue = true while (continue){ try { server.run() } catch (e) { continue = log_and_examine_error(e) } The problem is that sparkcontext frequently calls System.exit when it encounters a problem which means the server can only be re-spawned at the process level, which is much more messy than the simple code above. Therefore, I believe it makes sense to replace all System.exit calls in sparkcontext with the throwing of a fatal error. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4759) Deadlock in complex spark job in local mode
[ https://issues.apache.org/jira/browse/SPARK-4759?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14237383#comment-14237383 ] Davis Shepherd commented on SPARK-4759: --- Ok your version does reproduce the issue against the spark-core 1.1.1 artifact if I copy and paste your code into the original SparkBugReplicator, but it only seems to hang on the second time the job is run :P. This smells of a race condition... Deadlock in complex spark job in local mode --- Key: SPARK-4759 URL: https://issues.apache.org/jira/browse/SPARK-4759 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 1.1.1, 1.2.0, 1.3.0 Environment: Java version 1.7.0_51 Java(TM) SE Runtime Environment (build 1.7.0_51-b13) Java HotSpot(TM) 64-Bit Server VM (build 24.51-b03, mixed mode) Mac OSX 10.10.1 Using local spark context Reporter: Davis Shepherd Assignee: Andrew Or Priority: Critical Attachments: SparkBugReplicator.scala The attached test class runs two identical jobs that perform some iterative computation on an RDD[(Int, Int)]. This computation involves # taking new data merging it with the previous result # caching and checkpointing the new result # rinse and repeat The first time the job is run, it runs successfully, and the spark context is shut down. The second time the job is run with a new spark context in the same process, the job hangs indefinitely, only having scheduled a subset of the necessary tasks for the final stage. Ive been able to produce a test case that reproduces the issue, and I've added some comments where some knockout experimentation has left some breadcrumbs as to where the issue might be. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4759) Deadlock in complex spark job in local mode
[ https://issues.apache.org/jira/browse/SPARK-4759?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14237393#comment-14237393 ] Davis Shepherd commented on SPARK-4759: --- I still cannot reproduce the issue in the spark shell on tag v1.1.1 Deadlock in complex spark job in local mode --- Key: SPARK-4759 URL: https://issues.apache.org/jira/browse/SPARK-4759 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 1.1.1, 1.2.0, 1.3.0 Environment: Java version 1.7.0_51 Java(TM) SE Runtime Environment (build 1.7.0_51-b13) Java HotSpot(TM) 64-Bit Server VM (build 24.51-b03, mixed mode) Mac OSX 10.10.1 Using local spark context Reporter: Davis Shepherd Assignee: Andrew Or Priority: Critical Attachments: SparkBugReplicator.scala The attached test class runs two identical jobs that perform some iterative computation on an RDD[(Int, Int)]. This computation involves # taking new data merging it with the previous result # caching and checkpointing the new result # rinse and repeat The first time the job is run, it runs successfully, and the spark context is shut down. The second time the job is run with a new spark context in the same process, the job hangs indefinitely, only having scheduled a subset of the necessary tasks for the final stage. Ive been able to produce a test case that reproduces the issue, and I've added some comments where some knockout experimentation has left some breadcrumbs as to where the issue might be. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-4759) Deadlock in complex spark job in local mode
[ https://issues.apache.org/jira/browse/SPARK-4759?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14237393#comment-14237393 ] Davis Shepherd edited comment on SPARK-4759 at 12/8/14 3:15 AM: I still cannot reproduce the issue with your snippet in the spark shell on tag v1.1.1 was (Author: dgshep): I still cannot reproduce the issue in the spark shell on tag v1.1.1 Deadlock in complex spark job in local mode --- Key: SPARK-4759 URL: https://issues.apache.org/jira/browse/SPARK-4759 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 1.1.1, 1.2.0, 1.3.0 Environment: Java version 1.7.0_51 Java(TM) SE Runtime Environment (build 1.7.0_51-b13) Java HotSpot(TM) 64-Bit Server VM (build 24.51-b03, mixed mode) Mac OSX 10.10.1 Using local spark context Reporter: Davis Shepherd Assignee: Andrew Or Priority: Critical Attachments: SparkBugReplicator.scala The attached test class runs two identical jobs that perform some iterative computation on an RDD[(Int, Int)]. This computation involves # taking new data merging it with the previous result # caching and checkpointing the new result # rinse and repeat The first time the job is run, it runs successfully, and the spark context is shut down. The second time the job is run with a new spark context in the same process, the job hangs indefinitely, only having scheduled a subset of the necessary tasks for the final stage. Ive been able to produce a test case that reproduces the issue, and I've added some comments where some knockout experimentation has left some breadcrumbs as to where the issue might be. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-3655) Support sorting of values in addition to keys (i.e. secondary sort)
[ https://issues.apache.org/jira/browse/SPARK-3655?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14237299#comment-14237299 ] koert kuipers edited comment on SPARK-3655 at 12/8/14 3:24 AM: --- i have a new pullreq that implements just groupByKeyAndSortValues in scala and java. i will need some help with python. pullreq is here: https://github.com/apache/spark/pull/3632 i changed methods to return RDD[(K, TraversableOnce[V])] instead of RDD[(K, Iterable[V])], since i dont see a reasonable way to implement it so that it returns Iterables without resorting to keeping the data in memory. The assumption made is that once you move on to the next key within a partition that the previous value (so the TraversableOnce[V]) will no longer be used. I personally find this API too generic, and too easy to abuse or make mistakes with. So i prefer a more constrained API like foldLeft. Or perhaps groupByKeyAndSortValues could be DeveloperAPI? was (Author: koert): i have a new pullreq that implements just groupByKeyAndSortValues in scala and java. i will need some help with python. pullreq is here: https://github.com/apache/spark/pull/3632 i changed methods to return RDD[(K, TraversableOnce[V])] instead of RDD[(K, Iterable[V])], since i dont see a reasonable way to implement it so that it returns Iterables without resorting to keeping the data in memory. The assumption made is that once you move on to the next key within a partition that the previous value (so the TraversableOnce[V]) will no longer be used. I personally find this API too generic, and too easy to abuse or make mistakes with. So i prefer a more constrained API like foldLeft. Support sorting of values in addition to keys (i.e. secondary sort) --- Key: SPARK-3655 URL: https://issues.apache.org/jira/browse/SPARK-3655 Project: Spark Issue Type: New Feature Components: Spark Core Affects Versions: 1.1.0, 1.2.0 Reporter: koert kuipers Assignee: Koert Kuipers Priority: Minor Now that spark has a sort based shuffle, can we expect a secondary sort soon? There are some use cases where getting a sorted iterator of values per key is helpful. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-4646) Replace Scala.util.Sorting.quickSort with Sorter(TimSort) in Spark
[ https://issues.apache.org/jira/browse/SPARK-4646?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ankur Dave resolved SPARK-4646. --- Resolution: Fixed Fix Version/s: 1.2.0 Issue resolved by pull request 3507 [https://github.com/apache/spark/pull/3507] Replace Scala.util.Sorting.quickSort with Sorter(TimSort) in Spark -- Key: SPARK-4646 URL: https://issues.apache.org/jira/browse/SPARK-4646 Project: Spark Issue Type: Improvement Components: GraphX Reporter: Takeshi Yamamuro Priority: Minor Fix For: 1.2.0 This patch just replaces a native quick sorter with Sorter(TimSort) in Spark. It could get performance gains by ~8% in my quick experiments. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-4620) Add unpersist in Graph/GraphImpl
[ https://issues.apache.org/jira/browse/SPARK-4620?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ankur Dave resolved SPARK-4620. --- Resolution: Fixed Fix Version/s: 1.2.0 Issue resolved by pull request 3476 [https://github.com/apache/spark/pull/3476] Add unpersist in Graph/GraphImpl Key: SPARK-4620 URL: https://issues.apache.org/jira/browse/SPARK-4620 Project: Spark Issue Type: Improvement Components: GraphX Reporter: Takeshi Yamamuro Priority: Trivial Fix For: 1.2.0 Add an IF to uncache both vertices and edges of Graph/GraphImpl. This IF is useful when iterative graph operations build a new graph in each iteration, and the vertices and edges of previous iterations are no longer needed for following iterations. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-4784) Model.fittingParamMap should store all Params
Joseph K. Bradley created SPARK-4784: Summary: Model.fittingParamMap should store all Params Key: SPARK-4784 URL: https://issues.apache.org/jira/browse/SPARK-4784 Project: Spark Issue Type: Improvement Components: ML Affects Versions: 1.2.0 Reporter: Joseph K. Bradley Priority: Minor spark.ml's Model class should store all parameters in the fittingParamMap, not just the ones which were explicitly set. [~mengxr] -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-4784) Model.fittingParamMap should store all Params
[ https://issues.apache.org/jira/browse/SPARK-4784?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joseph K. Bradley updated SPARK-4784: - Description: spark.ml's Model class should store all parameters in the fittingParamMap, not just the ones which were explicitly set. CC: [~mengxr] was: spark.ml's Model class should store all parameters in the fittingParamMap, not just the ones which were explicitly set. [~mengxr] Model.fittingParamMap should store all Params - Key: SPARK-4784 URL: https://issues.apache.org/jira/browse/SPARK-4784 Project: Spark Issue Type: Improvement Components: ML Affects Versions: 1.2.0 Reporter: Joseph K. Bradley Priority: Minor spark.ml's Model class should store all parameters in the fittingParamMap, not just the ones which were explicitly set. CC: [~mengxr] -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-1350) YARN ContainerLaunchContext should use cluster's JAVA_HOME
[ https://issues.apache.org/jira/browse/SPARK-1350?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14237525#comment-14237525 ] Aniket Bhatnagar commented on SPARK-1350: - [~sandyr] Using Environment.JAVA_HOME.$() causes issues while submitting spark applications from a windows box into a Yarn cluster running on Linux (with spark master set as yarn-client). This is because Environment.JAVA_HOME.$() resolves to %JAVA_HOME% which results in not a valid executable on Linux. Is this a Spark issue or a YARN issue? YARN ContainerLaunchContext should use cluster's JAVA_HOME -- Key: SPARK-1350 URL: https://issues.apache.org/jira/browse/SPARK-1350 Project: Spark Issue Type: Bug Components: YARN Affects Versions: 0.9.0 Reporter: Sandy Ryza Assignee: Sandy Ryza Fix For: 1.0.0 {code} var javaCommand = java val javaHome = System.getenv(JAVA_HOME) if ((javaHome != null !javaHome.isEmpty()) || env.isDefinedAt(JAVA_HOME)) { javaCommand = Environment.JAVA_HOME.$() + /bin/java } {code} Currently, if JAVA_HOME is specified on the client, it will be used instead of the value given on the cluster. This makes it so that Java must be installed in the same place on the client as on the cluster. This is a possibly incompatible change that we should get in before 1.0. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-4785) When called with arguments referring column fields, PMOD throws NPE
Cheng Lian created SPARK-4785: - Summary: When called with arguments referring column fields, PMOD throws NPE Key: SPARK-4785 URL: https://issues.apache.org/jira/browse/SPARK-4785 Project: Spark Issue Type: Bug Components: SQL Reporter: Cheng Lian Priority: Blocker Reproduction steps with {{hive/console}}: {code} scala loadTestTable(src) scala sql(SELECT PMOD(key, 10) FROM src LIMIT 1).collect() ... 14/12/08 15:11:31 INFO DAGScheduler: Job 0 failed: runJob at basicOperators.scala:141, took 0.235788 s org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 0.0 failed 1 times, most recent failure: Lost task 0.0 in stage 0.0 (TID 0, localhost): java.lang.NullPointerException at org.apache.hadoop.hive.ql.udf.generic.GenericUDFBaseNumeric.initialize(GenericUDFBaseNumeric.java:109) at org.apache.hadoop.hive.ql.udf.generic.GenericUDF.initializeAndFoldConstants(GenericUDF.java:116) at org.apache.spark.sql.hive.HiveGenericUdf.returnInspector$lzycompute(hiveUdfs.scala:156) at org.apache.spark.sql.hive.HiveGenericUdf.returnInspector(hiveUdfs.scala:155) at org.apache.spark.sql.hive.HiveGenericUdf.eval(hiveUdfs.scala:174) at org.apache.spark.sql.catalyst.expressions.Alias.eval(namedExpressions.scala:92) at org.apache.spark.sql.catalyst.expressions.InterpretedMutableProjection.apply(Projection.scala:68) at org.apache.spark.sql.catalyst.expressions.InterpretedMutableProjection.apply(Projection.scala:52) at scala.collection.Iterator$$anon$11.next(Iterator.scala:328) at scala.collection.Iterator$$anon$11.next(Iterator.scala:328) at scala.collection.Iterator$$anon$10.next(Iterator.scala:312) at scala.collection.Iterator$class.foreach(Iterator.scala:727) at scala.collection.AbstractIterator.foreach(Iterator.scala:1157) at scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:48) at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:103) at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:47) at scala.collection.TraversableOnce$class.to(TraversableOnce.scala:273) at scala.collection.AbstractIterator.to(Iterator.scala:1157) at scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:265) at scala.collection.AbstractIterator.toBuffer(Iterator.scala:1157) at scala.collection.TraversableOnce$class.toArray(TraversableOnce.scala:252) at scala.collection.AbstractIterator.toArray(Iterator.scala:1157) at org.apache.spark.sql.execution.Limit$$anonfun$4.apply(basicOperators.scala:141) at org.apache.spark.sql.execution.Limit$$anonfun$4.apply(basicOperators.scala:141) at org.apache.spark.SparkContext$$anonfun$runJob$4.apply(SparkContext.scala:1314) at org.apache.spark.SparkContext$$anonfun$runJob$4.apply(SparkContext.scala:1314) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:61) at org.apache.spark.scheduler.Task.run(Task.scala:56) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:196) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:745) Driver stacktrace: at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1214) at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1203) at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1202) at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47) at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1202) at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:696) at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:696) at scala.Option.foreach(Option.scala:236) at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:696) at org.apache.spark.scheduler.DAGSchedulerEventProcessActor$$anonfun$receive$2.applyOrElse(DAGScheduler.scala:1420) at akka.actor.Actor$class.aroundReceive(Actor.scala:465) at org.apache.spark.scheduler.DAGSchedulerEventProcessActor.aroundReceive(DAGScheduler.scala:1375) at akka.actor.ActorCell.receiveMessage(ActorCell.scala:516) at akka.actor.ActorCell.invoke(ActorCell.scala:487)
[jira] [Commented] (SPARK-4759) Deadlock in complex spark job in local mode
[ https://issues.apache.org/jira/browse/SPARK-4759?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14237550#comment-14237550 ] Andrew Or commented on SPARK-4759: -- Ok yeah you're right, I can't reproduce it from the code snippet in branch 1.1 either. There seems to be at least two issues going on here... Can you confirm that the snippet does reproduce the lock in master branch? Deadlock in complex spark job in local mode --- Key: SPARK-4759 URL: https://issues.apache.org/jira/browse/SPARK-4759 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 1.1.1, 1.2.0, 1.3.0 Environment: Java version 1.7.0_51 Java(TM) SE Runtime Environment (build 1.7.0_51-b13) Java HotSpot(TM) 64-Bit Server VM (build 24.51-b03, mixed mode) Mac OSX 10.10.1 Using local spark context Reporter: Davis Shepherd Assignee: Andrew Or Priority: Critical Attachments: SparkBugReplicator.scala The attached test class runs two identical jobs that perform some iterative computation on an RDD[(Int, Int)]. This computation involves # taking new data merging it with the previous result # caching and checkpointing the new result # rinse and repeat The first time the job is run, it runs successfully, and the spark context is shut down. The second time the job is run with a new spark context in the same process, the job hangs indefinitely, only having scheduled a subset of the necessary tasks for the final stage. Ive been able to produce a test case that reproduces the issue, and I've added some comments where some knockout experimentation has left some breadcrumbs as to where the issue might be. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-4759) Deadlock in complex spark job in local mode
[ https://issues.apache.org/jira/browse/SPARK-4759?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14237321#comment-14237321 ] Andrew Or edited comment on SPARK-4759 at 12/8/14 7:28 AM: --- Hey I came up with a much smaller reproduction for this from your program. 1. Start spark-shell with --master local[N] where N can be anything (or simply local with 1 core) 2. Copy and paste the following into your REPL {code} def runMyJob(): Unit = { val rdd = sc.parallelize(1 to 100).repartition(5).cache() rdd.count() val rdd2 = sc.parallelize(1 to 100).repartition(12) rdd.union(rdd2).count() } {code} 3. runMyJob() It should be stuck at task 5/17. Note that with local-cluster and (local) standalone mode, it pauses a little at 5/17 too, but finishes shortly afterwards. - EDIT - This seems to reproduce it only on the master branch. was (Author: andrewor14): Hey I came up with a much smaller reproduction for this from your program. 1. Start spark-shell with --master local[N] where N can be anything (or simply local with 1 core) 2. Copy and paste the following into your REPL {code} def runMyJob(): Unit = { val rdd = sc.parallelize(1 to 100).repartition(5).cache() rdd.count() val rdd2 = sc.parallelize(1 to 100).repartition(12) rdd.union(rdd2).count() } {code} 3. runMyJob() It should be stuck at task 5/17. Note that with local-cluster and (local) standalone mode, it pauses a little at 5/17 too, but finishes shortly afterwards. Deadlock in complex spark job in local mode --- Key: SPARK-4759 URL: https://issues.apache.org/jira/browse/SPARK-4759 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 1.1.1, 1.2.0, 1.3.0 Environment: Java version 1.7.0_51 Java(TM) SE Runtime Environment (build 1.7.0_51-b13) Java HotSpot(TM) 64-Bit Server VM (build 24.51-b03, mixed mode) Mac OSX 10.10.1 Using local spark context Reporter: Davis Shepherd Assignee: Andrew Or Priority: Critical Attachments: SparkBugReplicator.scala The attached test class runs two identical jobs that perform some iterative computation on an RDD[(Int, Int)]. This computation involves # taking new data merging it with the previous result # caching and checkpointing the new result # rinse and repeat The first time the job is run, it runs successfully, and the spark context is shut down. The second time the job is run with a new spark context in the same process, the job hangs indefinitely, only having scheduled a subset of the necessary tasks for the final stage. Ive been able to produce a test case that reproduces the issue, and I've added some comments where some knockout experimentation has left some breadcrumbs as to where the issue might be. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-4759) Deadlock in complex spark job in local mode
[ https://issues.apache.org/jira/browse/SPARK-4759?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14237321#comment-14237321 ] Andrew Or edited comment on SPARK-4759 at 12/8/14 7:29 AM: --- Hey I came up with a much smaller reproduction for this from your program. 1. Start spark-shell with --master local[N] where N can be anything (or simply local with 1 core) 2. Copy and paste the following into your REPL {code} def runMyJob(): Unit = { val rdd = sc.parallelize(1 to 100).repartition(5).cache() rdd.count() val rdd2 = sc.parallelize(1 to 100).repartition(12) rdd.union(rdd2).count() } {code} 3. runMyJob() It should be stuck at task 5/17. Note that with local-cluster and (local) standalone mode, it pauses a little at 5/17 too, but finishes shortly afterwards. === EDIT === This seems to reproduce it only on the master branch, but not 1.1. was (Author: andrewor14): Hey I came up with a much smaller reproduction for this from your program. 1. Start spark-shell with --master local[N] where N can be anything (or simply local with 1 core) 2. Copy and paste the following into your REPL {code} def runMyJob(): Unit = { val rdd = sc.parallelize(1 to 100).repartition(5).cache() rdd.count() val rdd2 = sc.parallelize(1 to 100).repartition(12) rdd.union(rdd2).count() } {code} 3. runMyJob() It should be stuck at task 5/17. Note that with local-cluster and (local) standalone mode, it pauses a little at 5/17 too, but finishes shortly afterwards. === EDIT === This seems to reproduce it only on the master branch. Deadlock in complex spark job in local mode --- Key: SPARK-4759 URL: https://issues.apache.org/jira/browse/SPARK-4759 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 1.1.1, 1.2.0, 1.3.0 Environment: Java version 1.7.0_51 Java(TM) SE Runtime Environment (build 1.7.0_51-b13) Java HotSpot(TM) 64-Bit Server VM (build 24.51-b03, mixed mode) Mac OSX 10.10.1 Using local spark context Reporter: Davis Shepherd Assignee: Andrew Or Priority: Critical Attachments: SparkBugReplicator.scala The attached test class runs two identical jobs that perform some iterative computation on an RDD[(Int, Int)]. This computation involves # taking new data merging it with the previous result # caching and checkpointing the new result # rinse and repeat The first time the job is run, it runs successfully, and the spark context is shut down. The second time the job is run with a new spark context in the same process, the job hangs indefinitely, only having scheduled a subset of the necessary tasks for the final stage. Ive been able to produce a test case that reproduces the issue, and I've added some comments where some knockout experimentation has left some breadcrumbs as to where the issue might be. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-4759) Deadlock in complex spark job in local mode
[ https://issues.apache.org/jira/browse/SPARK-4759?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14237321#comment-14237321 ] Andrew Or edited comment on SPARK-4759 at 12/8/14 7:29 AM: --- Hey I came up with a much smaller reproduction for this from your program. 1. Start spark-shell with --master local[N] where N can be anything (or simply local with 1 core) 2. Copy and paste the following into your REPL {code} def runMyJob(): Unit = { val rdd = sc.parallelize(1 to 100).repartition(5).cache() rdd.count() val rdd2 = sc.parallelize(1 to 100).repartition(12) rdd.union(rdd2).count() } {code} 3. runMyJob() It should be stuck at task 5/17. Note that with local-cluster and (local) standalone mode, it pauses a little at 5/17 too, but finishes shortly afterwards. === EDIT === This seems to reproduce it only on the master branch. was (Author: andrewor14): Hey I came up with a much smaller reproduction for this from your program. 1. Start spark-shell with --master local[N] where N can be anything (or simply local with 1 core) 2. Copy and paste the following into your REPL {code} def runMyJob(): Unit = { val rdd = sc.parallelize(1 to 100).repartition(5).cache() rdd.count() val rdd2 = sc.parallelize(1 to 100).repartition(12) rdd.union(rdd2).count() } {code} 3. runMyJob() It should be stuck at task 5/17. Note that with local-cluster and (local) standalone mode, it pauses a little at 5/17 too, but finishes shortly afterwards. - EDIT - This seems to reproduce it only on the master branch. Deadlock in complex spark job in local mode --- Key: SPARK-4759 URL: https://issues.apache.org/jira/browse/SPARK-4759 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 1.1.1, 1.2.0, 1.3.0 Environment: Java version 1.7.0_51 Java(TM) SE Runtime Environment (build 1.7.0_51-b13) Java HotSpot(TM) 64-Bit Server VM (build 24.51-b03, mixed mode) Mac OSX 10.10.1 Using local spark context Reporter: Davis Shepherd Assignee: Andrew Or Priority: Critical Attachments: SparkBugReplicator.scala The attached test class runs two identical jobs that perform some iterative computation on an RDD[(Int, Int)]. This computation involves # taking new data merging it with the previous result # caching and checkpointing the new result # rinse and repeat The first time the job is run, it runs successfully, and the spark context is shut down. The second time the job is run with a new spark context in the same process, the job hangs indefinitely, only having scheduled a subset of the necessary tasks for the final stage. Ive been able to produce a test case that reproduces the issue, and I've added some comments where some knockout experimentation has left some breadcrumbs as to where the issue might be. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-4781) Column values become all NULL after doing ALTER TABLE CHANGE for renaming column names (Parquet external table in HiveContext)
[ https://issues.apache.org/jira/browse/SPARK-4781?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jianshi Huang updated SPARK-4781: - Issue Type: Bug (was: Improvement) Column values become all NULL after doing ALTER TABLE CHANGE for renaming column names (Parquet external table in HiveContext) -- Key: SPARK-4781 URL: https://issues.apache.org/jira/browse/SPARK-4781 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.2.0, 1.3.0, 1.2.1 Reporter: Jianshi Huang I have a table say created like follows: {code} CREATE EXTERNAL TABLE pmt ( `sorted::cre_ts` string ) STORED AS PARQUET LOCATION '...' {code} And I renamed the column from sorted::cre_ts to cre_ts by doing: {code} ALTER TABLE pmt CHANGE `sorted::cre_ts` cre_ts string {code} After renaming the column, the values in the column become all NULLs. {noformat} Before renaming: scala sql(select `sorted::cre_ts` from pmt limit 1).collect res12: Array[org.apache.spark.sql.Row] = Array([12/02/2014 07:38:54]) Execute renaming: scala sql(alter table pmt change `sorted::cre_ts` cre_ts string) res13: org.apache.spark.sql.SchemaRDD = SchemaRDD[972] at RDD at SchemaRDD.scala:108 == Query Plan == Native command: executed by Hive After renaming: scala sql(select cre_ts from pmt limit 1).collect res16: Array[org.apache.spark.sql.Row] = Array([null]) {noformat} Jianshi -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org