[jira] [Created] (SPARK-4783) Remove all System.exit calls from sparkcontext
David Semeria created SPARK-4783: Summary: Remove all System.exit calls from sparkcontext Key: SPARK-4783 URL: https://issues.apache.org/jira/browse/SPARK-4783 Project: Spark Issue Type: Bug Reporter: David Semeria A common architectural choice for integrating Spark within a larger application is to employ a gateway to handle Spark jobs. The gateway is a server which contains one or more long-running sparkcontexts. A typical server is created with the following pseudo code: ``` var continue = true while (continue){ try { server.run() } catch (e) { continue = log_and_examine_error(e) } ``` The problem is that sparkcontext frequently calls System.exit when it encounters a problem which means the server can only be re-spawned at the process level, which is much more messy than the simple code above. Therefore, I believe it makes sense to replace all System.exit calls in sparkcontext with the throwing of a fatal error. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-4783) Remove all System.exit calls from sparkcontext
[ https://issues.apache.org/jira/browse/SPARK-4783?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] David Semeria updated SPARK-4783: - Description: A common architectural choice for integrating Spark within a larger application is to employ a gateway to handle Spark jobs. The gateway is a server which contains one or more long-running sparkcontexts. A typical server is created with the following pseudo code: var continue = true while (continue){ try { server.run() } catch (e) { continue = log_and_examine_error(e) } The problem is that sparkcontext frequently calls System.exit when it encounters a problem which means the server can only be re-spawned at the process level, which is much more messy than the simple code above. Therefore, I believe it makes sense to replace all System.exit calls in sparkcontext with the throwing of a fatal error. was: A common architectural choice for integrating Spark within a larger application is to employ a gateway to handle Spark jobs. The gateway is a server which contains one or more long-running sparkcontexts. A typical server is created with the following pseudo code: ``` var continue = true while (continue){ try { server.run() } catch (e) { continue = log_and_examine_error(e) } ``` The problem is that sparkcontext frequently calls System.exit when it encounters a problem which means the server can only be re-spawned at the process level, which is much more messy than the simple code above. Therefore, I believe it makes sense to replace all System.exit calls in sparkcontext with the throwing of a fatal error. > Remove all System.exit calls from sparkcontext > -- > > Key: SPARK-4783 > URL: https://issues.apache.org/jira/browse/SPARK-4783 > Project: Spark > Issue Type: Bug >Reporter: David Semeria > > A common architectural choice for integrating Spark within a larger > application is to employ a gateway to handle Spark jobs. The gateway is a > server which contains one or more long-running sparkcontexts. > A typical server is created with the following pseudo code: > var continue = true > while (continue){ > try { > server.run() > } catch (e) { > continue = log_and_examine_error(e) > } > The problem is that sparkcontext frequently calls System.exit when it > encounters a problem which means the server can only be re-spawned at the > process level, which is much more messy than the simple code above. > Therefore, I believe it makes sense to replace all System.exit calls in > sparkcontext with the throwing of a fatal error. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-2808) update kafka to version 0.8.2
[ https://issues.apache.org/jira/browse/SPARK-2808?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14237150#comment-14237150 ] Helena Edelson commented on SPARK-2808: --- I've done the migration, am testing the changes against Scala 2.10.4 since the parent spark pom uses 2.10.4 2.10 This will allow usage of the new producer, among other nice additions in 0.8.2. Though I do not know when it will be GA. > update kafka to version 0.8.2 > - > > Key: SPARK-2808 > URL: https://issues.apache.org/jira/browse/SPARK-2808 > Project: Spark > Issue Type: Sub-task > Components: Build, Spark Core >Reporter: Anand Avati > > First kafka_2.11 0.8.1 has to be released -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4181) Create separate options to control the client-mode AM resource allocation request
[ https://issues.apache.org/jira/browse/SPARK-4181?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14237163#comment-14237163 ] WangTaoTheTonic commented on SPARK-4181: I couldn't think out one case for extraLibraryPath either. But we should honor extraClasspath as some function, for instance, log system, would load resources from self-defined classpath. > Create separate options to control the client-mode AM resource allocation > request > - > > Key: SPARK-4181 > URL: https://issues.apache.org/jira/browse/SPARK-4181 > Project: Spark > Issue Type: Improvement > Components: YARN >Reporter: WangTaoTheTonic >Priority: Minor > > I found related discussion in https://github.com/apache/spark/pull/2115, > SPARK-1953 and SPARK-1507. And recently I found some inconvenience in > configuring properties like logging while we use yarn-client mode. So if no > one else do the work, I will try it. Maybe start in few days, and complete in > next 1 or 2 weeks. > As not very familiar with spark on yarn, any discussion and feedback is > welcome! -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4740) Netty's network throughput is about 1/2 of NIO's in spark-perf sortByKey
[ https://issues.apache.org/jira/browse/SPARK-4740?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14237178#comment-14237178 ] Zhang, Liye commented on SPARK-4740: [~pwendell], I have the same concern first, and I verified that, and replaced the assembly jar files on all nodes, and also restart the cluster, but the problem still exists. One more solid proof is that the better performance is not stick to one machine, maybe one node performs better for this test and another node for other test. > Netty's network throughput is about 1/2 of NIO's in spark-perf sortByKey > > > Key: SPARK-4740 > URL: https://issues.apache.org/jira/browse/SPARK-4740 > Project: Spark > Issue Type: Improvement > Components: Shuffle, Spark Core >Affects Versions: 1.2.0 >Reporter: Zhang, Liye >Assignee: Reynold Xin >Priority: Blocker > Attachments: (rxin patch better executor)TestRunner sort-by-key - > Thread dump for executor 3_files.zip, (rxin patch normal executor)TestRunner > sort-by-key - Thread dump for executor 0 _files.zip, Spark-perf Test Report > 16 Cores per Executor.pdf, Spark-perf Test Report.pdf, TestRunner > sort-by-key - Thread dump for executor 1_files (Netty-48 Cores per node).zip, > TestRunner sort-by-key - Thread dump for executor 1_files (Nio-48 cores per > node).zip > > > When testing current spark master (1.3.0-snapshot) with spark-perf > (sort-by-key, aggregate-by-key, etc), Netty based shuffle transferService > takes much longer time than NIO based shuffle transferService. The network > throughput of Netty is only about half of that of NIO. > We tested with standalone mode, and the data set we used for test is 20 > billion records, and the total size is about 400GB. Spark-perf test is > Running on a 4 node cluster with 10G NIC, 48 cpu cores per node and each > executor memory is 64GB. The reduce tasks number is set to 1000. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-4740) Netty's network throughput is about 1/2 of NIO's in spark-perf sortByKey
[ https://issues.apache.org/jira/browse/SPARK-4740?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14237178#comment-14237178 ] Zhang, Liye edited comment on SPARK-4740 at 12/7/14 3:32 PM: - [~pwendell], I have the same concern at first. I verified for several times, and replaced the assembly jar files on all nodes, and also restart the cluster, but the problem still exists. One more solid proof is that the better performance node is not stick to one machine, maybe one node performs better for this test and may switch to another node for next test. was (Author: liyezhang556520): [~pwendell], I have the same concern first, and I verified that, and replaced the assembly jar files on all nodes, and also restart the cluster, but the problem still exists. One more solid proof is that the better performance is not stick to one machine, maybe one node performs better for this test and another node for other test. > Netty's network throughput is about 1/2 of NIO's in spark-perf sortByKey > > > Key: SPARK-4740 > URL: https://issues.apache.org/jira/browse/SPARK-4740 > Project: Spark > Issue Type: Improvement > Components: Shuffle, Spark Core >Affects Versions: 1.2.0 >Reporter: Zhang, Liye >Assignee: Reynold Xin >Priority: Blocker > Attachments: (rxin patch better executor)TestRunner sort-by-key - > Thread dump for executor 3_files.zip, (rxin patch normal executor)TestRunner > sort-by-key - Thread dump for executor 0 _files.zip, Spark-perf Test Report > 16 Cores per Executor.pdf, Spark-perf Test Report.pdf, TestRunner > sort-by-key - Thread dump for executor 1_files (Netty-48 Cores per node).zip, > TestRunner sort-by-key - Thread dump for executor 1_files (Nio-48 cores per > node).zip > > > When testing current spark master (1.3.0-snapshot) with spark-perf > (sort-by-key, aggregate-by-key, etc), Netty based shuffle transferService > takes much longer time than NIO based shuffle transferService. The network > throughput of Netty is only about half of that of NIO. > We tested with standalone mode, and the data set we used for test is 20 > billion records, and the total size is about 400GB. Spark-perf test is > Running on a 4 node cluster with 10G NIC, 48 cpu cores per node and each > executor memory is 64GB. The reduce tasks number is set to 1000. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4740) Netty's network throughput is about 1/2 of NIO's in spark-perf sortByKey
[ https://issues.apache.org/jira/browse/SPARK-4740?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14237179#comment-14237179 ] Zhang, Liye commented on SPARK-4740: Hi [~adav], I kept speculation as default (false). > Netty's network throughput is about 1/2 of NIO's in spark-perf sortByKey > > > Key: SPARK-4740 > URL: https://issues.apache.org/jira/browse/SPARK-4740 > Project: Spark > Issue Type: Improvement > Components: Shuffle, Spark Core >Affects Versions: 1.2.0 >Reporter: Zhang, Liye >Assignee: Reynold Xin >Priority: Blocker > Attachments: (rxin patch better executor)TestRunner sort-by-key - > Thread dump for executor 3_files.zip, (rxin patch normal executor)TestRunner > sort-by-key - Thread dump for executor 0 _files.zip, Spark-perf Test Report > 16 Cores per Executor.pdf, Spark-perf Test Report.pdf, TestRunner > sort-by-key - Thread dump for executor 1_files (Netty-48 Cores per node).zip, > TestRunner sort-by-key - Thread dump for executor 1_files (Nio-48 cores per > node).zip > > > When testing current spark master (1.3.0-snapshot) with spark-perf > (sort-by-key, aggregate-by-key, etc), Netty based shuffle transferService > takes much longer time than NIO based shuffle transferService. The network > throughput of Netty is only about half of that of NIO. > We tested with standalone mode, and the data set we used for test is 20 > billion records, and the total size is about 400GB. Spark-perf test is > Running on a 4 node cluster with 10G NIC, 48 cpu cores per node and each > executor memory is 64GB. The reduce tasks number is set to 1000. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-2808) update kafka to version 0.8.2
[ https://issues.apache.org/jira/browse/SPARK-2808?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14237150#comment-14237150 ] Helena Edelson edited comment on SPARK-2808 at 12/7/14 3:35 PM: I've done the migration, am testing the changes against Scala 2.10.4 since the parent spark pom uses 2.10.4 2.10 I do not know when it will be GA. was (Author: helena_e): I've done the migration, am testing the changes against Scala 2.10.4 since the parent spark pom uses 2.10.4 2.10 This will allow usage of the new producer, among other nice additions in 0.8.2. Though I do not know when it will be GA. > update kafka to version 0.8.2 > - > > Key: SPARK-2808 > URL: https://issues.apache.org/jira/browse/SPARK-2808 > Project: Spark > Issue Type: Sub-task > Components: Build, Spark Core >Reporter: Anand Avati > > First kafka_2.11 0.8.1 has to be released -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-2808) update kafka to version 0.8.2
[ https://issues.apache.org/jira/browse/SPARK-2808?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14237150#comment-14237150 ] Helena Edelson edited comment on SPARK-2808 at 12/7/14 3:52 PM: I've done the migration, am testing the changes against Scala 2.10.4 since the parent spark pom uses 2.10.4 2.10 I do not know when it will be GA. https://github.com/apache/spark/pull/3631 was (Author: helena_e): I've done the migration, am testing the changes against Scala 2.10.4 since the parent spark pom uses 2.10.4 2.10 I do not know when it will be GA. > update kafka to version 0.8.2 > - > > Key: SPARK-2808 > URL: https://issues.apache.org/jira/browse/SPARK-2808 > Project: Spark > Issue Type: Sub-task > Components: Build, Spark Core >Reporter: Anand Avati > > First kafka_2.11 0.8.1 has to be released -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3655) Support sorting of values in addition to keys (i.e. secondary sort)
[ https://issues.apache.org/jira/browse/SPARK-3655?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14237233#comment-14237233 ] koert kuipers commented on SPARK-3655: -- [~sandyr] One issue with groupByKeyAndSortValues that i did not have with foldLeft is that a user has no obligation to finish reading all values from the Iterable[V] of the sorted values. This leads to wrong results in the current implementation, since the underlying iterator is then wrongly positioned for the next call. I will add a unit test for this and see if there is an elegant/efficient way to handle this. > Support sorting of values in addition to keys (i.e. secondary sort) > --- > > Key: SPARK-3655 > URL: https://issues.apache.org/jira/browse/SPARK-3655 > Project: Spark > Issue Type: New Feature > Components: Spark Core >Affects Versions: 1.1.0 >Reporter: koert kuipers >Assignee: Koert Kuipers >Priority: Minor > > Now that spark has a sort based shuffle, can we expect a secondary sort soon? > There are some use cases where getting a sorted iterator of values per key is > helpful. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3655) Support sorting of values in addition to keys (i.e. secondary sort)
[ https://issues.apache.org/jira/browse/SPARK-3655?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14237254#comment-14237254 ] koert kuipers commented on SPARK-3655: -- i also dont like the signature def groupByKeyAndSortValues(valueOrdering: Ordering[V], partitioner: Partitioner): RDD[(K, Iterable[V])] i doubt it can be implemented efficiently i would much prefer def groupByKeyAndSortValues(valueOrdering: Ordering[V], partitioner: Partitioner): RDD[(K, TraversableOnce[V])] but that is inconsistent with groupByKey (which i guess has Iterable in it's return type for historical reasons.. used to be Seq) > Support sorting of values in addition to keys (i.e. secondary sort) > --- > > Key: SPARK-3655 > URL: https://issues.apache.org/jira/browse/SPARK-3655 > Project: Spark > Issue Type: New Feature > Components: Spark Core >Affects Versions: 1.1.0 >Reporter: koert kuipers >Assignee: Koert Kuipers >Priority: Minor > > Now that spark has a sort based shuffle, can we expect a secondary sort soon? > There are some use cases where getting a sorted iterator of values per key is > helpful. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4759) Deadlock in complex spark job in local mode with multiple cores
[ https://issues.apache.org/jira/browse/SPARK-4759?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14237264#comment-14237264 ] Davis Shepherd commented on SPARK-4759: --- It is possible to reproduce the issue with single core local mode. Simply change the partitions parameter to > 1 (in the attached version it uses the defaultParallelism of the spark context, which in single core mode is 1) in either of the coalesce calls. > Deadlock in complex spark job in local mode with multiple cores > --- > > Key: SPARK-4759 > URL: https://issues.apache.org/jira/browse/SPARK-4759 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 1.1.1, 1.2.0, 1.3.0 > Environment: Java version "1.7.0_51" > Java(TM) SE Runtime Environment (build 1.7.0_51-b13) > Java HotSpot(TM) 64-Bit Server VM (build 24.51-b03, mixed mode) > Mac OSX 10.10.1 > Using local spark context >Reporter: Davis Shepherd >Assignee: Andrew Or >Priority: Critical > Attachments: SparkBugReplicator.scala > > > The attached test class runs two identical jobs that perform some iterative > computation on an RDD[(Int, Int)]. This computation involves > # taking new data merging it with the previous result > # caching and checkpointing the new result > # rinse and repeat > The first time the job is run, it runs successfully, and the spark context is > shut down. The second time the job is run with a new spark context in the > same process, the job hangs indefinitely, only having scheduled a subset of > the necessary tasks for the final stage. > Ive been able to produce a test case that reproduces the issue, and I've > added some comments where some knockout experimentation has left some > breadcrumbs as to where the issue might be. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3655) Support sorting of values in addition to keys (i.e. secondary sort)
[ https://issues.apache.org/jira/browse/SPARK-3655?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14237298#comment-14237298 ] Apache Spark commented on SPARK-3655: - User 'koertkuipers' has created a pull request for this issue: https://github.com/apache/spark/pull/3632 > Support sorting of values in addition to keys (i.e. secondary sort) > --- > > Key: SPARK-3655 > URL: https://issues.apache.org/jira/browse/SPARK-3655 > Project: Spark > Issue Type: New Feature > Components: Spark Core >Affects Versions: 1.1.0 >Reporter: koert kuipers >Assignee: Koert Kuipers >Priority: Minor > > Now that spark has a sort based shuffle, can we expect a secondary sort soon? > There are some use cases where getting a sorted iterator of values per key is > helpful. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3655) Support sorting of values in addition to keys (i.e. secondary sort)
[ https://issues.apache.org/jira/browse/SPARK-3655?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14237299#comment-14237299 ] koert kuipers commented on SPARK-3655: -- i have a new pullreq that implements just groupByKeyAndSortValues in scala and java. i will need some help with python. pullreq is here: https://github.com/apache/spark/pull/3632 i changed methods to return RDD[(K, TraversableOnce[V])] instead of RDD[(K, Iterable[V])], since i dont see a reasonable way to implement it so that it returns Iterables without resorting to keeping the data in memory. The assumption made is that once you move on to the next key within a partition that the previous value (so the TraversableOnce[V]) will no longer be used. I personally find this API too generic, and too easy to abuse or make mistakes with. So i prefer a more constrained API like foldLeft. > Support sorting of values in addition to keys (i.e. secondary sort) > --- > > Key: SPARK-3655 > URL: https://issues.apache.org/jira/browse/SPARK-3655 > Project: Spark > Issue Type: New Feature > Components: Spark Core >Affects Versions: 1.1.0 >Reporter: koert kuipers >Assignee: Koert Kuipers >Priority: Minor > > Now that spark has a sort based shuffle, can we expect a secondary sort soon? > There are some use cases where getting a sorted iterator of values per key is > helpful. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-3655) Support sorting of values in addition to keys (i.e. secondary sort)
[ https://issues.apache.org/jira/browse/SPARK-3655?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] koert kuipers updated SPARK-3655: - Affects Version/s: 1.2.0 > Support sorting of values in addition to keys (i.e. secondary sort) > --- > > Key: SPARK-3655 > URL: https://issues.apache.org/jira/browse/SPARK-3655 > Project: Spark > Issue Type: New Feature > Components: Spark Core >Affects Versions: 1.1.0, 1.2.0 >Reporter: koert kuipers >Assignee: Koert Kuipers >Priority: Minor > > Now that spark has a sort based shuffle, can we expect a secondary sort soon? > There are some use cases where getting a sorted iterator of values per key is > helpful. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4759) Deadlock in complex spark job in local mode with multiple cores
[ https://issues.apache.org/jira/browse/SPARK-4759?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14237321#comment-14237321 ] Andrew Or commented on SPARK-4759: -- Hey I came up with a much smaller reproduction for this from your program. 1. Start spark-shell with --master local[8] 2. Copy and paste the following into your REPL {code} import org.apache.spark.SparkContext import org.apache.spark.rdd.RDD def makeMyRdd(sc: SparkContext): RDD[Int] = { sc.parallelize(1 to 100).repartition(sc.defaultParallelism).cache() } def runMyJob(sc: SparkContext): Unit = { sc.setCheckpointDir("/tmp/spark-test") val rdd = makeMyRdd(sc) rdd.checkpoint() rdd.count() val rdd2 = makeMyRdd(sc) val newRdd = rdd.union(rdd2).coalesce(sc.defaultParallelism).cache() newRdd.checkpoint() newRdd.count() } {code} 3. runMyJob(sc) It should be stuck at task 7/8. > Deadlock in complex spark job in local mode with multiple cores > --- > > Key: SPARK-4759 > URL: https://issues.apache.org/jira/browse/SPARK-4759 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 1.1.1, 1.2.0, 1.3.0 > Environment: Java version "1.7.0_51" > Java(TM) SE Runtime Environment (build 1.7.0_51-b13) > Java HotSpot(TM) 64-Bit Server VM (build 24.51-b03, mixed mode) > Mac OSX 10.10.1 > Using local spark context >Reporter: Davis Shepherd >Assignee: Andrew Or >Priority: Critical > Attachments: SparkBugReplicator.scala > > > The attached test class runs two identical jobs that perform some iterative > computation on an RDD[(Int, Int)]. This computation involves > # taking new data merging it with the previous result > # caching and checkpointing the new result > # rinse and repeat > The first time the job is run, it runs successfully, and the spark context is > shut down. The second time the job is run with a new spark context in the > same process, the job hangs indefinitely, only having scheduled a subset of > the necessary tasks for the final stage. > Ive been able to produce a test case that reproduces the issue, and I've > added some comments where some knockout experimentation has left some > breadcrumbs as to where the issue might be. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-4759) Deadlock in complex spark job in local mode with multiple cores
[ https://issues.apache.org/jira/browse/SPARK-4759?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14237321#comment-14237321 ] Andrew Or edited comment on SPARK-4759 at 12/7/14 11:53 PM: Hey I came up with a much smaller reproduction for this from your program. 1. Start spark-shell with --master local[8] 2. Copy and paste the following into your REPL {code} import org.apache.spark.SparkContext import org.apache.spark.rdd.RDD def makeMyRdd(sc: SparkContext): RDD[Int] = { sc.parallelize(1 to 100).repartition(4).cache() } def runMyJob(sc: SparkContext): Unit = { sc.setCheckpointDir("/tmp/spark-test") val rdd = makeMyRdd(sc) rdd.checkpoint() rdd.count() val rdd2 = makeMyRdd(sc) val newRdd = rdd.union(rdd2).coalesce(4).cache() newRdd.checkpoint() newRdd.count() } {code} 3. runMyJob(sc) It should be stuck at task 7/8. was (Author: andrewor14): Hey I came up with a much smaller reproduction for this from your program. 1. Start spark-shell with --master local[8] 2. Copy and paste the following into your REPL {code} import org.apache.spark.SparkContext import org.apache.spark.rdd.RDD def makeMyRdd(sc: SparkContext): RDD[Int] = { sc.parallelize(1 to 100).repartition(sc.defaultParallelism).cache() } def runMyJob(sc: SparkContext): Unit = { sc.setCheckpointDir("/tmp/spark-test") val rdd = makeMyRdd(sc) rdd.checkpoint() rdd.count() val rdd2 = makeMyRdd(sc) val newRdd = rdd.union(rdd2).coalesce(sc.defaultParallelism).cache() newRdd.checkpoint() newRdd.count() } {code} 3. runMyJob(sc) It should be stuck at task 7/8. > Deadlock in complex spark job in local mode with multiple cores > --- > > Key: SPARK-4759 > URL: https://issues.apache.org/jira/browse/SPARK-4759 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 1.1.1, 1.2.0, 1.3.0 > Environment: Java version "1.7.0_51" > Java(TM) SE Runtime Environment (build 1.7.0_51-b13) > Java HotSpot(TM) 64-Bit Server VM (build 24.51-b03, mixed mode) > Mac OSX 10.10.1 > Using local spark context >Reporter: Davis Shepherd >Assignee: Andrew Or >Priority: Critical > Attachments: SparkBugReplicator.scala > > > The attached test class runs two identical jobs that perform some iterative > computation on an RDD[(Int, Int)]. This computation involves > # taking new data merging it with the previous result > # caching and checkpointing the new result > # rinse and repeat > The first time the job is run, it runs successfully, and the spark context is > shut down. The second time the job is run with a new spark context in the > same process, the job hangs indefinitely, only having scheduled a subset of > the necessary tasks for the final stage. > Ive been able to produce a test case that reproduces the issue, and I've > added some comments where some knockout experimentation has left some > breadcrumbs as to where the issue might be. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-4759) Deadlock in complex spark job in local mode
[ https://issues.apache.org/jira/browse/SPARK-4759?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrew Or updated SPARK-4759: - Summary: Deadlock in complex spark job in local mode (was: Deadlock in complex spark job in local mode with multiple cores) > Deadlock in complex spark job in local mode > --- > > Key: SPARK-4759 > URL: https://issues.apache.org/jira/browse/SPARK-4759 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 1.1.1, 1.2.0, 1.3.0 > Environment: Java version "1.7.0_51" > Java(TM) SE Runtime Environment (build 1.7.0_51-b13) > Java HotSpot(TM) 64-Bit Server VM (build 24.51-b03, mixed mode) > Mac OSX 10.10.1 > Using local spark context >Reporter: Davis Shepherd >Assignee: Andrew Or >Priority: Critical > Attachments: SparkBugReplicator.scala > > > The attached test class runs two identical jobs that perform some iterative > computation on an RDD[(Int, Int)]. This computation involves > # taking new data merging it with the previous result > # caching and checkpointing the new result > # rinse and repeat > The first time the job is run, it runs successfully, and the spark context is > shut down. The second time the job is run with a new spark context in the > same process, the job hangs indefinitely, only having scheduled a subset of > the necessary tasks for the final stage. > Ive been able to produce a test case that reproduces the issue, and I've > added some comments where some knockout experimentation has left some > breadcrumbs as to where the issue might be. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-4759) Deadlock in complex spark job in local mode with multiple cores
[ https://issues.apache.org/jira/browse/SPARK-4759?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14237321#comment-14237321 ] Andrew Or edited comment on SPARK-4759 at 12/7/14 11:53 PM: Hey I came up with a much smaller reproduction for this from your program. 1. Start spark-shell with --master local[8] (or simply local) 2. Copy and paste the following into your REPL {code} import org.apache.spark.SparkContext import org.apache.spark.rdd.RDD def makeMyRdd(sc: SparkContext): RDD[Int] = { sc.parallelize(1 to 100).repartition(4).cache() } def runMyJob(sc: SparkContext): Unit = { sc.setCheckpointDir("/tmp/spark-test") val rdd = makeMyRdd(sc) rdd.checkpoint() rdd.count() val rdd2 = makeMyRdd(sc) val newRdd = rdd.union(rdd2).coalesce(4).cache() newRdd.checkpoint() newRdd.count() } {code} 3. runMyJob(sc) It should be stuck at task 7/8. was (Author: andrewor14): Hey I came up with a much smaller reproduction for this from your program. 1. Start spark-shell with --master local[8] 2. Copy and paste the following into your REPL {code} import org.apache.spark.SparkContext import org.apache.spark.rdd.RDD def makeMyRdd(sc: SparkContext): RDD[Int] = { sc.parallelize(1 to 100).repartition(4).cache() } def runMyJob(sc: SparkContext): Unit = { sc.setCheckpointDir("/tmp/spark-test") val rdd = makeMyRdd(sc) rdd.checkpoint() rdd.count() val rdd2 = makeMyRdd(sc) val newRdd = rdd.union(rdd2).coalesce(4).cache() newRdd.checkpoint() newRdd.count() } {code} 3. runMyJob(sc) It should be stuck at task 7/8. > Deadlock in complex spark job in local mode with multiple cores > --- > > Key: SPARK-4759 > URL: https://issues.apache.org/jira/browse/SPARK-4759 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 1.1.1, 1.2.0, 1.3.0 > Environment: Java version "1.7.0_51" > Java(TM) SE Runtime Environment (build 1.7.0_51-b13) > Java HotSpot(TM) 64-Bit Server VM (build 24.51-b03, mixed mode) > Mac OSX 10.10.1 > Using local spark context >Reporter: Davis Shepherd >Assignee: Andrew Or >Priority: Critical > Attachments: SparkBugReplicator.scala > > > The attached test class runs two identical jobs that perform some iterative > computation on an RDD[(Int, Int)]. This computation involves > # taking new data merging it with the previous result > # caching and checkpointing the new result > # rinse and repeat > The first time the job is run, it runs successfully, and the spark context is > shut down. The second time the job is run with a new spark context in the > same process, the job hangs indefinitely, only having scheduled a subset of > the necessary tasks for the final stage. > Ive been able to produce a test case that reproduces the issue, and I've > added some comments where some knockout experimentation has left some > breadcrumbs as to where the issue might be. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-4759) Deadlock in complex spark job in local mode
[ https://issues.apache.org/jira/browse/SPARK-4759?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14237321#comment-14237321 ] Andrew Or edited comment on SPARK-4759 at 12/7/14 11:57 PM: Hey I came up with a much smaller reproduction for this from your program. 1. Start spark-shell with --master local[N] where N can be anything (or simply local with 1 core) 2. Copy and paste the following into your REPL {code} import org.apache.spark.SparkContext import org.apache.spark.rdd.RDD def makeMyRdd(sc: SparkContext): RDD[Int] = { sc.parallelize(1 to 100).repartition(4).cache() } def runMyJob(sc: SparkContext): Unit = { sc.setCheckpointDir("/tmp/spark-test") val rdd = makeMyRdd(sc) rdd.checkpoint() rdd.count() val rdd2 = makeMyRdd(sc) val newRdd = rdd.union(rdd2).coalesce(4).cache() newRdd.checkpoint() newRdd.count() } {code} 3. runMyJob(sc) It should be stuck at task 1/4. Note that with local-cluster and (local) standalone mode, it pauses a little at 1/4 too, but finishes shortly afterwards. was (Author: andrewor14): Hey I came up with a much smaller reproduction for this from your program. 1. Start spark-shell with --master local[8] (or simply local) 2. Copy and paste the following into your REPL {code} import org.apache.spark.SparkContext import org.apache.spark.rdd.RDD def makeMyRdd(sc: SparkContext): RDD[Int] = { sc.parallelize(1 to 100).repartition(4).cache() } def runMyJob(sc: SparkContext): Unit = { sc.setCheckpointDir("/tmp/spark-test") val rdd = makeMyRdd(sc) rdd.checkpoint() rdd.count() val rdd2 = makeMyRdd(sc) val newRdd = rdd.union(rdd2).coalesce(4).cache() newRdd.checkpoint() newRdd.count() } {code} 3. runMyJob(sc) It should be stuck at task 7/8. > Deadlock in complex spark job in local mode > --- > > Key: SPARK-4759 > URL: https://issues.apache.org/jira/browse/SPARK-4759 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 1.1.1, 1.2.0, 1.3.0 > Environment: Java version "1.7.0_51" > Java(TM) SE Runtime Environment (build 1.7.0_51-b13) > Java HotSpot(TM) 64-Bit Server VM (build 24.51-b03, mixed mode) > Mac OSX 10.10.1 > Using local spark context >Reporter: Davis Shepherd >Assignee: Andrew Or >Priority: Critical > Attachments: SparkBugReplicator.scala > > > The attached test class runs two identical jobs that perform some iterative > computation on an RDD[(Int, Int)]. This computation involves > # taking new data merging it with the previous result > # caching and checkpointing the new result > # rinse and repeat > The first time the job is run, it runs successfully, and the spark context is > shut down. The second time the job is run with a new spark context in the > same process, the job hangs indefinitely, only having scheduled a subset of > the necessary tasks for the final stage. > Ive been able to produce a test case that reproduces the issue, and I've > added some comments where some knockout experimentation has left some > breadcrumbs as to where the issue might be. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-4759) Deadlock in complex spark job in local mode
[ https://issues.apache.org/jira/browse/SPARK-4759?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14237321#comment-14237321 ] Andrew Or edited comment on SPARK-4759 at 12/8/14 12:17 AM: Hey I came up with a much smaller reproduction for this from your program. 1. Start spark-shell with --master local[N] where N can be anything (or simply local with 1 core) 2. Copy and paste the following into your REPL {code} import org.apache.spark.SparkContext import org.apache.spark.rdd.RDD def runMyJob(sc: SparkContext): Unit = { sc.setCheckpointDir("/tmp/spark-test") val rdd = sc.parallelize(1 to 100).repartition(4).cache() rdd.checkpoint() rdd.count() val rdd2 = sc.parallelize(1 to 100).repartition(4) rdd.union(rdd2).count() } {code} 3. runMyJob(sc) It should be stuck at task 4/8. Note that with local-cluster and (local) standalone mode, it pauses a little at 4/8 too, but finishes shortly afterwards. was (Author: andrewor14): Hey I came up with a much smaller reproduction for this from your program. 1. Start spark-shell with --master local[N] where N can be anything (or simply local with 1 core) 2. Copy and paste the following into your REPL {code} import org.apache.spark.SparkContext import org.apache.spark.rdd.RDD def makeMyRdd(sc: SparkContext): RDD[Int] = { sc.parallelize(1 to 100).repartition(4).cache() } def runMyJob(sc: SparkContext): Unit = { sc.setCheckpointDir("/tmp/spark-test") val rdd = makeMyRdd(sc) rdd.checkpoint() rdd.count() val rdd2 = makeMyRdd(sc) val newRdd = rdd.union(rdd2).coalesce(4).cache() newRdd.checkpoint() newRdd.count() } {code} 3. runMyJob(sc) It should be stuck at task 1/4. Note that with local-cluster and (local) standalone mode, it pauses a little at 1/4 too, but finishes shortly afterwards. > Deadlock in complex spark job in local mode > --- > > Key: SPARK-4759 > URL: https://issues.apache.org/jira/browse/SPARK-4759 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 1.1.1, 1.2.0, 1.3.0 > Environment: Java version "1.7.0_51" > Java(TM) SE Runtime Environment (build 1.7.0_51-b13) > Java HotSpot(TM) 64-Bit Server VM (build 24.51-b03, mixed mode) > Mac OSX 10.10.1 > Using local spark context >Reporter: Davis Shepherd >Assignee: Andrew Or >Priority: Critical > Attachments: SparkBugReplicator.scala > > > The attached test class runs two identical jobs that perform some iterative > computation on an RDD[(Int, Int)]. This computation involves > # taking new data merging it with the previous result > # caching and checkpointing the new result > # rinse and repeat > The first time the job is run, it runs successfully, and the spark context is > shut down. The second time the job is run with a new spark context in the > same process, the job hangs indefinitely, only having scheduled a subset of > the necessary tasks for the final stage. > Ive been able to produce a test case that reproduces the issue, and I've > added some comments where some knockout experimentation has left some > breadcrumbs as to where the issue might be. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-4759) Deadlock in complex spark job in local mode
[ https://issues.apache.org/jira/browse/SPARK-4759?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14237321#comment-14237321 ] Andrew Or edited comment on SPARK-4759 at 12/8/14 12:20 AM: Hey I came up with a much smaller reproduction for this from your program. 1. Start spark-shell with --master local[N] where N can be anything (or simply local with 1 core) 2. Copy and paste the following into your REPL {code} def runMyJob(): Unit = { sc.setCheckpointDir("/tmp/spark-test") val rdd = sc.parallelize(1 to 100).repartition(4).cache() rdd.checkpoint() rdd.count() val rdd2 = sc.parallelize(1 to 100).repartition(4) rdd.union(rdd2).count() } {code} 3. runMyJob(sc) It should be stuck at task 4/8. Note that with local-cluster and (local) standalone mode, it pauses a little at 4/8 too, but finishes shortly afterwards. was (Author: andrewor14): Hey I came up with a much smaller reproduction for this from your program. 1. Start spark-shell with --master local[N] where N can be anything (or simply local with 1 core) 2. Copy and paste the following into your REPL {code} def runMyJob(sc: SparkContext): Unit = { sc.setCheckpointDir("/tmp/spark-test") val rdd = sc.parallelize(1 to 100).repartition(4).cache() rdd.checkpoint() rdd.count() val rdd2 = sc.parallelize(1 to 100).repartition(4) rdd.union(rdd2).count() } {code} 3. runMyJob(sc) It should be stuck at task 4/8. Note that with local-cluster and (local) standalone mode, it pauses a little at 4/8 too, but finishes shortly afterwards. > Deadlock in complex spark job in local mode > --- > > Key: SPARK-4759 > URL: https://issues.apache.org/jira/browse/SPARK-4759 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 1.1.1, 1.2.0, 1.3.0 > Environment: Java version "1.7.0_51" > Java(TM) SE Runtime Environment (build 1.7.0_51-b13) > Java HotSpot(TM) 64-Bit Server VM (build 24.51-b03, mixed mode) > Mac OSX 10.10.1 > Using local spark context >Reporter: Davis Shepherd >Assignee: Andrew Or >Priority: Critical > Attachments: SparkBugReplicator.scala > > > The attached test class runs two identical jobs that perform some iterative > computation on an RDD[(Int, Int)]. This computation involves > # taking new data merging it with the previous result > # caching and checkpointing the new result > # rinse and repeat > The first time the job is run, it runs successfully, and the spark context is > shut down. The second time the job is run with a new spark context in the > same process, the job hangs indefinitely, only having scheduled a subset of > the necessary tasks for the final stage. > Ive been able to produce a test case that reproduces the issue, and I've > added some comments where some knockout experimentation has left some > breadcrumbs as to where the issue might be. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-4759) Deadlock in complex spark job in local mode
[ https://issues.apache.org/jira/browse/SPARK-4759?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14237321#comment-14237321 ] Andrew Or edited comment on SPARK-4759 at 12/8/14 12:20 AM: Hey I came up with a much smaller reproduction for this from your program. 1. Start spark-shell with --master local[N] where N can be anything (or simply local with 1 core) 2. Copy and paste the following into your REPL {code} def runMyJob(sc: SparkContext): Unit = { sc.setCheckpointDir("/tmp/spark-test") val rdd = sc.parallelize(1 to 100).repartition(4).cache() rdd.checkpoint() rdd.count() val rdd2 = sc.parallelize(1 to 100).repartition(4) rdd.union(rdd2).count() } {code} 3. runMyJob(sc) It should be stuck at task 4/8. Note that with local-cluster and (local) standalone mode, it pauses a little at 4/8 too, but finishes shortly afterwards. was (Author: andrewor14): Hey I came up with a much smaller reproduction for this from your program. 1. Start spark-shell with --master local[N] where N can be anything (or simply local with 1 core) 2. Copy and paste the following into your REPL {code} import org.apache.spark.SparkContext import org.apache.spark.rdd.RDD def runMyJob(sc: SparkContext): Unit = { sc.setCheckpointDir("/tmp/spark-test") val rdd = sc.parallelize(1 to 100).repartition(4).cache() rdd.checkpoint() rdd.count() val rdd2 = sc.parallelize(1 to 100).repartition(4) rdd.union(rdd2).count() } {code} 3. runMyJob(sc) It should be stuck at task 4/8. Note that with local-cluster and (local) standalone mode, it pauses a little at 4/8 too, but finishes shortly afterwards. > Deadlock in complex spark job in local mode > --- > > Key: SPARK-4759 > URL: https://issues.apache.org/jira/browse/SPARK-4759 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 1.1.1, 1.2.0, 1.3.0 > Environment: Java version "1.7.0_51" > Java(TM) SE Runtime Environment (build 1.7.0_51-b13) > Java HotSpot(TM) 64-Bit Server VM (build 24.51-b03, mixed mode) > Mac OSX 10.10.1 > Using local spark context >Reporter: Davis Shepherd >Assignee: Andrew Or >Priority: Critical > Attachments: SparkBugReplicator.scala > > > The attached test class runs two identical jobs that perform some iterative > computation on an RDD[(Int, Int)]. This computation involves > # taking new data merging it with the previous result > # caching and checkpointing the new result > # rinse and repeat > The first time the job is run, it runs successfully, and the spark context is > shut down. The second time the job is run with a new spark context in the > same process, the job hangs indefinitely, only having scheduled a subset of > the necessary tasks for the final stage. > Ive been able to produce a test case that reproduces the issue, and I've > added some comments where some knockout experimentation has left some > breadcrumbs as to where the issue might be. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-4759) Deadlock in complex spark job in local mode
[ https://issues.apache.org/jira/browse/SPARK-4759?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14237321#comment-14237321 ] Andrew Or edited comment on SPARK-4759 at 12/8/14 12:25 AM: Hey I came up with a much smaller reproduction for this from your program. 1. Start spark-shell with --master local[N] where N can be anything (or simply local with 1 core) 2. Copy and paste the following into your REPL {code} def runMyJob(): Unit = { sc.setCheckpointDir("/tmp/spark-test") val rdd = sc.parallelize(1 to 100).coalesce(4).cache() rdd.checkpoint() rdd.count() val rdd2 = sc.parallelize(1 to 100).coalesce(4) rdd.union(rdd2).count() } {code} 3. runMyJob(sc) It should be stuck at task 1/2. Note that with local-cluster and (local) standalone mode, it pauses a little at 1/2 too, but finishes shortly afterwards. was (Author: andrewor14): Hey I came up with a much smaller reproduction for this from your program. 1. Start spark-shell with --master local[N] where N can be anything (or simply local with 1 core) 2. Copy and paste the following into your REPL {code} def runMyJob(): Unit = { sc.setCheckpointDir("/tmp/spark-test") val rdd = sc.parallelize(1 to 100).repartition(4).cache() rdd.checkpoint() rdd.count() val rdd2 = sc.parallelize(1 to 100).repartition(4) rdd.union(rdd2).count() } {code} 3. runMyJob(sc) It should be stuck at task 4/8. Note that with local-cluster and (local) standalone mode, it pauses a little at 4/8 too, but finishes shortly afterwards. > Deadlock in complex spark job in local mode > --- > > Key: SPARK-4759 > URL: https://issues.apache.org/jira/browse/SPARK-4759 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 1.1.1, 1.2.0, 1.3.0 > Environment: Java version "1.7.0_51" > Java(TM) SE Runtime Environment (build 1.7.0_51-b13) > Java HotSpot(TM) 64-Bit Server VM (build 24.51-b03, mixed mode) > Mac OSX 10.10.1 > Using local spark context >Reporter: Davis Shepherd >Assignee: Andrew Or >Priority: Critical > Attachments: SparkBugReplicator.scala > > > The attached test class runs two identical jobs that perform some iterative > computation on an RDD[(Int, Int)]. This computation involves > # taking new data merging it with the previous result > # caching and checkpointing the new result > # rinse and repeat > The first time the job is run, it runs successfully, and the spark context is > shut down. The second time the job is run with a new spark context in the > same process, the job hangs indefinitely, only having scheduled a subset of > the necessary tasks for the final stage. > Ive been able to produce a test case that reproduces the issue, and I've > added some comments where some knockout experimentation has left some > breadcrumbs as to where the issue might be. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-4759) Deadlock in complex spark job in local mode
[ https://issues.apache.org/jira/browse/SPARK-4759?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14237321#comment-14237321 ] Andrew Or edited comment on SPARK-4759 at 12/8/14 12:36 AM: Hey I came up with a much smaller reproduction for this from your program. 1. Start spark-shell with --master local[N] where N can be anything (or simply local with 1 core) 2. Copy and paste the following into your REPL {code} def runMyJob(): Unit = { sc.setCheckpointDir("/tmp/spark-test") val rdd = sc.parallelize(1 to 100).repartition(5).cache() rdd.checkpoint() rdd.count() val rdd2 = sc.parallelize(1 to 100).repartition(12) rdd.union(rdd2).count() } {code} 3. runMyJob(sc) It should be stuck at task 5/17. Note that with local-cluster and (local) standalone mode, it pauses a little at 5/17 too, but finishes shortly afterwards. was (Author: andrewor14): Hey I came up with a much smaller reproduction for this from your program. 1. Start spark-shell with --master local[N] where N can be anything (or simply local with 1 core) 2. Copy and paste the following into your REPL {code} def runMyJob(): Unit = { sc.setCheckpointDir("/tmp/spark-test") val rdd = sc.parallelize(1 to 100).repartition(5).cache() rdd.checkpoint() rdd.count() val rdd2 = sc.parallelize(1 to 100).repartition(12) rdd.union(rdd2).count() } {code} 3. runMyJob(sc) It should be stuck at task 5/12. Note that with local-cluster and (local) standalone mode, it pauses a little at 5/12 too, but finishes shortly afterwards. > Deadlock in complex spark job in local mode > --- > > Key: SPARK-4759 > URL: https://issues.apache.org/jira/browse/SPARK-4759 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 1.1.1, 1.2.0, 1.3.0 > Environment: Java version "1.7.0_51" > Java(TM) SE Runtime Environment (build 1.7.0_51-b13) > Java HotSpot(TM) 64-Bit Server VM (build 24.51-b03, mixed mode) > Mac OSX 10.10.1 > Using local spark context >Reporter: Davis Shepherd >Assignee: Andrew Or >Priority: Critical > Attachments: SparkBugReplicator.scala > > > The attached test class runs two identical jobs that perform some iterative > computation on an RDD[(Int, Int)]. This computation involves > # taking new data merging it with the previous result > # caching and checkpointing the new result > # rinse and repeat > The first time the job is run, it runs successfully, and the spark context is > shut down. The second time the job is run with a new spark context in the > same process, the job hangs indefinitely, only having scheduled a subset of > the necessary tasks for the final stage. > Ive been able to produce a test case that reproduces the issue, and I've > added some comments where some knockout experimentation has left some > breadcrumbs as to where the issue might be. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-4759) Deadlock in complex spark job in local mode
[ https://issues.apache.org/jira/browse/SPARK-4759?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14237321#comment-14237321 ] Andrew Or edited comment on SPARK-4759 at 12/8/14 12:36 AM: Hey I came up with a much smaller reproduction for this from your program. 1. Start spark-shell with --master local[N] where N can be anything (or simply local with 1 core) 2. Copy and paste the following into your REPL {code} def runMyJob(): Unit = { sc.setCheckpointDir("/tmp/spark-test") val rdd = sc.parallelize(1 to 100).repartition(5).cache() rdd.checkpoint() rdd.count() val rdd2 = sc.parallelize(1 to 100).repartition(12) rdd.union(rdd2).count() } {code} 3. runMyJob(sc) It should be stuck at task 5/12. Note that with local-cluster and (local) standalone mode, it pauses a little at 5/12 too, but finishes shortly afterwards. was (Author: andrewor14): Hey I came up with a much smaller reproduction for this from your program. 1. Start spark-shell with --master local[N] where N can be anything (or simply local with 1 core) 2. Copy and paste the following into your REPL {code} def runMyJob(): Unit = { sc.setCheckpointDir("/tmp/spark-test") val rdd = sc.parallelize(1 to 100).coalesce(4).cache() rdd.checkpoint() rdd.count() val rdd2 = sc.parallelize(1 to 100).coalesce(4) rdd.union(rdd2).count() } {code} 3. runMyJob(sc) It should be stuck at task 1/2. Note that with local-cluster and (local) standalone mode, it pauses a little at 1/2 too, but finishes shortly afterwards. > Deadlock in complex spark job in local mode > --- > > Key: SPARK-4759 > URL: https://issues.apache.org/jira/browse/SPARK-4759 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 1.1.1, 1.2.0, 1.3.0 > Environment: Java version "1.7.0_51" > Java(TM) SE Runtime Environment (build 1.7.0_51-b13) > Java HotSpot(TM) 64-Bit Server VM (build 24.51-b03, mixed mode) > Mac OSX 10.10.1 > Using local spark context >Reporter: Davis Shepherd >Assignee: Andrew Or >Priority: Critical > Attachments: SparkBugReplicator.scala > > > The attached test class runs two identical jobs that perform some iterative > computation on an RDD[(Int, Int)]. This computation involves > # taking new data merging it with the previous result > # caching and checkpointing the new result > # rinse and repeat > The first time the job is run, it runs successfully, and the spark context is > shut down. The second time the job is run with a new spark context in the > same process, the job hangs indefinitely, only having scheduled a subset of > the necessary tasks for the final stage. > Ive been able to produce a test case that reproduces the issue, and I've > added some comments where some knockout experimentation has left some > breadcrumbs as to where the issue might be. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4783) Remove all System.exit calls from sparkcontext
[ https://issues.apache.org/jira/browse/SPARK-4783?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14237341#comment-14237341 ] Patrick Wendell commented on SPARK-4783: For code cleanliness, we should go through and look everywhere we call System.exit() and see which ones can be converted safely to exceptions. Most of our use of System.exit is on the executor side, but there may be a few on the driver/SparkContext side. That said, if there is a fatal exception in the SparkContext, I don't think your app can safely just catch the exception, log it, and create a new SparkContext. Is that what you are trying to do? In that case there could be static state around that is not properly cleaned up and will cause the new context to be buggy. > Remove all System.exit calls from sparkcontext > -- > > Key: SPARK-4783 > URL: https://issues.apache.org/jira/browse/SPARK-4783 > Project: Spark > Issue Type: Bug >Reporter: David Semeria > > A common architectural choice for integrating Spark within a larger > application is to employ a gateway to handle Spark jobs. The gateway is a > server which contains one or more long-running sparkcontexts. > A typical server is created with the following pseudo code: > var continue = true > while (continue){ > try { > server.run() > } catch (e) { > continue = log_and_examine_error(e) > } > The problem is that sparkcontext frequently calls System.exit when it > encounters a problem which means the server can only be re-spawned at the > process level, which is much more messy than the simple code above. > Therefore, I believe it makes sense to replace all System.exit calls in > sparkcontext with the throwing of a fatal error. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-4759) Deadlock in complex spark job in local mode
[ https://issues.apache.org/jira/browse/SPARK-4759?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14237321#comment-14237321 ] Andrew Or edited comment on SPARK-4759 at 12/8/14 1:29 AM: --- Hey I came up with a much smaller reproduction for this from your program. 1. Start spark-shell with --master local[N] where N can be anything (or simply local with 1 core) 2. Copy and paste the following into your REPL {code} def runMyJob(): Unit = { sc.setCheckpointDir("/tmp/spark-test") val rdd = sc.parallelize(1 to 100).repartition(5).cache() rdd.count() val rdd2 = sc.parallelize(1 to 100).repartition(12) rdd.union(rdd2).count() } {code} 3. runMyJob(sc) It should be stuck at task 5/17. Note that with local-cluster and (local) standalone mode, it pauses a little at 5/17 too, but finishes shortly afterwards. was (Author: andrewor14): Hey I came up with a much smaller reproduction for this from your program. 1. Start spark-shell with --master local[N] where N can be anything (or simply local with 1 core) 2. Copy and paste the following into your REPL {code} def runMyJob(): Unit = { sc.setCheckpointDir("/tmp/spark-test") val rdd = sc.parallelize(1 to 100).repartition(5).cache() rdd.checkpoint() rdd.count() val rdd2 = sc.parallelize(1 to 100).repartition(12) rdd.union(rdd2).count() } {code} 3. runMyJob(sc) It should be stuck at task 5/17. Note that with local-cluster and (local) standalone mode, it pauses a little at 5/17 too, but finishes shortly afterwards. > Deadlock in complex spark job in local mode > --- > > Key: SPARK-4759 > URL: https://issues.apache.org/jira/browse/SPARK-4759 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 1.1.1, 1.2.0, 1.3.0 > Environment: Java version "1.7.0_51" > Java(TM) SE Runtime Environment (build 1.7.0_51-b13) > Java HotSpot(TM) 64-Bit Server VM (build 24.51-b03, mixed mode) > Mac OSX 10.10.1 > Using local spark context >Reporter: Davis Shepherd >Assignee: Andrew Or >Priority: Critical > Attachments: SparkBugReplicator.scala > > > The attached test class runs two identical jobs that perform some iterative > computation on an RDD[(Int, Int)]. This computation involves > # taking new data merging it with the previous result > # caching and checkpointing the new result > # rinse and repeat > The first time the job is run, it runs successfully, and the spark context is > shut down. The second time the job is run with a new spark context in the > same process, the job hangs indefinitely, only having scheduled a subset of > the necessary tasks for the final stage. > Ive been able to produce a test case that reproduces the issue, and I've > added some comments where some knockout experimentation has left some > breadcrumbs as to where the issue might be. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-4759) Deadlock in complex spark job in local mode
[ https://issues.apache.org/jira/browse/SPARK-4759?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14237321#comment-14237321 ] Andrew Or edited comment on SPARK-4759 at 12/8/14 1:30 AM: --- Hey I came up with a much smaller reproduction for this from your program. 1. Start spark-shell with --master local[N] where N can be anything (or simply local with 1 core) 2. Copy and paste the following into your REPL {code} def runMyJob(): Unit = { val rdd = sc.parallelize(1 to 100).repartition(5).cache() rdd.count() val rdd2 = sc.parallelize(1 to 100).repartition(12) rdd.union(rdd2).count() } {code} 3. runMyJob(sc) It should be stuck at task 5/17. Note that with local-cluster and (local) standalone mode, it pauses a little at 5/17 too, but finishes shortly afterwards. was (Author: andrewor14): Hey I came up with a much smaller reproduction for this from your program. 1. Start spark-shell with --master local[N] where N can be anything (or simply local with 1 core) 2. Copy and paste the following into your REPL {code} def runMyJob(): Unit = { sc.setCheckpointDir("/tmp/spark-test") val rdd = sc.parallelize(1 to 100).repartition(5).cache() rdd.count() val rdd2 = sc.parallelize(1 to 100).repartition(12) rdd.union(rdd2).count() } {code} 3. runMyJob(sc) It should be stuck at task 5/17. Note that with local-cluster and (local) standalone mode, it pauses a little at 5/17 too, but finishes shortly afterwards. > Deadlock in complex spark job in local mode > --- > > Key: SPARK-4759 > URL: https://issues.apache.org/jira/browse/SPARK-4759 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 1.1.1, 1.2.0, 1.3.0 > Environment: Java version "1.7.0_51" > Java(TM) SE Runtime Environment (build 1.7.0_51-b13) > Java HotSpot(TM) 64-Bit Server VM (build 24.51-b03, mixed mode) > Mac OSX 10.10.1 > Using local spark context >Reporter: Davis Shepherd >Assignee: Andrew Or >Priority: Critical > Attachments: SparkBugReplicator.scala > > > The attached test class runs two identical jobs that perform some iterative > computation on an RDD[(Int, Int)]. This computation involves > # taking new data merging it with the previous result > # caching and checkpointing the new result > # rinse and repeat > The first time the job is run, it runs successfully, and the spark context is > shut down. The second time the job is run with a new spark context in the > same process, the job hangs indefinitely, only having scheduled a subset of > the necessary tasks for the final stage. > Ive been able to produce a test case that reproduces the issue, and I've > added some comments where some knockout experimentation has left some > breadcrumbs as to where the issue might be. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-4759) Deadlock in complex spark job in local mode
[ https://issues.apache.org/jira/browse/SPARK-4759?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14237321#comment-14237321 ] Andrew Or edited comment on SPARK-4759 at 12/8/14 1:33 AM: --- Hey I came up with a much smaller reproduction for this from your program. 1. Start spark-shell with --master local[N] where N can be anything (or simply local with 1 core) 2. Copy and paste the following into your REPL {code} def runMyJob(): Unit = { val rdd = sc.parallelize(1 to 100).repartition(5).cache() rdd.count() val rdd2 = sc.parallelize(1 to 100).repartition(12) rdd.union(rdd2).count() } {code} 3. runMyJob() It should be stuck at task 5/17. Note that with local-cluster and (local) standalone mode, it pauses a little at 5/17 too, but finishes shortly afterwards. was (Author: andrewor14): Hey I came up with a much smaller reproduction for this from your program. 1. Start spark-shell with --master local[N] where N can be anything (or simply local with 1 core) 2. Copy and paste the following into your REPL {code} def runMyJob(): Unit = { val rdd = sc.parallelize(1 to 100).repartition(5).cache() rdd.count() val rdd2 = sc.parallelize(1 to 100).repartition(12) rdd.union(rdd2).count() } {code} 3. runMyJob(sc) It should be stuck at task 5/17. Note that with local-cluster and (local) standalone mode, it pauses a little at 5/17 too, but finishes shortly afterwards. > Deadlock in complex spark job in local mode > --- > > Key: SPARK-4759 > URL: https://issues.apache.org/jira/browse/SPARK-4759 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 1.1.1, 1.2.0, 1.3.0 > Environment: Java version "1.7.0_51" > Java(TM) SE Runtime Environment (build 1.7.0_51-b13) > Java HotSpot(TM) 64-Bit Server VM (build 24.51-b03, mixed mode) > Mac OSX 10.10.1 > Using local spark context >Reporter: Davis Shepherd >Assignee: Andrew Or >Priority: Critical > Attachments: SparkBugReplicator.scala > > > The attached test class runs two identical jobs that perform some iterative > computation on an RDD[(Int, Int)]. This computation involves > # taking new data merging it with the previous result > # caching and checkpointing the new result > # rinse and repeat > The first time the job is run, it runs successfully, and the spark context is > shut down. The second time the job is run with a new spark context in the > same process, the job hangs indefinitely, only having scheduled a subset of > the necessary tasks for the final stage. > Ive been able to produce a test case that reproduces the issue, and I've > added some comments where some knockout experimentation has left some > breadcrumbs as to where the issue might be. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4759) Deadlock in complex spark job in local mode
[ https://issues.apache.org/jira/browse/SPARK-4759?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14237371#comment-14237371 ] Davis Shepherd commented on SPARK-4759: --- This doesn't seem to reproduce the issue for me. The job finishes regardless of how many times I call runMyJob() > Deadlock in complex spark job in local mode > --- > > Key: SPARK-4759 > URL: https://issues.apache.org/jira/browse/SPARK-4759 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 1.1.1, 1.2.0, 1.3.0 > Environment: Java version "1.7.0_51" > Java(TM) SE Runtime Environment (build 1.7.0_51-b13) > Java HotSpot(TM) 64-Bit Server VM (build 24.51-b03, mixed mode) > Mac OSX 10.10.1 > Using local spark context >Reporter: Davis Shepherd >Assignee: Andrew Or >Priority: Critical > Attachments: SparkBugReplicator.scala > > > The attached test class runs two identical jobs that perform some iterative > computation on an RDD[(Int, Int)]. This computation involves > # taking new data merging it with the previous result > # caching and checkpointing the new result > # rinse and repeat > The first time the job is run, it runs successfully, and the spark context is > shut down. The second time the job is run with a new spark context in the > same process, the job hangs indefinitely, only having scheduled a subset of > the necessary tasks for the final stage. > Ive been able to produce a test case that reproduces the issue, and I've > added some comments where some knockout experimentation has left some > breadcrumbs as to where the issue might be. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4759) Deadlock in complex spark job in local mode
[ https://issues.apache.org/jira/browse/SPARK-4759?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14237372#comment-14237372 ] Andrew Or commented on SPARK-4759: -- Found the issue. The task scheduler schedules tasks based on the preferred locations specified by the partition. In CoalescedRDD's partitions, we use the empty string as the default preferred location, even though this does not actually represent a real host: https://github.com/apache/spark/blob/e895e0cbecbbec1b412ff21321e57826d2d0a982/core/src/main/scala/org/apache/spark/rdd/CoalescedRDD.scala#L41 As a result, the task scheduler doesn't schedule a subset of the tasks on the local executor because these tasks are supposed to be scheduled on the host "" (empty string) that doesn't actually exist. I have not dug into the details of PartitionCoalescer as to why this is only specific to local mode. I'll submit a fix shortly. > Deadlock in complex spark job in local mode > --- > > Key: SPARK-4759 > URL: https://issues.apache.org/jira/browse/SPARK-4759 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 1.1.1, 1.2.0, 1.3.0 > Environment: Java version "1.7.0_51" > Java(TM) SE Runtime Environment (build 1.7.0_51-b13) > Java HotSpot(TM) 64-Bit Server VM (build 24.51-b03, mixed mode) > Mac OSX 10.10.1 > Using local spark context >Reporter: Davis Shepherd >Assignee: Andrew Or >Priority: Critical > Attachments: SparkBugReplicator.scala > > > The attached test class runs two identical jobs that perform some iterative > computation on an RDD[(Int, Int)]. This computation involves > # taking new data merging it with the previous result > # caching and checkpointing the new result > # rinse and repeat > The first time the job is run, it runs successfully, and the spark context is > shut down. The second time the job is run with a new spark context in the > same process, the job hangs indefinitely, only having scheduled a subset of > the necessary tasks for the final stage. > Ive been able to produce a test case that reproduces the issue, and I've > added some comments where some knockout experimentation has left some > breadcrumbs as to where the issue might be. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4759) Deadlock in complex spark job in local mode
[ https://issues.apache.org/jira/browse/SPARK-4759?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14237374#comment-14237374 ] Andrew Or commented on SPARK-4759: -- [~dgshep] That's strange. I am able to reproduce this every time, and I only need to call "runMyJob" once. What master are you running? > Deadlock in complex spark job in local mode > --- > > Key: SPARK-4759 > URL: https://issues.apache.org/jira/browse/SPARK-4759 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 1.1.1, 1.2.0, 1.3.0 > Environment: Java version "1.7.0_51" > Java(TM) SE Runtime Environment (build 1.7.0_51-b13) > Java HotSpot(TM) 64-Bit Server VM (build 24.51-b03, mixed mode) > Mac OSX 10.10.1 > Using local spark context >Reporter: Davis Shepherd >Assignee: Andrew Or >Priority: Critical > Attachments: SparkBugReplicator.scala > > > The attached test class runs two identical jobs that perform some iterative > computation on an RDD[(Int, Int)]. This computation involves > # taking new data merging it with the previous result > # caching and checkpointing the new result > # rinse and repeat > The first time the job is run, it runs successfully, and the spark context is > shut down. The second time the job is run with a new spark context in the > same process, the job hangs indefinitely, only having scheduled a subset of > the necessary tasks for the final stage. > Ive been able to produce a test case that reproduces the issue, and I've > added some comments where some knockout experimentation has left some > breadcrumbs as to where the issue might be. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4759) Deadlock in complex spark job in local mode
[ https://issues.apache.org/jira/browse/SPARK-4759?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14237376#comment-14237376 ] Davis Shepherd commented on SPARK-4759: --- local[2] on the latest commit of branch 1.1 > Deadlock in complex spark job in local mode > --- > > Key: SPARK-4759 > URL: https://issues.apache.org/jira/browse/SPARK-4759 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 1.1.1, 1.2.0, 1.3.0 > Environment: Java version "1.7.0_51" > Java(TM) SE Runtime Environment (build 1.7.0_51-b13) > Java HotSpot(TM) 64-Bit Server VM (build 24.51-b03, mixed mode) > Mac OSX 10.10.1 > Using local spark context >Reporter: Davis Shepherd >Assignee: Andrew Or >Priority: Critical > Attachments: SparkBugReplicator.scala > > > The attached test class runs two identical jobs that perform some iterative > computation on an RDD[(Int, Int)]. This computation involves > # taking new data merging it with the previous result > # caching and checkpointing the new result > # rinse and repeat > The first time the job is run, it runs successfully, and the spark context is > shut down. The second time the job is run with a new spark context in the > same process, the job hangs indefinitely, only having scheduled a subset of > the necessary tasks for the final stage. > Ive been able to produce a test case that reproduces the issue, and I've > added some comments where some knockout experimentation has left some > breadcrumbs as to where the issue might be. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-4759) Deadlock in complex spark job in local mode
[ https://issues.apache.org/jira/browse/SPARK-4759?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14237374#comment-14237374 ] Andrew Or edited comment on SPARK-4759 at 12/8/14 2:36 AM: --- [~dgshep] That's strange. I am able to reproduce this every time, and I only need to call "runMyJob" once. What master are you running? I just tried local, local[6], and local[*] and they all reproduced the deadlock. I am running the master branch with this commit: 6eb1b6f6204ea3c8083af3fb9cd990d9f3dac89d was (Author: andrewor14): [~dgshep] That's strange. I am able to reproduce this every time, and I only need to call "runMyJob" once. What master are you running? > Deadlock in complex spark job in local mode > --- > > Key: SPARK-4759 > URL: https://issues.apache.org/jira/browse/SPARK-4759 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 1.1.1, 1.2.0, 1.3.0 > Environment: Java version "1.7.0_51" > Java(TM) SE Runtime Environment (build 1.7.0_51-b13) > Java HotSpot(TM) 64-Bit Server VM (build 24.51-b03, mixed mode) > Mac OSX 10.10.1 > Using local spark context >Reporter: Davis Shepherd >Assignee: Andrew Or >Priority: Critical > Attachments: SparkBugReplicator.scala > > > The attached test class runs two identical jobs that perform some iterative > computation on an RDD[(Int, Int)]. This computation involves > # taking new data merging it with the previous result > # caching and checkpointing the new result > # rinse and repeat > The first time the job is run, it runs successfully, and the spark context is > shut down. The second time the job is run with a new spark context in the > same process, the job hangs indefinitely, only having scheduled a subset of > the necessary tasks for the final stage. > Ive been able to produce a test case that reproduces the issue, and I've > added some comments where some knockout experimentation has left some > breadcrumbs as to where the issue might be. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4759) Deadlock in complex spark job in local mode
[ https://issues.apache.org/jira/browse/SPARK-4759?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14237377#comment-14237377 ] Andrew Or commented on SPARK-4759: -- Hm I'll try branch 1.1 again later tonight. There might very well be more than one issue that causes this. > Deadlock in complex spark job in local mode > --- > > Key: SPARK-4759 > URL: https://issues.apache.org/jira/browse/SPARK-4759 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 1.1.1, 1.2.0, 1.3.0 > Environment: Java version "1.7.0_51" > Java(TM) SE Runtime Environment (build 1.7.0_51-b13) > Java HotSpot(TM) 64-Bit Server VM (build 24.51-b03, mixed mode) > Mac OSX 10.10.1 > Using local spark context >Reporter: Davis Shepherd >Assignee: Andrew Or >Priority: Critical > Attachments: SparkBugReplicator.scala > > > The attached test class runs two identical jobs that perform some iterative > computation on an RDD[(Int, Int)]. This computation involves > # taking new data merging it with the previous result > # caching and checkpointing the new result > # rinse and repeat > The first time the job is run, it runs successfully, and the spark context is > shut down. The second time the job is run with a new spark context in the > same process, the job hangs indefinitely, only having scheduled a subset of > the necessary tasks for the final stage. > Ive been able to produce a test case that reproduces the issue, and I've > added some comments where some knockout experimentation has left some > breadcrumbs as to where the issue might be. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-4783) System.exit() calls in SparkContext disrupt applications embedding Spark
[ https://issues.apache.org/jira/browse/SPARK-4783?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Patrick Wendell updated SPARK-4783: --- Summary: System.exit() calls in SparkContext disrupt applications embedding Spark (was: Remove all System.exit calls from sparkcontext) > System.exit() calls in SparkContext disrupt applications embedding Spark > > > Key: SPARK-4783 > URL: https://issues.apache.org/jira/browse/SPARK-4783 > Project: Spark > Issue Type: Bug >Reporter: David Semeria > > A common architectural choice for integrating Spark within a larger > application is to employ a gateway to handle Spark jobs. The gateway is a > server which contains one or more long-running sparkcontexts. > A typical server is created with the following pseudo code: > var continue = true > while (continue){ > try { > server.run() > } catch (e) { > continue = log_and_examine_error(e) > } > The problem is that sparkcontext frequently calls System.exit when it > encounters a problem which means the server can only be re-spawned at the > process level, which is much more messy than the simple code above. > Therefore, I believe it makes sense to replace all System.exit calls in > sparkcontext with the throwing of a fatal error. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4759) Deadlock in complex spark job in local mode
[ https://issues.apache.org/jira/browse/SPARK-4759?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14237383#comment-14237383 ] Davis Shepherd commented on SPARK-4759: --- Ok your version does reproduce the issue against the spark-core 1.1.1 artifact if I copy and paste your code into the original SparkBugReplicator, but it only seems to hang on the second time the job is run :P. This smells of a race condition... > Deadlock in complex spark job in local mode > --- > > Key: SPARK-4759 > URL: https://issues.apache.org/jira/browse/SPARK-4759 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 1.1.1, 1.2.0, 1.3.0 > Environment: Java version "1.7.0_51" > Java(TM) SE Runtime Environment (build 1.7.0_51-b13) > Java HotSpot(TM) 64-Bit Server VM (build 24.51-b03, mixed mode) > Mac OSX 10.10.1 > Using local spark context >Reporter: Davis Shepherd >Assignee: Andrew Or >Priority: Critical > Attachments: SparkBugReplicator.scala > > > The attached test class runs two identical jobs that perform some iterative > computation on an RDD[(Int, Int)]. This computation involves > # taking new data merging it with the previous result > # caching and checkpointing the new result > # rinse and repeat > The first time the job is run, it runs successfully, and the spark context is > shut down. The second time the job is run with a new spark context in the > same process, the job hangs indefinitely, only having scheduled a subset of > the necessary tasks for the final stage. > Ive been able to produce a test case that reproduces the issue, and I've > added some comments where some knockout experimentation has left some > breadcrumbs as to where the issue might be. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4759) Deadlock in complex spark job in local mode
[ https://issues.apache.org/jira/browse/SPARK-4759?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14237393#comment-14237393 ] Davis Shepherd commented on SPARK-4759: --- I still cannot reproduce the issue in the spark shell on tag v1.1.1 > Deadlock in complex spark job in local mode > --- > > Key: SPARK-4759 > URL: https://issues.apache.org/jira/browse/SPARK-4759 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 1.1.1, 1.2.0, 1.3.0 > Environment: Java version "1.7.0_51" > Java(TM) SE Runtime Environment (build 1.7.0_51-b13) > Java HotSpot(TM) 64-Bit Server VM (build 24.51-b03, mixed mode) > Mac OSX 10.10.1 > Using local spark context >Reporter: Davis Shepherd >Assignee: Andrew Or >Priority: Critical > Attachments: SparkBugReplicator.scala > > > The attached test class runs two identical jobs that perform some iterative > computation on an RDD[(Int, Int)]. This computation involves > # taking new data merging it with the previous result > # caching and checkpointing the new result > # rinse and repeat > The first time the job is run, it runs successfully, and the spark context is > shut down. The second time the job is run with a new spark context in the > same process, the job hangs indefinitely, only having scheduled a subset of > the necessary tasks for the final stage. > Ive been able to produce a test case that reproduces the issue, and I've > added some comments where some knockout experimentation has left some > breadcrumbs as to where the issue might be. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-4759) Deadlock in complex spark job in local mode
[ https://issues.apache.org/jira/browse/SPARK-4759?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14237393#comment-14237393 ] Davis Shepherd edited comment on SPARK-4759 at 12/8/14 3:15 AM: I still cannot reproduce the issue with your snippet in the spark shell on tag v1.1.1 was (Author: dgshep): I still cannot reproduce the issue in the spark shell on tag v1.1.1 > Deadlock in complex spark job in local mode > --- > > Key: SPARK-4759 > URL: https://issues.apache.org/jira/browse/SPARK-4759 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 1.1.1, 1.2.0, 1.3.0 > Environment: Java version "1.7.0_51" > Java(TM) SE Runtime Environment (build 1.7.0_51-b13) > Java HotSpot(TM) 64-Bit Server VM (build 24.51-b03, mixed mode) > Mac OSX 10.10.1 > Using local spark context >Reporter: Davis Shepherd >Assignee: Andrew Or >Priority: Critical > Attachments: SparkBugReplicator.scala > > > The attached test class runs two identical jobs that perform some iterative > computation on an RDD[(Int, Int)]. This computation involves > # taking new data merging it with the previous result > # caching and checkpointing the new result > # rinse and repeat > The first time the job is run, it runs successfully, and the spark context is > shut down. The second time the job is run with a new spark context in the > same process, the job hangs indefinitely, only having scheduled a subset of > the necessary tasks for the final stage. > Ive been able to produce a test case that reproduces the issue, and I've > added some comments where some knockout experimentation has left some > breadcrumbs as to where the issue might be. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-3655) Support sorting of values in addition to keys (i.e. secondary sort)
[ https://issues.apache.org/jira/browse/SPARK-3655?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14237299#comment-14237299 ] koert kuipers edited comment on SPARK-3655 at 12/8/14 3:24 AM: --- i have a new pullreq that implements just groupByKeyAndSortValues in scala and java. i will need some help with python. pullreq is here: https://github.com/apache/spark/pull/3632 i changed methods to return RDD[(K, TraversableOnce[V])] instead of RDD[(K, Iterable[V])], since i dont see a reasonable way to implement it so that it returns Iterables without resorting to keeping the data in memory. The assumption made is that once you move on to the next key within a partition that the previous value (so the TraversableOnce[V]) will no longer be used. I personally find this API too generic, and too easy to abuse or make mistakes with. So i prefer a more constrained API like foldLeft. Or perhaps groupByKeyAndSortValues could be DeveloperAPI? was (Author: koert): i have a new pullreq that implements just groupByKeyAndSortValues in scala and java. i will need some help with python. pullreq is here: https://github.com/apache/spark/pull/3632 i changed methods to return RDD[(K, TraversableOnce[V])] instead of RDD[(K, Iterable[V])], since i dont see a reasonable way to implement it so that it returns Iterables without resorting to keeping the data in memory. The assumption made is that once you move on to the next key within a partition that the previous value (so the TraversableOnce[V]) will no longer be used. I personally find this API too generic, and too easy to abuse or make mistakes with. So i prefer a more constrained API like foldLeft. > Support sorting of values in addition to keys (i.e. secondary sort) > --- > > Key: SPARK-3655 > URL: https://issues.apache.org/jira/browse/SPARK-3655 > Project: Spark > Issue Type: New Feature > Components: Spark Core >Affects Versions: 1.1.0, 1.2.0 >Reporter: koert kuipers >Assignee: Koert Kuipers >Priority: Minor > > Now that spark has a sort based shuffle, can we expect a secondary sort soon? > There are some use cases where getting a sorted iterator of values per key is > helpful. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-4646) Replace Scala.util.Sorting.quickSort with Sorter(TimSort) in Spark
[ https://issues.apache.org/jira/browse/SPARK-4646?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ankur Dave resolved SPARK-4646. --- Resolution: Fixed Fix Version/s: 1.2.0 Issue resolved by pull request 3507 [https://github.com/apache/spark/pull/3507] > Replace Scala.util.Sorting.quickSort with Sorter(TimSort) in Spark > -- > > Key: SPARK-4646 > URL: https://issues.apache.org/jira/browse/SPARK-4646 > Project: Spark > Issue Type: Improvement > Components: GraphX >Reporter: Takeshi Yamamuro >Priority: Minor > Fix For: 1.2.0 > > > This patch just replaces a native quick sorter with Sorter(TimSort) in Spark. > It could get performance gains by ~8% in my quick experiments. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-4620) Add unpersist in Graph/GraphImpl
[ https://issues.apache.org/jira/browse/SPARK-4620?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ankur Dave resolved SPARK-4620. --- Resolution: Fixed Fix Version/s: 1.2.0 Issue resolved by pull request 3476 [https://github.com/apache/spark/pull/3476] > Add unpersist in Graph/GraphImpl > > > Key: SPARK-4620 > URL: https://issues.apache.org/jira/browse/SPARK-4620 > Project: Spark > Issue Type: Improvement > Components: GraphX >Reporter: Takeshi Yamamuro >Priority: Trivial > Fix For: 1.2.0 > > > Add an IF to uncache both vertices and edges of Graph/GraphImpl. > This IF is useful when iterative graph operations build a new graph in each > iteration, and the vertices and edges of previous iterations are no longer > needed for following iterations. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-4784) Model.fittingParamMap should store all Params
Joseph K. Bradley created SPARK-4784: Summary: Model.fittingParamMap should store all Params Key: SPARK-4784 URL: https://issues.apache.org/jira/browse/SPARK-4784 Project: Spark Issue Type: Improvement Components: ML Affects Versions: 1.2.0 Reporter: Joseph K. Bradley Priority: Minor spark.ml's Model class should store all parameters in the fittingParamMap, not just the ones which were explicitly set. [~mengxr] -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-4784) Model.fittingParamMap should store all Params
[ https://issues.apache.org/jira/browse/SPARK-4784?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joseph K. Bradley updated SPARK-4784: - Description: spark.ml's Model class should store all parameters in the fittingParamMap, not just the ones which were explicitly set. CC: [~mengxr] was: spark.ml's Model class should store all parameters in the fittingParamMap, not just the ones which were explicitly set. [~mengxr] > Model.fittingParamMap should store all Params > - > > Key: SPARK-4784 > URL: https://issues.apache.org/jira/browse/SPARK-4784 > Project: Spark > Issue Type: Improvement > Components: ML >Affects Versions: 1.2.0 >Reporter: Joseph K. Bradley >Priority: Minor > > spark.ml's Model class should store all parameters in the fittingParamMap, > not just the ones which were explicitly set. > CC: [~mengxr] -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-1350) YARN ContainerLaunchContext should use cluster's JAVA_HOME
[ https://issues.apache.org/jira/browse/SPARK-1350?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14237525#comment-14237525 ] Aniket Bhatnagar commented on SPARK-1350: - [~sandyr] Using Environment.JAVA_HOME.$() causes issues while submitting spark applications from a windows box into a Yarn cluster running on Linux (with spark master set as yarn-client). This is because Environment.JAVA_HOME.$() resolves to %JAVA_HOME% which results in not a valid executable on Linux. Is this a Spark issue or a YARN issue? > YARN ContainerLaunchContext should use cluster's JAVA_HOME > -- > > Key: SPARK-1350 > URL: https://issues.apache.org/jira/browse/SPARK-1350 > Project: Spark > Issue Type: Bug > Components: YARN >Affects Versions: 0.9.0 >Reporter: Sandy Ryza >Assignee: Sandy Ryza > Fix For: 1.0.0 > > > {code} > var javaCommand = "java" > val javaHome = System.getenv("JAVA_HOME") > if ((javaHome != null && !javaHome.isEmpty()) || > env.isDefinedAt("JAVA_HOME")) { > javaCommand = Environment.JAVA_HOME.$() + "/bin/java" > } > {code} > Currently, if JAVA_HOME is specified on the client, it will be used instead > of the value given on the cluster. This makes it so that Java must be > installed in the same place on the client as on the cluster. > This is a possibly incompatible change that we should get in before 1.0. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-4785) When called with arguments referring column fields, PMOD throws NPE
Cheng Lian created SPARK-4785: - Summary: When called with arguments referring column fields, PMOD throws NPE Key: SPARK-4785 URL: https://issues.apache.org/jira/browse/SPARK-4785 Project: Spark Issue Type: Bug Components: SQL Reporter: Cheng Lian Priority: Blocker Reproduction steps with {{hive/console}}: {code} scala> loadTestTable("src") scala> sql("SELECT PMOD(key, 10) FROM src LIMIT 1").collect() ... 14/12/08 15:11:31 INFO DAGScheduler: Job 0 failed: runJob at basicOperators.scala:141, took 0.235788 s org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 0.0 failed 1 times, most recent failure: Lost task 0.0 in stage 0.0 (TID 0, localhost): java.lang.NullPointerException at org.apache.hadoop.hive.ql.udf.generic.GenericUDFBaseNumeric.initialize(GenericUDFBaseNumeric.java:109) at org.apache.hadoop.hive.ql.udf.generic.GenericUDF.initializeAndFoldConstants(GenericUDF.java:116) at org.apache.spark.sql.hive.HiveGenericUdf.returnInspector$lzycompute(hiveUdfs.scala:156) at org.apache.spark.sql.hive.HiveGenericUdf.returnInspector(hiveUdfs.scala:155) at org.apache.spark.sql.hive.HiveGenericUdf.eval(hiveUdfs.scala:174) at org.apache.spark.sql.catalyst.expressions.Alias.eval(namedExpressions.scala:92) at org.apache.spark.sql.catalyst.expressions.InterpretedMutableProjection.apply(Projection.scala:68) at org.apache.spark.sql.catalyst.expressions.InterpretedMutableProjection.apply(Projection.scala:52) at scala.collection.Iterator$$anon$11.next(Iterator.scala:328) at scala.collection.Iterator$$anon$11.next(Iterator.scala:328) at scala.collection.Iterator$$anon$10.next(Iterator.scala:312) at scala.collection.Iterator$class.foreach(Iterator.scala:727) at scala.collection.AbstractIterator.foreach(Iterator.scala:1157) at scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:48) at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:103) at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:47) at scala.collection.TraversableOnce$class.to(TraversableOnce.scala:273) at scala.collection.AbstractIterator.to(Iterator.scala:1157) at scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:265) at scala.collection.AbstractIterator.toBuffer(Iterator.scala:1157) at scala.collection.TraversableOnce$class.toArray(TraversableOnce.scala:252) at scala.collection.AbstractIterator.toArray(Iterator.scala:1157) at org.apache.spark.sql.execution.Limit$$anonfun$4.apply(basicOperators.scala:141) at org.apache.spark.sql.execution.Limit$$anonfun$4.apply(basicOperators.scala:141) at org.apache.spark.SparkContext$$anonfun$runJob$4.apply(SparkContext.scala:1314) at org.apache.spark.SparkContext$$anonfun$runJob$4.apply(SparkContext.scala:1314) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:61) at org.apache.spark.scheduler.Task.run(Task.scala:56) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:196) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:745) Driver stacktrace: at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1214) at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1203) at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1202) at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47) at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1202) at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:696) at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:696) at scala.Option.foreach(Option.scala:236) at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:696) at org.apache.spark.scheduler.DAGSchedulerEventProcessActor$$anonfun$receive$2.applyOrElse(DAGScheduler.scala:1420) at akka.actor.Actor$class.aroundReceive(Actor.scala:465) at org.apache.spark.scheduler.DAGSchedulerEventProcessActor.aroundReceive(DAGScheduler.scala:1375) at akka.actor.ActorCell.receiveMessage(ActorCell.scala:516) at akka.actor.ActorCell.invoke(ActorCell.scala:487)
[jira] [Commented] (SPARK-4759) Deadlock in complex spark job in local mode
[ https://issues.apache.org/jira/browse/SPARK-4759?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14237550#comment-14237550 ] Andrew Or commented on SPARK-4759: -- Ok yeah you're right, I can't reproduce it from the code snippet in branch 1.1 either. There seems to be at least two issues going on here... Can you confirm that the snippet does reproduce the lock in master branch? > Deadlock in complex spark job in local mode > --- > > Key: SPARK-4759 > URL: https://issues.apache.org/jira/browse/SPARK-4759 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 1.1.1, 1.2.0, 1.3.0 > Environment: Java version "1.7.0_51" > Java(TM) SE Runtime Environment (build 1.7.0_51-b13) > Java HotSpot(TM) 64-Bit Server VM (build 24.51-b03, mixed mode) > Mac OSX 10.10.1 > Using local spark context >Reporter: Davis Shepherd >Assignee: Andrew Or >Priority: Critical > Attachments: SparkBugReplicator.scala > > > The attached test class runs two identical jobs that perform some iterative > computation on an RDD[(Int, Int)]. This computation involves > # taking new data merging it with the previous result > # caching and checkpointing the new result > # rinse and repeat > The first time the job is run, it runs successfully, and the spark context is > shut down. The second time the job is run with a new spark context in the > same process, the job hangs indefinitely, only having scheduled a subset of > the necessary tasks for the final stage. > Ive been able to produce a test case that reproduces the issue, and I've > added some comments where some knockout experimentation has left some > breadcrumbs as to where the issue might be. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-4759) Deadlock in complex spark job in local mode
[ https://issues.apache.org/jira/browse/SPARK-4759?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14237321#comment-14237321 ] Andrew Or edited comment on SPARK-4759 at 12/8/14 7:28 AM: --- Hey I came up with a much smaller reproduction for this from your program. 1. Start spark-shell with --master local[N] where N can be anything (or simply local with 1 core) 2. Copy and paste the following into your REPL {code} def runMyJob(): Unit = { val rdd = sc.parallelize(1 to 100).repartition(5).cache() rdd.count() val rdd2 = sc.parallelize(1 to 100).repartition(12) rdd.union(rdd2).count() } {code} 3. runMyJob() It should be stuck at task 5/17. Note that with local-cluster and (local) standalone mode, it pauses a little at 5/17 too, but finishes shortly afterwards. - EDIT - This seems to reproduce it only on the master branch. was (Author: andrewor14): Hey I came up with a much smaller reproduction for this from your program. 1. Start spark-shell with --master local[N] where N can be anything (or simply local with 1 core) 2. Copy and paste the following into your REPL {code} def runMyJob(): Unit = { val rdd = sc.parallelize(1 to 100).repartition(5).cache() rdd.count() val rdd2 = sc.parallelize(1 to 100).repartition(12) rdd.union(rdd2).count() } {code} 3. runMyJob() It should be stuck at task 5/17. Note that with local-cluster and (local) standalone mode, it pauses a little at 5/17 too, but finishes shortly afterwards. > Deadlock in complex spark job in local mode > --- > > Key: SPARK-4759 > URL: https://issues.apache.org/jira/browse/SPARK-4759 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 1.1.1, 1.2.0, 1.3.0 > Environment: Java version "1.7.0_51" > Java(TM) SE Runtime Environment (build 1.7.0_51-b13) > Java HotSpot(TM) 64-Bit Server VM (build 24.51-b03, mixed mode) > Mac OSX 10.10.1 > Using local spark context >Reporter: Davis Shepherd >Assignee: Andrew Or >Priority: Critical > Attachments: SparkBugReplicator.scala > > > The attached test class runs two identical jobs that perform some iterative > computation on an RDD[(Int, Int)]. This computation involves > # taking new data merging it with the previous result > # caching and checkpointing the new result > # rinse and repeat > The first time the job is run, it runs successfully, and the spark context is > shut down. The second time the job is run with a new spark context in the > same process, the job hangs indefinitely, only having scheduled a subset of > the necessary tasks for the final stage. > Ive been able to produce a test case that reproduces the issue, and I've > added some comments where some knockout experimentation has left some > breadcrumbs as to where the issue might be. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-4759) Deadlock in complex spark job in local mode
[ https://issues.apache.org/jira/browse/SPARK-4759?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14237321#comment-14237321 ] Andrew Or edited comment on SPARK-4759 at 12/8/14 7:29 AM: --- Hey I came up with a much smaller reproduction for this from your program. 1. Start spark-shell with --master local[N] where N can be anything (or simply local with 1 core) 2. Copy and paste the following into your REPL {code} def runMyJob(): Unit = { val rdd = sc.parallelize(1 to 100).repartition(5).cache() rdd.count() val rdd2 = sc.parallelize(1 to 100).repartition(12) rdd.union(rdd2).count() } {code} 3. runMyJob() It should be stuck at task 5/17. Note that with local-cluster and (local) standalone mode, it pauses a little at 5/17 too, but finishes shortly afterwards. === EDIT === This seems to reproduce it only on the master branch, but not 1.1. was (Author: andrewor14): Hey I came up with a much smaller reproduction for this from your program. 1. Start spark-shell with --master local[N] where N can be anything (or simply local with 1 core) 2. Copy and paste the following into your REPL {code} def runMyJob(): Unit = { val rdd = sc.parallelize(1 to 100).repartition(5).cache() rdd.count() val rdd2 = sc.parallelize(1 to 100).repartition(12) rdd.union(rdd2).count() } {code} 3. runMyJob() It should be stuck at task 5/17. Note that with local-cluster and (local) standalone mode, it pauses a little at 5/17 too, but finishes shortly afterwards. === EDIT === This seems to reproduce it only on the master branch. > Deadlock in complex spark job in local mode > --- > > Key: SPARK-4759 > URL: https://issues.apache.org/jira/browse/SPARK-4759 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 1.1.1, 1.2.0, 1.3.0 > Environment: Java version "1.7.0_51" > Java(TM) SE Runtime Environment (build 1.7.0_51-b13) > Java HotSpot(TM) 64-Bit Server VM (build 24.51-b03, mixed mode) > Mac OSX 10.10.1 > Using local spark context >Reporter: Davis Shepherd >Assignee: Andrew Or >Priority: Critical > Attachments: SparkBugReplicator.scala > > > The attached test class runs two identical jobs that perform some iterative > computation on an RDD[(Int, Int)]. This computation involves > # taking new data merging it with the previous result > # caching and checkpointing the new result > # rinse and repeat > The first time the job is run, it runs successfully, and the spark context is > shut down. The second time the job is run with a new spark context in the > same process, the job hangs indefinitely, only having scheduled a subset of > the necessary tasks for the final stage. > Ive been able to produce a test case that reproduces the issue, and I've > added some comments where some knockout experimentation has left some > breadcrumbs as to where the issue might be. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-4759) Deadlock in complex spark job in local mode
[ https://issues.apache.org/jira/browse/SPARK-4759?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14237321#comment-14237321 ] Andrew Or edited comment on SPARK-4759 at 12/8/14 7:29 AM: --- Hey I came up with a much smaller reproduction for this from your program. 1. Start spark-shell with --master local[N] where N can be anything (or simply local with 1 core) 2. Copy and paste the following into your REPL {code} def runMyJob(): Unit = { val rdd = sc.parallelize(1 to 100).repartition(5).cache() rdd.count() val rdd2 = sc.parallelize(1 to 100).repartition(12) rdd.union(rdd2).count() } {code} 3. runMyJob() It should be stuck at task 5/17. Note that with local-cluster and (local) standalone mode, it pauses a little at 5/17 too, but finishes shortly afterwards. === EDIT === This seems to reproduce it only on the master branch. was (Author: andrewor14): Hey I came up with a much smaller reproduction for this from your program. 1. Start spark-shell with --master local[N] where N can be anything (or simply local with 1 core) 2. Copy and paste the following into your REPL {code} def runMyJob(): Unit = { val rdd = sc.parallelize(1 to 100).repartition(5).cache() rdd.count() val rdd2 = sc.parallelize(1 to 100).repartition(12) rdd.union(rdd2).count() } {code} 3. runMyJob() It should be stuck at task 5/17. Note that with local-cluster and (local) standalone mode, it pauses a little at 5/17 too, but finishes shortly afterwards. - EDIT - This seems to reproduce it only on the master branch. > Deadlock in complex spark job in local mode > --- > > Key: SPARK-4759 > URL: https://issues.apache.org/jira/browse/SPARK-4759 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 1.1.1, 1.2.0, 1.3.0 > Environment: Java version "1.7.0_51" > Java(TM) SE Runtime Environment (build 1.7.0_51-b13) > Java HotSpot(TM) 64-Bit Server VM (build 24.51-b03, mixed mode) > Mac OSX 10.10.1 > Using local spark context >Reporter: Davis Shepherd >Assignee: Andrew Or >Priority: Critical > Attachments: SparkBugReplicator.scala > > > The attached test class runs two identical jobs that perform some iterative > computation on an RDD[(Int, Int)]. This computation involves > # taking new data merging it with the previous result > # caching and checkpointing the new result > # rinse and repeat > The first time the job is run, it runs successfully, and the spark context is > shut down. The second time the job is run with a new spark context in the > same process, the job hangs indefinitely, only having scheduled a subset of > the necessary tasks for the final stage. > Ive been able to produce a test case that reproduces the issue, and I've > added some comments where some knockout experimentation has left some > breadcrumbs as to where the issue might be. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-4781) Column values become all NULL after doing ALTER TABLE CHANGE for renaming column names (Parquet external table in HiveContext)
[ https://issues.apache.org/jira/browse/SPARK-4781?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jianshi Huang updated SPARK-4781: - Issue Type: Bug (was: Improvement) > Column values become all NULL after doing ALTER TABLE CHANGE for renaming > column names (Parquet external table in HiveContext) > -- > > Key: SPARK-4781 > URL: https://issues.apache.org/jira/browse/SPARK-4781 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.2.0, 1.3.0, 1.2.1 >Reporter: Jianshi Huang > > I have a table say created like follows: > {code} > CREATE EXTERNAL TABLE pmt ( > `sorted::cre_ts` string > ) > STORED AS PARQUET > LOCATION '...' > {code} > And I renamed the column from sorted::cre_ts to cre_ts by doing: > {code} > ALTER TABLE pmt CHANGE `sorted::cre_ts` cre_ts string > {code} > After renaming the column, the values in the column become all NULLs. > {noformat} > Before renaming: > scala> sql("select `sorted::cre_ts` from pmt limit 1").collect > res12: Array[org.apache.spark.sql.Row] = Array([12/02/2014 07:38:54]) > Execute renaming: > scala> sql("alter table pmt change `sorted::cre_ts` cre_ts string") > res13: org.apache.spark.sql.SchemaRDD = > SchemaRDD[972] at RDD at SchemaRDD.scala:108 > == Query Plan == > > After renaming: > scala> sql("select cre_ts from pmt limit 1").collect > res16: Array[org.apache.spark.sql.Row] = Array([null]) > {noformat} > Jianshi -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org