[jira] [Updated] (SPARK-20483) Mesos Coarse mode may starve other Mesos frameworks if max cores is not a multiple of executor cores
[ https://issues.apache.org/jira/browse/SPARK-20483?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Davis Shepherd updated SPARK-20483: --- Description: if {{spark.cores.max = 10}} for example and {{spark.executor.cores = 4}}, 2 executors will get launched thus {{totalCoresAcquired = 8}}. All future Mesos offers will not get tasks launched because {{sc.conf.getInt("spark.executor.cores", ...) + totalCoresAcquired <= maxCores}} will always evaluate to false. However, in {{handleMatchedOffers}} we check if {{totalCoresAcquired >= maxCores}} to determine if we should decline the offer "for a configurable amount of time to avoid starving other frameworks", and this will always evaluate to false in the above scenario. This leaves the framework in a state of limbo where it will never launch any new executors, but only decline offers for the Mesos default of 5 seconds, thus starving other frameworks of offers. (was: if {{spark.cores.max = 10}} for example and {{spark.executor.cores = 4}}, 2 executors will get launched thus {{totalCoresAcquired = 8}}. All future Mesos offers will not get tasks launched because {{sc.conf.getInt("spark.executor.cores", ...) + totalCoresAcquired <= maxCores}} will always evaluate to false. However, in {{handleMatchedOffers}} we check if {{totalCoresAcquired >= maxCores}} to determine if we should decline the offer "for a configurable amount of time to avoid starving other frameworks", and this will always evaluate to false in the above scenario. This leaves the framework in a state of limbo where it will never launch any new executors, but only decline offers for the Mesos default of 5 seconds, thus starving other frameworks of offers. Relates to: SPARK-12554, SPARK-19702) > Mesos Coarse mode may starve other Mesos frameworks if max cores is not a > multiple of executor cores > > > Key: SPARK-20483 > URL: https://issues.apache.org/jira/browse/SPARK-20483 > Project: Spark > Issue Type: Bug > Components: Mesos >Affects Versions: 2.1.0 >Reporter: Davis Shepherd >Priority: Minor > > if {{spark.cores.max = 10}} for example and {{spark.executor.cores = 4}}, 2 > executors will get launched thus {{totalCoresAcquired = 8}}. All future Mesos > offers will not get tasks launched because > {{sc.conf.getInt("spark.executor.cores", ...) + totalCoresAcquired <= > maxCores}} will always evaluate to false. However, in > {{handleMatchedOffers}} we check if {{totalCoresAcquired >= maxCores}} to > determine if we should decline the offer "for a configurable amount of time > to avoid starving other frameworks", and this will always evaluate to false > in the above scenario. This leaves the framework in a state of limbo where it > will never launch any new executors, but only decline offers for the Mesos > default of 5 seconds, thus starving other frameworks of offers. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-20483) Mesos Coarse mode may starve other Mesos frameworks if max cores is not a multiple of executor cores
[ https://issues.apache.org/jira/browse/SPARK-20483?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Davis Shepherd updated SPARK-20483: --- Description: if {{spark.cores.max = 10}} for example and {{spark.executor.cores = 4}}, 2 executors will get launched thus {{totalCoresAcquired = 8}}. All future Mesos offers will not get tasks launched because {{sc.conf.getInt("spark.executor.cores", ...) + totalCoresAcquired <= maxCores}} will always evaluate to false. However, in {{handleMatchedOffers}} we check if {{totalCoresAcquired >= maxCores}} to determine if we should decline the offer "for a configurable amount of time to avoid starving other frameworks", and this will always evaluate to false in the above scenario. This leaves the framework in a state of limbo where it will never launch any new executors, but only decline offers for the Mesos default of 5 seconds, thus starving other frameworks of offers. Relates to: SPARK-12554, SPARK-19702 was: if {{spark.cores.max = 10}} for example and {{spark.executor.cores = 4}}, 2 executors will get launched thus {{totalCoresAcquired = 8}}. All future Mesos offers will not get tasks launched because {{sc.conf.getInt("spark.executor.cores", ...) + totalCoresAcquired <= maxCores}} will always evaluate to false. However, in {{handleMatchedOffers}} we check if {{totalCoresAcquired >= maxCores}} to determine if we should decline the offer "for a configurable amount of time to avoid starving other frameworks", and this will always evaluate to false in the above scenario. This leaves the framework in a state of limbo where it will never launch any new executors, but only decline offers for the Mesos default of 5 seconds, thus starving other frameworks of offers. Relates to: SPARK-12554 > Mesos Coarse mode may starve other Mesos frameworks if max cores is not a > multiple of executor cores > > > Key: SPARK-20483 > URL: https://issues.apache.org/jira/browse/SPARK-20483 > Project: Spark > Issue Type: Bug > Components: Mesos >Affects Versions: 2.1.0 >Reporter: Davis Shepherd >Priority: Minor > > if {{spark.cores.max = 10}} for example and {{spark.executor.cores = 4}}, 2 > executors will get launched thus {{totalCoresAcquired = 8}}. All future Mesos > offers will not get tasks launched because > {{sc.conf.getInt("spark.executor.cores", ...) + totalCoresAcquired <= > maxCores}} will always evaluate to false. However, in > {{handleMatchedOffers}} we check if {{totalCoresAcquired >= maxCores}} to > determine if we should decline the offer "for a configurable amount of time > to avoid starving other frameworks", and this will always evaluate to false > in the above scenario. This leaves the framework in a state of limbo where it > will never launch any new executors, but only decline offers for the Mesos > default of 5 seconds, thus starving other frameworks of offers. > Relates to: SPARK-12554, SPARK-19702 -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-20483) Mesos Coarse mode may starve other Mesos frameworks if max cores is not a multiple of executor cores
[ https://issues.apache.org/jira/browse/SPARK-20483?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Davis Shepherd updated SPARK-20483: --- Description: if {{spark.cores.max = 10}} for example and {{spark.executor.cores = 4}}, 2 executors will get launched thus {{totalCoresAcquired = 8}}. All future Mesos offers will not get tasks launched because {{sc.conf.getInt("spark.executor.cores", ...) + totalCoresAcquired <= maxCores}} will always evaluate to false. However, in {{handleMatchedOffers}} we check if {{totalCoresAcquired >= maxCores}} to determine if we should decline the offer "for a configurable amount of time to avoid starving other frameworks", and this will always evaluate to false in the above scenario. This leaves the framework in a state of limbo where it will never launch any new executors, but only decline offers for the Mesos default of 5 seconds, thus starving other frameworks of offers. Relates to: SPARK-12554 was: if `spark.cores.max = 10` for example and `spark.executor.cores = 4`, 2 executors will get lauched thus `totalCoresAcquired = 8`. All future Mesos offers will not get tasks launched because `sc.conf.getInt("spark.executor.cores", ...) + totalCoresAcquired <= maxCores` will always evaluate to false. However, in `handleMatchedOffers` we check if `totalCoresAcquired >= maxCores` to determine if we should decline the offer "for a configurable amount of time to avoid starving other frameworks", and this will always evaluate to false in the above scenario. This leaves the framework in a state of limbo where it will never launch any new executors, but only decline offers for the Mesos default of 5 seconds, thus starving other frameworks of offers. Relates to: SPARK-12554 > Mesos Coarse mode may starve other Mesos frameworks if max cores is not a > multiple of executor cores > > > Key: SPARK-20483 > URL: https://issues.apache.org/jira/browse/SPARK-20483 > Project: Spark > Issue Type: Bug > Components: Mesos >Affects Versions: 2.1.0 >Reporter: Davis Shepherd >Priority: Minor > > if {{spark.cores.max = 10}} for example and {{spark.executor.cores = 4}}, 2 > executors will get launched thus {{totalCoresAcquired = 8}}. All future Mesos > offers will not get tasks launched because > {{sc.conf.getInt("spark.executor.cores", ...) + totalCoresAcquired <= > maxCores}} will always evaluate to false. However, in > {{handleMatchedOffers}} we check if {{totalCoresAcquired >= maxCores}} to > determine if we should decline the offer "for a configurable amount of time > to avoid starving other frameworks", and this will always evaluate to false > in the above scenario. This leaves the framework in a state of limbo where it > will never launch any new executors, but only decline offers for the Mesos > default of 5 seconds, thus starving other frameworks of offers. > Relates to: SPARK-12554 -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-20483) Mesos Coarse mode may starve other Mesos frameworks if max cores is not a multiple of executor cores
Davis Shepherd created SPARK-20483: -- Summary: Mesos Coarse mode may starve other Mesos frameworks if max cores is not a multiple of executor cores Key: SPARK-20483 URL: https://issues.apache.org/jira/browse/SPARK-20483 Project: Spark Issue Type: Bug Components: Mesos Affects Versions: 2.1.0 Reporter: Davis Shepherd Priority: Minor if `spark.cores.max = 10` for example and `spark.executor.cores = 4`, 2 executors will get lauched thus `totalCoresAcquired = 8`. All future Mesos offers will not get tasks launched because `sc.conf.getInt("spark.executor.cores", ...) + totalCoresAcquired <= maxCores` will always evaluate to false. However, in `handleMatchedOffers` we check if `totalCoresAcquired >= maxCores` to determine if we should decline the offer "for a configurable amount of time to avoid starving other frameworks", and this will always evaluate to false in the above scenario. This leaves the framework in a state of limbo where it will never launch any new executors, but only decline offers for the Mesos default of 5 seconds, thus starving other frameworks of offers. Relates to: SPARK-12554 -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4759) Deadlock in complex spark job in local mode
[ https://issues.apache.org/jira/browse/SPARK-4759?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14238221#comment-14238221 ] Davis Shepherd commented on SPARK-4759: --- The fix appears to resolve the issue in master for both repros. > Deadlock in complex spark job in local mode > --- > > Key: SPARK-4759 > URL: https://issues.apache.org/jira/browse/SPARK-4759 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 1.1.1, 1.2.0, 1.3.0 > Environment: Java version "1.7.0_51" > Java(TM) SE Runtime Environment (build 1.7.0_51-b13) > Java HotSpot(TM) 64-Bit Server VM (build 24.51-b03, mixed mode) > Mac OSX 10.10.1 > Using local spark context >Reporter: Davis Shepherd >Assignee: Andrew Or >Priority: Critical > Attachments: SparkBugReplicator.scala > > > The attached test class runs two identical jobs that perform some iterative > computation on an RDD[(Int, Int)]. This computation involves > # taking new data merging it with the previous result > # caching and checkpointing the new result > # rinse and repeat > The first time the job is run, it runs successfully, and the spark context is > shut down. The second time the job is run with a new spark context in the > same process, the job hangs indefinitely, only having scheduled a subset of > the necessary tasks for the final stage. > Ive been able to produce a test case that reproduces the issue, and I've > added some comments where some knockout experimentation has left some > breadcrumbs as to where the issue might be. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-4759) Deadlock in complex spark job in local mode
[ https://issues.apache.org/jira/browse/SPARK-4759?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14237393#comment-14237393 ] Davis Shepherd edited comment on SPARK-4759 at 12/8/14 3:15 AM: I still cannot reproduce the issue with your snippet in the spark shell on tag v1.1.1 was (Author: dgshep): I still cannot reproduce the issue in the spark shell on tag v1.1.1 > Deadlock in complex spark job in local mode > --- > > Key: SPARK-4759 > URL: https://issues.apache.org/jira/browse/SPARK-4759 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 1.1.1, 1.2.0, 1.3.0 > Environment: Java version "1.7.0_51" > Java(TM) SE Runtime Environment (build 1.7.0_51-b13) > Java HotSpot(TM) 64-Bit Server VM (build 24.51-b03, mixed mode) > Mac OSX 10.10.1 > Using local spark context >Reporter: Davis Shepherd >Assignee: Andrew Or >Priority: Critical > Attachments: SparkBugReplicator.scala > > > The attached test class runs two identical jobs that perform some iterative > computation on an RDD[(Int, Int)]. This computation involves > # taking new data merging it with the previous result > # caching and checkpointing the new result > # rinse and repeat > The first time the job is run, it runs successfully, and the spark context is > shut down. The second time the job is run with a new spark context in the > same process, the job hangs indefinitely, only having scheduled a subset of > the necessary tasks for the final stage. > Ive been able to produce a test case that reproduces the issue, and I've > added some comments where some knockout experimentation has left some > breadcrumbs as to where the issue might be. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4759) Deadlock in complex spark job in local mode
[ https://issues.apache.org/jira/browse/SPARK-4759?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14237393#comment-14237393 ] Davis Shepherd commented on SPARK-4759: --- I still cannot reproduce the issue in the spark shell on tag v1.1.1 > Deadlock in complex spark job in local mode > --- > > Key: SPARK-4759 > URL: https://issues.apache.org/jira/browse/SPARK-4759 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 1.1.1, 1.2.0, 1.3.0 > Environment: Java version "1.7.0_51" > Java(TM) SE Runtime Environment (build 1.7.0_51-b13) > Java HotSpot(TM) 64-Bit Server VM (build 24.51-b03, mixed mode) > Mac OSX 10.10.1 > Using local spark context >Reporter: Davis Shepherd >Assignee: Andrew Or >Priority: Critical > Attachments: SparkBugReplicator.scala > > > The attached test class runs two identical jobs that perform some iterative > computation on an RDD[(Int, Int)]. This computation involves > # taking new data merging it with the previous result > # caching and checkpointing the new result > # rinse and repeat > The first time the job is run, it runs successfully, and the spark context is > shut down. The second time the job is run with a new spark context in the > same process, the job hangs indefinitely, only having scheduled a subset of > the necessary tasks for the final stage. > Ive been able to produce a test case that reproduces the issue, and I've > added some comments where some knockout experimentation has left some > breadcrumbs as to where the issue might be. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4759) Deadlock in complex spark job in local mode
[ https://issues.apache.org/jira/browse/SPARK-4759?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14237383#comment-14237383 ] Davis Shepherd commented on SPARK-4759: --- Ok your version does reproduce the issue against the spark-core 1.1.1 artifact if I copy and paste your code into the original SparkBugReplicator, but it only seems to hang on the second time the job is run :P. This smells of a race condition... > Deadlock in complex spark job in local mode > --- > > Key: SPARK-4759 > URL: https://issues.apache.org/jira/browse/SPARK-4759 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 1.1.1, 1.2.0, 1.3.0 > Environment: Java version "1.7.0_51" > Java(TM) SE Runtime Environment (build 1.7.0_51-b13) > Java HotSpot(TM) 64-Bit Server VM (build 24.51-b03, mixed mode) > Mac OSX 10.10.1 > Using local spark context >Reporter: Davis Shepherd >Assignee: Andrew Or >Priority: Critical > Attachments: SparkBugReplicator.scala > > > The attached test class runs two identical jobs that perform some iterative > computation on an RDD[(Int, Int)]. This computation involves > # taking new data merging it with the previous result > # caching and checkpointing the new result > # rinse and repeat > The first time the job is run, it runs successfully, and the spark context is > shut down. The second time the job is run with a new spark context in the > same process, the job hangs indefinitely, only having scheduled a subset of > the necessary tasks for the final stage. > Ive been able to produce a test case that reproduces the issue, and I've > added some comments where some knockout experimentation has left some > breadcrumbs as to where the issue might be. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4759) Deadlock in complex spark job in local mode
[ https://issues.apache.org/jira/browse/SPARK-4759?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14237376#comment-14237376 ] Davis Shepherd commented on SPARK-4759: --- local[2] on the latest commit of branch 1.1 > Deadlock in complex spark job in local mode > --- > > Key: SPARK-4759 > URL: https://issues.apache.org/jira/browse/SPARK-4759 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 1.1.1, 1.2.0, 1.3.0 > Environment: Java version "1.7.0_51" > Java(TM) SE Runtime Environment (build 1.7.0_51-b13) > Java HotSpot(TM) 64-Bit Server VM (build 24.51-b03, mixed mode) > Mac OSX 10.10.1 > Using local spark context >Reporter: Davis Shepherd >Assignee: Andrew Or >Priority: Critical > Attachments: SparkBugReplicator.scala > > > The attached test class runs two identical jobs that perform some iterative > computation on an RDD[(Int, Int)]. This computation involves > # taking new data merging it with the previous result > # caching and checkpointing the new result > # rinse and repeat > The first time the job is run, it runs successfully, and the spark context is > shut down. The second time the job is run with a new spark context in the > same process, the job hangs indefinitely, only having scheduled a subset of > the necessary tasks for the final stage. > Ive been able to produce a test case that reproduces the issue, and I've > added some comments where some knockout experimentation has left some > breadcrumbs as to where the issue might be. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4759) Deadlock in complex spark job in local mode
[ https://issues.apache.org/jira/browse/SPARK-4759?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14237371#comment-14237371 ] Davis Shepherd commented on SPARK-4759: --- This doesn't seem to reproduce the issue for me. The job finishes regardless of how many times I call runMyJob() > Deadlock in complex spark job in local mode > --- > > Key: SPARK-4759 > URL: https://issues.apache.org/jira/browse/SPARK-4759 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 1.1.1, 1.2.0, 1.3.0 > Environment: Java version "1.7.0_51" > Java(TM) SE Runtime Environment (build 1.7.0_51-b13) > Java HotSpot(TM) 64-Bit Server VM (build 24.51-b03, mixed mode) > Mac OSX 10.10.1 > Using local spark context >Reporter: Davis Shepherd >Assignee: Andrew Or >Priority: Critical > Attachments: SparkBugReplicator.scala > > > The attached test class runs two identical jobs that perform some iterative > computation on an RDD[(Int, Int)]. This computation involves > # taking new data merging it with the previous result > # caching and checkpointing the new result > # rinse and repeat > The first time the job is run, it runs successfully, and the spark context is > shut down. The second time the job is run with a new spark context in the > same process, the job hangs indefinitely, only having scheduled a subset of > the necessary tasks for the final stage. > Ive been able to produce a test case that reproduces the issue, and I've > added some comments where some knockout experimentation has left some > breadcrumbs as to where the issue might be. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4759) Deadlock in complex spark job in local mode with multiple cores
[ https://issues.apache.org/jira/browse/SPARK-4759?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14237264#comment-14237264 ] Davis Shepherd commented on SPARK-4759: --- It is possible to reproduce the issue with single core local mode. Simply change the partitions parameter to > 1 (in the attached version it uses the defaultParallelism of the spark context, which in single core mode is 1) in either of the coalesce calls. > Deadlock in complex spark job in local mode with multiple cores > --- > > Key: SPARK-4759 > URL: https://issues.apache.org/jira/browse/SPARK-4759 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 1.1.1, 1.2.0, 1.3.0 > Environment: Java version "1.7.0_51" > Java(TM) SE Runtime Environment (build 1.7.0_51-b13) > Java HotSpot(TM) 64-Bit Server VM (build 24.51-b03, mixed mode) > Mac OSX 10.10.1 > Using local spark context >Reporter: Davis Shepherd >Assignee: Andrew Or >Priority: Critical > Attachments: SparkBugReplicator.scala > > > The attached test class runs two identical jobs that perform some iterative > computation on an RDD[(Int, Int)]. This computation involves > # taking new data merging it with the previous result > # caching and checkpointing the new result > # rinse and repeat > The first time the job is run, it runs successfully, and the spark context is > shut down. The second time the job is run with a new spark context in the > same process, the job hangs indefinitely, only having scheduled a subset of > the necessary tasks for the final stage. > Ive been able to produce a test case that reproduces the issue, and I've > added some comments where some knockout experimentation has left some > breadcrumbs as to where the issue might be. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4759) Deadlock in complex spark job.
[ https://issues.apache.org/jira/browse/SPARK-4759?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14235703#comment-14235703 ] Davis Shepherd commented on SPARK-4759: --- Fair enough. As far as I can work out, it appears that there are 9 expected tasks in the final stage, but only 6 ever get scheduled. I suppose that leaves only one thread waiting on something that will never happen. ;) > Deadlock in complex spark job. > -- > > Key: SPARK-4759 > URL: https://issues.apache.org/jira/browse/SPARK-4759 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 1.1.1 > Environment: Java version "1.7.0_51" > Java(TM) SE Runtime Environment (build 1.7.0_51-b13) > Java HotSpot(TM) 64-Bit Server VM (build 24.51-b03, mixed mode) > Mac OSX 10.10.1 > Using local spark context >Reporter: Davis Shepherd > Attachments: SparkBugReplicator.scala > > > The attached test class runs two identical jobs that perform some iterative > computation on an RDD[(Int, Int)]. This computation involves > # taking new data merging it with the previous result > # caching and checkpointing the new result > # rinse and repeat > The first time the job is run, it runs successfully, and the spark context is > shut down. The second time the job is run with a new spark context in the > same process, the job hangs indefinitely, only having scheduled a subset of > the necessary tasks for the final stage. > Ive been able to produce a test case that reproduces the issue, and I've > added some comments where some knockout experimentation has left some > breadcrumbs as to where the issue might be. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-4759) Deadlock in complex spark job.
[ https://issues.apache.org/jira/browse/SPARK-4759?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Davis Shepherd updated SPARK-4759: -- Description: The attached test class runs two identical jobs that perform some iterative computation on an RDD[(Int, Int)]. This computation involves # taking new data merging it with the previous result # caching and checkpointing the new result # rinse and repeat The first time the job is run, it runs successfully, and the spark context is shut down. The second time the job is run with a new spark context in the same process, the job hangs indefinitely, only having scheduled a subset of the necessary tasks for the final stage. Ive been able to produce a test case that reproduces the issue, and I've added some comments where some knockout experimentation has left some breadcrumbs as to where the issue might be. was: The attached test class runs two identical jobs that perform some iterative computation on an RDD[(Int, Int)]. This computation involves *taking new data merging it with the previous result *caching and checkpointing the new result *rinse and repeat The first time the job is run, it runs successfully, and the spark context is shut down. The second time the job is run with a new spark context in the same process, the job hangs indefinitely, only having scheduled a subset of the necessary tasks for the final stage. Ive been able to produce a test case that reproduces the issue, and I've added some comments where some knockout experimentation has left some breadcrumbs as to where the issue might be. > Deadlock in complex spark job. > -- > > Key: SPARK-4759 > URL: https://issues.apache.org/jira/browse/SPARK-4759 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 1.1.1 > Environment: Java version "1.7.0_51" > Java(TM) SE Runtime Environment (build 1.7.0_51-b13) > Java HotSpot(TM) 64-Bit Server VM (build 24.51-b03, mixed mode) > Mac OSX 10.10.1 > Using local spark context >Reporter: Davis Shepherd > Attachments: SparkBugReplicator.scala > > > The attached test class runs two identical jobs that perform some iterative > computation on an RDD[(Int, Int)]. This computation involves > # taking new data merging it with the previous result > # caching and checkpointing the new result > # rinse and repeat > The first time the job is run, it runs successfully, and the spark context is > shut down. The second time the job is run with a new spark context in the > same process, the job hangs indefinitely, only having scheduled a subset of > the necessary tasks for the final stage. > Ive been able to produce a test case that reproduces the issue, and I've > added some comments where some knockout experimentation has left some > breadcrumbs as to where the issue might be. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-4759) Deadlock in complex spark job.
[ https://issues.apache.org/jira/browse/SPARK-4759?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14235651#comment-14235651 ] Davis Shepherd edited comment on SPARK-4759 at 12/5/14 3:52 PM: Here is a thread dump with superfluous threads omitted (namely jetty qtp threads) {noformat} "sparkDriver-akka.actor.default-dispatcher-18" daemon prio=5 tid=7fc3853ef800 nid=0x119e87000 waiting on condition [119e86000] java.lang.Thread.State: WAITING (parking) at sun.misc.Unsafe.park(Native Method) - parking to wait for <7c33229d0> (a akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinPool) at scala.concurrent.forkjoin.ForkJoinPool.scan(ForkJoinPool.java:1594) at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1478) at scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:104) "sparkDriver-akka.actor.default-dispatcher-17" daemon prio=5 tid=7fc38c007000 nid=0x119d84000 waiting on condition [119d83000] java.lang.Thread.State: WAITING (parking) at sun.misc.Unsafe.park(Native Method) - parking to wait for <7c33229d0> (a akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinPool) at scala.concurrent.forkjoin.ForkJoinPool.scan(ForkJoinPool.java:1594) at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1478) at scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:104) "sparkDriver-akka.actor.default-dispatcher-14" daemon prio=5 tid=7fc38b003800 nid=0x119b96000 waiting on condition [119b95000] java.lang.Thread.State: TIMED_WAITING (parking) at sun.misc.Unsafe.park(Native Method) - parking to wait for <7c33229d0> (a akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinPool) at scala.concurrent.forkjoin.ForkJoinPool.idleAwaitWork(ForkJoinPool.java:1626) at scala.concurrent.forkjoin.ForkJoinPool.scan(ForkJoinPool.java:1579) at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1478) at scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:104) "sparkDriver-akka.actor.default-dispatcher-15" daemon prio=5 tid=7fc38b003000 nid=0x119a93000 waiting on condition [119a92000] java.lang.Thread.State: WAITING (parking) at sun.misc.Unsafe.park(Native Method) - parking to wait for <7c33229d0> (a akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinPool) at scala.concurrent.forkjoin.ForkJoinPool.scan(ForkJoinPool.java:1594) at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1478) at scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:104) "sparkDriver-akka.actor.default-dispatcher-16" daemon prio=5 tid=7fc38c006000 nid=0x11971d000 waiting on condition [11971c000] java.lang.Thread.State: WAITING (parking) at sun.misc.Unsafe.park(Native Method) - parking to wait for <7c33229d0> (a akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinPool) at scala.concurrent.forkjoin.ForkJoinPool.scan(ForkJoinPool.java:1594) at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1478) at scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:104) "sparkDriver-akka.actor.default-dispatcher-13" daemon prio=5 tid=7fc38681f000 nid=0x1193d1000 waiting on condition [1193d] java.lang.Thread.State: WAITING (parking) at sun.misc.Unsafe.park(Native Method) - parking to wait for <7c33229d0> (a akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinPool) at scala.concurrent.forkjoin.ForkJoinPool.scan(ForkJoinPool.java:1594) at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1478) at scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:104) "task-result-getter-3" daemon prio=5 tid=7fc3871a7800 nid=0x1192ce000 waiting on condition [1192cd000] java.lang.Thread.State: WAITING (parking) at sun.misc.Unsafe.park(Native Method) - parking to wait for <7bbeaace0> (a java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject) at java.util.concurrent.locks.LockSupport.park(LockSupport.java:156) at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:1987) at java.util.concurrent.LinkedBlockingQueue.take(LinkedBlockingQueue.java:399) at java.util.concurrent.ThreadPoolExecutor.getTask(ThreadPoolExecutor.java:957) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:917) at java.lang.Thread.run(Thread.java:695) "task-result-getter-2" daemon prio=5 tid=7fc3871a6800 nid=0x1191cb000 waiting on condition [1191ca000] java.lang.Thread.State: WAITING (parkin
[jira] [Comment Edited] (SPARK-4759) Deadlock in complex spark job.
[ https://issues.apache.org/jira/browse/SPARK-4759?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14235651#comment-14235651 ] Davis Shepherd edited comment on SPARK-4759 at 12/5/14 3:48 PM: {noformat} "sparkDriver-akka.actor.default-dispatcher-18" daemon prio=5 tid=7fc3853ef800 nid=0x119e87000 waiting on condition [119e86000] java.lang.Thread.State: WAITING (parking) at sun.misc.Unsafe.park(Native Method) - parking to wait for <7c33229d0> (a akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinPool) at scala.concurrent.forkjoin.ForkJoinPool.scan(ForkJoinPool.java:1594) at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1478) at scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:104) "sparkDriver-akka.actor.default-dispatcher-17" daemon prio=5 tid=7fc38c007000 nid=0x119d84000 waiting on condition [119d83000] java.lang.Thread.State: WAITING (parking) at sun.misc.Unsafe.park(Native Method) - parking to wait for <7c33229d0> (a akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinPool) at scala.concurrent.forkjoin.ForkJoinPool.scan(ForkJoinPool.java:1594) at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1478) at scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:104) "sparkDriver-akka.actor.default-dispatcher-14" daemon prio=5 tid=7fc38b003800 nid=0x119b96000 waiting on condition [119b95000] java.lang.Thread.State: TIMED_WAITING (parking) at sun.misc.Unsafe.park(Native Method) - parking to wait for <7c33229d0> (a akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinPool) at scala.concurrent.forkjoin.ForkJoinPool.idleAwaitWork(ForkJoinPool.java:1626) at scala.concurrent.forkjoin.ForkJoinPool.scan(ForkJoinPool.java:1579) at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1478) at scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:104) "sparkDriver-akka.actor.default-dispatcher-15" daemon prio=5 tid=7fc38b003000 nid=0x119a93000 waiting on condition [119a92000] java.lang.Thread.State: WAITING (parking) at sun.misc.Unsafe.park(Native Method) - parking to wait for <7c33229d0> (a akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinPool) at scala.concurrent.forkjoin.ForkJoinPool.scan(ForkJoinPool.java:1594) at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1478) at scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:104) "sparkDriver-akka.actor.default-dispatcher-16" daemon prio=5 tid=7fc38c006000 nid=0x11971d000 waiting on condition [11971c000] java.lang.Thread.State: WAITING (parking) at sun.misc.Unsafe.park(Native Method) - parking to wait for <7c33229d0> (a akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinPool) at scala.concurrent.forkjoin.ForkJoinPool.scan(ForkJoinPool.java:1594) at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1478) at scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:104) "sparkDriver-akka.actor.default-dispatcher-13" daemon prio=5 tid=7fc38681f000 nid=0x1193d1000 waiting on condition [1193d] java.lang.Thread.State: WAITING (parking) at sun.misc.Unsafe.park(Native Method) - parking to wait for <7c33229d0> (a akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinPool) at scala.concurrent.forkjoin.ForkJoinPool.scan(ForkJoinPool.java:1594) at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1478) at scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:104) "task-result-getter-3" daemon prio=5 tid=7fc3871a7800 nid=0x1192ce000 waiting on condition [1192cd000] java.lang.Thread.State: WAITING (parking) at sun.misc.Unsafe.park(Native Method) - parking to wait for <7bbeaace0> (a java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject) at java.util.concurrent.locks.LockSupport.park(LockSupport.java:156) at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:1987) at java.util.concurrent.LinkedBlockingQueue.take(LinkedBlockingQueue.java:399) at java.util.concurrent.ThreadPoolExecutor.getTask(ThreadPoolExecutor.java:957) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:917) at java.lang.Thread.run(Thread.java:695) "task-result-getter-2" daemon prio=5 tid=7fc3871a6800 nid=0x1191cb000 waiting on condition [1191ca000] java.lang.Thread.State: WAITING (parking) at sun.misc.Unsafe.park(Native Method) - parking to wait for <7b
[jira] [Commented] (SPARK-4759) Deadlock in complex spark job.
[ https://issues.apache.org/jira/browse/SPARK-4759?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14235651#comment-14235651 ] Davis Shepherd commented on SPARK-4759: --- {{"sparkDriver-akka.actor.default-dispatcher-18" daemon prio=5 tid=7fc3853ef800 nid=0x119e87000 waiting on condition [119e86000] java.lang.Thread.State: WAITING (parking) at sun.misc.Unsafe.park(Native Method) - parking to wait for <7c33229d0> (a akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinPool) at scala.concurrent.forkjoin.ForkJoinPool.scan(ForkJoinPool.java:1594) at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1478) at scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:104) "sparkDriver-akka.actor.default-dispatcher-17" daemon prio=5 tid=7fc38c007000 nid=0x119d84000 waiting on condition [119d83000] java.lang.Thread.State: WAITING (parking) at sun.misc.Unsafe.park(Native Method) - parking to wait for <7c33229d0> (a akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinPool) at scala.concurrent.forkjoin.ForkJoinPool.scan(ForkJoinPool.java:1594) at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1478) at scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:104) "sparkDriver-akka.actor.default-dispatcher-14" daemon prio=5 tid=7fc38b003800 nid=0x119b96000 waiting on condition [119b95000] java.lang.Thread.State: TIMED_WAITING (parking) at sun.misc.Unsafe.park(Native Method) - parking to wait for <7c33229d0> (a akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinPool) at scala.concurrent.forkjoin.ForkJoinPool.idleAwaitWork(ForkJoinPool.java:1626) at scala.concurrent.forkjoin.ForkJoinPool.scan(ForkJoinPool.java:1579) at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1478) at scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:104) "sparkDriver-akka.actor.default-dispatcher-15" daemon prio=5 tid=7fc38b003000 nid=0x119a93000 waiting on condition [119a92000] java.lang.Thread.State: WAITING (parking) at sun.misc.Unsafe.park(Native Method) - parking to wait for <7c33229d0> (a akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinPool) at scala.concurrent.forkjoin.ForkJoinPool.scan(ForkJoinPool.java:1594) at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1478) at scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:104) "sparkDriver-akka.actor.default-dispatcher-16" daemon prio=5 tid=7fc38c006000 nid=0x11971d000 waiting on condition [11971c000] java.lang.Thread.State: WAITING (parking) at sun.misc.Unsafe.park(Native Method) - parking to wait for <7c33229d0> (a akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinPool) at scala.concurrent.forkjoin.ForkJoinPool.scan(ForkJoinPool.java:1594) at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1478) at scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:104) "sparkDriver-akka.actor.default-dispatcher-13" daemon prio=5 tid=7fc38681f000 nid=0x1193d1000 waiting on condition [1193d] java.lang.Thread.State: WAITING (parking) at sun.misc.Unsafe.park(Native Method) - parking to wait for <7c33229d0> (a akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinPool) at scala.concurrent.forkjoin.ForkJoinPool.scan(ForkJoinPool.java:1594) at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1478) at scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:104) "task-result-getter-3" daemon prio=5 tid=7fc3871a7800 nid=0x1192ce000 waiting on condition [1192cd000] java.lang.Thread.State: WAITING (parking) at sun.misc.Unsafe.park(Native Method) - parking to wait for <7bbeaace0> (a java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject) at java.util.concurrent.locks.LockSupport.park(LockSupport.java:156) at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:1987) at java.util.concurrent.LinkedBlockingQueue.take(LinkedBlockingQueue.java:399) at java.util.concurrent.ThreadPoolExecutor.getTask(ThreadPoolExecutor.java:957) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:917) at java.lang.Thread.run(Thread.java:695) "task-result-getter-2" daemon prio=5 tid=7fc3871a6800 nid=0x1191cb000 waiting on condition [1191ca000] java.lang.Thread.State: WAITING (parking) at sun.misc.Unsafe.park(Native Method) - parking to wait for <7bbeaace0> (a java.util.concurrent.locks.AbstractQueuedSync
[jira] [Updated] (SPARK-4759) Deadlock in complex spark job.
[ https://issues.apache.org/jira/browse/SPARK-4759?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Davis Shepherd updated SPARK-4759: -- Attachment: SparkBugReplicator.scala > Deadlock in complex spark job. > -- > > Key: SPARK-4759 > URL: https://issues.apache.org/jira/browse/SPARK-4759 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 1.1.1 > Environment: Java version "1.7.0_51" > Java(TM) SE Runtime Environment (build 1.7.0_51-b13) > Java HotSpot(TM) 64-Bit Server VM (build 24.51-b03, mixed mode) > Mac OSX 10.10.1 > Using local spark context >Reporter: Davis Shepherd > Attachments: SparkBugReplicator.scala > > > The attached test class runs two identical jobs that perform some iterative > computation on an RDD[(Int, Int)]. This computation involves > *taking new data merging it with the previous result > *caching and checkpointing the new result > *rinse and repeat > The first time the job is run, it runs successfully, and the spark context is > shut down. The second time the job is run with a new spark context in the > same process, the job hangs indefinitely, only having scheduled a subset of > the necessary tasks for the final stage. > Ive been able to produce a test case that reproduces the issue, and I've > added some comments where some knockout experimentation has left some > breadcrumbs as to where the issue might be. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-4759) Deadlock in complex spark job.
[ https://issues.apache.org/jira/browse/SPARK-4759?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Davis Shepherd updated SPARK-4759: -- Attachment: (was: SparkBugReplicator.scala) > Deadlock in complex spark job. > -- > > Key: SPARK-4759 > URL: https://issues.apache.org/jira/browse/SPARK-4759 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 1.1.1 > Environment: Java version "1.7.0_51" > Java(TM) SE Runtime Environment (build 1.7.0_51-b13) > Java HotSpot(TM) 64-Bit Server VM (build 24.51-b03, mixed mode) > Mac OSX 10.10.1 > Using local spark context >Reporter: Davis Shepherd > > The attached test class runs two identical jobs that perform some iterative > computation on an RDD[(Int, Int)]. This computation involves > *taking new data merging it with the previous result > *caching and checkpointing the new result > *rinse and repeat > The first time the job is run, it runs successfully, and the spark context is > shut down. The second time the job is run with a new spark context in the > same process, the job hangs indefinitely, only having scheduled a subset of > the necessary tasks for the final stage. > Ive been able to produce a test case that reproduces the issue, and I've > added some comments where some knockout experimentation has left some > breadcrumbs as to where the issue might be. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-4759) Deadlock in complex spark job.
[ https://issues.apache.org/jira/browse/SPARK-4759?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Davis Shepherd updated SPARK-4759: -- Summary: Deadlock in complex spark job. (was: Deadlock in pathological spark job.) > Deadlock in complex spark job. > -- > > Key: SPARK-4759 > URL: https://issues.apache.org/jira/browse/SPARK-4759 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 1.1.1 > Environment: Java version "1.7.0_51" > Java(TM) SE Runtime Environment (build 1.7.0_51-b13) > Java HotSpot(TM) 64-Bit Server VM (build 24.51-b03, mixed mode) > Mac OSX 10.10.1 > Using local spark context >Reporter: Davis Shepherd > Attachments: SparkBugReplicator.scala > > > The attached test class runs two identical jobs that perform some iterative > computation on an RDD[(Int, Int)]. This computation involves > *taking new data merging it with the previous result > *caching and checkpointing the new result > *rinse and repeat > The first time the job is run, it runs successfully, and the spark context is > shut down. The second time the job is run with a new spark context in the > same process, the job hangs indefinitely, only having scheduled a subset of > the necessary tasks for the final stage. > Ive been able to produce a test case that reproduces the issue, and I've > added some comments where some knockout experimentation has left some > breadcrumbs as to where the issue might be. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-4759) Deadlock in pathological spark job.
[ https://issues.apache.org/jira/browse/SPARK-4759?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Davis Shepherd updated SPARK-4759: -- Description: The attached test class runs two identical jobs that perform some iterative computation on an RDD[(Int, Int)]. This computation involves *taking new data merging it with the previous result *caching and checkpointing the new result *rinse and repeat The first time the job is run, it runs successfully, and the spark context is shut down. The second time the job is run with a new spark context in the same process, the job hangs indefinitely, only having scheduled a subset of the necessary tasks for the final stage. Ive been able to produce a test case that reproduces the issue, and I've added some comments where some knockout experimentation has left some breadcrumbs as to where the issue might be. was: The attached test class runs two identical jobs that perform some iterative computation on and RDD[Int, Int]. This computation involves taking new data merging it with the previous result caching and checkpointing the new result rinse and repeat The first time the job is run, it runs successfully, and the spark context is shut down. The second time the job is run with a new spark context in the same process, the job hangs indefinitely, only having scheduled a subset of the necessary tasks for the final stage. Ive been able to produce a test case that reproduces the issue, and I've added some comments where some knockout experimentation has left some breadcrumbs as to where the issue might be. > Deadlock in pathological spark job. > --- > > Key: SPARK-4759 > URL: https://issues.apache.org/jira/browse/SPARK-4759 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 1.1.1 > Environment: Java version "1.7.0_51" > Java(TM) SE Runtime Environment (build 1.7.0_51-b13) > Java HotSpot(TM) 64-Bit Server VM (build 24.51-b03, mixed mode) > Mac OSX 10.10.1 > Using local spark context >Reporter: Davis Shepherd > Attachments: SparkBugReplicator.scala > > > The attached test class runs two identical jobs that perform some iterative > computation on an RDD[(Int, Int)]. This computation involves > *taking new data merging it with the previous result > *caching and checkpointing the new result > *rinse and repeat > The first time the job is run, it runs successfully, and the spark context is > shut down. The second time the job is run with a new spark context in the > same process, the job hangs indefinitely, only having scheduled a subset of > the necessary tasks for the final stage. > Ive been able to produce a test case that reproduces the issue, and I've > added some comments where some knockout experimentation has left some > breadcrumbs as to where the issue might be. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-4759) Deadlock in pathological spark job.
[ https://issues.apache.org/jira/browse/SPARK-4759?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Davis Shepherd updated SPARK-4759: -- Attachment: SparkBugReplicator.scala > Deadlock in pathological spark job. > --- > > Key: SPARK-4759 > URL: https://issues.apache.org/jira/browse/SPARK-4759 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 1.1.1 > Environment: Java version "1.7.0_51" > Java(TM) SE Runtime Environment (build 1.7.0_51-b13) > Java HotSpot(TM) 64-Bit Server VM (build 24.51-b03, mixed mode) > Mac OSX 10.10.1 > Using local spark context >Reporter: Davis Shepherd > Attachments: SparkBugReplicator.scala > > > The attached test class runs two identical jobs that perform some iterative > computation on and RDD[Int, Int]. This computation involves > taking new data merging it with the previous result > caching and checkpointing the new result > rinse and repeat > The first time the job is run, it runs successfully, and the spark context is > shut down. The second time the job is run with a new spark context in the > same process, the job hangs indefinitely, only having scheduled a subset of > the necessary tasks for the final stage. > Ive been able to produce a test case that reproduces the issue, and I've > added some comments where some knockout experimentation has left some > breadcrumbs as to where the issue might be. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-4759) Deadlock in pathological spark job.
Davis Shepherd created SPARK-4759: - Summary: Deadlock in pathological spark job. Key: SPARK-4759 URL: https://issues.apache.org/jira/browse/SPARK-4759 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 1.1.1 Environment: Java version "1.7.0_51" Java(TM) SE Runtime Environment (build 1.7.0_51-b13) Java HotSpot(TM) 64-Bit Server VM (build 24.51-b03, mixed mode) Mac OSX 10.10.1 Using local spark context Reporter: Davis Shepherd The attached test class runs two identical jobs that perform some iterative computation on and RDD[Int, Int]. This computation involves taking new data merging it with the previous result caching and checkpointing the new result rinse and repeat The first time the job is run, it runs successfully, and the spark context is shut down. The second time the job is run with a new spark context in the same process, the job hangs indefinitely, only having scheduled a subset of the necessary tasks for the final stage. Ive been able to produce a test case that reproduces the issue, and I've added some comments where some knockout experimentation has left some breadcrumbs as to where the issue might be. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Issue Comment Deleted] (SPARK-3882) JobProgressListener gets permanently out of sync with long running job
[ https://issues.apache.org/jira/browse/SPARK-3882?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Davis Shepherd updated SPARK-3882: -- Comment: was deleted (was: This is also a serious memory leak that will cause long running drivers for spark streaming jobs to exhaust their heap.) > JobProgressListener gets permanently out of sync with long running job > -- > > Key: SPARK-3882 > URL: https://issues.apache.org/jira/browse/SPARK-3882 > Project: Spark > Issue Type: Bug > Components: Web UI >Affects Versions: 1.0.2 >Reporter: Davis Shepherd > Attachments: Screen Shot 2014-10-03 at 12.50.59 PM.png > > > A long running spark context (non-streaming) will eventually start throwing > the following in the driver: > java.util.NoSuchElementException: key not found: 12771 > at scala.collection.MapLike$class.default(MapLike.scala:228) > at scala.collection.AbstractMap.default(Map.scala:58) > at scala.collection.mutable.HashMap.apply(HashMap.scala:64) > at > org.apache.spark.ui.jobs.JobProgressListener.onStageCompleted(JobProgressListener.scala:79) > at > org.apache.spark.scheduler.SparkListenerBus$$anonfun$postToAll$2.apply(SparkListenerBus.scala:48) > at > org.apache.spark.scheduler.SparkListenerBus$$anonfun$postToAll$2.apply(SparkListenerBus.scala:48) > at > org.apache.spark.scheduler.SparkListenerBus$$anonfun$foreachListener$1.apply(SparkListenerBus.scala:81) > at > org.apache.spark.scheduler.SparkListenerBus$$anonfun$foreachListener$1.apply(SparkListenerBus.scala:79) > at > scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) > at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47) > at > org.apache.spark.scheduler.SparkListenerBus$class.foreachListener(SparkListenerBus.scala:79) > at > org.apache.spark.scheduler.SparkListenerBus$class.postToAll(SparkListenerBus.scala:48) > at > org.apache.spark.scheduler.LiveListenerBus.postToAll(LiveListenerBus.scala:32) > at > org.apache.spark.scheduler.LiveListenerBus$$anon$1$$anonfun$run$1$$anonfun$apply$mcV$sp$1.apply(LiveListenerBus.scala:56) > at > org.apache.spark.scheduler.LiveListenerBus$$anon$1$$anonfun$run$1$$anonfun$apply$mcV$sp$1.apply(LiveListenerBus.scala:56) > at scala.Option.foreach(Option.scala:236) > at > org.apache.spark.scheduler.LiveListenerBus$$anon$1$$anonfun$run$1.apply$mcV$sp(LiveListenerBus.scala:56) > at > org.apache.spark.scheduler.LiveListenerBus$$anon$1$$anonfun$run$1.apply(LiveListenerBus.scala:47) > at > org.apache.spark.scheduler.LiveListenerBus$$anon$1$$anonfun$run$1.apply(LiveListenerBus.scala:47) > at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1160) > at > org.apache.spark.scheduler.LiveListenerBus$$anon$1.run(LiveListenerBus.scala:46) > 2014-10-09 18:45:33,523 [SparkListenerBus] ERROR > org.apache.spark.scheduler.LiveListenerBus - Listener JobProgressListener > threw an exception > java.util.NoSuchElementException: key not found: 12782 > at scala.collection.MapLike$class.default(MapLike.scala:228) > at scala.collection.AbstractMap.default(Map.scala:58) > at scala.collection.mutable.HashMap.apply(HashMap.scala:64) > at > org.apache.spark.ui.jobs.JobProgressListener.onStageCompleted(JobProgressListener.scala:79) > at > org.apache.spark.scheduler.SparkListenerBus$$anonfun$postToAll$2.apply(SparkListenerBus.scala:48) > at > org.apache.spark.scheduler.SparkListenerBus$$anonfun$postToAll$2.apply(SparkListenerBus.scala:48) > at > org.apache.spark.scheduler.SparkListenerBus$$anonfun$foreachListener$1.apply(SparkListenerBus.scala:81) > at > org.apache.spark.scheduler.SparkListenerBus$$anonfun$foreachListener$1.apply(SparkListenerBus.scala:79) > at > scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) > at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47) > at > org.apache.spark.scheduler.SparkListenerBus$class.foreachListener(SparkListenerBus.scala:79) > at > org.apache.spark.scheduler.SparkListenerBus$class.postToAll(SparkListenerBus.scala:48) > at > org.apache.spark.scheduler.LiveListenerBus.postToAll(LiveListenerBus.scala:32) > at > org.apache.spark.scheduler.LiveListenerBus$$anon$1$$anonfun$run$1$$anonfun$apply$mcV$sp$1.apply(LiveListenerBus.scala:56) > at > org.apache.spark.scheduler.LiveListenerBus$$anon$1$$anonfun$run$1$$anonfun$apply$mcV$sp$1.apply(LiveListenerBus.scala:56) > at scala.Option.foreach(Option.scala:236) > at > org.apache.spark.scheduler.LiveListenerBus$$anon$1$$anonfun$run$1.apply$mcV$sp(LiveListenerBus.scala:56) > at > org.apache.spark.scheduler.
[jira] [Commented] (SPARK-3882) JobProgressListener gets permanently out of sync with long running job
[ https://issues.apache.org/jira/browse/SPARK-3882?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14173428#comment-14173428 ] Davis Shepherd commented on SPARK-3882: --- This is also a serious memory leak that will cause long running drivers for spark streaming jobs to exhaust their heap. > JobProgressListener gets permanently out of sync with long running job > -- > > Key: SPARK-3882 > URL: https://issues.apache.org/jira/browse/SPARK-3882 > Project: Spark > Issue Type: Bug > Components: Web UI >Affects Versions: 1.0.2 >Reporter: Davis Shepherd > Attachments: Screen Shot 2014-10-03 at 12.50.59 PM.png > > > A long running spark context (non-streaming) will eventually start throwing > the following in the driver: > java.util.NoSuchElementException: key not found: 12771 > at scala.collection.MapLike$class.default(MapLike.scala:228) > at scala.collection.AbstractMap.default(Map.scala:58) > at scala.collection.mutable.HashMap.apply(HashMap.scala:64) > at > org.apache.spark.ui.jobs.JobProgressListener.onStageCompleted(JobProgressListener.scala:79) > at > org.apache.spark.scheduler.SparkListenerBus$$anonfun$postToAll$2.apply(SparkListenerBus.scala:48) > at > org.apache.spark.scheduler.SparkListenerBus$$anonfun$postToAll$2.apply(SparkListenerBus.scala:48) > at > org.apache.spark.scheduler.SparkListenerBus$$anonfun$foreachListener$1.apply(SparkListenerBus.scala:81) > at > org.apache.spark.scheduler.SparkListenerBus$$anonfun$foreachListener$1.apply(SparkListenerBus.scala:79) > at > scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) > at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47) > at > org.apache.spark.scheduler.SparkListenerBus$class.foreachListener(SparkListenerBus.scala:79) > at > org.apache.spark.scheduler.SparkListenerBus$class.postToAll(SparkListenerBus.scala:48) > at > org.apache.spark.scheduler.LiveListenerBus.postToAll(LiveListenerBus.scala:32) > at > org.apache.spark.scheduler.LiveListenerBus$$anon$1$$anonfun$run$1$$anonfun$apply$mcV$sp$1.apply(LiveListenerBus.scala:56) > at > org.apache.spark.scheduler.LiveListenerBus$$anon$1$$anonfun$run$1$$anonfun$apply$mcV$sp$1.apply(LiveListenerBus.scala:56) > at scala.Option.foreach(Option.scala:236) > at > org.apache.spark.scheduler.LiveListenerBus$$anon$1$$anonfun$run$1.apply$mcV$sp(LiveListenerBus.scala:56) > at > org.apache.spark.scheduler.LiveListenerBus$$anon$1$$anonfun$run$1.apply(LiveListenerBus.scala:47) > at > org.apache.spark.scheduler.LiveListenerBus$$anon$1$$anonfun$run$1.apply(LiveListenerBus.scala:47) > at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1160) > at > org.apache.spark.scheduler.LiveListenerBus$$anon$1.run(LiveListenerBus.scala:46) > 2014-10-09 18:45:33,523 [SparkListenerBus] ERROR > org.apache.spark.scheduler.LiveListenerBus - Listener JobProgressListener > threw an exception > java.util.NoSuchElementException: key not found: 12782 > at scala.collection.MapLike$class.default(MapLike.scala:228) > at scala.collection.AbstractMap.default(Map.scala:58) > at scala.collection.mutable.HashMap.apply(HashMap.scala:64) > at > org.apache.spark.ui.jobs.JobProgressListener.onStageCompleted(JobProgressListener.scala:79) > at > org.apache.spark.scheduler.SparkListenerBus$$anonfun$postToAll$2.apply(SparkListenerBus.scala:48) > at > org.apache.spark.scheduler.SparkListenerBus$$anonfun$postToAll$2.apply(SparkListenerBus.scala:48) > at > org.apache.spark.scheduler.SparkListenerBus$$anonfun$foreachListener$1.apply(SparkListenerBus.scala:81) > at > org.apache.spark.scheduler.SparkListenerBus$$anonfun$foreachListener$1.apply(SparkListenerBus.scala:79) > at > scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) > at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47) > at > org.apache.spark.scheduler.SparkListenerBus$class.foreachListener(SparkListenerBus.scala:79) > at > org.apache.spark.scheduler.SparkListenerBus$class.postToAll(SparkListenerBus.scala:48) > at > org.apache.spark.scheduler.LiveListenerBus.postToAll(LiveListenerBus.scala:32) > at > org.apache.spark.scheduler.LiveListenerBus$$anon$1$$anonfun$run$1$$anonfun$apply$mcV$sp$1.apply(LiveListenerBus.scala:56) > at > org.apache.spark.scheduler.LiveListenerBus$$anon$1$$anonfun$run$1$$anonfun$apply$mcV$sp$1.apply(LiveListenerBus.scala:56) > at scala.Option.foreach(Option.scala:236) > at > org.apache.spark.scheduler.LiveListenerBus$$anon$1$$anonfun$run$1.apply$mcV$sp(LiveListenerBus.scala:56) > at > or
[jira] [Comment Edited] (SPARK-3882) JobProgressListener gets permanently out of sync with long running job
[ https://issues.apache.org/jira/browse/SPARK-3882?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14165535#comment-14165535 ] Davis Shepherd edited comment on SPARK-3882 at 10/9/14 6:51 PM: Attached web ui screenshot. was (Author: dgshep): Lots of orphaned jobs. > JobProgressListener gets permanently out of sync with long running job > -- > > Key: SPARK-3882 > URL: https://issues.apache.org/jira/browse/SPARK-3882 > Project: Spark > Issue Type: Bug > Components: Web UI >Affects Versions: 1.0.2 >Reporter: Davis Shepherd > Attachments: Screen Shot 2014-10-03 at 12.50.59 PM.png > > > A long running spark context (non-streaming) will eventually start throwing > the following in the driver: > java.util.NoSuchElementException: key not found: 12771 > at scala.collection.MapLike$class.default(MapLike.scala:228) > at scala.collection.AbstractMap.default(Map.scala:58) > at scala.collection.mutable.HashMap.apply(HashMap.scala:64) > at > org.apache.spark.ui.jobs.JobProgressListener.onStageCompleted(JobProgressListener.scala:79) > at > org.apache.spark.scheduler.SparkListenerBus$$anonfun$postToAll$2.apply(SparkListenerBus.scala:48) > at > org.apache.spark.scheduler.SparkListenerBus$$anonfun$postToAll$2.apply(SparkListenerBus.scala:48) > at > org.apache.spark.scheduler.SparkListenerBus$$anonfun$foreachListener$1.apply(SparkListenerBus.scala:81) > at > org.apache.spark.scheduler.SparkListenerBus$$anonfun$foreachListener$1.apply(SparkListenerBus.scala:79) > at > scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) > at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47) > at > org.apache.spark.scheduler.SparkListenerBus$class.foreachListener(SparkListenerBus.scala:79) > at > org.apache.spark.scheduler.SparkListenerBus$class.postToAll(SparkListenerBus.scala:48) > at > org.apache.spark.scheduler.LiveListenerBus.postToAll(LiveListenerBus.scala:32) > at > org.apache.spark.scheduler.LiveListenerBus$$anon$1$$anonfun$run$1$$anonfun$apply$mcV$sp$1.apply(LiveListenerBus.scala:56) > at > org.apache.spark.scheduler.LiveListenerBus$$anon$1$$anonfun$run$1$$anonfun$apply$mcV$sp$1.apply(LiveListenerBus.scala:56) > at scala.Option.foreach(Option.scala:236) > at > org.apache.spark.scheduler.LiveListenerBus$$anon$1$$anonfun$run$1.apply$mcV$sp(LiveListenerBus.scala:56) > at > org.apache.spark.scheduler.LiveListenerBus$$anon$1$$anonfun$run$1.apply(LiveListenerBus.scala:47) > at > org.apache.spark.scheduler.LiveListenerBus$$anon$1$$anonfun$run$1.apply(LiveListenerBus.scala:47) > at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1160) > at > org.apache.spark.scheduler.LiveListenerBus$$anon$1.run(LiveListenerBus.scala:46) > 2014-10-09 18:45:33,523 [SparkListenerBus] ERROR > org.apache.spark.scheduler.LiveListenerBus - Listener JobProgressListener > threw an exception > java.util.NoSuchElementException: key not found: 12782 > at scala.collection.MapLike$class.default(MapLike.scala:228) > at scala.collection.AbstractMap.default(Map.scala:58) > at scala.collection.mutable.HashMap.apply(HashMap.scala:64) > at > org.apache.spark.ui.jobs.JobProgressListener.onStageCompleted(JobProgressListener.scala:79) > at > org.apache.spark.scheduler.SparkListenerBus$$anonfun$postToAll$2.apply(SparkListenerBus.scala:48) > at > org.apache.spark.scheduler.SparkListenerBus$$anonfun$postToAll$2.apply(SparkListenerBus.scala:48) > at > org.apache.spark.scheduler.SparkListenerBus$$anonfun$foreachListener$1.apply(SparkListenerBus.scala:81) > at > org.apache.spark.scheduler.SparkListenerBus$$anonfun$foreachListener$1.apply(SparkListenerBus.scala:79) > at > scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) > at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47) > at > org.apache.spark.scheduler.SparkListenerBus$class.foreachListener(SparkListenerBus.scala:79) > at > org.apache.spark.scheduler.SparkListenerBus$class.postToAll(SparkListenerBus.scala:48) > at > org.apache.spark.scheduler.LiveListenerBus.postToAll(LiveListenerBus.scala:32) > at > org.apache.spark.scheduler.LiveListenerBus$$anon$1$$anonfun$run$1$$anonfun$apply$mcV$sp$1.apply(LiveListenerBus.scala:56) > at > org.apache.spark.scheduler.LiveListenerBus$$anon$1$$anonfun$run$1$$anonfun$apply$mcV$sp$1.apply(LiveListenerBus.scala:56) > at scala.Option.foreach(Option.scala:236) > at > org.apache.spark.scheduler.LiveListenerBus$$anon$1$$anonfun$run$1.apply$mcV$sp(LiveListenerBus.scala:56) > at >
[jira] [Updated] (SPARK-3882) JobProgressListener gets permanently out of sync with long running job
[ https://issues.apache.org/jira/browse/SPARK-3882?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Davis Shepherd updated SPARK-3882: -- Attachment: Screen Shot 2014-10-03 at 12.50.59 PM.png Lots of orphaned jobs. > JobProgressListener gets permanently out of sync with long running job > -- > > Key: SPARK-3882 > URL: https://issues.apache.org/jira/browse/SPARK-3882 > Project: Spark > Issue Type: Bug > Components: Web UI >Affects Versions: 1.0.2 >Reporter: Davis Shepherd > Attachments: Screen Shot 2014-10-03 at 12.50.59 PM.png > > > A long running spark context (non-streaming) will eventually start throwing > the following in the driver: > java.util.NoSuchElementException: key not found: 12771 > at scala.collection.MapLike$class.default(MapLike.scala:228) > at scala.collection.AbstractMap.default(Map.scala:58) > at scala.collection.mutable.HashMap.apply(HashMap.scala:64) > at > org.apache.spark.ui.jobs.JobProgressListener.onStageCompleted(JobProgressListener.scala:79) > at > org.apache.spark.scheduler.SparkListenerBus$$anonfun$postToAll$2.apply(SparkListenerBus.scala:48) > at > org.apache.spark.scheduler.SparkListenerBus$$anonfun$postToAll$2.apply(SparkListenerBus.scala:48) > at > org.apache.spark.scheduler.SparkListenerBus$$anonfun$foreachListener$1.apply(SparkListenerBus.scala:81) > at > org.apache.spark.scheduler.SparkListenerBus$$anonfun$foreachListener$1.apply(SparkListenerBus.scala:79) > at > scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) > at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47) > at > org.apache.spark.scheduler.SparkListenerBus$class.foreachListener(SparkListenerBus.scala:79) > at > org.apache.spark.scheduler.SparkListenerBus$class.postToAll(SparkListenerBus.scala:48) > at > org.apache.spark.scheduler.LiveListenerBus.postToAll(LiveListenerBus.scala:32) > at > org.apache.spark.scheduler.LiveListenerBus$$anon$1$$anonfun$run$1$$anonfun$apply$mcV$sp$1.apply(LiveListenerBus.scala:56) > at > org.apache.spark.scheduler.LiveListenerBus$$anon$1$$anonfun$run$1$$anonfun$apply$mcV$sp$1.apply(LiveListenerBus.scala:56) > at scala.Option.foreach(Option.scala:236) > at > org.apache.spark.scheduler.LiveListenerBus$$anon$1$$anonfun$run$1.apply$mcV$sp(LiveListenerBus.scala:56) > at > org.apache.spark.scheduler.LiveListenerBus$$anon$1$$anonfun$run$1.apply(LiveListenerBus.scala:47) > at > org.apache.spark.scheduler.LiveListenerBus$$anon$1$$anonfun$run$1.apply(LiveListenerBus.scala:47) > at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1160) > at > org.apache.spark.scheduler.LiveListenerBus$$anon$1.run(LiveListenerBus.scala:46) > 2014-10-09 18:45:33,523 [SparkListenerBus] ERROR > org.apache.spark.scheduler.LiveListenerBus - Listener JobProgressListener > threw an exception > java.util.NoSuchElementException: key not found: 12782 > at scala.collection.MapLike$class.default(MapLike.scala:228) > at scala.collection.AbstractMap.default(Map.scala:58) > at scala.collection.mutable.HashMap.apply(HashMap.scala:64) > at > org.apache.spark.ui.jobs.JobProgressListener.onStageCompleted(JobProgressListener.scala:79) > at > org.apache.spark.scheduler.SparkListenerBus$$anonfun$postToAll$2.apply(SparkListenerBus.scala:48) > at > org.apache.spark.scheduler.SparkListenerBus$$anonfun$postToAll$2.apply(SparkListenerBus.scala:48) > at > org.apache.spark.scheduler.SparkListenerBus$$anonfun$foreachListener$1.apply(SparkListenerBus.scala:81) > at > org.apache.spark.scheduler.SparkListenerBus$$anonfun$foreachListener$1.apply(SparkListenerBus.scala:79) > at > scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) > at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47) > at > org.apache.spark.scheduler.SparkListenerBus$class.foreachListener(SparkListenerBus.scala:79) > at > org.apache.spark.scheduler.SparkListenerBus$class.postToAll(SparkListenerBus.scala:48) > at > org.apache.spark.scheduler.LiveListenerBus.postToAll(LiveListenerBus.scala:32) > at > org.apache.spark.scheduler.LiveListenerBus$$anon$1$$anonfun$run$1$$anonfun$apply$mcV$sp$1.apply(LiveListenerBus.scala:56) > at > org.apache.spark.scheduler.LiveListenerBus$$anon$1$$anonfun$run$1$$anonfun$apply$mcV$sp$1.apply(LiveListenerBus.scala:56) > at scala.Option.foreach(Option.scala:236) > at > org.apache.spark.scheduler.LiveListenerBus$$anon$1$$anonfun$run$1.apply$mcV$sp(LiveListenerBus.scala:56) > at > org.apache.spark.scheduler.LiveListenerBus$$anon$1$$anonfun$run$1.apply(LiveListenerBus.scala:47) >
[jira] [Created] (SPARK-3882) JobProgressListener gets permanently out of sync with long running job
Davis Shepherd created SPARK-3882: - Summary: JobProgressListener gets permanently out of sync with long running job Key: SPARK-3882 URL: https://issues.apache.org/jira/browse/SPARK-3882 Project: Spark Issue Type: Bug Components: Web UI Affects Versions: 1.0.2 Reporter: Davis Shepherd A long running spark context (non-streaming) will eventually start throwing the following in the driver: java.util.NoSuchElementException: key not found: 12771 at scala.collection.MapLike$class.default(MapLike.scala:228) at scala.collection.AbstractMap.default(Map.scala:58) at scala.collection.mutable.HashMap.apply(HashMap.scala:64) at org.apache.spark.ui.jobs.JobProgressListener.onStageCompleted(JobProgressListener.scala:79) at org.apache.spark.scheduler.SparkListenerBus$$anonfun$postToAll$2.apply(SparkListenerBus.scala:48) at org.apache.spark.scheduler.SparkListenerBus$$anonfun$postToAll$2.apply(SparkListenerBus.scala:48) at org.apache.spark.scheduler.SparkListenerBus$$anonfun$foreachListener$1.apply(SparkListenerBus.scala:81) at org.apache.spark.scheduler.SparkListenerBus$$anonfun$foreachListener$1.apply(SparkListenerBus.scala:79) at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47) at org.apache.spark.scheduler.SparkListenerBus$class.foreachListener(SparkListenerBus.scala:79) at org.apache.spark.scheduler.SparkListenerBus$class.postToAll(SparkListenerBus.scala:48) at org.apache.spark.scheduler.LiveListenerBus.postToAll(LiveListenerBus.scala:32) at org.apache.spark.scheduler.LiveListenerBus$$anon$1$$anonfun$run$1$$anonfun$apply$mcV$sp$1.apply(LiveListenerBus.scala:56) at org.apache.spark.scheduler.LiveListenerBus$$anon$1$$anonfun$run$1$$anonfun$apply$mcV$sp$1.apply(LiveListenerBus.scala:56) at scala.Option.foreach(Option.scala:236) at org.apache.spark.scheduler.LiveListenerBus$$anon$1$$anonfun$run$1.apply$mcV$sp(LiveListenerBus.scala:56) at org.apache.spark.scheduler.LiveListenerBus$$anon$1$$anonfun$run$1.apply(LiveListenerBus.scala:47) at org.apache.spark.scheduler.LiveListenerBus$$anon$1$$anonfun$run$1.apply(LiveListenerBus.scala:47) at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1160) at org.apache.spark.scheduler.LiveListenerBus$$anon$1.run(LiveListenerBus.scala:46) 2014-10-09 18:45:33,523 [SparkListenerBus] ERROR org.apache.spark.scheduler.LiveListenerBus - Listener JobProgressListener threw an exception java.util.NoSuchElementException: key not found: 12782 at scala.collection.MapLike$class.default(MapLike.scala:228) at scala.collection.AbstractMap.default(Map.scala:58) at scala.collection.mutable.HashMap.apply(HashMap.scala:64) at org.apache.spark.ui.jobs.JobProgressListener.onStageCompleted(JobProgressListener.scala:79) at org.apache.spark.scheduler.SparkListenerBus$$anonfun$postToAll$2.apply(SparkListenerBus.scala:48) at org.apache.spark.scheduler.SparkListenerBus$$anonfun$postToAll$2.apply(SparkListenerBus.scala:48) at org.apache.spark.scheduler.SparkListenerBus$$anonfun$foreachListener$1.apply(SparkListenerBus.scala:81) at org.apache.spark.scheduler.SparkListenerBus$$anonfun$foreachListener$1.apply(SparkListenerBus.scala:79) at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47) at org.apache.spark.scheduler.SparkListenerBus$class.foreachListener(SparkListenerBus.scala:79) at org.apache.spark.scheduler.SparkListenerBus$class.postToAll(SparkListenerBus.scala:48) at org.apache.spark.scheduler.LiveListenerBus.postToAll(LiveListenerBus.scala:32) at org.apache.spark.scheduler.LiveListenerBus$$anon$1$$anonfun$run$1$$anonfun$apply$mcV$sp$1.apply(LiveListenerBus.scala:56) at org.apache.spark.scheduler.LiveListenerBus$$anon$1$$anonfun$run$1$$anonfun$apply$mcV$sp$1.apply(LiveListenerBus.scala:56) at scala.Option.foreach(Option.scala:236) at org.apache.spark.scheduler.LiveListenerBus$$anon$1$$anonfun$run$1.apply$mcV$sp(LiveListenerBus.scala:56) at org.apache.spark.scheduler.LiveListenerBus$$anon$1$$anonfun$run$1.apply(LiveListenerBus.scala:47) at org.apache.spark.scheduler.LiveListenerBus$$anon$1$$anonfun$run$1.apply(LiveListenerBus.scala:47) at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1160) at org.apache.spark.scheduler.LiveListenerBus$$anon$1.run(LiveListenerBus.scala:46) And the ui will show running jobs that are in fact no longer running and never clean them up. (see attached screenshot) The result is that the ui becomes unusable, and the JobProgressListener leaks memo
[jira] [Created] (SPARK-1432) Potential memory leak in stageIdToExecutorSummaries in JobProgressTracker
Davis Shepherd created SPARK-1432: - Summary: Potential memory leak in stageIdToExecutorSummaries in JobProgressTracker Key: SPARK-1432 URL: https://issues.apache.org/jira/browse/SPARK-1432 Project: Spark Issue Type: Bug Components: Web UI Affects Versions: 0.9.0 Reporter: Davis Shepherd JobProgressTracker continuously cleans up old metadata as per the spark.ui.retainedStages configuration parameter. It seems however that not all metadata maps are being cleaned, in particular stageIdToExecutorSummaries could grow in an unbounded manner in a long running application. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (SPARK-1411) When using spark.ui.retainedStages=n only the first n stages are kept, not the most recent.
[ https://issues.apache.org/jira/browse/SPARK-1411?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13959664#comment-13959664 ] Davis Shepherd commented on SPARK-1411: --- I submitted a pull request to fix 1337 as well.— Davis On Thu, Apr 3, 2014 at 5:52 PM, Patrick Wendell (JIRA) > When using spark.ui.retainedStages=n only the first n stages are kept, not > the most recent. > --- > > Key: SPARK-1411 > URL: https://issues.apache.org/jira/browse/SPARK-1411 > Project: Spark > Issue Type: Bug > Components: Web UI >Affects Versions: 0.9.0 > Environment: Ubuntu 12.04 LTS Precise > 4 Nodes, > 16 cores each, > 196GB RAM each. >Reporter: Davis Shepherd > Attachments: Screen Shot 2014-04-03 at 5.35.00 PM.png > > > For any long running job with many stages, the web ui only shows the first n > stages of the job (where spark.ui.retainedStages=n). The most recent stages > are immediately dropped and are only visible for a brief time. This renders > the UI pretty useless after a pretty short amount of time for a long running > non-streaming job. I am unsure as to whether similar results appear for > streaming jobs. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Closed] (SPARK-1411) When using spark.ui.retainedStages=n only the first n stages are kept, not the most recent.
[ https://issues.apache.org/jira/browse/SPARK-1411?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Davis Shepherd closed SPARK-1411. - Resolution: Duplicate > When using spark.ui.retainedStages=n only the first n stages are kept, not > the most recent. > --- > > Key: SPARK-1411 > URL: https://issues.apache.org/jira/browse/SPARK-1411 > Project: Spark > Issue Type: Bug > Components: Web UI >Affects Versions: 0.9.0 > Environment: Ubuntu 12.04 LTS Precise > 4 Nodes, > 16 cores each, > 196GB RAM each. >Reporter: Davis Shepherd > Attachments: Screen Shot 2014-04-03 at 5.35.00 PM.png > > > For any long running job with many stages, the web ui only shows the first n > stages of the job (where spark.ui.retainedStages=n). The most recent stages > are immediately dropped and are only visible for a brief time. This renders > the UI pretty useless after a pretty short amount of time for a long running > non-streaming job. I am unsure as to whether similar results appear for > streaming jobs. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (SPARK-1411) When using spark.ui.retainedStages=n only the first n stages are kept, not the most recent.
[ https://issues.apache.org/jira/browse/SPARK-1411?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Davis Shepherd updated SPARK-1411: -- Attachment: Screen Shot 2014-04-03 at 5.35.00 PM.png Notice the large gap in stage ids and submitted time between the top two stages. > When using spark.ui.retainedStages=n only the first n stages are kept, not > the most recent. > --- > > Key: SPARK-1411 > URL: https://issues.apache.org/jira/browse/SPARK-1411 > Project: Spark > Issue Type: Bug > Components: Web UI >Affects Versions: 0.9.0 > Environment: Ubuntu 12.04 LTS Precise > 4 Nodes, > 16 cores each, > 196GB RAM each. >Reporter: Davis Shepherd >Priority: Minor > Attachments: Screen Shot 2014-04-03 at 5.35.00 PM.png > > > For any long running job with many stages, the web ui only shows the first n > stages of the job (where spark.ui.retainedStages=n). The most recent stages > are immediately dropped and are only visible for a brief time. This renders > the UI pretty useless after a pretty short amount of time for a long running > non-streaming job. I am unsure as to whether similar results appear for > streaming jobs. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (SPARK-1411) When using spark.ui.retainedStages=n only the first n stages are kept, not the most recent.
[ https://issues.apache.org/jira/browse/SPARK-1411?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Davis Shepherd updated SPARK-1411: -- Priority: Major (was: Minor) > When using spark.ui.retainedStages=n only the first n stages are kept, not > the most recent. > --- > > Key: SPARK-1411 > URL: https://issues.apache.org/jira/browse/SPARK-1411 > Project: Spark > Issue Type: Bug > Components: Web UI >Affects Versions: 0.9.0 > Environment: Ubuntu 12.04 LTS Precise > 4 Nodes, > 16 cores each, > 196GB RAM each. >Reporter: Davis Shepherd > Attachments: Screen Shot 2014-04-03 at 5.35.00 PM.png > > > For any long running job with many stages, the web ui only shows the first n > stages of the job (where spark.ui.retainedStages=n). The most recent stages > are immediately dropped and are only visible for a brief time. This renders > the UI pretty useless after a pretty short amount of time for a long running > non-streaming job. I am unsure as to whether similar results appear for > streaming jobs. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Created] (SPARK-1411) When using spark.ui.retainedStages=n only the first n stages are kept, not the most recent.
Davis Shepherd created SPARK-1411: - Summary: When using spark.ui.retainedStages=n only the first n stages are kept, not the most recent. Key: SPARK-1411 URL: https://issues.apache.org/jira/browse/SPARK-1411 Project: Spark Issue Type: Bug Components: Web UI Affects Versions: 0.9.0 Environment: Ubuntu 12.04 LTS Precise 4 Nodes, 16 cores each, 196GB RAM each. Reporter: Davis Shepherd Priority: Minor For any long running job with many stages, the web ui only shows the first n stages of the job (where spark.ui.retainedStages=n). The most recent stages are immediately dropped and are only visible for a brief time. This renders the UI pretty useless after a pretty short amount of time for a long running non-streaming job. I am unsure as to whether similar results appear for streaming jobs. -- This message was sent by Atlassian JIRA (v6.2#6252)