[jira] [Updated] (SPARK-20483) Mesos Coarse mode may starve other Mesos frameworks if max cores is not a multiple of executor cores

2017-04-26 Thread Davis Shepherd (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20483?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Davis Shepherd updated SPARK-20483:
---
Description: if {{spark.cores.max = 10}} for example and 
{{spark.executor.cores = 4}}, 2 executors will get launched thus 
{{totalCoresAcquired = 8}}. All future Mesos offers will not get tasks launched 
because {{sc.conf.getInt("spark.executor.cores", ...) + totalCoresAcquired <= 
maxCores}} will always evaluate to false.  However, in {{handleMatchedOffers}} 
we check if {{totalCoresAcquired >= maxCores}} to determine if we should 
decline the offer "for a configurable amount of time to avoid starving other 
frameworks", and this will always evaluate to false in the above scenario. This 
leaves the framework in a state of limbo where it will never launch any new 
executors, but only decline offers for the Mesos default of 5 seconds, thus 
starving other frameworks of offers.  (was: if {{spark.cores.max = 10}} for 
example and {{spark.executor.cores = 4}}, 2 executors will get launched thus 
{{totalCoresAcquired = 8}}. All future Mesos offers will not get tasks launched 
because {{sc.conf.getInt("spark.executor.cores", ...) + totalCoresAcquired <= 
maxCores}} will always evaluate to false.  However, in {{handleMatchedOffers}} 
we check if {{totalCoresAcquired >= maxCores}} to determine if we should 
decline the offer "for a configurable amount of time to avoid starving other 
frameworks", and this will always evaluate to false in the above scenario. This 
leaves the framework in a state of limbo where it will never launch any new 
executors, but only decline offers for the Mesos default of 5 seconds, thus 
starving other frameworks of offers.

Relates to: SPARK-12554, SPARK-19702)

> Mesos Coarse mode may starve other Mesos frameworks if max cores is not a 
> multiple of executor cores
> 
>
> Key: SPARK-20483
> URL: https://issues.apache.org/jira/browse/SPARK-20483
> Project: Spark
>  Issue Type: Bug
>  Components: Mesos
>Affects Versions: 2.1.0
>Reporter: Davis Shepherd
>Priority: Minor
>
> if {{spark.cores.max = 10}} for example and {{spark.executor.cores = 4}}, 2 
> executors will get launched thus {{totalCoresAcquired = 8}}. All future Mesos 
> offers will not get tasks launched because 
> {{sc.conf.getInt("spark.executor.cores", ...) + totalCoresAcquired <= 
> maxCores}} will always evaluate to false.  However, in 
> {{handleMatchedOffers}} we check if {{totalCoresAcquired >= maxCores}} to 
> determine if we should decline the offer "for a configurable amount of time 
> to avoid starving other frameworks", and this will always evaluate to false 
> in the above scenario. This leaves the framework in a state of limbo where it 
> will never launch any new executors, but only decline offers for the Mesos 
> default of 5 seconds, thus starving other frameworks of offers.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-20483) Mesos Coarse mode may starve other Mesos frameworks if max cores is not a multiple of executor cores

2017-04-26 Thread Davis Shepherd (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20483?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Davis Shepherd updated SPARK-20483:
---
Description: 
if {{spark.cores.max = 10}} for example and {{spark.executor.cores = 4}}, 2 
executors will get launched thus {{totalCoresAcquired = 8}}. All future Mesos 
offers will not get tasks launched because 
{{sc.conf.getInt("spark.executor.cores", ...) + totalCoresAcquired <= 
maxCores}} will always evaluate to false.  However, in {{handleMatchedOffers}} 
we check if {{totalCoresAcquired >= maxCores}} to determine if we should 
decline the offer "for a configurable amount of time to avoid starving other 
frameworks", and this will always evaluate to false in the above scenario. This 
leaves the framework in a state of limbo where it will never launch any new 
executors, but only decline offers for the Mesos default of 5 seconds, thus 
starving other frameworks of offers.

Relates to: SPARK-12554, SPARK-19702

  was:
if {{spark.cores.max = 10}} for example and {{spark.executor.cores = 4}}, 2 
executors will get launched thus {{totalCoresAcquired = 8}}. All future Mesos 
offers will not get tasks launched because 
{{sc.conf.getInt("spark.executor.cores", ...) + totalCoresAcquired <= 
maxCores}} will always evaluate to false.  However, in {{handleMatchedOffers}} 
we check if {{totalCoresAcquired >= maxCores}} to determine if we should 
decline the offer "for a configurable amount of time to avoid starving other 
frameworks", and this will always evaluate to false in the above scenario. This 
leaves the framework in a state of limbo where it will never launch any new 
executors, but only decline offers for the Mesos default of 5 seconds, thus 
starving other frameworks of offers.

Relates to: SPARK-12554


> Mesos Coarse mode may starve other Mesos frameworks if max cores is not a 
> multiple of executor cores
> 
>
> Key: SPARK-20483
> URL: https://issues.apache.org/jira/browse/SPARK-20483
> Project: Spark
>  Issue Type: Bug
>  Components: Mesos
>Affects Versions: 2.1.0
>Reporter: Davis Shepherd
>Priority: Minor
>
> if {{spark.cores.max = 10}} for example and {{spark.executor.cores = 4}}, 2 
> executors will get launched thus {{totalCoresAcquired = 8}}. All future Mesos 
> offers will not get tasks launched because 
> {{sc.conf.getInt("spark.executor.cores", ...) + totalCoresAcquired <= 
> maxCores}} will always evaluate to false.  However, in 
> {{handleMatchedOffers}} we check if {{totalCoresAcquired >= maxCores}} to 
> determine if we should decline the offer "for a configurable amount of time 
> to avoid starving other frameworks", and this will always evaluate to false 
> in the above scenario. This leaves the framework in a state of limbo where it 
> will never launch any new executors, but only decline offers for the Mesos 
> default of 5 seconds, thus starving other frameworks of offers.
> Relates to: SPARK-12554, SPARK-19702



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-20483) Mesos Coarse mode may starve other Mesos frameworks if max cores is not a multiple of executor cores

2017-04-26 Thread Davis Shepherd (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20483?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Davis Shepherd updated SPARK-20483:
---
Description: 
if {{spark.cores.max = 10}} for example and {{spark.executor.cores = 4}}, 2 
executors will get launched thus {{totalCoresAcquired = 8}}. All future Mesos 
offers will not get tasks launched because 
{{sc.conf.getInt("spark.executor.cores", ...) + totalCoresAcquired <= 
maxCores}} will always evaluate to false.  However, in {{handleMatchedOffers}} 
we check if {{totalCoresAcquired >= maxCores}} to determine if we should 
decline the offer "for a configurable amount of time to avoid starving other 
frameworks", and this will always evaluate to false in the above scenario. This 
leaves the framework in a state of limbo where it will never launch any new 
executors, but only decline offers for the Mesos default of 5 seconds, thus 
starving other frameworks of offers.

Relates to: SPARK-12554

  was:
if `spark.cores.max = 10` for example and `spark.executor.cores = 4`, 2 
executors will get lauched thus `totalCoresAcquired = 8`. All future Mesos 
offers will not get tasks launched because 
`sc.conf.getInt("spark.executor.cores", ...) + totalCoresAcquired <= maxCores` 
will always evaluate to false.  However, in `handleMatchedOffers` we check if 
`totalCoresAcquired >= maxCores` to determine if we should decline the offer 
"for a configurable amount of time to avoid starving other frameworks", and 
this will always evaluate to false in the above scenario. This leaves the 
framework in a state of limbo where it will never launch any new executors, but 
only decline offers for the Mesos default of 5 seconds, thus starving other 
frameworks of offers.

Relates to: SPARK-12554


> Mesos Coarse mode may starve other Mesos frameworks if max cores is not a 
> multiple of executor cores
> 
>
> Key: SPARK-20483
> URL: https://issues.apache.org/jira/browse/SPARK-20483
> Project: Spark
>  Issue Type: Bug
>  Components: Mesos
>Affects Versions: 2.1.0
>Reporter: Davis Shepherd
>Priority: Minor
>
> if {{spark.cores.max = 10}} for example and {{spark.executor.cores = 4}}, 2 
> executors will get launched thus {{totalCoresAcquired = 8}}. All future Mesos 
> offers will not get tasks launched because 
> {{sc.conf.getInt("spark.executor.cores", ...) + totalCoresAcquired <= 
> maxCores}} will always evaluate to false.  However, in 
> {{handleMatchedOffers}} we check if {{totalCoresAcquired >= maxCores}} to 
> determine if we should decline the offer "for a configurable amount of time 
> to avoid starving other frameworks", and this will always evaluate to false 
> in the above scenario. This leaves the framework in a state of limbo where it 
> will never launch any new executors, but only decline offers for the Mesos 
> default of 5 seconds, thus starving other frameworks of offers.
> Relates to: SPARK-12554



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-20483) Mesos Coarse mode may starve other Mesos frameworks if max cores is not a multiple of executor cores

2017-04-26 Thread Davis Shepherd (JIRA)
Davis Shepherd created SPARK-20483:
--

 Summary: Mesos Coarse mode may starve other Mesos frameworks if 
max cores is not a multiple of executor cores
 Key: SPARK-20483
 URL: https://issues.apache.org/jira/browse/SPARK-20483
 Project: Spark
  Issue Type: Bug
  Components: Mesos
Affects Versions: 2.1.0
Reporter: Davis Shepherd
Priority: Minor


if `spark.cores.max = 10` for example and `spark.executor.cores = 4`, 2 
executors will get lauched thus `totalCoresAcquired = 8`. All future Mesos 
offers will not get tasks launched because 
`sc.conf.getInt("spark.executor.cores", ...) + totalCoresAcquired <= maxCores` 
will always evaluate to false.  However, in `handleMatchedOffers` we check if 
`totalCoresAcquired >= maxCores` to determine if we should decline the offer 
"for a configurable amount of time to avoid starving other frameworks", and 
this will always evaluate to false in the above scenario. This leaves the 
framework in a state of limbo where it will never launch any new executors, but 
only decline offers for the Mesos default of 5 seconds, thus starving other 
frameworks of offers.

Relates to: SPARK-12554



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-4759) Deadlock in complex spark job in local mode

2014-12-08 Thread Davis Shepherd (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-4759?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14238221#comment-14238221
 ] 

Davis Shepherd commented on SPARK-4759:
---

The fix appears to resolve the issue in master for both repros.

> Deadlock in complex spark job in local mode
> ---
>
> Key: SPARK-4759
> URL: https://issues.apache.org/jira/browse/SPARK-4759
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.1.1, 1.2.0, 1.3.0
> Environment: Java version "1.7.0_51"
> Java(TM) SE Runtime Environment (build 1.7.0_51-b13)
> Java HotSpot(TM) 64-Bit Server VM (build 24.51-b03, mixed mode)
> Mac OSX 10.10.1
> Using local spark context
>Reporter: Davis Shepherd
>Assignee: Andrew Or
>Priority: Critical
> Attachments: SparkBugReplicator.scala
>
>
> The attached test class runs two identical jobs that perform some iterative 
> computation on an RDD[(Int, Int)]. This computation involves 
>   # taking new data merging it with the previous result
>   # caching and checkpointing the new result
>   # rinse and repeat
> The first time the job is run, it runs successfully, and the spark context is 
> shut down. The second time the job is run with a new spark context in the 
> same process, the job hangs indefinitely, only having scheduled a subset of 
> the necessary tasks for the final stage.
> Ive been able to produce a test case that reproduces the issue, and I've 
> added some comments where some knockout experimentation has left some 
> breadcrumbs as to where the issue might be.  



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-4759) Deadlock in complex spark job in local mode

2014-12-07 Thread Davis Shepherd (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-4759?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14237393#comment-14237393
 ] 

Davis Shepherd edited comment on SPARK-4759 at 12/8/14 3:15 AM:


I still cannot reproduce the issue with your snippet in the spark shell on tag 
v1.1.1


was (Author: dgshep):
I still cannot reproduce the issue in the spark shell on tag v1.1.1

> Deadlock in complex spark job in local mode
> ---
>
> Key: SPARK-4759
> URL: https://issues.apache.org/jira/browse/SPARK-4759
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.1.1, 1.2.0, 1.3.0
> Environment: Java version "1.7.0_51"
> Java(TM) SE Runtime Environment (build 1.7.0_51-b13)
> Java HotSpot(TM) 64-Bit Server VM (build 24.51-b03, mixed mode)
> Mac OSX 10.10.1
> Using local spark context
>Reporter: Davis Shepherd
>Assignee: Andrew Or
>Priority: Critical
> Attachments: SparkBugReplicator.scala
>
>
> The attached test class runs two identical jobs that perform some iterative 
> computation on an RDD[(Int, Int)]. This computation involves 
>   # taking new data merging it with the previous result
>   # caching and checkpointing the new result
>   # rinse and repeat
> The first time the job is run, it runs successfully, and the spark context is 
> shut down. The second time the job is run with a new spark context in the 
> same process, the job hangs indefinitely, only having scheduled a subset of 
> the necessary tasks for the final stage.
> Ive been able to produce a test case that reproduces the issue, and I've 
> added some comments where some knockout experimentation has left some 
> breadcrumbs as to where the issue might be.  



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-4759) Deadlock in complex spark job in local mode

2014-12-07 Thread Davis Shepherd (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-4759?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14237393#comment-14237393
 ] 

Davis Shepherd commented on SPARK-4759:
---

I still cannot reproduce the issue in the spark shell on tag v1.1.1

> Deadlock in complex spark job in local mode
> ---
>
> Key: SPARK-4759
> URL: https://issues.apache.org/jira/browse/SPARK-4759
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.1.1, 1.2.0, 1.3.0
> Environment: Java version "1.7.0_51"
> Java(TM) SE Runtime Environment (build 1.7.0_51-b13)
> Java HotSpot(TM) 64-Bit Server VM (build 24.51-b03, mixed mode)
> Mac OSX 10.10.1
> Using local spark context
>Reporter: Davis Shepherd
>Assignee: Andrew Or
>Priority: Critical
> Attachments: SparkBugReplicator.scala
>
>
> The attached test class runs two identical jobs that perform some iterative 
> computation on an RDD[(Int, Int)]. This computation involves 
>   # taking new data merging it with the previous result
>   # caching and checkpointing the new result
>   # rinse and repeat
> The first time the job is run, it runs successfully, and the spark context is 
> shut down. The second time the job is run with a new spark context in the 
> same process, the job hangs indefinitely, only having scheduled a subset of 
> the necessary tasks for the final stage.
> Ive been able to produce a test case that reproduces the issue, and I've 
> added some comments where some knockout experimentation has left some 
> breadcrumbs as to where the issue might be.  



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-4759) Deadlock in complex spark job in local mode

2014-12-07 Thread Davis Shepherd (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-4759?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14237383#comment-14237383
 ] 

Davis Shepherd commented on SPARK-4759:
---

Ok your version does reproduce the issue against the spark-core 1.1.1 artifact 
if I copy and paste your code into the original SparkBugReplicator, but it only 
seems to hang on the second time the job is run :P. This smells of a race 
condition...

> Deadlock in complex spark job in local mode
> ---
>
> Key: SPARK-4759
> URL: https://issues.apache.org/jira/browse/SPARK-4759
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.1.1, 1.2.0, 1.3.0
> Environment: Java version "1.7.0_51"
> Java(TM) SE Runtime Environment (build 1.7.0_51-b13)
> Java HotSpot(TM) 64-Bit Server VM (build 24.51-b03, mixed mode)
> Mac OSX 10.10.1
> Using local spark context
>Reporter: Davis Shepherd
>Assignee: Andrew Or
>Priority: Critical
> Attachments: SparkBugReplicator.scala
>
>
> The attached test class runs two identical jobs that perform some iterative 
> computation on an RDD[(Int, Int)]. This computation involves 
>   # taking new data merging it with the previous result
>   # caching and checkpointing the new result
>   # rinse and repeat
> The first time the job is run, it runs successfully, and the spark context is 
> shut down. The second time the job is run with a new spark context in the 
> same process, the job hangs indefinitely, only having scheduled a subset of 
> the necessary tasks for the final stage.
> Ive been able to produce a test case that reproduces the issue, and I've 
> added some comments where some knockout experimentation has left some 
> breadcrumbs as to where the issue might be.  



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-4759) Deadlock in complex spark job in local mode

2014-12-07 Thread Davis Shepherd (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-4759?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14237376#comment-14237376
 ] 

Davis Shepherd commented on SPARK-4759:
---

local[2] on the latest commit of branch 1.1

> Deadlock in complex spark job in local mode
> ---
>
> Key: SPARK-4759
> URL: https://issues.apache.org/jira/browse/SPARK-4759
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.1.1, 1.2.0, 1.3.0
> Environment: Java version "1.7.0_51"
> Java(TM) SE Runtime Environment (build 1.7.0_51-b13)
> Java HotSpot(TM) 64-Bit Server VM (build 24.51-b03, mixed mode)
> Mac OSX 10.10.1
> Using local spark context
>Reporter: Davis Shepherd
>Assignee: Andrew Or
>Priority: Critical
> Attachments: SparkBugReplicator.scala
>
>
> The attached test class runs two identical jobs that perform some iterative 
> computation on an RDD[(Int, Int)]. This computation involves 
>   # taking new data merging it with the previous result
>   # caching and checkpointing the new result
>   # rinse and repeat
> The first time the job is run, it runs successfully, and the spark context is 
> shut down. The second time the job is run with a new spark context in the 
> same process, the job hangs indefinitely, only having scheduled a subset of 
> the necessary tasks for the final stage.
> Ive been able to produce a test case that reproduces the issue, and I've 
> added some comments where some knockout experimentation has left some 
> breadcrumbs as to where the issue might be.  



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-4759) Deadlock in complex spark job in local mode

2014-12-07 Thread Davis Shepherd (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-4759?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14237371#comment-14237371
 ] 

Davis Shepherd commented on SPARK-4759:
---

This doesn't seem to reproduce the issue for me. The job finishes regardless of 
how many times I call runMyJob()


> Deadlock in complex spark job in local mode
> ---
>
> Key: SPARK-4759
> URL: https://issues.apache.org/jira/browse/SPARK-4759
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.1.1, 1.2.0, 1.3.0
> Environment: Java version "1.7.0_51"
> Java(TM) SE Runtime Environment (build 1.7.0_51-b13)
> Java HotSpot(TM) 64-Bit Server VM (build 24.51-b03, mixed mode)
> Mac OSX 10.10.1
> Using local spark context
>Reporter: Davis Shepherd
>Assignee: Andrew Or
>Priority: Critical
> Attachments: SparkBugReplicator.scala
>
>
> The attached test class runs two identical jobs that perform some iterative 
> computation on an RDD[(Int, Int)]. This computation involves 
>   # taking new data merging it with the previous result
>   # caching and checkpointing the new result
>   # rinse and repeat
> The first time the job is run, it runs successfully, and the spark context is 
> shut down. The second time the job is run with a new spark context in the 
> same process, the job hangs indefinitely, only having scheduled a subset of 
> the necessary tasks for the final stage.
> Ive been able to produce a test case that reproduces the issue, and I've 
> added some comments where some knockout experimentation has left some 
> breadcrumbs as to where the issue might be.  



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-4759) Deadlock in complex spark job in local mode with multiple cores

2014-12-07 Thread Davis Shepherd (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-4759?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14237264#comment-14237264
 ] 

Davis Shepherd commented on SPARK-4759:
---

It is possible to reproduce the issue with single core local mode.

Simply change the partitions parameter to > 1 (in the attached version it uses 
the defaultParallelism of the spark context, which in single core mode is 1) in 
either of the coalesce calls.

> Deadlock in complex spark job in local mode with multiple cores
> ---
>
> Key: SPARK-4759
> URL: https://issues.apache.org/jira/browse/SPARK-4759
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.1.1, 1.2.0, 1.3.0
> Environment: Java version "1.7.0_51"
> Java(TM) SE Runtime Environment (build 1.7.0_51-b13)
> Java HotSpot(TM) 64-Bit Server VM (build 24.51-b03, mixed mode)
> Mac OSX 10.10.1
> Using local spark context
>Reporter: Davis Shepherd
>Assignee: Andrew Or
>Priority: Critical
> Attachments: SparkBugReplicator.scala
>
>
> The attached test class runs two identical jobs that perform some iterative 
> computation on an RDD[(Int, Int)]. This computation involves 
>   # taking new data merging it with the previous result
>   # caching and checkpointing the new result
>   # rinse and repeat
> The first time the job is run, it runs successfully, and the spark context is 
> shut down. The second time the job is run with a new spark context in the 
> same process, the job hangs indefinitely, only having scheduled a subset of 
> the necessary tasks for the final stage.
> Ive been able to produce a test case that reproduces the issue, and I've 
> added some comments where some knockout experimentation has left some 
> breadcrumbs as to where the issue might be.  



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-4759) Deadlock in complex spark job.

2014-12-05 Thread Davis Shepherd (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-4759?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14235703#comment-14235703
 ] 

Davis Shepherd commented on SPARK-4759:
---

Fair enough. As far as I can work out, it appears that there are 9 expected 
tasks in the final stage, but only 6 ever get scheduled.  I suppose that leaves 
only one thread waiting on something that will never happen. ;)

> Deadlock in complex spark job.
> --
>
> Key: SPARK-4759
> URL: https://issues.apache.org/jira/browse/SPARK-4759
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.1.1
> Environment: Java version "1.7.0_51"
> Java(TM) SE Runtime Environment (build 1.7.0_51-b13)
> Java HotSpot(TM) 64-Bit Server VM (build 24.51-b03, mixed mode)
> Mac OSX 10.10.1
> Using local spark context
>Reporter: Davis Shepherd
> Attachments: SparkBugReplicator.scala
>
>
> The attached test class runs two identical jobs that perform some iterative 
> computation on an RDD[(Int, Int)]. This computation involves 
>   # taking new data merging it with the previous result
>   # caching and checkpointing the new result
>   # rinse and repeat
> The first time the job is run, it runs successfully, and the spark context is 
> shut down. The second time the job is run with a new spark context in the 
> same process, the job hangs indefinitely, only having scheduled a subset of 
> the necessary tasks for the final stage.
> Ive been able to produce a test case that reproduces the issue, and I've 
> added some comments where some knockout experimentation has left some 
> breadcrumbs as to where the issue might be.  



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-4759) Deadlock in complex spark job.

2014-12-05 Thread Davis Shepherd (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-4759?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Davis Shepherd updated SPARK-4759:
--
Description: 
The attached test class runs two identical jobs that perform some iterative 
computation on an RDD[(Int, Int)]. This computation involves 
  # taking new data merging it with the previous result
  # caching and checkpointing the new result
  # rinse and repeat

The first time the job is run, it runs successfully, and the spark context is 
shut down. The second time the job is run with a new spark context in the same 
process, the job hangs indefinitely, only having scheduled a subset of the 
necessary tasks for the final stage.

Ive been able to produce a test case that reproduces the issue, and I've added 
some comments where some knockout experimentation has left some breadcrumbs as 
to where the issue might be.  


  was:
The attached test class runs two identical jobs that perform some iterative 
computation on an RDD[(Int, Int)]. This computation involves 
  *taking new data merging it with the previous result
  *caching and checkpointing the new result
  *rinse and repeat

The first time the job is run, it runs successfully, and the spark context is 
shut down. The second time the job is run with a new spark context in the same 
process, the job hangs indefinitely, only having scheduled a subset of the 
necessary tasks for the final stage.

Ive been able to produce a test case that reproduces the issue, and I've added 
some comments where some knockout experimentation has left some breadcrumbs as 
to where the issue might be.  



> Deadlock in complex spark job.
> --
>
> Key: SPARK-4759
> URL: https://issues.apache.org/jira/browse/SPARK-4759
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.1.1
> Environment: Java version "1.7.0_51"
> Java(TM) SE Runtime Environment (build 1.7.0_51-b13)
> Java HotSpot(TM) 64-Bit Server VM (build 24.51-b03, mixed mode)
> Mac OSX 10.10.1
> Using local spark context
>Reporter: Davis Shepherd
> Attachments: SparkBugReplicator.scala
>
>
> The attached test class runs two identical jobs that perform some iterative 
> computation on an RDD[(Int, Int)]. This computation involves 
>   # taking new data merging it with the previous result
>   # caching and checkpointing the new result
>   # rinse and repeat
> The first time the job is run, it runs successfully, and the spark context is 
> shut down. The second time the job is run with a new spark context in the 
> same process, the job hangs indefinitely, only having scheduled a subset of 
> the necessary tasks for the final stage.
> Ive been able to produce a test case that reproduces the issue, and I've 
> added some comments where some knockout experimentation has left some 
> breadcrumbs as to where the issue might be.  



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-4759) Deadlock in complex spark job.

2014-12-05 Thread Davis Shepherd (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-4759?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14235651#comment-14235651
 ] 

Davis Shepherd edited comment on SPARK-4759 at 12/5/14 3:52 PM:


Here is a thread dump with superfluous threads omitted (namely jetty qtp 
threads)

{noformat}
"sparkDriver-akka.actor.default-dispatcher-18" daemon prio=5 tid=7fc3853ef800 
nid=0x119e87000 waiting on condition [119e86000]
   java.lang.Thread.State: WAITING (parking)
at sun.misc.Unsafe.park(Native Method)
- parking to wait for  <7c33229d0> (a 
akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinPool)
at scala.concurrent.forkjoin.ForkJoinPool.scan(ForkJoinPool.java:1594)
at 
scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1478)
at 
scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:104)

"sparkDriver-akka.actor.default-dispatcher-17" daemon prio=5 tid=7fc38c007000 
nid=0x119d84000 waiting on condition [119d83000]
   java.lang.Thread.State: WAITING (parking)
at sun.misc.Unsafe.park(Native Method)
- parking to wait for  <7c33229d0> (a 
akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinPool)
at scala.concurrent.forkjoin.ForkJoinPool.scan(ForkJoinPool.java:1594)
at 
scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1478)
at 
scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:104)

"sparkDriver-akka.actor.default-dispatcher-14" daemon prio=5 tid=7fc38b003800 
nid=0x119b96000 waiting on condition [119b95000]
   java.lang.Thread.State: TIMED_WAITING (parking)
at sun.misc.Unsafe.park(Native Method)
- parking to wait for  <7c33229d0> (a 
akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinPool)
at 
scala.concurrent.forkjoin.ForkJoinPool.idleAwaitWork(ForkJoinPool.java:1626)
at scala.concurrent.forkjoin.ForkJoinPool.scan(ForkJoinPool.java:1579)
at 
scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1478)
at 
scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:104)

"sparkDriver-akka.actor.default-dispatcher-15" daemon prio=5 tid=7fc38b003000 
nid=0x119a93000 waiting on condition [119a92000]
   java.lang.Thread.State: WAITING (parking)
at sun.misc.Unsafe.park(Native Method)
- parking to wait for  <7c33229d0> (a 
akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinPool)
at scala.concurrent.forkjoin.ForkJoinPool.scan(ForkJoinPool.java:1594)
at 
scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1478)
at 
scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:104)

"sparkDriver-akka.actor.default-dispatcher-16" daemon prio=5 tid=7fc38c006000 
nid=0x11971d000 waiting on condition [11971c000]
   java.lang.Thread.State: WAITING (parking)
at sun.misc.Unsafe.park(Native Method)
- parking to wait for  <7c33229d0> (a 
akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinPool)
at scala.concurrent.forkjoin.ForkJoinPool.scan(ForkJoinPool.java:1594)
at 
scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1478)
at 
scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:104)

"sparkDriver-akka.actor.default-dispatcher-13" daemon prio=5 tid=7fc38681f000 
nid=0x1193d1000 waiting on condition [1193d]
   java.lang.Thread.State: WAITING (parking)
at sun.misc.Unsafe.park(Native Method)
- parking to wait for  <7c33229d0> (a 
akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinPool)
at scala.concurrent.forkjoin.ForkJoinPool.scan(ForkJoinPool.java:1594)
at 
scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1478)
at 
scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:104)

"task-result-getter-3" daemon prio=5 tid=7fc3871a7800 nid=0x1192ce000 waiting 
on condition [1192cd000]
   java.lang.Thread.State: WAITING (parking)
at sun.misc.Unsafe.park(Native Method)
- parking to wait for  <7bbeaace0> (a 
java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject)
at java.util.concurrent.locks.LockSupport.park(LockSupport.java:156)
at 
java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:1987)
at 
java.util.concurrent.LinkedBlockingQueue.take(LinkedBlockingQueue.java:399)
at 
java.util.concurrent.ThreadPoolExecutor.getTask(ThreadPoolExecutor.java:957)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:917)
at java.lang.Thread.run(Thread.java:695)

"task-result-getter-2" daemon prio=5 tid=7fc3871a6800 nid=0x1191cb000 waiting 
on condition [1191ca000]
   java.lang.Thread.State: WAITING (parkin

[jira] [Comment Edited] (SPARK-4759) Deadlock in complex spark job.

2014-12-05 Thread Davis Shepherd (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-4759?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14235651#comment-14235651
 ] 

Davis Shepherd edited comment on SPARK-4759 at 12/5/14 3:48 PM:


{noformat}
"sparkDriver-akka.actor.default-dispatcher-18" daemon prio=5 tid=7fc3853ef800 
nid=0x119e87000 waiting on condition [119e86000]
   java.lang.Thread.State: WAITING (parking)
at sun.misc.Unsafe.park(Native Method)
- parking to wait for  <7c33229d0> (a 
akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinPool)
at scala.concurrent.forkjoin.ForkJoinPool.scan(ForkJoinPool.java:1594)
at 
scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1478)
at 
scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:104)

"sparkDriver-akka.actor.default-dispatcher-17" daemon prio=5 tid=7fc38c007000 
nid=0x119d84000 waiting on condition [119d83000]
   java.lang.Thread.State: WAITING (parking)
at sun.misc.Unsafe.park(Native Method)
- parking to wait for  <7c33229d0> (a 
akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinPool)
at scala.concurrent.forkjoin.ForkJoinPool.scan(ForkJoinPool.java:1594)
at 
scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1478)
at 
scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:104)

"sparkDriver-akka.actor.default-dispatcher-14" daemon prio=5 tid=7fc38b003800 
nid=0x119b96000 waiting on condition [119b95000]
   java.lang.Thread.State: TIMED_WAITING (parking)
at sun.misc.Unsafe.park(Native Method)
- parking to wait for  <7c33229d0> (a 
akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinPool)
at 
scala.concurrent.forkjoin.ForkJoinPool.idleAwaitWork(ForkJoinPool.java:1626)
at scala.concurrent.forkjoin.ForkJoinPool.scan(ForkJoinPool.java:1579)
at 
scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1478)
at 
scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:104)

"sparkDriver-akka.actor.default-dispatcher-15" daemon prio=5 tid=7fc38b003000 
nid=0x119a93000 waiting on condition [119a92000]
   java.lang.Thread.State: WAITING (parking)
at sun.misc.Unsafe.park(Native Method)
- parking to wait for  <7c33229d0> (a 
akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinPool)
at scala.concurrent.forkjoin.ForkJoinPool.scan(ForkJoinPool.java:1594)
at 
scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1478)
at 
scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:104)

"sparkDriver-akka.actor.default-dispatcher-16" daemon prio=5 tid=7fc38c006000 
nid=0x11971d000 waiting on condition [11971c000]
   java.lang.Thread.State: WAITING (parking)
at sun.misc.Unsafe.park(Native Method)
- parking to wait for  <7c33229d0> (a 
akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinPool)
at scala.concurrent.forkjoin.ForkJoinPool.scan(ForkJoinPool.java:1594)
at 
scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1478)
at 
scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:104)

"sparkDriver-akka.actor.default-dispatcher-13" daemon prio=5 tid=7fc38681f000 
nid=0x1193d1000 waiting on condition [1193d]
   java.lang.Thread.State: WAITING (parking)
at sun.misc.Unsafe.park(Native Method)
- parking to wait for  <7c33229d0> (a 
akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinPool)
at scala.concurrent.forkjoin.ForkJoinPool.scan(ForkJoinPool.java:1594)
at 
scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1478)
at 
scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:104)

"task-result-getter-3" daemon prio=5 tid=7fc3871a7800 nid=0x1192ce000 waiting 
on condition [1192cd000]
   java.lang.Thread.State: WAITING (parking)
at sun.misc.Unsafe.park(Native Method)
- parking to wait for  <7bbeaace0> (a 
java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject)
at java.util.concurrent.locks.LockSupport.park(LockSupport.java:156)
at 
java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:1987)
at 
java.util.concurrent.LinkedBlockingQueue.take(LinkedBlockingQueue.java:399)
at 
java.util.concurrent.ThreadPoolExecutor.getTask(ThreadPoolExecutor.java:957)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:917)
at java.lang.Thread.run(Thread.java:695)

"task-result-getter-2" daemon prio=5 tid=7fc3871a6800 nid=0x1191cb000 waiting 
on condition [1191ca000]
   java.lang.Thread.State: WAITING (parking)
at sun.misc.Unsafe.park(Native Method)
- parking to wait for  <7b

[jira] [Commented] (SPARK-4759) Deadlock in complex spark job.

2014-12-05 Thread Davis Shepherd (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-4759?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14235651#comment-14235651
 ] 

Davis Shepherd commented on SPARK-4759:
---

{{"sparkDriver-akka.actor.default-dispatcher-18" daemon prio=5 tid=7fc3853ef800 
nid=0x119e87000 waiting on condition [119e86000]
   java.lang.Thread.State: WAITING (parking)
at sun.misc.Unsafe.park(Native Method)
- parking to wait for  <7c33229d0> (a 
akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinPool)
at scala.concurrent.forkjoin.ForkJoinPool.scan(ForkJoinPool.java:1594)
at 
scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1478)
at 
scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:104)

"sparkDriver-akka.actor.default-dispatcher-17" daemon prio=5 tid=7fc38c007000 
nid=0x119d84000 waiting on condition [119d83000]
   java.lang.Thread.State: WAITING (parking)
at sun.misc.Unsafe.park(Native Method)
- parking to wait for  <7c33229d0> (a 
akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinPool)
at scala.concurrent.forkjoin.ForkJoinPool.scan(ForkJoinPool.java:1594)
at 
scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1478)
at 
scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:104)

"sparkDriver-akka.actor.default-dispatcher-14" daemon prio=5 tid=7fc38b003800 
nid=0x119b96000 waiting on condition [119b95000]
   java.lang.Thread.State: TIMED_WAITING (parking)
at sun.misc.Unsafe.park(Native Method)
- parking to wait for  <7c33229d0> (a 
akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinPool)
at 
scala.concurrent.forkjoin.ForkJoinPool.idleAwaitWork(ForkJoinPool.java:1626)
at scala.concurrent.forkjoin.ForkJoinPool.scan(ForkJoinPool.java:1579)
at 
scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1478)
at 
scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:104)

"sparkDriver-akka.actor.default-dispatcher-15" daemon prio=5 tid=7fc38b003000 
nid=0x119a93000 waiting on condition [119a92000]
   java.lang.Thread.State: WAITING (parking)
at sun.misc.Unsafe.park(Native Method)
- parking to wait for  <7c33229d0> (a 
akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinPool)
at scala.concurrent.forkjoin.ForkJoinPool.scan(ForkJoinPool.java:1594)
at 
scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1478)
at 
scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:104)

"sparkDriver-akka.actor.default-dispatcher-16" daemon prio=5 tid=7fc38c006000 
nid=0x11971d000 waiting on condition [11971c000]
   java.lang.Thread.State: WAITING (parking)
at sun.misc.Unsafe.park(Native Method)
- parking to wait for  <7c33229d0> (a 
akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinPool)
at scala.concurrent.forkjoin.ForkJoinPool.scan(ForkJoinPool.java:1594)
at 
scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1478)
at 
scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:104)

"sparkDriver-akka.actor.default-dispatcher-13" daemon prio=5 tid=7fc38681f000 
nid=0x1193d1000 waiting on condition [1193d]
   java.lang.Thread.State: WAITING (parking)
at sun.misc.Unsafe.park(Native Method)
- parking to wait for  <7c33229d0> (a 
akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinPool)
at scala.concurrent.forkjoin.ForkJoinPool.scan(ForkJoinPool.java:1594)
at 
scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1478)
at 
scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:104)

"task-result-getter-3" daemon prio=5 tid=7fc3871a7800 nid=0x1192ce000 waiting 
on condition [1192cd000]
   java.lang.Thread.State: WAITING (parking)
at sun.misc.Unsafe.park(Native Method)
- parking to wait for  <7bbeaace0> (a 
java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject)
at java.util.concurrent.locks.LockSupport.park(LockSupport.java:156)
at 
java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:1987)
at 
java.util.concurrent.LinkedBlockingQueue.take(LinkedBlockingQueue.java:399)
at 
java.util.concurrent.ThreadPoolExecutor.getTask(ThreadPoolExecutor.java:957)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:917)
at java.lang.Thread.run(Thread.java:695)

"task-result-getter-2" daemon prio=5 tid=7fc3871a6800 nid=0x1191cb000 waiting 
on condition [1191ca000]
   java.lang.Thread.State: WAITING (parking)
at sun.misc.Unsafe.park(Native Method)
- parking to wait for  <7bbeaace0> (a 
java.util.concurrent.locks.AbstractQueuedSync

[jira] [Updated] (SPARK-4759) Deadlock in complex spark job.

2014-12-04 Thread Davis Shepherd (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-4759?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Davis Shepherd updated SPARK-4759:
--
Attachment: SparkBugReplicator.scala

> Deadlock in complex spark job.
> --
>
> Key: SPARK-4759
> URL: https://issues.apache.org/jira/browse/SPARK-4759
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.1.1
> Environment: Java version "1.7.0_51"
> Java(TM) SE Runtime Environment (build 1.7.0_51-b13)
> Java HotSpot(TM) 64-Bit Server VM (build 24.51-b03, mixed mode)
> Mac OSX 10.10.1
> Using local spark context
>Reporter: Davis Shepherd
> Attachments: SparkBugReplicator.scala
>
>
> The attached test class runs two identical jobs that perform some iterative 
> computation on an RDD[(Int, Int)]. This computation involves 
>   *taking new data merging it with the previous result
>   *caching and checkpointing the new result
>   *rinse and repeat
> The first time the job is run, it runs successfully, and the spark context is 
> shut down. The second time the job is run with a new spark context in the 
> same process, the job hangs indefinitely, only having scheduled a subset of 
> the necessary tasks for the final stage.
> Ive been able to produce a test case that reproduces the issue, and I've 
> added some comments where some knockout experimentation has left some 
> breadcrumbs as to where the issue might be.  



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-4759) Deadlock in complex spark job.

2014-12-04 Thread Davis Shepherd (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-4759?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Davis Shepherd updated SPARK-4759:
--
Attachment: (was: SparkBugReplicator.scala)

> Deadlock in complex spark job.
> --
>
> Key: SPARK-4759
> URL: https://issues.apache.org/jira/browse/SPARK-4759
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.1.1
> Environment: Java version "1.7.0_51"
> Java(TM) SE Runtime Environment (build 1.7.0_51-b13)
> Java HotSpot(TM) 64-Bit Server VM (build 24.51-b03, mixed mode)
> Mac OSX 10.10.1
> Using local spark context
>Reporter: Davis Shepherd
>
> The attached test class runs two identical jobs that perform some iterative 
> computation on an RDD[(Int, Int)]. This computation involves 
>   *taking new data merging it with the previous result
>   *caching and checkpointing the new result
>   *rinse and repeat
> The first time the job is run, it runs successfully, and the spark context is 
> shut down. The second time the job is run with a new spark context in the 
> same process, the job hangs indefinitely, only having scheduled a subset of 
> the necessary tasks for the final stage.
> Ive been able to produce a test case that reproduces the issue, and I've 
> added some comments where some knockout experimentation has left some 
> breadcrumbs as to where the issue might be.  



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-4759) Deadlock in complex spark job.

2014-12-04 Thread Davis Shepherd (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-4759?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Davis Shepherd updated SPARK-4759:
--
Summary: Deadlock in complex spark job.  (was: Deadlock in pathological 
spark job.)

> Deadlock in complex spark job.
> --
>
> Key: SPARK-4759
> URL: https://issues.apache.org/jira/browse/SPARK-4759
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.1.1
> Environment: Java version "1.7.0_51"
> Java(TM) SE Runtime Environment (build 1.7.0_51-b13)
> Java HotSpot(TM) 64-Bit Server VM (build 24.51-b03, mixed mode)
> Mac OSX 10.10.1
> Using local spark context
>Reporter: Davis Shepherd
> Attachments: SparkBugReplicator.scala
>
>
> The attached test class runs two identical jobs that perform some iterative 
> computation on an RDD[(Int, Int)]. This computation involves 
>   *taking new data merging it with the previous result
>   *caching and checkpointing the new result
>   *rinse and repeat
> The first time the job is run, it runs successfully, and the spark context is 
> shut down. The second time the job is run with a new spark context in the 
> same process, the job hangs indefinitely, only having scheduled a subset of 
> the necessary tasks for the final stage.
> Ive been able to produce a test case that reproduces the issue, and I've 
> added some comments where some knockout experimentation has left some 
> breadcrumbs as to where the issue might be.  



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-4759) Deadlock in pathological spark job.

2014-12-04 Thread Davis Shepherd (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-4759?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Davis Shepherd updated SPARK-4759:
--
Description: 
The attached test class runs two identical jobs that perform some iterative 
computation on an RDD[(Int, Int)]. This computation involves 
  *taking new data merging it with the previous result
  *caching and checkpointing the new result
  *rinse and repeat

The first time the job is run, it runs successfully, and the spark context is 
shut down. The second time the job is run with a new spark context in the same 
process, the job hangs indefinitely, only having scheduled a subset of the 
necessary tasks for the final stage.

Ive been able to produce a test case that reproduces the issue, and I've added 
some comments where some knockout experimentation has left some breadcrumbs as 
to where the issue might be.  


  was:
The attached test class runs two identical jobs that perform some iterative 
computation on and RDD[Int, Int]. This computation involves 
  taking new data merging it with the previous result
  caching and checkpointing the new result
  rinse and repeat

The first time the job is run, it runs successfully, and the spark context is 
shut down. The second time the job is run with a new spark context in the same 
process, the job hangs indefinitely, only having scheduled a subset of the 
necessary tasks for the final stage.

Ive been able to produce a test case that reproduces the issue, and I've added 
some comments where some knockout experimentation has left some breadcrumbs as 
to where the issue might be.  



> Deadlock in pathological spark job.
> ---
>
> Key: SPARK-4759
> URL: https://issues.apache.org/jira/browse/SPARK-4759
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.1.1
> Environment: Java version "1.7.0_51"
> Java(TM) SE Runtime Environment (build 1.7.0_51-b13)
> Java HotSpot(TM) 64-Bit Server VM (build 24.51-b03, mixed mode)
> Mac OSX 10.10.1
> Using local spark context
>Reporter: Davis Shepherd
> Attachments: SparkBugReplicator.scala
>
>
> The attached test class runs two identical jobs that perform some iterative 
> computation on an RDD[(Int, Int)]. This computation involves 
>   *taking new data merging it with the previous result
>   *caching and checkpointing the new result
>   *rinse and repeat
> The first time the job is run, it runs successfully, and the spark context is 
> shut down. The second time the job is run with a new spark context in the 
> same process, the job hangs indefinitely, only having scheduled a subset of 
> the necessary tasks for the final stage.
> Ive been able to produce a test case that reproduces the issue, and I've 
> added some comments where some knockout experimentation has left some 
> breadcrumbs as to where the issue might be.  



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-4759) Deadlock in pathological spark job.

2014-12-04 Thread Davis Shepherd (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-4759?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Davis Shepherd updated SPARK-4759:
--
Attachment: SparkBugReplicator.scala

> Deadlock in pathological spark job.
> ---
>
> Key: SPARK-4759
> URL: https://issues.apache.org/jira/browse/SPARK-4759
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.1.1
> Environment: Java version "1.7.0_51"
> Java(TM) SE Runtime Environment (build 1.7.0_51-b13)
> Java HotSpot(TM) 64-Bit Server VM (build 24.51-b03, mixed mode)
> Mac OSX 10.10.1
> Using local spark context
>Reporter: Davis Shepherd
> Attachments: SparkBugReplicator.scala
>
>
> The attached test class runs two identical jobs that perform some iterative 
> computation on and RDD[Int, Int]. This computation involves 
>   taking new data merging it with the previous result
>   caching and checkpointing the new result
>   rinse and repeat
> The first time the job is run, it runs successfully, and the spark context is 
> shut down. The second time the job is run with a new spark context in the 
> same process, the job hangs indefinitely, only having scheduled a subset of 
> the necessary tasks for the final stage.
> Ive been able to produce a test case that reproduces the issue, and I've 
> added some comments where some knockout experimentation has left some 
> breadcrumbs as to where the issue might be.  



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-4759) Deadlock in pathological spark job.

2014-12-04 Thread Davis Shepherd (JIRA)
Davis Shepherd created SPARK-4759:
-

 Summary: Deadlock in pathological spark job.
 Key: SPARK-4759
 URL: https://issues.apache.org/jira/browse/SPARK-4759
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.1.1
 Environment: Java version "1.7.0_51"
Java(TM) SE Runtime Environment (build 1.7.0_51-b13)
Java HotSpot(TM) 64-Bit Server VM (build 24.51-b03, mixed mode)

Mac OSX 10.10.1
Using local spark context
Reporter: Davis Shepherd


The attached test class runs two identical jobs that perform some iterative 
computation on and RDD[Int, Int]. This computation involves 
  taking new data merging it with the previous result
  caching and checkpointing the new result
  rinse and repeat

The first time the job is run, it runs successfully, and the spark context is 
shut down. The second time the job is run with a new spark context in the same 
process, the job hangs indefinitely, only having scheduled a subset of the 
necessary tasks for the final stage.

Ive been able to produce a test case that reproduces the issue, and I've added 
some comments where some knockout experimentation has left some breadcrumbs as 
to where the issue might be.  




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Issue Comment Deleted] (SPARK-3882) JobProgressListener gets permanently out of sync with long running job

2014-10-15 Thread Davis Shepherd (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3882?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Davis Shepherd updated SPARK-3882:
--
Comment: was deleted

(was: This is also a serious memory leak that will cause long running drivers 
for spark streaming jobs to exhaust their heap.)

> JobProgressListener gets permanently out of sync with long running job
> --
>
> Key: SPARK-3882
> URL: https://issues.apache.org/jira/browse/SPARK-3882
> Project: Spark
>  Issue Type: Bug
>  Components: Web UI
>Affects Versions: 1.0.2
>Reporter: Davis Shepherd
> Attachments: Screen Shot 2014-10-03 at 12.50.59 PM.png
>
>
> A long running spark context (non-streaming) will eventually start throwing 
> the following in the driver:
> java.util.NoSuchElementException: key not found: 12771
>   at scala.collection.MapLike$class.default(MapLike.scala:228)
>   at scala.collection.AbstractMap.default(Map.scala:58)
>   at scala.collection.mutable.HashMap.apply(HashMap.scala:64)
>   at 
> org.apache.spark.ui.jobs.JobProgressListener.onStageCompleted(JobProgressListener.scala:79)
>   at 
> org.apache.spark.scheduler.SparkListenerBus$$anonfun$postToAll$2.apply(SparkListenerBus.scala:48)
>   at 
> org.apache.spark.scheduler.SparkListenerBus$$anonfun$postToAll$2.apply(SparkListenerBus.scala:48)
>   at 
> org.apache.spark.scheduler.SparkListenerBus$$anonfun$foreachListener$1.apply(SparkListenerBus.scala:81)
>   at 
> org.apache.spark.scheduler.SparkListenerBus$$anonfun$foreachListener$1.apply(SparkListenerBus.scala:79)
>   at 
> scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
>   at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
>   at 
> org.apache.spark.scheduler.SparkListenerBus$class.foreachListener(SparkListenerBus.scala:79)
>   at 
> org.apache.spark.scheduler.SparkListenerBus$class.postToAll(SparkListenerBus.scala:48)
>   at 
> org.apache.spark.scheduler.LiveListenerBus.postToAll(LiveListenerBus.scala:32)
>   at 
> org.apache.spark.scheduler.LiveListenerBus$$anon$1$$anonfun$run$1$$anonfun$apply$mcV$sp$1.apply(LiveListenerBus.scala:56)
>   at 
> org.apache.spark.scheduler.LiveListenerBus$$anon$1$$anonfun$run$1$$anonfun$apply$mcV$sp$1.apply(LiveListenerBus.scala:56)
>   at scala.Option.foreach(Option.scala:236)
>   at 
> org.apache.spark.scheduler.LiveListenerBus$$anon$1$$anonfun$run$1.apply$mcV$sp(LiveListenerBus.scala:56)
>   at 
> org.apache.spark.scheduler.LiveListenerBus$$anon$1$$anonfun$run$1.apply(LiveListenerBus.scala:47)
>   at 
> org.apache.spark.scheduler.LiveListenerBus$$anon$1$$anonfun$run$1.apply(LiveListenerBus.scala:47)
>   at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1160)
>   at 
> org.apache.spark.scheduler.LiveListenerBus$$anon$1.run(LiveListenerBus.scala:46)
> 2014-10-09 18:45:33,523 [SparkListenerBus] ERROR 
> org.apache.spark.scheduler.LiveListenerBus - Listener JobProgressListener 
> threw an exception
> java.util.NoSuchElementException: key not found: 12782
>   at scala.collection.MapLike$class.default(MapLike.scala:228)
>   at scala.collection.AbstractMap.default(Map.scala:58)
>   at scala.collection.mutable.HashMap.apply(HashMap.scala:64)
>   at 
> org.apache.spark.ui.jobs.JobProgressListener.onStageCompleted(JobProgressListener.scala:79)
>   at 
> org.apache.spark.scheduler.SparkListenerBus$$anonfun$postToAll$2.apply(SparkListenerBus.scala:48)
>   at 
> org.apache.spark.scheduler.SparkListenerBus$$anonfun$postToAll$2.apply(SparkListenerBus.scala:48)
>   at 
> org.apache.spark.scheduler.SparkListenerBus$$anonfun$foreachListener$1.apply(SparkListenerBus.scala:81)
>   at 
> org.apache.spark.scheduler.SparkListenerBus$$anonfun$foreachListener$1.apply(SparkListenerBus.scala:79)
>   at 
> scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
>   at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
>   at 
> org.apache.spark.scheduler.SparkListenerBus$class.foreachListener(SparkListenerBus.scala:79)
>   at 
> org.apache.spark.scheduler.SparkListenerBus$class.postToAll(SparkListenerBus.scala:48)
>   at 
> org.apache.spark.scheduler.LiveListenerBus.postToAll(LiveListenerBus.scala:32)
>   at 
> org.apache.spark.scheduler.LiveListenerBus$$anon$1$$anonfun$run$1$$anonfun$apply$mcV$sp$1.apply(LiveListenerBus.scala:56)
>   at 
> org.apache.spark.scheduler.LiveListenerBus$$anon$1$$anonfun$run$1$$anonfun$apply$mcV$sp$1.apply(LiveListenerBus.scala:56)
>   at scala.Option.foreach(Option.scala:236)
>   at 
> org.apache.spark.scheduler.LiveListenerBus$$anon$1$$anonfun$run$1.apply$mcV$sp(LiveListenerBus.scala:56)
>   at 
> org.apache.spark.scheduler.

[jira] [Commented] (SPARK-3882) JobProgressListener gets permanently out of sync with long running job

2014-10-15 Thread Davis Shepherd (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3882?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14173428#comment-14173428
 ] 

Davis Shepherd commented on SPARK-3882:
---

This is also a serious memory leak that will cause long running drivers for 
spark streaming jobs to exhaust their heap.

> JobProgressListener gets permanently out of sync with long running job
> --
>
> Key: SPARK-3882
> URL: https://issues.apache.org/jira/browse/SPARK-3882
> Project: Spark
>  Issue Type: Bug
>  Components: Web UI
>Affects Versions: 1.0.2
>Reporter: Davis Shepherd
> Attachments: Screen Shot 2014-10-03 at 12.50.59 PM.png
>
>
> A long running spark context (non-streaming) will eventually start throwing 
> the following in the driver:
> java.util.NoSuchElementException: key not found: 12771
>   at scala.collection.MapLike$class.default(MapLike.scala:228)
>   at scala.collection.AbstractMap.default(Map.scala:58)
>   at scala.collection.mutable.HashMap.apply(HashMap.scala:64)
>   at 
> org.apache.spark.ui.jobs.JobProgressListener.onStageCompleted(JobProgressListener.scala:79)
>   at 
> org.apache.spark.scheduler.SparkListenerBus$$anonfun$postToAll$2.apply(SparkListenerBus.scala:48)
>   at 
> org.apache.spark.scheduler.SparkListenerBus$$anonfun$postToAll$2.apply(SparkListenerBus.scala:48)
>   at 
> org.apache.spark.scheduler.SparkListenerBus$$anonfun$foreachListener$1.apply(SparkListenerBus.scala:81)
>   at 
> org.apache.spark.scheduler.SparkListenerBus$$anonfun$foreachListener$1.apply(SparkListenerBus.scala:79)
>   at 
> scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
>   at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
>   at 
> org.apache.spark.scheduler.SparkListenerBus$class.foreachListener(SparkListenerBus.scala:79)
>   at 
> org.apache.spark.scheduler.SparkListenerBus$class.postToAll(SparkListenerBus.scala:48)
>   at 
> org.apache.spark.scheduler.LiveListenerBus.postToAll(LiveListenerBus.scala:32)
>   at 
> org.apache.spark.scheduler.LiveListenerBus$$anon$1$$anonfun$run$1$$anonfun$apply$mcV$sp$1.apply(LiveListenerBus.scala:56)
>   at 
> org.apache.spark.scheduler.LiveListenerBus$$anon$1$$anonfun$run$1$$anonfun$apply$mcV$sp$1.apply(LiveListenerBus.scala:56)
>   at scala.Option.foreach(Option.scala:236)
>   at 
> org.apache.spark.scheduler.LiveListenerBus$$anon$1$$anonfun$run$1.apply$mcV$sp(LiveListenerBus.scala:56)
>   at 
> org.apache.spark.scheduler.LiveListenerBus$$anon$1$$anonfun$run$1.apply(LiveListenerBus.scala:47)
>   at 
> org.apache.spark.scheduler.LiveListenerBus$$anon$1$$anonfun$run$1.apply(LiveListenerBus.scala:47)
>   at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1160)
>   at 
> org.apache.spark.scheduler.LiveListenerBus$$anon$1.run(LiveListenerBus.scala:46)
> 2014-10-09 18:45:33,523 [SparkListenerBus] ERROR 
> org.apache.spark.scheduler.LiveListenerBus - Listener JobProgressListener 
> threw an exception
> java.util.NoSuchElementException: key not found: 12782
>   at scala.collection.MapLike$class.default(MapLike.scala:228)
>   at scala.collection.AbstractMap.default(Map.scala:58)
>   at scala.collection.mutable.HashMap.apply(HashMap.scala:64)
>   at 
> org.apache.spark.ui.jobs.JobProgressListener.onStageCompleted(JobProgressListener.scala:79)
>   at 
> org.apache.spark.scheduler.SparkListenerBus$$anonfun$postToAll$2.apply(SparkListenerBus.scala:48)
>   at 
> org.apache.spark.scheduler.SparkListenerBus$$anonfun$postToAll$2.apply(SparkListenerBus.scala:48)
>   at 
> org.apache.spark.scheduler.SparkListenerBus$$anonfun$foreachListener$1.apply(SparkListenerBus.scala:81)
>   at 
> org.apache.spark.scheduler.SparkListenerBus$$anonfun$foreachListener$1.apply(SparkListenerBus.scala:79)
>   at 
> scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
>   at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
>   at 
> org.apache.spark.scheduler.SparkListenerBus$class.foreachListener(SparkListenerBus.scala:79)
>   at 
> org.apache.spark.scheduler.SparkListenerBus$class.postToAll(SparkListenerBus.scala:48)
>   at 
> org.apache.spark.scheduler.LiveListenerBus.postToAll(LiveListenerBus.scala:32)
>   at 
> org.apache.spark.scheduler.LiveListenerBus$$anon$1$$anonfun$run$1$$anonfun$apply$mcV$sp$1.apply(LiveListenerBus.scala:56)
>   at 
> org.apache.spark.scheduler.LiveListenerBus$$anon$1$$anonfun$run$1$$anonfun$apply$mcV$sp$1.apply(LiveListenerBus.scala:56)
>   at scala.Option.foreach(Option.scala:236)
>   at 
> org.apache.spark.scheduler.LiveListenerBus$$anon$1$$anonfun$run$1.apply$mcV$sp(LiveListenerBus.scala:56)
>   at 
> or

[jira] [Comment Edited] (SPARK-3882) JobProgressListener gets permanently out of sync with long running job

2014-10-09 Thread Davis Shepherd (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3882?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14165535#comment-14165535
 ] 

Davis Shepherd edited comment on SPARK-3882 at 10/9/14 6:51 PM:


Attached web ui screenshot.


was (Author: dgshep):
Lots of orphaned jobs.

> JobProgressListener gets permanently out of sync with long running job
> --
>
> Key: SPARK-3882
> URL: https://issues.apache.org/jira/browse/SPARK-3882
> Project: Spark
>  Issue Type: Bug
>  Components: Web UI
>Affects Versions: 1.0.2
>Reporter: Davis Shepherd
> Attachments: Screen Shot 2014-10-03 at 12.50.59 PM.png
>
>
> A long running spark context (non-streaming) will eventually start throwing 
> the following in the driver:
> java.util.NoSuchElementException: key not found: 12771
>   at scala.collection.MapLike$class.default(MapLike.scala:228)
>   at scala.collection.AbstractMap.default(Map.scala:58)
>   at scala.collection.mutable.HashMap.apply(HashMap.scala:64)
>   at 
> org.apache.spark.ui.jobs.JobProgressListener.onStageCompleted(JobProgressListener.scala:79)
>   at 
> org.apache.spark.scheduler.SparkListenerBus$$anonfun$postToAll$2.apply(SparkListenerBus.scala:48)
>   at 
> org.apache.spark.scheduler.SparkListenerBus$$anonfun$postToAll$2.apply(SparkListenerBus.scala:48)
>   at 
> org.apache.spark.scheduler.SparkListenerBus$$anonfun$foreachListener$1.apply(SparkListenerBus.scala:81)
>   at 
> org.apache.spark.scheduler.SparkListenerBus$$anonfun$foreachListener$1.apply(SparkListenerBus.scala:79)
>   at 
> scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
>   at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
>   at 
> org.apache.spark.scheduler.SparkListenerBus$class.foreachListener(SparkListenerBus.scala:79)
>   at 
> org.apache.spark.scheduler.SparkListenerBus$class.postToAll(SparkListenerBus.scala:48)
>   at 
> org.apache.spark.scheduler.LiveListenerBus.postToAll(LiveListenerBus.scala:32)
>   at 
> org.apache.spark.scheduler.LiveListenerBus$$anon$1$$anonfun$run$1$$anonfun$apply$mcV$sp$1.apply(LiveListenerBus.scala:56)
>   at 
> org.apache.spark.scheduler.LiveListenerBus$$anon$1$$anonfun$run$1$$anonfun$apply$mcV$sp$1.apply(LiveListenerBus.scala:56)
>   at scala.Option.foreach(Option.scala:236)
>   at 
> org.apache.spark.scheduler.LiveListenerBus$$anon$1$$anonfun$run$1.apply$mcV$sp(LiveListenerBus.scala:56)
>   at 
> org.apache.spark.scheduler.LiveListenerBus$$anon$1$$anonfun$run$1.apply(LiveListenerBus.scala:47)
>   at 
> org.apache.spark.scheduler.LiveListenerBus$$anon$1$$anonfun$run$1.apply(LiveListenerBus.scala:47)
>   at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1160)
>   at 
> org.apache.spark.scheduler.LiveListenerBus$$anon$1.run(LiveListenerBus.scala:46)
> 2014-10-09 18:45:33,523 [SparkListenerBus] ERROR 
> org.apache.spark.scheduler.LiveListenerBus - Listener JobProgressListener 
> threw an exception
> java.util.NoSuchElementException: key not found: 12782
>   at scala.collection.MapLike$class.default(MapLike.scala:228)
>   at scala.collection.AbstractMap.default(Map.scala:58)
>   at scala.collection.mutable.HashMap.apply(HashMap.scala:64)
>   at 
> org.apache.spark.ui.jobs.JobProgressListener.onStageCompleted(JobProgressListener.scala:79)
>   at 
> org.apache.spark.scheduler.SparkListenerBus$$anonfun$postToAll$2.apply(SparkListenerBus.scala:48)
>   at 
> org.apache.spark.scheduler.SparkListenerBus$$anonfun$postToAll$2.apply(SparkListenerBus.scala:48)
>   at 
> org.apache.spark.scheduler.SparkListenerBus$$anonfun$foreachListener$1.apply(SparkListenerBus.scala:81)
>   at 
> org.apache.spark.scheduler.SparkListenerBus$$anonfun$foreachListener$1.apply(SparkListenerBus.scala:79)
>   at 
> scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
>   at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
>   at 
> org.apache.spark.scheduler.SparkListenerBus$class.foreachListener(SparkListenerBus.scala:79)
>   at 
> org.apache.spark.scheduler.SparkListenerBus$class.postToAll(SparkListenerBus.scala:48)
>   at 
> org.apache.spark.scheduler.LiveListenerBus.postToAll(LiveListenerBus.scala:32)
>   at 
> org.apache.spark.scheduler.LiveListenerBus$$anon$1$$anonfun$run$1$$anonfun$apply$mcV$sp$1.apply(LiveListenerBus.scala:56)
>   at 
> org.apache.spark.scheduler.LiveListenerBus$$anon$1$$anonfun$run$1$$anonfun$apply$mcV$sp$1.apply(LiveListenerBus.scala:56)
>   at scala.Option.foreach(Option.scala:236)
>   at 
> org.apache.spark.scheduler.LiveListenerBus$$anon$1$$anonfun$run$1.apply$mcV$sp(LiveListenerBus.scala:56)
>   at 
>

[jira] [Updated] (SPARK-3882) JobProgressListener gets permanently out of sync with long running job

2014-10-09 Thread Davis Shepherd (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3882?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Davis Shepherd updated SPARK-3882:
--
Attachment: Screen Shot 2014-10-03 at 12.50.59 PM.png

Lots of orphaned jobs.

> JobProgressListener gets permanently out of sync with long running job
> --
>
> Key: SPARK-3882
> URL: https://issues.apache.org/jira/browse/SPARK-3882
> Project: Spark
>  Issue Type: Bug
>  Components: Web UI
>Affects Versions: 1.0.2
>Reporter: Davis Shepherd
> Attachments: Screen Shot 2014-10-03 at 12.50.59 PM.png
>
>
> A long running spark context (non-streaming) will eventually start throwing 
> the following in the driver:
> java.util.NoSuchElementException: key not found: 12771
>   at scala.collection.MapLike$class.default(MapLike.scala:228)
>   at scala.collection.AbstractMap.default(Map.scala:58)
>   at scala.collection.mutable.HashMap.apply(HashMap.scala:64)
>   at 
> org.apache.spark.ui.jobs.JobProgressListener.onStageCompleted(JobProgressListener.scala:79)
>   at 
> org.apache.spark.scheduler.SparkListenerBus$$anonfun$postToAll$2.apply(SparkListenerBus.scala:48)
>   at 
> org.apache.spark.scheduler.SparkListenerBus$$anonfun$postToAll$2.apply(SparkListenerBus.scala:48)
>   at 
> org.apache.spark.scheduler.SparkListenerBus$$anonfun$foreachListener$1.apply(SparkListenerBus.scala:81)
>   at 
> org.apache.spark.scheduler.SparkListenerBus$$anonfun$foreachListener$1.apply(SparkListenerBus.scala:79)
>   at 
> scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
>   at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
>   at 
> org.apache.spark.scheduler.SparkListenerBus$class.foreachListener(SparkListenerBus.scala:79)
>   at 
> org.apache.spark.scheduler.SparkListenerBus$class.postToAll(SparkListenerBus.scala:48)
>   at 
> org.apache.spark.scheduler.LiveListenerBus.postToAll(LiveListenerBus.scala:32)
>   at 
> org.apache.spark.scheduler.LiveListenerBus$$anon$1$$anonfun$run$1$$anonfun$apply$mcV$sp$1.apply(LiveListenerBus.scala:56)
>   at 
> org.apache.spark.scheduler.LiveListenerBus$$anon$1$$anonfun$run$1$$anonfun$apply$mcV$sp$1.apply(LiveListenerBus.scala:56)
>   at scala.Option.foreach(Option.scala:236)
>   at 
> org.apache.spark.scheduler.LiveListenerBus$$anon$1$$anonfun$run$1.apply$mcV$sp(LiveListenerBus.scala:56)
>   at 
> org.apache.spark.scheduler.LiveListenerBus$$anon$1$$anonfun$run$1.apply(LiveListenerBus.scala:47)
>   at 
> org.apache.spark.scheduler.LiveListenerBus$$anon$1$$anonfun$run$1.apply(LiveListenerBus.scala:47)
>   at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1160)
>   at 
> org.apache.spark.scheduler.LiveListenerBus$$anon$1.run(LiveListenerBus.scala:46)
> 2014-10-09 18:45:33,523 [SparkListenerBus] ERROR 
> org.apache.spark.scheduler.LiveListenerBus - Listener JobProgressListener 
> threw an exception
> java.util.NoSuchElementException: key not found: 12782
>   at scala.collection.MapLike$class.default(MapLike.scala:228)
>   at scala.collection.AbstractMap.default(Map.scala:58)
>   at scala.collection.mutable.HashMap.apply(HashMap.scala:64)
>   at 
> org.apache.spark.ui.jobs.JobProgressListener.onStageCompleted(JobProgressListener.scala:79)
>   at 
> org.apache.spark.scheduler.SparkListenerBus$$anonfun$postToAll$2.apply(SparkListenerBus.scala:48)
>   at 
> org.apache.spark.scheduler.SparkListenerBus$$anonfun$postToAll$2.apply(SparkListenerBus.scala:48)
>   at 
> org.apache.spark.scheduler.SparkListenerBus$$anonfun$foreachListener$1.apply(SparkListenerBus.scala:81)
>   at 
> org.apache.spark.scheduler.SparkListenerBus$$anonfun$foreachListener$1.apply(SparkListenerBus.scala:79)
>   at 
> scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
>   at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
>   at 
> org.apache.spark.scheduler.SparkListenerBus$class.foreachListener(SparkListenerBus.scala:79)
>   at 
> org.apache.spark.scheduler.SparkListenerBus$class.postToAll(SparkListenerBus.scala:48)
>   at 
> org.apache.spark.scheduler.LiveListenerBus.postToAll(LiveListenerBus.scala:32)
>   at 
> org.apache.spark.scheduler.LiveListenerBus$$anon$1$$anonfun$run$1$$anonfun$apply$mcV$sp$1.apply(LiveListenerBus.scala:56)
>   at 
> org.apache.spark.scheduler.LiveListenerBus$$anon$1$$anonfun$run$1$$anonfun$apply$mcV$sp$1.apply(LiveListenerBus.scala:56)
>   at scala.Option.foreach(Option.scala:236)
>   at 
> org.apache.spark.scheduler.LiveListenerBus$$anon$1$$anonfun$run$1.apply$mcV$sp(LiveListenerBus.scala:56)
>   at 
> org.apache.spark.scheduler.LiveListenerBus$$anon$1$$anonfun$run$1.apply(LiveListenerBus.scala:47)
>

[jira] [Created] (SPARK-3882) JobProgressListener gets permanently out of sync with long running job

2014-10-09 Thread Davis Shepherd (JIRA)
Davis Shepherd created SPARK-3882:
-

 Summary: JobProgressListener gets permanently out of sync with 
long running job
 Key: SPARK-3882
 URL: https://issues.apache.org/jira/browse/SPARK-3882
 Project: Spark
  Issue Type: Bug
  Components: Web UI
Affects Versions: 1.0.2
Reporter: Davis Shepherd


A long running spark context (non-streaming) will eventually start throwing the 
following in the driver:

java.util.NoSuchElementException: key not found: 12771
  at scala.collection.MapLike$class.default(MapLike.scala:228)
  at scala.collection.AbstractMap.default(Map.scala:58)
  at scala.collection.mutable.HashMap.apply(HashMap.scala:64)
  at 
org.apache.spark.ui.jobs.JobProgressListener.onStageCompleted(JobProgressListener.scala:79)
  at 
org.apache.spark.scheduler.SparkListenerBus$$anonfun$postToAll$2.apply(SparkListenerBus.scala:48)
  at 
org.apache.spark.scheduler.SparkListenerBus$$anonfun$postToAll$2.apply(SparkListenerBus.scala:48)
  at 
org.apache.spark.scheduler.SparkListenerBus$$anonfun$foreachListener$1.apply(SparkListenerBus.scala:81)
  at 
org.apache.spark.scheduler.SparkListenerBus$$anonfun$foreachListener$1.apply(SparkListenerBus.scala:79)
  at 
scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
  at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
  at 
org.apache.spark.scheduler.SparkListenerBus$class.foreachListener(SparkListenerBus.scala:79)
  at 
org.apache.spark.scheduler.SparkListenerBus$class.postToAll(SparkListenerBus.scala:48)
  at 
org.apache.spark.scheduler.LiveListenerBus.postToAll(LiveListenerBus.scala:32)
  at 
org.apache.spark.scheduler.LiveListenerBus$$anon$1$$anonfun$run$1$$anonfun$apply$mcV$sp$1.apply(LiveListenerBus.scala:56)
  at 
org.apache.spark.scheduler.LiveListenerBus$$anon$1$$anonfun$run$1$$anonfun$apply$mcV$sp$1.apply(LiveListenerBus.scala:56)
  at scala.Option.foreach(Option.scala:236)
  at 
org.apache.spark.scheduler.LiveListenerBus$$anon$1$$anonfun$run$1.apply$mcV$sp(LiveListenerBus.scala:56)
  at 
org.apache.spark.scheduler.LiveListenerBus$$anon$1$$anonfun$run$1.apply(LiveListenerBus.scala:47)
  at 
org.apache.spark.scheduler.LiveListenerBus$$anon$1$$anonfun$run$1.apply(LiveListenerBus.scala:47)
  at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1160)
  at 
org.apache.spark.scheduler.LiveListenerBus$$anon$1.run(LiveListenerBus.scala:46)
2014-10-09 18:45:33,523 [SparkListenerBus] ERROR 
org.apache.spark.scheduler.LiveListenerBus - Listener JobProgressListener threw 
an exception
java.util.NoSuchElementException: key not found: 12782
  at scala.collection.MapLike$class.default(MapLike.scala:228)
  at scala.collection.AbstractMap.default(Map.scala:58)
  at scala.collection.mutable.HashMap.apply(HashMap.scala:64)
  at 
org.apache.spark.ui.jobs.JobProgressListener.onStageCompleted(JobProgressListener.scala:79)
  at 
org.apache.spark.scheduler.SparkListenerBus$$anonfun$postToAll$2.apply(SparkListenerBus.scala:48)
  at 
org.apache.spark.scheduler.SparkListenerBus$$anonfun$postToAll$2.apply(SparkListenerBus.scala:48)
  at 
org.apache.spark.scheduler.SparkListenerBus$$anonfun$foreachListener$1.apply(SparkListenerBus.scala:81)
  at 
org.apache.spark.scheduler.SparkListenerBus$$anonfun$foreachListener$1.apply(SparkListenerBus.scala:79)
  at 
scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
  at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
  at 
org.apache.spark.scheduler.SparkListenerBus$class.foreachListener(SparkListenerBus.scala:79)
  at 
org.apache.spark.scheduler.SparkListenerBus$class.postToAll(SparkListenerBus.scala:48)
  at 
org.apache.spark.scheduler.LiveListenerBus.postToAll(LiveListenerBus.scala:32)
  at 
org.apache.spark.scheduler.LiveListenerBus$$anon$1$$anonfun$run$1$$anonfun$apply$mcV$sp$1.apply(LiveListenerBus.scala:56)
  at 
org.apache.spark.scheduler.LiveListenerBus$$anon$1$$anonfun$run$1$$anonfun$apply$mcV$sp$1.apply(LiveListenerBus.scala:56)
  at scala.Option.foreach(Option.scala:236)
  at 
org.apache.spark.scheduler.LiveListenerBus$$anon$1$$anonfun$run$1.apply$mcV$sp(LiveListenerBus.scala:56)
  at 
org.apache.spark.scheduler.LiveListenerBus$$anon$1$$anonfun$run$1.apply(LiveListenerBus.scala:47)
  at 
org.apache.spark.scheduler.LiveListenerBus$$anon$1$$anonfun$run$1.apply(LiveListenerBus.scala:47)
  at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1160)
  at 
org.apache.spark.scheduler.LiveListenerBus$$anon$1.run(LiveListenerBus.scala:46)

And the ui will show running jobs that are in fact no longer running and never 
clean them up. (see attached screenshot)

The result is that the ui becomes unusable, and the JobProgressListener leaks 
memo

[jira] [Created] (SPARK-1432) Potential memory leak in stageIdToExecutorSummaries in JobProgressTracker

2014-04-07 Thread Davis Shepherd (JIRA)
Davis Shepherd created SPARK-1432:
-

 Summary: Potential memory leak in stageIdToExecutorSummaries in 
JobProgressTracker
 Key: SPARK-1432
 URL: https://issues.apache.org/jira/browse/SPARK-1432
 Project: Spark
  Issue Type: Bug
  Components: Web UI
Affects Versions: 0.9.0
Reporter: Davis Shepherd


JobProgressTracker continuously cleans up old metadata as per the 
spark.ui.retainedStages configuration parameter. It seems however that not all 
metadata maps are being cleaned, in particular stageIdToExecutorSummaries could 
grow in an unbounded manner in a long running application.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (SPARK-1411) When using spark.ui.retainedStages=n only the first n stages are kept, not the most recent.

2014-04-03 Thread Davis Shepherd (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-1411?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13959664#comment-13959664
 ] 

Davis Shepherd commented on SPARK-1411:
---

I submitted a pull request to fix 1337 as well.—
Davis

On Thu, Apr 3, 2014 at 5:52 PM, Patrick Wendell (JIRA) 



> When using spark.ui.retainedStages=n only the first n stages are kept, not 
> the most recent.
> ---
>
> Key: SPARK-1411
> URL: https://issues.apache.org/jira/browse/SPARK-1411
> Project: Spark
>  Issue Type: Bug
>  Components: Web UI
>Affects Versions: 0.9.0
> Environment: Ubuntu 12.04 LTS Precise
> 4 Nodes, 
> 16 cores each, 
> 196GB RAM each.
>Reporter: Davis Shepherd
> Attachments: Screen Shot 2014-04-03 at 5.35.00 PM.png
>
>
> For any long running job with many stages, the web ui only shows the first n 
> stages of the job (where spark.ui.retainedStages=n). The most recent stages 
> are immediately dropped and are only visible for a brief time.  This renders 
> the UI pretty useless after a pretty short amount of time for a long running 
> non-streaming job. I am unsure as to whether similar results appear for 
> streaming jobs.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Closed] (SPARK-1411) When using spark.ui.retainedStages=n only the first n stages are kept, not the most recent.

2014-04-03 Thread Davis Shepherd (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-1411?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Davis Shepherd closed SPARK-1411.
-

Resolution: Duplicate

> When using spark.ui.retainedStages=n only the first n stages are kept, not 
> the most recent.
> ---
>
> Key: SPARK-1411
> URL: https://issues.apache.org/jira/browse/SPARK-1411
> Project: Spark
>  Issue Type: Bug
>  Components: Web UI
>Affects Versions: 0.9.0
> Environment: Ubuntu 12.04 LTS Precise
> 4 Nodes, 
> 16 cores each, 
> 196GB RAM each.
>Reporter: Davis Shepherd
> Attachments: Screen Shot 2014-04-03 at 5.35.00 PM.png
>
>
> For any long running job with many stages, the web ui only shows the first n 
> stages of the job (where spark.ui.retainedStages=n). The most recent stages 
> are immediately dropped and are only visible for a brief time.  This renders 
> the UI pretty useless after a pretty short amount of time for a long running 
> non-streaming job. I am unsure as to whether similar results appear for 
> streaming jobs.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (SPARK-1411) When using spark.ui.retainedStages=n only the first n stages are kept, not the most recent.

2014-04-03 Thread Davis Shepherd (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-1411?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Davis Shepherd updated SPARK-1411:
--

Attachment: Screen Shot 2014-04-03 at 5.35.00 PM.png

Notice the large gap in stage ids and submitted time between the top two stages.

> When using spark.ui.retainedStages=n only the first n stages are kept, not 
> the most recent.
> ---
>
> Key: SPARK-1411
> URL: https://issues.apache.org/jira/browse/SPARK-1411
> Project: Spark
>  Issue Type: Bug
>  Components: Web UI
>Affects Versions: 0.9.0
> Environment: Ubuntu 12.04 LTS Precise
> 4 Nodes, 
> 16 cores each, 
> 196GB RAM each.
>Reporter: Davis Shepherd
>Priority: Minor
> Attachments: Screen Shot 2014-04-03 at 5.35.00 PM.png
>
>
> For any long running job with many stages, the web ui only shows the first n 
> stages of the job (where spark.ui.retainedStages=n). The most recent stages 
> are immediately dropped and are only visible for a brief time.  This renders 
> the UI pretty useless after a pretty short amount of time for a long running 
> non-streaming job. I am unsure as to whether similar results appear for 
> streaming jobs.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (SPARK-1411) When using spark.ui.retainedStages=n only the first n stages are kept, not the most recent.

2014-04-03 Thread Davis Shepherd (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-1411?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Davis Shepherd updated SPARK-1411:
--

Priority: Major  (was: Minor)

> When using spark.ui.retainedStages=n only the first n stages are kept, not 
> the most recent.
> ---
>
> Key: SPARK-1411
> URL: https://issues.apache.org/jira/browse/SPARK-1411
> Project: Spark
>  Issue Type: Bug
>  Components: Web UI
>Affects Versions: 0.9.0
> Environment: Ubuntu 12.04 LTS Precise
> 4 Nodes, 
> 16 cores each, 
> 196GB RAM each.
>Reporter: Davis Shepherd
> Attachments: Screen Shot 2014-04-03 at 5.35.00 PM.png
>
>
> For any long running job with many stages, the web ui only shows the first n 
> stages of the job (where spark.ui.retainedStages=n). The most recent stages 
> are immediately dropped and are only visible for a brief time.  This renders 
> the UI pretty useless after a pretty short amount of time for a long running 
> non-streaming job. I am unsure as to whether similar results appear for 
> streaming jobs.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Created] (SPARK-1411) When using spark.ui.retainedStages=n only the first n stages are kept, not the most recent.

2014-04-03 Thread Davis Shepherd (JIRA)
Davis Shepherd created SPARK-1411:
-

 Summary: When using spark.ui.retainedStages=n only the first n 
stages are kept, not the most recent.
 Key: SPARK-1411
 URL: https://issues.apache.org/jira/browse/SPARK-1411
 Project: Spark
  Issue Type: Bug
  Components: Web UI
Affects Versions: 0.9.0
 Environment: Ubuntu 12.04 LTS Precise
4 Nodes, 
16 cores each, 
196GB RAM each.
Reporter: Davis Shepherd
Priority: Minor


For any long running job with many stages, the web ui only shows the first n 
stages of the job (where spark.ui.retainedStages=n). The most recent stages are 
immediately dropped and are only visible for a brief time.  This renders the UI 
pretty useless after a pretty short amount of time for a long running 
non-streaming job. I am unsure as to whether similar results appear for 
streaming jobs.



--
This message was sent by Atlassian JIRA
(v6.2#6252)