[ 
https://issues.apache.org/jira/browse/SPARK-16929?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15889435#comment-15889435
 ] 

Apache Spark commented on SPARK-16929:
--------------------------------------

User 'jinxing64' has created a pull request for this issue:
https://github.com/apache/spark/pull/17112

> Speculation-related synchronization bottleneck in checkSpeculatableTasks
> ------------------------------------------------------------------------
>
>                 Key: SPARK-16929
>                 URL: https://issues.apache.org/jira/browse/SPARK-16929
>             Project: Spark
>          Issue Type: Bug
>          Components: Scheduler
>            Reporter: Nicholas Brown
>
> Our cluster has been running slowly since I got speculation working, I looked 
> into it and noticed that stderr was saying some tasks were taking almost an 
> hour to run even though in the application logs on the nodes that task only 
> took a minute or so to run.  Digging into the thread dump for the master node 
> I noticed a number of threads are blocked, apparently by speculation thread.  
> At line 476 of TaskSchedulerImpl it grabs a lock on the TaskScheduler while 
> it looks through the tasks to see what needs to be rerun.  Unfortunately that 
> code loops through each of the tasks, so when you have even just a couple 
> hundred thousand tasks to run that can be prohibitively slow to run inside of 
> a synchronized block.  Once I disabled speculation, the job went back to 
> having acceptable performance.
> There are no comments around that lock indicating why it was added, and the 
> git history seems to have a couple refactorings so its hard to find where it 
> was added.  I'm tempted to believe it is the result of someone assuming that 
> an extra synchronized block never hurt anyone (in reality I've probably just 
> as many bugs caused by over synchronization as too little) as it looks too 
> broad to be actually guarding any potential concurrency issue.  But, since 
> concurrency issues can be tricky to reproduce (and yes, I understand that's 
> an extreme understatement) I'm not sure just blindly removing it without 
> being familiar with the history is necessarily safe.  
> Can someone look into this?  Or at least make a note in the documentation 
> that speculation should not be used with large clusters?



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to