Nicholas Brown created SPARK-16929:
--------------------------------------

             Summary: Bad synchronization with regard to speculation
                 Key: SPARK-16929
                 URL: https://issues.apache.org/jira/browse/SPARK-16929
             Project: Spark
          Issue Type: Bug
          Components: Scheduler
            Reporter: Nicholas Brown


Our cluster has been running slowly since I got speculation working, I looked 
into it and noticed that stderr was saying some tasks were taking almost an 
hour to run even though in the application logs on the nodes that task only 
took a minute or so to run.  Digging into the thread dump for the master node I 
noticed a number of threads are blocked, apparently by speculation thread.  At 
line 476 of TaskSchedulerImpl it grabs a lock on the TaskScheduler while it 
looks through the tasks to see what needs to be rerun.  Unfortunately that code 
loops through each of the tasks, so when you have even just a couple hundred 
thousand tasks to run that can be prohibitively slow to run inside of a 
synchronized block.  Once I disabled speculation, the job went back to having 
acceptable performance.

There are no comments around that lock indicating why it was added, and the git 
history seems to have a couple refactorings so its hard to find where it was 
added.  I'm tempted to believe it is the result of someone assuming that an 
extra synchronized block never hurt anyone (in reality I've probably just as 
many bugs caused by over synchronization as too little) as it looks too broad 
to be actually guarding any potential concurrency issue.  But, since 
concurrency issues can be tricky to reproduce (and yes, I understand that's an 
extreme understatement) I'm not sure just blindly removing it without being 
familiar with the history is necessarily safe.  

Can someone look into this?  Or at least make a note in the documentation that 
speculation should not be used with large clusters?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to