[ https://issues.apache.org/jira/browse/SPARK-16929?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15889435#comment-15889435 ]
Apache Spark commented on SPARK-16929: -------------------------------------- User 'jinxing64' has created a pull request for this issue: https://github.com/apache/spark/pull/17112 > Speculation-related synchronization bottleneck in checkSpeculatableTasks > ------------------------------------------------------------------------ > > Key: SPARK-16929 > URL: https://issues.apache.org/jira/browse/SPARK-16929 > Project: Spark > Issue Type: Bug > Components: Scheduler > Reporter: Nicholas Brown > > Our cluster has been running slowly since I got speculation working, I looked > into it and noticed that stderr was saying some tasks were taking almost an > hour to run even though in the application logs on the nodes that task only > took a minute or so to run. Digging into the thread dump for the master node > I noticed a number of threads are blocked, apparently by speculation thread. > At line 476 of TaskSchedulerImpl it grabs a lock on the TaskScheduler while > it looks through the tasks to see what needs to be rerun. Unfortunately that > code loops through each of the tasks, so when you have even just a couple > hundred thousand tasks to run that can be prohibitively slow to run inside of > a synchronized block. Once I disabled speculation, the job went back to > having acceptable performance. > There are no comments around that lock indicating why it was added, and the > git history seems to have a couple refactorings so its hard to find where it > was added. I'm tempted to believe it is the result of someone assuming that > an extra synchronized block never hurt anyone (in reality I've probably just > as many bugs caused by over synchronization as too little) as it looks too > broad to be actually guarding any potential concurrency issue. But, since > concurrency issues can be tricky to reproduce (and yes, I understand that's > an extreme understatement) I'm not sure just blindly removing it without > being familiar with the history is necessarily safe. > Can someone look into this? Or at least make a note in the documentation > that speculation should not be used with large clusters? -- This message was sent by Atlassian JIRA (v6.3.15#6346) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org