[ https://issues.apache.org/jira/browse/SPARK-14327?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Chris Bannister updated SPARK-14327: ------------------------------------ Attachment: driver.jstack jstack of the driver > Scheduler holds locks which cause huge scheulder delays and executor timeouts > ----------------------------------------------------------------------------- > > Key: SPARK-14327 > URL: https://issues.apache.org/jira/browse/SPARK-14327 > Project: Spark > Issue Type: Bug > Components: Scheduler > Affects Versions: 1.6.1 > Reporter: Chris Bannister > Attachments: driver.jstack > > > I have a job which after a while in one of its stages grinds to a halt, from > processing around 300k tasks in 15 minutes to less than 1000 in the next > hour. The driver ends up using 100% CPU on a single core (out of 4) and the > executors start failing to receive heartbeat responses, tasks are not > scheduled and results trickle in. > For this stage the max scheduler delay is 15 minutes, and the 75% percentile > is 4ms. > It appears that TaskScheulderImpl does most of its work whilst holding the > global synchronised lock for the class, this synchronised lock is shared > between at least, > TaskSetManager.canFetchMoreResults > TaskSchedulerImpl.handleSuccessfulTask > TaskSchedulerImpl.executorHeartbeatReceived > TaskSchedulerImpl.statusUpdate > TaskSchedulerImpl.checkSpeculatableTasks > This looks to severely limit the latency and throughput of the scheduler, and > casuses my job to straight up fail due to taking too long. -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org