Chris Bannister created SPARK-14327:
---------------------------------------

             Summary: Scheduler holds locks which cause huge scheulder delays 
and executor timeouts
                 Key: SPARK-14327
                 URL: https://issues.apache.org/jira/browse/SPARK-14327
             Project: Spark
          Issue Type: Bug
          Components: Scheduler
    Affects Versions: 1.6.1
            Reporter: Chris Bannister


I have a job which after a while in one of its stages grinds to a halt, from 
processing around 300k tasks in 15 minutes to less than 1000 in the next hour. 
The driver ends up using 100% CPU on a single core (out of 4) and the executors 
start failing to receive heartbeat responses, tasks are not scheduled and 
results trickle in.

For this stage the max scheduler delay is 15 minutes, and the 75% percentile is 
4ms.

It appears that TaskScheulderImpl does most of its work whilst holding the 
global synchronised lock for the class, this synchronised lock is shared 
between at least,

TaskSetManager.canFetchMoreResults
TaskSchedulerImpl.handleSuccessfulTask
TaskSchedulerImpl.executorHeartbeatReceived
TaskSchedulerImpl.statusUpdate
TaskSchedulerImpl.checkSpeculatableTasks

This looks to severely limit the latency and throughput of the scheduler, and 
casuses my job to straight up fail due to taking too long.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to