GitHub user pgandhi999 opened a pull request: https://github.com/apache/spark/pull/22221
[SPARK-25231] : Executor Heartbeat Receiver does not need to synchron⦠â¦ize on the TaskSchedulerImpl object Running a large Spark job with speculation turned on was causing executor heartbeats to time out on the driver end after sometime and eventually, after hitting the max number of executor failures, the job would fail. ## What changes were proposed in this pull request? The main reason for the heartbeat timeouts was that the heartbeat-receiver-event-loop-thread was blocked waiting on the TaskSchedulerImpl object which was being held by one of the dispatcher-event-loop threads executing the method dequeueSpeculativeTasks() in TaskSetManager.scala. On further analysis of the heartbeat receiver method, it turns out there is no need to hold the lock on the whole object. The block of code in the method only uses one global HashMap taskIdToTaskSetManager. Making that map a ConcurrentHashMap, we are ensuring atomicity of operations and speeding up the heartbeat receiver thread operation. ## How was this patch tested? Screenshots of the thread dump have been attached below: **heartbeat-receiver-event-loop-thread:** <img width="1409" alt="screen shot 2018-08-24 at 9 19 57 am" src="https://user-images.githubusercontent.com/22228190/44593413-e25df780-a788-11e8-9520-176a18401a59.png"> **dispatcher-event-loop-thread:** <img width="1409" alt="screen shot 2018-08-24 at 9 21 56 am" src="https://user-images.githubusercontent.com/22228190/44593484-13d6c300-a789-11e8-8d88-34b1d51d4541.png"> You can merge this pull request into a Git repository by running: $ git pull https://github.com/pgandhi999/spark SPARK-25231 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/22221.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #22221 ---- commit a0dcde583c76cb96f5112f4ff863874415ec9140 Author: pgandhi <pgandhi@...> Date: 2018-08-24T15:27:01Z [SPARK-25231] : Executor Heartbeat Receiver does not need to synchronize on the TaskSchedulerImpl object The main reason for the heartbeat timeouts was that the heartbeat-receiver-event-loop-thread was blocked waiting on the TaskSchedulerImpl object which was being held by one of the dispatcher-event-loop threads executing the method dequeueSpeculativeTasks() in TaskSetManager.scala. On further analysis of the heartbeat receiver method, it turns out there is no need to hold the lock on the whole object. The block of code in the method only uses one global HashMap taskIdToTaskSetManager. Making that map a ConcurrentHashMap, we are ensuring atomicity of operations and speeding up the heartbeat receiver thread operation. ---- --- --------------------------------------------------------------------- To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org