[ https://issues.apache.org/jira/browse/SPARK-13279?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15143259#comment-15143259 ]
Sital Kedia commented on SPARK-13279: ------------------------------------- As you can see from the jstack of the driver http://pastebin.com/m8CP6VMv. The dag-scheduler-event-loop thread has taken a lock and is spending a lot of time in the addPendingTask function. For each task added, It is iterating over the list of tasks to check for duplicates. Which becomes an o(n2) operation and when the number of tasks is huge, it takes more than 5 minutes. As mentioned in the comment, the addPendingTask function does not really need to check for duplicates because dequeueTaskFromList will skip already running tasks. If we remove the duplicate check from addPendingTask function, then the time period for which the lock is held is very short and things are working fine. We can not make this as a set because we treat the list of pending tasks as a stack, please see https://github.com/sitalkedia/spark/blob/fix_stuck_driver/core/src/main/scala/org/apache/spark/scheduler/TaskSetManager.scala#L113 . Please note that this is a regression from Spark 1.5 and is introduced in facebook/FB-Spark@3535b91#diff-bad3987c83bd22d46416d3dd9d208e76L789 > Spark driver stuck holding a global lock when there are 200k tasks submitted > in a stage > --------------------------------------------------------------------------------------- > > Key: SPARK-13279 > URL: https://issues.apache.org/jira/browse/SPARK-13279 > Project: Spark > Issue Type: Bug > Components: Spark Core > Affects Versions: 1.6.0 > Reporter: Sital Kedia > > While running a large pipeline with 200k tasks, we found that the executors > were not able to register with the driver because the driver was stuck > holding a global lock in TaskSchedulerImpl.submitTasks function. > jstack of the driver - http://pastebin.com/m8CP6VMv > executor log - http://pastebin.com/2NPS1mXC > From the jstack I see that the thread handing the resource offer from > executors (dispatcher-event-loop-9) is blocked on a lock held by the thread > "dag-scheduler-event-loop", which is iterating over an entire ArrayBuffer > when adding a pending tasks. So when we have 200k pending tasks, because of > this o(n2) operations, the driver is just hung for more than 5 minutes. > Solution - In addPendingTask function, we don't really need a duplicate > check. It's okay if we add a task to the same queue twice because > dequeueTaskFromList will skip already-running tasks. > Please note that this is a regression from Spark 1.5. -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org