[ https://issues.apache.org/jira/browse/SPARK-13279?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Sital Kedia updated SPARK-13279: -------------------------------- Description: While running a large pipeline with 200k tasks, we found that the executors were not able to register with the driver because the driver was stuck holding a global lock in TaskSchedulerImpl.submitTasks function. jstack of the driver - http://pastebin.com/m8CP6VMv executor log - http://pastebin.com/2NPS1mXC >From the jstack I see that the thread handing the resource offer from >executors (dispatcher-event-loop-9) is blocked on a lock held by the thread >"dag-scheduler-event-loop", which is iterating over an entire ArrayBuffer when >adding a pending tasks. So when we have 200k pending tasks, because of this >o(n2) operations, the driver is just hung for more than 5 minutes. Solution - Instead of an ArrayBuffer, we can use a LinkedHashSet which will provide us o(1) lookup and also maintain the ordering. was: While running a large pipeline with 200k tasks, we found that the executors were not able to register with the driver because the driver was stuck holding a global lock in TaskSchedulerImpl.submitTasks function. jstack of the driver - http://pastebin.com/m8CP6VMv executor log - http://pastebin.com/2NPS1mXC >From the jstack I see that the thread handing the resource offer from >executors (dispatcher-event-loop-9) is blocked on a lock held by the thread >"dag-scheduler-event-loop" which is iterating over an entire ArrayBuffer when >adding a pending tasks. So when we have 200k pending tasks, because of this >o(n2) operations, the driver is just hung for more than 5 minutes. Solution - Instead of an ArrayBuffer, we can use a LinkedHashSet which will provide us o(1) lookup and also maintain the ordering. > Spark driver stuck holding a global lock when there are 200k tasks submitted > in a stage > --------------------------------------------------------------------------------------- > > Key: SPARK-13279 > URL: https://issues.apache.org/jira/browse/SPARK-13279 > Project: Spark > Issue Type: Bug > Components: Spark Core > Affects Versions: 1.6.0 > Reporter: Sital Kedia > Fix For: 1.6.0 > > > While running a large pipeline with 200k tasks, we found that the executors > were not able to register with the driver because the driver was stuck > holding a global lock in TaskSchedulerImpl.submitTasks function. > jstack of the driver - http://pastebin.com/m8CP6VMv > executor log - http://pastebin.com/2NPS1mXC > From the jstack I see that the thread handing the resource offer from > executors (dispatcher-event-loop-9) is blocked on a lock held by the thread > "dag-scheduler-event-loop", which is iterating over an entire ArrayBuffer > when adding a pending tasks. So when we have 200k pending tasks, because of > this o(n2) operations, the driver is just hung for more than 5 minutes. > Solution - Instead of an ArrayBuffer, we can use a LinkedHashSet which will > provide us o(1) lookup and also maintain the ordering. -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org