[ https://issues.apache.org/jira/browse/SPARK-13279?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Apache Spark reassigned SPARK-13279: ------------------------------------ Assignee: (was: Apache Spark) > Spark driver stuck holding a global lock when there are 200k tasks submitted > in a stage > --------------------------------------------------------------------------------------- > > Key: SPARK-13279 > URL: https://issues.apache.org/jira/browse/SPARK-13279 > Project: Spark > Issue Type: Bug > Components: Spark Core > Affects Versions: 1.6.0 > Reporter: Sital Kedia > Fix For: 1.6.0 > > > While running a large pipeline with 200k tasks, we found that the executors > were not able to register with the driver because the driver was stuck > holding a global lock in TaskSchedulerImpl.submitTasks function. > jstack of the driver - http://pastebin.com/m8CP6VMv > executor log - http://pastebin.com/2NPS1mXC > From the jstack I see that the thread handing the resource offer from > executors (dispatcher-event-loop-9) is blocked on a lock held by the thread > "dag-scheduler-event-loop", which is iterating over an entire ArrayBuffer > when adding a pending tasks. So when we have 200k pending tasks, because of > this o(n2) operations, the driver is just hung for more than 5 minutes. > Solution - In addPendingTask function, we don't really need a duplicate > check. It's okay if we add a task to the same queue twice because > dequeueTaskFromList will skip already-running tasks. > Please note that this is a regression from Spark 1.5. -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org