Github user JoshRosen commented on the pull request: https://github.com/apache/spark/pull/7385#issuecomment-122333250 It looks like the hanging test is DAGSchedulerSuite's "misbehaved resultHandler should not crash DAGScheduler and SparkContext" test. Here's the relevant part of the unit tests log: ``` 15/07/17 09:05:49.333 pool-1-thread-1-ScalaTest-running-DAGSchedulerSuite INFO SparkContext: Starting job: runJob at DAGSchedulerSuite.scala:762 15/07/17 09:05:49.346 dag-scheduler-event-loop INFO DAGScheduler: Got job 0 (runJob at DAGSchedulerSuite.scala:762) with 1 output partitions (allowLocal=true) 15/07/17 09:05:49.346 dag-scheduler-event-loop INFO DAGScheduler: Final stage: ResultStage 0(runJob at DAGSchedulerSuite.scala:762) 15/07/17 09:05:49.347 dag-scheduler-event-loop INFO DAGScheduler: Parents of final stage: List() 15/07/17 09:05:49.347 dag-scheduler-event-loop INFO DAGScheduler: Missing parents: List() 15/07/17 09:05:49.348 dag-scheduler-event-loop INFO DAGScheduler: Computing the requested partition locally 15/07/17 09:05:49.353 pool-1-thread-1-ScalaTest-running-DAGSchedulerSuite INFO DAGScheduler: Job 0 failed: runJob at DAGSchedulerSuite.scala:762, took 0.019835 s 15/07/17 09:05:49.362 pool-1-thread-1-ScalaTest-running-DAGSchedulerSuite INFO SparkContext: Starting job: runJob at DAGSchedulerSuite.scala:773 15/07/17 09:05:49.363 dag-scheduler-event-loop INFO DAGScheduler: Got job 1 (runJob at DAGSchedulerSuite.scala:773) with 2 output partitions (allowLocal=false) 15/07/17 09:05:49.363 dag-scheduler-event-loop INFO DAGScheduler: Final stage: ResultStage 1(runJob at DAGSchedulerSuite.scala:773) 15/07/17 09:05:49.363 dag-scheduler-event-loop INFO DAGScheduler: Parents of final stage: List() 15/07/17 09:05:49.363 dag-scheduler-event-loop INFO DAGScheduler: Missing parents: List() 15/07/17 09:05:49.366 dag-scheduler-event-loop INFO DAGScheduler: Submitting ResultStage 1 (ParallelCollectionRDD[1] at parallelize at DAGSchedulerSuite.scala:772), which has no missing parents 15/07/17 09:05:49.452 dag-scheduler-event-loop INFO MemoryStore: ensureFreeSpace(1048) called with curMem=0, maxMem=1691332116 15/07/17 09:05:49.453 dag-scheduler-event-loop INFO MemoryStore: Block broadcast_0 stored as values in memory (estimated size 1048.0 B, free 1613.0 MB) 15/07/17 09:05:49.459 dag-scheduler-event-loop INFO MemoryStore: ensureFreeSpace(769) called with curMem=1048, maxMem=1691332116 15/07/17 09:05:49.459 dag-scheduler-event-loop INFO MemoryStore: Block broadcast_0_piece0 stored as bytes in memory (estimated size 769.0 B, free 1613.0 MB) 15/07/17 09:05:49.461 sparkDriver-akka.actor.default-dispatcher-3 INFO BlockManagerInfo: Added broadcast_0_piece0 in memory on localhost:61512 (size: 769.0 B, free: 1613.0 MB) 15/07/17 09:05:49.463 dag-scheduler-event-loop INFO SparkContext: Created broadcast 0 from broadcast at DAGScheduler.scala:901 15/07/17 09:05:49.465 dag-scheduler-event-loop INFO DAGScheduler: Submitting 2 missing tasks from ResultStage 1 (ParallelCollectionRDD[1] at parallelize at DAGSchedulerSuite.scala:772) 15/07/17 09:05:49.466 dag-scheduler-event-loop INFO TaskSchedulerImpl: Adding task set 1.0 with 2 tasks 15/07/17 09:05:49.488 sparkDriver-akka.actor.default-dispatcher-3 INFO TaskSetManager: Starting task 0.0 in stage 1.0 (TID 0, localhost, PROCESS_LOCAL, 1369 bytes) 15/07/17 09:05:49.491 Executor task launch worker-0 INFO Executor: Running task 0.0 in stage 1.0 (TID 0) 15/07/17 09:05:49.504 Executor task launch worker-0 INFO Executor: Finished task 0.0 in stage 1.0 (TID 0). 810 bytes result sent to driver 15/07/17 09:05:49.506 sparkDriver-akka.actor.default-dispatcher-3 INFO TaskSetManager: Starting task 1.0 in stage 1.0 (TID 1, localhost, PROCESS_LOCAL, 1426 bytes) 15/07/17 09:05:49.506 Executor task launch worker-0 INFO Executor: Running task 1.0 in stage 1.0 (TID 1) 15/07/17 09:05:49.507 task-result-getter-0 INFO TaskSetManager: Finished task 0.0 in stage 1.0 (TID 0) in 30 ms on localhost (1/2) 15/07/17 09:05:49.508 pool-1-thread-1-ScalaTest-running-DAGSchedulerSuite INFO DAGScheduler: Job 1 failed: runJob at DAGSchedulerSuite.scala:773, took 0.145836 s 15/07/17 09:05:49.509 Executor task launch worker-0 INFO Executor: Finished task 1.0 in stage 1.0 (TID 1). 810 bytes result sent to driver 15/07/17 09:05:49.511 task-result-getter-1 INFO TaskSetManager: Finished task 1.0 in stage 1.0 (TID 1) in 6 ms on localhost (2/2) 15/07/17 09:05:49.512 dag-scheduler-event-loop INFO DAGScheduler: ResultStage 1 (runJob at DAGSchedulerSuite.scala:773) finished in 0.041 s 15/07/17 09:05:49.512 task-result-getter-1 INFO TaskSchedulerImpl: Removed TaskSet 1.0, whose tasks have all completed, from pool 15/07/17 09:05:49.516 dag-scheduler-event-loop ERROR DAGSchedulerEventProcessLoop: DAGSchedulerEventProcessLoop failed; shutting down SparkContext java.lang.IllegalStateException: Promise already completed. at scala.concurrent.Promise$class.complete(Promise.scala:55) at scala.concurrent.impl.Promise$DefaultPromise.complete(Promise.scala:153) at scala.concurrent.Promise$class.failure(Promise.scala:107) at scala.concurrent.impl.Promise$DefaultPromise.failure(Promise.scala:153) at org.apache.spark.scheduler.JobWaiter.jobFailed(JobWaiter.scala:86) at org.apache.spark.scheduler.DAGScheduler.handleTaskCompletion(DAGScheduler.scala:1059) at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1508) at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1472) at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1461) at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48) 15/07/17 09:05:49.516 pool-1-thread-1-ScalaTest-running-DAGSchedulerSuite INFO SparkContext: Starting job: count at DAGSchedulerSuite.scala:783 15/07/17 09:05:49.519 dag-scheduler-event-loop INFO DAGScheduler: Stopping DAGScheduler 15/07/17 09:08:48.579 sparkDriver-akka.actor.default-dispatcher-4 WARN HeartbeatReceiver: Removing executor driver with no recent heartbeats: 179561 ms exceeds timeout 120000 ms 15/07/17 09:08:48.580 sparkDriver-akka.actor.default-dispatcher-4 ERROR TaskSchedulerImpl: Lost executor driver on localhost: Executor heartbeat timed out after 179561 ms ``` It looks like the problem is that `jobFailed` can be called multiple times if a task that runs in `DAGScheduler.runLocallyWithinThread` fails. We get one call from `runLocallyWithinThread` and another call inside of `DAGScheduler.handleTaskCompletion`. I guess this is actually a subtlety that was overlooked in the old code. I guess one approach is to make `jobFailed` into an idempotent method. Let me try that now...
--- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- --------------------------------------------------------------------- To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org