Github user JoshRosen commented on the pull request:

    https://github.com/apache/spark/pull/7385#issuecomment-122333250
  
    It looks like the hanging test is DAGSchedulerSuite's "misbehaved 
resultHandler should not crash DAGScheduler and SparkContext" test.  Here's the 
relevant part of the unit tests log:
    
    ```
    15/07/17 09:05:49.333 pool-1-thread-1-ScalaTest-running-DAGSchedulerSuite 
INFO SparkContext: Starting job: runJob at DAGSchedulerSuite.scala:762
    15/07/17 09:05:49.346 dag-scheduler-event-loop INFO DAGScheduler: Got job 0 
(runJob at DAGSchedulerSuite.scala:762) with 1 output partitions 
(allowLocal=true)
    15/07/17 09:05:49.346 dag-scheduler-event-loop INFO DAGScheduler: Final 
stage: ResultStage 0(runJob at DAGSchedulerSuite.scala:762)
    15/07/17 09:05:49.347 dag-scheduler-event-loop INFO DAGScheduler: Parents 
of final stage: List()
    15/07/17 09:05:49.347 dag-scheduler-event-loop INFO DAGScheduler: Missing 
parents: List()
    15/07/17 09:05:49.348 dag-scheduler-event-loop INFO DAGScheduler: Computing 
the requested partition locally
    15/07/17 09:05:49.353 pool-1-thread-1-ScalaTest-running-DAGSchedulerSuite 
INFO DAGScheduler: Job 0 failed: runJob at DAGSchedulerSuite.scala:762, took 
0.019835 s
    15/07/17 09:05:49.362 pool-1-thread-1-ScalaTest-running-DAGSchedulerSuite 
INFO SparkContext: Starting job: runJob at DAGSchedulerSuite.scala:773
    15/07/17 09:05:49.363 dag-scheduler-event-loop INFO DAGScheduler: Got job 1 
(runJob at DAGSchedulerSuite.scala:773) with 2 output partitions 
(allowLocal=false)
    15/07/17 09:05:49.363 dag-scheduler-event-loop INFO DAGScheduler: Final 
stage: ResultStage 1(runJob at DAGSchedulerSuite.scala:773)
    15/07/17 09:05:49.363 dag-scheduler-event-loop INFO DAGScheduler: Parents 
of final stage: List()
    15/07/17 09:05:49.363 dag-scheduler-event-loop INFO DAGScheduler: Missing 
parents: List()
    15/07/17 09:05:49.366 dag-scheduler-event-loop INFO DAGScheduler: 
Submitting ResultStage 1 (ParallelCollectionRDD[1] at parallelize at 
DAGSchedulerSuite.scala:772), which has no missing parents
    15/07/17 09:05:49.452 dag-scheduler-event-loop INFO MemoryStore: 
ensureFreeSpace(1048) called with curMem=0, maxMem=1691332116
    15/07/17 09:05:49.453 dag-scheduler-event-loop INFO MemoryStore: Block 
broadcast_0 stored as values in memory (estimated size 1048.0 B, free 1613.0 MB)
    15/07/17 09:05:49.459 dag-scheduler-event-loop INFO MemoryStore: 
ensureFreeSpace(769) called with curMem=1048, maxMem=1691332116
    15/07/17 09:05:49.459 dag-scheduler-event-loop INFO MemoryStore: Block 
broadcast_0_piece0 stored as bytes in memory (estimated size 769.0 B, free 
1613.0 MB)
    15/07/17 09:05:49.461 sparkDriver-akka.actor.default-dispatcher-3 INFO 
BlockManagerInfo: Added broadcast_0_piece0 in memory on localhost:61512 (size: 
769.0 B, free: 1613.0 MB)
    15/07/17 09:05:49.463 dag-scheduler-event-loop INFO SparkContext: Created 
broadcast 0 from broadcast at DAGScheduler.scala:901
    15/07/17 09:05:49.465 dag-scheduler-event-loop INFO DAGScheduler: 
Submitting 2 missing tasks from ResultStage 1 (ParallelCollectionRDD[1] at 
parallelize at DAGSchedulerSuite.scala:772)
    15/07/17 09:05:49.466 dag-scheduler-event-loop INFO TaskSchedulerImpl: 
Adding task set 1.0 with 2 tasks
    15/07/17 09:05:49.488 sparkDriver-akka.actor.default-dispatcher-3 INFO 
TaskSetManager: Starting task 0.0 in stage 1.0 (TID 0, localhost, 
PROCESS_LOCAL, 1369 bytes)
    15/07/17 09:05:49.491 Executor task launch worker-0 INFO Executor: Running 
task 0.0 in stage 1.0 (TID 0)
    15/07/17 09:05:49.504 Executor task launch worker-0 INFO Executor: Finished 
task 0.0 in stage 1.0 (TID 0). 810 bytes result sent to driver
    15/07/17 09:05:49.506 sparkDriver-akka.actor.default-dispatcher-3 INFO 
TaskSetManager: Starting task 1.0 in stage 1.0 (TID 1, localhost, 
PROCESS_LOCAL, 1426 bytes)
    15/07/17 09:05:49.506 Executor task launch worker-0 INFO Executor: Running 
task 1.0 in stage 1.0 (TID 1)
    15/07/17 09:05:49.507 task-result-getter-0 INFO TaskSetManager: Finished 
task 0.0 in stage 1.0 (TID 0) in 30 ms on localhost (1/2)
    15/07/17 09:05:49.508 pool-1-thread-1-ScalaTest-running-DAGSchedulerSuite 
INFO DAGScheduler: Job 1 failed: runJob at DAGSchedulerSuite.scala:773, took 
0.145836 s
    15/07/17 09:05:49.509 Executor task launch worker-0 INFO Executor: Finished 
task 1.0 in stage 1.0 (TID 1). 810 bytes result sent to driver
    15/07/17 09:05:49.511 task-result-getter-1 INFO TaskSetManager: Finished 
task 1.0 in stage 1.0 (TID 1) in 6 ms on localhost (2/2)
    15/07/17 09:05:49.512 dag-scheduler-event-loop INFO DAGScheduler: 
ResultStage 1 (runJob at DAGSchedulerSuite.scala:773) finished in 0.041 s
    15/07/17 09:05:49.512 task-result-getter-1 INFO TaskSchedulerImpl: Removed 
TaskSet 1.0, whose tasks have all completed, from pool
    15/07/17 09:05:49.516 dag-scheduler-event-loop ERROR 
DAGSchedulerEventProcessLoop: DAGSchedulerEventProcessLoop failed; shutting 
down SparkContext
    java.lang.IllegalStateException: Promise already completed.
        at scala.concurrent.Promise$class.complete(Promise.scala:55)
        at 
scala.concurrent.impl.Promise$DefaultPromise.complete(Promise.scala:153)
        at scala.concurrent.Promise$class.failure(Promise.scala:107)
        at 
scala.concurrent.impl.Promise$DefaultPromise.failure(Promise.scala:153)
        at org.apache.spark.scheduler.JobWaiter.jobFailed(JobWaiter.scala:86)
        at 
org.apache.spark.scheduler.DAGScheduler.handleTaskCompletion(DAGScheduler.scala:1059)
        at 
org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1508)
        at 
org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1472)
        at 
org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1461)
        at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48)
    15/07/17 09:05:49.516 pool-1-thread-1-ScalaTest-running-DAGSchedulerSuite 
INFO SparkContext: Starting job: count at DAGSchedulerSuite.scala:783
    15/07/17 09:05:49.519 dag-scheduler-event-loop INFO DAGScheduler: Stopping 
DAGScheduler
    15/07/17 09:08:48.579 sparkDriver-akka.actor.default-dispatcher-4 WARN 
HeartbeatReceiver: Removing executor driver with no recent heartbeats: 179561 
ms exceeds timeout 120000 ms
    15/07/17 09:08:48.580 sparkDriver-akka.actor.default-dispatcher-4 ERROR 
TaskSchedulerImpl: Lost executor driver on localhost: Executor heartbeat timed 
out after 179561 ms
    ```
    
    It looks like the problem is that `jobFailed` can be called multiple times 
if a task that runs in `DAGScheduler.runLocallyWithinThread` fails.  We get one 
call from `runLocallyWithinThread` and another call inside of 
`DAGScheduler.handleTaskCompletion`.
    
    I guess this is actually a subtlety that was overlooked in the old code.  I 
guess one approach is to make `jobFailed` into an idempotent method.  Let me 
try that now...


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Reply via email to