Huiqiang Liu created SPARK-14699: ------------------------------------ Summary: Driver is marked as failed even it runs successfully Key: SPARK-14699 URL: https://issues.apache.org/jira/browse/SPARK-14699 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 1.6.1, 1.6.0 Environment: Standalone deployment Reporter: Huiqiang Liu
We recently upgraded Spark from 1.5.2 to 1.6.0 and found that all batch jobs are marked as failed. To address this issue, we wrote a simple test application which just sum up from 1 to 10000 and it is marked as failed even though its result was correct. Here is the typical stderr message and there is "ERROR worker.WorkerWatcher: Lost connection to worker rpc" when driver exits. 16/04/14 06:20:41 INFO scheduler.DAGScheduler: ResultStage 1 (sum at SparkBatchTest.scala:19) finished in 0.052 s 16/04/14 06:20:41 INFO scheduler.TaskSchedulerImpl: Removed TaskSet 1.0, whose tasks have all completed, from pool 16/04/14 06:20:41 INFO scheduler.DAGScheduler: Job 1 finished: sum at SparkBatchTest.scala:19, took 0.061177 s 16/04/14 06:20:41 ERROR worker.WorkerWatcher: Lost connection to worker rpc endpoint spark://wor...@spark-worker-ltv-prod-006.prod.vungle.com:7078. Exiting. 16/04/14 06:20:41 INFO spark.SparkContext: Invoking stop() from shutdown hook 16/04/14 06:20:41 INFO storage.BlockManagerInfo: Removed broadcast_1_piece0 on 172.16.33.187:36442 in memory (size: 1452.0 B, free: 511.1 MB) 16/04/14 06:20:41 INFO storage.BlockManagerInfo: Removed broadcast_1_piece0 on ip-172-16-31-86.ec2.internal:29708 in memory (size: 1452.0 B, free: 511.1 MB) 16/04/14 06:20:41 INFO storage.BlockManagerInfo: Removed broadcast_1_piece0 on ip-172-16-32-207.ec2.internal:21259 in memory (size: 1452.0 B, free: 511.1 MB) 16/04/14 06:20:41 INFO handler.ContextHandler: stopped o.s.j.s.ServletContextHandler{/metrics/json,null} 16/04/14 06:20:41 INFO handler.ContextHandler: stopped o.s.j.s.ServletContextHandler{/stages/stage/kill,null} 16/04/14 06:20:41 INFO handler.ContextHandler: stopped o.s.j.s.ServletContextHandler{/api,null} 16/04/14 06:20:41 INFO handler.ContextHandler: stopped o.s.j.s.ServletContextHandler{/,null} 16/04/14 06:20:41 INFO handler.ContextHandler: stopped o.s.j.s.ServletContextHandler{/static,null} 16/04/14 06:20:41 INFO handler.ContextHandler: stopped o.s.j.s.ServletContextHandler{/executors/threadDump/json,null} 16/04/14 06:20:41 INFO handler.ContextHandler: stopped o.s.j.s.ServletContextHandler{/executors/threadDump,null} 16/04/14 06:20:41 INFO handler.ContextHandler: stopped o.s.j.s.ServletContextHandler{/executors/json,null} 16/04/14 06:20:41 INFO handler.ContextHandler: stopped o.s.j.s.ServletContextHandler{/executors,null} 16/04/14 06:20:41 INFO handler.ContextHandler: stopped o.s.j.s.ServletContextHandler{/environment/json,null} 16/04/14 06:20:41 INFO handler.ContextHandler: stopped o.s.j.s.ServletContextHandler{/environment,null} 16/04/14 06:20:41 INFO handler.ContextHandler: stopped o.s.j.s.ServletContextHandler{/storage/rdd/json,null} 16/04/14 06:20:41 INFO handler.ContextHandler: stopped o.s.j.s.ServletContextHandler{/storage/rdd,null} 16/04/14 06:20:41 INFO handler.ContextHandler: stopped o.s.j.s.ServletContextHandler{/storage/json,null} 16/04/14 06:20:41 INFO handler.ContextHandler: stopped o.s.j.s.ServletContextHandler{/storage,null} 16/04/14 06:20:41 INFO handler.ContextHandler: stopped o.s.j.s.ServletContextHandler{/stages/pool/json,null} 16/04/14 06:20:41 INFO handler.ContextHandler: stopped o.s.j.s.ServletContextHandler{/stages/pool,null} 16/04/14 06:20:41 INFO handler.ContextHandler: stopped o.s.j.s.ServletContextHandler{/stages/stage/json,null} 16/04/14 06:20:41 INFO handler.ContextHandler: stopped o.s.j.s.ServletContextHandler{/stages/stage,null} 16/04/14 06:20:41 INFO handler.ContextHandler: stopped o.s.j.s.ServletContextHandler{/stages/json,null} 16/04/14 06:20:41 INFO handler.ContextHandler: stopped o.s.j.s.ServletContextHandler{/stages,null} 16/04/14 06:20:41 INFO handler.ContextHandler: stopped o.s.j.s.ServletContextHandler{/jobs/job/json,null} 16/04/14 06:20:41 INFO handler.ContextHandler: stopped o.s.j.s.ServletContextHandler{/jobs/job,null} 16/04/14 06:20:41 INFO handler.ContextHandler: stopped o.s.j.s.ServletContextHandler{/jobs/json,null} 16/04/14 06:20:41 INFO handler.ContextHandler: stopped o.s.j.s.ServletContextHandler{/jobs,null} 16/04/14 06:20:41 INFO spark.ContextCleaner: Cleaned accumulator 2 16/04/14 06:20:41 INFO storage.BlockManagerInfo: Removed broadcast_0_piece0 on 172.16.33.187:36442 in memory (size: 804.0 B, free: 511.1 MB) 16/04/14 06:20:41 INFO storage.BlockManagerInfo: Removed broadcast_0_piece0 on ip-172-16-31-86.ec2.internal:29708 in memory (size: 804.0 B, free: 511.1 MB) 16/04/14 06:20:41 INFO storage.BlockManagerInfo: Removed broadcast_0_piece0 on ip-172-16-32-207.ec2.internal:21259 in memory (size: 804.0 B, free: 511.1 MB) 16/04/14 06:20:41 INFO spark.ContextCleaner: Cleaned accumulator 1 16/04/14 06:20:41 INFO ui.SparkUI: Stopped Spark web UI at http://172.16.31.56:4040 16/04/14 06:20:41 INFO cluster.SparkDeploySchedulerBackend: Shutting down all executors 16/04/14 06:20:41 INFO cluster.SparkDeploySchedulerBackend: Asking each executor to shut down 16/04/14 06:20:41 INFO spark.MapOutputTrackerMasterEndpoint: MapOutputTrackerMasterEndpoint stopped! 16/04/14 06:20:41 INFO storage.MemoryStore: MemoryStore cleared 16/04/14 06:20:41 INFO storage.BlockManager: BlockManager stopped 16/04/14 06:20:41 INFO storage.BlockManagerMaster: BlockManagerMaster stopped 16/04/14 06:20:41 INFO scheduler.OutputCommitCoordinator$OutputCommitCoordinatorEndpoint: OutputCommitCoordinator stopped! 16/04/14 06:20:41 INFO spark.SparkContext: Successfully stopped SparkContext 16/04/14 06:20:41 INFO util.ShutdownHookManager: Shutdown hook called 16/04/14 06:20:41 INFO util.ShutdownHookManager: Deleting directory /tmp/spark-a4c7f9ac-bf40-4924-b977-ec0be4c5fe93 Since it looks like something related with rpc, we tested with Akka which was default before 1.6.0 by setting "spark.rpc=akka". Then the error message dismisses and a warning complains the same thing about disassociation with worker. It seems with Akka it waits another 5 seconds and that makes driver exists gracefully. 16/04/12 15:41:28 INFO DAGScheduler: ResultStage 1 (sum at SparkBatchTest.scala:19) finished in 0.053 s 16/04/12 15:41:28 INFO DAGScheduler: Job 1 finished: sum at SparkBatchTest.scala:19, took 0.060660 s 16/04/12 15:41:28 INFO SparkContext: Invoking stop() from shutdown hook 16/04/12 15:41:28 INFO RemoteActorRefProvider$RemotingTerminator: Shutting down remote daemon. 16/04/12 15:41:28 INFO RemoteActorRefProvider$RemotingTerminator: Remote daemon shut down; proceeding with flushing remote transports. 16/04/12 15:41:28 INFO RemoteActorRefProvider$RemotingTerminator: Remoting shut down. 16/04/12 15:41:28 INFO SparkUI: Stopped Spark web UI at http://sparkbox:8830 16/04/12 15:41:28 INFO SparkDeploySchedulerBackend: Shutting down all executors 16/04/12 15:41:28 INFO SparkDeploySchedulerBackend: Asking each executor to shut down 16/04/12 15:41:28 WARN ReliableDeliverySupervisor: Association with remote system [akka.tcp://sparkExecutor@sparkbox:58832] has failed, address is now gated for [5000] ms. Reason: [Disassociated] 16/04/12 15:41:28 INFO MapOutputTrackerMasterEndpoint: MapOutputTrackerMasterEndpoint stopped! 16/04/12 15:41:28 INFO MemoryStore: MemoryStore cleared 16/04/12 15:41:28 INFO BlockManager: BlockManager stopped 16/04/12 15:41:28 INFO BlockManagerMaster: BlockManagerMaster stopped 16/04/12 15:41:28 INFO OutputCommitCoordinator$OutputCommitCoordinatorEndpoint: OutputCommitCoordinator stopped! 16/04/12 15:41:28 INFO RemoteActorRefProvider$RemotingTerminator: Shutting down remote daemon. 16/04/12 15:41:28 INFO RemoteActorRefProvider$RemotingTerminator: Remote daemon shut down; proceeding with flushing remote transports. 16/04/12 15:41:28 INFO RemoteActorRefProvider$RemotingTerminator: Remoting shut down. 16/04/12 15:41:28 INFO SparkContext: Successfully stopped SparkContext 16/04/12 15:41:28 INFO ShutdownHookManager: Shutdown hook called 16/04/12 15:41:28 INFO ShutdownHookManager: Deleting directory /tmp/spark-138182c2-f1bc-4ea1-b5b8-2adceb27c083 This only occurs with standalone deployment since it is said DriverWrapper is only used in that cluster mode. https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/deploy/worker/DriverWrapper.scala#L43 Typically it affects the '--supervise' usage for batch jobs, since they will be always considered as failed and get restarted again and again. https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/deploy/worker/DriverRunner.scala#L214 -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org