Hi Spark experts, I am using Spark 1.5.2 on YARN with dynamic allocation enabled. I see in the driver/application master logs that the app is marked as SUCCEEDED and then SparkContext stop is called. However, this stop sequence takes > 10 minutes to complete, and YARN resource manager kills the application master as it didn’t receive a heartbeat within the last 10 minutes. The resource manager then kills the application master. Any ideas about what may be going on?
Here are the relevant logs: *6/03/18 21:26:58 INFO yarn.ApplicationMaster: Final app status: SUCCEEDED, exitCode: 0 16/03/18 21:26:58 INFO spark.SparkContext: Invoking stop() from shutdown hook*16/03/18 21:26:58 INFO handler.ContextHandler: stopped o.s.j.s.ServletContextHandler{/static/sql,null}16/03/18 21:26:58 INFO handler.ContextHandler: stopped o.s.j.s.ServletContextHandler{/SQL/execution/json,null}16/03/18 21:26:58 INFO handler.ContextHandler: stopped o.s.j.s.ServletContextHandler{/SQL/execution,null}16/03/18 21:26:58 INFO handler.ContextHandler: stopped o.s.j.s.ServletContextHandler{/SQL/json,null}16/03/18 21:26:58 INFO handler.ContextHandler: stopped o.s.j.s.ServletContextHandler{/SQL,null}16/03/18 21:26:58 INFO handler.ContextHandler: stopped o.s.j.s.ServletContextHandler{/static/sql,null}16/03/18 21:26:58 INFO handler.ContextHandler: stopped o.s.j.s.ServletContextHandler{/SQL/execution/json,null}16/03/18 21:26:58 INFO handler.ContextHandler: stopped o.s.j.s.ServletContextHandler{/SQL/execution,null}16/03/18 21:26:58 INFO handler.ContextHandler: stopped o.s.j.s.ServletContextHandler{/SQL/json,null}16/03/18 21:26:58 INFO handler.ContextHandler: stopped o.s.j.s.ServletContextHandler{/SQL,null}16/03/18 21:26:58 INFO handler.ContextHandler: stopped o.s.j.s.ServletContextHandler{/metrics/json,null}16/03/18 21:26:58 INFO handler.ContextHandler: stopped o.s.j.s.ServletContextHandler{/stages/stage/kill,null}16/03/18 21:26:58 INFO handler.ContextHandler: stopped o.s.j.s.ServletContextHandler{/api,null}16/03/18 21:26:58 INFO handler.ContextHandler: stopped o.s.j.s.ServletContextHandler{/,null}16/03/18 21:26:58 INFO handler.ContextHandler: stopped o.s.j.s.ServletContextHandler{/static,null}16/03/18 21:26:58 INFO handler.ContextHandler: stopped o.s.j.s.ServletContextHandler{/executors/threadDump/json,null}16/03/18 21:26:58 INFO handler.ContextHandler: stopped o.s.j.s.ServletContextHandler{/executors/threadDump,null}16/03/18 21:26:58 INFO handler.ContextHandler: stopped o.s.j.s.ServletContextHandler{/executors/json,null}16/03/18 21:26:58 INFO handler.ContextHandler: stopped o.s.j.s.ServletContextHandler{/executors,null}16/03/18 21:26:58 INFO handler.ContextHandler: stopped o.s.j.s.ServletContextHandler{/environment/json,null}16/03/18 21:26:58 INFO handler.ContextHandler: stopped o.s.j.s.ServletContextHandler{/environment,null}16/03/18 21:26:58 INFO handler.ContextHandler: stopped o.s.j.s.ServletContextHandler{/storage/rdd/json,null}16/03/18 21:26:58 INFO handler.ContextHandler: stopped o.s.j.s.ServletContextHandler{/storage/rdd,null}16/03/18 21:26:58 INFO handler.ContextHandler: stopped o.s.j.s.ServletContextHandler{/storage/json,null}16/03/18 21:26:58 INFO handler.ContextHandler: stopped o.s.j.s.ServletContextHandler{/storage,null}16/03/18 21:26:58 INFO handler.ContextHandler: stopped o.s.j.s.ServletContextHandler{/stages/pool/json,null}16/03/18 21:26:58 INFO handler.ContextHandler: stopped o.s.j.s.ServletContextHandler{/stages/pool,null}16/03/18 21:26:58 INFO handler.ContextHandler: stopped o.s.j.s.ServletContextHandler{/stages/stage/json,null}16/03/18 21:26:58 INFO handler.ContextHandler: stopped o.s.j.s.ServletContextHandler{/stages/stage,null}16/03/18 21:26:58 INFO handler.ContextHandler: stopped o.s.j.s.ServletContextHandler{/stages/json,null}16/03/18 21:26:58 INFO handler.ContextHandler: stopped o.s.j.s.ServletContextHandler{/stages,null}16/03/18 21:26:58 INFO handler.ContextHandler: stopped o.s.j.s.ServletContextHandler{/jobs/job/json,null}16/03/18 21:26:58 INFO handler.ContextHandler: stopped o.s.j.s.ServletContextHandler{/jobs/job,null}16/03/18 21:26:58 INFO handler.ContextHandler: stopped o.s.j.s.ServletContextHandler{/jobs/json,null}16/03/18 21:26:58 INFO handler.ContextHandler: stopped o.s.j.s.ServletContextHandler{/jobs,null}16/03/18 21:26:58 INFO ui.SparkUI: Stopped Spark web UI at http://10.143.240.240:5270616/03/18 21:27:58 INFO cluster.YarnClusterSchedulerBackend: Requesting to kill executor(s) 113516/03/18 21:27:58 INFO yarn.YarnAllocator: Driver requested a total number of 208 executor(s).16/03/18 21:27:58 INFO yarn.ApplicationMaster$AMEndpoint: Driver requested to kill executor(s) 1135.16/03/18 21:27:58 INFO spark.ExecutorAllocationManager: Removing executor 1135 because it has been idle for 60 seconds (new desired total will be 208)16/03/18 21:27:58 INFO cluster.YarnClusterSchedulerBackend: Requesting to kill executor(s) 112316/03/18 21:27:58 INFO yarn.YarnAllocator: Driver requested a total number of 207 executor(s).16/03/18 21:27:58 INFO yarn.ApplicationMaster$AMEndpoint: Driver requested to kill executor(s) 1123.16/03/18 21:27:58 INFO spark.ExecutorAllocationManager: Removing executor 1123 because it has been idle for 60 seconds (new desired total will be 207)16/03/18 21:27:58 INFO cluster.YarnClusterSchedulerBackend: Requesting to kill executor(s) 111716/03/18 21:27:58 INFO yarn.YarnAllocator: Driver requested a total number of 206 executor(s).16/03/18 21:27:58 INFO yarn.ApplicationMaster$AMEndpoint: Driver requested to kill executor(s) 1117.16/03/18 21:27:58 INFO spark.ExecutorAllocationManager: Removing executor 1117 because it has been idle for 60 seconds (new desired total will be 206)16/03/18 21:27:58 INFO cluster.YarnClusterSchedulerBackend: Requesting to kill executor(s) 118516/03/18 21:27:58 INFO yarn.YarnAllocator: Driver requested a total number of 205 executor(s).16/03/18 21:27:58 INFO yarn.ApplicationMaster$AMEndpoint: Driver requested to kill executor(s) 1185.16/03/18 21:27:58 INFO spark.ExecutorAllocationManager: Removing executor 1185 because it has been idle for 60 seconds (new desired total will be 205)16/03/18 21:27:58 INFO cluster.YarnClusterSchedulerBackend: Requesting to kill executor(s) 115216/03/18 21:27:58 INFO yarn.YarnAllocator: Driver requested a total number of 204 executor(s).16/03/18 21:27:58 INFO yarn.ApplicationMaster$AMEndpoint: Driver requested to kill executor(s) 1152.16/03/18 21:27:58 INFO spark.ExecutorAllocationManager: Removing executor 1152 because it has been idle for 60 seconds (new desired total will be 204)16/03/18 21:27:58 INFO cluster.YarnClusterSchedulerBackend: Requesting to kill executor(s) 114016/03/18 21:27:58 INFO yarn.YarnAllocator: Driver requested a total number of 203 executor(s).16/03/18 21:27:58 INFO yarn.ApplicationMaster$AMEndpoint: Driver requested to kill executor(s) 1140.16/03/18 21:27:58 INFO spark.ExecutorAllocationManager: Removing executor 1140 because it has been idle for 60 seconds (new desired total will be 203)16/03/18 21:27:58 INFO cluster.YarnClusterSchedulerBackend: Requesting to kill executor(s) 114916/03/18 21:27:58 INFO yarn.YarnAllocator: Driver requested a total number of 202 executor(s).16/03/18 21:27:58 INFO yarn.ApplicationMaster$AMEndpoint: Driver requested to kill executor(s) 1149.16/03/18 21:27:58 INFO spark.ExecutorAllocationManager: Removing executor 1149 because it has been idle for 60 seconds (new desired total will be 202)16/03/18 21:27:58 INFO cluster.YarnClusterSchedulerBackend: Requesting to kill executor(s) 115416/03/18 21:27:58 INFO yarn.YarnAllocator: Driver requested a total number of 201 executor(s).16/03/18 21:27:58 INFO yarn.ApplicationMaster$AMEndpoint: Driver requested to kill executor(s) 1154.16/03/18 21:27:58 INFO spark.ExecutorAllocationManager: Removing executor 1154 because it has been idle for 60 seconds (new desired total will be 201)16/03/18 21:27:58 INFO cluster.YarnClusterSchedulerBackend: Requesting to kill executor(s) 113616/03/18 21:27:58 INFO yarn.YarnAllocator: Driver requested a total number of 200 executor(s).16/03/18 21:27:58 INFO yarn.ApplicationMaster$AMEndpoint: Driver requested to kill executor(s) 1136.16/03/18 21:27:58 INFO spark.ExecutorAllocationManager: Removing executor 1136 because it has been idle for 60 seconds (new desired total will be 200)*16/03/18 21:38:17 ERROR yarn.ApplicationMaster: RECEIVED SIGNAL 15: SIGTERM *