Hi Spark experts,
I am using Spark 1.5.2 on YARN with dynamic allocation enabled. I see in
the driver/application master logs that the app is marked as SUCCEEDED and
then SparkContext stop is called. However, this stop sequence takes > 10
minutes to complete, and YARN resource manager kills the application master
as it didn’t receive a heartbeat within the last 10 minutes. The resource
manager then kills the application master. Any ideas about what may be
going on?

Here are the relevant logs:

*6/03/18 21:26:58 INFO yarn.ApplicationMaster: Final app status:
SUCCEEDED, exitCode: 0
16/03/18 21:26:58 INFO spark.SparkContext: Invoking stop() from
shutdown hook*16/03/18 21:26:58 INFO handler.ContextHandler: stopped
o.s.j.s.ServletContextHandler{/static/sql,null}16/03/18 21:26:58 INFO
handler.ContextHandler: stopped
o.s.j.s.ServletContextHandler{/SQL/execution/json,null}16/03/18
21:26:58 INFO handler.ContextHandler: stopped
o.s.j.s.ServletContextHandler{/SQL/execution,null}16/03/18 21:26:58
INFO handler.ContextHandler: stopped
o.s.j.s.ServletContextHandler{/SQL/json,null}16/03/18 21:26:58 INFO
handler.ContextHandler: stopped
o.s.j.s.ServletContextHandler{/SQL,null}16/03/18 21:26:58 INFO
handler.ContextHandler: stopped
o.s.j.s.ServletContextHandler{/static/sql,null}16/03/18 21:26:58 INFO
handler.ContextHandler: stopped
o.s.j.s.ServletContextHandler{/SQL/execution/json,null}16/03/18
21:26:58 INFO handler.ContextHandler: stopped
o.s.j.s.ServletContextHandler{/SQL/execution,null}16/03/18 21:26:58
INFO handler.ContextHandler: stopped
o.s.j.s.ServletContextHandler{/SQL/json,null}16/03/18 21:26:58 INFO
handler.ContextHandler: stopped
o.s.j.s.ServletContextHandler{/SQL,null}16/03/18 21:26:58 INFO
handler.ContextHandler: stopped
o.s.j.s.ServletContextHandler{/metrics/json,null}16/03/18 21:26:58
INFO handler.ContextHandler: stopped
o.s.j.s.ServletContextHandler{/stages/stage/kill,null}16/03/18
21:26:58 INFO handler.ContextHandler: stopped
o.s.j.s.ServletContextHandler{/api,null}16/03/18 21:26:58 INFO
handler.ContextHandler: stopped
o.s.j.s.ServletContextHandler{/,null}16/03/18 21:26:58 INFO
handler.ContextHandler: stopped
o.s.j.s.ServletContextHandler{/static,null}16/03/18 21:26:58 INFO
handler.ContextHandler: stopped
o.s.j.s.ServletContextHandler{/executors/threadDump/json,null}16/03/18
21:26:58 INFO handler.ContextHandler: stopped
o.s.j.s.ServletContextHandler{/executors/threadDump,null}16/03/18
21:26:58 INFO handler.ContextHandler: stopped
o.s.j.s.ServletContextHandler{/executors/json,null}16/03/18 21:26:58
INFO handler.ContextHandler: stopped
o.s.j.s.ServletContextHandler{/executors,null}16/03/18 21:26:58 INFO
handler.ContextHandler: stopped
o.s.j.s.ServletContextHandler{/environment/json,null}16/03/18 21:26:58
INFO handler.ContextHandler: stopped
o.s.j.s.ServletContextHandler{/environment,null}16/03/18 21:26:58 INFO
handler.ContextHandler: stopped
o.s.j.s.ServletContextHandler{/storage/rdd/json,null}16/03/18 21:26:58
INFO handler.ContextHandler: stopped
o.s.j.s.ServletContextHandler{/storage/rdd,null}16/03/18 21:26:58 INFO
handler.ContextHandler: stopped
o.s.j.s.ServletContextHandler{/storage/json,null}16/03/18 21:26:58
INFO handler.ContextHandler: stopped
o.s.j.s.ServletContextHandler{/storage,null}16/03/18 21:26:58 INFO
handler.ContextHandler: stopped
o.s.j.s.ServletContextHandler{/stages/pool/json,null}16/03/18 21:26:58
INFO handler.ContextHandler: stopped
o.s.j.s.ServletContextHandler{/stages/pool,null}16/03/18 21:26:58 INFO
handler.ContextHandler: stopped
o.s.j.s.ServletContextHandler{/stages/stage/json,null}16/03/18
21:26:58 INFO handler.ContextHandler: stopped
o.s.j.s.ServletContextHandler{/stages/stage,null}16/03/18 21:26:58
INFO handler.ContextHandler: stopped
o.s.j.s.ServletContextHandler{/stages/json,null}16/03/18 21:26:58 INFO
handler.ContextHandler: stopped
o.s.j.s.ServletContextHandler{/stages,null}16/03/18 21:26:58 INFO
handler.ContextHandler: stopped
o.s.j.s.ServletContextHandler{/jobs/job/json,null}16/03/18 21:26:58
INFO handler.ContextHandler: stopped
o.s.j.s.ServletContextHandler{/jobs/job,null}16/03/18 21:26:58 INFO
handler.ContextHandler: stopped
o.s.j.s.ServletContextHandler{/jobs/json,null}16/03/18 21:26:58 INFO
handler.ContextHandler: stopped
o.s.j.s.ServletContextHandler{/jobs,null}16/03/18 21:26:58 INFO
ui.SparkUI: Stopped Spark web UI at
http://10.143.240.240:5270616/03/18 21:27:58 INFO
cluster.YarnClusterSchedulerBackend: Requesting to kill executor(s)
113516/03/18 21:27:58 INFO yarn.YarnAllocator: Driver requested a
total number of 208 executor(s).16/03/18 21:27:58 INFO
yarn.ApplicationMaster$AMEndpoint: Driver requested to kill
executor(s) 1135.16/03/18 21:27:58 INFO
spark.ExecutorAllocationManager: Removing executor 1135 because it has
been idle for 60 seconds (new desired total will be 208)16/03/18
21:27:58 INFO cluster.YarnClusterSchedulerBackend: Requesting to kill
executor(s) 112316/03/18 21:27:58 INFO yarn.YarnAllocator: Driver
requested a total number of 207 executor(s).16/03/18 21:27:58 INFO
yarn.ApplicationMaster$AMEndpoint: Driver requested to kill
executor(s) 1123.16/03/18 21:27:58 INFO
spark.ExecutorAllocationManager: Removing executor 1123 because it has
been idle for 60 seconds (new desired total will be 207)16/03/18
21:27:58 INFO cluster.YarnClusterSchedulerBackend: Requesting to kill
executor(s) 111716/03/18 21:27:58 INFO yarn.YarnAllocator: Driver
requested a total number of 206 executor(s).16/03/18 21:27:58 INFO
yarn.ApplicationMaster$AMEndpoint: Driver requested to kill
executor(s) 1117.16/03/18 21:27:58 INFO
spark.ExecutorAllocationManager: Removing executor 1117 because it has
been idle for 60 seconds (new desired total will be 206)16/03/18
21:27:58 INFO cluster.YarnClusterSchedulerBackend: Requesting to kill
executor(s) 118516/03/18 21:27:58 INFO yarn.YarnAllocator: Driver
requested a total number of 205 executor(s).16/03/18 21:27:58 INFO
yarn.ApplicationMaster$AMEndpoint: Driver requested to kill
executor(s) 1185.16/03/18 21:27:58 INFO
spark.ExecutorAllocationManager: Removing executor 1185 because it has
been idle for 60 seconds (new desired total will be 205)16/03/18
21:27:58 INFO cluster.YarnClusterSchedulerBackend: Requesting to kill
executor(s) 115216/03/18 21:27:58 INFO yarn.YarnAllocator: Driver
requested a total number of 204 executor(s).16/03/18 21:27:58 INFO
yarn.ApplicationMaster$AMEndpoint: Driver requested to kill
executor(s) 1152.16/03/18 21:27:58 INFO
spark.ExecutorAllocationManager: Removing executor 1152 because it has
been idle for 60 seconds (new desired total will be 204)16/03/18
21:27:58 INFO cluster.YarnClusterSchedulerBackend: Requesting to kill
executor(s) 114016/03/18 21:27:58 INFO yarn.YarnAllocator: Driver
requested a total number of 203 executor(s).16/03/18 21:27:58 INFO
yarn.ApplicationMaster$AMEndpoint: Driver requested to kill
executor(s) 1140.16/03/18 21:27:58 INFO
spark.ExecutorAllocationManager: Removing executor 1140 because it has
been idle for 60 seconds (new desired total will be 203)16/03/18
21:27:58 INFO cluster.YarnClusterSchedulerBackend: Requesting to kill
executor(s) 114916/03/18 21:27:58 INFO yarn.YarnAllocator: Driver
requested a total number of 202 executor(s).16/03/18 21:27:58 INFO
yarn.ApplicationMaster$AMEndpoint: Driver requested to kill
executor(s) 1149.16/03/18 21:27:58 INFO
spark.ExecutorAllocationManager: Removing executor 1149 because it has
been idle for 60 seconds (new desired total will be 202)16/03/18
21:27:58 INFO cluster.YarnClusterSchedulerBackend: Requesting to kill
executor(s) 115416/03/18 21:27:58 INFO yarn.YarnAllocator: Driver
requested a total number of 201 executor(s).16/03/18 21:27:58 INFO
yarn.ApplicationMaster$AMEndpoint: Driver requested to kill
executor(s) 1154.16/03/18 21:27:58 INFO
spark.ExecutorAllocationManager: Removing executor 1154 because it has
been idle for 60 seconds (new desired total will be 201)16/03/18
21:27:58 INFO cluster.YarnClusterSchedulerBackend: Requesting to kill
executor(s) 113616/03/18 21:27:58 INFO yarn.YarnAllocator: Driver
requested a total number of 200 executor(s).16/03/18 21:27:58 INFO
yarn.ApplicationMaster$AMEndpoint: Driver requested to kill
executor(s) 1136.16/03/18 21:27:58 INFO
spark.ExecutorAllocationManager: Removing executor 1136 because it has
been idle for 60 seconds (new desired total will be 200)*16/03/18
21:38:17 ERROR yarn.ApplicationMaster: RECEIVED SIGNAL 15: SIGTERM
*

​

Reply via email to