Adrian Tanase created SPARK-10792:
-------------------------------------

             Summary: Spark streaming + YARN – executor is not re-created on 
machine restart
                 Key: SPARK-10792
                 URL: https://issues.apache.org/jira/browse/SPARK-10792
             Project: Spark
          Issue Type: Bug
          Components: Streaming, YARN
    Affects Versions: 1.4.0
         Environment: - centos7 deployed on AWS
- yarn / hadoop 2.6.0-cdh5.4.2
- spark 1.4.0 compiled with hadoop 2.6
            Reporter: Adrian Tanase


We’re using spark streaming (1.4.0), deployed on AWS through yarn. It’s a 
stateful app that reads from kafka (with the new direct API) and we’re 
checkpointing to HDFS.

During some resilience testing, we restarted one of the machines and brought it 
back online. During the offline period, the Yarn cluster would not have 
resources to re-create the missing executor.
After starting all the services on the machine, it correctly joined the Yarn 
cluster, however the spark streaming app does not seem to notice that the 
resources are back and has not re-created the missing executor.

The app is correctly running with 6 out of 7 executors, however it’s running 
under capacity.
If we manually kill the driver and re-submit the app to yarn, all the sate is 
correctly recreated from checkpoint and all 7 executors are now online – 
however this seems like a brutal workaround.

Scenarios tested to isolate the issue:

The expected outcome after a machine reboot + services back is that processing 
continues on it. *FAILED* below means that processing continues in a reduced 
capacity, as the machine lost rarely re-joins as container/executor even if 
YARN sees it as healthy node.

The expected outcome after a machine reboot + services back is that processing 
continues on it.
FAILED below means that processing continues in a reduced capacity, as the 
machine lost rarely re-joins as container/executor even if YARN sees it as 
healthy node.

|| No || Failure scenario || test result || data loss || Notes ||
| 1  | Single node restart | FAILED | NO | Executor NOT redeployed when machine 
comes back and services are restarted |
| 2  | Multi-node restart (quick succession) | FAILED | YES | If we are not 
restoring services on machines that are down, the app OR kafka OR zookeeper 
metadata gets corrupted, app crashes and can't be restarted w/o clearing 
checkpoint -> dataloss. Root cause is unhealthy cluster when too many machines 
are lost. |
| 3  | Multi-node restart (rolling) | FAILED | NO | Same as single node 
restart, driver does not crash |
| 4  | Graceful services restart | FAILED | NO | Behaves just like single node 
restart even if we take the time to manually stop services before machine 
reboot. |
| 5  | Adding nodes to an incomplete cluster | SUCCESS | NO | The spark app 
will usually start even if YARN can't fullfill all the resource requests (e.g. 
5 out of 7 nodes are up when app is started). However, when the nodes are added 
to YARN, we see that Spark deploys executors on them, as expected in all the 
scenarios. |
| 6  | Restart executor process | PARTIAL SUCCESS | NO | 1 out of 5 attempts it 
behaves like machine restart - the rest work as expected, container/executor 
are redeployed in a matter of seconds |
| 7  | Node restart on bigger cluster | FAILED | NO | We were trying to 
validate if the behavior is caused by maxing out the cluster and having no 
slack to redeploy a crashed node. We are still behaving like single node 
restart event with lots of extra capacity in YARN - nodes, cores and RAM. |

*Logs for Scenario 6 – correct behavior on process restart*
{noformat}
2015-09-21 11:00:11,193 [Reporter] INFO  
org.apache.spark.deploy.yarn.YarnAllocator - Completed container 
container_1442827158253_0004_01_000004 (state: COMPLETE, exit status: 137)
2015-09-21 11:00:11,193 [Reporter] INFO  
org.apache.spark.deploy.yarn.YarnAllocator - Container marked as failed: 
container_1442827158253_0004_01_000004. Exit status: 137. Diagnostics: 
Container killed on request. Exit code is 137
Container exited with a non-zero exit code 137
Killed by external signal

..
(logical continuation from earlier restart attempt)

2015-09-21 10:33:20,658 [Reporter] INFO  
org.apache.spark.deploy.yarn.YarnAllocator - Will request 1 executor 
containers, each with 14 cores and 18022 MB memory including 1638 MB overhead
2015-09-21 10:33:20,658 [Reporter] INFO  
org.apache.spark.deploy.yarn.YarnAllocator - Container request (host: Any, 
capability: <memory:18022, vCores:14>)

..

2015-09-21 10:33:25,663 [Reporter] INFO  
org.apache.spark.deploy.yarn.YarnAllocator - Launching container 
container_1442827158253_0004_01_000012 for on host ip-10-0-1-16.ec2.internal
2015-09-21 10:33:25,664 [Reporter] INFO  
org.apache.spark.deploy.yarn.YarnAllocator - Launching ExecutorRunnable. 
driverUrl: akka.tcp://sparkDriver@10.0.1.14:32938/user/CoarseGrainedScheduler,  
executorHostname: ip-10-0-1-16.ec2.internal
2015-09-21 10:33:25,664 [Reporter] INFO  
org.apache.spark.deploy.yarn.YarnAllocator - Received 1 containers from YARN, 
launching executors on 1 of them.
{noformat}


*Logs for Scenario 1 – weird resource requests / behavior on node restart*

{noformat}
2015-09-21 10:36:57,352 [sparkDriver-akka.actor.default-dispatcher-31] INFO  
org.apache.spark.deploy.yarn.ApplicationMaster$AMEndpoint - Driver terminated 
or disconnected! Shutting down. ip-10-0-1-16.ec2.internal:34741
2015-09-21 10:36:57,352 [sparkDriver-akka.actor.default-dispatcher-24] ERROR 
org.apache.spark.scheduler.cluster.YarnClusterScheduler - Lost executor 8 on 
ip-10-0-1-16.ec2.internal: remote Rpc client disassociated
2015-09-21 10:36:57,352 [sparkDriver-akka.actor.default-dispatcher-24] INFO  
org.apache.spark.deploy.yarn.ApplicationMaster$AMEndpoint - Driver terminated 
or disconnected! Shutting down. ip-10-0-1-16.ec2.internal:34741
2015-09-21 10:36:57,352 [sparkDriver-akka.actor.default-dispatcher-31] WARN  
akka.remote.ReliableDeliverySupervisor - Association with remote system 
[akka.tcp://sparkExecutor@ip-10-0-1-16.ec2.internal:34741] has failed, address 
is now gated for [5000] ms. Reason is: [Disassociated].
2015-09-21 10:36:57,352 [dag-scheduler-event-loop] INFO  
org.apache.spark.scheduler.DAGScheduler - Executor lost: 8 (epoch 995)
2015-09-21 10:36:57,352 [sparkDriver-akka.actor.default-dispatcher-31] INFO  
org.apache.spark.storage.BlockManagerMasterEndpoint - Trying to remove executor 
8 from BlockManagerMaster.
2015-09-21 10:36:57,352 [sparkDriver-akka.actor.default-dispatcher-31] INFO  
org.apache.spark.storage.BlockManagerMasterEndpoint - Removing block manager 
BlockManagerId(8, ip-10-0-1-16.ec2.internal, 35415)
2015-09-21 10:36:57,352 [dag-scheduler-event-loop] INFO  
org.apache.spark.storage.BlockManagerMaster - Removed 8 successfully in 
removeExecutor
20

...

2015-09-21 10:39:44,320 [sparkDriver-akka.actor.default-dispatcher-31] WARN  
org.apache.spark.HeartbeatReceiver - Removing executor 8 with no recent 
heartbeats: 168535 ms exceeds timeout 120000 ms
2015-09-21 10:39:44,320 [sparkDriver-akka.actor.default-dispatcher-31] ERROR 
org.apache.spark.scheduler.cluster.YarnClusterScheduler - Lost an executor 8 
(already removed): Executor heartbeat timed out after 168535 ms
2015-09-21 10:39:44,320 [kill-executor-thread] INFO  
org.apache.spark.scheduler.cluster.YarnClusterSchedulerBackend - Requesting to 
kill executor(s) 8
2015-09-21 10:39:44,320 [kill-executor-thread] WARN  
org.apache.spark.scheduler.cluster.YarnClusterSchedulerBackend - Executor to 
kill 8 does not exist!
2015-09-21 10:39:44,320 [sparkDriver-akka.actor.default-dispatcher-31] INFO  
org.apache.spark.deploy.yarn.YarnAllocator - Driver requested a total number of 
5 executor(s).
2015-09-21 10:39:44,321 [sparkDriver-akka.actor.default-dispatcher-31] INFO  
org.apache.spark.deploy.yarn.ApplicationMaster$AMEndpoint - Driver requested to 
kill executor(s) .
2015-09-21 10:39:45,793 [Reporter] INFO  
org.apache.spark.deploy.yarn.YarnAllocator - Canceling requests for 0 executor 
containers
2015-09-21 10:39:45,793 [Reporter] WARN  
org.apache.spark.deploy.yarn.YarnAllocator - Expected to find pending requests, 
but found none.


... every 5 seconds

2015-09-21 10:40:05,800 [Reporter] INFO  
org.apache.spark.deploy.yarn.YarnAllocator - Canceling requests for 0 executor 
containers
2015-09-21 10:40:05,800 [Reporter] WARN  
org.apache.spark.deploy.yarn.YarnAllocator - Expected to find pending requests, 
but found none.


..

2015-09-21 10:43:55,876 [Reporter] INFO  
org.apache.spark.deploy.yarn.YarnAllocator - Canceling requests for 0 executor 
containers
2015-09-21 10:43:55,876 [Reporter] WARN  
org.apache.spark.deploy.yarn.YarnAllocator - Expected to find pending requests, 
but found none.

...


2015-09-21 10:49:20,979 [Reporter] INFO  
org.apache.spark.deploy.yarn.YarnAllocator - Canceling requests for 0 executor 
containers
2015-09-21 10:49:20,979 [Reporter] WARN  
org.apache.spark.deploy.yarn.YarnAllocator - Expected to find pending requests, 
but found none.
2015-09-21 10:49:20,980 [Reporter] INFO  
org.apache.spark.deploy.yarn.YarnAllocator - Completed container 
container_1442827158253_0004_01_000012 (state: COMPLETE, exit status: -100)
2015-09-21 10:49:20,980 [Reporter] INFO  
org.apache.spark.deploy.yarn.YarnAllocator - Container marked as failed: 
container_1442827158253_0004_01_000012. Exit status: -100. Diagnostics: 
Container released on a *lost* node


.. done


=======

(ANOTHER RESTART ATTEMPT – note how it’s now requesting a total of 4 (should be 
7!)

2015-09-21 10:51:28,226 [sparkDriver-akka.actor.default-dispatcher-19] INFO  
org.apache.spark.deploy.yarn.ApplicationMaster$AMEndpoint - Driver terminated 
or disconnected! Shutting down. ip-10-0-1-15.ec2.internal:34332
2015-09-21 10:51:28,226 [sparkDriver-akka.actor.default-dispatcher-31] ERROR 
org.apache.spark.scheduler.cluster.YarnClusterScheduler - Lost executor 1 on 
ip-10-0-1-15.ec2.internal: remote Rpc client disassociated
2015-09-21 10:51:28,226 [sparkDriver-akka.actor.default-dispatcher-30] WARN  
akka.remote.ReliableDeliverySupervisor - Association with remote system 
[akka.tcp://sparkExecutor@ip-10-0-1-15.ec2.internal:34332] has failed, address 
is now gated for [5000] ms. Reason is: [Disassociated].
2015-09-21 10:51:28,226 [sparkDriver-akka.actor.default-dispatcher-31] INFO  
org.apache.spark.deploy.yarn.ApplicationMaster$AMEndpoint - Driver terminated 
or disconnected! Shutting down. ip-10-0-1-15.ec2.internal:34332
2015-09-21 10:51:28,226 [dag-scheduler-event-loop] INFO  
org.apache.spark.scheduler.DAGScheduler - Executor lost: 1 (epoch 1431)
2015-09-21 10:51:28,226 [sparkDriver-akka.actor.default-dispatcher-32] INFO  
org.apache.spark.storage.BlockManagerMasterEndpoint - Trying to remove executor 
1 from BlockManagerMaster.
2015-09-21 10:51:28,227 [sparkDriver-akka.actor.default-dispatcher-32] INFO  
org.apache.spark.storage.BlockManagerMasterEndpoint - Removing block manager 
BlockManagerId(1, ip-10-0-1-15.ec2.internal, 36311)
2015-09-21 10:51:28,227 [dag-scheduler-event-loop] INFO  
org.apache.spark.storage.BlockManagerMaster - Removed 1 successfully in 
removeExecutor

...

2015-09-21 10:53:44,320 [sparkDriver-akka.actor.default-dispatcher-32] WARN  
org.apache.spark.HeartbeatReceiver - Removing executor 1 with no recent 
heartbeats: 140055 ms exceeds timeout 120000 ms
2015-09-21 10:53:44,320 [sparkDriver-akka.actor.default-dispatcher-32] ERROR 
org.apache.spark.scheduler.cluster.YarnClusterScheduler - Lost an executor 1 
(already removed): Executor heartbeat timed out after 140055 ms
2015-09-21 10:53:44,320 [kill-executor-thread] INFO  
org.apache.spark.scheduler.cluster.YarnClusterSchedulerBackend - Requesting to 
kill executor(s) 1
2015-09-21 10:53:44,320 [kill-executor-thread] WARN  
org.apache.spark.scheduler.cluster.YarnClusterSchedulerBackend - Executor to 
kill 1 does not exist!
2015-09-21 10:53:44,320 [sparkDriver-akka.actor.default-dispatcher-32] INFO  
org.apache.spark.deploy.yarn.YarnAllocator - Driver requested a total number of 
4 executor(s).
2015-09-21 10:53:44,321 [sparkDriver-akka.actor.default-dispatcher-32] INFO  
org.apache.spark.deploy.yarn.ApplicationMaster$AMEndpoint - Driver requested to 
kill executor(s) .
{noformat}




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to