[jira] [Assigned] (SPARK-15725) Dynamic allocation hangs YARN app when executors time out

Apache Spark (JIRA) Thu, 02 Jun 2016 16:23:36 -0700

     [ 
https://issues.apache.org/jira/browse/SPARK-15725?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Apache Spark reassigned SPARK-15725:
------------------------------------

    Assignee:     (was: Apache Spark)

> Dynamic allocation hangs YARN app when executors time out
> ---------------------------------------------------------
>
>                 Key: SPARK-15725
>                 URL: https://issues.apache.org/jira/browse/SPARK-15725
>             Project: Spark
>          Issue Type: Bug
>          Components: Spark Core
>    Affects Versions: 1.6.1, 2.0.0
>            Reporter: Ryan Blue
>
> We've had a problem with a dynamic allocation and YARN (since 1.6) where a 
> large stage will cause a lot of executors to get killed around the same time, 
> causing the driver and AM to lock up and wait forever. This can happen even 
> with a small number of executors (~100).
> When executors are killed by the driver, the [network connection to the 
> driver 
> disconnects|https://github.com/apache/spark/blob/master/yarn/src/main/scala/org/apache/spark/scheduler/cluster/YarnSchedulerBackend.scala#L201].
>  That results in a call to the AM to find out why the executor died, followed 
> by a [blocking and retrying `RemoveExecutor` RPC 
> call|https://github.com/apache/spark/blob/master/yarn/src/main/scala/org/apache/spark/scheduler/cluster/YarnSchedulerBackend.scala#L227]
>  that results in a second `KillExecutor` call to the AM. When a lot of 
> executors are killed around the same time, the driver's AM threads are all 
> taken up blocking and waiting on the AM (see the stack trace below, which was 
> the same for 42 threads). I think this behavior, the network disconnect and 
> subsequent cleanup, is unique to YARN.
> {code:title=Driver AM thread stack}
> sun.misc.Unsafe.park(Native Method)
> java.util.concurrent.locks.LockSupport.parkNanos(LockSupport.java:215)
> java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireSharedNanos(AbstractQueuedSynchronizer.java:1037)
> java.util.concurrent.locks.AbstractQueuedSynchronizer.tryAcquireSharedNanos(AbstractQueuedSynchronizer.java:1328)
> scala.concurrent.impl.Promise$DefaultPromise.tryAwait(Promise.scala:208)
> scala.concurrent.impl.Promise$DefaultPromise.ready(Promise.scala:218)
> scala.concurrent.impl.Promise$DefaultPromise.result(Promise.scala:223)
> scala.concurrent.Await$$anonfun$result$1.apply(package.scala:190)
> scala.concurrent.BlockContext$DefaultBlockContext$.blockOn(BlockContext.scala:53)
> scala.concurrent.Await$.result(package.scala:190)
> org.apache.spark.rpc.RpcTimeout.awaitResult(RpcTimeout.scala:81)
> org.apache.spark.rpc.RpcEndpointRef.askWithRetry(RpcEndpointRef.scala:102)
> org.apache.spark.rpc.RpcEndpointRef.askWithRetry(RpcEndpointRef.scala:78)
> org.apache.spark.scheduler.cluster.YarnSchedulerBackend$YarnSchedulerEndpoint$$anonfun$receiveAndReply$1$$anonfun$applyOrElse$2.apply$mcV$sp(YarnSchedulerBackend.scala:286)
> org.apache.spark.scheduler.cluster.YarnSchedulerBackend$YarnSchedulerEndpoint$$anonfun$receiveAndReply$1$$anonfun$applyOrElse$2.apply(YarnSchedulerBackend.scala:286)
> org.apache.spark.scheduler.cluster.YarnSchedulerBackend$YarnSchedulerEndpoint$$anonfun$receiveAndReply$1$$anonfun$applyOrElse$2.apply(YarnSchedulerBackend.scala:286)
> scala.concurrent.impl.Future$PromiseCompletingRunnable.liftedTree1$1(Future.scala:24)
> scala.concurrent.impl.Future$PromiseCompletingRunnable.run(Future.scala:24)
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
> java.lang.Thread.run(Thread.java:745)
> {code}
> The RPC calls to the AM aren't returning because the `YarnAllocator` is 
> spending all of its time in the `allocateResources` method. That class's 
> public methods are synchronized so only one RPC can be satisfied at a time. 
> The reason why it is constantly calling `allocateResources` is because [its 
> thread|https://github.com/apache/spark/blob/master/yarn/src/main/scala/org/apache/spark/deploy/yarn/ApplicationMaster.scala#L467]
>  is [woken 
> up|https://github.com/apache/spark/blob/master/yarn/src/main/scala/org/apache/spark/deploy/yarn/ApplicationMaster.scala#L686]
>  by calls to get the failure reason for an executor -- which is part of the 
> chain of events in the driver for each executor that goes down.
> The final result is that the `YarnAllocator` doesn't respond to RPC calls for 
> long enough that calls time out and replies for non-blocking calls are 
> dropped. Then the application is unable to do any work because everything 
> retries or exits and the application *hangs for 24+ hours*, until enough 
> errors accumulate that it dies.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-15725) Dynamic allocation hangs YARN app when executors time out

Reply via email to