[ 
https://issues.apache.org/jira/browse/SPARK-11928?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15046393#comment-15046393
 ] 

Xu Chen commented on SPARK-11928:
---------------------------------

hi guys this issue have fixed at newly version 

> Master retry deadlock
> ---------------------
>
>                 Key: SPARK-11928
>                 URL: https://issues.apache.org/jira/browse/SPARK-11928
>             Project: Spark
>          Issue Type: Bug
>          Components: Spark Core
>    Affects Versions: 1.5.2
>            Reporter: Richard Marscher
>
> In our hardening testing during upgrade to Spark 1.5.2 we are noticing that 
> there is a deadlock issue in the master connection retry code in AppClient: 
> https://github.com/apache/spark/blob/branch-1.5/core/src/main/scala/org/apache/spark/deploy/client/AppClient.scala#L96
> On a cursory analysis, the Runnable blocks at line 102 waiting on the rpcEnv 
> for the masterRef. This Runnable is not checking for interruption on the 
> current thread so it seems like it will continue to block as long as the 
> rpcEnv is blocking and not respect the cancel(true) call in 
> registerWithMaster. The thread pool only has enough threads to run 
> tryRegisterAllMasters once at a time.
> It is being called from registerWithMaster which is going to retry every 20 
> seconds by default 
> (https://github.com/apache/spark/blob/branch-1.5/core/src/main/scala/org/apache/spark/deploy/client/AppClient.scala#L49).
>  Meanwhile the rpcEnv default timeout is 120 seconds by default 
> (https://github.com/apache/spark/blob/branch-1.5/core/src/main/scala/org/apache/spark/util/RpcUtils.scala#L60).
>  So once registerWithMaster recursively calls itself it will provoke a 
> deadlock in tryRegisterAllMasters.
> Exception
> {code}
> 15/11/23 15:42:19 INFO o.a.s.d.c.AppClient$ClientEndpoint: Connecting to 
> master spark://ip-172-22-121-44:7077...
> 15/11/23 15:42:39 ERROR o.a.s.u.SparkUncaughtExceptionHandler: Uncaught 
> exception in thread Thread[appclient-registration-retry-thread,5,main]
> java.util.concurrent.RejectedExecutionException: Task 
> java.util.concurrent.FutureTask@1d96b2c9 rejected from 
> java.util.concurrent.ThreadPoolExecutor@10b3b94c[Running, pool size = 1, 
> active threads = 0, queued tasks = 0, completed tasks = 1]
>       at 
> java.util.concurrent.ThreadPoolExecutor$AbortPolicy.rejectedExecution(ThreadPoolExecutor.java:2048)
>  ~[na:1.7.0_91]
>       at 
> java.util.concurrent.ThreadPoolExecutor.reject(ThreadPoolExecutor.java:821) 
> [na:1.7.0_91]
>       at 
> java.util.concurrent.ThreadPoolExecutor.execute(ThreadPoolExecutor.java:1372) 
> [na:1.7.0_91]
>       at 
> java.util.concurrent.AbstractExecutorService.submit(AbstractExecutorService.java:110)
>  ~[na:1.7.0_91]
>       at 
> org.apache.spark.deploy.client.AppClient$ClientEndpoint$$anonfun$tryRegisterAllMasters$1.apply(AppClient.scala:96)
>  ~[org.apache.spark.spark-core_2.10-1.5.2.jar:1.5.2]
>       at 
> org.apache.spark.deploy.client.AppClient$ClientEndpoint$$anonfun$tryRegisterAllMasters$1.apply(AppClient.scala:95)
>  ~[org.apache.spark.spark-core_2.10-1.5.2.jar:1.5.2]
> {code}
> Thread dump excerpt
> {code}
> "appclient-registration-retry-thread" daemon prio=10 tid=0x00007f92dc00f800 
> nid=0x6939 waiting on condition [0x00007f927dada000]
>    java.lang.Thread.State: TIMED_WAITING (parking)
>       at sun.misc.Unsafe.park(Native Method)
>       - parking to wait for  <0x00000000f7e7aef8> (a 
> java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject)
>       at 
> java.util.concurrent.locks.LockSupport.parkNanos(LockSupport.java:226)
>       at 
> java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.awaitNanos(AbstractQueuedSynchronizer.java:2082)
>       at 
> java.util.concurrent.ScheduledThreadPoolExecutor$DelayedWorkQueue.take(ScheduledThreadPoolExecutor.java:1090)
>       at 
> java.util.concurrent.ScheduledThreadPoolExecutor$DelayedWorkQueue.take(ScheduledThreadPoolExecutor.java:807)
>       at 
> java.util.concurrent.ThreadPoolExecutor.getTask(ThreadPoolExecutor.java:1068)
>       at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1130)
>       at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
>       at java.lang.Thread.run(Thread.java:745)
> "appclient-register-master-threadpool-0" daemon prio=10 
> tid=0x00007f92dc004000 nid=0x6938 waiting on condition [0x00007f927dbdb000]
>    java.lang.Thread.State: TIMED_WAITING (parking)
>       at sun.misc.Unsafe.park(Native Method)
>       - parking to wait for  <0x00000000f7e7a078> (a 
> java.util.concurrent.SynchronousQueue$TransferStack)
>       at 
> java.util.concurrent.locks.LockSupport.parkNanos(LockSupport.java:226)
>       at 
> java.util.concurrent.SynchronousQueue$TransferStack.awaitFulfill(SynchronousQueue.java:460)
>       at 
> java.util.concurrent.SynchronousQueue$TransferStack.transfer(SynchronousQueue.java:359)
>       at java.util.concurrent.SynchronousQueue.poll(SynchronousQueue.java:942)
>       at 
> java.util.concurrent.ThreadPoolExecutor.getTask(ThreadPoolExecutor.java:1068)
>       at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1130)
>       at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
>       at java.lang.Thread.run(Thread.java:745)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to