[ https://issues.apache.org/jira/browse/SPARK-11928?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15046393#comment-15046393 ]
Xu Chen commented on SPARK-11928: --------------------------------- hi guys this issue have fixed at newly version > Master retry deadlock > --------------------- > > Key: SPARK-11928 > URL: https://issues.apache.org/jira/browse/SPARK-11928 > Project: Spark > Issue Type: Bug > Components: Spark Core > Affects Versions: 1.5.2 > Reporter: Richard Marscher > > In our hardening testing during upgrade to Spark 1.5.2 we are noticing that > there is a deadlock issue in the master connection retry code in AppClient: > https://github.com/apache/spark/blob/branch-1.5/core/src/main/scala/org/apache/spark/deploy/client/AppClient.scala#L96 > On a cursory analysis, the Runnable blocks at line 102 waiting on the rpcEnv > for the masterRef. This Runnable is not checking for interruption on the > current thread so it seems like it will continue to block as long as the > rpcEnv is blocking and not respect the cancel(true) call in > registerWithMaster. The thread pool only has enough threads to run > tryRegisterAllMasters once at a time. > It is being called from registerWithMaster which is going to retry every 20 > seconds by default > (https://github.com/apache/spark/blob/branch-1.5/core/src/main/scala/org/apache/spark/deploy/client/AppClient.scala#L49). > Meanwhile the rpcEnv default timeout is 120 seconds by default > (https://github.com/apache/spark/blob/branch-1.5/core/src/main/scala/org/apache/spark/util/RpcUtils.scala#L60). > So once registerWithMaster recursively calls itself it will provoke a > deadlock in tryRegisterAllMasters. > Exception > {code} > 15/11/23 15:42:19 INFO o.a.s.d.c.AppClient$ClientEndpoint: Connecting to > master spark://ip-172-22-121-44:7077... > 15/11/23 15:42:39 ERROR o.a.s.u.SparkUncaughtExceptionHandler: Uncaught > exception in thread Thread[appclient-registration-retry-thread,5,main] > java.util.concurrent.RejectedExecutionException: Task > java.util.concurrent.FutureTask@1d96b2c9 rejected from > java.util.concurrent.ThreadPoolExecutor@10b3b94c[Running, pool size = 1, > active threads = 0, queued tasks = 0, completed tasks = 1] > at > java.util.concurrent.ThreadPoolExecutor$AbortPolicy.rejectedExecution(ThreadPoolExecutor.java:2048) > ~[na:1.7.0_91] > at > java.util.concurrent.ThreadPoolExecutor.reject(ThreadPoolExecutor.java:821) > [na:1.7.0_91] > at > java.util.concurrent.ThreadPoolExecutor.execute(ThreadPoolExecutor.java:1372) > [na:1.7.0_91] > at > java.util.concurrent.AbstractExecutorService.submit(AbstractExecutorService.java:110) > ~[na:1.7.0_91] > at > org.apache.spark.deploy.client.AppClient$ClientEndpoint$$anonfun$tryRegisterAllMasters$1.apply(AppClient.scala:96) > ~[org.apache.spark.spark-core_2.10-1.5.2.jar:1.5.2] > at > org.apache.spark.deploy.client.AppClient$ClientEndpoint$$anonfun$tryRegisterAllMasters$1.apply(AppClient.scala:95) > ~[org.apache.spark.spark-core_2.10-1.5.2.jar:1.5.2] > {code} > Thread dump excerpt > {code} > "appclient-registration-retry-thread" daemon prio=10 tid=0x00007f92dc00f800 > nid=0x6939 waiting on condition [0x00007f927dada000] > java.lang.Thread.State: TIMED_WAITING (parking) > at sun.misc.Unsafe.park(Native Method) > - parking to wait for <0x00000000f7e7aef8> (a > java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject) > at > java.util.concurrent.locks.LockSupport.parkNanos(LockSupport.java:226) > at > java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.awaitNanos(AbstractQueuedSynchronizer.java:2082) > at > java.util.concurrent.ScheduledThreadPoolExecutor$DelayedWorkQueue.take(ScheduledThreadPoolExecutor.java:1090) > at > java.util.concurrent.ScheduledThreadPoolExecutor$DelayedWorkQueue.take(ScheduledThreadPoolExecutor.java:807) > at > java.util.concurrent.ThreadPoolExecutor.getTask(ThreadPoolExecutor.java:1068) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1130) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) > at java.lang.Thread.run(Thread.java:745) > "appclient-register-master-threadpool-0" daemon prio=10 > tid=0x00007f92dc004000 nid=0x6938 waiting on condition [0x00007f927dbdb000] > java.lang.Thread.State: TIMED_WAITING (parking) > at sun.misc.Unsafe.park(Native Method) > - parking to wait for <0x00000000f7e7a078> (a > java.util.concurrent.SynchronousQueue$TransferStack) > at > java.util.concurrent.locks.LockSupport.parkNanos(LockSupport.java:226) > at > java.util.concurrent.SynchronousQueue$TransferStack.awaitFulfill(SynchronousQueue.java:460) > at > java.util.concurrent.SynchronousQueue$TransferStack.transfer(SynchronousQueue.java:359) > at java.util.concurrent.SynchronousQueue.poll(SynchronousQueue.java:942) > at > java.util.concurrent.ThreadPoolExecutor.getTask(ThreadPoolExecutor.java:1068) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1130) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) > at java.lang.Thread.run(Thread.java:745) > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org