[jira] [Updated] (SPARK-2865) Potential deadlock: tasks could hang forever waiting to fetch a remote block even though most tasks finish

Zongheng Yang (JIRA) Tue, 05 Aug 2014 11:30:35 -0700

     [ 
https://issues.apache.org/jira/browse/SPARK-2865?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Zongheng Yang updated SPARK-2865:
---------------------------------

    Description: 
In the application I tested, most of the tasks out of 128 tasks could finish, 
but sometimes (pretty deterministically) either 1 or 3 tasks would just hang 
forever with the following stack trace. There were no apparent failures from 
the UI, also the nodes where the stuck tasks were running had no apparent 
memory/CPU/disk pressures.

{noformat}
"Executor task launch worker-0" daemon prio=10 tid=0x00007f32ec003800 nid=0xaac 
waiting on condition [0x00007f33f4428000]
   java.lang.Thread.State: WAITING (parking)
        at sun.misc.Unsafe.park(Native Method)
        - parking to wait for  <0x00007f3e0d7198e8> (a 
scala.concurrent.impl.Promise$CompletionLatch)
        at java.util.concurrent.locks.LockSupport.park(LockSupport.java:186)
        at 
java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:834)
        at 
java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireSharedInterruptibly(AbstractQueuedSynchronizer.java:994)
        at 
java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireSharedInterruptibly(AbstractQueuedSynchronizer.java:1303)
        at 
scala.concurrent.impl.Promise$DefaultPromise.tryAwait(Promise.scala:202)
        at scala.concurrent.impl.Promise$DefaultPromise.ready(Promise.scala:218)
        at 
scala.concurrent.impl.Promise$DefaultPromise.result(Promise.scala:223)
        at scala.concurrent.Await$$anonfun$result$1.apply(package.scala:107)
        at 
scala.concurrent.BlockContext$DefaultBlockContext$.blockOn(BlockContext.scala:53)
        at scala.concurrent.Await$.result(package.scala:107)
        at 
org.apache.spark.network.ConnectionManager.sendMessageReliablySync(ConnectionManager.scala:832)
        at 
org.apache.spark.storage.BlockManagerWorker$.syncGetBlock(BlockManagerWorker.scala:122)
        at 
org.apache.spark.storage.BlockManager$$anonfun$doGetRemote$2.apply(BlockManager.scala:497)
        at 
org.apache.spark.storage.BlockManager$$anonfun$doGetRemote$2.apply(BlockManager.scala:495)
        at 
scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
        at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
        at 
org.apache.spark.storage.BlockManager.doGetRemote(BlockManager.scala:495)
        at 
org.apache.spark.storage.BlockManager.getRemote(BlockManager.scala:481)
        at org.apache.spark.storage.BlockManager.get(BlockManager.scala:524)
        at org.apache.spark.CacheManager.getOrCompute(CacheManager.scala:44)
        at org.apache.spark.rdd.RDD.iterator(RDD.scala:227)
        at org.apache.spark.rdd.MappedRDD.compute(MappedRDD.scala:31)
        at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
        at org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
        at 
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
        at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
        at org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
        at org.apache.spark.rdd.MappedRDD.compute(MappedRDD.scala:31)
        at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
        at org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
        at 
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:68)
        at 
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
        at org.apache.spark.scheduler.Task.run(Task.scala:54)
        at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:199)
        at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
        at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
        at java.lang.Thread.run(Thread.java:745)
{noformat}

This behavior does *not* appear on 1.0 (reusing the same cluster), but appears 
on the master branch as of Aug 4, 2014 *and* 1.0.1. Further, I tried out [this 
patch|https://github.com/apache/spark/pull/1758], and it didn't fix the 
behavior.

When this behavior happened, the driver printed out the following line 
repeatedly:

{noformat}
14/08/04 23:32:42 WARN storage.BlockManagerMasterActor: Removing BlockManager 
BlockManagerId(7, ip-172-31-6-74.us-west-1.compute.internal, 59408, 0) with no 
recent heart beats: 67331ms exceeds 45000ms
{noformat}

  was:
In the application I tested, most of the tasks out of 128 tasks could finish, 
but sometimes (pretty deterministically) either 1 or 3 tasks would just hang 
forever with the following stack trace. There were no apparent failures from 
the UI, also the nodes where the stuck tasks were running had no apparent 
memory/CPU/disk pressures.

{noformat}
"Executor task launch worker-0" daemon prio=10 tid=0x00007f32ec003800 nid=0xaac 
waiting on condition [0x00007f33f4428000]
   java.lang.Thread.State: WAITING (parking)
        at sun.misc.Unsafe.park(Native Method)
        - parking to wait for  <0x00007f3e0d7198e8> (a 
scala.concurrent.impl.Promise$CompletionLatch)
        at java.util.concurrent.locks.LockSupport.park(LockSupport.java:186)
        at 
java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:834)
        at 
java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireSharedInterruptibly(AbstractQueuedSynchronizer.java:994)
        at 
java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireSharedInterruptibly(AbstractQueuedSynchronizer.java:1303)
        at 
scala.concurrent.impl.Promise$DefaultPromise.tryAwait(Promise.scala:202)
        at scala.concurrent.impl.Promise$DefaultPromise.ready(Promise.scala:218)
        at 
scala.concurrent.impl.Promise$DefaultPromise.result(Promise.scala:223)
        at scala.concurrent.Await$$anonfun$result$1.apply(package.scala:107)
        at 
scala.concurrent.BlockContext$DefaultBlockContext$.blockOn(BlockContext.scala:53)
        at scala.concurrent.Await$.result(package.scala:107)
        at 
org.apache.spark.network.ConnectionManager.sendMessageReliablySync(ConnectionManager.scala:832)
        at 
org.apache.spark.storage.BlockManagerWorker$.syncGetBlock(BlockManagerWorker.scala:122)
        at 
org.apache.spark.storage.BlockManager$$anonfun$doGetRemote$2.apply(BlockManager.scala:497)
        at 
org.apache.spark.storage.BlockManager$$anonfun$doGetRemote$2.apply(BlockManager.scala:495)
        at 
scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
        at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
        at 
org.apache.spark.storage.BlockManager.doGetRemote(BlockManager.scala:495)
        at 
org.apache.spark.storage.BlockManager.getRemote(BlockManager.scala:481)
        at org.apache.spark.storage.BlockManager.get(BlockManager.scala:524)
        at org.apache.spark.CacheManager.getOrCompute(CacheManager.scala:44)
        at org.apache.spark.rdd.RDD.iterator(RDD.scala:227)
        at org.apache.spark.rdd.MappedRDD.compute(MappedRDD.scala:31)
        at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
        at org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
        at 
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
        at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
        at org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
        at org.apache.spark.rdd.MappedRDD.compute(MappedRDD.scala:31)
        at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
        at org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
        at 
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:68)
        at 
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
        at org.apache.spark.scheduler.Task.run(Task.scala:54)
        at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:199)
        at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
        at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
        at java.lang.Thread.run(Thread.java:745)
{noformat}

This behavior does *not* appear on 1.0 (reusing the same cluster), but appears 
on the master branch as of Aug 4, 2014 *and* 1.0.1. Further, I tried out [this 
patch|https://github.com/apache/spark/pull/1758], and it didn't fix the 
behavior.

Further, when this behavior happened, the driver printed out the following line 
repeatedly:

{noformat}
14/08/04 23:32:42 WARN storage.BlockManagerMasterActor: Removing BlockManager 
BlockManagerId(7, ip-172-31-6-74.us-west-1.compute.internal, 59408, 0) with no 
recent heart beats: 67331ms exceeds 45000ms
{noformat}


> Potential deadlock: tasks could hang forever waiting to fetch a remote block 
> even though most tasks finish
> ----------------------------------------------------------------------------------------------------------
>
>                 Key: SPARK-2865
>                 URL: https://issues.apache.org/jira/browse/SPARK-2865
>             Project: Spark
>          Issue Type: Bug
>          Components: Shuffle, Spark Core
>    Affects Versions: 1.0.1, 1.1.0
>         Environment: 16-node EC2 r3.2xlarge cluster
>            Reporter: Zongheng Yang
>            Priority: Blocker
>
> In the application I tested, most of the tasks out of 128 tasks could finish, 
> but sometimes (pretty deterministically) either 1 or 3 tasks would just hang 
> forever with the following stack trace. There were no apparent failures from 
> the UI, also the nodes where the stuck tasks were running had no apparent 
> memory/CPU/disk pressures.
> {noformat}
> "Executor task launch worker-0" daemon prio=10 tid=0x00007f32ec003800 
> nid=0xaac waiting on condition [0x00007f33f4428000]
>    java.lang.Thread.State: WAITING (parking)
>         at sun.misc.Unsafe.park(Native Method)
>         - parking to wait for  <0x00007f3e0d7198e8> (a 
> scala.concurrent.impl.Promise$CompletionLatch)
>         at java.util.concurrent.locks.LockSupport.park(LockSupport.java:186)
>         at 
> java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:834)
>         at 
> java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireSharedInterruptibly(AbstractQueuedSynchronizer.java:994)
>         at 
> java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireSharedInterruptibly(AbstractQueuedSynchronizer.java:1303)
>         at 
> scala.concurrent.impl.Promise$DefaultPromise.tryAwait(Promise.scala:202)
>         at 
> scala.concurrent.impl.Promise$DefaultPromise.ready(Promise.scala:218)
>         at 
> scala.concurrent.impl.Promise$DefaultPromise.result(Promise.scala:223)
>         at scala.concurrent.Await$$anonfun$result$1.apply(package.scala:107)
>         at 
> scala.concurrent.BlockContext$DefaultBlockContext$.blockOn(BlockContext.scala:53)
>         at scala.concurrent.Await$.result(package.scala:107)
>         at 
> org.apache.spark.network.ConnectionManager.sendMessageReliablySync(ConnectionManager.scala:832)
>         at 
> org.apache.spark.storage.BlockManagerWorker$.syncGetBlock(BlockManagerWorker.scala:122)
>         at 
> org.apache.spark.storage.BlockManager$$anonfun$doGetRemote$2.apply(BlockManager.scala:497)
>         at 
> org.apache.spark.storage.BlockManager$$anonfun$doGetRemote$2.apply(BlockManager.scala:495)
>         at 
> scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
>         at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
>         at 
> org.apache.spark.storage.BlockManager.doGetRemote(BlockManager.scala:495)
>         at 
> org.apache.spark.storage.BlockManager.getRemote(BlockManager.scala:481)
>         at org.apache.spark.storage.BlockManager.get(BlockManager.scala:524)
>         at org.apache.spark.CacheManager.getOrCompute(CacheManager.scala:44)
>         at org.apache.spark.rdd.RDD.iterator(RDD.scala:227)
>         at org.apache.spark.rdd.MappedRDD.compute(MappedRDD.scala:31)
>         at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
>         at org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
>         at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
>         at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
>         at org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
>         at org.apache.spark.rdd.MappedRDD.compute(MappedRDD.scala:31)
>         at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
>         at org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
>         at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:68)
>         at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
>         at org.apache.spark.scheduler.Task.run(Task.scala:54)
>         at 
> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:199)
>         at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
>         at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
>         at java.lang.Thread.run(Thread.java:745)
> {noformat}
> This behavior does *not* appear on 1.0 (reusing the same cluster), but 
> appears on the master branch as of Aug 4, 2014 *and* 1.0.1. Further, I tried 
> out [this patch|https://github.com/apache/spark/pull/1758], and it didn't fix 
> the behavior.
> When this behavior happened, the driver printed out the following line 
> repeatedly:
> {noformat}
> 14/08/04 23:32:42 WARN storage.BlockManagerMasterActor: Removing BlockManager 
> BlockManagerId(7, ip-172-31-6-74.us-west-1.compute.internal, 59408, 0) with 
> no recent heart beats: 67331ms exceeds 45000ms
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.2#6252)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-2865) Potential deadlock: tasks could hang forever waiting to fetch a remote block even though most tasks finish

Reply via email to