Zongheng Yang created SPARK-2865:
------------------------------------

             Summary: Potential deadlock: tasks could hang forever waiting to 
fetch a remote block even though most tasks finish
                 Key: SPARK-2865
                 URL: https://issues.apache.org/jira/browse/SPARK-2865
             Project: Spark
          Issue Type: Bug
          Components: Shuffle, Spark Core
    Affects Versions: 1.0.1, 1.1.0
         Environment: 16-node EC2 r3.2xlarge cluster
            Reporter: Zongheng Yang
            Priority: Blocker


In the application I tested, most of the tasks out of 128 tasks could finish, 
but sometimes (pretty deterministically) either 1 or 3 tasks would just hang 
forever with the following stack trace. There were no apparent failures from 
the UI, also the nodes where the stuck tasks were running had no apparent 
memory/CPU/disk pressures.

{noformat}
"Executor task launch worker-0" daemon prio=10 tid=0x00007f32ec003800 nid=0xaac 
waiting on condition [0x00007f33f4428000]
   java.lang.Thread.State: WAITING (parking)
        at sun.misc.Unsafe.park(Native Method)
        - parking to wait for  <0x00007f3e0d7198e8> (a 
scala.concurrent.impl.Promise$CompletionLatch)
        at java.util.concurrent.locks.LockSupport.park(LockSupport.java:186)
        at 
java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:834)
        at 
java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireSharedInterruptibly(AbstractQueuedSynchronizer.java:994)
        at 
java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireSharedInterruptibly(AbstractQueuedSynchronizer.java:1303)
        at 
scala.concurrent.impl.Promise$DefaultPromise.tryAwait(Promise.scala:202)
        at scala.concurrent.impl.Promise$DefaultPromise.ready(Promise.scala:218)
        at 
scala.concurrent.impl.Promise$DefaultPromise.result(Promise.scala:223)
        at scala.concurrent.Await$$anonfun$result$1.apply(package.scala:107)
        at 
scala.concurrent.BlockContext$DefaultBlockContext$.blockOn(BlockContext.scala:53)
        at scala.concurrent.Await$.result(package.scala:107)
        at 
org.apache.spark.network.ConnectionManager.sendMessageReliablySync(ConnectionManager.scala:832)
        at 
org.apache.spark.storage.BlockManagerWorker$.syncGetBlock(BlockManagerWorker.scala:122)
        at 
org.apache.spark.storage.BlockManager$$anonfun$doGetRemote$2.apply(BlockManager.scala:497)
        at 
org.apache.spark.storage.BlockManager$$anonfun$doGetRemote$2.apply(BlockManager.scala:495)
        at 
scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
        at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
        at 
org.apache.spark.storage.BlockManager.doGetRemote(BlockManager.scala:495)
        at 
org.apache.spark.storage.BlockManager.getRemote(BlockManager.scala:481)
        at org.apache.spark.storage.BlockManager.get(BlockManager.scala:524)
        at org.apache.spark.CacheManager.getOrCompute(CacheManager.scala:44)
        at org.apache.spark.rdd.RDD.iterator(RDD.scala:227)
        at org.apache.spark.rdd.MappedRDD.compute(MappedRDD.scala:31)
        at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
        at org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
        at 
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
        at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
        at org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
        at org.apache.spark.rdd.MappedRDD.compute(MappedRDD.scala:31)
        at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
        at org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
        at 
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:68)
        at 
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
        at org.apache.spark.scheduler.Task.run(Task.scala:54)
        at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:199)
        at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
        at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
        at java.lang.Thread.run(Thread.java:745)
{noformat}

This behavior does *not* appear on 1.0 (reusing the same cluster), but appears 
on the master branch as of Aug 4, 2014 *and* 1.0.1. Further, I tried out [this 
patch|https://github.com/apache/spark/pull/1758], and it didn't fix the 
behavior.

Further, when this behavior happened, the driver printed out the following line 
repeatedly:

{noformat}
14/08/04 23:32:42 WARN storage.BlockManagerMasterActor: Removing BlockManager 
BlockManagerId(7, ip-172-31-6-74.us-west-1.compute.internal, 59408, 0) with no 
recent heart beats: 67331ms exceeds 45000ms
{noformat}



--
This message was sent by Atlassian JIRA
(v6.2#6252)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to