Task hang problem
Hi, I've had this nagging problem where a task will hang and the entire job hangs. Using pyspark. Spark 1.5.1 The job output looks like this, and hangs after the last task: .. 15/12/29 17:00:38 INFO BlockManagerInfo: Added broadcast_0_piece0 in memory on 10.65.143.174:34385 (size: 5.8 KB, free: 2.1 GB) 15/12/29 17:00:39 INFO TaskSetManager: Finished task 15.0 in stage 0.0 (TID 15) in 11668 ms on 10.65.143.174 (29/32) 15/12/29 17:00:39 INFO TaskSetManager: Finished task 23.0 in stage 0.0 (TID 23) in 11684 ms on 10.65.143.174 (30/32) 15/12/29 17:00:39 INFO TaskSetManager: Finished task 7.0 in stage 0.0 (TID 7) in 11717 ms on 10.65.143.174 (31/32) {nothing here for a while, ~6mins} Here is the executor status, from UI. 31 31 0 RUNNING PROCESS_LOCAL 2 / 10.65.143.174 2015/12/29 17:00:28 6.8 min 0 ms 0 ms 60 ms 0 ms 0 ms 0.0 B Here is executor 2 from 10.65.143.174. Never see task 31 get to the executor.any ideas? . 15/12/29 17:00:38 INFO TorrentBroadcast: Started reading broadcast variable 0 15/12/29 17:00:38 INFO MemoryStore: ensureFreeSpace(5979) called with curMem=0, maxMem=2223023063 15/12/29 17:00:38 INFO MemoryStore: Block broadcast_0_piece0 stored as bytes in memory (estimated size 5.8 KB, free 2.1 GB) 15/12/29 17:00:38 INFO TorrentBroadcast: Reading broadcast variable 0 took 208 ms 15/12/29 17:00:38 INFO MemoryStore: ensureFreeSpace(8544) called with curMem=5979, maxMem=2223023063 15/12/29 17:00:38 INFO MemoryStore: Block broadcast_0 stored as values in memory (estimated size 8.3 KB, free 2.1 GB) 15/12/29 17:00:39 INFO PythonRunner: Times: total = 913, boot = 747, init = 166, finish = 0 15/12/29 17:00:39 INFO Executor: Finished task 15.0 in stage 0.0 (TID 15). 967 bytes result sent to driver 15/12/29 17:00:39 INFO PythonRunner: Times: total = 955, boot = 735, init = 220, finish = 0 15/12/29 17:00:39 INFO Executor: Finished task 23.0 in stage 0.0 (TID 23). 967 bytes result sent to driver 15/12/29 17:00:39 INFO PythonRunner: Times: total = 970, boot = 812, init = 158, finish = 0 15/12/29 17:00:39 INFO Executor: Finished task 7.0 in stage 0.0 (TID 7). 967 bytes result sent to driver root@ip-10-65-143-174 2]$ Sent from my Verizon Wireless 4G LTE smartphone
Re: Task hang problem
Can you log onto 10.65.143.174 , find task 31 and take a stack trace ? Thanks On Tue, Dec 29, 2015 at 9:19 AM, Darren Govoni wrote: > Hi, > I've had this nagging problem where a task will hang and the entire job > hangs. Using pyspark. Spark 1.5.1 > > The job output looks like this, and hangs after the last task: > > .. > 15/12/29 17:00:38 INFO BlockManagerInfo: Added broadcast_0_piece0 in > memory on 10.65.143.174:34385 (size: 5.8 KB, free: 2.1 GB) > 15/12/29 17:00:39 INFO TaskSetManager: Finished task 15.0 in stage 0.0 > (TID 15) in 11668 ms on 10.65.143.174 (29/32) > 15/12/29 17:00:39 INFO TaskSetManager: Finished task 23.0 in stage 0.0 > (TID 23) in 11684 ms on 10.65.143.174 (30/32) > 15/12/29 17:00:39 INFO TaskSetManager: Finished task 7.0 in stage 0.0 > (TID 7) in 11717 ms on 10.65.143.174 (31/32) > {nothing here for a while, ~6mins} > > > Here is the executor status, from UI. > > 31 31 0 RUNNING PROCESS_LOCAL 2 / 10.65.143.174 2015/12/29 17:00:28 6.8 > min 0 ms 0 ms 60 ms 0 ms 0 ms 0.0 B > Here is executor 2 from 10.65.143.174. Never see task 31 get to the > executor.any ideas? > > . > 15/12/29 17:00:38 INFO TorrentBroadcast: Started reading broadcast > variable 0 > 15/12/29 17:00:38 INFO MemoryStore: ensureFreeSpace(5979) called with > curMem=0, maxMem=2223023063 > 15/12/29 17:00:38 INFO MemoryStore: Block broadcast_0_piece0 stored as > bytes in memory (estimated size 5.8 KB, free 2.1 GB) > 15/12/29 17:00:38 INFO TorrentBroadcast: Reading broadcast variable 0 > took 208 ms > 15/12/29 17:00:38 INFO MemoryStore: ensureFreeSpace(8544) called with > curMem=5979, maxMem=2223023063 > 15/12/29 17:00:38 INFO MemoryStore: Block broadcast_0 stored as values in > memory (estimated size 8.3 KB, free 2.1 GB) > 15/12/29 17:00:39 INFO PythonRunner: Times: total = 913, boot = 747, init > = 166, finish = 0 > 15/12/29 17:00:39 INFO Executor: Finished task 15.0 in stage 0.0 (TID > 15). 967 bytes result sent to driver > 15/12/29 17:00:39 INFO PythonRunner: Times: total = 955, boot = 735, init > = 220, finish = 0 > 15/12/29 17:00:39 INFO Executor: Finished task 23.0 in stage 0.0 (TID > 23). 967 bytes result sent to driver > 15/12/29 17:00:39 INFO PythonRunner: Times: total = 970, boot = 812, init > = 158, finish = 0 > 15/12/29 17:00:39 INFO Executor: Finished task 7.0 in stage 0.0 (TID 7). > 967 bytes result sent to driver > root@ip-10-65-143-174 2]$ > > > > Sent from my Verizon Wireless 4G LTE smartphone >
Re: Task hang problem
here's executor trace. Thread 58: Executor task launch worker-3 (RUNNABLE) java.net.SocketInputStream.socketRead0(Native Method) java.net.SocketInputStream.read(SocketInputStream.java:152) java.net.SocketInputStream.read(SocketInputStream.java:122) java.io.BufferedInputStream.fill(BufferedInputStream.java:235) java.io.BufferedInputStream.read(BufferedInputStream.java:254) java.io.DataInputStream.readInt(DataInputStream.java:387) org.apache.spark.api.python.PythonRunner$$anon$1.read(PythonRDD.scala:139) org.apache.spark.api.python.PythonRunner$$anon$1.(PythonRDD.scala:207) org.apache.spark.api.python.PythonRunner.compute(PythonRDD.scala:125) org.apache.spark.api.python.PythonRDD.compute(PythonRDD.scala:70) org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297) org.apache.spark.rdd.RDD.iterator(RDD.scala:264) org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66) org.apache.spark.scheduler.Task.run(Task.scala:88) org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214) java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) java.lang.Thread.run(Thread.java:745) Thread 41: BLOCK_MANAGER cleanup timer (WAITING) java.lang.Object.wait(Native Method) java.lang.Object.wait(Object.java:503) java.util.TimerThread.mainLoop(Timer.java:526) java.util.TimerThread.run(Timer.java:505) Thread 42: BROADCAST_VARS cleanup timer (WAITING) java.lang.Object.wait(Native Method) java.lang.Object.wait(Object.java:503) java.util.TimerThread.mainLoop(Timer.java:526) java.util.TimerThread.run(Timer.java:505) Thread 54: driver-heartbeater (TIMED_WAITING) sun.misc.Unsafe.park(Native Method) java.util.concurrent.locks.LockSupport.parkNanos(LockSupport.java:226) java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.awaitNanos(AbstractQueuedSynchronizer.java:2082) java.util.concurrent.ScheduledThreadPoolExecutor$DelayedWorkQueue.take(ScheduledThreadPoolExecutor.java:1090) java.util.concurrent.ScheduledThreadPoolExecutor$DelayedWorkQueue.take(ScheduledThreadPoolExecutor.java:807) java.util.concurrent.ThreadPoolExecutor.getTask(ThreadPoolExecutor.java:1068) java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1130) java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) java.lang.Thread.run(Thread.java:745) Thread 3: Finalizer (WAITING) java.lang.Object.wait(Native Method) java.lang.ref.ReferenceQueue.remove(ReferenceQueue.java:135) java.lang.ref.ReferenceQueue.remove(ReferenceQueue.java:151) java.lang.ref.Finalizer$FinalizerThread.run(Finalizer.java:209) Thread 25: ForkJoinPool-3-worker-15 (WAITING) sun.misc.Unsafe.park(Native Method) scala.concurrent.forkjoin.ForkJoinPool.scan(ForkJoinPool.java:2075) scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979) scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107) Thread 35: Hashed wheel timer #2 (TIMED_WAITING) java.lang.Thread.sleep(Native Method) org.jboss.netty.util.HashedWheelTimer$Worker.waitForNextTick(HashedWheelTimer.java:483) org.jboss.netty.util.HashedWheelTimer$Worker.run(HashedWheelTimer.java:392) org.jboss.netty.util.ThreadRenamingRunnable.run(ThreadRenamingRunnable.java:108) java.lang.Thread.run(Thread.java:745) Thread 68: Idle Worker Monitor for /usr/bin/python2.7 (TIMED_WAITING) java.lang.Thread.sleep(Native Method) org.apache.spark.api.python.PythonWorkerFactory$MonitorThread.run(PythonWorkerFactory.scala:229) Thread 1: main (WAITING) sun.misc.Unsafe.park(Native Method) java.util.concurrent.locks.LockSupport.park(LockSupport.java:186) java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:834) java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireSharedInterruptibly(AbstractQueuedSynchronizer.java:994) java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireSharedInterruptibly(AbstractQueuedSynchronizer.java:1303) java.util.concurrent.CountDownLatch.await(CountDownLatch.java:236) akka.actor.ActorSystemImpl$TerminationCallbacks.ready(ActorSystem.scala:819) akka.actor.ActorSystemImpl$TerminationCallbacks.ready(ActorSystem.scala:788