Hi, I have a 10-node spark (0.8.1 with cdh) cluster. My job hangs at some point, in that the driver UI is still available at port 4040, but the master UI says the job failed.
The job seems to always hang in a reduceByKey(or reduceByKeyLocally). Some of my configuration parameters are - -Dspark.akka.frameSize=160 -Dspark.akka.timeout=100 -Dspark.akka.askTimeout=30 The job hangs with the driver saying an executor was lost. Within a few seconds, it loses all executors. Some of the executors say that a SendingConnectionManagerId was not found. I've noticed that in all cases, the executors *don't log for 15 seconds *before they log some more and then finish. *Driver log: * 14/02/04 14:12:46 INFO scheduler.DAGScheduler: Executor lost: 7 (epoch 3) 14/02/04 14:12:46 INFO storage.BlockManagerMasterActor: Trying to remove executor 7 from BlockManagerMaster. 14/02/04 14:12:46 INFO storage.BlockManagerMaster: Removed 7 successfully in removeExecutor 14/02/04 14:12:46 INFO cluster.ClusterTaskSetManager: Serialized task 10.0:49 as 6065714 bytes in 9 ms 14/02/04 14:12:46 INFO cluster.ClusterTaskSetManager: Starting task 10.0:54 as TID 881 on executor 9: node1.staging (PROCESS_LOCAL) 14/02/04 14:12:46 INFO scheduler.Stage: Stage 3 is now unavailable on executor 7 (72/80, false) 14/02/04 14:12:46 DEBUG spark.MapOutputTracker: Increasing epoch to 4 14/02/04 14:12:46 INFO scheduler.Stage: Stage 6 is now unavailable on executor 7 (72/80, false) 14/02/04 14:12:46 DEBUG spark.MapOutputTracker: Increasing epoch to 5 14/02/04 14:12:46 INFO client.Client$ClientActor: Executor updated: app-20140204140954-0076/7 is now FAILED (Command exited with code 137) 14/02/04 14:12:46 INFO scheduler.Stage: Stage 1 is now unavailable on executor 7 (72/80, false) 14/02/04 14:12:46 DEBUG spark.MapOutputTracker: Increasing epoch to 6 *Last log lines from first executor that was lost:* 14/02/04 14:12:28 DEBUG BlockManager: Getting local block rdd_4_22 14/02/04 14:12:28 DEBUG BlockManager: Level for block rdd_4_22 is StorageLevel(false, true, true, 1) 14/02/04 14:12:28 DEBUG BlockManager: Getting block rdd_4_22 from memory 14/02/04 14:12:28 INFO BlockManager: Found block rdd_4_22 locally 14/02/04 14:12:29 DEBUG Executor: Task 876's epoch is 3 14/02/04 14:12:29 DEBUG CacheManager: Looking for partition rdd_4_23 14/02/04 14:12:29 DEBUG BlockManager: Getting local block rdd_4_23 14/02/04 14:12:29 DEBUG BlockManager: Level for block rdd_4_23 is StorageLevel(false, true, true, 1) 14/02/04 14:12:29 DEBUG BlockManager: Getting block rdd_4_23 from memory 14/02/04 14:12:29 INFO BlockManager: Found block rdd_4_23 locally 14/02/04 14:12:46 INFO ConnectionManager: Removing SendingConnection to ConnectionManagerId(10.0.15.122,58862) 14/02/04 14:12:46 INFO ConnectionManager: Key not valid ? sun.nio.ch.SelectionKeyImpl@469c17ba 14/02/04 14:12:46 INFO ConnectionManager: Removing SendingConnection to ConnectionManagerId(59b987.staging,58862) 14/02/04 14:12:46 INFO ConnectionManager: Removing ReceivingConnection to ConnectionManagerId(10.0.15.122,58862) 14/02/04 14:12:46 ERROR ConnectionManager: Corresponding SendingConnectionManagerId not found 14/02/04 14:12:46 INFO ConnectionManager: key already cancelled ? sun.nio.ch.SelectionKeyImpl@469c17ba java.nio.channels.CancelledKeyException at org.apache.spark.network.ConnectionManager.run(ConnectionManager.scala:341) at org.apache.spark.network.ConnectionManager$$anon$3.run(ConnectionManager.scala:98) 14/02/04 14:12:46 INFO ConnectionManager: Removing ReceivingConnection to ConnectionManagerId(10.0.15.122,58862) 14/02/04 14:12:46 ERROR ConnectionManager: Corresponding SendingConnectionManagerId not found 14/02/04 14:12:46 DEBUG ConnectionManager: Ignoring exception java.nio.channels.CancelledKeyException at sun.nio.ch.SelectionKeyImpl.ensureValid(SelectionKeyImpl.java:73) at sun.nio.ch.SelectionKeyImpl.interestOps(SelectionKeyImpl.java:82) at org.apache.spark.network.ConnectionManager.triggerForceCloseByException(ConnectionManager.scala:201) at org.apache.spark.network.ConnectionManager.run(ConnectionManager.scala:347) at org.apache.spark.network.ConnectionManager$$anon$3.run(ConnectionManager.scala:98) 14/02/04 14:12:46 INFO ConnectionManager: Key not valid ? sun.nio.ch.SelectionKeyImpl@4ed29790 14/02/04 14:12:46 INFO ConnectionManager: key already cancelled ? sun.nio.ch.SelectionKeyImpl@4ed29790 java.nio.channels.CancelledKeyException at org.apache.spark.network.ConnectionManager.run(ConnectionManager.scala:341) at org.apache.spark.network.ConnectionManager$$anon$3.run(ConnectionManager.scala:98) 14/02/04 14:12:46 DEBUG ConnectionManager: Ignoring exception java.nio.channels.CancelledKeyException at sun.nio.ch.SelectionKeyImpl.ensureValid(SelectionKeyImpl.java:73) at sun.nio.ch.SelectionKeyImpl.interestOps(SelectionKeyImpl.java:82) at org.apache.spark.network.ConnectionManager.triggerForceCloseByException(ConnectionManager.scala:201) at org.apache.spark.network.ConnectionManager.run(ConnectionManager.scala:347) at org.apache.spark.network.ConnectionManager$$anon$3.run(ConnectionManager.scala:98) 14/02/04 14:12:46 DEBUG ConnectionManager: Selector selected 0 of 33 keys 14/02/04 14:12:46 DEBUG ConnectionManager: Selector selected 0 of 33 keys 14/02/04 14:12:46 DEBUG ConnectionManager: Selector selected 0 of 33 keys *Final few logs from another executor:* 14/02/04 14:12:29 DEBUG BlockManager: Getting local block rdd_4_79 14/02/04 14:12:29 DEBUG BlockManager: Level for block rdd_4_79 is StorageLevel(false, true, true, 1) 14/02/04 14:12:29 DEBUG BlockManager: Getting block rdd_4_79 from memory 14/02/04 14:12:29 INFO BlockManager: Found block rdd_4_79 locally 14/02/04 14:12:46 INFO ConnectionManager: Removing ReceivingConnection to ConnectionManagerId(10.0.15.122,58862) 14/02/04 14:12:46 INFO ConnectionManager: Removing ReceivingConnection to ConnectionManagerId(10.0.15.122,58862) 14/02/04 14:12:46 INFO ConnectionManager: Removing SendingConnection to ConnectionManagerId(10.0.15.122,58862) 14/02/04 14:12:46 INFO ConnectionManager: Removing SendingConnection to ConnectionManagerId(59b987.staging,58862) 14/02/04 14:12:46 ERROR ConnectionManager: Corresponding SendingConnectionManagerId not found 14/02/04 14:12:46 ERROR SendingConnection: Exception while reading SendingConnection to ConnectionManagerId(10.0.15.122,58862) java.nio.channels.ClosedChannelException at sun.nio.ch.SocketChannelImpl.ensureReadOpen(SocketChannelImpl.java:236) at sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:279) at org.apache.spark.network.SendingConnection.read(Connection.scala:398) at org.apache.spark.network.ConnectionManager$$anon$5.run(ConnectionManager.scala:158) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1110) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:603) at java.lang.Thread.run(Thread.java:722) 14/02/04 14:12:46 INFO ConnectionManager: Handling connection error on connection to ConnectionManagerId(10.0.15.122,58862) 14/02/04 14:12:46 DEBUG ConnectionManager: Selector selected 0 of 33 keys 14/02/04 14:12:46 INFO ConnectionManager: Removing SendingConnection to ConnectionManagerId(10.0.15.122,58862) 14/02/04 14:12:46 DEBUG ConnectionManager: Selector selected 0 of 33 keys 14/02/04 14:12:46 INFO ConnectionManager: Removing SendingConnection to ConnectionManagerId(10.0.15.122,58862) 14/02/04 14:12:46 DEBUG ConnectionManager: Selector selected 0 of 33 keys 14/02/04 14:12:46 DEBUG ConnectionManager: Selector selected 0 of 33 keys Thanks in advance! Roshan