Has anyone had this happen before? Roshan
On Tue, Feb 4, 2014 at 9:13 PM, Roshan Nair <ros...@indix.com> wrote: > Hi, > > I have a 10-node spark (0.8.1 with cdh) cluster. My job hangs at some > point, in that the driver UI is still available at port 4040, but the > master UI says the job failed. > > The job seems to always hang in a reduceByKey(or reduceByKeyLocally). > > Some of my configuration parameters are - > -Dspark.akka.frameSize=160 -Dspark.akka.timeout=100 > -Dspark.akka.askTimeout=30 > > The job hangs with the driver saying an executor was lost. Within a few > seconds, it loses all executors. > Some of the executors say that a SendingConnectionManagerId was not found. > I've noticed that in all cases, the executors *don't log for 15 seconds > *before > they log some more and then finish. > > > *Driver log: * > 14/02/04 14:12:46 INFO scheduler.DAGScheduler: Executor lost: 7 (epoch 3) > 14/02/04 14:12:46 INFO storage.BlockManagerMasterActor: Trying to remove > executor 7 from BlockManagerMaster. > 14/02/04 14:12:46 INFO storage.BlockManagerMaster: Removed 7 successfully > in removeExecutor > 14/02/04 14:12:46 INFO cluster.ClusterTaskSetManager: Serialized task > 10.0:49 as 6065714 bytes in 9 ms > 14/02/04 14:12:46 INFO cluster.ClusterTaskSetManager: Starting task > 10.0:54 as TID 881 on executor 9: node1.staging (PROCESS_LOCAL) > 14/02/04 14:12:46 INFO scheduler.Stage: Stage 3 is now unavailable on > executor 7 (72/80, false) > 14/02/04 14:12:46 DEBUG spark.MapOutputTracker: Increasing epoch to 4 > 14/02/04 14:12:46 INFO scheduler.Stage: Stage 6 is now unavailable on > executor 7 (72/80, false) > 14/02/04 14:12:46 DEBUG spark.MapOutputTracker: Increasing epoch to 5 > 14/02/04 14:12:46 INFO client.Client$ClientActor: Executor updated: > app-20140204140954-0076/7 is now FAILED (Command exited with code 137) > 14/02/04 14:12:46 INFO scheduler.Stage: Stage 1 is now unavailable on > executor 7 (72/80, false) > 14/02/04 14:12:46 DEBUG spark.MapOutputTracker: Increasing epoch to 6 > > *Last log lines from first executor that was lost:* > 14/02/04 14:12:28 DEBUG BlockManager: Getting local block rdd_4_22 > 14/02/04 14:12:28 DEBUG BlockManager: Level for block rdd_4_22 is > StorageLevel(false, true, true, 1) > 14/02/04 14:12:28 DEBUG BlockManager: Getting block rdd_4_22 from memory > 14/02/04 14:12:28 INFO BlockManager: Found block rdd_4_22 locally > 14/02/04 14:12:29 DEBUG Executor: Task 876's epoch is 3 > 14/02/04 14:12:29 DEBUG CacheManager: Looking for partition rdd_4_23 > 14/02/04 14:12:29 DEBUG BlockManager: Getting local block rdd_4_23 > 14/02/04 14:12:29 DEBUG BlockManager: Level for block rdd_4_23 is > StorageLevel(false, true, true, 1) > 14/02/04 14:12:29 DEBUG BlockManager: Getting block rdd_4_23 from memory > 14/02/04 14:12:29 INFO BlockManager: Found block rdd_4_23 locally > 14/02/04 14:12:46 INFO ConnectionManager: Removing SendingConnection to > ConnectionManagerId(10.0.15.122,58862) > 14/02/04 14:12:46 INFO ConnectionManager: Key not valid ? > sun.nio.ch.SelectionKeyImpl@469c17ba > 14/02/04 14:12:46 INFO ConnectionManager: Removing SendingConnection to > ConnectionManagerId(59b987.staging,58862) > 14/02/04 14:12:46 INFO ConnectionManager: Removing ReceivingConnection to > ConnectionManagerId(10.0.15.122,58862) > 14/02/04 14:12:46 ERROR ConnectionManager: Corresponding > SendingConnectionManagerId not found > 14/02/04 14:12:46 INFO ConnectionManager: key already cancelled ? > sun.nio.ch.SelectionKeyImpl@469c17ba > java.nio.channels.CancelledKeyException > at > org.apache.spark.network.ConnectionManager.run(ConnectionManager.scala:341) > at > org.apache.spark.network.ConnectionManager$$anon$3.run(ConnectionManager.scala:98) > 14/02/04 14:12:46 INFO ConnectionManager: Removing ReceivingConnection to > ConnectionManagerId(10.0.15.122,58862) > 14/02/04 14:12:46 ERROR ConnectionManager: Corresponding > SendingConnectionManagerId not found > 14/02/04 14:12:46 DEBUG ConnectionManager: Ignoring exception > java.nio.channels.CancelledKeyException > at sun.nio.ch.SelectionKeyImpl.ensureValid(SelectionKeyImpl.java:73) > at sun.nio.ch.SelectionKeyImpl.interestOps(SelectionKeyImpl.java:82) > at > org.apache.spark.network.ConnectionManager.triggerForceCloseByException(ConnectionManager.scala:201) > at > org.apache.spark.network.ConnectionManager.run(ConnectionManager.scala:347) > at > org.apache.spark.network.ConnectionManager$$anon$3.run(ConnectionManager.scala:98) > 14/02/04 14:12:46 INFO ConnectionManager: Key not valid ? > sun.nio.ch.SelectionKeyImpl@4ed29790 > 14/02/04 14:12:46 INFO ConnectionManager: key already cancelled ? > sun.nio.ch.SelectionKeyImpl@4ed29790 > java.nio.channels.CancelledKeyException > at > org.apache.spark.network.ConnectionManager.run(ConnectionManager.scala:341) > at > org.apache.spark.network.ConnectionManager$$anon$3.run(ConnectionManager.scala:98) > 14/02/04 14:12:46 DEBUG ConnectionManager: Ignoring exception > java.nio.channels.CancelledKeyException > at sun.nio.ch.SelectionKeyImpl.ensureValid(SelectionKeyImpl.java:73) > at sun.nio.ch.SelectionKeyImpl.interestOps(SelectionKeyImpl.java:82) > at > org.apache.spark.network.ConnectionManager.triggerForceCloseByException(ConnectionManager.scala:201) > at > org.apache.spark.network.ConnectionManager.run(ConnectionManager.scala:347) > at > org.apache.spark.network.ConnectionManager$$anon$3.run(ConnectionManager.scala:98) > 14/02/04 14:12:46 DEBUG ConnectionManager: Selector selected 0 of 33 keys > 14/02/04 14:12:46 DEBUG ConnectionManager: Selector selected 0 of 33 keys > 14/02/04 14:12:46 DEBUG ConnectionManager: Selector selected 0 of 33 keys > > *Final few logs from another executor:* > 14/02/04 14:12:29 DEBUG BlockManager: Getting local block rdd_4_79 > 14/02/04 14:12:29 DEBUG BlockManager: Level for block rdd_4_79 is > StorageLevel(false, true, true, 1) > 14/02/04 14:12:29 DEBUG BlockManager: Getting block rdd_4_79 from memory > 14/02/04 14:12:29 INFO BlockManager: Found block rdd_4_79 locally > 14/02/04 14:12:46 INFO ConnectionManager: Removing ReceivingConnection to > ConnectionManagerId(10.0.15.122,58862) > 14/02/04 14:12:46 INFO ConnectionManager: Removing ReceivingConnection to > ConnectionManagerId(10.0.15.122,58862) > 14/02/04 14:12:46 INFO ConnectionManager: Removing SendingConnection to > ConnectionManagerId(10.0.15.122,58862) > 14/02/04 14:12:46 INFO ConnectionManager: Removing SendingConnection to > ConnectionManagerId(59b987.staging,58862) > 14/02/04 14:12:46 ERROR ConnectionManager: Corresponding > SendingConnectionManagerId not found > 14/02/04 14:12:46 ERROR SendingConnection: Exception while reading > SendingConnection to ConnectionManagerId(10.0.15.122,58862) > java.nio.channels.ClosedChannelException > at sun.nio.ch.SocketChannelImpl.ensureReadOpen(SocketChannelImpl.java:236) > at sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:279) > at org.apache.spark.network.SendingConnection.read(Connection.scala:398) > at > org.apache.spark.network.ConnectionManager$$anon$5.run(ConnectionManager.scala:158) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1110) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:603) > at java.lang.Thread.run(Thread.java:722) > 14/02/04 14:12:46 INFO ConnectionManager: Handling connection error on > connection to ConnectionManagerId(10.0.15.122,58862) > 14/02/04 14:12:46 DEBUG ConnectionManager: Selector selected 0 of 33 keys > 14/02/04 14:12:46 INFO ConnectionManager: Removing SendingConnection to > ConnectionManagerId(10.0.15.122,58862) > 14/02/04 14:12:46 DEBUG ConnectionManager: Selector selected 0 of 33 keys > 14/02/04 14:12:46 INFO ConnectionManager: Removing SendingConnection to > ConnectionManagerId(10.0.15.122,58862) > 14/02/04 14:12:46 DEBUG ConnectionManager: Selector selected 0 of 33 keys > 14/02/04 14:12:46 DEBUG ConnectionManager: Selector selected 0 of 33 keys > > Thanks in advance! > > Roshan >