Hi,

I have a 10-node spark (0.8.1 with cdh) cluster. My job hangs at some
point, in that the driver UI is still available at port 4040, but the
master UI says the job failed.

The job seems to always hang in a reduceByKey(or reduceByKeyLocally).

Some of my configuration parameters are -
-Dspark.akka.frameSize=160 -Dspark.akka.timeout=100
-Dspark.akka.askTimeout=30

The job hangs with the driver saying an executor was lost. Within a few
seconds, it loses all executors.
Some of the executors say that a SendingConnectionManagerId was not found.
I've noticed that in all cases, the executors *don't log for 15 seconds *before
they log some more and then finish.


*Driver log: *
14/02/04 14:12:46 INFO scheduler.DAGScheduler: Executor lost: 7 (epoch 3)
14/02/04 14:12:46 INFO storage.BlockManagerMasterActor: Trying to remove
executor 7 from BlockManagerMaster.
14/02/04 14:12:46 INFO storage.BlockManagerMaster: Removed 7 successfully
in removeExecutor
14/02/04 14:12:46 INFO cluster.ClusterTaskSetManager: Serialized task
10.0:49 as 6065714 bytes in 9 ms
14/02/04 14:12:46 INFO cluster.ClusterTaskSetManager: Starting task 10.0:54
as TID 881 on executor 9: node1.staging (PROCESS_LOCAL)
14/02/04 14:12:46 INFO scheduler.Stage: Stage 3 is now unavailable on
executor 7 (72/80, false)
14/02/04 14:12:46 DEBUG spark.MapOutputTracker: Increasing epoch to 4
14/02/04 14:12:46 INFO scheduler.Stage: Stage 6 is now unavailable on
executor 7 (72/80, false)
14/02/04 14:12:46 DEBUG spark.MapOutputTracker: Increasing epoch to 5
14/02/04 14:12:46 INFO client.Client$ClientActor: Executor updated:
app-20140204140954-0076/7 is now FAILED (Command exited with code 137)
14/02/04 14:12:46 INFO scheduler.Stage: Stage 1 is now unavailable on
executor 7 (72/80, false)
14/02/04 14:12:46 DEBUG spark.MapOutputTracker: Increasing epoch to 6

*Last log lines from first executor that was lost:*
14/02/04 14:12:28 DEBUG BlockManager: Getting local block rdd_4_22
14/02/04 14:12:28 DEBUG BlockManager: Level for block rdd_4_22 is
StorageLevel(false, true, true, 1)
14/02/04 14:12:28 DEBUG BlockManager: Getting block rdd_4_22 from memory
14/02/04 14:12:28 INFO BlockManager: Found block rdd_4_22 locally
14/02/04 14:12:29 DEBUG Executor: Task 876's epoch is 3
14/02/04 14:12:29 DEBUG CacheManager: Looking for partition rdd_4_23
14/02/04 14:12:29 DEBUG BlockManager: Getting local block rdd_4_23
14/02/04 14:12:29 DEBUG BlockManager: Level for block rdd_4_23 is
StorageLevel(false, true, true, 1)
14/02/04 14:12:29 DEBUG BlockManager: Getting block rdd_4_23 from memory
14/02/04 14:12:29 INFO BlockManager: Found block rdd_4_23 locally
14/02/04 14:12:46 INFO ConnectionManager: Removing SendingConnection to
ConnectionManagerId(10.0.15.122,58862)
14/02/04 14:12:46 INFO ConnectionManager: Key not valid ?
sun.nio.ch.SelectionKeyImpl@469c17ba
14/02/04 14:12:46 INFO ConnectionManager: Removing SendingConnection to
ConnectionManagerId(59b987.staging,58862)
14/02/04 14:12:46 INFO ConnectionManager: Removing ReceivingConnection to
ConnectionManagerId(10.0.15.122,58862)
14/02/04 14:12:46 ERROR ConnectionManager: Corresponding
SendingConnectionManagerId not found
14/02/04 14:12:46 INFO ConnectionManager: key already cancelled ?
sun.nio.ch.SelectionKeyImpl@469c17ba
java.nio.channels.CancelledKeyException
at
org.apache.spark.network.ConnectionManager.run(ConnectionManager.scala:341)
at
org.apache.spark.network.ConnectionManager$$anon$3.run(ConnectionManager.scala:98)
14/02/04 14:12:46 INFO ConnectionManager: Removing ReceivingConnection to
ConnectionManagerId(10.0.15.122,58862)
14/02/04 14:12:46 ERROR ConnectionManager: Corresponding
SendingConnectionManagerId not found
14/02/04 14:12:46 DEBUG ConnectionManager: Ignoring exception
java.nio.channels.CancelledKeyException
at sun.nio.ch.SelectionKeyImpl.ensureValid(SelectionKeyImpl.java:73)
at sun.nio.ch.SelectionKeyImpl.interestOps(SelectionKeyImpl.java:82)
at
org.apache.spark.network.ConnectionManager.triggerForceCloseByException(ConnectionManager.scala:201)
at
org.apache.spark.network.ConnectionManager.run(ConnectionManager.scala:347)
at
org.apache.spark.network.ConnectionManager$$anon$3.run(ConnectionManager.scala:98)
14/02/04 14:12:46 INFO ConnectionManager: Key not valid ?
sun.nio.ch.SelectionKeyImpl@4ed29790
14/02/04 14:12:46 INFO ConnectionManager: key already cancelled ?
sun.nio.ch.SelectionKeyImpl@4ed29790
java.nio.channels.CancelledKeyException
at
org.apache.spark.network.ConnectionManager.run(ConnectionManager.scala:341)
at
org.apache.spark.network.ConnectionManager$$anon$3.run(ConnectionManager.scala:98)
14/02/04 14:12:46 DEBUG ConnectionManager: Ignoring exception
java.nio.channels.CancelledKeyException
at sun.nio.ch.SelectionKeyImpl.ensureValid(SelectionKeyImpl.java:73)
at sun.nio.ch.SelectionKeyImpl.interestOps(SelectionKeyImpl.java:82)
at
org.apache.spark.network.ConnectionManager.triggerForceCloseByException(ConnectionManager.scala:201)
at
org.apache.spark.network.ConnectionManager.run(ConnectionManager.scala:347)
at
org.apache.spark.network.ConnectionManager$$anon$3.run(ConnectionManager.scala:98)
14/02/04 14:12:46 DEBUG ConnectionManager: Selector selected 0 of 33 keys
14/02/04 14:12:46 DEBUG ConnectionManager: Selector selected 0 of 33 keys
14/02/04 14:12:46 DEBUG ConnectionManager: Selector selected 0 of 33 keys

*Final few logs from another executor:*
14/02/04 14:12:29 DEBUG BlockManager: Getting local block rdd_4_79
14/02/04 14:12:29 DEBUG BlockManager: Level for block rdd_4_79 is
StorageLevel(false, true, true, 1)
14/02/04 14:12:29 DEBUG BlockManager: Getting block rdd_4_79 from memory
14/02/04 14:12:29 INFO BlockManager: Found block rdd_4_79 locally
14/02/04 14:12:46 INFO ConnectionManager: Removing ReceivingConnection to
ConnectionManagerId(10.0.15.122,58862)
14/02/04 14:12:46 INFO ConnectionManager: Removing ReceivingConnection to
ConnectionManagerId(10.0.15.122,58862)
14/02/04 14:12:46 INFO ConnectionManager: Removing SendingConnection to
ConnectionManagerId(10.0.15.122,58862)
14/02/04 14:12:46 INFO ConnectionManager: Removing SendingConnection to
ConnectionManagerId(59b987.staging,58862)
14/02/04 14:12:46 ERROR ConnectionManager: Corresponding
SendingConnectionManagerId not found
14/02/04 14:12:46 ERROR SendingConnection: Exception while reading
SendingConnection to ConnectionManagerId(10.0.15.122,58862)
java.nio.channels.ClosedChannelException
at sun.nio.ch.SocketChannelImpl.ensureReadOpen(SocketChannelImpl.java:236)
at sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:279)
at org.apache.spark.network.SendingConnection.read(Connection.scala:398)
at
org.apache.spark.network.ConnectionManager$$anon$5.run(ConnectionManager.scala:158)
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1110)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:603)
at java.lang.Thread.run(Thread.java:722)
14/02/04 14:12:46 INFO ConnectionManager: Handling connection error on
connection to ConnectionManagerId(10.0.15.122,58862)
14/02/04 14:12:46 DEBUG ConnectionManager: Selector selected 0 of 33 keys
14/02/04 14:12:46 INFO ConnectionManager: Removing SendingConnection to
ConnectionManagerId(10.0.15.122,58862)
14/02/04 14:12:46 DEBUG ConnectionManager: Selector selected 0 of 33 keys
14/02/04 14:12:46 INFO ConnectionManager: Removing SendingConnection to
ConnectionManagerId(10.0.15.122,58862)
14/02/04 14:12:46 DEBUG ConnectionManager: Selector selected 0 of 33 keys
14/02/04 14:12:46 DEBUG ConnectionManager: Selector selected 0 of 33 keys

Thanks in advance!

Roshan

Reply via email to