Has anyone had this happen before?

Roshan


On Tue, Feb 4, 2014 at 9:13 PM, Roshan Nair <ros...@indix.com> wrote:

> Hi,
>
> I have a 10-node spark (0.8.1 with cdh) cluster. My job hangs at some
> point, in that the driver UI is still available at port 4040, but the
> master UI says the job failed.
>
> The job seems to always hang in a reduceByKey(or reduceByKeyLocally).
>
> Some of my configuration parameters are -
> -Dspark.akka.frameSize=160 -Dspark.akka.timeout=100
> -Dspark.akka.askTimeout=30
>
> The job hangs with the driver saying an executor was lost. Within a few
> seconds, it loses all executors.
> Some of the executors say that a SendingConnectionManagerId was not found.
> I've noticed that in all cases, the executors *don't log for 15 seconds 
> *before
> they log some more and then finish.
>
>
> *Driver log: *
> 14/02/04 14:12:46 INFO scheduler.DAGScheduler: Executor lost: 7 (epoch 3)
> 14/02/04 14:12:46 INFO storage.BlockManagerMasterActor: Trying to remove
> executor 7 from BlockManagerMaster.
> 14/02/04 14:12:46 INFO storage.BlockManagerMaster: Removed 7 successfully
> in removeExecutor
> 14/02/04 14:12:46 INFO cluster.ClusterTaskSetManager: Serialized task
> 10.0:49 as 6065714 bytes in 9 ms
> 14/02/04 14:12:46 INFO cluster.ClusterTaskSetManager: Starting task
> 10.0:54 as TID 881 on executor 9: node1.staging (PROCESS_LOCAL)
> 14/02/04 14:12:46 INFO scheduler.Stage: Stage 3 is now unavailable on
> executor 7 (72/80, false)
> 14/02/04 14:12:46 DEBUG spark.MapOutputTracker: Increasing epoch to 4
> 14/02/04 14:12:46 INFO scheduler.Stage: Stage 6 is now unavailable on
> executor 7 (72/80, false)
> 14/02/04 14:12:46 DEBUG spark.MapOutputTracker: Increasing epoch to 5
> 14/02/04 14:12:46 INFO client.Client$ClientActor: Executor updated:
> app-20140204140954-0076/7 is now FAILED (Command exited with code 137)
> 14/02/04 14:12:46 INFO scheduler.Stage: Stage 1 is now unavailable on
> executor 7 (72/80, false)
> 14/02/04 14:12:46 DEBUG spark.MapOutputTracker: Increasing epoch to 6
>
> *Last log lines from first executor that was lost:*
> 14/02/04 14:12:28 DEBUG BlockManager: Getting local block rdd_4_22
> 14/02/04 14:12:28 DEBUG BlockManager: Level for block rdd_4_22 is
> StorageLevel(false, true, true, 1)
> 14/02/04 14:12:28 DEBUG BlockManager: Getting block rdd_4_22 from memory
> 14/02/04 14:12:28 INFO BlockManager: Found block rdd_4_22 locally
> 14/02/04 14:12:29 DEBUG Executor: Task 876's epoch is 3
> 14/02/04 14:12:29 DEBUG CacheManager: Looking for partition rdd_4_23
> 14/02/04 14:12:29 DEBUG BlockManager: Getting local block rdd_4_23
> 14/02/04 14:12:29 DEBUG BlockManager: Level for block rdd_4_23 is
> StorageLevel(false, true, true, 1)
> 14/02/04 14:12:29 DEBUG BlockManager: Getting block rdd_4_23 from memory
> 14/02/04 14:12:29 INFO BlockManager: Found block rdd_4_23 locally
> 14/02/04 14:12:46 INFO ConnectionManager: Removing SendingConnection to
> ConnectionManagerId(10.0.15.122,58862)
> 14/02/04 14:12:46 INFO ConnectionManager: Key not valid ?
> sun.nio.ch.SelectionKeyImpl@469c17ba
> 14/02/04 14:12:46 INFO ConnectionManager: Removing SendingConnection to
> ConnectionManagerId(59b987.staging,58862)
> 14/02/04 14:12:46 INFO ConnectionManager: Removing ReceivingConnection to
> ConnectionManagerId(10.0.15.122,58862)
> 14/02/04 14:12:46 ERROR ConnectionManager: Corresponding
> SendingConnectionManagerId not found
> 14/02/04 14:12:46 INFO ConnectionManager: key already cancelled ?
> sun.nio.ch.SelectionKeyImpl@469c17ba
> java.nio.channels.CancelledKeyException
> at
> org.apache.spark.network.ConnectionManager.run(ConnectionManager.scala:341)
> at
> org.apache.spark.network.ConnectionManager$$anon$3.run(ConnectionManager.scala:98)
> 14/02/04 14:12:46 INFO ConnectionManager: Removing ReceivingConnection to
> ConnectionManagerId(10.0.15.122,58862)
> 14/02/04 14:12:46 ERROR ConnectionManager: Corresponding
> SendingConnectionManagerId not found
> 14/02/04 14:12:46 DEBUG ConnectionManager: Ignoring exception
> java.nio.channels.CancelledKeyException
> at sun.nio.ch.SelectionKeyImpl.ensureValid(SelectionKeyImpl.java:73)
>  at sun.nio.ch.SelectionKeyImpl.interestOps(SelectionKeyImpl.java:82)
> at
> org.apache.spark.network.ConnectionManager.triggerForceCloseByException(ConnectionManager.scala:201)
>  at
> org.apache.spark.network.ConnectionManager.run(ConnectionManager.scala:347)
> at
> org.apache.spark.network.ConnectionManager$$anon$3.run(ConnectionManager.scala:98)
> 14/02/04 14:12:46 INFO ConnectionManager: Key not valid ?
> sun.nio.ch.SelectionKeyImpl@4ed29790
> 14/02/04 14:12:46 INFO ConnectionManager: key already cancelled ?
> sun.nio.ch.SelectionKeyImpl@4ed29790
> java.nio.channels.CancelledKeyException
> at
> org.apache.spark.network.ConnectionManager.run(ConnectionManager.scala:341)
> at
> org.apache.spark.network.ConnectionManager$$anon$3.run(ConnectionManager.scala:98)
> 14/02/04 14:12:46 DEBUG ConnectionManager: Ignoring exception
> java.nio.channels.CancelledKeyException
> at sun.nio.ch.SelectionKeyImpl.ensureValid(SelectionKeyImpl.java:73)
>  at sun.nio.ch.SelectionKeyImpl.interestOps(SelectionKeyImpl.java:82)
> at
> org.apache.spark.network.ConnectionManager.triggerForceCloseByException(ConnectionManager.scala:201)
>  at
> org.apache.spark.network.ConnectionManager.run(ConnectionManager.scala:347)
> at
> org.apache.spark.network.ConnectionManager$$anon$3.run(ConnectionManager.scala:98)
> 14/02/04 14:12:46 DEBUG ConnectionManager: Selector selected 0 of 33 keys
> 14/02/04 14:12:46 DEBUG ConnectionManager: Selector selected 0 of 33 keys
> 14/02/04 14:12:46 DEBUG ConnectionManager: Selector selected 0 of 33 keys
>
> *Final few logs from another executor:*
> 14/02/04 14:12:29 DEBUG BlockManager: Getting local block rdd_4_79
> 14/02/04 14:12:29 DEBUG BlockManager: Level for block rdd_4_79 is
> StorageLevel(false, true, true, 1)
> 14/02/04 14:12:29 DEBUG BlockManager: Getting block rdd_4_79 from memory
> 14/02/04 14:12:29 INFO BlockManager: Found block rdd_4_79 locally
> 14/02/04 14:12:46 INFO ConnectionManager: Removing ReceivingConnection to
> ConnectionManagerId(10.0.15.122,58862)
> 14/02/04 14:12:46 INFO ConnectionManager: Removing ReceivingConnection to
> ConnectionManagerId(10.0.15.122,58862)
> 14/02/04 14:12:46 INFO ConnectionManager: Removing SendingConnection to
> ConnectionManagerId(10.0.15.122,58862)
> 14/02/04 14:12:46 INFO ConnectionManager: Removing SendingConnection to
> ConnectionManagerId(59b987.staging,58862)
> 14/02/04 14:12:46 ERROR ConnectionManager: Corresponding
> SendingConnectionManagerId not found
> 14/02/04 14:12:46 ERROR SendingConnection: Exception while reading
> SendingConnection to ConnectionManagerId(10.0.15.122,58862)
> java.nio.channels.ClosedChannelException
> at sun.nio.ch.SocketChannelImpl.ensureReadOpen(SocketChannelImpl.java:236)
>  at sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:279)
> at org.apache.spark.network.SendingConnection.read(Connection.scala:398)
>  at
> org.apache.spark.network.ConnectionManager$$anon$5.run(ConnectionManager.scala:158)
> at
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1110)
>  at
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:603)
> at java.lang.Thread.run(Thread.java:722)
> 14/02/04 14:12:46 INFO ConnectionManager: Handling connection error on
> connection to ConnectionManagerId(10.0.15.122,58862)
> 14/02/04 14:12:46 DEBUG ConnectionManager: Selector selected 0 of 33 keys
> 14/02/04 14:12:46 INFO ConnectionManager: Removing SendingConnection to
> ConnectionManagerId(10.0.15.122,58862)
> 14/02/04 14:12:46 DEBUG ConnectionManager: Selector selected 0 of 33 keys
> 14/02/04 14:12:46 INFO ConnectionManager: Removing SendingConnection to
> ConnectionManagerId(10.0.15.122,58862)
> 14/02/04 14:12:46 DEBUG ConnectionManager: Selector selected 0 of 33 keys
> 14/02/04 14:12:46 DEBUG ConnectionManager: Selector selected 0 of 33 keys
>
> Thanks in advance!
>
> Roshan
>

Reply via email to