Debugging cluster stability, configuration issues

Shay Seng Thu, 21 Aug 2014 10:24:01 -0700

Hi,

I am running Spark 0.9.2 on an EC2 cluster with about 16 r3.4xlarge machines
The cluster is running Spark standalone and is launched with the ec2
scripts.
In my Spark job, I am using ephemeral HDFS to checkpoint some of my RDDs.
I'm also reading and writing to S3. My jobs also involve a large amountf of
shuffles.


I run the same job on multiple set of data and for 50-70% of these runs,
the job completes with no issues. (Typically a rerun will allow the
"failures" to complete as well)

However on the rest of the 30%, I see a bunch of different kinds of issues
pop up. (which will go away if I rerun the same job)

(1) Checkpointing silently fails (I assume). the checkpoint dir exists in
HDFS, but no data files are written out. And a later step in the job tries
to reload these RDDs and I get a failure about not being able to read from
HDFS. -- Usually a start, stop-dfs "cures" this.
*Q: What could be the cause of this? Timeouts? *


(2) Other times I get ... no idea who or what is causing this...
in master /spark/logs:
2014-08-21 16:46:15 ERROR EndpointWriter: AssociationError [akka.tcp://
sparkmas...@ec2-54-218-216-19.us-west-2.compute.amazonaws.com:7077] ->
[akka.tcp://sp...@ip-10-34-2-246.us-west-2.compute.internal:37681]: Error
[Association failed with [akka.tcp://spark@ip-10
-34-2-246.us-west-2.compute.internal:37681]] [
akka.remote.EndpointAssociationException: Association failed with
[akka.tcp://sp...@ip-10-34-2-246.us-west-2.compute.internal:37681]
Caused by:
akka.remote.transport.netty.NettyTransport$$anonfun$associate$1$$anon$2:
Connection refused: ip-10-34-2-246.us-west-2.compute.internal/
10.34.2.246:37681
]

Slave Log:
2014-08-21 16:46:47 INFO ConnectionManager: Removing SendingConnection to
ConnectionManagerId(ip-10-33-7-4.us-west-2.compute.internal,33242)
2014-08-21 16:46:47 ERROR SendingConnection: Exception while reading
SendingConnection to
ConnectionManagerId(ip-10-33-7-4.us-west-2.compute.internal,33242)
java.nio.channels.ClosedChannelException
        at
sun.nio.ch.SocketChannelImpl.ensureReadOpen(SocketChannelImpl.java:252)
        at sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:295)
        at
org.apache.spark.network.SendingConnection.read(Connection.scala:398)
        at
org.apache.spark.network.ConnectionManager$$anon$5.run(ConnectionManager.scala:158)
        at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
        at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
        at java.lang.Thread.run(Thread.java:744)
*Q: Where do I even start debugging this kind of issues? Are the machines
too loaded and so timeouts are getting hit? Am I not setting some
configuration number correctly? I would be grateful for some hints on where
to start looking!*


(3) Often (2) will be preceeded by the following in spark.logs..
2014-08-21 16:34:10 WARN TaskSetManager: Lost TID 102135 (task 398.0:147)
2014-08-21 16:34:10 WARN TaskSetManager: Loss was due to fetch failure from
BlockManagerId(0, ip-10-33-131-250.us-west-2.compute.internal, 51371, 0)
2014-08-21 16:34:10 WARN TaskSetManager: Loss was due to fetch failure from
BlockManagerId(0, ip-10-33-131-250.us-west-2.compute.internal, 51371, 0)
2014-08-21 16:34:10 WARN TaskSetManager: Loss was due to fetch failure from
BlockManagerId(0, ip-10-33-131-250.us-west-2.compute.internal, 51371, 0)
Not sure if this is an indication...



I'll be very grateful for any ideas on how to start debugging these.
Is there anything I should be noting -- CPU using on Master/Slave. Number
of executors/cpu, akka threads etc?

Cheers,
shay

Debugging cluster stability, configuration issues

Reply via email to