Re: Debugging cluster stability, configuration issues

Jayant Shekhar Thu, 21 Aug 2014 14:53:23 -0700

Hi Shay,

You can try setting spark.storage.blockManagerSlaveTimeoutMs to a higher
value.


Cheers,
Jayant



On Thu, Aug 21, 2014 at 1:33 PM, Shay Seng <s...@urbanengines.com> wrote:

> Unfortunately it doesn't look like my executors are OOM. On the slave
> machines I checked both the logs in /spark/log (which I assume is from the
> salve driver?) and in /spark/work/... which I assume are from each
> worker/executor.
>
>
>
>
> On Thu, Aug 21, 2014 at 11:19 AM, Yana Kadiyska <yana.kadiy...@gmail.com>
> wrote:
>
>> Whenever I've seen this exception it has ALWAYS been the case of an
>> executor running out of memory. I don't use checkpointing so not too sure
>> about the first item. The rest of them I believe would happen if an
>> executor fails and the worker spawns a new executor. Usually a good way to
>> verify this is if you look in the driver log, where it says Lost TID
>> 102135 to see where TID 102135 was sent to (which worker). If I'm
>> correct and an executor has rolled you would see two executor logs for your
>> application -- the first one usually contains an OOM. I run 0.9.1 but I
>> believe it should be a pretty similar setup.
>>
>>
>> On Thu, Aug 21, 2014 at 1:23 PM, Shay Seng <s...@urbanengines.com> wrote:
>>
>>> Hi,
>>>
>>> I am running Spark 0.9.2 on an EC2 cluster with about 16 r3.4xlarge
>>> machines
>>> The cluster is running Spark standalone and is launched with the ec2
>>> scripts.
>>> In my Spark job, I am using ephemeral HDFS to checkpoint some of my
>>> RDDs. I'm also reading and writing to S3. My jobs also involve a large
>>> amountf of shuffles.
>>>
>>> I run the same job on multiple set of data and for 50-70% of these runs,
>>> the job completes with no issues. (Typically a rerun will allow the
>>> "failures" to complete as well)
>>>
>>> However on the rest of the 30%, I see a bunch of different kinds of
>>> issues pop up. (which will go away if I rerun the same job)
>>>
>>> (1) Checkpointing silently fails (I assume). the checkpoint dir exists
>>> in HDFS, but no data files are written out. And a later step in the job
>>> tries to reload these RDDs and I get a failure about not being able to read
>>> from HDFS. -- Usually a start, stop-dfs "cures" this.
>>> *Q: What could be the cause of this? Timeouts? *
>>>
>>>
>>> (2) Other times I get ... no idea who or what is causing this...
>>> in master /spark/logs:
>>> 2014-08-21 16:46:15 ERROR EndpointWriter: AssociationError [akka.tcp://
>>> sparkmas...@ec2-54-218-216-19.us-west-2.compute.amazonaws.com:7077] ->
>>> [akka.tcp://sp...@ip-10-34-2-246.us-west-2.compute.internal:37681]:
>>> Error [Association failed with [akka.tcp://spark@ip-10
>>> -34-2-246.us-west-2.compute.internal:37681]] [
>>> akka.remote.EndpointAssociationException: Association failed with
>>> [akka.tcp://sp...@ip-10-34-2-246.us-west-2.compute.internal:37681]
>>> Caused by:
>>> akka.remote.transport.netty.NettyTransport$$anonfun$associate$1$$anon$2:
>>> Connection refused: ip-10-34-2-246.us-west-2.compute.internal/
>>> 10.34.2.246:37681
>>> ]
>>>
>>> Slave Log:
>>> 2014-08-21 16:46:47 INFO ConnectionManager: Removing SendingConnection
>>> to ConnectionManagerId(ip-10-33-7-4.us-west-2.compute.internal,33242)
>>> 2014-08-21 16:46:47 ERROR SendingConnection: Exception while reading
>>> SendingConnection to
>>> ConnectionManagerId(ip-10-33-7-4.us-west-2.compute.internal,33242)
>>> java.nio.channels.ClosedChannelException
>>>         at
>>> sun.nio.ch.SocketChannelImpl.ensureReadOpen(SocketChannelImpl.java:252)
>>>         at sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:295)
>>>         at
>>> org.apache.spark.network.SendingConnection.read(Connection.scala:398)
>>>         at
>>> org.apache.spark.network.ConnectionManager$$anon$5.run(ConnectionManager.scala:158)
>>>         at
>>> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
>>>         at
>>> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
>>>         at java.lang.Thread.run(Thread.java:744)
>>> *Q: Where do I even start debugging this kind of issues? Are the
>>> machines too loaded and so timeouts are getting hit? Am I not setting some
>>> configuration number correctly? I would be grateful for some hints on where
>>> to start looking!*
>>>
>>>
>>> (3) Often (2) will be preceeded by the following in spark.logs..
>>> 2014-08-21 16:34:10 WARN TaskSetManager: Lost TID 102135 (task 398.0:147)
>>> 2014-08-21 16:34:10 WARN TaskSetManager: Loss was due to fetch failure
>>> from BlockManagerId(0, ip-10-33-131-250.us-west-2.compute.internal, 51371,
>>> 0)
>>> 2014-08-21 16:34:10 WARN TaskSetManager: Loss was due to fetch failure
>>> from BlockManagerId(0, ip-10-33-131-250.us-west-2.compute.internal, 51371,
>>> 0)
>>> 2014-08-21 16:34:10 WARN TaskSetManager: Loss was due to fetch failure
>>> from BlockManagerId(0, ip-10-33-131-250.us-west-2.compute.internal, 51371,
>>> 0)
>>> Not sure if this is an indication...
>>>
>>>
>>>
>>> I'll be very grateful for any ideas on how to start debugging these.
>>> Is there anything I should be noting -- CPU using on Master/Slave.
>>> Number of executors/cpu, akka threads etc?
>>>
>>> Cheers,
>>> shay
>>>
>>
>>
>

Re: Debugging cluster stability, configuration issues

Reply via email to