Hi Shay, You can try setting spark.storage.blockManagerSlaveTimeoutMs to a higher value.
Cheers, Jayant On Thu, Aug 21, 2014 at 1:33 PM, Shay Seng <s...@urbanengines.com> wrote: > Unfortunately it doesn't look like my executors are OOM. On the slave > machines I checked both the logs in /spark/log (which I assume is from the > salve driver?) and in /spark/work/... which I assume are from each > worker/executor. > > > > > On Thu, Aug 21, 2014 at 11:19 AM, Yana Kadiyska <yana.kadiy...@gmail.com> > wrote: > >> Whenever I've seen this exception it has ALWAYS been the case of an >> executor running out of memory. I don't use checkpointing so not too sure >> about the first item. The rest of them I believe would happen if an >> executor fails and the worker spawns a new executor. Usually a good way to >> verify this is if you look in the driver log, where it says Lost TID >> 102135 to see where TID 102135 was sent to (which worker). If I'm >> correct and an executor has rolled you would see two executor logs for your >> application -- the first one usually contains an OOM. I run 0.9.1 but I >> believe it should be a pretty similar setup. >> >> >> On Thu, Aug 21, 2014 at 1:23 PM, Shay Seng <s...@urbanengines.com> wrote: >> >>> Hi, >>> >>> I am running Spark 0.9.2 on an EC2 cluster with about 16 r3.4xlarge >>> machines >>> The cluster is running Spark standalone and is launched with the ec2 >>> scripts. >>> In my Spark job, I am using ephemeral HDFS to checkpoint some of my >>> RDDs. I'm also reading and writing to S3. My jobs also involve a large >>> amountf of shuffles. >>> >>> I run the same job on multiple set of data and for 50-70% of these runs, >>> the job completes with no issues. (Typically a rerun will allow the >>> "failures" to complete as well) >>> >>> However on the rest of the 30%, I see a bunch of different kinds of >>> issues pop up. (which will go away if I rerun the same job) >>> >>> (1) Checkpointing silently fails (I assume). the checkpoint dir exists >>> in HDFS, but no data files are written out. And a later step in the job >>> tries to reload these RDDs and I get a failure about not being able to read >>> from HDFS. -- Usually a start, stop-dfs "cures" this. >>> *Q: What could be the cause of this? Timeouts? * >>> >>> >>> (2) Other times I get ... no idea who or what is causing this... >>> in master /spark/logs: >>> 2014-08-21 16:46:15 ERROR EndpointWriter: AssociationError [akka.tcp:// >>> sparkmas...@ec2-54-218-216-19.us-west-2.compute.amazonaws.com:7077] -> >>> [akka.tcp://sp...@ip-10-34-2-246.us-west-2.compute.internal:37681]: >>> Error [Association failed with [akka.tcp://spark@ip-10 >>> -34-2-246.us-west-2.compute.internal:37681]] [ >>> akka.remote.EndpointAssociationException: Association failed with >>> [akka.tcp://sp...@ip-10-34-2-246.us-west-2.compute.internal:37681] >>> Caused by: >>> akka.remote.transport.netty.NettyTransport$$anonfun$associate$1$$anon$2: >>> Connection refused: ip-10-34-2-246.us-west-2.compute.internal/ >>> 10.34.2.246:37681 >>> ] >>> >>> Slave Log: >>> 2014-08-21 16:46:47 INFO ConnectionManager: Removing SendingConnection >>> to ConnectionManagerId(ip-10-33-7-4.us-west-2.compute.internal,33242) >>> 2014-08-21 16:46:47 ERROR SendingConnection: Exception while reading >>> SendingConnection to >>> ConnectionManagerId(ip-10-33-7-4.us-west-2.compute.internal,33242) >>> java.nio.channels.ClosedChannelException >>> at >>> sun.nio.ch.SocketChannelImpl.ensureReadOpen(SocketChannelImpl.java:252) >>> at sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:295) >>> at >>> org.apache.spark.network.SendingConnection.read(Connection.scala:398) >>> at >>> org.apache.spark.network.ConnectionManager$$anon$5.run(ConnectionManager.scala:158) >>> at >>> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) >>> at >>> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) >>> at java.lang.Thread.run(Thread.java:744) >>> *Q: Where do I even start debugging this kind of issues? Are the >>> machines too loaded and so timeouts are getting hit? Am I not setting some >>> configuration number correctly? I would be grateful for some hints on where >>> to start looking!* >>> >>> >>> (3) Often (2) will be preceeded by the following in spark.logs.. >>> 2014-08-21 16:34:10 WARN TaskSetManager: Lost TID 102135 (task 398.0:147) >>> 2014-08-21 16:34:10 WARN TaskSetManager: Loss was due to fetch failure >>> from BlockManagerId(0, ip-10-33-131-250.us-west-2.compute.internal, 51371, >>> 0) >>> 2014-08-21 16:34:10 WARN TaskSetManager: Loss was due to fetch failure >>> from BlockManagerId(0, ip-10-33-131-250.us-west-2.compute.internal, 51371, >>> 0) >>> 2014-08-21 16:34:10 WARN TaskSetManager: Loss was due to fetch failure >>> from BlockManagerId(0, ip-10-33-131-250.us-west-2.compute.internal, 51371, >>> 0) >>> Not sure if this is an indication... >>> >>> >>> >>> I'll be very grateful for any ideas on how to start debugging these. >>> Is there anything I should be noting -- CPU using on Master/Slave. >>> Number of executors/cpu, akka threads etc? >>> >>> Cheers, >>> shay >>> >> >> >