Debugging cluster stability, configuration issues

2014-08-21 Thread Shay Seng
Hi, I am running Spark 0.9.2 on an EC2 cluster with about 16 r3.4xlarge machines The cluster is running Spark standalone and is launched with the ec2 scripts. In my Spark job, I am using ephemeral HDFS to checkpoint some of my RDDs. I'm also reading and writing to S3. My jobs also involve a large

Re: Debugging cluster stability, configuration issues

2014-08-21 Thread Shay Seng
Unfortunately it doesn't look like my executors are OOM. On the slave machines I checked both the logs in /spark/log (which I assume is from the salve driver?) and in /spark/work/... which I assume are from each worker/executor. On Thu, Aug 21, 2014 at 11:19 AM, Yana Kadiyska

Re: Debugging cluster stability, configuration issues

2014-08-21 Thread Jayant Shekhar
Hi Shay, You can try setting spark.storage.blockManagerSlaveTimeoutMs to a higher value. Cheers, Jayant On Thu, Aug 21, 2014 at 1:33 PM, Shay Seng s...@urbanengines.com wrote: Unfortunately it doesn't look like my executors are OOM. On the slave machines I checked both the logs in