Hi Darin, In our case, we were getting the error gue to long GC pauses in our app. Fixing the underlying code helped us remove this error. This is also mentioned as point 1 in the link below:
http://mail-archives.apache.org/mod_mbox/spark-user/201409.mbox/%3cca+-p3ah5aamgtke6viycwb24ohsnmaqm1q9x53abwb_arvw...@mail.gmail.com%3E Best Regards, Sonal Founder, Nube Technologies <http://www.nubetech.co> <http://in.linkedin.com/in/sonalgoyal> On Fri, Nov 14, 2014 at 1:01 AM, Darin McBeath <ddmcbe...@yahoo.com.invalid> wrote: > For one of my Spark jobs, my workers/executors are dying and leaving the > cluster. > > On the master, I see something like the following in the log file. I'm > surprised to see the '60' seconds in the master log below because I > explicitly set it to '600' (or so I thought) in my spark job (see below). > This is happening at the end of my job when I'm trying to persist a large > RDD (probably around 300+GB) back to S3 (in 256 partitions). My cluster > consists of 6 r3.8xlarge machines. The job successfully works when I'm > outputting 100GB or 200GB. > > If you have any thoughts/insights, it would be appreciated. > > Thanks. > > Darin. > > Here is where I'm setting the 'timeout' in my spark job. > > SparkConf conf = new SparkConf() > .setAppName("SparkSync Application") > .set("spark.serializer", "org.apache.spark.serializer.KryoSerializer") > .set("spark.rdd.compress","true") > .set("spark.core.connection.ack.wait.timeout","600"); > > On the master, I see the following in the log file. > > 4/11/13 17:20:39 WARN master.Master: Removing > worker-20141113134801-ip-10-35-184-232.ec2.internal-51877 because we got no > heartbeat in 60 seconds > 14/11/13 17:20:39 INFO master.Master: Removing worker > worker-20141113134801-ip-10-35-184-232.ec2.internal-51877 on > ip-10-35-184-232.ec2.internal:51877 > 14/11/13 17:20:39 INFO master.Master: Telling app of lost executor: 2 > > On a worker, I see something like the following in the log file. > > 14/11/13 17:20:58 WARN util.AkkaUtils: Error sending message in 1 attempts > java.util.concurrent.TimeoutException: Futures timed out after [30 seconds] > at scala.concurrent.impl.Promise$DefaultPromise.ready(Promise.scala:219) > at scala.concurrent.impl.Promise$DefaultPromise.result(Promise.scala:223) > at scala.concurrent.Await$$anonfun$result$1.apply(package.scala:107) > at > scala.concurrent.BlockContext$DefaultBlockContext$.blockOn(BlockContext.scala:53) > at scala.concurrent.Await$.result(package.scala:107) > at org.apache.spark.util.AkkaUtils$.askWithReply(AkkaUtils.scala:176) > at org.apache.spark.executor.Executor$$anon$1.run(Executor.scala:362) > 14/11/13 17:21:11 INFO httpclient.HttpMethodDirector: I/O exception > (java.net.SocketException) caught when processing request: Broken pipe > 14/11/13 17:21:11 INFO httpclient.HttpMethodDirector: Retrying request > 14/11/13 17:21:32 INFO httpclient.HttpMethodDirector: I/O exception > (java.net.SocketException) caught when processing request: Broken pipe > 14/11/13 17:21:34 INFO httpclient.HttpMethodDirector: Retrying request > 14/11/13 17:21:34 INFO httpclient.HttpMethodDirector: I/O exception > (java.io.IOException) caught when processing request: Resetting to invalid > mark > 14/11/13 17:21:34 INFO httpclient.HttpMethodDirector: Retrying request > 14/11/13 17:21:34 INFO httpclient.HttpMethodDirector: I/O exception > (java.io.IOException) caught when processing request: Resetting to invalid > mark > 14/11/13 17:21:34 INFO httpclient.HttpMethodDirector: Retrying request > 14/11/13 17:21:34 INFO httpclient.HttpMethodDirector: I/O exception > (java.io.IOException) caught when processing request: Resetting to invalid > mark > 14/11/13 17:21:34 INFO httpclient.HttpMethodDirector: Retrying request > 14/11/13 17:21:34 INFO httpclient.HttpMethodDirector: I/O exception > (java.io.IOException) caught when processing request: Resetting to invalid > mark > 14/11/13 17:21:34 INFO httpclient.HttpMethodDirector: Retrying request > 14/11/13 17:21:34 INFO httpclient.HttpMethodDirector: I/O exception > (java.io.IOException) caught when processing request: Resetting to invalid > mark > 14/11/13 17:21:34 INFO httpclient.HttpMethodDirector: Retrying request > 14/11/13 17:21:34 INFO httpclient.HttpMethodDirector: I/O exception > (java.io.IOException) caught when processing request: Resetting to invalid > mark > 14/11/13 17:21:34 INFO httpclient.HttpMethodDirector: Retrying request > 14/11/13 17:21:34 INFO httpclient.HttpMethodDirector: I/O exception > (java.io.IOException) caught when processing request: Resetting to invalid > mark > 14/11/13 17:21:34 INFO httpclient.HttpMethodDirector: Retrying request > 14/11/13 17:21:34 INFO httpclient.HttpMethodDirector: I/O exception > (java.io.IOException) caught when processing request: Resetting to invalid > mark > 14/11/13 17:21:34 INFO httpclient.HttpMethodDirector: Retrying request > 14/11/13 17:21:58 WARN utils.RestUtils: Retried connection 6 times, which > exceeds the maximum retry count of 5 > 14/11/13 17:21:58 WARN utils.RestUtils: Retried connection 6 times, which > exceeds the maximum retry count of 5 > 14/11/13 17:22:57 WARN util.AkkaUtils: Error sending message in 1 attempts > java.util.concurrent.TimeoutException: Futures timed out after [30 seconds] > >