Re: Confused why I'm losing workers/executors when writing a large file to S3

Tsai Li Ming Wed, 21 Jan 2015 20:44:44 -0800

I’m getting the same issue on Spark 1.2.0. Despite having set 
“spark.core.connection.ack.wait.timeout” in spark-defaults.conf and verified in 
the job UI (port 4040) environment tab, I still get the “no heartbeat in 60 
seconds” error.


spark.core.connection.ack.wait.timeout=3600

15/01/22 07:29:36 WARN master.Master: Removing 
worker-20150121231529-numaq1-4-34948 because we got no heartbeat in 60 seconds


On 14 Nov, 2014, at 3:04 pm, Reynold Xin <r...@databricks.com> wrote:

> Darin,
> 
> You might want to increase these config options also:
> 
> spark.akka.timeout 300
> spark.storage.blockManagerSlaveTimeoutMs 300000
> 
> On Thu, Nov 13, 2014 at 11:31 AM, Darin McBeath <ddmcbe...@yahoo.com.invalid> 
> wrote:
> For one of my Spark jobs, my workers/executors are dying and leaving the 
> cluster.
> 
> On the master, I see something like the following in the log file.  I'm 
> surprised to see the '60' seconds in the master log below because I 
> explicitly set it to '600' (or so I thought) in my spark job (see below).   
> This is happening at the end of my job when I'm trying to persist a large RDD 
> (probably around 300+GB) back to S3 (in 256 partitions).  My cluster consists 
> of 6 r3.8xlarge machines.  The job successfully works when I'm outputting 
> 100GB or 200GB.
> 
> If  you have any thoughts/insights, it would be appreciated. 
> 
> Thanks.
> 
> Darin.
> 
> Here is where I'm setting the 'timeout' in my spark job.
> 
> SparkConf conf = new SparkConf()
> .setAppName("SparkSync Application")
> .set("spark.serializer", "org.apache.spark.serializer.KryoSerializer")
> .set("spark.rdd.compress","true")   
> .set("spark.core.connection.ack.wait.timeout","600");
> 
> On the master, I see the following in the log file.
> 
> 4/11/13 17:20:39 WARN master.Master: Removing 
> worker-20141113134801-ip-10-35-184-232.ec2.internal-51877 because we got no 
> heartbeat in 60 seconds
> 14/11/13 17:20:39 INFO master.Master: Removing worker 
> worker-20141113134801-ip-10-35-184-232.ec2.internal-51877 on 
> ip-10-35-184-232.ec2.internal:51877
> 14/11/13 17:20:39 INFO master.Master: Telling app of lost executor: 2
> 
> On a worker, I see something like the following in the log file.
> 
> 14/11/13 17:20:58 WARN util.AkkaUtils: Error sending message in 1 attempts
> java.util.concurrent.TimeoutException: Futures timed out after [30 seconds]
>   at scala.concurrent.impl.Promise$DefaultPromise.ready(Promise.scala:219)
>   at scala.concurrent.impl.Promise$DefaultPromise.result(Promise.scala:223)
>   at scala.concurrent.Await$$anonfun$result$1.apply(package.scala:107)
>   at 
> scala.concurrent.BlockContext$DefaultBlockContext$.blockOn(BlockContext.scala:53)
>   at scala.concurrent.Await$.result(package.scala:107)
>   at org.apache.spark.util.AkkaUtils$.askWithReply(AkkaUtils.scala:176)
>   at org.apache.spark.executor.Executor$$anon$1.run(Executor.scala:362)
> 14/11/13 17:21:11 INFO httpclient.HttpMethodDirector: I/O exception 
> (java.net.SocketException) caught when processing request: Broken pipe
> 14/11/13 17:21:11 INFO httpclient.HttpMethodDirector: Retrying request
> 14/11/13 17:21:32 INFO httpclient.HttpMethodDirector: I/O exception 
> (java.net.SocketException) caught when processing request: Broken pipe
> 14/11/13 17:21:34 INFO httpclient.HttpMethodDirector: Retrying request
> 14/11/13 17:21:34 INFO httpclient.HttpMethodDirector: I/O exception 
> (java.io.IOException) caught when processing request: Resetting to invalid 
> mark
> 14/11/13 17:21:34 INFO httpclient.HttpMethodDirector: Retrying request
> 14/11/13 17:21:34 INFO httpclient.HttpMethodDirector: I/O exception 
> (java.io.IOException) caught when processing request: Resetting to invalid 
> mark
> 14/11/13 17:21:34 INFO httpclient.HttpMethodDirector: Retrying request
> 14/11/13 17:21:34 INFO httpclient.HttpMethodDirector: I/O exception 
> (java.io.IOException) caught when processing request: Resetting to invalid 
> mark
> 14/11/13 17:21:34 INFO httpclient.HttpMethodDirector: Retrying request
> 14/11/13 17:21:34 INFO httpclient.HttpMethodDirector: I/O exception 
> (java.io.IOException) caught when processing request: Resetting to invalid 
> mark
> 14/11/13 17:21:34 INFO httpclient.HttpMethodDirector: Retrying request
> 14/11/13 17:21:34 INFO httpclient.HttpMethodDirector: I/O exception 
> (java.io.IOException) caught when processing request: Resetting to invalid 
> mark
> 14/11/13 17:21:34 INFO httpclient.HttpMethodDirector: Retrying request
> 14/11/13 17:21:34 INFO httpclient.HttpMethodDirector: I/O exception 
> (java.io.IOException) caught when processing request: Resetting to invalid 
> mark
> 14/11/13 17:21:34 INFO httpclient.HttpMethodDirector: Retrying request
> 14/11/13 17:21:34 INFO httpclient.HttpMethodDirector: I/O exception 
> (java.io.IOException) caught when processing request: Resetting to invalid 
> mark
> 14/11/13 17:21:34 INFO httpclient.HttpMethodDirector: Retrying request
> 14/11/13 17:21:34 INFO httpclient.HttpMethodDirector: I/O exception 
> (java.io.IOException) caught when processing request: Resetting to invalid 
> mark
> 14/11/13 17:21:34 INFO httpclient.HttpMethodDirector: Retrying request
> 14/11/13 17:21:58 WARN utils.RestUtils: Retried connection 6 times, which 
> exceeds the maximum retry count of 5
> 14/11/13 17:21:58 WARN utils.RestUtils: Retried connection 6 times, which 
> exceeds the maximum retry count of 5
> 14/11/13 17:22:57 WARN util.AkkaUtils: Error sending message in 1 attempts
> java.util.concurrent.TimeoutException: Futures timed out after [30 seconds]
> 
>

Re: Confused why I'm losing workers/executors when writing a large file to S3

Reply via email to