Re: Confused why I'm losing workers/executors when writing a large file to S3

2015-01-21 Thread Tsai Li Ming
I’m getting the same issue on Spark 1.2.0. Despite having set 
“spark.core.connection.ack.wait.timeout” in spark-defaults.conf and verified in 
the job UI (port 4040) environment tab, I still get the “no heartbeat in 60 
seconds” error. 

spark.core.connection.ack.wait.timeout=3600

15/01/22 07:29:36 WARN master.Master: Removing 
worker-20150121231529-numaq1-4-34948 because we got no heartbeat in 60 seconds


On 14 Nov, 2014, at 3:04 pm, Reynold Xin r...@databricks.com wrote:

 Darin,
 
 You might want to increase these config options also:
 
 spark.akka.timeout 300
 spark.storage.blockManagerSlaveTimeoutMs 30
 
 On Thu, Nov 13, 2014 at 11:31 AM, Darin McBeath ddmcbe...@yahoo.com.invalid 
 wrote:
 For one of my Spark jobs, my workers/executors are dying and leaving the 
 cluster.
 
 On the master, I see something like the following in the log file.  I'm 
 surprised to see the '60' seconds in the master log below because I 
 explicitly set it to '600' (or so I thought) in my spark job (see below).   
 This is happening at the end of my job when I'm trying to persist a large RDD 
 (probably around 300+GB) back to S3 (in 256 partitions).  My cluster consists 
 of 6 r3.8xlarge machines.  The job successfully works when I'm outputting 
 100GB or 200GB.
 
 If  you have any thoughts/insights, it would be appreciated. 
 
 Thanks.
 
 Darin.
 
 Here is where I'm setting the 'timeout' in my spark job.
 
 SparkConf conf = new SparkConf()
 .setAppName(SparkSync Application)
 .set(spark.serializer, org.apache.spark.serializer.KryoSerializer)
 .set(spark.rdd.compress,true)   
 .set(spark.core.connection.ack.wait.timeout,600);
 ​
 On the master, I see the following in the log file.
 
 4/11/13 17:20:39 WARN master.Master: Removing 
 worker-20141113134801-ip-10-35-184-232.ec2.internal-51877 because we got no 
 heartbeat in 60 seconds
 14/11/13 17:20:39 INFO master.Master: Removing worker 
 worker-20141113134801-ip-10-35-184-232.ec2.internal-51877 on 
 ip-10-35-184-232.ec2.internal:51877
 14/11/13 17:20:39 INFO master.Master: Telling app of lost executor: 2
 
 On a worker, I see something like the following in the log file.
 
 14/11/13 17:20:58 WARN util.AkkaUtils: Error sending message in 1 attempts
 java.util.concurrent.TimeoutException: Futures timed out after [30 seconds]
   at scala.concurrent.impl.Promise$DefaultPromise.ready(Promise.scala:219)
   at scala.concurrent.impl.Promise$DefaultPromise.result(Promise.scala:223)
   at scala.concurrent.Await$$anonfun$result$1.apply(package.scala:107)
   at 
 scala.concurrent.BlockContext$DefaultBlockContext$.blockOn(BlockContext.scala:53)
   at scala.concurrent.Await$.result(package.scala:107)
   at org.apache.spark.util.AkkaUtils$.askWithReply(AkkaUtils.scala:176)
   at org.apache.spark.executor.Executor$$anon$1.run(Executor.scala:362)
 14/11/13 17:21:11 INFO httpclient.HttpMethodDirector: I/O exception 
 (java.net.SocketException) caught when processing request: Broken pipe
 14/11/13 17:21:11 INFO httpclient.HttpMethodDirector: Retrying request
 14/11/13 17:21:32 INFO httpclient.HttpMethodDirector: I/O exception 
 (java.net.SocketException) caught when processing request: Broken pipe
 14/11/13 17:21:34 INFO httpclient.HttpMethodDirector: Retrying request
 14/11/13 17:21:34 INFO httpclient.HttpMethodDirector: I/O exception 
 (java.io.IOException) caught when processing request: Resetting to invalid 
 mark
 14/11/13 17:21:34 INFO httpclient.HttpMethodDirector: Retrying request
 14/11/13 17:21:34 INFO httpclient.HttpMethodDirector: I/O exception 
 (java.io.IOException) caught when processing request: Resetting to invalid 
 mark
 14/11/13 17:21:34 INFO httpclient.HttpMethodDirector: Retrying request
 14/11/13 17:21:34 INFO httpclient.HttpMethodDirector: I/O exception 
 (java.io.IOException) caught when processing request: Resetting to invalid 
 mark
 14/11/13 17:21:34 INFO httpclient.HttpMethodDirector: Retrying request
 14/11/13 17:21:34 INFO httpclient.HttpMethodDirector: I/O exception 
 (java.io.IOException) caught when processing request: Resetting to invalid 
 mark
 14/11/13 17:21:34 INFO httpclient.HttpMethodDirector: Retrying request
 14/11/13 17:21:34 INFO httpclient.HttpMethodDirector: I/O exception 
 (java.io.IOException) caught when processing request: Resetting to invalid 
 mark
 14/11/13 17:21:34 INFO httpclient.HttpMethodDirector: Retrying request
 14/11/13 17:21:34 INFO httpclient.HttpMethodDirector: I/O exception 
 (java.io.IOException) caught when processing request: Resetting to invalid 
 mark
 14/11/13 17:21:34 INFO httpclient.HttpMethodDirector: Retrying request
 14/11/13 17:21:34 INFO httpclient.HttpMethodDirector: I/O exception 
 (java.io.IOException) caught when processing request: Resetting to invalid 
 mark
 14/11/13 17:21:34 INFO httpclient.HttpMethodDirector: Retrying request
 14/11/13 17:21:34 INFO httpclient.HttpMethodDirector: I/O exception 
 (java.io.IOException) caught when processing request: Resetting to invalid 
 mark
 14/11/13 17:21:34 INFO 

Confused why I'm losing workers/executors when writing a large file to S3

2014-11-13 Thread Darin McBeath
For one of my Spark jobs, my workers/executors are dying and leaving the 
cluster.

On the master, I see something like the following in the log file.  I'm 
surprised to see the '60' seconds in the master log below because I explicitly 
set it to '600' (or so I thought) in my spark job (see below).   This is 
happening at the end of my job when I'm trying to persist a large RDD (probably 
around 300+GB) back to S3 (in 256 partitions).  My cluster consists of 6 
r3.8xlarge machines.  The job successfully works when I'm outputting 100GB or 
200GB.

If  you have any thoughts/insights, it would be appreciated. 

Thanks.

Darin.

Here is where I'm setting the 'timeout' in my spark job.
SparkConf conf = new SparkConf().setAppName(SparkSync 
Application).set(spark.serializer, 
org.apache.spark.serializer.KryoSerializer).set(spark.rdd.compress,true)  
 .set(spark.core.connection.ack.wait.timeout,600);​
On the master, I see the following in the log file.

4/11/13 17:20:39 WARN master.Master: Removing 
worker-20141113134801-ip-10-35-184-232.ec2.internal-51877 because we got no 
heartbeat in 60 seconds14/11/13 17:20:39 INFO master.Master: Removing worker 
worker-20141113134801-ip-10-35-184-232.ec2.internal-51877 on 
ip-10-35-184-232.ec2.internal:5187714/11/13 17:20:39 INFO master.Master: 
Telling app of lost executor: 2

On a worker, I see something like the following in the log file.

14/11/13 17:20:58 WARN util.AkkaUtils: Error sending message in 1 
attemptsjava.util.concurrent.TimeoutException: Futures timed out after [30 
seconds]  at 
scala.concurrent.impl.Promise$DefaultPromise.ready(Promise.scala:219)  at 
scala.concurrent.impl.Promise$DefaultPromise.result(Promise.scala:223)  at 
scala.concurrent.Await$$anonfun$result$1.apply(package.scala:107)  at 
scala.concurrent.BlockContext$DefaultBlockContext$.blockOn(BlockContext.scala:53)
  at scala.concurrent.Await$.result(package.scala:107)  at 
org.apache.spark.util.AkkaUtils$.askWithReply(AkkaUtils.scala:176)  at 
org.apache.spark.executor.Executor$$anon$1.run(Executor.scala:362)14/11/13 
17:21:11 INFO httpclient.HttpMethodDirector: I/O exception 
(java.net.SocketException) caught when processing request: Broken pipe14/11/13 
17:21:11 INFO httpclient.HttpMethodDirector: Retrying request14/11/13 17:21:32 
INFO httpclient.HttpMethodDirector: I/O exception (java.net.SocketException) 
caught when processing request: Broken pipe14/11/13 17:21:34 INFO 
httpclient.HttpMethodDirector: Retrying request14/11/13 17:21:34 INFO 
httpclient.HttpMethodDirector: I/O exception (java.io.IOException) caught when 
processing request: Resetting to invalid mark14/11/13 17:21:34 INFO 
httpclient.HttpMethodDirector: Retrying request14/11/13 17:21:34 INFO 
httpclient.HttpMethodDirector: I/O exception (java.io.IOException) caught when 
processing request: Resetting to invalid mark14/11/13 17:21:34 INFO 
httpclient.HttpMethodDirector: Retrying request14/11/13 17:21:34 INFO 
httpclient.HttpMethodDirector: I/O exception (java.io.IOException) caught when 
processing request: Resetting to invalid mark14/11/13 17:21:34 INFO 
httpclient.HttpMethodDirector: Retrying request14/11/13 17:21:34 INFO 
httpclient.HttpMethodDirector: I/O exception (java.io.IOException) caught when 
processing request: Resetting to invalid mark14/11/13 17:21:34 INFO 
httpclient.HttpMethodDirector: Retrying request14/11/13 17:21:34 INFO 
httpclient.HttpMethodDirector: I/O exception (java.io.IOException) caught when 
processing request: Resetting to invalid mark14/11/13 17:21:34 INFO 
httpclient.HttpMethodDirector: Retrying request14/11/13 17:21:34 INFO 
httpclient.HttpMethodDirector: I/O exception (java.io.IOException) caught when 
processing request: Resetting to invalid mark14/11/13 17:21:34 INFO 
httpclient.HttpMethodDirector: Retrying request14/11/13 17:21:34 INFO 
httpclient.HttpMethodDirector: I/O exception (java.io.IOException) caught when 
processing request: Resetting to invalid mark14/11/13 17:21:34 INFO 
httpclient.HttpMethodDirector: Retrying request14/11/13 17:21:34 INFO 
httpclient.HttpMethodDirector: I/O exception (java.io.IOException) caught when 
processing request: Resetting to invalid mark14/11/13 17:21:34 INFO 
httpclient.HttpMethodDirector: Retrying request14/11/13 17:21:58 WARN 
utils.RestUtils: Retried connection 6 times, which exceeds the maximum retry 
count of 514/11/13 17:21:58 WARN utils.RestUtils: Retried connection 6 times, 
which exceeds the maximum retry count of 514/11/13 17:22:57 WARN 
util.AkkaUtils: Error sending message in 1 
attemptsjava.util.concurrent.TimeoutException: Futures timed out after [30 
seconds]


Re: Confused why I'm losing workers/executors when writing a large file to S3

2014-11-13 Thread Sonal Goyal
Hi Darin,

In our case, we were getting the error gue to long GC pauses in our app.
Fixing the underlying code helped us remove this error. This is also
mentioned as point 1 in the link below:

http://mail-archives.apache.org/mod_mbox/spark-user/201409.mbox/%3cca+-p3ah5aamgtke6viycwb24ohsnmaqm1q9x53abwb_arvw...@mail.gmail.com%3E


Best Regards,
Sonal
Founder, Nube Technologies http://www.nubetech.co

http://in.linkedin.com/in/sonalgoyal



On Fri, Nov 14, 2014 at 1:01 AM, Darin McBeath ddmcbe...@yahoo.com.invalid
wrote:

 For one of my Spark jobs, my workers/executors are dying and leaving the
 cluster.

 On the master, I see something like the following in the log file.  I'm
 surprised to see the '60' seconds in the master log below because I
 explicitly set it to '600' (or so I thought) in my spark job (see below).
 This is happening at the end of my job when I'm trying to persist a large
 RDD (probably around 300+GB) back to S3 (in 256 partitions).  My cluster
 consists of 6 r3.8xlarge machines.  The job successfully works when I'm
 outputting 100GB or 200GB.

 If  you have any thoughts/insights, it would be appreciated.

 Thanks.

 Darin.

 Here is where I'm setting the 'timeout' in my spark job.

 SparkConf conf = new SparkConf()
 .setAppName(SparkSync Application)
 .set(spark.serializer, org.apache.spark.serializer.KryoSerializer)
 .set(spark.rdd.compress,true)
 .set(spark.core.connection.ack.wait.timeout,600);
 ​
 On the master, I see the following in the log file.

 4/11/13 17:20:39 WARN master.Master: Removing
 worker-20141113134801-ip-10-35-184-232.ec2.internal-51877 because we got no
 heartbeat in 60 seconds
 14/11/13 17:20:39 INFO master.Master: Removing worker
 worker-20141113134801-ip-10-35-184-232.ec2.internal-51877 on
 ip-10-35-184-232.ec2.internal:51877
 14/11/13 17:20:39 INFO master.Master: Telling app of lost executor: 2

 On a worker, I see something like the following in the log file.

 14/11/13 17:20:58 WARN util.AkkaUtils: Error sending message in 1 attempts
 java.util.concurrent.TimeoutException: Futures timed out after [30 seconds]
   at scala.concurrent.impl.Promise$DefaultPromise.ready(Promise.scala:219)
   at scala.concurrent.impl.Promise$DefaultPromise.result(Promise.scala:223)
   at scala.concurrent.Await$$anonfun$result$1.apply(package.scala:107)
   at
 scala.concurrent.BlockContext$DefaultBlockContext$.blockOn(BlockContext.scala:53)
   at scala.concurrent.Await$.result(package.scala:107)
   at org.apache.spark.util.AkkaUtils$.askWithReply(AkkaUtils.scala:176)
   at org.apache.spark.executor.Executor$$anon$1.run(Executor.scala:362)
 14/11/13 17:21:11 INFO httpclient.HttpMethodDirector: I/O exception
 (java.net.SocketException) caught when processing request: Broken pipe
 14/11/13 17:21:11 INFO httpclient.HttpMethodDirector: Retrying request
 14/11/13 17:21:32 INFO httpclient.HttpMethodDirector: I/O exception
 (java.net.SocketException) caught when processing request: Broken pipe
 14/11/13 17:21:34 INFO httpclient.HttpMethodDirector: Retrying request
 14/11/13 17:21:34 INFO httpclient.HttpMethodDirector: I/O exception
 (java.io.IOException) caught when processing request: Resetting to invalid
 mark
 14/11/13 17:21:34 INFO httpclient.HttpMethodDirector: Retrying request
 14/11/13 17:21:34 INFO httpclient.HttpMethodDirector: I/O exception
 (java.io.IOException) caught when processing request: Resetting to invalid
 mark
 14/11/13 17:21:34 INFO httpclient.HttpMethodDirector: Retrying request
 14/11/13 17:21:34 INFO httpclient.HttpMethodDirector: I/O exception
 (java.io.IOException) caught when processing request: Resetting to invalid
 mark
 14/11/13 17:21:34 INFO httpclient.HttpMethodDirector: Retrying request
 14/11/13 17:21:34 INFO httpclient.HttpMethodDirector: I/O exception
 (java.io.IOException) caught when processing request: Resetting to invalid
 mark
 14/11/13 17:21:34 INFO httpclient.HttpMethodDirector: Retrying request
 14/11/13 17:21:34 INFO httpclient.HttpMethodDirector: I/O exception
 (java.io.IOException) caught when processing request: Resetting to invalid
 mark
 14/11/13 17:21:34 INFO httpclient.HttpMethodDirector: Retrying request
 14/11/13 17:21:34 INFO httpclient.HttpMethodDirector: I/O exception
 (java.io.IOException) caught when processing request: Resetting to invalid
 mark
 14/11/13 17:21:34 INFO httpclient.HttpMethodDirector: Retrying request
 14/11/13 17:21:34 INFO httpclient.HttpMethodDirector: I/O exception
 (java.io.IOException) caught when processing request: Resetting to invalid
 mark
 14/11/13 17:21:34 INFO httpclient.HttpMethodDirector: Retrying request
 14/11/13 17:21:34 INFO httpclient.HttpMethodDirector: I/O exception
 (java.io.IOException) caught when processing request: Resetting to invalid
 mark
 14/11/13 17:21:34 INFO httpclient.HttpMethodDirector: Retrying request
 14/11/13 17:21:58 WARN utils.RestUtils: Retried connection 6 times, which
 exceeds the maximum retry count of 5
 14/11/13 17:21:58 WARN utils.RestUtils: Retried connection 6 

Re: Confused why I'm losing workers/executors when writing a large file to S3

2014-11-13 Thread Reynold Xin
Darin,

You might want to increase these config options also:

spark.akka.timeout 300
spark.storage.blockManagerSlaveTimeoutMs 30

On Thu, Nov 13, 2014 at 11:31 AM, Darin McBeath ddmcbe...@yahoo.com.invalid
 wrote:

 For one of my Spark jobs, my workers/executors are dying and leaving the
 cluster.

 On the master, I see something like the following in the log file.  I'm
 surprised to see the '60' seconds in the master log below because I
 explicitly set it to '600' (or so I thought) in my spark job (see below).
 This is happening at the end of my job when I'm trying to persist a large
 RDD (probably around 300+GB) back to S3 (in 256 partitions).  My cluster
 consists of 6 r3.8xlarge machines.  The job successfully works when I'm
 outputting 100GB or 200GB.

 If  you have any thoughts/insights, it would be appreciated.

 Thanks.

 Darin.

 Here is where I'm setting the 'timeout' in my spark job.

 SparkConf conf = new SparkConf()
 .setAppName(SparkSync Application)
 .set(spark.serializer, org.apache.spark.serializer.KryoSerializer)
 .set(spark.rdd.compress,true)
 .set(spark.core.connection.ack.wait.timeout,600);
 ​
 On the master, I see the following in the log file.

 4/11/13 17:20:39 WARN master.Master: Removing
 worker-20141113134801-ip-10-35-184-232.ec2.internal-51877 because we got no
 heartbeat in 60 seconds
 14/11/13 17:20:39 INFO master.Master: Removing worker
 worker-20141113134801-ip-10-35-184-232.ec2.internal-51877 on
 ip-10-35-184-232.ec2.internal:51877
 14/11/13 17:20:39 INFO master.Master: Telling app of lost executor: 2

 On a worker, I see something like the following in the log file.

 14/11/13 17:20:58 WARN util.AkkaUtils: Error sending message in 1 attempts
 java.util.concurrent.TimeoutException: Futures timed out after [30 seconds]
   at scala.concurrent.impl.Promise$DefaultPromise.ready(Promise.scala:219)
   at scala.concurrent.impl.Promise$DefaultPromise.result(Promise.scala:223)
   at scala.concurrent.Await$$anonfun$result$1.apply(package.scala:107)
   at
 scala.concurrent.BlockContext$DefaultBlockContext$.blockOn(BlockContext.scala:53)
   at scala.concurrent.Await$.result(package.scala:107)
   at org.apache.spark.util.AkkaUtils$.askWithReply(AkkaUtils.scala:176)
   at org.apache.spark.executor.Executor$$anon$1.run(Executor.scala:362)
 14/11/13 17:21:11 INFO httpclient.HttpMethodDirector: I/O exception
 (java.net.SocketException) caught when processing request: Broken pipe
 14/11/13 17:21:11 INFO httpclient.HttpMethodDirector: Retrying request
 14/11/13 17:21:32 INFO httpclient.HttpMethodDirector: I/O exception
 (java.net.SocketException) caught when processing request: Broken pipe
 14/11/13 17:21:34 INFO httpclient.HttpMethodDirector: Retrying request
 14/11/13 17:21:34 INFO httpclient.HttpMethodDirector: I/O exception
 (java.io.IOException) caught when processing request: Resetting to invalid
 mark
 14/11/13 17:21:34 INFO httpclient.HttpMethodDirector: Retrying request
 14/11/13 17:21:34 INFO httpclient.HttpMethodDirector: I/O exception
 (java.io.IOException) caught when processing request: Resetting to invalid
 mark
 14/11/13 17:21:34 INFO httpclient.HttpMethodDirector: Retrying request
 14/11/13 17:21:34 INFO httpclient.HttpMethodDirector: I/O exception
 (java.io.IOException) caught when processing request: Resetting to invalid
 mark
 14/11/13 17:21:34 INFO httpclient.HttpMethodDirector: Retrying request
 14/11/13 17:21:34 INFO httpclient.HttpMethodDirector: I/O exception
 (java.io.IOException) caught when processing request: Resetting to invalid
 mark
 14/11/13 17:21:34 INFO httpclient.HttpMethodDirector: Retrying request
 14/11/13 17:21:34 INFO httpclient.HttpMethodDirector: I/O exception
 (java.io.IOException) caught when processing request: Resetting to invalid
 mark
 14/11/13 17:21:34 INFO httpclient.HttpMethodDirector: Retrying request
 14/11/13 17:21:34 INFO httpclient.HttpMethodDirector: I/O exception
 (java.io.IOException) caught when processing request: Resetting to invalid
 mark
 14/11/13 17:21:34 INFO httpclient.HttpMethodDirector: Retrying request
 14/11/13 17:21:34 INFO httpclient.HttpMethodDirector: I/O exception
 (java.io.IOException) caught when processing request: Resetting to invalid
 mark
 14/11/13 17:21:34 INFO httpclient.HttpMethodDirector: Retrying request
 14/11/13 17:21:34 INFO httpclient.HttpMethodDirector: I/O exception
 (java.io.IOException) caught when processing request: Resetting to invalid
 mark
 14/11/13 17:21:34 INFO httpclient.HttpMethodDirector: Retrying request
 14/11/13 17:21:58 WARN utils.RestUtils: Retried connection 6 times, which
 exceeds the maximum retry count of 5
 14/11/13 17:21:58 WARN utils.RestUtils: Retried connection 6 times, which
 exceeds the maximum retry count of 5
 14/11/13 17:22:57 WARN util.AkkaUtils: Error sending message in 1 attempts
 java.util.concurrent.TimeoutException: Futures timed out after [30 seconds]