Re: Confused why I'm losing workers/executors when writing a large file to S3
I’m getting the same issue on Spark 1.2.0. Despite having set “spark.core.connection.ack.wait.timeout” in spark-defaults.conf and verified in the job UI (port 4040) environment tab, I still get the “no heartbeat in 60 seconds” error. spark.core.connection.ack.wait.timeout=3600 15/01/22 07:29:36 WARN master.Master: Removing worker-20150121231529-numaq1-4-34948 because we got no heartbeat in 60 seconds On 14 Nov, 2014, at 3:04 pm, Reynold Xin r...@databricks.com wrote: Darin, You might want to increase these config options also: spark.akka.timeout 300 spark.storage.blockManagerSlaveTimeoutMs 30 On Thu, Nov 13, 2014 at 11:31 AM, Darin McBeath ddmcbe...@yahoo.com.invalid wrote: For one of my Spark jobs, my workers/executors are dying and leaving the cluster. On the master, I see something like the following in the log file. I'm surprised to see the '60' seconds in the master log below because I explicitly set it to '600' (or so I thought) in my spark job (see below). This is happening at the end of my job when I'm trying to persist a large RDD (probably around 300+GB) back to S3 (in 256 partitions). My cluster consists of 6 r3.8xlarge machines. The job successfully works when I'm outputting 100GB or 200GB. If you have any thoughts/insights, it would be appreciated. Thanks. Darin. Here is where I'm setting the 'timeout' in my spark job. SparkConf conf = new SparkConf() .setAppName(SparkSync Application) .set(spark.serializer, org.apache.spark.serializer.KryoSerializer) .set(spark.rdd.compress,true) .set(spark.core.connection.ack.wait.timeout,600); On the master, I see the following in the log file. 4/11/13 17:20:39 WARN master.Master: Removing worker-20141113134801-ip-10-35-184-232.ec2.internal-51877 because we got no heartbeat in 60 seconds 14/11/13 17:20:39 INFO master.Master: Removing worker worker-20141113134801-ip-10-35-184-232.ec2.internal-51877 on ip-10-35-184-232.ec2.internal:51877 14/11/13 17:20:39 INFO master.Master: Telling app of lost executor: 2 On a worker, I see something like the following in the log file. 14/11/13 17:20:58 WARN util.AkkaUtils: Error sending message in 1 attempts java.util.concurrent.TimeoutException: Futures timed out after [30 seconds] at scala.concurrent.impl.Promise$DefaultPromise.ready(Promise.scala:219) at scala.concurrent.impl.Promise$DefaultPromise.result(Promise.scala:223) at scala.concurrent.Await$$anonfun$result$1.apply(package.scala:107) at scala.concurrent.BlockContext$DefaultBlockContext$.blockOn(BlockContext.scala:53) at scala.concurrent.Await$.result(package.scala:107) at org.apache.spark.util.AkkaUtils$.askWithReply(AkkaUtils.scala:176) at org.apache.spark.executor.Executor$$anon$1.run(Executor.scala:362) 14/11/13 17:21:11 INFO httpclient.HttpMethodDirector: I/O exception (java.net.SocketException) caught when processing request: Broken pipe 14/11/13 17:21:11 INFO httpclient.HttpMethodDirector: Retrying request 14/11/13 17:21:32 INFO httpclient.HttpMethodDirector: I/O exception (java.net.SocketException) caught when processing request: Broken pipe 14/11/13 17:21:34 INFO httpclient.HttpMethodDirector: Retrying request 14/11/13 17:21:34 INFO httpclient.HttpMethodDirector: I/O exception (java.io.IOException) caught when processing request: Resetting to invalid mark 14/11/13 17:21:34 INFO httpclient.HttpMethodDirector: Retrying request 14/11/13 17:21:34 INFO httpclient.HttpMethodDirector: I/O exception (java.io.IOException) caught when processing request: Resetting to invalid mark 14/11/13 17:21:34 INFO httpclient.HttpMethodDirector: Retrying request 14/11/13 17:21:34 INFO httpclient.HttpMethodDirector: I/O exception (java.io.IOException) caught when processing request: Resetting to invalid mark 14/11/13 17:21:34 INFO httpclient.HttpMethodDirector: Retrying request 14/11/13 17:21:34 INFO httpclient.HttpMethodDirector: I/O exception (java.io.IOException) caught when processing request: Resetting to invalid mark 14/11/13 17:21:34 INFO httpclient.HttpMethodDirector: Retrying request 14/11/13 17:21:34 INFO httpclient.HttpMethodDirector: I/O exception (java.io.IOException) caught when processing request: Resetting to invalid mark 14/11/13 17:21:34 INFO httpclient.HttpMethodDirector: Retrying request 14/11/13 17:21:34 INFO httpclient.HttpMethodDirector: I/O exception (java.io.IOException) caught when processing request: Resetting to invalid mark 14/11/13 17:21:34 INFO httpclient.HttpMethodDirector: Retrying request 14/11/13 17:21:34 INFO httpclient.HttpMethodDirector: I/O exception (java.io.IOException) caught when processing request: Resetting to invalid mark 14/11/13 17:21:34 INFO httpclient.HttpMethodDirector: Retrying request 14/11/13 17:21:34 INFO httpclient.HttpMethodDirector: I/O exception (java.io.IOException) caught when processing request: Resetting to invalid mark 14/11/13 17:21:34 INFO
Confused why I'm losing workers/executors when writing a large file to S3
For one of my Spark jobs, my workers/executors are dying and leaving the cluster. On the master, I see something like the following in the log file. I'm surprised to see the '60' seconds in the master log below because I explicitly set it to '600' (or so I thought) in my spark job (see below). This is happening at the end of my job when I'm trying to persist a large RDD (probably around 300+GB) back to S3 (in 256 partitions). My cluster consists of 6 r3.8xlarge machines. The job successfully works when I'm outputting 100GB or 200GB. If you have any thoughts/insights, it would be appreciated. Thanks. Darin. Here is where I'm setting the 'timeout' in my spark job. SparkConf conf = new SparkConf().setAppName(SparkSync Application).set(spark.serializer, org.apache.spark.serializer.KryoSerializer).set(spark.rdd.compress,true) .set(spark.core.connection.ack.wait.timeout,600); On the master, I see the following in the log file. 4/11/13 17:20:39 WARN master.Master: Removing worker-20141113134801-ip-10-35-184-232.ec2.internal-51877 because we got no heartbeat in 60 seconds14/11/13 17:20:39 INFO master.Master: Removing worker worker-20141113134801-ip-10-35-184-232.ec2.internal-51877 on ip-10-35-184-232.ec2.internal:5187714/11/13 17:20:39 INFO master.Master: Telling app of lost executor: 2 On a worker, I see something like the following in the log file. 14/11/13 17:20:58 WARN util.AkkaUtils: Error sending message in 1 attemptsjava.util.concurrent.TimeoutException: Futures timed out after [30 seconds] at scala.concurrent.impl.Promise$DefaultPromise.ready(Promise.scala:219) at scala.concurrent.impl.Promise$DefaultPromise.result(Promise.scala:223) at scala.concurrent.Await$$anonfun$result$1.apply(package.scala:107) at scala.concurrent.BlockContext$DefaultBlockContext$.blockOn(BlockContext.scala:53) at scala.concurrent.Await$.result(package.scala:107) at org.apache.spark.util.AkkaUtils$.askWithReply(AkkaUtils.scala:176) at org.apache.spark.executor.Executor$$anon$1.run(Executor.scala:362)14/11/13 17:21:11 INFO httpclient.HttpMethodDirector: I/O exception (java.net.SocketException) caught when processing request: Broken pipe14/11/13 17:21:11 INFO httpclient.HttpMethodDirector: Retrying request14/11/13 17:21:32 INFO httpclient.HttpMethodDirector: I/O exception (java.net.SocketException) caught when processing request: Broken pipe14/11/13 17:21:34 INFO httpclient.HttpMethodDirector: Retrying request14/11/13 17:21:34 INFO httpclient.HttpMethodDirector: I/O exception (java.io.IOException) caught when processing request: Resetting to invalid mark14/11/13 17:21:34 INFO httpclient.HttpMethodDirector: Retrying request14/11/13 17:21:34 INFO httpclient.HttpMethodDirector: I/O exception (java.io.IOException) caught when processing request: Resetting to invalid mark14/11/13 17:21:34 INFO httpclient.HttpMethodDirector: Retrying request14/11/13 17:21:34 INFO httpclient.HttpMethodDirector: I/O exception (java.io.IOException) caught when processing request: Resetting to invalid mark14/11/13 17:21:34 INFO httpclient.HttpMethodDirector: Retrying request14/11/13 17:21:34 INFO httpclient.HttpMethodDirector: I/O exception (java.io.IOException) caught when processing request: Resetting to invalid mark14/11/13 17:21:34 INFO httpclient.HttpMethodDirector: Retrying request14/11/13 17:21:34 INFO httpclient.HttpMethodDirector: I/O exception (java.io.IOException) caught when processing request: Resetting to invalid mark14/11/13 17:21:34 INFO httpclient.HttpMethodDirector: Retrying request14/11/13 17:21:34 INFO httpclient.HttpMethodDirector: I/O exception (java.io.IOException) caught when processing request: Resetting to invalid mark14/11/13 17:21:34 INFO httpclient.HttpMethodDirector: Retrying request14/11/13 17:21:34 INFO httpclient.HttpMethodDirector: I/O exception (java.io.IOException) caught when processing request: Resetting to invalid mark14/11/13 17:21:34 INFO httpclient.HttpMethodDirector: Retrying request14/11/13 17:21:34 INFO httpclient.HttpMethodDirector: I/O exception (java.io.IOException) caught when processing request: Resetting to invalid mark14/11/13 17:21:34 INFO httpclient.HttpMethodDirector: Retrying request14/11/13 17:21:58 WARN utils.RestUtils: Retried connection 6 times, which exceeds the maximum retry count of 514/11/13 17:21:58 WARN utils.RestUtils: Retried connection 6 times, which exceeds the maximum retry count of 514/11/13 17:22:57 WARN util.AkkaUtils: Error sending message in 1 attemptsjava.util.concurrent.TimeoutException: Futures timed out after [30 seconds]
Re: Confused why I'm losing workers/executors when writing a large file to S3
Hi Darin, In our case, we were getting the error gue to long GC pauses in our app. Fixing the underlying code helped us remove this error. This is also mentioned as point 1 in the link below: http://mail-archives.apache.org/mod_mbox/spark-user/201409.mbox/%3cca+-p3ah5aamgtke6viycwb24ohsnmaqm1q9x53abwb_arvw...@mail.gmail.com%3E Best Regards, Sonal Founder, Nube Technologies http://www.nubetech.co http://in.linkedin.com/in/sonalgoyal On Fri, Nov 14, 2014 at 1:01 AM, Darin McBeath ddmcbe...@yahoo.com.invalid wrote: For one of my Spark jobs, my workers/executors are dying and leaving the cluster. On the master, I see something like the following in the log file. I'm surprised to see the '60' seconds in the master log below because I explicitly set it to '600' (or so I thought) in my spark job (see below). This is happening at the end of my job when I'm trying to persist a large RDD (probably around 300+GB) back to S3 (in 256 partitions). My cluster consists of 6 r3.8xlarge machines. The job successfully works when I'm outputting 100GB or 200GB. If you have any thoughts/insights, it would be appreciated. Thanks. Darin. Here is where I'm setting the 'timeout' in my spark job. SparkConf conf = new SparkConf() .setAppName(SparkSync Application) .set(spark.serializer, org.apache.spark.serializer.KryoSerializer) .set(spark.rdd.compress,true) .set(spark.core.connection.ack.wait.timeout,600); On the master, I see the following in the log file. 4/11/13 17:20:39 WARN master.Master: Removing worker-20141113134801-ip-10-35-184-232.ec2.internal-51877 because we got no heartbeat in 60 seconds 14/11/13 17:20:39 INFO master.Master: Removing worker worker-20141113134801-ip-10-35-184-232.ec2.internal-51877 on ip-10-35-184-232.ec2.internal:51877 14/11/13 17:20:39 INFO master.Master: Telling app of lost executor: 2 On a worker, I see something like the following in the log file. 14/11/13 17:20:58 WARN util.AkkaUtils: Error sending message in 1 attempts java.util.concurrent.TimeoutException: Futures timed out after [30 seconds] at scala.concurrent.impl.Promise$DefaultPromise.ready(Promise.scala:219) at scala.concurrent.impl.Promise$DefaultPromise.result(Promise.scala:223) at scala.concurrent.Await$$anonfun$result$1.apply(package.scala:107) at scala.concurrent.BlockContext$DefaultBlockContext$.blockOn(BlockContext.scala:53) at scala.concurrent.Await$.result(package.scala:107) at org.apache.spark.util.AkkaUtils$.askWithReply(AkkaUtils.scala:176) at org.apache.spark.executor.Executor$$anon$1.run(Executor.scala:362) 14/11/13 17:21:11 INFO httpclient.HttpMethodDirector: I/O exception (java.net.SocketException) caught when processing request: Broken pipe 14/11/13 17:21:11 INFO httpclient.HttpMethodDirector: Retrying request 14/11/13 17:21:32 INFO httpclient.HttpMethodDirector: I/O exception (java.net.SocketException) caught when processing request: Broken pipe 14/11/13 17:21:34 INFO httpclient.HttpMethodDirector: Retrying request 14/11/13 17:21:34 INFO httpclient.HttpMethodDirector: I/O exception (java.io.IOException) caught when processing request: Resetting to invalid mark 14/11/13 17:21:34 INFO httpclient.HttpMethodDirector: Retrying request 14/11/13 17:21:34 INFO httpclient.HttpMethodDirector: I/O exception (java.io.IOException) caught when processing request: Resetting to invalid mark 14/11/13 17:21:34 INFO httpclient.HttpMethodDirector: Retrying request 14/11/13 17:21:34 INFO httpclient.HttpMethodDirector: I/O exception (java.io.IOException) caught when processing request: Resetting to invalid mark 14/11/13 17:21:34 INFO httpclient.HttpMethodDirector: Retrying request 14/11/13 17:21:34 INFO httpclient.HttpMethodDirector: I/O exception (java.io.IOException) caught when processing request: Resetting to invalid mark 14/11/13 17:21:34 INFO httpclient.HttpMethodDirector: Retrying request 14/11/13 17:21:34 INFO httpclient.HttpMethodDirector: I/O exception (java.io.IOException) caught when processing request: Resetting to invalid mark 14/11/13 17:21:34 INFO httpclient.HttpMethodDirector: Retrying request 14/11/13 17:21:34 INFO httpclient.HttpMethodDirector: I/O exception (java.io.IOException) caught when processing request: Resetting to invalid mark 14/11/13 17:21:34 INFO httpclient.HttpMethodDirector: Retrying request 14/11/13 17:21:34 INFO httpclient.HttpMethodDirector: I/O exception (java.io.IOException) caught when processing request: Resetting to invalid mark 14/11/13 17:21:34 INFO httpclient.HttpMethodDirector: Retrying request 14/11/13 17:21:34 INFO httpclient.HttpMethodDirector: I/O exception (java.io.IOException) caught when processing request: Resetting to invalid mark 14/11/13 17:21:34 INFO httpclient.HttpMethodDirector: Retrying request 14/11/13 17:21:58 WARN utils.RestUtils: Retried connection 6 times, which exceeds the maximum retry count of 5 14/11/13 17:21:58 WARN utils.RestUtils: Retried connection 6
Re: Confused why I'm losing workers/executors when writing a large file to S3
Darin, You might want to increase these config options also: spark.akka.timeout 300 spark.storage.blockManagerSlaveTimeoutMs 30 On Thu, Nov 13, 2014 at 11:31 AM, Darin McBeath ddmcbe...@yahoo.com.invalid wrote: For one of my Spark jobs, my workers/executors are dying and leaving the cluster. On the master, I see something like the following in the log file. I'm surprised to see the '60' seconds in the master log below because I explicitly set it to '600' (or so I thought) in my spark job (see below). This is happening at the end of my job when I'm trying to persist a large RDD (probably around 300+GB) back to S3 (in 256 partitions). My cluster consists of 6 r3.8xlarge machines. The job successfully works when I'm outputting 100GB or 200GB. If you have any thoughts/insights, it would be appreciated. Thanks. Darin. Here is where I'm setting the 'timeout' in my spark job. SparkConf conf = new SparkConf() .setAppName(SparkSync Application) .set(spark.serializer, org.apache.spark.serializer.KryoSerializer) .set(spark.rdd.compress,true) .set(spark.core.connection.ack.wait.timeout,600); On the master, I see the following in the log file. 4/11/13 17:20:39 WARN master.Master: Removing worker-20141113134801-ip-10-35-184-232.ec2.internal-51877 because we got no heartbeat in 60 seconds 14/11/13 17:20:39 INFO master.Master: Removing worker worker-20141113134801-ip-10-35-184-232.ec2.internal-51877 on ip-10-35-184-232.ec2.internal:51877 14/11/13 17:20:39 INFO master.Master: Telling app of lost executor: 2 On a worker, I see something like the following in the log file. 14/11/13 17:20:58 WARN util.AkkaUtils: Error sending message in 1 attempts java.util.concurrent.TimeoutException: Futures timed out after [30 seconds] at scala.concurrent.impl.Promise$DefaultPromise.ready(Promise.scala:219) at scala.concurrent.impl.Promise$DefaultPromise.result(Promise.scala:223) at scala.concurrent.Await$$anonfun$result$1.apply(package.scala:107) at scala.concurrent.BlockContext$DefaultBlockContext$.blockOn(BlockContext.scala:53) at scala.concurrent.Await$.result(package.scala:107) at org.apache.spark.util.AkkaUtils$.askWithReply(AkkaUtils.scala:176) at org.apache.spark.executor.Executor$$anon$1.run(Executor.scala:362) 14/11/13 17:21:11 INFO httpclient.HttpMethodDirector: I/O exception (java.net.SocketException) caught when processing request: Broken pipe 14/11/13 17:21:11 INFO httpclient.HttpMethodDirector: Retrying request 14/11/13 17:21:32 INFO httpclient.HttpMethodDirector: I/O exception (java.net.SocketException) caught when processing request: Broken pipe 14/11/13 17:21:34 INFO httpclient.HttpMethodDirector: Retrying request 14/11/13 17:21:34 INFO httpclient.HttpMethodDirector: I/O exception (java.io.IOException) caught when processing request: Resetting to invalid mark 14/11/13 17:21:34 INFO httpclient.HttpMethodDirector: Retrying request 14/11/13 17:21:34 INFO httpclient.HttpMethodDirector: I/O exception (java.io.IOException) caught when processing request: Resetting to invalid mark 14/11/13 17:21:34 INFO httpclient.HttpMethodDirector: Retrying request 14/11/13 17:21:34 INFO httpclient.HttpMethodDirector: I/O exception (java.io.IOException) caught when processing request: Resetting to invalid mark 14/11/13 17:21:34 INFO httpclient.HttpMethodDirector: Retrying request 14/11/13 17:21:34 INFO httpclient.HttpMethodDirector: I/O exception (java.io.IOException) caught when processing request: Resetting to invalid mark 14/11/13 17:21:34 INFO httpclient.HttpMethodDirector: Retrying request 14/11/13 17:21:34 INFO httpclient.HttpMethodDirector: I/O exception (java.io.IOException) caught when processing request: Resetting to invalid mark 14/11/13 17:21:34 INFO httpclient.HttpMethodDirector: Retrying request 14/11/13 17:21:34 INFO httpclient.HttpMethodDirector: I/O exception (java.io.IOException) caught when processing request: Resetting to invalid mark 14/11/13 17:21:34 INFO httpclient.HttpMethodDirector: Retrying request 14/11/13 17:21:34 INFO httpclient.HttpMethodDirector: I/O exception (java.io.IOException) caught when processing request: Resetting to invalid mark 14/11/13 17:21:34 INFO httpclient.HttpMethodDirector: Retrying request 14/11/13 17:21:34 INFO httpclient.HttpMethodDirector: I/O exception (java.io.IOException) caught when processing request: Resetting to invalid mark 14/11/13 17:21:34 INFO httpclient.HttpMethodDirector: Retrying request 14/11/13 17:21:58 WARN utils.RestUtils: Retried connection 6 times, which exceeds the maximum retry count of 5 14/11/13 17:21:58 WARN utils.RestUtils: Retried connection 6 times, which exceeds the maximum retry count of 5 14/11/13 17:22:57 WARN util.AkkaUtils: Error sending message in 1 attempts java.util.concurrent.TimeoutException: Futures timed out after [30 seconds]