Re: Confused why I'm losing workers/executors when writing a large file to S3

2015-01-21 Thread Tsai Li Ming
I’m getting the same issue on Spark 1.2.0. Despite having set “spark.core.connection.ack.wait.timeout” in spark-defaults.conf and verified in the job UI (port 4040) environment tab, I still get the “no heartbeat in 60 seconds” error. spark.core.connection.ack.wait.timeout=3600 15/01/22

Confused why I'm losing workers/executors when writing a large file to S3

2014-11-13 Thread Darin McBeath
For one of my Spark jobs, my workers/executors are dying and leaving the cluster. On the master, I see something like the following in the log file.  I'm surprised to see the '60' seconds in the master log below because I explicitly set it to '600' (or so I thought) in my spark job (see

Re: Confused why I'm losing workers/executors when writing a large file to S3

2014-11-13 Thread Sonal Goyal
Hi Darin, In our case, we were getting the error gue to long GC pauses in our app. Fixing the underlying code helped us remove this error. This is also mentioned as point 1 in the link below:

Re: Confused why I'm losing workers/executors when writing a large file to S3

2014-11-13 Thread Reynold Xin
Darin, You might want to increase these config options also: spark.akka.timeout 300 spark.storage.blockManagerSlaveTimeoutMs 30 On Thu, Nov 13, 2014 at 11:31 AM, Darin McBeath ddmcbe...@yahoo.com.invalid wrote: For one of my Spark jobs, my workers/executors are dying and leaving the