Re: spark time out

2014-09-23 Thread Chen Song
I am running the job on 500 executors, each with 8G and 1 core.

See lots of fetch failures on reduce stage, when running a simple
reduceByKey

map tasks - 4000
reduce tasks - 200



On Mon, Sep 22, 2014 at 12:22 PM, Chen Song chen.song...@gmail.com wrote:

 I am using Spark 1.1.0 and have seen a lot of Fetch Failures due to the
 following exception.

 java.io.IOException: sendMessageReliably failed because ack was not
 received within 60 sec
 at
 org.apache.spark.network.ConnectionManager$$anon$5$$anonfun$run$15.apply(ConnectionManager.scala:854)
 at
 org.apache.spark.network.ConnectionManager$$anon$5$$anonfun$run$15.apply(ConnectionManager.scala:852)
 at scala.Option.foreach(Option.scala:236)
 at
 org.apache.spark.network.ConnectionManager$$anon$5.run(ConnectionManager.scala:852)
 at java.util.TimerThread.mainLoop(Timer.java:555)
 at java.util.TimerThread.run(Timer.java:505)

 I have increased spark.core.connection.ack.wait.timeout to 120 seconds.
 Situation is relieved but not too much. I am pretty confident it was not
 due to GC on executors. What could be the reason for this?

 Chen




-- 
Chen Song


spark time out

2014-09-22 Thread Chen Song
I am using Spark 1.1.0 and have seen a lot of Fetch Failures due to the
following exception.

java.io.IOException: sendMessageReliably failed because ack was not
received within 60 sec
at
org.apache.spark.network.ConnectionManager$$anon$5$$anonfun$run$15.apply(ConnectionManager.scala:854)
at
org.apache.spark.network.ConnectionManager$$anon$5$$anonfun$run$15.apply(ConnectionManager.scala:852)
at scala.Option.foreach(Option.scala:236)
at
org.apache.spark.network.ConnectionManager$$anon$5.run(ConnectionManager.scala:852)
at java.util.TimerThread.mainLoop(Timer.java:555)
at java.util.TimerThread.run(Timer.java:505)

I have increased spark.core.connection.ack.wait.timeout to 120 seconds.
Situation is relieved but not too much. I am pretty confident it was not
due to GC on executors. What could be the reason for this?

Chen