I am running the job on 500 executors, each with 8G and 1 core.
See lots of fetch failures on reduce stage, when running a simple
reduceByKey
map tasks - 4000
reduce tasks - 200
On Mon, Sep 22, 2014 at 12:22 PM, Chen Song chen.song...@gmail.com wrote:
I am using Spark 1.1.0 and have seen a lot of Fetch Failures due to the
following exception.
java.io.IOException: sendMessageReliably failed because ack was not
received within 60 sec
at
org.apache.spark.network.ConnectionManager$$anon$5$$anonfun$run$15.apply(ConnectionManager.scala:854)
at
org.apache.spark.network.ConnectionManager$$anon$5$$anonfun$run$15.apply(ConnectionManager.scala:852)
at scala.Option.foreach(Option.scala:236)
at
org.apache.spark.network.ConnectionManager$$anon$5.run(ConnectionManager.scala:852)
at java.util.TimerThread.mainLoop(Timer.java:555)
at java.util.TimerThread.run(Timer.java:505)
I have increased spark.core.connection.ack.wait.timeout to 120 seconds.
Situation is relieved but not too much. I am pretty confident it was not
due to GC on executors. What could be the reason for this?
Chen
--
Chen Song