Hi, I am having this FetchFailed issue when the driver is about to collect about 2.5M lines of short strings (about 10 characters each line) from a YARN cluster with 400 nodes:
*14/08/22 11:43:27 WARN scheduler.TaskSetManager: Lost task 205.0 in stage 0.0 (TID 1228, aaa.xxx.com): FetchFailed(BlockManagerId(220, aaa.xxx.com, 37899, 0), shuffleId=0, mapId=420, reduceId=205) 14/08/22 11:43:27 WARN scheduler.TaskSetManager: Lost task 603.0 in stage 0.0 (TID 1626, aaa.xxx.com): FetchFailed(BlockManagerId(220, aaa.xxx.com, 37899, 0), shuffleId=0, mapId=420, reduceId=603)* And other than this FetchFailed, I am not able to see anything else from the log file (no OOM errors shown). This does not happen when there is only 2M lines. I guess it might because of the akka message size, and then I used the following spark.akka.frameSize 100 spark.akka.timeout 200 And that does not help as well. Has anyone experienced similar problems? Thanks, Jiayu -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/FetchFailed-when-collect-at-YARN-cluster-tp12670.html Sent from the Apache Spark User List mailing list archive at Nabble.com. --------------------------------------------------------------------- To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org