Hi, I am running a simple job on Spark 1.6 in which I am trying to leftOuterJoin a big RDD with a smaller one. I am not yet broadcasting the smaller RDD yet but I am stilling running into FetchFailed errors with finally the job getting killed.
I have already partitioned the data to 5000 partitions and every time the job runs with no errors for the first 2K to 3K tasks but then starts getting this exception. If I look further in the stack trace for some I see errors like below but if there is any network issue the initial 2k+ tasks should not have succeeded. Caused by: java.io.IOException: Connection reset by peer Caused by: java.io.IOException: Failed to connect to <host> I am running on Yarn cluster manager with 200 executors and 6GB of executor and driver heap. I had in the last run seen errors related to spark.yarn.executor.memoryOverhead, so I have set it to 1.5 GB and do not see those errors. Any help will be much appreciated. Thanks Ankur