Job keeps aborting because of org.apache.spark.shuffle.FetchFailedException: Failed to connect to server/ip:39232

jeff saremi Fri, 28 Jul 2017 09:57:39 -0700

We have a not too complex and not too large spark job that keeps dying with 
this error


I have researched it and I have not seen any convincing explanation on why

I am not using a shuffle service. Which server is the one that is refusing the 
connection?
If I go to the server that is being reported in the error message, I see a lot 
of these errors towards the end:


java.io.FileNotFoundException: 
D:\data\yarnnm\local\usercache\hadoop\appcache\application_1500970459432_1024\blockmgr-7f3a1abc-2b8b-4e51-9072-8c12495ec563\0e\shuffle_0_4107_0.index

(may or may not be related to the problem at all)


and if you examine further on this machine there are fetchfailedexceptions 
resulting from other machines and so on and so forth

This is Spark 1.6 on Yarn-master

Could anyone provide some insight or solution to this?

thanks

Job keeps aborting because of org.apache.spark.shuffle.FetchFailedException: Failed to connect to server/ip:39232

Reply via email to