Thanks Juan for taking the time Here's more info: - This is running on Yarn in Master mode
- See config params below - This is a corporate environment. In general nodes should not be added or removed that often to the cluster. Even if that is the case I would expect that to be one or 2 servers. In my case I get hundreds of these errors before the job fails. --master yarn-cluster ^ --driver-memory 96G ^ --executor-memory 48G ^ --num-executors 150 ^ --executor-cores 8 ^ --driver-cores 8 ^ --conf spark.yarn.executor.memoryOverhead=36000 ^ --conf spark.shuffle.service.enabled=true ^ --conf spark.yarn.submit.waitAppCompletion=false ^ --conf spark.yarn.submit.file.replication=64 ^ --conf spark.yarn.maxAppAttempts=1 ^ --conf spark.speculation=true ^ --conf spark.speculation.quantile=0.9 ^ --conf spark.yarn.executor.nodeLabelExpression="prod" ^ --conf spark.yarn.am.nodeLabelExpression="prod" ^ --conf spark.stage.maxConsecutiveAttempts=1000 ^ --conf spark.yarn.scheduler.heartbeat.interval-ms=15000 ^ --conf spark.yarn.launchContainer.count.simultaneously=50 ^ --conf spark.driver.maxResultSize=16G ^ --conf spark.network.timeout=1000s ^ ________________________________ From: Juan Rodríguez Hortalá <juan.rodriguez.hort...@gmail.com> Sent: Friday, July 28, 2017 4:20:40 PM To: jeff saremi Cc: user@spark.apache.org Subject: Re: Job keeps aborting because of org.apache.spark.shuffle.FetchFailedException: Failed to connect to server/ip:39232 Hi Jeff, Can you provide more information about how are you running your job? In particular: - which cluster manager are you using? It is YARN, Mesos, Spark Standalone? - with configuration options are you using to submit the job? In particular are you using dynamic allocation or external shuffle? You should be able to see this in the Environment tab of the Spark UI, looking for spark.dynamicAllocation.enabled and spark.shuffle.service.enabled. - in which environment are you running the jobs? Is this an on premise cluster or some cloud provider? Are you adding or removing nodes from the cluster during the job execution? FetchFailedException errors happen during execution when an executor is not able to read the shuffle blocks for a previous stage that are served by other executor. That might happen if the executor that has to serve the files dies and internal shuffle is used, although there can be other reasons like network errors. If you are using dynamic allocation then you should also enable external shuffle service so shuffle blocks can be served by the node manager after the executor that created the blocks is terminated, see https://spark.apache.org/docs/latest/job-scheduling.html#dynamic-resource-allocation for more details. On Fri, Jul 28, 2017 at 9:57 AM, jeff saremi <jeffsar...@hotmail.com<mailto:jeffsar...@hotmail.com>> wrote: We have a not too complex and not too large spark job that keeps dying with this error I have researched it and I have not seen any convincing explanation on why I am not using a shuffle service. Which server is the one that is refusing the connection? If I go to the server that is being reported in the error message, I see a lot of these errors towards the end: java.io.FileNotFoundException: D:\data\yarnnm\local\usercache\hadoop\appcache\application_1500970459432_1024\blockmgr-7f3a1abc-2b8b-4e51-9072-8c12495ec563\0e\shuffle_0_4107_0.index (may or may not be related to the problem at all) and if you examine further on this machine there are fetchfailedexceptions resulting from other machines and so on and so forth This is Spark 1.6 on Yarn-master Could anyone provide some insight or solution to this? thanks