Thanks Juan for taking the time

Here's more info:
- This is running on Yarn in Master mode

- See config params below

- This is a corporate environment. In general nodes should not be added or 
removed that often to the cluster. Even if that is the case I would expect that 
to be one or 2 servers. In my case I get hundreds of these errors before the 
job fails.


  --master yarn-cluster ^
  --driver-memory 96G ^
  --executor-memory 48G ^
  --num-executors 150 ^
  --executor-cores 8 ^
  --driver-cores 8 ^
  --conf spark.yarn.executor.memoryOverhead=36000 ^
  --conf spark.shuffle.service.enabled=true ^
  --conf spark.yarn.submit.waitAppCompletion=false ^
  --conf spark.yarn.submit.file.replication=64 ^
  --conf spark.yarn.maxAppAttempts=1 ^
  --conf spark.speculation=true ^
  --conf spark.speculation.quantile=0.9 ^
  --conf spark.yarn.executor.nodeLabelExpression="prod" ^
  --conf spark.yarn.am.nodeLabelExpression="prod" ^
  --conf spark.stage.maxConsecutiveAttempts=1000 ^
  --conf spark.yarn.scheduler.heartbeat.interval-ms=15000 ^
  --conf spark.yarn.launchContainer.count.simultaneously=50 ^
  --conf spark.driver.maxResultSize=16G ^
  --conf spark.network.timeout=1000s ^


________________________________
From: Juan Rodríguez Hortalá <juan.rodriguez.hort...@gmail.com>
Sent: Friday, July 28, 2017 4:20:40 PM
To: jeff saremi
Cc: user@spark.apache.org
Subject: Re: Job keeps aborting because of 
org.apache.spark.shuffle.FetchFailedException: Failed to connect to 
server/ip:39232

Hi Jeff,

Can you provide more information about how are you running your job? In 
particular:
  - which cluster manager are you using? It is YARN, Mesos, Spark Standalone?
  - with configuration options are you using to submit the job? In particular 
are you using dynamic allocation or external shuffle? You should be able to see 
this in the Environment tab of the Spark UI, looking for 
spark.dynamicAllocation.enabled and spark.shuffle.service.enabled.
  - in which environment are you running the jobs? Is this an on premise 
cluster or some cloud provider? Are you adding or removing nodes from the 
cluster during the job execution?

FetchFailedException errors happen during execution when an executor is not 
able to read the shuffle blocks for a previous stage that are served by other 
executor. That might happen if the executor that has to serve the files dies 
and internal shuffle is used, although there can be other reasons like network 
errors. If you are using dynamic allocation then you should also enable 
external shuffle service so shuffle blocks can be served by the node manager 
after the executor that created the blocks is terminated, see 
https://spark.apache.org/docs/latest/job-scheduling.html#dynamic-resource-allocation
 for more details.



On Fri, Jul 28, 2017 at 9:57 AM, jeff saremi 
<jeffsar...@hotmail.com<mailto:jeffsar...@hotmail.com>> wrote:

We have a not too complex and not too large spark job that keeps dying with 
this error

I have researched it and I have not seen any convincing explanation on why

I am not using a shuffle service. Which server is the one that is refusing the 
connection?
If I go to the server that is being reported in the error message, I see a lot 
of these errors towards the end:


java.io.FileNotFoundException: 
D:\data\yarnnm\local\usercache\hadoop\appcache\application_1500970459432_1024\blockmgr-7f3a1abc-2b8b-4e51-9072-8c12495ec563\0e\shuffle_0_4107_0.index

(may or may not be related to the problem at all)


and if you examine further on this machine there are fetchfailedexceptions 
resulting from other machines and so on and so forth

This is Spark 1.6 on Yarn-master

Could anyone provide some insight or solution to this?

thanks


Reply via email to