I think you should check the rpc target, may be the nodemanager has memory issue like gc or other.Check it out first. And i wonder why you assign --executor-cores 8?
2017-07-29 7:40 GMT+08:00 jeff saremi <[email protected]>: > asking this on a tangent: > > Is there anyway for the shuffle data to be replicated to more than one > server? > > thanks > > ------------------------------ > *From:* jeff saremi <[email protected]> > *Sent:* Friday, July 28, 2017 4:38:08 PM > *To:* Juan Rodríguez Hortalá > > *Cc:* [email protected] > *Subject:* Re: Job keeps aborting because of > org.apache.spark.shuffle.FetchFailedException: > Failed to connect to server/ip:39232 > > > Thanks Juan for taking the time > > Here's more info: > - This is running on Yarn in Master mode > > - See config params below > > - This is a corporate environment. In general nodes should not be added or > removed that often to the cluster. Even if that is the case I would expect > that to be one or 2 servers. In my case I get hundreds of these errors > before the job fails. > > --master yarn-cluster ^ > --driver-memory 96G ^ > --executor-memory 48G ^ > --num-executors 150 ^ > --executor-cores 8 ^ > --driver-cores 8 ^ > --conf spark.yarn.executor.memoryOverhead=36000 ^ > --conf spark.shuffle.service.enabled=true ^ > --conf spark.yarn.submit.waitAppCompletion=false ^ > --conf spark.yarn.submit.file.replication=64 ^ > --conf spark.yarn.maxAppAttempts=1 ^ > --conf spark.speculation=true ^ > --conf spark.speculation.quantile=0.9 ^ > --conf spark.yarn.executor.nodeLabelExpression="prod" ^ > --conf spark.yarn.am.nodeLabelExpression="prod" ^ > --conf spark.stage.maxConsecutiveAttempts=1000 ^ > --conf spark.yarn.scheduler.heartbeat.interval-ms=15000 ^ > --conf spark.yarn.launchContainer.count.simultaneously=50 ^ > --conf spark.driver.maxResultSize=16G ^ > --conf spark.network.timeout=1000s ^ > > ------------------------------ > *From:* Juan Rodríguez Hortalá <[email protected]> > *Sent:* Friday, July 28, 2017 4:20:40 PM > *To:* jeff saremi > *Cc:* [email protected] > *Subject:* Re: Job keeps aborting because of > org.apache.spark.shuffle.FetchFailedException: > Failed to connect to server/ip:39232 > > Hi Jeff, > > Can you provide more information about how are you running your job? In > particular: > - which cluster manager are you using? It is YARN, Mesos, Spark > Standalone? > - with configuration options are you using to submit the job? In > particular are you using dynamic allocation or external shuffle? You should > be able to see this in the Environment tab of the Spark UI, looking > for spark.dynamicAllocation.enabled and spark.shuffle.service.enabled. > - in which environment are you running the jobs? Is this an on premise > cluster or some cloud provider? Are you adding or removing nodes from the > cluster during the job execution? > > FetchFailedException errors happen during execution when an executor is > not able to read the shuffle blocks for a previous stage that are served by > other executor. That might happen if the executor that has to serve the > files dies and internal shuffle is used, although there can be other > reasons like network errors. If you are using dynamic allocation then you > should also enable external shuffle service so shuffle blocks can be served > by the node manager after the executor that created the blocks is > terminated, see https://spark.apache.org/docs/latest/job-scheduling.html# > dynamic-resource-allocation for more details. > > > > On Fri, Jul 28, 2017 at 9:57 AM, jeff saremi <[email protected]> > wrote: > >> We have a not too complex and not too large spark job that keeps dying >> with this error >> >> I have researched it and I have not seen any convincing explanation on why >> >> I am not using a shuffle service. Which server is the one that is >> refusing the connection? >> If I go to the server that is being reported in the error message, I see >> a lot of these errors towards the end: >> >> java.io.FileNotFoundException: >> D:\data\yarnnm\local\usercache\hadoop\appcache\application_1500970459432_1024\blockmgr-7f3a1abc-2b8b-4e51-9072-8c12495ec563\0e\shuffle_0_4107_0.index >> >> (may or may not be related to the problem at all) >> >> and if you examine further on this machine there are >> fetchfailedexceptions resulting from other machines and so on and so forth >> >> >> This is Spark 1.6 on Yarn-master >> >> >> Could anyone provide some insight or solution to this? >> >> thanks >> >> >> >
