Re: Job keeps aborting because of org.apache.spark.shuffle.FetchFailedException: Failed to connect to server/ip:39232

周康 Sat, 29 Jul 2017 21:54:36 -0700

I think you should check the rpc target, may be the nodemanager has memory
issue like gc or other.Check it out first.
And i wonder why you assign  --executor-cores 8?


2017-07-29 7:40 GMT+08:00 jeff saremi <[email protected]>:

> asking this on a tangent:
>
> Is there anyway for the shuffle data to be replicated to more than one
> server?
>
> thanks
>
> ------------------------------
> *From:* jeff saremi <[email protected]>
> *Sent:* Friday, July 28, 2017 4:38:08 PM
> *To:* Juan Rodríguez Hortalá
>
> *Cc:* [email protected]
> *Subject:* Re: Job keeps aborting because of 
> org.apache.spark.shuffle.FetchFailedException:
> Failed to connect to server/ip:39232
>
>
> Thanks Juan for taking the time
>
> Here's more info:
> - This is running on Yarn in Master mode
>
> - See config params below
>
> - This is a corporate environment. In general nodes should not be added or
> removed that often to the cluster. Even if that is the case I would expect
> that to be one or 2 servers. In my case I get hundreds of these errors
> before the job fails.
>
>   --master yarn-cluster ^
>   --driver-memory 96G ^
>   --executor-memory 48G ^
>   --num-executors 150 ^
>   --executor-cores 8 ^
>   --driver-cores 8 ^
>   --conf spark.yarn.executor.memoryOverhead=36000 ^
>   --conf spark.shuffle.service.enabled=true ^
>   --conf spark.yarn.submit.waitAppCompletion=false ^
>   --conf spark.yarn.submit.file.replication=64 ^
>   --conf spark.yarn.maxAppAttempts=1 ^
>   --conf spark.speculation=true ^
>   --conf spark.speculation.quantile=0.9 ^
>   --conf spark.yarn.executor.nodeLabelExpression="prod" ^
>   --conf spark.yarn.am.nodeLabelExpression="prod" ^
>   --conf spark.stage.maxConsecutiveAttempts=1000 ^
>   --conf spark.yarn.scheduler.heartbeat.interval-ms=15000 ^
>   --conf spark.yarn.launchContainer.count.simultaneously=50 ^
>   --conf spark.driver.maxResultSize=16G ^
>   --conf spark.network.timeout=1000s ^
>
> ------------------------------
> *From:* Juan Rodríguez Hortalá <[email protected]>
> *Sent:* Friday, July 28, 2017 4:20:40 PM
> *To:* jeff saremi
> *Cc:* [email protected]
> *Subject:* Re: Job keeps aborting because of 
> org.apache.spark.shuffle.FetchFailedException:
> Failed to connect to server/ip:39232
>
> Hi Jeff,
>
> Can you provide more information about how are you running your job? In
> particular:
>   - which cluster manager are you using? It is YARN, Mesos, Spark
> Standalone?
>   - with configuration options are you using to submit the job? In
> particular are you using dynamic allocation or external shuffle? You should
> be able to see this in the Environment tab of the Spark UI, looking
> for spark.dynamicAllocation.enabled and spark.shuffle.service.enabled.
>   - in which environment are you running the jobs? Is this an on premise
> cluster or some cloud provider? Are you adding or removing nodes from the
> cluster during the job execution?
>
> FetchFailedException errors happen during execution when an executor is
> not able to read the shuffle blocks for a previous stage that are served by
> other executor. That might happen if the executor that has to serve the
> files dies and internal shuffle is used, although there can be other
> reasons like network errors. If you are using dynamic allocation then you
> should also enable external shuffle service so shuffle blocks can be served
> by the node manager after the executor that created the blocks is
> terminated, see https://spark.apache.org/docs/latest/job-scheduling.html#
> dynamic-resource-allocation for more details.
>
>
>
> On Fri, Jul 28, 2017 at 9:57 AM, jeff saremi <[email protected]>
> wrote:
>
>> We have a not too complex and not too large spark job that keeps dying
>> with this error
>>
>> I have researched it and I have not seen any convincing explanation on why
>>
>> I am not using a shuffle service. Which server is the one that is
>> refusing the connection?
>> If I go to the server that is being reported in the error message, I see
>> a lot of these errors towards the end:
>>
>> java.io.FileNotFoundException: 
>> D:\data\yarnnm\local\usercache\hadoop\appcache\application_1500970459432_1024\blockmgr-7f3a1abc-2b8b-4e51-9072-8c12495ec563\0e\shuffle_0_4107_0.index
>>
>> (may or may not be related to the problem at all)
>>
>> and if you examine further on this machine there are
>> fetchfailedexceptions resulting from other machines and so on and so forth
>>
>>
>> This is Spark 1.6 on Yarn-master
>>
>>
>> Could anyone provide some insight or solution to this?
>>
>> thanks
>>
>>
>>
>

Re: Job keeps aborting because of org.apache.spark.shuffle.FetchFailedException: Failed to connect to server/ip:39232

Reply via email to