[ 
https://issues.apache.org/jira/browse/SPARK-18691?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nicholas Brown updated SPARK-18691:
-----------------------------------
    Attachment: retries.txt

> Spark can hang if a node goes down during a shuffle
> ---------------------------------------------------
>
>                 Key: SPARK-18691
>                 URL: https://issues.apache.org/jira/browse/SPARK-18691
>             Project: Spark
>          Issue Type: Bug
>          Components: Shuffle
>    Affects Versions: 1.6.1
>         Environment: Running on an AWS EMR cluster using yarn
>            Reporter: Nicholas Brown
>         Attachments: retries.txt
>
>
> We have a 200 node cluster that sometimes hangs if a node goes down. It looks 
> like it detects the failure and spins up a new node just fine.  However, the 
> other 199 nodes then appear to become unresponsive.  From looking at the logs 
> and code, this is what appears to be happening.
> The other nodes have 36 threads each (as far as I can tell, they are 
> processing the shuffle) that timeout while trying to fetch data from the now 
> dead node.  With the default configuration, they are supposed to retry 3 
> times with 5 seconds in between retries, for a max delay of 15 seconds.  
> However, because spark.shuffle.io.numConnectionsPerPeer is set to the default 
> value of 1, they can only go one at a time.  And with a two minute default 
> network timeout, that means the delay becomes several hours, which is 
> obviously unacceptable.  I'll attach a portion of our logs which shows the 
> retries getting further and further behind.
> Now I can work around this partially by playing with the configuration 
> options mentioned above, particularly by upping the numConnectionsPerPeer.  
> But it seems Spark should be able to handle this.  One thought is make the 
> maxRetries apply between across threads that are accessing the same node.
> I've seen this on 1.6.1, but from looking at the code I suspect it exists in 
> the latest version as well.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to