[ https://issues.apache.org/jira/browse/SPARK-18691?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Nicholas Brown updated SPARK-18691: ----------------------------------- Attachment: retries.txt > Spark can hang if a node goes down during a shuffle > --------------------------------------------------- > > Key: SPARK-18691 > URL: https://issues.apache.org/jira/browse/SPARK-18691 > Project: Spark > Issue Type: Bug > Components: Shuffle > Affects Versions: 1.6.1 > Environment: Running on an AWS EMR cluster using yarn > Reporter: Nicholas Brown > Attachments: retries.txt > > > We have a 200 node cluster that sometimes hangs if a node goes down. It looks > like it detects the failure and spins up a new node just fine. However, the > other 199 nodes then appear to become unresponsive. From looking at the logs > and code, this is what appears to be happening. > The other nodes have 36 threads each (as far as I can tell, they are > processing the shuffle) that timeout while trying to fetch data from the now > dead node. With the default configuration, they are supposed to retry 3 > times with 5 seconds in between retries, for a max delay of 15 seconds. > However, because spark.shuffle.io.numConnectionsPerPeer is set to the default > value of 1, they can only go one at a time. And with a two minute default > network timeout, that means the delay becomes several hours, which is > obviously unacceptable. I'll attach a portion of our logs which shows the > retries getting further and further behind. > Now I can work around this partially by playing with the configuration > options mentioned above, particularly by upping the numConnectionsPerPeer. > But it seems Spark should be able to handle this. One thought is make the > maxRetries apply between across threads that are accessing the same node. > I've seen this on 1.6.1, but from looking at the code I suspect it exists in > the latest version as well. -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org