[ https://issues.apache.org/jira/browse/SPARK-31179?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Thomas Graves resolved SPARK-31179. ----------------------------------- Fix Version/s: 3.1.0 Assignee: feiwang Resolution: Fixed > Fast fail the connection while last shuffle connection failed in the last > retry IO wait > ---------------------------------------------------------------------------------------- > > Key: SPARK-31179 > URL: https://issues.apache.org/jira/browse/SPARK-31179 > Project: Spark > Issue Type: Improvement > Components: Shuffle > Affects Versions: 3.1.0 > Reporter: feiwang > Assignee: feiwang > Priority: Major > Fix For: 3.1.0 > > > When reading shuffle data, maybe several fetch request sent to a same shuffle > server. > There is a client pool, and these request may share the same client. > When the shuffle server is busy, it may cause the request connection timeout. > For example: there are two request connection, rc1 and rc2. > Especially, the io.numConnectionsPerPeer is 1 and connection timeout is 2 > minutes. > 1: rc1 hold the client lock, it timeout after 2 minutes. > 2: rc2 hold the client lock, it timeout after 2 minutes. > 3: rc1 start the second retry, hold lock and timeout after 2 minutes. > 4: rc2 start the second retry, hold lock and timeout after 2 minutes. > 5: rc1 start the third retry, hold lock and timeout after 2 minutes. > 6: rc2 start the third retry, hold lock and timeout after 2 minutes. > It wastes lots of time. -- This message was sent by Atlassian Jira (v8.3.4#803005) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org