[ https://issues.apache.org/jira/browse/SPARK-20640?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Wenchen Fan resolved SPARK-20640. --------------------------------- Resolution: Fixed Fix Version/s: 2.3.0 Issue resolved by pull request 18092 [https://github.com/apache/spark/pull/18092] > Make rpc timeout and retry for shuffle registration configurable > ---------------------------------------------------------------- > > Key: SPARK-20640 > URL: https://issues.apache.org/jira/browse/SPARK-20640 > Project: Spark > Issue Type: Bug > Components: Shuffle > Affects Versions: 2.0.2 > Reporter: Sital Kedia > Fix For: 2.3.0 > > > Currently the shuffle service registration timeout and retry has been > hardcoded (see > https://github.com/sitalkedia/spark/blob/master/network/shuffle/src/main/java/org/apache/spark/network/shuffle/ExternalShuffleClient.java#L144 > and > https://github.com/sitalkedia/spark/blob/master/core/src/main/scala/org/apache/spark/storage/BlockManager.scala#L197). > This works well for small workloads but under heavy workload when the > shuffle service is busy transferring large amount of data we see significant > delay in responding to the registration request, as a result we often see the > executors fail to register with the shuffle service, eventually failing the > job. We need to make these two parameters configurable. -- This message was sent by Atlassian JIRA (v6.4.14#64029) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org