[ https://issues.apache.org/jira/browse/SPARK-20640?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Wenchen Fan reassigned SPARK-20640: ----------------------------------- Assignee: Li Yichao > Make rpc timeout and retry for shuffle registration configurable > ---------------------------------------------------------------- > > Key: SPARK-20640 > URL: https://issues.apache.org/jira/browse/SPARK-20640 > Project: Spark > Issue Type: Bug > Components: Shuffle > Affects Versions: 2.0.2 > Reporter: Sital Kedia > Assignee: Li Yichao > Fix For: 2.3.0 > > > Currently the shuffle service registration timeout and retry has been > hardcoded (see > https://github.com/sitalkedia/spark/blob/master/network/shuffle/src/main/java/org/apache/spark/network/shuffle/ExternalShuffleClient.java#L144 > and > https://github.com/sitalkedia/spark/blob/master/core/src/main/scala/org/apache/spark/storage/BlockManager.scala#L197). > This works well for small workloads but under heavy workload when the > shuffle service is busy transferring large amount of data we see significant > delay in responding to the registration request, as a result we often see the > executors fail to register with the shuffle service, eventually failing the > job. We need to make these two parameters configurable. -- This message was sent by Atlassian JIRA (v6.4.14#64029) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org