Sital Kedia created SPARK-20640:
-----------------------------------

             Summary: Make rpc timeout and retry for shuffle registration 
configurable
                 Key: SPARK-20640
                 URL: https://issues.apache.org/jira/browse/SPARK-20640
             Project: Spark
          Issue Type: Bug
          Components: Shuffle
    Affects Versions: 2.0.2
            Reporter: Sital Kedia


Currently the shuffle service registration timeout and retry has been hardcoded 
(see 
https://github.com/sitalkedia/spark/blob/master/network/shuffle/src/main/java/org/apache/spark/network/shuffle/ExternalShuffleClient.java#L144
 and 
https://github.com/sitalkedia/spark/blob/master/core/src/main/scala/org/apache/spark/storage/BlockManager.scala#L197).
 This works well for small workloads but under heavy workload when the shuffle 
service is busy transferring large amount of data we see significant delay in 
responding to the registration request, as a result we often see the executors 
fail to register with the shuffle service, eventually failing the job. We need 
to make these two parameters configurable.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to