[incubator-celeborn] branch branch-0.3 updated: [CELEBORN-803] Increase default timeout for commit files

zhouky Mon, 17 Jul 2023 07:34:58 -0700

This is an automated email from the ASF dual-hosted git repository.

zhouky pushed a commit to branch branch-0.3
in repository https://gitbox.apache.org/repos/asf/incubator-celeborn.git



The following commit(s) were added to refs/heads/branch-0.3 by this push:
     new c21ae16ab [CELEBORN-803] Increase default timeout for commit files
c21ae16ab is described below

commit c21ae16abca91ba15faa245276cff61798e8b458
Author: zky.zhoukeyong <[email protected]>
AuthorDate: Mon Jul 17 22:31:36 2023 +0800

    [CELEBORN-803] Increase default timeout for commit files
    
    ### What changes were proposed in this pull request?
    As title.
    
    ### Why are the changes needed?
    In 0.2.1-incubating, commit files default timeout is ```NETWORK_TIMEOUT```, 
which is 240s.
    It's more reasonable because commit files costs relatively long time. In my 
testing with tough disks,
    30s timeout with 2 retires is not enough.
    
    ### Does this PR introduce _any_ user-facing change?
    No.
    
    ### How was this patch tested?
    Passes GA and manual test.
    
    Closes #1724 from waitinfuture/803.
    
    Authored-by: zky.zhoukeyong <[email protected]>
    Signed-off-by: zky.zhoukeyong <[email protected]>
---
 common/src/main/scala/org/apache/celeborn/common/CelebornConf.scala | 5 +++--
 docs/configuration/client.md                                        | 2 +-
 docs/configuration/worker.md                                        | 2 +-
 3 files changed, 5 insertions(+), 4 deletions(-)

diff --git 
a/common/src/main/scala/org/apache/celeborn/common/CelebornConf.scala 
b/common/src/main/scala/org/apache/celeborn/common/CelebornConf.scala
index d799635f3..42d85efe9 100644
--- a/common/src/main/scala/org/apache/celeborn/common/CelebornConf.scala
+++ b/common/src/main/scala/org/apache/celeborn/common/CelebornConf.scala
@@ -2147,7 +2147,8 @@ object CelebornConf extends Logging {
       .doc("Timeout for a Celeborn worker to commit files of a shuffle. " +
         "It's recommended to set at least `240s` when `HDFS` is enabled in 
`celeborn.storage.activeTypes`.")
       .version("0.3.0")
-      .fallbackConf(RPC_ASK_TIMEOUT)
+      .timeConf(TimeUnit.MILLISECONDS)
+      .createWithDefaultString("120s")
 
   val PARTITION_SORTER_SORT_TIMEOUT: ConfigEntry[Long] =
     buildConf("celeborn.worker.sortPartition.timeout")
@@ -3190,7 +3191,7 @@ object CelebornConf extends Logging {
       .version("0.3.0")
       .intConf
       .checkValue(v => v > 0, "value must be positive")
-      .createWithDefault(2)
+      .createWithDefault(3)
 
   val CLIENT_COMMIT_IGNORE_EXCLUDED_WORKERS: ConfigEntry[Boolean] =
     buildConf("celeborn.client.commitFiles.ignoreExcludedWorker")
diff --git a/docs/configuration/client.md b/docs/configuration/client.md
index beeb60a45..e6f3d6ba9 100644
--- a/docs/configuration/client.md
+++ b/docs/configuration/client.md
@@ -62,7 +62,7 @@ license: |
 | celeborn.client.push.timeout | 120s | Timeout for a task to push data rpc 
message. This value should better be more than twice of 
`celeborn.<module>.push.timeoutCheck.interval` | 0.3.0 | 
 | celeborn.client.registerShuffle.maxRetries | 3 | Max retry times for client 
to register shuffle. | 0.3.0 | 
 | celeborn.client.registerShuffle.retryWait | 3s | Wait time before next retry 
if register shuffle failed. | 0.3.0 | 
-| celeborn.client.requestCommitFiles.maxRetries | 2 | Max retry times for 
requestCommitFiles RPC. | 0.3.0 | 
+| celeborn.client.requestCommitFiles.maxRetries | 3 | Max retry times for 
requestCommitFiles RPC. | 0.3.0 | 
 | celeborn.client.reserveSlots.maxRetries | 3 | Max retry times for client to 
reserve slots. | 0.3.0 | 
 | celeborn.client.reserveSlots.rackware.enabled | false | Whether need to 
place different replicates on different racks when allocating slots. | 0.3.0 | 
 | celeborn.client.reserveSlots.retryWait | 3s | Wait time before next retry if 
reserve slots failed. | 0.3.0 | 
diff --git a/docs/configuration/worker.md b/docs/configuration/worker.md
index 372f91cf4..7d37707f2 100644
--- a/docs/configuration/worker.md
+++ b/docs/configuration/worker.md
@@ -27,7 +27,7 @@ license: |
 | celeborn.worker.bufferStream.threadsPerMountpoint | 8 | Threads count for 
read buffer per mount point. | 0.3.0 | 
 | celeborn.worker.closeIdleConnections | false | Whether worker will close 
idle connections. | 0.2.0 | 
 | celeborn.worker.commitFiles.threads | 32 | Thread number of worker to commit 
shuffle data files asynchronously. It's recommended to set at least `128` when 
`HDFS` is enabled in `celeborn.storage.activeTypes`. | 0.3.0 | 
-| celeborn.worker.commitFiles.timeout | &lt;value of 
celeborn.rpc.askTimeout&gt; | Timeout for a Celeborn worker to commit files of 
a shuffle. It's recommended to set at least `240s` when `HDFS` is enabled in 
`celeborn.storage.activeTypes`. | 0.3.0 | 
+| celeborn.worker.commitFiles.timeout | 120s | Timeout for a Celeborn worker 
to commit files of a shuffle. It's recommended to set at least `240s` when 
`HDFS` is enabled in `celeborn.storage.activeTypes`. | 0.3.0 | 
 | celeborn.worker.congestionControl.enabled | false | Whether to enable 
congestion control or not. | 0.3.0 | 
 | celeborn.worker.congestionControl.high.watermark | &lt;undefined&gt; | If 
the total bytes in disk buffer exceeds this configure, will start to 
congestusers whose produce rate is higher than the potential average consume 
rate. The congestion will stop if the produce rate is lower or equal to the 
average consume rate, or the total pending bytes lower than 
celeborn.worker.congestionControl.low.watermark | 0.3.0 | 
 | celeborn.worker.congestionControl.low.watermark | &lt;undefined&gt; | Will 
stop congest users if the total pending bytes of disk buffer is lower than this 
configuration | 0.3.0 |

[incubator-celeborn] branch branch-0.3 updated: [CELEBORN-803] Increase default timeout for commit files

Reply via email to