This is an automated email from the ASF dual-hosted git repository.
zhouky pushed a commit to branch branch-0.3
in repository https://gitbox.apache.org/repos/asf/incubator-celeborn.git
The following commit(s) were added to refs/heads/branch-0.3 by this push:
new c21ae16ab [CELEBORN-803] Increase default timeout for commit files
c21ae16ab is described below
commit c21ae16abca91ba15faa245276cff61798e8b458
Author: zky.zhoukeyong <[email protected]>
AuthorDate: Mon Jul 17 22:31:36 2023 +0800
[CELEBORN-803] Increase default timeout for commit files
### What changes were proposed in this pull request?
As title.
### Why are the changes needed?
In 0.2.1-incubating, commit files default timeout is ```NETWORK_TIMEOUT```,
which is 240s.
It's more reasonable because commit files costs relatively long time. In my
testing with tough disks,
30s timeout with 2 retires is not enough.
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
Passes GA and manual test.
Closes #1724 from waitinfuture/803.
Authored-by: zky.zhoukeyong <[email protected]>
Signed-off-by: zky.zhoukeyong <[email protected]>
---
common/src/main/scala/org/apache/celeborn/common/CelebornConf.scala | 5 +++--
docs/configuration/client.md | 2 +-
docs/configuration/worker.md | 2 +-
3 files changed, 5 insertions(+), 4 deletions(-)
diff --git
a/common/src/main/scala/org/apache/celeborn/common/CelebornConf.scala
b/common/src/main/scala/org/apache/celeborn/common/CelebornConf.scala
index d799635f3..42d85efe9 100644
--- a/common/src/main/scala/org/apache/celeborn/common/CelebornConf.scala
+++ b/common/src/main/scala/org/apache/celeborn/common/CelebornConf.scala
@@ -2147,7 +2147,8 @@ object CelebornConf extends Logging {
.doc("Timeout for a Celeborn worker to commit files of a shuffle. " +
"It's recommended to set at least `240s` when `HDFS` is enabled in
`celeborn.storage.activeTypes`.")
.version("0.3.0")
- .fallbackConf(RPC_ASK_TIMEOUT)
+ .timeConf(TimeUnit.MILLISECONDS)
+ .createWithDefaultString("120s")
val PARTITION_SORTER_SORT_TIMEOUT: ConfigEntry[Long] =
buildConf("celeborn.worker.sortPartition.timeout")
@@ -3190,7 +3191,7 @@ object CelebornConf extends Logging {
.version("0.3.0")
.intConf
.checkValue(v => v > 0, "value must be positive")
- .createWithDefault(2)
+ .createWithDefault(3)
val CLIENT_COMMIT_IGNORE_EXCLUDED_WORKERS: ConfigEntry[Boolean] =
buildConf("celeborn.client.commitFiles.ignoreExcludedWorker")
diff --git a/docs/configuration/client.md b/docs/configuration/client.md
index beeb60a45..e6f3d6ba9 100644
--- a/docs/configuration/client.md
+++ b/docs/configuration/client.md
@@ -62,7 +62,7 @@ license: |
| celeborn.client.push.timeout | 120s | Timeout for a task to push data rpc
message. This value should better be more than twice of
`celeborn.<module>.push.timeoutCheck.interval` | 0.3.0 |
| celeborn.client.registerShuffle.maxRetries | 3 | Max retry times for client
to register shuffle. | 0.3.0 |
| celeborn.client.registerShuffle.retryWait | 3s | Wait time before next retry
if register shuffle failed. | 0.3.0 |
-| celeborn.client.requestCommitFiles.maxRetries | 2 | Max retry times for
requestCommitFiles RPC. | 0.3.0 |
+| celeborn.client.requestCommitFiles.maxRetries | 3 | Max retry times for
requestCommitFiles RPC. | 0.3.0 |
| celeborn.client.reserveSlots.maxRetries | 3 | Max retry times for client to
reserve slots. | 0.3.0 |
| celeborn.client.reserveSlots.rackware.enabled | false | Whether need to
place different replicates on different racks when allocating slots. | 0.3.0 |
| celeborn.client.reserveSlots.retryWait | 3s | Wait time before next retry if
reserve slots failed. | 0.3.0 |
diff --git a/docs/configuration/worker.md b/docs/configuration/worker.md
index 372f91cf4..7d37707f2 100644
--- a/docs/configuration/worker.md
+++ b/docs/configuration/worker.md
@@ -27,7 +27,7 @@ license: |
| celeborn.worker.bufferStream.threadsPerMountpoint | 8 | Threads count for
read buffer per mount point. | 0.3.0 |
| celeborn.worker.closeIdleConnections | false | Whether worker will close
idle connections. | 0.2.0 |
| celeborn.worker.commitFiles.threads | 32 | Thread number of worker to commit
shuffle data files asynchronously. It's recommended to set at least `128` when
`HDFS` is enabled in `celeborn.storage.activeTypes`. | 0.3.0 |
-| celeborn.worker.commitFiles.timeout | <value of
celeborn.rpc.askTimeout> | Timeout for a Celeborn worker to commit files of
a shuffle. It's recommended to set at least `240s` when `HDFS` is enabled in
`celeborn.storage.activeTypes`. | 0.3.0 |
+| celeborn.worker.commitFiles.timeout | 120s | Timeout for a Celeborn worker
to commit files of a shuffle. It's recommended to set at least `240s` when
`HDFS` is enabled in `celeborn.storage.activeTypes`. | 0.3.0 |
| celeborn.worker.congestionControl.enabled | false | Whether to enable
congestion control or not. | 0.3.0 |
| celeborn.worker.congestionControl.high.watermark | <undefined> | If
the total bytes in disk buffer exceeds this configure, will start to
congestusers whose produce rate is higher than the potential average consume
rate. The congestion will stop if the produce rate is lower or equal to the
average consume rate, or the total pending bytes lower than
celeborn.worker.congestionControl.low.watermark | 0.3.0 |
| celeborn.worker.congestionControl.low.watermark | <undefined> | Will
stop congest users if the total pending bytes of disk buffer is lower than this
configuration | 0.3.0 |