This is an automated email from the ASF dual-hosted git repository.
feiwang pushed a commit to branch main
in repository https://gitbox.apache.org/repos/asf/celeborn.git
The following commit(s) were added to refs/heads/main by this push:
new 7ab6268e3 [CELEBORN-2083] For `WorkerStatusTracker`, log error for
`recordWorkerFailure`
7ab6268e3 is described below
commit 7ab6268e38a30699c1be86dd9298fd8233564f77
Author: Wang, Fei <[email protected]>
AuthorDate: Sun Jul 27 22:46:20 2025 -0700
[CELEBORN-2083] For `WorkerStatusTracker`, log error for
`recordWorkerFailure`
### What changes were proposed in this pull request?
For WorkerStatusTracker, log error for recordWorkerFailure to separate with
status change from application heartbeat response.
### Why are the changes needed?
Currently, in `WorkerStatusTracker`, it logs warning for two cases:
1. status change from application heartbeat response
https://github.com/apache/celeborn/blob/ae40222351cbeb1a9bdd398d461255a0739f3cac/client/src/main/scala/org/apache/celeborn/client/WorkerStatusTracker.scala#L213-L214
2. `recordWorkerFailure ` on some failures, likes `connectFailedWorkers`.
In our use case, the celeborn cluster is very large and the worker status
change frequently, so the log for case 1 is very noisy.
I think that:
1. for case2, it is more critical, should use error level
2. for case1, it might be normal for large celeborn cluster, warning level
is fine.
With separated log levels, we can mute the noisy status change from
application heartbeat response by setting the log level for
`WorkerStatusTracker` to error.
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
Code review.
Closes #3392 from turboFei/log_level_worker_status.
Authored-by: Wang, Fei <[email protected]>
Signed-off-by: Wang, Fei <[email protected]>
---
.../src/main/scala/org/apache/celeborn/client/WorkerStatusTracker.scala | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git
a/client/src/main/scala/org/apache/celeborn/client/WorkerStatusTracker.scala
b/client/src/main/scala/org/apache/celeborn/client/WorkerStatusTracker.scala
index f065f2c3e..698032014 100644
--- a/client/src/main/scala/org/apache/celeborn/client/WorkerStatusTracker.scala
+++ b/client/src/main/scala/org/apache/celeborn/client/WorkerStatusTracker.scala
@@ -124,7 +124,7 @@ class WorkerStatusTracker(
val failedWorkersMsg = failedWorkers.asScala.map { case (worker,
(status, time)) =>
s"${worker.readableAddress()} ${status.name()}
${Utils.formatTimestamp(time)}"
}.mkString("\n")
- logWarning(
+ logError(
s"""
|Reporting failed workers:
|$failedWorkersMsg$currentFailedWorkers""".stripMargin)