This is an automated email from the ASF dual-hosted git repository.
rexxiong pushed a commit to branch main
in repository https://gitbox.apache.org/repos/asf/celeborn.git
The following commit(s) were added to refs/heads/main by this push:
new 4540b5772 [MINOR] Document introduced metrics into monitoring.md
4540b5772 is described below
commit 4540b5772bb946f9afb502e6f1eca171ffe6c9b3
Author: SteNicholas <[email protected]>
AuthorDate: Tue Jul 29 14:33:46 2025 +0800
[MINOR] Document introduced metrics into monitoring.md
### What changes were proposed in this pull request?
Document introduced metrics into `monitoring.md` including
`FetchChunkTransferTime`, `FetchChunkTransferSize`, `FlushWorkingQueueSize`,
`LocalFlushCount`, `LocalFlushSize`, `HdfsFlushCount`, `HdfsFlushSize`,
`OssFlushCount`, `OssFlushSize`, `S3FlushCount`, `S3FlushSize`.
### Why are the changes needed?
Introduced metrics `FetchChunkTransferTime`, `FetchChunkTransferSize`,
`FlushWorkingQueueSize`, `LocalFlushCount`, `LocalFlushSize`, `HdfsFlushCount`,
`HdfsFlushSize`, `OssFlushCount`, `OssFlushSize`, `S3FlushCount`, `S3FlushSize`
don't document in `monitoring.md`.
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
No.
Closes #3398 from SteNicholas/document-monitoring.
Authored-by: SteNicholas <[email protected]>
Signed-off-by: Shuang <[email protected]>
---
docs/monitoring.md | 15 +++++++++++++--
1 file changed, 13 insertions(+), 2 deletions(-)
diff --git a/docs/monitoring.md b/docs/monitoring.md
index 89558cb42..8d750c843 100644
--- a/docs/monitoring.md
+++ b/docs/monitoring.md
@@ -43,7 +43,7 @@ _instances_ corresponding to Celeborn components. The
following instances are c
Each instance can report to zero or more _sinks_. Sinks are contained in the
`org.apache.celeborn.common.metrics.sink` package:
-* `CSVSink`: Exports metrics data to CSV files at regular intervals.
+* `CsvSink`: Exports metrics data to CSV files at regular intervals.
* `PrometheusServlet`: Adds a servlet within the existing Celeborn REST API to
serve metrics data in Prometheus format.
* `JsonServlet`: Adds a servlet within the existing Celeborn REST API to serve
metrics data in JSON format.
* `GraphiteSink`: Sends metrics to a Graphite node.
@@ -185,11 +185,13 @@ These metrics are exposed by Celeborn worker.
| ActiveShuffleFileCount | The active shuffle file count
of a worker including master replica and slave replica.
|
| OpenStreamTime | The time for a worker to
process openStream RPC and return StreamHandle.
|
| FetchChunkTime | The time for a worker to fetch
a chunk which is 8MB by default from a reduced partition.
|
+ | FetchChunkTransferTime | The time for a worker to
transfer for fetching a chunk from a reduced partition.
|
| ActiveChunkStreamCount | Active stream count for reduce
partition reading streams.
|
| OpenStreamSuccessCount | The count of opening stream
succeed in current worker.
|
| OpenStreamFailCount | The count of opening stream
failed in current worker.
|
| FetchChunkSuccessCount | The count of fetching chunk
succeed in current worker.
|
| FetchChunkFailCount | The count of fetching chunk
failed in current worker.
|
+ | FetchChunkTransferSize | The size of transfer for
fetching chunk in current worker.
|
| PrimaryPushDataTime | The time for a worker to handle
a pushData RPC sent from a celeborn client.
|
| ReplicaPushDataTime | The time for a worker to handle
a pushData RPC sent from a celeborn worker by replicating.
|
| PrimarySegmentStartTime | The time for a worker to handle
a segmentStart RPC sent from a celeborn client.
|
@@ -230,7 +232,7 @@ These metrics are exposed by Celeborn worker.
| SortTime | The time for a worker to sort a
shuffle file.
|
| SortMemory | The memory used by sorting
shuffle files.
|
| SortingFiles | The count of sorting shuffle
files.
|
- | PendingSortTaks | The count of sort tasks waiting
to be submitted to FileSorterExecutors.
|
+ | PendingSortTasks | The count of sort tasks waiting
to be submitted to FileSorterExecutors.
|
| SortedFiles | The count of sorted shuffle
files.
|
| SortedFileSize | The count of sorted shuffle
files 's total size.
|
| DiskBuffer | The memory occupied by pushData
and pushMergedData which should be written to disk.
|
@@ -256,6 +258,15 @@ These metrics are exposed by Celeborn worker.
| EvictedFileCount | The count of files evicted from
Memory Storage to Disk
|
| DirectMemoryUsageRatio | Ratio of direct memory used and
max direct memory.
|
| RegisterWithMasterFailCount | The count of failures in
register with master request.
|
+ | FlushWorkingQueueSize | The size of flush working queue
for mount point.
|
+ | LocalFlushCount | The amount of data flushed to
local.
|
+ | LocalFlushSize | The size of data flushed to
local.
|
+ | HdfsFlushCount | The amount of data flushed to
HDFS.
|
+ | HdfsFlushSize | The size of data flushed to
HDFS.
|
+ | OssFlushCount | The amount of data flushed to
OSS.
|
+ | OssFlushSize | The size of data flushed to
OSS.
|
+ | S3FlushCount | The amount of data flushed to
S3.
|
+ | S3FlushSize | The size of data flushed to S3.
|
| push_usedHeapMemory |
|
| push_usedDirectMemory |
|
| push_numHeapArenas |
|