Copilot commented on code in PR #9788:
URL: https://github.com/apache/ozone/pull/9788#discussion_r2825285364
##########
hadoop-ozone/ozone-manager/src/main/java/org/apache/hadoop/ozone/om/snapshot/defrag/SnapshotDefragService.java:
##########
@@ -638,6 +705,12 @@ boolean checkAndDefragSnapshot(SnapshotChainManager
chainManager, UUID snapshotI
}
LOG.info("Completed defragmentation for snapshot: {} (ID: {}) in {} ms",
snapshotInfo.getTableKey(),
snapshotInfo.getSnapshotId(), Time.monotonicNow() - start);
+ snapshotMetrics.incNumSnapshotDefrag();
+ if (isFullDefrag) {
+ snapshotMetrics.incNumSnapshotFullDefrag();
+ } else {
+ snapshotMetrics.incNumSnapshotIncDefrag();
+ }
Review Comment:
Metrics are double-counted when defragmentation succeeds. Both
`incNumSnapshotDefrag()` (line 708) and either `incNumSnapshotFullDefrag()` or
`incNumSnapshotIncDefrag()` (lines 710/712) are called for successful
defragmentations. According to the PR description, `numSnapshotDefrag` should
track "Total successful defrag ops (full + incremental)" which implies it
should be the sum of both types. However, with the current implementation, if
you have 3 full and 2 incremental defragmentations, `numSnapshotDefrag` would
be 5, `numSnapshotFullDefrag` would be 3, and `numSnapshotIncDefrag` would be
2, making them redundant rather than complementary.
Consider either: (1) removing lines 708-713 and keeping only the increment
inside the try blocks (lines 671/682), or (2) removing the increments from the
try blocks and keeping only lines 708-713, depending on whether the intent is
to track just the defragmentation type counts or also maintain a total counter.
```suggestion
```
##########
hadoop-ozone/ozone-manager/src/main/java/org/apache/hadoop/ozone/om/snapshot/defrag/SnapshotDefragService.java:
##########
@@ -608,18 +653,37 @@ boolean checkAndDefragSnapshot(SnapshotChainManager
chainManager, UUID snapshotI
Path checkpointLocation =
checkpointMetadataManager.getStore().getDbLocation().toPath();
try {
DBStore checkpointDBStore = checkpointMetadataManager.getStore();
+ if (LOG.isTraceEnabled()) {
+ Path snapshotDbDir = OmSnapshotManager.getSnapshotPath(
+ ozoneManager.getMetadataManager(), snapshotId,
needsDefragVersionPair.getValue());
+ logDirStats("Before defrag", snapshotInfo, snapshotDbDir);
+ }
TablePrefixInfo prefixInfo =
ozoneManager.getMetadataManager().getTableBucketPrefix(snapshotInfo.getVolumeName(),
snapshotInfo.getBucketName());
// If first snapshot in the chain perform full defragmentation.
- if (snapshotInfo.getPathPreviousSnapshotId() == null) {
+ boolean isFullDefrag = snapshotInfo.getPathPreviousSnapshotId() == null;
+ long defragStart = Time.monotonicNow();
+ if (isFullDefrag) {
LOG.info("Performing full defragmentation for snapshot: {} (ID: {})",
snapshotInfo.getTableKey(),
snapshotInfo.getSnapshotId());
- performFullDefragmentation(checkpointDBStore, prefixInfo,
COLUMN_FAMILIES_TO_TRACK_IN_SNAPSHOT);
+ try {
+ performFullDefragmentation(checkpointDBStore, prefixInfo,
COLUMN_FAMILIES_TO_TRACK_IN_SNAPSHOT);
+
perfMetrics.setSnapshotDefragServiceFullLatencyMs(Time.monotonicNow() -
defragStart);
+ } catch (IOException e) {
+ snapshotMetrics.incNumSnapshotFullDefragFails();
+ throw e;
+ }
} else {
LOG.info("Performing incremental defragmentation for snapshot: {} (ID:
{})", snapshotInfo.getTableKey(),
snapshotInfo.getSnapshotId());
- performIncrementalDefragmentation(checkpointSnapshotInfo,
snapshotInfo, needsDefragVersionPair.getValue(),
- checkpointDBStore, prefixInfo,
COLUMN_FAMILIES_TO_TRACK_IN_SNAPSHOT);
+ try {
+ performIncrementalDefragmentation(checkpointSnapshotInfo,
snapshotInfo, needsDefragVersionPair.getValue(),
+ checkpointDBStore, prefixInfo,
COLUMN_FAMILIES_TO_TRACK_IN_SNAPSHOT);
+ perfMetrics.setSnapshotDefragServiceIncLatencyMs(Time.monotonicNow()
- defragStart);
+ } catch (IOException e) {
+ snapshotMetrics.incNumSnapshotIncDefragFails();
+ throw e;
+ }
Review Comment:
When an IOException occurs during performFullDefragmentation or
performIncrementalDefragmentation, the specific failure metric is incremented
(incNumSnapshotFullDefragFails or incNumSnapshotIncDefragFails), but the
general failure metric incNumSnapshotDefragFails is only incremented at the
outer level (line 760) if the exception propagates from checkAndDefragSnapshot.
This creates inconsistency where specific failure counts may not align with the
total failure count if exceptions from the inner try blocks are caught and
handled at the outer level without re-throwing. Consider whether the general
failure metric should also be incremented in the inner catch blocks (lines 673
and 684) to ensure the total count is consistent with the sum of specific
failures.
##########
hadoop-ozone/ozone-manager/src/main/java/org/apache/hadoop/ozone/om/snapshot/defrag/SnapshotDefragService.java:
##########
@@ -608,18 +653,37 @@ boolean checkAndDefragSnapshot(SnapshotChainManager
chainManager, UUID snapshotI
Path checkpointLocation =
checkpointMetadataManager.getStore().getDbLocation().toPath();
try {
DBStore checkpointDBStore = checkpointMetadataManager.getStore();
+ if (LOG.isTraceEnabled()) {
+ Path snapshotDbDir = OmSnapshotManager.getSnapshotPath(
+ ozoneManager.getMetadataManager(), snapshotId,
needsDefragVersionPair.getValue());
+ logDirStats("Before defrag", snapshotInfo, snapshotDbDir);
+ }
TablePrefixInfo prefixInfo =
ozoneManager.getMetadataManager().getTableBucketPrefix(snapshotInfo.getVolumeName(),
snapshotInfo.getBucketName());
// If first snapshot in the chain perform full defragmentation.
- if (snapshotInfo.getPathPreviousSnapshotId() == null) {
+ boolean isFullDefrag = snapshotInfo.getPathPreviousSnapshotId() == null;
+ long defragStart = Time.monotonicNow();
+ if (isFullDefrag) {
LOG.info("Performing full defragmentation for snapshot: {} (ID: {})",
snapshotInfo.getTableKey(),
snapshotInfo.getSnapshotId());
- performFullDefragmentation(checkpointDBStore, prefixInfo,
COLUMN_FAMILIES_TO_TRACK_IN_SNAPSHOT);
+ try {
+ performFullDefragmentation(checkpointDBStore, prefixInfo,
COLUMN_FAMILIES_TO_TRACK_IN_SNAPSHOT);
+
perfMetrics.setSnapshotDefragServiceFullLatencyMs(Time.monotonicNow() -
defragStart);
+ } catch (IOException e) {
+ snapshotMetrics.incNumSnapshotFullDefragFails();
+ throw e;
+ }
} else {
LOG.info("Performing incremental defragmentation for snapshot: {} (ID:
{})", snapshotInfo.getTableKey(),
snapshotInfo.getSnapshotId());
- performIncrementalDefragmentation(checkpointSnapshotInfo,
snapshotInfo, needsDefragVersionPair.getValue(),
- checkpointDBStore, prefixInfo,
COLUMN_FAMILIES_TO_TRACK_IN_SNAPSHOT);
+ try {
+ performIncrementalDefragmentation(checkpointSnapshotInfo,
snapshotInfo, needsDefragVersionPair.getValue(),
+ checkpointDBStore, prefixInfo,
COLUMN_FAMILIES_TO_TRACK_IN_SNAPSHOT);
+ perfMetrics.setSnapshotDefragServiceIncLatencyMs(Time.monotonicNow()
- defragStart);
+ } catch (IOException e) {
+ snapshotMetrics.incNumSnapshotIncDefragFails();
+ throw e;
+ }
Review Comment:
Latency metrics are set only when the inner defragmentation operation
(performFullDefragmentation or performIncrementalDefragmentation) succeeds, but
not when other failures occur later in the process (e.g., during
ingestNonIncrementalTables, atomicSwitchSnapshotDB, or
deleteSnapshotCheckpointDirectories). This creates an inconsistency where
success counters are incremented (lines 708-713) but the latency gauge doesn't
reflect the total time taken for the defragmentation attempt that actually
completed. Consider moving the latency metric recording to after all operations
complete successfully (around line 707), or ensure the latency reflects only
the specific operation it's named after.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]