Re: [PR] HDDS-13453. Add snapshot defrag metrics [ozone]

via GitHub Wed, 18 Feb 2026 17:58:00 -0800


Copilot commented on code in PR #9788:
URL: https://github.com/apache/ozone/pull/9788#discussion_r2825285364



##########
hadoop-ozone/ozone-manager/src/main/java/org/apache/hadoop/ozone/om/snapshot/defrag/SnapshotDefragService.java:
##########
@@ -638,6 +705,12 @@ boolean checkAndDefragSnapshot(SnapshotChainManager 
chainManager, UUID snapshotI
       }
       LOG.info("Completed defragmentation for snapshot: {} (ID: {}) in {} ms", 
snapshotInfo.getTableKey(),
           snapshotInfo.getSnapshotId(), Time.monotonicNow() - start);
+      snapshotMetrics.incNumSnapshotDefrag();
+      if (isFullDefrag) {
+        snapshotMetrics.incNumSnapshotFullDefrag();
+      } else {
+        snapshotMetrics.incNumSnapshotIncDefrag();
+      }

Review Comment:
   Metrics are double-counted when defragmentation succeeds. Both 
`incNumSnapshotDefrag()` (line 708) and either `incNumSnapshotFullDefrag()` or 
`incNumSnapshotIncDefrag()` (lines 710/712) are called for successful 
defragmentations. According to the PR description, `numSnapshotDefrag` should 
track "Total successful defrag ops (full + incremental)" which implies it 
should be the sum of both types. However, with the current implementation, if 
you have 3 full and 2 incremental defragmentations, `numSnapshotDefrag` would 
be 5, `numSnapshotFullDefrag` would be 3, and `numSnapshotIncDefrag` would be 
2, making them redundant rather than complementary. 
   
   Consider either: (1) removing lines 708-713 and keeping only the increment 
inside the try blocks (lines 671/682), or (2) removing the increments from the 
try blocks and keeping only lines 708-713, depending on whether the intent is 
to track just the defragmentation type counts or also maintain a total counter.
   ```suggestion
   
   ```



##########
hadoop-ozone/ozone-manager/src/main/java/org/apache/hadoop/ozone/om/snapshot/defrag/SnapshotDefragService.java:
##########
@@ -608,18 +653,37 @@ boolean checkAndDefragSnapshot(SnapshotChainManager 
chainManager, UUID snapshotI
     Path checkpointLocation = 
checkpointMetadataManager.getStore().getDbLocation().toPath();
     try {
       DBStore checkpointDBStore = checkpointMetadataManager.getStore();
+      if (LOG.isTraceEnabled()) {
+        Path snapshotDbDir = OmSnapshotManager.getSnapshotPath(
+            ozoneManager.getMetadataManager(), snapshotId, 
needsDefragVersionPair.getValue());
+        logDirStats("Before defrag", snapshotInfo, snapshotDbDir);
+      }
       TablePrefixInfo prefixInfo = 
ozoneManager.getMetadataManager().getTableBucketPrefix(snapshotInfo.getVolumeName(),
           snapshotInfo.getBucketName());
       // If first snapshot in the chain perform full defragmentation.
-      if (snapshotInfo.getPathPreviousSnapshotId() == null) {
+      boolean isFullDefrag = snapshotInfo.getPathPreviousSnapshotId() == null;
+      long defragStart = Time.monotonicNow();
+      if (isFullDefrag) {
         LOG.info("Performing full defragmentation for snapshot: {} (ID: {})", 
snapshotInfo.getTableKey(),
             snapshotInfo.getSnapshotId());
-        performFullDefragmentation(checkpointDBStore, prefixInfo, 
COLUMN_FAMILIES_TO_TRACK_IN_SNAPSHOT);
+        try {
+          performFullDefragmentation(checkpointDBStore, prefixInfo, 
COLUMN_FAMILIES_TO_TRACK_IN_SNAPSHOT);
+          
perfMetrics.setSnapshotDefragServiceFullLatencyMs(Time.monotonicNow() - 
defragStart);
+        } catch (IOException e) {
+          snapshotMetrics.incNumSnapshotFullDefragFails();
+          throw e;
+        }
       } else {
         LOG.info("Performing incremental defragmentation for snapshot: {} (ID: 
{})", snapshotInfo.getTableKey(),
             snapshotInfo.getSnapshotId());
-        performIncrementalDefragmentation(checkpointSnapshotInfo, 
snapshotInfo, needsDefragVersionPair.getValue(),
-            checkpointDBStore, prefixInfo, 
COLUMN_FAMILIES_TO_TRACK_IN_SNAPSHOT);
+        try {
+          performIncrementalDefragmentation(checkpointSnapshotInfo, 
snapshotInfo, needsDefragVersionPair.getValue(),
+              checkpointDBStore, prefixInfo, 
COLUMN_FAMILIES_TO_TRACK_IN_SNAPSHOT);
+          perfMetrics.setSnapshotDefragServiceIncLatencyMs(Time.monotonicNow() 
- defragStart);
+        } catch (IOException e) {
+          snapshotMetrics.incNumSnapshotIncDefragFails();
+          throw e;
+        }

Review Comment:
   When an IOException occurs during performFullDefragmentation or 
performIncrementalDefragmentation, the specific failure metric is incremented 
(incNumSnapshotFullDefragFails or incNumSnapshotIncDefragFails), but the 
general failure metric incNumSnapshotDefragFails is only incremented at the 
outer level (line 760) if the exception propagates from checkAndDefragSnapshot. 
This creates inconsistency where specific failure counts may not align with the 
total failure count if exceptions from the inner try blocks are caught and 
handled at the outer level without re-throwing. Consider whether the general 
failure metric should also be incremented in the inner catch blocks (lines 673 
and 684) to ensure the total count is consistent with the sum of specific 
failures.



##########
hadoop-ozone/ozone-manager/src/main/java/org/apache/hadoop/ozone/om/snapshot/defrag/SnapshotDefragService.java:
##########
@@ -608,18 +653,37 @@ boolean checkAndDefragSnapshot(SnapshotChainManager 
chainManager, UUID snapshotI
     Path checkpointLocation = 
checkpointMetadataManager.getStore().getDbLocation().toPath();
     try {
       DBStore checkpointDBStore = checkpointMetadataManager.getStore();
+      if (LOG.isTraceEnabled()) {
+        Path snapshotDbDir = OmSnapshotManager.getSnapshotPath(
+            ozoneManager.getMetadataManager(), snapshotId, 
needsDefragVersionPair.getValue());
+        logDirStats("Before defrag", snapshotInfo, snapshotDbDir);
+      }
       TablePrefixInfo prefixInfo = 
ozoneManager.getMetadataManager().getTableBucketPrefix(snapshotInfo.getVolumeName(),
           snapshotInfo.getBucketName());
       // If first snapshot in the chain perform full defragmentation.
-      if (snapshotInfo.getPathPreviousSnapshotId() == null) {
+      boolean isFullDefrag = snapshotInfo.getPathPreviousSnapshotId() == null;
+      long defragStart = Time.monotonicNow();
+      if (isFullDefrag) {
         LOG.info("Performing full defragmentation for snapshot: {} (ID: {})", 
snapshotInfo.getTableKey(),
             snapshotInfo.getSnapshotId());
-        performFullDefragmentation(checkpointDBStore, prefixInfo, 
COLUMN_FAMILIES_TO_TRACK_IN_SNAPSHOT);
+        try {
+          performFullDefragmentation(checkpointDBStore, prefixInfo, 
COLUMN_FAMILIES_TO_TRACK_IN_SNAPSHOT);
+          
perfMetrics.setSnapshotDefragServiceFullLatencyMs(Time.monotonicNow() - 
defragStart);
+        } catch (IOException e) {
+          snapshotMetrics.incNumSnapshotFullDefragFails();
+          throw e;
+        }
       } else {
         LOG.info("Performing incremental defragmentation for snapshot: {} (ID: 
{})", snapshotInfo.getTableKey(),
             snapshotInfo.getSnapshotId());
-        performIncrementalDefragmentation(checkpointSnapshotInfo, 
snapshotInfo, needsDefragVersionPair.getValue(),
-            checkpointDBStore, prefixInfo, 
COLUMN_FAMILIES_TO_TRACK_IN_SNAPSHOT);
+        try {
+          performIncrementalDefragmentation(checkpointSnapshotInfo, 
snapshotInfo, needsDefragVersionPair.getValue(),
+              checkpointDBStore, prefixInfo, 
COLUMN_FAMILIES_TO_TRACK_IN_SNAPSHOT);
+          perfMetrics.setSnapshotDefragServiceIncLatencyMs(Time.monotonicNow() 
- defragStart);
+        } catch (IOException e) {
+          snapshotMetrics.incNumSnapshotIncDefragFails();
+          throw e;
+        }

Review Comment:
   Latency metrics are set only when the inner defragmentation operation 
(performFullDefragmentation or performIncrementalDefragmentation) succeeds, but 
not when other failures occur later in the process (e.g., during 
ingestNonIncrementalTables, atomicSwitchSnapshotDB, or 
deleteSnapshotCheckpointDirectories). This creates an inconsistency where 
success counters are incremented (lines 708-713) but the latency gauge doesn't 
reflect the total time taken for the defragmentation attempt that actually 
completed. Consider moving the latency metric recording to after all operations 
complete successfully (around line 707), or ensure the latency reflects only 
the specific operation it's named after.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [PR] HDDS-13453. Add snapshot defrag metrics [ozone]

Reply via email to