hemantk-12 opened a new pull request, #7112:
URL: https://github.com/apache/ozone/pull/7112

   ## What changes were proposed in this pull request?
   Sometimes, the follower node may be lagging in flushing the DoubleBuffer. If 
that happens, it may accumulate `purgeKeys` from Snapshot (or 
`setSnapshotProperty`, or `snapshotMoveDeletedKeys`), and `purgeSnapshot` in 
the same batch. When that happens, `addToBatch` fails for `purgeKeys` (or 
`setSnapshotProperty`, or `snapshotMoveDeletedKeys`) because the snapshot is 
already purged from the `snapshotInfoTable` cache. 
   
   ***PurgeKeys***:
   ```
   2024-08-22 06:48:44,726 ERROR 
[om113-OMDoubleBufferFlushThread]-org.apache.hadoop.ozone.om.ratis.OzoneManagerDoubleBuffer:
 Terminating with exit status 1: During flush to DB encountered error in 
OMDoubleBuffer flush thread om113-OMDoubleBufferFlushThread when handling 
OMRequest: cmdType: PurgeKeys
   traceID: ""
   success: true
   status: OK
   KEY_NOT_FOUND org.apache.hadoop.ozone.om.exceptions.OMException: Snapshot 
'/testvol/buckecobs/snap1724234145' is not found.
           at 
org.apache.hadoop.ozone.om.snapshot.SnapshotUtils.getSnapshotInfo(SnapshotUtils.java:84)
           at 
org.apache.hadoop.ozone.om.OmSnapshotManager.getSnapshot(OmSnapshotManager.java:666)
           at 
org.apache.hadoop.ozone.om.OmSnapshotManager.getSnapshot(OmSnapshotManager.java:661)
           at 
org.apache.hadoop.ozone.om.OmSnapshotManager.getSnapshot(OmSnapshotManager.java:644)
           at 
org.apache.hadoop.ozone.om.response.key.OMKeyPurgeResponse.addToDBBatch(OMKeyPurgeResponse.java:82)
           at 
org.apache.hadoop.ozone.om.response.OMClientResponse.checkAndUpdateDB(OMClientResponse.java:66)
           at 
org.apache.hadoop.ozone.om.ratis.OzoneManagerDoubleBuffer.lambda$8(OzoneManagerDoubleBuffer.java:408)
           at 
org.apache.hadoop.ozone.om.ratis.OzoneManagerDoubleBuffer.addToBatchWithTrace(OzoneManagerDoubleBuffer.java:253)
          ...
          ...
   ```
   
   ***SnapshotMoveDeletedKeys***:
   ```
   2024-07-11 16:39:41,419 ERROR 
[om136-OMDoubleBufferFlushThread]-org.apache.hadoop.ozone.om.ratis.OzoneManagerDoubleBuffer:
 Terminating with exit status 1: During flush to DB encountered error in 
OMDoubleBuffer flush thread om136-OMDoubleBufferFlushThread when handling 
OMRequest: cmdType: SnapshotMoveDeletedKeys
   traceID: ""
   success: true
   status: OK
   java.io.IOException: Rocks Database is closed
           at 
org.apache.hadoop.hdds.utils.db.RocksDatabase.acquire(RocksDatabase.java:439)
           at 
org.apache.hadoop.hdds.utils.db.RocksDatabase.batchWrite(RocksDatabase.java:776)
           at 
org.apache.hadoop.hdds.utils.db.RocksDatabase.batchWrite(RocksDatabase.java:785)
           at 
org.apache.hadoop.hdds.utils.db.RDBBatchOperation.commit(RDBBatchOperation.java:348)
           at 
org.apache.hadoop.hdds.utils.db.RDBStore.commitBatchOperation(RDBStore.java:285)
           at 
org.apache.hadoop.ozone.om.response.snapshot.OMSnapshotMoveDeletedKeysResponse.addToDBBatch(OMSnapshotMoveDeletedKeysResponse.java:136)
           at 
org.apache.hadoop.ozone.om.response.OMClientResponse.checkAndUpdateDB(OMClientResponse.java:66)
           at 
org.apache.hadoop.ozone.om.ratis.OzoneManagerDoubleBuffer.lambda$8(OzoneManagerDoubleBuffer.java:408)
           at 
org.apache.hadoop.ozone.om.ratis.OzoneManagerDoubleBuffer.addToBatchWithTrace(OzoneManagerDoubleBuffer.java:253)
           ...
           ...
   ```
   
   I simulated the same scenario via an integration test and it failed all the 
time. [Test without the 
fix](https://github.com/hemantk-12/ozone/actions/runs/10533485504/job/29190017932).
   
   To fix this problem, we are making two main changes,
   * `OmXyzResponse` should use snapshotInfo passed by the `OmXyzRequest`. 
(Note: this should be a general way to handle Request/Response and we should 
not re-read the value).
   * For SnapshotCache, check for the snapshot in the cache and then fall back 
to SnapshotInfoTable.
   
   ## What is the link to the Apache JIRA
   HDDS-11152
   
   ## How was this patch tested?
   Added integration tests and got green CI/CD: 
https://github.com/hemantk-12/ozone/actions/runs/10533935479.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to