hemantk-12 opened a new pull request, #7112:
URL: https://github.com/apache/ozone/pull/7112
## What changes were proposed in this pull request?
Sometimes, the follower node may be lagging in flushing the DoubleBuffer. If
that happens, it may accumulate `purgeKeys` from Snapshot (or
`setSnapshotProperty`, or `snapshotMoveDeletedKeys`), and `purgeSnapshot` in
the same batch. When that happens, `addToBatch` fails for `purgeKeys` (or
`setSnapshotProperty`, or `snapshotMoveDeletedKeys`) because the snapshot is
already purged from the `snapshotInfoTable` cache.
***PurgeKeys***:
```
2024-08-22 06:48:44,726 ERROR
[om113-OMDoubleBufferFlushThread]-org.apache.hadoop.ozone.om.ratis.OzoneManagerDoubleBuffer:
Terminating with exit status 1: During flush to DB encountered error in
OMDoubleBuffer flush thread om113-OMDoubleBufferFlushThread when handling
OMRequest: cmdType: PurgeKeys
traceID: ""
success: true
status: OK
KEY_NOT_FOUND org.apache.hadoop.ozone.om.exceptions.OMException: Snapshot
'/testvol/buckecobs/snap1724234145' is not found.
at
org.apache.hadoop.ozone.om.snapshot.SnapshotUtils.getSnapshotInfo(SnapshotUtils.java:84)
at
org.apache.hadoop.ozone.om.OmSnapshotManager.getSnapshot(OmSnapshotManager.java:666)
at
org.apache.hadoop.ozone.om.OmSnapshotManager.getSnapshot(OmSnapshotManager.java:661)
at
org.apache.hadoop.ozone.om.OmSnapshotManager.getSnapshot(OmSnapshotManager.java:644)
at
org.apache.hadoop.ozone.om.response.key.OMKeyPurgeResponse.addToDBBatch(OMKeyPurgeResponse.java:82)
at
org.apache.hadoop.ozone.om.response.OMClientResponse.checkAndUpdateDB(OMClientResponse.java:66)
at
org.apache.hadoop.ozone.om.ratis.OzoneManagerDoubleBuffer.lambda$8(OzoneManagerDoubleBuffer.java:408)
at
org.apache.hadoop.ozone.om.ratis.OzoneManagerDoubleBuffer.addToBatchWithTrace(OzoneManagerDoubleBuffer.java:253)
...
...
```
***SnapshotMoveDeletedKeys***:
```
2024-07-11 16:39:41,419 ERROR
[om136-OMDoubleBufferFlushThread]-org.apache.hadoop.ozone.om.ratis.OzoneManagerDoubleBuffer:
Terminating with exit status 1: During flush to DB encountered error in
OMDoubleBuffer flush thread om136-OMDoubleBufferFlushThread when handling
OMRequest: cmdType: SnapshotMoveDeletedKeys
traceID: ""
success: true
status: OK
java.io.IOException: Rocks Database is closed
at
org.apache.hadoop.hdds.utils.db.RocksDatabase.acquire(RocksDatabase.java:439)
at
org.apache.hadoop.hdds.utils.db.RocksDatabase.batchWrite(RocksDatabase.java:776)
at
org.apache.hadoop.hdds.utils.db.RocksDatabase.batchWrite(RocksDatabase.java:785)
at
org.apache.hadoop.hdds.utils.db.RDBBatchOperation.commit(RDBBatchOperation.java:348)
at
org.apache.hadoop.hdds.utils.db.RDBStore.commitBatchOperation(RDBStore.java:285)
at
org.apache.hadoop.ozone.om.response.snapshot.OMSnapshotMoveDeletedKeysResponse.addToDBBatch(OMSnapshotMoveDeletedKeysResponse.java:136)
at
org.apache.hadoop.ozone.om.response.OMClientResponse.checkAndUpdateDB(OMClientResponse.java:66)
at
org.apache.hadoop.ozone.om.ratis.OzoneManagerDoubleBuffer.lambda$8(OzoneManagerDoubleBuffer.java:408)
at
org.apache.hadoop.ozone.om.ratis.OzoneManagerDoubleBuffer.addToBatchWithTrace(OzoneManagerDoubleBuffer.java:253)
...
...
```
I simulated the same scenario via an integration test and it failed all the
time. [Test without the
fix](https://github.com/hemantk-12/ozone/actions/runs/10533485504/job/29190017932).
To fix this problem, we are making two main changes,
* `OmXyzResponse` should use snapshotInfo passed by the `OmXyzRequest`.
(Note: this should be a general way to handle Request/Response and we should
not re-read the value).
* For SnapshotCache, check for the snapshot in the cache and then fall back
to SnapshotInfoTable.
## What is the link to the Apache JIRA
HDDS-11152
## How was this patch tested?
Added integration tests and got green CI/CD:
https://github.com/hemantk-12/ozone/actions/runs/10533935479.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]