[
https://issues.apache.org/jira/browse/HDDS-8732?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17728132#comment-17728132
]
Siyao Meng edited comment on HDDS-8732 at 5/31/23 9:14 PM:
-----------------------------------------------------------
-This looks to be a bug in "snapshot list" command. This would fail whenever
double buffer is not flushed fast enough.-
-When snapshot listing is iterating over the DB it doesn't check whether the
entry is already populated from the cache and is incorrectly overwriting it in
{{appendSnapshotFromDBToMap}}:-
https://github.com/apache/ozone/blob/586a20274c5505c0ea083101a4d4e681529098fb/hadoop-ozone/ozone-manager/src/main/java/org/apache/hadoop/ozone/om/OmMetadataManagerImpl.java#L1298-L1299
-The solution would be to *either*:-
-1. in {{appendSnapshotFromDBToMap}}, skip the result entry population if the
entry already exists because it is already populated from the cache (which is
the source of truth)-
-2. or swap the order of calls to {{appendSnapshotFromDBToMap}} and
{{appendSnapshotFromCacheToMap}}.-
Thanks [~aswinshakil] for pointing this out.
cc [~wfps1210] [~ppogde]
oh nvm it looks like the check is indeed there already.
[~dteng] You could double check if there are any unhandled edge cases in the
listing logic, or anywhere else that could go wrong:
Other reference points:
1. Cache update is done here in OMSnapshotDeleteRequest in Ratis state machine:
https://github.com/apache/ozone/blob/e76f312183ca29cda8abcf963963d32223388b77/hadoop-ozone/ozone-manager/src/main/java/org/apache/hadoop/ozone/om/request/snapshot/OMSnapshotDeleteRequest.java#L177-L185
2. DB update is done here in OMSnapshotDeleteResponse during double buffer
flush (should happen very quickly when double buffer thread is not busy):
https://github.com/apache/ozone/blob/ccc814ee7f8a7cac70b6d64dcae12056e375199d/hadoop-ozone/ozone-manager/src/main/java/org/apache/hadoop/ozone/om/response/snapshot/OMSnapshotDeleteResponse.java#L60-L65
was (Author: smeng):
-This looks to be a bug in "snapshot list" command. This would fail whenever
double buffer is not flushed fast enough.-
-When snapshot listing is iterating over the DB it doesn't check whether the
entry is already populated from the cache and is incorrectly overwriting it in
{{appendSnapshotFromDBToMap}}:-
https://github.com/apache/ozone/blob/586a20274c5505c0ea083101a4d4e681529098fb/hadoop-ozone/ozone-manager/src/main/java/org/apache/hadoop/ozone/om/OmMetadataManagerImpl.java#L1298-L1299
-The solution would be to *either*:-
-1. in {{appendSnapshotFromDBToMap}}, skip the result entry population if the
entry already exists because it is already populated from the cache (which is
the source of truth)-
-2. or swap the order of calls to {{appendSnapshotFromDBToMap}} and
{{appendSnapshotFromCacheToMap}}.-
Thanks [~aswinshakil] for pointing this out.
cc [~wfps1210] [~ppogde]
oh nvm it looks like the check is indeed there already.
[~dteng] You could double check if there are any unhandled edge cases listing
logic, or anywhere else that could go wrong:
Other reference points:
1. Cache update is done here in OMSnapshotDeleteRequest in Ratis state machine:
https://github.com/apache/ozone/blob/e76f312183ca29cda8abcf963963d32223388b77/hadoop-ozone/ozone-manager/src/main/java/org/apache/hadoop/ozone/om/request/snapshot/OMSnapshotDeleteRequest.java#L177-L185
2. DB update is done here in OMSnapshotDeleteResponse during double buffer
flush (should happen very quickly when double buffer thread is not busy):
https://github.com/apache/ozone/blob/ccc814ee7f8a7cac70b6d64dcae12056e375199d/hadoop-ozone/ozone-manager/src/main/java/org/apache/hadoop/ozone/om/response/snapshot/OMSnapshotDeleteResponse.java#L60-L65
> Intermittent failure at Delete snapshot in upgrade test
> -------------------------------------------------------
>
> Key: HDDS-8732
> URL: https://issues.apache.org/jira/browse/HDDS-8732
> Project: Apache Ozone
> Issue Type: Sub-task
> Components: Snapshot
> Affects Versions: 1.4.0
> Reporter: Attila Doroszlai
> Assignee: Dave Teng
> Priority: Critical
>
> {code:title=https://github.com/apache/ozone/actions/runs/5127514810/jobs/9223531259#step:5:899}
> Delete snapshot | FAIL |
> '[ {
> "volumeName" : "snapvolume-1",
> "bucketName" : "snapbucket-1",
> "name" : "snapshot2",
> "creationTime" : 1685495519725,
> "snapshotStatus" : "SNAPSHOT_ACTIVE",
> "snapshotID" : "206b010f-499e-4ce3-95c4-cc737d6f1003",
> "snapshotPath" : "snapvolume-1/snapbucket-1",
> "checkpointDir" : "-206b010f-499e-4ce3-95c4-cc737d6f1003"
> } ]' does not contain 'SNAPSHOT_DELETED'
> ------------------------------------------------------------------------------
> Upgrade-Snapshot-Check :: Smoketest ozone cluster snapshot feature | FAIL |
> {code}
> CC [~dteng], [~smeng]
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]