[ 
https://issues.apache.org/jira/browse/HDDS-8732?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17728132#comment-17728132
 ] 

Siyao Meng edited comment on HDDS-8732 at 5/31/23 9:14 PM:
-----------------------------------------------------------

-This looks to be a bug in "snapshot list" command. This would fail whenever 
double buffer is not flushed fast enough.-

-When snapshot listing is iterating over the DB it doesn't check whether the 
entry is already populated from the cache and is incorrectly overwriting it in 
{{appendSnapshotFromDBToMap}}:-

https://github.com/apache/ozone/blob/586a20274c5505c0ea083101a4d4e681529098fb/hadoop-ozone/ozone-manager/src/main/java/org/apache/hadoop/ozone/om/OmMetadataManagerImpl.java#L1298-L1299

-The solution would be to *either*:-

-1. in {{appendSnapshotFromDBToMap}}, skip the result entry population if the 
entry already exists because it is already populated from the cache (which is 
the source of truth)-
-2. or swap the order of calls to {{appendSnapshotFromDBToMap}} and 
{{appendSnapshotFromCacheToMap}}.-

Thanks [~aswinshakil] for pointing this out.

cc [~wfps1210] [~ppogde]

oh nvm it looks like the check is indeed there already.

[~dteng] You could double check if there are any unhandled edge cases in the 
listing logic, or anywhere else that could go wrong:

Other reference points:

1. Cache update is done here in OMSnapshotDeleteRequest in Ratis state machine:

https://github.com/apache/ozone/blob/e76f312183ca29cda8abcf963963d32223388b77/hadoop-ozone/ozone-manager/src/main/java/org/apache/hadoop/ozone/om/request/snapshot/OMSnapshotDeleteRequest.java#L177-L185

2. DB update is done here in OMSnapshotDeleteResponse during double buffer 
flush (should happen very quickly when double buffer thread is not busy):

https://github.com/apache/ozone/blob/ccc814ee7f8a7cac70b6d64dcae12056e375199d/hadoop-ozone/ozone-manager/src/main/java/org/apache/hadoop/ozone/om/response/snapshot/OMSnapshotDeleteResponse.java#L60-L65


was (Author: smeng):
-This looks to be a bug in "snapshot list" command. This would fail whenever 
double buffer is not flushed fast enough.-

-When snapshot listing is iterating over the DB it doesn't check whether the 
entry is already populated from the cache and is incorrectly overwriting it in 
{{appendSnapshotFromDBToMap}}:-

https://github.com/apache/ozone/blob/586a20274c5505c0ea083101a4d4e681529098fb/hadoop-ozone/ozone-manager/src/main/java/org/apache/hadoop/ozone/om/OmMetadataManagerImpl.java#L1298-L1299

-The solution would be to *either*:-

-1. in {{appendSnapshotFromDBToMap}}, skip the result entry population if the 
entry already exists because it is already populated from the cache (which is 
the source of truth)-
-2. or swap the order of calls to {{appendSnapshotFromDBToMap}} and 
{{appendSnapshotFromCacheToMap}}.-

Thanks [~aswinshakil] for pointing this out.

cc [~wfps1210] [~ppogde]

oh nvm it looks like the check is indeed there already.

[~dteng] You could double check if there are any unhandled edge cases listing 
logic, or anywhere else that could go wrong:

Other reference points:

1. Cache update is done here in OMSnapshotDeleteRequest in Ratis state machine:

https://github.com/apache/ozone/blob/e76f312183ca29cda8abcf963963d32223388b77/hadoop-ozone/ozone-manager/src/main/java/org/apache/hadoop/ozone/om/request/snapshot/OMSnapshotDeleteRequest.java#L177-L185

2. DB update is done here in OMSnapshotDeleteResponse during double buffer 
flush (should happen very quickly when double buffer thread is not busy):

https://github.com/apache/ozone/blob/ccc814ee7f8a7cac70b6d64dcae12056e375199d/hadoop-ozone/ozone-manager/src/main/java/org/apache/hadoop/ozone/om/response/snapshot/OMSnapshotDeleteResponse.java#L60-L65

> Intermittent failure at Delete snapshot in upgrade test
> -------------------------------------------------------
>
>                 Key: HDDS-8732
>                 URL: https://issues.apache.org/jira/browse/HDDS-8732
>             Project: Apache Ozone
>          Issue Type: Sub-task
>          Components: Snapshot
>    Affects Versions: 1.4.0
>            Reporter: Attila Doroszlai
>            Assignee: Dave Teng
>            Priority: Critical
>
> {code:title=https://github.com/apache/ozone/actions/runs/5127514810/jobs/9223531259#step:5:899}
> Delete snapshot                                                       | FAIL |
> '[ {
>   "volumeName" : "snapvolume-1",
>   "bucketName" : "snapbucket-1",
>   "name" : "snapshot2",
>   "creationTime" : 1685495519725,
>   "snapshotStatus" : "SNAPSHOT_ACTIVE",
>   "snapshotID" : "206b010f-499e-4ce3-95c4-cc737d6f1003",
>   "snapshotPath" : "snapvolume-1/snapbucket-1",
>   "checkpointDir" : "-206b010f-499e-4ce3-95c4-cc737d6f1003"
> } ]' does not contain 'SNAPSHOT_DELETED'
> ------------------------------------------------------------------------------
> Upgrade-Snapshot-Check :: Smoketest ozone cluster snapshot feature    | FAIL |
> {code}
> CC [~dteng], [~smeng]



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to