[ https://issues.apache.org/jira/browse/HDDS-11267?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17870252#comment-17870252 ]
Arafat Khan commented on HDDS-11267: ------------------------------------ After investigation, we identified the root causes of the issue where datanodes report negative container sizes. This problem was particularly noticeable when attempting to delete containers that had already been marked for deletion. In these cases, some containers returned negative values for used bytes and block count metrics. For example: {code:java} sh-4.2$ ozone admin container list | jq '. | {state: .state, containerID: .containerID, usedBytes: .usedBytes}' { "state": "DELETED", "containerID": 1, "usedBytes": -100000000 } { "state": "DELETED", "containerID": 2, "usedBytes": -95420416 } { "state": "DELETED", "containerID": 3, "usedBytes": -97517568 } {code} We examined the deletion process in detail: # *Normal Flow:* ** The OM keeps track of the blocks and keys. When keys are deleted, OM prepares a list of blocks associated with them and sends a deletion request to SCM. ** SCM assigns a new transaction ID to the deletion request and sends it to the datanodes holding the containers with those blocks. ** The datanode retrieves block information from its *{{blockInfo}}* table, deletes the blocks, and updates the metrics for used bytes and block count accordingly. # *Issue with Duplicate Requests:* ** OM may retry sending delete block requests if the same key is taken up again in the next iteration before the previous transaction is flushed to the database. There can also be retries for the same key blocks deletion in case of failures. ** SCM, unaware of the duplication, assigns a new transaction ID and forwards the request to the datanode. ** When the datanode receives this duplicate request, it attempts to delete the already-deleted blocks. It fails to find them but still updates the metrics, leading to negative values. We confirmed this issue by adding extra logs. For example: {code:java} // The first valid request 2024-07-29 12:00:30 2024-07-29 06:30:30,815 [DeleteBlocksCommandHandlerThread-1] INFO commandhandler.DeleteBlocksCommandHandler: isDuplicateTransaction called with containerId: 2, containerDataDeleteTxnID: 0, delTX-ID: 2 2024-07-29 12:00:30 localID: 113750153625600011 2024-07-29 12:00:30 localID: 113750153625600014 2024-07-29 12:00:30 localID: 113750153625600017 // The second duplicate request 2024-07-29 12:00:30 2024-07-29 06:30:30,846 [DeleteBlocksCommandHandlerThread-2] INFO commandhandler.DeleteBlocksCommandHandler: isDuplicateTransaction called with containerId: 2, containerDataDeleteTxnID: 2, delTX-ID: 6 2024-07-29 12:00:30 localID: 113750153625600011 2024-07-29 12:00:30 localID: 113750153625600014 2024-07-29 12:00:30 localID: 113750153625600017{code} Upon receiving the duplicate request, no blocks are found to delete, resulting in the following log: {code:java} 2024-07-31 13:32:28 2024-07-31 08:02:28,869 [BlockDeletingService#3] WARN impl.FilePerBlockStrategy: Block file to be deleted does not exist: /data/hdds/.../chunks/113750153625600011.block {code} > Ozone Datanode Reporting Negative Container values for UsedBytes and > BlockCount parameters > ------------------------------------------------------------------------------------------ > > Key: HDDS-11267 > URL: https://issues.apache.org/jira/browse/HDDS-11267 > Project: Apache Ozone > Issue Type: Bug > Components: Ozone Datanode, Ozone Recon > Reporter: Arafat Khan > Assignee: Arafat Khan > Priority: Major > > The issue involves datanodes in Ozone reporting negative container sizes for > the {{usedBytes}} and block count metrics. This occurs when the Ozone Manager > sends duplicate block deletion requests to the Storage Container Manager. Due > to a delay in processing the original request, OM may mistakenly send a > duplicate request. The datanode, upon receiving the duplicate request, > attempts to delete blocks that have already been deleted, but still updates > the metrics, leading to negative values. The proposed solution is to modify > the deletion process in the datanode to track and ignore duplicate block > deletion requests, ensuring metrics are not updated incorrectly. > Recon Reported the following negative sized containers:- > {code:java} > sh-4.2$ ozone admin container list | jq '. | {state: .state, containerID: > .containerID, usedBytes: .usedBytes}' > { > "state": "DELETED", > "containerID": 1, > "usedBytes": -100000000 > } > { > "state": "DELETED", > "containerID": 2, > "usedBytes": -95420416 > } > { > "state": "DELETED", > "containerID": 3, > "usedBytes": -97517568 > }{code} -- This message was sent by Atlassian Jira (v8.20.10#820010) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@ozone.apache.org For additional commands, e-mail: issues-h...@ozone.apache.org