[ 
https://issues.apache.org/jira/browse/HDDS-11267?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17870252#comment-17870252
 ] 

Arafat Khan commented on HDDS-11267:
------------------------------------

After investigation, we identified the root causes of the issue where datanodes 
report negative container sizes. This problem was particularly noticeable when 
attempting to delete containers that had already been marked for deletion. In 
these cases, some containers returned negative values for used bytes and block 
count metrics.

For example:
{code:java}
sh-4.2$ ozone admin container list | jq '. | {state: .state, containerID: 
.containerID, usedBytes: .usedBytes}'
{
  "state": "DELETED",
  "containerID": 1,
  "usedBytes": -100000000
}
{
  "state": "DELETED",
  "containerID": 2,
  "usedBytes": -95420416
}
{
  "state": "DELETED",
  "containerID": 3,
  "usedBytes": -97517568
} {code}
We examined the deletion process in detail:
 # *Normal Flow:*
 ** The OM keeps track of the blocks and keys. When keys are deleted, OM 
prepares a list of blocks associated with them and sends a deletion request to 
SCM.
 ** SCM assigns a new transaction ID to the deletion request and sends it to 
the datanodes holding the containers with those blocks.
 ** The datanode retrieves block information from its *{{blockInfo}}* table, 
deletes the blocks, and updates the metrics for used bytes and block count 
accordingly.
 # *Issue with Duplicate Requests:*
 ** OM may retry sending delete block requests if the same key is taken up 
again in the next iteration before the previous transaction is flushed to the 
database. There can also be retries for the same key blocks deletion in case of 
failures.
 ** SCM, unaware of the duplication, assigns a new transaction ID and forwards 
the request to the datanode.
 ** When the datanode receives this duplicate request, it attempts to delete 
the already-deleted blocks. It fails to find them but still updates the 
metrics, leading to negative values.

We confirmed this issue by adding extra logs. For example:
{code:java}
// The first valid request
2024-07-29 12:00:30 2024-07-29 06:30:30,815 
[DeleteBlocksCommandHandlerThread-1] INFO 
commandhandler.DeleteBlocksCommandHandler: isDuplicateTransaction called with 
containerId: 2, containerDataDeleteTxnID: 0, delTX-ID: 2
2024-07-29 12:00:30 localID: 113750153625600011
2024-07-29 12:00:30 localID: 113750153625600014
2024-07-29 12:00:30 localID: 113750153625600017

// The second duplicate request
2024-07-29 12:00:30 2024-07-29 06:30:30,846 
[DeleteBlocksCommandHandlerThread-2] INFO 
commandhandler.DeleteBlocksCommandHandler: isDuplicateTransaction called with 
containerId: 2, containerDataDeleteTxnID: 2, delTX-ID: 6
2024-07-29 12:00:30 localID: 113750153625600011
2024-07-29 12:00:30 localID: 113750153625600014
2024-07-29 12:00:30 localID: 113750153625600017{code}
Upon receiving the duplicate request, no blocks are found to delete, resulting 
in the following log:
{code:java}
2024-07-31 13:32:28 2024-07-31 08:02:28,869 [BlockDeletingService#3] WARN 
impl.FilePerBlockStrategy: Block file to be deleted does not exist: 
/data/hdds/.../chunks/113750153625600011.block {code}

> Ozone Datanode Reporting Negative Container values for UsedBytes and 
> BlockCount parameters
> ------------------------------------------------------------------------------------------
>
>                 Key: HDDS-11267
>                 URL: https://issues.apache.org/jira/browse/HDDS-11267
>             Project: Apache Ozone
>          Issue Type: Bug
>          Components: Ozone Datanode, Ozone Recon
>            Reporter: Arafat Khan
>            Assignee: Arafat Khan
>            Priority: Major
>
> The issue involves datanodes in Ozone reporting negative container sizes for 
> the {{usedBytes}} and block count metrics. This occurs when the Ozone Manager 
> sends duplicate block deletion requests to the Storage Container Manager. Due 
> to a delay in processing the original request, OM may mistakenly send a 
> duplicate request. The datanode, upon receiving the duplicate request, 
> attempts to delete blocks that have already been deleted, but still updates 
> the metrics, leading to negative values. The proposed solution is to modify 
> the deletion process in the datanode to track and ignore duplicate block 
> deletion requests, ensuring metrics are not updated incorrectly.
> Recon Reported the following negative sized containers:- 
> {code:java}
> sh-4.2$ ozone admin container list | jq '. | {state: .state, containerID: 
> .containerID, usedBytes: .usedBytes}'
> {
>   "state": "DELETED",
>   "containerID": 1,
>   "usedBytes": -100000000
> }
> {
>   "state": "DELETED",
>   "containerID": 2,
>   "usedBytes": -95420416
> }
> {
>   "state": "DELETED",
>   "containerID": 3,
>   "usedBytes": -97517568
> }{code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@ozone.apache.org
For additional commands, e-mail: issues-h...@ozone.apache.org

Reply via email to