[jira] [Commented] (HDDS-15651) [DiskBalancer] markContainerForDelete failure is treated as a successful move

Arun Sarin (Jira) Wed, 24 Jun 2026 10:31:09 -0700


    [ 
https://issues.apache.org/jira/browse/HDDS-15651?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=18091266#comment-18091266
 ]


Arun Sarin commented on HDDS-15651:
-----------------------------------

Hi [~gargijaiswal] [~sammichen] ,

I have a few questions on the DiskBalancer cleanup flow. Sharing my assumptions 
below, please let me know if these are correct.

*Move and cleanup behavior*
 * After a successful move, I assume the source replica stays on disk until 
{{replica.deletion.delay}} expires (default 5 minutes), even though 
ContainerSet already points to the destination. Is that temporary double disk 
usage expected?
 * *If {{markContainerForDelete()}} fails after import and ContainerSet update, 
I assume the move is still counted as success and the old replica is queued for 
lazy deletion (since {{moveSucceeded}} is set before mark). Is that the 
intended behavior?*
 * I see DiskBalancer lazy deletion calls {{removeContainer()}} directly and 
does not check DELETED state on disk. If {{markContainerForDelete()}} fails to 
persist DELETED to the {{.container}} file, is the old source replica still 
removed after the delay based on container path alone?

*Failure handling*
 * {{markContainerForDelete()}} catches {{IOException}} and does not rethrow. I 
assume DiskBalancer often cannot detect when mark actually failed. Is that a 
known limitation?
 * In {{{}tryCleanupOnePendingDeletionContainer(){}}}, the queue entry is 
removed before delete runs. If {{deleteContainer()}} fails, I assume there is 
no retry and the source copy stays on disk. Is there any other in-process 
recovery path?
 * The log says failed deletes will be handled by background scanners. Since 
the old source replica is not in ContainerSet after a move, I assume scanners 
may not clean it up. Is that log accurate for DiskBalancer leftovers?

*Restart and recovery*
 * If lazy deletion fails, I assume the orphaned source copy may remain until 
DN restart, and for Ratis containers {{ContainerReader.resolveDuplicate()}} 
removes one duplicate. Is restart the expected operator recovery in that case?
 * For EC containers, {{resolveDuplicate()}} leaves both copies on disk and 
HDDS-12722 is not implemented yet. *If lazy deletion fails on an EC move, can 
duplicates persist even after restart?*
 * *{{pendingDeletionContainers}} is in-memory. If the DN restarts before the 
delay expires, I assume the queue is lost and cleanup depends on 
ContainerReader at startup. Is that correct?*

> [DiskBalancer] markContainerForDelete failure is treated as a successful move
> -----------------------------------------------------------------------------
>
>                 Key: HDDS-15651
>                 URL: https://issues.apache.org/jira/browse/HDDS-15651
>             Project: Apache Ozone
>          Issue Type: Bug
>          Components: Ozone Datanode
>            Reporter: Arun Sarin
>            Assignee: Gargi Jaiswal
>            Priority: Major
>              Labels: pull-request-available
>         Attachments: repro_HDDS_markContainerForDelete_BEFORE_fix.log
>
>
> During a DiskBalancer container move, the datanode copies the container to the
> destination volume, updates ContainerSet to point to the new replica, and then
> calls markContainerForDelete() on the old source replica.
>  
> If markContainerForDelete() fails, DiskBalancer still treats the move as
> successful. Success metrics are updated and the old replica may be queued for
> delayed deletion, even though the source replica was never properly marked
> DELETED. This can leave duplicate replicas on disk and make disk usage and
> balancer status misleading.
>  
> While reviewing DiskBalancerService.DiskBalancerTask.call(), I noticed that
> moveSucceeded is set to true before markContainerForDelete() is called. If 
> mark
> fails, the error is only logged and the move is still counted as success.
>  
> I added a unit test to reproduce this:
> TestDiskBalancerTask#moveSucceedsDespiteMarkContainerForDeleteFailure
>  
> The test simulates a markContainerForDelete() failure and checks that the move
> should be reported as failed, with no duplicate replica left active on the
> destination.
>  
> *Steps to reproduce*
> 1. Run the unit test:
>  
> mvn test -pl hadoop-hdds/container-service -am \
> -Dtest=TestDiskBalancerTask#moveSucceedsDespiteMarkContainerForDeleteFailure \
> -DfailIfNoTests=false -Dsurefire.failIfNoSpecifiedTests=false
>  
> 2. On current master (before fix), the test fails with 8 failures (one per
> container schema variant). Example:
>  
> successCount should be 0 when markContainerForDelete fails
> expected: 0 but was: 1
>  
> *Expected behavior*
> - The move should be counted as a failure (failureCount increases, 
> successCount
> stays 0).
> - ContainerSet should keep the source replica as the active one.
> - Source and destination volume used space should not reflect a completed 
> move.
> - Any partially created destination replica should be cleaned up.
>  
> *Actual behavior*
> - successCount is incremented and successBytes is updated.
> - Log message: "Failed to mark the old container <id> for delete. It will be
> handled after DN restart."
> - ContainerSet points to the new replica on the destination volume.
> - The old source replica directory still exists on disk.
> - The old replica is still queued for delayed deletion.
>  
> *Impact*
> - Operators see a successful move in DiskBalancer metrics when it did not
> fully complete.
> - Duplicate replicas can consume extra disk space on the datanode.
> - Source volume may stay over-utilized and balancing may not progress as
> expected until datanode restart.
>  
> *Suggested fix*
> Only mark the move as successful after markContainerForDelete() succeeds. If
> mark fails, roll back the move: restore ContainerSet to the source replica,
> revert destination volume used space, and delete the destination replica
> directory. Do not update success metrics or queue the old replica for 
> deletion.
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Commented] (HDDS-15651) [DiskBalancer] markContainerForDelete failure is treated as a successful move

Reply via email to