[ 
https://issues.apache.org/jira/browse/HDDS-15651?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=18091142#comment-18091142
 ] 

Gargi Jaiswal commented on HDDS-15651:
--------------------------------------

*Why treating the move as successful can be correct behavior?*

The DiskBalancer move has two phases:
 # *Move phase —* copy container to destination, import it, update 
{{ContainerSet}} to the new replica.
 # *Cleanup phase —* mark/delete the old source replica (with lazy deletion for 
in-flight reads).

If phase 1 completes, the operational goal of the move is achieved: the active 
replica is on the destination volume and {{ContainerSet}} points there. New 
reads/writes use the new path. The source cleanup failure is a cleanup problem, 
not a move failure.

That matches the current design:
 * {{moveSucceeded = true}} is set after import + {{ContainerSet}} update, 
before source cleanup.
 * The code explicitly says mark failure should not fail the whole process, 
because rollback would discard a valid destination copy and undo work that 
already succeeded.
 * Lazy deletion exists for a reason ({{{}replica.deletion.delay{}}}, default 5 
min): source is not removed immediately so in-flight reads on the old path can 
finish. So “move succeeded” does not mean “source is gone immediately.”
 * Even if {{markContainerForDelete()}} fails, the old replica is still queued 
in {{{}pendingDeletionContainers{}}}, and {{deleteContainer()}} → 
{{removeContainer()}} does not require {{DELETED}} state — it removes the 
directory directly. So cleanup can still happen asynchronously.
 * For Ratis, HDDS-9322 provides another safety net on DN restart if lazy 
deletion also fails.

So reporting success when import + {{ContainerSet}} update succeed is 
reasonable: the container was moved; only source cleanup is pending or 
best-effort.

> [DiskBalancer] markContainerForDelete failure is treated as a successful move
> -----------------------------------------------------------------------------
>
>                 Key: HDDS-15651
>                 URL: https://issues.apache.org/jira/browse/HDDS-15651
>             Project: Apache Ozone
>          Issue Type: Bug
>          Components: Ozone Datanode
>            Reporter: Arun Sarin
>            Assignee: Gargi Jaiswal
>            Priority: Major
>              Labels: pull-request-available
>         Attachments: repro_HDDS_markContainerForDelete_BEFORE_fix.log
>
>
> During a DiskBalancer container move, the datanode copies the container to the
> destination volume, updates ContainerSet to point to the new replica, and then
> calls markContainerForDelete() on the old source replica.
>  
> If markContainerForDelete() fails, DiskBalancer still treats the move as
> successful. Success metrics are updated and the old replica may be queued for
> delayed deletion, even though the source replica was never properly marked
> DELETED. This can leave duplicate replicas on disk and make disk usage and
> balancer status misleading.
>  
> While reviewing DiskBalancerService.DiskBalancerTask.call(), I noticed that
> moveSucceeded is set to true before markContainerForDelete() is called. If 
> mark
> fails, the error is only logged and the move is still counted as success.
>  
> I added a unit test to reproduce this:
> TestDiskBalancerTask#moveSucceedsDespiteMarkContainerForDeleteFailure
>  
> The test simulates a markContainerForDelete() failure and checks that the move
> should be reported as failed, with no duplicate replica left active on the
> destination.
>  
> *Steps to reproduce*
> 1. Run the unit test:
>  
> mvn test -pl hadoop-hdds/container-service -am \
> -Dtest=TestDiskBalancerTask#moveSucceedsDespiteMarkContainerForDeleteFailure \
> -DfailIfNoTests=false -Dsurefire.failIfNoSpecifiedTests=false
>  
> 2. On current master (before fix), the test fails with 8 failures (one per
> container schema variant). Example:
>  
> successCount should be 0 when markContainerForDelete fails
> expected: 0 but was: 1
>  
> *Expected behavior*
> - The move should be counted as a failure (failureCount increases, 
> successCount
> stays 0).
> - ContainerSet should keep the source replica as the active one.
> - Source and destination volume used space should not reflect a completed 
> move.
> - Any partially created destination replica should be cleaned up.
>  
> *Actual behavior*
> - successCount is incremented and successBytes is updated.
> - Log message: "Failed to mark the old container <id> for delete. It will be
> handled after DN restart."
> - ContainerSet points to the new replica on the destination volume.
> - The old source replica directory still exists on disk.
> - The old replica is still queued for delayed deletion.
>  
> *Impact*
> - Operators see a successful move in DiskBalancer metrics when it did not
> fully complete.
> - Duplicate replicas can consume extra disk space on the datanode.
> - Source volume may stay over-utilized and balancing may not progress as
> expected until datanode restart.
>  
> *Suggested fix*
> Only mark the move as successful after markContainerForDelete() succeeds. If
> mark fails, roll back the move: restore ContainerSet to the source replica,
> revert destination volume used space, and delete the destination replica
> directory. Do not update success metrics or queue the old replica for 
> deletion.
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to