[
https://issues.apache.org/jira/browse/HDDS-15651?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=18091266#comment-18091266
]
Arun Sarin commented on HDDS-15651:
-----------------------------------
Hi [~gargijaiswal] [~sammichen] ,
I have a few questions on the DiskBalancer cleanup flow. Sharing my assumptions
below, please let me know if these are correct.
*Move and cleanup behavior*
* After a successful move, I assume the source replica stays on disk until
{{replica.deletion.delay}} expires (default 5 minutes), even though
ContainerSet already points to the destination. Is that temporary double disk
usage expected?
* *If {{markContainerForDelete()}} fails after import and ContainerSet update,
I assume the move is still counted as success and the old replica is queued for
lazy deletion (since {{moveSucceeded}} is set before mark). Is that the
intended behavior?*
* I see DiskBalancer lazy deletion calls {{removeContainer()}} directly and
does not check DELETED state on disk. If {{markContainerForDelete()}} fails to
persist DELETED to the {{.container}} file, is the old source replica still
removed after the delay based on container path alone?
*Failure handling*
* {{markContainerForDelete()}} catches {{IOException}} and does not rethrow. I
assume DiskBalancer often cannot detect when mark actually failed. Is that a
known limitation?
* In {{{}tryCleanupOnePendingDeletionContainer(){}}}, the queue entry is
removed before delete runs. If {{deleteContainer()}} fails, I assume there is
no retry and the source copy stays on disk. Is there any other in-process
recovery path?
* The log says failed deletes will be handled by background scanners. Since
the old source replica is not in ContainerSet after a move, I assume scanners
may not clean it up. Is that log accurate for DiskBalancer leftovers?
*Restart and recovery*
* If lazy deletion fails, I assume the orphaned source copy may remain until
DN restart, and for Ratis containers {{ContainerReader.resolveDuplicate()}}
removes one duplicate. Is restart the expected operator recovery in that case?
* For EC containers, {{resolveDuplicate()}} leaves both copies on disk and
HDDS-12722 is not implemented yet. *If lazy deletion fails on an EC move, can
duplicates persist even after restart?*
* *{{pendingDeletionContainers}} is in-memory. If the DN restarts before the
delay expires, I assume the queue is lost and cleanup depends on
ContainerReader at startup. Is that correct?*
> [DiskBalancer] markContainerForDelete failure is treated as a successful move
> -----------------------------------------------------------------------------
>
> Key: HDDS-15651
> URL: https://issues.apache.org/jira/browse/HDDS-15651
> Project: Apache Ozone
> Issue Type: Bug
> Components: Ozone Datanode
> Reporter: Arun Sarin
> Assignee: Gargi Jaiswal
> Priority: Major
> Labels: pull-request-available
> Attachments: repro_HDDS_markContainerForDelete_BEFORE_fix.log
>
>
> During a DiskBalancer container move, the datanode copies the container to the
> destination volume, updates ContainerSet to point to the new replica, and then
> calls markContainerForDelete() on the old source replica.
>
> If markContainerForDelete() fails, DiskBalancer still treats the move as
> successful. Success metrics are updated and the old replica may be queued for
> delayed deletion, even though the source replica was never properly marked
> DELETED. This can leave duplicate replicas on disk and make disk usage and
> balancer status misleading.
>
> While reviewing DiskBalancerService.DiskBalancerTask.call(), I noticed that
> moveSucceeded is set to true before markContainerForDelete() is called. If
> mark
> fails, the error is only logged and the move is still counted as success.
>
> I added a unit test to reproduce this:
> TestDiskBalancerTask#moveSucceedsDespiteMarkContainerForDeleteFailure
>
> The test simulates a markContainerForDelete() failure and checks that the move
> should be reported as failed, with no duplicate replica left active on the
> destination.
>
> *Steps to reproduce*
> 1. Run the unit test:
>
> mvn test -pl hadoop-hdds/container-service -am \
> -Dtest=TestDiskBalancerTask#moveSucceedsDespiteMarkContainerForDeleteFailure \
> -DfailIfNoTests=false -Dsurefire.failIfNoSpecifiedTests=false
>
> 2. On current master (before fix), the test fails with 8 failures (one per
> container schema variant). Example:
>
> successCount should be 0 when markContainerForDelete fails
> expected: 0 but was: 1
>
> *Expected behavior*
> - The move should be counted as a failure (failureCount increases,
> successCount
> stays 0).
> - ContainerSet should keep the source replica as the active one.
> - Source and destination volume used space should not reflect a completed
> move.
> - Any partially created destination replica should be cleaned up.
>
> *Actual behavior*
> - successCount is incremented and successBytes is updated.
> - Log message: "Failed to mark the old container <id> for delete. It will be
> handled after DN restart."
> - ContainerSet points to the new replica on the destination volume.
> - The old source replica directory still exists on disk.
> - The old replica is still queued for delayed deletion.
>
> *Impact*
> - Operators see a successful move in DiskBalancer metrics when it did not
> fully complete.
> - Duplicate replicas can consume extra disk space on the datanode.
> - Source volume may stay over-utilized and balancing may not progress as
> expected until datanode restart.
>
> *Suggested fix*
> Only mark the move as successful after markContainerForDelete() succeeds. If
> mark fails, roll back the move: restore ContainerSet to the source replica,
> revert destination volume used space, and delete the destination replica
> directory. Do not update success metrics or queue the old replica for
> deletion.
>
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]