[ https://issues.apache.org/jira/browse/HUDI-7655?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
ASF GitHub Bot updated HUDI-7655: --------------------------------- Labels: clean pull-request-available (was: clean) > Support configuration for clean to fail execution if there is at least one > file is marked as a failed delete > ------------------------------------------------------------------------------------------------------------ > > Key: HUDI-7655 > URL: https://issues.apache.org/jira/browse/HUDI-7655 > Project: Apache Hudi > Issue Type: Improvement > Reporter: Krishen Bhan > Assignee: sivabalan narayanan > Priority: Minor > Labels: clean, pull-request-available > > When a HUDI clean plan is executed, any targeted file that was not confirmed > as deleted (or non-existing) will be marked as a "failed delete". Although > these failed deletes will be added to `.clean` metadata, if incremental clean > is used then these files might not ever be picked up again as a future clean > plan, unless a "full-scan" clean ends up being scheduled. In addition to > leading to more files unnecessarily taking up storage space for longer, then > can lead to the following dataset consistency issue for COW datasets: > # Insert at C1 creates file group f1 in partition > # Replacecommit at RC2 creates file group f2 in partition, and replaces f1 > # Any reader of partition that calls HUDI API (with or without using MDT) > will recognize that f1 should be ignored, as it has been replaced. This is > since RC2 instant file is in active timeline > # Some completed instants later an incremental clean is scheduled. It moves > the "earliest commit to retain" to an time after instant time RC2, so it > targets f1 for deletion. But during execution of the plan, it fails to delete > f1. > # An archive job eventually is triggered, and archives C1 and RC2. Note that > f1 is still in partition > At this point, any job/query that reads the aforementioned partition directly > from the DFS file system calls (without directly using MDT FILES partition) > will consider both f1 and f2 as valid file groups, since RC2 is no longer in > active timeline. This is a data consistency issue, and will only be resolved > if a "full-scan" clean is triggered and deletes f1. > This specific scenario can be avoided if the user can configure HUDI clean to > fail execution of a clean plan unless all files are confirmed as deleted (or > not existing in DFS already), "blocking" the clean. The next clean attempt > will re-execute this existing plan, since clean plans cannot be "rolled > back". -- This message was sent by Atlassian Jira (v8.20.10#820010)