Krishen Bhan created HUDI-7655:
----------------------------------

             Summary: Support configuration for clean to fail execution if 
there is at least one file is marked as a failed delete
                 Key: HUDI-7655
                 URL: https://issues.apache.org/jira/browse/HUDI-7655
             Project: Apache Hudi
          Issue Type: Improvement
            Reporter: Krishen Bhan


When a HUDI clean plan is executed, any targeted file that was not confirmed as 
deleted (or non-existing) will be marked as a "failed delete". Although these 
failed deletes will be added to `.clean` metadata, if incremental clean is used 
then these files might not ever be picked up again as a future clean plan, 
unless a "full-scan" clean ends up being scheduled. In addition to leading to 
more files unnecessarily taking up storage space for longer, then can lead to 
the following dataset consistency issue for COW datasets:
 # Insert at C1 creates file group f1 in partition
 # Replacecommit at RC2 creates file group f2 in partition, and replaces f1
 # Any reader of partition that calls HUDI API (with or without using MDT) will 
recognize that f1 should be ignored, as it has been replaced. This is since RC2 
instant file is in active timeline
 # Some completed instants later an incremental clean is scheduled. It moves 
the "earliest commit to retain" to an time after instant time RC2, so it 
targets f1 for deletion. But during execution of the plan, it fails to delete 
f1.
 # An archive job eventually is triggered, and archives C1. Note that f1 is 
still in partition

At this point, any job/query that reads the aforementioned partition directly 
from the DFS file system calls (without directly using MDT FILES partition) 
will consider both f1 and f2 as valid file groups, since RC2 is no longer in 
active timeline. This is a data consistency issue, and will only be resolved if 
a "full-scan" clean is triggered and deletes f1.

This specific scenario can be avoided if the user can configure HUDI clean to 
fail execution of a clean plan unless all files are confirmed as deleted (or 
not existing in DFS already), "blocking" the clean. The next clean attempt will 
re-execute this existing plan, since clean plans cannot be "rolled back". 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to