Prashant Wason created HUDI-2925:
------------------------------------

             Summary: Cleaner may attempt to delete the same file twice when 
metadata table is enabled
                 Key: HUDI-2925
                 URL: https://issues.apache.org/jira/browse/HUDI-2925
             Project: Apache Hudi
          Issue Type: Bug
            Reporter: Prashant Wason
            Assignee: Prashant Wason
             Fix For: 0.10.0


This issue happens only when TimelineServer is disabled (reason in next 
comment). Our pipelines execute a write (insert or upsert) along with an 
asynchronous clean. Metadata table is enabled.

 

Assume the timelines are as follows:

Dataset:   100.commit        101.commit   102.clean.inflight
Metadata: 100.deltacomit  

(this happened as the pipeline failed due to non-HUDI  issues which executing 
101 and 102)

 

In the next run of the pipeline some more data is available  so a commit will 
take place (103.commit.requested). Along with it, an asynchronous clean starts 
(104.clean.requested). The [BaseCleanActionExecutor detected previously 
unfinished 
clean|https://github.com/apache/hudi/blob/master/hudi-client/hudi-client-common/src/main/java/org/apache/hudi/table/action/clean/CleanActionExecutor.java#L231]
 (102.clean.inflight) and attempts to do it first. So the order of cleans will 
be 102.clean followed by 104.clean.

 

102.clean => Suppose this deletes files from 90.commit

104.clean  => This should delete files from 91.commit

 

The issue is that while executing 104.clean, the filesystemview is still the 
one which was used during 102.clean (i.e. post clean the file system view is 
not synced). When metadata table is enabled, HoodieMetadataFileSystemView is 
used which has the metadata reader inside it. This metadata reader opens the 
metadata table at a particular time instant (will be 101.commit as that was the 
last completed action). Even after 102.clean is completed, the 
HoodieMetadataFileSystemView is still using the cached metadata reader. Hence, 
the reader still returns files from 90.commit which have already been deleted 
by 102.clean.  

 



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

Reply via email to