hudi-bot opened a new issue, #16464:
URL: https://github.com/apache/hudi/issues/16464

   When archival runs it may consider an instant as a candidate for archival 
even if the file groups said instant replaced/updated still need to undergo a 
`clean`. For example, consider the following scenario with clean and archived 
scheduled/executed independently in different jobs
    # Insert at C1 creates file group f1 in partition
    # Replacecommit at RC2 creates file group f2 in partition, and replaces f1
    # Any reader of partition that calls HUDI API (with or without using MDT) 
will recognize that f1 should be ignored, as it has been replaced. This is 
since RC2 instant file is in active timeline
    # Some more instants are added to timeline. RC2 is now eligible to be 
cleaned (as per the table writers' clean policy). Assume though that file 
groups replaces by RC2 haven't been deleted yet, such as due to clean 
repeatedly failing, async clean not being scheduled yet, or the clean failing 
to delete said file groups.
    # An archive job eventually is triggered, and archives C1 and RC2. Note 
that f1 is still in partition
   
   Now the table has the same consistency issue as seen in 
https://issues.apache.org/jira/browse/HUDI-7655 , where replaced file groups 
are still in partition and readers may see inconsistent data. 
   
    
   
   This situation can be avoided by ensuring that archival will "block" and no 
go past an older instant time if it sees that said instant didn't undergo a 
clean yet. 
   
   ## JIRA info
   
   - Link: https://issues.apache.org/jira/browse/HUDI-7687
   - Type: Improvement
   - Fix version(s):
     - 1.1.0


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to