hudi-bot opened a new issue, #16464:
URL: https://github.com/apache/hudi/issues/16464
When archival runs it may consider an instant as a candidate for archival
even if the file groups said instant replaced/updated still need to undergo a
`clean`. For example, consider the following scenario with clean and archived
scheduled/executed independently in different jobs
# Insert at C1 creates file group f1 in partition
# Replacecommit at RC2 creates file group f2 in partition, and replaces f1
# Any reader of partition that calls HUDI API (with or without using MDT)
will recognize that f1 should be ignored, as it has been replaced. This is
since RC2 instant file is in active timeline
# Some more instants are added to timeline. RC2 is now eligible to be
cleaned (as per the table writers' clean policy). Assume though that file
groups replaces by RC2 haven't been deleted yet, such as due to clean
repeatedly failing, async clean not being scheduled yet, or the clean failing
to delete said file groups.
# An archive job eventually is triggered, and archives C1 and RC2. Note
that f1 is still in partition
Now the table has the same consistency issue as seen in
https://issues.apache.org/jira/browse/HUDI-7655 , where replaced file groups
are still in partition and readers may see inconsistent data.
This situation can be avoided by ensuring that archival will "block" and no
go past an older instant time if it sees that said instant didn't undergo a
clean yet.
## JIRA info
- Link: https://issues.apache.org/jira/browse/HUDI-7687
- Type: Improvement
- Fix version(s):
- 1.1.0
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]