hudi-bot opened a new issue, #15072:
URL: https://github.com/apache/hudi/issues/15072
The end-to-end streaming processing is more and more popular around the
Flink users now, and the most typical application scenario for streaming
ingestion checkpoint interval is within minutes (1min, 5mins ..). Say user sets
up the time-interval as 1 minute, and there are about 60 write commits on the
timeline for one hour.
{t1, t2, t3, t4 ...t60}
Now let's consider the very popular streaming read scenario, people want to
keep the history data for a medium live time(usually 1 day or even 1 week), and
let's say user configure the cleaning retain commits number as:
_1(day) * 24 (hours) * 60 (commits of one hour) _= *1440 commits*
While considering the current cleaning retain commits restriction:
_num_retain_commits < min_archive_commits_num_
We must keep at least 1440 commits on the active timeline, that means we
have at least:
_1440 * 3 = 4320_
files on the timeline !!! Which is a pressure to the file IO and the
metadata scanning (the metadata client). If we do not configure long enough
retain time commits, the writer may remove the old files and the reader
encounter {{FileNotFoundException}}.
So, we may find a way to lift restriction that active timeline commits
number must be greater than cleaning retain commits.
One way i can think of is that we remember the last committed cleaning
instant and only check that when cleaning (suitable for the hours cleaning
strategy). With num_commits cleaning strategy we may need to scan the archive
timeline (or metadata table if it is enabled ?)
Whatever a solution is eagerly needed now !
## JIRA info
- Link: https://issues.apache.org/jira/browse/HUDI-3657
- Type: Improvement
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]