hudi-bot opened a new issue, #15072:
URL: https://github.com/apache/hudi/issues/15072

   The end-to-end streaming processing is more and more popular around the 
Flink users now, and the most typical application scenario for streaming 
ingestion checkpoint interval is within minutes (1min, 5mins ..). Say user sets 
up the time-interval as 1 minute, and there are about 60 write commits on the 
timeline for one hour.
   
   {t1, t2, t3, t4 ...t60}
   
   Now let's consider the very popular streaming read scenario, people want to 
keep the history data for a medium live time(usually 1 day or even 1 week), and 
let's say user configure the cleaning retain commits number as:
   
   _1(day) * 24 (hours) * 60 (commits of one hour) _= *1440 commits*
   
   While considering the current cleaning retain commits restriction:
   
   _num_retain_commits < min_archive_commits_num_
   
   We must keep at least 1440 commits on the active timeline, that means we 
have at least:
   
   _1440 * 3 = 4320_
   
    files on the timeline !!! Which is a pressure to the file IO and the 
metadata scanning (the metadata client). If we do not configure long enough 
retain time commits, the writer may remove the old files and the reader 
encounter {{FileNotFoundException}}.
   
   So, we may find a way to lift restriction that active timeline commits 
number must be greater than cleaning retain commits.
   
   One way i can think of is that we remember the last committed cleaning 
instant and only check that when cleaning (suitable for the hours cleaning 
strategy). With num_commits cleaning strategy we may need to scan the archive 
timeline (or metadata table if it is enabled ?)
   
   Whatever a solution is eagerly needed now !
   
   
   
   
   
   
   ## JIRA info
   
   - Link: https://issues.apache.org/jira/browse/HUDI-3657
   - Type: Improvement


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to