I updated the FAQ section to set defaults correctly and add more information related to this : https://cwiki.apache.org/confluence/display/HUDI/FAQ#FAQ-WhatdoestheHudicleanerdo
The cleaner retention configuration is based on counts (number of commits to be retained) with the assumption that users need to provide a conservative number. The historical reason was that ingestion used to run in specific cadence (e.g every 30 mins) with the norm being an ingestion run taking less than 30 mins. With this model, it was simpler to represent the configuration as a count of commits to approximate the retention time. With delta-streamer continuous mode, ingestion is allowed to be scheduled immediately after the previous run is scheduled. I think it would make sense to introduce a time based retention. I have created a newbie ticket for this : https://jira.apache.org/jira/browse/HUDI-349 Pratyaksh, In sum, if the defaults are too low, use a conservative number based on the number of ingestion runs you see in your setup. The defaults as referenced in the code-comments needs change (from 24 to 10).( https://jira.apache.org/jira/browse/HUDI-350) Thanks, Balaji.V On Tue, Nov 19, 2019 at 1:40 AM Pratyaksh Sharma <pratyaks...@gmail.com> wrote: > Hi, > > We are assuming the following in getDeletePaths() method in cleaner flow in > case of KEEP_LATEST_COMMITS policy - > > /** > * Selects the versions for file for cleaning, such that it > * <p> > * - Leaves the latest version of the file untouched - For older versions, - > It leaves all the commits untouched which > * has occured in last <code>config.getCleanerCommitsRetained()</code> > commits - It leaves ONE commit before this > * window. We assume that the max(query execution time) == commit_batch_time > * config.getCleanerCommitsRetained(). > * This is 12 hours by default. This is essential to leave the file used by > the query thats running for the max time. > * <p> > * This provides the effect of having lookback into all changes that > happened in the last X commits. (eg: if you > * retain 24 commits, and commit batch time is 30 mins, then you have 12 hrs > of lookback) > * <p> > * This policy is the default. > */ > > I want to understand the term commit_batch_time in this assumption and the > assumption as a whole. As per my understanding, this term refers to the > time taken in one iteration of DeltaSync end to end (which is hardly 7-8 > minutes in my case). If my understanding is correct, then this time will > vary depending on the size of incoming RDD. So in that case, the time > needed for the longest query is effectively a variable. So in that case > what is a safe option to keep for the config > <code>config.getCleanerCommitsRetained()</code>. > > Basically I want to set the config > <code>config.getCleanerCommitsRetained()</code> properly for my Hudi > instance and hence I am trying to understand the assumption. Its default > value is 10, I want to understand if this can be reduced further without > any query failing. > > Please help me with this. > > Regards > Pratyaksh >