Hello all,

So it all boils down to:

Make the change now which could break existing workflows of backups OR
Do not make the change now leading to more and more users experiencing
servers crashing due to OOM (apart from performance overhead) under certain
workloads

It is a tough call to make but perhaps we could have some message displayed
in WebUI/logs/ksck by monitoring diff scans. For example, if a server is up
for more than 7 days and we do not see any diff scan requesting data with
timestamp from no older than 15 minutes, we can try to alert the user
through WebUI/ksck/logs.
We could take a step ahead and since this is marked as a runtime flag we
might as well lower it down to 15 minutes(or the longest time period seen
in those past 7 days) and log the configuration change.

Thanks,
Abhishek

On Wed, Feb 12, 2025 at 7:17 AM Ashwani Raina <ara...@cloudera.com.invalid>
wrote:

> While I concur with Attila on most of the points, I still think it would be
> worth reducing the tablet history.
> Since release notes do not give us any guarantee that users will notice the
> change, I am thinking if it is possible to have this flag as a mandatory
> one.
> Just like we have some flags that are required (master address, dir paths,
> etc) to be configured by the users,
> we can have this flag under the same category. Also, we can provide
> guidance on the value based on use cases along with limitations.
> Users will get to choose a value as they see fit instead of relying on some
> default value.
>
> Alternatively, we could think of reducing the value to 24 hours since that
> seems to be a general backup schedule.
>
> Regards,
> Ashwani
>
> On Wed, Feb 12, 2025 at 7:57 PM Attila Bukor <abu...@apache.org> wrote:
>
> > Hi Alexey,
> >
> > Thanks for bringing this up.
> >
> > > Does this look like a drastic and maybe a breaking change to anybody?
> >
> > It sounds like a breaking change to me as it can easily break existing
> > use-cases
> > after an upgrade if someone relies on the default values. Even if this is
> > mentioned in the release notes, there's no guarantee that the person
> > performing
> > the upgrade knows about all the use-cases that rely on a long AHM (or
> that
> > they
> > read the release notes carefully before upgrading).
> >
> > Furthermore, if someone has backup/restore enabled and forgets changing
> the
> > value while upgrading, an incremental backup cannot be retried after
> > setting it,
> > because the historical data is gone, so they have to do a full backup
> > again.
> >
> > > If yes, what alternatives should we consider instead?
> >
> > I would simply suggest adding it into the known issues section on the
> > website
> > with the suggested workaround. Additionally, we could log a warning at
> > startup
> > linking to the known issues page if the setting if the value of
> > --tablet_history_max_age_sec is higher than 15 minutes.
> >
> > - Attila
> >
>

Reply via email to