On 2023-09-27 19:09:41 -0400, Melanie Plageman wrote:
> On Wed, Sep 27, 2023 at 3:25 PM Robert Haas <robertmh...@gmail.com> wrote:
> >
> > On Wed, Sep 27, 2023 at 12:34 PM Andres Freund <and...@anarazel.de> wrote:
> > > One way to deal with that would be to not track the average age in
> > > LSN-difference-bytes, but convert the value to some age metric at that
> > > time. If we e.g. were to convert the byte-age into an approximate age in
> > > checkpoints, with quadratic bucketing (e.g. 0 -> current checkpoint, 1 -> 
> > > 1
> > > checkpoint, 2 -> 2 checkpoints ago, 3 -> 4 checkpoints ago, ...), using a 
> > > mean
> > > of that age would probably be fine.
> >
> > Yes. I think it's possible that we could even get by with just two
> > buckets. Say current checkpoint and not. Or current-or-previous
> > checkpoint and not. And just look at what percentage of accesses fall
> > into this first bucket -- it should be small or we're doing it wrong.
> > It seems like the only thing we actually need to avoid is freezing the
> > same ages over and over again in a tight loop.
>
> At the risk of seeming too execution-focused, I want to try and get more
> specific.

I think that's a good intuition :)

> Here is a description of an example implementation to test my
> understanding:
>
> In table-level stats, save two numbers: younger_than_cpt/older_than_cpt
> storing the number of instances of unfreezing a page which is either
> younger or older than the start of the most recent checkpoint at the
> time of its unfreezing

> This has the downside of counting most unfreezings directly after a
> checkpoint in the older_than_cpt bucket. That is: older_than_cpt !=
> longer_frozen_duration at certain times in the checkpoint cycle.

Yea - I don't think just using before/after checkpoint is a good measure. As
you say, it'd be quite jumpy around checkpoints - even though the freezing
behaviour hasn't materially changed. I think using the *distance* between
checkpoints would be a more reliable measure, i.e. if (insert_lsn - page_lsn)
< recent_average_lsn_diff_between_checkpoints, then it's recently modified,
otherwise not.

One problem with using checkpoints "distances" to control things is
forced/immediate checkpoints. The fact that a base backup was started (and
thus a checkpoint completed much earlier than it would have otherwise)
shouldn't make our system assume that the overall behaviour is quite different
going forward.


> Now, I'm trying to imagine how this would interact in a meaningful way
> with opportunistic freezing behavior during vacuum.
>
> You would likely want to combine it with one of the other heuristics we
> discussed.
>
> For example:
> For a table with only 20% younger unfreezings, when vacuuming that page,

Fwiw, I wouldn't say that unfreezing 20% of recently frozen pages is a low
value.


>   if insert LSN - RedoRecPtr < insert LSN - page LSN
>   page is older than the most recent checkpoint start, so freeze it
>   regardless of whether or not it would emit an FPI
>
> What aggressiveness levels should there be? What should change at each
> level? What criteria should pages have to meet to be subject to the
> aggressiveness level?

I'm thinking something very roughly along these lines could make sense:

page_lsn_age = insert_lsn - page_lsn;

if (dirty && !fpi)
{
   /*
    * If we can freeze without an FPI, be quite agressive about
    * opportunistically freezing. We just need to prevent freezing
    * when the table is constantly being rewritten. It's ok to make mistakes
    * initially - the rate of unfreezes will quickly stop us from making
    * mistakes as often.
    */
#define NO_FPI_FREEZE_FACTOR 10.0
   if (page_lsn_age >
       average_lsn_bytes_per_checkpoint * (1 - recent_unfreeze_ratio) * 
NO_FPI_FREEZE_FACTOR)
      freeze = true;
}
else
{
   /*
    * Freezing would emit an FPI and/or dirty the page, making freezing quite
    * a bit more costly. Be more hesitant about freezing recently modified
    * data, unless it's very rare that we unfreeze recently modified data.
    * For insert-only/mostly tables, unfreezes should be rare, so we'll still
    * freeze most of the time.
    */
#define FPI_FREEZE_FACTOR 1
   if (page_lsn_age >
       average_lsn_bytes_per_checkpoint * (1 - recent_unfreeze_ratio) * 
FPI_FREEZE_FACTOR)
       freeze = true;
}

Greetings,

Andres Freund


Reply via email to