On 2023-09-27 19:09:41 -0400, Melanie Plageman wrote: > On Wed, Sep 27, 2023 at 3:25 PM Robert Haas <robertmh...@gmail.com> wrote: > > > > On Wed, Sep 27, 2023 at 12:34 PM Andres Freund <and...@anarazel.de> wrote: > > > One way to deal with that would be to not track the average age in > > > LSN-difference-bytes, but convert the value to some age metric at that > > > time. If we e.g. were to convert the byte-age into an approximate age in > > > checkpoints, with quadratic bucketing (e.g. 0 -> current checkpoint, 1 -> > > > 1 > > > checkpoint, 2 -> 2 checkpoints ago, 3 -> 4 checkpoints ago, ...), using a > > > mean > > > of that age would probably be fine. > > > > Yes. I think it's possible that we could even get by with just two > > buckets. Say current checkpoint and not. Or current-or-previous > > checkpoint and not. And just look at what percentage of accesses fall > > into this first bucket -- it should be small or we're doing it wrong. > > It seems like the only thing we actually need to avoid is freezing the > > same ages over and over again in a tight loop. > > At the risk of seeming too execution-focused, I want to try and get more > specific.
I think that's a good intuition :) > Here is a description of an example implementation to test my > understanding: > > In table-level stats, save two numbers: younger_than_cpt/older_than_cpt > storing the number of instances of unfreezing a page which is either > younger or older than the start of the most recent checkpoint at the > time of its unfreezing > This has the downside of counting most unfreezings directly after a > checkpoint in the older_than_cpt bucket. That is: older_than_cpt != > longer_frozen_duration at certain times in the checkpoint cycle. Yea - I don't think just using before/after checkpoint is a good measure. As you say, it'd be quite jumpy around checkpoints - even though the freezing behaviour hasn't materially changed. I think using the *distance* between checkpoints would be a more reliable measure, i.e. if (insert_lsn - page_lsn) < recent_average_lsn_diff_between_checkpoints, then it's recently modified, otherwise not. One problem with using checkpoints "distances" to control things is forced/immediate checkpoints. The fact that a base backup was started (and thus a checkpoint completed much earlier than it would have otherwise) shouldn't make our system assume that the overall behaviour is quite different going forward. > Now, I'm trying to imagine how this would interact in a meaningful way > with opportunistic freezing behavior during vacuum. > > You would likely want to combine it with one of the other heuristics we > discussed. > > For example: > For a table with only 20% younger unfreezings, when vacuuming that page, Fwiw, I wouldn't say that unfreezing 20% of recently frozen pages is a low value. > if insert LSN - RedoRecPtr < insert LSN - page LSN > page is older than the most recent checkpoint start, so freeze it > regardless of whether or not it would emit an FPI > > What aggressiveness levels should there be? What should change at each > level? What criteria should pages have to meet to be subject to the > aggressiveness level? I'm thinking something very roughly along these lines could make sense: page_lsn_age = insert_lsn - page_lsn; if (dirty && !fpi) { /* * If we can freeze without an FPI, be quite agressive about * opportunistically freezing. We just need to prevent freezing * when the table is constantly being rewritten. It's ok to make mistakes * initially - the rate of unfreezes will quickly stop us from making * mistakes as often. */ #define NO_FPI_FREEZE_FACTOR 10.0 if (page_lsn_age > average_lsn_bytes_per_checkpoint * (1 - recent_unfreeze_ratio) * NO_FPI_FREEZE_FACTOR) freeze = true; } else { /* * Freezing would emit an FPI and/or dirty the page, making freezing quite * a bit more costly. Be more hesitant about freezing recently modified * data, unless it's very rare that we unfreeze recently modified data. * For insert-only/mostly tables, unfreezes should be rare, so we'll still * freeze most of the time. */ #define FPI_FREEZE_FACTOR 1 if (page_lsn_age > average_lsn_bytes_per_checkpoint * (1 - recent_unfreeze_ratio) * FPI_FREEZE_FACTOR) freeze = true; } Greetings, Andres Freund