Re: New strategies for freezing, advancing relfrozenxid early

Peter Geoghegan Fri, 16 Dec 2022 13:54:39 -0800

On Thu, Dec 15, 2022 at 11:48 PM John Naylor
<john.nay...@enterprisedb.com> wrote:
> Thanks for this. This is the kind of concrete, data-based evidence that I 
> find much more convincing, or at least easy to reason about.


I'm glad to hear that it helped. It's always difficult to judge where
other people are coming from, especially when it's not clear how much
context is shared. Face time would have helped here, too.

> One motivating example mentioned is the append-only table. If we detected 
> that case, which I assume we can because autovacuum_vacuum_insert_* GUCs 
> exist, we could use that information as one way to drive eager freezing 
> independently of size. At least in theory -- it's very possible size will be 
> a necessary part of the decision, but it's less clear that it's as useful as 
> a user-tunable knob.

I am not strongly opposed to that idea, though I have my doubts about
it. I have thought about it already, and it wouldn't be hard to get
the information to vacuumlazy.c (I plan on doing it as part of related
work on antiwraparound autovacuum, in fact [1]). I'm skeptical of the
general idea that autovacuum.c has enough reliable information to give
detailed recommendations as to how vacuumlazy.c should process the
table.

I have pointed out several major flaws with the autovacuum.c dead
tuple accounting in the past [2][3], but I also think that there are
significant problems with the tuples inserted accounting. Basically, I
think that there are effects which are arguably an example of the
inspection paradox [4]. Insert-based autovacuums occur on a timeline
determined by the "inserted since last autovacuum" statistics. These
statistics are (in part) maintained by autovacuum/VACUUM itself. Which
has no specific understanding of how it might end up chasing its own
tail.

Let me be more concrete about what I mean about autovacuum chasing its
own tail. The autovacuum_vacuum_insert_threshold mechanism works by
triggering an autovacuum whenever the number of tuples inserted since
the last autovacuum/VACUUM reaches a certain threshold -- usually some
fixed proportion of pg_class.reltuples. But the
tuples-inserted-since-last-VACUUM counter gets reset at the end of
VACUUM, not at the start. Whereas VACUUM itself processes only the
subset of pages that needed to be vacuumed at the start of the VACUUM.
There is no attempt to compensate for that disparity. This *isn't*
really a measure of "unvacuumed tuples" (you'd need to compensate to
get that).

This "at the start vs at the end" difference won't matter at all with
smaller tables. And even in larger tables we might hope that the
effect would kind of average out. But what about cases where one
particular VACUUM operation takes an unusually long time, out of a
sequence of successive VACUUMs that run against the same table? For
example, the sequence that you see on the Wiki page, when Postgres
HEAD autovacuum does an aggressive VACUUM on one occasion, which takes
dramatically longer [5].

Notice that the sequence in [5] shows that the patch does one more
autovacuum operation in total, compared to HEAD/master. That's a lot
more -- we're talking about VACUUMs that each take 40+ minutes. That
can be explained by the fact that VACUUM (quite naturally) resets the
"tuples inserted since last VACUUM" at the end of that unusually long
running aggressive autovacuum -- just like any other VACUUM would.
That seems very weird to me. If (say) we happened to have a much
higher vacuum_freeze_table_age setting, then we wouldn't have had an
aggressive VACUUM until much later on (or never, because the benchmark
would just end). And the VACUUM that was aggressive would have been a
regular VACUUM instead, and would therefore have completed far sooner,
and would therefore have had a *totally* different cadence, compared
to what we actually saw -- it becomes distorted in a way that outlasts
the aggressive VACUUM.

With a far higher vacuum_freeze_table_age, we might have even managed
to do two regular autovacuums in the same period that it took a single
aggressive VACUUM to run in (that's not too far from what actually
happened with the patch). The *second* regular autovacuum would then
end up resetting the "inserted since last VACUUM" counter to 0 at the
same time as the long running aggressive VACUUM actually did so (same
wall clock time, same time since the start of the benchmark). Notice
that we'll have done much less useful work (on cleaning up bloat and
setting newer pages all-visible) with the "one long aggressive mode
VACUUM" setup/scenario -- we'll be way behind -- but the statistics
will nevertheless look about the same as they do in the "two fast
autovacuums instead of one slow autovacuum" counterfactual scenario.

In short, autovacuum.c fails to appreciate that a lot of stuff about
the table changes when VACUUM runs. Time hasn't stood still -- the
table was modified and extended throughout. So autovacuum.c hasn't
compensated for how VACUUM actually performed, and, in effect, forgets
how far it has fallen behind. It should be eager to start the nex
autovacuum very quickly, having fallen behind, but it isn't eager.
This is all the more reason to get rid of aggressive mode, but that's
not my point -- my point is that the statistics driving things seem
quite dubious, in all sorts of ways.

> Aside from that, I've only given the patches a brief reading.

Thanks for taking a look.

> Having seen the VM snapshot in practice (under "Scanned pages, visibility map 
> snapshot" in the wiki page), it's neat to see fewer pages being scanned. 
> Prefetching not only seems superior to SKIP_PAGES_THRESHOLD, but anticipates 
> asynchronous IO.

All of that is true, but more than anything else the VM snapshot
concept appeals to me because it seems to make VACUUMs of large tables
more similar to VACUUMs of small tables. Particularly when one
individual VACUUM happens to take an unusually long amount of time,
for whatever reason (best example right now is aggressive mode, but
there are other ways in which VACUUM can take far longer than
expected). That approach seems much more logical. I also think that
it'll make it easier to teach VACUUM to "pick up where the last VACUUM
left off" in the future.

I understand why you haven't seriously investigated using the same
information for the Radix tree dead_items project. I certainly don't
object. But I still think that having one integrated data structure
(VM snapshots + dead_items) is worth exploring in the future. It's
something that I think is quite promising.

> I do have a cosmetic, but broad-reaching, nitpick about terms regarding 
> "skipping strategy". That's phrased as a kind of negative -- what we're *not* 
> doing. Many times I had to pause and compute in my head what we're *doing*, 
> i.e. the "scanning strategy". For example, I wonder if the VM strategies 
> would be easier to read as:
>
> VMSNAP_SKIP_ALL_VISIBLE -> VMSNAP_SCAN_LAZY
> VMSNAP_SKIP_ALL_FROZEN -> VMSNAP_SCAN_EAGER
> VMSNAP_SKIP_NONE -> VMSNAP_SCAN_ALL
>
> Notice here they're listed in order of increasing eagerness.

I agree that the terminology around skipping strategies is confusing,
and plan to address that in the next version. I'll consider using this
scheme for v10.

[1] https://commitfest.postgresql.org/41/4027/
[2] 
https://postgr.es/m/CAH2-Wz=MGFwJEpEjVzXwEjY5yx=uunpza6bt4dsmasrgluq...@mail.gmail.com
[3] 
https://postgr.es/m/cah2-wznrzc-ohkb+qzqs65o+8_jtj6rxadjh+8ebqjrd1f8...@mail.gmail.com
[4] 
https://towardsdatascience.com/the-inspection-paradox-is-everywhere-2ef1c2e9d709
[5] 
https://wiki.postgresql.org/wiki/Freezing/skipping_strategies_patch:_motivating_examples#Scanned_pages.2C_visibility_map_snapshot
-- 
Peter Geoghegan

Re: New strategies for freezing, advancing relfrozenxid early

Reply via email to