On Thu, Dec 15, 2022 at 11:48 PM John Naylor <john.nay...@enterprisedb.com> wrote: > Thanks for this. This is the kind of concrete, data-based evidence that I > find much more convincing, or at least easy to reason about.
I'm glad to hear that it helped. It's always difficult to judge where other people are coming from, especially when it's not clear how much context is shared. Face time would have helped here, too. > One motivating example mentioned is the append-only table. If we detected > that case, which I assume we can because autovacuum_vacuum_insert_* GUCs > exist, we could use that information as one way to drive eager freezing > independently of size. At least in theory -- it's very possible size will be > a necessary part of the decision, but it's less clear that it's as useful as > a user-tunable knob. I am not strongly opposed to that idea, though I have my doubts about it. I have thought about it already, and it wouldn't be hard to get the information to vacuumlazy.c (I plan on doing it as part of related work on antiwraparound autovacuum, in fact [1]). I'm skeptical of the general idea that autovacuum.c has enough reliable information to give detailed recommendations as to how vacuumlazy.c should process the table. I have pointed out several major flaws with the autovacuum.c dead tuple accounting in the past [2][3], but I also think that there are significant problems with the tuples inserted accounting. Basically, I think that there are effects which are arguably an example of the inspection paradox [4]. Insert-based autovacuums occur on a timeline determined by the "inserted since last autovacuum" statistics. These statistics are (in part) maintained by autovacuum/VACUUM itself. Which has no specific understanding of how it might end up chasing its own tail. Let me be more concrete about what I mean about autovacuum chasing its own tail. The autovacuum_vacuum_insert_threshold mechanism works by triggering an autovacuum whenever the number of tuples inserted since the last autovacuum/VACUUM reaches a certain threshold -- usually some fixed proportion of pg_class.reltuples. But the tuples-inserted-since-last-VACUUM counter gets reset at the end of VACUUM, not at the start. Whereas VACUUM itself processes only the subset of pages that needed to be vacuumed at the start of the VACUUM. There is no attempt to compensate for that disparity. This *isn't* really a measure of "unvacuumed tuples" (you'd need to compensate to get that). This "at the start vs at the end" difference won't matter at all with smaller tables. And even in larger tables we might hope that the effect would kind of average out. But what about cases where one particular VACUUM operation takes an unusually long time, out of a sequence of successive VACUUMs that run against the same table? For example, the sequence that you see on the Wiki page, when Postgres HEAD autovacuum does an aggressive VACUUM on one occasion, which takes dramatically longer [5]. Notice that the sequence in [5] shows that the patch does one more autovacuum operation in total, compared to HEAD/master. That's a lot more -- we're talking about VACUUMs that each take 40+ minutes. That can be explained by the fact that VACUUM (quite naturally) resets the "tuples inserted since last VACUUM" at the end of that unusually long running aggressive autovacuum -- just like any other VACUUM would. That seems very weird to me. If (say) we happened to have a much higher vacuum_freeze_table_age setting, then we wouldn't have had an aggressive VACUUM until much later on (or never, because the benchmark would just end). And the VACUUM that was aggressive would have been a regular VACUUM instead, and would therefore have completed far sooner, and would therefore have had a *totally* different cadence, compared to what we actually saw -- it becomes distorted in a way that outlasts the aggressive VACUUM. With a far higher vacuum_freeze_table_age, we might have even managed to do two regular autovacuums in the same period that it took a single aggressive VACUUM to run in (that's not too far from what actually happened with the patch). The *second* regular autovacuum would then end up resetting the "inserted since last VACUUM" counter to 0 at the same time as the long running aggressive VACUUM actually did so (same wall clock time, same time since the start of the benchmark). Notice that we'll have done much less useful work (on cleaning up bloat and setting newer pages all-visible) with the "one long aggressive mode VACUUM" setup/scenario -- we'll be way behind -- but the statistics will nevertheless look about the same as they do in the "two fast autovacuums instead of one slow autovacuum" counterfactual scenario. In short, autovacuum.c fails to appreciate that a lot of stuff about the table changes when VACUUM runs. Time hasn't stood still -- the table was modified and extended throughout. So autovacuum.c hasn't compensated for how VACUUM actually performed, and, in effect, forgets how far it has fallen behind. It should be eager to start the nex autovacuum very quickly, having fallen behind, but it isn't eager. This is all the more reason to get rid of aggressive mode, but that's not my point -- my point is that the statistics driving things seem quite dubious, in all sorts of ways. > Aside from that, I've only given the patches a brief reading. Thanks for taking a look. > Having seen the VM snapshot in practice (under "Scanned pages, visibility map > snapshot" in the wiki page), it's neat to see fewer pages being scanned. > Prefetching not only seems superior to SKIP_PAGES_THRESHOLD, but anticipates > asynchronous IO. All of that is true, but more than anything else the VM snapshot concept appeals to me because it seems to make VACUUMs of large tables more similar to VACUUMs of small tables. Particularly when one individual VACUUM happens to take an unusually long amount of time, for whatever reason (best example right now is aggressive mode, but there are other ways in which VACUUM can take far longer than expected). That approach seems much more logical. I also think that it'll make it easier to teach VACUUM to "pick up where the last VACUUM left off" in the future. I understand why you haven't seriously investigated using the same information for the Radix tree dead_items project. I certainly don't object. But I still think that having one integrated data structure (VM snapshots + dead_items) is worth exploring in the future. It's something that I think is quite promising. > I do have a cosmetic, but broad-reaching, nitpick about terms regarding > "skipping strategy". That's phrased as a kind of negative -- what we're *not* > doing. Many times I had to pause and compute in my head what we're *doing*, > i.e. the "scanning strategy". For example, I wonder if the VM strategies > would be easier to read as: > > VMSNAP_SKIP_ALL_VISIBLE -> VMSNAP_SCAN_LAZY > VMSNAP_SKIP_ALL_FROZEN -> VMSNAP_SCAN_EAGER > VMSNAP_SKIP_NONE -> VMSNAP_SCAN_ALL > > Notice here they're listed in order of increasing eagerness. I agree that the terminology around skipping strategies is confusing, and plan to address that in the next version. I'll consider using this scheme for v10. [1] https://commitfest.postgresql.org/41/4027/ [2] https://postgr.es/m/CAH2-Wz=MGFwJEpEjVzXwEjY5yx=uunpza6bt4dsmasrgluq...@mail.gmail.com [3] https://postgr.es/m/cah2-wznrzc-ohkb+qzqs65o+8_jtj6rxadjh+8ebqjrd1f8...@mail.gmail.com [4] https://towardsdatascience.com/the-inspection-paradox-is-everywhere-2ef1c2e9d709 [5] https://wiki.postgresql.org/wiki/Freezing/skipping_strategies_patch:_motivating_examples#Scanned_pages.2C_visibility_map_snapshot -- Peter Geoghegan