Re: Removing more vacuumlazy.c special cases, relfrozenxid optimizations
On Thu, Apr 14, 2022 at 4:19 PM Jim Nasby wrote: > > - percentage of non-yet-removable vs removable tuples > > This'd give you an idea how bad your long-running-transaction problem is. VACUUM fundamentally works by removing those tuples that are considered dead according to an XID-based cutoff established when the operation begins. And so many very long running VACUUM operations will see dead-but-not-removable tuples even when there are absolutely no long running transactions (nor any other VACUUM operations). The only long running thing involved might be our own long running VACUUM operation. I would like to reduce the number of non-removal dead tuples encountered by VACUUM by "locking in" heap pages that we'd like to scan up front. This would work by having VACUUM create its own local in-memory copy of the visibility map before it even starts scanning heap pages. That way VACUUM won't end up visiting heap pages just because they were concurrently modified half way through our VACUUM (by some other transactions). We don't really need to scan these pages at all -- they have dead tuples, but not tuples that are "dead to VACUUM". The key idea here is to remove a big unnatural downside to slowing VACUUM down. The cutoff would almost work like an MVCC snapshot, that described precisely the work that VACUUM needs to do (which pages to scan) up-front. Once that's locked in, the amount of work we're required to do cannot go up as we're doing it (or it'll be less of an issue, at least). It would also help if VACUUM didn't scan pages that it already knows don't have any dead tuples. The current SKIP_PAGES_THRESHOLD rule could easily be improved. That's almost the same problem. -- Peter Geoghegan
Re: Removing more vacuumlazy.c special cases, relfrozenxid optimizations
On Mon, Apr 4, 2022 at 8:25 PM Peter Geoghegan wrote: > Right. The reason I used WARNINGs was because it matches vaguely > related WARNINGs in vac_update_relstats()'s sibling function, > vacuum_set_xid_limits(). Okay, pushed the relfrozenxid warning patch. Thanks -- Peter Geoghegan
Re: Removing more vacuumlazy.c special cases, relfrozenxid optimizations
On Mon, Apr 4, 2022 at 8:18 PM Andres Freund wrote: > The remaining patch are the warnings in vac_update_relstats(), correct? I > guess one could argue they should be LOG rather than WARNING, but I find the > project stance on that pretty impractical. So warning's ok with me. Right. The reason I used WARNINGs was because it matches vaguely related WARNINGs in vac_update_relstats()'s sibling function, vacuum_set_xid_limits(). > Not sure why you used errmsg_internal()? The usual reason for using errmsg_internal(), I suppose. I tend to do that with corruption related messages on the grounds that they're usually highly obscure issues that are (by definition) never supposed to happen. The only thing that a user can be expected to do with the information from the message is to report it to -bugs, or find some other similar report. -- Peter Geoghegan
Re: Removing more vacuumlazy.c special cases, relfrozenxid optimizations
Hi, On 2022-04-04 19:32:13 -0700, Peter Geoghegan wrote: > On Fri, Apr 1, 2022 at 10:54 AM Peter Geoghegan wrote: > > I also refined the WARNING patch in v15. It now actually issues > > WARNINGs (rather than PANICs, which were just a temporary debugging > > measure in v14). > > Going to commit this remaining patch tomorrow, barring objections. The remaining patch are the warnings in vac_update_relstats(), correct? I guess one could argue they should be LOG rather than WARNING, but I find the project stance on that pretty impractical. So warning's ok with me. Not sure why you used errmsg_internal()? Otherwise LGTM. Greetings, Andres Freund
Re: Removing more vacuumlazy.c special cases, relfrozenxid optimizations
On Fri, Apr 1, 2022 at 10:54 AM Peter Geoghegan wrote: > I also refined the WARNING patch in v15. It now actually issues > WARNINGs (rather than PANICs, which were just a temporary debugging > measure in v14). Going to commit this remaining patch tomorrow, barring objections. -- Peter Geoghegan
Re: Removing more vacuumlazy.c special cases, relfrozenxid optimizations
On Sun, Apr 3, 2022 at 12:05 PM Andres Freund wrote: > Just saw that you committed: Wee! I think this will be a substantial > improvement for our users. I hope so! I think that it's much more useful as the basis for future work than as a standalone thing. Users of Postgres 15 might not notice a huge difference. But it opens up a lot of new directions to take VACUUM in. I would like to get rid of anti-wraparound VACUUMs and aggressive VACUUMs in Postgres 16. This isn't as radical as it sounds. It seems quite possible to find a way for *every* VACUUM to become aggressive progressively and dynamically. We'll still need to have autovacuum.c know about wraparound, but it should just be just another threshold, not fundamentally different to the other thresholds (except that it's still used when autovacuum is nominally disabled). The behavior around autovacuum cancellations is probably still going to be necessary when age(relfrozenxid) gets too high, but it shouldn't be conditioned on what age(relfrozenxid) *used to be*, when the autovacuum started. That could have been a long time ago. It should be based on what's happening *right now*. > While I was writing the above I, again, realized that it'd be awfully nice to > have some accumulated stats about (auto-)vacuum's effectiveness. For us to get > feedback about improvements more easily and for users to know what aspects > they need to tune. Strongly agree. And I'm excited about the potential of the shared memory stats patch to enable more thorough instrumentation, which allows us to improve things with feedback that we just can't get right now. VACUUM is still too complicated -- that makes this kind of analysis much harder, even for experts. You need more continuous behavior to get value from this kind of analysis. There are too many things that might end up mattering, that really shouldn't ever matter. Too much potential for strange illogical discontinuities in performance over time. Having only one type of VACUUM (excluding VACUUM FULL) will be much easier for users to reason about. But I also think that it'll be much easier for us to reason about. For example, better autovacuum scheduling will be made much easier if autovacuum.c can just assume that every VACUUM operation will do the same amount of work. (Another problem with the scheduling is that it uses ANALYZE statistics (sampling) in a way that just doesn't make any sense for something like VACUUM, which is an inherently dynamic and cyclic process.) None of this stuff has to rely on my patch for freezing. We don't necessarily have to make every VACUUM advance relfrozenxid to do all this. The important point is that we definitely shouldn't be putting off *all* freezing of all-visible pages in non-aggressive VACUUMs (or in VACUUMs that are not expected to advance relfrozenxid). Even a very conservative implementation could achieve all this; we need only spread out the burden of freezing all-visible pages over time, across multiple VACUUM operations. Make the behavior continuous. > Knowing how many times a table was vacuumed doesn't really tell that much, and > requiring to enable log_autovacuum_min_duration and then aggregating those > results is pretty painful (and version dependent). Yeah. Ideally we could avoid making the output of log_autovacuum_min_duration into an API, by having a real API instead. The output probably needs to evolve some more. A lot of very basic information wasn't there until recently. > If we just collected something like: > - number of heap passes > - time spent heap vacuuming > - number of index scans > - time spent index vacuuming > - time spent delaying You forgot FPIs. > - percentage of non-yet-removable vs removable tuples I think that we should address this directly too. By "taking a snapshot of the visibility map", so we at least don't scan/vacuum heap pages that don't really need it. This is also valuable because it makes slowing down VACUUM (maybe slowing it down a lot) have fewer downsides. At least we'll have "locked in" our scanned_pages, which we can figure out in full before we really scan even one page. > it'd start to be a heck of a lot easier to judge how well autovacuum is > coping. What about the potential of the shared memory stats stuff to totally replace the use of ANALYZE stats in autovacuum.c? Possibly with help from vacuumlazy.c, and the visibility map? I see a lot of potential for exploiting the visibility map more, both within vacuumlazy.c itself, and for autovacuum.c scheduling [1]. I'd probably start with the scheduling stuff, and only then work out how to show users more actionable information. [1] https://postgr.es/m/cah2-wzkt9ey9nnm7q9nsaw5jdbjvsaq3yvb4ut4m93uajvd...@mail.gmail.com -- Peter Geoghegan
Re: Removing more vacuumlazy.c special cases, relfrozenxid optimizations
Hi, On 2022-04-01 10:54:14 -0700, Peter Geoghegan wrote: > On Thu, Mar 31, 2022 at 11:19 AM Peter Geoghegan wrote: > > The assert is "Assert(diff > 0)", and not "Assert(diff >= 0)". > > Attached is v15. I plan to commit the first two patches (the most > substantial two patches by far) in the next couple of days, barring > objections. Just saw that you committed: Wee! I think this will be a substantial improvement for our users. While I was writing the above I, again, realized that it'd be awfully nice to have some accumulated stats about (auto-)vacuum's effectiveness. For us to get feedback about improvements more easily and for users to know what aspects they need to tune. Knowing how many times a table was vacuumed doesn't really tell that much, and requiring to enable log_autovacuum_min_duration and then aggregating those results is pretty painful (and version dependent). If we just collected something like: - number of heap passes - time spent heap vacuuming - number of index scans - time spent index vacuuming - time spent delaying - percentage of non-yet-removable vs removable tuples it'd start to be a heck of a lot easier to judge how well autovacuum is coping. If we tracked the related pieces above in the index stats (or perhaps additionally there), it'd also make it easier to judge the cost of different indexes. - Andres
Re: Removing more vacuumlazy.c special cases, relfrozenxid optimizations
On Thu, Mar 31, 2022 at 11:19 AM Peter Geoghegan wrote: > The assert is "Assert(diff > 0)", and not "Assert(diff >= 0)". Attached is v15. I plan to commit the first two patches (the most substantial two patches by far) in the next couple of days, barring objections. v15 removes this "Assert(diff > 0)" assertion from 0001. It's not adding any value, now that the underlying issue that it accidentally brought to light is well understood (there are still more robust assertions to the relfrozenxid/relminmxid invariants). "Assert(diff > 0)" is liable to fail until the underlying bug on HEAD is fixed, which can be treated as separate work. I also refined the WARNING patch in v15. It now actually issues WARNINGs (rather than PANICs, which were just a temporary debugging measure in v14). Also fixed a compiler warning in this patch, based on a complaint from CFBot's CompilerWarnings task. I can delay commiting this WARNING patch until right before feature freeze. Seems best to give others more opportunity for comments. -- Peter Geoghegan v15-0003-Have-VACUUM-warn-on-relfrozenxid-from-the-future.patch Description: Binary data v15-0004-vacuumlazy.c-Move-resource-allocation-to-heap_va.patch Description: Binary data v15-0002-Generalize-how-VACUUM-skips-all-frozen-pages.patch Description: Binary data v15-0001-Set-relfrozenxid-to-oldest-extant-XID-seen-by-VA.patch Description: Binary data
Re: Removing more vacuumlazy.c special cases, relfrozenxid optimizations
On Thu, Mar 31, 2022 at 10:50 AM Andres Freund wrote: > > So, to be clear: vac_update_relstats() never actually considered the > > new relfrozenxid value from its vacuumlazy.c caller to be "in the > > future"? > > No, I added separate debug messages for those, and also applied your patch, > and it didn't trigger. The assert is "Assert(diff > 0)", and not "Assert(diff >= 0)". Plus the other related assert I mentioned did not trigger. So when this "diff" assert did trigger, the value of "diff" must have been 0 (not a negative value). While this state does technically indicate that the "existing" relfrozenxid value (actually a stale version) appears to be "in the future" (because the OldestXmin XID might still never have been allocated), it won't ever be in the future according to vac_update_relstats() (even if it used that version). I suppose that I might be wrong about that, somehow -- anything is possible. The important point is that there is currently no evidence that this bug (or any very recent bug) could ever allow vac_update_relstats() to actually believe that it needs to update relfrozenxid/relminmxid, purely because the existing value is in the future. The fact that vac_update_relstats() doesn't log/warn when this happens is very unfortunate, but there is nevertheless no evidence that that would have informed us of any bug on HEAD, even including the actual bug here, which is a bug in vacuumlazy.c (not in vac_update_relstats). > I do think we should apply a version of the warnings you have (with a WARNING > instead of PANIC obviously). I think it's bordering on insanity that we have > so many paths to just silently fix stuff up around vacuum. It's like we want > things to be undebuggable, and to give users no warnings about something being > up. Yeah, it's just totally self defeating to not at least log it. I mean this is a code path that is only hit once per VACUUM, so there is practically no risk of that causing any new problems. > Can you repro the issue with my recipe? FWIW, adding log_min_messages=debug5 > and fsync=off made the crash trigger more quickly. I'll try to do that today. I'm not feeling the most energetic right now, to be honest. -- Peter Geoghegan
Re: Removing more vacuumlazy.c special cases, relfrozenxid optimizations
Hi, On 2022-03-31 10:12:49 -0700, Peter Geoghegan wrote: > On Wed, Mar 30, 2022 at 9:59 PM Andres Freund wrote: > > I'm not sure there's a proper bug on HEAD here. I think at worst it can > > delay > > the horizon increasing a bunch, by falsely not using an aggressive vacuum > > when > > we should have - might even be limited to a single autovacuum cycle. > > So, to be clear: vac_update_relstats() never actually considered the > new relfrozenxid value from its vacuumlazy.c caller to be "in the > future"? No, I added separate debug messages for those, and also applied your patch, and it didn't trigger. I don't immediately see how we could end up computing a frozenxid value that would be problematic? The pgcform->relfrozenxid value will always be the "local" value, which afaics can be behind the other database's value (and thus behind the value from the relcache init file). But it can't be ahead, we have the proper invalidations for that (I think). I do think we should apply a version of the warnings you have (with a WARNING instead of PANIC obviously). I think it's bordering on insanity that we have so many paths to just silently fix stuff up around vacuum. It's like we want things to be undebuggable, and to give users no warnings about something being up. > It just looked that way to the failing assertion in > vacuumlazy.c, because its own version of the original relfrozenxid was > stale from the beginning? And so the worst problem is probably just > that we don't use aggressive VACUUM when we really should in rare > cases? Yes, I think that's right. Can you repro the issue with my recipe? FWIW, adding log_min_messages=debug5 and fsync=off made the crash trigger more quickly. Greetings, Andres Freund
Re: Removing more vacuumlazy.c special cases, relfrozenxid optimizations
On Thu, Mar 31, 2022 at 10:11 AM Andres Freund wrote: > I don't think we should weaken defenses against xids from before relfrozenxid > in vacuum / amcheck / If anything we should strengthen them. > > Isn't it also just plainly required for correctness? We'd not necessarily > trigger a vacuum in time to remove the xid before approaching wraparound if we > put in an xid before relfrozenxid? That happening in prune_xid is obviously > les bad than on actual data, but still. Yeah, you're right. Ambiguity about stuff like this should be avoided on general principle. > ISTM we should just use our own xid. Yes, it might delay cleanup a bit > longer. But unless there's already crud on the page (with prune_xid already > set, the abort of the speculative insertion isn't likely to make the > difference? Speculative insertion abort is pretty rare in the real world, I bet. The speculative insertion precheck is very likely to work almost always with real workloads. -- Peter Geoghegan
Re: Removing more vacuumlazy.c special cases, relfrozenxid optimizations
On Wed, Mar 30, 2022 at 9:59 PM Andres Freund wrote: > I'm not sure there's a proper bug on HEAD here. I think at worst it can delay > the horizon increasing a bunch, by falsely not using an aggressive vacuum when > we should have - might even be limited to a single autovacuum cycle. So, to be clear: vac_update_relstats() never actually considered the new relfrozenxid value from its vacuumlazy.c caller to be "in the future"? It just looked that way to the failing assertion in vacuumlazy.c, because its own version of the original relfrozenxid was stale from the beginning? And so the worst problem is probably just that we don't use aggressive VACUUM when we really should in rare cases? -- Peter Geoghegan
Re: Removing more vacuumlazy.c special cases, relfrozenxid optimizations
Hi, On 2022-03-31 09:58:18 -0700, Peter Geoghegan wrote: > On Thu, Mar 31, 2022 at 9:37 AM Andres Freund wrote: > > The only place that appears to access rd_rel->relfrozenxid outside of DDL is > > heap_abort_speculative(). > > I wonder how necessary that really is. Even if the XID is before > relfrozenxid, does that in itself really make it "in the future"? > Obviously it's often necessary to make the assumption that allowing > wraparound amounts to allowing XIDs "from the future" to exist, which > is dangerous. But why here? Won't pruning by VACUUM eventually correct > the issue anyway? I don't think we should weaken defenses against xids from before relfrozenxid in vacuum / amcheck / If anything we should strengthen them. Isn't it also just plainly required for correctness? We'd not necessarily trigger a vacuum in time to remove the xid before approaching wraparound if we put in an xid before relfrozenxid? That happening in prune_xid is obviously les bad than on actual data, but still. ISTM we should just use our own xid. Yes, it might delay cleanup a bit longer. But unless there's already crud on the page (with prune_xid already set, the abort of the speculative insertion isn't likely to make the difference? Greetings, Andres Freund
Re: Removing more vacuumlazy.c special cases, relfrozenxid optimizations
On Thu, Mar 31, 2022 at 9:37 AM Andres Freund wrote: > Perhaps we should explicitly mask out parts of relcache entries in the shared > init file that we know to be unreliable. I.e. set relfrozenxid, relminmxid to > Invalid* or such. That has the advantage of being more honest. If you're going to break the abstraction, then it seems best to break it in an obvious way, that leaves no doubts about what you're supposed to be relying on. This bug doesn't seem like the kind of thing that should be left as-is. If only because it makes it hard to add something like a WARNING when we make relfrozenxid go backwards (on the basis of the existing value apparently being in the future), which we really should have been doing all along. The whole reason why we overwrite pg_class.relfrozenxid values from the future is to ameliorate the effects of more serious bugs like the pg_upgrade/pg_resetwal one fixed in commit 74cf7d46 not so long ago (mid last year). We had essentially the same pg_upgrade "from the future" bug twice (once for relminmxid in the MultiXact bug era, another more recent version affecting relfrozenxid). > The only place that appears to access rd_rel->relfrozenxid outside of DDL is > heap_abort_speculative(). I wonder how necessary that really is. Even if the XID is before relfrozenxid, does that in itself really make it "in the future"? Obviously it's often necessary to make the assumption that allowing wraparound amounts to allowing XIDs "from the future" to exist, which is dangerous. But why here? Won't pruning by VACUUM eventually correct the issue anyway? -- Peter Geoghegan
Re: Removing more vacuumlazy.c special cases, relfrozenxid optimizations
Hi, On 2022-03-30 21:59:15 -0700, Andres Freund wrote: > On 2022-03-30 21:29:16 -0700, Peter Geoghegan wrote: > > On Wed, Mar 30, 2022 at 9:20 PM Andres Freund wrote: > > > Perhaps we should just fetch the horizons from the "local" catalog for > > > shared > > > rels? > > > > Not sure what you mean. > > Basically, instead of relying on the relcache, which for shared relation is > vulnerable to seeing "too new" horizons due to the shared relcache init file, > explicitly load relfrozenxid / relminmxid from the the catalog / syscache. > > I.e. fetch the relevant pg_class row in heap_vacuum_rel() (using > SearchSysCache[Copy1](RELID)). And use that to set vacrel->relfrozenxid > etc. Whereas right now we only fetch the pg_class row in > vac_update_relstats(), but use the relcache before. Perhaps we should explicitly mask out parts of relcache entries in the shared init file that we know to be unreliable. I.e. set relfrozenxid, relminmxid to Invalid* or such. I even wonder if we should just generally move those out of the fields we have in the relcache, not just for shared rels loaded from the init fork. Presumably by just moving them into the CATALOG_VARLEN ifdef. The only place that appears to access rd_rel->relfrozenxid outside of DDL is heap_abort_speculative(). Greetings, Andres Freund
Re: Removing more vacuumlazy.c special cases, relfrozenxid optimizations
Hi, On 2022-03-30 21:29:16 -0700, Peter Geoghegan wrote: > On Wed, Mar 30, 2022 at 9:20 PM Andres Freund wrote: > > But the debug elog reports that > > > > relfrozenxid updated 714 -> 717 > > relminmxid updated 1 -> 6 > > > > Tthe problem is that the crashing backend reads the relfrozenxid/relminmxid > > from the shared relcache init file written by another backend: > > We should have added logging of relfrozenxid and relminmxid a long time ago. At least at DEBUG1 or such. > > This is basically the inverse of a54e1f15 - we read a *newer* horizon. > > That's > > normally fairly harmless - I think. > > Is this one pretty old? What do you mean with "this one"? The cause for the assert failure? I'm not sure there's a proper bug on HEAD here. I think at worst it can delay the horizon increasing a bunch, by falsely not using an aggressive vacuum when we should have - might even be limited to a single autovacuum cycle. > > Perhaps we should just fetch the horizons from the "local" catalog for > > shared > > rels? > > Not sure what you mean. Basically, instead of relying on the relcache, which for shared relation is vulnerable to seeing "too new" horizons due to the shared relcache init file, explicitly load relfrozenxid / relminmxid from the the catalog / syscache. I.e. fetch the relevant pg_class row in heap_vacuum_rel() (using SearchSysCache[Copy1](RELID)). And use that to set vacrel->relfrozenxid etc. Whereas right now we only fetch the pg_class row in vac_update_relstats(), but use the relcache before. Greetings, Andres Freund
Re: Removing more vacuumlazy.c special cases, relfrozenxid optimizations
On Wed, Mar 30, 2022 at 9:29 PM Peter Geoghegan wrote: > > Perhaps we should just fetch the horizons from the "local" catalog for > > shared > > rels? > > Not sure what you mean. Wait, you mean use vacrel->relfrozenxid directly? Seems kind of ugly... -- Peter Geoghegan
Re: Removing more vacuumlazy.c special cases, relfrozenxid optimizations
On Wed, Mar 30, 2022 at 9:20 PM Andres Freund wrote: > But the debug elog reports that > > relfrozenxid updated 714 -> 717 > relminmxid updated 1 -> 6 > > Tthe problem is that the crashing backend reads the relfrozenxid/relminmxid > from the shared relcache init file written by another backend: We should have added logging of relfrozenxid and relminmxid a long time ago. > This is basically the inverse of a54e1f15 - we read a *newer* horizon. That's > normally fairly harmless - I think. Is this one pretty old? > Perhaps we should just fetch the horizons from the "local" catalog for shared > rels? Not sure what you mean. -- Peter Geoghegan
Re: Removing more vacuumlazy.c special cases, relfrozenxid optimizations
Hi, On 2022-03-30 21:11:48 -0700, Peter Geoghegan wrote: > On Wed, Mar 30, 2022 at 9:04 PM Andres Freund wrote: > > (gdb) p vacrel->NewRelfrozenXid > > $3 = 717 > > (gdb) p vacrel->relfrozenxid > > $4 = 717 > > (gdb) p OldestXmin > > $5 = 5112 > > (gdb) p aggressive > > $6 = false > > Does this OldestXmin seem reasonable at this point in execution, based > on context? Does it look too high? Something else? Reasonable: (gdb) p *ShmemVariableCache $1 = {nextOid = 78969, oidCount = 2951, nextXid = {value = 21411}, oldestXid = 714, xidVacLimit = 20714, xidWarnLimit = 2107484361, xidStopLimit = 2144484361, xidWrapLimit = 2147484361, oldestXidDB = 1, oldestCommitTsXid = 0, newestCommitTsXid = 0, latestCompletedXid = {value = 21408}, xactCompletionCount = 1635, oldestClogXid = 714} I think the explanation I just sent explains the problem, without "in-memory" confusion about what's running and what's not. Greetings, Andres Freund
Re: Removing more vacuumlazy.c special cases, relfrozenxid optimizations
Hi, On 2022-03-30 21:04:07 -0700, Andres Freund wrote: > On 2022-03-30 20:35:25 -0700, Peter Geoghegan wrote: > > On Wed, Mar 30, 2022 at 8:28 PM Andres Freund wrote: > > > I triggered twice now, but it took a while longer the second time. > > > > Great. > > > > I wonder if you can get an RR recording... > > Started it, but looks like it's too slow. > > (gdb) p MyProcPid > $1 = 2172500 > > (gdb) p vacrel->NewRelfrozenXid > $3 = 717 > (gdb) p vacrel->relfrozenxid > $4 = 717 > (gdb) p OldestXmin > $5 = 5112 > (gdb) p aggressive > $6 = false I added a bunch of debug elogs to see what sets *frozenxid_updated to true. (gdb) p *vacrel $1 = {rel = 0x7fe24f3e0148, indrels = 0x7fe255c17ef8, nindexes = 2, aggressive = false, skipwithvm = true, failsafe_active = false, consider_bypass_optimization = true, do_index_vacuuming = true, do_index_cleanup = true, do_rel_truncate = true, bstrategy = 0x7fe255bb0e28, pvs = 0x0, relfrozenxid = 717, relminmxid = 6, old_live_tuples = 42, OldestXmin = 20751, vistest = 0x7fe255058970 , FreezeLimit = 4244988047, MultiXactCutoff = 4289967302, NewRelfrozenXid = 717, NewRelminMxid = 6, skippedallvis = false, relnamespace = 0x7fe255c17bf8 "pg_catalog", relname = 0x7fe255c17cb8 "pg_database", indname = 0x0, blkno = 4294967295, offnum = 0, phase = VACUUM_ERRCB_PHASE_SCAN_HEAP, verbose = false, dead_items = 0x7fe255c131d0, rel_pages = 8, scanned_pages = 8, removed_pages = 0, lpdead_item_pages = 0, missed_dead_pages = 0, nonempty_pages = 8, new_rel_tuples = 124, new_live_tuples = 42, indstats = 0x7fe255c18320, num_index_scans = 0, tuples_deleted = 0, lpdead_items = 0, live_tuples = 42, recently_dead_tuples = 82, missed_dead_tuples = 0} But the debug elog reports that relfrozenxid updated 714 -> 717 relminmxid updated 1 -> 6 Tthe problem is that the crashing backend reads the relfrozenxid/relminmxid from the shared relcache init file written by another backend: 2022-03-30 21:10:47.626 PDT [2625038][autovacuum worker][6/433:0][] LOG: automatic vacuum of table "contrib_regression_postgres_fdw.pg_catalog.pg_database": index scans: 1 pages: 0 removed, 8 remain, 8 scanned (100.00% of total) tuples: 4 removed, 114 remain, 72 are dead but not yet removable removable cutoff: 20751, older by 596 xids when operation ended new relfrozenxid: 717, which is 3 xids ahead of previous value new relminmxid: 6, which is 5 mxids ahead of previous value index scan needed: 3 pages from table (37.50% of total) had 8 dead item identifiers removed index "pg_database_datname_index": pages: 2 in total, 0 newly deleted, 0 currently deleted, 0 reusable index "pg_database_oid_index": pages: 6 in total, 0 newly deleted, 2 currently deleted, 2 reusable I/O timings: read: 0.050 ms, write: 0.102 ms avg read rate: 209.860 MB/s, avg write rate: 76.313 MB/s buffer usage: 42 hits, 22 misses, 8 dirtied WAL usage: 13 records, 5 full page images, 33950 bytes system usage: CPU: user: 0.00 s, system: 0.00 s, elapsed: 0.00 s ... 2022-03-30 21:10:47.772 PDT [2625043][autovacuum worker][:0][] DEBUG: InitPostgres 2022-03-30 21:10:47.772 PDT [2625043][autovacuum worker][6/0:0][] DEBUG: my backend ID is 6 2022-03-30 21:10:47.772 PDT [2625043][autovacuum worker][6/0:0][] LOG: reading shared init file 2022-03-30 21:10:47.772 PDT [2625043][autovacuum worker][6/443:0][] DEBUG: StartTransaction(1) name: unnamed; blockState: DEFAULT; state: INPROGRESS, xid/sub> 2022-03-30 21:10:47.772 PDT [2625043][autovacuum worker][6/443:0][] LOG: reading non-shared init file This is basically the inverse of a54e1f15 - we read a *newer* horizon. That's normally fairly harmless - I think. Perhaps we should just fetch the horizons from the "local" catalog for shared rels? Greetings, Andres Freund
Re: Removing more vacuumlazy.c special cases, relfrozenxid optimizations
On Wed, Mar 30, 2022 at 9:04 PM Andres Freund wrote: > (gdb) p vacrel->NewRelfrozenXid > $3 = 717 > (gdb) p vacrel->relfrozenxid > $4 = 717 > (gdb) p OldestXmin > $5 = 5112 > (gdb) p aggressive > $6 = false Does this OldestXmin seem reasonable at this point in execution, based on context? Does it look too high? Something else? -- Peter Geoghegan
Re: Removing more vacuumlazy.c special cases, relfrozenxid optimizations
Hi, On 2022-03-30 20:35:25 -0700, Peter Geoghegan wrote: > On Wed, Mar 30, 2022 at 8:28 PM Andres Freund wrote: > > I triggered twice now, but it took a while longer the second time. > > Great. > > I wonder if you can get an RR recording... Started it, but looks like it's too slow. (gdb) p MyProcPid $1 = 2172500 (gdb) p vacrel->NewRelfrozenXid $3 = 717 (gdb) p vacrel->relfrozenxid $4 = 717 (gdb) p OldestXmin $5 = 5112 (gdb) p aggressive $6 = false There was another autovacuum of pg_database 10s before: 2022-03-30 20:35:17.622 PDT [2165344][autovacuum worker][5/3:0][] LOG: automatic vacuum of table "postgres.pg_catalog.pg_database": index scans: 1 pages: 0 removed, 3 remain, 3 scanned (100.00% of total) tuples: 61 removed, 4 remain, 1 are dead but not yet removable removable cutoff: 1921, older by 3 xids when operation ended new relfrozenxid: 717, which is 3 xids ahead of previous value index scan needed: 3 pages from table (100.00% of total) had 599 dead item identifiers removed index "pg_database_datname_index": pages: 2 in total, 0 newly deleted, 0 currently deleted, 0 reusable index "pg_database_oid_index": pages: 4 in total, 0 newly deleted, 0 currently deleted, 0 reusable I/O timings: read: 0.029 ms, write: 0.034 ms avg read rate: 134.120 MB/s, avg write rate: 89.413 MB/s buffer usage: 35 hits, 12 misses, 8 dirtied WAL usage: 12 records, 5 full page images, 27218 bytes system usage: CPU: user: 0.00 s, system: 0.00 s, elapsed: 0.00 s The dying backend: 2022-03-30 20:35:27.668 PDT [2172500][autovacuum worker][7/0:0][] DEBUG: autovacuum: processing database "contrib_regression_hstore" ... 2022-03-30 20:35:27.690 PDT [2172500][autovacuum worker][7/674:0][] CONTEXT: while cleaning up index "pg_database_oid_index" of relation "pg_catalog.pg_database" Greetings, Andres Freund
Re: Removing more vacuumlazy.c special cases, relfrozenxid optimizations
Hi, On 2022-03-30 20:28:44 -0700, Andres Freund wrote: > I was able to trigger the crash. > > cat ~/tmp/pgbench-createdb.sql > CREATE DATABASE pgb_:client_id; > DROP DATABASE pgb_:client_id; > > pgbench -n -P1 -c 10 -j10 -T100 -f ~/tmp/pgbench-createdb.sql > > while I was also running > > for i in $(seq 1 100); do echo iteration $i; make -Otarget -C contrib/ -s > installcheck -j48 -s prove_installcheck=true USE_MODULE_DB=1 > /tmp/ci-$i.log > 2>&1; done > > I triggered twice now, but it took a while longer the second time. Forgot to say how postgres was started. Via my usual devenv script, which results in: + /home/andres/build/postgres/dev-assert/vpath/src/backend/postgres -c hba_file=/home/andres/tmp/pgdev/pg_hba.conf -D /srv/dev/pgdev-dev/ -p 5440 -c shared_buffers=2GB -c wal_level=hot_standby -c max_wal_senders=10 -c track_io_timing=on -c restart_after_crash=false -c max_prepared_transactions=20 -c log_checkpoints=on -c min_wal_size=48MB -c max_wal_size=150GB -c 'cluster_name=dev assert' -c ssl_cert_file=/home/andres/tmp/pgdev/ssl-cert-snakeoil.pem -c ssl_key_file=/home/andres/tmp/pgdev/ssl-cert-snakeoil.key -c 'log_line_prefix=%m [%p][%b][%v:%x][%a] ' -c shared_buffers=16MB -c log_min_messages=debug1 -c log_connections=on -c allow_in_place_tablespaces=1 -c log_autovacuum_min_duration=0 -c log_lock_waits=true -c autovacuum_naptime=10s -c fsync=off Greetings, Andres Freund
Re: Removing more vacuumlazy.c special cases, relfrozenxid optimizations
On Wed, Mar 30, 2022 at 8:28 PM Andres Freund wrote: > I triggered twice now, but it took a while longer the second time. Great. I wonder if you can get an RR recording... -- Peter Geoghegan
Re: Removing more vacuumlazy.c special cases, relfrozenxid optimizations
Hi, I was able to trigger the crash. cat ~/tmp/pgbench-createdb.sql CREATE DATABASE pgb_:client_id; DROP DATABASE pgb_:client_id; pgbench -n -P1 -c 10 -j10 -T100 -f ~/tmp/pgbench-createdb.sql while I was also running for i in $(seq 1 100); do echo iteration $i; make -Otarget -C contrib/ -s installcheck -j48 -s prove_installcheck=true USE_MODULE_DB=1 > /tmp/ci-$i.log 2>&1; done I triggered twice now, but it took a while longer the second time. (gdb) bt full #0 __GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:49 set = {__val = {4194304, 0, 0, 0, 0, 0, 216172782113783808, 2, 2377909399344644096, 18446497967838863616, 0, 0, 0, 0, 0, 0}} pid = tid = ret = #1 0x7fe49a2db546 in __GI_abort () at abort.c:79 save_stage = 1 act = {__sigaction_handler = {sa_handler = 0x0, sa_sigaction = 0x0}, sa_mask = {__val = {0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0}}, sa_flags = 0, sa_restorer = 0x107e0} sigs = {__val = {32, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0}} #2 0x7fe49b9706f1 in ExceptionalCondition (conditionName=0x7fe49ba0618d "diff > 0", errorType=0x7fe49ba05bd1 "FailedAssertion", fileName=0x7fe49ba05b90 "/home/andres/src/postgresql/src/backend/access/heap/vacuumlazy.c", lineNumber=724) at /home/andres/src/postgresql/src/backend/utils/error/assert.c:69 No locals. #3 0x7fe49b2fc739 in heap_vacuum_rel (rel=0x7fe497a8d148, params=0x7fe49c130d7c, bstrategy=0x7fe49c130e10) at /home/andres/src/postgresql/src/backend/access/heap/vacuumlazy.c:724 buf = { data = 0x7fe49c17e238 "automatic vacuum of table \"contrib_regression_dict_int.pg_catalog.pg_database\": index scans: 1\npages: 0 removed, 3 remain, 3 scanned (100.00% of total)\ntuples: 49 removed, 53 remain, 9 are dead but no"..., len = 279, maxlen = 1024, cursor = 0} msgfmt = 0x7fe49ba06038 "automatic vacuum of table \"%s.%s.%s\": index scans: %d\n" diff = 0 endtime = 702011687982080 vacrel = 0x7fe49c19b5b8 verbose = false instrument = true ru0 = {tv = {tv_sec = 1648696487, tv_usec = 975963}, ru = {ru_utime = {tv_sec = 0, tv_usec = 0}, ru_stime = {tv_sec = 0, tv_usec = 3086}, { --Type for more, q to quit, c to continue without paging--c ru_maxrss = 10824, __ru_maxrss_word = 10824}, {ru_ixrss = 0, __ru_ixrss_word = 0}, {ru_idrss = 0, __ru_idrss_word = 0}, {ru_isrss = 0, __ru_isrss_word = 0}, {ru_minflt = 449, __ru_minflt_word = 449}, {ru_majflt = 0, __ru_majflt_word = 0}, {ru_nswap = 0, __ru_nswap_word = 0}, {ru_inblock = 0, __ru_inblock_word = 0}, {ru_oublock = 0, __ru_oublock_word = 0}, {ru_msgsnd = 0, __ru_msgsnd_word = 0}, {ru_msgrcv = 0, __ru_msgrcv_word = 0}, {ru_nsignals = 0, __ru_nsignals_word = 0}, {ru_nvcsw = 2, __ru_nvcsw_word = 2}, {ru_nivcsw = 0, __ru_nivcsw_word = 0}}} starttime = 702011687975964 walusage_start = {wal_records = 0, wal_fpi = 0, wal_bytes = 0} walusage = {wal_records = 11, wal_fpi = 7, wal_bytes = 30847} secs = 0 usecs = 6116 read_rate = 16.606033355134073 write_rate = 7.6643230869849575 aggressive = false skipwithvm = true frozenxid_updated = true minmulti_updated = true orig_rel_pages = 3 new_rel_pages = 3 new_rel_allvisible = 0 indnames = 0x7fe49c19bb28 errcallback = {previous = 0x0, callback = 0x7fe49b3012fd , arg = 0x7fe49c19b5b8} startreadtime = 180 startwritetime = 0 OldestXmin = 67552 FreezeLimit = 4245034848 OldestMxact = 224 MultiXactCutoff = 4289967520 __func__ = "heap_vacuum_rel" #4 0x7fe49b523d92 in table_relation_vacuum (rel=0x7fe497a8d148, params=0x7fe49c130d7c, bstrategy=0x7fe49c130e10) at /home/andres/src/postgresql/src/include/access/tableam.h:1680 No locals. #5 0x7fe49b527032 in vacuum_rel (relid=1262, relation=0x7fe49c1ae360, params=0x7fe49c130d7c) at /home/andres/src/postgresql/src/backend/commands/vacuum.c:2065 lmode = 4 rel = 0x7fe497a8d148 lockrelid = {relId = 1262, dbId = 0} toast_relid = 0 save_userid = 10 save_sec_context = 0 save_nestlevel = 2 __func__ = "vacuum_rel" #6 0x7fe49b524c3b in vacuum (relations=0x7fe49c1b03a8, params=0x7fe49c130d7c, bstrategy=0x7fe49c130e10, isTopLevel=true) at /home/andres/src/postgresql/src/backend/commands/vacuum.c:482 vrel = 0x7fe49c1ae3b8 cur__state = {l = 0x7fe49c1b03a8, i = 0} cur = 0x7fe49c1b03c0 _save_exception_stack = 0x7fff97e35a10 _save_context_stack = 0x0 _local_sigjmp_buf = {{__jmpbuf = {140735741652128, 6126579318940970843, 9223372036854775747, 0, 0, 0, 6126579318957748059, 6139499258682879835}, __mask_was_saved = 0, __saved_mask = {__val = {32, 140619848279000, 8590910454, 140619848278592, 32, 140619848278944, 7784,
Re: Removing more vacuumlazy.c special cases, relfrozenxid optimizations
On Wed, Mar 30, 2022 at 7:37 PM Peter Geoghegan wrote: > Yeah, a WARNING would be good here. I can write a new version of my > patch series with a separation patch for that this evening. Actually, > better make it a PANIC for now... Attached is v14, which includes a new patch that PANICs like that in vac_update_relstats() --- 0003. This approach also covers manual VACUUMs, which isn't the case with the failing assertion, which is in instrumentation code (actually VACUUM VERBOSE might hit it). I definitely think that something like this should be committed. Silently ignoring system catalog corruption isn't okay. -- Peter Geoghegan v14-0003-PANIC-on-relfrozenxid-from-the-future.patch Description: Binary data v14-0001-Set-relfrozenxid-to-oldest-extant-XID-seen-by-VA.patch Description: Binary data v14-0002-Generalize-how-VACUUM-skips-all-frozen-pages.patch Description: Binary data v14-0004-vacuumlazy.c-Move-resource-allocation-to-heap_va.patch Description: Binary data
Re: Removing more vacuumlazy.c special cases, relfrozenxid optimizations
On Wed, Mar 30, 2022 at 7:00 PM Andres Freund wrote: > Something vaguely like EXPECT_EQ_U32 in regress.c. Maybe > AssertCmp(type, a, op, b), > > Then the assertion could have been something like >AssertCmp(int32, diff, >, 0) I'd definitely use them if they were there. > Does the line number in the failed run actually correspond to the xid, rather > than the mxid case? I didn't check. Yes, I verified -- definitely relfrozenxid. > You can do something similar locally on linux with > make -Otarget -C contrib/ -j48 -s USE_MODULE_DB=1 installcheck > prove_installcheck=true > (the prove_installcheck=true to prevent tap tests from running, we don't seem > to have another way for that) > > I don't think windows uses USE_MODULE_DB=1, but it allows to cause a lot more > load concurrently than running tests serially... Can't get it to fail locally with that recipe. > > Assert(vacrel->NewRelfrozenXid == OldestXmin || > >TransactionIdPrecedesOrEquals(vacrel->relfrozenxid, > > vacrel->NewRelfrozenXid)); > > The comment in your patch says "is either older or newer than FreezeLimit" - I > assume that's some rephrasing damage? Both the comment and the assertion are correct. I see what you mean, though. > Perhaps it's worth commiting improved assertions on master? If this is indeed > a pre-existing bug, and we're just missing due to slightly less stringent > asserts, we could rectify that separately. I don't think there's much chance of the assertion actually hitting without the rest of the patch series. The new relfrozenxid value is always going to be OldestXmin - vacuum_min_freeze_age on HEAD, while with the patch it's sometimes close to OldestXmin. Especially when you have lots of dead tuples that you churn through constantly (like pgbench_tellers, or like these system catalogs on the CI test machine). > Hm. This triggers some vague memories. There's some oddities around shared > relations being vacuumed separately in all the databases and thus having > separate horizons. That's what I was thinking of, obviously. > After "remembering" that, I looked in the cirrus log for the failed run, and > the worker was processing shared a shared relation last: > > 2022-03-30 03:48:30.238 GMT [5984][autovacuum worker] LOG: automatic analyze > of table "contrib_regression.pg_catalog.pg_authid" I noticed the same thing myself. Should have said sooner. > Perhaps this ought to be an elog() instead of an Assert()? Something has gone > pear shaped if we get here... It's a bit annoying though, because it'd have to > be a PANIC to be visible on the bf / CI :(. Yeah, a WARNING would be good here. I can write a new version of my patch series with a separation patch for that this evening. Actually, better make it a PANIC for now... -- Peter Geoghegan
Re: Removing more vacuumlazy.c special cases, relfrozenxid optimizations
Hi, On 2022-03-30 17:50:42 -0700, Peter Geoghegan wrote: > I tried several times to recreate this issue on CI. No luck with that, > though -- can't get it to fail again after 4 attempts. It's really annoying that we don't have Assert variants that show the compared values, that might make it easier to intepret what's going on. Something vaguely like EXPECT_EQ_U32 in regress.c. Maybe AssertCmp(type, a, op, b), Then the assertion could have been something like AssertCmp(int32, diff, >, 0) Does the line number in the failed run actually correspond to the xid, rather than the mxid case? I didn't check. You could try to increase the likelihood of reproducing the failure by duplicating the invocation that lead to the crash a few times in the .cirrus.yml file in your dev branch. That might allow hitting the problem more quickly. Maybe reduce autovacuum_naptime in src/tools/ci/pg_ci_base.conf? Or locally - one thing that windows CI does different from the other platforms is that it runs isolation, contrib and a bunch of other tests using the same cluster. Which of course increases the likelihood of autovacuum having stuff to do, *particularly* on shared relations - normally there's probably not enough changes for that. You can do something similar locally on linux with make -Otarget -C contrib/ -j48 -s USE_MODULE_DB=1 installcheck prove_installcheck=true (the prove_installcheck=true to prevent tap tests from running, we don't seem to have another way for that) I don't think windows uses USE_MODULE_DB=1, but it allows to cause a lot more load concurrently than running tests serially... > We know that this particular assertion did not fail during the same VACUUM: > > Assert(vacrel->NewRelfrozenXid == OldestXmin || >TransactionIdPrecedesOrEquals(vacrel->relfrozenxid, > vacrel->NewRelfrozenXid)); The comment in your patch says "is either older or newer than FreezeLimit" - I assume that's some rephrasing damage? > So it's hard to see how this could be a bug in the patch -- the final > new relfrozenxid is presumably equal to VACUUM's OldestXmin in the > problem scenario seen on the CI Windows instance yesterday (that's why > this earlier assertion didn't fail). Perhaps it's worth commiting improved assertions on master? If this is indeed a pre-existing bug, and we're just missing due to slightly less stringent asserts, we could rectify that separately. > The surprising part of the CI failure must have taken place just after > this assertion, when VACUUM's call to vacuum_set_xid_limits() actually > updates pg_class.relfrozenxid with vacrel->NewRelfrozenXid -- > presumably because the existing relfrozenxid appeared to be "in the > future" when we examine it in pg_class again. We see evidence that > this must have happened afterwards, when the closely related assertion > (used only in instrumentation code) fails: Hm. This triggers some vague memories. There's some oddities around shared relations being vacuumed separately in all the databases and thus having separate horizons. After "remembering" that, I looked in the cirrus log for the failed run, and the worker was processing shared a shared relation last: 2022-03-30 03:48:30.238 GMT [5984][autovacuum worker] LOG: automatic analyze of table "contrib_regression.pg_catalog.pg_authid" Obviously that's not a guarantee that the next table processed also is a shared catalog, but ... Oh, the relid is actually in the stack trace. 0x4ee = 1262 = pg_database. Which makes sense, the test ends up with a high percentage of dead rows in pg_database, due to all the different contrib tests creating/dropping a database. > From my patch: > > > if (frozenxid_updated) > > { > > - diff = (int32) (FreezeLimit - vacrel->relfrozenxid); > > + diff = (int32) (vacrel->NewRelfrozenXid - > > vacrel->relfrozenxid); > > + Assert(diff > 0); > > appendStringInfo(, > > _("new relfrozenxid: %u, which is %d xids > > ahead of previous value\n"), > > -FreezeLimit, diff); > > +vacrel->NewRelfrozenXid, diff); > > } Perhaps this ought to be an elog() instead of an Assert()? Something has gone pear shaped if we get here... It's a bit annoying though, because it'd have to be a PANIC to be visible on the bf / CI :(. Greetings, Andres Freund
Re: Removing more vacuumlazy.c special cases, relfrozenxid optimizations
On Wed, Mar 30, 2022 at 12:01 AM Peter Geoghegan wrote: > Perhaps something is amiss inside vac_update_relstats(), where the > boolean flag that indicates that pg_class.relfrozenxid was advanced is > set: > > if (frozenxid_updated) > *frozenxid_updated = false; > if (TransactionIdIsNormal(frozenxid) && > pgcform->relfrozenxid != frozenxid && > (TransactionIdPrecedes(pgcform->relfrozenxid, frozenxid) || > TransactionIdPrecedes(ReadNextTransactionId(), >pgcform->relfrozenxid))) > { > if (frozenxid_updated) > *frozenxid_updated = true; > pgcform->relfrozenxid = frozenxid; > dirty = true; > } > > Maybe the "existing relfrozenxid is in the future, silently update > relfrozenxid" part of the condition (which involves > ReadNextTransactionId()) somehow does the wrong thing here. But how? I tried several times to recreate this issue on CI. No luck with that, though -- can't get it to fail again after 4 attempts. This was a VACUUM of pg_database, run from an autovacuum worker. I am vaguely reminded of the two bugs fixed by Andres in commit a54e1f15. Both were issues with the shared relcache init file affecting shared and nailed catalog relations. Those bugs had symptoms like " ERROR: found xmin ... from before relfrozenxid ..." for various system catalogs. We know that this particular assertion did not fail during the same VACUUM: Assert(vacrel->NewRelfrozenXid == OldestXmin || TransactionIdPrecedesOrEquals(vacrel->relfrozenxid, vacrel->NewRelfrozenXid)); So it's hard to see how this could be a bug in the patch -- the final new relfrozenxid is presumably equal to VACUUM's OldestXmin in the problem scenario seen on the CI Windows instance yesterday (that's why this earlier assertion didn't fail). The assertion I'm showing here needs the "vacrel->NewRelfrozenXid == OldestXmin" part of the condition to account for the fact that OldestXmin/GetOldestNonRemovableTransactionId() is known to "go backwards". Without that the regression tests will fail quite easily. The surprising part of the CI failure must have taken place just after this assertion, when VACUUM's call to vacuum_set_xid_limits() actually updates pg_class.relfrozenxid with vacrel->NewRelfrozenXid -- presumably because the existing relfrozenxid appeared to be "in the future" when we examine it in pg_class again. We see evidence that this must have happened afterwards, when the closely related assertion (used only in instrumentation code) fails: >From my patch: > if (frozenxid_updated) > { > - diff = (int32) (FreezeLimit - vacrel->relfrozenxid); > + diff = (int32) (vacrel->NewRelfrozenXid - > vacrel->relfrozenxid); > + Assert(diff > 0); > appendStringInfo(, > _("new relfrozenxid: %u, which is %d xids > ahead of previous value\n"), > -FreezeLimit, diff); > +vacrel->NewRelfrozenXid, diff); > } Does anybody have any ideas about what might be going on here? -- Peter Geoghegan
Re: Removing more vacuumlazy.c special cases, relfrozenxid optimizations
On Tue, Mar 29, 2022 at 11:10 PM Justin Pryzby wrote: > > + diff = (int32) (vacrel->NewRelfrozenXid - > vacrel->relfrozenxid); > + Assert(diff > 0); > > Did you see that this crashed on windows cfbot? > > https://api.cirrus-ci.com/v1/artifact/task/4592929254670336/log/tmp_check/postmaster.log > TRAP: FailedAssertion("diff > 0", File: > "c:\cirrus\src\backend\access\heap\vacuumlazy.c", Line: 724, PID: 5984) That's weird. There are very similar assertions a little earlier, that must have *not* failed here, before the call to vac_update_relstats(). I was actually thinking of removing this assertion for that reason -- I thought that it was redundant. Perhaps something is amiss inside vac_update_relstats(), where the boolean flag that indicates that pg_class.relfrozenxid was advanced is set: if (frozenxid_updated) *frozenxid_updated = false; if (TransactionIdIsNormal(frozenxid) && pgcform->relfrozenxid != frozenxid && (TransactionIdPrecedes(pgcform->relfrozenxid, frozenxid) || TransactionIdPrecedes(ReadNextTransactionId(), pgcform->relfrozenxid))) { if (frozenxid_updated) *frozenxid_updated = true; pgcform->relfrozenxid = frozenxid; dirty = true; } Maybe the "existing relfrozenxid is in the future, silently update relfrozenxid" part of the condition (which involves ReadNextTransactionId()) somehow does the wrong thing here. But how? The other assertions take into account the fact that OldestXmin can itself "go backwards" across VACUUM operations against the same table: Assert(!aggressive || vacrel->NewRelfrozenXid == OldestXmin || TransactionIdPrecedesOrEquals(FreezeLimit, vacrel->NewRelfrozenXid)); Note the "vacrel->NewRelfrozenXid == OldestXmin", without which the assertion will fail pretty easily when the regression tests are run. Perhaps I need to do something like that with the other assertion as well (or more likely just get rid of it). Will figure it out tomorrow. -- Peter Geoghegan
Re: Removing more vacuumlazy.c special cases, relfrozenxid optimizations
+ diff = (int32) (vacrel->NewRelfrozenXid - vacrel->relfrozenxid); + Assert(diff > 0); Did you see that this crashed on windows cfbot? https://api.cirrus-ci.com/v1/artifact/task/4592929254670336/log/tmp_check/postmaster.log TRAP: FailedAssertion("diff > 0", File: "c:\cirrus\src\backend\access\heap\vacuumlazy.c", Line: 724, PID: 5984) abort() has been called2022-03-30 03:48:30.267 GMT [5316][client backend] [pg_regress/tablefunc][3/15389:0] ERROR: infinite recursion detected 2022-03-30 03:48:38.031 GMT [5592][postmaster] LOG: server process (PID 5984) was terminated by exception 0xC354 2022-03-30 03:48:38.031 GMT [5592][postmaster] DETAIL: Failed process was running: autovacuum: VACUUM ANALYZE pg_catalog.pg_database 2022-03-30 03:48:38.031 GMT [5592][postmaster] HINT: See C include file "ntstatus.h" for a description of the hexadecimal value. https://cirrus-ci.com/task/4592929254670336 `007ff130 0001`400b4ef8 postgres!ExceptionalCondition( char * conditionName = 0x0001`40a915d8 "diff > 0", char * errorType = 0x0001`40a915c8 "FailedAssertion", char * fileName = 0x0001`40a91598 "c:\cirrus\src\backend\access\heap\vacuumlazy.c", int lineNumber = 0n724)+0x8d [c:\cirrus\src\backend\utils\error\assert.c @ 70] `007ff170 0001`402a0914 postgres!heap_vacuum_rel( struct RelationData * rel = 0x`00a51088, struct VacuumParams * params = 0x`00a8420c, struct BufferAccessStrategyData * bstrategy = 0x`00a842a0)+0x1038 [c:\cirrus\src\backend\access\heap\vacuumlazy.c @ 724] `007ff350 0001`402a4686 postgres!table_relation_vacuum( struct RelationData * rel = 0x`00a51088, struct VacuumParams * params = 0x`00a8420c, struct BufferAccessStrategyData * bstrategy = 0x`00a842a0)+0x34 [c:\cirrus\src\include\access\tableam.h @ 1681] `007ff380 0001`402a1a2d postgres!vacuum_rel( unsigned int relid = 0x4ee, struct RangeVar * relation = 0x`01799ae0, struct VacuumParams * params = 0x`00a8420c)+0x5a6 [c:\cirrus\src\backend\commands\vacuum.c @ 2068] `007ff400 0001`4050f1ef postgres!vacuum( struct List * relations = 0x`0179df58, struct VacuumParams * params = 0x`00a8420c, struct BufferAccessStrategyData * bstrategy = 0x`00a842a0, bool isTopLevel = true)+0x69d [c:\cirrus\src\backend\commands\vacuum.c @ 482] `007ff5f0 0001`4050dc95 postgres!autovacuum_do_vac_analyze( struct autovac_table * tab = 0x`00a84208, struct BufferAccessStrategyData * bstrategy = 0x`00a842a0)+0x8f [c:\cirrus\src\backend\postmaster\autovacuum.c @ 3248] `007ff640 0001`4050b4e3 postgres!do_autovacuum(void)+0xef5 [c:\cirrus\src\backend\postmaster\autovacuum.c @ 2503] It seems like there should be even more logs, especially since it says: [03:48:43.119] Uploading 3 artifacts for c:\cirrus\**\*.diffs [03:48:43.122] Uploaded c:\cirrus\contrib\tsm_system_rows\regression.diffs [03:48:43.125] Uploaded c:\cirrus\contrib\tsm_system_time\regression.diffs
Re: Removing more vacuumlazy.c special cases, relfrozenxid optimizations
On Tue, Mar 29, 2022 at 11:58 AM Peter Geoghegan wrote: > > I think I understand what the first paragraph of the header comment > > for heap_tuple_needs_freeze() is trying to say, but the second one is > > quite confusing. I think this is again because it veers into talking > > about what the caller should do rather than explaining what the > > function itself does. > > I wouldn't have done it that way if the function wasn't called > heap_tuple_needs_freeze(). > > I would be okay with removing this paragraph if the function was > renamed to reflect the fact it now tells the caller something about > the tuple having an old XID/MXID relative to the caller's own XID/MXID > cutoffs. Maybe the function name should be heap_tuple_would_freeze(), > making it clear that the function merely tells caller what > heap_prepare_freeze_tuple() *would* do, without presuming to tell the > vacuumlazy.c caller what it *should* do about any of the information > it is provided. Attached is v13, which does it that way. This does seem like a real increase in clarity, albeit one that comes at the cost of renaming heap_tuple_needs_freeze(). v13 also addresses all of the other items from Robert's most recent round of feedback. I would like to commit something close to v13 on Friday or Saturday. Thanks -- Peter Geoghegan v13-0003-vacuumlazy.c-Move-resource-allocation-to-heap_va.patch Description: Binary data v13-0002-Generalize-how-VACUUM-skips-all-frozen-pages.patch Description: Binary data v13-0001-Set-relfrozenxid-to-oldest-extant-XID-seen-by-VA.patch Description: Binary data
Re: Removing more vacuumlazy.c special cases, relfrozenxid optimizations
On Tue, Mar 29, 2022 at 10:03 AM Robert Haas wrote: > I can understand this version of the commit message. Woohoo! I like > understanding things. That's good news. > I think the header comments for FreezeMultiXactId() focus way too much > on what the caller is supposed to do and not nearly enough on what > FreezeMultiXactId() itself does. I think to some extent this also > applies to the comments within the function body. To some extent this is a legitimate difference in style. I myself don't think that it's intrinsically good to have these sorts of comments. I just think that it can be the least worst thing when a function is intrinsically written with one caller and one very specific set of requirements in mind. That is pretty much a matter of taste, though. > I think I understand what the first paragraph of the header comment > for heap_tuple_needs_freeze() is trying to say, but the second one is > quite confusing. I think this is again because it veers into talking > about what the caller should do rather than explaining what the > function itself does. I wouldn't have done it that way if the function wasn't called heap_tuple_needs_freeze(). I would be okay with removing this paragraph if the function was renamed to reflect the fact it now tells the caller something about the tuple having an old XID/MXID relative to the caller's own XID/MXID cutoffs. Maybe the function name should be heap_tuple_would_freeze(), making it clear that the function merely tells caller what heap_prepare_freeze_tuple() *would* do, without presuming to tell the vacuumlazy.c caller what it *should* do about any of the information it is provided. Then it becomes natural to see the boolean return value and the changes the function makes to caller's relfrozenxid/relminmxid tracker variables as independent. > I don't like the statement-free else block in lazy_scan_noprune(). I > think you could delete the else{} and just put that same comment there > with one less level of indentation. There's a clear "return false" > just above so it shouldn't be confusing what's happening. Okay, will fix. > The comment hunk at the end of lazy_scan_noprune() would probably be > better if it said something more specific than "caller can tolerate > reduced processing." My guess is that it would be something like > "caller does not need to do something or other." I meant "caller can tolerate not pruning or freezing this particular page". Will fix. > I have my doubts about whether the overwrite-a-future-relfrozenxid > behavior is any good, but that's a topic for another day. I suggest > keeping the words "it seems best to", though, because they convey a > level of tentativeness, which seems appropriate. I agree that it's best to keep a tentative tone here. That code was written following a very specific bug in pg_upgrade several years back. There was a very recent bug fixed only last year, by commit 74cf7d46. FWIW I tend to think that we'd have a much better chance of catching that sort of thing if we'd had better relfrozenxid instrumentation before now. Now you'd see a negative value in the "new relfrozenxid: %u, which is %d xids ahead of previous value" part of the autovacuum log message in the event of such a bug. That's weird enough that I bet somebody would notice and report it. > I am surprised to see you write in maintenance.sgml that the VACUUM > which most recently advanced relfrozenxid will typically be the most > recent aggressive VACUUM. I would have expected something like "(often > the most recent VACUUM)". That's always been true, and will only be slightly less true in Postgres 15 -- the fact is that we only need to skip one all-visible page to lose out, and that's not unlikely with tables that aren't quite small with all the patches from v12 applied (we're still much too naive). The work that I'll get into Postgres 15 on VACUUM is very valuable as a basis for future improvements, but not all that valuable to users (improved instrumentation might be the biggest benefit in 15, or maybe relminmxid advancement for certain types of applications). I still think that we need to do more proactive page-level freezing to make relfrozenxid advancement happen in almost every VACUUM, but even that won't quite be enough. There are still cases where we need to make a choice about giving up on relfrozenxid advancement in a non-aggressive VACUUM -- all-visible pages won't completely go away with page-level freezing. At a minimum we'll still have edge cases like the case where heap_lock_tuple() unsets the all-frozen bit. And pg_upgrade'd databases, too. 0002 structures the logic for skipping using the VM in a way that will make the choice to skip or not skip all-visible pages in non-aggressive VACUUMs quite natural. I suspect that SKIP_PAGES_THRESHOLD was always mostly just about relfrozenxid advancement in non-aggressive VACUUM, all along. We can do much better than SKIP_PAGES_THRESHOLD, especially if we preprocess the entire
Re: Removing more vacuumlazy.c special cases, relfrozenxid optimizations
On Sun, Mar 27, 2022 at 11:24 PM Peter Geoghegan wrote: > Attached is v12. My current goal is to commit all 3 patches before > feature freeze. Note that this does not include the more complicated > patch including with previous revisions of the patch series (the > page-level freezing work that appeared in versions before v11). Reviewing 0001, focusing on the words in the patch file much more than the code: I can understand this version of the commit message. Woohoo! I like understanding things. I think the header comments for FreezeMultiXactId() focus way too much on what the caller is supposed to do and not nearly enough on what FreezeMultiXactId() itself does. I think to some extent this also applies to the comments within the function body. On the other hand, the header comments for heap_prepare_freeze_tuple() seem good to me. If I were thinking of calling this function, I would know how to use the new arguments. If I were looking for bugs in it, I could compare the logic in the function to what these comments say it should be doing. Yay. I think I understand what the first paragraph of the header comment for heap_tuple_needs_freeze() is trying to say, but the second one is quite confusing. I think this is again because it veers into talking about what the caller should do rather than explaining what the function itself does. I don't like the statement-free else block in lazy_scan_noprune(). I think you could delete the else{} and just put that same comment there with one less level of indentation. There's a clear "return false" just above so it shouldn't be confusing what's happening. The comment hunk at the end of lazy_scan_noprune() would probably be better if it said something more specific than "caller can tolerate reduced processing." My guess is that it would be something like "caller does not need to do something or other." I have my doubts about whether the overwrite-a-future-relfrozenxid behavior is any good, but that's a topic for another day. I suggest keeping the words "it seems best to", though, because they convey a level of tentativeness, which seems appropriate. I am surprised to see you write in maintenance.sgml that the VACUUM which most recently advanced relfrozenxid will typically be the most recent aggressive VACUUM. I would have expected something like "(often the most recent VACUUM)". -- Robert Haas EDB: http://www.enterprisedb.com
Re: Removing more vacuumlazy.c special cases, relfrozenxid optimizations
On Thu, Mar 24, 2022 at 2:40 PM Peter Geoghegan wrote: > > > This is absolutely mandatory in the aggressive case, because otherwise > > > relfrozenxid advancement might be seen as unsafe. My observation is: > > > Why should we accept the same race in the non-aggressive case? Why not > > > do essentially the same thing in every VACUUM? > > > > Sure, that seems like a good idea. I think I basically agree with the > > goals of the patch. > > Great. Attached is v12. My current goal is to commit all 3 patches before feature freeze. Note that this does not include the more complicated patch including with previous revisions of the patch series (the page-level freezing work that appeared in versions before v11). Changes that appear in this new revision, v12: * Reworking of the commit messages based on feedback from Robert. * General cleanup of the changes to heapam.c from 0001 (the changes to heap_prepare_freeze_tuple and related functions). New and existing code now fits together a bit better. I also added a couple of new documenting assertions, to make the flow a bit easier to understand. * Added new assertions that document OldestXmin/FreezeLimit/relfrozenxid invariants, right at the point we update pg_class within vacuumlazy.c. These assertions would have a decent chance of failing if there were any bugs in the code. * Removed patch that made DISABLE_PAGE_SKIPPING not force aggressive VACUUM, limiting the underlying mechanism to forcing scanning of all pages in lazy_scan_heap (v11 was the first and last revision that included this patch). * Adds a new small patch 0003. This just moves the last piece of resource allocation that still took place at the top of lazy_scan_heap() back into its caller, heap_vacuum_rel(). The work in 0003 probably should have happened as part of the patch that became commit 73f6ec3d -- same idea. It's totally mechanical stuff. With 0002 and 0003, there is hardly any lazy_scan_heap code before the main loop that iterates through blocks in rel_pages (and the code that's still there is obviously related to the loop in a direct and obvious way). This seems like a big overall improvement in maintainability. Didn't see a way to split up 0002, per Robert's suggestion 3 days ago. As I said at the time, it's possible to split it up, but not in a way that highlights the underlying issue (since the issue 0002 fixes was always limited to non-aggressive VACUUMs). The commit message may have to suffice. -- Peter Geoghegan v12-0001-Set-relfrozenxid-to-oldest-extant-XID-seen-by-VA.patch Description: Binary data v12-0003-vacuumlazy.c-Move-resource-allocation-to-heap_va.patch Description: Binary data v12-0002-Generalize-how-VACUUM-skips-all-frozen-pages.patch Description: Binary data
Re: Removing more vacuumlazy.c special cases, relfrozenxid optimizations
On Thu, Mar 24, 2022 at 1:21 PM Robert Haas wrote: > > How about: "Set relfrozenxid to oldest extant XID seen by VACUUM" > > Sure, that sounds nice. Cool. > > What you're saying here boils down to this: it doesn't matter what the > > visibility map would say right this microsecond (in the aggressive > > case) were we to call VM_ALL_FROZEN(): we know for sure that the VM > > said that this page was all-frozen *in the recent past*. That's good > > enough; we will never fail to scan a page that might have an XID < > > OldestXmin (ditto for XMIDs) this way, which is all that really > > matters. > > Makes sense. So maybe the commit message should try to emphasize this > point e.g. "If a page is all-frozen at the time we check whether it > can be skipped, don't allow it to affect the relfrozenxmin and > relminmxid which we set for the relation. This was previously true for > aggressive vacuums, but not for non-aggressive vacuums, which was > inconsistent. (The reason this is a safe thing to do is that any new > XIDs or MXIDs that appear on the page after we initially observe it to > be frozen must be newer than any relfrozenxid or relminmxid the > current vacuum could possibly consider storing into pg_class.)" Okay, I'll add something more like that. Almost every aspect of relfrozenxid advancement by VACUUM seems simpler when thought about in these terms IMV. Every VACUUM now scans all pages that might have XIDs < OldestXmin, and so every VACUUM can advance relfrozenxid to the oldest extant XID (barring non-aggressive VACUUMs that *choose* to skip some all-visible pages). There are a lot more important details, of course. My "Every VACUUM..." statement works well as an axiom because all of those other details don't create any awkward exceptions. > > This is absolutely mandatory in the aggressive case, because otherwise > > relfrozenxid advancement might be seen as unsafe. My observation is: > > Why should we accept the same race in the non-aggressive case? Why not > > do essentially the same thing in every VACUUM? > > Sure, that seems like a good idea. I think I basically agree with the > goals of the patch. Great. > My concern is just about making the changes > understandable to future readers. This area is notoriously subtle, and > people are going to introduce more bugs even if the comments and code > organization are fantastic. Makes sense. > > And so you could almost say that there is now behavioral change at > > all. > > I vigorously object to this part, though. We should always err on the > side of saying that commits *do* have behavioral changes. I think that you've taken my words too literally here. I would never conceal the intent of a piece of work like that. I thought that it would clarify matters to point out that I could in theory "get away with it if I wanted to" in this instance. This was only a means of conveying a subtle point about the behavioral changes from 0002 -- since you couldn't initially see them yourself (even with my commit message). Kind of like Tom Lane's 2011 talk on the query planner. The one where he lied to the audience several times. > > It seems kinda tricky to split up 0002 like that. It's possible, but > > I'm not sure if it's possible to split it in a way that highlights the > > issue that I just described. Because we already avoided the race in > > the aggressive case. > > I do see that there are some difficulties there. I'm not sure what to > do about that. I think a sufficiently clear commit message could > possibly be enough, rather than trying to split the patch. But I also > think splitting the patch should be considered, if that can reasonably > be done. I'll see if I can come up with something. It's hard to be sure about that kind of thing when you're this close to the code. -- Peter Geoghegan
Re: Removing more vacuumlazy.c special cases, relfrozenxid optimizations
On Thu, Mar 24, 2022 at 3:28 PM Peter Geoghegan wrote: > But non-aggressive VACUUMs have always been able to do that. > > How about: "Set relfrozenxid to oldest extant XID seen by VACUUM" Sure, that sounds nice. > Believe it or not, I avoided functional changes in 0002 -- at least in > one important sense. That's why you had difficulty spotting any. This > must sound peculiar, since the commit message very clearly says that > the commit avoids a problem seen only in the non-aggressive case. It's > really quite subtle. Well, I think the goal in revising the code is to be as un-subtle as possible. Commits that people can't easily understand breed future bugs. > What you're saying here boils down to this: it doesn't matter what the > visibility map would say right this microsecond (in the aggressive > case) were we to call VM_ALL_FROZEN(): we know for sure that the VM > said that this page was all-frozen *in the recent past*. That's good > enough; we will never fail to scan a page that might have an XID < > OldestXmin (ditto for XMIDs) this way, which is all that really > matters. Makes sense. So maybe the commit message should try to emphasize this point e.g. "If a page is all-frozen at the time we check whether it can be skipped, don't allow it to affect the relfrozenxmin and relminmxid which we set for the relation. This was previously true for aggressive vacuums, but not for non-aggressive vacuums, which was inconsistent. (The reason this is a safe thing to do is that any new XIDs or MXIDs that appear on the page after we initially observe it to be frozen must be newer than any relfrozenxid or relminmxid the current vacuum could possibly consider storing into pg_class.)" > This is absolutely mandatory in the aggressive case, because otherwise > relfrozenxid advancement might be seen as unsafe. My observation is: > Why should we accept the same race in the non-aggressive case? Why not > do essentially the same thing in every VACUUM? Sure, that seems like a good idea. I think I basically agree with the goals of the patch. My concern is just about making the changes understandable to future readers. This area is notoriously subtle, and people are going to introduce more bugs even if the comments and code organization are fantastic. > And so you could almost say that there is now behavioral change at > all. I vigorously object to this part, though. We should always err on the side of saying that commits *do* have behavioral changes. We should go out of our way to call out in the commit message any possible way that someone might notice the difference between the post-commit situation and the pre-commit situation. It is fine, even good, to also be clear about how we're maintaining continuity and why we don't think it's a problem, but the only commits that should be described as not having any behavioral change are ones that do mechanical code movement, or are just changing comments, or something like that. > It seems kinda tricky to split up 0002 like that. It's possible, but > I'm not sure if it's possible to split it in a way that highlights the > issue that I just described. Because we already avoided the race in > the aggressive case. I do see that there are some difficulties there. I'm not sure what to do about that. I think a sufficiently clear commit message could possibly be enough, rather than trying to split the patch. But I also think splitting the patch should be considered, if that can reasonably be done. -- Robert Haas EDB: http://www.enterprisedb.com
Re: Removing more vacuumlazy.c special cases, relfrozenxid optimizations
On Thu, Mar 24, 2022 at 10:21 AM Robert Haas wrote: > You're probably not going to love hearing this, but I think you're > still explaining things here in ways that are too baroque and hard to > follow. I do think it's probably better. There are a lot of dimensions to this work. It's hard to know which to emphasize here. > But, for example, in the > commit message for 0001, I think you could change the subject line to > "Allow non-aggressive vacuums to advance relfrozenxid" and it would be > clearer. But non-aggressive VACUUMs have always been able to do that. How about: "Set relfrozenxid to oldest extant XID seen by VACUUM" > And then I think you could eliminate about half of the first > paragraph, starting with "There is no fixed relationship", and all of > the third paragraph (which starts with "Later work..."), and I think > removing all that material would make it strictly more clear than it > is currently. I don't think it's the place of a commit message to > speculate too much on future directions or to wax eloquent on > theoretical points. If that belongs anywhere, it's in a mailing list > discussion. Okay, I'll do that. > It seems to me that 0002 mixes code movement with functional changes. Believe it or not, I avoided functional changes in 0002 -- at least in one important sense. That's why you had difficulty spotting any. This must sound peculiar, since the commit message very clearly says that the commit avoids a problem seen only in the non-aggressive case. It's really quite subtle. You wrote this comment and code block (which I propose to remove in 0002), so clearly you already understand the race condition that I'm concerned with here: - if (skipping_blocks && blkno < rel_pages - 1) - { - /* -* Tricky, tricky. If this is in aggressive vacuum, the page -* must have been all-frozen at the time we checked whether it -* was skippable, but it might not be any more. We must be -* careful to count it as a skipped all-frozen page in that -* case, or else we'll think we can't update relfrozenxid and -* relminmxid. If it's not an aggressive vacuum, we don't -* know whether it was initially all-frozen, so we have to -* recheck. -*/ - if (vacrel->aggressive || - VM_ALL_FROZEN(vacrel->rel, blkno, )) - vacrel->frozenskipped_pages++; - continue; - } What you're saying here boils down to this: it doesn't matter what the visibility map would say right this microsecond (in the aggressive case) were we to call VM_ALL_FROZEN(): we know for sure that the VM said that this page was all-frozen *in the recent past*. That's good enough; we will never fail to scan a page that might have an XID < OldestXmin (ditto for XMIDs) this way, which is all that really matters. This is absolutely mandatory in the aggressive case, because otherwise relfrozenxid advancement might be seen as unsafe. My observation is: Why should we accept the same race in the non-aggressive case? Why not do essentially the same thing in every VACUUM? In 0002 we now track if each range that we actually chose to skip had any all-visible (not all-frozen) pages -- if that happens then relfrozenxid advancement becomes unsafe. The existing code uses "vacrel->aggressive" as a proxy for the same condition -- the existing code reasons based on what the visibility map must have said about the page in the recent past. Which makes sense, but only works in the aggressive case. The approach taken in 0002 also makes the code simpler, which is what enabled putting the VM skipping code into its own helper function, but that was just a bonus. And so you could almost say that there is now behavioral change at all. We're skipping pages in the same way, based on the same information (from the visibility map) as before. We're just being a bit more careful than before about how that information is tracked, to avoid this race. A race that we always avoided in the aggressive case is now consistently avoided. > I'm completely on board with moving the code that decides how much to > skip into a function. That seems like a great idea, and probably > overdue. But it is not easy for me to see what has changed > functionally between the old and new code organization, and I bet it > would be possible to split this into two patches, one of which creates > a function, and the other of which fixes the problem, and I think that > would be a useful service to future readers of the code. It seems kinda tricky to split up 0002 like that. It's possible, but I'm not sure if it's possible to split it in a way that highlights the issue that I just described. Because we already avoided the race in the aggressive case. > I also think that the commit message for 0002 is probably longer and > more complex than is really
Re: Removing more vacuumlazy.c special cases, relfrozenxid optimizations
On Wed, Mar 23, 2022 at 6:28 PM Peter Geoghegan wrote: > It would be great if you could take a look v11-0002-*, Robert. Does it > make sense to you? You're probably not going to love hearing this, but I think you're still explaining things here in ways that are too baroque and hard to follow. I do think it's probably better. But, for example, in the commit message for 0001, I think you could change the subject line to "Allow non-aggressive vacuums to advance relfrozenxid" and it would be clearer. And then I think you could eliminate about half of the first paragraph, starting with "There is no fixed relationship", and all of the third paragraph (which starts with "Later work..."), and I think removing all that material would make it strictly more clear than it is currently. I don't think it's the place of a commit message to speculate too much on future directions or to wax eloquent on theoretical points. If that belongs anywhere, it's in a mailing list discussion. It seems to me that 0002 mixes code movement with functional changes. I'm completely on board with moving the code that decides how much to skip into a function. That seems like a great idea, and probably overdue. But it is not easy for me to see what has changed functionally between the old and new code organization, and I bet it would be possible to split this into two patches, one of which creates a function, and the other of which fixes the problem, and I think that would be a useful service to future readers of the code. I have a hard time believing that if someone in the future bisects a problem back to this commit, they're going to have an easy time finding the behavior change in here. In fact I can't see it myself. I think the actual functional change is to fix what is described in the second paragraph of the commit message, but I haven't been able to figure out where the logic is actually changing to address that. Note that I would be happy with the behavior change happening either before or after the code reorganization. I also think that the commit message for 0002 is probably longer and more complex than is really helpful, and that the subject line is too vague, but since I don't yet understand exactly what's happening here, I cannot comment on how I think it should be revised at this point, except to say that the second paragraph of that commit message looks like the most useful part. I would also like to mention a few things that I do like about 0002. One is that it seems to collapse two different pieces of logic for page skipping into one. That seems good. As mentioned, it's especially good because that logic is abstracted into a function. Also, it looks like it is making a pretty localized change to one (1) aspect of what VACUUM does -- and I definitely prefer patches that change only one thing at a time. Hope that's helpful. -- Robert Haas EDB: http://www.enterprisedb.com
Re: Removing more vacuumlazy.c special cases, relfrozenxid optimizations
On Wed, Mar 23, 2022 at 2:03 PM Thomas Munro wrote: > Yeah, I found it confusing that DISABLE_PAGE_SKIPPING doesn't disable > all page skipping, so 3414099c turned out to be not enough. The proposed change to DISABLE_PAGE_SKIPPING is partly driven by that, and partly driven by a similar concern about aggressive VACUUM. It seems worth emphasizing the idea that an aggressive VACUUM is now just the same as any other VACUUM except for one detail: we're guaranteed to advance relfrozenxid to a value >= FreezeLimit at the end. The non-aggressive case has the choice to do things that make that impossible. But there are only two places where this can happen now: 1. Non-aggressive VACUUMs might decide to skip some all-visible pages in the new lazy_scan_skip() helper routine for skipping with the VM (see v11-0002-*). 2. A non-aggressive VACUUM can *always* decide to ratchet back its target relfrozenxid in lazy_scan_noprune, to avoid waiting for a cleanup lock -- a final value from before FreezeLimit is usually still pretty good. The first scenario is the only one where it becomes impossible for non-aggressive VACUUM to be able to advance relfrozenxid (with v11-0001-* in place) by any amount. Even that's a choice, made by weighing costs against benefits. There is no behavioral change in v11-0002-* (we're still using the old SKIP_PAGES_THRESHOLD strategy), but the lazy_scan_skip() helper routine could fairly easily be taught a lot more about the downside of skipping all-visible pages (namely how that makes it impossible to advance relfrozenxid). Maybe it's worth skipping all-visible pages (there are lots of them and age(relfrozenxid) is still low), and maybe it isn't worth it. We should get to decide, without implementation details making relfrozenxid advancement unsafe. It would be great if you could take a look v11-0002-*, Robert. Does it make sense to you? Thanks -- Peter Geoghegan
Re: Removing more vacuumlazy.c special cases, relfrozenxid optimizations
On Thu, Mar 24, 2022 at 9:59 AM Peter Geoghegan wrote: > On Wed, Mar 23, 2022 at 1:53 PM Robert Haas wrote: > > And therefore I favor defining it to mean that we don't > > skip any work at all. > > But even today DISABLE_PAGE_SKIPPING won't do pruning when we cannot > acquire a cleanup lock on a page, unless it happens to have XIDs from > before FreezeLimit (which is probably 50 million XIDs behind > OldestXmin, the vacuum_freeze_min_age default). I don't see much > difference. Yeah, I found it confusing that DISABLE_PAGE_SKIPPING doesn't disable all page skipping, so 3414099c turned out to be not enough.
Re: Removing more vacuumlazy.c special cases, relfrozenxid optimizations
On Wed, Mar 23, 2022 at 1:53 PM Robert Haas wrote: > I see what you mean about it depending on how you define "skipping". > But I think that DISABLE_PAGE_SKIPPING is intended as a sort of > emergency safeguard when you really, really don't want to leave > anything out. I agree. > And therefore I favor defining it to mean that we don't > skip any work at all. But even today DISABLE_PAGE_SKIPPING won't do pruning when we cannot acquire a cleanup lock on a page, unless it happens to have XIDs from before FreezeLimit (which is probably 50 million XIDs behind OldestXmin, the vacuum_freeze_min_age default). I don't see much difference. Anyway, this isn't important. I'll just drop the third patch. -- Peter Geoghegan
Re: Removing more vacuumlazy.c special cases, relfrozenxid optimizations
On Wed, Mar 23, 2022 at 4:49 PM Peter Geoghegan wrote: > On Wed, Mar 23, 2022 at 1:41 PM Robert Haas wrote: > > It seems to me that if DISABLE_PAGE_SKIPPING doesn't completely > > disable skipping pages, we have a problem. > > It depends on how you define skipping. DISABLE_PAGE_SKIPPING was > created at a time when a broader definition of skipping made a lot > more sense. > > > The option isn't named CARE_ABOUT_VISIBILITY_MAP. It's named > > DISABLE_PAGE_SKIPPING. > > VACUUM(DISABLE_PAGE_SKIPPING, VERBOSE) will still consistently show > that 100% of all of the pages from rel_pages are scanned. A page that > is "skipped" by lazy_scan_noprune isn't pruned, and won't have any of > its tuples frozen. But every other aspect of processing the page > happens in just the same way as it would in the cleanup > lock/lazy_scan_prune path. I see what you mean about it depending on how you define "skipping". But I think that DISABLE_PAGE_SKIPPING is intended as a sort of emergency safeguard when you really, really don't want to leave anything out. And therefore I favor defining it to mean that we don't skip any work at all. -- Robert Haas EDB: http://www.enterprisedb.com
Re: Removing more vacuumlazy.c special cases, relfrozenxid optimizations
On Wed, Mar 23, 2022 at 1:41 PM Robert Haas wrote: > It seems to me that if DISABLE_PAGE_SKIPPING doesn't completely > disable skipping pages, we have a problem. It depends on how you define skipping. DISABLE_PAGE_SKIPPING was created at a time when a broader definition of skipping made a lot more sense. > The option isn't named CARE_ABOUT_VISIBILITY_MAP. It's named > DISABLE_PAGE_SKIPPING. VACUUM(DISABLE_PAGE_SKIPPING, VERBOSE) will still consistently show that 100% of all of the pages from rel_pages are scanned. A page that is "skipped" by lazy_scan_noprune isn't pruned, and won't have any of its tuples frozen. But every other aspect of processing the page happens in just the same way as it would in the cleanup lock/lazy_scan_prune path. We'll even still VACUUM the page if it happens to have some existing LP_DEAD items left behind by opportunistic pruning. We don't need a cleanup in either lazy_scan_noprune (a share lock is all we need), nor do we even need one in lazy_vacuum_heap_page (a regular exclusive lock is all we need). -- Peter Geoghegan
Re: Removing more vacuumlazy.c special cases, relfrozenxid optimizations
On Wed, Mar 23, 2022 at 3:59 PM Peter Geoghegan wrote: > In other words, since DISABLE_PAGE_SKIPPING doesn't *consistently* > force lazy_scan_noprune to refuse to process a page on HEAD (it all > depends on FreezeLimit/vacuum_freeze_min_age), it is logical for > DISABLE_PAGE_SKIPPING to totally get out of the business of caring > about that -- better to limit it to caring only about the visibility > map (by no longer making it force aggressiveness). It seems to me that if DISABLE_PAGE_SKIPPING doesn't completely disable skipping pages, we have a problem. The option isn't named CARE_ABOUT_VISIBILITY_MAP. It's named DISABLE_PAGE_SKIPPING. -- Robert Haas EDB: http://www.enterprisedb.com
Re: Removing more vacuumlazy.c special cases, relfrozenxid optimizations
On Sun, Mar 13, 2022 at 9:05 PM Peter Geoghegan wrote: > Attached is v10. While this does still include the freezing patch, > it's not in scope for Postgres 15. As I've said, I still think that it > makes sense to maintain the patch series with the freezing stuff, > since it's structurally related. Attached is v11. Changes: * No longer includes the patch that adds page-level freezing. It was making it harder to assess code coverage for the patches that I'm targeting Postgres 15 with. And so including it with each new revision no longer seems useful. I'll pick it up for Postgres 16. * Extensive isolation tests added to v11-0001-*, exercising a lot of hard-to-hit code paths that are reached when VACUUM is unable to immediately acquire a cleanup lock on some heap page. In particular, we now have test coverage for the code in heapam.c that handles tracking the oldest extant XID and MXID in the presence of MultiXacts (on a no-cleanup-lock heap page). * v11-0002-* (which is the patch that avoids missing out on advancing relfrozenxid in non-aggressive VACUUMs due to a race condition on HEAD) now moves even more of the logic for deciding how VACUUM will skip using the visibility map into its own helper routine. Now lazy_scan_heap just does what the state returned by the helper routine tells it about the current skippable range -- it doesn't make any decisions itself anymore. This is far simpler than what we do currently, on HEAD. There are no behavioral changes here, but this approach could be pushed further to improve performance. We could easily determine *every* page that we're going to scan (not skip) up-front in even the largest tables, very early, before we've even scanned one page. This could enable things like I/O prefetching, or capping the size of the dead_items array based on our final scanned_pages (not on rel_pages). * A new patch (v11-0003-*) alters the behavior of VACUUM's DISABLE_PAGE_SKIPPING option. DISABLE_PAGE_SKIPPING no longer forces aggressive VACUUM -- now it only forces the use of the visibility map, since that behavior is totally independent of aggressiveness. I don't feel too strongly about the DISABLE_PAGE_SKIPPING change. It just seems logical to decouple no-vm-skipping from aggressiveness -- it might actually be helpful in testing the work from the patch series in the future. Any page counted in scanned_pages has essentially been processed by VACUUM with this work in place -- that was the idea behind the lazy_scan_noprune stuff from commit 44fa8488. Bear in mind that the relfrozenxid tracking stuff from v11-0001-* makes it almost certain that a DISABLE_PAGE_SKIPPING-without-aggressiveness VACUUM will still manage to advance relfrozenxid -- usually by the same amount as an equivalent aggressive VACUUM would anyway. (Failing to acquire a cleanup lock on some heap page might result in the final older relfrozenxid being appreciably older, but probably not, and we'd still almost certainly manage to advance relfrozenxid by *some* small amount.) Of course, anybody that wants both an aggressive VACUUM and a VACUUM that never skips even all-frozen pages in the visibility map will still be able to get that behavior quite easily. For example, VACUUM(DISABLE_PAGE_SKIPPING, FREEZE) will do that. Several of our existing tests must already use both of these options together, because the tests require an effective vacuum_freeze_min_age of 0 (and vacuum_multixact_freeze_min_age of 0) -- DISABLE_PAGE_SKIPPING alone won't do that on HEAD, which seems to confuse the issue (see commit b700f96c for an example of that). In other words, since DISABLE_PAGE_SKIPPING doesn't *consistently* force lazy_scan_noprune to refuse to process a page on HEAD (it all depends on FreezeLimit/vacuum_freeze_min_age), it is logical for DISABLE_PAGE_SKIPPING to totally get out of the business of caring about that -- better to limit it to caring only about the visibility map (by no longer making it force aggressiveness). -- Peter Geoghegan v11-0003-Don-t-force-aggressive-mode-for-DISABLE_PAGE_SKI.patch Description: Binary data v11-0002-Generalize-how-VACUUM-skips-all-frozen-pages.patch Description: Binary data v11-0001-Loosen-coupling-between-relfrozenxid-and-freezin.patch Description: Binary data
Re: Removing more vacuumlazy.c special cases, relfrozenxid optimizations
On Fri, Feb 25, 2022 at 5:52 PM Peter Geoghegan wrote: > There is an important practical way in which it makes sense to treat > 0001 as separate to 0002. It is true that 0001 is independently quite > useful. In practical terms, I'd be quite happy to just get 0001 into > Postgres 15, without 0002. I think that that's what you meant here, in > concrete terms, and we can agree on that now. Attached is v10. While this does still include the freezing patch, it's not in scope for Postgres 15. As I've said, I still think that it makes sense to maintain the patch series with the freezing stuff, since it's structurally related. So, to be clear, the first two patches from the patch series are in scope for Postgres 15. But not the third. Highlights: * Changes to terminology and commit messages along the lines suggested by Andres. * Bug fixes to heap_tuple_needs_freeze()'s MultiXact handling. My testing strategy here still needs work. * Expanded refactoring by v10-0002 patch. The v10-0002 patch (which appeared for the first time in v9) was originally all about fixing a case where non-aggressive VACUUMs were at a gratuitous disadvantage (relative to aggressive VACUUMs) around advancing relfrozenxid -- very much like the lazy_scan_noprune work from commit 44fa8488. And that is still its main purpose. But the refactoring now seems related to Andres' idea of making non-aggressive VACUUMs decides to scan a few extra all-visible pages in order to be able to advance relfrozenxid. The code that sets up skipping the visibility map is made a lot clearer by v10-0002. That patch moves a significant amount of code from lazy_scan_heap() into a new helper routine (so it continues the trend started by the Postgres 14 work that added lazy_scan_prune()). Now skipping a range of visibility map pages is fundamentally based on setting up the range up front, and then using the same saved details about the range thereafter -- we don't have anymore ad-hoc VM_ALL_VISIBLE()/VM_ALL_FROZEN() calls for pages from a range that we already decided to skip (so no calls to those routines from lazy_scan_heap(), at least not until after we finish processing in lazy_scan_prune()). This is more or less what we were doing all along for one special case: aggressive VACUUMs. We had to make sure to either increment frozenskipped_pages or increment scanned_pages for every page from rel_pages -- this issue is described by lazy_scan_heap() comments on HEAD that begin with "Tricky, tricky." (these date back to the freeze map work from 2016). Anyway, there is no reason to not go further with that: we should make whole ranges the basic unit that we deal with when skipping. It's a lot simpler to think in terms of entire ranges (not individual pages) that are determined to be all-visible or all-frozen up-front, without needing to recheck anything (regardless of whether it's an aggressive VACUUM). We don't need to track frozenskipped_pages this way. And it's much more obvious that it's safe for more complicated cases, in particular for aggressive VACUUMs. This kind of approach seems necessary to make non-aggressive VACUUMs do a little more work opportunistically, when they realize that they can advance relfrozenxid relatively easily that way (which I believe Andres favors as part of overhauling freezing). That becomes a lot more natural when you have a clear and unambiguous separation between deciding what range of blocks to skip, and then actually skipping. I can imagine the new helper function added by v10-0002 (which I've called lazy_scan_skip_range()) eventually being taught to do these kinds of tricks. In general I think that all of the details of what to skip need to be decided up front. The loop in lazy_scan_heap() should execute skipping based on the instructions it receives from the new helper function, in the simplest way possible. The helper function can become more intelligent about the costs and benefits of skipping in the future, without that impacting lazy_scan_heap(). -- Peter Geoghegan From 43ab00609392ed7ad31be491834bdac348e13653 Mon Sep 17 00:00:00 2001 From: Peter Geoghegan Date: Fri, 11 Mar 2022 19:16:02 -0800 Subject: [PATCH v10 3/3] Make page-level characteristics drive freezing. Teach VACUUM to freeze all of the tuples on a page whenever it notices that it would otherwise mark the page all-visible, without also marking it all-frozen. VACUUM typically won't freeze _any_ tuples on the page unless _all_ tuples (that remain after pruning) are all-visible. This makes the overhead of vacuuming much more predictable over time. We avoid the need for large balloon payments during aggressive VACUUMs (typically anti-wraparound autovacuums). Freezing is proactive, so we're much less likely to get into "freezing debt". The new approach to freezing also enables relfrozenxid advancement in non-aggressive VACUUMs, which might be enough to avoid aggressive VACUUMs altogether (with many individual tables/workloads). While the non-aggressive
Re: Removing more vacuumlazy.c special cases, relfrozenxid optimizations
On Tue, Mar 1, 2022 at 1:46 PM Robert Haas wrote: > I think that this is not really a description of an algorithm -- and I > think that it is far from clear that the third "in-between" category > does not need to exist. But I already described the algorithm. It is very simple mechanistically -- though that in itself means very little. As I have said multiple times now, the hard part is assessing what the implications are. And the even harder part is making a judgement about whether or not those implications are what we generally want. > I think findings like this are very unconvincing. TPC-C may be unrealistic in certain ways, but it is nevertheless vastly more realistic than pgbench. pgbench is really more of a stress test than a benchmark. The main reasons why TPC-C is interesting here are *very* simple, and would likely be equally true with TPC-E (just for example) -- even though TPC-E is a very different benchmark kind of OLTP workload overall. TPC-C (like TPC-E) features a diversity of transaction types, some of which are more complicated than others -- which is strictly more realistic than having only one highly synthetic OLTP transaction type. Each transaction type doesn't necessarily modify the same tables in the same way. This leads to natural diversity among tables and among transactions, including: * The typical or average number of distinct XIDs per heap page varies significantly among each table. There are way fewer distinct XIDs per "order line" table heap page than there are per "order" table heap page, for the obvious reason. * Roughly speaking, there are various different ways that free space management ought to work in a system like Postgres. For example it is necessary to make a "fragmentations vs space utilization" trade-off with the new orders table. * There are joins in some of the transactions! Maybe TPC-C is a crude approximation of reality, but it nevertheless exercises relevant parts of the system to a significant degree. What else would you expect me to use, for a project like this? To a significant degree the relfrozenxid tracking stuff is interesting because tables tend to have natural differences like the ones I have highlighted on this thread. How could that not be the case? Why wouldn't we want to take advantage of that? There might be some danger in over-optimizing for this particular benchmark, but right now that is so far from being the main problem that the idea seems strange to me. pgbench doesn't need the FSM, at all. In fact pgbench doesn't even really need VACUUM (except for antiwraparound), once heap fillfactor is lowered to 95 or so. pgbench simply isn't relevant, *at all*, except perhaps as a way of measuring regressions in certain synthetic cases that don't benefit. > TPC-C (or any > benchmark really) is so simple as to be a terrible proxy for what > vacuuming is going to look like on real-world systems. Doesn't that amount to "no amount of any kind of testing or benchmarking will convince me of anything, ever"? There is more than one type of real-world system. I think that TPC-C is representative of some real world systems in some regards. But even that's not the important point for me. I find TPC-C generally interesting for one reason: I can clearly see that Postgres does things in a way that just doesn't make much sense, which isn't particularly fundamental to how VACUUM works. My only long term goal is to teach Postgres to *avoid* various pathological cases exhibited by TPC-C (e.g., the B-Tree "split after new tuple" mechanism from commit f21668f328 *avoids* a pathological case from TPC-C). We don't necessarily have to agree on how important each individual case is "in the real world" (which is impossible to know anyway). We only have to agree that what we see is a pathological case (because some reasonable expectation is dramatically violated), and then work out a fix. I don't want to teach Postgres to be clever -- I want to teach it to avoid being stupid in cases where it exhibits behavior that really cannot be described any other way. You seem to talk about some of this work as if it was just as likely to have a detrimental effect elsewhere, for some equally plausible workload, which will have a downside that is roughly as bad as the advertised upside. I consider that very unlikely, though. Sure, regressions are quite possible, and a real concern -- but regressions *like that* are unlikely. Avoiding doing what is clearly the wrong thing just seems to work out that way, in general. -- Peter Geoghegan
Re: Removing more vacuumlazy.c special cases, relfrozenxid optimizations
On Sun, Feb 20, 2022 at 3:27 PM Peter Geoghegan wrote: > > I think that the idea has potential, but I don't think that I > > understand yet what the *exact* algorithm is. > > The algorithm seems to exploit a natural tendency that Andres once > described in a blog post about his snapshot scalability work [1]. To a > surprising extent, we can usefully bucket all tuples/pages into two > simple categories: > > 1. Very, very old ("infinitely old" for all practical purposes). > > 2. Very very new. > > There doesn't seem to be much need for a third "in-between" category > in practice. This seems to be at least approximately true all of the > time. > > Perhaps Andres wouldn't agree with this very general statement -- he > actually said something more specific. I for one believe that the > point he made generalizes surprisingly well, though. I have my own > theories about why this appears to be true. (Executive summary: power > laws are weird, and it seems as if the sparsity-of-effects principle > makes it easy to bucket things at the highest level, in a way that > generalizes well across disparate workloads.) I think that this is not really a description of an algorithm -- and I think that it is far from clear that the third "in-between" category does not need to exist. > Remember when I got excited about how my big TPC-C benchmark run > showed a predictable, tick/tock style pattern across VACUUM operations > against the order and order lines table [2]? It seemed very > significant to me that the OldestXmin of VACUUM operation n > consistently went on to become the new relfrozenxid for the same table > in VACUUM operation n + 1. It wasn't exactly the same XID, but very > close to it (within the range of noise). This pattern was clearly > present, even though VACUUM operation n + 1 might happen as long as 4 > or 5 hours after VACUUM operation n (this was a big table). I think findings like this are very unconvincing. TPC-C (or any benchmark really) is so simple as to be a terrible proxy for what vacuuming is going to look like on real-world systems. Like, it's nice that it works, and it shows that something's working, but it doesn't demonstrate that the patch is making the right trade-offs overall. -- Robert Haas EDB: http://www.enterprisedb.com
Re: Removing more vacuumlazy.c special cases, relfrozenxid optimizations
On Fri, Feb 25, 2022 at 3:26 PM Andres Freund wrote: > freeze_required_limit, freeze_desired_limit? Or s/limit/cutoff/? Or > s/limit/below/? I kind of like below because that answers < vs <= which I find > hard to remember around freezing. I like freeze_required_limit the most. > That may be true, but I think working more incrementally is better in this > are. I'd rather have a smaller improvement for a release, collect some data, > get another improvement in the next, than see a bunch of reports of larger > wind and large regressions. I agree. There is an important practical way in which it makes sense to treat 0001 as separate to 0002. It is true that 0001 is independently quite useful. In practical terms, I'd be quite happy to just get 0001 into Postgres 15, without 0002. I think that that's what you meant here, in concrete terms, and we can agree on that now. However, it is *also* true that there is an important practical sense in which they *are* related. I don't want to ignore that either -- it does matter. Most of the value to be had here comes from the synergy between 0001 and 0002 -- or what I've been calling a "virtuous cycle", the thing that makes it possible to advance relfrozenxid/relminmxid in almost every VACUUM. Having both 0001 and 0002 together (or something along the same lines) is way more valuable than having just one. Perhaps we can even agree on this second point. I am encouraged by the fact that you at least recognize the general validity of the key ideas from 0002. If I am going to commit 0001 (and not 0002) ahead of feature freeze for 15, I better be pretty sure that I have at least roughly the right idea with 0002, too -- since that's the direction that 0001 is going in. It almost seems dishonest to pretend that I wasn't thinking of 0002 when I wrote 0001. I'm glad that you seem to agree that this business of accumulating freezing debt without any natural limit is just not okay. That is really fundamental to me. I mean, vacuum_freeze_min_age kind of doesn't work as designed. This is a huge problem for us. > > Under these conditions, we will have many more opportunities to > > advance relminmxid for most of the tables (including the larger > > tables) all the way up to current-oldestMxact with the patch series. > > Without needing to freeze *any* MultiXacts early (just freezing some > > XIDs early) to get that benefit. The patch series is not just about > > spreading the burden of freezing, so that non-aggressive VACUUMs > > freeze more -- it's also making relfrozenxid and relminmxid more > > recent and therefore *reliable* indicators of which tables any > > wraparound problems *really* are. > > My concern was explicitly about the case where we have to create new > multixacts... It was a mistake on my part to counter your point about that with this other point about eager relminmxid advancement. As I said in the last email, while that is very valuable, it's not something that needs to be brought into this. > > Does that make sense to you? > > Yes. Okay, great. The fact that you recognize the value in that comes as a relief. > > You mean to change the signature of heap_tuple_needs_freeze, so it > > doesn't return a bool anymore? It just has two bool pointers as > > arguments, can_freeze and need_freeze? > > Something like that. Or return true if there's anything to do, and then rely > on can_freeze and need_freeze for finer details. But it doesn't matter that > much. Got it. > > The problem that all of these heuristics have is that they will tend > > to make it impossible for future non-aggressive VACUUMs to be able to > > advance relfrozenxid. All that it takes is one single all-visible page > > to make that impossible. As I said upthread, I think that being able > > to advance relfrozenxid (and especially relminmxid) by *some* amount > > in every VACUUM has non-obvious value. > > I think that's a laudable goal. But I don't think we should go there unless we > are quite confident we've mitigated the potential downsides. True. But that works both ways. We also shouldn't err in the direction of adding these kinds of heuristics (which have real downsides) until the idea of mostly swallowing the cost of freezing whole pages (while making it possible to disable) has lost, fairly. Overall, it looks like the cost is acceptable in most cases. I think that users will find it very reassuring to regularly and reliably see confirmation that wraparound is being kept at bay, by every VACUUM operation, with details that they can relate to their workload. That has real value IMV -- even when it's theoretically unnecessary for us to be so eager with advancing relfrozenxid. I really don't like the idea of falling behind on freezing systematically. You always run the "risk" of freezing being wasted. But that way of looking at it can be penny wise, pound foolish -- maybe we should just accept that trying to predict what will happen in the future (whether or not freezing will be
Re: Removing more vacuumlazy.c special cases, relfrozenxid optimizations
On Fri, Feb 25, 2022 at 3:48 PM Andres Freund wrote: > I don't see why it matters that OldestXmin and OldestMxact are computed at the > same time? It's a question of the workload, not vacuum algorithm. I think it's both. > OldestMxact inherently lags OldestXmin. OldestMxact can only advance after all > members are older than OldestXmin (not quite true, but that's the bound), and > they have always more than one member. > > > > How many of these "skewed MultiXacts" can we really expect? > > I don't think they're skewed in any way. It's a fundamental aspect of > multixacts. Having this happen to some degree is fundamental to MultiXacts, sure. But also seems like the approach of using FreezeLimit and MultiXactCutoff in the way that we do right now seems like it might make the problem a lot worse. Because they're completely meaningless cutoffs. They are magic numbers that have no relationship whatsoever to each other. There are problems with assuming that OldestXmin and OldestMxact "align" -- no question. But at least it's approximately true -- which is a start. They are at least not arbitrarily, unpredictably different, like FreezeLimit and MultiXactCutoff are, and always will be. I think that that's a meaningful and useful distinction. I am okay with making the most pessimistic possible assumptions about how any changes to how we freeze might cause FreezeMultiXactId() to allocate more MultiXacts than before. And I accept that the patch series shouldn't "get credit" for "offsetting" any problem like that by making relminmxid advancement occur much more frequently (even though that does seem very valuable). All I'm really saying is this: in general, there are probably quite a few opportunities for FreezeMultiXactId() to avoid allocating new XMIDs (just to freeze XIDs) by having the full context. And maybe by making the dialog between lazy_scan_prune and heap_prepare_freeze_tuple a bit more nuanced. -- Peter Geoghegan
Re: Removing more vacuumlazy.c special cases, relfrozenxid optimizations
Hi, On 2022-02-25 15:28:17 -0800, Peter Geoghegan wrote: > But why should we ever get to the FreezeMultiXactId() loop with the > stuff from 0002 in place? The whole purpose of the loop is to handle > cases where we have to remove *some* (not all) XIDs from before > cutoff_xid that appear in a MultiXact, which requires careful checking > of each XID (this is only possible when the MultiXactId is < > cutoff_multi to begin with, which is OldestMxact in the patch, which > is presumably very recent). > > It's not impossible that we'll get some number of "skewed MultiXacts" > with the patch -- cases that really do necessitate allocating a new > MultiXact, just to "freeze some XIDs from a MultiXact". That is, there > will sometimes be some number of XIDs that are < OldestXmin, but > nevertheless appear in some MultiXactIds >= OldestMxact. This seems > likely to be rare with the patch, though, since VACUUM calculates its > OldestXmin and OldestMxact (which are what cutoff_xid and cutoff_multi > really are in the patch) at the same point in time. Which was the > point I made in my email yesterday. I don't see why it matters that OldestXmin and OldestMxact are computed at the same time? It's a question of the workload, not vacuum algorithm. OldestMxact inherently lags OldestXmin. OldestMxact can only advance after all members are older than OldestXmin (not quite true, but that's the bound), and they have always more than one member. > How many of these "skewed MultiXacts" can we really expect? I don't think they're skewed in any way. It's a fundamental aspect of multixacts. Greetings, Andres Freund
Re: Removing more vacuumlazy.c special cases, relfrozenxid optimizations
On Fri, Feb 25, 2022 at 2:00 PM Peter Geoghegan wrote: > > Hm. I guess I'll have to look at the code for it. It doesn't immediately > > "feel" quite right. > > I kinda think it might be. Please let me know if you see a problem > with what I've said. Oh, wait. I have a better idea of what you meant now. The loop towards the end of FreezeMultiXactId() will indeed "Determine whether to keep this member or ignore it." when we need a new MultiXactId. The loop is exact in the sense that it will only include those XIDs that are truly needed -- those that are still running. But why should we ever get to the FreezeMultiXactId() loop with the stuff from 0002 in place? The whole purpose of the loop is to handle cases where we have to remove *some* (not all) XIDs from before cutoff_xid that appear in a MultiXact, which requires careful checking of each XID (this is only possible when the MultiXactId is < cutoff_multi to begin with, which is OldestMxact in the patch, which is presumably very recent). It's not impossible that we'll get some number of "skewed MultiXacts" with the patch -- cases that really do necessitate allocating a new MultiXact, just to "freeze some XIDs from a MultiXact". That is, there will sometimes be some number of XIDs that are < OldestXmin, but nevertheless appear in some MultiXactIds >= OldestMxact. This seems likely to be rare with the patch, though, since VACUUM calculates its OldestXmin and OldestMxact (which are what cutoff_xid and cutoff_multi really are in the patch) at the same point in time. Which was the point I made in my email yesterday. How many of these "skewed MultiXacts" can we really expect? Seems like there might be very few in practice. But I'm really not sure about that. -- Peter Geoghegan
Re: Removing more vacuumlazy.c special cases, relfrozenxid optimizations
Hi, On 2022-02-25 14:00:12 -0800, Peter Geoghegan wrote: > On Thu, Feb 24, 2022 at 11:14 PM Andres Freund wrote: > > I am not a fan of the backstop terminology. It's still the reason we need to > > do freezing for correctness reasons. > > Thanks for the review! > > I'm not wedded to that particular terminology, but I think that we > need something like it. Open to suggestions. > > How about limit-based? Something like that? freeze_required_limit, freeze_desired_limit? Or s/limit/cutoff/? Or s/limit/below/? I kind of like below because that answers < vs <= which I find hard to remember around freezing. > > I'm a tad concerned about replacing mxids that have some members that are > > older than OldestXmin but not older than FreezeLimit. It's not too hard to > > imagine that accelerating mxid consumption considerably. But we can > > probably, > > if not already done, special case that. > > Let's assume for a moment that this is a real problem. I'm not sure if > it is or not myself (it's complicated), but let's say that it is. The > problem may be more than offset by the positive impact on relminxmid > advancement. I have placed a large emphasis on enabling > relfrozenxid/relminxmid advancement in every non-aggressive VACUUM, > for a number of reasons -- this is one of the reasons. Finding a way > for every VACUUM operation to be "vacrel->scanned_pages + > vacrel->frozenskipped_pages == orig_rel_pages" (i.e. making *some* > amount of relfrozenxid/relminxmid advancement possible in every > VACUUM) has a great deal of value. That may be true, but I think working more incrementally is better in this are. I'd rather have a smaller improvement for a release, collect some data, get another improvement in the next, than see a bunch of reports of larger wind and large regressions. > As I said recently on the "do only critical work during single-user > vacuum?" thread, why should the largest tables in databases that > consume too many MXIDs do so evenly, across all their tables? There > are usually one or two large tables, and many more smaller tables. I > think it's much more likely that the largest tables consume > approximately zero MultiXactIds in these databases -- actual > MultiXactId consumption is probably concentrated in just one or two > smaller tables (even when we burn through MultiXacts very quickly). > But we don't recognize these kinds of distinctions at all right now. Recognizing those distinctions seems independent of freezing multixacts with live members. I am happy with freezing them more aggressively if they don't have live members. It's freezing mxids with live members that has me concerned. The limits you're proposing are quite aggressive and can advance quickly. I've seen large tables with plenty multixacts. Typically concentrated over a value range (often changing over time). > Under these conditions, we will have many more opportunities to > advance relminmxid for most of the tables (including the larger > tables) all the way up to current-oldestMxact with the patch series. > Without needing to freeze *any* MultiXacts early (just freezing some > XIDs early) to get that benefit. The patch series is not just about > spreading the burden of freezing, so that non-aggressive VACUUMs > freeze more -- it's also making relfrozenxid and relminmxid more > recent and therefore *reliable* indicators of which tables any > wraparound problems *really* are. My concern was explicitly about the case where we have to create new multixacts... > Does that make sense to you? Yes. > > > On HEAD, FreezeMultiXactId() doesn't get passed down the VACUUM > > > operation's > > > OldestXmin at all (it actually just gets FreezeLimit passed as its > > > cutoff_xid argument). It cannot possibly recognize any of this for itself. > > > > It does recognize something like OldestXmin in a more precise and expensive > > way - MultiXactIdIsRunning() and TransactionIdIsCurrentTransactionId(). > > It doesn't look that way to me. > > While it's true that FreezeMultiXactId() will call MultiXactIdIsRunning(), > that's only a cross-check. > This cross-check is made at a point where we've already determined that the > MultiXact in question is < cutoff_multi. In other words, it catches cases > where a "MultiXactId < cutoff_multi" Multi contains an XID *that's still > running* -- a correctness issue. Nothing to do with being smart about > avoiding allocating new MultiXacts during freezing, or exploiting the fact > that "FreezeLimit < OldestXmin" (which is almost always true, very true). If there's <= 1 live members in a mxact, we replace it with with a plain xid iff the xid also would get frozen. With the current freezing logic I don't see what passing down OldestXmin would change. Or how it differs to a meaningful degree from heap_prepare_freeze_tuple()'s logic. I don't see how it'd avoid a single new mxact from being allocated. > > > Caller should make temp > > > + * copies of global tracking variables
Re: Removing more vacuumlazy.c special cases, relfrozenxid optimizations
On Thu, Feb 24, 2022 at 11:14 PM Andres Freund wrote: > I am not a fan of the backstop terminology. It's still the reason we need to > do freezing for correctness reasons. Thanks for the review! I'm not wedded to that particular terminology, but I think that we need something like it. Open to suggestions. How about limit-based? Something like that? > It'd make more sense to me to turn it > around and call the "non-backstop" freezing opportunistic freezing or such. The problem with that scheme is that it leads to a world where "standard freezing" is incredibly rare (it often literally never happens), whereas "opportunistic freezing" is incredibly common. That doesn't make much sense to me. We tend to think of 50 million XIDs (the vacuum_freeze_min_age default) as being not that many. But I think that it can be a huge number, too. Even then, it's unpredictable -- I suspect that it can change without very much changing in the application, from the point of view of users. That's a big part of the problem I'm trying to address -- freezing outside of aggressive VACUUMs is way too rare (it might barely happen at all). FreezeLimit/vacuum_freeze_min_age was designed at a time when there was no visibility map at all, when it made somewhat more sense as the thing that drives freezing. Incidentally, this is part of the problem with anti-wraparound vacuums and freezing debt -- the fact that some quite busy databases take weeks or months to go through 50 million XIDs (or 200 million) increases the pain of the eventual aggressive VACUUM. It's not completely unbounded -- autovacuum_freeze_max_age is not 100% useless here. But the extent to which that stuff bounds the debt can vary enormously, for not-very-good reasons. > > Whenever we opt to > > "freeze a page", the new page-level algorithm *always* uses the most > > recent possible XID and MXID values (OldestXmin and oldestMxact) to > > decide what XIDs/XMIDs need to be replaced. That might sound like it'd > > be too much, but it only applies to those pages that we actually > > decide to freeze (since page-level characteristics drive everything > > now). FreezeLimit is only one way of triggering that now (and one of > > the least interesting and rarest). > > That largely makes sense to me and doesn't seem weird. I'm very pleased that the main intuition behind 0002 makes sense to you. That's a start, at least. > I'm a tad concerned about replacing mxids that have some members that are > older than OldestXmin but not older than FreezeLimit. It's not too hard to > imagine that accelerating mxid consumption considerably. But we can probably, > if not already done, special case that. Let's assume for a moment that this is a real problem. I'm not sure if it is or not myself (it's complicated), but let's say that it is. The problem may be more than offset by the positive impact on relminxmid advancement. I have placed a large emphasis on enabling relfrozenxid/relminxmid advancement in every non-aggressive VACUUM, for a number of reasons -- this is one of the reasons. Finding a way for every VACUUM operation to be "vacrel->scanned_pages + vacrel->frozenskipped_pages == orig_rel_pages" (i.e. making *some* amount of relfrozenxid/relminxmid advancement possible in every VACUUM) has a great deal of value. As I said recently on the "do only critical work during single-user vacuum?" thread, why should the largest tables in databases that consume too many MXIDs do so evenly, across all their tables? There are usually one or two large tables, and many more smaller tables. I think it's much more likely that the largest tables consume approximately zero MultiXactIds in these databases -- actual MultiXactId consumption is probably concentrated in just one or two smaller tables (even when we burn through MultiXacts very quickly). But we don't recognize these kinds of distinctions at all right now. Under these conditions, we will have many more opportunities to advance relminmxid for most of the tables (including the larger tables) all the way up to current-oldestMxact with the patch series. Without needing to freeze *any* MultiXacts early (just freezing some XIDs early) to get that benefit. The patch series is not just about spreading the burden of freezing, so that non-aggressive VACUUMs freeze more -- it's also making relfrozenxid and relminmxid more recent and therefore *reliable* indicators of which tables any wraparound problems *really* are. Does that make sense to you? This kind of "virtuous cycle" seems really important to me. It's a subtle point, so I have to ask. > > It seems that heap_prepare_freeze_tuple allocates new MXIDs (when > > freezing old ones) in large part so it can NOT freeze XIDs that it > > would have been useful (and much cheaper) to remove anyway. > > Well, we may have to allocate a new mxid because some members are older than > FreezeLimit but others are still running. When do we not remove xids that > would have been cheaper to remove once we
Re: Removing more vacuumlazy.c special cases, relfrozenxid optimizations
Hi, On 2022-02-24 20:53:08 -0800, Peter Geoghegan wrote: > 0002 makes page-level freezing a first class thing. > heap_prepare_freeze_tuple now has some (limited) knowledge of how this > works. heap_prepare_freeze_tuple's cutoff_xid argument is now always > the VACUUM caller's OldestXmin (not its FreezeLimit, as before). We > still have to pass FreezeLimit to heap_prepare_freeze_tuple, which > helps us to respect FreezeLimit as a backstop, and so now it's passed > via the new backstop_cutoff_xid argument instead. I am not a fan of the backstop terminology. It's still the reason we need to do freezing for correctness reasons. It'd make more sense to me to turn it around and call the "non-backstop" freezing opportunistic freezing or such. > Whenever we opt to > "freeze a page", the new page-level algorithm *always* uses the most > recent possible XID and MXID values (OldestXmin and oldestMxact) to > decide what XIDs/XMIDs need to be replaced. That might sound like it'd > be too much, but it only applies to those pages that we actually > decide to freeze (since page-level characteristics drive everything > now). FreezeLimit is only one way of triggering that now (and one of > the least interesting and rarest). That largely makes sense to me and doesn't seem weird. I'm a tad concerned about replacing mxids that have some members that are older than OldestXmin but not older than FreezeLimit. It's not too hard to imagine that accelerating mxid consumption considerably. But we can probably, if not already done, special case that. > It seems that heap_prepare_freeze_tuple allocates new MXIDs (when > freezing old ones) in large part so it can NOT freeze XIDs that it > would have been useful (and much cheaper) to remove anyway. Well, we may have to allocate a new mxid because some members are older than FreezeLimit but others are still running. When do we not remove xids that would have been cheaper to remove once we decide to actually do work? > On HEAD, FreezeMultiXactId() doesn't get passed down the VACUUM operation's > OldestXmin at all (it actually just gets FreezeLimit passed as its > cutoff_xid argument). It cannot possibly recognize any of this for itself. It does recognize something like OldestXmin in a more precise and expensive way - MultiXactIdIsRunning() and TransactionIdIsCurrentTransactionId(). > Does that theory about MultiXacts sound plausible? I'm not claiming > that the patch makes it impossible that FreezeMultiXactId() will have > to allocate a new MultiXact to freeze during VACUUM -- the > freeze-the-dead isolation tests already show that that's not true. I > just think that page-level freezing based on page characteristics with > oldestXmin and oldestMxact (not FreezeLimit and MultiXactCutoff) > cutoffs might make it a lot less likely in practice. Hm. I guess I'll have to look at the code for it. It doesn't immediately "feel" quite right. > oldestXmin and oldestMxact map to the same wall clock time, more or less -- > that seems like it might be an important distinction, independent of > everything else. Hm. Multis can be kept alive by fairly "young" member xids. So it may not be removably (without creating a newer multi) until much later than its creation time. So I don't think that's really true. > From 483bc8df203f9df058fcb53e7972e3912e223b30 Mon Sep 17 00:00:00 2001 > From: Peter Geoghegan > Date: Mon, 22 Nov 2021 10:02:30 -0800 > Subject: [PATCH v9 1/4] Loosen coupling between relfrozenxid and freezing. > > When VACUUM set relfrozenxid before now, it set it to whatever value was > used to determine which tuples to freeze -- the FreezeLimit cutoff. > This approach was very naive: the relfrozenxid invariant only requires > that new relfrozenxid values be <= the oldest extant XID remaining in > the table (at the point that the VACUUM operation ends), which in > general might be much more recent than FreezeLimit. There is no fixed > relationship between the amount of physical work performed by VACUUM to > make it safe to advance relfrozenxid (freezing and pruning), and the > actual number of XIDs that relfrozenxid can be advanced by (at least in > principle) as a result. VACUUM might have to freeze all of the tuples > from a hundred million heap pages just to enable relfrozenxid to be > advanced by no more than one or two XIDs. On the other hand, VACUUM > might end up doing little or no work, and yet still be capable of > advancing relfrozenxid by hundreds of millions of XIDs as a result. > > VACUUM now sets relfrozenxid (and relminmxid) using the exact oldest > extant XID (and oldest extant MultiXactId) from the table, including > XIDs from the table's remaining/unfrozen MultiXacts. This requires that > VACUUM carefully track the oldest unfrozen XID/MultiXactId as it goes. > This optimization doesn't require any changes to the definition of > relfrozenxid, nor does it require changes to the core design of > freezing. > Final relfrozenxid values must still be >=
Re: Removing more vacuumlazy.c special cases, relfrozenxid optimizations
On Sun, Feb 20, 2022 at 12:27 PM Peter Geoghegan wrote: > You've given me a lot of high quality feedback on all of this, which > I'll work through soon. It's hard to get the balance right here, but > it's made much easier by this kind of feedback. Attached is v9. Lots of changes. Highlights: * Much improved 0001 ("loosen coupling" dynamic relfrozenxid tracking patch). Some of the improvements are due to recent feedback from Robert. * Much improved 0002 ("Make page-level characteristics drive freezing" patch). Whole new approach to the implementation, though the same algorithm as before. * No more FSM patch -- that was totally separate work, that I shouldn't have attached to this project. * There are 2 new patches (these are now 0003 and 0004), both of which are concerned with allowing non-aggressive VACUUM to consistently advance relfrozenxid. I think that 0003 makes sense on general principle, but I'm much less sure about 0004. These aren't too important. While working on the new approach to freezing taken by v9-0002, I had some insight about the issues that Robert raised around 0001, too. I wasn't expecting that to happen. 0002 makes page-level freezing a first class thing. heap_prepare_freeze_tuple now has some (limited) knowledge of how this works. heap_prepare_freeze_tuple's cutoff_xid argument is now always the VACUUM caller's OldestXmin (not its FreezeLimit, as before). We still have to pass FreezeLimit to heap_prepare_freeze_tuple, which helps us to respect FreezeLimit as a backstop, and so now it's passed via the new backstop_cutoff_xid argument instead. Whenever we opt to "freeze a page", the new page-level algorithm *always* uses the most recent possible XID and MXID values (OldestXmin and oldestMxact) to decide what XIDs/XMIDs need to be replaced. That might sound like it'd be too much, but it only applies to those pages that we actually decide to freeze (since page-level characteristics drive everything now). FreezeLimit is only one way of triggering that now (and one of the least interesting and rarest). 0002 also adds an alternative set of relfrozenxid/relminmxid tracker variables, to make the "don't freeze the page" path within lazy_scan_prune simpler (if you don't want to freeze the page, then use the set of tracker variables that go with that choice, which heap_prepare_freeze_tuple knows about and helps with). With page-level freezing, lazy_scan_prune wants to make a decision about the page as a whole, at the last minute, after all heap_prepare_freeze_tuple calls have already been made. So I think that heap_prepare_freeze_tuple needs to know about that aspect of lazy_scan_prune's behavior. When we *don't* want to freeze the page, we more or less need everything related to freezing inside lazy_scan_prune to behave like lazy_scan_noprune, which never freezes the page (that's mostly the point of lazy_scan_noprune). And that's almost what we actually do -- heap_prepare_freeze_tuple now outsources maintenance of this alternative set of "don't freeze the page" relfrozenxid/relminmxid tracker variables to its sibling function, heap_tuple_needs_freeze. That is the same function that lazy_scan_noprune itself actually calls. Now back to Robert's feedback on 0001, which had very complicated comments in the last version. This approach seems to make the "being versus becoming" or "going to freeze versus not going to freeze" distinctions much clearer. This is less true if you assume that 0002 won't be committed but 0001 will be. Even if that happens with Postgres 15, I have to imagine that adding something like 0002 must be the real goal, long term. Without 0002, the value from 0001 is far more limited. You need both together to get the virtuous cycle I've described. The approach with always using OldestXmin as cutoff_xid and oldestMxact as our cutoff_multi makes a lot of sense to me, in part because I think that it might well cut down on the tendency of VACUUM to allocate new MultiXacts in order to be able to freeze old ones. AFAICT the only reason that heap_prepare_freeze_tuple does that is because it has no flexibility on FreezeLimit and MultiXactCutoff. These are derived from vacuum_freeze_min_age and vacuum_multixact_freeze_min_age, respectively, and so they're two independent though fairly meaningless cutoffs. On the other hand, OldestXmin and OldestMxact are not independent in the same way. We get both of them at the same time and the same place, in vacuum_set_xid_limits. OldestMxact really is very close to OldestXmin -- only the units differ. It seems that heap_prepare_freeze_tuple allocates new MXIDs (when freezing old ones) in large part so it can NOT freeze XIDs that it would have been useful (and much cheaper) to remove anyway. On HEAD, FreezeMultiXactId() doesn't get passed down the VACUUM operation's OldestXmin at all (it actually just gets FreezeLimit passed as its cutoff_xid argument). It cannot possibly recognize any of this for itself. Does that theory about MultiXacts
Re: Removing more vacuumlazy.c special cases, relfrozenxid optimizations
, On Sun, Feb 20, 2022 at 7:30 AM Robert Haas wrote: > Right, so we at least need to add a similar comment to what I proposed > for MXIDs, and maybe other changes are needed, too. Agreed. > > Maybe the way that FreezeLimit/cutoff_xid is overloaded can be fixed > > here, to make all of this less confusing. I only now fully realized > > how confusing all of this stuff is -- very. > > Right. I think I understand all of this, or at least most of it -- but > not from the comment. The question is how the comment can be more > clear. My general suggestion is that function header comments should > have more to do with the behavior of the function than how it fits > into the bigger picture. If it's clear to the reader what conditions > must hold before calling the function and which must hold on return, > it helps a lot. IMHO, it's the job of the comments in the calling > function to clarify why we then choose to call that function at the > place and in the way that we do. You've given me a lot of high quality feedback on all of this, which I'll work through soon. It's hard to get the balance right here, but it's made much easier by this kind of feedback. > I think that the idea has potential, but I don't think that I > understand yet what the *exact* algorithm is. The algorithm seems to exploit a natural tendency that Andres once described in a blog post about his snapshot scalability work [1]. To a surprising extent, we can usefully bucket all tuples/pages into two simple categories: 1. Very, very old ("infinitely old" for all practical purposes). 2. Very very new. There doesn't seem to be much need for a third "in-between" category in practice. This seems to be at least approximately true all of the time. Perhaps Andres wouldn't agree with this very general statement -- he actually said something more specific. I for one believe that the point he made generalizes surprisingly well, though. I have my own theories about why this appears to be true. (Executive summary: power laws are weird, and it seems as if the sparsity-of-effects principle makes it easy to bucket things at the highest level, in a way that generalizes well across disparate workloads.) > Maybe I need to read the > code, when I have some time for that. I can't form an intelligent > opinion at this stage about whether this is likely to be a net > positive. The code in the v8-0002 patch is a bit sloppy right now. I didn't quite get around to cleaning it up -- I was focussed on performance validation of the algorithm itself. So bear that in mind if you do look at v8-0002 (might want to wait for v9-0002 before looking). I believe that the only essential thing about the algorithm itself is that it freezes all the tuples on a page when it anticipates setting the page all-visible, or (barring edge cases) freezes none at all. (Note that setting the page all-visible/all-frozen may be happen just after lazy_scan_prune returns, or in the second pass over the heap, after LP_DEAD items are set to LP_UNUSED -- lazy_scan_prune doesn't care which way it will happen.) There are one or two other design choices that we need to make, like what exact tuples we freeze in the edge case where FreezeLimit/XID age forces us to freeze in lazy_scan_prune. These other design choices don't seem relevant to the issue of central importance, which is whether or not we come out ahead overall with this new algorithm. FreezeLimit will seldom affect our choice to freeze or not freeze now, and so AFAICT the exact way that FreezeLimit affects which precise freezing-eligible tuples we freeze doesn't complicate performance validation. Remember when I got excited about how my big TPC-C benchmark run showed a predictable, tick/tock style pattern across VACUUM operations against the order and order lines table [2]? It seemed very significant to me that the OldestXmin of VACUUM operation n consistently went on to become the new relfrozenxid for the same table in VACUUM operation n + 1. It wasn't exactly the same XID, but very close to it (within the range of noise). This pattern was clearly present, even though VACUUM operation n + 1 might happen as long as 4 or 5 hours after VACUUM operation n (this was a big table). This pattern was encouraging to me because it showed (at least for the workload and tables in question) that the amount of unnecessary extra freezing can't have been too bad -- the fact that we can always advance relfrozenxid in the same way is evidence of that. Note that the vacuum_freeze_min_age setting can't have affected our choice of what to freeze (given what we see in the logs), and yet there is a clear pattern where the pages (it's really pages, not tuples) that the new algorithm doesn't freeze in VACUUM operation n will reliably get frozen in VACUUM operation n + 1 instead. And so this pattern seems to lend support to the general idea of letting the workload itself be the primary driver of what pages we freeze (not FreezeLimit, and not anything based
Re: Removing more vacuumlazy.c special cases, relfrozenxid optimizations
On Fri, Feb 18, 2022 at 7:12 PM Peter Geoghegan wrote: > We have to worry about XIDs from MultiXacts (and xmax values more > generally). And we have to worry about the case where we start out > with only xmin frozen (by an earlier VACUUM), and then have to freeze > xmax too. I believe that we have to generally consider xmin and xmax > independently. For example, we cannot ignore xmax, just because we > looked at xmin, since in general xmin alone might have already been > frozen. Right, so we at least need to add a similar comment to what I proposed for MXIDs, and maybe other changes are needed, too. > The difference between the cleanup lock path (in > lazy_scan_prune/heap_prepare_freeze_tuple) and the share lock path (in > lazy_scan_noprune/heap_tuple_needs_freeze) is what is at issue in both > of these confusing comment blocks, really. Note that cutoff_xid is the > name that both heap_prepare_freeze_tuple and heap_tuple_needs_freeze > have for FreezeLimit (maybe we should rename every occurence of > cutoff_xid in heapam.c to FreezeLimit). > > At a high level, we aren't changing the fundamental definition of an > aggressive VACUUM in any of the patches -- we still need to advance > relfrozenxid up to FreezeLimit in an aggressive VACUUM, just like on > HEAD, today (we may be able to advance it *past* FreezeLimit, but > that's just a bonus). But in a non-aggressive VACUUM, where there is > still no strict requirement to advance relfrozenxid (by any amount), > the code added by 0001 can set relfrozenxid to any known safe value, > which could either be from before FreezeLimit, or after FreezeLimit -- > almost anything is possible (provided we respect the relfrozenxid > invariant, and provided we see that we didn't skip any > all-visible-not-all-frozen pages). > > Since we still need to "respect FreezeLimit" in an aggressive VACUUM, > the aggressive case might need to wait for a full cleanup lock the > hard way, having tried and failed to do it the easy way within > lazy_scan_noprune (lazy_scan_noprune will still return false when any > call to heap_tuple_needs_freeze for any tuple returns false) -- same > as on HEAD, today. > > And so the difference at issue here is: FreezeLimit/cutoff_xid only > needs to affect the new NewRelfrozenxid value we use for relfrozenxid in > heap_prepare_freeze_tuple, which is involved in real freezing -- not > in heap_tuple_needs_freeze, whose main purpose is still to help us > avoid freezing where a cleanup lock isn't immediately available. While > the purpose of FreezeLimit/cutoff_xid within heap_tuple_needs_freeze > is to determine its bool return value, which will only be of interest > to the aggressive case (which might have to get a cleanup lock and do > it the hard way), not the non-aggressive case (where ratcheting back > NewRelfrozenxid is generally possible, and generally leaves us with > almost as good of a value). > > In other words: the calls to heap_tuple_needs_freeze made from > lazy_scan_noprune are simply concerned with the page as it actually > is, whereas the similar/corresponding calls to > heap_prepare_freeze_tuple from lazy_scan_prune are concerned with > *what the page will actually become*, after freezing finishes, and > after lazy_scan_prune is done with the page entirely (ultimately > the final NewRelfrozenxid value set in pg_class.relfrozenxid only has > to be <= the oldest extant XID *at the time the VACUUM operation is > just about to end*, not some earlier time, so "being versus becoming" > is an interesting distinction for us). > > Maybe the way that FreezeLimit/cutoff_xid is overloaded can be fixed > here, to make all of this less confusing. I only now fully realized > how confusing all of this stuff is -- very. Right. I think I understand all of this, or at least most of it -- but not from the comment. The question is how the comment can be more clear. My general suggestion is that function header comments should have more to do with the behavior of the function than how it fits into the bigger picture. If it's clear to the reader what conditions must hold before calling the function and which must hold on return, it helps a lot. IMHO, it's the job of the comments in the calling function to clarify why we then choose to call that function at the place and in the way that we do. > As a general rule, we try to freeze all of the remaining live tuples > on a page (following pruning) together, as a group, or none at all. > Most of the time this is triggered by our noticing that the page is > about to be set all-visible (but not all-frozen), and doing work > sufficient to mark it fully all-frozen instead. Occasionally there is > FreezeLimit to consider, which is now more of a backstop thing, used > to make sure that we never get too far behind in terms of unfrozen > XIDs. This is useful in part because it avoids any future > non-aggressive VACUUM that is fundamentally unable to advance > relfrozenxid (you can't skip all-visible pages if there are
Re: Removing more vacuumlazy.c special cases, relfrozenxid optimizations
On Sat, Feb 19, 2022 at 8:54 PM Andres Freund wrote: > > Leaving behind disconnected/orphaned heap-only tuples is pretty much > > pointless anyway, since they'll never be accessible by index scans. > > Even after a REINDEX, since there is no root item from the heap page > > to go in the index. (A dump and restore might work better, though.) > > Given that heap_surgery's raison d'etre is correcting corruption etc, I think > it makes sense for it to do as minimal work as possible. Iterating through a > HOT chain would be a problem if you e.g. tried to repair a page with HOT > corruption. Yeah, I agree. I don't have time to respond to all of these emails thoroughly right now, but I think it's really important that pg_surgery do the exact surgery the user requested, and not any other work. I don't think that page defragmentation should EVER be REQUIRED as a condition of other work. If other code is relying on that, I'd say it's busted. I'm a little more uncertain about the case where we kill the root tuple of a HOT chain, because I can see that this might leave the page a state where sequential scans see the tuple at the end of the chain and index scans don't. I'm not sure whether that should be the responsibility of pg_surgery itself to avoid, or whether that's your problem as a user of it -- although I lean mildly toward the latter view, at the moment. But in any case surely the pruning code can't just decide to go into an infinite loop if that happens. Code that manipulates the states of data pages needs to be as robust against arbitrary on-disk states as we can reasonably make it, because pages get garbled on disk all the time. -- Robert Haas EDB: http://www.enterprisedb.com
Re: Removing more vacuumlazy.c special cases, relfrozenxid optimizations
On Sat, Feb 19, 2022 at 8:21 PM Andres Freund wrote: > Why does anything need to happen to it from vacuum's POV? It'll not be a > problem for freezing etc. Until it's deleted vacuum doesn't need to care. > > Probably worth a WARNING, and amcheck definitely needs to detect it, but > otherwise I think it's fine to just continue. Maybe that's true, but it's just really weird to imagine not having an LP_REDIRECT that points to the LIVE item here, without throwing an error. Seems kind of iffy, to say the least. > >I guess it doesn't actually matter if we leave an aborted DEAD tuple > >behind, that we could have pruned away, but didn't. The important > >thing is to be consistent at the level of the page. > > That's not ok, because it opens up dangers of being interpreted differently > after wraparound etc. > > But I don't see any cases where it would happen with the new pruning logic in > your patch and sharing the HTSV status array? Right. Fundamentally, there isn't any reason why it should matter that VACUUM reached the heap page just before (rather than concurrent with or just after) some xact that inserted or updated on the page aborts. Just as long as we have a consistent idea about what's going on at the level of the whole page (or maybe the level of each HOT chain, but the whole page level seems simpler to me). -- Peter Geoghegan
Re: Removing more vacuumlazy.c special cases, relfrozenxid optimizations
Hi, On February 19, 2022 7:56:53 PM PST, Peter Geoghegan wrote: >On Sat, Feb 19, 2022 at 7:47 PM Andres Freund wrote: >> Non DEAD orphaned versions shouldn't cause a problem in lazy_scan_prune(). >> The >> problem here is a DEAD orphaned HOT tuples, and those we should be able to >> delete with the new page pruning logic, right? > >Right. But what good does that really do? The problematic page had a >third tuple (at offnum 3) that was LIVE. If we could have done >something about the problematic tuple at offnum 2 (which is where we >got stuck), then we'd still be left with a very unpleasant choice >about what happens to the third tuple. Why does anything need to happen to it from vacuum's POV? It'll not be a problem for freezing etc. Until it's deleted vacuum doesn't need to care. Probably worth a WARNING, and amcheck definitely needs to detect it, but otherwise I think it's fine to just continue. >> I think it might be worth getting rid of the need for the retry approach by >> reusing the same HTSV status array between heap_prune_page and >> lazy_scan_prune. Then the only legitimate reason for seeing a DEAD item in >> lazy_scan_prune() would be some form of corruption. And it'd be a pretty >> decent performance boost, HTSV ain't cheap. > >I guess it doesn't actually matter if we leave an aborted DEAD tuple >behind, that we could have pruned away, but didn't. The important >thing is to be consistent at the level of the page. That's not ok, because it opens up dangers of being interpreted differently after wraparound etc. But I don't see any cases where it would happen with the new pruning logic in your patch and sharing the HTSV status array? Andres -- Sent from my Android device with K-9 Mail. Please excuse my brevity.
Re: Removing more vacuumlazy.c special cases, relfrozenxid optimizations
On Sat, Feb 19, 2022 at 7:47 PM Andres Freund wrote: > I'm not that sure those are that different... Imagine some corruption leading > to two hot chains ending in the same tid, which our fancy new secure pruning > algorithm might detect. I suppose that's possible, but it doesn't seem all that likely to ever happen, what with the xmin -> xmax cross-tuple matching stuff. > Either way, I'm a bit surprised about the logic to not allow killing redirect > items? What if you have a redirect pointing to an unused item? Again, I simply think it boils down to having to treat HOT chains as a whole unit when killing TIDs. > > Let's assume that we don't want to make VACUUM/pruning just treat > > orphaned heap-only tuples as DEAD, regardless of their true HTSV-wise > > status > > I don't think that'd ever be a good idea. Those tuples are visible to a > seqscan after all. I agree (I don't hate it completely, but it seems mostly bad). This is what leads me to the conclusion that pg_surgery has to be able to do this instead. Surely it's not okay to have something that makes VACUUM always end in error, that cannot even be fixed by pg_surgery. > > -- let's say that we want to err in the direction of doing > > nothing at all with the page. Now we have to have a weird error in > > VACUUM instead (not great, but better than just spinning between > > lazy_scan_prune and heap_prune_page). > > Non DEAD orphaned versions shouldn't cause a problem in lazy_scan_prune(). The > problem here is a DEAD orphaned HOT tuples, and those we should be able to > delete with the new page pruning logic, right? Right. But what good does that really do? The problematic page had a third tuple (at offnum 3) that was LIVE. If we could have done something about the problematic tuple at offnum 2 (which is where we got stuck), then we'd still be left with a very unpleasant choice about what happens to the third tuple. > I think it might be worth getting rid of the need for the retry approach by > reusing the same HTSV status array between heap_prune_page and > lazy_scan_prune. Then the only legitimate reason for seeing a DEAD item in > lazy_scan_prune() would be some form of corruption. And it'd be a pretty > decent performance boost, HTSV ain't cheap. I guess it doesn't actually matter if we leave an aborted DEAD tuple behind, that we could have pruned away, but didn't. The important thing is to be consistent at the level of the page. -- Peter Geoghegan
Re: Removing more vacuumlazy.c special cases, relfrozenxid optimizations
Hi, On 2022-02-19 19:31:21 -0800, Peter Geoghegan wrote: > On Sat, Feb 19, 2022 at 6:16 PM Peter Geoghegan wrote: > > > Given that heap_surgery's raison d'etre is correcting corruption etc, I > > > think > > > it makes sense for it to do as minimal work as possible. Iterating > > > through a > > > HOT chain would be a problem if you e.g. tried to repair a page with HOT > > > corruption. > > > > I guess that's also true. There is at least a legitimate argument to > > be made for not leaving behind any orphaned heap-only tuples. The > > interface is a TID, and so the user may already believe that they're > > killing the heap-only, not just the root item (since ctid suggests > > that the TID of a heap-only tuple is the TID of the root item, which > > is kind of misleading). > > Actually, I would say that heap_surgery's raison d'etre is making > weird errors related to corruption of this or that TID go away, so > that the user can cut their losses. That's how it's advertised. I'm not that sure those are that different... Imagine some corruption leading to two hot chains ending in the same tid, which our fancy new secure pruning algorithm might detect. Either way, I'm a bit surprised about the logic to not allow killing redirect items? What if you have a redirect pointing to an unused item? > Let's assume that we don't want to make VACUUM/pruning just treat > orphaned heap-only tuples as DEAD, regardless of their true HTSV-wise > status I don't think that'd ever be a good idea. Those tuples are visible to a seqscan after all. > -- let's say that we want to err in the direction of doing > nothing at all with the page. Now we have to have a weird error in > VACUUM instead (not great, but better than just spinning between > lazy_scan_prune and heap_prune_page). Non DEAD orphaned versions shouldn't cause a problem in lazy_scan_prune(). The problem here is a DEAD orphaned HOT tuples, and those we should be able to delete with the new page pruning logic, right? I think it might be worth getting rid of the need for the retry approach by reusing the same HTSV status array between heap_prune_page and lazy_scan_prune. Then the only legitimate reason for seeing a DEAD item in lazy_scan_prune() would be some form of corruption. And it'd be a pretty decent performance boost, HTSV ain't cheap. Greetings, Andres Freund
Re: Removing more vacuumlazy.c special cases, relfrozenxid optimizations
On Sat, Feb 19, 2022 at 7:28 PM Andres Freund wrote: > If the vacuum can get the cleanup lock due to the adversarial patch, the > heap_force_kill() doesn't do anything, because the first item is a > redirect. However if it *can't* get a cleanup lock, heap_force_kill() instead > targets the root item. Triggering the endless loop. But it shouldn't matter if the root item is an LP_REDIRECT or a normal (not heap-only) tuple with storage. Either way it's the root of a HOT chain. The fact that pg_surgery treats LP_REDIRECT items differently from the other kind of root items is just arbitrary. It seems to have more to do with freezing tuples than killing tuples. -- Peter Geoghegan
Re: Removing more vacuumlazy.c special cases, relfrozenxid optimizations
On Sat, Feb 19, 2022 at 6:16 PM Peter Geoghegan wrote: > > Given that heap_surgery's raison d'etre is correcting corruption etc, I > > think > > it makes sense for it to do as minimal work as possible. Iterating through a > > HOT chain would be a problem if you e.g. tried to repair a page with HOT > > corruption. > > I guess that's also true. There is at least a legitimate argument to > be made for not leaving behind any orphaned heap-only tuples. The > interface is a TID, and so the user may already believe that they're > killing the heap-only, not just the root item (since ctid suggests > that the TID of a heap-only tuple is the TID of the root item, which > is kind of misleading). Actually, I would say that heap_surgery's raison d'etre is making weird errors related to corruption of this or that TID go away, so that the user can cut their losses. That's how it's advertised. Let's assume that we don't want to make VACUUM/pruning just treat orphaned heap-only tuples as DEAD, regardless of their true HTSV-wise status -- let's say that we want to err in the direction of doing nothing at all with the page. Now we have to have a weird error in VACUUM instead (not great, but better than just spinning between lazy_scan_prune and heap_prune_page). And we've just created natural demand for heap_surgery to deal with the problem by deleting whole HOT chains (not just root items). If we allow VACUUM to treat orphaned heap-only tuples as DEAD right away, then we might as well do the same thing in heap_surgery, since there is little chance that the user will get to the heap-only tuples before VACUUM does (not something to rely on, at any rate). Either way, I think we probably end up needing to teach heap_surgery to kill entire HOT chains as a group, given a TID. -- Peter Geoghegan
Re: Removing more vacuumlazy.c special cases, relfrozenxid optimizations
Hi, On 2022-02-19 19:07:39 -0800, Peter Geoghegan wrote: > On Sat, Feb 19, 2022 at 7:01 PM Andres Freund wrote: > > > We can either do that, or we can throw an error concerning corruption > > > when heap_page_prune notices orphaned tuples. Neither seems > > > particularly appealing. But it definitely makes no sense to allow > > > lazy_scan_prune to spin in a futile attempt to reach agreement with > > > heap_page_prune about a DEAD tuple really being DEAD. > > > > Yea, this sucks. I think we should go for the rewrite of the > > heap_prune_chain() logic. The current approach is just never going to be > > robust. > > No, it just isn't robust enough. But it's not that hard to fix. My > patch really wasn't invasive. I think we're in agreement there. We might think at some point about backpatching too, but I'd rather have it stew in HEAD for a bit first. > I confirmed that HeapTupleSatisfiesVacuum() and > heap_prune_satisfies_vacuum() agree that the heap-only tuple at offnum > 2 is HEAPTUPLE_DEAD -- they are in agreement, as expected (so no > reason to think that there is a new bug involved). The problem here is > indeed just that heap_prune_chain() can't "get to" the tuple, given > its current design. Right. The reason that the "adversarial" patch makes a different is solely that it changes the heap_surgery test to actually kill an item, which it doesn't intend: create temp table htab2(a int); insert into htab2 values (100); update htab2 set a = 200; vacuum htab2; -- redirected TIDs should be skipped select heap_force_kill('htab2'::regclass, ARRAY['(0, 1)']::tid[]); If the vacuum can get the cleanup lock due to the adversarial patch, the heap_force_kill() doesn't do anything, because the first item is a redirect. However if it *can't* get a cleanup lock, heap_force_kill() instead targets the root item. Triggering the endless loop. Hm. I think this might be a mild regression in 14. In < 14 we'd just skip the tuple in lazy_scan_heap(), but now we have an uninterruptible endless loop. We'd do completely bogus stuff later in < 14 though, I think we'd just leave it in place despite being older than relfrozenxid, which obviously has its own set of issues. Greetings, Andres Freund
Re: Removing more vacuumlazy.c special cases, relfrozenxid optimizations
On Sat, Feb 19, 2022 at 7:01 PM Andres Freund wrote: > It's kind of surprising that this needs this > 0001-Add-adversarial-ConditionalLockBuff to break. I suspect it's a question > of hint bits changing due to lazy_scan_noprune(), which then makes > HeapTupleHeaderIsHotUpdated() have a different return value, preventing the > "If the tuple is DEAD and doesn't chain to anything else" > path from being taken. That makes sense as an explanation. Goes to show just how fragile the "DEAD and doesn't chain to anything else" logic at the top of heap_prune_chain really is. -- Peter Geoghegan
Re: Removing more vacuumlazy.c special cases, relfrozenxid optimizations
On Sat, Feb 19, 2022 at 7:01 PM Andres Freund wrote: > > We can either do that, or we can throw an error concerning corruption > > when heap_page_prune notices orphaned tuples. Neither seems > > particularly appealing. But it definitely makes no sense to allow > > lazy_scan_prune to spin in a futile attempt to reach agreement with > > heap_page_prune about a DEAD tuple really being DEAD. > > Yea, this sucks. I think we should go for the rewrite of the > heap_prune_chain() logic. The current approach is just never going to be > robust. No, it just isn't robust enough. But it's not that hard to fix. My patch really wasn't invasive. I confirmed that HeapTupleSatisfiesVacuum() and heap_prune_satisfies_vacuum() agree that the heap-only tuple at offnum 2 is HEAPTUPLE_DEAD -- they are in agreement, as expected (so no reason to think that there is a new bug involved). The problem here is indeed just that heap_prune_chain() can't "get to" the tuple, given its current design. For anybody else that doesn't follow what we're talking about: The "doesn't chain to anything else" code at the start of heap_prune_chain() won't get to the heap-only tuple at offnum 2, since the tuple is itself HeapTupleHeaderIsHotUpdated() -- the expectation is that it'll be processed later on, once we locate the HOT chain's root item. Since, of course, the "root item" was already LP_DEAD before we even reached heap_page_prune() (on account of the pg_surgery corruption), there is no possible way that that can happen later on. And so we cannot find the same heap-only tuple and mark it LP_UNUSED (which is how we always deal with HEAPTUPLE_DEAD heap-only tuples) during pruning. -- Peter Geoghegan
Re: Removing more vacuumlazy.c special cases, relfrozenxid optimizations
Hi, On 2022-02-19 18:16:54 -0800, Peter Geoghegan wrote: > On Sat, Feb 19, 2022 at 5:54 PM Andres Freund wrote: > > How does that cause the endless loop? > > Attached is the page image itself, dumped via gdb (and gzip'd). This > was on recent HEAD (commit 8f388f6f, actually), plus > 0001-Add-adversarial-ConditionalLockBuff[...]. No other changes. No > defragmenting in pg_surgery, nothing like that. > > It doesn't do so on HEAD + 0001-Add-adversarial-ConditionalLockBuff[...] for > > me. So something needs have changed with your patch? > > It doesn't always happen -- only about half the time on my machine. > Maybe it's timing sensitive? Ah, I'd only run the tests three times or so, without it happening. Trying a few more times repro'd it. It's kind of surprising that this needs this 0001-Add-adversarial-ConditionalLockBuff to break. I suspect it's a question of hint bits changing due to lazy_scan_noprune(), which then makes HeapTupleHeaderIsHotUpdated() have a different return value, preventing the "If the tuple is DEAD and doesn't chain to anything else" path from being taken. > We hit the "goto retry" on offnum 2, which is the first tuple with > storage (you can see "the ghost" of the tuple from the LP_DEAD item at > offnum 1, since the page isn't defragmented in pg_surgery). I think > that this happens because the heap-only tuple at offnum 2 is fully > DEAD to lazy_scan_prune, but hasn't been recognized as such by > heap_page_prune. There is no way that they'll ever "agree" on the > tuple being DEAD right now, because pruning still doesn't assume that > an orphaned heap-only tuple is fully DEAD. > We can either do that, or we can throw an error concerning corruption > when heap_page_prune notices orphaned tuples. Neither seems > particularly appealing. But it definitely makes no sense to allow > lazy_scan_prune to spin in a futile attempt to reach agreement with > heap_page_prune about a DEAD tuple really being DEAD. Yea, this sucks. I think we should go for the rewrite of the heap_prune_chain() logic. The current approach is just never going to be robust. Greetings, Andres Freund
Re: Removing more vacuumlazy.c special cases, relfrozenxid optimizations
On Sat, Feb 19, 2022 at 5:54 PM Andres Freund wrote: > How does that cause the endless loop? Attached is the page image itself, dumped via gdb (and gzip'd). This was on recent HEAD (commit 8f388f6f, actually), plus 0001-Add-adversarial-ConditionalLockBuff[...]. No other changes. No defragmenting in pg_surgery, nothing like that. > It doesn't do so on HEAD + 0001-Add-adversarial-ConditionalLockBuff[...] for > me. So something needs have changed with your patch? It doesn't always happen -- only about half the time on my machine. Maybe it's timing sensitive? We hit the "goto retry" on offnum 2, which is the first tuple with storage (you can see "the ghost" of the tuple from the LP_DEAD item at offnum 1, since the page isn't defragmented in pg_surgery). I think that this happens because the heap-only tuple at offnum 2 is fully DEAD to lazy_scan_prune, but hasn't been recognized as such by heap_page_prune. There is no way that they'll ever "agree" on the tuple being DEAD right now, because pruning still doesn't assume that an orphaned heap-only tuple is fully DEAD. We can either do that, or we can throw an error concerning corruption when heap_page_prune notices orphaned tuples. Neither seems particularly appealing. But it definitely makes no sense to allow lazy_scan_prune to spin in a futile attempt to reach agreement with heap_page_prune about a DEAD tuple really being DEAD. > Given that heap_surgery's raison d'etre is correcting corruption etc, I think > it makes sense for it to do as minimal work as possible. Iterating through a > HOT chain would be a problem if you e.g. tried to repair a page with HOT > corruption. I guess that's also true. There is at least a legitimate argument to be made for not leaving behind any orphaned heap-only tuples. The interface is a TID, and so the user may already believe that they're killing the heap-only, not just the root item (since ctid suggests that the TID of a heap-only tuple is the TID of the root item, which is kind of misleading). Anyway, we can decide on what to do in heap_surgery later, once the main issue is under control. My point was mostly just that orphaned heap-only tuples are definitely not okay, in general. They are the least worst option when corruption has already happened, maybe -- but maybe not. -- Peter Geoghegan corrupt-hot-chain.page.gz Description: GNU Zip compressed data
Re: Removing more vacuumlazy.c special cases, relfrozenxid optimizations
Hi, On 2022-02-19 17:22:33 -0800, Peter Geoghegan wrote: > Looks like pg_surgery isn't processing HOT chains as whole units, > which it really should (at least in the context of killing items via > the heap_force_kill() function). Killing a root item in a HOT chain is > just hazardous -- disconnected/orphaned heap-only tuples are liable to > cause chaos, and should be avoided everywhere (including during > pruning, and within pg_surgery). How does that cause the endless loop? It doesn't do so on HEAD + 0001-Add-adversarial-ConditionalLockBuff[...] for me. So something needs have changed with your patch? > It's likely that the hardening I already planned on adding to pruning > [1] (as follow-up work to recent bugfix commit 18b87b201f) will > prevent lazy_scan_prune from getting stuck like this, whatever the > cause happens to be. Yea, we should pick that up again. Not just for robustness or performance. Also because it's just a lot easier to understand. > Leaving behind disconnected/orphaned heap-only tuples is pretty much > pointless anyway, since they'll never be accessible by index scans. > Even after a REINDEX, since there is no root item from the heap page > to go in the index. (A dump and restore might work better, though.) Given that heap_surgery's raison d'etre is correcting corruption etc, I think it makes sense for it to do as minimal work as possible. Iterating through a HOT chain would be a problem if you e.g. tried to repair a page with HOT corruption. Greetings, Andres Freund
Re: Removing more vacuumlazy.c special cases, relfrozenxid optimizations
On Sat, Feb 19, 2022 at 4:22 PM Peter Geoghegan wrote: > This very much looks like a bug in pg_surgery itself now -- attached > is a draft fix. Wait, that's not it either. I jumped the gun -- this isn't sufficient (though the patch I posted might not be a bad idea anyway). Looks like pg_surgery isn't processing HOT chains as whole units, which it really should (at least in the context of killing items via the heap_force_kill() function). Killing a root item in a HOT chain is just hazardous -- disconnected/orphaned heap-only tuples are liable to cause chaos, and should be avoided everywhere (including during pruning, and within pg_surgery). It's likely that the hardening I already planned on adding to pruning [1] (as follow-up work to recent bugfix commit 18b87b201f) will prevent lazy_scan_prune from getting stuck like this, whatever the cause happens to be. The actual page image I see lazy_scan_prune choke on (i.e. exhibit the same infinite loop unpleasantness we've seen before on) is not in a consistent state at all (its tuples consist of tuples from a single HOT chain, and the HOT chain is totally inconsistent on account of having an LP_DEAD line pointer root item). pg_surgery could in principle do the right thing here by always treating HOT chains as whole units. Leaving behind disconnected/orphaned heap-only tuples is pretty much pointless anyway, since they'll never be accessible by index scans. Even after a REINDEX, since there is no root item from the heap page to go in the index. (A dump and restore might work better, though.) [1] https://postgr.es/m/cah2-wzmnk6v6tqzuuabxoxm8hjrawu6h12toas-bqycliht...@mail.gmail.com -- Peter Geoghegan
Re: Removing more vacuumlazy.c special cases, relfrozenxid optimizations
On Sat, Feb 19, 2022 at 3:08 PM Peter Geoghegan wrote: > It's quite possible that this is nothing more than a bug in my > adversarial gizmo patch -- since I don't think that > ConditionalLockBufferForCleanup() can ever fail with a temp buffer > (though even that's not completely clear right now). Even if the > behavior that I saw does not indicate a bug on HEAD, it still seems > informative. This very much looks like a bug in pg_surgery itself now -- attached is a draft fix. The temp table thing was a red herring. I found I could get exactly the same kind of failure when htab2 was a permanent table (which was how it originally appeared, before commit 0811f766fd made it into a temp table due to test flappiness issues). The relevant "vacuum freeze htab2" happens at a point after the test has already deliberately corrupted one of its tuples using heap_force_kill(). It's not that we aren't careful enough about the corruption at some point in vacuumlazy.c, which was my second theory. But I quickly discarded that idea, and came up with a third theory: the relevant heap_surgery.c path does the relevant ItemIdSetDead() to kill items, without also defragmenting the page to remove the tuples with storage, which is wrong. This meant that we depended on pruning happening (in this case during VACUUM) and defragmenting the page in passing. But there is no reason to not defragment the page within pg_surgery (at least no obvious reason), since we have a cleanup lock anyway. Theoretically you could blame this on lazy_scan_noprune instead, since it thinks it can collect LP_DEAD items while assuming that they have no storage, but that doesn't make much sense to me. There has never been any way of setting a heap item to LP_DEAD without also defragmenting the page. Since that's exactly what it means to prune a heap page. (Actually, the same used to be true about heap vacuuming, which worked more like heap pruning before Postgres 14, but that doesn't seem important.) -- Peter Geoghegan From 81f01ca623b115647ee78a1b09bbb4458fb35dab Mon Sep 17 00:00:00 2001 From: Peter Geoghegan Date: Sat, 19 Feb 2022 16:13:48 -0800 Subject: [PATCH 2/2] Fix for pg_surgery's heap_force_kill() function. --- contrib/pg_surgery/heap_surgery.c | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) diff --git a/contrib/pg_surgery/heap_surgery.c b/contrib/pg_surgery/heap_surgery.c index 3e641aa64..a3a193ba5 100644 --- a/contrib/pg_surgery/heap_surgery.c +++ b/contrib/pg_surgery/heap_surgery.c @@ -311,7 +311,8 @@ heap_force_common(FunctionCallInfo fcinfo, HeapTupleForceOption heap_force_opt) */ if (did_modify_page) { - /* Mark buffer dirty before we write WAL. */ + /* Defragment and mark buffer dirty before we write WAL. */ + PageRepairFragmentation(page); MarkBufferDirty(buf); /* XLOG stuff */ -- 2.30.2
Re: Removing more vacuumlazy.c special cases, relfrozenxid optimizations
Hi, (On phone, so crappy formatting and no source access) On February 19, 2022 3:08:41 PM PST, Peter Geoghegan wrote: >On Fri, Feb 18, 2022 at 5:00 PM Peter Geoghegan wrote: >> Another testing strategy occurs to me: we could stress-test the >> implementation by simulating an environment where the no-cleanup-lock >> path is hit an unusually large number of times, possibly a fixed >> percentage of the time (like 1%, 5%), say by making vacuumlazy.c's >> ConditionalLockBufferForCleanup() call return false randomly. Now that >> we have lazy_scan_noprune for the no-cleanup-lock path (which is as >> similar to the regular lazy_scan_prune path as possible), I wouldn't >> expect this ConditionalLockBufferForCleanup() testing gizmo to be too >> disruptive. > >I tried this out, using the attached patch. It was quite interesting, >even when run against HEAD. I think that I might have found a bug on >HEAD, though I'm not really sure. > >If you modify the patch to simulate conditions under which >ConditionalLockBufferForCleanup() fails about 2% of the time, you get >much better coverage of lazy_scan_noprune/heap_tuple_needs_freeze, >without it being so aggressive as to make "make check-world" fail -- >which is exactly what I expected. If you are much more aggressive >about it, and make it 50% instead (which you can get just by using the >patch as written), then some tests will fail, mostly for reasons that >aren't surprising or interesting (e.g. plan changes). This is also >what I'd have guessed would happen. > >However, it gets more interesting. One thing that I did not expect to >happen at all also happened (with the current 50% rate of simulated >ConditionalLockBufferForCleanup() failure from the patch): if I run >"make check" from the pg_surgery directory, then the Postgres backend >gets stuck in an infinite loop inside lazy_scan_prune, which has been >a symptom of several tricky bugs in the past year (not every time, but >usually). Specifically, the VACUUM statement launched by the SQL >command "vacuum freeze htab2;" from the file >contrib/pg_surgery/sql/heap_surgery.sql, at line 54 leads to this >misbehavior. >This is a temp table, which is a choice made by the tests specifically >because they need to "use a temp table so that vacuum behavior doesn't >depend on global xmin". This is convenient way of avoiding spurious >regression tests failures (e.g. from autoanalyze), and relies on the >GlobalVisTempRels behavior established by Andres' 2020 bugfix commit >94bc27b5. We don't have a blocking path for cleanup locks of temporary buffers IIRC (normally not reachable). So I wouldn't be surprised if a cleanup lock failing would cause some odd behavior. >It's quite possible that this is nothing more than a bug in my >adversarial gizmo patch -- since I don't think that >ConditionalLockBufferForCleanup() can ever fail with a temp buffer >(though even that's not completely clear right now). Even if the >behavior that I saw does not indicate a bug on HEAD, it still seems >informative. At the very least, it wouldn't hurt to Assert() that the >target table isn't a temp table inside lazy_scan_noprune, documenting >our assumptions around temp tables and >ConditionalLockBufferForCleanup(). Definitely worth looking into more. This reminds me of a recent thing I noticed in the aio patch. Spgist can end up busy looping when buffers are locked, instead of blocking. Not actually related, of course. Andres -- Sent from my Android device with K-9 Mail. Please excuse my brevity.
Re: Removing more vacuumlazy.c special cases, relfrozenxid optimizations
On Fri, Feb 18, 2022 at 5:00 PM Peter Geoghegan wrote: > Another testing strategy occurs to me: we could stress-test the > implementation by simulating an environment where the no-cleanup-lock > path is hit an unusually large number of times, possibly a fixed > percentage of the time (like 1%, 5%), say by making vacuumlazy.c's > ConditionalLockBufferForCleanup() call return false randomly. Now that > we have lazy_scan_noprune for the no-cleanup-lock path (which is as > similar to the regular lazy_scan_prune path as possible), I wouldn't > expect this ConditionalLockBufferForCleanup() testing gizmo to be too > disruptive. I tried this out, using the attached patch. It was quite interesting, even when run against HEAD. I think that I might have found a bug on HEAD, though I'm not really sure. If you modify the patch to simulate conditions under which ConditionalLockBufferForCleanup() fails about 2% of the time, you get much better coverage of lazy_scan_noprune/heap_tuple_needs_freeze, without it being so aggressive as to make "make check-world" fail -- which is exactly what I expected. If you are much more aggressive about it, and make it 50% instead (which you can get just by using the patch as written), then some tests will fail, mostly for reasons that aren't surprising or interesting (e.g. plan changes). This is also what I'd have guessed would happen. However, it gets more interesting. One thing that I did not expect to happen at all also happened (with the current 50% rate of simulated ConditionalLockBufferForCleanup() failure from the patch): if I run "make check" from the pg_surgery directory, then the Postgres backend gets stuck in an infinite loop inside lazy_scan_prune, which has been a symptom of several tricky bugs in the past year (not every time, but usually). Specifically, the VACUUM statement launched by the SQL command "vacuum freeze htab2;" from the file contrib/pg_surgery/sql/heap_surgery.sql, at line 54 leads to this misbehavior. This is a temp table, which is a choice made by the tests specifically because they need to "use a temp table so that vacuum behavior doesn't depend on global xmin". This is convenient way of avoiding spurious regression tests failures (e.g. from autoanalyze), and relies on the GlobalVisTempRels behavior established by Andres' 2020 bugfix commit 94bc27b5. It's quite possible that this is nothing more than a bug in my adversarial gizmo patch -- since I don't think that ConditionalLockBufferForCleanup() can ever fail with a temp buffer (though even that's not completely clear right now). Even if the behavior that I saw does not indicate a bug on HEAD, it still seems informative. At the very least, it wouldn't hurt to Assert() that the target table isn't a temp table inside lazy_scan_noprune, documenting our assumptions around temp tables and ConditionalLockBufferForCleanup(). I haven't actually tried to debug the issue just yet, so take all this with a grain of salt. -- Peter Geoghegan From 3f01281af3ba81b35777cb7d717f76e001fd3e10 Mon Sep 17 00:00:00 2001 From: Peter Geoghegan Date: Sat, 19 Feb 2022 14:07:35 -0800 Subject: [PATCH] Add adversarial ConditionalLockBufferForCleanup() gizmo to vacuumlazy.c. --- src/backend/access/heap/vacuumlazy.c | 36 +++- 1 file changed, 35 insertions(+), 1 deletion(-) diff --git a/src/backend/access/heap/vacuumlazy.c b/src/backend/access/heap/vacuumlazy.c index 242511a23..31c6b360e 100644 --- a/src/backend/access/heap/vacuumlazy.c +++ b/src/backend/access/heap/vacuumlazy.c @@ -50,6 +50,7 @@ #include "commands/dbcommands.h" #include "commands/progress.h" #include "commands/vacuum.h" +#include "common/pg_prng.h" #include "executor/instrument.h" #include "miscadmin.h" #include "optimizer/paths.h" @@ -748,6 +749,39 @@ heap_vacuum_rel(Relation rel, VacuumParams *params, } } +/* + * Adversarial gizmo, simulates excessive failure to get cleanup locks + */ +static inline bool +lazy_conditionallockbufferforcleanup(Buffer buffer) +{ + /* +* Artificially fail to get a cleanup lock 50% of the time. +* +* XXX: What about temp tables? We simulate not getting a cleanup lock +* there, but is that choice actually reasonable? +*/ + if (pg_prng_uint32(_global_prng_state) <= (PG_UINT32_MAX / 2)) + return false; + +#if 0 + /* +* 50% is very very aggressive, while 2% - 5% is still basically +* adversarial but in many ways less annoying. +* +* This version (which injects a failure to get a cleanup lock 2% of the +* time) seems to pass the regression tests, even with my parallel make +* check-world recipe. Expected query plans don't seem to shift on +* account of unexpected index bloat (nor are there any problems of a +* similar nature) with this variant of the gizmo. +*/ + if (pg_prng_uint32(_global_prng_state) <= (PG_UINT32_MAX / 50)) +
Re: Removing more vacuumlazy.c special cases, relfrozenxid optimizations
On Fri, Feb 18, 2022 at 2:11 PM Andres Freund wrote: > I think it'd be good to add a few isolationtest cases for the > can't-get-cleanup-lock paths. I think it shouldn't be hard using cursors. The > slightly harder part is verifying that VACUUM did something reasonable, but > that still should be doable? We could even just extend existing, related tests, from vacuum-reltuples.spec. Another testing strategy occurs to me: we could stress-test the implementation by simulating an environment where the no-cleanup-lock path is hit an unusually large number of times, possibly a fixed percentage of the time (like 1%, 5%), say by making vacuumlazy.c's ConditionalLockBufferForCleanup() call return false randomly. Now that we have lazy_scan_noprune for the no-cleanup-lock path (which is as similar to the regular lazy_scan_prune path as possible), I wouldn't expect this ConditionalLockBufferForCleanup() testing gizmo to be too disruptive. -- Peter Geoghegan
Re: Removing more vacuumlazy.c special cases, relfrozenxid optimizations
On Fri, Feb 18, 2022 at 1:56 PM Robert Haas wrote: > + * Also maintains *NewRelfrozenxid and *NewRelminmxid, which are the current > + * target relfrozenxid and relminmxid for the relation. Assumption is that > > "maintains" is fuzzy. I think you should be saying something much more > explicit, and the thing you are saying should make it clear that these > arguments are input-output arguments: i.e. the caller must set them > correctly before calling this function, and they will be updated by > the function. Makes sense. > I don't think you have to spell all of that out in every > place where this comes up in the patch, but it needs to be clear from > what you do say. For example, I would be happier with a comment that > said something like "Every call to this function will either set > HEAP_XMIN_FROZEN in the xl_heap_freeze_tuple struct passed as an > argument, or else reduce *NewRelfrozenxid to the xmin of the tuple if > it is currently newer than that. Thus, after a series of calls to this > function, *NewRelfrozenxid represents a lower bound on unfrozen xmin > values in the tuples examined. Before calling this function, caller > should initialize *NewRelfrozenxid to ." We have to worry about XIDs from MultiXacts (and xmax values more generally). And we have to worry about the case where we start out with only xmin frozen (by an earlier VACUUM), and then have to freeze xmax too. I believe that we have to generally consider xmin and xmax independently. For example, we cannot ignore xmax, just because we looked at xmin, since in general xmin alone might have already been frozen. > + * Also maintains *NewRelfrozenxid and *NewRelminmxid, which are the current > + * target relfrozenxid and relminmxid for the relation. Assumption is that > + * caller will never freeze any of the XIDs from the tuple, even when we say > + * that they should. If caller opts to go with our recommendation to freeze, > + * then it must account for the fact that it shouldn't trust how we've set > + * NewRelfrozenxid/NewRelminmxid. (In practice aggressive VACUUMs always > take > + * our recommendation because they must, and non-aggressive VACUUMs always > opt > + * to not freeze, preferring to ratchet back NewRelfrozenxid instead). > > I don't understand this one. > > +* (Actually, we maintain NewRelminmxid differently here, because we > +* assume that XIDs that should be frozen according to cutoff_xid > won't > +* be, whereas heap_prepare_freeze_tuple makes the opposite > assumption.) > > This one either. The difference between the cleanup lock path (in lazy_scan_prune/heap_prepare_freeze_tuple) and the share lock path (in lazy_scan_noprune/heap_tuple_needs_freeze) is what is at issue in both of these confusing comment blocks, really. Note that cutoff_xid is the name that both heap_prepare_freeze_tuple and heap_tuple_needs_freeze have for FreezeLimit (maybe we should rename every occurence of cutoff_xid in heapam.c to FreezeLimit). At a high level, we aren't changing the fundamental definition of an aggressive VACUUM in any of the patches -- we still need to advance relfrozenxid up to FreezeLimit in an aggressive VACUUM, just like on HEAD, today (we may be able to advance it *past* FreezeLimit, but that's just a bonus). But in a non-aggressive VACUUM, where there is still no strict requirement to advance relfrozenxid (by any amount), the code added by 0001 can set relfrozenxid to any known safe value, which could either be from before FreezeLimit, or after FreezeLimit -- almost anything is possible (provided we respect the relfrozenxid invariant, and provided we see that we didn't skip any all-visible-not-all-frozen pages). Since we still need to "respect FreezeLimit" in an aggressive VACUUM, the aggressive case might need to wait for a full cleanup lock the hard way, having tried and failed to do it the easy way within lazy_scan_noprune (lazy_scan_noprune will still return false when any call to heap_tuple_needs_freeze for any tuple returns false) -- same as on HEAD, today. And so the difference at issue here is: FreezeLimit/cutoff_xid only needs to affect the new NewRelfrozenxid value we use for relfrozenxid in heap_prepare_freeze_tuple, which is involved in real freezing -- not in heap_tuple_needs_freeze, whose main purpose is still to help us avoid freezing where a cleanup lock isn't immediately available. While the purpose of FreezeLimit/cutoff_xid within heap_tuple_needs_freeze is to determine its bool return value, which will only be of interest to the aggressive case (which might have to get a cleanup lock and do it the hard way), not the non-aggressive case (where ratcheting back NewRelfrozenxid is generally possible, and generally leaves us with almost as good of a value). In other words: the calls to heap_tuple_needs_freeze made from lazy_scan_noprune are simply concerned with the page as it actually is, whereas the similar/corresponding calls to heap_prepare_freeze_tuple from
Re: Removing more vacuumlazy.c special cases, relfrozenxid optimizations
Hi, On 2022-02-18 15:54:19 -0500, Robert Haas wrote: > > Are there any objections to this plan? > > I really like the idea of reducing the scope of what is being changed > here, and I agree that eagerly advancing relfrozenxid carries much > less risk than the other changes. Sounds good to me too! Greetings, Andres Freund
Re: Removing more vacuumlazy.c special cases, relfrozenxid optimizations
Hi, On 2022-02-18 13:09:45 -0800, Peter Geoghegan wrote: > 0001 is tricky in the sense that there are a lot of fine details, and > if you get any one of them wrong the result might be a subtle bug. For > example, the heap_tuple_needs_freeze() code path is only used when we > cannot get a cleanup lock, which is rare -- and some of the branches > within the function are relatively rare themselves. The obvious > concern is: What if some detail of how we track the new relfrozenxid > value (and new relminmxid value) in this seldom-hit codepath is just > wrong, in whatever way we didn't think of? I think it'd be good to add a few isolationtest cases for the can't-get-cleanup-lock paths. I think it shouldn't be hard using cursors. The slightly harder part is verifying that VACUUM did something reasonable, but that still should be doable? Greetings, Andres Freund
Re: Removing more vacuumlazy.c special cases, relfrozenxid optimizations
On Fri, Feb 18, 2022 at 4:10 PM Peter Geoghegan wrote: > It does not change any other behavior. It's totally mechanical. > > 0001 is tricky in the sense that there are a lot of fine details, and > if you get any one of them wrong the result might be a subtle bug. For > example, the heap_tuple_needs_freeze() code path is only used when we > cannot get a cleanup lock, which is rare -- and some of the branches > within the function are relatively rare themselves. The obvious > concern is: What if some detail of how we track the new relfrozenxid > value (and new relminmxid value) in this seldom-hit codepath is just > wrong, in whatever way we didn't think of? Right. I think we have no choice but to accept such risks if we want to make any progress here, and every patch carries them to some degree. I hope that someone else will review this patch in more depth than I have just now, but what I notice reading through it is that some of the comments seem pretty opaque. For instance: + * Also maintains *NewRelfrozenxid and *NewRelminmxid, which are the current + * target relfrozenxid and relminmxid for the relation. Assumption is that "maintains" is fuzzy. I think you should be saying something much more explicit, and the thing you are saying should make it clear that these arguments are input-output arguments: i.e. the caller must set them correctly before calling this function, and they will be updated by the function. I don't think you have to spell all of that out in every place where this comes up in the patch, but it needs to be clear from what you do say. For example, I would be happier with a comment that said something like "Every call to this function will either set HEAP_XMIN_FROZEN in the xl_heap_freeze_tuple struct passed as an argument, or else reduce *NewRelfrozenxid to the xmin of the tuple if it is currently newer than that. Thus, after a series of calls to this function, *NewRelfrozenxid represents a lower bound on unfrozen xmin values in the tuples examined. Before calling this function, caller should initialize *NewRelfrozenxid to ." +* Changing nothing, so might have to ratchet back NewRelminmxid, +* NewRelfrozenxid, or both together This comment I like. +* New multixact might have remaining XID older than +* NewRelfrozenxid This one's good, too. + * Also maintains *NewRelfrozenxid and *NewRelminmxid, which are the current + * target relfrozenxid and relminmxid for the relation. Assumption is that + * caller will never freeze any of the XIDs from the tuple, even when we say + * that they should. If caller opts to go with our recommendation to freeze, + * then it must account for the fact that it shouldn't trust how we've set + * NewRelfrozenxid/NewRelminmxid. (In practice aggressive VACUUMs always take + * our recommendation because they must, and non-aggressive VACUUMs always opt + * to not freeze, preferring to ratchet back NewRelfrozenxid instead). I don't understand this one. +* (Actually, we maintain NewRelminmxid differently here, because we +* assume that XIDs that should be frozen according to cutoff_xid won't +* be, whereas heap_prepare_freeze_tuple makes the opposite assumption.) This one either. I haven't really grokked exactly what is happening in heap_tuple_needs_freeze yet, and may not have time to study it further in the near future. Not saying it's wrong, although improving the comments above would likely help me out. > > If there's a way you can make the precise contents of 0002 and 0003 > > more clear, I would like that, too. > > The really big one is 0002 -- even 0003 (the FSM PageIsAllVisible() > thing) wasn't on the table before now. 0002 is the patch that changes > the basic criteria for freezing, making it block-based rather than > based on the FreezeLimit cutoff (barring edge cases that are important > for correctness, but shouldn't noticeably affect freezing overhead). > > The single biggest practical improvement from 0002 is that it > eliminates what I've called the freeze cliff, which is where many old > tuples (much older than FreezeLimit/vacuum_freeze_min_age) must be > frozen all at once, in a balloon payment during an eventual aggressive > VACUUM. Although it's easy to see that that could be useful, it is > harder to justify (much harder) than anything else. Because we're > freezing more eagerly overall, we're also bound to do more freezing > without benefit in certain cases. Although I think that this can be > justified as the cost of doing business, that's a hard argument to > make. You've used the term "freezing cliff" repeatedly in earlier emails, and this is the first time I've been able to understand what you meant. I'm glad I do, now. But can you describe the algorithm that 0002 uses to accomplish this improvement? Like "if it sees that the page meets criteria X, then it freezes all tuples on the page, else if it sees
Re: Removing more vacuumlazy.c special cases, relfrozenxid optimizations
On Fri, Feb 18, 2022 at 12:54 PM Robert Haas wrote: > I'd like to have a clearer idea of exactly what is in each of the > remaining patches before forming a final opinion. Great. > What's tricky about 0001? Does it change any other behavior, either as > a necessary component of advancing relfrozenxid more eagerly, or > otherwise? It does not change any other behavior. It's totally mechanical. 0001 is tricky in the sense that there are a lot of fine details, and if you get any one of them wrong the result might be a subtle bug. For example, the heap_tuple_needs_freeze() code path is only used when we cannot get a cleanup lock, which is rare -- and some of the branches within the function are relatively rare themselves. The obvious concern is: What if some detail of how we track the new relfrozenxid value (and new relminmxid value) in this seldom-hit codepath is just wrong, in whatever way we didn't think of? On the other hand, we must already be precise in almost the same way within heap_tuple_needs_freeze() today -- it's not all that different (we currently need to avoid leaving any XIDs < FreezeLimit behind, which isn't made that less complicated by the fact that it's a static XID cutoff). Plus, we have experience with bugs like this. There was hardening added to catch stuff like this back in 2017, following the "freeze the dead" bug. > If there's a way you can make the precise contents of 0002 and 0003 > more clear, I would like that, too. The really big one is 0002 -- even 0003 (the FSM PageIsAllVisible() thing) wasn't on the table before now. 0002 is the patch that changes the basic criteria for freezing, making it block-based rather than based on the FreezeLimit cutoff (barring edge cases that are important for correctness, but shouldn't noticeably affect freezing overhead). The single biggest practical improvement from 0002 is that it eliminates what I've called the freeze cliff, which is where many old tuples (much older than FreezeLimit/vacuum_freeze_min_age) must be frozen all at once, in a balloon payment during an eventual aggressive VACUUM. Although it's easy to see that that could be useful, it is harder to justify (much harder) than anything else. Because we're freezing more eagerly overall, we're also bound to do more freezing without benefit in certain cases. Although I think that this can be justified as the cost of doing business, that's a hard argument to make. In short, 0001 is mechanically tricky, but easy to understand at a high level. Whereas 0002 is mechanically simple, but tricky to understand at a high level (and therefore far trickier than 0001 overall). -- Peter Geoghegan
Re: Removing more vacuumlazy.c special cases, relfrozenxid optimizations
On Fri, Feb 18, 2022 at 3:41 PM Peter Geoghegan wrote: > Concerns about my general approach to this project (and even the > Postgres 14 VACUUM work) were expressed by Robert and Andres over on > the "Nonrandom scanned_pages distorts pg_class.reltuples set by > VACUUM" thread. Some of what was said honestly shocked me. It now > seems unwise to pursue this project on my original timeline. I even > thought about shelving it indefinitely (which is still on the table). > > I propose the following compromise: the least contentious patch alone > will be in scope for Postgres 15, while the other patches will not be. > I'm referring to the first patch from v8, which adds dynamic tracking > of the oldest extant XID in each heap table, in order to be able to > use it as our new relfrozenxid. I can't imagine that I'll have > difficulty convincing Andres of the merits of this idea, for one, > since it was his idea in the first place. It makes a lot of sense, > independent of any change to how and when we freeze. > > The first patch is tricky, but at least it won't require elaborate > performance validation. It doesn't change any of the basic performance > characteristics of VACUUM. It sometimes allows us to advance > relfrozenxid to a value beyond FreezeLimit (typically only possible in > an aggressive VACUUM), which is an intrinsic good. If it isn't > effective then the overhead seems very unlikely to be noticeable. It's > pretty much a strictly additive improvement. > > Are there any objections to this plan? I really like the idea of reducing the scope of what is being changed here, and I agree that eagerly advancing relfrozenxid carries much less risk than the other changes. I'd like to have a clearer idea of exactly what is in each of the remaining patches before forming a final opinion. What's tricky about 0001? Does it change any other behavior, either as a necessary component of advancing relfrozenxid more eagerly, or otherwise? If there's a way you can make the precise contents of 0002 and 0003 more clear, I would like that, too. -- Robert Haas EDB: http://www.enterprisedb.com
Re: Removing more vacuumlazy.c special cases, relfrozenxid optimizations
On Fri, Feb 11, 2022 at 8:30 PM Peter Geoghegan wrote: > Attached is v8. No real changes -- just a rebased version. Concerns about my general approach to this project (and even the Postgres 14 VACUUM work) were expressed by Robert and Andres over on the "Nonrandom scanned_pages distorts pg_class.reltuples set by VACUUM" thread. Some of what was said honestly shocked me. It now seems unwise to pursue this project on my original timeline. I even thought about shelving it indefinitely (which is still on the table). I propose the following compromise: the least contentious patch alone will be in scope for Postgres 15, while the other patches will not be. I'm referring to the first patch from v8, which adds dynamic tracking of the oldest extant XID in each heap table, in order to be able to use it as our new relfrozenxid. I can't imagine that I'll have difficulty convincing Andres of the merits of this idea, for one, since it was his idea in the first place. It makes a lot of sense, independent of any change to how and when we freeze. The first patch is tricky, but at least it won't require elaborate performance validation. It doesn't change any of the basic performance characteristics of VACUUM. It sometimes allows us to advance relfrozenxid to a value beyond FreezeLimit (typically only possible in an aggressive VACUUM), which is an intrinsic good. If it isn't effective then the overhead seems very unlikely to be noticeable. It's pretty much a strictly additive improvement. Are there any objections to this plan? -- Peter Geoghegan
Re: Removing more vacuumlazy.c special cases, relfrozenxid optimizations
On Sat, Jan 29, 2022 at 8:42 PM Peter Geoghegan wrote: > Attached is v7, a revision that overhauls the algorithm that decides > what to freeze. I'm now calling it block-driven freezing in the commit > message. Also included is a new patch, that makes VACUUM record zero > free space in the FSM for an all-visible page, unless the total amount > of free space happens to be greater than one half of BLCKSZ. I pushed the earlier refactoring and instrumentation patches today. Attached is v8. No real changes -- just a rebased version. It will be easier to benchmark and test the page-driven freezing stuff now, since the master/baseline case will now output instrumentation showing how relfrozenxid has been advanced (if at all) -- whether (and to what extent) each VACUUM operation advances relfrozenxid can now be directly compared, just by monitoring the log_autovacuum_min_duration output for a given table over time. -- Peter Geoghegan From 41136d2a8af434a095ce3e6dfdfbe4b48b9ec338 Mon Sep 17 00:00:00 2001 From: Peter Geoghegan Date: Sun, 23 Jan 2022 21:10:38 -0800 Subject: [PATCH v8 3/3] Add all-visible FSM heuristic. When recording free space in all-frozen page, record that the page has zero free space when it has less than half BLCKSZ worth of space, according to the traditional definition. Otherwise record free space as usual. Making all-visible pages resistant to change like this can be thought of as a form of hysteresis. The page is given an opportunity to "settle" and permanently stay in the same state when the tuples on the page will never be updated or deleted. But when they are updated or deleted, the page can once again be used to store any tuple. Over time, most pages tend to settle permanently in many workloads, perhaps only on the second or third attempt. --- src/backend/access/heap/vacuumlazy.c | 21 + 1 file changed, 21 insertions(+) diff --git a/src/backend/access/heap/vacuumlazy.c b/src/backend/access/heap/vacuumlazy.c index ea4b75189..95049ed25 100644 --- a/src/backend/access/heap/vacuumlazy.c +++ b/src/backend/access/heap/vacuumlazy.c @@ -1231,6 +1231,13 @@ lazy_scan_heap(LVRelState *vacrel, int nworkers) */ freespace = PageGetHeapFreeSpace(page); +/* + * An all-visible page should not have its free space + * available from FSM unless it's more than half empty + */ +if (PageIsAllVisible(page) && freespace < BLCKSZ / 2) + freespace = 0; + UnlockReleaseBuffer(buf); RecordPageWithFreeSpace(vacrel->rel, blkno, freespace); continue; @@ -1368,6 +1375,13 @@ lazy_scan_heap(LVRelState *vacrel, int nworkers) { Size freespace = PageGetHeapFreeSpace(page); + /* + * An all-visible page should not have its free space available + * from FSM unless it's more than half empty + */ + if (PageIsAllVisible(page) && freespace < BLCKSZ / 2) +freespace = 0; + UnlockReleaseBuffer(buf); RecordPageWithFreeSpace(vacrel->rel, blkno, freespace); } @@ -2537,6 +2551,13 @@ lazy_vacuum_heap_rel(LVRelState *vacrel) page = BufferGetPage(buf); freespace = PageGetHeapFreeSpace(page); + /* + * An all-visible page should not have its free space available from + * FSM unless it's more than half empty + */ + if (PageIsAllVisible(page) && freespace < BLCKSZ / 2) + freespace = 0; + UnlockReleaseBuffer(buf); RecordPageWithFreeSpace(vacrel->rel, tblk, freespace); vacuumed_pages++; -- 2.30.2 From 4838bd1f11b748d2082caedfe4da506b8fe3f67a Mon Sep 17 00:00:00 2001 From: Peter Geoghegan Date: Mon, 13 Dec 2021 15:00:49 -0800 Subject: [PATCH v8 2/3] Make block-level characteristics drive freezing. Teach VACUUM to freeze all of the tuples on a page whenever it notices that it would otherwise mark the page all-visible, without also marking it all-frozen. VACUUM won't freeze _any_ tuples on the page unless _all_ tuples (that remain after pruning) are all-visible. It may occasionally be necessary to freeze the page due to the presence of a particularly old XID, from before VACUUM's FreezeLimit cutoff. But the FreezeLimit mechanism will seldom have any impact on which pages are frozen anymore -- it is just a backstop now. Freezing can now informally be thought of as something that takes place at the level of an entire page, or not at all -- differences in XIDs among tuples on the same page are not interesting, barring extreme cases. Freezing a page is now practically synonymous with setting the page to all-visible in the visibility map, at least to users. The main upside of the new approach to freezing is that it makes the overhead of vacuuming much more predictable over time. We avoid the need for large balloon payments, since the system no longer accumulates "freezing debt" that can only be paid off by anti-wraparound vacuuming. This seems to have been particularly troublesome with append-only tables, especially in the common case where XIDs from pages that are marked all-visible for the first
Re: Removing more vacuumlazy.c special cases, relfrozenxid optimizations
On Mon, Feb 7, 2022 at 12:21 PM Robert Haas wrote: > > On Mon, Feb 7, 2022 at 11:43 AM Peter Geoghegan wrote: > > > That's because, if VACUUM is only ever getting triggered by XID > > > age advancement and not by bloat, there's no opportunity for your > > > patch set to advance relfrozenxid any sooner than we're doing now. > > > > We must distinguish between: > > > > 1. "VACUUM is fundamentally never going to need to run unless it is > > forced to, just to advance relfrozenxid" -- this applies to tables > > like the stock and customers tables from the benchmark. > > > > and: > > > > 2. "VACUUM must sometimes run to mark newly appended heap pages > > all-visible, and maybe to also remove dead tuples, but not that often > > -- and yet we current only get expensive and inconveniently timed > > anti-wraparound VACUUMs, no matter what" -- this applies to all the > > other big tables in the benchmark, in particular to the orders and > > order lines tables, but also to simpler cases like pgbench_history. > > It's not really very understandable for me when you refer to the way > table X behaves in Y benchmark, because I haven't studied that in > enough detail to know. If you say things like insert-only table, or a > continuous-random-updates table, or whatever the case is, it's a lot > easier to wrap my head around it. What I've called category 2 tables are the vast majority of big tables in practice. They include pure append-only tables, but also tables that grow and grow from inserts, but also have some updates. The point of the TPC-C order + order lines examples was to show how broad the category really is. And how mixtures of inserts and bloat from updates on one single table confuse the implementation in general. > > Does that make sense? It's pretty subtle, admittedly, and you no doubt > > have (very reasonable) concerns about the extremes, even if you accept > > all that. I just want to get the general idea across here, as a > > starting point for further discussion. > > Not really. I think you *might* be saying tables which currently get > only wraparound vacuums will end up getting other kinds of vacuums > with your patch because things will improve enough for other tables in > the system that they will be able to get more attention than they do > currently. Yes, I am. > But I'm not sure I am understanding you correctly, and even > if I am I don't understand why that would be so, and even if it is I > think it doesn't help if essentially all the tables in the system are > suffering from the problem. When I say "relfrozenxid advancement has been qualitatively improved by the patch", what I mean is that we are much closer to a rate of relfrozenxid advancement that is far closer to the theoretically optimal rate for our current design, with freezing and with 32-bit XIDs, and with the invariants for freezing. Consider the extreme case, and generalize. In the simple append-only table case, it is most obvious. The final relfrozenxid is very close to OldestXmin (only tiny noise level differences appear), regardless of XID consumption by the system in general, and even within the append-only table in particular. Other cases are somewhat trickier, but have roughly the same quality, to a surprising degree. Lots of things that never really should have affected relfrozenxid to begin with do not, for the first time. -- Peter Geoghegan
Re: Removing more vacuumlazy.c special cases, relfrozenxid optimizations
On Mon, Feb 7, 2022 at 11:43 AM Peter Geoghegan wrote: > > That's because, if VACUUM is only ever getting triggered by XID > > age advancement and not by bloat, there's no opportunity for your > > patch set to advance relfrozenxid any sooner than we're doing now. > > We must distinguish between: > > 1. "VACUUM is fundamentally never going to need to run unless it is > forced to, just to advance relfrozenxid" -- this applies to tables > like the stock and customers tables from the benchmark. > > and: > > 2. "VACUUM must sometimes run to mark newly appended heap pages > all-visible, and maybe to also remove dead tuples, but not that often > -- and yet we current only get expensive and inconveniently timed > anti-wraparound VACUUMs, no matter what" -- this applies to all the > other big tables in the benchmark, in particular to the orders and > order lines tables, but also to simpler cases like pgbench_history. It's not really very understandable for me when you refer to the way table X behaves in Y benchmark, because I haven't studied that in enough detail to know. If you say things like insert-only table, or a continuous-random-updates table, or whatever the case is, it's a lot easier to wrap my head around it. > Does that make sense? It's pretty subtle, admittedly, and you no doubt > have (very reasonable) concerns about the extremes, even if you accept > all that. I just want to get the general idea across here, as a > starting point for further discussion. Not really. I think you *might* be saying tables which currently get only wraparound vacuums will end up getting other kinds of vacuums with your patch because things will improve enough for other tables in the system that they will be able to get more attention than they do currently. But I'm not sure I am understanding you correctly, and even if I am I don't understand why that would be so, and even if it is I think it doesn't help if essentially all the tables in the system are suffering from the problem. -- Robert Haas EDB: http://www.enterprisedb.com
Re: Removing more vacuumlazy.c special cases, relfrozenxid optimizations
On Mon, Feb 7, 2022 at 10:08 AM Robert Haas wrote: > But ... if I'm not mistaken, in the kind of case that Greg is > describing, relfrozenxid will be advanced exactly as often as it is > today. But what happens today in a scenario like Greg's is pathological, despite being fairly common (common in large DBs). It doesn't seem informative to extrapolate too much from current experience for that reason. > That's because, if VACUUM is only ever getting triggered by XID > age advancement and not by bloat, there's no opportunity for your > patch set to advance relfrozenxid any sooner than we're doing now. We must distinguish between: 1. "VACUUM is fundamentally never going to need to run unless it is forced to, just to advance relfrozenxid" -- this applies to tables like the stock and customers tables from the benchmark. and: 2. "VACUUM must sometimes run to mark newly appended heap pages all-visible, and maybe to also remove dead tuples, but not that often -- and yet we current only get expensive and inconveniently timed anti-wraparound VACUUMs, no matter what" -- this applies to all the other big tables in the benchmark, in particular to the orders and order lines tables, but also to simpler cases like pgbench_history. As I've said a few times now, the patch doesn't change anything for 1. But Greg's problem tables very much sound like they're from category 2. And what we see with the master branch for such tables is that they always get anti-wraparound VACUUMs, past a certain size (depends on things like exact XID rate and VACUUM settings, the insert-driven autovacuum scheduling stuff matters). While the patch never reaches that point in practice, during my testing -- and doesn't come close. It is true that in theory, as the size of ones of these "category 2" tables tends to infinity, the patch ends up behaving the same as master anyway. But I'm pretty sure that that usually doesn't matter at all, or matters less than you'd think. As I emphasized when presenting the recent v7 TPC-C benchmark, neither of the two "TPC-C big problem tables" (which are particularly interesting/tricky examples of tables from category 2) come close to getting an anti-wraparound VACUUM (plus, as I said in the same email, wouldn't matter if they did). > So I think that people in this kind of situation will potentially be > helped or hurt by other things the patch set does, but the eager > relfrozenxid stuff won't make any difference for them. To be clear, I think it would if everything was in place, including the basic relfrozenxid advancement thing, plus the new freezing stuff (though you wouldn't need the experimental FSM thing to get this benefit). Here is a thought experiment that may make the general idea a bit clearer: Imagine I reran the same benchmark as before, with the same settings, and the expectation that everything would be the same as first time around for the patch series. But to make things more interesting, this time I add an adversarial element: I add an adversarial gizmo that burns XIDs steadily, without doing any useful work. This gizmo doubles the rate of XID consumption for the database as a whole, perhaps by calling "SELECT txid_current()" in a loop, followed by a timed sleep (with a delay chosen with the goal of doubling XID consumption). I imagine that this would also burn CPU cycles, but probably not enough to make more than a noise level impact -- so we're severely stressing the implementation by adding this gizmo, but the stress is precisely targeted at XID consumption and related implementation details. It's a pretty clean experiment. What happens now? I believe (though haven't checked for myself) that nothing important would change. We'd still see the same VACUUM operations occur at approximately the same times (relative to the start of the benchmark) that we saw with the original benchmark, and each VACUUM operation would do approximately the same amount of physical work on each occasion. Of course, the autovacuum log output would show that the OldestXmin for each individual VACUUM operation had larger values than first time around for this newly initdb'd TPC-C database (purely as a consequence of the XID burning gizmo), but it would *also* show *concomitant* increases for our newly set relfrozenxid. The system should therefore hardly behave differently at all compared to the original benchmark run, despite this adversarial gizmo. It's fair to wonder: okay, but what if it was 4x, 8x, 16x? What then? That does get a bit more complicated, and we should get into why that is. But for now I'll just say that I think that even that kind of extreme would make much less difference than you might think -- since relfrozenxid advancement has been qualitatively improved by the patch series. It is especially likely that nothing would change if you were willing to increase autovacuum_freeze_max_age to get a bit more breathing room -- room to allow the autovacuums to run at their "natural" times. You
Re: Removing more vacuumlazy.c special cases, relfrozenxid optimizations
On Fri, Feb 4, 2022 at 10:45 PM Peter Geoghegan wrote: > > While I've seen all the above cases triggering anti-wraparound cases > > by far the majority of the cases are not of these pathological forms. > > Right - it's practically inevitable that you'll need an > anti-wraparound VACUUM to advance relfrozenxid right now. Technically > it's possible to advance relfrozenxid in any VACUUM, but in practice > it just never happens on a large table. You only need to get unlucky > with one heap page, either by failing to get a cleanup lock, or (more > likely) by setting even one single page all-visible but not all-frozen > just once (once in any VACUUM that takes place between anti-wraparound > VACUUMs). But ... if I'm not mistaken, in the kind of case that Greg is describing, relfrozenxid will be advanced exactly as often as it is today. That's because, if VACUUM is only ever getting triggered by XID age advancement and not by bloat, there's no opportunity for your patch set to advance relfrozenxid any sooner than we're doing now. So I think that people in this kind of situation will potentially be helped or hurt by other things the patch set does, but the eager relfrozenxid stuff won't make any difference for them. -- Robert Haas EDB: http://www.enterprisedb.com
Re: Removing more vacuumlazy.c special cases, relfrozenxid optimizations
On Fri, Feb 4, 2022 at 10:21 PM Greg Stark wrote: > By far the majority of anti-wraparound vacuums are triggered by tables > that are very large and so don't trigger regular vacuums for "long > periods" of time and consistently hit the anti-wraparound threshold > first. That's interesting, because my experience is different. Most of the time when I get asked to look at a system, it turns out that there is a prepared transaction or a forgotten replication slot and nobody noticed until the system hit the wraparound threshold. Or occasionally a long-running transaction or a failing/stuck vacuum that has the same effect. -- Robert Haas EDB: http://www.enterprisedb.com
Re: Removing more vacuumlazy.c special cases, relfrozenxid optimizations
On Fri, Feb 4, 2022 at 10:44 PM Peter Geoghegan wrote: > Right - it's practically inevitable that you'll need an > anti-wraparound VACUUM to advance relfrozenxid right now. Technically > it's possible to advance relfrozenxid in any VACUUM, but in practice > it just never happens on a large table. You only need to get unlucky > with one heap page, either by failing to get a cleanup lock, or (more > likely) by setting even one single page all-visible but not all-frozen > just once (once in any VACUUM that takes place between anti-wraparound > VACUUMs). Minor correction: That's a slight exaggeration, since we won't skip groups of all-visible pages that don't exceed SKIP_PAGES_THRESHOLD blocks (32 blocks). -- Peter Geoghegan
Re: Removing more vacuumlazy.c special cases, relfrozenxid optimizations
On Fri, Feb 4, 2022 at 10:21 PM Greg Stark wrote: > On Wed, 15 Dec 2021 at 15:30, Peter Geoghegan wrote: > > My emphasis here has been on making non-aggressive VACUUMs *always* > > advance relfrozenxid, outside of certain obvious edge cases. And so > > with all the patches applied, up to and including the opportunistic > > freezing patch, every autovacuum of every table manages to advance > > relfrozenxid during benchmarking -- usually to a fairly recent value. > > I've focussed on making aggressive VACUUMs (especially anti-wraparound > > autovacuums) a rare occurrence, for truly exceptional cases (e.g., > > user keeps canceling autovacuums, maybe due to automated script that > > performs DDL). That has taken priority over other goals, for now. > > While I've seen all the above cases triggering anti-wraparound cases > by far the majority of the cases are not of these pathological forms. Right - it's practically inevitable that you'll need an anti-wraparound VACUUM to advance relfrozenxid right now. Technically it's possible to advance relfrozenxid in any VACUUM, but in practice it just never happens on a large table. You only need to get unlucky with one heap page, either by failing to get a cleanup lock, or (more likely) by setting even one single page all-visible but not all-frozen just once (once in any VACUUM that takes place between anti-wraparound VACUUMs). > By far the majority of anti-wraparound vacuums are triggered by tables > that are very large and so don't trigger regular vacuums for "long > periods" of time and consistently hit the anti-wraparound threshold > first. autovacuum_vacuum_insert_scale_factor can help with this on 13 and 14, but only if you tune autovacuum_freeze_min_age with that goal in mind. Which probably doesn't happen very often. > There's nothing limiting how long "long periods" is and nothing tying > it to the rate of xid consumption. It's quite common to have some > *very* large mostly static tables in databases that have other tables > that are *very* busy. > > The worst I've seen is a table that took 36 hours to vacuum in a > database that consumed about a billion transactions per day... That's > extreme but these days it's quite common to see tables that get > anti-wraparound vacuums every week or so despite having < 1% modified > tuples. And databases are only getting bigger and transaction rates > faster... Sounds very much like what I've been calling the freezing cliff. An anti-wraparound VACUUM throws things off by suddenly dirtying many more pages than the expected amount for a VACUUM against the table, despite there being no change in workload characteristics. If you just had to remove the dead tuples in such a table, then it probably wouldn't matter if it happened earlier than expected. -- Peter Geoghegan
Re: Removing more vacuumlazy.c special cases, relfrozenxid optimizations
On Wed, 15 Dec 2021 at 15:30, Peter Geoghegan wrote: > > My emphasis here has been on making non-aggressive VACUUMs *always* > advance relfrozenxid, outside of certain obvious edge cases. And so > with all the patches applied, up to and including the opportunistic > freezing patch, every autovacuum of every table manages to advance > relfrozenxid during benchmarking -- usually to a fairly recent value. > I've focussed on making aggressive VACUUMs (especially anti-wraparound > autovacuums) a rare occurrence, for truly exceptional cases (e.g., > user keeps canceling autovacuums, maybe due to automated script that > performs DDL). That has taken priority over other goals, for now. While I've seen all the above cases triggering anti-wraparound cases by far the majority of the cases are not of these pathological forms. By far the majority of anti-wraparound vacuums are triggered by tables that are very large and so don't trigger regular vacuums for "long periods" of time and consistently hit the anti-wraparound threshold first. There's nothing limiting how long "long periods" is and nothing tying it to the rate of xid consumption. It's quite common to have some *very* large mostly static tables in databases that have other tables that are *very* busy. The worst I've seen is a table that took 36 hours to vacuum in a database that consumed about a billion transactions per day... That's extreme but these days it's quite common to see tables that get anti-wraparound vacuums every week or so despite having < 1% modified tuples. And databases are only getting bigger and transaction rates faster... -- greg
Re: Removing more vacuumlazy.c special cases, relfrozenxid optimizations
On Fri, Feb 4, 2022 at 4:18 PM Robert Haas wrote: > On Fri, Feb 4, 2022 at 3:31 PM Peter Geoghegan wrote: > > Application B will already block pruning by VACUUM operations against > > application A's table, and so effectively blocks recording of the > > resultant free space in the FSM in your scenario. And so application A > > and application B should be considered the same application already. > > That's just how VACUUM works. > > Sure ... but that also sucks. If we consider application A and > application B to be the same application, then we're basing our > decision about what to do on information that is inaccurate. I agree that it sucks, but I don't think that it's particularly relevant to the FSM prototype patch that I included with v7 of the patch series. A heap page cannot be considered "closed" (either in the specific sense from the patch, or in any informal sense) when it has recently dead tuples. At some point we should invent a fallback path for pruning, that migrates recently dead tuples to some other subsidiary structure, retaining only forwarding information in the heap page. But even that won't change what I just said about closed pages (it'll just make it easier to return and fix things up later on). > I don't see it that way. There are cases where avoiding writes is > better, and cases where trying to cram everything into the fewest > possible ages is better. With the right test case you can make either > strategy look superior. The cost of reads is effectively much lower than writes with modern SSDs, in TCO terms. Plus when a FSM strategy like the one from the patch does badly according to a naive measure such as total table size, that in itself doesn't mean that we do worse with reads. In fact, it's quite the opposite. The benchmark showed that v7 of the patch did very slightly worse on overall space utilization, but far, far better on reads. In fact, the benefits for reads were far in excess of any efficiency gains for writes/with WAL. The greatest bottleneck is almost always latency on modern hardware [1]. It follows that keeping logically related data grouped together is crucial. Far more important than potentially using very slightly more space. The story I wanted to tell with the FSM patch was about open and closed pages being the right long term direction. More generally, we should emphasize managing page-level costs, and deemphasize managing tuple-level costs, which are much less meaningful. > What I think your test case has going for it > is that it is similar to something that a lot of people, really a ton > of people, actually do with PostgreSQL. However, it's not going to be > an accurate model of what everybody does, and therein lies some > element of danger. No question -- agreed. > > Then what could you have confidence in? > > Real-world experience. Which is hard to get if we don't ever commit > any patches, but a good argument for (a) having them tested by > multiple different hackers who invent test cases independently and (b) > some configurability where we can reasonably include it, so that if > anyone does experience problems they have an escape. I agree. [1] https://dl.acm.org/doi/10.1145/1022594.1022596 -- Peter Geoghegan
Re: Removing more vacuumlazy.c special cases, relfrozenxid optimizations
On Fri, Feb 4, 2022 at 3:31 PM Peter Geoghegan wrote: > Application B will already block pruning by VACUUM operations against > application A's table, and so effectively blocks recording of the > resultant free space in the FSM in your scenario. And so application A > and application B should be considered the same application already. > That's just how VACUUM works. Sure ... but that also sucks. If we consider application A and application B to be the same application, then we're basing our decision about what to do on information that is inaccurate. > 5% larger seems like a lot more than would be typical, based on what > I've seen. I don't think that the regression in this scenario can be > characterized as "infinitely worse", or anything like it. On a long > enough timeline, the potential upside of something like this is nearly > unlimited -- it could avoid a huge amount of write amplification. But > the potential downside seems to be small and fixed -- which is the > point (bounding the downside). The mere possibility of getting that > big benefit (avoiding the costs from heap fragmentation) is itself a > benefit, even when it turns out not to pay off in your particular > case. It can be seen as insurance. I don't see it that way. There are cases where avoiding writes is better, and cases where trying to cram everything into the fewest possible ages is better. With the right test case you can make either strategy look superior. What I think your test case has going for it is that it is similar to something that a lot of people, really a ton of people, actually do with PostgreSQL. However, it's not going to be an accurate model of what everybody does, and therein lies some element of danger. > Then what could you have confidence in? Real-world experience. Which is hard to get if we don't ever commit any patches, but a good argument for (a) having them tested by multiple different hackers who invent test cases independently and (b) some configurability where we can reasonably include it, so that if anyone does experience problems they have an escape. -- Robert Haas EDB: http://www.enterprisedb.com
Re: Removing more vacuumlazy.c special cases, relfrozenxid optimizations
On Fri, Feb 4, 2022 at 2:45 PM Robert Haas wrote: > While I agree that there's some case to be made for leaving settled > pages well enough alone, your criterion for settled seems pretty much > accidental. I fully admit that I came up with the FSM heuristic with TPC-C in mind. But you have to start somewhere. Fortunately, the main benefit of this patch series (avoiding the freeze cliff during anti-wraparound VACUUMs, often avoiding anti-wraparound VACUUMs altogether) don't depend on the experimental FSM patch at all. I chose to post that now because it seemed to help with my more general point about qualitatively different pages, and freezing at the page level. > Imagine a system where there are two applications running, > A and B. Application A runs all the time and all the transactions > which it performs are short. Therefore, when a certain page is not > modified by transaction A for a short period of time, the page will > become all-visible and will be considered settled. Application B runs > once a month and performs various transactions all of which are long, > perhaps on a completely separate set of tables. While application B is > running, pages take longer to settle not only for application B but > also for application A. It doesn't make sense to say that the > application is in control of the behavior when, in reality, it may be > some completely separate application that is controlling the behavior. Application B will already block pruning by VACUUM operations against application A's table, and so effectively blocks recording of the resultant free space in the FSM in your scenario. And so application A and application B should be considered the same application already. That's just how VACUUM works. VACUUM isn't a passive observer of the system -- it's another participant. It both influences and is influenced by almost everything else in the system. > I can see that this could have significant advantages under some > circumstances. But I think it could easily be far worse under other > circumstances. I mean, you can have workloads where you do some amount > of read-write work on a table and then go read only and sequential > scan it an infinite number of times. An algorithm that causes the > table to be smaller at the point where we switch to read-only > operations, even by a modest amount, wins infinitely over anything > else. But even if you have no change in the access pattern, is it a > good idea to allow the table to be, say, 5% larger if it means that > correlated data is colocated? In general, probably yes. If that means > that the table fails to fit in shared_buffers instead of fitting, no. > If that means that the table fails to fit in the OS cache instead of > fitting, definitely no. 5% larger seems like a lot more than would be typical, based on what I've seen. I don't think that the regression in this scenario can be characterized as "infinitely worse", or anything like it. On a long enough timeline, the potential upside of something like this is nearly unlimited -- it could avoid a huge amount of write amplification. But the potential downside seems to be small and fixed -- which is the point (bounding the downside). The mere possibility of getting that big benefit (avoiding the costs from heap fragmentation) is itself a benefit, even when it turns out not to pay off in your particular case. It can be seen as insurance. > And to me, that kind of effect is why it's hard to gain much > confidence in regards to stuff like this via laboratory testing. I > mean, I'm glad you're doing such tests. But in a laboratory test, you > tend not to have things like a sudden and complete change in the > workload, or a random other application sometimes sharing the machine, > or only being on the edge of running out of memory. I think in general > people tend to avoid such things in benchmarking scenarios, but even > if include stuff like this, it's hard to know what to include that > would be representative of real life, because just about anything > *could* happen in real life. Then what could you have confidence in? -- Peter Geoghegan
Re: Removing more vacuumlazy.c special cases, relfrozenxid optimizations
On Sat, Jan 29, 2022 at 11:43 PM Peter Geoghegan wrote: > When VACUUM sees that all remaining/unpruned tuples on a page are > all-visible, it isn't just important because of cost control > considerations. It's deeper than that. It's also treated as a > tentative signal from the application itself, about the data itself. > Which is: this page looks "settled" -- it may never be updated again, > but if there is an update it likely won't change too much about the > whole page. While I agree that there's some case to be made for leaving settled pages well enough alone, your criterion for settled seems pretty much accidental. Imagine a system where there are two applications running, A and B. Application A runs all the time and all the transactions which it performs are short. Therefore, when a certain page is not modified by transaction A for a short period of time, the page will become all-visible and will be considered settled. Application B runs once a month and performs various transactions all of which are long, perhaps on a completely separate set of tables. While application B is running, pages take longer to settle not only for application B but also for application A. It doesn't make sense to say that the application is in control of the behavior when, in reality, it may be some completely separate application that is controlling the behavior. > The application is in charge, really -- not VACUUM. This is already > the case, whether we like it or not. VACUUM needs to learn to live in > that reality, rather than fighting it. When VACUUM considers a page > settled, and the physical page still has a relatively large amount of > free space (say 45% of BLCKSZ, a borderline case in the new FSM > patch), "losing" so much free space certainly is unappealing. We set > the free space to 0 in the free space map all the same, because we're > cutting our losses at that point. While the exact threshold I've > proposed is tentative, the underlying theory seems pretty sound to me. > The BLCKSZ/2 cutoff (and the way that it extends the general rules for > whole-page freezing) is intended to catch pages that are qualitatively > different, as well as quantitatively different. It is a balancing act, > between not wasting space, and the risk of systemic problems involving > excessive amounts of non-HOT updates that must move a successor > version to another page. I can see that this could have significant advantages under some circumstances. But I think it could easily be far worse under other circumstances. I mean, you can have workloads where you do some amount of read-write work on a table and then go read only and sequential scan it an infinite number of times. An algorithm that causes the table to be smaller at the point where we switch to read-only operations, even by a modest amount, wins infinitely over anything else. But even if you have no change in the access pattern, is it a good idea to allow the table to be, say, 5% larger if it means that correlated data is colocated? In general, probably yes. If that means that the table fails to fit in shared_buffers instead of fitting, no. If that means that the table fails to fit in the OS cache instead of fitting, definitely no. And to me, that kind of effect is why it's hard to gain much confidence in regards to stuff like this via laboratory testing. I mean, I'm glad you're doing such tests. But in a laboratory test, you tend not to have things like a sudden and complete change in the workload, or a random other application sometimes sharing the machine, or only being on the edge of running out of memory. I think in general people tend to avoid such things in benchmarking scenarios, but even if include stuff like this, it's hard to know what to include that would be representative of real life, because just about anything *could* happen in real life. -- Robert Haas EDB: http://www.enterprisedb.com
Re: Removing more vacuumlazy.c special cases, relfrozenxid optimizations
On Fri, Feb 4, 2022 at 2:00 PM John Naylor wrote: > Without having looked at the latest patches, there was something in > the back of my mind while following the discussion upthread -- the > proposed opportunistic freezing made a lot more sense if the > earlier-proposed open/closed pages concept was already available. Yeah, sorry about that. The open/closed pages concept is still something I plan on working on. My prototype (which I never posted to the list) will be rebased, and I'll try to target Postgres 16. > > Freezing whole pages > > > > > It's possible that a higher cutoff (for example a cutoff of 80% of > > BLCKSZ, not 50%) will actually lead to *worse* space utilization, in > > addition to the downsides from fragmentation -- it's far from a simple > > trade-off. (Not that you should believe that 50% is special, it's just > > a starting point for me.) > > How was the space utilization with the 50% cutoff in the TPC-C test? The picture was mixed. To get the raw numbers, compare pg-relation-sizes-after-patch-2.out and pg-relation-sizes-after-master-2.out files from the drive link I provided (to repeat, get them from https://drive.google.com/drive/u/1/folders/1A1g0YGLzluaIpv-d_4o4thgmWbVx3LuR) Highlights: the largest table (the bmsql_order_line table) had a total size of x1.006 relative to master, meaning that we did slightly worse there. However, the index on the same table was slightly smaller instead, probably because reducing heap fragmentation tends to make the index deletion stuff work a bit better than before. Certain small tables (bmsql_district and bmsql_warehouse) were actually significantly smaller (less than half their size on master), probably just because the patch can reliably remove LP_DEAD items from heap pages, even when a cleanup lock isn't available. The bmsql_new_order table was quite a bit larger, but it's not that large anyway (1250 MB on master at the very end, versus 1433 MB with the patch). This is a clear trade-off, since we get much less fragmentation in the same table (as evidenced by the VACUUM output, where there are fewer pages with any LP_DEAD items per VACUUM with the patch). The workload for that table is characterized by inserting new orders together, and deleting the same orders as a group later on. So we're bound to pay a cost in space utilization to lower the fragmentation. > > blks_hit | 174,551,067,731 > > tup_fetched | 124,797,772,450 > > > Here is the same pg_stat_database info for master: > > > blks_hit | 283,015,966,386 > > tup_fetched | 237,052,965,901 > > That's impressive. Thanks! It's still possible to get a big improvement like that with something like TPC-C because there are certain behaviors that are clearly suboptimal -- once you look at the details of the workload, and compare an imaginary ideal to the actual behavior of the system. In particular, there is really only one way that the free space management can work for the two big tables that will perform acceptably -- the orders have to be stored in the same place to begin with, and stay in the same place forever (at least to the extent that that's possible). -- Peter Geoghegan
Re: Removing more vacuumlazy.c special cases, relfrozenxid optimizations
On Sat, Jan 29, 2022 at 11:43 PM Peter Geoghegan wrote: > > On Thu, Jan 20, 2022 at 2:00 PM Peter Geoghegan wrote: > > I do see some value in that, too. Though it's not going to be a way of > > turning off the early freezing stuff, which seems unnecessary (though > > I do still have work to do on getting the overhead for that down). > > Attached is v7, a revision that overhauls the algorithm that decides > what to freeze. I'm now calling it block-driven freezing in the commit > message. Also included is a new patch, that makes VACUUM record zero > free space in the FSM for an all-visible page, unless the total amount > of free space happens to be greater than one half of BLCKSZ. > > The fact that I am now including this new FSM patch (v7-0006-*patch) > may seem like a case of expanding the scope of something that could > well do without it. But hear me out! It's true that the new FSM patch > isn't essential. I'm including it now because it seems relevant to the > approach taken with block-driven freezing -- it may even make my > general approach easier to understand. Without having looked at the latest patches, there was something in the back of my mind while following the discussion upthread -- the proposed opportunistic freezing made a lot more sense if the earlier-proposed open/closed pages concept was already available. > Freezing whole pages > > It's possible that a higher cutoff (for example a cutoff of 80% of > BLCKSZ, not 50%) will actually lead to *worse* space utilization, in > addition to the downsides from fragmentation -- it's far from a simple > trade-off. (Not that you should believe that 50% is special, it's just > a starting point for me.) How was the space utilization with the 50% cutoff in the TPC-C test? > TPC-C raw numbers > = > > The single most important number for the patch might be the decrease > in both buffer misses and buffer hits, which I believe is caused by > the patch being able to use index-only scans much more effectively > (with modifications to BenchmarkSQL to improve the indexing strategy > [1]). This is quite clear from pg_stat_database state at the end. > > Patch: > blks_hit | 174,551,067,731 > tup_fetched | 124,797,772,450 > Here is the same pg_stat_database info for master: > blks_hit | 283,015,966,386 > tup_fetched | 237,052,965,901 That's impressive. -- John Naylor EDB: http://www.enterprisedb.com
Re: Removing more vacuumlazy.c special cases, relfrozenxid optimizations
On Fri, Jan 21, 2022 at 12:07 PM Greg Stark wrote: > This confuses me. "Transactions per second" is a headline database > metric that lots of users actually focus on quite heavily -- rather > too heavily imho. But transactions per second is for the whole database, not for individual tables. It's also really a benchmarking thing, where the size and variety of transactions is fixed. With something like pgbench it actually is exactly the same thing, but such a workload is not at all realistic. Even BenchmarkSQL/TPC-C isn't like that, despite the fact that it is a fairly synthetic workload (it's just not super synthetic). > Ok, XID consumption is only a subset of transactions > that are not read-only but that's a detail that's pretty easy to > explain and users get pretty quickly. My point was mostly this: the number of distinct extant unfrozen tuple headers (and the range of the relevant XIDs) is generally highly unpredictable today. And the number of tuples we'll have to freeze to be able to advance relfrozenxid by a good amount is quite variable, in general. For example, if we bulk extend a relation as part of an ETL process, then the number of distinct XIDs could be as low as 1, even though we can expect a great deal of "freeze debt" that will have to be paid off at some point (with the current design, in the common case where the user doesn't account for this effect because they're not already an expert). There are other common cases that are not quite as extreme as that, that still have the same effect -- even an expert will find it hard or impossible to tune autovacuum_freeze_min_age for that. Another case of interest (that illustrates the general principle) is something like pgbench_tellers. We'll never have an aggressive VACUUM of the table with the patch, and we shouldn't ever need to freeze any tuples. But, owing to workload characteristics, we'll constantly be able to keep its relfrozenxid very current, because (even if we introduce skew) each individual row cannot go very long without being updated, allowing old XIDs to age out that way. There is also an interesting middle ground, where you get a mixture of both tendencies due to skew. The tuple that's most likely to get updated was the one that was just updated. How are you as a DBA ever supposed to tune autovacuum_freeze_min_age if tuples happen to be qualitatively different in this way? > What I find confuses people much more is the concept of the > oldestxmin. I think most of the autovacuum problems I've seen come > from cases where autovacuum is happily kicking off useless vacuums > because the oldestxmin hasn't actually advanced enough for them to do > any useful work. As it happens, the proposed log output won't use the term oldestxmin anymore -- I think that it makes sense to rename it to "removable cutoff". Here's an example: LOG: automatic vacuum of table "regression.public.bmsql_oorder": index scans: 1 pages: 0 removed, 317308 remain, 250258 skipped using visibility map (78.87% of total) tuples: 70 removed, 34105925 remain (6830471 newly frozen), 2528 are dead but not yet removable removable cutoff: 37574752, which is 230115 xids behind next new relfrozenxid: 35221275, which is 5219310 xids ahead of previous value index scan needed: 55540 pages from table (17.50% of total) had 3339809 dead item identifiers removed index "bmsql_oorder_pkey": pages: 144257 in total, 0 newly deleted, 0 currently deleted, 0 reusable index "bmsql_oorder_idx2": pages: 330083 in total, 0 newly deleted, 0 currently deleted, 0 reusable I/O timings: read: 7928.207 ms, write: 1386.662 ms avg read rate: 33.107 MB/s, avg write rate: 26.218 MB/s buffer usage: 220825 hits, 443331 misses, 351084 dirtied WAL usage: 576110 records, 364797 full page images, 2046767817 bytes system usage: CPU: user: 10.62 s, system: 7.56 s, elapsed: 104.61 s Note also that I deliberately made the "new relfrozenxid" line that immediately follows (information that we haven't shown before now) similar, to highlight that they're now closely related concepts. Now if you VACUUM a table that is either empty or has only frozen tuples, VACUUM will set relfrozenxid to oldestxmin/removable cutoff. Internally, oldestxmin is the "starting point" for our final/target relfrozenxid for the table. We ratchet it back dynamically, whenever we see an older-than-current-target XID that cannot be immediately frozen (e.g., when we can't easily get a cleanup lock on the page). -- Peter Geoghegan