Re: Making C function declaration parameter names consistent with corresponding definition names
On Fri, Sep 16, 2022 at 4:19 PM Tom Lane wrote: > I agree, this has always been a pet peeve of mine as well. I would > have guessed there were fewer examples than you found, because I've > generally fixed any such cases I happened to notice. If you actually go through them all one by one you'll see that the vast majority of individual cases involve an inconsistency that follows some kind of recognizable pattern. For example, a Relation parameter might be spelled "relation" in one place and "rel" in another. I find these more common cases much less noticeable -- perhaps that's why there are more than you thought there'd be? It's possible to configure the clang-tidy tooling to tolerate various inconsistencies, below some kind of threshold -- it is totally customizable. But I think that a strict, simple rule is the way to go here. (Though without creating busy work for committers that don't want to use clang-tidy all the time.) -- Peter Geoghegan
Re: Reducing the WAL overhead of freezing in VACUUM by deduplicating per-tuple freeze plans
On Fri, Sep 16, 2022 at 12:30 AM Masahiko Sawada wrote: > After a quick benchmark, I've confirmed that the amount of WAL records > for freezing 1 million tuples reduced to about one-fifth (1.2GB vs > 250MB). Great. I think that the really interesting thing about the patch is how this changes the way we should think about freezing costs. It makes page-level batching seem very natural. The minimum possible size of a Heap2/FREEZE_PAGE record is 64 bytes, once alignment and so on is taken into account (without the patch). Once we already know that we have to freeze *some* tuples on a given heap page, it becomes very reasonable to freeze as many as possible, in batch, just because we know that it'll be much cheaper if we do it now versus doing it later on instead. Even if this extra freezing ends up "going to waste" due to updates against the same tuples that happen a little later on, the *added* cost of freezing "extra" tuples will have been so small that it's unlikely to matter. On the other hand, if it's not wasted we'll be *much* better off. It's very hard to predict the future, which is kinda what the current FreezeLimit-based approach to freezing does. It's actually quite easy to understand the cost of freezing now versus freezing later, though. At a high level, it makes sense for VACUUM to focus on freezing costs (including the risk that comes with *not* freezing with larger tables), and not worry so much about making accurate predictions. Making accurate predictions about freezing/workload characteristics is overrated. > True. I've not looked at the patch in depth yet but I think we need > regression tests for this. What did you have in mind? I think that the best way to test something like this is with wal_consistency_checking. That mostly works fine. However, note that heap_mask() won't always be able to preserve the state of a tuple's xmax when modified by freezing. We sometimes need "hint bits" to actually reliably be set in REDO, when replaying the records for freezing. At other times they really are just hints. We have to conservatively assume that it's just a hint when masking. Not sure if I can do much about that. Note that this optimization is one level below lazy_scan_prune(), and one level above heap_execute_freeze_tuple(). Neither function really changes at all. This seems useful because there are rare pg_upgrade-only paths where xvac fields need to be frozen. That's not tested either. -- Peter Geoghegan
Re: New strategies for freezing, advancing relfrozenxid early
On Thu, Sep 15, 2022 at 12:09 AM John Naylor wrote: > On Wed, Sep 14, 2022 at 11:33 PM Peter Geoghegan wrote: > > The final number of TIDs doesn't seem like the most interesting > > information that VM snapshots could provide us when it comes to > > building the dead_items TID data structure -- the *distribution* of > > TIDs across heap pages seems much more interesting. The "shape" can be > > known ahead of time, at least to some degree. It can help with > > compression, which will reduce cache misses. > > My point here was simply that spilling to disk is an admission of > failure to utilize memory efficiently and thus shouldn't be a selling > point of VM snapshots. Other selling points could still be valid. I was trying to explain the goals of this work in a way that was as accessible as possible. It's not easy to get the high level ideas across, let alone all of the details. It's true that I have largely ignored the question of how VM snapshots will need to spill up until now. There are several reasons for this, most of which you could probably guess. FWIW it wouldn't be at all difficult to add *some* reasonable spilling behavior very soon; the underlying access patterns are highly sequential and predictable, in the obvious way. -- Peter Geoghegan
Re: New strategies for freezing, advancing relfrozenxid early
On Wed, Sep 14, 2022 at 3:18 AM John Naylor wrote: > On Wed, Sep 14, 2022 at 12:53 AM Peter Geoghegan wrote: > > This is still only scratching the surface of what is possible with > > dead_items. The visibility map snapshot concept can enable a far more > > sophisticated approach to resource management in vacuumlazy.c. > I don't quite see how it helps "enable" that. I have already written a simple throwaway patch that can use the current VM snapshot data structure (which is just a local copy of the VM's pages) to do a cheap precheck ahead of actually doing a binary search in dead_items -- if a TID's heap page is all-visible or all-frozen (depending on the type of VACUUM) then we're 100% guaranteed to not visit it, and so it's 100% guaranteed to not have any dead_items (actually it could have LP_DEAD items by the time the index scan happens, but they won't be in our dead_items array in any case). Since we're working off of an immutable source, this optimization is simple to implement already. Very simple. I haven't even bothered to benchmark this throwaway patch (I literally wrote it in 5 minutes to show Masahiko what I meant). I can't see why even that throwaway prototype wouldn't significantly improve performance, though. After all, the VM snapshot data structure is far denser than dead_items, and the largest tables often have most heap pages skipped via the VM. I'm not really interested in pursuing this simple approach because it conflicts with Masahiko's work on the data structure, and there are other good reasons to expect that to help. Plus I'm already very busy with what I have here. > It'd be more logical to > me to say the VM snapshot *requires* you to think harder about > resource management, since a palloc'd snapshot should surely be > counted as part of the configured memory cap that admins control. That's clearly true -- it creates a new problem for resource management that will need to be solved. But that doesn't mean that it can't ultimately make resource management better and easier. Remember, we don't randomly visit some skippable pages for no good reason in the patch, since the SKIP_PAGES_THRESHOLD stuff is completely gone. The VM snapshot isn't just a data structure that vacuumlazy.c uses as it sees fit -- it's actually more like a set of instructions on which pages to scan, that vacuumlazy.c *must* follow. There is no way that vacuumlazy.c can accidentally pick up a few extra dead_items here and there due to concurrent activity that unsets VM pages. We don't need to leave that to chance -- it is locked in from the start. > I do remember your foreshadowing in the radix tree thread a while > back, and I do think it's an intriguing idea to combine pages-to-scan > and dead TIDs in the same data structure. The devil is in the details, > of course. It's worth looking into. Of course. > Looking at the count of index scans, it's pretty much always > "1", so even if the current approach could scale above 1GB, "no" it > wouldn't help to raise that limit. I agree that multiple index scans are rare. But I also think that they're disproportionately involved in really problematic cases for VACUUM. That said, I agree that simply making lookups to dead_items as fast as possible is the single most important way to improve VACUUM by improving dead_items. > Furthermore, it doesn't have to anticipate the maximum size, so there > is no up front calculation assuming max-tuples-per-page, so it > automatically uses less memory for less demanding tables. The final number of TIDs doesn't seem like the most interesting information that VM snapshots could provide us when it comes to building the dead_items TID data structure -- the *distribution* of TIDs across heap pages seems much more interesting. The "shape" can be known ahead of time, at least to some degree. It can help with compression, which will reduce cache misses. Andres made remarks about memory usage with sparse dead TID patterns at this point on the "Improve dead tuple storage for lazy vacuum" thread: https://postgr.es/m/20210710025543.37sizjvgybemk...@alap3.anarazel.de I haven't studied the radix tree stuff in great detail, so I am uncertain of how much the VM snapshot concept could help, and where exactly it would help. I'm just saying that it seems promising, especially as a way of addressing concerns like this. -- Peter Geoghegan
Re: wrong shell trap
On Tue, Sep 13, 2022 at 2:01 PM Tom Lane wrote: > > AFAICT almost all of our shell scripts contain the same mistake. I > > propose to fix them all as in the attached demo patch, which makes > > headerscheck exit properly (no silly noise) when interrupted. > > Sounds like a good idea. Might not be a bad idea to run shellcheck against the scripts, to see if that highlights anything. I've found that shellcheck makes working with shell scripts less terrible, especially when portability is a concern. It can be used to enforce consistent coding standards that seem pretty well thought out. It will sometimes produce dubious warnings, of course, but it tends to mostly have the right idea, most of the time. -- Peter Geoghegan
Reducing the WAL overhead of freezing in VACUUM by deduplicating per-tuple freeze plans
My ongoing project to make VACUUM more predictable over time by proactive freezing [1] will increase the overall number of tuples frozen by VACUUM significantly (at least in larger tables). It's important that we avoid any new user-visible impact from extra freezing, though. I recently spent a lot of time on adding high-level techniques that aim to avoid extra freezing (e.g. by being lazy about freezing) when that makes sense. Low level techniques aimed at making the mechanical process of freezing cheaper might also help. (In any case it's well worth optimizing.) I'd like to talk about one such technique on this thread. The attached WIP patch reduces the size of xl_heap_freeze_page records by applying a simple deduplication process. This can be treated as independent work (I think it can, at least). The patch doesn't change anything about the conceptual model used by VACUUM's lazy_scan_prune function to build xl_heap_freeze_page records for a page, and yet still manages to make the WAL records for freeze records over 5x smaller in many important cases. They'll be ~4x-5x smaller with *most* workloads, even. Each individual tuple entry (each xl_heap_freeze_tuple) adds a full 12 bytes to the WAL record right now -- no matter what. So the existing approach is rather space inefficient by any standard (perhaps because it was developed under time pressure while fixing bugs in Postgres 9.3). More importantly, there is a lot of natural redundancy among each xl_heap_freeze_tuple entry -- each tuple's xl_heap_freeze_tuple details tend to match. We can usually get away with storing each unique combination of values from xl_heap_freeze_tuple once per xl_heap_freeze_page record, while storing associated page offset numbers in a separate area, grouped by their canonical freeze plan (which is a normalized version of the information currently stored in xl_heap_freeze_tuple). In practice most individual tuples that undergo any kind of freezing only need to have their xmin field frozen. And when xmax is affected at all, it'll usually just get set to InvalidTransactionId. And so the actual low-level processing steps for xmax have a high chance of being shared by other tuples on the page, even in ostensibly tricky cases. While there are quite a few paths that lead to VACUUM setting a tuple's xmax to InvalidTransactionId, they all end up with the same instructional state in the final xl_heap_freeze_tuple entry. Note that there is a small chance that the patch will be less space efficient by up to 2 bytes per tuple frozen per page in cases where we're allocating new Mulits during VACUUM. I think that this should be acceptable on its own -- even in rare bad cases we'll usually still come out ahead -- what are the chances that we won't make up the difference on the same page? Or at least within the same VACUUM? And that's before we talk about a future world in which freezing will batch tuples together at the page level (you don't have to bring the other VACUUM work into this discussion, I think, but it's not *completely* unrelated either). [1] https://postgr.es/m/CAH2-WzkFok_6EAHuK39GaW4FjEFQsY=3j0aad6fxk93u-xq...@mail.gmail.com -- Peter Geoghegan v1-0001-Shrink-freeze-WAL-records-via-deduplication.patch Description: Binary data
Re: pg15b4: FailedAssertion("TransactionIdIsValid(xmax)
On Sun, Sep 11, 2022 at 6:42 PM Justin Pryzby wrote: > I think you're saying is that this can be explained by the > io_concurrency bug in recovery_prefetch, if run under 15b3. > > But yesterday I started from initdb and restored this cluster from backup, and > started up sqlsmith, and sent some kill -9, and now got more corruption. > Looks like it took ~10 induced crashes before this happened. Have you tested fsync on the system? The symptoms here are all over the place. This assertion failure seems like a pretty good sign that the problems happen during recovery, or because basic guarantees needed by for crash safety aren't met: > #2 0x00962c5c in ExceptionalCondition > (conditionName=conditionName@entry=0x9ce238 "P_ISLEAF(opaque) && > !P_ISDELETED(opaque)", errorType=errorType@entry=0x9bad97 "FailedAssertion", > fileName=fileName@entry=0x9cdcd1 "nbtpage.c", > lineNumber=lineNumber@entry=1778) at assert.c:69 > #3 0x00507e34 in _bt_rightsib_halfdeadflag > (rel=rel@entry=0x7f4138a238a8, leafrightsib=leafrightsib@entry=53) at > nbtpage.c:1778 > #4 0x00507fba in _bt_mark_page_halfdead > (rel=rel@entry=0x7f4138a238a8, leafbuf=leafbuf@entry=13637, > stack=stack@entry=0x144ca20) at nbtpage.c:2121 This shows that the basic rules for page deletion have somehow seemingly been violated. It's as if a page deletion went ahead, but didn't work as an atomic operation -- there were some lost writes for some but not all pages. Actually, it looks like a mix of states from before and after both the first and the second phases of page deletion -- so not just one atomic operation. -- Peter Geoghegan
Re: Index ordering after IS NULL
On Sat, Sep 10, 2022 at 2:28 PM Jeff Janes wrote: > explain analyze select * from j where b is null order by c limit 10; > explain analyze select * from j where b =8 order by c limit 10; > > The first uses a sort despite it being disabled. The first/is null query seems to give the result and plan you're looking for if the query is rewritten to order by "b, c", and not just "c". That in itself doesn't make your complaint any less valid, of course. You don't have to do this with the second query, so why should you have to do it with the first? -- Peter Geoghegan
Re: Making autovacuum logs indicate if insert-based threshold was the triggering condition
On Fri, Sep 9, 2022 at 11:11 AM Justin Pryzby wrote: > > It might make sense to *always* show how close we were to hitting each > > of the thresholds, including the ones that we didn't end up hitting > > (we may come pretty close quite often, which seems like it might > > matter). But showing multiple conditions together just because the > > planets aligned (we hit multiple thresholds together) emphasizes the > > low-level mechanism, which is pretty far removed from anything that > > matters. You might as well pick either threshold at random once this > > happens -- even an expert won't be able to tell the difference. > > I don't have strong feelings about it; I'm just pointing out that the > two of the conditions aren't actually exclusive. Fair enough. I'm just pointing out that the cutoffs are continuous for all practical purposes, even if there are cases where they seem kind of discrete, due only to implementation details (e.g. autovacuum_naptime stuff). Displaying only one reason for triggering an autovacuum is consistent with the idea that the cutoffs are continuous. It's not literally true that they're continuous, but it might as well be. I think of it as similar to how it's not literally true that a coin toss is always either heads or tails, though it might as well be true. Sure, even a standard fair coin toss might result in the coin landing on its side. That'll probably never happen even once, but if does: just flip the coin again! The physical coin toss was never the important part. > It seems like it could be less confusing to show both. Consider someone who > is > trying to *reduce* how often autovacuum runs, or give priority to some tables > by raising the thresholds for other tables. My objection to that sort of approach is that it suggests a difference in what each VACUUM actually does -- as if autovacuum.c actually decided on a particular runtime behavior for the VACUUM up front, based on its own considerations that come from statistics about the workload. I definitely want to avoid creating that false impression in the minds of users. -- Peter Geoghegan
Re: Instrumented pages/tuples frozen in autovacuum's server log out (and VACUUM VERBOSE)
On Mon, Sep 5, 2022 at 12:43 PM Peter Geoghegan wrote: > Barring any objections I will commit this patch within the next few days. Pushed this just now. Thanks -- Peter Geoghegan
Re: Instrumented pages/tuples frozen in autovacuum's server log out (and VACUUM VERBOSE)
On Wed, Aug 31, 2022 at 7:49 PM Jeff Janes wrote: > I think "frozen:" would be a more suitable line prefix. Attached revision does it that way. Barring any objections I will commit this patch within the next few days. Thanks -- Peter Geoghegan v2-0001-Instrument-freezing-in-autovacuum-log-reports.patch Description: Binary data
Re: Backpatching nbtree VACUUM (page deletion) hardening
On Fri, Sep 2, 2022 at 6:51 PM Peter Geoghegan wrote: > Yes -- nbtree VACUUM generally can cope quite well, even when the > index is corrupt. It should mostly manage to do what is expected here, > even with a misbehaving opclass, because it relies as little as > possible on user-defined opclass code. I just backpatched the hardening commit from 14 to every supported branch. -- Peter Geoghegan
Re: Backpatching nbtree VACUUM (page deletion) hardening
On Fri, Sep 2, 2022 at 6:14 PM Michael Paquier wrote: > This has been a problem for years, and still for years to come with > libc updates. I am not much into this stuff, but does running VACUUM > in this case help with the state of the index that used a past, > now-invalid, collation (be it libc or ICU) to get a bit cleaned up? Yes -- nbtree VACUUM generally can cope quite well, even when the index is corrupt. It should mostly manage to do what is expected here, even with a misbehaving opclass, because it relies as little as possible on user-defined opclass code. Even without the hardening in place, nbtree VACUUM will still do a *surprisingly* good job of recovering when the opclass is broken in some way: VACUUM just needs the insertion scankey operator class code to initially determine roughly where to look for the to-be-deleted page's downlink, one level up in the tree. Even when an operator class is wildly broken (e.g. the comparator gives a result that it determines at random), we still won't see problems in nbtree VACUUM most of the time -- because even being roughly correct is good enough in practice! You have to be quite unlucky to hit this, even when the opclass is wildly broken (which is probably much less common than "moderately broken"). > When written like that, this surely sounds extremely bad and this > would need more complex chirurgy (or just running with a build that > includes this patch?). The patch will fix the case in question, which I have seen internal AWS reports about -- though the initial fix that went into 14 wasn't driven by any complaint from any user. I just happened to notice that we were throwing an ERROR in nbtree VACUUM for no good reason, which is something that should be avoided on general principle. In theory there could be other ways in which you'd run into the same basic problem (in any index AM). The important point is that we're better off not throwing any errors in the first place, but if we must then they had better not be errors that will be repeated again and again, without any chance of the problem going away naturally. (Not that it never makes sense to just throw an error; there are meaningful gradations of "totally unacceptable problem".) -- Peter Geoghegan
Backpatching nbtree VACUUM (page deletion) hardening
Postgres 14 commit 5b861baa55 added hardening to nbtree page deletion. This had the effect of making nbtree VACUUM robust against misbehaving operator classes -- we just LOG the problem and move on, without throwing an error. In practice a "misbehaving operator class" is often a problem with collation versioning. I think that this should be backpatched now, to protect users from particularly nasty problems that hitting the error eventually leads to. An error ends the whole VACUUM operation. If VACUUM cannot delete the page the first time, there is no reason to think that it'll be any different on the second or the tenth attempt. The eventual result (absent user/DBA intervention) is that no antiwraparound autovacuum will ever complete, leading to an outage when the system hits xidStopLimit. (Actually this scenario won't result in the system hitting xidStopLimit where the failsafe is available, but that's another thing that is only in 14, so that's not any help.) This seems low risk. The commit in question is very simple. It just downgrades an old 9.4-era ereport() from ERROR to LOG, and adds a "return false;" immediately after that. The function in question is fundamentally structured in a way that allows it to back out of page deletion because of problems that are far removed from where the caller starts from. When and why we back out of page deletion is already opaque to the caller, so it's very hard to imagine a new problem caused by backpatching. Besides all this, 14 has been out for a while now. -- Peter Geoghegan
Re: Instrumented pages/tuples frozen in autovacuum's server log out (and VACUUM VERBOSE)
On Wed, Aug 31, 2022 at 7:49 PM Jeff Janes wrote: > I like this addition, but I would also like to see how many pages got newly > set to all frozen by the vacuum. I'd say that that's independent work. Though I'm happy to discuss it now. It would be fairly straightforward to show something about the VM itself, but it's not entirely clear what aspects of the VM should be emphasized. Are we reporting on the state of the table, or work performed by VACUUM? You said you were interested in the latter already, but why prefer that over a summary of the contents of the VM at the end of the VACUUM? Are you concerned about the cost of setting pages all-visible? Do you have an interest in how VACUUM manages to set VM pages over time? Something else? We already call visibilitymap_count() at the end of every VACUUM, which scans the authoritative VM to produce a more-or-less consistent summary of the VM at that point in time. This information is then used to update pg_class.relallvisible (we don't do anything with the all-frozen number at all). Why not show that information in VERBOSE/autovacuum's log message? Does it really matter *when* a page became all-visible/all-frozen/unset? > Also, isn't all of vacuuming about XID processing? I think "frozen:" would > be a more suitable line prefix. That also works for me. I have no strong feelings here. -- Peter Geoghegan
Re: New strategies for freezing, advancing relfrozenxid early
On Thu, Aug 25, 2022 at 2:21 PM Peter Geoghegan wrote: > Attached patch series is a completely overhauled version of earlier > work on freezing. Related work from the Postgres 15 cycle became > commits 0b018fab, f3c15cbe, and 44fa8488. Attached is v2. This is just to keep CFTester happy, since v1 now has conflicts when applied against HEAD. There are no notable changes in this v2 compared to v1. -- Peter Geoghegan v2-0001-Add-page-level-freezing-to-VACUUM.patch Description: Binary data v2-0002-Teach-VACUUM-to-use-visibility-map-snapshot.patch Description: Binary data v2-0003-Add-eager-freezing-strategy-to-VACUUM.patch Description: Binary data v2-0004-Avoid-allocating-MultiXacts-during-VACUUM.patch Description: Binary data
Re: effective_multixact_freeze_max_age issue
On Tue, Aug 30, 2022 at 8:56 PM Nathan Bossart wrote: > LGTM Pushed, thanks. -- Peter Geoghegan
Re: New strategies for freezing, advancing relfrozenxid early
On Tue, Aug 30, 2022 at 11:28 PM Jeff Davis wrote: > That clarifies your point. It's still a challenge for me to reason > about which of these potential new problems really need to be solved in > v1, though. I don't claim to understand it that well myself -- not just yet. I feel like I have the right general idea, but the specifics aren't all there (which is very often the case for me at this point in the cycle). That seems like a good basis for further discussion. It's going to be quite a few months before some version of this patchset is committed, at the very earliest. Obviously these are questions that need answers, but the process of getting to those answers is a significant part of the work itself IMV. > > Why stop at a couple of dozens of lines of code? Why not just change > > the default of vacuum_freeze_min_age and > > vacuum_multixact_freeze_min_age to 0? > > I don't think that would actually solve the unbounded buildup of > unfrozen pages. It would still be possible for pages to be marked all > visible before being frozen, and then end up being skipped until an > aggressive vacuum is forced, right? With the 15 work in place, and with the insert-driven autovacuum behavior from 13, it is likely that this would be enough to avoid all antiwraparound vacuums in an append-only table. There is still one case where we can throw away the opportunity to advance relfrozenxid during non-aggressive VACUUMs for no good reason -- I didn't fix them all just yet. But the remaining case (which is in lazy_scan_skip()) is very narrow. With vacuum_freeze_min_age = 0 and vacuum_multixact_freeze_min_age = 0, any page that is eligible to be set all-visible is also eligible to have its tuples frozen and be set all-frozen instead, immediately. When it isn't then we'll scan it in the next VACUUM anyway. Actually I'm also ignoring some subtleties with Multis that could make this not quite happen, but again, that's only a super obscure corner case. The idea that just setting vacuum_freeze_min_age = 0 and vacuum_multixact_freeze_min_age = 0 will be enough is definitely true in spirit. You don't need to touch vacuum_freeze_table_age (if you did then you'd get aggressive VACUUMs, and one goal here is to avoid those whenever possible -- especially aggressive antiwraparound autovacuums). -- Peter Geoghegan
Re: New strategies for freezing, advancing relfrozenxid early
On Tue, Aug 30, 2022 at 9:37 PM Jeff Davis wrote: > On Tue, 2022-08-30 at 18:50 -0700, Peter Geoghegan wrote: > > Since VM snapshots are immutable, it should be relatively > > easy to have the implementation make its final decision on skipping > > only *after* lazy_scan_heap() returns. > > I like this idea. I was hoping that you would. I imagine that this idea (with minor variations) could enable an approach that's much closer to what you were thinking of: one that mostly focuses on controlling the number of unfrozen pages, and not so much on advancing relfrozenxid early, just because we can and we might not get another chance for a long time. In other words your idea of a design that can freeze more during a non-aggressive VACUUM, while still potentially skipping all-visible pages. I said earlier on that we ought to at least have a strong bias in the direction of advancing relfrozenxid in larger tables, especially when we decide to freeze whole pages more eagerly -- we only get one chance to advance relfrozenxid per VACUUM, and those opportunities will naturally be few and far between. We cannot really justify all this extra freezing if it doesn't prevent antiwraparound autovacuums. That was more or less my objection to going in that direction. But if we can more or less double the number of opportunities to at least ask the question "is now a good time to advance relfrozenxid?" without really paying much for keeping this option open (and I think that we can), my concern about relfrozenxid advancement becomes far less important. Just being able to ask that question is significantly less rare and precious. Plus we'll probably be able to make significantly better decisions about relfrozenxid overall with the "second phase because I changed my mind about skipping" concept in place. -- Peter Geoghegan
Re: New strategies for freezing, advancing relfrozenxid early
On Tue, Aug 30, 2022 at 1:45 PM Peter Geoghegan wrote: > > d. Can you help give me a sense of scale of the problems solved by > > visibilitymap snapshots and the cost model? Do those need to be in v1? > > I'm not sure. I think that having certainty that we'll be able to scan > only so many pages up-front is very broadly useful, though. Plus it > removes the SKIP_PAGES_THRESHOLD stuff, which was intended to enable > relfrozenxid advancement in non-aggressive VACUUMs, but does so in a > way that results in scanning many more pages needlessly. See commit > bf136cf6, which added the SKIP_PAGES_THRESHOLD stuff back in 2009, > shortly after the visibility map first appeared. Here is a better example: Right now the second patch adds both VM snapshots and the skipping strategy stuff. The VM snapshot is used in the second patch, as a source of reliable information about how we need to process the table, in terms of the total number of scanned_pages -- which drives our choice of strategy. Importantly, we can assess the question of which skipping strategy to take (in non-aggressive VACUUM) based on 100% accurate information about how many *extra* pages we'll have to scan in the event of being eager (i.e. in the event that we prioritize early relfrozenxid advancement over skipping some pages). Importantly, that cannot change later on, since VM snapshots are immutable -- everything is locked in. That already seems quite valuable to me. This general concept could be pushed a lot further without great difficulty. Since VM snapshots are immutable, it should be relatively easy to have the implementation make its final decision on skipping only *after* lazy_scan_heap() returns. We could allow VACUUM to "change its mind about skipping" in cases where it initially thought that skipping was the best strategy, only to discover much later on that that was the wrong choice after all. A huge amount of new, reliable information will come to light from scanning the heap rel. In particular, the current value of vacrel->NewRelfrozenXid seems like it would be particularly interesting when the time came to consider if a second scan made sense -- if NewRelfrozenXid is a recent-ish value already, then that argues for finishing off the all-visible pages in a second heap pass, with the aim of setting relfrozenxid to a similarly recent value when it happens to be cheap to do so. The actual process of scanning precisely those all-visible pages that were skipped the first time around during a second call to lazy_scan_heap() can be implemented in the obvious way: by teaching the VM snapshot infrastructure/lazy_scan_skip() to treat pages that were skipped the first time around to get scanned during the second pass over the heap instead. Also, those pages that were scanned the first time around can/must be skipped on our second pass (excluding all-frozen pages, which won't be scanned in either heap pass). I've used the term "second heap pass" here, but that term is slightly misleading. The final outcome of this whole process is that every heap page that the vmsnap says VACUUM will need to scan in order for it to be able to safely advance relfrozenxid will be scanned, precisely once. The overall order that the heap pages are scanned in will of course differ from the simple case, but I don't think that it makes very much difference. In reality there will have only been one heap pass, consisting of two distinct phases. No individual heap page will ever be considered for pruning/freezing more than once, no matter what. This is just a case of *reordering* work. Immutability makes reordering work easy in general. -- Peter Geoghegan
Re: effective_multixact_freeze_max_age issue
On Mon, Aug 29, 2022 at 3:40 PM Nathan Bossart wrote: > Agreed. Attached is v2, which cleans up the structure of vacuum_set_xid_limits() a bit more. The overall idea was to improve the overall flow/readability of the function by moving the WARNINGs into their own "code stanza", just after the logic for establishing freeze cutoffs and just before the logic for deciding on aggressiveness. That is now the more logical approach (group the stanzas by functionality), since we can't sensibly group the code based on whether it deals with XIDs or with Multis anymore (not since the function was taught to deal with the question of whether caller's VACUUM will be aggressive). Going to push this in the next day or so. I also removed some local variables that seem to make the function a lot harder to follow in v2. Consider code like this: - freezemin = freeze_min_age; - if (freezemin < 0) - freezemin = vacuum_freeze_min_age; - freezemin = Min(freezemin, autovacuum_freeze_max_age / 2); - Assert(freezemin >= 0); + if (freeze_min_age < 0) + freeze_min_age = vacuum_freeze_min_age; + freeze_min_age = Min(freeze_min_age, autovacuum_freeze_max_age / 2); + Assert(freeze_min_age >= 0); Why have this freezemin temp variable? Why not just use the vacuum_freeze_min_age function parameter directly instead? That is a better representation of what's going on at the conceptual level. We now assign vacuum_freeze_min_age to the vacuum_freeze_min_age arg (not to the freezemin variable) when our VACUUM caller passes us a value of -1 for that arg. -1 effectively means "whatever the value of the vacuum_freeze_min_age GUC is'', which is clearer without the superfluous freezemin variable. -- Peter Geoghegan v2-0001-Derive-freeze-cutoff-from-nextXID-not-OldestXmin.patch Description: Binary data
Re: New strategies for freezing, advancing relfrozenxid early
elp give me a sense of scale of the problems solved by > visibilitymap snapshots and the cost model? Do those need to be in v1? I'm not sure. I think that having certainty that we'll be able to scan only so many pages up-front is very broadly useful, though. Plus it removes the SKIP_PAGES_THRESHOLD stuff, which was intended to enable relfrozenxid advancement in non-aggressive VACUUMs, but does so in a way that results in scanning many more pages needlessly. See commit bf136cf6, which added the SKIP_PAGES_THRESHOLD stuff back in 2009, shortly after the visibility map first appeared. Since relfrozenxid advancement fundamentally works at the table level, it seems natural to make it a top-down, VACUUM-level thing -- even within non-aggessive VACUUMs (I guess it already meets that description in aggressive VACUUMs). And since we really want to advance relfrozenxid when we do extra freezing (for the reasons I just went into), it seems natural to me to view it as one problem. I accept that it's not clear cut, though. [1] https://docs.google.com/presentation/d/1WgP-SlKay5AnSoVDSvOIzmu7edMmtYhdywoa0oAR4JQ/edit?usp=sharing [2] https://disc-projects.bu.edu/compactionary/research.html -- Peter Geoghegan
Re: effective_multixact_freeze_max_age issue
On Mon, Aug 29, 2022 at 2:20 AM Matthias van de Meent wrote: > Apart from the message that this behaviour is changing, I'd prefer > some more description in the commit message as to why this needs > changing. I usually only write a full commit message before posting a patch when it's a full patch series, where it can be helpful to be very explicit about how the parts fit together. The single line commit message is just a placeholder -- I'll definitely write a better one before commit. > Then, on to the patch itself: > > > + * XXX We don't do push back oldestMxact here, which is not > > ideal > > Do you intend to commit this marker, or is this leftover from the > development process? Ordinarily I would never commit an XXX comment, and probably wouldn't even leave one in early revisions of patches that I post to the list. This is a special case, though -- it involves the "snapshot too old" feature, which has many similar XXX/FIXME/TODO comments. I think I might leave it like that when committing. The background here is that the snapshot too old code still has lots of problems -- there is a FIXME comment that gives an overview of this in TransactionIdLimitedForOldSnapshots(). We're going to have to live with the fact that that feature isn't in good shape for the foreseeable future. I can only really work around it. > > +if (*multiXactCutoff < FirstMultiXactId) > [...] > > +if (safeOldestMxact < FirstMultiXactId) > [...] > > +if (aggressiveMXIDCutoff < FirstMultiXactId) > > I prefer !TransactionId/MultiXactIdIsValid() over '< First > [MultiXact/Transaction]Id', even though it is the same in > functionality, because it clarifies the problem we're trying to solve. > I understand that the use of < is pre-existing, but since we're > touching this code shouldn't we try to get this new code up to current > standards? I agree in principle, but there are already 40+ other places that use the same idiom in places like multixact.c. Perhaps you can propose a patch to change all of them at once, together? -- Peter Geoghegan
Re: New strategies for freezing, advancing relfrozenxid early
efinitely no point in scanning only a subset of the table's all-visible pages, as far as relfrozenxid advancement is concerned (and skipping strategy is fundamentally a choice about relfrozenxid advancement vs work avoidance, eagerness vs laziness). Maybe you're right that there is room for additional freezing strategies, besides the two added by v1-0003-*patch. Definitely seems possible. The freezing strategy concept should be usable as a framework for adding additional strategies, including (just for example) a strategy that decides ahead of time to freeze only so many pages, though not others (without regard for the fact that the pages that we are freezing may not be very different to those we won't be freezing in the current VACUUM). I'm definitely open to that. It's just a matter of characterizing what set of workload characteristics this third strategy would solve, how users might opt in or opt out, etc. Both the eager and the lazy freezing strategies are based on some notion of what's important for the table, based on its known characteristics, and based on what seems like to happen to the table in the future (the next VACUUM, at least). I'm not completely sure how many strategies we'll end up needing. Though it seems like the eager/lazy trade-off is a really important part of how these strategies will need to work, in general. (Thinks some more) I guess that such an alternative freezing strategy would probably have to affect the skipping strategy too. It's tricky to tease apart because it breaks the idea that skipping strategy and freezing strategy are basically distinct questions. That is a factor that makes it a bit more complicated to discuss. In any case, as I said, I have an open mind about alternative freezing strategies beyond the 2 basic lazy/eager freezing strategies from the patch. > What if we thought about this more like a "background freezer". It > would keep track of the total number of unfrozen pages in the system, > and freeze them at some kind of controlled/adaptive rate. I like the idea of storing metadata in shared memory. And scheduling and deprioritizing running autovacuums. Being able to slow down or even totally halt a given autovacuum worker without much consequence is enabled by the VM snapshot concept. That said, this seems like future work to me. Worth discussing, but trying to keep out of scope in the first version of this that is committed. > Regular autovacuum's job would be to keep advancing relfrozenxid for > all tables and to do other cleanup, and the background freezer's job > would be to keep the absolute number of unfrozen pages under some > limit. Conceptually those two jobs seem different to me. The problem with making it such a sharp distinction is that it can be very useful to manage costs by making it the job of VACUUM to do both -- we can avoid dirtying the same page multiple times. I think that we can accomplish the same thing by giving VACUUM more freedom to do either more or less work, based on the observed characteristics of the table, and some sense of how costs will tend to work over time. across multiple distinct VACUUM operations. In practice that might end up looking very similar to what you describe. It seems undesirable for VACUUM to ever be too sure of itself -- the information that triggers autovacuum may not be particularly reliable, which can be solved to some degree by making as many decisions as possible at runtime, dynamically, based on the most authoritative and recent information. Delaying committing to one particular course of action isn't always possible, but when it is possible (and not too expensive) we should do it that way on general principle. > Also, regarding patch v1-0001-Add-page-level-freezing, do you think > that narrows the conceptual gap between an all-visible page and an all- > frozen page? Yes, definitely. However, I don't think that we can just get rid of the distinction completely -- though I did think about it for a while. For one thing we need to be able to handle cases like the case where heap_lock_tuple() modifies an all-frozen page, and makes it all-visible without making it completely unskippable to every VACUUM operation. -- Peter Geoghegan
Re: effective_multixact_freeze_max_age issue
On Sun, Aug 28, 2022 at 4:14 PM Nathan Bossart wrote: > The idea seems sound to me, and IMO your patch simplifies things nicely, > which might be reason enough to proceed with it. It is primarily a case of making things simpler. Why would it ever make sense to interpret age differently in the presence of a long running transaction, though only for the FreezeLimit/MultiXactCutoff cutoff calculation? And not for the closely related freeze_table_age/multixact_freeze_table_age calculation? It's hard to imagine that that was ever a deliberate choice. vacuum_set_xid_limits() didn't contain the logic for determining if its caller's VACUUM should be an aggressive VACUUM until quite recently. Postgres 15 commit efa4a9462a put the logic for determining aggressiveness right next to the logic for determining FreezeLimit, which made the inconsistency much more noticeable. It is easy to believe that this was really just an oversight, all along. > However, I'm struggling > to understand when this change would help much in practice. IIUC it will > cause vacuums to freeze a bit more, but outside of extreme cases (maybe > when vacuum_freeze_min_age is set very high and there are long-running > transactions), ISTM it might not have tremendously much impact. Is the > intent to create some sort of long-term behavior change for autovacuum, or > is this mostly aimed towards consistency among the cutoff calculations? I agree that this will have only a negligible impact on the majority (perhaps even the vast majority) of applications. The primary justification for this patch (simplification) seems sufficient, all things considered. Even still, it's possible that it will help in extreme cases. Cases with pathological performance issues, particularly those involving MultiXacts. We already set FreezeLimit to the most aggressive possible value of OldestXmin when OldestXmin has itself already crossed a quasi arbitrary XID-age threshold of autovacuum_freeze_max_age XIDs (i.e. when OldestXmin < safeLimit), with analogous rules for MultiXactCutoff/OldestMxact. Consequently, the way that we set the cutoffs for freezing in pathological cases where (say) there is a leaked replication slot will see a sharp discontinuity in how FreezeLimit is set, within and across tables. And for what? Initially, these pathological cases will end up using exactly the same FreezeLimit for every VACUUM against every table (assuming that we're using a system-wide min_freeze_age setting) -- every VACUUM operation will use a FreezeLimit of `OldestXmin - vacuum_freeze_min_age`. At a certain point that'll suddenly flip -- now every VACUUM operation will use a FreezeLimit of `OldestXmin`. OTOH with the patch they'd all have a FreezeLimit that is tied to the time that each VACUUM started -- which is exactly the FreezeLimit behavior that we'd get if there was no leaked replication slot (at least until OldestXmin finally attains an age of vacuum_freeze_min_age, when it finally becomes unavoidable, even with the patch). There is something to be said for preserving the "natural diversity" of the relfrozenxid values among tables, too. The FreezeLimit we use is (at least for now) almost always going to be very close to (if not exactly) the same value as the NewFrozenXid value that we set relfrozenxid to at the end of VACUUM (at least in larger tables). Without the patch, a once-off problem with a leaked replication slot can accidentally result in lasting problems where all of the largest tables get their antiwraparound autovacuums at exactly the same time. The current behavior increases the risk that we'll accidentally "synchronize" the relfrozenxid values for large tables that had an antiwraparound vacuum during the time when OldestXmin was held back. Needlessly using the same FreezeLimit across each VACUUM operation risks disrupting the natural ebb and flow of things. It's hard to say how much of a problem that is in the real word. But why take any chances? -- Peter Geoghegan
Re: effective_multixact_freeze_max_age issue
On Tue, Aug 2, 2022 at 4:12 PM Peter Geoghegan wrote: > That is, we only need to make sure that the "multiXactCutoff must > always be <= oldestMxact" invariant holds once, by checking for it > directly. The same thing happens with OldestXmin/FreezeLimit. That > seems like a simpler foundation. It's also a lot more logical. Why > should the cutoff for freezing be held back by a long running > transaction, except to the extent that it is strictly necessary to do > so to avoid wrong answers (wrong answers seen by the long running > transaction)? Anybody have any input on this? I'm hoping that this can be committed soon. ISTM that the way that we currently derive FreezeLimit (by starting with OldestXmin rather than starting with the same ReadNextTransactionId() value that gets used for the aggressiveness cutoffs) is just an accident of history. The "Routine vacuuming" docs already describe this behavior in terms that sound closer to the behavior with the patch than the actual current behavior: "When VACUUM scans every page in the table that is not already all-frozen, it should set age(relfrozenxid) to a value just a little more than the vacuum_freeze_min_age setting that was used (more by the number of transactions started since the VACUUM started)" Besides, why should there be an idiosyncratic definition of "age" that is only used with vacuum_freeze_min_age/vacuum_multixact_freeze_min_age? Why would anyone want to do less freezing in the presence of a long running transaction? It simply makes no sense (unless we're forced to do so to preserve basic guarantees needed for correctness, such as the "FreezeLimit <= OldestXmin" invariant). -- Peter Geoghegan
Re: New strategies for freezing, advancing relfrozenxid early
On Thu, Aug 25, 2022 at 4:23 PM Peter Geoghegan wrote: > As a general rule VACUUM will tend to do more eager freezing with the > patch set compared to HEAD, though it should never do less eager > freezing. Not even in corner cases -- never. Come to think of it, I don't think that that's quite true. Though the fourth patch isn't particularly related to the problem. It *is* true that VACUUM will do at least as much freezing of XID based tuple header fields as before. That just leaves MXIDs. It's even true that we will do just as much freezing as before if you go pure on MultiXact-age. But I'm the one that likes to point out that age is altogether the wrong approach for stuff like this -- so that won't cut it. More concretely, I think that the patch series will fail to do certain inexpensive eager processing of tuple xmax, that will happen today, regardless of what the user has set vacuum_freeze_min_age or vacuum_multixact_freeze_min_age to. Although we currently only care about XID age when processing simple XIDs, we already sometimes make trade-offs similar to the trade-off I propose to make in the fourth patch for Multis. In other words, on HEAD, we promise to process any XMID >= MultiXactCutoff inside FreezeMultiXactId(). But we also manage to do "eager processing of xmax" when it's cheap and easy to do so, without caring about MultiXactCutoff at all -- this is likely the common case, even. This preexisting eager processing of Multis is likely important in many applications. The problem that I think I've created is that page-level freezing as implemented in lazy_scan_prune by the third patch doesn't know that we might be a good idea to execute a subset of freeze plans, in order to remove a multi from a page right away. It mostly has the right idea by holding off on freezing until it looks like a good idea at the level of the whole page, but I think that this is a plausible exception. Just because we're much more sensitive to leaving behind an Multi, and right now the only code path that can remove a Multi that isn't needed anymore is FreezeMultiXactId(). If xmax was an updater that aborted instead of a multi then we could rely on hint bits being set by pruning to avoid clog lookups. Technically nobody has violated their contract here, I think, but it still seems like it could easily be unacceptable. I need to come up with my own microbenchmark suite for Multis -- that was on my TODO list already. I still believe that the fourth patch addresses Andres' complaint about allocating new Multis during VACUUM. But it seems like I need to think about the nuances of Multis some more. In particular, what the performance impact might be of making a decision on freezing at the page level, in light of the special performance considerations for Multis. Maybe it needs to be more granular than that, more often. Or maybe we can comprehensively solve the problem in some other way entirely. Maybe pruning should do this instead, in general. Something like that might put this right, and be independently useful. -- Peter Geoghegan
Re: New strategies for freezing, advancing relfrozenxid early
On Thu, Aug 25, 2022 at 3:35 PM Jeremy Schneider wrote: > We should be careful here. IIUC, the current autovac behavior helps > bound the "spread" or range of active multixact IDs in the system, which > directly determines the number of distinct pages that contain those > multixacts. If the proposed change herein causes the spread/range of > MXIDs to significantly increase, then it will increase the number of > blocks and increase the probability of thrashing on the SLRUs for these > data structures. As a general rule VACUUM will tend to do more eager freezing with the patch set compared to HEAD, though it should never do less eager freezing. Not even in corner cases -- never. With the patch, VACUUM pretty much uses the most aggressive possible XID-wise/MXID-wise cutoffs in almost all cases (though only when we actually decide to freeze a page at all, which is now a separate question). The fourth patch in the patch series introduces a very limited exception, where we use the same cutoffs that we'll always use on HEAD (FreezeLimit + MultiXactCutoff) instead of the aggressive variants (OldestXmin and OldestMxact). This isn't just *any* xmax containing a MultiXact: it's a Multi that contains *some* XIDs that *need* to go away during the ongoing VACUUM, and others that *cannot* go away. Oh, and there usually has to be a need to keep two or more XIDs for this to happen -- if there is only one XID then we can usually swap xmax with that XID without any fuss. > PS. see also > https://www.postgresql.org/message-id/247e3ce4-ae81-d6ad-f54d-7d3e0409a...@ardentperf.com I think that the problem you describe here is very real, though I suspect that it needs to be addressed by making opportunistic cleanup of Multis happen more reliably. Running VACUUM more often just isn't practical once a table reaches a certain size. In general, any kind of processing that is time sensitive probably shouldn't be happening solely during VACUUM -- it's just too risky. VACUUM might take a relatively long time to get to the affected page. It might not even be that long in wall clock time or whatever -- just too long to reliably avoid the problem. -- Peter Geoghegan
New strategies for freezing, advancing relfrozenxid early
an the alternative). Reducing the WAL space overhead of freezing === Not included in this new v1 are other patches that control the overhead of added freezing -- my focus since joining AWS has been to get these more strategic patches in shape, and telling the right story about what I'm trying to do here. I'm going to say a little on the patches that I have in the pipeline here, though. Getting the low-level/mechanical overhead of freezing under control will probably require a few complementary techniques, not just high-level strategies (though the strategy stuff is the most important piece). The really interesting omitted-in-v1 patch adds deduplication of xl_heap_freeze_page WAL records. This reduces the space overhead of WAL records used to freeze by ~5x in most cases. It works in the obvious way: we just store the 12 byte freeze plans that appear in each xl_heap_freeze_page record only once, and then store an array of item offset numbers for each entry (rather than naively storing a full 12 bytes per tuple frozen per page-level WAL record). This means that we only need an "extra" ~2 bytes of WAL space per "extra" tuple frozen (2 bytes for an OffsetNumber) once we decide to freeze something on the same page. The *marginal* cost can be much lower than it is today, which makes page-based batching of freezing much more compelling IMV. Thoughts? -- Peter Geoghegan v1-0001-Add-page-level-freezing-to-VACUUM.patch Description: Binary data v1-0004-Avoid-allocating-MultiXacts-during-VACUUM.patch Description: Binary data v1-0003-Add-eager-freezing-strategy-to-VACUUM.patch Description: Binary data v1-0002-Teach-VACUUM-to-use-visibility-map-snapshot.patch Description: Binary data
Instrumented pages/tuples frozen in autovacuum's server log out (and VACUUM VERBOSE)
Attached is another autovacuum (and VACUUM VERBOSE) instrumentation patch. This one adds instrumentation about freezing to the report autovacuum makes to the server log. Specifically, it makes the output look like this: regression=# vacuum (freeze,verbose) foo; INFO: aggressively vacuuming "regression.public.foo" INFO: finished vacuuming "regression.public.foo": index scans: 0 pages: 0 removed, 45 remain, 45 scanned (100.00% of total) tuples: 0 removed, 1 remain, 0 are dead but not yet removable removable cutoff: 751, which was 0 XIDs old when operation ended new relfrozenxid: 751, which is 2 XIDs ahead of previous value XIDs processed: 45 pages from table (100.00% of total) had 1 tuples frozen index scan not needed: 0 pages from table (0.00% of total) had 0 dead item identifiers removed I/O timings: read: 0.023 ms, write: 0.000 ms avg read rate: 2.829 MB/s, avg write rate: 5.658 MB/s buffer usage: 95 hits, 2 misses, 4 dirtied WAL usage: 91 records, 1 full page images, 133380 bytes system usage: CPU: user: 0.00 s, system: 0.00 s, elapsed: 0.00 s VACUUM Notice the new line about freezing, which we always output -- it's the line that begins with "XIDs processed", that appears about half way down. The new line is deliberately placed after the existing "new relfrozenxid" line and before the existing line about dead item identifiers. This placement of the new instrumentation seems logical to me; freezing is related to relfrozenxid (obviously), but doesn't need to be shoehorned into the prominent early line that reports on tuples removed/remain[ing]. Like its neighboring "dead item identifier" line, this new line shows the number of items/tuples affected, and the number of heap pages affected -- with heap pages shown both as an absolute number and as a percentage of rel_pages (in parentheses). The main cost associated with freezing is the WAL overhead, so emphasizing pages here seems like the way to go -- pages are more interesting than tuples. This format also makes it relatively easy to get a sense of the *relative* costs of the overhead of each distinct class/category of maintenance performed. -- Peter Geoghegan v1-0001-Add-freezing-instrumentation-to-VACUUM-VERBOSE.patch Description: Binary data
Re: pg15b3: crash in paralell vacuum
On Thu, Aug 18, 2022 at 7:25 AM Masahiko Sawada wrote: > Justin, if it's reproducible in your environment, could you please try > it again with the attached patch? Pushed, thanks. I wonder how this issue could have been caught earlier, or even avoided in the first place. Would the bug have been caught if Valgrind had known to mark dynamic shared memory VALGRIND_MAKE_MEM_UNDEFINED() when it is first allocated? ISTM that we should do something that is analogous to aset.c's Valgrind handling for palloc() requests. Similar work on buffers in shared memory led to us catching a tricky bug involving unsafe access to a buffer, a little while ago -- see bugfix commit 7b7ed046. The bug in question would probably have taken much longer to catch without the instrumentation. In fact, it seems like a good idea to use Valgrind for *anything* where it *might* catch bugs, just in case. Valgrind can work well for shared memory without any extra work. The backend's own idea of the memory (the memory mapping used by the process) is all that Valgrind cares about. You don't have to worry about Valgrind instrumentation in one backend causing confusion in another backend. It's very practical, and very general purpose. I think that most of the protection comes from a basic understanding of "this memory is unsafe to access, this memory contains uninitialized data that cannot be assumed to have any particular value, this memory is initialized and safe". -- Peter Geoghegan
Re: Remaining case where reltuples can become distorted across multiple VACUUM operations
On Thu, Aug 11, 2022 at 1:48 AM Matthias van de Meent wrote: > I think I understand your reasoning, but I don't agree with the > conclusion. The attached patch 0002 does fix that skew too, at what I > consider negligible cost. 0001 is your patch with a new version > number. Your patch added allowSystemTableMods to one of the tests. I guess that this was an oversight? > I'm fine with your patch as is, but would appreciate it if known > estimate mistakes would also be fixed. Why do you think that this particular scenario/example deserves special attention? As I've acknowledged already, it is true that your scenario is one in which we provably give a less accurate estimate, based on already-available information. But other than that, I don't see any underlying principle that would be violated by my original patch (any kind of principle, held by anybody). reltuples is just an estimate. I was thinking of going your way on this, purely because it didn't seem like there'd be much harm in it (why not just handle your case and be done with it?). But I don't think that it's a good idea now. reltuples is usually derived by ANALYZE using a random sample, so the idea that tuple density can be derived accurately enough from a random sample is pretty baked in. You're talking about a case where ignoring just one page ("sampling" all but one of the pages) *isn't* good enough. It just doesn't seem like something that needs to be addressed -- it's quite awkward to do so. Barring any further objections, I plan on committing the original version tomorrow. > An alternative solution could be doing double-vetting, where we ignore > tuples_scanned if <2% of pages AND <2% of previous estimated tuples > was scanned. I'm not sure that I've understood you, but I think that you're talking about remembering more information (in pg_class), which is surely out of scope for a bug fix. -- Peter Geoghegan
Re: Remaining case where reltuples can become distorted across multiple VACUUM operations
On Mon, Aug 8, 2022 at 9:17 AM Matthias van de Meent wrote: > Because if a subset of the pages of a relation contains more tuples > than your current total expected tuples in the table, you should > update your expectations regardless of which blocks or which number of > blocks you've scanned - the previous stored value is a strictly worse > estimation than your last measurement. The previous stored value could be -1, which represents the idea that we don't know the tuple density yet. So it doesn't necessarily follow that the new estimate is strictly better, even in this exact scenario. > A 33-block relation with first 32 1-tuple pages is still enough to > have a last page with 250 tuples, which would be ignored in that > scheme and have a total tuple count of 33 or so. The simple fact is that there is only so much we can do with the limited information/context that we have. Heuristics are not usually free of all bias. Often the bias is the whole point -- the goal can be to make sure that we have the bias that we know we can live with, and not the opposite bias, which is much worse. Details of which are usually very domain specific. I presented my patch with a very simple test case -- a very clear problem. Can you do the same for this scenario? I accept that it is possible that we'll keep an old reltuples which is provably less accurate than doing something with the latest information from vacuumlazy.c. But the conditions under which this can happen are *very* narrow. I am not inclined to do anything about it for that reason. -- Peter Geoghegan
Re: Remaining case where reltuples can become distorted across multiple VACUUM operations
On Mon, Aug 8, 2022 at 8:33 AM Matthias van de Meent wrote: > For example, if currently the measured 2% of the pages contains more > than 100% of the previous count of tuples, or with your patch the last > page contains more than 100% of the previous count of the tuples, that > new count is ignored, which seems silly considering that the vacuum > count is supposed to be authorative. The 2% thing is conditioned on the new relpages value precisely matching the existing relpages from pg_class -- which makes it very targeted. I don't see why scanned_tuples greatly exceeding the existing reltuples from pg_class is interesting (any more interesting than the other way around). We'll always accept scanned_tuples as authoritative when VACUUM actually scans all pages, no matter what. Currently it isn't possible for VACUUM to skip pages in a table that is 32 pages or less in size. So even the new "single page" thing from the patch cannot matter there. -- Peter Geoghegan
Re: Remaining case where reltuples can become distorted across multiple VACUUM operations
On Mon, Aug 8, 2022 at 8:14 AM Matthias van de Meent wrote: > I do not have intimate knowledge of this code, but shouldn't we also > add some sefety guarantees like the following in these blocks? Right > now, we'll keep underestimating the table size even when we know that > the count is incorrect. > > if (scanned_tuples > old_rel_tuples) > return some_weighted_scanned_tuples; Not sure what you mean -- we do something very much like that already. We take the existing tuple density, and assume that that hasn't changed for any unscanned pages -- that is used to build a total number of tuples for the unscanned pages. Then we add the number of live tuples/scanned_tuples that the vacuumlazy.c caller just encountered on scanned_pages. That's often where the final reltuples value comes from. -- Peter Geoghegan
Re: Remaining case where reltuples can become distorted across multiple VACUUM operations
On Fri, Aug 5, 2022 at 5:39 PM Peter Geoghegan wrote: > Attached patch fixes closes the remaining gap. With the patch applied, > the second and subsequent vacuum verbose operations from the test case > will show that reltuples is still 1 (it won't ever change). The > patch just extends an old behavior that was applied when scanned_pages > == 0 to cases where scanned_pages <= 1 (unless we happened to scan all > of the relation's tables, of course). My plan is to commit this later in the week, barring objections. Maybe on Thursday. -- Peter Geoghegan
Re: Making autovacuum logs indicate if insert-based threshold was the triggering condition
On Sat, Aug 6, 2022 at 3:51 PM Justin Pryzby wrote: > > Well, autovacuum.c should have (and/or kind of already has) 3 > > different triggering conditions. These are mutually exclusive > > conditions -- technically autovacuum.c always launches an autovacuum > > against a table because exactly 1 of the 3 thresholds were crossed. > > The issue being that both thresholds can be crossed: > > >> 2022-08-06 16:47:47.674 CDT autovacuum worker[12707] DEBUG: t: VAC: 9 > >> (THRESHOLD 50), INS: 9 (THRESHOLD 1000), anl: 18 (threshold 50) What are the chances that both thresholds will be crossed at *exactly* (not approximately) the same time in a real world case, where the table isn't tiny (tiny relative to the "autovacuum_naptime quantum")? This is a very narrow case. Besides, the same can already be said with how autovacuum.c crosses the XID-based antiwraparound threshold. Yet we still arbitrarily report that it's antiwraparound in the logs, which (at least right now) is generally assumed to mostly be about advancing relfrozenxid. (Or maybe it's the other way around; it doesn't matter.) It might make sense to *always* show how close we were to hitting each of the thresholds, including the ones that we didn't end up hitting (we may come pretty close quite often, which seems like it might matter). But showing multiple conditions together just because the planets aligned (we hit multiple thresholds together) emphasizes the low-level mechanism, which is pretty far removed from anything that matters. You might as well pick either threshold at random once this happens -- even an expert won't be able to tell the difference. -- Peter Geoghegan
Re: Making autovacuum logs indicate if insert-based threshold was the triggering condition
On Sat, Aug 6, 2022 at 2:50 PM Justin Pryzby wrote: > This sounded familiar, and it seems like I anticipated that it might be an > issue. Here, I was advocating for the new insert-based GUCs to default to -1, > to have insert-based autovacuum fall back to the thresholds specified by the > pre-existing GUCs (20% + 50), which would (in my proposal) remain be the > normal > way to tune any type of vacuum. > > https://www.postgresql.org/message-id/20200317233218.gd26...@telsasoft.com > > I haven't heard of anyone who had trouble setting the necessary GUC, but I'm > not surprised if most postgres installations are running versions before 13. ISTM that having insert-based triggering conditions is definitely a good idea, but what we have right now still has problems. It currently won't work very well unless the user goes out of their way to tune freezing to do the right thing. Typically we miss out on the opportunity to freeze early, because without sophisticated intervention from the user there is only a slim chance of *any* freezing taking place outside of the inevitable antiwraparound autovacuum. > > Note that a VACUUM that is an "automatic vacuum for inserted tuples" cannot > > [...] also be a "regular" autovacuum/VACUUM > > Why not ? Well, autovacuum.c should have (and/or kind of already has) 3 different triggering conditions. These are mutually exclusive conditions -- technically autovacuum.c always launches an autovacuum against a table because exactly 1 of the 3 thresholds were crossed. My patch makes sure that it always gives exactly one reason why autovacuum.c decided to VACUUM, so by definition there is only one relevant piece of information for vacuumlazy.c to report in the log. That's fairly simple and high level, and presumably something that users won't have much trouble understanding. Right now antiwraparound autovacuum "implies aggressive", in that it almost always makes vacuumlazy.c use aggressive mode, but this seems totally arbitrary to me -- they don't have to be virtually synonymous. I think that antiwraparound autovacuum could even be rebranded as "an autovacuum that takes place because the table hasn't had one in a long time". This is much less scary, and makes it clearer that autovacuum.c shouldn't be expected to really understand what will turn out to be important "at runtime". That's the time to make important decisions about what work to do -- when we actually have accurate information. My antiwraparound example is just that: an example. There is a broader idea: we shouldn't be too confident that the exact triggering condition autovacuum.c applied to launch an autovacuum worker turns out to be the best reason to VACUUM, or even a good reason -- vacuumlazy.c should be able to cope with that. The user is kept in the loop about both, by reporting the triggering condition and the details of what really happened at runtime. Maybe lazyvacuum.c can be taught to speed up and slow down based on the conditions it observes as it scans the heap -- there are many possibilities. This broader idea is pretty much what you were getting at with your example, I think. -- Peter Geoghegan
Making autovacuum logs indicate if insert-based threshold was the triggering condition
It's quite possible (and probably very common) for certain tables to have autovacuum scheduling trigger autovacuums based on both the "regular" bloat-orientated thresholds, and the newer insert-based thresholds. It may be far from obvious which triggering condition autovacuum.c must have applied to trigger any given autovacuum, since that information isn't currently passed down to lazyvacuum.c. This seems like a problem to me; how are users supposed to tune autovacuum's thresholds without even basic feedback about how the thresholds get applied? Attached patch teaches autovacuum.c to pass the information down to lazyvacuum.c, which includes the information in the autovacuum log. The approach I've taken is very similar to the existing approach with anti-wraparound autovacuum. It's pretty straightforward. Note that a VACUUM that is an "automatic vacuum for inserted tuples" cannot also be an antiwraparound autovacuum, nor can it also be a "regular" autovacuum/VACUUM -- there are now 3 distinct "triggering conditions" for autovacuum. Although this patch doesn't touch antiwraparound autovacuums at all, I will note in passing that I think that anti-wraparound autovacuums should become just another triggering condition for autovacuum -- IMV they shouldn't be special in *any* way. We'd still need to keep antiwraparound's "cannot automatically cancel autovacuum worker" behavior in some form, but that would become dynamic, a little like the failsafe is today, and would trigger on its own timeline (probably *much* later than we first trigger antiwraparound autovacuum). We'd also need to decouple "aggressiveness" (the behaviors that we associate with aggressive mode in vacuumlazy.c) from the condition of the table/system when VACUUM first began -- those could all become dynamic, too. -- Peter Geoghegan v1-0001-Show-when-autovacuum-is-insert-driven-in-log.patch Description: Binary data
Remaining case where reltuples can become distorted across multiple VACUUM operations
My bugfix commit 74388a1a (which was pushed back in February) added heuristics to VACUUM's reltuples calculation/estimate. This prevented VACUUM from distorting our estimate of reltuples over time, across successive VACUUM operations run against the same table. The problem was that VACUUM could scan the same single heap page again and again, while believing it saw a random sample each time. This eventually leads to a pg_class.reltuples value that is based on the assumption that every single heap page in the table is just like the heap page that gets "sampled" again and again. This was always the last heap page (due to implementation details related to the work done by commit e8429082), which in practice tend to be particularly poor representations of the overall reltuples density of tables. I have discovered a gap in these heuristics: there are remaining cases where its percentage threshold doesn't prevent reltuples distortion as intended. It can still happen with tables that are small enough that a cutoff of 2% of rel_pages is less than a single page, yet still large enough that vacuumlazy.c will consider it worth its while to skip some pages using the visibility map. It will typically skip all but the final heap page from the relation (same as the first time around). Here is a test case that shows how this can still happen on HEAD (and in Postgres 15): regression=# create table foo(bar int);insert into foo select i from generate_series(1, 1) i; CREATE TABLE INSERT 0 1 Now run vacuum verbose against the table several times: regression=# vacuum verbose foo; *** SNIP *** regression=# vacuum verbose foo; The first vacuum shows "tuples: 0 removed, 1 remain...", which is correct. However, each subsequent vacuum verbose revises the estimate downwards, eventually making pg_class.reltuples significantly underestimate tuple density (same as the first time around). Attached patch fixes closes the remaining gap. With the patch applied, the second and subsequent vacuum verbose operations from the test case will show that reltuples is still 1 (it won't ever change). The patch just extends an old behavior that was applied when scanned_pages == 0 to cases where scanned_pages <= 1 (unless we happened to scan all of the relation's tables, of course). It doesn't remove the original test from commit 74388a1a, which still seems like a good idea to me. -- Peter Geoghegan v1-0001-Avoid-reltuples-distortion-in-very-small-tables.patch Description: Binary data
Re: BTMaxItemSize seems to be subtly incorrect
On Fri, Aug 5, 2022 at 10:13 AM Peter Geoghegan wrote: > Update: I discovered that I can get the regression tests to fail (even > on mainstream 64-bit platforms) by MAXALIGN()'ing the expression that > we assign to state->maxpostingsize at the top of _bt_dedup_pass(). Looks like this was nothing more than a silly oversight with how the macro was defined. As written, it would evaluate to the wrong thing at the same point in nbtdedup.c, just because it was used in an expression. Pushed a fix for that just now. -- Peter Geoghegan
Re: BTMaxItemSize seems to be subtly incorrect
On Thu, Aug 4, 2022 at 10:40 PM Peter Geoghegan wrote: > This very likely has something to do with the way nbtdedup.c uses > BTMaxItemSize(), which apparently won't work on these 32-bit systems > now. Update: I discovered that I can get the regression tests to fail (even on mainstream 64-bit platforms) by MAXALIGN()'ing the expression that we assign to state->maxpostingsize at the top of _bt_dedup_pass(). This is surprising; it contradicts existing comments that explain that the existing max is 1/6 of a page by choice, to get better space utilization than the more natural cap of 1/3 of a page. It now looks like that might have actually been strictly necessary, all along. -- Peter Geoghegan
Re: BTMaxItemSize seems to be subtly incorrect
On Thu, Aug 4, 2022 at 10:25 PM Thomas Munro wrote: > FYI florican and lapwing showed: > > 2022-08-05 01:04:29.903 EDT [34485:5] FATAL: deduplication failed to > add heap tid to pending posting list > 2022-08-05 01:04:29.903 EDT [34485:6] CONTEXT: WAL redo at 0/49708D8 > for Btree/DEDUP: nintervals 4; blkref #0: rel 1663/16384/2674, blk 1 This very likely has something to do with the way nbtdedup.c uses BTMaxItemSize(), which apparently won't work on these 32-bit systems now. I'll fix this tomorrow morning. -- Peter Geoghegan
Re: BTMaxItemSize seems to be subtly incorrect
On Thu, Jun 9, 2022 at 11:20 AM Robert Haas wrote: > I think I'd feel more comfortable if you wrote the patch, if that's possible. Okay, pushed a fix just now. Thanks -- Peter Geoghegan
Re: annoyance with .git-blame-ignore-revs
On Mon, Jul 11, 2022 at 12:30 PM Alvaro Herrera wrote: > $ git blame configure > fatal: could not open object name list: .git-blame-ignore-revs > > My first workaround was to add empty .git-blame-ignore-revs in all > checkouts. This was moderately ok (shrug), until after a recent `tig` > upgrade the empty file started to show up in the history as an untracked > file. Ping? Would be nice to get this done soon. I don't think that it requires a great deal of care. If I was doing this myself, I would probably make sure that the backbranch copies of the file won't reference commits from later releases. But even that probably doesn't matter; just backpatching the file from HEAD as-is wouldn't break anybody's workflow. Again, to reiterate: I see no reason to do anything on the backbranches here more than once. I mentioned already that somebody proposed a patch that fixes the problem at the git level, which seems to have stalled. Here is the discussion: https://public-inbox.org/git/xmqq5ywehb69.fsf@gitster.g/T/ ISTM that we're working around what is actually a usability problem with git (imagine that!). I think that that's fine. Just thought that it was worth acknowledging it as such. We're certainly not the first people to run into this exact annoyance. -- Peter Geoghegan
Re: pg15b2: large objects lost on upgrade
On Thu, Aug 4, 2022 at 12:31 PM Robert Haas wrote: > > What about autoanalyze? > > What about it? It has a tendency to consume an XID, here or there, quite unpredictably. I've noticed that this often involves an analyze of pg_statistic. Have you accounted for that? You said upthread that you don't like "fuzzy" tests, because it's too easy for things to look like they're working when they really aren't. I suppose that there may be some truth to that, but ISTM that there is also a lot to be said for a test that can catch failures that weren't specifically anticipated. Users won't be running pg_upgrade with autovacuum disabled. And so ISTM that just testing that relfrozenxid has been carried forward is more precise about one particular detail (more precise than alternative approaches to testing), but less precise about the thing that we actually care about. -- Peter Geoghegan
Re: pg15b2: large objects lost on upgrade
On Thu, Aug 4, 2022 at 12:15 PM Robert Haas wrote: > Given that the old cluster is suffering no new write > transactions, there's probably exactly two values that are legal: one > being the value from the old cluster, which we know, and the other > being whatever a vacuum of that table would produce, which we don't > know, although we do know that it's somewhere in that range. What about autoanalyze? -- Peter Geoghegan
Re: pg15b2: large objects lost on upgrade
On Thu, Aug 4, 2022 at 11:07 AM Tom Lane wrote: > How much will that add to the test's runtime? I could get behind this > idea if it's not exorbitantly expensive. I'm not sure offhand, but I suspect it wouldn't be too bad. -- Peter Geoghegan
Re: pg15b2: large objects lost on upgrade
On Thu, Aug 4, 2022 at 9:44 AM Robert Haas wrote: > But if you don't want to do that, and you also don't want to have > random failures, the only alternatives are weakening the check and > removing the test. It's kind of hard to say which is better, but I'm > inclined to think that if we just weaken the test we're going to think > we've got coverage for this kind of problem when we really don't. Perhaps amcheck's verify_heapam() function can be used here. What could be better than exhaustively verifying that the relfrozenxid (and relminmxid) invariants hold for every single tuple in the table? Those are the exact conditions that we care about, as far as relfrozenxid/relminmxid goes. My sense is that that has a much better chance of detecting a real bug at some point. This approach is arguably an example of property-based testing. -- Peter Geoghegan
Re: pg15b2: large objects lost on upgrade
On Wed, Aug 3, 2022 at 1:47 PM Tom Lane wrote: > Again, this seems to me to be breaking the test's real-world applicability > for a (false?) sense of stability. I agree. A lot of the VACUUM test flappiness issues we've had to deal with in the past now seem like problems with VACUUM itself, the test's design, or both. For example, why should we get a totally different pg_class.reltuples because we couldn't get a cleanup lock on some page? Why not just make sure to give the same answer either way, which happens to be the most useful behavior to the user? That way the test isn't just targeting implementation details. -- Peter Geoghegan
Re: pg15b2: large objects lost on upgrade
On Wed, Aug 3, 2022 at 1:34 PM Tom Lane wrote: > That doesn't seem like it'd be all that thorough: we expect VACUUM > to skip pages whenever possible. I'm also a bit concerned about > the expense, though admittedly this test is ridiculously expensive > already. I bet the SKIP_PAGES_THRESHOLD stuff will be enough to make VACUUM visit every heap page in practice for a test case like this. That is all it takes to be able to safely advance relfrozenxid to whatever the oldest extant XID happened to be. However, I'm no fan of the SKIP_PAGES_THRESHOLD behavior, and already have plans to get rid of it -- so I wouldn't rely on that continuing to be true forever. It's probably not really necessary to have that kind of coverage in this particular test case. VACUUM will complain about weird relfrozenxid values in a large variety of contexts, even without assertions enabled. Mostly I was just saying: if we really do need test coverage of relfrozenxid in this context, then VACUUM is probably the way to go. -- Peter Geoghegan
Re: pg15b2: large objects lost on upgrade
On Wed, Aug 3, 2022 at 1:20 PM Andres Freund wrote: > > I don't really like this approach. Imagine that the code got broken in > > such a way that relfrozenxid and relminmxid were set to a value chosen > > at random - say, the contents of 4 bytes of unallocated memory that > > contained random garbage. Well, right now, the chances that this would > > cause a test failure are nearly 100%. With this change, they'd be > > nearly 0%. > > Can't that pretty easily be addressed by subsequently querying txid_current(), > and checking that the value isn't newer than that? It couldn't hurt to do that as well, in passing (at the same time as testing that newrelfrozenxid >= oldrelfrozenxid directly). But deliberately running VACUUM afterwards seems like a good idea. We really ought to expect VACUUM to catch cases where relfrozenxid/relminmxid is faulty, at least in cases where it can be proven wrong by noticing some kind of inconsistency. -- Peter Geoghegan
Re: pg15b2: large objects lost on upgrade
On Wed, Aug 3, 2022 at 6:59 AM Robert Haas wrote: > I don't really like this approach. Imagine that the code got broken in > such a way that relfrozenxid and relminmxid were set to a value chosen > at random - say, the contents of 4 bytes of unallocated memory that > contained random garbage. Well, right now, the chances that this would > cause a test failure are nearly 100%. With this change, they'd be > nearly 0%. If that kind of speculative bug existed, and somehow triggered before the concurrent autovacuum ran (which seems very likely to be the source of the test flappiness), then it would still be caught, most likely. VACUUM itself has the following defenses: * The defensive "can't happen" errors added to heap_prepare_freeze_tuple() and related freezing routines by commit 699bf7d0 in 2017, as hardening following the "freeze the dead" bug. That'll catch XIDs that are before the relfrozenxid at the start of the VACUUM (ditto for MXIDs/relminmxid). * The assertion added in my recent commit 0b018fab, which verifies that we're about to set relfrozenxid to something sane. * VACUUM now warns when it sees a *previous* relfrozenxid that's apparently "in the future", following recent commit e83ebfe6. This problem scenario is associated with several historic bugs in pg_upgrade, where for one reason or another it failed to carry forward correct relfrozenxid and/or relminmxid values for a table (see the commit message for references to those old pg_upgrade bugs). It might make sense to run a manual VACUUM right at the end of the test, so that you reliably get this kind of coverage, even without autovacuum. -- Peter Geoghegan
Re: pg15b2: large objects lost on upgrade
On Tue, Aug 2, 2022 at 12:32 PM Tom Lane wrote: > Hmmm ... now that you mention it, I see nothing in 002_pg_upgrade.pl > that attempts to turn off autovacuum on either the source server or > the destination. So one plausible theory is that autovac moved the > numbers since we checked. It's very easy to believe that my work in commit 0b018fab could make that happen, which is only a few months old. It's now completely routine for non-aggressive autovacuums to advance relfrozenxid by at least a small amount. For example, when autovacuum runs against either the tellers table or the branches table during a pgbench run, it will now advance relfrozenxid, every single time. And to a very recent value. This will happen in spite of the fact that no freezing ever takes place -- it's just a consequence of the oldest extant XID consistently being quite young each time, due to workload characteristics. -- Peter Geoghegan
effective_multixact_freeze_max_age issue
The effective_multixact_freeze_max_age mechanism added by commit 53bb309d2d forces aggressive VACUUMs to take place earlier, as protection against wraparound of the MultiXact member space SLRU. There was also a follow-up bugfix several years later -- commit 6bda2af039 -- which made sure that the MXID-wise cutoff used to determine which MXIDs to freeze in vacuumlazy.c could never exceed oldestMxact (VACUUM generally cannot freeze MultiXacts that are still seen as running by somebody according to oldestMxact). I would like to talk about making the effective_multixact_freeze_max_age stuff more robust, particularly in the presence of a long held snapshot that holds things up even as SLRU space for MultiXact members dwindles. I have to admit that I always found this part of vacuum_set_xid_limits() confusing. I suspect that the problem has something to do with how we calculate vacuumlazy.c's multiXactCutoff (as well as FreezeLimit): vacuum_set_xid_limits() just subtracts a freezemin value from GetOldestMultiXactId(). This is confusingly similar (though different in important ways) to the handling for other related cutoffs that happens nearby. In particular, we start from ReadNextMultiXactId() (not from GetOldestMultiXactId()) for the cutoff that determines if the VACUUM is going to be aggressive. I think that this can be fixed -- see the attached patch. Of course, it wouldn't be safe to allow vacuum_set_xid_limits() to hand off a multiXactCutoff to vacuumlazy.c that is (for whatever reason) less than GetOldestMultiXactId()/oldestMxact (the bug fixed by 6bda2af039 involved just such a scenario). But that doesn't seem like much of a problem to me. We can just handle it directly, as needed. The attached patch handles it as follows: /* Compute multiXactCutoff, being careful to generate a valid value */ *multiXactCutoff = nextMXID - mxid_freezemin; if (*multiXactCutoff < FirstMultiXactId) *multiXactCutoff = FirstMultiXactId; /* multiXactCutoff must always be <= oldestMxact */ if (MultiXactIdPrecedes(*oldestMxact, *multiXactCutoff)) *multiXactCutoff = *oldestMxact; That is, we only need to make sure that the "multiXactCutoff must always be <= oldestMxact" invariant holds once, by checking for it directly. The same thing happens with OldestXmin/FreezeLimit. That seems like a simpler foundation. It's also a lot more logical. Why should the cutoff for freezing be held back by a long running transaction, except to the extent that it is strictly necessary to do so to avoid wrong answers (wrong answers seen by the long running transaction)? This allows us to simplify the code that issues a WARNING about oldestMxact/OldestXmin inside vacuum_set_xid_limits(). Why not actually test oldestMxact/OldestXmin directly, without worrying about the limits (multiXactCutoff/FreezeLimit)? That also seems more logical; there is more to be concerned about than freezing being blocked when OldestXmin gets very old. Though we still rely on the autovacuum_freeze_max_age GUC to represent "a wildly unreasonable number of XIDs for OldestXmin to be held back by", just because that's still convenient. -- Peter Geoghegan v1-0001-Derive-VACUUM-s-FreezeLimit-from-next-transaction.patch Description: Binary data
Re: Maximize page freezing
On Fri, Jul 29, 2022 at 5:55 AM Simon Riggs wrote: > > I should be able to post something in a couple of weeks. > > How do you see that affecting this thread? Well, it's clearly duplicative, at least in part. That in itself doesn't mean much, but there are some general questions (that apply to any variant of proactive/batched freezing), particularly around the added overhead, and the question of whether or not we get to advance relfrozenxid substantially in return for that cost. Those parts are quite tricky. I have every intention of addressing these thorny questions in my upcoming patch set, which actually does far more than change the rules about when and how we freeze -- changing the mechanism itself is very much the easy part. I'm taking a holistic approach that involves making an up-front decision about freezing strategy based on the observed characteristics of the table, driven by what we see in the visibility map at the start. Similar questions will also apply to this patch, even though it isn't as aggressive (your patch doesn't trigger freezing when a page is about to be set all-visible in order to make sure that it can be set all-frozen instead). You still want to give the user a clear benefit for any added overhead. It needs a great deal of performance validation, too. -- Peter Geoghegan
Re: Maximize page freezing
On Thu, Jul 28, 2022 at 6:56 AM Matthias van de Meent wrote: > Great idea, yet this patch seems to only freeze those tuples that are > located after the first to-be-frozen tuple. It should probably > re-visit earlier live tuples to potentially freeze those as well. I have a big patch set pending that does this (which I dubbed "page-level freezing"), plus a bunch of other things that control the overhead. Although the basic idea of freezing all of the tuples on a page together appears in earlier patching that were posted. These were things that didn't make it into Postgres 15. I should be able to post something in a couple of weeks. -- Peter Geoghegan
Re: PANIC: wrong buffer passed to visibilitymap_clear
On Fri, Jul 22, 2022 at 1:22 AM 王伟(å¦å¼ˆ) wrote: > I recently find this problem while testing PG14 with sysbench. The line numbers from your stack trace don't match up with REL_14_STABLE. Is this actually a fork of Postgres 14? (Oh, looks like it's an old beta release.) > Then I look through the emails from pgsql-hackers and find a previous > similary bug which is > https://www.postgresql.org/message-id/flat/2247102.1618008027%40sss.pgh.pa.us. > But the bugfix commit(34f581c39e97e2ea237255cf75cccebccc02d477) is already > patched to PG14. It does seem possible that there is another similar bug somewhere -- another case where we were protected by the fact that VACUUM acquired a full cleanup lock (not just an exclusive buffer lock) during its second heap pass. That changed in Postgres 14 (commit 8523492d4e). But I really don't know -- almost anything is possible. > I'm wondering whether there's another code path to lead this problem > happened. Since, I take a deep dig via gdb which turns out that newbuffer is > not euqal to buffer. In other words, the function RelationGetBufferForTuple > must have been called just now. > Besides, why didn't we re-check the flag after RelationGetBufferForTuple was > called? Recheck what flag? And at what point? It's not easy to figure this out from your stack trace, because of the line number issues. It would also be helpful if you told us about the specific table involved. Though the important thing (the essential thing) is to test today's REL_14_STABLE. There have been *lots* of bug fixes since Postgres 14 beta2 was current. -- Peter Geoghegan
Re: [PoC] Improve dead tuple storage for lazy vacuum
On Mon, Jul 18, 2022 at 9:10 PM John Naylor wrote: > On Tue, Jul 19, 2022 at 9:24 AM Andres Freund wrote: > > FWIW, I think the best path forward would be to do something similar to the > > simplehash.h approach, so it can be customized to the specific user. > > I figured that would come up at some point. It may be worth doing in the > future, but I think it's way too much to ask for the first use case. I have a prototype patch that creates a read-only snapshot of the visibility map, and has vacuumlazy.c work off of that when determining with pages to skip. The patch also gets rid of the SKIP_PAGES_THRESHOLD stuff. This is very effective with TPC-C, principally because it really cuts down on the number of scanned_pages that are scanned only because the VM bit is unset concurrently by DML. The window for this is very large when the table is large (and naturally takes a long time to scan), resulting in many more "dead but not yet removable" tuples being encountered than necessary. Which itself causes bogus information in the FSM -- information about the space that VACUUM could free from the page, which is often highly misleading. There are remaining questions about how to do this properly. Right now I'm just copying pages from the VM into local memory, right after OldestXmin is first acquired -- we "lock in" a snapshot of the VM at the earliest opportunity, which is what lazy_scan_skip() actually works off now. There needs to be some consideration given to the resource management aspects of this -- it needs to use memory sensibly, which the current prototype patch doesn't do at all. I'm probably going to seriously pursue this as a project soon, and will probably need some kind of data structure for the local copy. The raw pages are usually quite space inefficient, considering we only need an immutable snapshot of the VM. I wonder if it makes sense to use this as part of this project. It will be possible to know the exact heap pages that will become scanned_pages before scanning even one page with this design (perhaps with caveats about low memory conditions). It could also be very effective as a way of speeding up TID lookups in the reasonably common case where most scanned_pages don't have any LP_DEAD items -- just look it up in our local/materialized copy of the VM first. But even when LP_DEAD items are spread fairly evenly, it could still give us reliable information about the distribution of LP_DEAD items very early on. Maybe the two data structures could even be combined in some way? You can use more memory for the local copy of the VM if you know that you won't need the memory for dead_items. It's kinda the same problem, in a way. -- Peter Geoghegan
Re: Rename some rel truncation related constants at the top of vacuumlazy.c
On Mon, Jul 18, 2022 at 8:55 PM Tom Lane wrote: > Um ... you seem to have removed some useful comments? I don't think that the stuff about making them into a GUC is useful myself. > Personally I wouldn't do this, as I don't think the renaming > brings much benefit, and it will create a hazard for back-patching > any fixes that might be needed in that code. I'm not hugely upset > about it, but that's the way I'd vote if asked. In that case I withdraw the patch. FWIW I wrote the patch during the course of work on new feature development. A patch that added a couple of similar constants a bit further down. Seemed neater this way, but it's certainly not worth arguing over. -- Peter Geoghegan
Rename some rel truncation related constants at the top of vacuumlazy.c
I propose to rename some of the rel truncation related constants at the top of vacuumlazy.c, per the attached patch. The patch consolidates related constants into a single block/grouping, and imposes a uniform naming scheme. -- Peter Geoghegan 0001-vacuumlazy.c-rename-rel-truncation-constants.patch Description: Binary data
Re: [PATCH v1] remove redundant check of item pointer
On Thu, Jul 14, 2022 at 3:59 PM Tom Lane wrote: > Even in an assert-enabled build, wouldn't you expect the compiler to > optimize away the second assertion as unreachable code? I think that it probably would, even at -O0 (GCC doesn't really allow you to opt out of all optimizations). I did think of that myself, but it seemed rather beside the point. There have been individual cases where individual assertions were deemed a bit too heavyweight. But those have been few and far between. I myself tend to use *lots* of technically-redundant assertions like this for preconditions and postconditions. At worst they're code comments that are all but guaranteed to stay current. -- Peter Geoghegan
Re: [PATCH v1] remove redundant check of item pointer
On Thu, Jul 14, 2022 at 3:31 PM Bruce Momjian wrote: > On Wed, Apr 27, 2022 at 08:04:00PM +0800, Junwang Zhao wrote: > for non-Assert builds, ItemPointerGetOffsetNumberNoCheck() and > ItemPointerGetOffsetNumber() are the same, so I don't see the point to > making this change. Frankly, I don't know why we even have two > functions for this. I am guessing ItemPointerGetOffsetNumberNoCheck is > for cases where you have an Assert build and do not want the check. Sometimes we use ItemPointerData for things that aren't actually TIDs. For example, both GIN and B-Tree type-pun the ItemPointerData field from the Indextuple struct. Plus we do something like that with UPDATEs that affect a partitioning key in a partitioned table. The proposal doesn't seem like an improvement. Technically the assertion cannot possibly fail here because the earlier assertion would always fail instead, so strictly speaking it is redundant -- at least right now. That is true. But it seems much more important to be consistent about which variant to use. Especially because there is obviously no overhead in builds without assertions enabled. -- Peter Geoghegan
Re: annoyance with .git-blame-ignore-revs
On Mon, Jul 11, 2022 at 12:36 PM Tom Lane wrote: > Only if we had to update all those copies all the time. But > I'm guessing we wouldn't need a branch's copy to be newer than > the last pgindent run affecting that branch? I wouldn't care very much if the file itself was empty in the backbranches, and remained that way -- that would at least suppress annoying error messages on those branches (from my text editor's git blame feature). You might as well have the relevant commits when you backpatch, but that's kind of not the point. At least to me. In any case I don't see a need to maintain the file on the backbranches. -- Peter Geoghegan
Re: annoyance with .git-blame-ignore-revs
On Mon, Jul 11, 2022 at 12:30 PM Alvaro Herrera wrote: > Anybody has any idea how to handle this better? > > A viable option would be to backpatch the addition of > .git-blame-ignore-revs to all live branches. Would that bother anyone? +1. I was thinking of suggesting the same thing myself, for the same reasons. This solution is a good start, but it does leave one remaining problem: commits from before the introduction of .git-blame-ignore-revs still won't have the file. There was actually a patch for git that tried to address the problem directly, but it didn't go anywhere. Maybe just backpatching the file is good enough. -- Peter Geoghegan
Re: Add last_vacuum_index_scans in pg_stat_all_tables
On Fri, Jul 8, 2022 at 10:47 AM Alvaro Herrera wrote: > Saving some sort of history would be much more useful, but of course a > lot more work. I think that storing a certain amount of history would be very useful, for lots of reasons. Not just for instrumentation purposes; I envisage a design where VACUUM itself makes certain decisions based on the history of each VACUUM operation against the table. The direction that things have taken suggests a certain amount about the direction that things are going in, which we should try to influence. The simplest and best example of how this could help is probably freezing, and freeze debt. Currently, the visibility map interacts with vacuum_freeze_min_age in a way that allows unfrozen all-visible pages to accumulate. These pages won't be frozen until the next aggressive VACUUM. But there is no fixed relationship between the number of XIDs consumed by the system (per unit of wallclock time) and the number of unfrozen all-visible pages (over the same duration). So we might end up having to freeze an absolutely enormous number of pages in the eventual aggressive vacuum. We also might not -- it's really hard to predict, for reasons that just don't make much sense. There are a few things we could do here, but having a sense of history seems like the important part. If (say) the table exceeds a certain size, and the number of all-visible pages grows and grows (without any freezing taking place), then we should "proactively" freeze at least some of the unfrozen all-visible pages in earlier VACUUM operations. In other words, we should (at the very least) spread out the burden of freezing those pages over time, while being careful to not pay too much more than we would with the old approach if and when the workload characteristics change again. More generally, I think that we should blur the distinction between aggressive and non-aggressive autovacuum. Sure, we'd still need VACUUM to "behave aggressively" in some sense, but that could all happen dynamically, without committing to a particular course of action until the last moment -- being able to change our minds at the last minute can be very valuable, even though we probably won't change our minds too often. -- Peter Geoghegan
Re: explain analyze rows=%.0f
On Thu, Jul 7, 2022 at 1:21 PM Robert Haas wrote: > Nested Loop (actual time=TIME FOR THIS AND ALL CHILDREN rows=THE REAL > ROW COUNT loops=1) > -> Seq Scan on something (actual time=THE TIME IT REALLY TOOK rows=THE > REAL ROW COUNT loops=1) > -> Index Scan using someidx on somethingelse (actual time=NOT REALLY > HOW LONG IT TOOK rows=NOT REALLY HOW MANY ROWS WE GOT loops=HUGE > NUMBER) > > If I'm looking at this plan and trying to find out what's gone wrong, > I want to know how much time got spent in the nested loop, how much > time got spent in the Seq Scan, and how much time got spent in the > Index Scan. It's easy to figure out how much time got spent in the Seq > Scan, but to find out how much time got spent in the Index Scan, I > have to multiply the time by the loop count. I agree that this general state of affairs is very confusing, and seems like something that we should still improve. Just because it's a very old way of presenting the information doesn't mean that it's the best one, or even a particularly good one. Plus you could probably make some kind of concession in the direction of maintaining compatibility with the current approach if you had to. Right? -- Peter Geoghegan
Re: [UNVERIFIED SENDER] Re: pg_upgrade can result in early wraparound on databases with high transaction load
On Tue, Jul 5, 2022 at 12:41 PM Peter Geoghegan wrote: > Actually, commit 74cf7d46 was where pg_resetxlog/pg_resetwal's -u > argument was first added, for use by pg_upgrade. That commit is only > about a year old, and was only backpatched to 9.6. I just realized that this thread was where that work was first discussed. That explains why it took a year to discover that we broke 8.4! On further reflection I think that breaking pg_upgrade for 8.4 might have been a good thing. The issue was fairly visible and obvious if you actually ran into it, which is vastly preferable to what would have happened before commit 74cf7d46. -- Peter Geoghegan
Re: [UNVERIFIED SENDER] Re: pg_upgrade can result in early wraparound on databases with high transaction load
On Tue, Jul 5, 2022 at 11:53 AM Andrew Dunstan wrote: > > Sure enough, 8.4's pg_controldata doesn't print anything about > > oldestXID, because that info wasn't there then. > > > > Given the lack of field complaints, it's probably not worth trying > > to do anything to restore that capability. But we really ought to > > update pg_upgrade's code and docs in pre-v15 branches to say that > > the minimum supported source version is 9.0. > > > So it's taken us a year to discover the issue :-( I'm not surprised at all, given the history here. There were at least a couple of bugs affecting how pg_upgrade carries forward information about these cutoffs. See commits 74cf7d46 and a61daa14. Actually, commit 74cf7d46 was where pg_resetxlog/pg_resetwal's -u argument was first added, for use by pg_upgrade. That commit is only about a year old, and was only backpatched to 9.6. Unfortunately the previous approach to carrying forward oldestXID was an accident that usually worked. So...yeah, things are bad here. At least we now have the ability to detect any downstream problems that this might cause by using pg_amcheck. -- Peter Geoghegan
Re: First draft of the PG 15 release notes
On Tue, Jul 5, 2022 at 11:09 AM Bruce Momjian wrote: > > Actually, I was wrong. I thought that we only mentioned that we > computed a more agressive xid, but now see I was mentioning the _frozen_ > xid. Reading the commit, we do compute the multi-xid and store that too > so I have updated the PG 15 release notes with the attached patch. It might be worth using the "symbol names" directly, since they appear in the documentation already (under "Routine Vacuuming"). These are relfrozenxid and relminmxid. These are implementation details, but they're documented in detail (though admittedly the documentation has *lots* of problems). Here is what I would like this item to hint at, to advanced users with tricky requirements: The new approach to setting relminmxid will improve the behavior of VACUUM in databases that already happen to use lots of MultiXacts. These users will notice that autovacuum now works off of relminmxid values that actually tell us something about each table's consumption of MultiXacts over time. Most individual tables naturally consume *zero* MultiXacts, even in databases that consume many MultiXacts -- due to naturally occuring workload characteristics. The old approach failed to recognize this, leading to very uniform relminmxid values across tables that were in fact very different, MultiXact-wise. The way that we handle relfrozenxid is probably much less likely to make life much easier for any database, at least on its own, in Postgres 15. So from the point of view of a user considering upgrading, the impact on relminmxid is likely to be far more important. Admittedly the most likely scenario by far is that the whole feature just isn't interesting, but a small minority of advanced users (users with painful MultiXact problems) will find the relminmxid thing very compelling. -- Peter Geoghegan
Re: First draft of the PG 15 release notes
> > Okay, thanks Bruce. -- Peter Geoghegan
Re: AIX support - alignment issues
On Sat, Jul 2, 2022 at 12:22 PM Tom Lane wrote: > Now, it's certainly possible that AIX is the only surviving platform > that hasn't adopted bug-compatible-with-glibc interpretations of > POSIX. But I think the standard is the standard, and we ought to > stay within it. So I find value in these fixes. I imagine that there is strong evolutionary pressure pushing minority platforms in the direction of bug-compatible-with-glibc. There is definitely a similar trend around things like endianness and alignment pickiness. But it wasn't always so. It seems fair to wonder if AIX bucks the glibc-compatible trend because it is already on the verge of extinction. If it wasn't just about dead already then somebody would have gone to the trouble of making it bug-compatible-with-glibc by now. (To be clear, I'm not arguing that this is a good thing.) Maybe it is still worth hanging on to AIX support for the time being, but it would be nice to have some idea of where we *will* finally draw the line. If the complaints from Andres aren't a good enough reason now, then what other hypothetical reasons might be good enough in the future? It seems fairly likely that Postgres desupporting AIX will happen (say) at some time in the next decade, no matter what happens today. -- Peter Geoghegan
Re: AIX support - alignment issues
On Sat, Jul 2, 2022 at 11:34 AM Andres Freund wrote: > Personally I think we should just drop AIX. The amount of effort to keep it > working is substantial due to being quite different from other unices ([2]), > the is > very outdated, the whole ecosystem is barely on lifesupport ([3]). And all of > that > for very little real world use. I tend to agree about dropping AIX. But I wonder if there is an argument against that proposal that doesn't rely on AIX being relevant to at least one user. Has supporting AIX ever led to the discovery of a bug that didn't just affect AIX? In other words, are AIX systems peculiar in some particular way that clearly makes them more likely to flush out a certain class of bugs? What is the best argument *against* desupporting AIX that you know of? Desupporting AIX doesn't mean that any AIX users will be left in the lurch immediately. Obviously these users will be able to use a supported version of Postgres for several more years. -- Peter Geoghegan
Re: First draft of the PG 15 release notes
On Fri, Jul 1, 2022 at 9:41 AM Bruce Momjian wrote: > > It might be worth explaining the shift directly in the release notes. > > The new approach is simpler and makes a lot more sense -- why should > > the relfrozenxid be closely tied to freezing? We don't necessarily > > I don't think this is an appropriate detail for the release notes. Okay. What about saying something about relminmxid advancement where the database consumes lots of multixacts? -- Peter Geoghegan
Re: doc: Clarify what "excluded" represents for INSERT ON CONFLICT
On Fri, Jul 1, 2022 at 6:01 AM Robert Haas wrote: > What would probably help more is adding something like this to the > error message: > > HINT: column "b" could refer to any of these relations: "foo", "excluded" > > That could also help people who encounter this error in other > situations. I'm not 100% sure this is a good idea, but I feel like it > would have a much better chance of helping someone in this situation > than the proposed doc patch. I agree with everything you've said here, though I am 100% sure that something like your proposed HINT would be a real usability win. The "Perhaps you meant to reference the column" HINT displayed when the user misspells a column is a surprisingly popular feature. I'm aware of quite a few people singing its praises on Twitter, for example. That hardly ever happens, even with features that we think of as high impact major features. So evidently users really value these kinds of quality of life improvements. -- Peter Geoghegan
Re: doc: Clarify what "excluded" represents for INSERT ON CONFLICT
On Fri, Jul 1, 2022 at 6:40 AM Tom Lane wrote: > Robert Haas writes: > > What would probably help more is adding something like this to the > > error message: > > HINT: column "b" could refer to any of these relations: "foo", "excluded" > > +1, that seems like it could be handy across the board. The user *will* get a similar HINT if they happen to *also* spell the would-be ambiguous column name slightly incorrectly: ERROR: column "barr" does not exist LINE 1: ...lict (bar) do update set bar = excluded.bar where barr != 5; ^ HINT: Perhaps you meant to reference the column "foo.bar" or the column "excluded.bar". -- Peter Geoghegan
Re: doc: Clarify what "excluded" represents for INSERT ON CONFLICT
On Thu, Jun 30, 2022 at 3:07 PM David G. Johnston wrote: > Yes, and based on a single encounter I agree this doesn't seem like a broadly > encountered issue. My takeaway from that eventually led to this proposal. > The "Other Person" who is complaining about the docs is one of the mentors on > the Discord server and works for one of the corporate contributors to the > community. (I suppose Discord is considered public so maybe this redaction is > unnecessary...) My impression from reading this transcript is that the user was confused as to why they needed to qualify the target table name in the ON CONFLICT DO UPDATE's WHERE clause -- they didn't have to qualify it in the targetlist that appears in "SET ... ", so why the need to do it in the WHERE clause? This isn't something that upsert statements need to do all that often, just because adding additional conditions to the WHERE clause isn't usually necessary. That much makes sense to me -- I *can* imagine how that could cause confusion. If that interpretation is correct, then it's not clear what it should mean for how the INSERT documentation describes EXCLUDED. EXCLUDED is involved here, since EXCLUDED is the thing that creates the ambiguity, but that seems almost incidental to me. This user would probably not have been confused if they didn't need to use a WHERE clause (very much the common case), even when expression evaluation involving EXCLUDED in the SET was still used (also common). -- Peter Geoghegan
Re: doc: Clarify what "excluded" represents for INSERT ON CONFLICT
On Thu, Jun 30, 2022 at 2:07 PM David G. Johnston wrote: > Current: > "The SET and WHERE clauses in ON CONFLICT DO UPDATE have access to the > existing row using the table's name (or an alias), and to [rows] proposed > for insertion using the special excluded table." > > The word table in that sentence is wrong and not a useful way to think of the > thing which we've named excluded. It is a single value of a composite type > having the structure of the named table. I think that your reasoning is correct, but I don't agree with your conclusion. The term "special excluded table" is a fudge, but that isn't necessarily a bad thing. Sure, we could add something about the UPDATE being similar to an UPDATE with a self-join, as I said upthread. But I think that that would make the concept harder to grasp. > I'll agree that most people will mentally paper over the difference and go > merrily on their way. At least one person recently did not do that, which > prompted an email to the community Can you provide a reference for this? Didn't see anything like that in the reference you gave upthread. I have a hard time imagining a user that reads the INSERT docs and imagines that "excluded" is a relation that is visible to the query in ways that are not limited to expression evaluation for the UPDATE's WHERE/SET. The way that it works (and doesn't work) follows naturally from what a user would want to do in order to upsert. MySQL's INSERT ... ON DUPLICATE KEY UPDATE feature has a magical UPSERT-only expression instead of "excluded". -- Peter Geoghegan
Re: doc: Clarify what "excluded" represents for INSERT ON CONFLICT
On Thu, Jun 30, 2022 at 1:43 PM Robert Haas wrote: > rhaas=# insert into foo values (1, 'frob') on conflict (a) do update > set b = (select b || 'nitz' from excluded); > ERROR: relation "excluded" does not exist > LINE 1: ...ct (a) do update set b = (select b || 'nitz' from excluded); > > I do find that a bit of a curious error message, because that relation > clearly DOES exist in the range table. Let's say that we supported this syntax. I see some problems with that (you may have thought of these already). Thinking of "excluded" as a separate table creates some very thorny questions, such as: How would the user be expected to handle the case where there was more than a single "row" in "excluded"? How could the implementation know what the contents of the "excluded table" were ahead of time in the multi-row-insert case? We'd have to know about *all* of the conflicts for all rows proposed for insertion to do this, which seems impossible in the general case -- because some of those conflicts won't have happened yet, and cannot be predicted. The "excluded" pseudo-table is conceptually similar to an from_item alias used within an UPDATE FROM ... where the target table is also the from_item table (i.e. there is a self-join). There is only one table involved. -- Peter Geoghegan
Re: vacuum verbose no longer reveals anything about pins
On Thu, Jun 30, 2022 at 8:43 AM Robert Haas wrote: > Ah, I missed that. I think that in the test case I was using, there > was a conflicting pin but there were no dead tuples, so that line > wasn't present in the output. Even if there was a DEAD tuple, your test would still have to account for opportunistic pruning before the cursor acquires its blocking pin (I'm guessing that you're using a cursor for this). The earlier pruning could turn the DEAD tuple into either an LP_UNUSED item (in the case of a heap-only tuple) or an LP_DEAD stub line pointer. As I said, any existing LP_DEAD items will get put in the dead_items array by lazy_scan_noprune, very much like it would in the cleanup-lock-acquired/lazy_scan_prune case. > Maybe nothing. I just thought you'd completely removed all reporting > on this, and I'm glad that's not so. The important idea behind all this is that VACUUM now uses a slightly more abstract definition of scanned_pages. This is far easier to work with at a high level, especially in the instrumentation code used by VACUUM VERBOSE. You may have also noticed that VACUUM VERBOSE/autovacuum logging actually shows the number of scanned pages in Postgres 15, for the first time -- even though that's very basic information. The revised definition of scanned_pages enabled that enhancement. We are no longer encumbered by the existence of a no-cleanup-lock-page special case. -- Peter Geoghegan
Re: vacuum verbose no longer reveals anything about pins
On Thu, Jun 30, 2022 at 5:57 AM Robert Haas wrote: > I was dismayed to learn that VACUUM VERBOSE on a table no longer tells > you anything about whether any pages were skipped due to pins. VACUUM VERBOSE will show a dedicated line that reports on the number of pages that we couldn't get a cleanup lock on, if and only if we couldn't do useful work as a result. In practice this means pages that had one or more fully DEAD tuples that couldn't be removed due to our inability to prune. In my view this is strictly better than reporting on the number of "skipped due to pins" pages. In the case of any pages that we couldn't get a cleanup lock on that didn't have any DEAD tuples (pages that are not reported on at all), VACUUM hasn't missed any work whatsoever. It even does heap vacuuming, which doesn't require a cleanup lock in either the first or the second heap pass. What's the problem with not reporting on that? -- Peter Geoghegan
Re: First draft of the PG 15 release notes
On Tue, Jun 28, 2022 at 1:35 PM Bruce Momjian wrote: > Okay, text updated, thanks. Applied patch attached. I have some notes on these items: 1. "Allow vacuum to be more aggressive in setting the oldest frozenxid (Peter Geoghegan)" 2. "Add additional information to VACUUM VERBOSE and autovacuum logging messages (Peter Geoghegan)" The main enhancement to VACUUM for Postgres 15 was item 1, which taught VACUUM to dynamically track the oldest remaining XID (and the oldest remaining MXID) that will remain in the table at the end of the same VACUUM operation. These final/oldest XID/MXID values are what we now use to set relfrozenxid and relminmxid in pg_class. Previously we just set relfrozenxid/relminmxid to whatever XID/MXID value was used to determine which XIDs/MXIDs needed to be frozen. These values often indicated more about VACUUM implementation details (like the vacuum_freeze_min_age GUc's value) than the actual true contents of the table at the end of the most recent VACUUM. It might be worth explaining the shift directly in the release notes. The new approach is simpler and makes a lot more sense -- why should the relfrozenxid be closely tied to freezing? We don't necessarily have to freeze any tuple to advance relfrozenxid right up to the removable cutoff/OldestXmin used by VACUUM. For example, anti-wraparound VACUUMs that run against static tables now set relfrozenxid/relminmxid to VACUUM's removable cutoff/OldestXmin directly, without freezing anything (after the first time). Same with tables that happen to have every row deleted -- only the actual unfrozen XIDs/MXIDs left in the table matter, and if there happen to be none at all then we can use the same relfrozenxid as we would for a CREATE TABLE. All depends on what the workload allows. There will also be a real practical benefit for users that allocate a lot of MultiXactIds: We'll now have pg_class.relminmxid values that are much more reliable indicators of what is really going on in the table, MultiXactId-wise. I expect that this will make it much less likely that anti-wraparound VACUUMs will run needlessly against the largest tables, where there probably wasn't ever one single MultiXactId. In other words, the implementation will have more accurate information at the level of each table, and won't . I think that very uneven consumption of MultiXactIds at the table level is probably common in real databases. Plus VACUUM can usually remove a non-running MultiXact from a tuple's xmax, regardless of whether or not the mxid happens to be before the vacuum_multixact_freeze_min_age-based MXID cutoff -- VACUUM has always just set xmax to InvalidXid in passing when it's possible to do so easily. MultiXacts are inherently pretty short-lived information about row lockers at a point in time. We don't really need to keep them around for very long. We may now be able to truncate the two MultiXact related SLRUs much more frequently with some workloads. Finally, note that the new VACUUM VERBOSE output (which is now pretty much the same as the autovacuum log output) shows when and how relfrozenxid/relminmxid have advanced. This should make it relatively easy to observe these effects where they exist. Thanks -- Peter Geoghegan
Re: Separate the attribute physical order from logical order
On Tue, Jun 28, 2022 at 11:47 AM Tom Lane wrote: > Now you do need something that will make the three meanings different > in order to test that step. But I'd suggest some bit of throwaway code > that just assigns randomly different logical and physical orders. That seems like a good idea. Might also make sense to make the behavior configurable via a developer-only GUC, to enable exhaustive tests that use every possible permutation of physical/logical mappings for a given table. Perhaps the random behavior itself should work by selecting a value for the GUC at various key points via a PRNG. During CREATE TABLE, for example. This approach could make it easier to reproduce failures on the buildfarm. -- Peter Geoghegan
Re: do only critical work during single-user vacuum?
On Mon, Jun 27, 2022 at 12:36 PM Justin Pryzby wrote: > By chance, I came across this prior thread which advocated the same thing in a > initially (rather than indirectly as in this year's thread). Revisiting this topic reminded me that PostgreSQL 14 (the first version that had the wraparound failsafe mechanism controlled by vacuum_failsafe_age) has been a stable release for 9 months now. As of today I am still not aware of even one user that ran into the failsafe mechanism in production. It might well have happened by now, of course, but I am not aware of any specific case. Perhaps this will change soon enough -- maybe somebody else will read this and enlighten me. To me the fact that the failsafe seems to seldom kick-in in practice suggests something about workload characteristics in general: that it isn't all that common for users to try to get away with putting off freezing until a table attains an age that is significantly above 1 billion XIDs. When people talk about things like 64-bit XIDs, I tend to wonder: if 2 billion XIDs wasn't enough, why should 4 billion or 8 billion be enough? *Maybe* the system can do better by getting even further into debt than it can today, but you can't expect to avoid freezing altogether (without significant work elsewhere). My general sense is that freezing isn't a particularly good thing to try to do lazily -- even if we ignore the risk of an eventual wraparound failure. -- Peter Geoghegan
Re: some aspects of our qsort might not be ideal
On Thu, Jun 23, 2022 at 7:51 AM Matthias van de Meent wrote: > I think that mostly has to do with reliability / stability of ORDER BY > in combination with LIMIT and OFFSET, even though right now we cannot > fully guarantee such stability due to unstable results from underlying > plan nodes. The current qsort isn't stable. While quicksort is never stable, our particular implementation has fairly significant optimizations that strongly rely on not using a stable sort. In particular, the B&M optimizations for duplicate elements. These optimizations make things like datum tuplesorts for 'SELECT(DISTINCT mycol) ...' style queries on low cardinality columns extremely fast. We're not really sorting so much as bucketing. This is based on Dijkstra's Dutch national flag problem. -- Peter Geoghegan
Re: Devel docs on website reloading
On Wed, Jun 22, 2022 at 8:45 AM Bruce Momjian wrote: > That is a big help for committers who want to email a URL of the new > doc commit output. Yes, it's a nice "quality of life" improvement. Thanks for finally getting this done! -- Peter Geoghegan
Re: PG 15 (and to a smaller degree 14) regression due to ExprEvalStep size
On Thu, Jun 16, 2022 at 7:15 PM David Rowley wrote: > So, there appears to be no performance regression due to the extra > indirection. There's maybe even some gains due to the smaller step > size. Have you tried this with the insert benchmark [1]? I've run it myself in the past (when working on B-Tree deduplication). It's quite straightforward to set up and run. [1] http://smalldatum.blogspot.com/2017/06/the-insert-benchmark.html -- Peter Geoghegan
Re: better page-level checksums
On Wed, Jun 15, 2022 at 1:27 PM Robert Haas wrote: > I think what will happen, depending on > the encryption mode, is probably that either (a) the page will decrypt > to complete garbage or (b) the page will fail some kind of > verification and you won't be able to decrypt it at all. Either way, > you won't be able to infer anything about what caused the problem. All > you'll know is that something is wrong. That sucks - a lot - and I > don't have a lot of good ideas as to what can be done about it. The > idea that an encrypted page is unintelligible and that small changes > to either the encrypted or unencrypted data should result in large > changes to the other is intrinsic to the nature of encryption. It's > more or less un-debuggable by design. It's pretty clear that there must be a lot of truth to that. But that doesn't mean that there aren't meaningful gradations beyond that. I think that it's worth doing the following exercise (humor me): Why wouldn't it be okay to just encrypt the tuple space and the line pointer array, leaving both the page header and page special area unencrypted? What kind of user would find that trade-off to be unacceptable, and why? What's the nuance of it? For all I know you're right (about encrypting the whole page, metadata and all). I just want to know why that is. I understand that this whole area is one where in general we may have to live with a certain amount of uncertainty about what really matters. > With extended checksums, I don't think the issues are anywhere near as > bad. I'm not deeply opposed to setting a page-level flag but I expect > nominal benefits. I also expect only a small benefit. But that isn't a particularly important factor in my mind. Let's suppose that it turns out to be significantly more useful than we originally expected, for whatever reason. Assuming all that, what else can be said about it now? Isn't it now *relatively* likely that including that status bit metadata will be *extremely* valuable, and not merely somewhat more valuable? I guess it doesn't matter much now (since you have all but conceded that using a bit for this makes sense), but FWIW that's the main reason why I almost took it for granted that we'd need to use a status bit (or bits) for this. -- Peter Geoghegan
Re: better page-level checksums
On Tue, Jun 14, 2022 at 7:21 PM Robert Haas wrote: > On Tue, Jun 14, 2022 at 9:56 PM Peter Geoghegan wrote: > > Technically we don't already do that today, with the 16-bit checksums > > that are stored in PageHeaderData.pd_checksum. But we do something > > equivalent: low-level tools can still infer that checksums must not be > > enabled on the page (really the cluster) indirectly in the event of a > > 0 checksum. A 0 value can reasonably be interpreted as a page from a > > cluster without checksums (barring page corruption). This is basically > > reasonable because our implementation of checksums is guaranteed to > > not generate 0 as a valid checksum value. > > I don't think that 'pg_checksums -d' zeroes the checksum values on the > pages in the cluster. Obviously there are limitations on when and how we can infer something about the whole cluster based on one single page image -- it all depends on the context. I'm only arguing that we ought to make this kind of analysis as easy as we reasonably can. I just don't see any downside to having a status bit per checksum or encryption algorithm at the page level, and plenty of upside (especially if the event of bugs). This seems like the absolute bare minimum to me, and I'm genuinely surprised that there is even a question about whether or not we should do that much. -- Peter Geoghegan
Re: better page-level checksums
On Tue, Jun 14, 2022 at 7:17 PM Robert Haas wrote: > But it seems > absolutely clear that our goal ought to be to leak as little > information as possible. But at what cost? Basically I think that this is giving up rather a lot. For example, isn't it possible that we'd have corruption that could be a bug in either the checksum code, or in recovery? I'd feel a lot better about it if there was some sense of both the costs and the benefits. > > Let's assume for now that we don't leave pd_flags unencrypted, as you > > have suggested. We're still discussing new approaches to checksumming > > in the scope of this work, which of course includes many individual > > cases that don't involve any encryption. Plus even with encryption > > there are things like defensive assertions that can be added by using > > a flag bit for this. > > True. I don't think we should be too profligate with those bits just > in case somebody needs a bunch of them for something important in the > future, but it's probably fine to use up one or two. Sure, but how many could possibly be needed for this? I can't see it being more than 2 or 3. Which seems absolutely fine. They *definitely* have no value if nobody ever uses them for anything. -- Peter Geoghegan
Re: better page-level checksums
On Tue, Jun 14, 2022 at 1:32 PM Peter Geoghegan wrote: > On Tue, Jun 14, 2022 at 1:22 PM Robert Haas wrote: > > I still am not clear on precisely what you are proposing here. I do > > agree that there is significant bit space available in pd_flags and > > that consuming some of it wouldn't be stupid, but that doesn't add up > > to a proposal. Maybe the proposal is: figure out how many different > > configurations there are for this new kind of page space, let's say N, > > and then reserve ceil(log2(N)) bits from pd_flags to indicate which > > one we've got. > > I'm just making a general point. Why wouldn't we start out with the > assumption that we use some pd_flags bit space for this stuff? Technically we don't already do that today, with the 16-bit checksums that are stored in PageHeaderData.pd_checksum. But we do something equivalent: low-level tools can still infer that checksums must not be enabled on the page (really the cluster) indirectly in the event of a 0 checksum. A 0 value can reasonably be interpreted as a page from a cluster without checksums (barring page corruption). This is basically reasonable because our implementation of checksums is guaranteed to not generate 0 as a valid checksum value. While pg_filedump does not rely on the 0 checksum convention currently, it doesn't really need to. When the user uses the -k option to verify checksums in passing, pg_filedump can assume that checksums must be enabled ("the user said they must be so expect it" is a reasonable assumption at that point). This also depends on there being only one approach to checksums. -- Peter Geoghegan
Re: better page-level checksums
On Tue, Jun 14, 2022 at 1:22 PM Robert Haas wrote: > I still am not clear on precisely what you are proposing here. I do > agree that there is significant bit space available in pd_flags and > that consuming some of it wouldn't be stupid, but that doesn't add up > to a proposal. Maybe the proposal is: figure out how many different > configurations there are for this new kind of page space, let's say N, > and then reserve ceil(log2(N)) bits from pd_flags to indicate which > one we've got. I'm just making a general point. Why wouldn't we start out with the assumption that we use some pd_flags bit space for this stuff? > One possible problem with this is that, if the page is actually > encrypted, we might want pd_flags to also be encrypted. The existing > contents of pd_flags disclose some information about the tuples that > are on the page, so having them exposed to prying eyes does not seem > appealing. I'm skeptical of the idea that we want to avoid leaving any metadata unencrypted. But I'm not an expert on TDE, and don't want to say too much about it without having done some more research. I would like to see some justification for just encrypting everything on the page without concern for the loss of debuggability, though. What is the underlying theory behind that particular decision? Are there any examples that we can draw from, from other systems or published designs? Let's assume for now that we don't leave pd_flags unencrypted, as you have suggested. We're still discussing new approaches to checksumming in the scope of this work, which of course includes many individual cases that don't involve any encryption. Plus even with encryption there are things like defensive assertions that can be added by using a flag bit for this. -- Peter Geoghegan
Re: better page-level checksums
On Tue, Jun 14, 2022 at 12:13 PM Robert Haas wrote: > Peter, unless I have missed something, this email is the very first > one where you or anyone else have said anything at all about a PD_* > bit. Even here, it's not very clear exactly what you are proposing. > Therefore I have neither said anything bad about it in the past, nor > can I now answer the question as to what is "so bad about it." If you > want to make a concrete proposal, I will be happy to tell you what I > think about it. I am proposing that we not commit ourselves to relying on implicit information about what must be true for every page in the cluster. Just having a little additional page-header metadata (in pd_flags) would accomplish that much, and wouldn't in itself impose any real overhead on TDE. It's not like the PageHeaderData.pd_flags bits are already a precious commodity, in the same way as the heap tuple infomask status bits are. We can afford to use some of them for this purpose, and then some. Why wouldn't we do it that way, just on general principle? You may still find it useful to rely on high level context at the level of code that runs on the server, perhaps for performance reasons (though it's unclear how much it matters). In which case the status bit is technically redundant information as far as the code is concerned. That may well be fine. -- Peter Geoghegan
Re: better page-level checksums
On Tue, Jun 14, 2022 at 11:52 AM Robert Haas wrote: > > Even within TDE, it might be okay to assume that it's a feature that > > the user must commit to using for a whole cluster at initdb time. What > > isn't okay is committing to that assumption now and forever, by > > leaving the door open to a world in which that assumption no longer > > holds. Like when you do finally get around to making TDE something > > that can work at the relation level, for example. Even if there is > > only a small chance of that ever happening, why wouldn't we be > > prepared for it, just on general principle? > > To the extent that we can leave ourselves room to do new things in the > future without incurring unreasonable costs in the present, I'm in > favor of that, as I believe anyone would be. But as you say, a lot > depends on the specifics. Theoretical flexibility that can only be > used in practice by really slow code doesn't help anybody. A tool like pg_filedump or a backup tool can easily afford this overhead. The only cost that TDE has to pay for this added flexibility is that it has to set one of the PD_* bits in a code path that is already bound to be very expensive. What's so bad about that? Honestly, I'm a bit surprised that you're pushing back on this particular point. A nonce for TDE is just something that code in places like bufpage.h ought to know about. It has to be negotiated at that level, because it will in fact affect a lot of callers to the bufpage.h functions. -- Peter Geoghegan
Re: better page-level checksums
On Tue, Jun 14, 2022 at 11:14 AM Robert Haas wrote: > We can have anything we want here, but we can't have everything we > want at the same time. There are irreducible engineering trade-offs > here. If all pages in a given cluster are the same, backends can > compute the values of things that are currently compile-time constants > upon startup and continue to use them for the lifetime of the backend. > If pages can vary, some encrypted or checksummed and others not, then > you have to recompute those values for every page. That's bound to > have some cost. It is also more flexible. Maybe not -- it depends on the particulars of the code. For example, it might be okay for the B-Tree code to assume that B-Tree pages have a special area at a known fixed offset, determined at compile time. At the same time, it might very well not be okay for a backup tool to make any such assumption, because it doesn't have the same context. Even within TDE, it might be okay to assume that it's a feature that the user must commit to using for a whole cluster at initdb time. What isn't okay is committing to that assumption now and forever, by leaving the door open to a world in which that assumption no longer holds. Like when you do finally get around to making TDE something that can work at the relation level, for example. Even if there is only a small chance of that ever happening, why wouldn't we be prepared for it, just on general principle? -- Peter Geoghegan
Re: better page-level checksums
On Tue, Jun 14, 2022 at 10:43 AM Robert Haas wrote: > Hmm, but on the other hand, if you imagine a scenario in which the > "storage system extra blob" is actually a nonce for TDE, you need to > be able to find it before you've decrypted the rest of the page. If > pd_checksum gives you the offset of that data, you need to exclude it > from what gets encrypted, which means that you need encrypt three > separate non-contiguous areas of the page whose combined size is > unlikely to be a multiple of the encryption algorithm's block size. > That kind of sucks (and putting it at the end of the page makes it way > better). I don't have a great understanding of how that cost will be felt in detail right now, because I don't know enough about the project and the requirements for TDE in general. > That said, I certainly agree that finding the special space needs to > be fast. The question in my mind is HOW fast it needs to be, and what > techniques we might be able to use to dodge the problem. For instance, > suppose that, during the startup sequence, we look at the control > file, figure out the size of the 'storage system extra blob', and > based on that each AM figures out the byte-offset of its special space > and caches that in a global variable. Then, instead of > PageGetSpecialSpace(page) it does PageGetBtreeSpecialSpace(page) or > whatever, where the implementation is ((char*) page) + > the_afformentioned_global_variable. Is that going to be too slow? Who knows? For now the important point is that there is a tension between the requirements of TDE, and the requirements of access methods (especially index access methods). It's possible that this will turn out not to be much of a problem. But the burden of proof is yours. Making a big change to the on-disk format like this (a change that affects every access method) should be held to an exceptionally high standard. There are bound to be tacit or even explicit assumptions made by access methods that you risk breaking here. The reality is that all of the access method code evolved in an environment where the special space size was constant and generic for a given BLCKSZ. I don't have much sympathy for any suggestion that code written 20 years ago should have known not to make these assumptions. I have a lot more sympathy for the idea that it's a general problem with our infrastructure (particularly code in bufpage.c and the delicate assumptions made by its callers) -- a problem that is worth addressing with a broad solution that enables lots of different work. We don't necessarily get another shot at this if we get it wrong now. > There's going to have to be some compromise here. On the one hand > you're going to have people who want to be able to do run-time > conversions between page formats even at the cost of extra runtime > overhead on top of what the basic feature necessarily implies. On the > other hand you're going to have people who don't think any overhead at > all is acceptable, even if it's purely nominal and only visible on a > microbenchmark. Such arguments can easily become holy wars. How many times has a big change to the on-disk format of this kind of magnitude taken place, post-pg_upgrade? I would argue that this would be the first, since it is the moral equivalent of extending the size of the generic page header. For all I know the overhead will be perfectly fine, and everybody wins. I just want to be adamant that we're making the right trade-offs, and maximizing the benefit from any new cost imposed on access method code. -- Peter Geoghegan
Re: better page-level checksums
On Tue, Jun 14, 2022 at 9:26 AM Tom Lane wrote: > It's been some years since I had much to do with pg_filedump, but > my recollection is that the size of the special space is only one > part of its heuristics, because there already *are* collisions. Right, there are collisions even today. The heuristics are kludgey, but they work perfectly in practice. That's not just due to luck -- it's due to people making sure that they continued to work over time. > Moreover, there already are per-AM magic numbers in there that > it uses to resolve those cases. They're not at the front though. > Nobody has ever wanted to break on-disk compatibility just to make > pg_filedump's page-type identification less klugy, so I find it > hard to believe that the above suggestion isn't a non-starter. There is no doubt that it's not worth breaking on-disk compatibility just for pg_filedump. The important principle here is that high-context page formats are bad, and should be avoided whenever possible. Why isn't it possible to avoid it here? We have all the bits we need for it in the page header, and then some. Why should we assume that it'll never be useful to apply encryption selectively, perhaps at the relation level? -- Peter Geoghegan
Re: better page-level checksums
On Tue, Jun 14, 2022 at 8:48 AM Robert Haas wrote: > On Mon, Jun 13, 2022 at 6:26 PM Peter Geoghegan wrote: > > Anyway, I can see how it would be useful to be able to know the offset > > of a nonce or of a hash digest on any given page, without access to a > > running server. But why shouldn't that be possible with other designs, > > including designs closer to what I've outlined? > > I don't know what you mean by this. As far as I'm aware, the only > design you've outlined is one where the space wasn't at the same > offset on every page. I am skeptical of that particular aspect, yes. Though I would define it the other way around (now the true special area struct isn't necessarily at the same offset for a given AM, at least across data directories). My main concern is maintaining the ability to interpret much about the contents of a page without context, and to not make it any harder to grow the special area dynamically -- which is a broader concern. Your patch isn't going to be the last one that wants to do something with the special area. This needs to be carefully considered. I see a huge amount of potential for adding new optimizations that use subsidiary space on the page, presumably implemented via a special area that can grow dynamically. For example, an ad-hoc compression technique for heap pages that temporarily "absorbs" some extra versions in the event of opportunistic pruning running and failing to free enough space. Such a design would operate on similar principles to deduplication in unique indexes, where the goal is to buy time rather than buy space. When we fail to keep the contents of a heap page together today, we often barely fail, so I expect something like this to have an outsized impact on some workloads. > In general, I was imagining that you'd need to look at the control > file to understand how much space had been reserved per page in this > particular cluster. I agree that's a bit awkward, especially for > pg_filedump. However, pg_filedump and I think also some code internal > to PostgreSQL try to figure out what kind of page we've got by looking > at the *size* of the special space. It's only good luck that we > haven't had a collision there yet, and continuing to rely on that > seems like a dead end. Perhaps we should start including a per-AM > magic number at the beginning of the special space. It's true that that approach is just a hack -- we probably can do better. I don't think that it's okay to break it, though. At least not without providing a comparable alternative, that doesn't rely on context from the control file. -- Peter Geoghegan
Re: Tightening behaviour for non-immutable behaviour in immutable functions
On Mon, Jun 13, 2022 at 1:51 PM Greg Stark wrote: > By "relatively common" I think we're talking "nigh universal". Afaics > there are no warnings in the docs about worrying about search_path on > IMMUTABLE functions. There is for SECURITY DEFINER but I have to admit > I wasn't aware myself of all the gotchas described there. I didn't realize that it was that bad. Even if it's only 10% as bad as you say, it would still be very valuable to do something about it (ideally with an approach that is non-invasive). -- Peter Geoghegan
Re: 2022-06-16 release announcement draft
On Mon, Jun 13, 2022 at 6:15 PM Jonathan S. Katz wrote: > > Perhaps it is also worth mentioning that you can use REINDEX without > > CONCURRENTLY, even before upgrading. > > I'm hesitant on giving too many options. We did put out the "warning" > announcement providing this as an option. I do think that folks who are > running CIC/RIC are sensitive to locking, and a plain old "REINDEX" may > be viable except in an emergency. The locking implications for plain REINDEX are surprising IMV -- and so I suggest sticking with what you have here. In many cases using plain REINDEX is not meaningfully different to taking a full AccessExclusiveLock on the table (we only acquire an AEL on the index, but in practice that can be a distinction without a difference). We at least went some way towards making the situation with REINDEX locking clearer in a doc patch that recently became commit 8ac700ac. -- Peter Geoghegan