Re: run pgindent on a regular basis / scripted manner
On Sat, Jan 21, 2023 at 12:59 PM Tom Lane wrote: > Hmm, that could be a deal-breaker. It's not going to be acceptable > to have to pgindent different parts of the system on different platforms > ... at least not unless we can segregate them on the file level, and > even that would have a large PITA factor. It's probably something that could be worked around. My remarks are based on some dim memories of dealing with the tool before I arrived at a configuration that works well enough for me. Importantly, clang-format doesn't require you to futz around with Makefiles or objdump or anything like that -- that's a huge plus. It doesn't seem to impose any requirements on how I build Postgres at all (I generally use gcc, not clang). Even if these kinds of issues proved to be a problem for the person tasked with running clang-format against the whole tree periodically, they still likely wouldn't affect most of us. It's quite convenient to use clang-format from an editor -- it can be invoked very incrementally, against a small range of lines at a time. It's pretty much something that I can treat like the built-in indent for my editor. It's vastly different to the typical pgindent workflow. > Still, we won't know unless someone makes a serious experiment with it. There is one thing about clang-format that I find mildly infuriating: it can indent function declarations in the way that I want it to, and it can indent variable declarations in the way that I want it to. It just can't do both at the same time, because they're both controlled by AlignConsecutiveDeclarations. Of course the way that I want to do things is (almost by definition) the pgindent way, at least right now -- it's not necessarily about my fixed preferences (though it can be hard to tell!). It's really not surprising that clang-format cannot quite perfectly simulate pgindent. How flexible can we be about stuff like that? Obviously there is no clear answer right now. -- Peter Geoghegan
Re: run pgindent on a regular basis / scripted manner
On Sat, Jan 21, 2023 at 9:30 AM Bruce Momjian wrote: > I don't see uncrustify or clang-format supporting typedef lists so maybe > they implemented this feedback loop. It would be good to see if we can > get either of these tools to match our formatting. I personally use clang-format for Postgres development work, since it integrates nicely with my text editor, and can be configured to produce approximately the same result as pgindent (certainly good enough when working on a patch that's still far from a committable state). I'm fairly sure that clang-format has access to a full AST from the clang compiler, which is the ideal approach - at least in theory. In practice this approach tends to run into problems when the relevant AST isn't available. For example, if there's code that only builds on Windows, maybe it won't work at all (at least on my Linux system). This doesn't really bother me currently, since I only rely on clang-format as a first pass sort of thing. Maybe I could figure out a better way to deal with such issues, but right now I don't have much incentive to. Another advantage of clang-format is that it's a known quantity. For example there is direct support for it built into meson, with bells and whistles such as CI support: https://mesonbuild.com/Code-formatting.html My guess is that moving to clang-format would require giving up some flexibility, but getting much better integration with text editors and tools like meson in return. It would probably make it practical to have much stronger rules about how committed code must be indented -- rules that are practical, and can actually be enforced. That trade-off seems likely to be worth it in my view, though it's not something that I feel too strongly about. -- Peter Geoghegan
Re: Decoupling antiwraparound autovacuum from special rules around auto cancellation
On Fri, Jan 20, 2023 at 12:57 PM Robert Haas wrote: > It doesn't seem that way to me. What am I missing? In that case, the > problem was a DROP TRIGGER command waiting behind autovacuum's lock > and thus causing all new table locks to wait behind DROP TRIGGER's > lock request. But it does not sound like that was a one-off event. It's true that I cannot categorically state that it would have made the crucial difference in this particular case. It comes down to two factors: 1. How many attempts would any given amount of additional XID space head room have bought them in practice? We can be all but certain that the smallest possible number is 1, which is something. 2. Would that have been enough for relfrozenxid to be advanced in practice? I think that it's likely that the answer to 2 is yes, since there was no mention of bloat as a relevant factor at any point in the postmortem. It was all about locking characteristics of antiwraparound autovacuuming in particular, and its interaction with their application. I think that they were perfectly okay with the autovacuum cancellation behavior most of the time. In fact, I don't think that there was any bloat in the table at all -- it was a really huge table (likely an events table), and those tend to be append-only. Even if I'm wrong about this specific case (we'll never know for sure), the patch as written would be virtually guaranteed to make the crucial differences in cases that I have seen up close. For example, a case with TRUNCATE. > It sounds like they used DROP TRIGGER pretty regularly. So I think this > sounds like exactly the kind of case I was talking about, where > autovacuums keep getting cancelled until we decide to stop cancelling > them. I don't know how you can reach that conclusion. The chances of a non-aggressive VACUUM advancing relfrozenxid right now are virtually zero, at least for a big table like this one. It seems quite likely that plenty of non-aggressive autovacuums completed, or would have had the insert-driven autovacuum feature been available. The whole article was about how this DROP TRIGGER pattern worked just fine most of the time, because most of the time autovacuum was just autocancelled. They say this at one point: "The normal autovacuum mechanism is skipped when locks are held in order to minimize service disruption. However, because transaction wraparound is such a severe problem, if the system gets too close to wraparound, an autovacuum is launched that does not back off under lock contention." At another point: "When the outage was resolved, we still had a number of questions: is a wraparound autovacuum always so disruptive? Given that it was blocking all table operations, why does it throttle itself?" ISTM that it was a combination of aggressive vacuuming taking far longer than usual (especially likely because this was pre freeze map), and the no-auto-cancel behavior. Aggressive/antiwraparound VACUUMs are naturally much more likely to coincide with periodic DDL, just because they take so much longer. That is a dangerous combination. -- Peter Geoghegan
Re: Decoupling antiwraparound autovacuum from special rules around auto cancellation
On Fri, Jan 20, 2023 at 5:47 AM Robert Haas wrote: > Yeah, this is a major reason why I'm very leery about changes in this > area. A lot of autovacuum behavior is emergent, in the sense that it > wasn't directly intended by whoever wrote the code. It's just a > consequence of other decisions that probably seemed very reasonable at > the time they were made but turned out to have surprising and > unpleasant consequences. I certainly agree with your general description of the ways things are. To a large degree we're living in a world where DBAs have already compensated for some of the autovacuum shortcomings discussed on this thread. For example, by setting autovacuum_vacuum_scale_factor (and even autovacuum_vacuum_insert_scale_factor) to very low values, to compensate for the issues with random sampling of dead tuples by analyze, and to compensate for the way that VACUUM doesn't reason correctly about how the number of dead tuples changes as VACUUM runs. They might not have thought of it that way -- it could have happened as a byproduct of tuning a production system through trial and error -- but it still counts as compensating for a defect in autovacuum scheduling IMV. It's actually quite likely that even a strict improvement to (say) autovacuum scheduling will cause some number of regressions, since now what were effectively mitigations become unnecessary. This is somewhat similar to the dynamic with optimizer improvements, where (say) a selectivity estimate function that's better by every available metric can still easily cause regressions that really cannot be blamed on the improvement itself. I personally believe that it's a price worth paying when it comes to the issues with autovacuum statistics, particularly the dead tuple count issues. Since much of the behavior that we sometimes see is just absurdly bad. We have both water tight theoretical arguments and practical experiences pointing in that direction. > In this particular case, I think that there is a large risk that > postponing auto-cancellation will make things significantly worse, > possibly drastically worse, for a certain class of users - > specifically, those whose vacuums often get auto-cancelled. I agree that that's a real concern for the autocancellation side of things. That seems quite different to the dead tuples accounting issues, in that nobody would claim that the existing behavior is flagrantly wrong (just that it sometimes causes problems instead of preventing them). > That's why I really liked your idea of decoupling auto-cancellation > from XID age. Such an approach can still avoid disabling > auto-cancellation just because autovacuum_freeze_max_age has been hit, > but it can also disable it much earlier when it detects that doing so > is necessary to make progress. To be clear, I didn't think that that's what Andres was proposing, and my recent v5 doesn't actually do that. Even in v5, it's still fundamentally impossible for autovacuums that are triggered by the tuples inserted or dead tuples thresholds to not be autocancellable. ISTM that it doesn't make sense to condition the autocancellation behavior on table XID age in the context of dead tuple VACUUMs. It could either be way too early or way too late at that point. I was rather hoping to not have to build the infrastructure required for fully decoupling the autocancellation behavior from the triggering condition (dead tuples vs table XID age) in the scope of this thread/patch, though I can see the appeal of that. The only reason why I'm using table age at all is because that's how it works already, rightly or wrongly. If nothing else, t's pretty clear that there is no theoretical or practical reason why it has to be exactly the same table age as the one for launching autovacuums to advance relfrozenxid/relminmxid. In v5 of the patch, the default is to use 1.8x of the threshold that initially makes autovacuum.c want to launch autovacuums to deal with table age. That's based on a suggestion from Andres, but I'd be almost as happy with a default as low as 1.1x or even 1.05x. That's going to make very little difference to those users that really rely on the no-auto-cancellation behavior, while at the same time making things a lot safer for scenarios like the Joyent/Manta "DROP TRIGGER" outage (not perfectly safe, by any means, but meaningfully safer). -- Peter Geoghegan
Re: Decoupling antiwraparound autovacuum from special rules around auto cancellation
On Thu, Jan 19, 2023 at 3:38 PM Andres Freund wrote: > Another version of this could be to integrate analyze.c's scan more closely > with vacuum all the time. It's a bit bonkers that we often sequentially read > blocks, evict them from shared buffers if we read them, just to then > afterwards do random IO for blocks we've already read. That's imo what we > eventually should do, but clearly it's not a small project. Very often, what we're really missing in VACUUM is high level context. That's true of what you say here, about analyze.c, as well as complaints like your vac_estimate_reltuples complaint. The problem scenarios involving vac_estimate_reltuples all involve repeating the same silly action again and again, never realizing that that's what's going on. I've found it very useful to think of one VACUUM as picking up where the last one left off for my work on freezing. This seems related to pre-autovacuum historical details. VACUUM shouldn't really be a command in the same way that CREATE INDEX is a command. I do think that we need to retain a VACUUM command in some form, but it should be something pretty close to a command that just enqueues off-schedule autovacuums. That can do things like coalesce duplicate requests into one. Anyway, I am generally in favor of a design that makes VACUUM and ANALYZE things that are more or less owned by autovacuum. It should be less and less of a problem to blur the distinction between VACUUM and ANALYZE under this model, in each successive release. These distinctions are quite unhelpful, overall, because they make it hard for autovacuum scheduling to work at the whole-system level. > > This wouldn't have to happen every time, but it would happen fairly often. > > Do you have a mechanism for that in mind? Just something vacuum_count % 10 == > 0 like? Or remember scanned_pages in pgstats and re-computing I was thinking of something very simple like that, yes. > I think it'd be fine to just use analyze.c and pass in an option to not > compute column and inheritance stats. That could be fine. Just as long as it's not duplicative in an obvious way. > > Presumably you'll want to add the same I/O prefetching logic to this > > cut-down version, just for example. Since without that there will be > > no competition between it and ANALYZE proper. Besides which, isn't it > > kinda wasteful to not just do a full ANALYZE? Sure, you can avoid > > detoasting overhead that way. But even still. > > It's not just that analyze is expensive, I think it'll also be confusing if > the column stats change after a manual VACUUM without ANALYZE. Possibly, but it doesn't have to happen there. It's not like the rules aren't a bit different compared to autovacuum already. For example, the way TOAST tables are handled by the VACUUM command versus autovacuum. Even if it's valuable to maintain this kind of VACUUM/autovacuum parity (which I tend to doubt), doesn't the same argument work almost as well with whatever stripped down version you come up with? It's also confusing that a manual VACUUM command will be doing an ANALYZE-like thing. Especially in cases where it's really expensive relative to the work of VACUUM, because VACUUM scanned so few pages. You just have to make some kind of trade-off. -- Peter Geoghegan
Re: Decoupling antiwraparound autovacuum from special rules around auto cancellation
On Thu, Jan 19, 2023 at 2:54 PM Andres Freund wrote: > Yea. Hence my musing about potentially addressing this by choosing to visit > additional blocks during the heap scan using vacuum's block sampling logic. I'd rather just invent a way for vacuumlazy.c to tell the top-level vacuum.c caller "I didn't update reltuples, but you ought to go ANALYZE the table now that I'm done, even if you weren't already planning to do so". This wouldn't have to happen every time, but it would happen fairly often. > IME most of the time in analyze isn't spent doing IO for the sample blocks > themselves, but CPU time and IO for toasted columns. A trimmed down version > that just computes relallvisible should be a good bit faster. I worry about that from a code maintainability point of view. I'm concerned that it won't be very cut down at all, in the end. Presumably you'll want to add the same I/O prefetching logic to this cut-down version, just for example. Since without that there will be no competition between it and ANALYZE proper. Besides which, isn't it kinda wasteful to not just do a full ANALYZE? Sure, you can avoid detoasting overhead that way. But even still. -- Peter Geoghegan
Re: Decoupling antiwraparound autovacuum from special rules around auto cancellation
On Thu, Jan 19, 2023 at 12:58 PM Andres Freund wrote: > There's absolutely no guarantee that autoanalyze is triggered > there. Particularly with repeated vacuums triggered due to an relfrozenxid age > that can't be advanced that very well might not happen within days on a large > relation. Arguments like that work far better as arguments in favor of the vac_estimate_reltuples heuristics. That doesn't mean that the heuristics are good in any absolute sense, of course. They were just a band aid intended to ameliorate some of the negative impact that came from treating scanned_pages as a random sample. I think that we both agree that the real problem is that scanned_pages just isn't a random sample, at least not as far as reltuples/live tuples is concerned (for dead tuples it kinda isn't a sample, but is rather something close to an exact count). I now understand that you're in favor of addressing the root problem directly. I am also in favor of that approach. I'd be more than happy to get rid of the band aid as part of that whole effort. -- Peter Geoghegan
Re: Decoupling antiwraparound autovacuum from special rules around auto cancellation
On Thu, Jan 19, 2023 at 12:56 PM Andres Freund wrote: > > In other words, ANALYZE sometimes (but not always) produces wrong answers. > > For dead tuples, but not live tuples. > > In other words, VACUUM sometimes (but not always) produces wrong answers. > > For live tuples, but not badly so for dead tuples. Agreed. More generally, there is a need to think about the whole table in some cases (like for regular optimizer statistics including reltuples/live tuples), and the subset of pages that will be scanned by VACUUM in other cases (for dead tuples accounting). Random sampling at the table level is appropriate and works well enough for the former, though not for the latter. > We are, but perhaps not too badly so, because we can choose to believe analyze > more for live tuples, and vacuum for dead tuples. Analyze doesn't compute > reltuples incrementally and vacuum doesn't compute deadtuples incrementally. Good points. > But in contrast to dead_tuples, where I think we can just stop analyze from > updating it unless we crashed recently, I do think we need to update reltuples > in vacuum. So computing an accurate value seems like the least unreasonable > thing I can see. I agree, but there is no reasonable basis for treating scanned_pages as a random sample, especially if it's only a small fraction of all of rel_pages -- treating it as a random sample is completely unjustifiable. And so it seems to me that the only thing that can be done is to either make VACUUM behave somewhat like ANALYZE in at least some cases, or to have it invoke ANALYZE directly (or indirectly) in those same cases. -- Peter Geoghegan
Re: Decoupling antiwraparound autovacuum from special rules around auto cancellation
On Wed, Jan 18, 2023 at 7:04 PM Andres Freund wrote: > > You seem to be saying that it's a problem if we don't update reltuples > > -- an estimate -- when less than 2% of the table is scanned by VACUUM. > > But why? Why can't we just do nothing sometimes? I mean in general, > > leaving aside the heuristics I came up with for a moment? > > The problem isn't that we might apply the heuristic once, that'd be fine. But > that there's nothing preventing it from applying until there basically are no > tuples left, as long as the vacuum is frequent enough. > > As a demo: The attached sql script ends up with a table containing 10k rows, > but relpages being set 1 million. I saw that too. But then I checked again a few seconds later, and autoanalyze had run, so reltuples was 10k. Just like it would have if there was no VACUUM statements in your script. -- Peter Geoghegan
Re: Decoupling antiwraparound autovacuum from special rules around auto cancellation
On Wed, Jan 18, 2023 at 6:10 PM Andres Freund wrote: > > This creates an awkward but logical question, though: what if > > dead_tuples doesn't go down at all? What if VACUUM actually has to > > increase it, because VACUUM runs so slowly relative to the workload? > > Sure, that can happen - but it's not made better by having wrong stats :) Maybe it's that simple. It's reasonable to wonder how far we want to go with letting dead tuples grow and grow, even when VACUUM is running constantly. It's not a question that has an immediate and obvious answer IMV. Maybe the real question is: is this an opportunity to signal to the user (say via a LOG message) that VACUUM cannot keep up? That might be very useful, in a general sort of way (not just to avoid new problems). > We have reasonably sophisticated accounting in pgstats what newly live/dead > rows a transaction "creates". So an obvious (and wrong) idea is just decrement > reltuples by the number of tuples removed by autovacuum. Did you mean dead_tuples? > But we can't do that, > because inserted/deleted tuples reported by backends can be removed by > on-access pruning and vacuumlazy doesn't know about all changes made by its > call to heap_page_prune(). I'm not following here. Perhaps this is a good sign that I should stop working for the day. :-) > But I think that if we add a > pgstat_count_heap_prune(nredirected, ndead, nunused) > around heap_page_prune() and a > pgstat_count_heap_vacuum(nunused) > in lazy_vacuum_heap_page(), we'd likely end up with a better approximation > than what vac_estimate_reltuples() does, in the "partially scanned" case. What does vac_estimate_reltuples() have to do with dead tuples? -- Peter Geoghegan
Re: Decoupling antiwraparound autovacuum from special rules around auto cancellation
On Wed, Jan 18, 2023 at 5:49 PM Andres Freund wrote: > On 2023-01-18 15:28:19 -0800, Peter Geoghegan wrote: > > Perhaps we should make vac_estimate_reltuples focus on the pages that > > VACUUM newly set all-visible each time (not including all-visible > > pages that got scanned despite being all-visible) -- only that subset > > of scanned_pages seems to be truly relevant. That way you wouldn't be > > able to do stuff like this. We'd have to start explicitly tracking the > > number of pages that were newly set in the VM in vacuumlazy.c to be > > able to do that, but that seems like a good idea anyway. > > Can you explain a bit more what you mean with "focus on the pages" means? We don't say anything about pages we didn't scan right now -- only scanned_pages have new information, so we just extrapolate. Why not go even further than that, by only saying something about the pages that were both scanned and newly set all-visible? Under this scheme, the pages that were scanned but couldn't be newly set all-visible are treated just like the pages that weren't scanned at all -- they get a generic estimate from the existing reltuples. > I don't think it's hard to see this causing problems. Set > autovacuum_vacuum_scale_factor to something smaller than 2% or somewhat > frequently vacuum manually. Incrementally delete old data. Unless analyze > saves you - which might not be run or might have a different scale factor or > not be run manually - reltuples will stay exactly the same, despite data > changing substantially. You seem to be saying that it's a problem if we don't update reltuples -- an estimate -- when less than 2% of the table is scanned by VACUUM. But why? Why can't we just do nothing sometimes? I mean in general, leaving aside the heuristics I came up with for a moment? It will cause problems if we remove the heuristics. Much less theoretical problems. What about those? How often does VACUUM scan so few pages, anyway? We've been talking about how ineffective autovacuum_vacuum_scale_factor is, at great length, but now you're saying that it *can* meaningfully trigger not just one VACUUM, but many VACUUMs, where no more than 2% of rel_pages are not all-visible (pages, not tuples)? Not just once, mind you, but many times? And in the presence of some kind of highly variable tuple size, where it actually could matter to the planner at some point? I would be willing to just avoid even these theoretical problems if there was some way to do so, that didn't also create new problems. I have my doubts that that is possible, within the constraints of updating pg_class. Or the present constraints, at least. I am not a miracle worker -- I can only do so much with the information that's available to vac_update_relstats (and/or the information that can easily be made available). -- Peter Geoghegan
Re: Decoupling antiwraparound autovacuum from special rules around auto cancellation
On Fri, Jan 13, 2023 at 9:55 PM Andres Freund wrote: > How about a float autovacuum_no_auto_cancel_age where positive values are > treated as absolute values, and negative values are a multiple of > autovacuum_freeze_max_age? And where the "computed" age is capped at > vacuum_failsafe_age? A "failsafe" autovacuum clearly shouldn't be cancelled. > > And maybe a default setting of -1.8 or so? Attached is a new revision, v5. I'm not happy with this, but thought it would be useful to show you where I am with it. It's a bit awkward that we have a GUC (autovacuum_no_auto_cancel_age) that can sometimes work as a cutoff that works similarly to both freeze_max_age and multixact_freeze_max_age, but usually works as a multiplier. It's both an XID age value, an MXID age value, and a multiplier on XID/MXID age values. What if it was just a simple multiplier on freeze_max_age/multixact_freeze_max_age, without changing any other detail? -- Peter Geoghegan v5-0002-Add-table-age-trigger-concept-to-autovacuum.patch Description: Binary data v5-0001-Add-autovacuum-trigger-instrumentation.patch Description: Binary data
Re: Decoupling antiwraparound autovacuum from special rules around auto cancellation
On Wed, Jan 18, 2023 at 4:37 PM Andres Freund wrote: > I can, it should be just about trivial code-wise. A bit queasy about trying to > forsee the potential consequences. That's always going to be true, though. > A somewhat related issue is that pgstat_report_vacuum() sets dead_tuples to > what VACUUM itself observed, ignoring any concurrently reported dead > tuples. As far as I can tell, when vacuum takes a long time, that can lead to > severely under-accounting dead tuples. Did I not mention that one? There are so many that it can be hard to keep track! That's why I catalog them. As you point out, it's the dead tuples equivalent of my ins_since_vacuum complaint. The problem is exactly analogous to my recent complaint about insert-driven autovacuums. > I wonder if we ought to remember the dead_tuples value at the start of the > heap scan and use that to adjust the final dead_tuples value. I'd lean towards > over-counting rather than under-counting and thus probably would go for > something like > > tabentry->dead_tuples = livetuples + Min(0, tabentry->dead_tuples - > deadtuples_at_start); > > i.e. assuming we might have missed all concurrently reported dead tuples. This is exactly what I was thinking of doing for both issues (the ins_since_vacuum one and this similar dead tuples one). It's completely logical. This creates an awkward but logical question, though: what if dead_tuples doesn't go down at all? What if VACUUM actually has to increase it, because VACUUM runs so slowly relative to the workload? Of course the whole point is to make it more likely that VACUUM will keep up with the workload. I'm just not quite sure that the consequences of doing it that way are strictly a good thing. Bearing in mind that we don't differentiate between recently dead and dead here. Fun fact: autovacuum can spin with pgbench because of recently dead tuples, even absent an old snapshot/long running xact, if you set things aggressively enough: https://postgr.es/m/CAH2-Wz=sJm3tm+FpXbyBhEhX5tbz1trQrhG6eOhYk4-+5uL=w...@mail.gmail.com I think that we probably need to do something like always make sure that dead_items goes down by a small amount at the end of each VACUUM, even when that's a lie. Maybe we also have a log message about autovacuum not keeping up, so as to not feel too guilty about it. You know, to give the user a chance to reconfigure autovacuum so that it stops happening. > Of course we could instead move to something like ins_since_vacuum and reset > it at the *start* of the vacuum. But that'd make the error case harder, > without giving us more accuracy, I think? It would. It seems illogical to me. > I do think this is an argument for splitting up dead_tuples into separate > "components" that we track differently. I.e. tracking the number of dead > items, not-yet-removable rows, and the number of dead tuples reported from DML > statements via pgstats. Is it? Why? I'm all in favor of doing that, of course. I just don't particularly think that it's related to this other problem. One problem is that we count dead tuples incorrectly because we don't account for the fact that things change while VACUUM runs. The other problem is that the thing that is counted isn't broken down into distinct subcategories of things -- things are bunched together that shouldn't be. Oh wait, you were thinking of what I said before -- my "awkward but logical question". Is that it? -- Peter Geoghegan
Re: Decoupling antiwraparound autovacuum from special rules around auto cancellation
On Wed, Jan 18, 2023 at 4:02 PM Andres Freund wrote: > vacuum-no reltuples/n_live_tupn_dead_tup > 1 476 500 > 2 2500077 500 > 3 1250184 500 > 4625266 500 > 5312821 500 > 1010165 500 > > Each vacuum halves reltuples. That's going to screw badly with all kinds of > things. Planner costs completely out of whack etc. I get that that could be a big problem, even relative to the more immediate problem of VACUUM just spinning like it does in your test case. What do you think we should do about it? What do you think about my idea of focussing on the subset of pages newly set all-visible in the VM? > I wonder if this is part of the reason for the distortion you addressed with > 74388a1a / 3097bde7dd1d. I am somewhat doubtful they're right as is. For a > large relation 2% of blocks is a significant number of rows, and simply never > adjusting reltuples seems quite problematic. At the very least we ought to > account for dead tids we removed or such, instead of just freezing reltuples. As I mentioned, it only kicks in when relpages is *precisely* the same as last time (not one block more or one block less), *and* we only scanned less than 2% of rel_pages. It's quite possible that that's insufficient, but I can't imagine it causing any new problems. I think that we need to be realistic about what's possible while storing a small, fixed amount of information. There is always going to be some distortion of this kind. We can do something about the obviously pathological cases, where errors can grow without bound. But you have to draw the line somewhere, unless you're willing to replace the whole approach with something that stores historic metadata. What kind of tradeoff do you want to make here? I think that you have to make one. -- Peter Geoghegan
Re: Decoupling antiwraparound autovacuum from special rules around auto cancellation
On Wed, Jan 18, 2023 at 3:28 PM Peter Geoghegan wrote: > The problems in this area tend to be that vac_estimate_reltuples() > behaves as if it sees a random sample, when in fact it's far from > random -- it's the same scanned_pages as last time, and the ten other > times before that. That's a feature of your test case, and other > similar vac_estimate_reltuples test cases that I came up with in the > past. Both scenarios involved skipping using the visibility map in > multiple successive VACUUM operations. FWIW, the problem in your test case just goes away if you just change this line: DELETE FROM foo WHERE i < (1000 * 0.1) To this: DELETE FROM foo WHERE i < (1000 * 0.065) Which is not a huge difference, overall. This effect is a consequence of the heuristics I added in commit 74388a1a, so it's only present on Postgres 15+. Whether or not this is sufficient protection is of course open to interpretation. One problem with those heuristics (as far as your test case is concerned) is that they either work, or they don't work. For example they're conditioned on "old_rel_pages == total_page". -- Peter Geoghegan
Re: Decoupling antiwraparound autovacuum from special rules around auto cancellation
On Wed, Jan 18, 2023 at 2:37 PM Peter Geoghegan wrote: > Maybe you're right to be concerned to the degree that you're concerned > -- I'm not sure. I'm just adding what I see as important context. The problems in this area tend to be that vac_estimate_reltuples() behaves as if it sees a random sample, when in fact it's far from random -- it's the same scanned_pages as last time, and the ten other times before that. That's a feature of your test case, and other similar vac_estimate_reltuples test cases that I came up with in the past. Both scenarios involved skipping using the visibility map in multiple successive VACUUM operations. Perhaps we should make vac_estimate_reltuples focus on the pages that VACUUM newly set all-visible each time (not including all-visible pages that got scanned despite being all-visible) -- only that subset of scanned_pages seems to be truly relevant. That way you wouldn't be able to do stuff like this. We'd have to start explicitly tracking the number of pages that were newly set in the VM in vacuumlazy.c to be able to do that, but that seems like a good idea anyway. This probably has consequences elsewhere, but maybe that's okay. We know when the existing pg_class has no information, since that is explicitly encoded by a reltuples of -1. Obviously we'd need to be careful about stuff like that. Overall, the danger from assuming that "unsettled" pages (pages that couldn't be newly set all-visible by VACUUM) have a generic tuple density seems less than the danger of assuming that they're representative. We know that we're bound to scan these same pages in the next VACUUM anyway, so they'll have another chance to impact our view of the table's tuple density at that time. -- Peter Geoghegan
Re: Decoupling antiwraparound autovacuum from special rules around auto cancellation
On Wed, Jan 18, 2023 at 2:22 PM Andres Freund wrote: > The problem with the change is here: > > /* > * Okay, we've covered the corner cases. The normal calculation is to > * convert the old measurement to a density (tuples per page), then > * estimate the number of tuples in the unscanned pages using that > figure, > * and finally add on the number of tuples in the scanned pages. > */ > old_density = old_rel_tuples / old_rel_pages; > unscanned_pages = (double) total_pages - (double) scanned_pages; > total_tuples = old_density * unscanned_pages + scanned_tuples; > return floor(total_tuples + 0.5); My assumption has always been that vac_estimate_reltuples() is prone to issues like this because it just doesn't have access to very much information each time it runs. It can only see the delta between what VACUUM just saw, and what the last VACUUM (or possibly the last ANALYZE) saw according to pg_class. You're always going to find weaknesses in such a model if you go looking for them. You're always going to find a way to salami slice your way from good information to total nonsense, if you pick the right/wrong test case, which runs VACUUM in a way that allows whatever bias there may be to accumulate. It's sort of like the way floating point values can become very inaccurate through a process that allows many small inaccuracies to accumulate over time. Maybe you're right to be concerned to the degree that you're concerned -- I'm not sure. I'm just adding what I see as important context. -- Peter Geoghegan
Re: Decoupling antiwraparound autovacuum from special rules around auto cancellation
On Wed, Jan 18, 2023 at 1:08 PM Andres Freund wrote: > I suggested nearby to only have ANALYZE dead_tuples it if there's been no > [auto]vacuum since the stats entry was created. That allows recovering from > stats resets, be it via crashes or explicitly. What do you think? I like that idea. It's far simpler than the kind of stuff I was thinking about, and probably just as effective. Even if it introduces some unforeseen problem (which seems unlikely), we can still rely on pgstat_report_vacuum() to set things straight before too long. Are you planning on writing a patch for this? I'd be very interested in seeing this through. Could definitely review it. -- Peter Geoghegan
Re: Decoupling antiwraparound autovacuum from special rules around auto cancellation
On Wed, Jan 18, 2023 at 1:02 PM Peter Geoghegan wrote: > Some of what I'm proposing arguably amounts to deliberately adding a > bias. But that's not an unreasonable thing in itself. I think of it as > related to the bias-variance tradeoff, which is a concept that comes > up a lot in machine learning and statistical inference. To be clear, I was thinking of unreservedly trusting what pgstat_report_analyze() says about dead_tuples in the event of its estimate increasing our dead_tuples estimate, while being skeptical (to a varying degree) when it's the other way around. But now I need to go think about what Andres just brought up... -- Peter Geoghegan
Re: Decoupling antiwraparound autovacuum from special rules around auto cancellation
On Wed, Jan 18, 2023 at 12:44 PM Robert Haas wrote: > I don't know enough about the specifics of how this works to have an > intelligent opinion about how likely these particular ideas are to > work out. However, I think it's risky to look at estimates and try to > infer whether they are reliable. It's too easy to be wrong. What we > really want to do is anchor our estimates to some data source that we > know we can trust absolutely. If you trust possibly-bad data less, it > screws up your estimates more slowly, but it still screws them up. Some of what I'm proposing arguably amounts to deliberately adding a bias. But that's not an unreasonable thing in itself. I think of it as related to the bias-variance tradeoff, which is a concept that comes up a lot in machine learning and statistical inference. We can afford to be quite imprecise at times, especially if we choose a bias that we know has much less potential to do us harm -- some mistakes hurt much more than others. We cannot afford to ever be dramatically wrong, though -- especially in the direction of vacuuming less often. Besides, there is something that we *can* place a relatively high degree of trust in that will still be in the loop here: VACUUM itself. If VACUUM runs then it'll call pgstat_report_vacuum(), which will set the record straight in the event of over estimating dead tuples. To some degree the problem of over estimating dead tuples is self-limiting. > If Andres is correct that what really matter is the number of pages > we're going to have to dirty, we could abandon counting dead tuples > altogether and just count not-all-visible pages in the VM map. That's what matters most from a cost point of view IMV. So it's a big part of the overall picture, but not everything. It tells us relatively little about the benefits, except perhaps when most pages are all-visible. -- Peter Geoghegan
Re: Decoupling antiwraparound autovacuum from special rules around auto cancellation
On Wed, Jan 18, 2023 at 11:02 AM Robert Haas wrote: > On Wed, Jan 18, 2023 at 1:31 PM Peter Geoghegan wrote: > > pgstat_report_analyze() will totally override the > > tabentry->dead_tuples information that drives autovacuum.c, based on > > an estimate derived from a random sample -- which seems to me to be an > > approach that just doesn't have any sound theoretical basis. > > Yikes. I think we don't have a choice but to have a method to correct > the information somehow, because AFAIK the statistics system is not > crash-safe. But that approach does seem to carry significant risk of > overwriting correct information with wrong information. This situation is really quite awful, so maybe we should do something about it soon, in the scope of the Postgres 16 work on autovacuum that is already underway. In fact I think that the problem here is so bad that even something slightly less naive would be far more effective. You're right to point out that pgstat_report_analyze needs to update the stats in case there is a hard crash, of course. But there is plenty of context with which to make better decisions close at hand. For example, I bet that pgstat_report_analyze already does a pretty good job of estimating live_tuples -- my spiel about statistics mostly doesn't apply to live_tuples. Suppose that we notice that its new estimate for live_tuples approximately matches what the stats subsystem already thought about live_tuples, while dead_tuples is far far lower. We shouldn't be so credulous as to believe the new dead_tuples estimate at that point. Perhaps we can change nothing about dead_tuples at all when this happens. Or perhaps we can set dead_tuples to a value that is scaled from the old estimate. The new dead_tuples value could be derived by taking the ratio of the old live_tuples to the old dead_tuples, and then using that to scale from the new live_tuples. This is just a first pass, to see what you and others think. Even very simple heuristics seem like they could make things much better. Another angle of attack is the PD_ALL_VISIBLE page-level bit, which acquire_sample_rows() could pay attention to -- especially in larger tables, where the difference between all pages and just the all-visible subset of pages is most likely to matter. The more sampled pages that had PD_ALL_VISIBLE set, the less credible the new dead_tuples estimate will be (relative to existing information), and so pgstat_report_analyze() should prefer the new estimate over the old one in proportion to that. We probably shouldn't change anything about pgstat_report_vacuum as part of this effort to make pgstat_report_analyze less terrible in the near term. It certainly has its problems (what is true for pages that VACUUM scanned at the end of VACUUM is far from representative for new pages!), it's probably much less of a contributor to issues like those that Andres reports seeing. BTW, one of the nice things about the insert-driven autovacuum stats is that pgstat_report_analyze doesn't have an opinion about how many tuples were inserted since the last VACUUM ran. It does have other problems, but they seem less serious to me. > Yep. I think what we should try to evaluate is which number is > furthest from the truth. My guess is that the threshold is so high > relative to what a reasonable value would be that you can't get any > benefit out of making the dead tuple count more accurate. Like, if the > threshold is 100x too high, or something, then who cares how accurate > the dead tuples number is? Right. Or if we don't make any reasonable distinction between LP_DEAD items and dead heap-only tuples, then the total number of both things together may matter very little. Better to be approximately correct than exactly wrong. Deliberately introducing a bias to lower the variance is a perfectly valid approach. > Maybe that's even true in some scenarios, but I bet that > it's never the issue when people have really big tables. The fact that > I'm OK with 10MB of bloat in my 100MB table doesn't mean I'm OK with > 1TB of bloat in my 10TB table. Among other problems, I can't even > vacuum away that much bloat in one index pass, because autovacuum > can't use enough work memory for that. Also, the absolute space > wastage matters. I certainly agree with all that. FWIW, part of my mental model with VACUUM is that the rules kind of change in the case of a big table. We're far more vulnerable to issues such as (say) waiting for cleanup locks because the overall cadence used by autovacuum is so infrequently relative to everything else. There are more opportunities for things to go wrong, worse consequences when they do go wrong, and greater potential for the problems to compound. -- Peter Geoghegan
Re: Decoupling antiwraparound autovacuum from special rules around auto cancellation
ouldn't ever be a sucker. It should pay attention to disconfirmatory signals. The information that drives its decision making process should be treated as provisional. Even if the information was correct at one point, the contents of the table are constantly changing in a way that could matter enormously. So we should be paying attention to where the table is going -- and even where it might be going -- not just where it is, or was. > True, although it can be overdone. An extra vacuum on a big table with > some large indexes that end up getting scanned can be very expensive > even if the table itself is almost entirely all-visible. We can't > afford to make too many mistakes in the direction of vacuuming early > in such cases. No, but we can afford to make some -- and can detect when it happened after the fact. I would rather err on the side of over-vacuuming, especially if the system is smart enough to self-correct when that turns out to be the wrong approach. One of the advantages of running VACUUM sooner is that it provides us with relatively reliable information about the needs of the table. We can also cheat, sort of. If we find another justification for autovacuuming (e.g., it's a quiet time for the system as a whole), and it works out to help with this other problem, it may be just as good for users. -- Peter Geoghegan
Re: Decoupling antiwraparound autovacuum from special rules around auto cancellation
On Tue, Jan 17, 2023 at 2:11 PM Robert Haas wrote: > On Tue, Jan 17, 2023 at 3:08 PM Peter Geoghegan wrote: > > If you assume that there is chronic undercounting of dead tuples > > (which I think is very common), ... > > Why do you think that? For the reasons I gave about statistics, random sampling, the central limit theorem. All that stuff. This matches the experience of Andres. And is obviously the only explanation behind the reliance on antiwraparound autovacuums for cleaning up bloat in larger OLTP databases. It just fits: the dead tuples approach can sometimes be so completely wrong that even an alternative triggering condition based on something that is virtually unrelated to the thing we actually care about can do much better in practice. Consistently, reliably, for a given table/workload. > > How many dead heap-only tuples are equivalent to one LP_DEAD item? > > What about page-level concentrations, and the implication for > > line-pointer bloat? I don't have a good answer to any of these > > questions myself. > > Seems a bit pessimistic. If we had unlimited resources and all > operations were infinitely fast, the optimal strategy would be to > vacuum after every insert, update, or delete. But in reality, that > would be prohibitively expensive, so we're making a trade-off. To a large degree, that's my point. I don't know how to apply this information, so having detailed information doesn't seem like the main problem. > If we had an oracle that could provide us with perfect information, > we'd ask it, among other things, how much work will be required to > vacuum right now, and how much benefit would we get out of doing so. And then what would we do? What about costs? Even if we were omniscient, we still wouldn't be omnipotent. We're still subject to the laws of physics. VACUUM would still be something that more or less works at the level of the whole table, or not at all. So being omniscient seems kinda overrated to me. Adding more information does not in general lead to better outcomes. > The dead tuple count is related to the first question. It's not a > direct, linear relationship, but it's not completely unrelated, > either. Maybe we could refine the estimates by gathering more or > different statistics than we do now, but ultimately it's always going > to be a trade-off between getting the work done sooner (and thus maybe > preventing table growth or a wraparound shutdown) and being able to do > more work at once (and thus being more efficient). The current system > set of counters predates HOT and the visibility map, so it's not > surprising if needs updating, but if you're argue that the whole > concept is just garbage, I think that's an overreaction. What I'm arguing is that principally relying on any one thing is garbage. If you have only one thing that creates pressure to VACUUM then there can be a big impact whenever it turns out to be completely wrong. Whereas if VACUUM can run because of (say) 3 moderate signals taken together, then it's much less likely that we'll be completely wrong. In general my emphasis is on avoiding disaster in all its forms. Vacuuming somewhat early more often is perhaps suboptimal, but far from a disaster. It's the kind of thing that we can manage. By all means, let's make the dead tuples/dead items stuff less naive (e.g. make it distinguish between LP_DEAD items and dead heap-only tuples). But even then, we shouldn't continue to completely rely on it in the way that we do right now. In other words, I'm fine with adding more information that is more accurate as long as we don't continue to make the mistake of not treating it kinda suspect, and certainly not something to completely rely on if at all possible. In particular, we need to think about both costs and benefits at all times. -- Peter Geoghegan
Re: Rework of collation code, extensibility
On Sat, Jan 14, 2023 at 12:03 PM Jeff Davis wrote: > The first goal I had was simply that the code was really hard to > understand and work on, and refactoring was justified to improve the > situation. > > The second goal, which is somewhat dependent on the first goal, is that > we really need an ability to support multiple ICU libraries, and I > wanted to do some common groundwork that would be needed for any > approach we choose there, and provide some hooks to get us there. You > are right that this goal influenced the first goal. I don't disagree that it was somewhat independent of the first goal. I just think that it makes sense to "round up to fully dependent". Basically it's not independent enough to be worth talking about as an independent thing, just as a practical matter - it's confusing at the level of things like the commit message. There is a clear direction that you're going in here from the start, and your intentions in 0001 do matter to somebody that's just looking at 0001 in isolation. That is my opinion, at least. The second goal is a perfectly good enough goal on its own, and one that I am totally supportive of. Making the code clearer is icing on the cake. > ucol_strcollUTF8() accepts -1 to mean "nul-terminated". I did some > basic testing and it doesn't seem like it's slower than using the > length. If passing the length is faster for some reason, it would > complicate the API because we'd need an entry point that's expecting > nul-termination and lengths, which is awkward (as Peter E. pointed > out). That's good. I'm happy to leave it at that. I was only enquiring. > I felt it was a little clearer amongst the other code, to a casual > reader, but I suppose it's a style thing. I will change it if you > insist. I certainly won't insist. > I'd have to expose the pg_locale_t struct, which didn't seem desirable > to me. Do you think it's enough of a performance concern to be worth > some ugliness there? I don't know. Quite possibly not. It would be nice to have some data on that, though. -- Peter Geoghegan
Re: Update comments in multixact.c
On Tue, Jan 17, 2023 at 1:33 AM shiy.f...@fujitsu.com wrote: > I noticed that commit 5212d447fa updated some comments in multixact.c because > SLRU truncation for multixacts is performed during VACUUM, instead of > checkpoint. Should the following comments which mentioned checkpointer be > changed, too? Yes, I think so. -- Peter Geoghegan
Re: Decoupling antiwraparound autovacuum from special rules around auto cancellation
27;t. Personally I think something like autovacuum_no_auto_cancel_age > would be an improvement, but I also don't quite feel satisfied with it. I don't either; but it should be strictly less unsatisfactory. > Tracking the number of autovac failures seems uncontroverial and quite > beneficial, even if the rest doesn't make it in. It'd at least let users > monitor for tables where autovac is likely to swoop in in anti-wraparound > mode. I'll see what I can come up with. > Perhaps its worth to separately track the number of times a backend would have > liked to cancel autovac, but couldn't due to anti-wrap? If changes to the > no-auto-cancel behaviour don't make it in, it'd at least allow us to collect > more data about the prevalence of the problem and in what situations it > occurs? Even just adding some logging for that case seems like it'd be an > improvement. Hmm, maybe. > I think with a bit of polish "Add autovacuum trigger instrumentation." ought > to be quickly mergeable. Yeah, I'll try to get that part out of the way quickly. -- Peter Geoghegan
Re: New strategies for freezing, advancing relfrozenxid early
On Mon, Jan 16, 2023 at 8:13 PM Dilip Kumar wrote: > I think that it makes sense to keep 'vacuum_freeze_strategy_threshold' > strictly for freezing. But the point is that the eager scanning > strategy is driven by table freezing needs of the table (tableagefrac) > that make sense, but if we have selected the eager freezing based on > the table age and its freezing need then why don't we force the eager > freezing as well if we have selected eager scanning, after all the > eager scanning is selected for satisfying the freezing need. Don't think of eager scanning as the new name for aggressive mode -- it's a fairly different concept, because we care about costs now. Eager scanning can be chosen just because it's very cheap relative to the alternative of lazy scanning, even when relfrozenxid is still very recent. (This kind of behavior isn't really new [1], but the exact implementation from the patch is new.) Tables such as pgbench_branches and pgbench_tellers will reliably use eager scanning strategy, no matter how any GUC has been set -- just because the added cost is always zero (relative to lazy scanning). It really doesn't matter how far along tableagefrac here, ever. These same tables will never use eager freezing strategy, unless the vacuum_freeze_strategy_threshold GUC is misconfigured. (This is another example of how scanning strategy and freezing strategy may differ for the same table.) You do have a good point, though. I think that I know what you mean. Note that antiwraparound autovacuums (or VACUUMs of tables very near to that point) *will* always use both the eager freezing strategy and the eager scanning strategy -- which is probably close to what you meant. The important point is that there can be more than one reason to prefer one strategy to another -- and the reasons can be rather different. Occasionally it'll be a combination of two factors together that push things in favor of one strategy over the other -- even though either factor on its own would not have resulted in the same choice. [1] https://wiki.postgresql.org/wiki/Freezing/skipping_strategies_patch:_motivating_examples#Constantly_updated_tables_.28usually_smaller_tables.29 -- Peter Geoghegan
Re: Decoupling antiwraparound autovacuum from special rules around auto cancellation
On Mon, Jan 16, 2023 at 8:25 AM Robert Haas wrote: > I really dislike formulas like Min(freeze_max_age * 2, 1 billion). > That looks completely magical from a user perspective. Some users > aren't going to understand autovacuum behavior at all. Some will, and > will be able to compare age(relfrozenxid) against > autovacuum_freeze_max_age. Very few people are going to think to > compare age(relfrozenxid) against some formula based on > autovacuum_freeze_max_age. I guess if we document it, maybe they will. What do you think of Andres' autovacuum_no_auto_cancel_age proposal? As I've said several times already, I am by no means attached to the current formula. > I do like the idea of driving the auto-cancel behavior off of the > results of previous attempts to vacuum the table. That could be done > independently of the XID age of the table. Even when the XID age of the table has already significantly surpassed autovacuum_freeze_max_age, say due to autovacuum worker starvation? > If we've failed to vacuum > the table, say, 10 times, because we kept auto-cancelling, it's > probably appropriate to force the issue. I suggested 1000 times upthread. 10 times seems very low, at least if "number of times cancelled" is the sole criterion, without any attention paid to relfrozenxid age or some other tiebreaker. > It doesn't really matter > whether the autovacuum triggered because of bloat or because of XID > age. Letting either of those things get out of control is bad. While inventing a new no-auto-cancel behavior that prevents bloat from getting completely out of control may well have merit, I don't see why it needs to be attached to this other effort. I think that the vast majority of individual tables have autovacuums cancelled approximately never, and so my immediate concern is ameliorating cases where not being able to auto-cancel once in a blue moon causes an outage. Sure, the opposite problem also exists, and I think that it would be really bad if it was made significantly worse as an unintended consequence of a patch that addressed just the first problem. But that doesn't mean we have to solve both problems together at the same time. > But at that point a lot of harm has already > been done. In a frequently updated table, waiting 300 million XIDs to > stop cancelling the vacuum is basically condemning the user to have to > run VACUUM FULL. The table can easily be ten or a hundred times bigger > than it should be by that point. The rate at which relfrozenxid ages is just about useless as a proxy for how much wall clock time has passed with a given workload -- workloads are usually very bursty. It's much worse still as a proxy for what has changed in the table; completely static tables have their relfrozenxid age at exactly the same rate as the most frequently updated table in the same database (the table that "consumes the most XIDs"). So while antiwraparound autovacuum no-auto-cancel behavior may indeed save the user from problems with serious bloat, it will happen pretty much by mistake. Not that it doesn't happen all the same -- of course it does. That factor (the mistake factor) doesn't mean I take the point any less seriously. What I don't take seriously is the idea that the precise XID age was ever crucially important. More generally, I just don't accept that this leaves with no room for something along the lines of my proposed, such as Andres' autovacuum_freeze_max_age concept. As I've said already, there will usually be a very asymmetric quality to the problem in cases like the Joyent outage. Even a modest amount of additional XID-space-headroom will very likely be all that will be needed at the critical juncture. It may not be perfect, but it still has every potential to make things safer for some users, without making things any less safe for other users. -- Peter Geoghegan
Re: Show various offset arrays for heap WAL records
On Wed, Jan 11, 2023 at 3:11 PM Peter Geoghegan wrote: > On Wed, Jan 11, 2023 at 3:00 PM Andres Freund wrote: > > What are your thoughts about the place for the helper functions? You're ok > > with rmgrdesc_utils.[ch]? > > Yeah, that seems okay. BTW, while playing around with this patch today, I noticed that it won't display the number of elements in each offset array directly. Perhaps it's worth including that, too? -- Peter Geoghegan
Re: New strategies for freezing, advancing relfrozenxid early
On Mon, Jan 16, 2023 at 10:00 AM Peter Geoghegan wrote: > Now we treat the scanning and freezing strategies as two independent > choices. Of course they're not independent in any practical sense, but > I think it's slightly simpler and more elegant that way -- it makes > the GUC vacuum_freeze_strategy_threshold strictly about freezing > strategy, while still leading to VACUUM advancing relfrozenxid in a > way that makes sense. It just happens as a second order effect. Why > add a special case? This might be a better way to explain it: The main page-level freezing commit (commit 1de58df4) already added an optimization that triggers page-level freezing "early" (early relative to vacuum_freeze_min_age). This happens whenever a page already needs to have an FPI logged inside lazy_scan_prune -- even when we're using the lazy freezing strategy. The optimization isn't configurable, and gets applied regardless of freezing strategy (technically there is no such thing as freezing strategies on HEAD just yet, though HEAD still has this optimization). There will be workloads where the FPI optimization will result in freezing many more pages -- especially when data checksums are in use (since then we could easily need to log an FPI just so pruning can set a hint bit). As a result, certain VACUUMs that use the lazy freezing strategy will freeze in almost the same way as an equivalent VACUUM using the eager freezing strategy. Such a "nominally lazy but actually quite eager" VACUUM operation should get the same benefit in terms of relfrozenxid advancement as it would if it really had used the eager freezing strategy instead. It's fairly obvious that we'll get the same benefit in relfrozenxid advancement (comparable relfrozenxid results for comparable freezing work), since the way that VACUUM decides on its scanning strategy is not conditioned on freezing strategy (whether by the ongoing VACUUM or any other VACUUM against the same table). All that matters is the conditions in the table (in particular the added cost of opting for eager scanning over lazy scanning) as indicated by the visibility map at the start of each VACUUM -- how those conditions came about really isn't interesting at that point. And so lazy_scan_strategy doesn't care about them when it chooses VACUUM's scanning strategy. There are even tables/workloads where relfrozenxid will be able to jump forward by a huge amount whenever VACUUM choosing the eager scanning strategy, despite the fact that VACUUM generally does little or no freezing to make that possible: https://wiki.postgresql.org/wiki/Freezing/skipping_strategies_patch:_motivating_examples#Patch_3 -- Peter Geoghegan
Re: New strategies for freezing, advancing relfrozenxid early
On Mon, Jan 16, 2023 at 10:10 AM Peter Geoghegan wrote: > Attached is v16, which incorporates some of Matthias' feedback. 0001 (the freezing strategies patch) is now committable IMV. Or at least will be once I polish the docs a bit more. I plan on committing 0001 some time next week, barring any objections. I should point out that 0001 is far shorter and simpler than the page-level freezing commit that already went in (commit 1de58df4). The only thing in 0001 that seems like it might be a bit controversial (when considered on its own) is the addition of the vacuum_freeze_strategy_threshold GUC/reloption. Note in particular that vacuum_freeze_strategy_threshold doesn't look like any other existing GUC; it gets applied as a threshold on the size of the rel's main fork at the beginning of vacuumlazy.c processing. As far as I know there are no objections to that approach at this time, but it does still seem worth drawing attention to now. 0001 also makes unlogged tables and temp tables always use eager freezing strategy, no matter how the GUC/reloption are set. This seems *very* easy to justify, since the potential downside of such a policy is obviously extremely low, even when we make very pessimistic assumptions. The usual cost we need to worry about when it comes to freezing is the added WAL overhead -- that clearly won't apply when we're vacuuming non-permanent tables. That really just leaves the cost of dirtying extra pages, which in general could have a noticeable system-level impact in the case of unlogged tables. Dirtying extra pages when vacuuming an unlogged table is also a non-issue. Even the eager freezing strategy only freezes "extra" pages ("extra" relative to the lazy strategy behavior) given a page that will be set all-visible in any case [1]. Such a page will need to have its page-level PD_ALL_VISIBLE bit set in any case -- which is already enough to dirty the page. And so there can never be any additional pages dirtied as a result of the special policy 0001 adds for non-permanent relations. [1] https://wiki.postgresql.org/wiki/Freezing/skipping_strategies_patch:_motivating_examples#Patch_2 -- Peter Geoghegan
Re: Decoupling antiwraparound autovacuum from special rules around auto cancellation
to more cheaply get rid of the > dead tids. Obviously we'd also set them when vacuum decides / was told not to > do index cleanup. Yes, it'd obviously be less effective than lots of the > things we discussed in this area (needing to re-collect the dead tids on the > indicated), but it'd have the advantage of not needing a lot of new > infrastructure. I wonder if it would be possible to split up the work of VACUUM into multiple phases that can be processed independently. The dead_items array could be serialized and stored in a temp file. That's not the same as some of the more complicated stuff we talked about in the last couple of years, such as a dedicated fork for Dead TIDs. It's more like an extremely flexible version of the same basic design for VACUUM, with the ability to slow down and speed up based on a system level view of things (e.g., checkpointing information). And with index vacuuming happening on a highly deferred timeline in some cases. Possibly we could make each slice of work processed by any available autovacuum worker. Autovacuum workers could become "headless". You would need some kind of state machine to make sure that critical dependencies were respected (e.g., always do the heap vacuuming step after all indexes are vacuumed), but that possibly isn't that hard, and still gives you a lot. As for this patch of mine: do you think that it would be acceptable to pursue a version based on your autovacuum_no_auto_cancel_age design for 16? Perhaps this can include something like pgstat_report_autovac_failure(). It's not even the work of implementing pgstat_report_autovac_failure() that creates risk that it'll miss the 16 feature freeze deadline. I'm more concerned that introducing a more complicated design will lead to the patch being bikeshedded to death. -- Peter Geoghegan
Re: New strategies for freezing, advancing relfrozenxid early
On Sun, Jan 15, 2023 at 9:13 PM Dilip Kumar wrote: > I have looked into the patch set, I think 0001 looks good to me about > 0002 I have a few questions, 0003 I haven't yet looked at Thanks for taking a look. > I think '(nextXID - cutoffs->relfrozenxid) / freeze_table_age' should > be the actual fraction right? What is the point of adding 0.5 to the > divisor? If there is a logical reason, maybe we can explain in the > comments. It's just a way of avoiding division by zero. > While looking into the logic of 'lazy_scan_strategy', I think the idea > looks very good but the only thing is that > we have kept eager freeze and eager scan completely independent. > Don't you think that if a table is chosen for an eager scan > then we should force the eager freezing as well? Earlier versions of the patch kind of worked that way. lazy_scan_strategy would actually use twice the GUC setting to determine scanning strategy. That approach could make our "transition from lazy to eager strategies" involve an excessive amount of "catch-up freezing" in the VACUUM operation that advanced relfrozenxid for the first time, which you see an example of here: https://wiki.postgresql.org/wiki/Freezing/skipping_strategies_patch:_motivating_examples#Patch Now we treat the scanning and freezing strategies as two independent choices. Of course they're not independent in any practical sense, but I think it's slightly simpler and more elegant that way -- it makes the GUC vacuum_freeze_strategy_threshold strictly about freezing strategy, while still leading to VACUUM advancing relfrozenxid in a way that makes sense. It just happens as a second order effect. Why add a special case? In principle the break-even point for eager scanning strategy (i.e. advancing relfrozenxid) is based on the added cost only under this scheme. There is no reason for lazy_scan_strategy to care about what happened in the past to make the eager scanning strategy look like a good idea. Similarly, there isn't any practical reason why lazy_scan_strategy needs to anticipate what will happen in the near future with freezing. -- Peter Geoghegan
Re: Decoupling antiwraparound autovacuum from special rules around auto cancellation
d behave that way. This phenomenon is sometimes called alarm fatigue, It can be quite dangerous to warn people about non-issues "out of an abundance of caution". > 3) Autovacuums triggered by tuple thresholds persistently getting cancelled >also regularly causes outages, and they make it more likely that an >eventual age-based vacuum will take forever. Really? Outages? I imagine that you'd have to be constantly hammering the table with DDL before it could happen. That's possible, but it seems relatively obvious that doing that is asking for trouble. Whereas the Manta postmortem (and every similar case that I've personally seen) involved very nasty interactions that happened due to the way components interacted with a workload that wasn't like that. Running DDL from a cron job or from the application may not be a great idea, but it's also quite common. > Aspect 1) is addressed to a good degree by the proposed split of anti-wrap > into an age and anti-cancel triggers. And could be improved by reporting > failsafe autovacuums in pg_stat_activity. What you call aspect 2 (the issue with disastrous HW lock traffic jams involving TRUNCATE being run from a cron job, etc) is a big goal of mine for this patch series. You seem unsure of how effective my approach (or an equally simple approach based on table age heuristics) will be. Is that the case? > Perhaps 2) could be improved a bit by emitting a WARNING message when we > didn't cancel AV because it was anti-wraparound? But eventually I think we > somehow need to signal the "intent" of the lock drop down into ProcSleep() or > wherever it'd be. That's doable, but definitely seems like separate work. > I have two ideas around 3): > > First, we could introduce thresholds for the tuple thresholds, after which > autovacuum isn't concealable anymore. Do you think that's a good idea? I just don't trust those statistics, at all. As in, I think they're often complete garbage. > Second, we could track the number of cancellations since the last [auto]vacuum > in pgstat, and only trigger the anti-cancel behaviour when autovacuum has been > cancelled a number of times. In theory that would be better than an approach along the lines I've proposed, because it is directly based on a less aggressive approach being tried a few times, and failing a few times. That part I like. However, I also don't like several things about this approach. First of all it relies on relatively complicated infrastructure, for something that can be critical. Second, it will be hard to test. Third, perhaps it would make sense to give the earlier/less aggressive approach (a table age av that is still autocancellable) quite a large number of attempts before giving up. If the table age isn't really growing too fast, why not continue to be patient, possibly for quite a long time? Perhaps a hybrid strategy could be useful? Something like what I came up with already, *plus* a mechanism that gives up after (say) 1000 cancellations, and escalates to no-auto-cancel, regardless of table age. It seems sensible to assume that a less aggressive approach is just hopeless relatively quickly (in wall clock time and logical XID time) once we see sufficiently many cancellations against the same table. -- Peter Geoghegan
Re: Decoupling antiwraparound autovacuum from special rules around auto cancellation
ze_max_age * 2, 1 billion) > prevent_auto_cancel = reflrozenxid < recentXid - prevent_auto_cancel_age > > Is that right? That summary looks accurate, but I'm a bit confused about why you're asking the question this way. I thought that it was obvious that the patch doesn't change most of these things. The only mechanism that the patch changes is related to "prevent auto-cancellation" behaviors -- which is now what the term "antiwraparound" refers to. It does change the name of "autovacuum based on age", though -- the name is now "table age autovacuum" (the old name was antiwraparound autovacuum, of course). As I pointed out to you already, it's mechanically impossible for any autovacuum to be antiwraparound unless it's an XID table age/MXID table age autovacuum. The naming convention I propose here makes it a little confusing for us to discuss, but it seems like the best thing for users. Users' basic intuitions about antiwraparound autovacuums (that they're scary things needed because wraparound is starting to become a real concern) don't need to change. If anything they become more accurate, because antiwraparound autovacuums become non-routine -- which is really how it should have been when autovacuum was first added IMV. Users have rather good reasons to find antiwraparound autovacuums scary, even though that's kind of wrong (it's really our fault for making it so confusing for them, not their fault for being wrong). > One thing I just noticed: Isn't it completely bonkers that we compute > recentXid/recentMulti once at the start of a worker in > relation_needs_vacanalyze()? That's fine for the calls in do_autovacuum()'s > initial loops over all tables. But seems completely wrong for the later calls > via table_recheck_autovac() -> recheck_relation_needs_vacanalyze() -> > relation_needs_vacanalyze()? > > These variables really shouldn't be globals. It makes sense to cache them > locally in do_autovacuum(), but reusing them > recheck_relation_needs_vacanalyze() and sharing it between launcher and worker > is bad. I am not sure. I do hope that there isn't some subtle way in which the design relies on that. It seems obviously weird, and so I have to wonder if there is a reason behind it that isn't immediately apparent. -- Peter Geoghegan v4-0002-Add-table-age-trigger-concept-to-autovacuum.patch Description: Binary data v4-0001-Add-autovacuum-trigger-instrumentation.patch Description: Binary data
Re: Rework of collation code, extensibility
On Wed, Jan 11, 2023 at 3:44 PM Jeff Davis wrote: > Attached trivial rebase as v6. Some review comments for this v6. Comments on 0001-*: * I think that 0002-* can be squashed with 0001-*, since there isn't any functional reason why you'd want to commit the strcoll() and strxfrm() changes separately. Sometimes it can be useful to break things up, despite the fact that it couldn't possibly make sense to commit just one of the resulting patches on its own. However, I don't think that that's appropriate here. There is no apparent conceptual boundary that you're highlighting by splitting things up like this. strxfrm() and strcoll() are defined in terms of each other -- they're siblings, joined at the hip -- so this seems kinda jarring. * Your commit message for 0001 (and other patches) don't set things up by describing what the point is, and what the work anticipates. I think that they should do that. You're adding a layer of indirection that's going to set things up for later patches that add a layer of indirection for version ICU libraries (and even libc itself), and some of the details only make sense in that context. This isn't just refactoring work that could just as easily have happened in some quite different context. * I'm not sure that pg_strcoll() should be breaking ties itself. We break ties using strcmp() for historical reasons, but must not do that for deterministic ICU collations, which may be obscured. That means that pg_strcoll()'s relationship to pg_strxfrm()'s isn't the same as the well known strcoll()/strxfrm() relationship. That kind of makes pg_strcoll() somewhat more than a strcoll() shim, which is inconsistent. Another concern is that the deterministic collation handling isn't handled in any one layer, which would have been nice. Do we need to do things this way? What's it adding? * varstrfastcmp_locale() is no longer capable of calling ucol_strcollUTF8() through the shim interface, meaning that it has to determine string length based on NUL-termination, when in principle it could just use the known length of the string. Presumably this might have performance implications. Have you thought about that? Some comments on 0002-*: * I don't see much point in this new varstr_abbrev_convert() variable: + const size_t max_prefix_bytes = sizeof(Datum); varstr_abbrev_convert() is concerned with packing abbreviated key bytes into Datums, so it's perfectly reasonable to deal with Datums/sizeof(Datum) directly. * Having a separate pg_strxfrm_prefix_libc() function just to throw an error doesn't really add much IMV. Comments on 0003-*: I suggest that some of the very trivial functions you have here (such as pg_locale_deterministic()) be made inline functions. Comments on 0006-*: * get_builtin_libc_library() could be indented in a way that would make it easier to understand. -- Peter Geoghegan
Re: Decoupling antiwraparound autovacuum from special rules around auto cancellation
make sure that you stay well hydrated -- if the problem doesn't go away after a few days, then call back, reassess. Perhaps it really will be a brain tumor, but there is nothing to gain and everything to lose by taking such drastic action at the first sign of trouble. > If we cancel ourselves too, we're just postponing resolution of the > problem to some future point when we decide to stop cancelling > ourselves. That's not a win. It's also only a very minor loss, relative to what would have happened without any of this. This is something that we can be relatively sure of (unlike anything about final outcomes). It's clear that we have a lot to gain. What do we have to lose, really? > > I think that users will really appreciate having only one kind of > > VACUUM/autovacuum (since the other patch gets rid of discrete > > aggressive mode VACUUMs). I want "table age autovacuuming" (as I > > propose to call it) come to be seen as not any different to any other > > autovacuum, such as an "insert tuples" autovacuum or a "dead tuples" > > autovacuum. The difference is only in how autovacuum.c triggers the > > VACUUM, not in any runtime behavior. That's an important goal here. > > I don't agree with that goal. I think that having different kinds of > autovacuums with different, identifiable names and corresponding, > easily-identifiable behaviors is really important for troubleshooting. You need to distinguish between different types of autovacuums and different types of VACUUMs here. Sure, it's valuable to have information about why autovacuum launched a VACUUM, and the patch greatly improves that. But runtime behavior is another story. It's not really generic behavior -- more like generic policies that produce different behavior under different runtime conditions. VACUUM has always had generic policies about how to do things, at least up until the introduction of the visibility map, which added scan_all/aggressive VACUUMs, and the vacuum_freeze_table_age GUC. The policy should be the same in every VACUUM, which the behavior itself emerges from. > Trying to remove those distinctions and make everything look the same > will not keep autovacuum from getting itself into trouble. It will > just make it harder to understand what's happening when it does. The point isn't to have every VACUUM behave in the same way. The point is to make decisions dynamically, based on the observed conditions in the table. And to delay committing to things until there really is no alternative, to maximize our opportunities to avoid disaster. In short: loose, springy behavior. Imposing absolute obligations on VACUUM has the potential to create lots of problems. It is sometimes necessary, but can easily be overused, making a bad situation much worse. -- Peter Geoghegan
Re: Decoupling antiwraparound autovacuum from special rules around auto cancellation
On Thu, Jan 12, 2023 at 9:12 AM Robert Haas wrote: > I do agree that it's good to slowly increase the aggressiveness of > VACUUM as we get further behind, rather than having big behavior > changes all at once, but I think that should happen by smoothly > varying various parameters rather than by making discrete behavior > changes at a whole bunch of different times. In general I tend to agree, but, as you go on to acknowledge yourself, this particular behavior is inherently discrete. Either the PROC_VACUUM_FOR_WRAPAROUND behavior is in effect, or it isn't. In many important cases the only kind of autovacuum that ever runs against a certain big table is antiwraparound autovacuum. And therefore every autovacuum that runs against the table must necessarily not be auto cancellable. These are the cases where we see disastrous interactions with automated DDL, such as a TRUNCATE run by a cron job (to stop those annoying antiwraparound autovacuums) -- a heavyweight lock traffic jam that causes the application to lock up. All that I really want to do here is give an autovacuum that *can* be auto cancelled *some* non-zero chance to succeed with these kinds of tables. TRUNCATE completes immediately, so the AEL is no big deal. Except when it's blocked behind an antiwraparound autovacuum. That kind of interaction is occasionally just disastrous. Even just the tiniest bit of wiggle room could avoid it in most cases, possibly even almost all cases. > Maybe that's not the right idea, I don't know, and a naive > implementation might be worse than nothing, but I think it has some > chance of being worth consideration. It's a question of priorities. The failsafe isn't supposed to be used (when it is it is a kind of a failure), and so presumably only kicks in on very rare occasions, where nobody was paying attention anyway. So far I've heard no complaints about this, but I've heard lots of complaints about the antiwrap autocancellation behavior. > The behavior already changes when you hit > vacuum_freeze_min_age and then again when you hit > vacuum_freeze_table_age and then there's also > autoovacuum_freeze_max_age and xidWarnLimit and xidStopLimit and a few > others, and these setting all interact in pretty complex ways. The > more conditional logic we add to that, the harder it becomes to > understand what's actually happening. In general I strongly agree. In fact that's a big part of what motivates my ongoing work on VACUUM. The user experience is central. As Andres pointed out, presenting antiwraparound autovacuums as kind of an emergency thing but also somehow a routine thing is just horribly confusing. I want to make them into an emergency thing in every sense -- something that you as a user can reasonably expect to never see (like the failsafe). But if you do see one, then that's a useful signal of an underlying problem with contention, say from automated DDL that pathologically cancels autovacuums again and again. > Now, you might reply to the above by saying, well, some behaviors > can't vary continuously. vacuum_cost_limit can perhaps be phased out > gradually, but autocancellation seems like something that you must > either do, or not do. I would agree with that. But what I'm saying is > that we ought to favor having those kinds of behaviors all engage at > the same point rather than at different times. Right now aggressive VACUUMs do just about all freezing at the same time, to the extent that many users seem to think that it's a totally different thing with totally different responsibilities to regular VACUUM. Doing everything at the same time like that causes huge practical problems, and is very confusing. I think that users will really appreciate having only one kind of VACUUM/autovacuum (since the other patch gets rid of discrete aggressive mode VACUUMs). I want "table age autovacuuming" (as I propose to call it) come to be seen as not any different to any other autovacuum, such as an "insert tuples" autovacuum or a "dead tuples" autovacuum. The difference is only in how autovacuum.c triggers the VACUUM, not in any runtime behavior. That's an important goal here. > I did take a look at the post-mortem to which you linked, but I am not > quite sure how that bears on the behavior change under discussion. The post-mortem involved a single "DROP TRIGGER" that caused chaos when it interacted with the auto cancellation behavior. It would usually completely instantly, so the AEL wasn't actually disruptive, but one day antiwraparound autovacuum made the cron job effectively block all reads and writes for hours. The similar outages I was called in to help with personally had either an automated TRUNCATE or an automated CREATE INDEX. Had autovacuum only been willing to yield once or twice, then it probably would have been fine -- the situation probably would have worked itself out naturally. That's the best outcome you can hope for. -- Peter Geoghegan
Re: Fixing a couple of buglets in how VACUUM sets visibility map bits
On Mon, Jan 9, 2023 at 12:58 PM Peter Geoghegan wrote: > On Mon, Jan 9, 2023 at 11:57 AM Andres Freund wrote: > > Afaict we'll need to backpatch this all the way? > > I thought that we probably wouldn't need to, at first. But I now think > that we really have to. Attached is v4. This is almost the same as v3. The only notable change is in how the issue is explained in comments, and in the commit message. I have revised my opinion on this question once more. In light of what has come to light about the issue from recent testing, I lean towards a HEAD-only commit once again. What do you think? I still hope to be able to commit this on my original timeline (on Monday or so), without the issue taking up too much more of everybody's time. Thanks -- Peter Geoghegan v4-0001-Tighten-up-VACUUM-s-approach-to-setting-VM-bits.patch Description: Binary data
Re: Reducing the WAL overhead of freezing in VACUUM by deduplicating per-tuple freeze plans
On Wed, Jan 11, 2023 at 4:44 PM Andres Freund wrote: > > Attached patch fixes up these issues. It's almost totally mechanical. > > Looks better, thanks! Pushed that just now. Thanks -- Peter Geoghegan
Re: Reducing the WAL overhead of freezing in VACUUM by deduplicating per-tuple freeze plans
On Mon, Jan 9, 2023 at 2:18 PM Peter Geoghegan wrote: > I'll try to get back to it this week. Attached patch fixes up these issues. It's almost totally mechanical. (Ended up using "git diff --color-moved=dimmed-zebra --color-moved-ws=ignore-all-space" with this, per your recent tip, which did help.) -- Peter Geoghegan v1-0001-Rename-freeze-plan-dedup-routines.patch Description: Binary data
Re: pgsql: Delay commit status checks until freezing executes.
On Wed, Jan 11, 2023 at 3:06 PM Andres Freund wrote: > > + * We can't use TransactionIdDidAbort here because it won't treat > > transactions > > + * that were in progress during a crash as aborted by now. We determine > > that > > + * transactions aborted/crashed through process of elimination instead. > > s/by now//? Did it that way in the commit I pushed just now. Thanks -- Peter Geoghegan
Re: Show various offset arrays for heap WAL records
On Wed, Jan 11, 2023 at 3:00 PM Andres Freund wrote: > What are your thoughts about the place for the helper functions? You're ok > with rmgrdesc_utils.[ch]? Yeah, that seems okay. We may well need to put more stuff in that file. We're overdue a big overhaul of the rmgr output, so that everybody uses the same format for everything. We made some progress on that for 16 already, by standardizing on the name snapshotConflictHorizon, but a lot of annoying inconsistencies still remain. Like the punctuation issue you mentioned. Ideally we'd be able to make the output more easy to manipulate via the SQL interface from pg_walinspect, or perhaps via scripting. That would require some rules that are imposed top-down, so that consumers of the data can make certain general assumptions. But that's fairly natural. It's not like there is just inherently a great deal of diversity that we need to be considered. For example, the WAL records used by each individual index access method are all very similar. In fact the most important index AM WAL records used by each index AM (e.g. insert, delete, vacuum) have virtually the same format as each other already. -- Peter Geoghegan
Re: Show various offset arrays for heap WAL records
On Tue, Jan 10, 2023 at 11:35 AM Andres Freund wrote: > Nontrivial, I'm afraid. We don't pass any relevant parameters to rm_desc: > void(*rm_desc) (StringInfo buf, XLogReaderState *record); > > so we'd need to patch all of them. That might be worth doing at some point, > but I don't want to tackle it right now. Okay. Let's just get the basics in soon, then. I would like to have a similar capability for index access methods, but mostly just for investigating performance. Whenever we've really needed something like this for debugging it seems to have been a heapam thing, just because there's a lot more that can go wrong with pruning, which is spread across many different places. -- Peter Geoghegan
Re: pgsql: Delay commit status checks until freezing executes.
On Sat, Jan 7, 2023 at 7:25 PM Andres Freund wrote: > Probably a good idea, although it doesn't neatly fit right now. I'll leave it for now. Attached is v2, which changes things based on your feedback. Would like to get this out of the way soon. > > Also, does amcheck's get_xid_status() need a reference to these rules? > > Don't think so? Whad made you ask? Just the fact that it seems to more or less follow the protocol described at the top of heapam_visibility.c. Not very important, though. -- Peter Geoghegan v2-0001-Improve-TransactionIdDidAbort-comments.patch Description: Binary data
Re: Option to not use ringbuffer in VACUUM, using it in failsafe mode
On Wed, Jan 11, 2023 at 11:18 AM Andres Freund wrote: > I don't like that - it's also quite useful to disable use of ringbuffers when > you actually need to clean up indexes. Especially when we have a lot of dead > tuples we'll rescan indexes over and over... That's a fair point. My vote goes to "REUSE_BUFFERS", then. -- Peter Geoghegan
Re: Option to not use ringbuffer in VACUUM, using it in failsafe mode
On Wed, Jan 11, 2023 at 10:58 AM Andres Freund wrote: > Any idea about the name? The obvious thing is to reference ring buffers in the > option name, but that's more of an implementation detail... What are the chances that anybody using this feature via a manual VACUUM command will also use INDEX_CLEANUP off? It's not really supposed to be used routinely, at all. Right? It's just for emergencies. Perhaps it can be tied to INDEX_CLEANUP=off? That makes it hard to get just the behavior you want when testing VACUUM, but maybe that doesn't matter. Realistically, most of the value here comes from changing the failsafe behavior, which doesn't require the user to know anything about VACUUM. I know that AWS has reduced the vacuum_failsafe_age default on RDS to 1.2 billion (a decision made before I joined Amazon), so it is already something AWS lean on quite a bit where available. -- Peter Geoghegan
Re: Option to not use ringbuffer in VACUUM, using it in failsafe mode
On Wed, Jan 11, 2023 at 10:27 AM Andres Freund wrote: > Therefore I'd like to add an option to the VACUUM command to use to disable > the use of the ringbuffer. Not sure about the name yet. Sounds like a good idea. > I think we should auto-enable that mode once we're using the failsafe mode, > similar to [auto]vacuum cost delays getting disabled > (c.f. lazy_check_wraparound_failsafe()). If things are bad enough that we're > soon going to shut down, we want to be aggressive. +1 -- Peter Geoghegan
Re: Fixing a couple of buglets in how VACUUM sets visibility map bits
On Tue, Jan 10, 2023 at 4:39 PM Peter Geoghegan wrote: > * Run VACUUM FREEZE. We need FREEZE in order to be able to hit the > relevant visibilitymap_set() call site (the one that passes > VISIBILITYMAP_ALL_FROZEN as its flags, without also passing > VISIBILITYMAP_ALL_VISIBLE). > > Now all_visible_according_to_vm is set to true, but we don't have a > lock/pin on the same heap page just yet. > > * Acquire several non-conflicting row locks on a row on the block in > question, so that a new multi is allocated. Forgot to mention that there needs to be a HOT update mixed in with these SELECT ... FOR SHARE row lockers, too, which must abort once its XID has been added to a multi. Obviously heap_lock_tuple() won't ever unset VISIBILITYMAP_ALL_VISIBLE or PD_ALL_VISIBLE (it only ever clears VISIBILITYMAP_ALL_FROZEN) -- so we need a heap_update() to clear all of these status bits. This enables the assertion to fail because: * Pruning can get rid of the aborted successor heap-only tuple right away, so it is not going to block us from setting the page all_visible (that just leaves the original tuple to consider). * The original tuple's xmax is a Multi, so it won't automatically be ineligible for freezing because it's > OldestXmin in this scenario. * FreezeMultiXactId() processing will completely remove xmax, without caring too much about cutoffs like OldestXmin -- it only cares about whether each individual XID needs to be kept or not. (Granted, FreezeMultiXactId() will only remove xmax like this because I deliberately removed its FRM_NOOP handling, but that is a very delicate thing to rely on, especially from such a great distance. I can't imagine that it doesn't fail on HEAD for any reason beyond sheer luck.) -- Peter Geoghegan
Re: Fixing a couple of buglets in how VACUUM sets visibility map bits
On Tue, Jan 10, 2023 at 12:08 PM Peter Geoghegan wrote: > Actually, FreezeMultiXactId() can fully remove an xmax that has some > member XIDs >= OldestXmin, provided FRM_NOOP processing isn't > possible, at least when no individual member is still running. Doesn't > have to involve transaction aborts at all. > > Let me go try to break it that way... Attached patch shows how this could break. It adds an assertion that checks that the expected PD_ALL_VISIBLE/VM_ALL_VISIBLE() conditions hold at the right point. It also comments out FreezeMultiXactId()'s FRM_NOOP handling. The FRM_NOOP case is really just an optimization, and shouldn't be needed for correctness. This is amply demonstrated by running "meston test" with the patch applied, which will pass without incident. I can get the PD_ALL_VISIBLE assertion to fail by following this procedure with the patch applied: * Run a plain VACUUM to set all the pages from a table all-visible, but not all-frozen. * Set a breakpoint that will hit after all_visible_according_to_vm is set to true, for an interesting blkno. * Run VACUUM FREEZE. We need FREEZE in order to be able to hit the relevant visibilitymap_set() call site (the one that passes VISIBILITYMAP_ALL_FROZEN as its flags, without also passing VISIBILITYMAP_ALL_VISIBLE). Now all_visible_according_to_vm is set to true, but we don't have a lock/pin on the same heap page just yet. * Acquire several non-conflicting row locks on a row on the block in question, so that a new multi is allocated. * End every session whose XID is stored in our multi (commit/abort). * Within GDB, continue from before -- observe assertion failure. Obviously this scenario doesn't demonstrate the presence of a bug -- not quite. But it does prove that we rely on FRM_NOOP to not allow the VM to become corrupt, which just doesn't make any sense, and can't have been intended. At a minimum, it strongly suggests that the current approach is very fragile. -- Peter Geoghegan 0001-Debug-freeze-map-issue.patch Description: Binary data
Re: Fixing a couple of buglets in how VACUUM sets visibility map bits
On Tue, Jan 10, 2023 at 12:19 PM Robert Haas wrote: > I don't understand what distinction you're making. It seems like > hair-splitting to me. We should be able to reproduce problems like > this reliably, at least with the aid of a debugger and some > breakpoints, before we go changing the code. So we can *never* change something defensively, on the basis of a suspected or theoretical hazard, either in backbranches or just on HEAD? Not under any circumstances, ever? > The risk of being wrong > is quite high because the code is subtle, and the consequences of > being wrong are potentially very bad because the code is critical to > data integrity. If the reproducer doesn't require a debugger or other > extreme contortions, then we should consider reducing it to a TAP test > that can be committed. If you agree with that, then I'm not sure what > your last email was complaining about. I was complaining about your prescribing conditions on proceeding with a commit, based on an understanding of things that you yourself acknowledged as incomplete. I cannot imagine how you read that as an unwillingness to test the issue, especially given that I agreed to work on that before you chimed in. > > I have been unable to reproduce the problem, and think it's possible > > that the issue cannot be triggered in practice. Though only through > > sheer luck. Here's why that is: > I guess I'm not very sure that this is sheer luck. That's just my characterization. Other people can make up their own minds. > For the purposes of clarifying my understanding, is this the code > you're principally worried about? > visibilitymap_set(vacrel->rel, blkno, buf, InvalidXLogRecPtr, > vmbuffer, InvalidTransactionId, > VISIBILITYMAP_ALL_FROZEN); Obviously I meant this call site, since it's the only one that passes VISIBILITYMAP_ALL_FROZEN as its flags, without also passing VISIBILITYMAP_ALL_VISIBLE -- in vacuumlazy.c, and in general. The other visibilitymap_set() callsite that you quoted is from the second heap pass, where LP_DEAD items are vacuumed and become LP_UNUSED items. That isn't buggy, but it is a silly approach, in that it cares about what the visibility map says about the page being all-visible, as if it might take a dissenting view that needs to be taken into consideration (obviously we know what's going on with the page because we just scanned it ourselves, and determined that it was at least all-visible). -- Peter Geoghegan
Re: Fixing a couple of buglets in how VACUUM sets visibility map bits
On Tue, Jan 10, 2023 at 11:47 AM Peter Geoghegan wrote: > In summary, I think that there is currently no way that we can have > the VM (or the PD_ALL_VISIBLE flag) concurrently unset, while leaving > the page all_frozen. It can happen and leave the page all_visible, but > not all_frozen, due to these very fine details. (Assuming I haven't > missed another path to the problem with aborted Multis or something, > but looks like I haven't.) Actually, FreezeMultiXactId() can fully remove an xmax that has some member XIDs >= OldestXmin, provided FRM_NOOP processing isn't possible, at least when no individual member is still running. Doesn't have to involve transaction aborts at all. Let me go try to break it that way... -- Peter Geoghegan
Re: Fixing a couple of buglets in how VACUUM sets visibility map bits
On Tue, Jan 10, 2023 at 10:50 AM Robert Haas wrote: > Look, I don't want to spend time arguing about what seem to me to be > basic principles of good software engineering. When I don't put test > cases into my patches, people complain at me and tell me that I'm a > bad software engineer because I didn't include test cases. Your > argument here seems to be that you're such a good software engineer > that you don't need any test cases to know what the bug is or that > you've fixed it correctly. That's not what I said. This is a straw man. What I actually said was that there is no reason to declare up front that the only circumstances under which a fix could be committed is when a clean repro is available. I never said that a test case has little or no value, and I certainly didn't assert that we definitely don't need a test case to proceed with a commit -- since I am not in the habit of presumptuously attaching conditions to such things well in advance. > I don't particularly appreciate the implication that I either lack > relevant or expertise or don't actually take time, either. The implication was only that you didn't take the time. Clearly you have the expertise. Obviously you're very far from stupid. I have been unable to reproduce the problem, and think it's possible that the issue cannot be triggered in practice. Though only through sheer luck. Here's why that is: While pruning will remove aborted dead tuples, freezing will not remove the xmax of an aborted update unless the XID happens to be < OldestXmin. With my problem scenario, the page will be all_visible in prunestate, but not all_frozen -- so it dodges the relevant visibilitymap_set() call site. That just leaves inserts that abort, I think. An aborted insert will be totally undone by pruning, but that does still leave behind an LP_DEAD item that needs to be vacuumed in the second heap pass. This means that we can only set the page all-visible/all-frozen in the VM in the second heap pass, which also dodges the relevant visibilitymap_set() call site. In summary, I think that there is currently no way that we can have the VM (or the PD_ALL_VISIBLE flag) concurrently unset, while leaving the page all_frozen. It can happen and leave the page all_visible, but not all_frozen, due to these very fine details. (Assuming I haven't missed another path to the problem with aborted Multis or something, but looks like I haven't.) -- Peter Geoghegan
Re: Fixing a couple of buglets in how VACUUM sets visibility map bits
On Mon, Jan 9, 2023 at 12:58 PM Peter Geoghegan wrote: > I didn't realize that affected visibilitymap_set() calls could > generate useless set-VM WAL records until you pointed it out. That's > far more likely to happen than the race condition that I described -- > it has nothing at all to do with concurrency. That's what clinches it > for me. I didn't spend as much time on this as I'd like to so far, but I think that this concern about visibilitymap_set() actually turns out to not apply. The visibilitymap_set() call in question is gated by a "!VM_ALL_FROZEN()", which is enough to avoid the problem with writing useless VM set records. That doesn't make me doubt my original concern about races where the all-frozen bit can be set, without setting the all-visible bit, and without accounting for the fact that it changed underneath us. That scenario will have !VM_ALL_FROZEN(), so that won't save us. (And we won't test VM_ALL_VISIBLE() or PD_ALL_VISIBLE in a way that is sufficient to realize that all_visible_according_to_vm is stale. prunestate.all_visible being set doesn't reliably indicate that it's not stale, but lazy_scan_heap incorrectly believes that it really does work that way.) -- Peter Geoghegan
Re: Show various offset arrays for heap WAL records
On Mon, Jan 9, 2023 at 1:58 PM Andres Freund wrote: > A couple times when investigating data corruption issues, the last time just > yesterday in [1], I needed to see the offsets affected by PRUNE and VACUUM > records. As that's probably not just me, I think we should make that change > in-tree. I remember how useful this was when we were investigating that early bug in 14, that turned out to be in parallel VACUUM. So I'm all in favor of it. > The attached patch adds details to XLOG_HEAP2_PRUNE, XLOG_HEAP2_VACUUM, > XLOG_HEAP2_FREEZE_PAGE. I'm bound to end up doing the same in index access methods. Might make sense for the utility routines to live somewhere more centralized, at least when code reuse is likely. Practically every index AM has WAL records that include a sorted page offset number array, just like these ones. It's a very standard thing, obviously. > I chose to include infomask[2] for the different freeze plans mainly because > it looks odd to see different plans without a visible reason. But I'm not sure > that's the right choice. I don't think that it is particularly necessary to do so in order for the output to make sense -- pg_waldump is inherently a tool for experts. What it comes down to for me is whether or not this information is sufficiently useful to display, and/or can be (or needs to be) controlled via some kind of verbosity knob. I think that it easily could be useful, and I also think that it easily could be a bit annoying. How hard would it be to invent a general mechanism to control the verbosity of what we'll show for each WAL record? -- Peter Geoghegan
Re: Decoupling antiwraparound autovacuum from special rules around auto cancellation
On Mon, Jan 9, 2023 at 5:22 PM Andres Freund wrote: > I've also seen the inverse, with recent versions of postgres: Autovacuum can > only ever make progress if it's an anti-wraparound vacuum, because it'll > always get cancelled otherwise. I'm worried that substantially increasing the > time until an anti-wraparound autovacuum happens will lead to more users > running into out-of-xid shutdowns. > > I don't think it's safe to just increase the time at which anti-wrap vacuums > happen to a hardcoded 1 billion. That's not what the patch does. It doubles the time that the anti-wrap no-autocancellation behaviors kick in, up to a maximum of 1 billion XIDs/MXIDs. So it goes from autovacuum_freeze_max_age to autovacuum_freeze_max_age x 2, without changing the basic fact that we initially launch autovacuums that advance relfrozenxid/relminmxid when the autovacuum_freeze_max_age threshold is first crossed. These heuristics are totally negotiable -- and likely should be thought out in more detail. It's likely that most of the benefit of the patch comes from simply trying to advance relfrozenxid without the special auto-cancellation behavior one single time. The main problem right now is that the threshold that launches most antiwraparound autovacuums is exactly the same as the threshold that activates the auto-cancellation protections. Even doing the latter very slightly later than the former could easily make things much better, while adding essentially no risk of the kind you're concerned about. > I'm also doubtful that it's ok to just make all autovacuums on relations with > an age > 1 billion anti-wraparound ones. For people that use a large > autovacuum_freeze_max_age that will be a rude awakening. Actually, users that have autovacuum_freeze_max_age set to over 1 billion will get exactly the same behavior as before (except that the instrumentation of autovacuum will be better). It'll be identical. If you set autovacuum_freeze_max_age to 2 billion, and a "standard" autovacuum is launched on a table whose relfrozenxid age is 1.5 billion, it'll just be a regular dead tuples/inserted tuples autovacuum, with the same old familiar locking characteristics as today. -- Peter Geoghegan
Re: Fixing a couple of buglets in how VACUUM sets visibility map bits
On Mon, Jan 9, 2023 at 12:51 PM Robert Haas wrote: > I feel that you should at least have a reproducer for these problems > posted to the thread, and ideally a regression test, before committing > things. I think it's very hard to understand what the problems are > right now. Hard to understand relative to what, exactly? We're talking about a very subtle race condition here. I'll try to come up with a reproducer, but I *utterly* reject your assertion that it's a hard requirement, sight unseen. Why should those be the parameters of the discussion? For one thing I'm quite confident that I'm right, with or without a reproducer. And my argument isn't all that hard to follow, if you have relevant expertise, and actually take the time. But even this is unlikely to matter much. Even if I somehow turn out to have been completely wrong about the race condition, it is still self-evident that the problem of uselessly WAL logging non-changes to the VM exists. That doesn't require any concurrent access at all. It's a natural consequence of calling visibilitymap_set() with VISIBILITYMAP_ALL_FROZEN-only flags. You need only look at the code for 2 minutes to see it. > I don't particularly have a problem with the idea of 0001, because if > we use InvalidTransactionId to mean that there cannot be any > conflicts, we do not need FrozenTransactionId to mean the same thing. > Picking one or the other makes sense. We've already picked one, many years ago -- InvalidTransactionId. This is a long established convention, common to all REDO routines that are capable of creating granular conflicts. I already committed 0001 over a week ago. We were calling ResolveRecoveryConflictWithSnapshot with FrozenTransactionId arguments before now, which was 100% guaranteed to be a waste of cycles. I saw no need to wait more than a few days for a +1, given that this particular issue was so completely clear cut. -- Peter Geoghegan
Re: Reducing the WAL overhead of freezing in VACUUM by deduplicating per-tuple freeze plans
On Mon, Jan 9, 2023 at 1:43 PM Andres Freund wrote: > ISTM that some of the page level freezing functions are misnamed. In heapam.c > the heap_xlog* routines are for replay, afaict. However > heap_xlog_freeze_plan() is used to WAL log the freeze > plan. heap_xlog_freeze_page() is used to replay that WAL record. Probably your > brain is too used to nbtree/ :). Sometimes I wonder why other people stubbornly insist on not starting every function name with an underscore. :-) > I think s/heap_xlog_freeze/heap_log_freeze/ would mostly do the trick, except > that heap_xlog_new_freeze_plan() doesn't quite fit in the scheme. > The routines then also should be moved a bit up, because right now they're > inbetween other routines doing WAL replay, adding to the confusion. I believe that I used this scheme because of the fact that the new functions were conceptually related to REDO routines, even though they run during original execution. I'm quite happy to revise the code based on your suggestions, though. > The memcpy in heap_xlog_freeze_page() seems a tad odd. I assume that the > alignment guarantees for xl_heap_freeze_plan are too weak? They're not too weak. I'm not sure why the memcpy() was used. I see your point; it makes you wonder if it must be necessary, which then seems to call into question why it's okay to access the main array as an array. I can change this detail, too. I'll try to get back to it this week. -- Peter Geoghegan
Re: Fixing a couple of buglets in how VACUUM sets visibility map bits
On Mon, Jan 9, 2023 at 11:57 AM Andres Freund wrote: > Afaict we'll need to backpatch this all the way? I thought that we probably wouldn't need to, at first. But I now think that we really have to. I didn't realize that affected visibilitymap_set() calls could generate useless set-VM WAL records until you pointed it out. That's far more likely to happen than the race condition that I described -- it has nothing at all to do with concurrency. That's what clinches it for me. > > One of the calls to visibilitymap_set() during VACUUM's initial heap > > pass could unset a page's all-visible bit during the process of setting > > the same page's all-frozen bit. > > As just mentioned upthread, this just seems wrong. I don't know why this sentence ever made sense to me. Anyway, it's not important now. > Do you have a reproducer for this? No, but I'm quite certain that the race can happen. If it's important to have a reproducer then I can probably come up with one. I could likely figure out a way to write an isolation test that reliably triggers the issue. It would have to work by playing games with cleanup lock/buffer pin waits, since that's the only thing that the test can hook into to make things happen in just the right/wrong order. > > elog(WARNING, "page is not marked all-visible but > > visibility map bit is set in relation \"%s\" page %u", > >vacrel->relname, blkno); > > Hm. The message gets a bit less accurate with the change. Perhaps OK? OTOH, it > might be useful to know what bit was wrong when debugging problems. Theoretically it might change again, if we call visibilitymap_get_status() again. Maybe I should just broaden the error message a bit instead? > I still think such changes are inappropriate for a bugfix, particularly one > that needs to be backpatched. I'll remove the changes that are inessential in the next revision. I wouldn't have done it if I'd fully understood the seriousness of the issue from the start. If you're really concerned about diff size then I should point out that the changes to lazy_vacuum_heap_rel() aren't strictly necessary, and probably shouldn't be backpatched. I deemed that in scope because it's part of the same overall problem of updating the visibility map based on potentially stale information. It makes zero sense to check with the visibility map before updating it when we already know that the page is all-visible. I mean, are we trying to avoid the work of needlessly updating the visibility map in cases where its state was corrupt, but then became uncorrupt (relative to the heap page) by mistake? -- Peter Geoghegan
Re: Fixing a couple of buglets in how VACUUM sets visibility map bits
On Mon, Jan 9, 2023 at 11:44 AM Andres Freund wrote: > I think that's just an imprecise formulation though - the problem is that we > can call visibilitymap_set() with just VISIBILITYMAP_ALL_FROZEN, even though > VISIBILITYMAP_ALL_VISIBLE was concurrently unset. That's correct. You're right that my description of the problem from the commit message was confusing. But we're on the same page about the problem now. > ISTM that we ought to update all_visible_according_to_vm from > PageIsAllVisible() once we've locked the page. Even if we avoid this specific > case, it seems a recipe for future bugs to have a potentially outdated > variable around. I basically agree, but some of the details are tricky. As I mentioned already, my work on visibility map snapshots just gets rid of all_visible_according_to_vm, which is my preferred long term approach. We will very likely need to keep all_visible_according_to_vm as a cache for performance reasons for as long as we have SKIP_PAGES_THRESHOLD. Can we just update all_visible_according_to_vm using PageIsAllVisible(), without making all_visible_according_to_vm significantly less useful as a cache? Maybe. Not sure offhand. -- Peter Geoghegan
Re: Fixing a couple of buglets in how VACUUM sets visibility map bits
On Sun, Jan 8, 2023 at 6:43 PM Peter Geoghegan wrote: > On further reflection even v2 won't repair the page-level > PD_ALL_VISIBLE flag in passing in this scenario. ISTM that on HEAD we > might actually leave the all-frozen bit set in the VM, while both the > all-visible bit and the page-level PD_ALL_VISIBLE bit remain unset. > Again, all due to the approach we take with > all_visible_according_to_vm, which can go stale independently of both > the VM bit being unset and the PD_ALL_VISIBLE bit being unset (in my > example problem scenario). Attached is v3, which explicitly checks the need to set the PD_ALL_VISIBLE flag at the relevant visibilitymap_set() call site. It also has improved comments. -- Peter Geoghegan v3-0001-Don-t-accidentally-unset-all-visible-bit-in-VM.patch Description: Binary data
Re: Fixing a couple of buglets in how VACUUM sets visibility map bits
On Sun, Jan 8, 2023 at 4:27 PM Peter Geoghegan wrote: > We're vulnerable to allowing "all-frozen but not all-visible" > inconsistencies because of two factors: this business with not passing > VISIBILITYMAP_ALL_VISIBLE along with VISIBILITYMAP_ALL_FROZEN to > visibilitymap_set(), *and* the use of all_visible_according_to_vm to > set the VM (a local variable that can go stale). We sort of assume > that all_visible_according_to_vm cannot go stale here without our > detecting it. That's almost always the case, but it's not quite > guaranteed. On further reflection even v2 won't repair the page-level PD_ALL_VISIBLE flag in passing in this scenario. ISTM that on HEAD we might actually leave the all-frozen bit set in the VM, while both the all-visible bit and the page-level PD_ALL_VISIBLE bit remain unset. Again, all due to the approach we take with all_visible_according_to_vm, which can go stale independently of both the VM bit being unset and the PD_ALL_VISIBLE bit being unset (in my example problem scenario). FWIW I don't have this remaining problem in my VACUUM freezing/scanning strategies patchset. It just gets rid of all_visible_according_to_vm altogether, which makes things a lot simpler at the point that we set VM bits at the end of lazy_scan_heap -- there is nothing left that can become stale. Quite a lot of the code is just removed; there is exactly one call to visibilitymap_set() at the end of lazy_scan_heap with the patchset, that does everything we need. The patchset also has logic for setting PD_ALL_VISIBLE when it needs to be set, which isn't (and shouldn't) be conditioned on whether we're doing a "all visible -> all frozen " transition or a "neither -> all visible" transition. What it actually needs to be conditioned on is whether it's unset now, and so needs to be set in passing, as part of setting one or both VM bits -- simple as that. -- Peter Geoghegan
Re: Decoupling antiwraparound autovacuum from special rules around auto cancellation
On Thu, Dec 29, 2022 at 7:01 PM Peter Geoghegan wrote: > Attached is v2, which is just to fix bitrot. Attached is v3. We no longer apply vacuum_failsafe_age when determining the cutoff for antiwraparound autovacuuming -- the new approach is a bit simpler. This is a fairly small change overall. Now any "table age driven" autovacuum will also be antiwraparound when its relfrozenxid/relminmxid attains an age that's either double the relevant setting (either autovacuum_freeze_max_age or effective_multixact_freeze_max_age), or 1 billion XIDs/MXIDs -- whichever is less. That makes it completely impossible to disable antiwraparound protections (the special antiwrap autocancellation behavior) for table-age-driven autovacuums once table age exceeds 1 billion XIDs/MXIDs. It's still possible to increase autovacuum_freeze_max_age to well over a billion, of course. It just won't be possible to do that while also avoiding the no-auto-cancellation behavior for those autovacuums that are triggered due to table age crossing the autovacuum_freeze_max_age/effective_multixact_freeze_max_age threshold. -- Peter Geoghegan From 6d78e62226608683d9882b4507d37442164267a1 Mon Sep 17 00:00:00 2001 From: Peter Geoghegan Date: Fri, 25 Nov 2022 11:23:20 -0800 Subject: [PATCH v3] Add "table age" trigger concept to autovacuum. Teach autovacuum.c to launch "table age" autovacuums at the same point that it previously triggered antiwraparound autovacuums. Antiwraparound autovacuums are retained, but are only used as a true option of last resort, when regular autovacuum has presumably tried and failed to advance relfrozenxid (likely because the auto-cancel behavior kept cancelling regular autovacuums triggered based on table age). The special auto-cancellation behavior applied by antiwraparound autovacuums is known to cause problems in the field, so it makes sense to avoid it, at least until the point where it starts to look like a proportionate response. Besides, the risk of the system eventually triggering xidStopLimit because of cancellations is a lot lower than it was back when the current auto-cancellation behavior was added by commit acac68b2. For example, there was no visibility map, so restarting antiwraparound autovacuum meant that the next autovacuum would get very little benefit from the work performed by earlier cancelled autovacuums. Also add new instrumentation that lists a triggering condition in the server log whenever an autovacuum is logged. This reports "table age" as the triggering criteria when regular (not antiwraparound) autovacuum runs because we need to advance the table's age. In other cases it will report an autovacuum was launched due to the tuple insert thresholds or the dead tuple thresholds. Note that pg_stat_activity doesn't have any special instrumentation for regular autovacuums that happen to have been triggered based on table age (since it really is just another form of autovacuum, and follows exactly the same rules in vacuumlazy.c and in proc.c). Author: Peter Geoghegan Reviewed-By: Jeff Davis Discussion: https://postgr.es/m/CAH2-Wz=S-R_2rO49Hm94Nuvhu9_twRGbTm6uwDRmRu-Sqn_t3w@mail.gmail.com --- src/include/commands/vacuum.h | 18 ++- src/include/storage/proc.h | 2 +- src/backend/access/heap/vacuumlazy.c| 14 ++ src/backend/access/heap/visibilitymap.c | 5 +- src/backend/access/transam/multixact.c | 4 +- src/backend/commands/vacuum.c | 18 ++- src/backend/postmaster/autovacuum.c | 207 +--- src/backend/storage/lmgr/proc.c | 4 +- 8 files changed, 197 insertions(+), 75 deletions(-) diff --git a/src/include/commands/vacuum.h b/src/include/commands/vacuum.h index 689dbb770..b70f69fd9 100644 --- a/src/include/commands/vacuum.h +++ b/src/include/commands/vacuum.h @@ -191,6 +191,21 @@ typedef struct VacAttrStats #define VACOPT_SKIP_DATABASE_STATS 0x100 /* skip vac_update_datfrozenxid() */ #define VACOPT_ONLY_DATABASE_STATS 0x200 /* only vac_update_datfrozenxid() */ +/* + * Values used by autovacuum.c to tell vacuumlazy.c about the specific + * threshold type that triggered an autovacuum worker. + * + * AUTOVACUUM_NONE is used when VACUUM isn't running in an autovacuum worker. + */ +typedef enum AutoVacType +{ + AUTOVACUUM_NONE = 0, + AUTOVACUUM_TABLE_XID_AGE, + AUTOVACUUM_TABLE_MXID_AGE, + AUTOVACUUM_DEAD_TUPLES, + AUTOVACUUM_INSERTED_TUPLES, +} AutoVacType; + /* * Values used by index_cleanup and truncate params. * @@ -222,7 +237,8 @@ typedef struct VacuumParams * use default */ int multixact_freeze_table_age; /* multixact age at which to scan * whole table */ - bool is_wraparound; /* force a for-wraparound vacuum */ + bool is_wraparound; /* antiwraparound autovacuum? */ + AutoVacType trigger; /* Autovacuum trigger condition, if any */ int log_min_duration; /* minimum execut
Re: Fixing a couple of buglets in how VACUUM sets visibility map bits
On Sun, Jan 8, 2023 at 3:53 PM Andres Freund wrote: > How? See the commit message for the scenario I have in mind, which involves a concurrent HOT update that aborts. We're vulnerable to allowing "all-frozen but not all-visible" inconsistencies because of two factors: this business with not passing VISIBILITYMAP_ALL_VISIBLE along with VISIBILITYMAP_ALL_FROZEN to visibilitymap_set(), *and* the use of all_visible_according_to_vm to set the VM (a local variable that can go stale). We sort of assume that all_visible_according_to_vm cannot go stale here without our detecting it. That's almost always the case, but it's not quite guaranteed. > visibilitymap_set() just adds flags, it doesn't remove any already > existing bits: I know. The concrete scenario I have in mind is very subtle (if the problem was this obvious I'm sure somebody would have noticed it by now, since we do hit this visibilitymap_set() call site reasonably often). A concurrent HOT update will certainly clear all the bits for us, which is enough. > You have plenty of changes like this, which are afaict entirely unrelated to > the issue the commit is fixing, in here. It just makes it hard to review the > patch. I didn't think that it was that big of a deal to tweak the style of one or two details in and around lazy_vacuum_heap_rel() in passing, for consistency with lazy_scan_heap(), since the patch already needs to do some of that. I do take your point, though. -- Peter Geoghegan
Re: Fixing a couple of buglets in how VACUUM sets visibility map bits
On Mon, Jan 2, 2023 at 10:31 AM Peter Geoghegan wrote: > Would be helpful if I could get a +1 on > v1-0002-Never-just-set-the-all-frozen-bit-in-VM.patch, which is > somewhat more substantial than the others. There has been no response on this thread for over a full week at this point. I'm CC'ing Robert now, since the bug is from his commit a892234f83. Attached revision of the "don't unset all-visible bit while unsetting all-frozen bit" patch adds some assertions that verify that visibility_cutoff_xid is InvalidTransactionId as expected when we go to set any page all-frozen in the VM. It also broadens an existing nearby test for corruption, which gives us some chance of detecting and repairing corruption of this sort that might have slipped in in the field. My current plan is to commit something like this in another week or so, barring any objections. -- Peter Geoghegan v2-0001-Don-t-accidentally-unset-all-visible-bit-in-VM.patch Description: Binary data
Re: pgsql: Delay commit status checks until freezing executes.
On Sat, Jan 7, 2023 at 1:47 PM Andres Freund wrote: > > What do you think of the attached patch, which revises comments over > > TransactionIdDidAbort, and adds something about it to the top of > > heapam_visbility.c? > > Mostly looks good to me. I think it'd be good to add a reference to the > heapam_visbility.c? comment to the top of transam.c (or move it). Makes sense. > I think it's currently very likely to be true, but I'd weaken the "never" a > bit nonetheless. I think it'd also be good to point to what to do instead. How > about: > Note that TransactionIdDidAbort() returns true only for explicitly aborted > transactions, as transactions implicitly aborted due to a crash will > commonly still appear to be in-progress in the clog. Most of the time > TransactionIdDidCommit(), with a preceding TransactionIdIsInProgress() > check, should be used instead of TransactionIdDidAbort(). That does seem better. Do we need to do anything about this to the "pg_xact and pg_subtrans" section of the transam README? Also, does amcheck's get_xid_status() need a reference to these rules? FWIW, I found an existing comment about this rule in the call to TransactionIdAbortTree() from RecordTransactionAbort() -- which took me quite a while to find. So you might have been remembering that comment before. -- Peter Geoghegan
Re: New strategies for freezing, advancing relfrozenxid early
32-bit transaction IDs and > that we update an epoch when this XID wraps around does not > automatically make the user aware of the issues that surface around > XID wraparound. Retaining the explainer for XID wraparound in the docs > seems like a decent idea - it may be moved, but please don't delete > it. We do need to stop telling users to enter single user mode. It's quite simply obsolete, bad advice, and has been since Postgres 14. It's the worst thing that you could do, in fact. -- Peter Geoghegan
Re: New strategies for freezing, advancing relfrozenxid early
> > pg_class.relfrozenxid > > + now take place more proactively, in every > > + VACUUM operation. > > This claim that it happens more proactively in "every" VACUUM > operation is false, so I think the removal of "every" would be better. Good catch. Will fix. > > [PATCH v14 3/3] Finish removing aggressive mode VACUUM. > I don't quite enjoy the refactoring+rewriting of the docs section; > it's difficult to determine what changed when so many things changed > line lengths and were moved around. Tomorrow I'll take a closer look, > but a separation of changes vs moved would be useful for review. I think that I should break out the doc changes some more. The docs are likely the least worked out thing at this point. > > +cutoffs->MinXid = nextXID - (freeze_table_age * 0.95); > > [...] > > +cutoffs->MinMulti = nextMXID - (multixact_freeze_table_age * 0.95); > > Why are these values adjusted down (up?) by 5%? If I configure this > GUC, I'd expect this to be used effectively verbatim; not adjusted by > an arbitrary factor. It is kind of arbitrary, but not in the way that you suggest. This isn't documented in the user docs, and shouldn't really need to be. It should have very little if any noticeable impact on our final relfrozenxid/relminmxid in practice. If it does have any noticeable impact, I strongly suspect it'll be a useful, positive impact. MinXid/MinMulti control the behavior around whether or not lazy_scan_noprune is willing to wait the hard way for a cleanup lock, no matter how long it takes. We do still need something like that, but it can be far looser than it is right now. The problem with aggressive mode is that it absolutely insists on a certain outcome, no matter the cost, and regardless of whether or not a slightly inferior outcome is acceptable. It's extremely rigid. Rigid things tend to break. Loose, springy things much less so. I think that it's an extremely bad idea to wait indefinitely for a cleanup lock. Sure, it'll work out the vast majority of the time -- it's *very* likely to work. But when it doesn't work right away, there is no telling how long the wait will be -- all bets are off. Could be a day, a week, a month -- who knows? The application itself is the crucial factor here, and in general the application can do whatever it wants to do -- that is the reality. So we should be willing to kick the can down the road in almost all cases -- that is actually the responsible thing to do under the circumstances. We need to get on with freezing every other page in the table! There just cannot be very many pages that can't be cleanup locked at any given time, so waiting indefinitely is a very drastic measure in response to a problem that is quite likely to go away on its own. A problem that waiting doesn't really solve anyway. Maybe the only thing that will work is waiting for a very long time, but we have nothing to lose (and everything to gain) by waiting to wait. -- Peter Geoghegan
Re: pgsql: Delay commit status checks until freezing executes.
On Tue, Jan 3, 2023 at 10:52 PM Peter Geoghegan wrote: > > And it'd make sense to have > > the explanation of why TransactionIdDidAbort() isn't the same as > > !TransactionIdDidCommit(), even for !TransactionIdIsInProgress() xacts, near > > the explanation for doing TransactionIdIsInProgress() first. > > I think that we should definitely have a comment directly over > TransactionIdDidAbort(). Though I wouldn't mind reorganizing these > other comments, or making the comment over TransactionIdDidAbort() > mostly just point to the other comments. What do you think of the attached patch, which revises comments over TransactionIdDidAbort, and adds something about it to the top of heapam_visbility.c? -- Peter Geoghegan v1-0001-Document-TransactionIdDidAbort-hard-crash-behavio.patch Description: Binary data
Re: BUG: Postgres 14 + vacuum_defer_cleanup_age + FOR UPDATE + UPDATE
On Thu, Jan 5, 2023 at 3:27 PM Peter Geoghegan wrote: > This particular error message is from the hardening added to Postgres > 15 in commit e7428a99. So it's not surprising that Michail didn't see > the same error on 14. Reproduced this on HEAD locally (no docker), without any difficulty. FWIW, I find that the "Assert(ItemIdIsNormal(lp));" at the top of heap_lock_tuple() is the first thing that fails on my assert-enabled build. -- Peter Geoghegan
Re: BUG: Postgres 14 + vacuum_defer_cleanup_age + FOR UPDATE + UPDATE
On Thu, Jan 5, 2023 at 1:49 PM Matthias van de Meent wrote: > client 29 script 0 aborted in command 2 query 0: ERROR: heap tid from index > tuple (111,1) points past end of heap page line pointer array at offset 262 > of block 1 in index "something_is_wrong_here_pkey" This particular error message is from the hardening added to Postgres 15 in commit e7428a99. So it's not surprising that Michail didn't see the same error on 14. -- Peter Geoghegan
Re: pgsql: Delay commit status checks until freezing executes.
On Wed, Jan 4, 2023 at 10:59 PM Amit Kapila wrote: > You are an extremely valuable person for this project and I wish that > you continue working with the same enthusiasm. Thank you, Amit. Knowing that my efforts are appreciated by colleagues does make it easier to persevere. -- Peter Geoghegan
Re: pgsql: Delay commit status checks until freezing executes.
On Wed, Jan 4, 2023 at 10:41 AM Andres Freund wrote: > > It's currently possible for VACUUM to set the all-frozen bit while > > unsetting the all-visible bit, due to a race condition [1]. This is > > your long standing bug. So apparently nobody is qualified to commit > > patches in this area. > > That's a non-sequitur. Bugs are a fact of programming. I agree. > > About a year ago, there was a massive argument over some earlier work > > in the same general area, by me. Being the subject of a pile-on on > > this mailing list is something that I find deeply upsetting and > > demoralizing. I just cannot take much more of it. At the same time, > > I've made quite an investment in the pending patches, and think that > > it's something that I have to see through. > > I'm, genuinely!, sorry that you feel piled on. That wasn't, and isn't, my > goal. Apology accepted. I am making a simple, practical point here, too: I'm much too selfish a person to continue to put myself in this position. I have nothing to prove, and have little to gain over what I'd get out of working in various other areas. I wasn't hired by my current employer to work on VACUUM in particular. In the recent past I have found ways to be very productive in other areas, without any apparent risk of protracted, stressful fights -- which is something that I plan on getting back to soon. I just don't have the stomach for this. It just isn't worth it to me. > I think the area of code desperately needs work. I complained because I > didn't like the process and was afraid of the consequences and the perceived > need on my part to do post-commit reviews. The work that I did in 15 (in particular commit 0b018fab, the "oldest extant XID" commit) really isn't very useful without the other patches in place -- it was always supposed to be one piece of a larger whole. It enables the freezing stuff because VACUUM now "gets credit" for proactive freezing in a way that it didn't before. The motivating examples wiki page shows examples of this [1]. Once the later patches are in place, the 15/16 work on VACUUM will be complete, and I can walk away from working on VACUUM having delivered a very useful improvement to performance stability -- a good outcome for everybody. If you and Robert can find a way to accommodate that, then in all likelihood we won't need to have any more heated and protracted arguments like the one from early in 2022. I will be quite happy to get back to working on B-Tree, likely the skip scan work. [1] https://wiki.postgresql.org/wiki/Freezing/skipping_strategies_patch:_motivating_examples -- Peter Geoghegan
Re: pgsql: Delay commit status checks until freezing executes.
On Wed, Jan 4, 2023 at 7:03 AM Robert Haas wrote: > But that having been said, I'm kind of astonished that you didn't know > about this already. The freezing behavior is in general extremely hard > to get right, and I guess I feel if you don't understand how the > underlying functions work, including things like performance > considerations I was the one that reported the issue with CLOG lookups in the first place. > and which functions return fully reliable results, I do > not think you should be committing your own patches in this area. My mistake here had nothing to do with my own goals. I was trying to be diligent by hardening an existing check in passing, and it backfired. > There is probably a lot of potential benefit in improving the way this > stuff works, but there is also a heck of a lot of danger of creating > subtle data corrupting bugs that could easily take years to find. It's currently possible for VACUUM to set the all-frozen bit while unsetting the all-visible bit, due to a race condition [1]. This is your long standing bug. So apparently nobody is qualified to commit patches in this area. About a year ago, there was a massive argument over some earlier work in the same general area, by me. Being the subject of a pile-on on this mailing list is something that I find deeply upsetting and demoralizing. I just cannot take much more of it. At the same time, I've made quite an investment in the pending patches, and think that it's something that I have to see through. If I am allowed to finish what I've started, then I will stop all new work on VACUUM. I'll go back to working on B-Tree indexing. Nobody is asking me to focus on VACUUM, and there are plenty of other things that I could be doing that don't seem to lead to these situations. [1] https://postgr.es/m/cah2-wznungszf8v6osgjac5aysb3cz6hw6mlm30x0d65cms...@mail.gmail.com -- Peter Geoghegan
Re: pgsql: Delay commit status checks until freezing executes.
On Tue, Jan 3, 2023 at 4:54 PM Andres Freund wrote: > There's some changes from TransactionIdDidCommit() to !TransactionIdDidAbort() > that don't look right to me. If the server crashed while xid X was > in-progress, TransactionIdDidCommit(X) will return false, but so will > TransactionIdDidAbort(X). So besides moving when the check happens you also > changed what's being checked in a more substantial way. I did point this out on the thread. I made this change with the intention of making the check more robust. Apparently this was misguided. Where is the behavior that you describe documented, if anywhere? > Also, why did you change when MarkBufferDirty() happens? Previously it > happened before we modify the page contents, now after. That's probably fine > (it's the order suggested in transam/README), but seems like a mighty subtle > thing to change at the same time as something unrelated, particularly without > even mentioning it? I changed it because the new order is idiomatic. I didn't think that this was particularly worth mentioning, or even subtle. The logic from heap_execute_freeze_tuple() only performs simple in-place modifications. -- Peter Geoghegan
Re: pgsql: Delay commit status checks until freezing executes.
On Tue, Jan 3, 2023 at 10:47 PM Andres Freund wrote: > IMO the comment at the top mentioning why the TransactionIdIsInProgress() > calls are crucial / need to be done first would be considerably more likely to > be found in transam.c than heapam_visibility.c. Yeah, but they're duplicated anyway. For example in the transam README. Plus we have references to these same comments from other files, such as heapam.c, which mentions heapam_visibility.c by name as where you go to learn more about this issue. > And it'd make sense to have > the explanation of why TransactionIdDidAbort() isn't the same as > !TransactionIdDidCommit(), even for !TransactionIdIsInProgress() xacts, near > the explanation for doing TransactionIdIsInProgress() first. I think that we should definitely have a comment directly over TransactionIdDidAbort(). Though I wouldn't mind reorganizing these other comments, or making the comment over TransactionIdDidAbort() mostly just point to the other comments. -- Peter Geoghegan
Re: pgsql: Delay commit status checks until freezing executes.
On Tue, Jan 3, 2023 at 10:33 PM Andres Freund wrote: > I'd say a comment above TransactionIdDidAbort() referencing an overview > comment at the top of the file? I think it might be worth moving the comment > from heapam_visibility.c to transam.c? What comments in heapam_visibility.c should we be referencing here? I don't see anything about it there. I have long been aware that those routines deduce that a transaction must have aborted, but surely that's not nearly enough. That's merely not being broken, without any explanation given as to why. > > I find this astonishing. Why isn't there a prominent comment that > > advertises that TransactionIdDidAbort() just doesn't work reliably? > > Arguably it works reliably, just more narrowly than one might think. Treating > "crashed transactions" as a distinct state from explicit aborts. That's quite a stretch. There are numerous comments that pretty much imply that TransactionIdDidCommit/TransactionIdDidAbort are very similar, for example any discussion of how you need to call TransactionIdIsInProgress first before calling either of the other two. -- Peter Geoghegan
Re: pgsql: Delay commit status checks until freezing executes.
On Tue, Jan 3, 2023 at 8:29 PM Peter Geoghegan wrote: > I find this astonishing. Why isn't there a prominent comment that > advertises that TransactionIdDidAbort() just doesn't work reliably? I pushed a fix for this now. We should add a comment about this issue to TransactionIdDidAbort() header comments, but I didn't do that part yet. Thanks for the report. -- Peter Geoghegan
Re: pgsql: Delay commit status checks until freezing executes.
On Tue, Jan 3, 2023 at 7:56 PM Andres Freund wrote: > I still think these moderation rules are deeply unhelpful... Yes, it is rather annoying. > I don't know - I think there's a explicit comment somewhere, but I couldn't > find it immediately. There's a bunch of indirect references to in in > heapam_visibility.c, with comments like "it must have aborted or > crashed". I think that that's a far cry from any kind of documentation... > The reason for the behaviour is that we do not have any mechanism for going > through the clog and aborting all in-progress-during-crash transactions. So > we'll end up with the clog for all in-progress-during-crash transaction being > zero / TRANSACTION_STATUS_IN_PROGRESS. I find this astonishing. Why isn't there a prominent comment that advertises that TransactionIdDidAbort() just doesn't work reliably? > IMO it's almost always wrong to use TransactionIdDidAbort(). I didn't think that there was any general guarantee about TransactionIdDidAbort() working after a crash. But this is an on-disk XID, taken from some tuple's xmax, which must have a value < OldestXmin. > I think changes in how WAL logging is done are just about always worth > mentioning in a commit message... I agree with that as a general statement, but I never imagined that this was a case that such a statement could apply to. I will try to remember to put something about similar changes in any future commit messages, in the unlikely event that I ever end up moving MarkBufferDirty() around in some existing critical section in the future. -- Peter Geoghegan
Re: pgsql: Delay commit status checks until freezing executes.
(Pruning -committers from the list, since cross-posting to -hackers resulted in this being held up for moderation.) On Tue, Jan 3, 2023 at 5:15 PM Peter Geoghegan wrote: > > On Tue, Jan 3, 2023 at 4:54 PM Andres Freund wrote: > > There's some changes from TransactionIdDidCommit() to > > !TransactionIdDidAbort() > > that don't look right to me. If the server crashed while xid X was > > in-progress, TransactionIdDidCommit(X) will return false, but so will > > TransactionIdDidAbort(X). So besides moving when the check happens you also > > changed what's being checked in a more substantial way. > > I did point this out on the thread. I made this change with the > intention of making the check more robust. Apparently this was > misguided. > > Where is the behavior that you describe documented, if anywhere? When the server crashes, and we have a problem case, what does TransactionLogFetch()/TransactionIdGetStatus() (which are the guts of both TransactionIdDidCommit and TransactionIdDidAbort) report about the XID? > > Also, why did you change when MarkBufferDirty() happens? Previously it > > happened before we modify the page contents, now after. That's probably fine > > (it's the order suggested in transam/README), but seems like a mighty subtle > > thing to change at the same time as something unrelated, particularly > > without > > even mentioning it? > > I changed it because the new order is idiomatic. I didn't think that > this was particularly worth mentioning, or even subtle. The logic from > heap_execute_freeze_tuple() only performs simple in-place > modifications. I'm including this here because presumably -hackers will have missed it due to the moderation hold-up issue. -- Peter Geoghegan
Re: New strategies for freezing, advancing relfrozenxid early
On Sat, Dec 31, 2022 at 12:45 PM Peter Geoghegan wrote: > On Sat, Dec 31, 2022 at 11:46 AM Jeff Davis wrote: > > "We have no freeze plans to execute, so there's no cost to following > > the freeze path. This is important in the case where the page is > > entirely frozen already, so that the page will be marked as such in the > > VM." > > I'm happy to use your wording instead -- I'll come up with a patch for that. What do you think of the wording adjustments in the attached patch? It's based on your suggested wording. -- Peter Geoghegan 0001-Tweak-page-level-freezing-comments.patch Description: Binary data
Re: Fixing a couple of buglets in how VACUUM sets visibility map bits
On Sat, Dec 31, 2022 at 4:53 PM Peter Geoghegan wrote: > The first patch makes sure that the snapshotConflictHorizon cutoff > (XID cutoff for recovery conflicts) is never a special XID, unless > that XID is InvalidTransactionId, which is interpreted as a > snapshotConflictHorizon value that will never need a recovery conflict > (per the general convention for snapshotConflictHorizon values > explained above ResolveRecoveryConflictWithSnapshot). Pushed this just now. Attached is another very simple refactoring patch for vacuumlazy.c. It makes vacuumlazy.c save the result of get_database_name() in vacrel, which matches what we already do with things like get_namespace_name(). Would be helpful if I could get a +1 on v1-0002-Never-just-set-the-all-frozen-bit-in-VM.patch, which is somewhat more substantial than the others. -- Peter Geoghegan 0001-Save-get_database_name-in-vacrel.patch Description: Binary data
Fixing a couple of buglets in how VACUUM sets visibility map bits
Attached are two patches, each of which fixes two historical buglets around VACUUM's approach to setting bits in the visibility map. (Whether or not this is actually refactoring work or hardening work is debatable, I suppose.) The first patch makes sure that the snapshotConflictHorizon cutoff (XID cutoff for recovery conflicts) is never a special XID, unless that XID is InvalidTransactionId, which is interpreted as a snapshotConflictHorizon value that will never need a recovery conflict (per the general convention for snapshotConflictHorizon values explained above ResolveRecoveryConflictWithSnapshot). This patch establishes a hard rule that snapshotConflictHorizon values can never be a special XID value, unless it's InvalidTransactionId. An assertion enforces the rule for us in REDO routines (at the point that they call ResolveRecoveryConflictWithSnapshot with the WAL record's snapshotConflictHorizon XID value). The second patch makes sure that VACUUM can never set a page all-frozen in the visibility map without also setting the same page all-visible in the same call to visibilitymap_set() -- regardless of what we think we know about the current state of the all-visible bit in the VM. The second patch adjusts one of the visibilitymap_set() calls in vacuumlazy.c that would previously sometimes set a page's all-frozen bit without also setting its all-visible bit. This could allow VACUUM to leave a page all-frozen but not all-visible in the visibility map (since the value of all_visible_according_to_vm can go stale). I think that this should be treated as a basic visibility map invariant: an all-frozen page must also be all-visible, by definition, so why should it be physically possible for the VM to give a contradictory picture of the all_visible/all_frozen status of any one page? Assertions are added that more or less make this rule into an invariant. amcheck/pg_visibility coverage might make sense too, but I haven't done that here. The second patch also adjusts a later visibilitymap_set() call site (the one used just after heap vacuuming runs in the final heap pass) in roughly the same way. It no longer reads from the visibility map to see what bits need to be changed. The existing approach here seems rather odd. The whole point of calling lazy_vacuum_heap_page() is to set LP_DEAD items referenced by VACUUM's dead_items array to LP_UNUSED -- there has to have been at least one LP_DEAD item on the page for us to end up here (which a Postgres 14 era assertion verifies for us). So we already know perfectly well that the visibility map shouldn't indicate that the page is all-visible yet -- why bother asking the VM? And besides, any call to visibilitymap_set() will only modify the VM when it directly observes that the bits have changed -- so why even attempt to duplicate that on the caller side? It seems to me that the visibilitymap_get_status() call inside lazy_vacuum_heap_page() is actually abused to work as a substitute for visibilitymap_pin(). Why not use top-level visibilitymap_pin() calls instead, just like we do it in the first heap pass? That's how it's done in the second patch; it adds a visibilitymap_pin() call in lazy_vacuum_heap_rel()'s blkno-wise loop. That gives us parity between the first and second heap pass, which seems like a clear maintainability win -- everybody can pass the already-pinned/already-setup vmbuffer by value. -- Peter Geoghegan v1-0001-Avoid-special-XIDs-in-snapshotConflictHorizon-val.patch Description: Binary data v1-0002-Never-just-set-the-all-frozen-bit-in-VM.patch Description: Binary data
Re: New strategies for freezing, advancing relfrozenxid early
On Sat, Dec 31, 2022 at 11:46 AM Jeff Davis wrote: > On Fri, 2022-12-30 at 16:58 -0800, Peter Geoghegan wrote: > > Following the path of freezing a page is *always* valid, by > > definition. Including when there are zero freeze plans to execute, or > > even zero tuples to examine in the first place -- we'll at least be > > able to perform nominal freezing, no matter what. > > This is a much clearer description, in my opinion. Do you think this is > already reflected in the comments (and I missed it)? I am arguably the person least qualified to answer this question. :-) > Perhaps the comment in the "if (tuples_frozen == 0)" branch could be > something more like: > > "We have no freeze plans to execute, so there's no cost to following > the freeze path. This is important in the case where the page is > entirely frozen already, so that the page will be marked as such in the > VM." I'm happy to use your wording instead -- I'll come up with a patch for that. In my mind it's just a restatement of what's there already. I assume that you're right about it being clearer this way. > Of course, I'm sure there are some nuances that I'm still missing. I don't think that there is, actually. I now believe that you totally understand the mechanics involved here. I'm glad that I was able to ascertain that that's all it was. It's worth going to the trouble of getting something like this exactly right. -- Peter Geoghegan
Re: New strategies for freezing, advancing relfrozenxid early
On Fri, Dec 30, 2022 at 1:12 PM Peter Geoghegan wrote: > > "Nominal freezing" is happening when there are no freeze plans at all. > > I get that it's to manage control flow so that the right thing happens > > later. But I think it should be defined in terms of what state the page > > is in so that we know that following a given path is valid. Defining > > "nominal freezing" as a case where there are no freeze plans is just > > confusing to me. > > What would you prefer? The state that the page is in is not something > that I want to draw much attention to, because it's confusing in a way > that mostly isn't worth talking about. I probably should have addressed what you said more directly. Here goes: Following the path of freezing a page is *always* valid, by definition. Including when there are zero freeze plans to execute, or even zero tuples to examine in the first place -- we'll at least be able to perform nominal freezing, no matter what. OTOH, following the "no freeze" path is permissible whenever the freeze_required flag hasn't been set during any call to heap_prepare_freeze_tuple(). It is never actually mandatory for lazy_scan_prune() to *not* freeze. It's a bit like how a simple point can be understood as a degenerate circle of radius 0. It's an abstract definition, which is just a tool for describing things precisely -- hopefully a useful tool. I welcome the opportunity to be able to describe things in a way that is clearer or more useful, in whatever way. But it's not like I haven't already put in significant effort to this exact question of what "freezing the page" really means to lazy_scan_prune(). Naming things is hard. -- Peter Geoghegan
Re: New strategies for freezing, advancing relfrozenxid early
On Fri, Dec 30, 2022 at 12:43 PM Jeff Davis wrote: > I always understood "freezing" to mean that a concrete action was > taken, and associated WAL generated. "When I use a word… it means just what I choose it to mean -- neither more nor less". I have also always understood freezing that way too. In fact, I still do understand it that way -- I don't think that it has been undermined by any of this. I've just invented this esoteric concept of nominal freezing that can be ignored approximately all the time, to solve one narrow problem that needed to be solved, that isn't that interesting anywhere else. > "Nominal freezing" is happening when there are no freeze plans at all. > I get that it's to manage control flow so that the right thing happens > later. But I think it should be defined in terms of what state the page > is in so that we know that following a given path is valid. Defining > "nominal freezing" as a case where there are no freeze plans is just > confusing to me. What would you prefer? The state that the page is in is not something that I want to draw much attention to, because it's confusing in a way that mostly isn't worth talking about. When we do nominal freezing, we don't necessarily go on to set the page all-frozen. In fact, it's not particularly likely that that will end up happening! Bear in mind that the exact definition of "freeze the page" is somewhat creative, even without bringing nominal freezing into it. It just has to be in order to support the requirements we have for MultiXacts (in particular for FRM_NOOP processing). The new concepts don't quite map directly on to the old ones. At the same time, it really is very often the case that "freezing the page" will perform maximally aggressive freezing, in the sense that it does precisely what a VACUUM FREEZE would do given the same page (in any Postgres version). -- Peter Geoghegan
Re: Avoiding unnecessary clog lookups while freezing
On Thu, Dec 29, 2022 at 12:20 PM Peter Geoghegan wrote: > On Thu, Dec 29, 2022 at 12:00 PM Andres Freund wrote: > > > I could just move the same tests from heap_prepare_freeze_tuple() to > > > heap_freeze_execute_prepared(), without changing any of the details. > > > > That might work, yes. > > Attached patch shows how that could work. My plan is to push something like this next week, barring objections. Note that I've inverted the xmax "!TransactionIdDidCommit()" test -- it is now "TransactionIdDidAbort()" instead. I believe that this makes the test more likely to catch problems, since we really should expect relevant xmax XIDs to have aborted, specifically -- since the xmax XIDs in question are always < OldestXmin. (There is a need to use a "!TransactionIdDidCommit()" test specifically in nearby code in FreezeMultiXactId(), because that code has to also deal with in-progress XIDs that are multi members, but that's not the case here.) I'm also going to create a CF entry for the other patch posted to this thread -- the enhancement to FreezeMultiXactId() that aims to get the most out of any expensive calls to GetMultiXactIdMembers(). That approach seems quite promising, and relatively simple. I'm not particularly worried about wasting a call to GetMultiXactIdMembers() during VACUUM, though. I'm more concerned about the actual impact of not doing our best to outright remove Multis during VACUUM. The application itself can experience big performance cliffs from SLRU buffer misses, which VACUUM should do its utmost to prevent. That is an occasional source of big problems [1]. I'm particularly concerned about the possibility that having an updater XID will totally change the characteristics of how multis are processed inside FreezeMultiXactId(). That seems like it might be a really sharp edge. I believe that the page-level freezing patch has already ameliorated the problem, since it made us much less reliant on the case where GetMultiXactIdMembers() returns "nmembers <= 0" for a Multi that happens to be HEAP_XMAX_IS_LOCKED_ONLY(). But we can and should go further than that. [1] https://buttondown.email/nelhage/archive/notes-on-some-postgresql-implementation-details/ -- Peter Geoghegan
Re: Decoupling antiwraparound autovacuum from special rules around auto cancellation
On Fri, Nov 25, 2022 at 2:47 PM Peter Geoghegan wrote: > Attached WIP patch invents the idea of a regular autovacuum that is > tasked with advancing relfrozenxid -- which is really just another > trigger criteria, reported on in the server log in its autovacuum > reports. Attached is v2, which is just to fix bitrot. Well, mostly. I did make one functional change in v2: the autovacuum server log reports now separately report on table XID age and table MultiXactId age, each as its own distinct triggering condition. I've heard informal reports that the difference between antiwraparound autovacuums triggered by table XID age versus table MXID age can matter a great deal. It isn't difficult to break out that detail anyway, so even if the distinction isn't interesting all that often we might as well surface it to users. I still haven't made a start on the docs for this. I'm still not sure how much work I should do on the docs in the scope of this project versus my project that deals with related issues in VACUUM itself. The existing material from the "Routine Vacuuming" docs has lots of problems, and figuring out how to approach fixing those problems seems kind of daunting right now. -- Peter Geoghegan From 24cc2dca479f5cf6f7f918db3237032bbb2c2ce4 Mon Sep 17 00:00:00 2001 From: Peter Geoghegan Date: Fri, 25 Nov 2022 11:23:20 -0800 Subject: [PATCH v2] Add "table age" trigger concept to autovacuum. Teach autovacuum.c to launch "table age" autovacuums at the same point that it previously triggered antiwraparound autovacuums. Antiwraparound autovacuums are retained, but are only used as a true option of last resort, when regular autovacuum has presumably tried and failed to advance relfrozenxid (likely because the auto-cancel behavior kept cancelling regular autovacuums triggered based on table age). The special auto-cancellation behavior applied by antiwraparound autovacuums is known to cause problems in the field, so it makes sense to avoid it, at least until the point where it starts to look like a proportionate response. Besides, the risk of the system eventually triggering xidStopLimit because of cancellations is a lot lower than it was back when the current auto-cancellation behavior was added by commit acac68b2. For example, there was no visibility map, so restarting antiwraparound autovacuum meant that the next autovacuum would get very little benefit from the work performed by earlier cancelled autovacuums. Also add new instrumentation that lists a triggering condition in the server log whenever an autovacuum is logged. This reports "table age" as the triggering criteria when regular (not antiwraparound) autovacuum runs because we need to advance the table's age. In other cases it will report an autovacuum was launched due to the tuple insert thresholds or the dead tuple thresholds. Note that pg_stat_activity doesn't have any special instrumentation for regular autovacuums that happen to have been triggered based on table age (since it really is just another form of autovacuum, and follows exactly the same rules in vacuumlazy.c and in proc.c). Author: Peter Geoghegan Reviewed-By: Jeff Davis Discussion: https://postgr.es/m/CAH2-Wz=S-R_2rO49Hm94Nuvhu9_twRGbTm6uwDRmRu-Sqn_t3w@mail.gmail.com --- src/include/commands/vacuum.h | 16 ++- src/include/storage/proc.h | 2 +- src/backend/access/heap/vacuumlazy.c| 14 +++ src/backend/access/heap/visibilitymap.c | 5 +- src/backend/access/transam/multixact.c | 4 +- src/backend/commands/vacuum.c | 18 ++-- src/backend/postmaster/autovacuum.c | 129 ++-- src/backend/storage/lmgr/proc.c | 4 +- 8 files changed, 145 insertions(+), 47 deletions(-) diff --git a/src/include/commands/vacuum.h b/src/include/commands/vacuum.h index 2f274f2be..53e4e8ffb 100644 --- a/src/include/commands/vacuum.h +++ b/src/include/commands/vacuum.h @@ -189,6 +189,19 @@ typedef struct VacAttrStats #define VACOPT_PROCESS_TOAST 0x40 /* process the TOAST table, if any */ #define VACOPT_DISABLE_PAGE_SKIPPING 0x80 /* don't skip any pages */ +/* + * Values used by autovacuum.c to tell vacuumlazy.c about the specific + * threshold type that triggered an autovacuum worker + */ +typedef enum AutoVacType +{ + AUTOVACUUM_NONE = 0, + AUTOVACUUM_TABLE_XID_AGE, + AUTOVACUUM_TABLE_MXID_AGE, + AUTOVACUUM_DEAD_TUPLES, + AUTOVACUUM_INSERTED_TUPLES, +} AutoVacType; + /* * Values used by index_cleanup and truncate params. * @@ -220,7 +233,8 @@ typedef struct VacuumParams * use default */ int multixact_freeze_table_age; /* multixact age at which to scan * whole table */ - bool is_wraparound; /* force a for-wraparound vacuum */ + bool is_wraparound; /* antiwraparound autovacuum */ + AutoVacType trigger; /* autovacuum launched to advance table age */ int log_min_duration
Re: Avoiding unnecessary clog lookups while freezing
On Thu, Dec 29, 2022 at 12:50 PM Peter Geoghegan wrote: > On Thu, Dec 29, 2022 at 12:20 PM Peter Geoghegan wrote: > > > It seems somewhat wrong that we discard all the work that > > > heap_prepare_freeze_tuple() did. Yes, we force freezing to actually > > > happen in > > > a bunch of important cases (e.g. creating a new multixact), but even so, > > > e.g. GetMultiXactIdMembers() is awfully expensive to do for nought. Nor is > > > just creating the freeze plan free. > Here's an idea that seems like it could ameliorate the issue: > > When we're looping through members from GetMultiXactIdMembers(), and > seeing if we can get away with !need_replace/FRM_NOOP processing, why > not also check if there are any XIDs >= OldestXmin among the members? > If not (if they're all < OldestXmin), then we should prefer to go > further with processing the Multi now -- FRM_NOOP processing isn't > actually cheaper. Attached patch shows what I mean here. I think that there is a tendency for OldestMxact to be held back by a disproportionate amount (relative to OldestXmin) in the presence of long running transactions and concurrent updates from short-ish transactions. The way that we maintain state in shared memory to compute OldestMxact (OldestMemberMXactId/OldestVisibleMXactId) seems to be vulnerable to that kind of thing. I'm thinking of scenarios where MultiXactIdSetOldestVisible() gets called from a long-running xact, at the first point that it examines any multi, just to read something. That effectively makes the long-running xact hold back OldestMxact, even when it doesn't hold back OldestXmin. A read-only transaction that runs in READ COMMITTED could do this -- it'll call OldestVisibleMXactId() and "lock in" the oldest visible Multi that it needs to continue to see as running, without clearing that until much later (until AtEOXact_MultiXact() is called at xact commit/abort). And without doing anything to hold back OldestXmin by the same amount, or for the same duration. That's the kind of scenario that the patch might make a difference in -- because it exploits the fact that OldestXmin can be a lot less vulnerable than OldestMxact is to being held back by long running xacts. Does that seem plausible to you? -- Peter Geoghegan 0001-Prefer-FRM_INVALIDATE_XMAX-over-FRM_NOOP-when-cheape.patch Description: Binary data
Re: Avoiding unnecessary clog lookups while freezing
On Thu, Dec 29, 2022 at 12:20 PM Peter Geoghegan wrote: > > It seems somewhat wrong that we discard all the work that > > heap_prepare_freeze_tuple() did. Yes, we force freezing to actually happen > > in > > a bunch of important cases (e.g. creating a new multixact), but even so, > > e.g. GetMultiXactIdMembers() is awfully expensive to do for nought. Nor is > > just creating the freeze plan free. > > I'm not sure what you mean by that. I believe that the page-level > freezing changes do not allow FreezeMultiXactId() to call > GetMultiXactIdMembers() any more often than before. Are you concerned > about a regression, or something more general than that? Here's an idea that seems like it could ameliorate the issue: When we're looping through members from GetMultiXactIdMembers(), and seeing if we can get away with !need_replace/FRM_NOOP processing, why not also check if there are any XIDs >= OldestXmin among the members? If not (if they're all < OldestXmin), then we should prefer to go further with processing the Multi now -- FRM_NOOP processing isn't actually cheaper. We'll already know that a second pass over the multi really isn't expensive. And that it will actually result in FRM_INVALIDATE_XMAX processing, which is ideal. Avoiding a second pass is really just about avoiding FRM_RETURN_IS_MULTI (and avoiding FRM_RETURN_IS_XID, perhaps, though to a much lesser degree). -- Peter Geoghegan
Re: Avoiding unnecessary clog lookups while freezing
On Thu, Dec 29, 2022 at 12:00 PM Andres Freund wrote: > > I could just move the same tests from heap_prepare_freeze_tuple() to > > heap_freeze_execute_prepared(), without changing any of the details. > > That might work, yes. Attached patch shows how that could work. > It seems somewhat wrong that we discard all the work that > heap_prepare_freeze_tuple() did. Yes, we force freezing to actually happen in > a bunch of important cases (e.g. creating a new multixact), but even so, > e.g. GetMultiXactIdMembers() is awfully expensive to do for nought. Nor is > just creating the freeze plan free. I'm not sure what you mean by that. I believe that the page-level freezing changes do not allow FreezeMultiXactId() to call GetMultiXactIdMembers() any more often than before. Are you concerned about a regression, or something more general than that? The only case that we *don't* force xmax freezing in FreezeMultiXactId() is the FRM_NOOP case. Note in particular that we will reliably force freezing for any Multi < OldestMxact (not < MultiXactCutoff). > I wonder how often it'd be worthwhile to also do opportunistic freezing during > lazy_vacuum_heap_page(), given that we already will WAL log (and often issue > an FPI). Yeah, we don't actually need a cleanup lock for that. It might also make sense to teach lazy_scan_prune() to anticipate what will happen later on, in lazy_vacuum_heap_page(), so that it can freeze based on the same observation about the cost. (It already does a certain amount of this kind of thing, in fact.) -- Peter Geoghegan v2-0001-Check-xmin-xmax-commit-status-when-freezing-execu.patch Description: Binary data
Re: Avoiding unnecessary clog lookups while freezing
On Thu, Dec 29, 2022 at 9:21 AM Andres Freund wrote: > I do think we wanted to avoid reviving actually-dead tuples (present due to > the multixact and related bugs). And I'm worried about giving that checking > up, I've seen it hit too many times. Both in the real world and during > development. I could just move the same tests from heap_prepare_freeze_tuple() to heap_freeze_execute_prepared(), without changing any of the details. That would mean that the TransactionIdDidCommit() calls would only take place with tuples that actually get frozen, which is more or less how it worked before now. heap_prepare_freeze_tuple() will now often prepare freeze plans that just get discarded by lazy_scan_prune(). My concern is the impact on tables/pages that almost always discard prepared freeze plans, and so require many TransactionIdDidCommit() calls that really aren't necessary. > Somewhat of a tangent: I've previously wondered if we should have a small > hash-table based clog cache. The current one-element cache doesn't suffice in > a lot of scenarios, but it wouldn't take a huge cache to end up filtering most > clog accesses. I imagine that the one-element cache works alright in some scenarios, but then suddenly doesn't work so well, even though not very much has changed. Behavior like that makes the problems difficult to analyze, and easy to miss. I'm suspicious of that. -- Peter Geoghegan
Re: Avoiding unnecessary clog lookups while freezing
On Wed, Dec 28, 2022 at 4:43 PM Andres Freund wrote: > > > Hm. I dimply recall that we had repeated cases where the hint bits were > > > set > > > wrongly due to some of the multixact related bugs. I think I was trying > > > to be > > > paranoid about not freezing stuff in those situations, since it can lead > > > to > > > reviving dead tuples, which obviously is bad. > > > > I think that it's a reasonable check, and I'm totally in favor of > > keeping it (or something very close, at least). > > I don't quite follow - one paragraph up you say we should fix something, and > then here you seem to say we should continue not to rely on the hint bits? I didn't mean that we should continue to not rely on the hint bits. Is that really all that the test is for? I think of it as a general sanity check. The important call to avoid with page-level freezing is the xmin call to TransactionIdDidCommit(), not the xmax call. The xmax call only occurs when VACUUM prepares to freeze a tuple that was updated by an updater (not a locker) that aborted. While the xmin calls will now take place with most unfrozen tuples. -- Peter Geoghegan
Re: Avoiding unnecessary clog lookups while freezing
On Wed, Dec 28, 2022 at 4:20 PM Andres Freund wrote: > > Theoretically this is an old issue that dates back to commit > > 699bf7d05c, as opposed to an issue in the page-level freezing patch. > > But that's not really true in a real practical sense. In practice the > > calls to TransactionIdDidCommit() will happen far more frequently > > following today's commit 1de58df4fe (since we're using OldestXmin as > > the cutoff that gates performing freeze_xmin/freeze_xmax processing -- > > not FreezeLimit). > > Hm. But we still only do the check when we actually freeze, rather than just > during the pre-check in heap_tuple_should_freeze(). So we'll only incur the > increased overhead when we also do more WAL logging etc. Correct? Yes, that's how it worked up until today's commit 1de58df4fe. I don't have strong feelings about back patching a fix, but this does seem like something that I should fix now, on HEAD. > Hm. I dimply recall that we had repeated cases where the hint bits were set > wrongly due to some of the multixact related bugs. I think I was trying to be > paranoid about not freezing stuff in those situations, since it can lead to > reviving dead tuples, which obviously is bad. I think that it's a reasonable check, and I'm totally in favor of keeping it (or something very close, at least). > There's practically no tuple-level concurrent activity in a normal regression > test. So I'm a bit doubtful that is an interesting metric. It'd be a tad more > interesting to run tests, including isolationtester and pgbench, against a > running cluster. Even on HEAD, with page-level freezing in place, we're only going to test XIDs that are < OldestXmin, that appear on pages tha VACUUM actually scans in the first place. Just checking tuple-level hint bits should be effective. But even if it isn't (for whatever reason) then it's similar to cases where our second heap pass has to do clog lookups in heap_page_is_all_visible() just because hint bits couldn't be set earlier on, back when lazy_scan_prune() processed the same page during VACUUM's initial heap pass. -- Peter Geoghegan
Avoiding unnecessary clog lookups while freezing
I took another look at the code coverage situation around freezing following pushing the page-level freezing patch earlier today. I spotted an issue that I'd missed up until now: certain sanity checks in heap_prepare_freeze_tuple() call TransactionIdDidCommit() more often than really seems necessary. Theoretically this is an old issue that dates back to commit 699bf7d05c, as opposed to an issue in the page-level freezing patch. But that's not really true in a real practical sense. In practice the calls to TransactionIdDidCommit() will happen far more frequently following today's commit 1de58df4fe (since we're using OldestXmin as the cutoff that gates performing freeze_xmin/freeze_xmax processing -- not FreezeLimit). No regressions related to clog lookups by VACUUM were apparent from my performance validation of the page-level freezing work, but I suspect that the increase in TransactionIdDidCommit() calls will cause noticeable regressions with the right/wrong workload and/or configuration. The page-level freezing work is expected to reduce clog lookups in VACUUM in general, which is bound to have been a confounding factor. I see no reason why we can't just condition the XID sanity check calls to TransactionIdDidCommit() (for both the freeze_xmin and the freeze_xmax callsites) on a cheap tuple hint bit precheck not being enough. We only need an expensive call to TransactionIdDidCommit() when the precheck doesn't immediately indicate that the tuple xmin looks committed when that's what the sanity check expects to see (or that the tuple's xmax looks aborted, in the case of the callsite where that's what we expect to see). Attached patch shows what I mean. A quick run of the standard regression tests (with custom instrumentation) shows that this patch eliminates 100% of all relevant calls to TransactionIdDidCommit(), for both the freeze_xmin and the freeze_xmax callsites. -- Peter Geoghegan v1-0001-Avoid-unnecessary-clog-lookups-when-freezing.patch Description: Binary data
Re: New strategies for freezing, advancing relfrozenxid early
On Mon, Dec 26, 2022 at 10:57 PM Hayato Kuroda (Fujitsu) wrote: > I guessed that this assertion failure seemed to be caused by the commit > 4ce3af[2], > because the Assert() seemed to be added by the commit. I agree that the problem is with this assertion, which is on the master branch (not in recent versions of the patch series itself) following commit 4ce3af: else { /* * Freeze plan for tuple "freezes xmax" in the strictest sense: * it'll leave nothing in xmax (neither an Xid nor a MultiXactId). */ Assert(MultiXactIdPrecedes(xid, cutoffs->OldestMxact)); ... } The problem is that FRM_INVALIDATE_XMAX multi processing can occur both in Multis from before OldestMxact and Multis >= OldestMxact. The latter case (the >= case) is far less common, but still quite possible. Not sure how I missed that. Anyway, this assertion is wrong, and simply needs to be removed. Thanks for the report -- Peter Geoghegan
Re: Call lazy_check_wraparound_failsafe earlier for parallel vacuum
On Tue, Dec 20, 2022 at 9:44 AM Imseih (AWS), Sami wrote: > Attached is a patch to check scanned pages rather > than blockno. Pushed, thanks! I adjusted the FAILSAFE_EVERY_PAGES comments, which now point out that FAILSAFE_EVERY_PAGES is a power-of-two. The implication is that the compiler is all but guaranteed to be able to reduce the modulo division into a shift in the lazy_scan_heap loop, at the point of the failsafe check. I doubt that it would really matter if the compiler had to generate a DIV instruction, but it seems like a good idea to avoid it on general principle, at least in performance sensitive code. -- Peter Geoghegan
Re: New strategies for freezing, advancing relfrozenxid early
On Wed, Dec 21, 2022 at 4:30 PM Jeff Davis wrote: > The confusing thing to me is perhaps just the name -- to me, > "freeze_required" suggests that if it were set to true, it would cause > freezing to happen. But as far as I can tell, it does not cause > freezing to happen, it causes some other things to happen that are > necessary when freezing happens (updating and using the right > trackers). freeze_required is about what's required, which tells us nothing about what will happen when it's not required (could go either way, depending on how lazy_scan_prune feels about it). Setting freeze_required=true implies that heap_prepare_freeze_tuple has stopped doing maintenance of the "no freeze" trackers. When it sets freeze_required=true, it really *does* force freezing to happen, in every practical sense. This happens because lazy_scan_prune does what it's told to do when it's told that freezing is required. Because of course it does, why wouldn't it? So...I still don't get what you mean. Why would lazy_scan_prune ever break its contract with heap_prepare_freeze_tuple? And in what sense would you say that heap_prepare_freeze_tuple's setting freeze_required=true doesn't quite amount to "forcing freezing"? Are you worried about the possibility that lazy_scan_prune will decide to rebel at some point, and fail to honor its contract with heap_prepare_freeze_tuple? :-) > A minor point, no need to take action here. Perhaps rename the > variable. Andres was the one that suggested this name, actually. I initially just called it "freeze", but I think that Andres had it right. > I think 0001+0002 are about ready. Great. I plan on committing 0001 in the next few days. Committing 0002 might take a bit longer. Thanks -- Peter Geoghegan
Re: New strategies for freezing, advancing relfrozenxid early
On Tue, Dec 20, 2022 at 7:15 PM Peter Geoghegan wrote: > On Tue, Dec 20, 2022 at 5:44 PM Jeff Davis wrote: > > Next, the 'freeze_required' field suggests that it's more involved in > > the control flow that causes freezing than it actually is. All it does > > is communicate how the trackers need to be adjusted. The return value > > of heap_prepare_freeze_tuple() (and underneath, the flags set by > > FreezeMultiXactId()) are what actually control what happens. It would > > be nice to make this more clear somehow. > > I'm not sure what you mean. Page-level freezing *doesn't* have to go > ahead when freeze_required is not ever set to true for any tuple on > the page (which is most of the time, in practice). lazy_scan_prune > gets to make a choice about freezing the page, when the choice is > available. Oh wait, I think I see the point of confusion now. When freeze_required is set to true, that means that lazy_scan_prune literally has no choice -- it simply must freeze the page as instructed by heap_prepare_freeze_tuple/FreezeMultiXactId. It's not just a strong suggestion -- it's crucial that lazy_scan_prune freezes the page as instructed. The "no freeze" trackers (HeapPageFreeze.NoFreezePageRelfrozenXid and HeapPageFreeze.NoFreezePageRelminMxid) won't have been maintained properly when freeze_required was set, so lazy_scan_prune can't expect to use them -- doing so would lead to VACUUM setting incorrect values in pg_class later on. Avoiding the work of maintaining those "no freeze" trackers isn't just a nice-to-have microoptimization -- it is sometimes very important. We kind of rely on this to be able to avoid getting too many MultiXact member SLRU buffer misses inside FreezeMultiXactId. There is a comment above FreezeMultiXactId that advises its caller that it had better not call heap_tuple_should_freeze when freeze_required is set to true, because that could easily lead to multixact member SLRU buffer misses -- misses that FreezeMultiXactId set out to avoid itself. It could actually be cheaper to freeze than to not freeze, in the case of a Multi -- member space misses can sometimes be really expensive. And so FreezeMultiXactId sometimes freezes a Multi even though it's not strictly required to do so. Note also that this isn't a new behavior -- it's actually an old one, for the most part. It kinda doesn't look that way, because we haven't passed down separate FreezeLimit/OldestXmin cutoffs (and separate OldestMxact/MultiXactCutoff cutoffs) until now. But we often don't need that granular information to be able to process Multis before the multi value is < MultiXactCutoff. If you look at how FreezeMultiXactId works, in detail, you'll see that even on Postgres HEAD it can (say) set a tuple's xmax to InvalidTransactionId long before the multi value is < MultiXactCutoff. It just needs to detect that the multi is not still running, and notice that it's HEAP_XMAX_IS_LOCKED_ONLY(). Stuff like that happens quite a bit. So for the most part "eager processing of Multis as a special case" is an old behavior, that has only been enhanced a little bit (the really important, new change in FreezeMultiXactId is how the FRM_NOOP case works with FreezeLimit, even though OldestXmin is used nearby -- this is extra confusing because 0002 doesn't change how we use FreezeLimit -- it actually changes every other use of FreezeLimit nearby, making it OldestXmin). -- Peter Geoghegan
Re: New strategies for freezing, advancing relfrozenxid early
But a cleanup-locked page does *not* need to get rid of all XIDs < OldestXmin, nor MXIDs < OldestMxact. This flexibility is mostly useful because it allows lazy_scan_prune to just decide to not freeze. But, to a much lesser degree, it's useful because of the edge case with multis -- in general we might just need the same leeway when lazy_scan_prune "freezes the page". > That being said, it does push harder against the limits on both sides. > If I understand correctly, that means pages with wider distributions of > xids are going to persist longer, which could expose pre-existing bugs > in new and interesting ways. I don't think it's fundamentally different to what we're already doing in lazy_scan_noprune. It's just more complicated, because you have to tease apart slightly different definitions of freezing to understand code around FreezeMultiXactId(). This is more or less needed to provide maximum flexibility, where we delay decisions about what to do until the very last moment. > Next, the 'freeze_required' field suggests that it's more involved in > the control flow that causes freezing than it actually is. All it does > is communicate how the trackers need to be adjusted. The return value > of heap_prepare_freeze_tuple() (and underneath, the flags set by > FreezeMultiXactId()) are what actually control what happens. It would > be nice to make this more clear somehow. I'm not sure what you mean. Page-level freezing *doesn't* have to go ahead when freeze_required is not ever set to true for any tuple on the page (which is most of the time, in practice). lazy_scan_prune gets to make a choice about freezing the page, when the choice is available. Note also that the FRM_NOOP case happens when a call to FreezeMultiXactId() takes place that won't leave behind a freeze plan for the tuple (unless its xmin happens to necessitate a freeze plan for the same tuple). And yet, it will do useful work, needed iff the "freeze the page" path is ultimately taken by lazy_scan_prune -- FreezeMultiXactId() itself will ratchet back FreezePageRelfrozenXid/NewRelfrozenXid as needed to make everything safe. > The comment: > > /* >* If we freeze xmax, make absolutely sure that it's not an XID that >* is important. (Note, a lock-only xmax can be removed independent >* of committedness, since a committed lock holder has released the >* lock). >*/ > > caused me to go down a rabbit hole looking for edge cases where we > might want to freeze an xmax but not an xmin; e.g. tup.xmax < > OldestXmin < tup.xmin or the related case where tup.xmax < RecentXmin < > tup.xmin. I didn't find a problem, so that's good news. This is an example of what I meant about the heapam.c code using a cutoff that actually comes from FreezeLimit, when it would be more sensible to use OldestXmin instead. > I also tried some pgbench activity along with concurrent vacuums (and > vacuum freezes) along with periodic verify_heapam(). No problems there. > > Did you already describe the testing you've done for 0001+0002 > specfiically? It's not radically new logic, but it would be good to try > to catch minor state-handling errors. Lots of stuff with contrib/amcheck, which, as you must already know, will notice when an XID/MXID is contained in a table whose relfrozenxid and/or relminmxid indicates that it shouldn't be there. (Though VACUUM itself does the same thing, albeit not as effectively.) Obviously the invariants haven't changed here. In many ways it's a very small set of changes. But in one or two ways it's a significant shift. It depends on how you think about it. -- Peter Geoghegan