from:"Peter Geoghegan"

Re: run pgindent on a regular basis / scripted manner

2023-01-21 Thread Peter Geoghegan

On Sat, Jan 21, 2023 at 12:59 PM Tom Lane  wrote:
> Hmm, that could be a deal-breaker.  It's not going to be acceptable
> to have to pgindent different parts of the system on different platforms
> ... at least not unless we can segregate them on the file level, and
> even that would have a large PITA factor.

It's probably something that could be worked around. My remarks are
based on some dim memories of dealing with the tool before I arrived
at a configuration that works well enough for me. Importantly,
clang-format doesn't require you to futz around with Makefiles or
objdump or anything like that -- that's a huge plus. It doesn't seem
to impose any requirements on how I build Postgres at all (I generally
use gcc, not clang).

Even if these kinds of issues proved to be a problem for the person
tasked with running clang-format against the whole tree periodically,
they still likely wouldn't affect most of us. It's quite convenient to
use clang-format from an editor -- it can be invoked very
incrementally, against a small range of lines at a time. It's pretty
much something that I can treat like the built-in indent for my
editor. It's vastly different to the typical pgindent workflow.

> Still, we won't know unless someone makes a serious experiment with it.

There is one thing about clang-format that I find mildly infuriating:
it can indent function declarations in the way that I want it to, and
it can indent variable declarations in the way that I want it to. It
just can't do both at the same time, because they're both controlled
by AlignConsecutiveDeclarations.

Of course the way that I want to do things is (almost by definition)
the pgindent way, at least right now -- it's not necessarily about my
fixed preferences (though it can be hard to tell!). It's really not
surprising that clang-format cannot quite perfectly simulate pgindent.
How flexible can we be about stuff like that? Obviously there is no
clear answer right now.

-- 
Peter Geoghegan

Re: run pgindent on a regular basis / scripted manner

2023-01-21 Thread Peter Geoghegan

On Sat, Jan 21, 2023 at 9:30 AM Bruce Momjian  wrote:
> I don't see uncrustify or clang-format supporting typedef lists so maybe
> they implemented this feedback loop.  It would be good to see if we can
> get either of these tools to match our formatting.

I personally use clang-format for Postgres development work, since it
integrates nicely with my text editor, and can be configured to
produce approximately the same result as pgindent (certainly good
enough when working on a patch that's still far from a committable
state). I'm fairly sure that clang-format has access to a full AST
from the clang compiler, which is the ideal approach - at least in
theory.

In practice this approach tends to run into problems when the relevant
AST isn't available. For example, if there's code that only builds on
Windows, maybe it won't work at all (at least on my Linux system).
This doesn't really bother me currently, since I only rely on
clang-format as a first pass sort of thing. Maybe I could figure out a
better way to deal with such issues, but right now I don't have much
incentive to.

Another advantage of clang-format is that it's a known quantity. For
example there is direct support for it built into meson, with bells
and whistles such as CI support:

https://mesonbuild.com/Code-formatting.html

My guess is that moving to clang-format would require giving up some
flexibility, but getting much better integration with text editors and
tools like meson in return. It would probably make it practical to
have much stronger rules about how committed code must be indented --
rules that are practical, and can actually be enforced. That trade-off
seems likely to be worth it in my view, though it's not something that
I feel too strongly about.

-- 
Peter Geoghegan

Re: Decoupling antiwraparound autovacuum from special rules around auto cancellation

2023-01-20 Thread Peter Geoghegan

On Fri, Jan 20, 2023 at 12:57 PM Robert Haas  wrote:
> It doesn't seem that way to me. What am I missing? In that case, the
> problem was a DROP TRIGGER command waiting behind autovacuum's lock
> and thus causing all new table locks to wait behind DROP TRIGGER's
> lock request. But it does not sound like that was a one-off event.

It's true that I cannot categorically state that it would have made
the crucial difference in this particular case. It comes down to two
factors:

1. How many attempts would any given amount of additional XID space
head room have bought them in practice? We can be all but certain that
the smallest possible number is 1, which is something.

2. Would that have been enough for relfrozenxid to be advanced in practice?

I think that it's likely that the answer to 2 is yes, since there was
no mention of bloat as a relevant factor at any point in the
postmortem. It was all about locking characteristics of antiwraparound
autovacuuming in particular, and its interaction with their
application. I think that they were perfectly okay with the autovacuum
cancellation behavior most of the time. In fact, I don't think that
there was any bloat in the table at all -- it was a really huge table
(likely an events table), and those tend to be append-only.

Even if I'm wrong about this specific case (we'll never know for
sure), the patch as written would be virtually guaranteed to make the
crucial differences in cases that I have seen up close. For example, a
case with TRUNCATE.

> It sounds like they used DROP TRIGGER pretty regularly. So I think this
> sounds like exactly the kind of case I was talking about, where
> autovacuums keep getting cancelled until we decide to stop cancelling
> them.

I don't know how you can reach that conclusion. The chances of a
non-aggressive VACUUM advancing relfrozenxid right now are virtually
zero, at least for a big table like this one. It seems quite likely
that plenty of non-aggressive autovacuums completed, or would have had
the insert-driven autovacuum feature been available.

The whole article was about how this DROP TRIGGER pattern worked just
fine most of the time, because most of the time autovacuum was just
autocancelled. They say this at one point:

"The normal autovacuum mechanism is skipped when locks are held in
order to minimize service disruption. However, because transaction
wraparound is such a severe problem, if the system gets too close to
wraparound, an autovacuum is launched that does not back off under
lock contention."

At another point:

"When the outage was resolved, we still had a number of questions: is
a wraparound autovacuum always so disruptive? Given that it was
blocking all table operations, why does it throttle itself?"

ISTM that it was a combination of aggressive vacuuming taking far
longer than usual (especially likely because this was pre freeze map),
and the no-auto-cancel behavior. Aggressive/antiwraparound VACUUMs are
naturally much more likely to coincide with periodic DDL, just because
they take so much longer. That is a dangerous combination.

-- 
Peter Geoghegan

Re: Decoupling antiwraparound autovacuum from special rules around auto cancellation

2023-01-20 Thread Peter Geoghegan

On Fri, Jan 20, 2023 at 5:47 AM Robert Haas  wrote:
> Yeah, this is a major reason why I'm very leery about changes in this
> area. A lot of autovacuum behavior is emergent, in the sense that it
> wasn't directly intended by whoever wrote the code. It's just a
> consequence of other decisions that probably seemed very reasonable at
> the time they were made but turned out to have surprising and
> unpleasant consequences.

I certainly agree with your general description of the ways things
are. To a large degree we're living in a world where DBAs have already
compensated for some of the autovacuum shortcomings discussed on this
thread. For example, by setting autovacuum_vacuum_scale_factor (and
even autovacuum_vacuum_insert_scale_factor) to very low values, to
compensate for the issues with random sampling of dead tuples by
analyze, and to compensate for the way that VACUUM doesn't reason
correctly about how the number of dead tuples changes as VACUUM runs.
They might not have thought of it that way -- it could have happened
as a byproduct of tuning a production system through trial and error
-- but it still counts as compensating for a defect in autovacuum
scheduling IMV.

It's actually quite likely that even a strict improvement to (say)
autovacuum scheduling will cause some number of regressions, since now
what were effectively mitigations become unnecessary. This is somewhat
similar to the dynamic with optimizer improvements, where (say) a
selectivity estimate function that's better by every available metric
can still easily cause regressions that really cannot be blamed on the
improvement itself. I personally believe that it's a price worth
paying when it comes to the issues with autovacuum statistics,
particularly the dead tuple count issues. Since much of the behavior
that we sometimes see is just absurdly bad. We have both water tight
theoretical arguments and practical experiences pointing in that
direction.

> In this particular case, I think that there is a large risk that
> postponing auto-cancellation will make things significantly worse,
> possibly drastically worse, for a certain class of users -
> specifically, those whose vacuums often get auto-cancelled.

I agree that that's a real concern for the autocancellation side of
things. That seems quite different to the dead tuples accounting
issues, in that nobody would claim that the existing behavior is
flagrantly wrong (just that it sometimes causes problems instead of
preventing them).

> That's why I really liked your idea of decoupling auto-cancellation
> from XID age. Such an approach can still avoid disabling
> auto-cancellation just because autovacuum_freeze_max_age has been hit,
> but it can also disable it much earlier when it detects that doing so
> is necessary to make progress.

To be clear, I didn't think that that's what Andres was proposing, and
my recent v5 doesn't actually do that. Even in v5, it's still
fundamentally impossible for autovacuums that are triggered by the
tuples inserted or dead tuples thresholds to not be autocancellable.

ISTM that it doesn't make sense to condition the autocancellation
behavior on table XID age in the context of dead tuple VACUUMs. It
could either be way too early or way too late at that point. I was
rather hoping to not have to build the infrastructure required for
fully decoupling the autocancellation behavior from the triggering
condition (dead tuples vs table XID age) in the scope of this
thread/patch, though I can see the appeal of that.

The only reason why I'm using table age at all is because that's how
it works already, rightly or wrongly. If nothing else, t's pretty
clear that there is no theoretical or practical reason why it has to
be exactly the same table age as the one for launching autovacuums to
advance relfrozenxid/relminmxid. In v5 of the patch, the default is to
use 1.8x of the threshold that initially makes autovacuum.c want to
launch autovacuums to deal with table age. That's based on a
suggestion from Andres, but I'd be almost as happy with a default as
low as 1.1x or even 1.05x. That's going to make very little difference
to those users that really rely on the no-auto-cancellation behavior,
while at the same time making things a lot safer for scenarios like
the Joyent/Manta "DROP TRIGGER" outage (not perfectly safe, by any
means, but meaningfully safer).

-- 
Peter Geoghegan

Re: Decoupling antiwraparound autovacuum from special rules around auto cancellation

2023-01-19 Thread Peter Geoghegan

On Thu, Jan 19, 2023 at 3:38 PM Andres Freund  wrote:
> Another version of this could be to integrate analyze.c's scan more closely
> with vacuum all the time. It's a bit bonkers that we often sequentially read
> blocks, evict them from shared buffers if we read them, just to then
> afterwards do random IO for blocks we've already read. That's imo what we
> eventually should do, but clearly it's not a small project.

Very often, what we're really missing in VACUUM is high level context.
That's true of what you say here, about analyze.c, as well as
complaints like your vac_estimate_reltuples complaint. The problem
scenarios involving vac_estimate_reltuples all involve repeating the
same silly action again and again, never realizing that that's what's
going on. I've found it very useful to think of one VACUUM as picking
up where the last one left off for my work on freezing.

This seems related to pre-autovacuum historical details. VACUUM
shouldn't really be a command in the same way that CREATE INDEX is a
command. I do think that we need to retain a VACUUM command in some
form, but it should be something pretty close to a command that just
enqueues off-schedule autovacuums. That can do things like coalesce
duplicate requests into one.

Anyway, I am generally in favor of a design that makes VACUUM and
ANALYZE things that are more or less owned by autovacuum. It should be
less and less of a problem to blur the distinction between VACUUM and
ANALYZE under this model, in each successive release. These
distinctions are quite unhelpful, overall, because they make it hard
for autovacuum scheduling to work at the whole-system level.

> > This wouldn't have to happen every time, but it would happen fairly often.
>
> Do you have a mechanism for that in mind? Just something vacuum_count % 10 ==
> 0 like? Or remember scanned_pages in pgstats and re-computing

I was thinking of something very simple like that, yes.

> I think it'd be fine to just use analyze.c and pass in an option to not
> compute column and inheritance stats.

That could be fine. Just as long as it's not duplicative in an obvious way.

> > Presumably you'll want to add the same I/O prefetching logic to this
> > cut-down version, just for example. Since without that there will be
> > no competition between it and ANALYZE proper. Besides which, isn't it
> > kinda wasteful to not just do a full ANALYZE? Sure, you can avoid
> > detoasting overhead that way. But even still.
>
> It's not just that analyze is expensive, I think it'll also be confusing if
> the column stats change after a manual VACUUM without ANALYZE.

Possibly, but it doesn't have to happen there. It's not like the rules
aren't a bit different compared to autovacuum already. For example,
the way TOAST tables are handled by the VACUUM command versus
autovacuum.

Even if it's valuable to maintain this kind of VACUUM/autovacuum
parity (which I tend to doubt), doesn't the same argument work almost
as well with whatever stripped down version you come up with? It's
also confusing that a manual VACUUM command will be doing an
ANALYZE-like thing. Especially in cases where it's really expensive
relative to the work of VACUUM, because VACUUM scanned so few pages.
You just have to make some kind of trade-off.

-- 
Peter Geoghegan

Re: Decoupling antiwraparound autovacuum from special rules around auto cancellation

2023-01-19 Thread Peter Geoghegan

On Thu, Jan 19, 2023 at 2:54 PM Andres Freund  wrote:
> Yea. Hence my musing about potentially addressing this by choosing to visit
> additional blocks during the heap scan using vacuum's block sampling logic.

I'd rather just invent a way for vacuumlazy.c to tell the top-level
vacuum.c caller "I didn't update reltuples, but you ought to go
ANALYZE the table now that I'm done, even if you weren't already
planning to do so". This wouldn't have to happen every time, but it
would happen fairly often.

> IME most of the time in analyze isn't spent doing IO for the sample blocks
> themselves, but CPU time and IO for toasted columns. A trimmed down version
> that just computes relallvisible should be a good bit faster.

I worry about that from a code maintainability point of view. I'm
concerned that it won't be very cut down at all, in the end.

Presumably you'll want to add the same I/O prefetching logic to this
cut-down version, just for example. Since without that there will be
no competition between it and ANALYZE proper. Besides which, isn't it
kinda wasteful to not just do a full ANALYZE? Sure, you can avoid
detoasting overhead that way. But even still.

-- 
Peter Geoghegan

Re: Decoupling antiwraparound autovacuum from special rules around auto cancellation

2023-01-19 Thread Peter Geoghegan

On Thu, Jan 19, 2023 at 12:58 PM Andres Freund  wrote:
> There's absolutely no guarantee that autoanalyze is triggered
> there. Particularly with repeated vacuums triggered due to an relfrozenxid age
> that can't be advanced that very well might not happen within days on a large
> relation.

Arguments like that work far better as arguments in favor of the
vac_estimate_reltuples heuristics.

That doesn't mean that the heuristics are good in any absolute sense,
of course. They were just a band aid intended to ameliorate some of
the negative impact that came from treating scanned_pages as a random
sample. I think that we both agree that the real problem is that
scanned_pages just isn't a random sample, at least not as far as
reltuples/live tuples is concerned (for dead tuples it kinda isn't a
sample, but is rather something close to an exact count).

I now understand that you're in favor of addressing the root problem
directly. I am also in favor of that approach. I'd be more than happy
to get rid of the band aid as part of that whole effort.

-- 
Peter Geoghegan

Re: Decoupling antiwraparound autovacuum from special rules around auto cancellation

2023-01-19 Thread Peter Geoghegan

On Thu, Jan 19, 2023 at 12:56 PM Andres Freund  wrote:
> > In other words, ANALYZE sometimes (but not always) produces wrong answers.
>
> For dead tuples, but not live tuples.

> > In other words, VACUUM sometimes (but not always) produces wrong answers.
>
> For live tuples, but not badly so for dead tuples.

Agreed. More generally, there is a need to think about the whole table
in some cases (like for regular optimizer statistics including
reltuples/live tuples), and the subset of pages that will be scanned
by VACUUM in other cases (for dead tuples accounting). Random sampling
at the table level is appropriate and works well enough for the
former, though not for the latter.

> We are, but perhaps not too badly so, because we can choose to believe analyze
> more for live tuples, and vacuum for dead tuples. Analyze doesn't compute
> reltuples incrementally and vacuum doesn't compute deadtuples incrementally.

Good points.

> But in contrast to dead_tuples, where I think we can just stop analyze from
> updating it unless we crashed recently, I do think we need to update reltuples
> in vacuum. So computing an accurate value seems like the least unreasonable
> thing I can see.

I agree, but there is no reasonable basis for treating scanned_pages
as a random sample, especially if it's only a small fraction of all of
rel_pages -- treating it as a random sample is completely
unjustifiable. And so it seems to me that the only thing that can be
done is to either make VACUUM behave somewhat like ANALYZE in at least
some cases, or to have it invoke ANALYZE directly (or indirectly) in
those same cases.

-- 
Peter Geoghegan

Re: Decoupling antiwraparound autovacuum from special rules around auto cancellation

2023-01-18 Thread Peter Geoghegan

On Wed, Jan 18, 2023 at 7:04 PM Andres Freund  wrote:
> > You seem to be saying that it's a problem if we don't update reltuples
> > -- an estimate -- when less than 2% of the table is scanned by VACUUM.
> > But why? Why can't we just do nothing sometimes? I mean in general,
> > leaving aside the heuristics I came up with for a moment?
>
> The problem isn't that we might apply the heuristic once, that'd be fine. But
> that there's nothing preventing it from applying until there basically are no
> tuples left, as long as the vacuum is frequent enough.
>
> As a demo: The attached sql script ends up with a table containing 10k rows,
> but relpages being set 1 million.

I saw that too. But then I checked again a few seconds later, and
autoanalyze had run, so reltuples was 10k. Just like it would have if
there was no VACUUM statements in your script.

-- 
Peter Geoghegan

Re: Decoupling antiwraparound autovacuum from special rules around auto cancellation

2023-01-18 Thread Peter Geoghegan

On Wed, Jan 18, 2023 at 6:10 PM Andres Freund  wrote:
> > This creates an awkward but logical question, though: what if
> > dead_tuples doesn't go down at all? What if VACUUM actually has to
> > increase it, because VACUUM runs so slowly relative to the workload?
>
> Sure, that can happen - but it's not made better by having wrong stats :)

Maybe it's that simple. It's reasonable to wonder how far we want to
go with letting dead tuples grow and grow, even when VACUUM is running
constantly. It's not a question that has an immediate and obvious
answer IMV.

Maybe the real question is: is this an opportunity to signal to the
user (say via a LOG message) that VACUUM cannot keep up? That might be
very useful, in a general sort of way (not just to avoid new
problems).

> We have reasonably sophisticated accounting in pgstats what newly live/dead
> rows a transaction "creates". So an obvious (and wrong) idea is just decrement
> reltuples by the number of tuples removed by autovacuum.

Did you mean dead_tuples?

> But we can't do that,
> because inserted/deleted tuples reported by backends can be removed by
> on-access pruning and vacuumlazy doesn't know about all changes made by its
> call to heap_page_prune().

I'm not following here. Perhaps this is a good sign that I should stop
working for the day.  :-)

> But I think that if we add a
>   pgstat_count_heap_prune(nredirected, ndead, nunused)
> around heap_page_prune() and a
>   pgstat_count_heap_vacuum(nunused)
> in lazy_vacuum_heap_page(), we'd likely end up with a better approximation
> than what vac_estimate_reltuples() does, in the "partially scanned" case.

What does vac_estimate_reltuples() have to do with dead tuples?

--
Peter Geoghegan

Re: Decoupling antiwraparound autovacuum from special rules around auto cancellation

2023-01-18 Thread Peter Geoghegan

On Wed, Jan 18, 2023 at 5:49 PM Andres Freund  wrote:
> On 2023-01-18 15:28:19 -0800, Peter Geoghegan wrote:
> > Perhaps we should make vac_estimate_reltuples focus on the pages that
> > VACUUM newly set all-visible each time (not including all-visible
> > pages that got scanned despite being all-visible) -- only that subset
> > of scanned_pages seems to be truly relevant. That way you wouldn't be
> > able to do stuff like this. We'd have to start explicitly tracking the
> > number of pages that were newly set in the VM in vacuumlazy.c to be
> > able to do that, but that seems like a good idea anyway.
>
> Can you explain a bit more what you mean with "focus on the pages" means?

We don't say anything about pages we didn't scan right now -- only
scanned_pages have new information, so we just extrapolate. Why not go
even further than that, by only saying something about the pages that
were both scanned and newly set all-visible?

Under this scheme, the pages that were scanned but couldn't be newly
set all-visible are treated just like the pages that weren't scanned
at all -- they get a generic estimate from the existing reltuples.

> I don't think it's hard to see this causing problems. Set
> autovacuum_vacuum_scale_factor to something smaller than 2% or somewhat
> frequently vacuum manually. Incrementally delete old data.  Unless analyze
> saves you - which might not be run or might have a different scale factor or
> not be run manually - reltuples will stay exactly the same, despite data
> changing substantially.

You seem to be saying that it's a problem if we don't update reltuples
-- an estimate -- when less than 2% of the table is scanned by VACUUM.
But why? Why can't we just do nothing sometimes? I mean in general,
leaving aside the heuristics I came up with for a moment?

It will cause problems if we remove the heuristics. Much less
theoretical problems. What about those?

How often does VACUUM scan so few pages, anyway? We've been talking
about how ineffective autovacuum_vacuum_scale_factor is, at great
length, but now you're saying that it *can* meaningfully trigger not
just one VACUUM, but many VACUUMs, where no more than 2% of  rel_pages
are not all-visible (pages, not tuples)? Not just once, mind you, but
many times? And in the presence of some kind of highly variable tuple
size, where it actually could matter to the planner at some point?

I would be willing to just avoid even these theoretical problems if
there was some way to do so, that didn't also create new problems. I
have my doubts that that is possible, within the constraints of
updating pg_class. Or the present constraints, at least. I am not a
miracle worker -- I can only do so much with the information that's
available to vac_update_relstats (and/or the information that can
easily be made available).

-- 
Peter Geoghegan

Re: Decoupling antiwraparound autovacuum from special rules around auto cancellation

2023-01-18 Thread Peter Geoghegan

On Fri, Jan 13, 2023 at 9:55 PM Andres Freund  wrote:
> How about a float autovacuum_no_auto_cancel_age where positive values are
> treated as absolute values, and negative values are a multiple of
> autovacuum_freeze_max_age? And where the "computed" age is capped at
> vacuum_failsafe_age? A "failsafe" autovacuum clearly shouldn't be cancelled.
>
> And maybe a default setting of -1.8 or so?

Attached is a new revision, v5. I'm not happy with this, but thought
it would be useful to show you where I am with it.

It's a bit awkward that we have a GUC (autovacuum_no_auto_cancel_age)
that can sometimes work as a cutoff that works similarly to both
freeze_max_age and multixact_freeze_max_age, but usually works as a
multiplier. It's both an XID age value, an MXID age value, and a
multiplier on XID/MXID age values.

What if it was just a simple multiplier on
freeze_max_age/multixact_freeze_max_age, without changing any other
detail?

-- 
Peter Geoghegan

v5-0002-Add-table-age-trigger-concept-to-autovacuum.patch
Description: Binary data

v5-0001-Add-autovacuum-trigger-instrumentation.patch
Description: Binary data

Re: Decoupling antiwraparound autovacuum from special rules around auto cancellation

2023-01-18 Thread Peter Geoghegan

On Wed, Jan 18, 2023 at 4:37 PM Andres Freund  wrote:
> I can, it should be just about trivial code-wise. A bit queasy about trying to
> forsee the potential consequences.

That's always going to be true, though.

> A somewhat related issue is that pgstat_report_vacuum() sets dead_tuples to
> what VACUUM itself observed, ignoring any concurrently reported dead
> tuples. As far as I can tell, when vacuum takes a long time, that can lead to
> severely under-accounting dead tuples.

Did I not mention that one? There are so many that it can be hard to
keep track! That's why I catalog them.

As you point out, it's the dead tuples equivalent of my
ins_since_vacuum complaint. The problem is exactly analogous to my
recent complaint about insert-driven autovacuums.

> I wonder if we ought to remember the dead_tuples value at the start of the
> heap scan and use that to adjust the final dead_tuples value. I'd lean towards
> over-counting rather than under-counting and thus probably would go for
> something like
>
>   tabentry->dead_tuples = livetuples + Min(0, tabentry->dead_tuples - 
> deadtuples_at_start);
>
> i.e. assuming we might have missed all concurrently reported dead tuples.

This is exactly what I was thinking of doing for both issues (the
ins_since_vacuum one and this similar dead tuples one). It's
completely logical.

This creates an awkward but logical question, though: what if
dead_tuples doesn't go down at all? What if VACUUM actually has to
increase it, because VACUUM runs so slowly relative to the workload?
Of course the whole point is to make it more likely that VACUUM will
keep up with the workload. I'm just not quite sure that the
consequences of doing it that way are strictly a good thing. Bearing
in mind that we don't differentiate between recently dead and dead
here.

Fun fact: autovacuum can spin with pgbench because of recently dead
tuples, even absent an old snapshot/long running xact, if you set
things aggressively enough:

https://postgr.es/m/CAH2-Wz=sJm3tm+FpXbyBhEhX5tbz1trQrhG6eOhYk4-+5uL=w...@mail.gmail.com

I think that we probably need to do something like always make sure
that dead_items goes down by a small amount at the end of each VACUUM,
even when that's a lie. Maybe we also have a log message about
autovacuum not keeping up, so as to not feel too guilty about it. You
know, to give the user a chance to reconfigure autovacuum so that it
stops happening.

> Of course we could instead move to something like ins_since_vacuum and reset
> it at the *start* of the vacuum. But that'd make the error case harder,
> without giving us more accuracy, I think?

It would. It seems illogical to me.

> I do think this is an argument for splitting up dead_tuples into separate
> "components" that we track differently. I.e. tracking the number of dead
> items, not-yet-removable rows, and the number of dead tuples reported from DML
> statements via pgstats.

Is it? Why?

I'm all in favor of doing that, of course. I just don't particularly
think that it's related to this other problem. One problem is that we
count dead tuples incorrectly because we don't account for the fact
that things change while VACUUM runs. The other problem is that the
thing that is counted isn't broken down into distinct subcategories of
things -- things are bunched together that shouldn't be.

Oh wait, you were thinking of what I said before -- my "awkward but
logical question". Is that it?

-- 
Peter Geoghegan

Re: Decoupling antiwraparound autovacuum from special rules around auto cancellation

2023-01-18 Thread Peter Geoghegan

On Wed, Jan 18, 2023 at 4:02 PM Andres Freund  wrote:
> vacuum-no   reltuples/n_live_tupn_dead_tup
> 1   476 500
> 2   2500077 500
> 3   1250184 500
> 4625266 500
> 5312821 500
> 1010165 500
>
> Each vacuum halves reltuples.  That's going to screw badly with all kinds of
> things. Planner costs completely out of whack etc.

I get that that could be a big problem, even relative to the more
immediate problem of VACUUM just spinning like it does in your test
case. What do you think we should do about it? What do you think about
my idea of focussing on the subset of pages newly set all-visible in
the VM?

> I wonder if this is part of the reason for the distortion you addressed with
> 74388a1a / 3097bde7dd1d. I am somewhat doubtful they're right as is. For a
> large relation 2% of blocks is a significant number of rows, and simply never
> adjusting reltuples seems quite problematic. At the very least we ought to
> account for dead tids we removed or such, instead of just freezing reltuples.

As I mentioned, it only kicks in when relpages is *precisely* the same
as last time (not one block more or one block less), *and* we only
scanned less than 2% of rel_pages. It's quite possible that that's
insufficient, but I can't imagine it causing any new problems.

I think that we need to be realistic about what's possible while
storing a small, fixed amount of information. There is always going to
be some distortion of this kind. We can do something about the
obviously pathological cases, where errors can grow without bound. But
you have to draw the line somewhere, unless you're willing to replace
the whole approach with something that stores historic metadata.

What kind of tradeoff do you want to make here? I think that you have
to make one.

-- 
Peter Geoghegan

Re: Decoupling antiwraparound autovacuum from special rules around auto cancellation

2023-01-18 Thread Peter Geoghegan

On Wed, Jan 18, 2023 at 3:28 PM Peter Geoghegan  wrote:
> The problems in this area tend to be that vac_estimate_reltuples()
> behaves as if it sees a random sample, when in fact it's far from
> random -- it's the same scanned_pages as last time, and the ten other
> times before that. That's a feature of your test case, and other
> similar vac_estimate_reltuples test cases that I came up with in the
> past. Both scenarios involved skipping using the visibility map in
> multiple successive VACUUM operations.

FWIW, the problem in your test case just goes away if you just change this line:

DELETE FROM foo WHERE i < (1000 * 0.1)

To this:

DELETE FROM foo WHERE i < (1000 * 0.065)

Which is not a huge difference, overall. This effect is a consequence
of the heuristics I added in commit 74388a1a, so it's only present on
Postgres 15+.

Whether or not this is sufficient protection is of course open to
interpretation. One problem with those heuristics (as far as your test
case is concerned) is that they either work, or they don't work. For
example they're conditioned on "old_rel_pages == total_page".

-- 
Peter Geoghegan

Re: Decoupling antiwraparound autovacuum from special rules around auto cancellation

2023-01-18 Thread Peter Geoghegan

On Wed, Jan 18, 2023 at 2:37 PM Peter Geoghegan  wrote:
> Maybe you're right to be concerned to the degree that you're concerned
> -- I'm not sure. I'm just adding what I see as important context.

The problems in this area tend to be that vac_estimate_reltuples()
behaves as if it sees a random sample, when in fact it's far from
random -- it's the same scanned_pages as last time, and the ten other
times before that. That's a feature of your test case, and other
similar vac_estimate_reltuples test cases that I came up with in the
past. Both scenarios involved skipping using the visibility map in
multiple successive VACUUM operations.

Perhaps we should make vac_estimate_reltuples focus on the pages that
VACUUM newly set all-visible each time (not including all-visible
pages that got scanned despite being all-visible) -- only that subset
of scanned_pages seems to be truly relevant. That way you wouldn't be
able to do stuff like this. We'd have to start explicitly tracking the
number of pages that were newly set in the VM in vacuumlazy.c to be
able to do that, but that seems like a good idea anyway.

This probably has consequences elsewhere, but maybe that's okay. We
know when the existing pg_class has no information, since that is
explicitly encoded by a reltuples of -1. Obviously we'd need to be
careful about stuff like that. Overall, the danger from assuming that
"unsettled" pages (pages that couldn't be newly set all-visible by
VACUUM) have a generic tuple density seems less than the danger of
assuming that they're representative. We know that we're bound to scan
these same pages in the next VACUUM anyway, so they'll have another
chance to impact our view of the table's tuple density at that time.

--
Peter Geoghegan

Re: Decoupling antiwraparound autovacuum from special rules around auto cancellation

2023-01-18 Thread Peter Geoghegan

On Wed, Jan 18, 2023 at 2:22 PM Andres Freund  wrote:
> The problem with the change is here:
>
> /*
>  * Okay, we've covered the corner cases.  The normal calculation is to
>  * convert the old measurement to a density (tuples per page), then
>  * estimate the number of tuples in the unscanned pages using that 
> figure,
>  * and finally add on the number of tuples in the scanned pages.
>  */
> old_density = old_rel_tuples / old_rel_pages;
> unscanned_pages = (double) total_pages - (double) scanned_pages;
> total_tuples = old_density * unscanned_pages + scanned_tuples;
> return floor(total_tuples + 0.5);

My assumption has always been that vac_estimate_reltuples() is prone
to issues like this because it just doesn't have access to very much
information each time it runs. It can only see the delta between what
VACUUM just saw, and what the last VACUUM (or possibly the last
ANALYZE) saw according to pg_class. You're always going to find
weaknesses in such a model if you go looking for them. You're always
going to find a way to salami slice your way from good information to
total nonsense, if you pick the right/wrong test case, which runs
VACUUM in a way that allows whatever bias there may be to accumulate.
It's sort of like the way floating point values can become very
inaccurate through a process that allows many small inaccuracies to
accumulate over time.

Maybe you're right to be concerned to the degree that you're concerned
-- I'm not sure. I'm just adding what I see as important context.

-- 
Peter Geoghegan

Re: Decoupling antiwraparound autovacuum from special rules around auto cancellation

2023-01-18 Thread Peter Geoghegan

On Wed, Jan 18, 2023 at 1:08 PM Andres Freund  wrote:
> I suggested nearby to only have ANALYZE dead_tuples it if there's been no
> [auto]vacuum since the stats entry was created. That allows recovering from
> stats resets, be it via crashes or explicitly. What do you think?

I like that idea. It's far simpler than the kind of stuff I was
thinking about, and probably just as effective. Even if it introduces
some unforeseen problem (which seems unlikely), we can still rely on
pgstat_report_vacuum() to set things straight before too long.

Are you planning on writing a patch for this? I'd be very interested
in seeing this through. Could definitely review it.

-- 
Peter Geoghegan

Re: Decoupling antiwraparound autovacuum from special rules around auto cancellation

2023-01-18 Thread Peter Geoghegan

On Wed, Jan 18, 2023 at 1:02 PM Peter Geoghegan  wrote:
> Some of what I'm proposing arguably amounts to deliberately adding a
> bias. But that's not an unreasonable thing in itself. I think of it as
> related to the bias-variance tradeoff, which is a concept that comes
> up a lot in machine learning and statistical inference.

To be clear, I was thinking of unreservedly trusting what
pgstat_report_analyze() says about dead_tuples in the event of its
estimate increasing our dead_tuples estimate, while being skeptical
(to a varying degree) when it's the other way around.

But now I need to go think about what Andres just brought up...

-- 
Peter Geoghegan

Re: Decoupling antiwraparound autovacuum from special rules around auto cancellation

2023-01-18 Thread Peter Geoghegan

On Wed, Jan 18, 2023 at 12:44 PM Robert Haas  wrote:
> I don't know enough about the specifics of how this works to have an
> intelligent opinion about how likely these particular ideas are to
> work out. However, I think it's risky to look at estimates and try to
> infer whether they are reliable. It's too easy to be wrong. What we
> really want to do is anchor our estimates to some data source that we
> know we can trust absolutely. If you trust possibly-bad data less, it
> screws up your estimates more slowly, but it still screws them up.

Some of what I'm proposing arguably amounts to deliberately adding a
bias. But that's not an unreasonable thing in itself. I think of it as
related to the bias-variance tradeoff, which is a concept that comes
up a lot in machine learning and statistical inference.

We can afford to be quite imprecise at times, especially if we choose
a bias that we know has much less potential to do us harm -- some
mistakes hurt much more than others. We cannot afford to ever be
dramatically wrong, though -- especially in the direction of vacuuming
less often.

Besides, there is something that we *can* place a relatively high
degree of trust in that will still be in the loop here: VACUUM itself.
If VACUUM runs then it'll call pgstat_report_vacuum(), which will set
the record straight in the event of over estimating dead tuples. To
some degree the problem of over estimating dead tuples is
self-limiting.

> If Andres is correct that what really matter is the number of pages
> we're going to have to dirty, we could abandon counting dead tuples
> altogether and just count not-all-visible pages in the VM map.

That's what matters most from a cost point of view IMV. So it's a big
part of the overall picture, but not everything. It tells us
relatively little about the benefits, except perhaps when most pages
are all-visible.

-- 
Peter Geoghegan

Re: Decoupling antiwraparound autovacuum from special rules around auto cancellation

2023-01-18 Thread Peter Geoghegan

On Wed, Jan 18, 2023 at 11:02 AM Robert Haas  wrote:
> On Wed, Jan 18, 2023 at 1:31 PM Peter Geoghegan  wrote:
> > pgstat_report_analyze() will totally override the
> > tabentry->dead_tuples information that drives autovacuum.c, based on
> > an estimate derived from a random sample -- which seems to me to be an
> > approach that just doesn't have any sound theoretical basis.
>
> Yikes. I think we don't have a choice but to have a method to correct
> the information somehow, because AFAIK the statistics system is not
> crash-safe. But that approach does seem to carry significant risk of
> overwriting correct information with wrong information.

This situation is really quite awful, so maybe we should do something
about it soon, in the scope of the Postgres 16 work on autovacuum that
is already underway. In fact I think that the problem here is so bad
that even something slightly less naive would be far more effective.

You're right to point out that pgstat_report_analyze needs to update
the stats in case there is a hard crash, of course. But there is
plenty of context with which to make better decisions close at hand.
For example, I bet that pgstat_report_analyze already does a pretty
good job of estimating live_tuples -- my spiel about statistics mostly
doesn't apply to live_tuples. Suppose that we notice that its new
estimate for live_tuples approximately matches what the stats
subsystem already thought about live_tuples, while dead_tuples is far
far lower. We shouldn't be so credulous as to believe the new
dead_tuples estimate at that point.

Perhaps we can change nothing about dead_tuples at all when this
happens. Or perhaps we can set dead_tuples to a value that is scaled
from the old estimate. The new dead_tuples value could be derived by
taking the ratio of the old live_tuples to the old dead_tuples, and
then using that to scale from the new live_tuples. This is just a
first pass, to see what you and others think. Even very simple
heuristics seem like they could make things much better.

Another angle of attack is the PD_ALL_VISIBLE page-level bit, which
acquire_sample_rows() could pay attention to -- especially in larger
tables, where the difference between all pages and just the
all-visible subset of pages is most likely to matter. The more sampled
pages that had PD_ALL_VISIBLE set, the less credible the new
dead_tuples estimate will be (relative to existing information), and
so pgstat_report_analyze() should prefer the new estimate over the old
one in proportion to that.

We probably shouldn't change anything about pgstat_report_vacuum as
part of this effort to make pgstat_report_analyze less terrible in the
near term. It certainly has its problems (what is true for pages that
VACUUM scanned at the end of VACUUM is far from representative for new
pages!), it's probably much less of a contributor to issues like those
that Andres reports seeing.

BTW, one of the nice things about the insert-driven autovacuum stats
is that pgstat_report_analyze doesn't have an opinion about how many
tuples were inserted since the last VACUUM ran. It does have other
problems, but they seem less serious to me.

> Yep. I think what we should try to evaluate is which number is
> furthest from the truth. My guess is that the threshold is so high
> relative to what a reasonable value would be that you can't get any
> benefit out of making the dead tuple count more accurate. Like, if the
> threshold is 100x too high, or something, then who cares how accurate
> the dead tuples number is?

Right. Or if we don't make any reasonable distinction between LP_DEAD
items and dead heap-only tuples, then the total number of both things
together may matter very little. Better to be approximately correct
than exactly wrong. Deliberately introducing a bias to lower the
variance is a perfectly valid approach.

> Maybe that's even true in some scenarios, but I bet that
> it's never the issue when people have really big tables. The fact that
> I'm OK with 10MB of bloat in my 100MB table doesn't mean I'm OK with
> 1TB of bloat in my 10TB table. Among other problems, I can't even
> vacuum away that much bloat in one index pass, because autovacuum
> can't use enough work memory for that. Also, the absolute space
> wastage matters.

I certainly agree with all that.

FWIW, part of my mental model with VACUUM is that the rules kind of
change in the case of a big table. We're far more vulnerable to issues
such as (say) waiting for cleanup locks because the overall cadence
used by autovacuum is so infrequently relative to everything else.
There are more opportunities for things to go wrong, worse
consequences when they do go wrong, and greater potential for the
problems to compound.

-- 
Peter Geoghegan

Re: Decoupling antiwraparound autovacuum from special rules around auto cancellation

2023-01-18 Thread Peter Geoghegan

ouldn't ever be a sucker. It
should pay attention to disconfirmatory signals. The information that
drives its decision making process should be treated as provisional.

Even if the information was correct at one point, the contents of the
table are constantly changing in a way that could matter enormously.
So we should be paying attention to where the table is going --  and
even where it might be going -- not just where it is, or was.

> True, although it can be overdone. An extra vacuum on a big table with
> some large indexes that end up getting scanned can be very expensive
> even if the table itself is almost entirely all-visible. We can't
> afford to make too many mistakes in the direction of vacuuming early
> in such cases.

No, but we can afford to make some -- and can detect when it happened
after the fact. I would rather err on the side of over-vacuuming,
especially if the system is smart enough to self-correct when that
turns out to be the wrong approach. One of the advantages of running
VACUUM sooner is that it provides us with relatively reliable
information about the needs of the table.

We can also cheat, sort of. If we find another justification for
autovacuuming (e.g., it's a quiet time for the system as a whole), and
it works out to help with this other problem, it may be just as good
for users.

--
Peter Geoghegan

Re: Decoupling antiwraparound autovacuum from special rules around auto cancellation

2023-01-17 Thread Peter Geoghegan

On Tue, Jan 17, 2023 at 2:11 PM Robert Haas  wrote:
> On Tue, Jan 17, 2023 at 3:08 PM Peter Geoghegan  wrote:
> > If you assume that there is chronic undercounting of dead tuples
> > (which I think is very common), ...
>
> Why do you think that?

For the reasons I gave about statistics, random sampling, the central
limit theorem. All that stuff. This matches the experience of Andres.
And is obviously the only explanation behind the reliance on
antiwraparound autovacuums for cleaning up bloat in larger OLTP
databases. It just fits: the dead tuples approach can sometimes be so
completely wrong that even an alternative triggering condition based
on something that is virtually unrelated to the thing we actually care
about can do much better in practice. Consistently, reliably, for a
given table/workload.

> > How many dead heap-only tuples are equivalent to one LP_DEAD item?
> > What about page-level concentrations, and the implication for
> > line-pointer bloat? I don't have a good answer to any of these
> > questions myself.
>
> Seems a bit pessimistic. If we had unlimited resources and all
> operations were infinitely fast, the optimal strategy would be to
> vacuum after every insert, update, or delete. But in reality, that
> would be prohibitively expensive, so we're making a trade-off.

To a large degree, that's my point. I don't know how to apply this
information, so having detailed information doesn't seem like the main
problem.

> If we had an oracle that could provide us with perfect information,
> we'd ask it, among other things, how much work will be required to
> vacuum right now, and how much benefit would we get out of doing so.

And then what would we do? What about costs?

Even if we were omniscient, we still wouldn't be omnipotent. We're
still subject to the laws of physics. VACUUM would still be something
that more or less works at the level of the whole table, or not at
all. So being omniscient seems kinda overrated to me. Adding more
information does not in general lead to better outcomes.

> The dead tuple count is related to the first question. It's not a
> direct, linear relationship, but it's not completely unrelated,
> either. Maybe we could refine the estimates by gathering more or
> different statistics than we do now, but ultimately it's always going
> to be a trade-off between getting the work done sooner (and thus maybe
> preventing table growth or a wraparound shutdown) and being able to do
> more work at once (and thus being more efficient). The current system
> set of counters predates HOT and the visibility map, so it's not
> surprising if needs updating, but if you're argue that the whole
> concept is just garbage, I think that's an overreaction.

What I'm arguing is that principally relying on any one thing is
garbage. If you have only one thing that creates pressure to VACUUM
then there can be a big impact whenever it turns out to be completely
wrong. Whereas if VACUUM can run because of (say) 3 moderate signals
taken together, then it's much less likely that we'll be completely
wrong. In general my emphasis is on avoiding disaster in all its
forms. Vacuuming somewhat early more often is perhaps suboptimal, but
far from a disaster. It's the kind of thing that we can manage.

By all means, let's make the dead tuples/dead items stuff less naive
(e.g. make it distinguish between LP_DEAD items and dead heap-only
tuples). But even then, we shouldn't continue to completely rely on it
in the way that we do right now. In other words, I'm fine with adding
more information that is more accurate as long as we don't continue to
make the mistake of not treating it kinda suspect, and certainly not
something to completely rely on if at all possible. In particular, we
need to think about both costs and benefits at all times.

-- 
Peter Geoghegan

Re: Rework of collation code, extensibility

2023-01-17 Thread Peter Geoghegan

On Sat, Jan 14, 2023 at 12:03 PM Jeff Davis  wrote:
> The first goal I had was simply that the code was really hard to
> understand and work on, and refactoring was justified to improve the
> situation.
>
> The second goal, which is somewhat dependent on the first goal, is that
> we really need an ability to support multiple ICU libraries, and I
> wanted to do some common groundwork that would be needed for any
> approach we choose there, and provide some hooks to get us there. You
> are right that this goal influenced the first goal.

I don't disagree that it was somewhat independent of the first goal. I
just think that it makes sense to "round up to fully dependent".
Basically it's not independent enough to be worth talking about as an
independent thing, just as a practical matter - it's confusing at the
level of things like the commit message. There is a clear direction
that you're going in here from the start, and your intentions in 0001
do matter to somebody that's just looking at 0001 in isolation. That
is my opinion, at least.

The second goal is a perfectly good enough goal on its own, and one
that I am totally supportive of. Making the code clearer is icing on
the cake.

> ucol_strcollUTF8() accepts -1 to mean "nul-terminated". I did some
> basic testing and it doesn't seem like it's slower than using the
> length. If passing the length is faster for some reason, it would
> complicate the API because we'd need an entry point that's expecting
> nul-termination and lengths, which is awkward (as Peter E. pointed
> out).

That's good. I'm happy to leave it at that. I was only enquiring.

> I felt it was a little clearer amongst the other code, to a casual
> reader, but I suppose it's a style thing. I will change it if you
> insist.

I certainly won't insist.

> I'd have to expose the pg_locale_t struct, which didn't seem desirable
> to me. Do you think it's enough of a performance concern to be worth
> some ugliness there?

I don't know. Quite possibly not. It would be nice to have some data
on that, though.

-- 
Peter Geoghegan

Re: Update comments in multixact.c

2023-01-17 Thread Peter Geoghegan

On Tue, Jan 17, 2023 at 1:33 AM shiy.f...@fujitsu.com
 wrote:
> I noticed that commit 5212d447fa updated some comments in multixact.c because
> SLRU truncation for multixacts is performed during VACUUM, instead of
> checkpoint. Should the following comments which mentioned checkpointer be
> changed, too?

Yes, I think so.

-- 
Peter Geoghegan

Re: Decoupling antiwraparound autovacuum from special rules around auto cancellation

2023-01-17 Thread Peter Geoghegan

27;t. Personally I think something like autovacuum_no_auto_cancel_age
> would be an improvement, but I also don't quite feel satisfied with it.

I don't either; but it should be strictly less unsatisfactory.

> Tracking the number of autovac failures seems uncontroverial and quite
> beneficial, even if the rest doesn't make it in. It'd at least let users
> monitor for tables where autovac is likely to swoop in in anti-wraparound
> mode.

I'll see what I can come up with.

> Perhaps its worth to separately track the number of times a backend would have
> liked to cancel autovac, but couldn't due to anti-wrap? If changes to the
> no-auto-cancel behaviour don't make it in, it'd at least allow us to collect
> more data about the prevalence of the problem and in what situations it
> occurs?  Even just adding some logging for that case seems like it'd be an
> improvement.

Hmm, maybe.

> I think with a bit of polish "Add autovacuum trigger instrumentation." ought
> to be quickly mergeable.

Yeah, I'll try to get that part out of the way quickly.

--
Peter Geoghegan

Re: New strategies for freezing, advancing relfrozenxid early

2023-01-16 Thread Peter Geoghegan

On Mon, Jan 16, 2023 at 8:13 PM Dilip Kumar  wrote:
> I think that it makes sense to keep 'vacuum_freeze_strategy_threshold'
> strictly for freezing.  But the point is that the eager scanning
> strategy is driven by table freezing needs of the table (tableagefrac)
> that make sense, but if we have selected the eager freezing based on
> the table age and its freezing need then why don't we force the eager
> freezing as well if we have selected eager scanning, after all the
> eager scanning is selected for satisfying the freezing need.

Don't think of eager scanning as the new name for aggressive mode --
it's a fairly different concept, because we care about costs now.
Eager scanning can be chosen just because it's very cheap relative to
the alternative of lazy scanning, even when relfrozenxid is still very
recent. (This kind of behavior isn't really new [1], but the exact
implementation from the patch is new.)

Tables such as pgbench_branches and pgbench_tellers will reliably use
eager scanning strategy, no matter how any GUC has been set -- just
because the added cost is always zero (relative to lazy scanning). It
really doesn't matter how far along tableagefrac here, ever. These
same tables will never use eager freezing strategy, unless the
vacuum_freeze_strategy_threshold GUC is misconfigured. (This is
another example of how scanning strategy and freezing strategy may
differ for the same table.)

You do have a good point, though. I think that I know what you mean.
Note that antiwraparound autovacuums (or VACUUMs of tables very near
to that point) *will* always use both the eager freezing strategy and
the eager scanning strategy -- which is probably close to what you
meant.

The important point is that there can be more than one reason to
prefer one strategy to another -- and the reasons can be rather
different. Occasionally it'll be a combination of two factors together
that push things in favor of one strategy over the other -- even
though either factor on its own would not have resulted in the same
choice.

[1] 
https://wiki.postgresql.org/wiki/Freezing/skipping_strategies_patch:_motivating_examples#Constantly_updated_tables_.28usually_smaller_tables.29
-- 
Peter Geoghegan

Re: Decoupling antiwraparound autovacuum from special rules around auto cancellation

2023-01-16 Thread Peter Geoghegan

On Mon, Jan 16, 2023 at 8:25 AM Robert Haas  wrote:
> I really dislike formulas like Min(freeze_max_age * 2, 1 billion).
> That looks completely magical from a user perspective. Some users
> aren't going to understand autovacuum behavior at all. Some will, and
> will be able to compare age(relfrozenxid) against
> autovacuum_freeze_max_age. Very few people are going to think to
> compare age(relfrozenxid) against some formula based on
> autovacuum_freeze_max_age. I guess if we document it, maybe they will.

What do you think of Andres' autovacuum_no_auto_cancel_age proposal?

As I've said several times already, I am by no means attached to the
current formula.

> I do like the idea of driving the auto-cancel behavior off of the
> results of previous attempts to vacuum the table. That could be done
> independently of the XID age of the table.

Even when the XID age of the table has already significantly surpassed
autovacuum_freeze_max_age, say due to autovacuum worker starvation?

> If we've failed to vacuum
> the table, say, 10 times, because we kept auto-cancelling, it's
> probably appropriate to force the issue.

I suggested 1000 times upthread. 10 times seems very low, at least if
"number of times cancelled" is the sole criterion, without any
attention paid to relfrozenxid age or some other tiebreaker.

> It doesn't really matter
> whether the autovacuum triggered because of bloat or because of XID
> age. Letting either of those things get out of control is bad.

While inventing a new no-auto-cancel behavior that prevents bloat from
getting completely out of control may well have merit, I don't see why
it needs to be attached to this other effort.

I think that the vast majority of individual tables have autovacuums
cancelled approximately never, and so my immediate concern is
ameliorating cases where not being able to auto-cancel once in a blue
moon causes an outage. Sure, the opposite problem also exists, and I
think that it would be really bad if it was made significantly worse
as an unintended consequence of a patch that addressed just the first
problem. But that doesn't mean we have to solve both problems together
at the same time.

> But at that point a lot of harm has already
> been done. In a frequently updated table, waiting 300 million XIDs to
> stop cancelling the vacuum is basically condemning the user to have to
> run VACUUM FULL. The table can easily be ten or a hundred times bigger
> than it should be by that point.

The rate at which relfrozenxid ages is just about useless as a proxy
for how much wall clock time has passed with a given workload --
workloads are usually very bursty. It's much worse still as a proxy
for what has changed in the table; completely static tables have their
relfrozenxid age at exactly the same rate as the most frequently
updated table in the same database (the table that "consumes the most
XIDs"). So while antiwraparound autovacuum no-auto-cancel behavior may
indeed save the user from problems with serious bloat, it will happen
pretty much by mistake. Not that it doesn't happen all the same -- of
course it does.

That factor (the mistake factor) doesn't mean I take the point any
less seriously. What I don't take seriously is the idea that the
precise XID age was ever crucially important.

More generally, I just don't accept that this leaves with no room for
something along the lines of my proposed, such as Andres'
autovacuum_freeze_max_age concept. As I've said already, there will
usually be a very asymmetric quality to the problem in cases like the
Joyent outage. Even a modest amount of additional XID-space-headroom
will very likely be all that will be needed at the critical juncture.
It may not be perfect, but it still has every potential to make things
safer for some users, without making things any less safe for other
users.

--
Peter Geoghegan

Re: Show various offset arrays for heap WAL records

2023-01-16 Thread Peter Geoghegan

On Wed, Jan 11, 2023 at 3:11 PM Peter Geoghegan  wrote:
> On Wed, Jan 11, 2023 at 3:00 PM Andres Freund  wrote:
> > What are your thoughts about the place for the helper functions? You're ok
> > with rmgrdesc_utils.[ch]?
>
> Yeah, that seems okay.

BTW, while playing around with this patch today, I noticed that it
won't display the number of elements in each offset array directly.
Perhaps it's worth including that, too?

-- 
Peter Geoghegan

Re: New strategies for freezing, advancing relfrozenxid early

2023-01-16 Thread Peter Geoghegan

On Mon, Jan 16, 2023 at 10:00 AM Peter Geoghegan  wrote:
> Now we treat the scanning and freezing strategies as two independent
> choices. Of course they're not independent in any practical sense, but
> I think it's slightly simpler and more elegant that way -- it makes
> the GUC vacuum_freeze_strategy_threshold strictly about freezing
> strategy, while still leading to VACUUM advancing relfrozenxid in a
> way that makes sense. It just happens as a second order effect. Why
> add a special case?

This might be a better way to explain it:

The main page-level freezing commit (commit 1de58df4) already added an
optimization that triggers page-level freezing "early" (early relative
to vacuum_freeze_min_age). This happens whenever a page already needs
to have an FPI logged inside lazy_scan_prune -- even when we're using
the lazy freezing strategy. The optimization isn't configurable, and
gets applied regardless of freezing strategy (technically there is no
such thing as freezing strategies on HEAD just yet, though HEAD still
has this optimization).

There will be workloads where the FPI optimization will result in
freezing many more pages -- especially when data checksums are in use
(since then we could easily need to log an FPI just so pruning can set
a hint bit). As a result, certain VACUUMs that use the lazy freezing
strategy will freeze in almost the same way as an equivalent VACUUM
using the eager freezing strategy. Such a "nominally lazy but actually
quite eager" VACUUM operation should get the same benefit in terms of
relfrozenxid advancement as it would if it really had used the eager
freezing strategy instead. It's fairly obvious that we'll get the same
benefit in relfrozenxid advancement (comparable relfrozenxid results
for comparable freezing work), since the way that VACUUM decides on
its scanning strategy is not conditioned on freezing strategy (whether
by the ongoing VACUUM or any other VACUUM against the same table).

All that matters is the conditions in the table (in particular the
added cost of opting for eager scanning over lazy scanning) as
indicated by the visibility map at the start of each VACUUM -- how
those conditions came about really isn't interesting at that point.
And so lazy_scan_strategy doesn't care about them when it chooses
VACUUM's scanning strategy.

There are even tables/workloads where relfrozenxid will be able to
jump forward by a huge amount whenever VACUUM choosing the eager
scanning strategy, despite the fact that VACUUM generally does little
or no freezing to make that possible:

https://wiki.postgresql.org/wiki/Freezing/skipping_strategies_patch:_motivating_examples#Patch_3

--
Peter Geoghegan

Re: New strategies for freezing, advancing relfrozenxid early

2023-01-16 Thread Peter Geoghegan

On Mon, Jan 16, 2023 at 10:10 AM Peter Geoghegan wrote:
> Attached is v16, which incorporates some of Matthias' feedback.

0001 (the freezing strategies patch) is now committable IMV. Or at
least will be once I polish the docs a bit more. I plan on committing
0001 some time next week, barring any objections.

I should point out that 0001 is far shorter and simpler than the
page-level freezing commit that already went in (commit 1de58df4). The
only thing in 0001 that seems like it might be a bit controversial
(when considered on its own) is the addition of the
vacuum_freeze_strategy_threshold GUC/reloption. Note in particular
that vacuum_freeze_strategy_threshold doesn't look like any other
existing GUC; it gets applied as a threshold on the size of the rel's
main fork at the beginning of vacuumlazy.c processing. As far as I
know there are no objections to that approach at this time, but it
does still seem worth drawing attention to now.

0001 also makes unlogged tables and temp tables always use eager
freezing strategy, no matter how the GUC/reloption are set. This seems
*very* easy to justify, since the potential downside of such a policy
is obviously extremely low, even when we make very pessimistic
assumptions. The usual cost we need to worry about when it comes to
freezing is the added WAL overhead -- that clearly won't apply when
we're vacuuming non-permanent tables. That really just leaves the cost
of dirtying extra pages, which in general could have a noticeable
system-level impact in the case of unlogged tables.

Dirtying extra pages when vacuuming an unlogged table is also a
non-issue. Even the eager freezing strategy only freezes "extra" pages
("extra" relative to the lazy strategy behavior) given a page that
will be set all-visible in any case [1]. Such a page will need to have
its page-level PD_ALL_VISIBLE bit set in any case -- which is already
enough to dirty the page. And so there can never be any additional
pages dirtied as a result of the special policy 0001 adds for
non-permanent relations.

[1]
https://wiki.postgresql.org/wiki/Freezing/skipping_strategies_patch:_motivating_examples#Patch_2
--
Peter Geoghegan

Re: Decoupling antiwraparound autovacuum from special rules around auto cancellation

2023-01-16 Thread Peter Geoghegan

to more cheaply get rid of the
> dead tids. Obviously we'd also set them when vacuum decides / was told not to
> do index cleanup. Yes, it'd obviously be less effective than lots of the
> things we discussed in this area (needing to re-collect the dead tids on the
> indicated), but it'd have the advantage of not needing a lot of new
> infrastructure.

I wonder if it would be possible to split up the work of VACUUM into
multiple phases that can be processed independently. The dead_items
array could be serialized and stored in a temp file. That's not the
same as some of the more complicated stuff we talked about in the last
couple of years, such as a dedicated fork for Dead TIDs. It's more
like an extremely flexible version of the same basic design for
VACUUM, with the ability to slow down and speed up based on a system
level view of things (e.g., checkpointing information). And with index
vacuuming happening on a highly deferred timeline in some cases.
Possibly we could make each slice of work processed by any available
autovacuum worker. Autovacuum workers could become "headless".

You would need some kind of state machine to make sure that critical
dependencies were respected (e.g., always do the heap vacuuming step
after all indexes are vacuumed), but that possibly isn't that hard,
and still gives you a lot.

As for this patch of mine: do you think that it would be acceptable to
pursue a version based on your autovacuum_no_auto_cancel_age design
for 16? Perhaps this can include something like
pgstat_report_autovac_failure(). It's not even the work of
implementing pgstat_report_autovac_failure() that creates risk that
it'll miss the 16 feature freeze deadline. I'm more concerned that
introducing a more complicated design will lead to the patch being
bikeshedded to death.

--
Peter Geoghegan

Re: New strategies for freezing, advancing relfrozenxid early

2023-01-16 Thread Peter Geoghegan

On Sun, Jan 15, 2023 at 9:13 PM Dilip Kumar  wrote:
> I have looked into the patch set, I think 0001 looks good to me about
> 0002 I have a few questions, 0003 I haven't yet looked at

Thanks for taking a look.

> I think '(nextXID - cutoffs->relfrozenxid) / freeze_table_age' should
> be the actual fraction right?  What is the point of adding 0.5 to the
> divisor?  If there is a logical reason, maybe we can explain in the
> comments.

It's just a way of avoiding division by zero.

> While looking into the logic of 'lazy_scan_strategy', I think the idea
> looks very good but the only thing is that
> we have kept eager freeze and eager scan completely independent.
> Don't you think that if a table is chosen for an eager scan
> then we should force the eager freezing as well?

Earlier versions of the patch kind of worked that way.
lazy_scan_strategy would actually use twice the GUC setting to
determine scanning strategy. That approach could make our "transition
from lazy to eager strategies" involve an excessive amount of
"catch-up freezing" in the VACUUM operation that advanced relfrozenxid
for the first time, which you see an example of here:

https://wiki.postgresql.org/wiki/Freezing/skipping_strategies_patch:_motivating_examples#Patch

Now we treat the scanning and freezing strategies as two independent
choices. Of course they're not independent in any practical sense, but
I think it's slightly simpler and more elegant that way -- it makes
the GUC vacuum_freeze_strategy_threshold strictly about freezing
strategy, while still leading to VACUUM advancing relfrozenxid in a
way that makes sense. It just happens as a second order effect. Why
add a special case?

In principle the break-even point for eager scanning strategy (i.e.
advancing relfrozenxid) is based on the added cost only under this
scheme. There is no reason for lazy_scan_strategy to care about what
happened in the past to make the eager scanning strategy look like a
good idea. Similarly, there isn't any practical reason why
lazy_scan_strategy  needs to anticipate what will happen in the near
future with freezing.

-- 
Peter Geoghegan

Re: Decoupling antiwraparound autovacuum from special rules around auto cancellation

2023-01-13 Thread Peter Geoghegan

d behave that way. This phenomenon is sometimes called
alarm fatigue, It can be quite dangerous to warn people about
non-issues "out of an abundance of caution".

> 3) Autovacuums triggered by tuple thresholds persistently getting cancelled
>also regularly causes outages, and they make it more likely that an
>eventual age-based vacuum will take forever.

Really? Outages? I imagine that you'd have to be constantly hammering
the table with DDL before it could happen. That's possible, but it
seems relatively obvious that doing that is asking for trouble.
Whereas the Manta postmortem (and every similar case that I've
personally seen) involved very nasty interactions that happened due to
the way components interacted with a workload that wasn't like that.

Running DDL from a cron job or from the application may not be a great
idea, but it's also quite common.

> Aspect 1) is addressed to a good degree by the proposed split of anti-wrap
> into an age and anti-cancel triggers. And could be improved by reporting
> failsafe autovacuums in pg_stat_activity.

What you call aspect 2 (the issue with disastrous HW lock traffic jams
involving TRUNCATE being run from a cron job, etc) is a big goal of
mine for this patch series. You seem unsure of how effective my
approach (or an equally simple approach based on table age heuristics)
will be. Is that the case?

> Perhaps 2) could be improved a bit by emitting a WARNING message when we
> didn't cancel AV because it was anti-wraparound?  But eventually I think we
> somehow need to signal the "intent" of the lock drop down into ProcSleep() or
> wherever it'd be.

That's doable, but definitely seems like separate work.

> I have two ideas around 3):
>
> First, we could introduce thresholds for the tuple thresholds, after which
> autovacuum isn't concealable anymore.

Do you think that's a good idea? I just don't trust those statistics,
at all. As in, I think they're often complete garbage.

> Second, we could track the number of cancellations since the last [auto]vacuum
> in pgstat, and only trigger the anti-cancel behaviour when autovacuum has been
> cancelled a number of times.

In theory that would be better than an approach along the lines I've
proposed, because it is directly based on a less aggressive approach
being tried a few times, and failing a few times. That part I like.

However, I also don't like several things about this approach. First
of all it relies on relatively complicated infrastructure, for
something that can be critical. Second, it will be hard to test.
Third, perhaps it would make sense to give the earlier/less aggressive
approach (a table age av that is still autocancellable) quite a large
number of attempts before giving up. If the table age isn't really
growing too fast, why not continue to be patient, possibly for quite a
long time?

Perhaps a hybrid strategy could be useful? Something like what I came
up with already, *plus* a mechanism that gives up after (say) 1000
cancellations, and escalates to no-auto-cancel, regardless of table
age. It seems sensible to assume that a less aggressive approach is
just hopeless relatively quickly (in wall clock time and logical XID
time) once we see sufficiently many cancellations against the same
table.

--
Peter Geoghegan

Re: Decoupling antiwraparound autovacuum from special rules around auto cancellation

2023-01-13 Thread Peter Geoghegan

ze_max_age * 2, 1 billion)
>   prevent_auto_cancel = reflrozenxid < recentXid - prevent_auto_cancel_age
>
> Is that right?

That summary looks accurate, but I'm a bit confused about why you're
asking the question this way. I thought that it was obvious that the
patch doesn't change most of these things.

The only mechanism that the patch changes is related to "prevent
auto-cancellation" behaviors -- which is now what the term
"antiwraparound" refers to. It does change the name of "autovacuum
based on age", though -- the name is now "table age autovacuum" (the
old name was antiwraparound autovacuum, of course). As I pointed out
to you already, it's mechanically impossible for any autovacuum to be
antiwraparound unless it's an XID table age/MXID table age autovacuum.

The naming convention I propose here makes it a little confusing for
us to discuss, but it seems like the best thing for users. Users'
basic intuitions about antiwraparound autovacuums (that they're scary
things needed because wraparound is starting to become a real concern)
don't need to change. If anything they become more accurate, because
antiwraparound autovacuums become non-routine -- which is really how
it should have been when autovacuum was first added IMV. Users have
rather good reasons to find antiwraparound autovacuums scary, even
though that's kind of wrong (it's really our fault for making it so
confusing for them, not their fault for being wrong).

> One thing I just noticed: Isn't it completely bonkers that we compute
> recentXid/recentMulti once at the start of a worker in
> relation_needs_vacanalyze()?  That's fine for the calls in do_autovacuum()'s
> initial loops over all tables. But seems completely wrong for the later calls
> via table_recheck_autovac() -> recheck_relation_needs_vacanalyze() ->
> relation_needs_vacanalyze()?
>
> These variables really shouldn't be globals. It makes sense to cache them
> locally in do_autovacuum(), but reusing them
> recheck_relation_needs_vacanalyze() and sharing it between launcher and worker
> is bad.

I am not sure. I do hope that there isn't some subtle way in which the
design relies on that. It seems obviously weird, and so I have to
wonder if there is a reason behind it that isn't immediately apparent.

--
Peter Geoghegan


v4-0002-Add-table-age-trigger-concept-to-autovacuum.patch
Description: Binary data


v4-0001-Add-autovacuum-trigger-instrumentation.patch
Description: Binary data

Re: Rework of collation code, extensibility

2023-01-13 Thread Peter Geoghegan

On Wed, Jan 11, 2023 at 3:44 PM Jeff Davis  wrote:
> Attached trivial rebase as v6.

Some review comments for this v6.

Comments on 0001-*:

* I think that 0002-* can be squashed with 0001-*, since there isn't
any functional reason why you'd want to commit the strcoll() and
strxfrm() changes separately.

Sometimes it can be useful to break things up, despite the fact that
it couldn't possibly make sense to commit just one of the resulting
patches on its own. However, I don't think that that's appropriate
here. There is no apparent conceptual boundary that you're
highlighting by splitting things up like this. strxfrm() and strcoll()
are defined in terms of each other -- they're siblings, joined at the
hip -- so this seems kinda jarring.

* Your commit message for 0001 (and other patches) don't set things up
by describing what the point is, and what the work anticipates. I
think that they should do that.

You're adding a layer of indirection that's going to set things up for
later patches that add a layer of indirection for version ICU
libraries (and even libc itself), and some of the details only make
sense in that context. This isn't just refactoring work that could
just as easily have happened in some quite different context.

* I'm not sure that pg_strcoll() should be breaking ties itself. We
break ties using strcmp() for historical reasons, but must not do that
for deterministic ICU collations, which may be obscured.

That means that pg_strcoll()'s relationship to pg_strxfrm()'s isn't
the same as the well known strcoll()/strxfrm() relationship. That kind
of makes pg_strcoll() somewhat more than a strcoll() shim, which is
inconsistent. Another concern is that the deterministic collation
handling isn't handled in any one layer, which would have been nice.

Do we need to do things this way? What's it adding?

* varstrfastcmp_locale() is no longer capable of calling
ucol_strcollUTF8() through the shim interface, meaning that it has to
determine string length based on NUL-termination, when in principle it
could just use the known length of the string.

Presumably this might have performance implications. Have you thought
about that?

Some comments on 0002-*:

* I don't see much point in this new varstr_abbrev_convert() variable:

+   const size_t max_prefix_bytes = sizeof(Datum);

varstr_abbrev_convert() is concerned with packing abbreviated key
bytes into Datums, so it's perfectly reasonable to deal with
Datums/sizeof(Datum) directly.

* Having a separate pg_strxfrm_prefix_libc() function just to throw an
error doesn't really add much IMV.

Comments on 0003-*:

I suggest that some of the very trivial functions you have here (such
as pg_locale_deterministic()) be made inline functions.

Comments on 0006-*:

* get_builtin_libc_library() could be indented in a way that would
make it easier to understand.

--
Peter Geoghegan

Re: Decoupling antiwraparound autovacuum from special rules around auto cancellation

2023-01-12 Thread Peter Geoghegan

make sure that you stay well hydrated -- if the problem
doesn't go away after a few days, then call back, reassess. Perhaps it
really will be a brain tumor, but there is nothing to gain and
everything to lose by taking such drastic action at the first sign of
trouble.

> If we cancel ourselves too, we're just postponing resolution of the
> problem to some future point when we decide to stop cancelling
> ourselves. That's not a win.

It's also only a very minor loss, relative to what would have happened
without any of this. This is something that we can be relatively sure
of (unlike anything about final outcomes). It's clear that we have a
lot to gain. What do we have to lose, really?

> > I think that users will really appreciate having only one kind of
> > VACUUM/autovacuum (since the other patch gets rid of discrete
> > aggressive mode VACUUMs). I want "table age autovacuuming" (as I
> > propose to call it) come to be seen as not any different to any other
> > autovacuum, such as an "insert tuples" autovacuum or a "dead tuples"
> > autovacuum. The difference is only in how autovacuum.c triggers the
> > VACUUM, not in any runtime behavior. That's an important goal here.
>
> I don't agree with that goal. I think that having different kinds of
> autovacuums with different, identifiable names and corresponding,
> easily-identifiable behaviors is really important for troubleshooting.

You need to distinguish between different types of autovacuums and
different types of VACUUMs here. Sure, it's valuable to have
information about why autovacuum launched a VACUUM, and the patch
greatly improves that. But runtime behavior is another story.

It's not really generic behavior -- more like generic policies that
produce different behavior under different runtime conditions. VACUUM
has always had generic policies about how to do things, at least up
until the introduction of the visibility map, which added
scan_all/aggressive VACUUMs, and the vacuum_freeze_table_age GUC. The
policy should be the same in every VACUUM, which the behavior itself
emerges from.

> Trying to remove those distinctions and make everything look the same
> will not keep autovacuum from getting itself into trouble. It will
> just make it harder to understand what's happening when it does.

The point isn't to have every VACUUM behave in the same way. The point
is to make decisions dynamically, based on the observed conditions in
the table. And to delay committing to things until there really is no
alternative, to maximize our opportunities to avoid disaster. In
short: loose, springy behavior.

Imposing absolute obligations on VACUUM has the potential to create
lots of problems. It is sometimes necessary, but can easily be
overused, making a bad situation much worse.

-- 
Peter Geoghegan

Re: Decoupling antiwraparound autovacuum from special rules around auto cancellation

2023-01-12 Thread Peter Geoghegan

On Thu, Jan 12, 2023 at 9:12 AM Robert Haas  wrote:
> I do agree that it's good to slowly increase the aggressiveness of
> VACUUM as we get further behind, rather than having big behavior
> changes all at once, but I think that should happen by smoothly
> varying various parameters rather than by making discrete behavior
> changes at a whole bunch of different times.

In general I tend to agree, but, as you go on to acknowledge yourself,
this particular behavior is inherently discrete. Either the
PROC_VACUUM_FOR_WRAPAROUND behavior is in effect, or it isn't.

In many important cases the only kind of autovacuum that ever runs
against a certain big table is antiwraparound autovacuum. And
therefore every autovacuum that runs against the table must
necessarily not be auto cancellable. These are the cases where we see
disastrous interactions with automated DDL, such as a TRUNCATE run by
a cron job (to stop those annoying antiwraparound autovacuums) -- a
heavyweight lock traffic jam that causes the application to lock up.

All that I really want to do here is give an autovacuum that *can* be
auto cancelled *some* non-zero chance to succeed with these kinds of
tables. TRUNCATE completes immediately, so the AEL is no big deal.
Except when it's blocked behind an antiwraparound autovacuum. That
kind of interaction is occasionally just disastrous. Even just the
tiniest bit of wiggle room could avoid it in most cases, possibly even
almost all cases.

> Maybe that's not the right idea, I don't know, and a naive
> implementation might be worse than nothing, but I think it has some
> chance of being worth consideration.

It's a question of priorities. The failsafe isn't supposed to be used
(when it is it is a kind of a failure), and so presumably only kicks
in on very rare occasions, where nobody was paying attention anyway.
So far I've heard no complaints about this, but I've heard lots of
complaints about the antiwrap autocancellation behavior.

> The behavior already changes when you hit
> vacuum_freeze_min_age and then again when you hit
> vacuum_freeze_table_age and then there's also
> autoovacuum_freeze_max_age and xidWarnLimit and xidStopLimit and a few
> others, and these setting all interact in pretty complex ways. The
> more conditional logic we add to that, the harder it becomes to
> understand what's actually happening.

In general I strongly agree. In fact that's a big part of what
motivates my ongoing work on VACUUM. The user experience is central.

As Andres pointed out, presenting antiwraparound autovacuums as kind
of an emergency thing but also somehow a routine thing is just
horribly confusing. I want to make them into an emergency thing in
every sense -- something that you as a user can reasonably expect to
never see (like the failsafe). But if you do see one, then that's a
useful signal of an underlying problem with contention, say from
automated DDL that pathologically cancels autovacuums again and again.

> Now, you might reply to the above by saying, well, some behaviors
> can't vary continuously. vacuum_cost_limit can perhaps be phased out
> gradually, but autocancellation seems like something that you must
> either do, or not do. I would agree with that. But what I'm saying is
> that we ought to favor having those kinds of behaviors all engage at
> the same point rather than at different times.

Right now aggressive VACUUMs do just about all freezing at the same
time, to the extent that many users seem to think that it's a totally
different thing with totally different responsibilities to regular
VACUUM. Doing everything at the same time like that causes huge
practical problems, and is very confusing.

I think that users will really appreciate having only one kind of
VACUUM/autovacuum (since the other patch gets rid of discrete
aggressive mode VACUUMs). I want "table age autovacuuming" (as I
propose to call it) come to be seen as not any different to any other
autovacuum, such as an "insert tuples" autovacuum or a "dead tuples"
autovacuum. The difference is only in how autovacuum.c triggers the
VACUUM, not in any runtime behavior. That's an important goal here.

> I did take a look at the post-mortem to which you linked, but I am not
> quite sure how that bears on the behavior change under discussion.

The post-mortem involved a single "DROP TRIGGER" that caused chaos
when it interacted with the auto cancellation behavior. It would
usually completely instantly, so the AEL wasn't actually disruptive,
but one day antiwraparound autovacuum made the cron job effectively
block all reads and writes for hours.

The similar outages I was called in to help with personally had either
an automated TRUNCATE or an automated CREATE INDEX. Had autovacuum
only been willing to yield once or twice, then it probably would have
been fine -- the situation probably would have worked itself out
naturally. That's the best outcome you can hope for.

--
Peter Geoghegan

Re: Fixing a couple of buglets in how VACUUM sets visibility map bits

2023-01-11 Thread Peter Geoghegan

On Mon, Jan 9, 2023 at 12:58 PM Peter Geoghegan  wrote:
> On Mon, Jan 9, 2023 at 11:57 AM Andres Freund  wrote:
> > Afaict we'll need to backpatch this all the way?
>
> I thought that we probably wouldn't need to, at first. But I now think
> that we really have to.

Attached is v4. This is almost the same as v3. The only notable change
is in how the issue is explained in comments, and in the commit
message.

I have revised my opinion on this question once more. In light of what
has come to light about the issue from recent testing, I lean towards
a HEAD-only commit once again. What do you think?

I still hope to be able to commit this on my original timeline (on
Monday or so), without the issue taking up too much more of
everybody's time.

Thanks
-- 
Peter Geoghegan

v4-0001-Tighten-up-VACUUM-s-approach-to-setting-VM-bits.patch
Description: Binary data

Re: Reducing the WAL overhead of freezing in VACUUM by deduplicating per-tuple freeze plans

2023-01-11 Thread Peter Geoghegan

On Wed, Jan 11, 2023 at 4:44 PM Andres Freund  wrote:
> > Attached patch fixes up these issues. It's almost totally mechanical.
>
> Looks better, thanks!

Pushed that just now.

Thanks
-- 
Peter Geoghegan

Re: Reducing the WAL overhead of freezing in VACUUM by deduplicating per-tuple freeze plans

2023-01-11 Thread Peter Geoghegan

On Mon, Jan 9, 2023 at 2:18 PM Peter Geoghegan  wrote:
> I'll try to get back to it this week.

Attached patch fixes up these issues. It's almost totally mechanical.

(Ended up using "git diff --color-moved=dimmed-zebra
--color-moved-ws=ignore-all-space" with this, per your recent tip,
which did help.)

--
Peter Geoghegan


v1-0001-Rename-freeze-plan-dedup-routines.patch
Description: Binary data

Re: pgsql: Delay commit status checks until freezing executes.

2023-01-11 Thread Peter Geoghegan

On Wed, Jan 11, 2023 at 3:06 PM Andres Freund  wrote:
> > + * We can't use TransactionIdDidAbort here because it won't treat 
> > transactions
> > + * that were in progress during a crash as aborted by now.  We determine 
> > that
> > + * transactions aborted/crashed through process of elimination instead.
>
> s/by now//?

Did it that way in the commit I pushed just now.

Thanks
-- 
Peter Geoghegan

Re: Show various offset arrays for heap WAL records

2023-01-11 Thread Peter Geoghegan

On Wed, Jan 11, 2023 at 3:00 PM Andres Freund  wrote:
> What are your thoughts about the place for the helper functions? You're ok
> with rmgrdesc_utils.[ch]?

Yeah, that seems okay.

We may well need to put more stuff in that file. We're overdue a big
overhaul of the rmgr output, so that everybody uses the same format
for everything. We made some progress on that for 16 already, by
standardizing on the name snapshotConflictHorizon, but a lot of
annoying inconsistencies still remain. Like the punctuation issue you
mentioned.

Ideally we'd be able to make the output more easy to manipulate via
the SQL interface from pg_walinspect, or perhaps via scripting. That
would require some rules that are imposed top-down, so that consumers
of the data can make certain general assumptions. But that's fairly
natural. It's not like there is just inherently a great deal of
diversity that we need to be considered. For example, the WAL records
used by each individual index access method are all very similar. In
fact the most important index AM WAL records used by each index AM
(e.g. insert, delete, vacuum) have virtually the same format as each
other already.

-- 
Peter Geoghegan

Re: Show various offset arrays for heap WAL records

2023-01-11 Thread Peter Geoghegan

On Tue, Jan 10, 2023 at 11:35 AM Andres Freund  wrote:
> Nontrivial, I'm afraid. We don't pass any relevant parameters to rm_desc:
> void(*rm_desc) (StringInfo buf, XLogReaderState *record);
>
> so we'd need to patch all of them. That might be worth doing at some point,
> but I don't want to tackle it right now.

Okay. Let's just get the basics in soon, then.

I would like to have a similar capability for index access methods,
but mostly just for investigating performance. Whenever we've really
needed something like this for debugging it seems to have been a
heapam thing, just because there's a lot more that can go wrong with
pruning, which is spread across many different places.

-- 
Peter Geoghegan

Re: pgsql: Delay commit status checks until freezing executes.

2023-01-11 Thread Peter Geoghegan

On Sat, Jan 7, 2023 at 7:25 PM Andres Freund  wrote:
> Probably a good idea, although it doesn't neatly fit right now.

I'll leave it for now.

Attached is v2, which changes things based on your feedback. Would
like to get this out of the way soon.

> > Also, does amcheck's get_xid_status() need a reference to these rules?
>
> Don't think so? Whad made you ask?

Just the fact that it seems to more or less follow the protocol
described at the top of heapam_visibility.c. Not very important,
though.

-- 
Peter Geoghegan

v2-0001-Improve-TransactionIdDidAbort-comments.patch
Description: Binary data

Re: Option to not use ringbuffer in VACUUM, using it in failsafe mode

2023-01-11 Thread Peter Geoghegan

On Wed, Jan 11, 2023 at 11:18 AM Andres Freund  wrote:
> I don't like that - it's also quite useful to disable use of ringbuffers when
> you actually need to clean up indexes. Especially when we have a lot of dead
> tuples we'll rescan indexes over and over...

That's a fair point.

My vote goes to "REUSE_BUFFERS", then.

-- 
Peter Geoghegan

Re: Option to not use ringbuffer in VACUUM, using it in failsafe mode

2023-01-11 Thread Peter Geoghegan

On Wed, Jan 11, 2023 at 10:58 AM Andres Freund  wrote:
> Any idea about the name? The obvious thing is to reference ring buffers in the
> option name, but that's more of an implementation detail...

What are the chances that anybody using this feature via a manual
VACUUM command will also use INDEX_CLEANUP off? It's not really
supposed to be used routinely, at all. Right? It's just for
emergencies.

Perhaps it can be tied to INDEX_CLEANUP=off? That makes it hard to get
just the behavior you want when testing VACUUM, but maybe that doesn't
matter.

Realistically, most of the value here comes from changing the failsafe
behavior, which doesn't require the user to know anything about
VACUUM. I know that AWS has reduced the vacuum_failsafe_age default on
RDS to 1.2 billion (a decision made before I joined Amazon), so it is
already something AWS lean on quite a bit where available.

-- 
Peter Geoghegan

Re: Option to not use ringbuffer in VACUUM, using it in failsafe mode

2023-01-11 Thread Peter Geoghegan

On Wed, Jan 11, 2023 at 10:27 AM Andres Freund  wrote:
> Therefore I'd like to add an option to the VACUUM command to use to disable
> the use of the ringbuffer. Not sure about the name yet.

Sounds like a good idea.

> I think we should auto-enable that mode once we're using the failsafe mode,
> similar to [auto]vacuum cost delays getting disabled
> (c.f. lazy_check_wraparound_failsafe()). If things are bad enough that we're
> soon going to shut down, we want to be aggressive.

+1

-- 
Peter Geoghegan

Re: Fixing a couple of buglets in how VACUUM sets visibility map bits

2023-01-10 Thread Peter Geoghegan

On Tue, Jan 10, 2023 at 4:39 PM Peter Geoghegan  wrote:
> * Run VACUUM FREEZE. We need FREEZE in order to be able to hit the
> relevant visibilitymap_set() call site (the one that passes
> VISIBILITYMAP_ALL_FROZEN as its flags, without also passing
> VISIBILITYMAP_ALL_VISIBLE).
>
> Now all_visible_according_to_vm is set to true, but we don't have a
> lock/pin on the same heap page just yet.
>
> * Acquire several non-conflicting row locks on a row on the block in
> question, so that a new multi is allocated.

Forgot to mention that there needs to be a HOT update mixed in with
these SELECT ... FOR SHARE row lockers, too, which must abort once its
XID has been added to a multi. Obviously heap_lock_tuple() won't ever
unset VISIBILITYMAP_ALL_VISIBLE or PD_ALL_VISIBLE (it only ever clears
VISIBILITYMAP_ALL_FROZEN) -- so we need a heap_update() to clear all
of these status bits.

This enables the assertion to fail because:

* Pruning can get rid of the aborted successor heap-only tuple right
away, so it is not going to block us from setting the page all_visible
(that just leaves the original tuple to consider).

* The original tuple's xmax is a Multi, so it won't automatically be
ineligible for freezing because it's > OldestXmin in this scenario.

* FreezeMultiXactId() processing will completely remove xmax, without
caring too much about cutoffs like OldestXmin -- it only cares about
whether each individual XID needs to be kept or not.

(Granted, FreezeMultiXactId() will only remove xmax like this because
I deliberately removed its FRM_NOOP handling, but that is a very
delicate thing to rely on, especially from such a great distance. I
can't imagine that it doesn't fail on HEAD for any reason beyond sheer
luck.)

--
Peter Geoghegan

Re: Fixing a couple of buglets in how VACUUM sets visibility map bits

2023-01-10 Thread Peter Geoghegan

On Tue, Jan 10, 2023 at 12:08 PM Peter Geoghegan  wrote:
> Actually, FreezeMultiXactId() can fully remove an xmax that has some
> member XIDs >= OldestXmin, provided FRM_NOOP processing isn't
> possible, at least when no individual member is still running. Doesn't
> have to involve transaction aborts at all.
>
> Let me go try to break it that way...

Attached patch shows how this could break.

It adds an assertion that checks that the expected
PD_ALL_VISIBLE/VM_ALL_VISIBLE() conditions hold at the right point. It
also comments out FreezeMultiXactId()'s FRM_NOOP handling.

The FRM_NOOP case is really just an optimization, and shouldn't be
needed for correctness. This is amply demonstrated by running "meston
test" with the patch applied, which will pass without incident.

I can get the PD_ALL_VISIBLE assertion to fail by following this
procedure with the patch applied:

* Run a plain VACUUM to set all the pages from a table all-visible,
but not all-frozen.

* Set a breakpoint that will hit after all_visible_according_to_vm is
set to true, for an interesting blkno.

* Run VACUUM FREEZE. We need FREEZE in order to be able to hit the
relevant visibilitymap_set() call site (the one that passes
VISIBILITYMAP_ALL_FROZEN as its flags, without also passing
VISIBILITYMAP_ALL_VISIBLE).

Now all_visible_according_to_vm is set to true, but we don't have a
lock/pin on the same heap page just yet.

* Acquire several non-conflicting row locks on a row on the block in
question, so that a new multi is allocated.

* End every session whose XID is stored in our multi (commit/abort).

* Within GDB, continue from before -- observe assertion failure.

Obviously this scenario doesn't demonstrate the presence of a bug --
not quite. But it does prove that we rely on FRM_NOOP to not allow the
VM to become corrupt, which just doesn't make any sense, and can't
have been intended. At a minimum, it strongly suggests that the
current approach is very fragile.

-- 
Peter Geoghegan

0001-Debug-freeze-map-issue.patch
Description: Binary data

Re: Fixing a couple of buglets in how VACUUM sets visibility map bits

2023-01-10 Thread Peter Geoghegan

On Tue, Jan 10, 2023 at 12:19 PM Robert Haas  wrote:
> I don't understand what distinction you're making. It seems like
> hair-splitting to me. We should be able to reproduce problems like
> this reliably, at least with the aid of a debugger and some
> breakpoints, before we go changing the code.

So we can *never* change something defensively, on the basis of a
suspected or theoretical hazard, either in backbranches or just on
HEAD? Not under any circumstances, ever?

> The risk of being wrong
> is quite high because the code is subtle, and the consequences of
> being wrong are potentially very bad because the code is critical to
> data integrity. If the reproducer doesn't require a debugger or other
> extreme contortions, then we should consider reducing it to a TAP test
> that can be committed. If you agree with that, then I'm not sure what
> your last email was complaining about.

I was complaining about your prescribing conditions on proceeding with
a commit, based on an understanding of things that you yourself
acknowledged as incomplete. I cannot imagine how you read that as an
unwillingness to test the issue, especially given that I agreed to
work on that before you chimed in.

> > I have been unable to reproduce the problem, and think it's possible
> > that the issue cannot be triggered in practice. Though only through
> > sheer luck. Here's why that is:

> I guess I'm not very sure that this is sheer luck.

That's just my characterization. Other people can make up their own minds.

> For the purposes of clarifying my understanding, is this the code
> you're principally worried about?

> visibilitymap_set(vacrel->rel, blkno, buf, InvalidXLogRecPtr,
>   vmbuffer, InvalidTransactionId,
>   VISIBILITYMAP_ALL_FROZEN);

Obviously I meant this call site, since it's the only one that passes
VISIBILITYMAP_ALL_FROZEN as its flags, without also passing
VISIBILITYMAP_ALL_VISIBLE -- in vacuumlazy.c, and in general.

The other visibilitymap_set() callsite that you quoted is from the
second heap pass, where LP_DEAD items are vacuumed and become
LP_UNUSED items. That isn't buggy, but it is a silly approach, in that
it cares about what the visibility map says about the page being
all-visible, as if it might take a dissenting view that needs to be
taken into consideration (obviously we know what's going on with the
page because we just scanned it ourselves, and determined that it was
at least all-visible).

-- 
Peter Geoghegan

Re: Fixing a couple of buglets in how VACUUM sets visibility map bits

2023-01-10 Thread Peter Geoghegan

On Tue, Jan 10, 2023 at 11:47 AM Peter Geoghegan  wrote:
> In summary, I think that there is currently no way that we can have
> the VM (or the PD_ALL_VISIBLE flag) concurrently unset, while leaving
> the page all_frozen. It can happen and leave the page all_visible, but
> not all_frozen, due to these very fine details. (Assuming I haven't
> missed another path to the problem with aborted Multis or something,
> but looks like I haven't.)

Actually, FreezeMultiXactId() can fully remove an xmax that has some
member XIDs >= OldestXmin, provided FRM_NOOP processing isn't
possible, at least when no individual member is still running. Doesn't
have to involve transaction aborts at all.

Let me go try to break it that way...

-- 
Peter Geoghegan

Re: Fixing a couple of buglets in how VACUUM sets visibility map bits

2023-01-10 Thread Peter Geoghegan

On Tue, Jan 10, 2023 at 10:50 AM Robert Haas  wrote:
> Look, I don't want to spend time arguing about what seem to me to be
> basic principles of good software engineering. When I don't put test
> cases into my patches, people complain at me and tell me that I'm a
> bad software engineer because I didn't include test cases. Your
> argument here seems to be that you're such a good software engineer
> that you don't need any test cases to know what the bug is or that
> you've fixed it correctly.

That's not what I said. This is a straw man.

What I actually said was that there is no reason to declare up front
that the only circumstances under which a fix could be committed is
when a clean repro is available. I never said that a test case has
little or no value, and I certainly didn't assert that we definitely
don't need a test case to proceed with a commit -- since I am not in
the habit of presumptuously attaching conditions to such things well
in advance.

> I don't particularly appreciate the implication that I either lack
> relevant or expertise or don't actually take time, either.

The implication was only that you didn't take the time. Clearly you
have the expertise. Obviously you're very far from stupid.

I have been unable to reproduce the problem, and think it's possible
that the issue cannot be triggered in practice. Though only through
sheer luck. Here's why that is:

While pruning will remove aborted dead tuples, freezing will not
remove the xmax of an aborted update unless the XID happens to be <
OldestXmin. With my problem scenario, the page will be all_visible in
prunestate, but not all_frozen -- so it dodges the relevant
visibilitymap_set() call site.

That just leaves inserts that abort, I think. An aborted insert will
be totally undone by pruning, but that does still leave behind an
LP_DEAD item that needs to be vacuumed in the second heap pass. This
means that we can only set the page all-visible/all-frozen in the VM
in the second heap pass, which also dodges the relevant
visibilitymap_set() call site.

In summary, I think that there is currently no way that we can have
the VM (or the PD_ALL_VISIBLE flag) concurrently unset, while leaving
the page all_frozen. It can happen and leave the page all_visible, but
not all_frozen, due to these very fine details. (Assuming I haven't
missed another path to the problem with aborted Multis or something,
but looks like I haven't.)

-- 
Peter Geoghegan

Re: Fixing a couple of buglets in how VACUUM sets visibility map bits

2023-01-09 Thread Peter Geoghegan

On Mon, Jan 9, 2023 at 12:58 PM Peter Geoghegan  wrote:
> I didn't realize that affected visibilitymap_set() calls could
> generate useless set-VM WAL records until you pointed it out. That's
> far more likely to happen than the race condition that I described --
> it has nothing at all to do with concurrency. That's what clinches it
> for me.

I didn't spend as much time on this as I'd like to so far, but I think
that this concern about visibilitymap_set() actually turns out to not
apply. The visibilitymap_set() call in question is gated by a
"!VM_ALL_FROZEN()", which is enough to avoid the problem with writing
useless VM set records.

That doesn't make me doubt my original concern about races where the
all-frozen bit can be set, without setting the all-visible bit, and
without accounting for the fact that it changed underneath us. That
scenario will have !VM_ALL_FROZEN(), so that won't save us. (And we
won't test VM_ALL_VISIBLE() or PD_ALL_VISIBLE in a way that is
sufficient to realize that all_visible_according_to_vm is stale.
prunestate.all_visible being set doesn't reliably indicate that it's not stale,
but lazy_scan_heap incorrectly believes that it really does work that way.)

--
Peter Geoghegan

Re: Show various offset arrays for heap WAL records

2023-01-09 Thread Peter Geoghegan

On Mon, Jan 9, 2023 at 1:58 PM Andres Freund  wrote:
> A couple times when investigating data corruption issues, the last time just
> yesterday in [1], I needed to see the offsets affected by PRUNE and VACUUM
> records. As that's probably not just me, I think we should make that change
> in-tree.

I remember how useful this was when we were investigating that early
bug in 14, that turned out to be in parallel VACUUM. So I'm all in
favor of it.

> The attached patch adds details to XLOG_HEAP2_PRUNE, XLOG_HEAP2_VACUUM,
> XLOG_HEAP2_FREEZE_PAGE.

I'm bound to end up doing the same in index access methods. Might make
sense for the utility routines to live somewhere more centralized, at
least when code reuse is likely. Practically every index AM has WAL
records that include a sorted page offset number array, just like
these ones. It's a very standard thing, obviously.

> I chose to include infomask[2] for the different freeze plans mainly because
> it looks odd to see different plans without a visible reason. But I'm not sure
> that's the right choice.

I don't think that it is particularly necessary to do so in order for
the output to make sense -- pg_waldump is inherently a tool for
experts. What it comes down to for me is whether or not this
information is sufficiently useful to display, and/or can be (or needs
to be) controlled via some kind of verbosity knob.

I think that it easily could be useful, and I also think that it
easily could be a bit annoying. How hard would it be to invent a
general mechanism to control the verbosity of what we'll show for each
WAL record?

-- 
Peter Geoghegan

Re: Decoupling antiwraparound autovacuum from special rules around auto cancellation

2023-01-09 Thread Peter Geoghegan

On Mon, Jan 9, 2023 at 5:22 PM Andres Freund  wrote:
> I've also seen the inverse, with recent versions of postgres: Autovacuum can
> only ever make progress if it's an anti-wraparound vacuum, because it'll
> always get cancelled otherwise.  I'm worried that substantially increasing the
> time until an anti-wraparound autovacuum happens will lead to more users
> running into out-of-xid shutdowns.
>
> I don't think it's safe to just increase the time at which anti-wrap vacuums
> happen to a hardcoded 1 billion.

That's not what the patch does. It doubles the time that the anti-wrap
no-autocancellation behaviors kick in, up to a maximum of 1 billion
XIDs/MXIDs. So it goes from autovacuum_freeze_max_age to
autovacuum_freeze_max_age x 2, without changing the basic fact that we
initially launch autovacuums that advance relfrozenxid/relminmxid when
the autovacuum_freeze_max_age threshold is first crossed.

These heuristics are totally negotiable -- and likely should be
thought out in more detail. It's likely that most of the benefit of
the patch comes from simply trying to advance relfrozenxid without the
special auto-cancellation behavior one single time. The main problem
right now is that the threshold that launches most antiwraparound
autovacuums is exactly the same as the threshold that activates the
auto-cancellation protections. Even doing the latter very slightly
later than the former could easily make things much better, while
adding essentially no risk of the kind you're concerned about.

> I'm also doubtful that it's ok to just make all autovacuums on relations with
> an age > 1 billion anti-wraparound ones. For people that use a large
> autovacuum_freeze_max_age that will be a rude awakening.

Actually, users that have autovacuum_freeze_max_age set to over 1
billion will get exactly the same behavior as before (except that the
instrumentation of autovacuum will be better). It'll be identical.

If you set autovacuum_freeze_max_age to 2 billion, and a "standard"
autovacuum is launched on a table whose relfrozenxid age is 1.5
billion, it'll just be a regular dead tuples/inserted tuples
autovacuum, with the same old familiar locking characteristics as
today.

-- 
Peter Geoghegan

Re: Fixing a couple of buglets in how VACUUM sets visibility map bits

2023-01-09 Thread Peter Geoghegan

On Mon, Jan 9, 2023 at 12:51 PM Robert Haas  wrote:
> I feel that you should at least have a reproducer for these problems
> posted to the thread, and ideally a regression test, before committing
> things. I think it's very hard to understand what the problems are
> right now.

Hard to understand relative to what, exactly? We're talking about a
very subtle race condition here.

I'll try to come up with a reproducer, but I *utterly* reject your
assertion that it's a hard requirement, sight unseen. Why should those
be the parameters of the discussion?

For one thing I'm quite confident that I'm right, with or without a
reproducer. And my argument isn't all that hard to follow, if you have
relevant expertise, and actually take the time. But even this is
unlikely to matter much. Even if I somehow turn out to have been
completely wrong about the race condition, it is still self-evident
that the problem of uselessly WAL logging non-changes to the VM
exists. That doesn't require any concurrent access at all. It's a
natural consequence of calling visibilitymap_set() with
VISIBILITYMAP_ALL_FROZEN-only flags. You need only look at the code
for 2 minutes to see it.

> I don't particularly have a problem with the idea of 0001, because if
> we use InvalidTransactionId to mean that there cannot be any
> conflicts, we do not need FrozenTransactionId to mean the same thing.
> Picking one or the other makes sense.

We've already picked one, many years ago -- InvalidTransactionId. This
is a long established convention, common to all REDO routines that are
capable of creating granular conflicts.

I already committed 0001 over a week ago. We were calling
ResolveRecoveryConflictWithSnapshot with FrozenTransactionId arguments
before now, which was 100% guaranteed to be a waste of cycles. I saw
no need to wait more than a few days for a +1, given that this
particular issue was so completely clear cut.

--
Peter Geoghegan

Re: Reducing the WAL overhead of freezing in VACUUM by deduplicating per-tuple freeze plans

2023-01-09 Thread Peter Geoghegan

On Mon, Jan 9, 2023 at 1:43 PM Andres Freund  wrote:
> ISTM that some of the page level freezing functions are misnamed. In heapam.c
> the heap_xlog* routines are for replay, afaict. However
> heap_xlog_freeze_plan() is used to WAL log the freeze
> plan. heap_xlog_freeze_page() is used to replay that WAL record. Probably your
> brain is too used to nbtree/ :).

Sometimes I wonder why other people stubbornly insist on not starting
every function name with an underscore.   :-)

> I think s/heap_xlog_freeze/heap_log_freeze/ would mostly do the trick, except
> that heap_xlog_new_freeze_plan() doesn't quite fit in the scheme.

> The routines then also should be moved a bit up, because right now they're
> inbetween other routines doing WAL replay, adding to the confusion.

I believe that I used this scheme because of the fact that the new
functions were conceptually related to REDO routines, even though they
run during original execution. I'm quite happy to revise the code
based on your suggestions, though.

> The memcpy in heap_xlog_freeze_page() seems a tad odd. I assume that the
> alignment guarantees for xl_heap_freeze_plan are too weak?

They're not too weak. I'm not sure why the memcpy() was used. I see
your point; it makes you wonder if it must be necessary, which then
seems to call into question why it's okay to access the main array as
an array. I can change this detail, too.

I'll try to get back to it this week.

--
Peter Geoghegan

Re: Fixing a couple of buglets in how VACUUM sets visibility map bits

2023-01-09 Thread Peter Geoghegan

On Mon, Jan 9, 2023 at 11:57 AM Andres Freund  wrote:
> Afaict we'll need to backpatch this all the way?

I thought that we probably wouldn't need to, at first. But I now think
that we really have to.

I didn't realize that affected visibilitymap_set() calls could
generate useless set-VM WAL records until you pointed it out. That's
far more likely to happen than the race condition that I described --
it has nothing at all to do with concurrency. That's what clinches it
for me.

> > One of the calls to visibilitymap_set() during VACUUM's initial heap
> > pass could unset a page's all-visible bit during the process of setting
> > the same page's all-frozen bit.
>
> As just mentioned upthread, this just seems wrong.

I don't know why this sentence ever made sense to me. Anyway, it's not
important now.

> Do you have a reproducer for this?

No, but I'm quite certain that the race can happen.

If it's important to have a reproducer then I can probably come up
with one. I could likely figure out a way to write an isolation test
that reliably triggers the issue. It would have to work by playing
games with cleanup lock/buffer pin waits, since that's the only thing
that the test can hook into to make things happen in just the
right/wrong order.

> >   elog(WARNING, "page is not marked all-visible but 
> > visibility map bit is set in relation \"%s\" page %u",
> >vacrel->relname, blkno);
>
> Hm. The message gets a bit less accurate with the change. Perhaps OK? OTOH, it
> might be useful to know what bit was wrong when debugging problems.

Theoretically it might change again, if we call
visibilitymap_get_status() again. Maybe I should just broaden the
error message a bit instead?

> I still think such changes are inappropriate for a bugfix, particularly one
> that needs to be backpatched.

I'll remove the changes that are inessential in the next revision. I
wouldn't have done it if I'd fully understood the seriousness of the
issue from the start.

If you're really concerned about diff size then I should point out
that the changes to lazy_vacuum_heap_rel() aren't strictly necessary,
and probably shouldn't be backpatched. I deemed that in scope because
it's part of the same overall problem of updating the visibility map
based on potentially stale information. It makes zero sense to check
with the visibility map before updating it when we already know that
the page is all-visible. I mean, are we trying to avoid the work of
needlessly updating the visibility map in cases where its state was
corrupt, but then became uncorrupt (relative to the heap page) by
mistake?

--
Peter Geoghegan

Re: Fixing a couple of buglets in how VACUUM sets visibility map bits

2023-01-09 Thread Peter Geoghegan

On Mon, Jan 9, 2023 at 11:44 AM Andres Freund  wrote:
> I think that's just an imprecise formulation though - the problem is that we
> can call visibilitymap_set() with just VISIBILITYMAP_ALL_FROZEN, even though
> VISIBILITYMAP_ALL_VISIBLE was concurrently unset.

That's correct.

You're right that my description of the problem from the commit
message was confusing. But we're on the same page about the problem
now.

> ISTM that we ought to update all_visible_according_to_vm from
> PageIsAllVisible() once we've locked the page. Even if we avoid this specific
> case, it seems a recipe for future bugs to have a potentially outdated
> variable around.

I basically agree, but some of the details are tricky.

As I mentioned already, my work on visibility map snapshots just gets
rid of all_visible_according_to_vm, which is my preferred long term
approach. We will very likely need to keep all_visible_according_to_vm
as a cache for performance reasons for as long as we have
SKIP_PAGES_THRESHOLD.

Can we just update all_visible_according_to_vm using
PageIsAllVisible(), without making all_visible_according_to_vm
significantly less useful as a cache? Maybe. Not sure offhand.

-- 
Peter Geoghegan

Re: Fixing a couple of buglets in how VACUUM sets visibility map bits

2023-01-09 Thread Peter Geoghegan

On Sun, Jan 8, 2023 at 6:43 PM Peter Geoghegan  wrote:
> On further reflection even v2 won't repair the page-level
> PD_ALL_VISIBLE flag in passing in this scenario. ISTM that on HEAD we
> might actually leave the all-frozen bit set in the VM, while both the
> all-visible bit and the page-level PD_ALL_VISIBLE bit remain unset.
> Again, all due to the approach we take with
> all_visible_according_to_vm, which can go stale independently of both
> the VM bit being unset and the PD_ALL_VISIBLE bit being unset (in my
> example problem scenario).

Attached is v3, which explicitly checks the need to set the
PD_ALL_VISIBLE flag at the relevant visibilitymap_set() call site. It
also has improved comments.

-- 
Peter Geoghegan


v3-0001-Don-t-accidentally-unset-all-visible-bit-in-VM.patch
Description: Binary data

Re: Fixing a couple of buglets in how VACUUM sets visibility map bits

2023-01-08 Thread Peter Geoghegan

On Sun, Jan 8, 2023 at 4:27 PM Peter Geoghegan  wrote:
> We're vulnerable to allowing "all-frozen but not all-visible"
> inconsistencies because of two factors: this business with not passing
> VISIBILITYMAP_ALL_VISIBLE along with VISIBILITYMAP_ALL_FROZEN to
> visibilitymap_set(), *and* the use of all_visible_according_to_vm to
> set the VM (a local variable that can go stale). We sort of assume
> that all_visible_according_to_vm cannot go stale here without our
> detecting it. That's almost always the case, but it's not quite
> guaranteed.

On further reflection even v2 won't repair the page-level
PD_ALL_VISIBLE flag in passing in this scenario. ISTM that on HEAD we
might actually leave the all-frozen bit set in the VM, while both the
all-visible bit and the page-level PD_ALL_VISIBLE bit remain unset.
Again, all due to the approach we take with
all_visible_according_to_vm, which can go stale independently of both
the VM bit being unset and the PD_ALL_VISIBLE bit being unset (in my
example problem scenario).

FWIW I don't have this remaining problem in my VACUUM
freezing/scanning strategies patchset. It just gets rid of
all_visible_according_to_vm altogether, which makes things a lot
simpler at the point that we set VM bits at the end of lazy_scan_heap
-- there is nothing left that can become stale. Quite a lot of the
code is just removed; there is exactly one call to visibilitymap_set()
at the end of lazy_scan_heap with the patchset, that does everything
we need.

The patchset also has logic for setting PD_ALL_VISIBLE when it needs
to be set, which isn't (and shouldn't) be conditioned on whether we're
doing a "all visible -> all frozen " transition or a "neither -> all
visible" transition. What it actually needs to be conditioned on is
whether it's unset now, and so needs to be set in passing, as part of
setting one or both VM bits -- simple as that.

-- 
Peter Geoghegan

Re: Decoupling antiwraparound autovacuum from special rules around auto cancellation

2023-01-08 Thread Peter Geoghegan

On Thu, Dec 29, 2022 at 7:01 PM Peter Geoghegan  wrote:
> Attached is v2, which is just to fix bitrot.

Attached is v3. We no longer apply vacuum_failsafe_age when
determining the cutoff for antiwraparound autovacuuming -- the new
approach is a bit simpler.

This is a fairly small change overall. Now any "table age driven"
autovacuum will also be antiwraparound when its
relfrozenxid/relminmxid attains an age that's either double the
relevant setting (either autovacuum_freeze_max_age or
effective_multixact_freeze_max_age), or 1 billion XIDs/MXIDs --
whichever is less.

That makes it completely impossible to disable antiwraparound
protections (the special antiwrap autocancellation behavior) for
table-age-driven autovacuums once table age exceeds 1 billion
XIDs/MXIDs. It's still possible to increase autovacuum_freeze_max_age
to well over a billion, of course. It just won't be possible to do
that while also avoiding the no-auto-cancellation behavior for those
autovacuums that are triggered due to table age crossing the
autovacuum_freeze_max_age/effective_multixact_freeze_max_age
threshold.

--
Peter Geoghegan
From 6d78e62226608683d9882b4507d37442164267a1 Mon Sep 17 00:00:00 2001
From: Peter Geoghegan 
Date: Fri, 25 Nov 2022 11:23:20 -0800
Subject: [PATCH v3] Add "table age" trigger concept to autovacuum.

Teach autovacuum.c to launch "table age" autovacuums at the same point
that it previously triggered antiwraparound autovacuums.  Antiwraparound
autovacuums are retained, but are only used as a true option of last
resort, when regular autovacuum has presumably tried and failed to
advance relfrozenxid (likely because the auto-cancel behavior kept
cancelling regular autovacuums triggered based on table age).

The special auto-cancellation behavior applied by antiwraparound
autovacuums is known to cause problems in the field, so it makes sense
to avoid it, at least until the point where it starts to look like a
proportionate response.  Besides, the risk of the system eventually
triggering xidStopLimit because of cancellations is a lot lower than it
was back when the current auto-cancellation behavior was added by commit
acac68b2.  For example, there was no visibility map, so restarting
antiwraparound autovacuum meant that the next autovacuum would get very
little benefit from the work performed by earlier cancelled autovacuums.

Also add new instrumentation that lists a triggering condition in the
server log whenever an autovacuum is logged.  This reports "table age"
as the triggering criteria when regular (not antiwraparound) autovacuum
runs because we need to advance the table's age.  In other cases it will
report an autovacuum was launched due to the tuple insert thresholds or
the dead tuple thresholds.  Note that pg_stat_activity doesn't have any
special instrumentation for regular autovacuums that happen to have been
triggered based on table age (since it really is just another form of
autovacuum, and follows exactly the same rules in vacuumlazy.c and in
proc.c).

Author: Peter Geoghegan 
Reviewed-By: Jeff Davis 
Discussion: https://postgr.es/m/CAH2-Wz=S-R_2rO49Hm94Nuvhu9_twRGbTm6uwDRmRu-Sqn_t3w@mail.gmail.com
---
 src/include/commands/vacuum.h   |  18 ++-
 src/include/storage/proc.h  |   2 +-
 src/backend/access/heap/vacuumlazy.c|  14 ++
 src/backend/access/heap/visibilitymap.c |   5 +-
 src/backend/access/transam/multixact.c  |   4 +-
 src/backend/commands/vacuum.c   |  18 ++-
 src/backend/postmaster/autovacuum.c | 207 +---
 src/backend/storage/lmgr/proc.c |   4 +-
 8 files changed, 197 insertions(+), 75 deletions(-)

diff --git a/src/include/commands/vacuum.h b/src/include/commands/vacuum.h
index 689dbb770..b70f69fd9 100644
--- a/src/include/commands/vacuum.h
+++ b/src/include/commands/vacuum.h
@@ -191,6 +191,21 @@ typedef struct VacAttrStats
 #define VACOPT_SKIP_DATABASE_STATS 0x100	/* skip vac_update_datfrozenxid() */
 #define VACOPT_ONLY_DATABASE_STATS 0x200	/* only vac_update_datfrozenxid() */
 
+/*
+ * Values used by autovacuum.c to tell vacuumlazy.c about the specific
+ * threshold type that triggered an autovacuum worker.
+ *
+ * AUTOVACUUM_NONE is used when VACUUM isn't running in an autovacuum worker.
+ */
+typedef enum AutoVacType
+{
+	AUTOVACUUM_NONE = 0,
+	AUTOVACUUM_TABLE_XID_AGE,
+	AUTOVACUUM_TABLE_MXID_AGE,
+	AUTOVACUUM_DEAD_TUPLES,
+	AUTOVACUUM_INSERTED_TUPLES,
+} AutoVacType;
+
 /*
  * Values used by index_cleanup and truncate params.
  *
@@ -222,7 +237,8 @@ typedef struct VacuumParams
 			 * use default */
 	int			multixact_freeze_table_age; /* multixact age at which to scan
 			 * whole table */
-	bool		is_wraparound;	/* force a for-wraparound vacuum */
+	bool		is_wraparound;	/* antiwraparound autovacuum? */
+	AutoVacType	trigger;		/* Autovacuum trigger condition, if any */
 	int			log_min_duration;	/* minimum execut

Re: Fixing a couple of buglets in how VACUUM sets visibility map bits

2023-01-08 Thread Peter Geoghegan

On Sun, Jan 8, 2023 at 3:53 PM Andres Freund  wrote:
> How?

See the commit message for the scenario I have in mind, which involves
a concurrent HOT update that aborts.

We're vulnerable to allowing "all-frozen but not all-visible"
inconsistencies because of two factors: this business with not passing
VISIBILITYMAP_ALL_VISIBLE along with VISIBILITYMAP_ALL_FROZEN to
visibilitymap_set(), *and* the use of all_visible_according_to_vm to
set the VM (a local variable that can go stale). We sort of assume
that all_visible_according_to_vm cannot go stale here without our
detecting it. That's almost always the case, but it's not quite
guaranteed.

> visibilitymap_set() just adds flags, it doesn't remove any already
> existing bits:

I know. The concrete scenario I have in mind is very subtle (if the
problem was this obvious I'm sure somebody would have noticed it by
now, since we do hit this visibilitymap_set() call site reasonably
often). A concurrent HOT update will certainly clear all the bits for
us, which is enough.

> You have plenty of changes like this, which are afaict entirely unrelated to
> the issue the commit is fixing, in here. It just makes it hard to review the
> patch.

I didn't think that it was that big of a deal to tweak the style of
one or two details in and around lazy_vacuum_heap_rel() in passing,
for consistency with lazy_scan_heap(), since the patch already needs
to do some of that. I do take your point, though.

-- 
Peter Geoghegan

Re: Fixing a couple of buglets in how VACUUM sets visibility map bits

2023-01-08 Thread Peter Geoghegan

On Mon, Jan 2, 2023 at 10:31 AM Peter Geoghegan  wrote:
> Would be helpful if I could get a +1 on
> v1-0002-Never-just-set-the-all-frozen-bit-in-VM.patch, which is
> somewhat more substantial than the others.

There has been no response on this thread for over a full week at this
point. I'm CC'ing Robert now, since the bug is from his commit
a892234f83.

Attached revision of the "don't unset all-visible bit while unsetting
all-frozen bit" patch adds some assertions that verify that
visibility_cutoff_xid is InvalidTransactionId as expected when we go
to set any page all-frozen in the VM. It also broadens an existing
nearby test for corruption, which gives us some chance of detecting
and repairing corruption of this sort that might have slipped in in
the field.

My current plan is to commit something like this in another week or
so, barring any objections.

-- 
Peter Geoghegan

v2-0001-Don-t-accidentally-unset-all-visible-bit-in-VM.patch
Description: Binary data

Re: pgsql: Delay commit status checks until freezing executes.

2023-01-07 Thread Peter Geoghegan

On Sat, Jan 7, 2023 at 1:47 PM Andres Freund  wrote:
> > What do you think of the attached patch, which revises comments over
> > TransactionIdDidAbort, and adds something about it to the top of
> > heapam_visbility.c?
>
> Mostly looks good to me.  I think it'd be good to add a reference to the
> heapam_visbility.c? comment to the top of transam.c (or move it).

Makes sense.

> I think it's currently very likely to be true, but I'd weaken the "never" a
> bit nonetheless. I think it'd also be good to point to what to do instead. How
> about:
>   Note that TransactionIdDidAbort() returns true only for explicitly aborted
>   transactions, as transactions implicitly aborted due to a crash will
>   commonly still appear to be in-progress in the clog. Most of the time
>   TransactionIdDidCommit(), with a preceding TransactionIdIsInProgress()
>   check, should be used instead of TransactionIdDidAbort().

That does seem better.

Do we need to do anything about this to the "pg_xact and pg_subtrans"
section of the transam README? Also, does amcheck's get_xid_status()
need a reference to these rules?

FWIW, I found an existing comment about this rule in the call to
TransactionIdAbortTree() from RecordTransactionAbort() -- which took
me quite a while to find. So you might have been remembering that
comment before.

-- 
Peter Geoghegan

Re: New strategies for freezing, advancing relfrozenxid early

2023-01-06 Thread Peter Geoghegan

 32-bit transaction IDs and
> that we update an epoch when this XID wraps around does not
> automatically make the user aware of the issues that surface around
> XID wraparound. Retaining the explainer for XID wraparound in the docs
> seems like a decent idea - it may be moved, but please don't delete
> it.

We do need to stop telling users to enter single user mode. It's quite
simply obsolete, bad advice, and has been since Postgres 14. It's the
worst thing that you could do, in fact.

--
Peter Geoghegan

Re: New strategies for freezing, advancing relfrozenxid early

2023-01-06 Thread Peter Geoghegan

 
> > pg_class.relfrozenxid
> > + now take place more proactively, in every
> > + VACUUM operation.
>
> This claim that it happens more proactively in "every" VACUUM
> operation is false, so I think the removal of "every" would be better.

Good catch. Will fix.

> > [PATCH v14 3/3] Finish removing aggressive mode VACUUM.

> I don't quite enjoy the refactoring+rewriting of the docs section;
> it's difficult to determine what changed when so many things changed
> line lengths and were moved around. Tomorrow I'll take a closer look,
> but a separation of changes vs moved would be useful for review.

I think that I should break out the doc changes some more. The docs
are likely the least worked out thing at this point.

> > +cutoffs->MinXid = nextXID - (freeze_table_age * 0.95);
> > [...]
> > +cutoffs->MinMulti = nextMXID - (multixact_freeze_table_age * 0.95);
>
> Why are these values adjusted down (up?) by 5%? If I configure this
> GUC, I'd expect this to be used effectively verbatim; not adjusted by
> an arbitrary factor.

It is kind of arbitrary, but not in the way that you suggest. This
isn't documented in the user docs, and shouldn't really need to be. It
should have very little if any noticeable impact on our final
relfrozenxid/relminmxid in practice. If it does have any noticeable
impact, I strongly suspect it'll be a useful, positive impact.

MinXid/MinMulti control the behavior around whether or not
lazy_scan_noprune is willing to wait the hard way for a cleanup lock,
no matter how long it takes. We do still need something like that, but
it can be far looser than it is right now. The problem with aggressive
mode is that it absolutely insists on a certain outcome, no matter the
cost, and regardless of whether or not a slightly inferior outcome is
acceptable. It's extremely rigid. Rigid things tend to break. Loose,
springy things much less so.

I think that it's an extremely bad idea to wait indefinitely for a
cleanup lock. Sure, it'll work out the vast majority of the time --
it's *very* likely to work. But when it doesn't work right away, there
is no telling how long the wait will be -- all bets are off. Could be
a day, a week, a month -- who knows? The application itself is the
crucial factor here, and in general the application can do whatever it
wants to do -- that is the reality. So we should be willing to kick
the can down the road in almost all cases -- that is actually the
responsible thing to do under the circumstances. We need to get on
with freezing every other page in the table!

There just cannot be very many pages that can't be cleanup locked at
any given time, so waiting indefinitely is a very drastic measure in
response to a problem that is quite likely to go away on its own. A
problem that waiting doesn't really solve anyway. Maybe the only thing
that will work is waiting for a very long time, but we have nothing to
lose (and everything to gain) by waiting to wait.

--
Peter Geoghegan

Re: pgsql: Delay commit status checks until freezing executes.

2023-01-06 Thread Peter Geoghegan

On Tue, Jan 3, 2023 at 10:52 PM Peter Geoghegan  wrote:
> > And it'd make sense to have
> > the explanation of why TransactionIdDidAbort() isn't the same as
> > !TransactionIdDidCommit(), even for !TransactionIdIsInProgress() xacts, near
> > the explanation for doing TransactionIdIsInProgress() first.
>
> I think that we should definitely have a comment directly over
> TransactionIdDidAbort(). Though I wouldn't mind reorganizing these
> other comments, or making the comment over TransactionIdDidAbort()
> mostly just point to the other comments.

What do you think of the attached patch, which revises comments over
TransactionIdDidAbort, and adds something about it to the top of
heapam_visbility.c?

-- 
Peter Geoghegan


v1-0001-Document-TransactionIdDidAbort-hard-crash-behavio.patch
Description: Binary data

Re: BUG: Postgres 14 + vacuum_defer_cleanup_age + FOR UPDATE + UPDATE

2023-01-05 Thread Peter Geoghegan

On Thu, Jan 5, 2023 at 3:27 PM Peter Geoghegan  wrote:
> This particular error message is from the hardening added to Postgres
> 15 in commit e7428a99. So it's not surprising that Michail didn't see
> the same error on 14.

Reproduced this on HEAD locally (no docker), without any difficulty.

FWIW, I find that the "Assert(ItemIdIsNormal(lp));" at the top of
heap_lock_tuple() is the first thing that fails on my assert-enabled
build.

-- 
Peter Geoghegan

Re: BUG: Postgres 14 + vacuum_defer_cleanup_age + FOR UPDATE + UPDATE

2023-01-05 Thread Peter Geoghegan

On Thu, Jan 5, 2023 at 1:49 PM Matthias van de Meent
 wrote:
> client 29 script 0 aborted in command 2 query 0: ERROR:  heap tid from index 
> tuple (111,1) points past end of heap page line pointer array at offset 262 
> of block 1 in index "something_is_wrong_here_pkey"

This particular error message is from the hardening added to Postgres
15 in commit e7428a99. So it's not surprising that Michail didn't see
the same error on 14.

-- 
Peter Geoghegan

Re: pgsql: Delay commit status checks until freezing executes.

2023-01-05 Thread Peter Geoghegan

On Wed, Jan 4, 2023 at 10:59 PM Amit Kapila  wrote:
> You are an extremely valuable person for this project and I wish that
> you continue working with the same enthusiasm.

Thank you, Amit. Knowing that my efforts are appreciated by colleagues
does make it easier to persevere.

--
Peter Geoghegan

Re: pgsql: Delay commit status checks until freezing executes.

2023-01-04 Thread Peter Geoghegan

On Wed, Jan 4, 2023 at 10:41 AM Andres Freund  wrote:
> > It's currently possible for VACUUM to set the all-frozen bit while
> > unsetting the all-visible bit, due to a race condition [1]. This is
> > your long standing bug. So apparently nobody is qualified to commit
> > patches in this area.
>
> That's a non-sequitur. Bugs are a fact of programming.

I agree.

> > About a year ago, there was a massive argument over some earlier work
> > in the same general area, by me. Being the subject of a pile-on on
> > this mailing list is something that I find deeply upsetting and
> > demoralizing. I just cannot take much more of it. At the same time,
> > I've made quite an investment in the pending patches, and think that
> > it's something that I have to see through.
>
> I'm, genuinely!, sorry that you feel piled on. That wasn't, and isn't, my
> goal.

Apology accepted.

I am making a simple, practical point here, too: I'm much too selfish
a person to continue to put myself in this position. I have nothing to
prove, and have little to gain over what I'd get out of working in
various other areas. I wasn't hired by my current employer to work on
VACUUM in particular. In the recent past I have found ways to be very
productive in other areas, without any apparent risk of protracted,
stressful fights -- which is something that I plan on getting back to
soon. I just don't have the stomach for this. It just isn't worth it
to me.

> I think the area of code desperately needs work. I complained because I
> didn't like the process and was afraid of the consequences and the perceived
> need on my part to do post-commit reviews.

The work that I did in 15 (in particular commit 0b018fab, the "oldest
extant XID" commit) really isn't very useful without the other patches
in place -- it was always supposed to be one piece of a larger whole.
It enables the freezing stuff because VACUUM now "gets credit" for
proactive freezing in a way that it didn't before. The motivating
examples wiki page shows examples of this [1].

Once the later patches are in place, the 15/16 work on VACUUM will be
complete, and I can walk away from working on VACUUM having delivered
a very useful improvement to performance stability -- a good outcome
for everybody. If you and Robert can find a way to accommodate that,
then in all likelihood we won't need to have any more heated and
protracted arguments like the one from early in 2022. I will be quite
happy to get back to working on B-Tree, likely the skip scan work.

 [1] 
https://wiki.postgresql.org/wiki/Freezing/skipping_strategies_patch:_motivating_examples
--
Peter Geoghegan

Re: pgsql: Delay commit status checks until freezing executes.

2023-01-04 Thread Peter Geoghegan

On Wed, Jan 4, 2023 at 7:03 AM Robert Haas  wrote:
> But that having been said, I'm kind of astonished that you didn't know
> about this already. The freezing behavior is in general extremely hard
> to get right, and I guess I feel if you don't understand how the
> underlying functions work, including things like performance
> considerations

I was the one that reported the issue with CLOG lookups in the first place.

> and which functions return fully reliable results, I do
> not think you should be committing your own patches in this area.

My mistake here had nothing to do with my own goals. I was trying to
be diligent by hardening an existing check in passing, and it
backfired.

> There is probably a lot of potential benefit in improving the way this
> stuff works, but there is also a heck of a lot of danger of creating
> subtle data corrupting bugs that could easily take years to find.

It's currently possible for VACUUM to set the all-frozen bit while
unsetting the all-visible bit, due to a race condition [1]. This is
your long standing bug. So apparently nobody is qualified to commit
patches in this area.

About a year ago, there was a massive argument over some earlier work
in the same general area, by me. Being the subject of a pile-on on
this mailing list is something that I find deeply upsetting and
demoralizing. I just cannot take much more of it. At the same time,
I've made quite an investment in the pending patches, and think that
it's something that I have to see through.

If I am allowed to finish what I've started, then I will stop all new
work on VACUUM. I'll go back to working on B-Tree indexing. Nobody is
asking me to focus on VACUUM, and there are plenty of other things
that I could be doing that don't seem to lead to these situations.

[1] 
https://postgr.es/m/cah2-wznungszf8v6osgjac5aysb3cz6hw6mlm30x0d65cms...@mail.gmail.com
--
Peter Geoghegan

Re: pgsql: Delay commit status checks until freezing executes.

2023-01-04 Thread Peter Geoghegan

On Tue, Jan 3, 2023 at 4:54 PM Andres Freund  wrote:
> There's some changes from TransactionIdDidCommit() to !TransactionIdDidAbort()
> that don't look right to me. If the server crashed while xid X was
> in-progress, TransactionIdDidCommit(X) will return false, but so will
> TransactionIdDidAbort(X). So besides moving when the check happens you also
> changed what's being checked in a more substantial way.

I did point this out on the thread. I made this change with the
intention of making the check more robust. Apparently this was
misguided.

Where is the behavior that you describe documented, if anywhere?

> Also, why did you change when MarkBufferDirty() happens? Previously it
> happened before we modify the page contents, now after. That's probably fine
> (it's the order suggested in transam/README), but seems like a mighty subtle
> thing to change at the same time as something unrelated, particularly without
> even mentioning it?

I changed it because the new order is idiomatic. I didn't think that
this was particularly worth mentioning, or even subtle. The logic from
heap_execute_freeze_tuple() only performs simple in-place
modifications.

-- 
Peter Geoghegan

Re: pgsql: Delay commit status checks until freezing executes.

2023-01-03 Thread Peter Geoghegan

On Tue, Jan 3, 2023 at 10:47 PM Andres Freund  wrote:
> IMO the comment at the top mentioning why the TransactionIdIsInProgress()
> calls are crucial / need to be done first would be considerably more likely to
> be found in transam.c than heapam_visibility.c.

Yeah, but they're duplicated anyway. For example in the transam
README. Plus we have references to these same comments from other
files, such as heapam.c, which mentions heapam_visibility.c by name as
where you go to learn more about this issue.

> And it'd make sense to have
> the explanation of why TransactionIdDidAbort() isn't the same as
> !TransactionIdDidCommit(), even for !TransactionIdIsInProgress() xacts, near
> the explanation for doing TransactionIdIsInProgress() first.

I think that we should definitely have a comment directly over
TransactionIdDidAbort(). Though I wouldn't mind reorganizing these
other comments, or making the comment over TransactionIdDidAbort()
mostly just point to the other comments.

-- 
Peter Geoghegan

Re: pgsql: Delay commit status checks until freezing executes.

2023-01-03 Thread Peter Geoghegan

On Tue, Jan 3, 2023 at 10:33 PM Andres Freund  wrote:
> I'd say a comment above TransactionIdDidAbort() referencing an overview
> comment at the top of the file? I think it might be worth moving the comment
> from heapam_visibility.c to transam.c?

What comments in heapam_visibility.c should we be referencing here? I
don't see anything about it there. I have long been aware that those
routines deduce that a transaction must have aborted, but surely
that's not nearly enough. That's merely not being broken, without any
explanation given as to why.

> > I find this astonishing. Why isn't there a prominent comment that
> > advertises that TransactionIdDidAbort() just doesn't work reliably?
>
> Arguably it works reliably, just more narrowly than one might think. Treating
> "crashed transactions" as a distinct state from explicit aborts.

That's quite a stretch. There are numerous comments that pretty much
imply that TransactionIdDidCommit/TransactionIdDidAbort are very
similar, for example any discussion of how you need to call
TransactionIdIsInProgress first before calling either of the other
two.

-- 
Peter Geoghegan

Re: pgsql: Delay commit status checks until freezing executes.

2023-01-03 Thread Peter Geoghegan

On Tue, Jan 3, 2023 at 8:29 PM Peter Geoghegan  wrote:
> I find this astonishing. Why isn't there a prominent comment that
> advertises that TransactionIdDidAbort() just doesn't work reliably?

I pushed a fix for this now.

We should add a comment about this issue to TransactionIdDidAbort()
header comments, but I didn't do that part yet.

Thanks for the report.
-- 
Peter Geoghegan

Re: pgsql: Delay commit status checks until freezing executes.

2023-01-03 Thread Peter Geoghegan

On Tue, Jan 3, 2023 at 7:56 PM Andres Freund  wrote:
> I still think these moderation rules are deeply unhelpful...

Yes, it is rather annoying.

> I don't know - I think there's a explicit comment somewhere, but I couldn't
> find it immediately. There's a bunch of indirect references to in in
> heapam_visibility.c, with comments like "it must have aborted or
> crashed".

I think that that's a far cry from any kind of documentation...

> The reason for the behaviour is that we do not have any mechanism for going
> through the clog and aborting all in-progress-during-crash transactions. So
> we'll end up with the clog for all in-progress-during-crash transaction being
> zero / TRANSACTION_STATUS_IN_PROGRESS.

I find this astonishing. Why isn't there a prominent comment that
advertises that TransactionIdDidAbort() just doesn't work reliably?

> IMO it's almost always wrong to use TransactionIdDidAbort().

I didn't think that there was any general guarantee about
TransactionIdDidAbort() working after a crash. But this is an on-disk
XID, taken from some tuple's xmax, which must have a value <
OldestXmin.

> I think changes in how WAL logging is done are just about always worth
> mentioning in a commit message...

I agree with that as a general statement, but I never imagined that
this was a case that such a statement could apply to.

I will try to remember to put something about similar changes in any
future commit messages, in the unlikely event that I ever end up
moving MarkBufferDirty() around in some existing critical section in
the future.

-- 
Peter Geoghegan

Re: pgsql: Delay commit status checks until freezing executes.

2023-01-03 Thread Peter Geoghegan

(Pruning -committers from the list, since cross-posting to -hackers
resulted in this being held up for moderation.)

On Tue, Jan 3, 2023 at 5:15 PM Peter Geoghegan  wrote:
>
> On Tue, Jan 3, 2023 at 4:54 PM Andres Freund  wrote:
> > There's some changes from TransactionIdDidCommit() to 
> > !TransactionIdDidAbort()
> > that don't look right to me. If the server crashed while xid X was
> > in-progress, TransactionIdDidCommit(X) will return false, but so will
> > TransactionIdDidAbort(X). So besides moving when the check happens you also
> > changed what's being checked in a more substantial way.
>
> I did point this out on the thread. I made this change with the
> intention of making the check more robust. Apparently this was
> misguided.
>
> Where is the behavior that you describe documented, if anywhere?

When the server crashes, and we have a problem case, what does
TransactionLogFetch()/TransactionIdGetStatus() (which are the guts of
both TransactionIdDidCommit and TransactionIdDidAbort) report about
the XID?

> > Also, why did you change when MarkBufferDirty() happens? Previously it
> > happened before we modify the page contents, now after. That's probably fine
> > (it's the order suggested in transam/README), but seems like a mighty subtle
> > thing to change at the same time as something unrelated, particularly 
> > without
> > even mentioning it?
>
> I changed it because the new order is idiomatic. I didn't think that
> this was particularly worth mentioning, or even subtle. The logic from
> heap_execute_freeze_tuple() only performs simple in-place
> modifications.

I'm including this here because presumably -hackers will have missed
it due to the moderation hold-up issue.


--
Peter Geoghegan

Re: New strategies for freezing, advancing relfrozenxid early

2023-01-02 Thread Peter Geoghegan

On Sat, Dec 31, 2022 at 12:45 PM Peter Geoghegan  wrote:
> On Sat, Dec 31, 2022 at 11:46 AM Jeff Davis  wrote:
> > "We have no freeze plans to execute, so there's no cost to following
> > the freeze path. This is important in the case where the page is
> > entirely frozen already, so that the page will be marked as such in the
> > VM."
>
> I'm happy to use your wording instead -- I'll come up with a patch for that.

What do you think of the wording adjustments in the attached patch?
It's based on your suggested wording.

--
Peter Geoghegan


0001-Tweak-page-level-freezing-comments.patch
Description: Binary data

Re: Fixing a couple of buglets in how VACUUM sets visibility map bits

2023-01-02 Thread Peter Geoghegan

On Sat, Dec 31, 2022 at 4:53 PM Peter Geoghegan  wrote:
> The first patch makes sure that the snapshotConflictHorizon cutoff
> (XID cutoff for recovery conflicts) is never a special XID, unless
> that XID is InvalidTransactionId, which is interpreted as a
> snapshotConflictHorizon value that will never need a recovery conflict
> (per the general convention for snapshotConflictHorizon values
> explained above ResolveRecoveryConflictWithSnapshot).

Pushed this just now.

Attached is another very simple refactoring patch for vacuumlazy.c. It
makes vacuumlazy.c save the result of get_database_name() in vacrel,
which matches what we already do with things like
get_namespace_name().

Would be helpful if I could get a +1 on
v1-0002-Never-just-set-the-all-frozen-bit-in-VM.patch, which is
somewhat more substantial than the others.

-- 
Peter Geoghegan

0001-Save-get_database_name-in-vacrel.patch
Description: Binary data

Fixing a couple of buglets in how VACUUM sets visibility map bits

2022-12-31 Thread Peter Geoghegan

Attached are two patches, each of which fixes two historical buglets
around VACUUM's approach to setting bits in the visibility map.
(Whether or not this is actually refactoring work or hardening work is
debatable, I suppose.)

The first patch makes sure that the snapshotConflictHorizon cutoff
(XID cutoff for recovery conflicts) is never a special XID, unless
that XID is InvalidTransactionId, which is interpreted as a
snapshotConflictHorizon value that will never need a recovery conflict
(per the general convention for snapshotConflictHorizon values
explained above ResolveRecoveryConflictWithSnapshot). This patch
establishes a hard rule that snapshotConflictHorizon values can never
be a special XID value, unless it's InvalidTransactionId. An assertion
enforces the rule for us in REDO routines (at the point that they call
ResolveRecoveryConflictWithSnapshot with the WAL record's
snapshotConflictHorizon XID value).

The second patch makes sure that VACUUM can never set a page
all-frozen in the visibility map without also setting the same page
all-visible in the same call to visibilitymap_set() -- regardless of
what we think we know about the current state of the all-visible bit
in the VM.

The second patch adjusts one of the visibilitymap_set() calls in
vacuumlazy.c that would previously sometimes set a page's all-frozen
bit without also setting its all-visible bit. This could allow VACUUM
to leave a page all-frozen but not all-visible in the visibility map
(since the value of all_visible_according_to_vm can go stale). I think
that this should be treated as a basic visibility map invariant: an
all-frozen page must also be all-visible, by definition, so why should
it be physically possible for the VM to give a contradictory picture
of the all_visible/all_frozen status of any one page? Assertions are
added that more or less make this rule into an invariant.
amcheck/pg_visibility coverage might make sense too, but I haven't
done that here.

The second patch also adjusts a later visibilitymap_set() call site
(the one used just after heap vacuuming runs in the final heap pass)
in roughly the same way. It no longer reads from the visibility map to
see what bits need to be changed. The existing approach here seems
rather odd. The whole point of calling lazy_vacuum_heap_page() is to
set LP_DEAD items referenced by VACUUM's dead_items array to LP_UNUSED
-- there has to have been at least one LP_DEAD item on the page for us
to end up here (which a Postgres 14 era assertion verifies for us). So
we already know perfectly well that the visibility map shouldn't
indicate that the page is all-visible yet -- why bother asking the VM?
And besides, any call to visibilitymap_set() will only modify the VM
when it directly observes that the bits have changed -- so why even
attempt to duplicate that on the caller side?

It seems to me that the visibilitymap_get_status() call inside
lazy_vacuum_heap_page() is actually abused to work as a substitute for
visibilitymap_pin(). Why not use top-level visibilitymap_pin() calls
instead, just like we do it in the first heap pass? That's how it's
done in the second patch; it adds a visibilitymap_pin() call in
lazy_vacuum_heap_rel()'s blkno-wise loop. That gives us parity between
the first and second heap pass, which seems like a clear
maintainability win -- everybody can pass the
already-pinned/already-setup vmbuffer by value.

-- 
Peter Geoghegan


v1-0001-Avoid-special-XIDs-in-snapshotConflictHorizon-val.patch
Description: Binary data


v1-0002-Never-just-set-the-all-frozen-bit-in-VM.patch
Description: Binary data

Re: New strategies for freezing, advancing relfrozenxid early

2022-12-31 Thread Peter Geoghegan

On Sat, Dec 31, 2022 at 11:46 AM Jeff Davis  wrote:
> On Fri, 2022-12-30 at 16:58 -0800, Peter Geoghegan wrote:
> > Following the path of freezing a page is *always* valid, by
> > definition. Including when there are zero freeze plans to execute, or
> > even zero tuples to examine in the first place -- we'll at least be
> > able to perform nominal freezing, no matter what.
>
> This is a much clearer description, in my opinion. Do you think this is
> already reflected in the comments (and I missed it)?

I am arguably the person least qualified to answer this question.   :-)

> Perhaps the comment in the "if (tuples_frozen == 0)" branch could be
> something more like:
>
> "We have no freeze plans to execute, so there's no cost to following
> the freeze path. This is important in the case where the page is
> entirely frozen already, so that the page will be marked as such in the
> VM."

I'm happy to use your wording instead -- I'll come up with a patch for that.

In my mind it's just a restatement of what's there already. I assume
that you're right about it being clearer this way.

> Of course, I'm sure there are some nuances that I'm still missing.

I don't think that there is, actually. I now believe that you totally
understand the mechanics involved here. I'm glad that I was able to
ascertain that that's all it was. It's worth going to the trouble of
getting something like this exactly right.

-- 
Peter Geoghegan

Re: New strategies for freezing, advancing relfrozenxid early

2022-12-30 Thread Peter Geoghegan

On Fri, Dec 30, 2022 at 1:12 PM Peter Geoghegan  wrote:
> > "Nominal freezing" is happening when there are no freeze plans at all.
> > I get that it's to manage control flow so that the right thing happens
> > later. But I think it should be defined in terms of what state the page
> > is in so that we know that following a given path is valid. Defining
> > "nominal freezing" as a case where there are no freeze plans is just
> > confusing to me.
>
> What would you prefer? The state that the page is in is not something
> that I want to draw much attention to, because it's confusing in a way
> that mostly isn't worth talking about.

I probably should have addressed what you said more directly. Here goes:

Following the path of freezing a page is *always* valid, by
definition. Including when there are zero freeze plans to execute, or
even zero tuples to examine in the first place -- we'll at least be
able to perform nominal freezing, no matter what. OTOH, following the
"no freeze" path is permissible whenever the freeze_required flag
hasn't been set during any call to heap_prepare_freeze_tuple(). It is
never actually mandatory for lazy_scan_prune() to *not* freeze.

It's a bit like how a simple point can be understood as a degenerate
circle of radius 0. It's an abstract definition, which is just a tool
for describing things precisely -- hopefully a useful tool. I welcome
the opportunity to be able to describe things in a way that is clearer
or more useful, in whatever way. But it's not like I haven't already
put in significant effort to this exact question of what "freezing the
page" really means to lazy_scan_prune(). Naming things is hard.

-- 
Peter Geoghegan

Re: New strategies for freezing, advancing relfrozenxid early

2022-12-30 Thread Peter Geoghegan

On Fri, Dec 30, 2022 at 12:43 PM Jeff Davis  wrote:
> I always understood "freezing" to mean that a concrete action was
> taken, and associated WAL generated.

"When I use a word… it means just what I choose it to mean -- neither
more nor less".

I have also always understood freezing that way too. In fact, I still
do understand it that way -- I don't think that it has been undermined
by any of this. I've just invented this esoteric concept of nominal
freezing that can be ignored approximately all the time, to solve one
narrow problem that needed to be solved, that isn't that interesting
anywhere else.

> "Nominal freezing" is happening when there are no freeze plans at all.
> I get that it's to manage control flow so that the right thing happens
> later. But I think it should be defined in terms of what state the page
> is in so that we know that following a given path is valid. Defining
> "nominal freezing" as a case where there are no freeze plans is just
> confusing to me.

What would you prefer? The state that the page is in is not something
that I want to draw much attention to, because it's confusing in a way
that mostly isn't worth talking about. When we do nominal freezing, we
don't necessarily go on to set the page all-frozen. In fact, it's not
particularly likely that that will end up happening!

Bear in mind that the exact definition of "freeze the page" is
somewhat creative, even without bringing nominal freezing into it. It
just has to be in order to support the requirements we have for
MultiXacts (in particular for FRM_NOOP processing). The new concepts
don't quite map directly on to the old ones. At the same time, it
really is very often the case that "freezing the page" will perform
maximally aggressive freezing, in the sense that it does precisely
what a VACUUM FREEZE would do given the same page (in any Postgres
version).

-- 
Peter Geoghegan

Re: Avoiding unnecessary clog lookups while freezing

2022-12-30 Thread Peter Geoghegan

On Thu, Dec 29, 2022 at 12:20 PM Peter Geoghegan  wrote:
> On Thu, Dec 29, 2022 at 12:00 PM Andres Freund  wrote:
> > > I could just move the same tests from heap_prepare_freeze_tuple() to
> > > heap_freeze_execute_prepared(), without changing any of the details.
> >
> > That might work, yes.
>
> Attached patch shows how that could work.

My plan is to push something like this next week, barring objections.
Note that I've inverted the xmax "!TransactionIdDidCommit()" test --
it is now "TransactionIdDidAbort()" instead. I believe that this makes
the test more likely to catch problems, since we really should expect
relevant xmax XIDs to have aborted, specifically -- since the xmax
XIDs in question are always < OldestXmin. (There is a need to use a
"!TransactionIdDidCommit()" test specifically in nearby code in
FreezeMultiXactId(), because that code has to also deal with
in-progress XIDs that are multi members, but that's not the case
here.)

I'm also going to create a CF entry for the other patch posted to this
thread -- the enhancement to FreezeMultiXactId() that aims to get the
most out of any expensive calls to GetMultiXactIdMembers(). That
approach seems quite promising, and relatively simple. I'm not
particularly worried about wasting a call to GetMultiXactIdMembers()
during VACUUM, though. I'm more concerned about the actual impact of
not doing our best to outright remove Multis during VACUUM. The
application itself can experience big performance cliffs from SLRU
buffer misses, which VACUUM should do its utmost to prevent. That is
an occasional source of big problems [1].

I'm particularly concerned about the possibility that having an
updater XID will totally change the characteristics of how multis are
processed inside FreezeMultiXactId(). That seems like it might be a
really sharp edge. I believe that the page-level freezing patch has
already ameliorated the problem, since it made us much less reliant on
the case where GetMultiXactIdMembers() returns "nmembers <= 0" for a
Multi that happens to be HEAP_XMAX_IS_LOCKED_ONLY(). But we can and
should go further than that.

[1] 
https://buttondown.email/nelhage/archive/notes-on-some-postgresql-implementation-details/
--
Peter Geoghegan

Re: Decoupling antiwraparound autovacuum from special rules around auto cancellation

2022-12-29 Thread Peter Geoghegan

On Fri, Nov 25, 2022 at 2:47 PM Peter Geoghegan  wrote:
> Attached WIP patch invents the idea of a regular autovacuum that is
> tasked with advancing relfrozenxid -- which is really just another
> trigger criteria, reported on in the server log in its autovacuum
> reports.

Attached is v2, which is just to fix bitrot. Well, mostly. I did make
one functional change in v2: the autovacuum server log reports now
separately report on table XID age and table MultiXactId age, each as
its own distinct triggering condition.

I've heard informal reports that the difference between antiwraparound
autovacuums triggered by table XID age versus table MXID age can
matter a great deal. It isn't difficult to break out that detail
anyway, so even if the distinction isn't interesting all that often we
might as well surface it to users.

I still haven't made a start on the docs for this. I'm still not sure
how much work I should do on the docs in the scope of this project
versus my project that deals with related issues in VACUUM itself. The
existing material from the "Routine Vacuuming" docs has lots of
problems, and figuring out how to approach fixing those problems seems
kind of daunting right now.

--
Peter Geoghegan
From 24cc2dca479f5cf6f7f918db3237032bbb2c2ce4 Mon Sep 17 00:00:00 2001
From: Peter Geoghegan 
Date: Fri, 25 Nov 2022 11:23:20 -0800
Subject: [PATCH v2] Add "table age" trigger concept to autovacuum.

Teach autovacuum.c to launch "table age" autovacuums at the same point
that it previously triggered antiwraparound autovacuums.  Antiwraparound
autovacuums are retained, but are only used as a true option of last
resort, when regular autovacuum has presumably tried and failed to
advance relfrozenxid (likely because the auto-cancel behavior kept
cancelling regular autovacuums triggered based on table age).

The special auto-cancellation behavior applied by antiwraparound
autovacuums is known to cause problems in the field, so it makes sense
to avoid it, at least until the point where it starts to look like a
proportionate response.  Besides, the risk of the system eventually
triggering xidStopLimit because of cancellations is a lot lower than it
was back when the current auto-cancellation behavior was added by commit
acac68b2.  For example, there was no visibility map, so restarting
antiwraparound autovacuum meant that the next autovacuum would get very
little benefit from the work performed by earlier cancelled autovacuums.

Also add new instrumentation that lists a triggering condition in the
server log whenever an autovacuum is logged.  This reports "table age"
as the triggering criteria when regular (not antiwraparound) autovacuum
runs because we need to advance the table's age.  In other cases it will
report an autovacuum was launched due to the tuple insert thresholds or
the dead tuple thresholds.  Note that pg_stat_activity doesn't have any
special instrumentation for regular autovacuums that happen to have been
triggered based on table age (since it really is just another form of
autovacuum, and follows exactly the same rules in vacuumlazy.c and in
proc.c).

Author: Peter Geoghegan 
Reviewed-By: Jeff Davis 
Discussion: https://postgr.es/m/CAH2-Wz=S-R_2rO49Hm94Nuvhu9_twRGbTm6uwDRmRu-Sqn_t3w@mail.gmail.com
---
 src/include/commands/vacuum.h   |  16 ++-
 src/include/storage/proc.h  |   2 +-
 src/backend/access/heap/vacuumlazy.c|  14 +++
 src/backend/access/heap/visibilitymap.c |   5 +-
 src/backend/access/transam/multixact.c  |   4 +-
 src/backend/commands/vacuum.c   |  18 ++--
 src/backend/postmaster/autovacuum.c | 129 ++--
 src/backend/storage/lmgr/proc.c |   4 +-
 8 files changed, 145 insertions(+), 47 deletions(-)

diff --git a/src/include/commands/vacuum.h b/src/include/commands/vacuum.h
index 2f274f2be..53e4e8ffb 100644
--- a/src/include/commands/vacuum.h
+++ b/src/include/commands/vacuum.h
@@ -189,6 +189,19 @@ typedef struct VacAttrStats
 #define VACOPT_PROCESS_TOAST 0x40	/* process the TOAST table, if any */
 #define VACOPT_DISABLE_PAGE_SKIPPING 0x80	/* don't skip any pages */

+/*
+ * Values used by autovacuum.c to tell vacuumlazy.c about the specific
+ * threshold type that triggered an autovacuum worker
+ */
+typedef enum AutoVacType
+{
+	AUTOVACUUM_NONE = 0,
+	AUTOVACUUM_TABLE_XID_AGE,
+	AUTOVACUUM_TABLE_MXID_AGE,
+	AUTOVACUUM_DEAD_TUPLES,
+	AUTOVACUUM_INSERTED_TUPLES,
+} AutoVacType;
+
 /*
  * Values used by index_cleanup and truncate params.
  *
@@ -220,7 +233,8 @@ typedef struct VacuumParams
 			 * use default */
 	int			multixact_freeze_table_age; /* multixact age at which to scan
 			 * whole table */
-	bool		is_wraparound;	/* force a for-wraparound vacuum */
+	bool		is_wraparound;	/* antiwraparound autovacuum */
+	AutoVacType	trigger;		/* autovacuum launched to advance table age */
 	int			log_min_duration

Re: Avoiding unnecessary clog lookups while freezing

2022-12-29 Thread Peter Geoghegan

On Thu, Dec 29, 2022 at 12:50 PM Peter Geoghegan  wrote:
> On Thu, Dec 29, 2022 at 12:20 PM Peter Geoghegan  wrote:
> > > It seems somewhat wrong that we discard all the work that
> > > heap_prepare_freeze_tuple() did. Yes, we force freezing to actually 
> > > happen in
> > > a bunch of important cases (e.g. creating a new multixact), but even so,
> > > e.g. GetMultiXactIdMembers() is awfully expensive to do for nought. Nor is
> > > just creating the freeze plan free.

> Here's an idea that seems like it could ameliorate the issue:
>
> When we're looping through members from GetMultiXactIdMembers(), and
> seeing if we can get away with !need_replace/FRM_NOOP processing, why
> not also check if there are any XIDs >= OldestXmin among the members?
> If not (if they're all < OldestXmin), then we should prefer to go
> further with processing the Multi now -- FRM_NOOP processing isn't
> actually cheaper.

Attached patch shows what I mean here.

I think that there is a tendency for OldestMxact to be held back by a
disproportionate amount (relative to OldestXmin) in the presence of
long running transactions and concurrent updates from short-ish
transactions. The way that we maintain state in shared memory to
compute OldestMxact (OldestMemberMXactId/OldestVisibleMXactId) seems
to be vulnerable to that kind of thing. I'm thinking of scenarios
where MultiXactIdSetOldestVisible() gets called from a long-running
xact, at the first point that it examines any multi, just to read
something. That effectively makes the long-running xact hold back
OldestMxact, even when it doesn't hold back OldestXmin. A read-only
transaction that runs in READ COMMITTED could do this -- it'll call
OldestVisibleMXactId() and "lock in" the oldest visible Multi that it
needs to continue to see as running, without clearing that until much
later (until AtEOXact_MultiXact() is called at xact commit/abort). And
without doing anything to hold back OldestXmin by the same amount, or
for the same duration.

That's the kind of scenario that the patch might make a difference in
-- because it exploits the fact that OldestXmin can be a lot less
vulnerable than OldestMxact is to being held back by long running
xacts. Does that seem plausible to you?

-- 
Peter Geoghegan

0001-Prefer-FRM_INVALIDATE_XMAX-over-FRM_NOOP-when-cheape.patch
Description: Binary data

Re: Avoiding unnecessary clog lookups while freezing

2022-12-29 Thread Peter Geoghegan

On Thu, Dec 29, 2022 at 12:20 PM Peter Geoghegan  wrote:
> > It seems somewhat wrong that we discard all the work that
> > heap_prepare_freeze_tuple() did. Yes, we force freezing to actually happen 
> > in
> > a bunch of important cases (e.g. creating a new multixact), but even so,
> > e.g. GetMultiXactIdMembers() is awfully expensive to do for nought. Nor is
> > just creating the freeze plan free.
>
> I'm not sure what you mean by that. I believe that the page-level
> freezing changes do not allow FreezeMultiXactId() to call
> GetMultiXactIdMembers() any more often than before. Are you concerned
> about a regression, or something more general than that?

Here's an idea that seems like it could ameliorate the issue:

When we're looping through members from GetMultiXactIdMembers(), and
seeing if we can get away with !need_replace/FRM_NOOP processing, why
not also check if there are any XIDs >= OldestXmin among the members?
If not (if they're all < OldestXmin), then we should prefer to go
further with processing the Multi now -- FRM_NOOP processing isn't
actually cheaper.

We'll already know that a second pass over the multi really isn't
expensive. And that it will actually result in FRM_INVALIDATE_XMAX
processing, which is ideal. Avoiding a second pass is really just
about avoiding FRM_RETURN_IS_MULTI (and avoiding FRM_RETURN_IS_XID,
perhaps, though to a much lesser degree).

-- 
Peter Geoghegan

Re: Avoiding unnecessary clog lookups while freezing

2022-12-29 Thread Peter Geoghegan

On Thu, Dec 29, 2022 at 12:00 PM Andres Freund  wrote:
> > I could just move the same tests from heap_prepare_freeze_tuple() to
> > heap_freeze_execute_prepared(), without changing any of the details.
>
> That might work, yes.

Attached patch shows how that could work.

> It seems somewhat wrong that we discard all the work that
> heap_prepare_freeze_tuple() did. Yes, we force freezing to actually happen in
> a bunch of important cases (e.g. creating a new multixact), but even so,
> e.g. GetMultiXactIdMembers() is awfully expensive to do for nought. Nor is
> just creating the freeze plan free.

I'm not sure what you mean by that. I believe that the page-level
freezing changes do not allow FreezeMultiXactId() to call
GetMultiXactIdMembers() any more often than before. Are you concerned
about a regression, or something more general than that?

The only case that we *don't* force xmax freezing in
FreezeMultiXactId() is the FRM_NOOP case. Note in particular that we
will reliably force freezing for any Multi < OldestMxact (not <
MultiXactCutoff).

> I wonder how often it'd be worthwhile to also do opportunistic freezing during
> lazy_vacuum_heap_page(), given that we already will WAL log (and often issue
> an FPI).

Yeah, we don't actually need a cleanup lock for that. It might also
make sense to teach lazy_scan_prune() to anticipate what will happen
later on, in lazy_vacuum_heap_page(), so that it can freeze based on
the same observation about the cost. (It already does a certain amount
of this kind of thing, in fact.)

-- 
Peter Geoghegan

v2-0001-Check-xmin-xmax-commit-status-when-freezing-execu.patch
Description: Binary data

Re: Avoiding unnecessary clog lookups while freezing

2022-12-29 Thread Peter Geoghegan

On Thu, Dec 29, 2022 at 9:21 AM Andres Freund  wrote:
> I do think we wanted to avoid reviving actually-dead tuples (present due to
> the multixact and related bugs). And I'm worried about giving that checking
> up, I've seen it hit too many times. Both in the real world and during
> development.

I could just move the same tests from heap_prepare_freeze_tuple() to
heap_freeze_execute_prepared(), without changing any of the details.
That would mean that the TransactionIdDidCommit() calls would only
take place with tuples that actually get frozen, which is more or less
how it worked before now.

heap_prepare_freeze_tuple() will now often prepare freeze plans that
just get discarded by lazy_scan_prune(). My concern is the impact on
tables/pages that almost always discard prepared freeze plans, and so
require many TransactionIdDidCommit() calls that really aren't
necessary.

> Somewhat of a tangent: I've previously wondered if we should have a small
> hash-table based clog cache. The current one-element cache doesn't suffice in
> a lot of scenarios, but it wouldn't take a huge cache to end up filtering most
> clog accesses.

I imagine that the one-element cache works alright in some scenarios,
but then suddenly doesn't work so well, even though not very much has
changed. Behavior like that makes the problems difficult to analyze,
and easy to miss. I'm suspicious of that.

-- 
Peter Geoghegan

Re: Avoiding unnecessary clog lookups while freezing

2022-12-28 Thread Peter Geoghegan

On Wed, Dec 28, 2022 at 4:43 PM Andres Freund  wrote:
> > > Hm. I dimply recall that we had repeated cases where the hint bits were 
> > > set
> > > wrongly due to some of the multixact related bugs. I think I was trying 
> > > to be
> > > paranoid about not freezing stuff in those situations, since it can lead 
> > > to
> > > reviving dead tuples, which obviously is bad.
> >
> > I think that it's a reasonable check, and I'm totally in favor of
> > keeping it (or something very close, at least).
>
> I don't quite follow - one paragraph up you say we should fix something, and
> then here you seem to say we should continue not to rely on the hint bits?

I didn't mean that we should continue to not rely on the hint bits. Is
that really all that the test is for? I think of it as a general sanity check.

The important call to avoid with page-level freezing is the xmin call to
TransactionIdDidCommit(), not the xmax call. The xmax call only occurs
when VACUUM prepares to freeze a tuple that was updated by an updater
(not a locker) that aborted. While the xmin calls will now take place with most
unfrozen tuples.

--
Peter Geoghegan

Re: Avoiding unnecessary clog lookups while freezing

2022-12-28 Thread Peter Geoghegan

On Wed, Dec 28, 2022 at 4:20 PM Andres Freund  wrote:
> > Theoretically this is an old issue that dates back to commit
> > 699bf7d05c, as opposed to an issue in the page-level freezing patch.
> > But that's not really true in a real practical sense. In practice the
> > calls to TransactionIdDidCommit() will happen far more frequently
> > following today's commit 1de58df4fe (since we're using OldestXmin as
> > the cutoff that gates performing freeze_xmin/freeze_xmax processing --
> > not FreezeLimit).
>
> Hm. But we still only do the check when we actually freeze, rather than just
> during the pre-check in heap_tuple_should_freeze(). So we'll only incur the
> increased overhead when we also do more WAL logging etc. Correct?

Yes, that's how it worked up until today's commit 1de58df4fe.

I don't have strong feelings about back patching a fix, but this does
seem like something that I should fix now, on HEAD.

> Hm. I dimply recall that we had repeated cases where the hint bits were set
> wrongly due to some of the multixact related bugs. I think I was trying to be
> paranoid about not freezing stuff in those situations, since it can lead to
> reviving dead tuples, which obviously is bad.

I think that it's a reasonable check, and I'm totally in favor of
keeping it (or something very close, at least).

> There's practically no tuple-level concurrent activity in a normal regression
> test. So I'm a bit doubtful that is an interesting metric. It'd be a tad more
> interesting to run tests, including isolationtester and pgbench, against a
> running cluster.

Even on HEAD, with page-level freezing in place, we're only going to
test XIDs that are < OldestXmin, that appear on pages tha VACUUM
actually scans in the first place. Just checking tuple-level hint bits
should be effective. But even if it isn't (for whatever reason) then
it's similar to cases where our second heap pass has to do clog
lookups in heap_page_is_all_visible() just because hint bits couldn't
be set earlier on, back when lazy_scan_prune() processed the same page
during VACUUM's initial heap pass.

-- 
Peter Geoghegan

Avoiding unnecessary clog lookups while freezing

2022-12-28 Thread Peter Geoghegan

I took another look at the code coverage situation around freezing
following pushing the page-level freezing patch earlier today. I
spotted an issue that I'd missed up until now: certain sanity checks
in heap_prepare_freeze_tuple() call TransactionIdDidCommit() more
often than really seems necessary.

Theoretically this is an old issue that dates back to commit
699bf7d05c, as opposed to an issue in the page-level freezing patch.
But that's not really true in a real practical sense. In practice the
calls to TransactionIdDidCommit() will happen far more frequently
following today's commit 1de58df4fe (since we're using OldestXmin as
the cutoff that gates performing freeze_xmin/freeze_xmax processing --
not FreezeLimit).

No regressions related to clog lookups by VACUUM were apparent from my
performance validation of the page-level freezing work, but I suspect
that the increase in TransactionIdDidCommit() calls will cause
noticeable regressions with the right/wrong workload and/or
configuration. The page-level freezing work is expected to reduce clog
lookups in VACUUM in general, which is bound to have been a
confounding factor.

I see no reason why we can't just condition the XID sanity check calls
to TransactionIdDidCommit() (for both the freeze_xmin and the
freeze_xmax callsites) on a cheap tuple hint bit precheck not being
enough. We only need an expensive call to TransactionIdDidCommit()
when the precheck doesn't immediately indicate that the tuple xmin
looks committed when that's what the sanity check expects to see (or
that the tuple's xmax looks aborted, in the case of the callsite where
that's what we expect to see).

Attached patch shows what I mean. A quick run of the standard
regression tests (with custom instrumentation) shows that this patch
eliminates 100% of all relevant calls to TransactionIdDidCommit(), for
both the freeze_xmin and the freeze_xmax callsites.

-- 
Peter Geoghegan


v1-0001-Avoid-unnecessary-clog-lookups-when-freezing.patch
Description: Binary data

Re: New strategies for freezing, advancing relfrozenxid early

2022-12-26 Thread Peter Geoghegan

On Mon, Dec 26, 2022 at 10:57 PM Hayato Kuroda (Fujitsu)
 wrote:
> I guessed that this assertion failure seemed to be caused by the commit 
> 4ce3af[2],
> because the Assert() seemed to be added by the commit.

I agree that the problem is with this assertion, which is on the
master branch (not in recent versions of the patch series itself)
following commit 4ce3af:

else
{
/*
* Freeze plan for tuple "freezes xmax" in the strictest sense:
* it'll leave nothing in xmax (neither an Xid nor a MultiXactId).
*/

Assert(MultiXactIdPrecedes(xid, cutoffs->OldestMxact));
...
}

The problem is that FRM_INVALIDATE_XMAX multi processing can occur
both in Multis from before OldestMxact and Multis >= OldestMxact. The
latter case (the >= case) is far less common, but still quite
possible. Not sure how I missed that.

Anyway, this assertion is wrong, and simply needs to be removed.
Thanks for the report
-- 
Peter Geoghegan

Re: Call lazy_check_wraparound_failsafe earlier for parallel vacuum

2022-12-22 Thread Peter Geoghegan

On Tue, Dec 20, 2022 at 9:44 AM Imseih (AWS), Sami  wrote:
> Attached is a patch to check scanned pages rather
> than blockno.

Pushed, thanks!

I adjusted the FAILSAFE_EVERY_PAGES comments, which now point out that
FAILSAFE_EVERY_PAGES is a power-of-two. The implication is that the
compiler is all but guaranteed to be able to reduce the modulo
division into a shift in the lazy_scan_heap loop, at the point of the
failsafe check. I doubt that it would really matter if the compiler
had to generate a DIV instruction, but it seems like a good idea to
avoid it on general principle, at least in performance sensitive code.

--
Peter Geoghegan

Re: New strategies for freezing, advancing relfrozenxid early

2022-12-21 Thread Peter Geoghegan

On Wed, Dec 21, 2022 at 4:30 PM Jeff Davis  wrote:
> The confusing thing to me is perhaps just the name -- to me,
> "freeze_required" suggests that if it were set to true, it would cause
> freezing to happen. But as far as I can tell, it does not cause
> freezing to happen, it causes some other things to happen that are
> necessary when freezing happens (updating and using the right
> trackers).

freeze_required is about what's required, which tells us nothing about
what will happen when it's not required (could go either way,
depending on how lazy_scan_prune feels about it).

Setting freeze_required=true implies that heap_prepare_freeze_tuple
has stopped doing maintenance of the "no freeze" trackers. When it
sets freeze_required=true, it really *does* force freezing to happen,
in every practical sense. This happens because lazy_scan_prune does
what it's told to do when it's told that freezing is required. Because
of course it does, why wouldn't it?

So...I still don't get what you mean. Why would lazy_scan_prune ever
break its contract with heap_prepare_freeze_tuple? And in what sense
would you say that heap_prepare_freeze_tuple's setting
freeze_required=true doesn't quite amount to "forcing freezing"? Are
you worried about the possibility that lazy_scan_prune will decide to
rebel at some point, and fail to honor its contract with
heap_prepare_freeze_tuple?  :-)

> A minor point, no need to take action here. Perhaps rename the
> variable.

Andres was the one that suggested this name, actually. I initially
just called it "freeze", but I think that Andres had it right.

> I think 0001+0002 are about ready.

Great. I plan on committing 0001 in the next few days. Committing 0002
might take a bit longer.

Thanks
-- 
Peter Geoghegan

Re: New strategies for freezing, advancing relfrozenxid early

2022-12-20 Thread Peter Geoghegan

On Tue, Dec 20, 2022 at 7:15 PM Peter Geoghegan  wrote:
> On Tue, Dec 20, 2022 at 5:44 PM Jeff Davis  wrote:
> > Next, the 'freeze_required' field suggests that it's more involved in
> > the control flow that causes freezing than it actually is. All it does
> > is communicate how the trackers need to be adjusted. The return value
> > of heap_prepare_freeze_tuple() (and underneath, the flags set by
> > FreezeMultiXactId()) are what actually control what happens. It would
> > be nice to make this more clear somehow.
>
> I'm not sure what you mean. Page-level freezing *doesn't* have to go
> ahead when freeze_required is not ever set to true for any tuple on
> the page (which is most of the time, in practice). lazy_scan_prune
> gets to make a choice about freezing the page, when the choice is
> available.

Oh wait, I think I see the point of confusion now.

When freeze_required is set to true, that means that lazy_scan_prune
literally has no choice -- it simply must freeze the page as
instructed by heap_prepare_freeze_tuple/FreezeMultiXactId. It's not
just a strong suggestion -- it's crucial that lazy_scan_prune freezes
the page as instructed.

The "no freeze" trackers (HeapPageFreeze.NoFreezePageRelfrozenXid and
HeapPageFreeze.NoFreezePageRelminMxid) won't have been maintained
properly when freeze_required was set, so lazy_scan_prune can't expect
to use them -- doing so would lead to VACUUM setting incorrect values
in pg_class later on.

Avoiding the work of maintaining those "no freeze" trackers isn't just
a nice-to-have microoptimization -- it is sometimes very important. We
kind of rely on this to be able to avoid getting too many MultiXact
member SLRU buffer misses inside FreezeMultiXactId. There is a comment
above FreezeMultiXactId that advises its caller that it had better not
call heap_tuple_should_freeze when freeze_required is set to true,
because that could easily lead to multixact member SLRU buffer misses
-- misses that FreezeMultiXactId set out to avoid itself.

It could actually be cheaper to freeze than to not freeze, in the case
of a Multi -- member space misses can sometimes be really expensive.
And so FreezeMultiXactId sometimes freezes a Multi even though it's
not strictly required to do so.

Note also that this isn't a new behavior -- it's actually an old one,
for the most part. It kinda doesn't look that way, because we haven't
passed down separate FreezeLimit/OldestXmin cutoffs (and separate
OldestMxact/MultiXactCutoff cutoffs) until now. But we often don't
need that granular information to be able to process Multis before the
multi value is < MultiXactCutoff.

If you look at how FreezeMultiXactId works, in detail, you'll see that
even on Postgres HEAD it can (say) set a tuple's xmax to
InvalidTransactionId long before the multi value is < MultiXactCutoff.
It just needs to detect that the multi is not still running, and
notice that it's HEAP_XMAX_IS_LOCKED_ONLY(). Stuff like that happens
quite a bit. So for the most part "eager processing of Multis as a
special case" is an old behavior, that has only been enhanced a little
bit (the really important, new change in FreezeMultiXactId is how the
FRM_NOOP case works with FreezeLimit, even though OldestXmin is used
nearby -- this is extra confusing because 0002 doesn't change how we
use FreezeLimit -- it actually changes every other use of FreezeLimit
nearby, making it OldestXmin).

-- 
Peter Geoghegan

Re: New strategies for freezing, advancing relfrozenxid early

2022-12-20 Thread Peter Geoghegan

 But a cleanup-locked page does *not* need to get rid of all
XIDs < OldestXmin, nor MXIDs < OldestMxact.

This flexibility is mostly useful because it allows lazy_scan_prune to
just decide to not freeze. But, to a much lesser degree, it's useful
because of the edge case with multis -- in general we might just need
the same leeway when lazy_scan_prune "freezes the page".

> That being said, it does push harder against the limits on both sides.
> If I understand correctly, that means pages with wider distributions of
> xids are going to persist longer, which could expose pre-existing bugs
> in new and interesting ways.

I don't think it's fundamentally different to what we're already doing
in lazy_scan_noprune. It's just more complicated, because you have to
tease apart slightly different definitions of freezing to understand
code around FreezeMultiXactId(). This is more or less needed to
provide maximum flexibility, where we delay decisions about what to do
until the very last moment.

> Next, the 'freeze_required' field suggests that it's more involved in
> the control flow that causes freezing than it actually is. All it does
> is communicate how the trackers need to be adjusted. The return value
> of heap_prepare_freeze_tuple() (and underneath, the flags set by
> FreezeMultiXactId()) are what actually control what happens. It would
> be nice to make this more clear somehow.

I'm not sure what you mean. Page-level freezing *doesn't* have to go
ahead when freeze_required is not ever set to true for any tuple on
the page (which is most of the time, in practice). lazy_scan_prune
gets to make a choice about freezing the page, when the choice is
available.

Note also that the FRM_NOOP case happens when a call to
FreezeMultiXactId() takes place that won't leave behind a freeze plan
for the tuple (unless its xmin happens to necessitate a freeze plan
for the same tuple). And yet, it will do useful work, needed iff the
"freeze the page" path is ultimately taken by lazy_scan_prune --
FreezeMultiXactId() itself will ratchet back
FreezePageRelfrozenXid/NewRelfrozenXid as needed to make everything
safe.

> The comment:
>
>   /*
>* If we freeze xmax, make absolutely sure that it's not an XID that
>* is important.  (Note, a lock-only xmax can be removed independent
>* of committedness, since a committed lock holder has released the
>* lock).
>*/
>
> caused me to go down a rabbit hole looking for edge cases where we
> might want to freeze an xmax but not an xmin; e.g. tup.xmax <
> OldestXmin < tup.xmin or the related case where tup.xmax < RecentXmin <
> tup.xmin. I didn't find a problem, so that's good news.

This is an example of what I meant about the heapam.c code using a
cutoff that actually comes from FreezeLimit, when it would be more
sensible to use OldestXmin instead.

> I also tried some pgbench activity along with concurrent vacuums (and
> vacuum freezes) along with periodic verify_heapam(). No problems there.
>
> Did you already describe the testing you've done for 0001+0002
> specfiically? It's not radically new logic, but it would be good to try
> to catch minor state-handling errors.

Lots of stuff with contrib/amcheck, which, as you must already know,
will notice when an XID/MXID is contained in a table whose
relfrozenxid and/or relminmxid indicates that it shouldn't be there.
(Though VACUUM itself does the same thing, albeit not as effectively.)

Obviously the invariants haven't changed here. In many ways it's a
very small set of changes. But in one or two ways it's a significant
shift. It depends on how you think about it.

--
Peter Geoghegan

< 2 3 4 5 6 7 8 9 10 11 >

601 - 700 of 2338 matches

Mail list logo