subject:"Removing more vacuumlazy.c special cases, relfrozenxid optimizations"

Re: Removing more vacuumlazy.c special cases, relfrozenxid optimizations

2022-04-14 Thread Peter Geoghegan

On Thu, Apr 14, 2022 at 4:19 PM Jim Nasby  wrote:
> > - percentage of non-yet-removable vs removable tuples
>
> This'd give you an idea how bad your long-running-transaction problem is.

VACUUM fundamentally works by removing those tuples that are
considered dead according to an XID-based cutoff established when the
operation begins. And so many very long running VACUUM operations will
see dead-but-not-removable tuples even when there are absolutely no
long running transactions (nor any other VACUUM operations). The only
long running thing involved might be our own long running VACUUM
operation.

I would like to reduce the number of non-removal dead tuples
encountered by VACUUM by "locking in" heap pages that we'd like to
scan up front. This would work by having VACUUM create its own local
in-memory copy of the visibility map before it even starts scanning
heap pages. That way VACUUM won't end up visiting heap pages just
because they were concurrently modified half way through our VACUUM
(by some other transactions). We don't really need to scan these pages
at all -- they have dead tuples, but not tuples that are "dead to
VACUUM".

The key idea here is to remove a big unnatural downside to slowing
VACUUM down. The cutoff would almost work like an MVCC snapshot, that
described precisely the work that VACUUM needs to do (which pages to
scan) up-front. Once that's locked in, the amount of work we're
required to do cannot go up as we're doing it (or it'll be less of an
issue, at least).

It would also help if VACUUM didn't scan pages that it already knows
don't have any dead tuples. The current SKIP_PAGES_THRESHOLD rule
could easily be improved. That's almost the same problem.

-- 
Peter Geoghegan

Re: Removing more vacuumlazy.c special cases, relfrozenxid optimizations

2022-04-05 Thread Peter Geoghegan

On Mon, Apr 4, 2022 at 8:25 PM Peter Geoghegan  wrote:
> Right. The reason I used WARNINGs was because it matches vaguely
> related WARNINGs in vac_update_relstats()'s sibling function,
> vacuum_set_xid_limits().

Okay, pushed the relfrozenxid warning patch.

Thanks
-- 
Peter Geoghegan

Re: Removing more vacuumlazy.c special cases, relfrozenxid optimizations

2022-04-04 Thread Peter Geoghegan

On Mon, Apr 4, 2022 at 8:18 PM Andres Freund  wrote:
> The remaining patch are the warnings in vac_update_relstats(), correct?  I
> guess one could argue they should be LOG rather than WARNING, but I find the
> project stance on that pretty impractical. So warning's ok with me.

Right. The reason I used WARNINGs was because it matches vaguely
related WARNINGs in vac_update_relstats()'s sibling function,
vacuum_set_xid_limits().

> Not sure why you used errmsg_internal()?

The usual reason for using errmsg_internal(), I suppose. I tend to do
that with corruption related messages on the grounds that they're
usually highly obscure issues that are (by definition) never supposed
to happen. The only thing that a user can be expected to do with the
information from the message is to report it to -bugs, or find some
other similar report.

-- 
Peter Geoghegan

Re: Removing more vacuumlazy.c special cases, relfrozenxid optimizations

2022-04-04 Thread Andres Freund

Hi,

On 2022-04-04 19:32:13 -0700, Peter Geoghegan wrote:
> On Fri, Apr 1, 2022 at 10:54 AM Peter Geoghegan  wrote:
> > I also refined the WARNING patch in v15. It now actually issues
> > WARNINGs (rather than PANICs, which were just a temporary debugging
> > measure in v14).
> 
> Going to commit this remaining patch tomorrow, barring objections.

The remaining patch are the warnings in vac_update_relstats(), correct?  I
guess one could argue they should be LOG rather than WARNING, but I find the
project stance on that pretty impractical. So warning's ok with me.

Not sure why you used errmsg_internal()?

Otherwise LGTM.

Greetings,

Andres Freund

Re: Removing more vacuumlazy.c special cases, relfrozenxid optimizations

2022-04-04 Thread Peter Geoghegan

On Fri, Apr 1, 2022 at 10:54 AM Peter Geoghegan  wrote:
> I also refined the WARNING patch in v15. It now actually issues
> WARNINGs (rather than PANICs, which were just a temporary debugging
> measure in v14).

Going to commit this remaining patch tomorrow, barring objections.

-- 
Peter Geoghegan

Re: Removing more vacuumlazy.c special cases, relfrozenxid optimizations

2022-04-03 Thread Peter Geoghegan

On Sun, Apr 3, 2022 at 12:05 PM Andres Freund  wrote:
> Just saw that you committed: Wee! I think this will be a substantial
> improvement for our users.

I hope so! I think that it's much more useful as the basis for future
work than as a standalone thing. Users of Postgres 15 might not notice
a huge difference. But it opens up a lot of new directions to take
VACUUM in.

I would like to get rid of anti-wraparound VACUUMs and aggressive
VACUUMs in Postgres 16. This isn't as radical as it sounds. It seems
quite possible to find a way for *every* VACUUM to become aggressive
progressively and dynamically. We'll still need to have autovacuum.c
know about wraparound, but it should just be just another threshold,
not fundamentally different to the other thresholds (except that it's
still used when autovacuum is nominally disabled).

The behavior around autovacuum cancellations is probably still going
to be necessary when age(relfrozenxid) gets too high, but it shouldn't
be conditioned on what age(relfrozenxid) *used to be*, when the
autovacuum started. That could have been a long time ago. It should be
based on what's happening *right now*.

> While I was writing the above I, again, realized that it'd be awfully nice to
> have some accumulated stats about (auto-)vacuum's effectiveness. For us to get
> feedback about improvements more easily and for users to know what aspects
> they need to tune.

Strongly agree. And I'm excited about the potential of the shared
memory stats patch to enable more thorough instrumentation, which
allows us to improve things with feedback that we just can't get right
now.

VACUUM is still too complicated -- that makes this kind of analysis
much harder, even for experts. You need more continuous behavior to
get value from this kind of analysis. There are too many things that
might end up mattering, that really shouldn't ever matter. Too much
potential for strange illogical discontinuities in performance over
time.

Having only one type of VACUUM (excluding VACUUM FULL) will be much
easier for users to reason about. But I also think that it'll be much
easier for us to reason about. For example, better autovacuum
scheduling will be made much easier if autovacuum.c can just assume
that every VACUUM operation will do the same amount of work. (Another
problem with the scheduling is that it uses ANALYZE statistics
(sampling) in a way that just doesn't make any sense for something
like VACUUM, which is an inherently dynamic and cyclic process.)

None of this stuff has to rely on my patch for freezing. We don't
necessarily have to make every VACUUM advance relfrozenxid to do all
this. The important point is that we definitely shouldn't be putting
off *all* freezing of all-visible pages in non-aggressive VACUUMs (or
in VACUUMs that are not expected to advance relfrozenxid). Even a very
conservative implementation could achieve all this; we need only
spread out the burden of freezing all-visible pages over time, across
multiple VACUUM operations. Make the behavior continuous.

> Knowing how many times a table was vacuumed doesn't really tell that much, and
> requiring to enable log_autovacuum_min_duration and then aggregating those
> results is pretty painful (and version dependent).

Yeah. Ideally we could avoid making the output of
log_autovacuum_min_duration into an API, by having a real API instead.
The output probably needs to evolve some more. A lot of very basic
information wasn't there until recently.

> If we just collected something like:
> - number of heap passes
> - time spent heap vacuuming
> - number of index scans
> - time spent index vacuuming
> - time spent delaying

You forgot FPIs.

> - percentage of non-yet-removable vs removable tuples

I think that we should address this directly too. By "taking a
snapshot of the visibility map", so we at least don't scan/vacuum heap
pages that don't really need it. This is also valuable because it
makes slowing down VACUUM (maybe slowing it down a lot) have fewer
downsides. At least we'll have "locked in" our scanned_pages, which we
can figure out in full before we really scan even one page.

> it'd start to be a heck of a lot easier to judge how well autovacuum is
> coping.

What about the potential of the shared memory stats stuff to totally
replace the use of ANALYZE stats in autovacuum.c? Possibly with help
from vacuumlazy.c, and the visibility map?

I see a lot of potential for exploiting the visibility map more, both
within vacuumlazy.c itself, and for autovacuum.c scheduling [1]. I'd
probably start with the scheduling stuff, and only then work out how
to show users more actionable information.

[1] 
https://postgr.es/m/cah2-wzkt9ey9nnm7q9nsaw5jdbjvsaq3yvb4ut4m93uajvd...@mail.gmail.com
--
Peter Geoghegan

Re: Removing more vacuumlazy.c special cases, relfrozenxid optimizations

2022-04-03 Thread Andres Freund

Hi,

On 2022-04-01 10:54:14 -0700, Peter Geoghegan wrote:
> On Thu, Mar 31, 2022 at 11:19 AM Peter Geoghegan  wrote:
> > The assert is "Assert(diff > 0)", and not "Assert(diff >= 0)".
>
> Attached is v15. I plan to commit the first two patches (the most
> substantial two patches by far) in the next couple of days, barring
> objections.

Just saw that you committed: Wee! I think this will be a substantial
improvement for our users.

While I was writing the above I, again, realized that it'd be awfully nice to
have some accumulated stats about (auto-)vacuum's effectiveness. For us to get
feedback about improvements more easily and for users to know what aspects
they need to tune.

Knowing how many times a table was vacuumed doesn't really tell that much, and
requiring to enable log_autovacuum_min_duration and then aggregating those
results is pretty painful (and version dependent).

If we just collected something like:
- number of heap passes
- time spent heap vacuuming
- number of index scans
- time spent index vacuuming
- time spent delaying
- percentage of non-yet-removable vs removable tuples

it'd start to be a heck of a lot easier to judge how well autovacuum is
coping.

If we tracked the related pieces above in the index stats (or perhaps
additionally there), it'd also make it easier to judge the cost of different
indexes.

- Andres

Re: Removing more vacuumlazy.c special cases, relfrozenxid optimizations

2022-04-01 Thread Peter Geoghegan

On Thu, Mar 31, 2022 at 11:19 AM Peter Geoghegan  wrote:
> The assert is "Assert(diff > 0)", and not "Assert(diff >= 0)".

Attached is v15. I plan to commit the first two patches (the most
substantial two patches by far) in the next couple of days, barring
objections.

v15 removes this "Assert(diff > 0)" assertion from 0001. It's not
adding any value, now that the underlying issue that it accidentally
brought to light is well understood (there are still more robust
assertions to the relfrozenxid/relminmxid invariants). "Assert(diff >
0)" is liable to fail until the underlying bug on HEAD is fixed, which
can be treated as separate work.

I also refined the WARNING patch in v15. It now actually issues
WARNINGs (rather than PANICs, which were just a temporary debugging
measure in v14). Also fixed a compiler warning in this patch, based on
a complaint from CFBot's CompilerWarnings task. I can delay commiting
this WARNING patch until right before feature freeze. Seems best to
give others more opportunity for comments.

-- 
Peter Geoghegan


v15-0003-Have-VACUUM-warn-on-relfrozenxid-from-the-future.patch
Description: Binary data


v15-0004-vacuumlazy.c-Move-resource-allocation-to-heap_va.patch
Description: Binary data


v15-0002-Generalize-how-VACUUM-skips-all-frozen-pages.patch
Description: Binary data


v15-0001-Set-relfrozenxid-to-oldest-extant-XID-seen-by-VA.patch
Description: Binary data

Re: Removing more vacuumlazy.c special cases, relfrozenxid optimizations

2022-03-31 Thread Peter Geoghegan

On Thu, Mar 31, 2022 at 10:50 AM Andres Freund  wrote:
> > So, to be clear: vac_update_relstats() never actually considered the
> > new relfrozenxid value from its vacuumlazy.c caller to be "in the
> > future"?
>
> No, I added separate debug messages for those, and also applied your patch,
> and it didn't trigger.

The assert is "Assert(diff > 0)", and not "Assert(diff >= 0)". Plus
the other related assert I mentioned did not trigger. So when this
"diff" assert did trigger, the value of "diff" must have been 0 (not a
negative value). While this state does technically indicate that the
"existing" relfrozenxid value (actually a stale version) appears to be
"in the future" (because the OldestXmin XID might still never have
been allocated), it won't ever be in the future according to
vac_update_relstats() (even if it used that version).

I suppose that I might be wrong about that, somehow -- anything is
possible. The important point is that there is currently no evidence
that this bug (or any very recent bug) could ever allow
vac_update_relstats() to actually believe that it needs to update
relfrozenxid/relminmxid, purely because the existing value is in the
future.

The fact that vac_update_relstats() doesn't log/warn when this happens
is very unfortunate, but there is nevertheless no evidence that that
would have informed us of any bug on HEAD, even including the actual
bug here, which is a bug in vacuumlazy.c (not in vac_update_relstats).

> I do think we should apply a version of the warnings you have (with a WARNING
> instead of PANIC obviously). I think it's bordering on insanity that we have
> so many paths to just silently fix stuff up around vacuum. It's like we want
> things to be undebuggable, and to give users no warnings about something being
> up.

Yeah, it's just totally self defeating to not at least log it. I mean
this is a code path that is only hit once per VACUUM, so there is
practically no risk of that causing any new problems.

> Can you repro the issue with my recipe? FWIW, adding log_min_messages=debug5
> and fsync=off made the crash trigger more quickly.

I'll try to do that today. I'm not feeling the most energetic right
now, to be honest.

--
Peter Geoghegan

Re: Removing more vacuumlazy.c special cases, relfrozenxid optimizations

2022-03-31 Thread Andres Freund

Hi,

On 2022-03-31 10:12:49 -0700, Peter Geoghegan wrote:
> On Wed, Mar 30, 2022 at 9:59 PM Andres Freund  wrote:
> > I'm not sure there's a proper bug on HEAD here. I think at worst it can 
> > delay
> > the horizon increasing a bunch, by falsely not using an aggressive vacuum 
> > when
> > we should have - might even be limited to a single autovacuum cycle.
> 
> So, to be clear: vac_update_relstats() never actually considered the
> new relfrozenxid value from its vacuumlazy.c caller to be "in the
> future"?

No, I added separate debug messages for those, and also applied your patch,
and it didn't trigger.

I don't immediately see how we could end up computing a frozenxid value that
would be problematic? The pgcform->relfrozenxid value will always be the
"local" value, which afaics can be behind the other database's value (and thus
behind the value from the relcache init file). But it can't be ahead, we have
the proper invalidations for that (I think).

I do think we should apply a version of the warnings you have (with a WARNING
instead of PANIC obviously). I think it's bordering on insanity that we have
so many paths to just silently fix stuff up around vacuum. It's like we want
things to be undebuggable, and to give users no warnings about something being
up.

> It just looked that way to the failing assertion in
> vacuumlazy.c, because its own version of the original relfrozenxid was
> stale from the beginning? And so the worst problem is probably just
> that we don't use aggressive VACUUM when we really should in rare
> cases?

Yes, I think that's right.

Can you repro the issue with my recipe? FWIW, adding log_min_messages=debug5
and fsync=off made the crash trigger more quickly.

Greetings,

Andres Freund

Re: Removing more vacuumlazy.c special cases, relfrozenxid optimizations

2022-03-31 Thread Peter Geoghegan

On Thu, Mar 31, 2022 at 10:11 AM Andres Freund  wrote:
> I don't think we should weaken defenses against xids from before relfrozenxid
> in vacuum / amcheck /  If anything we should strengthen them.
>
> Isn't it also just plainly required for correctness? We'd not necessarily
> trigger a vacuum in time to remove the xid before approaching wraparound if we
> put in an xid before relfrozenxid? That happening in prune_xid is obviously
> les bad than on actual data, but still.

Yeah, you're right. Ambiguity about stuff like this should be avoided
on general principle.

> ISTM we should just use our own xid. Yes, it might delay cleanup a bit
> longer. But unless there's already crud on the page (with prune_xid already
> set, the abort of the speculative insertion isn't likely to make the
> difference?

Speculative insertion abort is pretty rare in the real world, I bet.
The speculative insertion precheck is very likely to work almost
always with real workloads.

-- 
Peter Geoghegan

Re: Removing more vacuumlazy.c special cases, relfrozenxid optimizations

2022-03-31 Thread Peter Geoghegan

On Wed, Mar 30, 2022 at 9:59 PM Andres Freund  wrote:
> I'm not sure there's a proper bug on HEAD here. I think at worst it can delay
> the horizon increasing a bunch, by falsely not using an aggressive vacuum when
> we should have - might even be limited to a single autovacuum cycle.

So, to be clear: vac_update_relstats() never actually considered the
new relfrozenxid value from its vacuumlazy.c caller to be "in the
future"? It just looked that way to the failing assertion in
vacuumlazy.c, because its own version of the original relfrozenxid was
stale from the beginning? And so the worst problem is probably just
that we don't use aggressive VACUUM when we really should in rare
cases?

-- 
Peter Geoghegan

Re: Removing more vacuumlazy.c special cases, relfrozenxid optimizations

2022-03-31 Thread Andres Freund

Hi,

On 2022-03-31 09:58:18 -0700, Peter Geoghegan wrote:
> On Thu, Mar 31, 2022 at 9:37 AM Andres Freund  wrote:
> > The only place that appears to access rd_rel->relfrozenxid outside of DDL is
> > heap_abort_speculative().
> 
> I wonder how necessary that really is. Even if the XID is before
> relfrozenxid, does that in itself really make it "in the future"?
> Obviously it's often necessary to make the assumption that allowing
> wraparound amounts to allowing XIDs "from the future" to exist, which
> is dangerous. But why here? Won't pruning by VACUUM eventually correct
> the issue anyway?

I don't think we should weaken defenses against xids from before relfrozenxid
in vacuum / amcheck /  If anything we should strengthen them.

Isn't it also just plainly required for correctness? We'd not necessarily
trigger a vacuum in time to remove the xid before approaching wraparound if we
put in an xid before relfrozenxid? That happening in prune_xid is obviously
les bad than on actual data, but still.

ISTM we should just use our own xid. Yes, it might delay cleanup a bit
longer. But unless there's already crud on the page (with prune_xid already
set, the abort of the speculative insertion isn't likely to make the
difference?

Greetings,

Andres Freund

Re: Removing more vacuumlazy.c special cases, relfrozenxid optimizations

2022-03-31 Thread Peter Geoghegan

On Thu, Mar 31, 2022 at 9:37 AM Andres Freund  wrote:
> Perhaps we should explicitly mask out parts of relcache entries in the shared
> init file that we know to be unreliable. I.e. set relfrozenxid, relminmxid to
> Invalid* or such.

That has the advantage of being more honest. If you're going to break
the abstraction, then it seems best to break it in an obvious way,
that leaves no doubts about what you're supposed to be relying on.

This bug doesn't seem like the kind of thing that should be left
as-is. If only because it makes it hard to add something like a
WARNING when we make relfrozenxid go backwards (on the basis of the
existing value apparently being in the future), which we really should
have been doing all along.

The whole reason why we overwrite pg_class.relfrozenxid values from
the future is to ameliorate the effects of more serious bugs like the
pg_upgrade/pg_resetwal one fixed in commit 74cf7d46 not so long ago
(mid last year). We had essentially the same pg_upgrade "from the
future" bug twice (once for relminmxid in the MultiXact bug era,
another more recent version affecting relfrozenxid).

> The only place that appears to access rd_rel->relfrozenxid outside of DDL is
> heap_abort_speculative().

I wonder how necessary that really is. Even if the XID is before
relfrozenxid, does that in itself really make it "in the future"?
Obviously it's often necessary to make the assumption that allowing
wraparound amounts to allowing XIDs "from the future" to exist, which
is dangerous. But why here? Won't pruning by VACUUM eventually correct
the issue anyway?

--
Peter Geoghegan

Re: Removing more vacuumlazy.c special cases, relfrozenxid optimizations

2022-03-31 Thread Andres Freund

Hi,

On 2022-03-30 21:59:15 -0700, Andres Freund wrote:
> On 2022-03-30 21:29:16 -0700, Peter Geoghegan wrote:
> > On Wed, Mar 30, 2022 at 9:20 PM Andres Freund  wrote:
> > > Perhaps we should just fetch the horizons from the "local" catalog for 
> > > shared
> > > rels?
> > 
> > Not sure what you mean.
> 
> Basically, instead of relying on the relcache, which for shared relation is
> vulnerable to seeing "too new" horizons due to the shared relcache init file,
> explicitly load relfrozenxid / relminmxid from the the catalog / syscache.
> 
> I.e. fetch the relevant pg_class row in heap_vacuum_rel() (using
> SearchSysCache[Copy1](RELID)). And use that to set vacrel->relfrozenxid
> etc. Whereas right now we only fetch the pg_class row in
> vac_update_relstats(), but use the relcache before.

Perhaps we should explicitly mask out parts of relcache entries in the shared
init file that we know to be unreliable. I.e. set relfrozenxid, relminmxid to
Invalid* or such.

I even wonder if we should just generally move those out of the fields we have
in the relcache, not just for shared rels loaded from the init
fork. Presumably by just moving them into the CATALOG_VARLEN ifdef.

The only place that appears to access rd_rel->relfrozenxid outside of DDL is
heap_abort_speculative().

Greetings,

Andres Freund

Re: Removing more vacuumlazy.c special cases, relfrozenxid optimizations

2022-03-30 Thread Andres Freund

Hi,

On 2022-03-30 21:29:16 -0700, Peter Geoghegan wrote:
> On Wed, Mar 30, 2022 at 9:20 PM Andres Freund  wrote:
> > But the debug elog reports that
> >
> > relfrozenxid updated 714 -> 717
> > relminmxid updated 1 -> 6
> >
> > Tthe problem is that the crashing backend reads the relfrozenxid/relminmxid
> > from the shared relcache init file written by another backend:
> 
> We should have added logging of relfrozenxid and relminmxid a long time ago.

At least at DEBUG1 or such.

> > This is basically the inverse of a54e1f15 - we read a *newer* horizon. 
> > That's
> > normally fairly harmless - I think.
> 
> Is this one pretty old?

What do you mean with "this one"? The cause for the assert failure?

I'm not sure there's a proper bug on HEAD here. I think at worst it can delay
the horizon increasing a bunch, by falsely not using an aggressive vacuum when
we should have - might even be limited to a single autovacuum cycle.

> > Perhaps we should just fetch the horizons from the "local" catalog for 
> > shared
> > rels?
> 
> Not sure what you mean.

Basically, instead of relying on the relcache, which for shared relation is
vulnerable to seeing "too new" horizons due to the shared relcache init file,
explicitly load relfrozenxid / relminmxid from the the catalog / syscache.

I.e. fetch the relevant pg_class row in heap_vacuum_rel() (using
SearchSysCache[Copy1](RELID)). And use that to set vacrel->relfrozenxid
etc. Whereas right now we only fetch the pg_class row in
vac_update_relstats(), but use the relcache before.

Greetings,

Andres Freund

Re: Removing more vacuumlazy.c special cases, relfrozenxid optimizations

2022-03-30 Thread Peter Geoghegan

On Wed, Mar 30, 2022 at 9:29 PM Peter Geoghegan  wrote:
> > Perhaps we should just fetch the horizons from the "local" catalog for 
> > shared
> > rels?
>
> Not sure what you mean.

Wait, you mean use vacrel->relfrozenxid directly? Seems kind of ugly...

-- 
Peter Geoghegan

Re: Removing more vacuumlazy.c special cases, relfrozenxid optimizations

2022-03-30 Thread Peter Geoghegan

On Wed, Mar 30, 2022 at 9:20 PM Andres Freund  wrote:
> But the debug elog reports that
>
> relfrozenxid updated 714 -> 717
> relminmxid updated 1 -> 6
>
> Tthe problem is that the crashing backend reads the relfrozenxid/relminmxid
> from the shared relcache init file written by another backend:

We should have added logging of relfrozenxid and relminmxid a long time ago.

> This is basically the inverse of a54e1f15 - we read a *newer* horizon. That's
> normally fairly harmless - I think.

Is this one pretty old?

> Perhaps we should just fetch the horizons from the "local" catalog for shared
> rels?

Not sure what you mean.

-- 
Peter Geoghegan

Re: Removing more vacuumlazy.c special cases, relfrozenxid optimizations

2022-03-30 Thread Andres Freund

Hi,

On 2022-03-30 21:11:48 -0700, Peter Geoghegan wrote:
> On Wed, Mar 30, 2022 at 9:04 PM Andres Freund  wrote:
> > (gdb) p vacrel->NewRelfrozenXid
> > $3 = 717
> > (gdb) p vacrel->relfrozenxid
> > $4 = 717
> > (gdb) p OldestXmin
> > $5 = 5112
> > (gdb) p aggressive
> > $6 = false
>
> Does this OldestXmin seem reasonable at this point in execution, based
> on context? Does it look too high? Something else?

Reasonable:
(gdb) p *ShmemVariableCache
$1 = {nextOid = 78969, oidCount = 2951, nextXid = {value = 21411}, oldestXid = 
714, xidVacLimit = 20714, xidWarnLimit = 2107484361,
  xidStopLimit = 2144484361, xidWrapLimit = 2147484361, oldestXidDB = 1, 
oldestCommitTsXid = 0, newestCommitTsXid = 0, latestCompletedXid = {value = 
21408},
  xactCompletionCount = 1635, oldestClogXid = 714}

I think the explanation I just sent explains the problem, without "in-memory"
confusion about what's running and what's not.

Greetings,

Andres Freund

Re: Removing more vacuumlazy.c special cases, relfrozenxid optimizations

2022-03-30 Thread Andres Freund

Hi,

On 2022-03-30 21:04:07 -0700, Andres Freund wrote:
> On 2022-03-30 20:35:25 -0700, Peter Geoghegan wrote:
> > On Wed, Mar 30, 2022 at 8:28 PM Andres Freund  wrote:
> > > I triggered twice now, but it took a while longer the second time.
> >
> > Great.
> >
> > I wonder if you can get an RR recording...
>
> Started it, but looks like it's too slow.
>
> (gdb) p MyProcPid
> $1 = 2172500
>
> (gdb) p vacrel->NewRelfrozenXid
> $3 = 717
> (gdb) p vacrel->relfrozenxid
> $4 = 717
> (gdb) p OldestXmin
> $5 = 5112
> (gdb) p aggressive
> $6 = false

I added a bunch of debug elogs to see what sets *frozenxid_updated to true.

(gdb) p *vacrel
$1 = {rel = 0x7fe24f3e0148, indrels = 0x7fe255c17ef8, nindexes = 2, aggressive 
= false, skipwithvm = true, failsafe_active = false,
  consider_bypass_optimization = true, do_index_vacuuming = true, 
do_index_cleanup = true, do_rel_truncate = true, bstrategy = 0x7fe255bb0e28, 
pvs = 0x0,
  relfrozenxid = 717, relminmxid = 6, old_live_tuples = 42, OldestXmin = 20751, 
vistest = 0x7fe255058970 , FreezeLimit = 4244988047,
  MultiXactCutoff = 4289967302, NewRelfrozenXid = 717, NewRelminMxid = 6, 
skippedallvis = false, relnamespace = 0x7fe255c17bf8 "pg_catalog",
  relname = 0x7fe255c17cb8 "pg_database", indname = 0x0, blkno = 4294967295, 
offnum = 0, phase = VACUUM_ERRCB_PHASE_SCAN_HEAP, verbose = false,
  dead_items = 0x7fe255c131d0, rel_pages = 8, scanned_pages = 8, removed_pages 
= 0, lpdead_item_pages = 0, missed_dead_pages = 0, nonempty_pages = 8,
  new_rel_tuples = 124, new_live_tuples = 42, indstats = 0x7fe255c18320, 
num_index_scans = 0, tuples_deleted = 0, lpdead_items = 0, live_tuples = 42,
  recently_dead_tuples = 82, missed_dead_tuples = 0}

But the debug elog reports that

relfrozenxid updated 714 -> 717
relminmxid updated 1 -> 6

Tthe problem is that the crashing backend reads the relfrozenxid/relminmxid
from the shared relcache init file written by another backend:

2022-03-30 21:10:47.626 PDT [2625038][autovacuum worker][6/433:0][] LOG:  
automatic vacuum of table 
"contrib_regression_postgres_fdw.pg_catalog.pg_database": index scans: 1
pages: 0 removed, 8 remain, 8 scanned (100.00% of total)
tuples: 4 removed, 114 remain, 72 are dead but not yet removable
removable cutoff: 20751, older by 596 xids when operation ended
new relfrozenxid: 717, which is 3 xids ahead of previous value
new relminmxid: 6, which is 5 mxids ahead of previous value
index scan needed: 3 pages from table (37.50% of total) had 8 dead item 
identifiers removed
index "pg_database_datname_index": pages: 2 in total, 0 newly deleted, 
0 currently deleted, 0 reusable
index "pg_database_oid_index": pages: 6 in total, 0 newly deleted, 2 
currently deleted, 2 reusable
I/O timings: read: 0.050 ms, write: 0.102 ms
avg read rate: 209.860 MB/s, avg write rate: 76.313 MB/s
buffer usage: 42 hits, 22 misses, 8 dirtied
WAL usage: 13 records, 5 full page images, 33950 bytes
system usage: CPU: user: 0.00 s, system: 0.00 s, elapsed: 0.00 s
...
2022-03-30 21:10:47.772 PDT [2625043][autovacuum worker][:0][] DEBUG:  
InitPostgres
2022-03-30 21:10:47.772 PDT [2625043][autovacuum worker][6/0:0][] DEBUG:  my 
backend ID is 6
2022-03-30 21:10:47.772 PDT [2625043][autovacuum worker][6/0:0][] LOG:  reading 
shared init file
2022-03-30 21:10:47.772 PDT [2625043][autovacuum worker][6/443:0][] DEBUG:  
StartTransaction(1) name: unnamed; blockState: DEFAULT; state: INPROGRESS, 
xid/sub>
2022-03-30 21:10:47.772 PDT [2625043][autovacuum worker][6/443:0][] LOG:  
reading non-shared init file

This is basically the inverse of a54e1f15 - we read a *newer* horizon. That's
normally fairly harmless - I think.

Perhaps we should just fetch the horizons from the "local" catalog for shared
rels?

Greetings,

Andres Freund

Re: Removing more vacuumlazy.c special cases, relfrozenxid optimizations

2022-03-30 Thread Peter Geoghegan

On Wed, Mar 30, 2022 at 9:04 PM Andres Freund  wrote:
> (gdb) p vacrel->NewRelfrozenXid
> $3 = 717
> (gdb) p vacrel->relfrozenxid
> $4 = 717
> (gdb) p OldestXmin
> $5 = 5112
> (gdb) p aggressive
> $6 = false

Does this OldestXmin seem reasonable at this point in execution, based
on context? Does it look too high? Something else?

-- 
Peter Geoghegan

Re: Removing more vacuumlazy.c special cases, relfrozenxid optimizations

2022-03-30 Thread Andres Freund

Hi,

On 2022-03-30 20:35:25 -0700, Peter Geoghegan wrote:
> On Wed, Mar 30, 2022 at 8:28 PM Andres Freund  wrote:
> > I triggered twice now, but it took a while longer the second time.
>
> Great.
>
> I wonder if you can get an RR recording...

Started it, but looks like it's too slow.

(gdb) p MyProcPid
$1 = 2172500

(gdb) p vacrel->NewRelfrozenXid
$3 = 717
(gdb) p vacrel->relfrozenxid
$4 = 717
(gdb) p OldestXmin
$5 = 5112
(gdb) p aggressive
$6 = false

There was another autovacuum of pg_database 10s before:

2022-03-30 20:35:17.622 PDT [2165344][autovacuum worker][5/3:0][] LOG:  
automatic vacuum of table "postgres.pg_catalog.pg_database": index scans: 1
pages: 0 removed, 3 remain, 3 scanned (100.00% of total)
tuples: 61 removed, 4 remain, 1 are dead but not yet removable
removable cutoff: 1921, older by 3 xids when operation ended
new relfrozenxid: 717, which is 3 xids ahead of previous value
index scan needed: 3 pages from table (100.00% of total) had 599 dead 
item identifiers removed
index "pg_database_datname_index": pages: 2 in total, 0 newly deleted, 
0 currently deleted, 0 reusable
index "pg_database_oid_index": pages: 4 in total, 0 newly deleted, 0 
currently deleted, 0 reusable
I/O timings: read: 0.029 ms, write: 0.034 ms
avg read rate: 134.120 MB/s, avg write rate: 89.413 MB/s
buffer usage: 35 hits, 12 misses, 8 dirtied
WAL usage: 12 records, 5 full page images, 27218 bytes
system usage: CPU: user: 0.00 s, system: 0.00 s, elapsed: 0.00 s

The dying backend:
2022-03-30 20:35:27.668 PDT [2172500][autovacuum worker][7/0:0][] DEBUG:  
autovacuum: processing database "contrib_regression_hstore"
...
2022-03-30 20:35:27.690 PDT [2172500][autovacuum worker][7/674:0][] CONTEXT:  
while cleaning up index "pg_database_oid_index" of relation 
"pg_catalog.pg_database"


Greetings,

Andres Freund

Re: Removing more vacuumlazy.c special cases, relfrozenxid optimizations

2022-03-30 Thread Andres Freund

Hi,

On 2022-03-30 20:28:44 -0700, Andres Freund wrote:
> I was able to trigger the crash.
> 
> cat ~/tmp/pgbench-createdb.sql
> CREATE DATABASE pgb_:client_id;
> DROP DATABASE pgb_:client_id;
> 
> pgbench -n -P1 -c 10 -j10 -T100 -f ~/tmp/pgbench-createdb.sql
> 
> while I was also running
> 
> for i in $(seq 1 100); do echo iteration $i; make -Otarget -C contrib/ -s 
> installcheck -j48 -s prove_installcheck=true USE_MODULE_DB=1 > /tmp/ci-$i.log 
> 2>&1; done
> 
> I triggered twice now, but it took a while longer the second time.

Forgot to say how postgres was started. Via my usual devenv script, which
results in:

+ /home/andres/build/postgres/dev-assert/vpath/src/backend/postgres -c 
hba_file=/home/andres/tmp/pgdev/pg_hba.conf -D /srv/dev/pgdev-dev/ -p 5440 -c 
shared_buffers=2GB -c wal_level=hot_standby -c max_wal_senders=10 -c 
track_io_timing=on -c restart_after_crash=false -c max_prepared_transactions=20 
-c log_checkpoints=on -c min_wal_size=48MB -c max_wal_size=150GB -c 
'cluster_name=dev assert' -c 
ssl_cert_file=/home/andres/tmp/pgdev/ssl-cert-snakeoil.pem -c 
ssl_key_file=/home/andres/tmp/pgdev/ssl-cert-snakeoil.key -c 
'log_line_prefix=%m [%p][%b][%v:%x][%a] ' -c shared_buffers=16MB -c 
log_min_messages=debug1 -c log_connections=on -c allow_in_place_tablespaces=1 
-c log_autovacuum_min_duration=0 -c log_lock_waits=true -c 
autovacuum_naptime=10s -c fsync=off

Greetings,

Andres Freund

Re: Removing more vacuumlazy.c special cases, relfrozenxid optimizations

2022-03-30 Thread Peter Geoghegan

On Wed, Mar 30, 2022 at 8:28 PM Andres Freund  wrote:
> I triggered twice now, but it took a while longer the second time.

Great.

I wonder if you can get an RR recording...
-- 
Peter Geoghegan

Re: Removing more vacuumlazy.c special cases, relfrozenxid optimizations

2022-03-30 Thread Andres Freund

Hi,

I was able to trigger the crash.

cat ~/tmp/pgbench-createdb.sql
CREATE DATABASE pgb_:client_id;
DROP DATABASE pgb_:client_id;

pgbench -n -P1 -c 10 -j10 -T100 -f ~/tmp/pgbench-createdb.sql

while I was also running

for i in $(seq 1 100); do echo iteration $i; make -Otarget -C contrib/ -s 
installcheck -j48 -s prove_installcheck=true USE_MODULE_DB=1 > /tmp/ci-$i.log 
2>&1; done

I triggered twice now, but it took a while longer the second time.

(gdb) bt full
#0  __GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:49
set = {__val = {4194304, 0, 0, 0, 0, 0, 216172782113783808, 2, 
2377909399344644096, 18446497967838863616, 0, 0, 0, 0, 0, 0}}
pid = 
tid = 
ret = 
#1  0x7fe49a2db546 in __GI_abort () at abort.c:79
save_stage = 1
act = {__sigaction_handler = {sa_handler = 0x0, sa_sigaction = 0x0}, 
sa_mask = {__val = {0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0}},
  sa_flags = 0, sa_restorer = 0x107e0}
sigs = {__val = {32, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0}}
#2  0x7fe49b9706f1 in ExceptionalCondition (conditionName=0x7fe49ba0618d 
"diff > 0", errorType=0x7fe49ba05bd1 "FailedAssertion",
fileName=0x7fe49ba05b90 
"/home/andres/src/postgresql/src/backend/access/heap/vacuumlazy.c", 
lineNumber=724)
at /home/andres/src/postgresql/src/backend/utils/error/assert.c:69
No locals.
#3  0x7fe49b2fc739 in heap_vacuum_rel (rel=0x7fe497a8d148, 
params=0x7fe49c130d7c, bstrategy=0x7fe49c130e10)
at /home/andres/src/postgresql/src/backend/access/heap/vacuumlazy.c:724
buf = {
  data = 0x7fe49c17e238 "automatic vacuum of table 
\"contrib_regression_dict_int.pg_catalog.pg_database\": index scans: 1\npages: 
0 removed, 3 remain, 3 scanned (100.00% of total)\ntuples: 49 removed, 53 
remain, 9 are dead but no"..., len = 279, maxlen = 1024, cursor = 0}
msgfmt = 0x7fe49ba06038 "automatic vacuum of table \"%s.%s.%s\": index 
scans: %d\n"
diff = 0
endtime = 702011687982080
vacrel = 0x7fe49c19b5b8
verbose = false
instrument = true
ru0 = {tv = {tv_sec = 1648696487, tv_usec = 975963}, ru = {ru_utime = 
{tv_sec = 0, tv_usec = 0}, ru_stime = {tv_sec = 0, tv_usec = 3086}, {
--Type  for more, q to quit, c to continue without paging--c
  ru_maxrss = 10824, __ru_maxrss_word = 10824}, {ru_ixrss = 0, 
__ru_ixrss_word = 0}, {ru_idrss = 0, __ru_idrss_word = 0}, {ru_isrss = 0, 
__ru_isrss_word = 0}, {ru_minflt = 449, __ru_minflt_word = 449}, {ru_majflt = 
0, __ru_majflt_word = 0}, {ru_nswap = 0, __ru_nswap_word = 0}, {ru_inblock = 0, 
__ru_inblock_word = 0}, {ru_oublock = 0, __ru_oublock_word = 0}, {ru_msgsnd = 
0, __ru_msgsnd_word = 0}, {ru_msgrcv = 0, __ru_msgrcv_word = 0}, {ru_nsignals = 
0, __ru_nsignals_word = 0}, {ru_nvcsw = 2, __ru_nvcsw_word = 2}, {ru_nivcsw = 
0, __ru_nivcsw_word = 0}}}
starttime = 702011687975964
walusage_start = {wal_records = 0, wal_fpi = 0, wal_bytes = 0}
walusage = {wal_records = 11, wal_fpi = 7, wal_bytes = 30847}
secs = 0
usecs = 6116
read_rate = 16.606033355134073
write_rate = 7.6643230869849575
aggressive = false
skipwithvm = true
frozenxid_updated = true
minmulti_updated = true
orig_rel_pages = 3
new_rel_pages = 3
new_rel_allvisible = 0
indnames = 0x7fe49c19bb28
errcallback = {previous = 0x0, callback = 0x7fe49b3012fd 
, arg = 0x7fe49c19b5b8}
startreadtime = 180
startwritetime = 0
OldestXmin = 67552
FreezeLimit = 4245034848
OldestMxact = 224
MultiXactCutoff = 4289967520
__func__ = "heap_vacuum_rel"
#4  0x7fe49b523d92 in table_relation_vacuum (rel=0x7fe497a8d148, 
params=0x7fe49c130d7c, bstrategy=0x7fe49c130e10) at 
/home/andres/src/postgresql/src/include/access/tableam.h:1680
No locals.
#5  0x7fe49b527032 in vacuum_rel (relid=1262, relation=0x7fe49c1ae360, 
params=0x7fe49c130d7c) at 
/home/andres/src/postgresql/src/backend/commands/vacuum.c:2065
lmode = 4
rel = 0x7fe497a8d148
lockrelid = {relId = 1262, dbId = 0}
toast_relid = 0
save_userid = 10
save_sec_context = 0
save_nestlevel = 2
__func__ = "vacuum_rel"
#6  0x7fe49b524c3b in vacuum (relations=0x7fe49c1b03a8, 
params=0x7fe49c130d7c, bstrategy=0x7fe49c130e10, isTopLevel=true) at 
/home/andres/src/postgresql/src/backend/commands/vacuum.c:482
vrel = 0x7fe49c1ae3b8
cur__state = {l = 0x7fe49c1b03a8, i = 0}
cur = 0x7fe49c1b03c0
_save_exception_stack = 0x7fff97e35a10
_save_context_stack = 0x0
_local_sigjmp_buf = {{__jmpbuf = {140735741652128, 6126579318940970843, 
9223372036854775747, 0, 0, 0, 6126579318957748059, 6139499258682879835}, 
__mask_was_saved = 0, __saved_mask = {__val = {32, 140619848279000, 8590910454, 
140619848278592, 32, 140619848278944, 7784,

Re: Removing more vacuumlazy.c special cases, relfrozenxid optimizations

2022-03-30 Thread Peter Geoghegan

On Wed, Mar 30, 2022 at 7:37 PM Peter Geoghegan  wrote:
> Yeah, a WARNING would be good here. I can write a new version of my
> patch series with a separation patch for that this evening. Actually,
> better make it a PANIC for now...

Attached is v14, which includes a new patch that PANICs like that in
vac_update_relstats() --- 0003.

This approach also covers manual VACUUMs, which isn't the case with
the failing assertion, which is in instrumentation code (actually
VACUUM VERBOSE might hit it).

I definitely think that something like this should be committed.
Silently ignoring system catalog corruption isn't okay.

-- 
Peter Geoghegan

v14-0003-PANIC-on-relfrozenxid-from-the-future.patch
Description: Binary data

v14-0001-Set-relfrozenxid-to-oldest-extant-XID-seen-by-VA.patch
Description: Binary data

v14-0002-Generalize-how-VACUUM-skips-all-frozen-pages.patch
Description: Binary data

v14-0004-vacuumlazy.c-Move-resource-allocation-to-heap_va.patch
Description: Binary data

Re: Removing more vacuumlazy.c special cases, relfrozenxid optimizations

2022-03-30 Thread Peter Geoghegan

On Wed, Mar 30, 2022 at 7:00 PM Andres Freund  wrote:
> Something vaguely like EXPECT_EQ_U32 in regress.c. Maybe
> AssertCmp(type, a, op, b),
>
> Then the assertion could have been something like
>AssertCmp(int32, diff, >, 0)

I'd definitely use them if they were there.

> Does the line number in the failed run actually correspond to the xid, rather
> than the mxid case? I didn't check.

Yes, I verified -- definitely relfrozenxid.

> You can do something similar locally on linux with
> make -Otarget -C contrib/ -j48 -s USE_MODULE_DB=1 installcheck 
> prove_installcheck=true
> (the prove_installcheck=true to prevent tap tests from running, we don't seem
> to have another way for that)
>
> I don't think windows uses USE_MODULE_DB=1, but it allows to cause a lot more
> load concurrently than running tests serially...

Can't get it to fail locally with that recipe.

> > Assert(vacrel->NewRelfrozenXid == OldestXmin ||
> >TransactionIdPrecedesOrEquals(vacrel->relfrozenxid,
> >  vacrel->NewRelfrozenXid));
>
> The comment in your patch says "is either older or newer than FreezeLimit" - I
> assume that's some rephrasing damage?

Both the comment and the assertion are correct. I see what you mean, though.

> Perhaps it's worth commiting improved assertions on master? If this is indeed
> a pre-existing bug, and we're just missing due to slightly less stringent
> asserts, we could rectify that separately.

I don't think there's much chance of the assertion actually hitting
without the rest of the patch series. The new relfrozenxid value is
always going to be OldestXmin - vacuum_min_freeze_age on HEAD, while
with the patch it's sometimes close to OldestXmin. Especially when you
have lots of dead tuples that you churn through constantly (like
pgbench_tellers, or like these system catalogs on the CI test
machine).

> Hm. This triggers some vague memories. There's some oddities around shared
> relations being vacuumed separately in all the databases and thus having
> separate horizons.

That's what I was thinking of, obviously.

> After "remembering" that, I looked in the cirrus log for the failed run, and
> the worker was processing shared a shared relation last:
>
> 2022-03-30 03:48:30.238 GMT [5984][autovacuum worker] LOG:  automatic analyze 
> of table "contrib_regression.pg_catalog.pg_authid"

I noticed the same thing myself. Should have said sooner.

> Perhaps this ought to be an elog() instead of an Assert()? Something has gone
> pear shaped if we get here... It's a bit annoying though, because it'd have to
> be a PANIC to be visible on the bf / CI :(.

Yeah, a WARNING would be good here. I can write a new version of my
patch series with a separation patch for that this evening. Actually,
better make it a PANIC for now...

-- 
Peter Geoghegan

Re: Removing more vacuumlazy.c special cases, relfrozenxid optimizations

2022-03-30 Thread Andres Freund

Hi,

On 2022-03-30 17:50:42 -0700, Peter Geoghegan wrote:
> I tried several times to recreate this issue on CI. No luck with that,
> though -- can't get it to fail again after 4 attempts.

It's really annoying that we don't have Assert variants that show the compared
values, that might make it easier to intepret what's going on.

Something vaguely like EXPECT_EQ_U32 in regress.c. Maybe
AssertCmp(type, a, op, b),

Then the assertion could have been something like
   AssertCmp(int32, diff, >, 0)

Does the line number in the failed run actually correspond to the xid, rather
than the mxid case? I didn't check.

You could try to increase the likelihood of reproducing the failure by
duplicating the invocation that lead to the crash a few times in the
.cirrus.yml file in your dev branch. That might allow hitting the problem more
quickly.

Maybe reduce autovacuum_naptime in src/tools/ci/pg_ci_base.conf?

Or locally - one thing that windows CI does different from the other platforms
is that it runs isolation, contrib and a bunch of other tests using the same
cluster. Which of course increases the likelihood of autovacuum having stuff
to do, *particularly* on shared relations - normally there's probably not
enough changes for that.

You can do something similar locally on linux with
make -Otarget -C contrib/ -j48 -s USE_MODULE_DB=1 installcheck 
prove_installcheck=true
(the prove_installcheck=true to prevent tap tests from running, we don't seem
to have another way for that)

I don't think windows uses USE_MODULE_DB=1, but it allows to cause a lot more
load concurrently than running tests serially...

> We know that this particular assertion did not fail during the same VACUUM:
> 
> Assert(vacrel->NewRelfrozenXid == OldestXmin ||
>TransactionIdPrecedesOrEquals(vacrel->relfrozenxid,
>  vacrel->NewRelfrozenXid));

The comment in your patch says "is either older or newer than FreezeLimit" - I
assume that's some rephrasing damage?

> So it's hard to see how this could be a bug in the patch -- the final
> new relfrozenxid is presumably equal to VACUUM's OldestXmin in the
> problem scenario seen on the CI Windows instance yesterday (that's why
> this earlier assertion didn't fail).

Perhaps it's worth commiting improved assertions on master? If this is indeed
a pre-existing bug, and we're just missing due to slightly less stringent
asserts, we could rectify that separately.

> The surprising part of the CI failure must have taken place just after
> this assertion, when VACUUM's call to vacuum_set_xid_limits() actually
> updates pg_class.relfrozenxid with vacrel->NewRelfrozenXid --
> presumably because the existing relfrozenxid appeared to be "in the
> future" when we examine it in pg_class again. We see evidence that
> this must have happened afterwards, when the closely related assertion
> (used only in instrumentation code) fails:

Hm. This triggers some vague memories. There's some oddities around shared
relations being vacuumed separately in all the databases and thus having
separate horizons.

After "remembering" that, I looked in the cirrus log for the failed run, and
the worker was processing shared a shared relation last:

2022-03-30 03:48:30.238 GMT [5984][autovacuum worker] LOG:  automatic analyze 
of table "contrib_regression.pg_catalog.pg_authid"

Obviously that's not a guarantee that the next table processed also is a
shared catalog, but ...

Oh, the relid is actually in the stack trace. 0x4ee = 1262 =
pg_database. Which makes sense, the test ends up with a high percentage of
dead rows in pg_database, due to all the different contrib tests
creating/dropping a database.

> From my patch:
> 
> > if (frozenxid_updated)
> > {
> > -   diff = (int32) (FreezeLimit - vacrel->relfrozenxid);
> > +   diff = (int32) (vacrel->NewRelfrozenXid - 
> > vacrel->relfrozenxid);
> > +   Assert(diff > 0);
> > appendStringInfo(,
> >  _("new relfrozenxid: %u, which is %d xids 
> > ahead of previous value\n"),
> > -FreezeLimit, diff);
> > +vacrel->NewRelfrozenXid, diff);
> > }

Perhaps this ought to be an elog() instead of an Assert()? Something has gone
pear shaped if we get here... It's a bit annoying though, because it'd have to
be a PANIC to be visible on the bf / CI :(.

Greetings,

Andres Freund

Re: Removing more vacuumlazy.c special cases, relfrozenxid optimizations

2022-03-30 Thread Peter Geoghegan

On Wed, Mar 30, 2022 at 12:01 AM Peter Geoghegan  wrote:
> Perhaps something is amiss inside vac_update_relstats(), where the
> boolean flag that indicates that pg_class.relfrozenxid was advanced is
> set:
>
> if (frozenxid_updated)
> *frozenxid_updated = false;
> if (TransactionIdIsNormal(frozenxid) &&
> pgcform->relfrozenxid != frozenxid &&
> (TransactionIdPrecedes(pgcform->relfrozenxid, frozenxid) ||
>  TransactionIdPrecedes(ReadNextTransactionId(),
>pgcform->relfrozenxid)))
> {
> if (frozenxid_updated)
> *frozenxid_updated = true;
> pgcform->relfrozenxid = frozenxid;
> dirty = true;
> }
>
> Maybe the "existing relfrozenxid is in the future, silently update
> relfrozenxid" part of the condition (which involves
> ReadNextTransactionId()) somehow does the wrong thing here. But how?

I tried several times to recreate this issue on CI. No luck with that,
though -- can't get it to fail again after 4 attempts.

This was a VACUUM of pg_database, run from an autovacuum worker. I am
vaguely reminded of the two bugs fixed by Andres in commit a54e1f15.
Both were issues with the shared relcache init file affecting shared
and nailed catalog relations. Those bugs had symptoms like " ERROR:
found xmin ... from before relfrozenxid ..." for various system
catalogs.

We know that this particular assertion did not fail during the same VACUUM:

Assert(vacrel->NewRelfrozenXid == OldestXmin ||
   TransactionIdPrecedesOrEquals(vacrel->relfrozenxid,
 vacrel->NewRelfrozenXid));

So it's hard to see how this could be a bug in the patch -- the final
new relfrozenxid is presumably equal to VACUUM's OldestXmin in the
problem scenario seen on the CI Windows instance yesterday (that's why
this earlier assertion didn't fail).  The assertion I'm showing here
needs the "vacrel->NewRelfrozenXid == OldestXmin" part of the
condition to account for the fact that
OldestXmin/GetOldestNonRemovableTransactionId() is known to "go
backwards". Without that the regression tests will fail quite easily.

The surprising part of the CI failure must have taken place just after
this assertion, when VACUUM's call to vacuum_set_xid_limits() actually
updates pg_class.relfrozenxid with vacrel->NewRelfrozenXid --
presumably because the existing relfrozenxid appeared to be "in the
future" when we examine it in pg_class again. We see evidence that
this must have happened afterwards, when the closely related assertion
(used only in instrumentation code) fails:

>From my patch:

> if (frozenxid_updated)
> {
> -   diff = (int32) (FreezeLimit - vacrel->relfrozenxid);
> +   diff = (int32) (vacrel->NewRelfrozenXid - 
> vacrel->relfrozenxid);
> +   Assert(diff > 0);
> appendStringInfo(,
>  _("new relfrozenxid: %u, which is %d xids 
> ahead of previous value\n"),
> -FreezeLimit, diff);
> +vacrel->NewRelfrozenXid, diff);
> }

Does anybody have any ideas about what might be going on here?

-- 
Peter Geoghegan

Re: Removing more vacuumlazy.c special cases, relfrozenxid optimizations

2022-03-30 Thread Peter Geoghegan

On Tue, Mar 29, 2022 at 11:10 PM Justin Pryzby  wrote:
>
> +   diff = (int32) (vacrel->NewRelfrozenXid - 
> vacrel->relfrozenxid);
> +   Assert(diff > 0);
>
> Did you see that this crashed on windows cfbot?
>
> https://api.cirrus-ci.com/v1/artifact/task/4592929254670336/log/tmp_check/postmaster.log
> TRAP: FailedAssertion("diff > 0", File: 
> "c:\cirrus\src\backend\access\heap\vacuumlazy.c", Line: 724, PID: 5984)

That's weird. There are very similar assertions a little earlier, that
must have *not* failed here, before the call to vac_update_relstats().
I was actually thinking of removing this assertion for that reason --
I thought that it was redundant.

Perhaps something is amiss inside vac_update_relstats(), where the
boolean flag that indicates that pg_class.relfrozenxid was advanced is
set:

if (frozenxid_updated)
*frozenxid_updated = false;
if (TransactionIdIsNormal(frozenxid) &&
pgcform->relfrozenxid != frozenxid &&
(TransactionIdPrecedes(pgcform->relfrozenxid, frozenxid) ||
 TransactionIdPrecedes(ReadNextTransactionId(),
   pgcform->relfrozenxid)))
{
if (frozenxid_updated)
*frozenxid_updated = true;
pgcform->relfrozenxid = frozenxid;
dirty = true;
}

Maybe the "existing relfrozenxid is in the future, silently update
relfrozenxid" part of the condition (which involves
ReadNextTransactionId()) somehow does the wrong thing here. But how?

The other assertions take into account the fact that OldestXmin can
itself "go backwards" across VACUUM operations against the same table:

Assert(!aggressive || vacrel->NewRelfrozenXid == OldestXmin ||
   TransactionIdPrecedesOrEquals(FreezeLimit,
 vacrel->NewRelfrozenXid));

Note the "vacrel->NewRelfrozenXid == OldestXmin", without which the
assertion will fail pretty easily when the regression tests are run.
Perhaps I need to do something like that with the other assertion as
well (or more likely just get rid of it). Will figure it out tomorrow.

-- 
Peter Geoghegan

Re: Removing more vacuumlazy.c special cases, relfrozenxid optimizations

2022-03-30 Thread Justin Pryzby

+   diff = (int32) (vacrel->NewRelfrozenXid - 
vacrel->relfrozenxid);
+   Assert(diff > 0);

Did you see that this crashed on windows cfbot?

https://api.cirrus-ci.com/v1/artifact/task/4592929254670336/log/tmp_check/postmaster.log
TRAP: FailedAssertion("diff > 0", File: 
"c:\cirrus\src\backend\access\heap\vacuumlazy.c", Line: 724, PID: 5984)
abort() has been called2022-03-30 03:48:30.267 GMT [5316][client backend] 
[pg_regress/tablefunc][3/15389:0] ERROR:  infinite recursion detected
2022-03-30 03:48:38.031 GMT [5592][postmaster] LOG:  server process (PID 5984) 
was terminated by exception 0xC354
2022-03-30 03:48:38.031 GMT [5592][postmaster] DETAIL:  Failed process was 
running: autovacuum: VACUUM ANALYZE pg_catalog.pg_database
2022-03-30 03:48:38.031 GMT [5592][postmaster] HINT:  See C include file 
"ntstatus.h" for a description of the hexadecimal value.

https://cirrus-ci.com/task/4592929254670336

`007ff130 0001`400b4ef8 postgres!ExceptionalCondition(
char * conditionName = 0x0001`40a915d8 "diff > 0", 
char * errorType = 0x0001`40a915c8 
"FailedAssertion", 
char * fileName = 0x0001`40a91598 
"c:\cirrus\src\backend\access\heap\vacuumlazy.c", 
int lineNumber = 0n724)+0x8d 
[c:\cirrus\src\backend\utils\error\assert.c @ 70]
`007ff170 0001`402a0914 postgres!heap_vacuum_rel(
struct RelationData * rel = 0x`00a51088, 
struct VacuumParams * params = 0x`00a8420c, 
struct BufferAccessStrategyData * bstrategy = 
0x`00a842a0)+0x1038 [c:\cirrus\src\backend\access\heap\vacuumlazy.c @ 
724]
`007ff350 0001`402a4686 postgres!table_relation_vacuum(
struct RelationData * rel = 0x`00a51088, 
struct VacuumParams * params = 0x`00a8420c, 
struct BufferAccessStrategyData * bstrategy = 
0x`00a842a0)+0x34 [c:\cirrus\src\include\access\tableam.h @ 1681]
`007ff380 0001`402a1a2d postgres!vacuum_rel(
unsigned int relid = 0x4ee, 
struct RangeVar * relation = 0x`01799ae0, 
struct VacuumParams * params = 
0x`00a8420c)+0x5a6 [c:\cirrus\src\backend\commands\vacuum.c @ 2068]
`007ff400 0001`4050f1ef postgres!vacuum(
struct List * relations = 0x`0179df58, 
struct VacuumParams * params = 0x`00a8420c, 
struct BufferAccessStrategyData * bstrategy = 
0x`00a842a0, 
bool isTopLevel = true)+0x69d 
[c:\cirrus\src\backend\commands\vacuum.c @ 482]
`007ff5f0 0001`4050dc95 postgres!autovacuum_do_vac_analyze(
struct autovac_table * tab = 0x`00a84208, 
struct BufferAccessStrategyData * bstrategy = 
0x`00a842a0)+0x8f [c:\cirrus\src\backend\postmaster\autovacuum.c @ 3248]
`007ff640 0001`4050b4e3 postgres!do_autovacuum(void)+0xef5 
[c:\cirrus\src\backend\postmaster\autovacuum.c @ 2503]

It seems like there should be even more logs, especially since it says:
[03:48:43.119] Uploading 3 artifacts for c:\cirrus\**\*.diffs
[03:48:43.122] Uploaded c:\cirrus\contrib\tsm_system_rows\regression.diffs
[03:48:43.125] Uploaded c:\cirrus\contrib\tsm_system_time\regression.diffs

Re: Removing more vacuumlazy.c special cases, relfrozenxid optimizations

2022-03-29 Thread Peter Geoghegan

On Tue, Mar 29, 2022 at 11:58 AM Peter Geoghegan  wrote:
> > I think I understand what the first paragraph of the header comment
> > for heap_tuple_needs_freeze() is trying to say, but the second one is
> > quite confusing. I think this is again because it veers into talking
> > about what the caller should do rather than explaining what the
> > function itself does.
>
> I wouldn't have done it that way if the function wasn't called
> heap_tuple_needs_freeze().
>
> I would be okay with removing this paragraph if the function was
> renamed to reflect the fact it now tells the caller something about
> the tuple having an old XID/MXID relative to the caller's own XID/MXID
> cutoffs. Maybe the function name should be heap_tuple_would_freeze(),
> making it clear that the function merely tells caller what
> heap_prepare_freeze_tuple() *would* do, without presuming to tell the
> vacuumlazy.c caller what it *should* do about any of the information
> it is provided.

Attached is v13, which does it that way. This does seem like a real
increase in clarity, albeit one that comes at the cost of renaming
heap_tuple_needs_freeze().

v13 also addresses all of the other items from Robert's most recent
round of feedback.

I would like to commit something close to v13 on Friday or Saturday.

Thanks
-- 
Peter Geoghegan


v13-0003-vacuumlazy.c-Move-resource-allocation-to-heap_va.patch
Description: Binary data


v13-0002-Generalize-how-VACUUM-skips-all-frozen-pages.patch
Description: Binary data


v13-0001-Set-relfrozenxid-to-oldest-extant-XID-seen-by-VA.patch
Description: Binary data

Re: Removing more vacuumlazy.c special cases, relfrozenxid optimizations

2022-03-29 Thread Peter Geoghegan

On Tue, Mar 29, 2022 at 10:03 AM Robert Haas  wrote:
> I can understand this version of the commit message. Woohoo! I like
> understanding things.

That's good news.

> I think the header comments for FreezeMultiXactId() focus way too much
> on what the caller is supposed to do and not nearly enough on what
> FreezeMultiXactId() itself does. I think to some extent this also
> applies to the comments within the function body.

To some extent this is a legitimate difference in style. I myself
don't think that it's intrinsically good to have these sorts of
comments. I just think that it can be the least worst thing when a
function is intrinsically written with one caller and one very
specific set of requirements in mind. That is pretty much a matter of
taste, though.

> I think I understand what the first paragraph of the header comment
> for heap_tuple_needs_freeze() is trying to say, but the second one is
> quite confusing. I think this is again because it veers into talking
> about what the caller should do rather than explaining what the
> function itself does.

I wouldn't have done it that way if the function wasn't called
heap_tuple_needs_freeze().

I would be okay with removing this paragraph if the function was
renamed to reflect the fact it now tells the caller something about
the tuple having an old XID/MXID relative to the caller's own XID/MXID
cutoffs. Maybe the function name should be heap_tuple_would_freeze(),
making it clear that the function merely tells caller what
heap_prepare_freeze_tuple() *would* do, without presuming to tell the
vacuumlazy.c caller what it *should* do about any of the information
it is provided.

Then it becomes natural to see the boolean return value and the
changes the function makes to caller's relfrozenxid/relminmxid tracker
variables as independent.

> I don't like the statement-free else block in lazy_scan_noprune(). I
> think you could delete the else{} and just put that same comment there
> with one less level of indentation. There's a clear "return false"
> just above so it shouldn't be confusing what's happening.

Okay, will fix.

> The comment hunk at the end of lazy_scan_noprune() would probably be
> better if it said something more specific than "caller can tolerate
> reduced processing." My guess is that it would be something like
> "caller does not need to do something or other."

I meant "caller can tolerate not pruning or freezing this particular
page". Will fix.

> I have my doubts about whether the overwrite-a-future-relfrozenxid
> behavior is any good, but that's a topic for another day. I suggest
> keeping the words "it seems best to", though, because they convey a
> level of tentativeness, which seems appropriate.

I agree that it's best to keep a tentative tone here. That code was
written following a very specific bug in pg_upgrade several years
back. There was a very recent bug fixed only last year, by commit
74cf7d46.

FWIW I tend to think that we'd have a much better chance of catching
that sort of thing if we'd had better relfrozenxid instrumentation
before now. Now you'd see a negative value in the "new relfrozenxid:
%u, which is %d xids ahead of previous value" part of the autovacuum
log message in the event of such a bug. That's weird enough that I bet
somebody would notice and report it.

> I am surprised to see you write in maintenance.sgml that the VACUUM
> which most recently advanced relfrozenxid will typically be the most
> recent aggressive VACUUM. I would have expected something like "(often
> the most recent VACUUM)".

That's always been true, and will only be slightly less true in
Postgres 15 -- the fact is that we only need to skip one all-visible
page to lose out, and that's not unlikely with tables that aren't
quite small with all the patches from v12 applied (we're still much
too naive). The work that I'll get into Postgres 15 on VACUUM is very
valuable as a basis for future improvements, but not all that valuable
to users (improved instrumentation might be the biggest benefit in 15,
or maybe relminmxid advancement for certain types of applications).

I still think that we need to do more proactive page-level freezing to
make relfrozenxid advancement happen in almost every VACUUM, but even
that won't quite be enough. There are still cases where we need to
make a choice about giving up on relfrozenxid advancement in a
non-aggressive VACUUM -- all-visible pages won't completely go away
with page-level freezing. At a minimum we'll still have edge cases
like the case where heap_lock_tuple() unsets the all-frozen bit. And
pg_upgrade'd databases, too.

0002 structures the logic for skipping using the VM in a way that will
make the choice to skip or not skip all-visible pages in
non-aggressive VACUUMs quite natural. I suspect that
SKIP_PAGES_THRESHOLD was always mostly just about relfrozenxid
advancement in non-aggressive VACUUM, all along. We can do much better
than SKIP_PAGES_THRESHOLD, especially if we preprocess the entire

Re: Removing more vacuumlazy.c special cases, relfrozenxid optimizations

2022-03-29 Thread Robert Haas

On Sun, Mar 27, 2022 at 11:24 PM Peter Geoghegan  wrote:
> Attached is v12. My current goal is to commit all 3 patches before
> feature freeze. Note that this does not include the more complicated
> patch including with previous revisions of the patch series (the
> page-level freezing work that appeared in versions before v11).

Reviewing 0001, focusing on the words in the patch file much more than the code:

I can understand this version of the commit message. Woohoo! I like
understanding things.

I think the header comments for FreezeMultiXactId() focus way too much
on what the caller is supposed to do and not nearly enough on what
FreezeMultiXactId() itself does. I think to some extent this also
applies to the comments within the function body.

On the other hand, the header comments for heap_prepare_freeze_tuple()
seem good to me. If I were thinking of calling this function, I would
know how to use the new arguments. If I were looking for bugs in it, I
could compare the logic in the function to what these comments say it
should be doing. Yay.

I think I understand what the first paragraph of the header comment
for heap_tuple_needs_freeze() is trying to say, but the second one is
quite confusing. I think this is again because it veers into talking
about what the caller should do rather than explaining what the
function itself does.

I don't like the statement-free else block in lazy_scan_noprune(). I
think you could delete the else{} and just put that same comment there
with one less level of indentation. There's a clear "return false"
just above so it shouldn't be confusing what's happening.

The comment hunk at the end of lazy_scan_noprune() would probably be
better if it said something more specific than "caller can tolerate
reduced processing." My guess is that it would be something like
"caller does not need to do something or other."

I have my doubts about whether the overwrite-a-future-relfrozenxid
behavior is any good, but that's a topic for another day. I suggest
keeping the words "it seems best to", though, because they convey a
level of tentativeness, which seems appropriate.

I am surprised to see you write in maintenance.sgml that the VACUUM
which most recently advanced relfrozenxid will typically be the most
recent aggressive VACUUM. I would have expected something like "(often
the most recent VACUUM)".

-- 
Robert Haas
EDB: http://www.enterprisedb.com

Re: Removing more vacuumlazy.c special cases, relfrozenxid optimizations

2022-03-27 Thread Peter Geoghegan

On Thu, Mar 24, 2022 at 2:40 PM Peter Geoghegan  wrote:
> > > This is absolutely mandatory in the aggressive case, because otherwise
> > > relfrozenxid advancement might be seen as unsafe. My observation is:
> > > Why should we accept the same race in the non-aggressive case? Why not
> > > do essentially the same thing in every VACUUM?
> >
> > Sure, that seems like a good idea. I think I basically agree with the
> > goals of the patch.
>
> Great.

Attached is v12. My current goal is to commit all 3 patches before
feature freeze. Note that this does not include the more complicated
patch including with previous revisions of the patch series (the
page-level freezing work that appeared in versions before v11).

Changes that appear in this new revision, v12:

* Reworking of the commit messages based on feedback from Robert.

* General cleanup of the changes to heapam.c from 0001 (the changes to
heap_prepare_freeze_tuple and related functions).  New and existing
code now fits together a bit better. I also added a couple of new
documenting assertions, to make the flow a bit easier to understand.

* Added new assertions that document
OldestXmin/FreezeLimit/relfrozenxid invariants, right at the point we
update pg_class within vacuumlazy.c.

These assertions would have a decent chance of failing if there were
any bugs in the code.

* Removed patch that made DISABLE_PAGE_SKIPPING not force aggressive
VACUUM, limiting the underlying mechanism to forcing scanning of all
pages in lazy_scan_heap (v11 was the first and last revision that
included this patch).

* Adds a new small patch 0003. This just moves the last piece of
resource allocation that still took place at the top of
lazy_scan_heap() back into its caller, heap_vacuum_rel().

The work in 0003 probably should have happened as part of the patch
that became commit 73f6ec3d -- same idea. It's totally mechanical
stuff. With 0002 and 0003, there is hardly any lazy_scan_heap code
before the main loop that iterates through blocks in rel_pages (and
the code that's still there is obviously related to the loop in a
direct and obvious way). This seems like a big overall improvement in
maintainability.

Didn't see a way to split up 0002, per Robert's suggestion 3 days ago.
As I said at the time, it's possible to split it up, but not in a way
that highlights the underlying issue (since the issue 0002 fixes was
always limited to non-aggressive VACUUMs). The commit message may have
to suffice.

--
Peter Geoghegan

v12-0001-Set-relfrozenxid-to-oldest-extant-XID-seen-by-VA.patch
Description: Binary data

v12-0003-vacuumlazy.c-Move-resource-allocation-to-heap_va.patch
Description: Binary data

v12-0002-Generalize-how-VACUUM-skips-all-frozen-pages.patch
Description: Binary data

Re: Removing more vacuumlazy.c special cases, relfrozenxid optimizations

2022-03-24 Thread Peter Geoghegan

On Thu, Mar 24, 2022 at 1:21 PM Robert Haas  wrote:
> > How about: "Set relfrozenxid to oldest extant XID seen by VACUUM"
>
> Sure, that sounds nice.

Cool.

> > What you're saying here boils down to this: it doesn't matter what the
> > visibility map would say right this microsecond (in the aggressive
> > case) were we to call VM_ALL_FROZEN(): we know for sure that the VM
> > said that this page was all-frozen *in the recent past*. That's good
> > enough; we will never fail to scan a page that might have an XID <
> > OldestXmin (ditto for XMIDs) this way, which is all that really
> > matters.
>
> Makes sense. So maybe the commit message should try to emphasize this
> point e.g. "If a page is all-frozen at the time we check whether it
> can be skipped, don't allow it to affect the relfrozenxmin and
> relminmxid which we set for the relation. This was previously true for
> aggressive vacuums, but not for non-aggressive vacuums, which was
> inconsistent. (The reason this is a safe thing to do is that any new
> XIDs or MXIDs that appear on the page after we initially observe it to
> be frozen must be newer than any relfrozenxid or relminmxid the
> current vacuum could possibly consider storing into pg_class.)"

Okay, I'll add something more like that.

Almost every aspect of relfrozenxid advancement by VACUUM seems
simpler when thought about in these terms IMV. Every VACUUM now scans
all pages that might have XIDs < OldestXmin, and so every VACUUM can
advance relfrozenxid to the oldest extant XID (barring non-aggressive
VACUUMs that *choose* to skip some all-visible pages).

There are a lot more important details, of course. My "Every
VACUUM..." statement works well as an axiom because all of those other
details don't create any awkward exceptions.

> > This is absolutely mandatory in the aggressive case, because otherwise
> > relfrozenxid advancement might be seen as unsafe. My observation is:
> > Why should we accept the same race in the non-aggressive case? Why not
> > do essentially the same thing in every VACUUM?
>
> Sure, that seems like a good idea. I think I basically agree with the
> goals of the patch.

Great.

> My concern is just about making the changes
> understandable to future readers. This area is notoriously subtle, and
> people are going to introduce more bugs even if the comments and code
> organization are fantastic.

Makes sense.

> > And so you could almost say that there is now behavioral change at
> > all.
>
> I vigorously object to this part, though. We should always err on the
> side of saying that commits *do* have behavioral changes.

I think that you've taken my words too literally here. I would never
conceal the intent of a piece of work like that. I thought that it
would clarify matters to point out that I could in theory "get away
with it if I wanted to" in this instance. This was only a means of
conveying a subtle point about the behavioral changes from 0002 --
since you couldn't initially see them yourself (even with my commit
message).

Kind of like Tom Lane's 2011 talk on the query planner. The one where
he lied to the audience several times.

> > It seems kinda tricky to split up 0002 like that. It's possible, but
> > I'm not sure if it's possible to split it in a way that highlights the
> > issue that I just described. Because we already avoided the race in
> > the aggressive case.
>
> I do see that there are some difficulties there. I'm not sure what to
> do about that. I think a sufficiently clear commit message could
> possibly be enough, rather than trying to split the patch. But I also
> think splitting the patch should be considered, if that can reasonably
> be done.

I'll see if I can come up with something. It's hard to be sure about
that kind of thing when you're this close to the code.

-- 
Peter Geoghegan

Re: Removing more vacuumlazy.c special cases, relfrozenxid optimizations

2022-03-24 Thread Robert Haas

On Thu, Mar 24, 2022 at 3:28 PM Peter Geoghegan  wrote:
> But non-aggressive VACUUMs have always been able to do that.
>
> How about: "Set relfrozenxid to oldest extant XID seen by VACUUM"

Sure, that sounds nice.

> Believe it or not, I avoided functional changes in 0002 -- at least in
> one important sense. That's why you had difficulty spotting any. This
> must sound peculiar, since the commit message very clearly says that
> the commit avoids a problem seen only in the non-aggressive case. It's
> really quite subtle.

Well, I think the goal in revising the code is to be as un-subtle as
possible. Commits that people can't easily understand breed future
bugs.

> What you're saying here boils down to this: it doesn't matter what the
> visibility map would say right this microsecond (in the aggressive
> case) were we to call VM_ALL_FROZEN(): we know for sure that the VM
> said that this page was all-frozen *in the recent past*. That's good
> enough; we will never fail to scan a page that might have an XID <
> OldestXmin (ditto for XMIDs) this way, which is all that really
> matters.

Makes sense. So maybe the commit message should try to emphasize this
point e.g. "If a page is all-frozen at the time we check whether it
can be skipped, don't allow it to affect the relfrozenxmin and
relminmxid which we set for the relation. This was previously true for
aggressive vacuums, but not for non-aggressive vacuums, which was
inconsistent. (The reason this is a safe thing to do is that any new
XIDs or MXIDs that appear on the page after we initially observe it to
be frozen must be newer than any relfrozenxid or relminmxid the
current vacuum could possibly consider storing into pg_class.)"

> This is absolutely mandatory in the aggressive case, because otherwise
> relfrozenxid advancement might be seen as unsafe. My observation is:
> Why should we accept the same race in the non-aggressive case? Why not
> do essentially the same thing in every VACUUM?

Sure, that seems like a good idea. I think I basically agree with the
goals of the patch. My concern is just about making the changes
understandable to future readers. This area is notoriously subtle, and
people are going to introduce more bugs even if the comments and code
organization are fantastic.

> And so you could almost say that there is now behavioral change at
> all.

I vigorously object to this part, though. We should always err on the
side of saying that commits *do* have behavioral changes. We should go
out of our way to call out in the commit message any possible way that
someone might notice the difference between the post-commit situation
and the pre-commit situation. It is fine, even good, to also be clear
about how we're maintaining continuity and why we don't think it's a
problem, but the only commits that should be described as not having
any behavioral change are ones that do mechanical code movement, or
are just changing comments, or something like that.

> It seems kinda tricky to split up 0002 like that. It's possible, but
> I'm not sure if it's possible to split it in a way that highlights the
> issue that I just described. Because we already avoided the race in
> the aggressive case.

I do see that there are some difficulties there. I'm not sure what to
do about that. I think a sufficiently clear commit message could
possibly be enough, rather than trying to split the patch. But I also
think splitting the patch should be considered, if that can reasonably
be done.

-- 
Robert Haas
EDB: http://www.enterprisedb.com

Re: Removing more vacuumlazy.c special cases, relfrozenxid optimizations

2022-03-24 Thread Peter Geoghegan

On Thu, Mar 24, 2022 at 10:21 AM Robert Haas  wrote:
> You're probably not going to love hearing this, but I think you're
> still explaining things here in ways that are too baroque and hard to
> follow. I do think it's probably better.

There are a lot of dimensions to this work. It's hard to know which to
emphasize here.

> But, for example, in the
> commit message for 0001, I think you could change the subject line to
> "Allow non-aggressive vacuums to advance relfrozenxid" and it would be
> clearer.

But non-aggressive VACUUMs have always been able to do that.

How about: "Set relfrozenxid to oldest extant XID seen by VACUUM"

> And then I think you could eliminate about half of the first
> paragraph, starting with "There is no fixed relationship", and all of
> the third paragraph (which starts with "Later work..."), and I think
> removing all that material would make it strictly more clear than it
> is currently. I don't think it's the place of a commit message to
> speculate too much on future directions or to wax eloquent on
> theoretical points. If that belongs anywhere, it's in a mailing list
> discussion.

Okay, I'll do that.

> It seems to me that 0002 mixes code movement with functional changes.

Believe it or not, I avoided functional changes in 0002 -- at least in
one important sense. That's why you had difficulty spotting any. This
must sound peculiar, since the commit message very clearly says that
the commit avoids a problem seen only in the non-aggressive case. It's
really quite subtle.

You wrote this comment and code block (which I propose to remove in
0002), so clearly you already understand the race condition that I'm
concerned with here:

-   if (skipping_blocks && blkno < rel_pages - 1)
-   {
-   /*
-* Tricky, tricky.  If this is in aggressive vacuum, the page
-* must have been all-frozen at the time we checked whether it
-* was skippable, but it might not be any more.  We must be
-* careful to count it as a skipped all-frozen page in that
-* case, or else we'll think we can't update relfrozenxid and
-* relminmxid.  If it's not an aggressive vacuum, we don't
-* know whether it was initially all-frozen, so we have to
-* recheck.
-*/
-   if (vacrel->aggressive ||
-   VM_ALL_FROZEN(vacrel->rel, blkno, ))
-   vacrel->frozenskipped_pages++;
-   continue;
-   }

What you're saying here boils down to this: it doesn't matter what the
visibility map would say right this microsecond (in the aggressive
case) were we to call VM_ALL_FROZEN(): we know for sure that the VM
said that this page was all-frozen *in the recent past*. That's good
enough; we will never fail to scan a page that might have an XID <
OldestXmin (ditto for XMIDs) this way, which is all that really
matters.

This is absolutely mandatory in the aggressive case, because otherwise
relfrozenxid advancement might be seen as unsafe. My observation is:
Why should we accept the same race in the non-aggressive case? Why not
do essentially the same thing in every VACUUM?

In 0002 we now track if each range that we actually chose to skip had
any all-visible (not all-frozen) pages -- if that happens then
relfrozenxid advancement becomes unsafe. The existing code uses
"vacrel->aggressive" as a proxy for the same condition -- the existing
code reasons based on what the visibility map must have said about the
page in the recent past. Which makes sense, but only works in the
aggressive case. The approach taken in 0002 also makes the code
simpler, which is what enabled putting the VM skipping code into its
own helper function, but that was just a bonus.

And so you could almost say that there is now behavioral change at
all. We're skipping pages in the same way, based on the same
information (from the visibility map) as before. We're just being a
bit more careful than before about how that information is tracked, to
avoid this race. A race that we always avoided in the aggressive case
is now consistently avoided.

> I'm completely on board with moving the code that decides how much to
> skip into a function. That seems like a great idea, and probably
> overdue. But it is not easy for me to see what has changed
> functionally between the old and new code organization, and I bet it
> would be possible to split this into two patches, one of which creates
> a function, and the other of which fixes the problem, and I think that
> would be a useful service to future readers of the code.

It seems kinda tricky to split up 0002 like that. It's possible, but
I'm not sure if it's possible to split it in a way that highlights the
issue that I just described. Because we already avoided the race in
the aggressive case.

> I also think that the commit message for 0002 is probably longer and
> more complex than is really

Re: Removing more vacuumlazy.c special cases, relfrozenxid optimizations

2022-03-24 Thread Robert Haas

On Wed, Mar 23, 2022 at 6:28 PM Peter Geoghegan  wrote:
> It would be great if you could take a look v11-0002-*, Robert. Does it
> make sense to you?

You're probably not going to love hearing this, but I think you're
still explaining things here in ways that are too baroque and hard to
follow. I do think it's probably better. But, for example, in the
commit message for 0001, I think you could change the subject line to
"Allow non-aggressive vacuums to advance relfrozenxid" and it would be
clearer. And then I think you could eliminate about half of the first
paragraph, starting with "There is no fixed relationship", and all of
the third paragraph (which starts with "Later work..."), and I think
removing all that material would make it strictly more clear than it
is currently. I don't think it's the place of a commit message to
speculate too much on future directions or to wax eloquent on
theoretical points. If that belongs anywhere, it's in a mailing list
discussion.

It seems to me that 0002 mixes code movement with functional changes.
I'm completely on board with moving the code that decides how much to
skip into a function. That seems like a great idea, and probably
overdue. But it is not easy for me to see what has changed
functionally between the old and new code organization, and I bet it
would be possible to split this into two patches, one of which creates
a function, and the other of which fixes the problem, and I think that
would be a useful service to future readers of the code. I have a hard
time believing that if someone in the future bisects a problem back to
this commit, they're going to have an easy time finding the behavior
change in here. In fact I can't see it myself. I think the actual
functional change is to fix what is described in the second paragraph
of the commit message, but I haven't been able to figure out where the
logic is actually changing to address that. Note that I would be happy
with the behavior change happening either before or after the code
reorganization.

I also think that the commit message for 0002 is probably longer and
more complex than is really helpful, and that the subject line is too
vague, but since I don't yet understand exactly what's happening here,
I cannot comment on how I think it should be revised at this point,
except to say that the second paragraph of that commit message looks
like the most useful part.

I would also like to mention a few things that I do like about 0002.
One is that it seems to collapse two different pieces of logic for
page skipping into one. That seems good. As mentioned, it's especially
good because that logic is abstracted into a function. Also, it looks
like it is making a pretty localized change to one (1) aspect of what
VACUUM does -- and I definitely prefer patches that change only one
thing at a time.

Hope that's helpful.

-- 
Robert Haas
EDB: http://www.enterprisedb.com

Re: Removing more vacuumlazy.c special cases, relfrozenxid optimizations

2022-03-23 Thread Peter Geoghegan

On Wed, Mar 23, 2022 at 2:03 PM Thomas Munro  wrote:
> Yeah, I found it confusing that DISABLE_PAGE_SKIPPING doesn't disable
> all page skipping, so 3414099c turned out to be not enough.

The proposed change to DISABLE_PAGE_SKIPPING is partly driven by that,
and partly driven by a similar concern about aggressive VACUUM.

It seems worth emphasizing the idea that an aggressive VACUUM is now
just the same as any other VACUUM except for one detail: we're
guaranteed to advance relfrozenxid to a value >= FreezeLimit at the
end. The non-aggressive case has the choice to do things that make
that impossible. But there are only two places where this can happen now:

1. Non-aggressive VACUUMs might decide to skip some all-visible pages in
the new lazy_scan_skip() helper routine for skipping with the VM (see
v11-0002-*).

2. A non-aggressive VACUUM can *always* decide to ratchet back its
target relfrozenxid in lazy_scan_noprune, to avoid waiting for a
cleanup lock -- a final value from before FreezeLimit is usually still
pretty good.

The first scenario is the only one where it becomes impossible for
non-aggressive VACUUM to be able to advance relfrozenxid (with
v11-0001-* in place) by any amount. Even that's a choice, made by
weighing costs against benefits.

There is no behavioral change in v11-0002-* (we're still using the
old SKIP_PAGES_THRESHOLD strategy), but the lazy_scan_skip()
helper routine could fairly easily be taught a lot more about the
downside of skipping all-visible pages (namely how that makes it
impossible to advance relfrozenxid).

Maybe it's worth skipping all-visible pages (there are lots of them
and age(relfrozenxid) is still low), and maybe it isn't worth it. We
should get to decide, without implementation details making
relfrozenxid advancement unsafe.

It would be great if you could take a look v11-0002-*, Robert. Does it
make sense to you?

Thanks
--
Peter Geoghegan

Re: Removing more vacuumlazy.c special cases, relfrozenxid optimizations

2022-03-23 Thread Thomas Munro

On Thu, Mar 24, 2022 at 9:59 AM Peter Geoghegan  wrote:
> On Wed, Mar 23, 2022 at 1:53 PM Robert Haas  wrote:
> > And therefore I favor defining it to mean that we don't
> > skip any work at all.
>
> But even today DISABLE_PAGE_SKIPPING won't do pruning when we cannot
> acquire a cleanup lock on a page, unless it happens to have XIDs from
> before FreezeLimit (which is probably 50 million XIDs behind
> OldestXmin, the vacuum_freeze_min_age default). I don't see much
> difference.

Yeah, I found it confusing that DISABLE_PAGE_SKIPPING doesn't disable
all page skipping, so 3414099c turned out to be not enough.

Re: Removing more vacuumlazy.c special cases, relfrozenxid optimizations

2022-03-23 Thread Peter Geoghegan

On Wed, Mar 23, 2022 at 1:53 PM Robert Haas  wrote:
> I see what you mean about it depending on how you define "skipping".
> But I think that DISABLE_PAGE_SKIPPING is intended as a sort of
> emergency safeguard when you really, really don't want to leave
> anything out.

I agree.

> And therefore I favor defining it to mean that we don't
> skip any work at all.

But even today DISABLE_PAGE_SKIPPING won't do pruning when we cannot
acquire a cleanup lock on a page, unless it happens to have XIDs from
before FreezeLimit (which is probably 50 million XIDs behind
OldestXmin, the vacuum_freeze_min_age default). I don't see much
difference.

Anyway, this isn't important. I'll just drop the third patch.

-- 
Peter Geoghegan

Re: Removing more vacuumlazy.c special cases, relfrozenxid optimizations

2022-03-23 Thread Robert Haas

On Wed, Mar 23, 2022 at 4:49 PM Peter Geoghegan  wrote:
> On Wed, Mar 23, 2022 at 1:41 PM Robert Haas  wrote:
> > It seems to me that if DISABLE_PAGE_SKIPPING doesn't completely
> > disable skipping pages, we have a problem.
>
> It depends on how you define skipping. DISABLE_PAGE_SKIPPING was
> created at a time when a broader definition of skipping made a lot
> more sense.
>
> > The option isn't named CARE_ABOUT_VISIBILITY_MAP. It's named
> > DISABLE_PAGE_SKIPPING.
>
> VACUUM(DISABLE_PAGE_SKIPPING, VERBOSE) will still consistently show
> that 100% of all of the pages from rel_pages are scanned. A page that
> is "skipped" by lazy_scan_noprune isn't pruned, and won't have any of
> its tuples frozen. But every other aspect of processing the page
> happens in just the same way as it would in the cleanup
> lock/lazy_scan_prune path.

I see what you mean about it depending on how you define "skipping".
But I think that DISABLE_PAGE_SKIPPING is intended as a sort of
emergency safeguard when you really, really don't want to leave
anything out. And therefore I favor defining it to mean that we don't
skip any work at all.

-- 
Robert Haas
EDB: http://www.enterprisedb.com

Re: Removing more vacuumlazy.c special cases, relfrozenxid optimizations

2022-03-23 Thread Peter Geoghegan

On Wed, Mar 23, 2022 at 1:41 PM Robert Haas  wrote:
> It seems to me that if DISABLE_PAGE_SKIPPING doesn't completely
> disable skipping pages, we have a problem.

It depends on how you define skipping. DISABLE_PAGE_SKIPPING was
created at a time when a broader definition of skipping made a lot
more sense.

> The option isn't named CARE_ABOUT_VISIBILITY_MAP. It's named
> DISABLE_PAGE_SKIPPING.

VACUUM(DISABLE_PAGE_SKIPPING, VERBOSE) will still consistently show
that 100% of all of the pages from rel_pages are scanned. A page that
is "skipped" by lazy_scan_noprune isn't pruned, and won't have any of
its tuples frozen. But every other aspect of processing the page
happens in just the same way as it would in the cleanup
lock/lazy_scan_prune path.

We'll even still VACUUM the page if it happens to have some existing
LP_DEAD items left behind by opportunistic pruning. We don't need a
cleanup in either lazy_scan_noprune (a share lock is all we need), nor
do we even need one in lazy_vacuum_heap_page (a regular exclusive lock
is all we need).

-- 
Peter Geoghegan

Re: Removing more vacuumlazy.c special cases, relfrozenxid optimizations

2022-03-23 Thread Robert Haas

On Wed, Mar 23, 2022 at 3:59 PM Peter Geoghegan  wrote:
> In other words, since DISABLE_PAGE_SKIPPING doesn't *consistently*
> force lazy_scan_noprune to refuse to process a page on HEAD (it all
> depends on FreezeLimit/vacuum_freeze_min_age), it is logical for
> DISABLE_PAGE_SKIPPING to totally get out of the business of caring
> about that -- better to limit it to caring only about the visibility
> map (by no longer making it force aggressiveness).

It seems to me that if DISABLE_PAGE_SKIPPING doesn't completely
disable skipping pages, we have a problem.

The option isn't named CARE_ABOUT_VISIBILITY_MAP. It's named
DISABLE_PAGE_SKIPPING.

-- 
Robert Haas
EDB: http://www.enterprisedb.com

Re: Removing more vacuumlazy.c special cases, relfrozenxid optimizations

2022-03-23 Thread Peter Geoghegan

On Sun, Mar 13, 2022 at 9:05 PM Peter Geoghegan  wrote:
> Attached is v10. While this does still include the freezing patch,
> it's not in scope for Postgres 15. As I've said, I still think that it
> makes sense to maintain the patch series with the freezing stuff,
> since it's structurally related.

Attached is v11. Changes:

* No longer includes the patch that adds page-level freezing. It was
making it harder to assess code coverage for the patches that I'm
targeting Postgres 15 with. And so including it with each new revision
no longer seems useful. I'll pick it up for Postgres 16.

* Extensive isolation tests added to v11-0001-*, exercising a lot of
hard-to-hit code paths that are reached when VACUUM is unable to
immediately acquire a cleanup lock on some heap page. In particular,
we now have test coverage for the code in heapam.c that handles
tracking the oldest extant XID and MXID in the presence of MultiXacts
(on a no-cleanup-lock heap page).

* v11-0002-* (which is the patch that avoids missing out on advancing
relfrozenxid in non-aggressive VACUUMs due to a race condition on
HEAD) now moves even more of the logic for deciding how VACUUM will
skip using the visibility map into its own helper routine. Now
lazy_scan_heap just does what the state returned by the helper routine
tells it about the current skippable range -- it doesn't make any
decisions itself anymore. This is far simpler than what we do
currently, on HEAD.

There are no behavioral changes here, but this approach could be
pushed further to improve performance. We could easily determine
*every* page that we're going to scan (not skip) up-front in even the
largest tables, very early, before we've even scanned one page. This
could enable things like I/O prefetching, or capping the size of the
dead_items array based on our final scanned_pages (not on rel_pages).

* A new patch (v11-0003-*) alters the behavior of VACUUM's
DISABLE_PAGE_SKIPPING option. DISABLE_PAGE_SKIPPING no longer forces
aggressive VACUUM -- now it only forces the use of the visibility map,
since that behavior is totally independent of aggressiveness.

I don't feel too strongly about the DISABLE_PAGE_SKIPPING change. It
just seems logical to decouple no-vm-skipping from aggressiveness --
it might actually be helpful in testing the work from the patch series
in the future. Any page counted in scanned_pages has essentially been
processed by VACUUM with this work in place -- that was the idea
behind the lazy_scan_noprune stuff from commit 44fa8488. Bear in mind
that the relfrozenxid tracking stuff from v11-0001-* makes it almost
certain that a DISABLE_PAGE_SKIPPING-without-aggressiveness VACUUM
will still manage to advance relfrozenxid -- usually by the same
amount as an equivalent aggressive VACUUM would anyway. (Failing to
acquire a cleanup lock on some heap page might result in the final
older relfrozenxid being appreciably older, but probably not, and we'd
still almost certainly manage to advance relfrozenxid by *some* small
amount.)

Of course, anybody that wants both an aggressive VACUUM and a VACUUM
that never skips even all-frozen pages in the visibility map will
still be able to get that behavior quite easily. For example,
VACUUM(DISABLE_PAGE_SKIPPING, FREEZE) will do that. Several of our
existing tests must already use both of these options together,
because the tests require an effective vacuum_freeze_min_age of 0 (and
vacuum_multixact_freeze_min_age of 0) -- DISABLE_PAGE_SKIPPING alone
won't do that on HEAD, which seems to confuse the issue (see commit
b700f96c for an example of that).

In other words, since DISABLE_PAGE_SKIPPING doesn't *consistently*
force lazy_scan_noprune to refuse to process a page on HEAD (it all
depends on FreezeLimit/vacuum_freeze_min_age), it is logical for
DISABLE_PAGE_SKIPPING to totally get out of the business of caring
about that -- better to limit it to caring only about the visibility
map (by no longer making it force aggressiveness).

-- 
Peter Geoghegan

v11-0003-Don-t-force-aggressive-mode-for-DISABLE_PAGE_SKI.patch
Description: Binary data

v11-0002-Generalize-how-VACUUM-skips-all-frozen-pages.patch
Description: Binary data

v11-0001-Loosen-coupling-between-relfrozenxid-and-freezin.patch
Description: Binary data

Re: Removing more vacuumlazy.c special cases, relfrozenxid optimizations

2022-03-13 Thread Peter Geoghegan

On Fri, Feb 25, 2022 at 5:52 PM Peter Geoghegan  wrote:
> There is an important practical way in which it makes sense to treat
> 0001 as separate to 0002. It is true that 0001 is independently quite
> useful. In practical terms, I'd be quite happy to just get 0001 into
> Postgres 15, without 0002. I think that that's what you meant here, in
> concrete terms, and we can agree on that now.

Attached is v10. While this does still include the freezing patch,
it's not in scope for Postgres 15. As I've said, I still think that it
makes sense to maintain the patch series with the freezing stuff,
since it's structurally related. So, to be clear, the first two
patches from the patch series are in scope for Postgres 15. But not
the third.

Highlights:

* Changes to terminology and commit messages along the lines suggested
by Andres.

* Bug fixes to heap_tuple_needs_freeze()'s MultiXact handling. My
testing strategy here still needs work.

* Expanded refactoring by v10-0002 patch.

The v10-0002 patch (which appeared for the first time in v9) was
originally all about fixing a case where non-aggressive VACUUMs were
at a gratuitous disadvantage (relative to aggressive VACUUMs) around
advancing relfrozenxid -- very much like the lazy_scan_noprune work
from commit 44fa8488. And that is still its main purpose. But the
refactoring now seems related to Andres' idea of making non-aggressive
VACUUMs decides to scan a few extra all-visible pages in order to be
able to advance relfrozenxid.

The code that sets up skipping the visibility map is made a lot
clearer by v10-0002. That patch moves a significant amount of code
from lazy_scan_heap() into a new helper routine (so it continues the
trend started by the Postgres 14 work that added lazy_scan_prune()).
Now skipping a range of visibility map pages is fundamentally based on
setting up the range up front, and then using the same saved details
about the range thereafter -- we don't have anymore ad-hoc
VM_ALL_VISIBLE()/VM_ALL_FROZEN() calls for pages from a range that we
already decided to skip (so no calls to those routines from
lazy_scan_heap(), at least not until after we finish processing in
lazy_scan_prune()).

This is more or less what we were doing all along for one special
case: aggressive VACUUMs. We had to make sure to either increment
frozenskipped_pages or increment scanned_pages for every page from
rel_pages -- this issue is described by lazy_scan_heap() comments on
HEAD that begin with "Tricky, tricky." (these date back to the freeze
map work from 2016). Anyway, there is no reason to not go further with
that: we should make whole ranges the basic unit that we deal with
when skipping. It's a lot simpler to think in terms of entire ranges
(not individual pages) that are determined to be all-visible or
all-frozen up-front, without needing to recheck anything (regardless
of whether it's an aggressive VACUUM).

We don't need to track frozenskipped_pages this way. And it's much
more obvious that it's safe for more complicated cases, in particular
for aggressive VACUUMs.

This kind of approach seems necessary to make non-aggressive VACUUMs
do a little more work opportunistically, when they realize that they
can advance relfrozenxid relatively easily that way (which I believe
Andres favors as part of overhauling freezing). That becomes a lot
more natural when you have a clear and unambiguous separation between
deciding what range of blocks to skip, and then actually skipping. I
can imagine the new helper function added by v10-0002 (which I've
called lazy_scan_skip_range()) eventually being taught to do these
kinds of tricks.

In general I think that all of the details of what to skip need to be
decided up front. The loop in lazy_scan_heap() should execute skipping
based on the instructions it receives from the new helper function, in
the simplest way possible. The helper function can become more
intelligent about the costs and benefits of skipping in the future,
without that impacting lazy_scan_heap().

--
Peter Geoghegan
From 43ab00609392ed7ad31be491834bdac348e13653 Mon Sep 17 00:00:00 2001
From: Peter Geoghegan 
Date: Fri, 11 Mar 2022 19:16:02 -0800
Subject: [PATCH v10 3/3] Make page-level characteristics drive freezing.

Teach VACUUM to freeze all of the tuples on a page whenever it notices
that it would otherwise mark the page all-visible, without also marking
it all-frozen.  VACUUM typically won't freeze _any_ tuples on the page
unless _all_ tuples (that remain after pruning) are all-visible.  This
makes the overhead of vacuuming much more predictable over time.  We
avoid the need for large balloon payments during aggressive VACUUMs
(typically anti-wraparound autovacuums).  Freezing is proactive, so
we're much less likely to get into "freezing debt".

The new approach to freezing also enables relfrozenxid advancement in
non-aggressive VACUUMs, which might be enough to avoid aggressive
VACUUMs altogether (with many individual tables/workloads).  While the
non-aggressive

Re: Removing more vacuumlazy.c special cases, relfrozenxid optimizations

2022-03-01 Thread Peter Geoghegan

On Tue, Mar 1, 2022 at 1:46 PM Robert Haas  wrote:
> I think that this is not really a description of an algorithm -- and I
> think that it is far from clear that the third "in-between" category
> does not need to exist.

But I already described the algorithm. It is very simple
mechanistically -- though that in itself means very little. As I have
said multiple times now, the hard part is assessing what the
implications are. And the even harder part is making a judgement about
whether or not those implications are what we generally want.

> I think findings like this are very unconvincing.

TPC-C may be unrealistic in certain ways, but it is nevertheless
vastly more realistic than pgbench. pgbench is really more of a stress
test than a benchmark.

The main reasons why TPC-C is interesting here are *very* simple, and
would likely be equally true with TPC-E (just for example) -- even
though TPC-E is a very different benchmark kind of OLTP workload
overall. TPC-C (like TPC-E) features a diversity of transaction types,
some of which are more complicated than others -- which is strictly
more realistic than having only one highly synthetic OLTP transaction
type. Each transaction type doesn't necessarily modify the same tables
in the same way. This leads to natural diversity among tables and
among transactions, including:

* The typical or average number of distinct XIDs per heap page varies
significantly among each table. There are way fewer distinct XIDs per
"order line" table heap page than there are per "order" table heap
page, for the obvious reason.

* Roughly speaking, there are various different ways that free space
management ought to work in a system like Postgres. For example it is
necessary to make a "fragmentations vs space utilization" trade-off
with the new orders table.

* There are joins in some of the transactions!

Maybe TPC-C is a crude approximation of reality, but it nevertheless
exercises relevant parts of the system to a significant degree. What
else would you expect me to use, for a project like this? To a
significant degree the relfrozenxid tracking stuff is interesting
because tables tend to have natural differences like the ones I have
highlighted on this thread. How could that not be the case? Why
wouldn't we want to take advantage of that?

There might be some danger in over-optimizing for this particular
benchmark, but right now that is so far from being the main problem
that the idea seems strange to me. pgbench doesn't need the FSM, at
all. In fact pgbench doesn't even really need VACUUM (except for
antiwraparound), once heap fillfactor is lowered to 95 or so. pgbench
simply isn't relevant, *at all*, except perhaps as a way of measuring
regressions in certain synthetic cases that don't benefit.

> TPC-C (or any
> benchmark really) is so simple as to be a terrible proxy for what
> vacuuming is going to look like on real-world systems.

Doesn't that amount to "no amount of any kind of testing or
benchmarking will convince me of anything, ever"?

There is more than one type of real-world system. I think that TPC-C
is representative of some real world systems in some regards. But even
that's not the important point for me. I find TPC-C generally
interesting for one reason: I can clearly see that Postgres does
things in a way that just doesn't make much sense, which isn't
particularly fundamental to how VACUUM works.

My only long term goal is to teach Postgres to *avoid* various
pathological cases exhibited by TPC-C (e.g., the B-Tree "split after
new tuple" mechanism from commit f21668f328 *avoids* a pathological
case from TPC-C). We don't necessarily have to agree on how important
each individual case is "in the real world" (which is impossible to
know anyway). We only have to agree that what we see is a pathological
case (because some reasonable expectation is dramatically violated),
and then work out a fix.

I don't want to teach Postgres to be clever -- I want to teach it to
avoid being stupid in cases where it exhibits behavior that really
cannot be described any other way. You seem to talk about some of this
work as if it was just as likely to have a detrimental effect
elsewhere, for some equally plausible workload, which will have a
downside that is roughly as bad as the advertised upside. I consider
that very unlikely, though. Sure, regressions are quite possible, and
a real concern -- but regressions *like that* are unlikely. Avoiding
doing what is clearly the wrong thing just seems to work out that way,
in general.

-- 
Peter Geoghegan

Re: Removing more vacuumlazy.c special cases, relfrozenxid optimizations

2022-03-01 Thread Robert Haas

On Sun, Feb 20, 2022 at 3:27 PM Peter Geoghegan  wrote:
> > I think that the idea has potential, but I don't think that I
> > understand yet what the *exact* algorithm is.
>
> The algorithm seems to exploit a natural tendency that Andres once
> described in a blog post about his snapshot scalability work [1]. To a
> surprising extent, we can usefully bucket all tuples/pages into two
> simple categories:
>
> 1. Very, very old ("infinitely old" for all practical purposes).
>
> 2. Very very new.
>
> There doesn't seem to be much need for a third "in-between" category
> in practice. This seems to be at least approximately true all of the
> time.
>
> Perhaps Andres wouldn't agree with this very general statement -- he
> actually said something more specific. I for one believe that the
> point he made generalizes surprisingly well, though. I have my own
> theories about why this appears to be true. (Executive summary: power
> laws are weird, and it seems as if the sparsity-of-effects principle
> makes it easy to bucket things at the highest level, in a way that
> generalizes well across disparate workloads.)

I think that this is not really a description of an algorithm -- and I
think that it is far from clear that the third "in-between" category
does not need to exist.

> Remember when I got excited about how my big TPC-C benchmark run
> showed a predictable, tick/tock style pattern across VACUUM operations
> against the order and order lines table [2]? It seemed very
> significant to me that the OldestXmin of VACUUM operation n
> consistently went on to become the new relfrozenxid for the same table
> in VACUUM operation n + 1. It wasn't exactly the same XID, but very
> close to it (within the range of noise). This pattern was clearly
> present, even though VACUUM operation n + 1 might happen as long as 4
> or 5 hours after VACUUM operation n (this was a big table).

I think findings like this are very unconvincing. TPC-C (or any
benchmark really) is so simple as to be a terrible proxy for what
vacuuming is going to look like on real-world systems. Like, it's nice
that it works, and it shows that something's working, but it doesn't
demonstrate that the patch is making the right trade-offs overall.

-- 
Robert Haas
EDB: http://www.enterprisedb.com

Re: Removing more vacuumlazy.c special cases, relfrozenxid optimizations

2022-02-25 Thread Peter Geoghegan

On Fri, Feb 25, 2022 at 3:26 PM Andres Freund  wrote:
> freeze_required_limit, freeze_desired_limit? Or s/limit/cutoff/? Or
> s/limit/below/? I kind of like below because that answers < vs <= which I find
> hard to remember around freezing.

I like freeze_required_limit the most.

> That may be true, but I think working more incrementally is better in this
> are. I'd rather have a smaller improvement for a release, collect some data,
> get another improvement in the next, than see a bunch of reports of larger
> wind and large regressions.

I agree.

There is an important practical way in which it makes sense to treat
0001 as separate to 0002. It is true that 0001 is independently quite
useful. In practical terms, I'd be quite happy to just get 0001 into
Postgres 15, without 0002. I think that that's what you meant here, in
concrete terms, and we can agree on that now.

However, it is *also* true that there is an important practical sense
in which they *are* related. I don't want to ignore that either -- it
does matter. Most of the value to be had here comes from the synergy
between 0001 and 0002 -- or what I've been calling a "virtuous cycle",
the thing that makes it possible to advance relfrozenxid/relminmxid in
almost every VACUUM. Having both 0001 and 0002 together (or something
along the same lines) is way more valuable than having just one.

Perhaps we can even agree on this second point. I am encouraged by the
fact that you at least recognize the general validity of the key ideas
from 0002. If I am going to commit 0001 (and not 0002) ahead of
feature freeze for 15, I better be pretty sure that I have at least
roughly the right idea with 0002, too -- since that's the direction
that 0001 is going in. It almost seems dishonest to pretend that I
wasn't thinking of 0002 when I wrote 0001.

I'm glad that you seem to agree that this business of accumulating
freezing debt without any natural limit is just not okay. That is
really fundamental to me. I mean, vacuum_freeze_min_age kind of
doesn't work as designed. This is a huge problem for us.

> > Under these conditions, we will have many more opportunities to
> > advance relminmxid for most of the tables (including the larger
> > tables) all the way up to current-oldestMxact with the patch series.
> > Without needing to freeze *any* MultiXacts early (just freezing some
> > XIDs early) to get that benefit. The patch series is not just about
> > spreading the burden of freezing, so that non-aggressive VACUUMs
> > freeze more -- it's also making relfrozenxid and relminmxid more
> > recent and therefore *reliable* indicators of which tables any
> > wraparound problems *really* are.
>
> My concern was explicitly about the case where we have to create new
> multixacts...

It was a mistake on my part to counter your point about that with this
other point about eager relminmxid advancement. As I said in the last
email, while that is very valuable, it's not something that needs to
be brought into this.

> > Does that make sense to you?
>
> Yes.

Okay, great. The fact that you recognize the value in that comes as a relief.

> > You mean to change the signature of heap_tuple_needs_freeze, so it
> > doesn't return a bool anymore? It just has two bool pointers as
> > arguments, can_freeze and need_freeze?
>
> Something like that. Or return true if there's anything to do, and then rely
> on can_freeze and need_freeze for finer details. But it doesn't matter that 
> much.

Got it.

> > The problem that all of these heuristics have is that they will tend
> > to make it impossible for future non-aggressive VACUUMs to be able to
> > advance relfrozenxid. All that it takes is one single all-visible page
> > to make that impossible. As I said upthread, I think that being able
> > to advance relfrozenxid (and especially relminmxid) by *some* amount
> > in every VACUUM has non-obvious value.
>
> I think that's a laudable goal. But I don't think we should go there unless we
> are quite confident we've mitigated the potential downsides.

True. But that works both ways. We also shouldn't err in the direction
of adding these kinds of heuristics (which have real downsides) until
the idea of mostly swallowing the cost of freezing whole pages (while
making it possible to disable) has lost, fairly. Overall, it looks
like the cost is acceptable in most cases.

I think that users will find it very reassuring to regularly and
reliably see confirmation that wraparound is being kept at bay, by
every VACUUM operation, with details that they can relate to their
workload. That has real value IMV -- even when it's theoretically
unnecessary for us to be so eager with advancing relfrozenxid.

I really don't like the idea of falling behind on freezing
systematically. You always run the "risk" of freezing being wasted.
But that way of looking at it can be penny wise, pound foolish --
maybe we should just accept that trying to predict what will happen in
the future (whether or not freezing will be

Re: Removing more vacuumlazy.c special cases, relfrozenxid optimizations

2022-02-25 Thread Peter Geoghegan

On Fri, Feb 25, 2022 at 3:48 PM Andres Freund  wrote:
> I don't see why it matters that OldestXmin and OldestMxact are computed at the
> same time?  It's a question of the workload, not vacuum algorithm.

I think it's both.

> OldestMxact inherently lags OldestXmin. OldestMxact can only advance after all
> members are older than OldestXmin (not quite true, but that's the bound), and
> they have always more than one member.
>
>
> > How many of these "skewed MultiXacts" can we really expect?
>
> I don't think they're skewed in any way. It's a fundamental aspect of
> multixacts.

Having this happen to some degree is fundamental to MultiXacts, sure.
But also seems like the approach of using FreezeLimit and
MultiXactCutoff in the way that we do right now seems like it might
make the problem a lot worse. Because they're completely meaningless
cutoffs. They are magic numbers that have no relationship whatsoever
to each other.

There are problems with assuming that OldestXmin and OldestMxact
"align" -- no question. But at least it's approximately true -- which
is a start. They are at least not arbitrarily, unpredictably
different, like FreezeLimit and MultiXactCutoff are, and always will
be. I think that that's a meaningful and useful distinction.

I am okay with making the most pessimistic possible assumptions about
how any changes to how we freeze might cause FreezeMultiXactId() to
allocate more MultiXacts than before. And I accept that the patch
series shouldn't "get credit" for "offsetting" any problem like that
by making relminmxid advancement occur much more frequently (even
though that does seem very valuable). All I'm really saying is this:
in general, there are probably quite a few opportunities for
FreezeMultiXactId() to avoid allocating new XMIDs (just to freeze
XIDs) by having the full context. And maybe by making the dialog
between lazy_scan_prune and heap_prepare_freeze_tuple a bit more
nuanced.

-- 
Peter Geoghegan

Re: Removing more vacuumlazy.c special cases, relfrozenxid optimizations

2022-02-25 Thread Andres Freund

Hi,

On 2022-02-25 15:28:17 -0800, Peter Geoghegan wrote:
> But why should we ever get to the FreezeMultiXactId() loop with the
> stuff from 0002 in place? The whole purpose of the loop is to handle
> cases where we have to remove *some* (not all) XIDs from before
> cutoff_xid that appear in a MultiXact, which requires careful checking
> of each XID (this is only possible when the MultiXactId is <
> cutoff_multi to begin with, which is OldestMxact in the patch, which
> is presumably very recent).
> 
> It's not impossible that we'll get some number of "skewed MultiXacts"
> with the patch -- cases that really do necessitate allocating a new
> MultiXact, just to "freeze some XIDs from a MultiXact". That is, there
> will sometimes be some number of XIDs that are < OldestXmin, but
> nevertheless appear in some MultiXactIds >= OldestMxact. This seems
> likely to be rare with the patch, though, since VACUUM calculates its
> OldestXmin and OldestMxact (which are what cutoff_xid and cutoff_multi
> really are in the patch) at the same point in time. Which was the
> point I made in my email yesterday.

I don't see why it matters that OldestXmin and OldestMxact are computed at the
same time?  It's a question of the workload, not vacuum algorithm.

OldestMxact inherently lags OldestXmin. OldestMxact can only advance after all
members are older than OldestXmin (not quite true, but that's the bound), and
they have always more than one member.


> How many of these "skewed MultiXacts" can we really expect?

I don't think they're skewed in any way. It's a fundamental aspect of
multixacts.

Greetings,

Andres Freund

Re: Removing more vacuumlazy.c special cases, relfrozenxid optimizations

2022-02-25 Thread Peter Geoghegan

On Fri, Feb 25, 2022 at 2:00 PM Peter Geoghegan  wrote:
> > Hm. I guess I'll have to look at the code for it. It doesn't immediately
> > "feel" quite right.
>
> I kinda think it might be. Please let me know if you see a problem
> with what I've said.

Oh, wait. I have a better idea of what you meant now. The loop towards
the end of FreezeMultiXactId() will indeed "Determine whether to keep
this member or ignore it." when we need a new MultiXactId. The loop is
exact in the sense that it will only include those XIDs that are truly
needed -- those that are still running.

But why should we ever get to the FreezeMultiXactId() loop with the
stuff from 0002 in place? The whole purpose of the loop is to handle
cases where we have to remove *some* (not all) XIDs from before
cutoff_xid that appear in a MultiXact, which requires careful checking
of each XID (this is only possible when the MultiXactId is <
cutoff_multi to begin with, which is OldestMxact in the patch, which
is presumably very recent).

It's not impossible that we'll get some number of "skewed MultiXacts"
with the patch -- cases that really do necessitate allocating a new
MultiXact, just to "freeze some XIDs from a MultiXact". That is, there
will sometimes be some number of XIDs that are < OldestXmin, but
nevertheless appear in some MultiXactIds >= OldestMxact. This seems
likely to be rare with the patch, though, since VACUUM calculates its
OldestXmin and OldestMxact (which are what cutoff_xid and cutoff_multi
really are in the patch) at the same point in time. Which was the
point I made in my email yesterday.

How many of these "skewed MultiXacts" can we really expect? Seems like
there might be very few in practice. But I'm really not sure about
that.

-- 
Peter Geoghegan

Re: Removing more vacuumlazy.c special cases, relfrozenxid optimizations

2022-02-25 Thread Andres Freund

Hi,

On 2022-02-25 14:00:12 -0800, Peter Geoghegan wrote:
> On Thu, Feb 24, 2022 at 11:14 PM Andres Freund  wrote:
> > I am not a fan of the backstop terminology. It's still the reason we need to
> > do freezing for correctness reasons.
> 
> Thanks for the review!
> 
> I'm not wedded to that particular terminology, but I think that we
> need something like it. Open to suggestions.
>
> How about limit-based? Something like that?

freeze_required_limit, freeze_desired_limit? Or s/limit/cutoff/? Or
s/limit/below/? I kind of like below because that answers < vs <= which I find
hard to remember around freezing.


> > I'm a tad concerned about replacing mxids that have some members that are
> > older than OldestXmin but not older than FreezeLimit. It's not too hard to
> > imagine that accelerating mxid consumption considerably.  But we can 
> > probably,
> > if not already done, special case that.
> 
> Let's assume for a moment that this is a real problem. I'm not sure if
> it is or not myself (it's complicated), but let's say that it is. The
> problem may be more than offset by the positive impact on relminxmid
> advancement. I have placed a large emphasis on enabling
> relfrozenxid/relminxmid advancement in every non-aggressive VACUUM,
> for a number of reasons -- this is one of the reasons. Finding a way
> for every VACUUM operation to be "vacrel->scanned_pages +
> vacrel->frozenskipped_pages == orig_rel_pages" (i.e. making *some*
> amount of relfrozenxid/relminxmid advancement possible in every
> VACUUM) has a great deal of value.

That may be true, but I think working more incrementally is better in this
are. I'd rather have a smaller improvement for a release, collect some data,
get another improvement in the next, than see a bunch of reports of larger
wind and large regressions.


> As I said recently on the "do only critical work during single-user
> vacuum?" thread, why should the largest tables in databases that
> consume too many MXIDs do so evenly, across all their tables? There
> are usually one or two large tables, and many more smaller tables. I
> think it's much more likely that the largest tables consume
> approximately zero MultiXactIds in these databases -- actual
> MultiXactId consumption is probably concentrated in just one or two
> smaller tables (even when we burn through MultiXacts very quickly).
> But we don't recognize these kinds of distinctions at all right now.

Recognizing those distinctions seems independent of freezing multixacts with
live members. I am happy with freezing them more aggressively if they don't
have live members. It's freezing mxids with live members that has me
concerned.  The limits you're proposing are quite aggressive and can advance
quickly.

I've seen large tables with plenty multixacts. Typically concentrated over a
value range (often changing over time).


> Under these conditions, we will have many more opportunities to
> advance relminmxid for most of the tables (including the larger
> tables) all the way up to current-oldestMxact with the patch series.
> Without needing to freeze *any* MultiXacts early (just freezing some
> XIDs early) to get that benefit. The patch series is not just about
> spreading the burden of freezing, so that non-aggressive VACUUMs
> freeze more -- it's also making relfrozenxid and relminmxid more
> recent and therefore *reliable* indicators of which tables any
> wraparound problems *really* are.

My concern was explicitly about the case where we have to create new
multixacts...


> Does that make sense to you?

Yes.


> > > On HEAD, FreezeMultiXactId() doesn't get passed down the VACUUM 
> > > operation's
> > > OldestXmin at all (it actually just gets FreezeLimit passed as its
> > > cutoff_xid argument). It cannot possibly recognize any of this for itself.
> >
> > It does recognize something like OldestXmin in a more precise and expensive
> > way - MultiXactIdIsRunning() and TransactionIdIsCurrentTransactionId().
> 
> It doesn't look that way to me.
> 
> While it's true that FreezeMultiXactId() will call MultiXactIdIsRunning(),
> that's only a cross-check.

> This cross-check is made at a point where we've already determined that the
> MultiXact in question is < cutoff_multi. In other words, it catches cases
> where a "MultiXactId < cutoff_multi" Multi contains an XID *that's still
> running* -- a correctness issue. Nothing to do with being smart about
> avoiding allocating new MultiXacts during freezing, or exploiting the fact
> that "FreezeLimit < OldestXmin" (which is almost always true, very true).

If there's <= 1 live members in a mxact, we replace it with with a plain xid
iff the xid also would get frozen. With the current freezing logic I don't see
what passing down OldestXmin would change. Or how it differs to a meaningful
degree from heap_prepare_freeze_tuple()'s logic.  I don't see how it'd avoid a
single new mxact from being allocated.



> > > Caller should make temp
> > > + * copies of global tracking variables

Re: Removing more vacuumlazy.c special cases, relfrozenxid optimizations

2022-02-25 Thread Peter Geoghegan

On Thu, Feb 24, 2022 at 11:14 PM Andres Freund  wrote:
> I am not a fan of the backstop terminology. It's still the reason we need to
> do freezing for correctness reasons.

Thanks for the review!

I'm not wedded to that particular terminology, but I think that we
need something like it. Open to suggestions.

How about limit-based? Something like that?

> It'd make more sense to me to turn it
> around and call the "non-backstop" freezing opportunistic freezing or such.

The problem with that scheme is that it leads to a world where
"standard freezing" is incredibly rare (it often literally never
happens), whereas "opportunistic freezing" is incredibly common. That
doesn't make much sense to me.

We tend to think of 50 million XIDs (the vacuum_freeze_min_age
default) as being not that many. But I think that it can be a huge
number, too. Even then, it's unpredictable -- I suspect that it can
change without very much changing in the application, from the point
of view of users. That's a big part of the problem I'm trying to
address -- freezing outside of aggressive VACUUMs is way too rare (it
might barely happen at all). FreezeLimit/vacuum_freeze_min_age was
designed at a time when there was no visibility map at all, when it
made somewhat more sense as the thing that drives freezing.

Incidentally, this is part of the problem with anti-wraparound vacuums
and freezing debt -- the fact that some quite busy databases take
weeks or months to go through 50 million XIDs (or 200 million)
increases the pain of the eventual aggressive VACUUM. It's not
completely unbounded -- autovacuum_freeze_max_age is not 100% useless
here. But the extent to which that stuff bounds the debt can vary
enormously, for not-very-good reasons.

> > Whenever we opt to
> > "freeze a page", the new page-level algorithm *always* uses the most
> > recent possible XID and MXID values (OldestXmin and oldestMxact) to
> > decide what XIDs/XMIDs need to be replaced. That might sound like it'd
> > be too much, but it only applies to those pages that we actually
> > decide to freeze (since page-level characteristics drive everything
> > now). FreezeLimit is only one way of triggering that now (and one of
> > the least interesting and rarest).
>
> That largely makes sense to me and doesn't seem weird.

I'm very pleased that the main intuition behind 0002 makes sense to
you. That's a start, at least.

> I'm a tad concerned about replacing mxids that have some members that are
> older than OldestXmin but not older than FreezeLimit. It's not too hard to
> imagine that accelerating mxid consumption considerably.  But we can probably,
> if not already done, special case that.

Let's assume for a moment that this is a real problem. I'm not sure if
it is or not myself (it's complicated), but let's say that it is. The
problem may be more than offset by the positive impact on relminxmid
advancement. I have placed a large emphasis on enabling
relfrozenxid/relminxmid advancement in every non-aggressive VACUUM,
for a number of reasons -- this is one of the reasons. Finding a way
for every VACUUM operation to be "vacrel->scanned_pages +
vacrel->frozenskipped_pages == orig_rel_pages" (i.e. making *some*
amount of relfrozenxid/relminxmid advancement possible in every
VACUUM) has a great deal of value.

As I said recently on the "do only critical work during single-user
vacuum?" thread, why should the largest tables in databases that
consume too many MXIDs do so evenly, across all their tables? There
are usually one or two large tables, and many more smaller tables. I
think it's much more likely that the largest tables consume
approximately zero MultiXactIds in these databases -- actual
MultiXactId consumption is probably concentrated in just one or two
smaller tables (even when we burn through MultiXacts very quickly).
But we don't recognize these kinds of distinctions at all right now.

Under these conditions, we will have many more opportunities to
advance relminmxid for most of the tables (including the larger
tables) all the way up to current-oldestMxact with the patch series.
Without needing to freeze *any* MultiXacts early (just freezing some
XIDs early) to get that benefit. The patch series is not just about
spreading the burden of freezing, so that non-aggressive VACUUMs
freeze more -- it's also making relfrozenxid and relminmxid more
recent and therefore *reliable* indicators of which tables any
wraparound problems *really* are.

Does that make sense to you? This kind of "virtuous cycle" seems
really important to me. It's a subtle point, so I have to ask.

> > It seems that heap_prepare_freeze_tuple allocates new MXIDs (when
> > freezing old ones) in large part so it can NOT freeze XIDs that it
> > would have been useful (and much cheaper) to remove anyway.
>
> Well, we may have to allocate a new mxid because some members are older than
> FreezeLimit but others are still running. When do we not remove xids that
> would have been cheaper to remove once we

Re: Removing more vacuumlazy.c special cases, relfrozenxid optimizations

2022-02-24 Thread Andres Freund

Hi,

On 2022-02-24 20:53:08 -0800, Peter Geoghegan wrote:
> 0002 makes page-level freezing a first class thing.
> heap_prepare_freeze_tuple now has some (limited) knowledge of how this
> works. heap_prepare_freeze_tuple's cutoff_xid argument is now always
> the VACUUM caller's OldestXmin (not its FreezeLimit, as before). We
> still have to pass FreezeLimit to heap_prepare_freeze_tuple, which
> helps us to respect FreezeLimit as a backstop, and so now it's passed
> via the new backstop_cutoff_xid argument instead.

I am not a fan of the backstop terminology. It's still the reason we need to
do freezing for correctness reasons. It'd make more sense to me to turn it
around and call the "non-backstop" freezing opportunistic freezing or such.


> Whenever we opt to
> "freeze a page", the new page-level algorithm *always* uses the most
> recent possible XID and MXID values (OldestXmin and oldestMxact) to
> decide what XIDs/XMIDs need to be replaced. That might sound like it'd
> be too much, but it only applies to those pages that we actually
> decide to freeze (since page-level characteristics drive everything
> now). FreezeLimit is only one way of triggering that now (and one of
> the least interesting and rarest).

That largely makes sense to me and doesn't seem weird.

I'm a tad concerned about replacing mxids that have some members that are
older than OldestXmin but not older than FreezeLimit. It's not too hard to
imagine that accelerating mxid consumption considerably.  But we can probably,
if not already done, special case that.


> It seems that heap_prepare_freeze_tuple allocates new MXIDs (when
> freezing old ones) in large part so it can NOT freeze XIDs that it
> would have been useful (and much cheaper) to remove anyway.

Well, we may have to allocate a new mxid because some members are older than
FreezeLimit but others are still running. When do we not remove xids that
would have been cheaper to remove once we decide to actually do work?


> On HEAD, FreezeMultiXactId() doesn't get passed down the VACUUM operation's
> OldestXmin at all (it actually just gets FreezeLimit passed as its
> cutoff_xid argument). It cannot possibly recognize any of this for itself.

It does recognize something like OldestXmin in a more precise and expensive
way - MultiXactIdIsRunning() and TransactionIdIsCurrentTransactionId().


> Does that theory about MultiXacts sound plausible? I'm not claiming
> that the patch makes it impossible that FreezeMultiXactId() will have
> to allocate a new MultiXact to freeze during VACUUM -- the
> freeze-the-dead isolation tests already show that that's not true. I
> just think that page-level freezing based on page characteristics with
> oldestXmin and oldestMxact (not FreezeLimit and MultiXactCutoff)
> cutoffs might make it a lot less likely in practice.

Hm. I guess I'll have to look at the code for it. It doesn't immediately
"feel" quite right.


> oldestXmin and oldestMxact map to the same wall clock time, more or less --
> that seems like it might be an important distinction, independent of
> everything else.

Hm. Multis can be kept alive by fairly "young" member xids. So it may not be
removably (without creating a newer multi) until much later than its creation
time. So I don't think that's really true.



> From 483bc8df203f9df058fcb53e7972e3912e223b30 Mon Sep 17 00:00:00 2001
> From: Peter Geoghegan 
> Date: Mon, 22 Nov 2021 10:02:30 -0800
> Subject: [PATCH v9 1/4] Loosen coupling between relfrozenxid and freezing.
>
> When VACUUM set relfrozenxid before now, it set it to whatever value was
> used to determine which tuples to freeze -- the FreezeLimit cutoff.
> This approach was very naive: the relfrozenxid invariant only requires
> that new relfrozenxid values be <= the oldest extant XID remaining in
> the table (at the point that the VACUUM operation ends), which in
> general might be much more recent than FreezeLimit.  There is no fixed
> relationship between the amount of physical work performed by VACUUM to
> make it safe to advance relfrozenxid (freezing and pruning), and the
> actual number of XIDs that relfrozenxid can be advanced by (at least in
> principle) as a result.  VACUUM might have to freeze all of the tuples
> from a hundred million heap pages just to enable relfrozenxid to be
> advanced by no more than one or two XIDs.  On the other hand, VACUUM
> might end up doing little or no work, and yet still be capable of
> advancing relfrozenxid by hundreds of millions of XIDs as a result.
>
> VACUUM now sets relfrozenxid (and relminmxid) using the exact oldest
> extant XID (and oldest extant MultiXactId) from the table, including
> XIDs from the table's remaining/unfrozen MultiXacts.  This requires that
> VACUUM carefully track the oldest unfrozen XID/MultiXactId as it goes.
> This optimization doesn't require any changes to the definition of
> relfrozenxid, nor does it require changes to the core design of
> freezing.


> Final relfrozenxid values must still be >=

Re: Removing more vacuumlazy.c special cases, relfrozenxid optimizations

2022-02-24 Thread Peter Geoghegan

On Sun, Feb 20, 2022 at 12:27 PM Peter Geoghegan  wrote:
> You've given me a lot of high quality feedback on all of this, which
> I'll work through soon. It's hard to get the balance right here, but
> it's made much easier by this kind of feedback.

Attached is v9. Lots of changes. Highlights:

* Much improved 0001 ("loosen coupling" dynamic relfrozenxid tracking
patch). Some of the improvements are due to recent feedback from
Robert.

* Much improved 0002 ("Make page-level characteristics drive freezing"
patch). Whole new approach to the implementation, though the same
algorithm as before.

* No more FSM patch -- that was totally separate work, that I
shouldn't have attached to this project.

* There are 2 new patches (these are now 0003 and 0004), both of which
are concerned with allowing non-aggressive VACUUM to consistently
advance relfrozenxid. I think that 0003 makes sense on general
principle, but I'm much less sure about 0004. These aren't too
important.

While working on the new approach to freezing taken by v9-0002, I had
some insight about the issues that Robert raised around 0001, too. I
wasn't expecting that to happen.

0002 makes page-level freezing a first class thing.
heap_prepare_freeze_tuple now has some (limited) knowledge of how this
works. heap_prepare_freeze_tuple's cutoff_xid argument is now always
the VACUUM caller's OldestXmin (not its FreezeLimit, as before). We
still have to pass FreezeLimit to heap_prepare_freeze_tuple, which
helps us to respect FreezeLimit as a backstop, and so now it's passed
via the new backstop_cutoff_xid argument instead. Whenever we opt to
"freeze a page", the new page-level algorithm *always* uses the most
recent possible XID and MXID values (OldestXmin and oldestMxact) to
decide what XIDs/XMIDs need to be replaced. That might sound like it'd
be too much, but it only applies to those pages that we actually
decide to freeze (since page-level characteristics drive everything
now). FreezeLimit is only one way of triggering that now (and one of
the least interesting and rarest).

0002 also adds an alternative set of relfrozenxid/relminmxid tracker
variables, to make the "don't freeze the page" path within
lazy_scan_prune simpler (if you don't want to freeze the page, then
use the set of tracker variables that go with that choice, which
heap_prepare_freeze_tuple knows about and helps with). With page-level
freezing, lazy_scan_prune wants to make a decision about the page as a
whole, at the last minute, after all heap_prepare_freeze_tuple calls
have already been made. So I think that heap_prepare_freeze_tuple
needs to know about that aspect of lazy_scan_prune's behavior.

When we *don't* want to freeze the page, we more or less need
everything related to freezing inside lazy_scan_prune to behave like
lazy_scan_noprune, which never freezes the page (that's mostly the
point of lazy_scan_noprune). And that's almost what we actually do --
heap_prepare_freeze_tuple now outsources maintenance of this
alternative set of "don't freeze the page" relfrozenxid/relminmxid
tracker variables to its sibling function, heap_tuple_needs_freeze.
That is the same function that lazy_scan_noprune itself actually
calls.

Now back to Robert's feedback on 0001, which had very complicated
comments in the last version. This approach seems to make the "being
versus becoming" or "going to freeze versus not going to freeze"
distinctions much clearer. This is less true if you assume that 0002
won't be committed but 0001 will be. Even if that happens with
Postgres 15, I have to imagine that adding something like 0002 must be
the real goal, long term. Without 0002, the value from 0001 is far
more limited. You need both together to get the virtuous cycle I've
described.

The approach with always using OldestXmin as cutoff_xid and
oldestMxact as our cutoff_multi makes a lot of sense to me, in part
because I think that it might well cut down on the tendency of VACUUM
to allocate new MultiXacts in order to be able to freeze old ones.
AFAICT the only reason that heap_prepare_freeze_tuple does that is
because it has no flexibility on FreezeLimit and MultiXactCutoff.
These are derived from vacuum_freeze_min_age and
vacuum_multixact_freeze_min_age, respectively, and so they're two
independent though fairly meaningless cutoffs. On the other hand,
OldestXmin and OldestMxact are not independent in the same way. We get
both of them at the same time and the same place, in
vacuum_set_xid_limits. OldestMxact really is very close to OldestXmin
-- only the units differ.

It seems that heap_prepare_freeze_tuple allocates new MXIDs (when
freezing old ones) in large part so it can NOT freeze XIDs that it
would have been useful (and much cheaper) to remove anyway. On HEAD,
FreezeMultiXactId() doesn't get passed down the VACUUM operation's
OldestXmin at all (it actually just gets FreezeLimit passed as its
cutoff_xid argument). It cannot possibly recognize any of this for
itself.

Does that theory about MultiXacts

Re: Removing more vacuumlazy.c special cases, relfrozenxid optimizations

2022-02-20 Thread Peter Geoghegan

,

On Sun, Feb 20, 2022 at 7:30 AM Robert Haas  wrote:
> Right, so we at least need to add a similar comment to what I proposed
> for MXIDs, and maybe other changes are needed, too.

Agreed.

> > Maybe the way that FreezeLimit/cutoff_xid is overloaded can be fixed
> > here, to make all of this less confusing. I only now fully realized
> > how confusing all of this stuff is -- very.
>
> Right. I think I understand all of this, or at least most of it -- but
> not from the comment. The question is how the comment can be more
> clear. My general suggestion is that function header comments should
> have more to do with the behavior of the function than how it fits
> into the bigger picture. If it's clear to the reader what conditions
> must hold before calling the function and which must hold on return,
> it helps a lot. IMHO, it's the job of the comments in the calling
> function to clarify why we then choose to call that function at the
> place and in the way that we do.

You've given me a lot of high quality feedback on all of this, which
I'll work through soon. It's hard to get the balance right here, but
it's made much easier by this kind of feedback.

> I think that the idea has potential, but I don't think that I
> understand yet what the *exact* algorithm is.

The algorithm seems to exploit a natural tendency that Andres once
described in a blog post about his snapshot scalability work [1]. To a
surprising extent, we can usefully bucket all tuples/pages into two
simple categories:

1. Very, very old ("infinitely old" for all practical purposes).

2. Very very new.

There doesn't seem to be much need for a third "in-between" category
in practice. This seems to be at least approximately true all of the
time.

Perhaps Andres wouldn't agree with this very general statement -- he
actually said something more specific. I for one believe that the
point he made generalizes surprisingly well, though. I have my own
theories about why this appears to be true. (Executive summary: power
laws are weird, and it seems as if the sparsity-of-effects principle
makes it easy to bucket things at the highest level, in a way that
generalizes well across disparate workloads.)

> Maybe I need to read the
> code, when I have some time for that. I can't form an intelligent
> opinion at this stage about whether this is likely to be a net
> positive.

The code in the v8-0002 patch is a bit sloppy right now. I didn't
quite get around to cleaning it up -- I was focussed on performance
validation of the algorithm itself. So bear that in mind if you do
look at v8-0002 (might want to wait for v9-0002 before looking).

I believe that the only essential thing about the algorithm itself is
that it freezes all the tuples on a page when it anticipates setting
the page all-visible, or (barring edge cases) freezes none at all.
(Note that setting the page all-visible/all-frozen may be happen just
after lazy_scan_prune returns, or in the second pass over the heap,
after LP_DEAD items are set to LP_UNUSED -- lazy_scan_prune doesn't
care which way it will happen.)

There are one or two other design choices that we need to make, like
what exact tuples we freeze in the edge case where FreezeLimit/XID age
forces us to freeze in lazy_scan_prune. These other design choices
don't seem relevant to the issue of central importance, which is
whether or not we come out ahead overall with this new algorithm.
FreezeLimit will seldom affect our choice to freeze or not freeze now,
and so AFAICT the exact way that FreezeLimit affects which precise
freezing-eligible tuples we freeze doesn't complicate performance
validation.

Remember when I got excited about how my big TPC-C benchmark run
showed a predictable, tick/tock style pattern across VACUUM operations
against the order and order lines table [2]? It seemed very
significant to me that the OldestXmin of VACUUM operation n
consistently went on to become the new relfrozenxid for the same table
in VACUUM operation n + 1. It wasn't exactly the same XID, but very
close to it (within the range of noise). This pattern was clearly
present, even though VACUUM operation n + 1 might happen as long as 4
or 5 hours after VACUUM operation n (this was a big table).

This pattern was encouraging to me because it showed (at least for the
workload and tables in question) that the amount of unnecessary extra
freezing can't have been too bad -- the fact that we can always
advance relfrozenxid in the same way is evidence of that. Note that
the vacuum_freeze_min_age setting can't have affected our choice of
what to freeze (given what we see in the logs), and yet there is a
clear pattern where the pages (it's really pages, not tuples) that the
new algorithm doesn't freeze in VACUUM operation n will reliably get
frozen in VACUUM operation n + 1 instead.

And so this pattern seems to lend support to the general idea of
letting the workload itself be the primary driver of what pages we
freeze (not FreezeLimit, and not anything based

Re: Removing more vacuumlazy.c special cases, relfrozenxid optimizations

2022-02-20 Thread Robert Haas

On Fri, Feb 18, 2022 at 7:12 PM Peter Geoghegan  wrote:
> We have to worry about XIDs from MultiXacts (and xmax values more
> generally). And we have to worry about the case where we start out
> with only xmin frozen (by an earlier VACUUM), and then have to freeze
> xmax too. I believe that we have to generally consider xmin and xmax
> independently. For example, we cannot ignore xmax, just because we
> looked at xmin, since in general xmin alone might have already been
> frozen.

Right, so we at least need to add a similar comment to what I proposed
for MXIDs, and maybe other changes are needed, too.

> The difference between the cleanup lock path (in
> lazy_scan_prune/heap_prepare_freeze_tuple) and the share lock path (in
> lazy_scan_noprune/heap_tuple_needs_freeze) is what is at issue in both
> of these confusing comment blocks, really. Note that cutoff_xid is the
> name that both heap_prepare_freeze_tuple and heap_tuple_needs_freeze
> have for FreezeLimit (maybe we should rename every occurence of
> cutoff_xid in heapam.c to FreezeLimit).
>
> At a high level, we aren't changing the fundamental definition of an
> aggressive VACUUM in any of the patches -- we still need to advance
> relfrozenxid up to FreezeLimit in an aggressive VACUUM, just like on
> HEAD, today (we may be able to advance it *past* FreezeLimit, but
> that's just a bonus). But in a non-aggressive VACUUM, where there is
> still no strict requirement to advance relfrozenxid (by any amount),
> the code added by 0001 can set relfrozenxid to any known safe value,
> which could either be from before FreezeLimit, or after FreezeLimit --
> almost anything is possible (provided we respect the relfrozenxid
> invariant, and provided we see that we didn't skip any
> all-visible-not-all-frozen pages).
>
> Since we still need to "respect FreezeLimit" in an aggressive VACUUM,
> the aggressive case might need to wait for a full cleanup lock the
> hard way, having tried and failed to do it the easy way within
> lazy_scan_noprune (lazy_scan_noprune will still return false when any
> call to heap_tuple_needs_freeze for any tuple returns false) -- same
> as on HEAD, today.
>
> And so the difference at issue here is: FreezeLimit/cutoff_xid only
> needs to affect the new NewRelfrozenxid value we use for relfrozenxid in
> heap_prepare_freeze_tuple, which is involved in real freezing -- not
> in heap_tuple_needs_freeze, whose main purpose is still to help us
> avoid freezing where a cleanup lock isn't immediately available. While
> the purpose of FreezeLimit/cutoff_xid within heap_tuple_needs_freeze
> is to determine its bool return value, which will only be of interest
> to the aggressive case (which might have to get a cleanup lock and do
> it the hard way), not the non-aggressive case (where ratcheting back
> NewRelfrozenxid is generally possible, and generally leaves us with
> almost as good of a value).
>
> In other words: the calls to heap_tuple_needs_freeze made from
> lazy_scan_noprune are simply concerned with the page as it actually
> is, whereas the similar/corresponding calls to
> heap_prepare_freeze_tuple from lazy_scan_prune are concerned with
> *what the page will actually become*, after freezing finishes, and
> after lazy_scan_prune is done with the page entirely (ultimately
> the final NewRelfrozenxid value set in pg_class.relfrozenxid only has
> to be <= the oldest extant XID *at the time the VACUUM operation is
> just about to end*, not some earlier time, so "being versus becoming"
> is an interesting distinction for us).
>
> Maybe the way that FreezeLimit/cutoff_xid is overloaded can be fixed
> here, to make all of this less confusing. I only now fully realized
> how confusing all of this stuff is -- very.

Right. I think I understand all of this, or at least most of it -- but
not from the comment. The question is how the comment can be more
clear. My general suggestion is that function header comments should
have more to do with the behavior of the function than how it fits
into the bigger picture. If it's clear to the reader what conditions
must hold before calling the function and which must hold on return,
it helps a lot. IMHO, it's the job of the comments in the calling
function to clarify why we then choose to call that function at the
place and in the way that we do.

> As a general rule, we try to freeze all of the remaining live tuples
> on a page (following pruning) together, as a group, or none at all.
> Most of the time this is triggered by our noticing that the page is
> about to be set all-visible (but not all-frozen), and doing work
> sufficient to mark it fully all-frozen instead. Occasionally there is
> FreezeLimit to consider, which is now more of a backstop thing, used
> to make sure that we never get too far behind in terms of unfrozen
> XIDs. This is useful in part because it avoids any future
> non-aggressive VACUUM that is fundamentally unable to advance
> relfrozenxid (you can't skip all-visible pages if there are

Re: Removing more vacuumlazy.c special cases, relfrozenxid optimizations

2022-02-20 Thread Robert Haas

On Sat, Feb 19, 2022 at 8:54 PM Andres Freund  wrote:
> > Leaving behind disconnected/orphaned heap-only tuples is pretty much
> > pointless anyway, since they'll never be accessible by index scans.
> > Even after a REINDEX, since there is no root item from the heap page
> > to go in the index. (A dump and restore might work better, though.)
>
> Given that heap_surgery's raison d'etre is correcting corruption etc, I think
> it makes sense for it to do as minimal work as possible. Iterating through a
> HOT chain would be a problem if you e.g. tried to repair a page with HOT
> corruption.

Yeah, I agree. I don't have time to respond to all of these emails
thoroughly right now, but I think it's really important that
pg_surgery do the exact surgery the user requested, and not any other
work. I don't think that page defragmentation should EVER be REQUIRED
as a condition of other work. If other code is relying on that, I'd
say it's busted. I'm a little more uncertain about the case where we
kill the root tuple of a HOT chain, because I can see that this might
leave the page a state where sequential scans see the tuple at the end
of the chain and index scans don't. I'm not sure whether that should
be the responsibility of pg_surgery itself to avoid, or whether that's
your problem as a user of it -- although I lean mildly toward the
latter view, at the moment. But in any case surely the pruning code
can't just decide to go into an infinite loop if that happens. Code
that manipulates the states of data pages needs to be as robust
against arbitrary on-disk states as we can reasonably make it, because
pages get garbled on disk all the time.

-- 
Robert Haas
EDB: http://www.enterprisedb.com

Re: Removing more vacuumlazy.c special cases, relfrozenxid optimizations

2022-02-19 Thread Peter Geoghegan

On Sat, Feb 19, 2022 at 8:21 PM Andres Freund  wrote:
> Why does anything need to happen to it from vacuum's POV?  It'll not be a 
> problem for freezing etc. Until it's deleted vacuum doesn't need to care.
>
> Probably worth a WARNING, and amcheck definitely needs to detect it, but 
> otherwise I think it's fine to just continue.

Maybe that's true, but it's just really weird to imagine not having an
LP_REDIRECT that points to the LIVE item here, without throwing an
error. Seems kind of iffy, to say the least.

> >I guess it doesn't actually matter if we leave an aborted DEAD tuple
> >behind, that we could have pruned away, but didn't. The important
> >thing is to be consistent at the level of the page.
>
> That's not ok, because it opens up dangers of being interpreted differently 
> after wraparound etc.
>
> But I don't see any cases where it would happen with the new pruning logic in 
> your patch and sharing the HTSV status array?

Right. Fundamentally, there isn't any reason why it should matter that
VACUUM reached the heap page just before (rather than concurrent with
or just after) some xact that inserted or updated on the page aborts.
Just as long as we have a consistent idea about what's going on at the
level of the whole page (or maybe the level of each HOT chain, but the
whole page level seems simpler to me).

-- 
Peter Geoghegan

Re: Removing more vacuumlazy.c special cases, relfrozenxid optimizations

2022-02-19 Thread Andres Freund

Hi, 

On February 19, 2022 7:56:53 PM PST, Peter Geoghegan  wrote:
>On Sat, Feb 19, 2022 at 7:47 PM Andres Freund  wrote:
>> Non DEAD orphaned versions shouldn't cause a problem in lazy_scan_prune(). 
>> The
>> problem here is a DEAD orphaned HOT tuples, and those we should be able to
>> delete with the new page pruning logic, right?
>
>Right. But what good does that really do? The problematic page had a
>third tuple (at offnum 3) that was LIVE. If we could have done
>something about the problematic tuple at offnum 2 (which is where we
>got stuck), then we'd still be left with a very unpleasant choice
>about what happens to the third tuple.

Why does anything need to happen to it from vacuum's POV?  It'll not be a 
problem for freezing etc. Until it's deleted vacuum doesn't need to care.

Probably worth a WARNING, and amcheck definitely needs to detect it, but 
otherwise I think it's fine to just continue.


>> I think it might be worth getting rid of the need for the retry approach by
>> reusing the same HTSV status array between heap_prune_page and
>> lazy_scan_prune. Then the only legitimate reason for seeing a DEAD item in
>> lazy_scan_prune() would be some form of corruption.  And it'd be a pretty
>> decent performance boost, HTSV ain't cheap.
>
>I guess it doesn't actually matter if we leave an aborted DEAD tuple
>behind, that we could have pruned away, but didn't. The important
>thing is to be consistent at the level of the page.

That's not ok, because it opens up dangers of being interpreted differently 
after wraparound etc.

But I don't see any cases where it would happen with the new pruning logic in 
your patch and sharing the HTSV status array?

Andres


-- 
Sent from my Android device with K-9 Mail. Please excuse my brevity.

Re: Removing more vacuumlazy.c special cases, relfrozenxid optimizations

2022-02-19 Thread Peter Geoghegan

On Sat, Feb 19, 2022 at 7:47 PM Andres Freund  wrote:
> I'm not that sure those are that different... Imagine some corruption leading
> to two hot chains ending in the same tid, which our fancy new secure pruning
> algorithm might detect.

I suppose that's possible, but it doesn't seem all that likely to ever
happen, what with the xmin -> xmax cross-tuple matching stuff.

> Either way, I'm a bit surprised about the logic to not allow killing redirect
> items? What if you have a redirect pointing to an unused item?

Again, I simply think it boils down to having to treat HOT chains as a
whole unit when killing TIDs.

> > Let's assume that we don't want to make VACUUM/pruning just treat
> > orphaned heap-only tuples as DEAD, regardless of their true HTSV-wise
> > status
>
> I don't think that'd ever be a good idea. Those tuples are visible to a
> seqscan after all.

I agree (I don't hate it completely, but it seems mostly bad). This is
what leads me to the conclusion that pg_surgery has to be able to do
this instead. Surely it's not okay to have something that makes VACUUM
always end in error, that cannot even be fixed by pg_surgery.

> > -- let's say that we want to err in the direction of doing
> > nothing at all with the page. Now we have to have a weird error in
> > VACUUM instead (not great, but better than just spinning between
> > lazy_scan_prune and heap_prune_page).
>
> Non DEAD orphaned versions shouldn't cause a problem in lazy_scan_prune(). The
> problem here is a DEAD orphaned HOT tuples, and those we should be able to
> delete with the new page pruning logic, right?

Right. But what good does that really do? The problematic page had a
third tuple (at offnum 3) that was LIVE. If we could have done
something about the problematic tuple at offnum 2 (which is where we
got stuck), then we'd still be left with a very unpleasant choice
about what happens to the third tuple.

> I think it might be worth getting rid of the need for the retry approach by
> reusing the same HTSV status array between heap_prune_page and
> lazy_scan_prune. Then the only legitimate reason for seeing a DEAD item in
> lazy_scan_prune() would be some form of corruption.  And it'd be a pretty
> decent performance boost, HTSV ain't cheap.

I guess it doesn't actually matter if we leave an aborted DEAD tuple
behind, that we could have pruned away, but didn't. The important
thing is to be consistent at the level of the page.

-- 
Peter Geoghegan

Re: Removing more vacuumlazy.c special cases, relfrozenxid optimizations

2022-02-19 Thread Andres Freund

Hi,

On 2022-02-19 19:31:21 -0800, Peter Geoghegan wrote:
> On Sat, Feb 19, 2022 at 6:16 PM Peter Geoghegan  wrote:
> > > Given that heap_surgery's raison d'etre is correcting corruption etc, I 
> > > think
> > > it makes sense for it to do as minimal work as possible. Iterating 
> > > through a
> > > HOT chain would be a problem if you e.g. tried to repair a page with HOT
> > > corruption.
> >
> > I guess that's also true. There is at least a legitimate argument to
> > be made for not leaving behind any orphaned heap-only tuples. The
> > interface is a TID, and so the user may already believe that they're
> > killing the heap-only, not just the root item (since ctid suggests
> > that the TID of a heap-only tuple is the TID of the root item, which
> > is kind of misleading).
> 
> Actually, I would say that heap_surgery's raison d'etre is making
> weird errors related to corruption of this or that TID go away, so
> that the user can cut their losses. That's how it's advertised.

I'm not that sure those are that different... Imagine some corruption leading
to two hot chains ending in the same tid, which our fancy new secure pruning
algorithm might detect.

Either way, I'm a bit surprised about the logic to not allow killing redirect
items? What if you have a redirect pointing to an unused item?

> Let's assume that we don't want to make VACUUM/pruning just treat
> orphaned heap-only tuples as DEAD, regardless of their true HTSV-wise
> status

I don't think that'd ever be a good idea. Those tuples are visible to a
seqscan after all.

> -- let's say that we want to err in the direction of doing
> nothing at all with the page. Now we have to have a weird error in
> VACUUM instead (not great, but better than just spinning between
> lazy_scan_prune and heap_prune_page).

Non DEAD orphaned versions shouldn't cause a problem in lazy_scan_prune(). The
problem here is a DEAD orphaned HOT tuples, and those we should be able to
delete with the new page pruning logic, right?

I think it might be worth getting rid of the need for the retry approach by
reusing the same HTSV status array between heap_prune_page and
lazy_scan_prune. Then the only legitimate reason for seeing a DEAD item in
lazy_scan_prune() would be some form of corruption.  And it'd be a pretty
decent performance boost, HTSV ain't cheap.

Greetings,

Andres Freund

Re: Removing more vacuumlazy.c special cases, relfrozenxid optimizations

2022-02-19 Thread Peter Geoghegan

On Sat, Feb 19, 2022 at 7:28 PM Andres Freund  wrote:
> If the vacuum can get the cleanup lock due to the adversarial patch, the
> heap_force_kill() doesn't do anything, because the first item is a
> redirect. However if it *can't* get a cleanup lock, heap_force_kill() instead
> targets the root item. Triggering the endless loop.

But it shouldn't matter if the root item is an LP_REDIRECT or a normal
(not heap-only) tuple with storage. Either way it's the root of a HOT
chain.

The fact that pg_surgery treats LP_REDIRECT items differently from the
other kind of root items is just arbitrary. It seems to have more to
do with freezing tuples than killing tuples.

--
Peter Geoghegan

Re: Removing more vacuumlazy.c special cases, relfrozenxid optimizations

2022-02-19 Thread Peter Geoghegan

On Sat, Feb 19, 2022 at 6:16 PM Peter Geoghegan  wrote:
> > Given that heap_surgery's raison d'etre is correcting corruption etc, I 
> > think
> > it makes sense for it to do as minimal work as possible. Iterating through a
> > HOT chain would be a problem if you e.g. tried to repair a page with HOT
> > corruption.
>
> I guess that's also true. There is at least a legitimate argument to
> be made for not leaving behind any orphaned heap-only tuples. The
> interface is a TID, and so the user may already believe that they're
> killing the heap-only, not just the root item (since ctid suggests
> that the TID of a heap-only tuple is the TID of the root item, which
> is kind of misleading).

Actually, I would say that heap_surgery's raison d'etre is making
weird errors related to corruption of this or that TID go away, so
that the user can cut their losses. That's how it's advertised.

Let's assume that we don't want to make VACUUM/pruning just treat
orphaned heap-only tuples as DEAD, regardless of their true HTSV-wise
status -- let's say that we want to err in the direction of doing
nothing at all with the page. Now we have to have a weird error in
VACUUM instead (not great, but better than just spinning between
lazy_scan_prune and heap_prune_page). And we've just created natural
demand for heap_surgery to deal with the problem by deleting whole HOT
chains (not just root items).

If we allow VACUUM to treat orphaned heap-only tuples as DEAD right
away, then we might as well do the same thing in heap_surgery, since
there is little chance that the user will get to the heap-only tuples
before VACUUM does (not something to rely on, at any rate).

Either way, I think we probably end up needing to teach heap_surgery
to kill entire HOT chains as a group, given a TID.

-- 
Peter Geoghegan

Re: Removing more vacuumlazy.c special cases, relfrozenxid optimizations

2022-02-19 Thread Andres Freund

Hi,

On 2022-02-19 19:07:39 -0800, Peter Geoghegan wrote:
> On Sat, Feb 19, 2022 at 7:01 PM Andres Freund  wrote:
> > > We can either do that, or we can throw an error concerning corruption
> > > when heap_page_prune notices orphaned tuples. Neither seems
> > > particularly appealing. But it definitely makes no sense to allow
> > > lazy_scan_prune to spin in a futile attempt to reach agreement with
> > > heap_page_prune about a DEAD tuple really being DEAD.
> >
> > Yea, this sucks. I think we should go for the rewrite of the
> > heap_prune_chain() logic. The current approach is just never going to be
> > robust.
> 
> No, it just isn't robust enough. But it's not that hard to fix. My
> patch really wasn't invasive.

I think we're in agreement there. We might think at some point about
backpatching too, but I'd rather have it stew in HEAD for a bit first.

> I confirmed that HeapTupleSatisfiesVacuum() and
> heap_prune_satisfies_vacuum() agree that the heap-only tuple at offnum
> 2 is HEAPTUPLE_DEAD -- they are in agreement, as expected (so no
> reason to think that there is a new bug involved). The problem here is
> indeed just that heap_prune_chain() can't "get to" the tuple, given
> its current design.

Right.

The reason that the "adversarial" patch makes a different is solely that it
changes the heap_surgery test to actually kill an item, which it doesn't
intend:

create temp table htab2(a int);
insert into htab2 values (100);
update htab2 set a = 200;
vacuum htab2;

-- redirected TIDs should be skipped
select heap_force_kill('htab2'::regclass, ARRAY['(0, 1)']::tid[]);

If the vacuum can get the cleanup lock due to the adversarial patch, the
heap_force_kill() doesn't do anything, because the first item is a
redirect. However if it *can't* get a cleanup lock, heap_force_kill() instead
targets the root item. Triggering the endless loop.

Hm. I think this might be a mild regression in 14. In < 14 we'd just skip the
tuple in lazy_scan_heap(), but now we have an uninterruptible endless
loop.

We'd do completely bogus stuff later in < 14 though, I think we'd just leave
it in place despite being older than relfrozenxid, which obviously has its own
set of issues.

Greetings,

Andres Freund

Re: Removing more vacuumlazy.c special cases, relfrozenxid optimizations

2022-02-19 Thread Peter Geoghegan

On Sat, Feb 19, 2022 at 7:01 PM Andres Freund  wrote:
> It's kind of surprising that this needs this
> 0001-Add-adversarial-ConditionalLockBuff to break. I suspect it's a question
> of hint bits changing due to lazy_scan_noprune(), which then makes
> HeapTupleHeaderIsHotUpdated() have a different return value, preventing the
> "If the tuple is DEAD and doesn't chain to anything else"
> path from being taken.

That makes sense as an explanation. Goes to show just how fragile the
"DEAD and doesn't chain to anything else" logic at the top of
heap_prune_chain really is.

-- 
Peter Geoghegan

Re: Removing more vacuumlazy.c special cases, relfrozenxid optimizations

2022-02-19 Thread Peter Geoghegan

On Sat, Feb 19, 2022 at 7:01 PM Andres Freund  wrote:
> > We can either do that, or we can throw an error concerning corruption
> > when heap_page_prune notices orphaned tuples. Neither seems
> > particularly appealing. But it definitely makes no sense to allow
> > lazy_scan_prune to spin in a futile attempt to reach agreement with
> > heap_page_prune about a DEAD tuple really being DEAD.
>
> Yea, this sucks. I think we should go for the rewrite of the
> heap_prune_chain() logic. The current approach is just never going to be
> robust.

No, it just isn't robust enough. But it's not that hard to fix. My
patch really wasn't invasive.

I confirmed that HeapTupleSatisfiesVacuum() and
heap_prune_satisfies_vacuum() agree that the heap-only tuple at offnum
2 is HEAPTUPLE_DEAD -- they are in agreement, as expected (so no
reason to think that there is a new bug involved). The problem here is
indeed just that heap_prune_chain() can't "get to" the tuple, given
its current design.

For anybody else that doesn't follow what we're talking about:

The "doesn't chain to anything else" code at the start of
heap_prune_chain() won't get to the heap-only tuple at offnum 2, since
the tuple is itself HeapTupleHeaderIsHotUpdated() -- the expectation
is that it'll be processed later on, once we locate the HOT chain's
root item. Since, of course, the "root item" was already LP_DEAD
before we even reached heap_page_prune() (on account of the pg_surgery
corruption), there is no possible way that that can happen later on.
And so we cannot find the same heap-only tuple and mark it LP_UNUSED
(which is how we always deal with HEAPTUPLE_DEAD heap-only tuples)
during pruning.

-- 
Peter Geoghegan

Re: Removing more vacuumlazy.c special cases, relfrozenxid optimizations

2022-02-19 Thread Andres Freund

Hi,

On 2022-02-19 18:16:54 -0800, Peter Geoghegan wrote:
> On Sat, Feb 19, 2022 at 5:54 PM Andres Freund  wrote:
> > How does that cause the endless loop?
> 
> Attached is the page image itself, dumped via gdb (and gzip'd). This
> was on recent HEAD (commit 8f388f6f, actually), plus
> 0001-Add-adversarial-ConditionalLockBuff[...]. No other changes. No
> defragmenting in pg_surgery, nothing like that.

> > It doesn't do so on HEAD + 0001-Add-adversarial-ConditionalLockBuff[...] for
> > me. So something needs have changed with your patch?
> 
> It doesn't always happen -- only about half the time on my machine.
> Maybe it's timing sensitive?

Ah, I'd only run the tests three times or so, without it happening. Trying a
few more times repro'd it.


It's kind of surprising that this needs this
0001-Add-adversarial-ConditionalLockBuff to break. I suspect it's a question
of hint bits changing due to lazy_scan_noprune(), which then makes
HeapTupleHeaderIsHotUpdated() have a different return value, preventing the
"If the tuple is DEAD and doesn't chain to anything else"
path from being taken.


> We hit the "goto retry" on offnum 2, which is the first tuple with
> storage (you can see "the ghost" of the tuple from the LP_DEAD item at
> offnum 1, since the page isn't defragmented in pg_surgery). I think
> that this happens because the heap-only tuple at offnum 2 is fully
> DEAD to lazy_scan_prune, but hasn't been recognized as such by
> heap_page_prune. There is no way that they'll ever "agree" on the
> tuple being DEAD right now, because pruning still doesn't assume that
> an orphaned heap-only tuple is fully DEAD.

> We can either do that, or we can throw an error concerning corruption
> when heap_page_prune notices orphaned tuples. Neither seems
> particularly appealing. But it definitely makes no sense to allow
> lazy_scan_prune to spin in a futile attempt to reach agreement with
> heap_page_prune about a DEAD tuple really being DEAD.

Yea, this sucks. I think we should go for the rewrite of the
heap_prune_chain() logic. The current approach is just never going to be
robust.

Greetings,

Andres Freund

Re: Removing more vacuumlazy.c special cases, relfrozenxid optimizations

2022-02-19 Thread Peter Geoghegan

On Sat, Feb 19, 2022 at 5:54 PM Andres Freund  wrote:
> How does that cause the endless loop?

Attached is the page image itself, dumped via gdb (and gzip'd). This
was on recent HEAD (commit 8f388f6f, actually), plus
0001-Add-adversarial-ConditionalLockBuff[...]. No other changes. No
defragmenting in pg_surgery, nothing like that.

> It doesn't do so on HEAD + 0001-Add-adversarial-ConditionalLockBuff[...] for
> me. So something needs have changed with your patch?

It doesn't always happen -- only about half the time on my machine.
Maybe it's timing sensitive?

We hit the "goto retry" on offnum 2, which is the first tuple with
storage (you can see "the ghost" of the tuple from the LP_DEAD item at
offnum 1, since the page isn't defragmented in pg_surgery). I think
that this happens because the heap-only tuple at offnum 2 is fully
DEAD to lazy_scan_prune, but hasn't been recognized as such by
heap_page_prune. There is no way that they'll ever "agree" on the
tuple being DEAD right now, because pruning still doesn't assume that
an orphaned heap-only tuple is fully DEAD.

We can either do that, or we can throw an error concerning corruption
when heap_page_prune notices orphaned tuples. Neither seems
particularly appealing. But it definitely makes no sense to allow
lazy_scan_prune to spin in a futile attempt to reach agreement with
heap_page_prune about a DEAD tuple really being DEAD.

> Given that heap_surgery's raison d'etre is correcting corruption etc, I think
> it makes sense for it to do as minimal work as possible. Iterating through a
> HOT chain would be a problem if you e.g. tried to repair a page with HOT
> corruption.

I guess that's also true. There is at least a legitimate argument to
be made for not leaving behind any orphaned heap-only tuples. The
interface is a TID, and so the user may already believe that they're
killing the heap-only, not just the root item (since ctid suggests
that the TID of a heap-only tuple is the TID of the root item, which
is kind of misleading).

Anyway, we can decide on what to do in heap_surgery later, once the
main issue is under control. My point was mostly just that orphaned
heap-only tuples are definitely not okay, in general. They are the
least worst option when corruption has already happened, maybe -- but
maybe not.

-- 
Peter Geoghegan

corrupt-hot-chain.page.gz
Description: GNU Zip compressed data

Re: Removing more vacuumlazy.c special cases, relfrozenxid optimizations

2022-02-19 Thread Andres Freund

Hi,

On 2022-02-19 17:22:33 -0800, Peter Geoghegan wrote:
> Looks like pg_surgery isn't processing HOT chains as whole units,
> which it really should (at least in the context of killing items via
> the heap_force_kill() function). Killing a root item in a HOT chain is
> just hazardous -- disconnected/orphaned heap-only tuples are liable to
> cause chaos, and should be avoided everywhere (including during
> pruning, and within pg_surgery).

How does that cause the endless loop?

It doesn't do so on HEAD + 0001-Add-adversarial-ConditionalLockBuff[...] for
me. So something needs have changed with your patch?

> It's likely that the hardening I already planned on adding to pruning
> [1] (as follow-up work to recent bugfix commit 18b87b201f) will
> prevent lazy_scan_prune from getting stuck like this, whatever the
> cause happens to be.

Yea, we should pick that up again. Not just for robustness or
performance. Also because it's just a lot easier to understand.

> Leaving behind disconnected/orphaned heap-only tuples is pretty much
> pointless anyway, since they'll never be accessible by index scans.
> Even after a REINDEX, since there is no root item from the heap page
> to go in the index. (A dump and restore might work better, though.)

Given that heap_surgery's raison d'etre is correcting corruption etc, I think
it makes sense for it to do as minimal work as possible. Iterating through a
HOT chain would be a problem if you e.g. tried to repair a page with HOT
corruption.

Greetings,

Andres Freund

Re: Removing more vacuumlazy.c special cases, relfrozenxid optimizations

2022-02-19 Thread Peter Geoghegan

On Sat, Feb 19, 2022 at 4:22 PM Peter Geoghegan  wrote:
> This very much looks like a bug in pg_surgery itself now -- attached
> is a draft fix.

Wait, that's not it either. I jumped the gun -- this isn't sufficient
(though the patch I posted might not be a bad idea anyway).

Looks like pg_surgery isn't processing HOT chains as whole units,
which it really should (at least in the context of killing items via
the heap_force_kill() function). Killing a root item in a HOT chain is
just hazardous -- disconnected/orphaned heap-only tuples are liable to
cause chaos, and should be avoided everywhere (including during
pruning, and within pg_surgery).

It's likely that the hardening I already planned on adding to pruning
[1] (as follow-up work to recent bugfix commit 18b87b201f) will
prevent lazy_scan_prune from getting stuck like this, whatever the
cause happens to be. The actual page image I see lazy_scan_prune choke
on (i.e. exhibit the same infinite loop unpleasantness we've seen
before on) is not in a consistent state at all (its tuples consist of
tuples from a single HOT chain, and the HOT chain is totally
inconsistent on account of having an LP_DEAD line pointer root item).
pg_surgery could in principle do the right thing here by always
treating HOT chains as whole units.

Leaving behind disconnected/orphaned heap-only tuples is pretty much
pointless anyway, since they'll never be accessible by index scans.
Even after a REINDEX, since there is no root item from the heap page
to go in the index. (A dump and restore might work better, though.)

[1] 
https://postgr.es/m/cah2-wzmnk6v6tqzuuabxoxm8hjrawu6h12toas-bqycliht...@mail.gmail.com
-- 
Peter Geoghegan

Re: Removing more vacuumlazy.c special cases, relfrozenxid optimizations

2022-02-19 Thread Peter Geoghegan

On Sat, Feb 19, 2022 at 3:08 PM Peter Geoghegan  wrote:
> It's quite possible that this is nothing more than a bug in my
> adversarial gizmo patch -- since I don't think that
> ConditionalLockBufferForCleanup() can ever fail with a temp buffer
> (though even that's not completely clear right now). Even if the
> behavior that I saw does not indicate a bug on HEAD, it still seems
> informative.

This very much looks like a bug in pg_surgery itself now -- attached
is a draft fix.

The temp table thing was a red herring. I found I could get exactly
the same kind of failure when htab2 was a permanent table (which was
how it originally appeared, before commit 0811f766fd made it into a
temp table due to test flappiness issues). The relevant "vacuum freeze
htab2" happens at a point after the test has already deliberately
corrupted one of its tuples using heap_force_kill().  It's not that we
aren't careful enough about the corruption at some point in
vacuumlazy.c, which was my second theory. But I quickly discarded that
idea, and came up with a third theory: the relevant heap_surgery.c
path does the relevant ItemIdSetDead() to kill items, without also
defragmenting the page to remove the tuples with storage, which is
wrong.

This meant that we depended on pruning happening (in this case during
VACUUM) and defragmenting the page in passing. But there is no reason
to not defragment the page within pg_surgery (at least no obvious
reason), since we have a cleanup lock anyway.

Theoretically you could blame this on lazy_scan_noprune instead, since
it thinks it can collect LP_DEAD items while assuming that they have
no storage, but that doesn't make much sense to me. There has never
been any way of setting a heap item to LP_DEAD without also
defragmenting the page.  Since that's exactly what it means to prune a
heap page. (Actually, the same used to be true about heap vacuuming,
which worked more like heap pruning before Postgres 14, but that
doesn't seem important.)

-- 
Peter Geoghegan
From 81f01ca623b115647ee78a1b09bbb4458fb35dab Mon Sep 17 00:00:00 2001
From: Peter Geoghegan 
Date: Sat, 19 Feb 2022 16:13:48 -0800
Subject: [PATCH 2/2] Fix for pg_surgery's heap_force_kill() function.

---
 contrib/pg_surgery/heap_surgery.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/contrib/pg_surgery/heap_surgery.c 
b/contrib/pg_surgery/heap_surgery.c
index 3e641aa64..a3a193ba5 100644
--- a/contrib/pg_surgery/heap_surgery.c
+++ b/contrib/pg_surgery/heap_surgery.c
@@ -311,7 +311,8 @@ heap_force_common(FunctionCallInfo fcinfo, 
HeapTupleForceOption heap_force_opt)
 */
if (did_modify_page)
{
-   /* Mark buffer dirty before we write WAL. */
+   /* Defragment and mark buffer dirty before we write 
WAL. */
+   PageRepairFragmentation(page);
MarkBufferDirty(buf);

/* XLOG stuff */
-- 
2.30.2

Re: Removing more vacuumlazy.c special cases, relfrozenxid optimizations

2022-02-19 Thread Andres Freund

Hi,

(On phone, so crappy formatting and no source access)

On February 19, 2022 3:08:41 PM PST, Peter Geoghegan  wrote:
>On Fri, Feb 18, 2022 at 5:00 PM Peter Geoghegan  wrote:
>> Another testing strategy occurs to me: we could stress-test the
>> implementation by simulating an environment where the no-cleanup-lock
>> path is hit an unusually large number of times, possibly a fixed
>> percentage of the time (like 1%, 5%), say by making vacuumlazy.c's
>> ConditionalLockBufferForCleanup() call return false randomly. Now that
>> we have lazy_scan_noprune for the no-cleanup-lock path (which is as
>> similar to the regular lazy_scan_prune path as possible), I wouldn't
>> expect this ConditionalLockBufferForCleanup() testing gizmo to be too
>> disruptive.
>
>I tried this out, using the attached patch. It was quite interesting,
>even when run against HEAD. I think that I might have found a bug on
>HEAD, though I'm not really sure.
>
>If you modify the patch to simulate conditions under which
>ConditionalLockBufferForCleanup() fails about 2% of the time, you get
>much better coverage of lazy_scan_noprune/heap_tuple_needs_freeze,
>without it being so aggressive as to make "make check-world" fail --
>which is exactly what I expected. If you are much more aggressive
>about it, and make it 50% instead (which you can get just by using the
>patch as written), then some tests will fail, mostly for reasons that
>aren't surprising or interesting (e.g. plan changes). This is also
>what I'd have guessed would happen.
>
>However, it gets more interesting. One thing that I did not expect to
>happen at all also happened (with the current 50% rate of simulated
>ConditionalLockBufferForCleanup() failure from the patch): if I run
>"make check" from the pg_surgery directory, then the Postgres backend
>gets stuck in an infinite loop inside lazy_scan_prune, which has been
>a symptom of several tricky bugs in the past year (not every time, but
>usually). Specifically, the VACUUM statement launched by the SQL
>command "vacuum freeze htab2;" from the file
>contrib/pg_surgery/sql/heap_surgery.sql, at line 54 leads to this
>misbehavior.


>This is a temp table, which is a choice made by the tests specifically
>because they need to "use a temp table so that vacuum behavior doesn't
>depend on global xmin". This is convenient way of avoiding spurious
>regression tests failures (e.g. from autoanalyze), and relies on the
>GlobalVisTempRels behavior established by Andres' 2020 bugfix commit
>94bc27b5.

We don't have a blocking path for cleanup locks of temporary buffers IIRC 
(normally not reachable). So I wouldn't be surprised if a cleanup lock failing 
would cause some odd behavior.

>It's quite possible that this is nothing more than a bug in my
>adversarial gizmo patch -- since I don't think that
>ConditionalLockBufferForCleanup() can ever fail with a temp buffer
>(though even that's not completely clear right now). Even if the
>behavior that I saw does not indicate a bug on HEAD, it still seems
>informative. At the very least, it wouldn't hurt to Assert() that the
>target table isn't a temp table inside lazy_scan_noprune, documenting
>our assumptions around temp tables and
>ConditionalLockBufferForCleanup().

Definitely worth looking into more.


This reminds me of a recent thing I noticed in the aio patch. Spgist can end up 
busy looping when buffers are locked, instead of blocking. Not actually 
related, of course.

Andres
-- 
Sent from my Android device with K-9 Mail. Please excuse my brevity.

Re: Removing more vacuumlazy.c special cases, relfrozenxid optimizations

2022-02-19 Thread Peter Geoghegan

On Fri, Feb 18, 2022 at 5:00 PM Peter Geoghegan  wrote:
> Another testing strategy occurs to me: we could stress-test the
> implementation by simulating an environment where the no-cleanup-lock
> path is hit an unusually large number of times, possibly a fixed
> percentage of the time (like 1%, 5%), say by making vacuumlazy.c's
> ConditionalLockBufferForCleanup() call return false randomly. Now that
> we have lazy_scan_noprune for the no-cleanup-lock path (which is as
> similar to the regular lazy_scan_prune path as possible), I wouldn't
> expect this ConditionalLockBufferForCleanup() testing gizmo to be too
> disruptive.

I tried this out, using the attached patch. It was quite interesting,
even when run against HEAD. I think that I might have found a bug on
HEAD, though I'm not really sure.

If you modify the patch to simulate conditions under which
ConditionalLockBufferForCleanup() fails about 2% of the time, you get
much better coverage of lazy_scan_noprune/heap_tuple_needs_freeze,
without it being so aggressive as to make "make check-world" fail --
which is exactly what I expected. If you are much more aggressive
about it, and make it 50% instead (which you can get just by using the
patch as written), then some tests will fail, mostly for reasons that
aren't surprising or interesting (e.g. plan changes). This is also
what I'd have guessed would happen.

However, it gets more interesting. One thing that I did not expect to
happen at all also happened (with the current 50% rate of simulated
ConditionalLockBufferForCleanup() failure from the patch): if I run
"make check" from the pg_surgery directory, then the Postgres backend
gets stuck in an infinite loop inside lazy_scan_prune, which has been
a symptom of several tricky bugs in the past year (not every time, but
usually). Specifically, the VACUUM statement launched by the SQL
command "vacuum freeze htab2;" from the file
contrib/pg_surgery/sql/heap_surgery.sql, at line 54 leads to this
misbehavior.

This is a temp table, which is a choice made by the tests specifically
because they need to "use a temp table so that vacuum behavior doesn't
depend on global xmin". This is convenient way of avoiding spurious
regression tests failures (e.g. from autoanalyze), and relies on the
GlobalVisTempRels behavior established by Andres' 2020 bugfix commit
94bc27b5.

It's quite possible that this is nothing more than a bug in my
adversarial gizmo patch -- since I don't think that
ConditionalLockBufferForCleanup() can ever fail with a temp buffer
(though even that's not completely clear right now). Even if the
behavior that I saw does not indicate a bug on HEAD, it still seems
informative. At the very least, it wouldn't hurt to Assert() that the
target table isn't a temp table inside lazy_scan_noprune, documenting
our assumptions around temp tables and
ConditionalLockBufferForCleanup().

I haven't actually tried to debug the issue just yet, so take all this
with a grain of salt.

-- 
Peter Geoghegan
From 3f01281af3ba81b35777cb7d717f76e001fd3e10 Mon Sep 17 00:00:00 2001
From: Peter Geoghegan 
Date: Sat, 19 Feb 2022 14:07:35 -0800
Subject: [PATCH] Add adversarial ConditionalLockBufferForCleanup() gizmo to
 vacuumlazy.c.

---
 src/backend/access/heap/vacuumlazy.c | 36 +++-
 1 file changed, 35 insertions(+), 1 deletion(-)

diff --git a/src/backend/access/heap/vacuumlazy.c 
b/src/backend/access/heap/vacuumlazy.c
index 242511a23..31c6b360e 100644
--- a/src/backend/access/heap/vacuumlazy.c
+++ b/src/backend/access/heap/vacuumlazy.c
@@ -50,6 +50,7 @@
 #include "commands/dbcommands.h"
 #include "commands/progress.h"
 #include "commands/vacuum.h"
+#include "common/pg_prng.h"
 #include "executor/instrument.h"
 #include "miscadmin.h"
 #include "optimizer/paths.h"
@@ -748,6 +749,39 @@ heap_vacuum_rel(Relation rel, VacuumParams *params,
}
 }

+/*
+ * Adversarial gizmo, simulates excessive failure to get cleanup locks
+ */
+static inline bool
+lazy_conditionallockbufferforcleanup(Buffer buffer)
+{
+   /*
+* Artificially fail to get a cleanup lock 50% of the time.
+*
+* XXX: What about temp tables?  We simulate not getting a cleanup lock
+* there, but is that choice actually reasonable?
+*/
+   if (pg_prng_uint32(_global_prng_state) <= (PG_UINT32_MAX / 2))
+   return false;
+
+#if 0
+   /*
+* 50% is very very aggressive, while 2% - 5% is still basically
+* adversarial but in many ways less annoying.
+*
+* This version (which injects a failure to get a cleanup lock 2% of the
+* time) seems to pass the regression tests, even with my parallel make
+* check-world recipe.  Expected query plans don't seem to shift on
+* account of unexpected index bloat (nor are there any problems of a
+* similar nature) with this variant of the gizmo.
+*/
+   if (pg_prng_uint32(_global_prng_state) <= (PG_UINT32_MAX / 50))
+

Re: Removing more vacuumlazy.c special cases, relfrozenxid optimizations

2022-02-18 Thread Peter Geoghegan

On Fri, Feb 18, 2022 at 2:11 PM Andres Freund  wrote:
> I think it'd be good to add a few isolationtest cases for the
> can't-get-cleanup-lock paths. I think it shouldn't be hard using cursors. The
> slightly harder part is verifying that VACUUM did something reasonable, but
> that still should be doable?

We could even just extend existing, related tests, from vacuum-reltuples.spec.

Another testing strategy occurs to me: we could stress-test the
implementation by simulating an environment where the no-cleanup-lock
path is hit an unusually large number of times, possibly a fixed
percentage of the time (like 1%, 5%), say by making vacuumlazy.c's
ConditionalLockBufferForCleanup() call return false randomly. Now that
we have lazy_scan_noprune for the no-cleanup-lock path (which is as
similar to the regular lazy_scan_prune path as possible), I wouldn't
expect this ConditionalLockBufferForCleanup() testing gizmo to be too
disruptive.

-- 
Peter Geoghegan

Re: Removing more vacuumlazy.c special cases, relfrozenxid optimizations

2022-02-18 Thread Peter Geoghegan

On Fri, Feb 18, 2022 at 1:56 PM Robert Haas  wrote:
> + * Also maintains *NewRelfrozenxid and *NewRelminmxid, which are the current
> + * target relfrozenxid and relminmxid for the relation.  Assumption is that
>
> "maintains" is fuzzy. I think you should be saying something much more
> explicit, and the thing you are saying should make it clear that these
> arguments are input-output arguments: i.e. the caller must set them
> correctly before calling this function, and they will be updated by
> the function.

Makes sense.

> I don't think you have to spell all of that out in every
> place where this comes up in the patch, but it needs to be clear from
> what you do say. For example, I would be happier with a comment that
> said something like "Every call to this function will either set
> HEAP_XMIN_FROZEN in the xl_heap_freeze_tuple struct passed as an
> argument, or else reduce *NewRelfrozenxid to the xmin of the tuple if
> it is currently newer than that. Thus, after a series of calls to this
> function, *NewRelfrozenxid represents a lower bound on unfrozen xmin
> values in the tuples examined. Before calling this function, caller
> should initialize *NewRelfrozenxid to ."

We have to worry about XIDs from MultiXacts (and xmax values more
generally). And we have to worry about the case where we start out
with only xmin frozen (by an earlier VACUUM), and then have to freeze
xmax too. I believe that we have to generally consider xmin and xmax
independently. For example, we cannot ignore xmax, just because we
looked at xmin, since in general xmin alone might have already been
frozen.

> + * Also maintains *NewRelfrozenxid and *NewRelminmxid, which are the current
> + * target relfrozenxid and relminmxid for the relation.  Assumption is that
> + * caller will never freeze any of the XIDs from the tuple, even when we say
> + * that they should.  If caller opts to go with our recommendation to freeze,
> + * then it must account for the fact that it shouldn't trust how we've set
> + * NewRelfrozenxid/NewRelminmxid.  (In practice aggressive VACUUMs always 
> take
> + * our recommendation because they must, and non-aggressive VACUUMs always 
> opt
> + * to not freeze, preferring to ratchet back NewRelfrozenxid instead).
>
> I don't understand this one.
>
> +* (Actually, we maintain NewRelminmxid differently here, because we
> +* assume that XIDs that should be frozen according to cutoff_xid 
> won't
> +* be, whereas heap_prepare_freeze_tuple makes the opposite 
> assumption.)
>
> This one either.

The difference between the cleanup lock path (in
lazy_scan_prune/heap_prepare_freeze_tuple) and the share lock path (in
lazy_scan_noprune/heap_tuple_needs_freeze) is what is at issue in both
of these confusing comment blocks, really. Note that cutoff_xid is the
name that both heap_prepare_freeze_tuple and heap_tuple_needs_freeze
have for FreezeLimit (maybe we should rename every occurence of
cutoff_xid in heapam.c to FreezeLimit).

At a high level, we aren't changing the fundamental definition of an
aggressive VACUUM in any of the patches -- we still need to advance
relfrozenxid up to FreezeLimit in an aggressive VACUUM, just like on
HEAD, today (we may be able to advance it *past* FreezeLimit, but
that's just a bonus). But in a non-aggressive VACUUM, where there is
still no strict requirement to advance relfrozenxid (by any amount),
the code added by 0001 can set relfrozenxid to any known safe value,
which could either be from before FreezeLimit, or after FreezeLimit --
almost anything is possible (provided we respect the relfrozenxid
invariant, and provided we see that we didn't skip any
all-visible-not-all-frozen pages).

Since we still need to "respect FreezeLimit" in an aggressive VACUUM,
the aggressive case might need to wait for a full cleanup lock the
hard way, having tried and failed to do it the easy way within
lazy_scan_noprune (lazy_scan_noprune will still return false when any
call to heap_tuple_needs_freeze for any tuple returns false) -- same
as on HEAD, today.

And so the difference at issue here is: FreezeLimit/cutoff_xid only
needs to affect the new NewRelfrozenxid value we use for relfrozenxid in
heap_prepare_freeze_tuple, which is involved in real freezing -- not
in heap_tuple_needs_freeze, whose main purpose is still to help us
avoid freezing where a cleanup lock isn't immediately available. While
the purpose of FreezeLimit/cutoff_xid within heap_tuple_needs_freeze
is to determine its bool return value, which will only be of interest
to the aggressive case (which might have to get a cleanup lock and do
it the hard way), not the non-aggressive case (where ratcheting back
NewRelfrozenxid is generally possible, and generally leaves us with
almost as good of a value).

In other words: the calls to heap_tuple_needs_freeze made from
lazy_scan_noprune are simply concerned with the page as it actually
is, whereas the similar/corresponding calls to
heap_prepare_freeze_tuple from

Re: Removing more vacuumlazy.c special cases, relfrozenxid optimizations

2022-02-18 Thread Andres Freund

Hi,

On 2022-02-18 15:54:19 -0500, Robert Haas wrote:
> > Are there any objections to this plan?
> 
> I really like the idea of reducing the scope of what is being changed
> here, and I agree that eagerly advancing relfrozenxid carries much
> less risk than the other changes.

Sounds good to me too!

Greetings,

Andres Freund

Re: Removing more vacuumlazy.c special cases, relfrozenxid optimizations

2022-02-18 Thread Andres Freund

Hi,

On 2022-02-18 13:09:45 -0800, Peter Geoghegan wrote:
> 0001 is tricky in the sense that there are a lot of fine details, and
> if you get any one of them wrong the result might be a subtle bug. For
> example, the heap_tuple_needs_freeze() code path is only used when we
> cannot get a cleanup lock, which is rare -- and some of the branches
> within the function are relatively rare themselves. The obvious
> concern is: What if some detail of how we track the new relfrozenxid
> value (and new relminmxid value) in this seldom-hit codepath is just
> wrong, in whatever way we didn't think of?

I think it'd be good to add a few isolationtest cases for the
can't-get-cleanup-lock paths. I think it shouldn't be hard using cursors. The
slightly harder part is verifying that VACUUM did something reasonable, but
that still should be doable?

Greetings,

Andres Freund

Re: Removing more vacuumlazy.c special cases, relfrozenxid optimizations

2022-02-18 Thread Robert Haas

On Fri, Feb 18, 2022 at 4:10 PM Peter Geoghegan  wrote:
> It does not change any other behavior. It's totally mechanical.
>
> 0001 is tricky in the sense that there are a lot of fine details, and
> if you get any one of them wrong the result might be a subtle bug. For
> example, the heap_tuple_needs_freeze() code path is only used when we
> cannot get a cleanup lock, which is rare -- and some of the branches
> within the function are relatively rare themselves. The obvious
> concern is: What if some detail of how we track the new relfrozenxid
> value (and new relminmxid value) in this seldom-hit codepath is just
> wrong, in whatever way we didn't think of?

Right. I think we have no choice but to accept such risks if we want
to make any progress here, and every patch carries them to some
degree. I hope that someone else will review this patch in more depth
than I have just now, but what I notice reading through it is that
some of the comments seem pretty opaque. For instance:

+ * Also maintains *NewRelfrozenxid and *NewRelminmxid, which are the current
+ * target relfrozenxid and relminmxid for the relation.  Assumption is that

"maintains" is fuzzy. I think you should be saying something much more
explicit, and the thing you are saying should make it clear that these
arguments are input-output arguments: i.e. the caller must set them
correctly before calling this function, and they will be updated by
the function. I don't think you have to spell all of that out in every
place where this comes up in the patch, but it needs to be clear from
what you do say. For example, I would be happier with a comment that
said something like "Every call to this function will either set
HEAP_XMIN_FROZEN in the xl_heap_freeze_tuple struct passed as an
argument, or else reduce *NewRelfrozenxid to the xmin of the tuple if
it is currently newer than that. Thus, after a series of calls to this
function, *NewRelfrozenxid represents a lower bound on unfrozen xmin
values in the tuples examined. Before calling this function, caller
should initialize *NewRelfrozenxid to ."

+* Changing nothing, so might have to ratchet
back NewRelminmxid,
+* NewRelfrozenxid, or both together

This comment I like.

+* New multixact might have remaining XID older than
+* NewRelfrozenxid

This one's good, too.

+ * Also maintains *NewRelfrozenxid and *NewRelminmxid, which are the current
+ * target relfrozenxid and relminmxid for the relation.  Assumption is that
+ * caller will never freeze any of the XIDs from the tuple, even when we say
+ * that they should.  If caller opts to go with our recommendation to freeze,
+ * then it must account for the fact that it shouldn't trust how we've set
+ * NewRelfrozenxid/NewRelminmxid.  (In practice aggressive VACUUMs always take
+ * our recommendation because they must, and non-aggressive VACUUMs always opt
+ * to not freeze, preferring to ratchet back NewRelfrozenxid instead).

I don't understand this one.

+* (Actually, we maintain NewRelminmxid differently here, because we
+* assume that XIDs that should be frozen according to cutoff_xid won't
+* be, whereas heap_prepare_freeze_tuple makes the opposite assumption.)

This one either.

I haven't really grokked exactly what is happening in
heap_tuple_needs_freeze yet, and may not have time to study it further
in the near future. Not saying it's wrong, although improving the
comments above would likely help me out.

> > If there's a way you can make the precise contents of 0002 and 0003
> > more clear, I would like that, too.
>
> The really big one is 0002 -- even 0003 (the FSM PageIsAllVisible()
> thing) wasn't on the table before now. 0002 is the patch that changes
> the basic criteria for freezing, making it block-based rather than
> based on the FreezeLimit cutoff (barring edge cases that are important
> for correctness, but shouldn't noticeably affect freezing overhead).
>
> The single biggest practical improvement from 0002 is that it
> eliminates what I've called the freeze cliff, which is where many old
> tuples (much older than FreezeLimit/vacuum_freeze_min_age) must be
> frozen all at once, in a balloon payment during an eventual aggressive
> VACUUM. Although it's easy to see that that could be useful, it is
> harder to justify (much harder) than anything else. Because we're
> freezing more eagerly overall, we're also bound to do more freezing
> without benefit in certain cases. Although I think that this can be
> justified as the cost of doing business, that's a hard argument to
> make.

You've used the term "freezing cliff" repeatedly in earlier emails,
and this is the first time I've been able to understand what you
meant. I'm glad I do, now.

But can you describe the algorithm that 0002 uses to accomplish this
improvement? Like "if it sees that the page meets criteria X, then it
freezes all tuples on the page, else if it sees

Re: Removing more vacuumlazy.c special cases, relfrozenxid optimizations

2022-02-18 Thread Peter Geoghegan

On Fri, Feb 18, 2022 at 12:54 PM Robert Haas  wrote:
> I'd like to have a clearer idea of exactly what is in each of the
> remaining patches before forming a final opinion.

Great.

> What's tricky about 0001? Does it change any other behavior, either as
> a necessary component of advancing relfrozenxid more eagerly, or
> otherwise?

It does not change any other behavior. It's totally mechanical.

0001 is tricky in the sense that there are a lot of fine details, and
if you get any one of them wrong the result might be a subtle bug. For
example, the heap_tuple_needs_freeze() code path is only used when we
cannot get a cleanup lock, which is rare -- and some of the branches
within the function are relatively rare themselves. The obvious
concern is: What if some detail of how we track the new relfrozenxid
value (and new relminmxid value) in this seldom-hit codepath is just
wrong, in whatever way we didn't think of?

On the other hand, we must already be precise in almost the same way
within heap_tuple_needs_freeze() today -- it's not all that different
(we currently need to avoid leaving any XIDs < FreezeLimit behind,
which isn't made that less complicated by the fact that it's a static
XID cutoff). Plus, we have experience with bugs like this. There was
hardening added to catch stuff like this back in 2017, following the
"freeze the dead" bug.

> If there's a way you can make the precise contents of 0002 and 0003
> more clear, I would like that, too.

The really big one is 0002 -- even 0003 (the FSM PageIsAllVisible()
thing) wasn't on the table before now. 0002 is the patch that changes
the basic criteria for freezing, making it block-based rather than
based on the FreezeLimit cutoff (barring edge cases that are important
for correctness, but shouldn't noticeably affect freezing overhead).

The single biggest practical improvement from 0002 is that it
eliminates what I've called the freeze cliff, which is where many old
tuples (much older than FreezeLimit/vacuum_freeze_min_age) must be
frozen all at once, in a balloon payment during an eventual aggressive
VACUUM. Although it's easy to see that that could be useful, it is
harder to justify (much harder) than anything else. Because we're
freezing more eagerly overall, we're also bound to do more freezing
without benefit in certain cases. Although I think that this can be
justified as the cost of doing business, that's a hard argument to
make.

In short, 0001 is mechanically tricky, but easy to understand at a
high level. Whereas 0002 is mechanically simple, but tricky to
understand at a high level (and therefore far trickier than 0001
overall).

-- 
Peter Geoghegan

Re: Removing more vacuumlazy.c special cases, relfrozenxid optimizations

2022-02-18 Thread Robert Haas

On Fri, Feb 18, 2022 at 3:41 PM Peter Geoghegan  wrote:
> Concerns about my general approach to this project (and even the
> Postgres 14 VACUUM work) were expressed by Robert and Andres over on
> the "Nonrandom scanned_pages distorts pg_class.reltuples set by
> VACUUM" thread. Some of what was said honestly shocked me. It now
> seems unwise to pursue this project on my original timeline. I even
> thought about shelving it indefinitely (which is still on the table).
>
> I propose the following compromise: the least contentious patch alone
> will be in scope for Postgres 15, while the other patches will not be.
> I'm referring to the first patch from v8, which adds dynamic tracking
> of the oldest extant XID in each heap table, in order to be able to
> use it as our new relfrozenxid. I can't imagine that I'll have
> difficulty convincing Andres of the merits of this idea, for one,
> since it was his idea in the first place. It makes a lot of sense,
> independent of any change to how and when we freeze.
>
> The first patch is tricky, but at least it won't require elaborate
> performance validation. It doesn't change any of the basic performance
> characteristics of VACUUM. It sometimes allows us to advance
> relfrozenxid to a value beyond FreezeLimit (typically only possible in
> an aggressive VACUUM), which is an intrinsic good. If it isn't
> effective then the overhead seems very unlikely to be noticeable. It's
> pretty much a strictly additive improvement.
>
> Are there any objections to this plan?

I really like the idea of reducing the scope of what is being changed
here, and I agree that eagerly advancing relfrozenxid carries much
less risk than the other changes.

I'd like to have a clearer idea of exactly what is in each of the
remaining patches before forming a final opinion.

What's tricky about 0001? Does it change any other behavior, either as
a necessary component of advancing relfrozenxid more eagerly, or
otherwise?

If there's a way you can make the precise contents of 0002 and 0003
more clear, I would like that, too.

-- 
Robert Haas
EDB: http://www.enterprisedb.com

Re: Removing more vacuumlazy.c special cases, relfrozenxid optimizations

2022-02-18 Thread Peter Geoghegan

On Fri, Feb 11, 2022 at 8:30 PM Peter Geoghegan  wrote:
> Attached is v8. No real changes -- just a rebased version.

Concerns about my general approach to this project (and even the
Postgres 14 VACUUM work) were expressed by Robert and Andres over on
the "Nonrandom scanned_pages distorts pg_class.reltuples set by
VACUUM" thread. Some of what was said honestly shocked me. It now
seems unwise to pursue this project on my original timeline. I even
thought about shelving it indefinitely (which is still on the table).

I propose the following compromise: the least contentious patch alone
will be in scope for Postgres 15, while the other patches will not be.
I'm referring to the first patch from v8, which adds dynamic tracking
of the oldest extant XID in each heap table, in order to be able to
use it as our new relfrozenxid. I can't imagine that I'll have
difficulty convincing Andres of the merits of this idea, for one,
since it was his idea in the first place. It makes a lot of sense,
independent of any change to how and when we freeze.

The first patch is tricky, but at least it won't require elaborate
performance validation. It doesn't change any of the basic performance
characteristics of VACUUM. It sometimes allows us to advance
relfrozenxid to a value beyond FreezeLimit (typically only possible in
an aggressive VACUUM), which is an intrinsic good. If it isn't
effective then the overhead seems very unlikely to be noticeable. It's
pretty much a strictly additive improvement.

Are there any objections to this plan?

--
Peter Geoghegan

Re: Removing more vacuumlazy.c special cases, relfrozenxid optimizations

2022-02-11 Thread Peter Geoghegan

On Sat, Jan 29, 2022 at 8:42 PM Peter Geoghegan  wrote:
> Attached is v7, a revision that overhauls the algorithm that decides
> what to freeze. I'm now calling it block-driven freezing in the commit
> message. Also included is a new patch, that makes VACUUM record zero
> free space in the FSM for an all-visible page, unless the total amount
> of free space happens to be greater than one half of BLCKSZ.

I pushed the earlier refactoring and instrumentation patches today.

Attached is v8. No real changes -- just a rebased version.

It will be easier to benchmark and test the page-driven freezing stuff
now, since the master/baseline case will now output instrumentation
showing how relfrozenxid has been advanced (if at all) -- whether (and
to what extent) each VACUUM operation advances relfrozenxid can now be
directly compared, just by monitoring the log_autovacuum_min_duration
output for a given table over time.

--
Peter Geoghegan
From 41136d2a8af434a095ce3e6dfdfbe4b48b9ec338 Mon Sep 17 00:00:00 2001
From: Peter Geoghegan 
Date: Sun, 23 Jan 2022 21:10:38 -0800
Subject: [PATCH v8 3/3] Add all-visible FSM heuristic.

When recording free space in all-frozen page, record that the page has
zero free space when it has less than half BLCKSZ worth of space,
according to the traditional definition.  Otherwise record free space as
usual.

Making all-visible pages resistant to change like this can be thought of
as a form of hysteresis.  The page is given an opportunity to "settle"
and permanently stay in the same state when the tuples on the page will
never be updated or deleted.  But when they are updated or deleted, the
page can once again be used to store any tuple.  Over time, most pages
tend to settle permanently in many workloads, perhaps only on the second
or third attempt.
---
 src/backend/access/heap/vacuumlazy.c | 21 +
 1 file changed, 21 insertions(+)

diff --git a/src/backend/access/heap/vacuumlazy.c b/src/backend/access/heap/vacuumlazy.c
index ea4b75189..95049ed25 100644
--- a/src/backend/access/heap/vacuumlazy.c
+++ b/src/backend/access/heap/vacuumlazy.c
@@ -1231,6 +1231,13 @@ lazy_scan_heap(LVRelState *vacrel, int nworkers)
  */
 freespace = PageGetHeapFreeSpace(page);

+/*
+ * An all-visible page should not have its free space
+ * available from FSM unless it's more than half empty
+ */
+if (PageIsAllVisible(page) && freespace < BLCKSZ / 2)
+	freespace = 0;
+
 UnlockReleaseBuffer(buf);
 RecordPageWithFreeSpace(vacrel->rel, blkno, freespace);
 continue;
@@ -1368,6 +1375,13 @@ lazy_scan_heap(LVRelState *vacrel, int nworkers)
 		{
 			Size		freespace = PageGetHeapFreeSpace(page);

+			/*
+			 * An all-visible page should not have its free space available
+			 * from FSM unless it's more than half empty
+			 */
+			if (PageIsAllVisible(page) && freespace < BLCKSZ / 2)
+freespace = 0;
+
 			UnlockReleaseBuffer(buf);
 			RecordPageWithFreeSpace(vacrel->rel, blkno, freespace);
 		}
@@ -2537,6 +2551,13 @@ lazy_vacuum_heap_rel(LVRelState *vacrel)
 		page = BufferGetPage(buf);
 		freespace = PageGetHeapFreeSpace(page);

+		/*
+		 * An all-visible page should not have its free space available from
+		 * FSM unless it's more than half empty
+		 */
+		if (PageIsAllVisible(page) && freespace < BLCKSZ / 2)
+			freespace = 0;
+
 		UnlockReleaseBuffer(buf);
 		RecordPageWithFreeSpace(vacrel->rel, tblk, freespace);
 		vacuumed_pages++;
-- 
2.30.2

From 4838bd1f11b748d2082caedfe4da506b8fe3f67a Mon Sep 17 00:00:00 2001
From: Peter Geoghegan 
Date: Mon, 13 Dec 2021 15:00:49 -0800
Subject: [PATCH v8 2/3] Make block-level characteristics drive freezing.

Teach VACUUM to freeze all of the tuples on a page whenever it notices
that it would otherwise mark the page all-visible, without also marking
it all-frozen.  VACUUM won't freeze _any_ tuples on the page unless
_all_ tuples (that remain after pruning) are all-visible.  It may
occasionally be necessary to freeze the page due to the presence of a
particularly old XID, from before VACUUM's FreezeLimit cutoff.  But the
FreezeLimit mechanism will seldom have any impact on which pages are
frozen anymore -- it is just a backstop now.

Freezing can now informally be thought of as something that takes place
at the level of an entire page, or not at all -- differences in XIDs
among tuples on the same page are not interesting, barring extreme
cases.  Freezing a page is now practically synonymous with setting the
page to all-visible in the visibility map, at least to users.

The main upside of the new approach to freezing is that it makes the
overhead of vacuuming much more predictable over time.  We avoid the
need for large balloon payments, since the system no longer accumulates
"freezing debt" that can only be paid off by anti-wraparound vacuuming.
This seems to have been particularly troublesome with append-only
tables, especially in the common case where XIDs from pages that are
marked all-visible for the first

Re: Removing more vacuumlazy.c special cases, relfrozenxid optimizations

2022-02-07 Thread Peter Geoghegan

On Mon, Feb 7, 2022 at 12:21 PM Robert Haas  wrote:
>
> On Mon, Feb 7, 2022 at 11:43 AM Peter Geoghegan  wrote:
> > > That's because, if VACUUM is only ever getting triggered by XID
> > > age advancement and not by bloat, there's no opportunity for your
> > > patch set to advance relfrozenxid any sooner than we're doing now.
> >
> > We must distinguish between:
> >
> > 1. "VACUUM is fundamentally never going to need to run unless it is
> > forced to, just to advance relfrozenxid" -- this applies to tables
> > like the stock and customers tables from the benchmark.
> >
> > and:
> >
> > 2. "VACUUM must sometimes run to mark newly appended heap pages
> > all-visible, and maybe to also remove dead tuples, but not that often
> > -- and yet we current only get expensive and inconveniently timed
> > anti-wraparound VACUUMs, no matter what" -- this applies to all the
> > other big tables in the benchmark, in particular to the orders and
> > order lines tables, but also to simpler cases like pgbench_history.
>
> It's not really very understandable for me when you refer to the way
> table X behaves in Y benchmark, because I haven't studied that in
> enough detail to know. If you say things like insert-only table, or a
> continuous-random-updates table, or whatever the case is, it's a lot
> easier to wrap my head around it.

What I've called category 2 tables are the vast majority of big tables
in practice. They include pure append-only tables, but also tables
that grow and grow from inserts, but also have some updates. The point
of the TPC-C order + order lines examples was to show how broad the
category really is. And how mixtures of inserts and bloat from updates
on one single table confuse the implementation in general.

> > Does that make sense? It's pretty subtle, admittedly, and you no doubt
> > have (very reasonable) concerns about the extremes, even if you accept
> > all that. I just want to get the general idea across here, as a
> > starting point for further discussion.
>
> Not really. I think you *might* be saying tables which currently get
> only wraparound vacuums will end up getting other kinds of vacuums
> with your patch because things will improve enough for other tables in
> the system that they will be able to get more attention than they do
> currently.

Yes, I am.

> But I'm not sure I am understanding you correctly, and even
> if I am I don't understand why that would be so, and even if it is I
> think it doesn't help if essentially all the tables in the system are
> suffering from the problem.

When I say "relfrozenxid advancement has been qualitatively improved
by the patch", what I mean is that we are much closer to a rate of
relfrozenxid advancement that is far closer to the theoretically
optimal rate for our current design, with freezing and with 32-bit
XIDs, and with the invariants for freezing.

Consider the extreme case, and generalize. In the simple append-only
table case, it is most obvious. The final relfrozenxid is very close
to OldestXmin (only tiny noise level differences appear), regardless
of XID consumption by the system in general, and even within the
append-only table in particular. Other cases are somewhat trickier,
but have roughly the same quality, to a surprising degree. Lots of
things that never really should have affected relfrozenxid to begin
with do not, for the first time.

-- 
Peter Geoghegan

Re: Removing more vacuumlazy.c special cases, relfrozenxid optimizations

2022-02-07 Thread Robert Haas

On Mon, Feb 7, 2022 at 11:43 AM Peter Geoghegan  wrote:
> > That's because, if VACUUM is only ever getting triggered by XID
> > age advancement and not by bloat, there's no opportunity for your
> > patch set to advance relfrozenxid any sooner than we're doing now.
>
> We must distinguish between:
>
> 1. "VACUUM is fundamentally never going to need to run unless it is
> forced to, just to advance relfrozenxid" -- this applies to tables
> like the stock and customers tables from the benchmark.
>
> and:
>
> 2. "VACUUM must sometimes run to mark newly appended heap pages
> all-visible, and maybe to also remove dead tuples, but not that often
> -- and yet we current only get expensive and inconveniently timed
> anti-wraparound VACUUMs, no matter what" -- this applies to all the
> other big tables in the benchmark, in particular to the orders and
> order lines tables, but also to simpler cases like pgbench_history.

It's not really very understandable for me when you refer to the way
table X behaves in Y benchmark, because I haven't studied that in
enough detail to know. If you say things like insert-only table, or a
continuous-random-updates table, or whatever the case is, it's a lot
easier to wrap my head around it.

> Does that make sense? It's pretty subtle, admittedly, and you no doubt
> have (very reasonable) concerns about the extremes, even if you accept
> all that. I just want to get the general idea across here, as a
> starting point for further discussion.

Not really. I think you *might* be saying tables which currently get
only wraparound vacuums will end up getting other kinds of vacuums
with your patch because things will improve enough for other tables in
the system that they will be able to get more attention than they do
currently. But I'm not sure I am understanding you correctly, and even
if I am I don't understand why that would be so, and even if it is I
think it doesn't help if essentially all the tables in the system are
suffering from the problem.

-- 
Robert Haas
EDB: http://www.enterprisedb.com

Re: Removing more vacuumlazy.c special cases, relfrozenxid optimizations

2022-02-07 Thread Peter Geoghegan

On Mon, Feb 7, 2022 at 10:08 AM Robert Haas  wrote:
> But ... if I'm not mistaken, in the kind of case that Greg is
> describing, relfrozenxid will be advanced exactly as often as it is
> today.

But what happens today in a scenario like Greg's is pathological,
despite being fairly common (common in large DBs). It doesn't seem
informative to extrapolate too much from current experience for that
reason.

> That's because, if VACUUM is only ever getting triggered by XID
> age advancement and not by bloat, there's no opportunity for your
> patch set to advance relfrozenxid any sooner than we're doing now.

We must distinguish between:

1. "VACUUM is fundamentally never going to need to run unless it is
forced to, just to advance relfrozenxid" -- this applies to tables
like the stock and customers tables from the benchmark.

and:

2. "VACUUM must sometimes run to mark newly appended heap pages
all-visible, and maybe to also remove dead tuples, but not that often
-- and yet we current only get expensive and inconveniently timed
anti-wraparound VACUUMs, no matter what" -- this applies to all the
other big tables in the benchmark, in particular to the orders and
order lines tables, but also to simpler cases like pgbench_history.

As I've said a few times now, the patch doesn't change anything for 1.
But Greg's problem tables very much sound like they're from category
2. And what we see with the master branch for such tables is that they
always get anti-wraparound VACUUMs, past a certain size (depends on
things like exact XID rate and VACUUM settings, the insert-driven
autovacuum scheduling stuff matters). While the patch never reaches
that point in practice, during my testing -- and doesn't come close.

It is true that in theory, as the size of ones of these "category 2"
tables tends to infinity, the patch ends up behaving the same as
master anyway. But I'm pretty sure that that usually doesn't matter at
all, or matters less than you'd think. As I emphasized when presenting
the recent v7 TPC-C benchmark, neither of the two "TPC-C big problem
tables" (which are particularly interesting/tricky examples of tables
from category 2) come close to getting an anti-wraparound VACUUM
(plus, as I said in the same email, wouldn't matter if they did).

> So I think that people in this kind of situation will potentially be
> helped or hurt by other things the patch set does, but the eager
> relfrozenxid stuff won't make any difference for them.

To be clear, I think it would if everything was in place, including
the basic relfrozenxid advancement thing, plus the new freezing stuff
(though you wouldn't need the experimental FSM thing to get this
benefit).

Here is a thought experiment that may make the general idea a bit clearer:

Imagine I reran the same benchmark as before, with the same settings,
and the expectation that everything would be the same as first time
around for the patch series. But to make things more interesting, this
time I add an adversarial element: I add an adversarial gizmo that
burns XIDs steadily, without doing any useful work. This gizmo doubles
the rate of XID consumption for the database as a whole, perhaps by
calling "SELECT txid_current()" in a loop, followed by a timed sleep
(with a delay chosen with the goal of doubling XID consumption). I
imagine that this would also burn CPU cycles, but probably not enough
to make more than a noise level impact -- so we're severely stressing
the implementation by adding this gizmo, but the stress is precisely
targeted at XID consumption and related implementation details. It's a
pretty clean experiment. What happens now?

I believe (though haven't checked for myself) that nothing important
would change. We'd still see the same VACUUM operations occur at
approximately the same times (relative to the start of the benchmark)
that we saw with the original benchmark, and each VACUUM operation
would do approximately the same amount of physical work on each
occasion. Of course, the autovacuum log output would show that the
OldestXmin for each individual VACUUM operation had larger values than
first time around for this newly initdb'd TPC-C database (purely as a
consequence of the XID burning gizmo), but it would *also* show
*concomitant* increases for our newly set relfrozenxid. The system
should therefore hardly behave differently at all compared to the
original benchmark run, despite this adversarial gizmo.

It's fair to wonder: okay, but what if it was 4x, 8x, 16x? What then?
That does get a bit more complicated, and we should get into why that
is. But for now I'll just say that I think that even that kind of
extreme would make much less difference than you might think -- since
relfrozenxid advancement has been qualitatively improved by the patch
series. It is especially likely that nothing would change if you were
willing to increase autovacuum_freeze_max_age to get a bit more
breathing room -- room to allow the autovacuums to run at their
"natural" times. You

Re: Removing more vacuumlazy.c special cases, relfrozenxid optimizations

2022-02-07 Thread Robert Haas

On Fri, Feb 4, 2022 at 10:45 PM Peter Geoghegan  wrote:
> > While I've seen all the above cases triggering anti-wraparound cases
> > by far the majority of the cases are not of these pathological forms.
>
> Right - it's practically inevitable that you'll need an
> anti-wraparound VACUUM to advance relfrozenxid right now. Technically
> it's possible to advance relfrozenxid in any VACUUM, but in practice
> it just never happens on a large table. You only need to get unlucky
> with one heap page, either by failing to get a cleanup lock, or (more
> likely) by setting even one single page all-visible but not all-frozen
> just once (once in any VACUUM that takes place between anti-wraparound
> VACUUMs).

But ... if I'm not mistaken, in the kind of case that Greg is
describing, relfrozenxid will be advanced exactly as often as it is
today. That's because, if VACUUM is only ever getting triggered by XID
age advancement and not by bloat, there's no opportunity for your
patch set to advance relfrozenxid any sooner than we're doing now. So
I think that people in this kind of situation will potentially be
helped or hurt by other things the patch set does, but the eager
relfrozenxid stuff won't make any difference for them.

-- 
Robert Haas
EDB: http://www.enterprisedb.com

Re: Removing more vacuumlazy.c special cases, relfrozenxid optimizations

2022-02-07 Thread Robert Haas

On Fri, Feb 4, 2022 at 10:21 PM Greg Stark  wrote:
> By far the majority of anti-wraparound vacuums are triggered by tables
> that are very large and so don't trigger regular vacuums for "long
> periods" of time and consistently hit the anti-wraparound threshold
> first.

That's interesting, because my experience is different. Most of the
time when I get asked to look at a system, it turns out that there is
a prepared transaction or a forgotten replication slot and nobody
noticed until the system hit the wraparound threshold. Or occasionally
a long-running transaction or a failing/stuck vacuum that has the same
effect.

-- 
Robert Haas
EDB: http://www.enterprisedb.com

Re: Removing more vacuumlazy.c special cases, relfrozenxid optimizations

2022-02-04 Thread Peter Geoghegan

On Fri, Feb 4, 2022 at 10:44 PM Peter Geoghegan  wrote:
> Right - it's practically inevitable that you'll need an
> anti-wraparound VACUUM to advance relfrozenxid right now. Technically
> it's possible to advance relfrozenxid in any VACUUM, but in practice
> it just never happens on a large table. You only need to get unlucky
> with one heap page, either by failing to get a cleanup lock, or (more
> likely) by setting even one single page all-visible but not all-frozen
> just once (once in any VACUUM that takes place between anti-wraparound
> VACUUMs).

Minor correction: That's a slight exaggeration, since we won't skip
groups of all-visible pages that don't exceed SKIP_PAGES_THRESHOLD
blocks (32 blocks).

-- 
Peter Geoghegan

Re: Removing more vacuumlazy.c special cases, relfrozenxid optimizations

2022-02-04 Thread Peter Geoghegan

On Fri, Feb 4, 2022 at 10:21 PM Greg Stark  wrote:
> On Wed, 15 Dec 2021 at 15:30, Peter Geoghegan  wrote:
> > My emphasis here has been on making non-aggressive VACUUMs *always*
> > advance relfrozenxid, outside of certain obvious edge cases. And so
> > with all the patches applied, up to and including the opportunistic
> > freezing patch, every autovacuum of every table manages to advance
> > relfrozenxid during benchmarking -- usually to a fairly recent value.
> > I've focussed on making aggressive VACUUMs (especially anti-wraparound
> > autovacuums) a rare occurrence, for truly exceptional cases (e.g.,
> > user keeps canceling autovacuums, maybe due to automated script that
> > performs DDL). That has taken priority over other goals, for now.
>
> While I've seen all the above cases triggering anti-wraparound cases
> by far the majority of the cases are not of these pathological forms.

Right - it's practically inevitable that you'll need an
anti-wraparound VACUUM to advance relfrozenxid right now. Technically
it's possible to advance relfrozenxid in any VACUUM, but in practice
it just never happens on a large table. You only need to get unlucky
with one heap page, either by failing to get a cleanup lock, or (more
likely) by setting even one single page all-visible but not all-frozen
just once (once in any VACUUM that takes place between anti-wraparound
VACUUMs).

> By far the majority of anti-wraparound vacuums are triggered by tables
> that are very large and so don't trigger regular vacuums for "long
> periods" of time and consistently hit the anti-wraparound threshold
> first.

autovacuum_vacuum_insert_scale_factor can help with this on 13 and 14,
but only if you tune autovacuum_freeze_min_age with that goal in mind.
Which probably doesn't happen very often.

> There's nothing limiting how long "long periods" is and nothing tying
> it to the rate of xid consumption. It's quite common to have some
> *very* large mostly static tables in databases that have other tables
> that are *very* busy.
>
> The worst I've seen is a table that took 36 hours to vacuum in a
> database that consumed about a billion transactions per day... That's
> extreme but these days it's quite common to see tables that get
> anti-wraparound vacuums every week or so despite having < 1% modified
> tuples. And databases are only getting bigger and transaction rates
> faster...

Sounds very much like what I've been calling the freezing cliff. An
anti-wraparound VACUUM throws things off by suddenly dirtying many
more pages than the expected amount for a VACUUM against the table,
despite there being no change in workload characteristics. If you just
had to remove the dead tuples in such a table, then it probably
wouldn't matter if it happened earlier than expected.

-- 
Peter Geoghegan

Re: Removing more vacuumlazy.c special cases, relfrozenxid optimizations

2022-02-04 Thread Greg Stark

On Wed, 15 Dec 2021 at 15:30, Peter Geoghegan  wrote:
>
> My emphasis here has been on making non-aggressive VACUUMs *always*
> advance relfrozenxid, outside of certain obvious edge cases. And so
> with all the patches applied, up to and including the opportunistic
> freezing patch, every autovacuum of every table manages to advance
> relfrozenxid during benchmarking -- usually to a fairly recent value.
> I've focussed on making aggressive VACUUMs (especially anti-wraparound
> autovacuums) a rare occurrence, for truly exceptional cases (e.g.,
> user keeps canceling autovacuums, maybe due to automated script that
> performs DDL). That has taken priority over other goals, for now.

While I've seen all the above cases triggering anti-wraparound cases
by far the majority of the cases are not of these pathological forms.

By far the majority of anti-wraparound vacuums are triggered by tables
that are very large and so don't trigger regular vacuums for "long
periods" of time and consistently hit the anti-wraparound threshold
first.

There's nothing limiting how long "long periods" is and nothing tying
it to the rate of xid consumption. It's quite common to have some
*very* large mostly static tables in databases that have other tables
that are *very* busy.

The worst I've seen is a table that took 36 hours to vacuum in a
database that consumed about a billion transactions per day... That's
extreme but these days it's quite common to see tables that get
anti-wraparound vacuums every week or so despite having < 1% modified
tuples. And databases are only getting bigger and transaction rates
faster...

-- 
greg

Re: Removing more vacuumlazy.c special cases, relfrozenxid optimizations

2022-02-04 Thread Peter Geoghegan

On Fri, Feb 4, 2022 at 4:18 PM Robert Haas  wrote:
> On Fri, Feb 4, 2022 at 3:31 PM Peter Geoghegan  wrote:
> > Application B will already block pruning by VACUUM operations against
> > application A's table, and so effectively blocks recording of the
> > resultant free space in the FSM in your scenario. And so application A
> > and application B should be considered the same application already.
> > That's just how VACUUM works.
>
> Sure ... but that also sucks. If we consider application A and
> application B to be the same application, then we're basing our
> decision about what to do on information that is inaccurate.

I agree that it sucks, but I don't think that it's particularly
relevant to the FSM prototype patch that I included with v7 of the
patch series. A heap page cannot be considered "closed" (either in the
specific sense from the patch, or in any informal sense) when it has
recently dead tuples.

At some point we should invent a fallback path for pruning, that
migrates recently dead tuples to some other subsidiary structure,
retaining only forwarding information in the heap page. But even that
won't change what I just said about closed pages (it'll just make it
easier to return and fix things up later on).

> I don't see it that way. There are cases where avoiding writes is
> better, and cases where trying to cram everything into the fewest
> possible ages is better. With the right test case you can make either
> strategy look superior.

The cost of reads is effectively much lower than writes with modern
SSDs, in TCO terms. Plus when a FSM strategy like the one from the
patch does badly according to a naive measure such as total table
size, that in itself doesn't mean that we do worse with reads. In
fact, it's quite the opposite.

The benchmark showed that v7 of the patch did very slightly worse on
overall space utilization, but far, far better on reads. In fact, the
benefits for reads were far in excess of any efficiency gains for
writes/with WAL. The greatest bottleneck is almost always latency on
modern hardware [1]. It follows that keeping logically related data
grouped together is crucial. Far more important than potentially using
very slightly more space.

The story I wanted to tell with the FSM patch was about open and
closed pages being the right long term direction. More generally, we
should emphasize managing page-level costs, and deemphasize managing
tuple-level costs, which are much less meaningful.

> What I think your test case has going for it
> is that it is similar to something that a lot of people, really a ton
> of people, actually do with PostgreSQL. However, it's not going to be
> an accurate model of what everybody does, and therein lies some
> element of danger.

No question -- agreed.

> > Then what could you have confidence in?
>
> Real-world experience. Which is hard to get if we don't ever commit
> any patches, but a good argument for (a) having them tested by
> multiple different hackers who invent test cases independently and (b)
> some configurability where we can reasonably include it, so that if
> anyone does experience problems they have an escape.

I agree.

[1] https://dl.acm.org/doi/10.1145/1022594.1022596
-- 
Peter Geoghegan

Re: Removing more vacuumlazy.c special cases, relfrozenxid optimizations

2022-02-04 Thread Robert Haas

On Fri, Feb 4, 2022 at 3:31 PM Peter Geoghegan  wrote:
> Application B will already block pruning by VACUUM operations against
> application A's table, and so effectively blocks recording of the
> resultant free space in the FSM in your scenario. And so application A
> and application B should be considered the same application already.
> That's just how VACUUM works.

Sure ... but that also sucks. If we consider application A and
application B to be the same application, then we're basing our
decision about what to do on information that is inaccurate.

> 5% larger seems like a lot more than would be typical, based on what
> I've seen. I don't think that the regression in this scenario can be
> characterized as "infinitely worse", or anything like it. On a long
> enough timeline, the potential upside of something like this is nearly
> unlimited -- it could avoid a huge amount of write amplification. But
> the potential downside seems to be small and fixed -- which is the
> point (bounding the downside). The mere possibility of getting that
> big benefit (avoiding the costs from heap fragmentation) is itself a
> benefit, even when it turns out not to pay off in your particular
> case. It can be seen as insurance.

I don't see it that way. There are cases where avoiding writes is
better, and cases where trying to cram everything into the fewest
possible ages is better. With the right test case you can make either
strategy look superior. What I think your test case has going for it
is that it is similar to something that a lot of people, really a ton
of people, actually do with PostgreSQL. However, it's not going to be
an accurate model of what everybody does, and therein lies some
element of danger.

> Then what could you have confidence in?

Real-world experience. Which is hard to get if we don't ever commit
any patches, but a good argument for (a) having them tested by
multiple different hackers who invent test cases independently and (b)
some configurability where we can reasonably include it, so that if
anyone does experience problems they have an escape.

-- 
Robert Haas
EDB: http://www.enterprisedb.com

Re: Removing more vacuumlazy.c special cases, relfrozenxid optimizations

2022-02-04 Thread Peter Geoghegan

On Fri, Feb 4, 2022 at 2:45 PM Robert Haas  wrote:
> While I agree that there's some case to be made for leaving settled
> pages well enough alone, your criterion for settled seems pretty much
> accidental.

I fully admit that I came up with the FSM heuristic with TPC-C in
mind. But you have to start somewhere.

Fortunately, the main benefit of this patch series (avoiding the
freeze cliff during anti-wraparound VACUUMs, often avoiding
anti-wraparound VACUUMs altogether) don't depend on the experimental
FSM patch at all. I chose to post that now because it seemed to help
with my more general point about qualitatively different pages, and
freezing at the page level.

> Imagine a system where there are two applications running,
> A and B. Application A runs all the time and all the transactions
> which it performs are short. Therefore, when a certain page is not
> modified by transaction A for a short period of time, the page will
> become all-visible and will be considered settled. Application B runs
> once a month and performs various transactions all of which are long,
> perhaps on a completely separate set of tables. While application B is
> running, pages take longer to settle not only for application B but
> also for application A. It doesn't make sense to say that the
> application is in control of the behavior when, in reality, it may be
> some completely separate application that is controlling the behavior.

Application B will already block pruning by VACUUM operations against
application A's table, and so effectively blocks recording of the
resultant free space in the FSM in your scenario. And so application A
and application B should be considered the same application already.
That's just how VACUUM works.

VACUUM isn't a passive observer of the system -- it's another
participant. It both influences and is influenced by almost everything
else in the system.

> I can see that this could have significant advantages under some
> circumstances. But I think it could easily be far worse under other
> circumstances. I mean, you can have workloads where you do some amount
> of read-write work on a table and then go read only and sequential
> scan it an infinite number of times. An algorithm that causes the
> table to be smaller at the point where we switch to read-only
> operations, even by a modest amount, wins infinitely over anything
> else. But even if you have no change in the access pattern, is it a
> good idea to allow the table to be, say, 5% larger if it means that
> correlated data is colocated? In general, probably yes. If that means
> that the table fails to fit in shared_buffers instead of fitting, no.
> If that means that the table fails to fit in the OS cache instead of
> fitting, definitely no.

5% larger seems like a lot more than would be typical, based on what
I've seen. I don't think that the regression in this scenario can be
characterized as "infinitely worse", or anything like it. On a long
enough timeline, the potential upside of something like this is nearly
unlimited -- it could avoid a huge amount of write amplification. But
the potential downside seems to be small and fixed -- which is the
point (bounding the downside). The mere possibility of getting that
big benefit (avoiding the costs from heap fragmentation) is itself a
benefit, even when it turns out not to pay off in your particular
case. It can be seen as insurance.

> And to me, that kind of effect is why it's hard to gain much
> confidence in regards to stuff like this via laboratory testing. I
> mean, I'm glad you're doing such tests. But in a laboratory test, you
> tend not to have things like a sudden and complete change in the
> workload, or a random other application sometimes sharing the machine,
> or only being on the edge of running out of memory. I think in general
> people tend to avoid such things in benchmarking scenarios, but even
> if include stuff like this, it's hard to know what to include that
> would be representative of real life, because just about anything
> *could* happen in real life.

Then what could you have confidence in?

-- 
Peter Geoghegan

Re: Removing more vacuumlazy.c special cases, relfrozenxid optimizations

2022-02-04 Thread Robert Haas

On Sat, Jan 29, 2022 at 11:43 PM Peter Geoghegan  wrote:
> When VACUUM sees that all remaining/unpruned tuples on a page are
> all-visible, it isn't just important because of cost control
> considerations. It's deeper than that. It's also treated as a
> tentative signal from the application itself, about the data itself.
> Which is: this page looks "settled" -- it may never be updated again,
> but if there is an update it likely won't change too much about the
> whole page.

While I agree that there's some case to be made for leaving settled
pages well enough alone, your criterion for settled seems pretty much
accidental. Imagine a system where there are two applications running,
A and B. Application A runs all the time and all the transactions
which it performs are short. Therefore, when a certain page is not
modified by transaction A for a short period of time, the page will
become all-visible and will be considered settled. Application B runs
once a month and performs various transactions all of which are long,
perhaps on a completely separate set of tables. While application B is
running, pages take longer to settle not only for application B but
also for application A. It doesn't make sense to say that the
application is in control of the behavior when, in reality, it may be
some completely separate application that is controlling the behavior.

> The application is in charge, really -- not VACUUM. This is already
> the case, whether we like it or not. VACUUM needs to learn to live in
> that reality, rather than fighting it. When VACUUM considers a page
> settled, and the physical page still has a relatively large amount of
> free space (say 45% of BLCKSZ, a borderline case in the new FSM
> patch), "losing" so much free space certainly is unappealing. We set
> the free space to 0 in the free space map all the same, because we're
> cutting our losses at that point. While the exact threshold I've
> proposed is tentative, the underlying theory seems pretty sound to me.
> The BLCKSZ/2 cutoff (and the way that it extends the general rules for
> whole-page freezing) is intended to catch pages that are qualitatively
> different, as well as quantitatively different. It is a balancing act,
> between not wasting space, and the risk of systemic problems involving
> excessive amounts of non-HOT updates that must move a successor
> version to another page.

I can see that this could have significant advantages under some
circumstances. But I think it could easily be far worse under other
circumstances. I mean, you can have workloads where you do some amount
of read-write work on a table and then go read only and sequential
scan it an infinite number of times. An algorithm that causes the
table to be smaller at the point where we switch to read-only
operations, even by a modest amount, wins infinitely over anything
else. But even if you have no change in the access pattern, is it a
good idea to allow the table to be, say, 5% larger if it means that
correlated data is colocated? In general, probably yes. If that means
that the table fails to fit in shared_buffers instead of fitting, no.
If that means that the table fails to fit in the OS cache instead of
fitting, definitely no.

And to me, that kind of effect is why it's hard to gain much
confidence in regards to stuff like this via laboratory testing. I
mean, I'm glad you're doing such tests. But in a laboratory test, you
tend not to have things like a sudden and complete change in the
workload, or a random other application sometimes sharing the machine,
or only being on the edge of running out of memory. I think in general
people tend to avoid such things in benchmarking scenarios, but even
if include stuff like this, it's hard to know what to include that
would be representative of real life, because just about anything
*could* happen in real life.

-- 
Robert Haas
EDB: http://www.enterprisedb.com

Re: Removing more vacuumlazy.c special cases, relfrozenxid optimizations

2022-02-04 Thread Peter Geoghegan

On Fri, Feb 4, 2022 at 2:00 PM John Naylor  wrote:
> Without having looked at the latest patches, there was something in
> the back of my mind while following the discussion upthread -- the
> proposed opportunistic freezing made a lot more sense if the
> earlier-proposed open/closed pages concept was already available.

Yeah, sorry about that. The open/closed pages concept is still
something I plan on working on. My prototype (which I never posted to
the list) will be rebased, and I'll try to target Postgres 16.

> > Freezing whole pages
> > 
>
> > It's possible that a higher cutoff (for example a cutoff of 80% of
> > BLCKSZ, not 50%) will actually lead to *worse* space utilization, in
> > addition to the downsides from fragmentation -- it's far from a simple
> > trade-off. (Not that you should believe that 50% is special, it's just
> > a starting point for me.)
>
> How was the space utilization with the 50% cutoff in the TPC-C test?

The picture was mixed. To get the raw numbers, compare
pg-relation-sizes-after-patch-2.out and
pg-relation-sizes-after-master-2.out files from the drive link I
provided (to repeat, get them from
https://drive.google.com/drive/u/1/folders/1A1g0YGLzluaIpv-d_4o4thgmWbVx3LuR)

Highlights: the largest table (the bmsql_order_line table) had a total
size of x1.006 relative to master, meaning that we did slightly worse
there. However, the index on the same table was slightly smaller
instead, probably because reducing heap fragmentation tends to make
the index deletion stuff work a bit better than before.

Certain small tables (bmsql_district and bmsql_warehouse) were
actually significantly smaller (less than half their size on master),
probably just because the patch can reliably remove LP_DEAD items from
heap pages, even when a cleanup lock isn't available.

The bmsql_new_order table was quite a bit larger, but it's not that
large anyway (1250 MB on master at the very end, versus 1433 MB with
the patch). This is a clear trade-off, since we get much less
fragmentation in the same table (as evidenced by the VACUUM output,
where there are fewer pages with any LP_DEAD items per VACUUM with the
patch). The workload for that table is characterized by inserting new
orders together, and deleting the same orders as a group later on. So
we're bound to pay a cost in space utilization to lower the
fragmentation.

> > blks_hit | 174,551,067,731
> > tup_fetched  | 124,797,772,450
>
> > Here is the same pg_stat_database info for master:
>
> > blks_hit | 283,015,966,386
> > tup_fetched  | 237,052,965,901
>
> That's impressive.

Thanks!

It's still possible to get a big improvement like that with something
like TPC-C because there are certain behaviors that are clearly
suboptimal -- once you look at the details of the workload, and
compare an imaginary ideal to the actual behavior of the system. In
particular, there is really only one way that the free space
management can work for the two big tables that will perform
acceptably -- the orders have to be stored in the same place to begin
with, and stay in the same place forever (at least to the extent that
that's possible).

-- 
Peter Geoghegan

Re: Removing more vacuumlazy.c special cases, relfrozenxid optimizations

2022-02-04 Thread John Naylor

On Sat, Jan 29, 2022 at 11:43 PM Peter Geoghegan  wrote:
>
> On Thu, Jan 20, 2022 at 2:00 PM Peter Geoghegan  wrote:
> > I do see some value in that, too. Though it's not going to be a way of
> > turning off the early freezing stuff, which seems unnecessary (though
> > I do still have work to do on getting the overhead for that down).
>
> Attached is v7, a revision that overhauls the algorithm that decides
> what to freeze. I'm now calling it block-driven freezing in the commit
> message. Also included is a new patch, that makes VACUUM record zero
> free space in the FSM for an all-visible page, unless the total amount
> of free space happens to be greater than one half of BLCKSZ.
>
> The fact that I am now including this new FSM patch (v7-0006-*patch)
> may seem like a case of expanding the scope of something that could
> well do without it. But hear me out! It's true that the new FSM patch
> isn't essential. I'm including it now because it seems relevant to the
> approach taken with block-driven freezing -- it may even make my
> general approach easier to understand.

Without having looked at the latest patches, there was something in
the back of my mind while following the discussion upthread -- the
proposed opportunistic freezing made a lot more sense if the
earlier-proposed open/closed pages concept was already available.

> Freezing whole pages
> 

> It's possible that a higher cutoff (for example a cutoff of 80% of
> BLCKSZ, not 50%) will actually lead to *worse* space utilization, in
> addition to the downsides from fragmentation -- it's far from a simple
> trade-off. (Not that you should believe that 50% is special, it's just
> a starting point for me.)

How was the space utilization with the 50% cutoff in the TPC-C test?

> TPC-C raw numbers
> =
>
> The single most important number for the patch might be the decrease
> in both buffer misses and buffer hits, which I believe is caused by
> the patch being able to use index-only scans much more effectively
> (with modifications to BenchmarkSQL to improve the indexing strategy
> [1]). This is quite clear from pg_stat_database state at the end.
>
> Patch:

> blks_hit | 174,551,067,731
> tup_fetched  | 124,797,772,450

> Here is the same pg_stat_database info for master:

> blks_hit | 283,015,966,386
> tup_fetched  | 237,052,965,901

That's impressive.

--
John Naylor
EDB: http://www.enterprisedb.com

Re: Removing more vacuumlazy.c special cases, relfrozenxid optimizations

2022-01-21 Thread Peter Geoghegan

On Fri, Jan 21, 2022 at 12:07 PM Greg Stark  wrote:
> This confuses me. "Transactions per second" is a headline database
> metric that lots of users actually focus on quite heavily -- rather
> too heavily imho.

But transactions per second is for the whole database, not for
individual tables. It's also really a benchmarking thing, where the
size and variety of transactions is fixed. With something like pgbench
it actually is exactly the same thing, but such a workload is not at
all realistic.  Even BenchmarkSQL/TPC-C isn't like that, despite the
fact that it is a fairly synthetic workload (it's just not super
synthetic).

> Ok, XID consumption is only a subset of transactions
> that are not read-only but that's a detail that's pretty easy to
> explain and users get pretty quickly.

My point was mostly this: the number of distinct extant unfrozen tuple
headers (and the range of the relevant XIDs) is generally highly
unpredictable today. And the number of tuples we'll have to freeze to
be able to advance relfrozenxid by a good amount is quite variable, in
general.

For example, if we bulk extend a relation as part of an ETL process,
then the number of distinct XIDs could be as low as 1, even though we
can expect a great deal of "freeze debt" that will have to be paid off
at some point (with the current design, in the common case where the
user doesn't account for this effect because they're not already an
expert). There are other common cases that are not quite as extreme as
that, that still have the same effect -- even an expert will find it
hard or impossible to tune autovacuum_freeze_min_age for that.

Another case of interest (that illustrates the general principle) is
something like pgbench_tellers. We'll never have an aggressive VACUUM
of the table with the patch, and we shouldn't ever need to freeze any
tuples. But, owing to workload characteristics, we'll constantly be
able to keep its relfrozenxid very current, because (even if we
introduce skew) each individual row cannot go very long without being
updated, allowing old XIDs to age out that way.

There is also an interesting middle ground, where you get a mixture of
both tendencies due to skew. The tuple that's most likely to get
updated was the one that was just updated. How are you as a DBA ever
supposed to tune autovacuum_freeze_min_age if tuples happen to be
qualitatively different in this way?

> What I find confuses people much more is the concept of the
> oldestxmin. I think most of the autovacuum problems I've seen come
> from cases where autovacuum is happily kicking off useless vacuums
> because the oldestxmin hasn't actually advanced enough for them to do
> any useful work.

As it happens, the proposed log output won't use the term oldestxmin
anymore -- I think that it makes sense to rename it to "removable
cutoff". Here's an example:

LOG:  automatic vacuum of table "regression.public.bmsql_oorder": index scans: 1
pages: 0 removed, 317308 remain, 250258 skipped using visibility map
(78.87% of total)
tuples: 70 removed, 34105925 remain (6830471 newly frozen), 2528 are
dead but not yet removable
removable cutoff: 37574752, which is 230115 xids behind next
new relfrozenxid: 35221275, which is 5219310 xids ahead of previous value
index scan needed: 55540 pages from table (17.50% of total) had
3339809 dead item identifiers removed
index "bmsql_oorder_pkey": pages: 144257 in total, 0 newly deleted, 0
currently deleted, 0 reusable
index "bmsql_oorder_idx2": pages: 330083 in total, 0 newly deleted, 0
currently deleted, 0 reusable
I/O timings: read: 7928.207 ms, write: 1386.662 ms
avg read rate: 33.107 MB/s, avg write rate: 26.218 MB/s
buffer usage: 220825 hits, 443331 misses, 351084 dirtied
WAL usage: 576110 records, 364797 full page images, 2046767817 bytes
system usage: CPU: user: 10.62 s, system: 7.56 s, elapsed: 104.61 s

Note also that I deliberately made the "new relfrozenxid" line that
immediately follows (information that we haven't shown before now)
similar, to highlight that they're now closely related concepts. Now
if you VACUUM a table that is either empty or has only frozen tuples,
VACUUM will set relfrozenxid to oldestxmin/removable cutoff.
Internally, oldestxmin is the "starting point" for our final/target
relfrozenxid for the table. We ratchet it back dynamically, whenever
we see an older-than-current-target XID that cannot be immediately
frozen (e.g., when we can't easily get a cleanup lock on the page).

-- 
Peter Geoghegan

1 2 >

1 - 100 of 133 matches

Mail list logo