from:"Peter Geoghegan"

Re: Making C function declaration parameter names consistent with corresponding definition names

2022-09-16 Thread Peter Geoghegan

On Fri, Sep 16, 2022 at 4:19 PM Tom Lane  wrote:
> I agree, this has always been a pet peeve of mine as well.  I would
> have guessed there were fewer examples than you found, because I've
> generally fixed any such cases I happened to notice.

If you actually go through them all one by one you'll see that the
vast majority of individual cases involve an inconsistency that
follows some kind of recognizable pattern. For example, a Relation
parameter might be spelled "relation" in one place and "rel" in
another. I find these more common cases much less noticeable --
perhaps that's why there are more than you thought there'd be?

It's possible to configure the clang-tidy tooling to tolerate various
inconsistencies, below some kind of threshold -- it is totally
customizable. But I think that a strict, simple rule is the way to go
here. (Though without creating busy work for committers that don't
want to use clang-tidy all the time.)
-- 
Peter Geoghegan

Re: Reducing the WAL overhead of freezing in VACUUM by deduplicating per-tuple freeze plans

2022-09-16 Thread Peter Geoghegan

On Fri, Sep 16, 2022 at 12:30 AM Masahiko Sawada  wrote:
> After a quick benchmark, I've confirmed that the amount of WAL records
> for freezing 1 million tuples reduced to about one-fifth (1.2GB vs
> 250MB). Great.

I think that the really interesting thing about the patch is how this
changes the way we should think about freezing costs. It makes
page-level batching seem very natural.

The minimum possible size of a Heap2/FREEZE_PAGE record is 64 bytes,
once alignment and so on is taken into account (without the patch).
Once we already know that we have to freeze *some* tuples on a given
heap page, it becomes very reasonable to freeze as many as possible,
in batch, just because we know that it'll be much cheaper if we do it
now versus doing it later on instead. Even if this extra freezing ends
up "going to waste" due to updates against the same tuples that happen
a little later on, the *added* cost of freezing "extra" tuples will
have been so small that it's unlikely to matter. On the other hand, if
it's not wasted we'll be *much* better off.

It's very hard to predict the future, which is kinda what the current
FreezeLimit-based approach to freezing does. It's actually quite easy
to understand the cost of freezing now versus freezing later, though.
At a high level, it makes sense for VACUUM to focus on freezing costs
(including the risk that comes with *not* freezing with larger
tables), and not worry so much about making accurate predictions.
Making accurate predictions about freezing/workload characteristics is
overrated.

> True. I've not looked at the patch in depth yet but I think we need
> regression tests for this.

What did you have in mind?

I think that the best way to test something like this is with
wal_consistency_checking. That mostly works fine. However, note that
heap_mask() won't always be able to preserve the state of a tuple's
xmax when modified by freezing. We sometimes need "hint bits" to
actually reliably be set in REDO, when replaying the records for
freezing. At other times they really are just hints. We have to
conservatively assume that it's just a hint when masking. Not sure if
I can do much about that.

Note that this optimization is one level below lazy_scan_prune(), and
one level above heap_execute_freeze_tuple(). Neither function really
changes at all. This seems useful because there are rare
pg_upgrade-only paths where xvac fields need to be frozen. That's not
tested either.

-- 
Peter Geoghegan

Re: New strategies for freezing, advancing relfrozenxid early

2022-09-15 Thread Peter Geoghegan

On Thu, Sep 15, 2022 at 12:09 AM John Naylor
 wrote:
> On Wed, Sep 14, 2022 at 11:33 PM Peter Geoghegan  wrote:
> > The final number of TIDs doesn't seem like the most interesting
> > information that VM snapshots could provide us when it comes to
> > building the dead_items TID data structure -- the *distribution* of
> > TIDs across heap pages seems much more interesting. The "shape" can be
> > known ahead of time, at least to some degree. It can help with
> > compression, which will reduce cache misses.
>
> My point here was simply that spilling to disk is an admission of
> failure to utilize memory efficiently and thus shouldn't be a selling
> point of VM snapshots. Other selling points could still be valid.

I was trying to explain the goals of this work in a way that was as
accessible as possible. It's not easy to get the high level ideas
across, let alone all of the details.

It's true that I have largely ignored the question of how VM snapshots
will need to spill up until now. There are several reasons for this,
most of which you could probably guess. FWIW it wouldn't be at all
difficult to add *some* reasonable spilling behavior very soon; the
underlying access patterns are highly sequential and predictable, in
the obvious way.

-- 
Peter Geoghegan

Re: New strategies for freezing, advancing relfrozenxid early

2022-09-14 Thread Peter Geoghegan

On Wed, Sep 14, 2022 at 3:18 AM John Naylor
 wrote:
> On Wed, Sep 14, 2022 at 12:53 AM Peter Geoghegan  wrote:
> > This is still only scratching the surface of what is possible with
> > dead_items. The visibility map snapshot concept can enable a far more
> > sophisticated approach to resource management in vacuumlazy.c.

> I don't quite see how it helps "enable" that.

I have already written a simple throwaway patch that can use the
current VM snapshot data structure (which is just a local copy of the
VM's pages) to do a cheap precheck ahead of actually doing a binary
search in dead_items -- if a TID's heap page is all-visible or
all-frozen (depending on the type of VACUUM) then we're 100%
guaranteed to not visit it, and so it's 100% guaranteed to not have
any dead_items (actually it could have LP_DEAD items by the time the
index scan happens, but they won't be in our dead_items array in any
case). Since we're working off of an immutable source, this
optimization is simple to implement already. Very simple.

I haven't even bothered to benchmark this throwaway patch (I literally
wrote it in 5 minutes to show Masahiko what I meant). I can't see why
even that throwaway prototype wouldn't significantly improve
performance, though. After all, the VM snapshot data structure is far
denser than dead_items, and the largest tables often have most heap
pages skipped via the VM.

I'm not really interested in pursuing this simple approach because it
conflicts with Masahiko's work on the data structure, and there are
other good reasons to expect that to help. Plus I'm already very busy
with what I have here.

> It'd be more logical to
> me to say the VM snapshot *requires* you to think harder about
> resource management, since a palloc'd snapshot should surely be
> counted as part of the configured memory cap that admins control.

That's clearly true -- it creates a new problem for resource
management that will need to be solved. But that doesn't mean that it
can't ultimately make resource management better and easier.

Remember, we don't randomly visit some skippable pages for no good
reason in the patch, since the SKIP_PAGES_THRESHOLD stuff is
completely gone. The VM snapshot isn't just a data structure that
vacuumlazy.c uses as it sees fit -- it's actually more like a set of
instructions on which pages to scan, that vacuumlazy.c *must* follow.
There is no way that vacuumlazy.c can accidentally pick up a few extra
dead_items here and there due to concurrent activity that unsets VM
pages. We don't need to leave that to chance -- it is locked in from
the start.

> I do remember your foreshadowing in the radix tree thread a while
> back, and I do think it's an intriguing idea to combine pages-to-scan
> and dead TIDs in the same data structure. The devil is in the details,
> of course. It's worth looking into.

Of course.

> Looking at the count of index scans, it's pretty much always
> "1", so even if the current approach could scale above 1GB, "no" it
> wouldn't help to raise that limit.

I agree that multiple index scans are rare. But I also think that
they're disproportionately involved in really problematic cases for
VACUUM. That said, I agree that simply making lookups to dead_items as
fast as possible is the single most important way to improve VACUUM by
improving dead_items.

> Furthermore, it doesn't have to anticipate the maximum size, so there
> is no up front calculation assuming max-tuples-per-page, so it
> automatically uses less memory for less demanding tables.

The final number of TIDs doesn't seem like the most interesting
information that VM snapshots could provide us when it comes to
building the dead_items TID data structure -- the *distribution* of
TIDs across heap pages seems much more interesting. The "shape" can be
known ahead of time, at least to some degree. It can help with
compression, which will reduce cache misses.

Andres made remarks about memory usage with sparse dead TID patterns
at this point on the "Improve dead tuple storage for lazy vacuum"
thread:

https://postgr.es/m/20210710025543.37sizjvgybemk...@alap3.anarazel.de

I haven't studied the radix tree stuff in great detail, so I am
uncertain of how much the VM snapshot concept could help, and where
exactly it would help. I'm just saying that it seems promising,
especially as a way of addressing concerns like this.

--
Peter Geoghegan

Re: wrong shell trap

2022-09-13 Thread Peter Geoghegan

On Tue, Sep 13, 2022 at 2:01 PM Tom Lane  wrote:
> > AFAICT almost all of our shell scripts contain the same mistake.  I
> > propose to fix them all as in the attached demo patch, which makes
> > headerscheck exit properly (no silly noise) when interrupted.
>
> Sounds like a good idea.

Might not be a bad idea to run shellcheck against the scripts, to see
if that highlights anything.

I've found that shellcheck makes working with shell scripts less
terrible, especially when portability is a concern. It can be used to
enforce consistent coding standards that seem pretty well thought out.
It will sometimes produce dubious warnings, of course, but it tends to
mostly have the right idea, most of the time.

-- 
Peter Geoghegan

Reducing the WAL overhead of freezing in VACUUM by deduplicating per-tuple freeze plans

2022-09-12 Thread Peter Geoghegan

My ongoing project to make VACUUM more predictable over time by
proactive freezing [1] will increase the overall number of tuples
frozen by VACUUM significantly (at least in larger tables). It's
important that we avoid any new user-visible impact from extra
freezing, though. I recently spent a lot of time on adding high-level
techniques that aim to avoid extra freezing (e.g. by being lazy about
freezing) when that makes sense. Low level techniques aimed at making
the mechanical process of freezing cheaper might also help. (In any
case it's well worth optimizing.)

I'd like to talk about one such technique on this thread. The attached
WIP patch reduces the size of xl_heap_freeze_page records by applying
a simple deduplication process. This can be treated as independent
work (I think it can, at least). The patch doesn't change anything
about the conceptual model used by VACUUM's lazy_scan_prune function
to build xl_heap_freeze_page records for a page, and yet still manages
to make the WAL records for freeze records over 5x smaller in many
important cases. They'll be ~4x-5x smaller with *most* workloads,
even.

Each individual tuple entry (each xl_heap_freeze_tuple) adds a full 12
bytes to the WAL record right now -- no matter what. So the existing
approach is rather space inefficient by any standard (perhaps because
it was developed under time pressure while fixing bugs in Postgres
9.3). More importantly, there is a lot of natural redundancy among
each xl_heap_freeze_tuple entry -- each tuple's xl_heap_freeze_tuple
details tend to match. We can usually get away with storing each
unique combination of values from xl_heap_freeze_tuple once per
xl_heap_freeze_page record, while storing associated page offset
numbers in a separate area, grouped by their canonical freeze plan
(which is a normalized version of the information currently stored in
xl_heap_freeze_tuple).

In practice most individual tuples that undergo any kind of freezing
only need to have their xmin field frozen. And when xmax is affected
at all, it'll usually just get set to InvalidTransactionId. And so the
actual low-level processing steps for xmax have a high chance of being
shared by other tuples on the page, even in ostensibly tricky cases.
While there are quite a few paths that lead to VACUUM setting a
tuple's xmax to InvalidTransactionId, they all end up with the same
instructional state in the final xl_heap_freeze_tuple entry.

Note that there is a small chance that the patch will be less space
efficient by up to 2 bytes per tuple frozen per page in cases where
we're allocating new Mulits during VACUUM. I think that this should be
acceptable on its own -- even in rare bad cases we'll usually still
come out ahead -- what are the chances that we won't make up the
difference on the same page? Or at least within the same VACUUM? And
that's before we talk about a future world in which freezing will
batch tuples together at the page level (you don't have to bring the
other VACUUM work into this discussion, I think, but it's not
*completely* unrelated either).

[1] 
https://postgr.es/m/CAH2-WzkFok_6EAHuK39GaW4FjEFQsY=3j0aad6fxk93u-xq...@mail.gmail.com
-- 
Peter Geoghegan


v1-0001-Shrink-freeze-WAL-records-via-deduplication.patch
Description: Binary data

Re: pg15b4: FailedAssertion("TransactionIdIsValid(xmax)

2022-09-11 Thread Peter Geoghegan

On Sun, Sep 11, 2022 at 6:42 PM Justin Pryzby  wrote:
> I think you're saying is that this can be explained by the
> io_concurrency bug in recovery_prefetch, if run under 15b3.
>
> But yesterday I started from initdb and restored this cluster from backup, and
> started up sqlsmith, and sent some kill -9, and now got more corruption.
> Looks like it took ~10 induced crashes before this happened.

Have you tested fsync on the system?

The symptoms here are all over the place. This assertion failure seems
like a pretty good sign that the problems happen during recovery, or
because basic guarantees needed by for crash safety aren't met:

> #2  0x00962c5c in ExceptionalCondition 
> (conditionName=conditionName@entry=0x9ce238 "P_ISLEAF(opaque) && 
> !P_ISDELETED(opaque)", errorType=errorType@entry=0x9bad97 "FailedAssertion",
> fileName=fileName@entry=0x9cdcd1 "nbtpage.c", 
> lineNumber=lineNumber@entry=1778) at assert.c:69
> #3  0x00507e34 in _bt_rightsib_halfdeadflag 
> (rel=rel@entry=0x7f4138a238a8, leafrightsib=leafrightsib@entry=53) at 
> nbtpage.c:1778
> #4  0x00507fba in _bt_mark_page_halfdead 
> (rel=rel@entry=0x7f4138a238a8, leafbuf=leafbuf@entry=13637, 
> stack=stack@entry=0x144ca20) at nbtpage.c:2121

This shows that the basic rules for page deletion have somehow
seemingly been violated. It's as if a page deletion went ahead, but
didn't work as an atomic operation -- there were some lost writes for
some but not all pages. Actually, it looks like a mix of states from
before and after both the first and the second phases of page deletion
-- so not just one atomic operation.

-- 
Peter Geoghegan

Re: Index ordering after IS NULL

2022-09-10 Thread Peter Geoghegan

On Sat, Sep 10, 2022 at 2:28 PM Jeff Janes  wrote:
> explain analyze select * from j where b is null order by c limit 10;
> explain analyze select * from j where b =8 order by c limit 10;
>
> The first uses a sort despite it being disabled.

The first/is null query seems to give the result and plan you're
looking for if the query is rewritten to order by "b, c", and not just
"c".

That in itself doesn't make your complaint any less valid, of course.
You don't have to do this with the second query, so why should you
have to do it with the first?

-- 
Peter Geoghegan

Re: Making autovacuum logs indicate if insert-based threshold was the triggering condition

2022-09-09 Thread Peter Geoghegan

On Fri, Sep 9, 2022 at 11:11 AM Justin Pryzby  wrote:
> > It might make sense to *always* show how close we were to hitting each
> > of the thresholds, including the ones that we didn't end up hitting
> > (we may come pretty close quite often, which seems like it might
> > matter). But showing multiple conditions together just because the
> > planets aligned (we hit multiple thresholds together) emphasizes the
> > low-level mechanism, which is pretty far removed from anything that
> > matters. You might as well pick either threshold at random once this
> > happens -- even an expert won't be able to tell the difference.
>
> I don't have strong feelings about it; I'm just pointing out that the
> two of the conditions aren't actually exclusive.

Fair enough. I'm just pointing out that the cutoffs are continuous for
all practical purposes, even if there are cases where they seem kind
of discrete, due only to implementation details (e.g.
autovacuum_naptime stuff). Displaying only one reason for triggering
an autovacuum is consistent with the idea that the cutoffs are
continuous. It's not literally true that they're continuous, but it
might as well be.

I think of it as similar to how it's not literally true that a coin
toss is always either heads or tails, though it might as well be true.
Sure, even a standard fair coin toss might result in the coin landing
on its side. That'll probably never happen even once, but if does:
just flip the coin again! The physical coin toss was never the
important part.

> It seems like it could be less confusing to show both.  Consider someone who 
> is
> trying to *reduce* how often autovacuum runs, or give priority to some tables
> by raising the thresholds for other tables.

My objection to that sort of approach is that it suggests a difference
in what each VACUUM actually does -- as if autovacuum.c actually
decided on a particular runtime behavior for the VACUUM up front,
based on its own considerations that come from statistics about the
workload. I definitely want to avoid creating that false impression in
the minds of users.

-- 
Peter Geoghegan

Re: Instrumented pages/tuples frozen in autovacuum's server log out (and VACUUM VERBOSE)

2022-09-08 Thread Peter Geoghegan

On Mon, Sep 5, 2022 at 12:43 PM Peter Geoghegan  wrote:
> Barring any objections I will commit this patch within the next few days.

Pushed this just now.

Thanks
-- 
Peter Geoghegan

Re: Instrumented pages/tuples frozen in autovacuum's server log out (and VACUUM VERBOSE)

2022-09-05 Thread Peter Geoghegan

On Wed, Aug 31, 2022 at 7:49 PM Jeff Janes  wrote:
> I think "frozen:" would be a more suitable line prefix.

Attached revision does it that way.

Barring any objections I will commit this patch within the next few days.

Thanks
-- 
Peter Geoghegan


v2-0001-Instrument-freezing-in-autovacuum-log-reports.patch
Description: Binary data

Re: Backpatching nbtree VACUUM (page deletion) hardening

2022-09-05 Thread Peter Geoghegan

On Fri, Sep 2, 2022 at 6:51 PM Peter Geoghegan  wrote:
> Yes -- nbtree VACUUM generally can cope quite well, even when the
> index is corrupt. It should mostly manage to do what is expected here,
> even with a misbehaving opclass, because it relies as little as
> possible on user-defined opclass code.

I just backpatched the hardening commit from 14 to every supported branch.

-- 
Peter Geoghegan

Re: Backpatching nbtree VACUUM (page deletion) hardening

2022-09-02 Thread Peter Geoghegan

On Fri, Sep 2, 2022 at 6:14 PM Michael Paquier  wrote:
> This has been a problem for years, and still for years to come with
> libc updates.  I am not much into this stuff, but does running VACUUM
> in this case help with the state of the index that used a past,
> now-invalid, collation (be it libc or ICU) to get a bit cleaned up?

Yes -- nbtree VACUUM generally can cope quite well, even when the
index is corrupt. It should mostly manage to do what is expected here,
even with a misbehaving opclass, because it relies as little as
possible on user-defined opclass code.

Even without the hardening in place, nbtree VACUUM will still do a
*surprisingly* good job of recovering when the opclass is broken in
some way: VACUUM just needs the insertion scankey operator class code
to initially determine roughly where to look for the to-be-deleted
page's downlink, one level up in the tree. Even when an operator class
is wildly broken (e.g. the comparator gives a result that it
determines at random), we still won't see problems in nbtree VACUUM
most of the time -- because even being roughly correct is good enough
in practice!

You have to be quite unlucky to hit this, even when the opclass is
wildly broken (which is probably much less common than "moderately
broken").

> When written like that, this surely sounds extremely bad and this
> would need more complex chirurgy (or just running with a build that
> includes this patch?).

The patch will fix the case in question, which I have seen internal
AWS reports about -- though the initial fix that went into 14 wasn't
driven by any complaint from any user. I just happened to notice that
we were throwing an ERROR in nbtree VACUUM for no good reason, which
is something that should be avoided on general principle.

In theory there could be other ways in which you'd run into the same
basic problem (in any index AM). The important point is that we're
better off not throwing any errors in the first place, but if we must
then they had better not be errors that will be repeated again and
again, without any chance of the problem going away naturally. (Not
that it never makes sense to just throw an error; there are meaningful
gradations of "totally unacceptable problem".)

-- 
Peter Geoghegan

Backpatching nbtree VACUUM (page deletion) hardening

2022-09-02 Thread Peter Geoghegan

Postgres 14 commit 5b861baa55 added hardening to nbtree page deletion.
This had the effect of making nbtree VACUUM robust against misbehaving
operator classes -- we just LOG the problem and move on, without
throwing an error. In practice a "misbehaving operator class" is often
a problem with collation versioning.

I think that this should be backpatched now, to protect users from
particularly nasty problems that hitting the error eventually leads
to.

An error ends the whole VACUUM operation. If VACUUM cannot delete the
page the first time, there is no reason to think that it'll be any
different on the second or the tenth attempt. The eventual result
(absent user/DBA intervention) is that no antiwraparound autovacuum
will ever complete, leading to an outage when the system hits
xidStopLimit. (Actually this scenario won't result in the system
hitting xidStopLimit where the failsafe is available, but that's
another thing that is only in 14, so that's not any help.)

This seems low risk. The commit in question is very simple. It just
downgrades an old 9.4-era ereport() from ERROR to LOG, and adds a
"return false;" immediately after that. The function in question is
fundamentally structured in a way that allows it to back out of page
deletion because of problems that are far removed from where the
caller starts from. When and why we back out of page deletion is
already opaque to the caller, so it's very hard to imagine a new
problem caused by backpatching. Besides all this, 14 has been out for
a while now.

-- 
Peter Geoghegan

Re: Instrumented pages/tuples frozen in autovacuum's server log out (and VACUUM VERBOSE)

2022-08-31 Thread Peter Geoghegan

On Wed, Aug 31, 2022 at 7:49 PM Jeff Janes  wrote:
> I like this addition, but I would also like to see how many pages got newly 
> set to all frozen by the vacuum.

I'd say that that's independent work. Though I'm happy to discuss it now.

It would be fairly straightforward to show something about the VM
itself, but it's not entirely clear what aspects of the VM should be
emphasized. Are we reporting on the state of the table, or work
performed by VACUUM? You said you were interested in the latter
already, but why prefer that over a summary of the contents of the VM
at the end of the VACUUM? Are you concerned about the cost of setting
pages all-visible? Do you have an interest in how VACUUM manages to
set VM pages over time? Something else?

We already call visibilitymap_count() at the end of every VACUUM,
which scans the authoritative VM to produce a more-or-less consistent
summary of the VM at that point in time. This information is then used
to update pg_class.relallvisible (we don't do anything with the
all-frozen number at all). Why not show that information in
VERBOSE/autovacuum's log message? Does it really matter *when* a page
became all-visible/all-frozen/unset?

> Also, isn't all of vacuuming about XID processing?  I think "frozen:" would 
> be a more suitable line prefix.

That also works for me. I have no strong feelings here.

-- 
Peter Geoghegan

Re: New strategies for freezing, advancing relfrozenxid early

2022-08-31 Thread Peter Geoghegan

On Thu, Aug 25, 2022 at 2:21 PM Peter Geoghegan  wrote:
> Attached patch series is a completely overhauled version of earlier
> work on freezing. Related work from the Postgres 15 cycle became
> commits 0b018fab, f3c15cbe, and 44fa8488.

Attached is v2.

This is just to keep CFTester happy, since v1 now has conflicts when
applied against HEAD. There are no notable changes in this v2 compared
to v1.

-- 
Peter Geoghegan


v2-0001-Add-page-level-freezing-to-VACUUM.patch
Description: Binary data


v2-0002-Teach-VACUUM-to-use-visibility-map-snapshot.patch
Description: Binary data


v2-0003-Add-eager-freezing-strategy-to-VACUUM.patch
Description: Binary data


v2-0004-Avoid-allocating-MultiXacts-during-VACUUM.patch
Description: Binary data

Re: effective_multixact_freeze_max_age issue

2022-08-31 Thread Peter Geoghegan

On Tue, Aug 30, 2022 at 8:56 PM Nathan Bossart  wrote:
> LGTM

Pushed, thanks.

-- 
Peter Geoghegan

Re: New strategies for freezing, advancing relfrozenxid early

2022-08-31 Thread Peter Geoghegan

On Tue, Aug 30, 2022 at 11:28 PM Jeff Davis  wrote:
> That clarifies your point. It's still a challenge for me to reason
> about which of these potential new problems really need to be solved in
> v1, though.

I don't claim to understand it that well myself -- not just yet.
I feel like I have the right general idea, but the specifics
aren't all there (which is very often the case for me at this
point in the cycle). That seems like a good basis for further
discussion.

It's going to be quite a few months before some version of this
patchset is committed, at the very earliest. Obviously these are
questions that need answers, but the process of getting to those
answers is a significant part of the work itself IMV.

> > Why stop at a couple of dozens of lines of code? Why not just change
> > the default of vacuum_freeze_min_age and
> > vacuum_multixact_freeze_min_age to 0?
>
> I don't think that would actually solve the unbounded buildup of
> unfrozen pages. It would still be possible for pages to be marked all
> visible before being frozen, and then end up being skipped until an
> aggressive vacuum is forced, right?

With the 15 work in place, and with the insert-driven autovacuum
behavior from 13, it is likely that this would be enough to avoid all
antiwraparound vacuums in an append-only table. There is still one
case where we can throw away the opportunity to advance relfrozenxid
during non-aggressive VACUUMs for no good reason -- I didn't fix them
all just yet. But the remaining case (which is in lazy_scan_skip()) is
very narrow.

With vacuum_freeze_min_age = 0 and vacuum_multixact_freeze_min_age =
0, any page that is eligible to be set all-visible is also eligible to
have its tuples frozen and be set all-frozen instead, immediately.
When it isn't then we'll scan it in the next VACUUM anyway.

Actually I'm also ignoring some subtleties with Multis that could make
this not quite happen, but again, that's only a super obscure corner case.
The idea that just setting vacuum_freeze_min_age = 0 and
vacuum_multixact_freeze_min_age = 0 will be enough is definitely true
in spirit. You don't need to touch vacuum_freeze_table_age (if you did
then you'd get aggressive VACUUMs, and one goal here is to avoid
those whenever possible -- especially aggressive antiwraparound
autovacuums).

--
Peter Geoghegan

Re: New strategies for freezing, advancing relfrozenxid early

2022-08-30 Thread Peter Geoghegan

On Tue, Aug 30, 2022 at 9:37 PM Jeff Davis  wrote:
> On Tue, 2022-08-30 at 18:50 -0700, Peter Geoghegan wrote:
> > Since VM snapshots are immutable, it should be relatively
> > easy to have the implementation make its final decision on skipping
> > only *after* lazy_scan_heap() returns.
>
> I like this idea.

I was hoping that you would. I imagine that this idea (with minor
variations) could enable an approach that's much closer to what you
were thinking of: one that mostly focuses on controlling the number of
unfrozen pages, and not so much on advancing relfrozenxid early, just
because we can and we might not get another chance for a long time. In
other words your idea of a design that can freeze more during a
non-aggressive VACUUM, while still potentially skipping all-visible
pages.

I said earlier on that we ought to at least have a strong bias in the
direction of advancing relfrozenxid in larger tables, especially when
we decide to freeze whole pages more eagerly -- we only get one chance
to advance relfrozenxid per VACUUM, and those opportunities will
naturally be few and far between. We cannot really justify all this
extra freezing if it doesn't prevent antiwraparound autovacuums. That
was more or less my objection to going in that direction.

But if we can more or less double the number of opportunities to at
least ask the question "is now a good time to advance relfrozenxid?"
without really paying much for keeping this option open (and I think
that we can), my concern about relfrozenxid advancement becomes far
less important. Just being able to ask that question is significantly
less rare and precious. Plus we'll probably be able to make
significantly better decisions about relfrozenxid overall with the
"second phase because I changed my mind about skipping" concept in
place.

-- 
Peter Geoghegan

Re: New strategies for freezing, advancing relfrozenxid early

2022-08-30 Thread Peter Geoghegan

On Tue, Aug 30, 2022 at 1:45 PM Peter Geoghegan  wrote:
> > d. Can you help give me a sense of scale of the problems solved by
> > visibilitymap snapshots and the cost model? Do those need to be in v1?
>
> I'm not sure. I think that having certainty that we'll be able to scan
> only so many pages up-front is very broadly useful, though. Plus it
> removes the SKIP_PAGES_THRESHOLD stuff, which was intended to enable
> relfrozenxid advancement in non-aggressive VACUUMs, but does so in a
> way that results in scanning many more pages needlessly. See commit
> bf136cf6, which added the SKIP_PAGES_THRESHOLD stuff back in 2009,
> shortly after the visibility map first appeared.

Here is a better example:

Right now the second patch adds both VM snapshots and the skipping
strategy stuff. The VM snapshot is used in the second patch, as a
source of reliable information about how we need to process the table,
in terms of the total number of scanned_pages -- which drives our
choice of strategy. Importantly, we can assess the question of which
skipping strategy to take (in non-aggressive VACUUM) based on 100%
accurate information about how many *extra* pages we'll have to scan
in the event of being eager (i.e. in the event that we prioritize
early relfrozenxid advancement over skipping some pages). Importantly,
that cannot change later on, since VM snapshots are immutable --
everything is locked in. That already seems quite valuable to me.

This general concept could be pushed a lot further without great
difficulty. Since VM snapshots are immutable, it should be relatively
easy to have the implementation make its final decision on skipping
only *after* lazy_scan_heap() returns. We could allow VACUUM to
"change its mind about skipping" in cases where it initially thought
that skipping was the best strategy, only to discover much later on
that that was the wrong choice after all.

A huge amount of new, reliable information will come to light from
scanning the heap rel. In particular, the current value of
vacrel->NewRelfrozenXid seems like it would be particularly
interesting when the time came to consider if a second scan made sense
-- if NewRelfrozenXid is a recent-ish value already, then that argues
for finishing off the all-visible pages in a second heap pass, with
the aim of setting relfrozenxid to a similarly recent value when it
happens to be cheap to do so.

The actual process of scanning precisely those all-visible pages that
were skipped the first time around during a second call to
lazy_scan_heap() can be implemented in the obvious way: by teaching
the VM snapshot infrastructure/lazy_scan_skip() to treat pages that
were skipped the first time around to get scanned during the second
pass over the heap instead. Also, those pages that were scanned the
first time around can/must be skipped on our second pass (excluding
all-frozen pages, which won't be scanned in either heap pass).

I've used the term "second heap pass" here, but that term is slightly
misleading. The final outcome of this whole process is that every heap
page that the vmsnap says VACUUM will need to scan in order for it to
be able to safely advance relfrozenxid will be scanned, precisely
once. The overall order that the heap pages are scanned in will of
course differ from the simple case, but I don't think that it makes
very much difference. In reality there will have only been one heap
pass, consisting of two distinct phases. No individual heap page will
ever be considered for pruning/freezing more than once, no matter
what. This is just a case of *reordering* work. Immutability makes
reordering work easy in general.

--
Peter Geoghegan

Re: effective_multixact_freeze_max_age issue

2022-08-30 Thread Peter Geoghegan

On Mon, Aug 29, 2022 at 3:40 PM Nathan Bossart  wrote:
> Agreed.

Attached is v2, which cleans up the structure of
vacuum_set_xid_limits() a bit more. The overall idea was to improve
the overall flow/readability of the function by moving the WARNINGs
into their own "code stanza", just after the logic for establishing
freeze cutoffs and just before the logic for deciding on
aggressiveness. That is now the more logical approach (group the
stanzas by functionality), since we can't sensibly group the code
based on whether it deals with XIDs or with Multis anymore (not since
the function was taught to deal with the question of whether caller's
VACUUM will be aggressive).

Going to push this in the next day or so.

I also removed some local variables that seem to make the function a
lot harder to follow in v2. Consider code like this:

-   freezemin = freeze_min_age;
-   if (freezemin < 0)
-   freezemin = vacuum_freeze_min_age;
-   freezemin = Min(freezemin, autovacuum_freeze_max_age / 2);
-   Assert(freezemin >= 0);
+   if (freeze_min_age < 0)
+   freeze_min_age = vacuum_freeze_min_age;
+   freeze_min_age = Min(freeze_min_age, autovacuum_freeze_max_age / 2);
+   Assert(freeze_min_age >= 0);

Why have this freezemin temp variable? Why not just use the
vacuum_freeze_min_age function parameter directly instead? That is a
better representation of what's going on at the conceptual level. We
now assign vacuum_freeze_min_age to the vacuum_freeze_min_age arg (not
to the freezemin variable) when our VACUUM caller passes us a value of
-1 for that arg. -1 effectively means "whatever the value of the
vacuum_freeze_min_age GUC is'', which is clearer without the
superfluous freezemin variable.

-- 
Peter Geoghegan


v2-0001-Derive-freeze-cutoff-from-nextXID-not-OldestXmin.patch
Description: Binary data

Re: New strategies for freezing, advancing relfrozenxid early

2022-08-30 Thread Peter Geoghegan

elp give me a sense of scale of the problems solved by
> visibilitymap snapshots and the cost model? Do those need to be in v1?

I'm not sure. I think that having certainty that we'll be able to scan
only so many pages up-front is very broadly useful, though. Plus it
removes the SKIP_PAGES_THRESHOLD stuff, which was intended to enable
relfrozenxid advancement in non-aggressive VACUUMs, but does so in a
way that results in scanning many more pages needlessly. See commit
bf136cf6, which added the SKIP_PAGES_THRESHOLD stuff back in 2009,
shortly after the visibility map first appeared.

Since relfrozenxid advancement fundamentally works at the table level,
it seems natural to make it a top-down, VACUUM-level thing -- even
within non-aggessive VACUUMs (I guess it already meets that
description in aggressive VACUUMs). And since we really want to
advance relfrozenxid when we do extra freezing (for the reasons I just
went into), it seems natural to me to view it as one problem. I accept
that it's not clear cut, though.

[1] 
https://docs.google.com/presentation/d/1WgP-SlKay5AnSoVDSvOIzmu7edMmtYhdywoa0oAR4JQ/edit?usp=sharing
[2] https://disc-projects.bu.edu/compactionary/research.html
-- 
Peter Geoghegan

Re: effective_multixact_freeze_max_age issue

2022-08-29 Thread Peter Geoghegan

On Mon, Aug 29, 2022 at 2:20 AM Matthias van de Meent
 wrote:
> Apart from the message that this behaviour is changing, I'd prefer
> some more description in the commit message as to why this needs
> changing.

I usually only write a full commit message before posting a patch when
it's a full patch series, where it can be helpful to be very explicit
about how the parts fit together. The single line commit message is
just a placeholder -- I'll definitely write a better one before
commit.

> Then, on to the patch itself:
>
> > + * XXX We don't do push back oldestMxact here, which is not 
> > ideal
>
> Do you intend to commit this marker, or is this leftover from the
> development process?

Ordinarily I would never commit an XXX comment, and probably wouldn't
even leave one in early revisions of patches that I post to the list.
This is a special case, though -- it involves the "snapshot too old"
feature, which has many similar XXX/FIXME/TODO comments. I think I
might leave it like that when committing.

The background here is that the snapshot too old code still has lots
of problems -- there is a FIXME comment that gives an overview of this
in TransactionIdLimitedForOldSnapshots(). We're going to have to live
with the fact that that feature isn't in good shape for the
foreseeable future. I can only really work around it.

> > +if (*multiXactCutoff < FirstMultiXactId)
> [...]
> > +if (safeOldestMxact < FirstMultiXactId)
> [...]
> > +if (aggressiveMXIDCutoff < FirstMultiXactId)
>
> I prefer !TransactionId/MultiXactIdIsValid() over '< First
> [MultiXact/Transaction]Id', even though it is the same in
> functionality, because it clarifies the problem we're trying to solve.
> I understand that the use of < is pre-existing, but since we're
> touching this code shouldn't we try to get this new code up to current
> standards?

I agree in principle, but there are already 40+ other places that use
the same idiom in places like multixact.c. Perhaps you can propose a
patch to change all of them at once, together?

-- 
Peter Geoghegan

Re: New strategies for freezing, advancing relfrozenxid early

2022-08-29 Thread Peter Geoghegan

efinitely no point in scanning only a
subset of the table's all-visible pages, as far as relfrozenxid
advancement is concerned (and skipping strategy is fundamentally a
choice about relfrozenxid advancement vs work avoidance, eagerness vs
laziness).

Maybe you're right that there is room for additional freezing
strategies, besides the two added by v1-0003-*patch. Definitely seems
possible. The freezing strategy concept should be usable as a
framework for adding additional strategies, including (just for
example) a strategy that decides ahead of time to freeze only so many
pages, though not others (without regard for the fact that the pages
that we are freezing may not be very different to those we won't be
freezing in the current VACUUM).

I'm definitely open to that. It's just a matter of characterizing what
set of workload characteristics this third strategy would solve, how
users might opt in or opt out, etc. Both the eager and the lazy
freezing strategies are based on some notion of what's important for
the table, based on its known characteristics, and based on what seems
like to happen to the table in the future (the next VACUUM, at least).
I'm not completely sure how many strategies we'll end up needing.
Though it seems like the eager/lazy trade-off is a really important
part of how these strategies will need to work, in general.

(Thinks some more) I guess that such an alternative freezing strategy
would probably have to affect the skipping strategy too. It's tricky
to tease apart because it breaks the idea that skipping strategy and
freezing strategy are basically distinct questions. That is a factor
that makes it a bit more complicated to discuss. In any case, as I
said, I have an open mind about alternative freezing strategies beyond
the 2 basic lazy/eager freezing strategies from the patch.

> What if we thought about this more like a "background freezer". It
> would keep track of the total number of unfrozen pages in the system,
> and freeze them at some kind of controlled/adaptive rate.

I like the idea of storing metadata in shared memory. And scheduling
and deprioritizing running autovacuums. Being able to slow down or
even totally halt a given autovacuum worker without much consequence
is enabled by the VM snapshot concept.

That said, this seems like future work to me. Worth discussing, but
trying to keep out of scope in the first version of this that is
committed.

> Regular autovacuum's job would be to keep advancing relfrozenxid for
> all tables and to do other cleanup, and the background freezer's job
> would be to keep the absolute number of unfrozen pages under some
> limit. Conceptually those two jobs seem different to me.

The problem with making it such a sharp distinction is that it can be
very useful to manage costs by making it the job of VACUUM to do both
-- we can avoid dirtying the same page multiple times.

I think that we can accomplish the same thing by giving VACUUM more
freedom to do either more or less work, based on the observed
characteristics of the table, and some sense of how costs will tend to
work over time. across multiple distinct VACUUM operations. In
practice that might end up looking very similar to what you describe.

It seems undesirable for VACUUM to ever be too sure of itself -- the
information that triggers autovacuum may not be particularly reliable,
which can be solved to some degree by making as many decisions as
possible at runtime, dynamically, based on the most authoritative and
recent information. Delaying committing to one particular course of
action isn't always possible, but when it is possible (and not too
expensive) we should do it that way on general principle.

> Also, regarding patch v1-0001-Add-page-level-freezing, do you think
> that narrows the conceptual gap between an all-visible page and an all-
> frozen page?

Yes, definitely. However, I don't think that we can just get rid of
the distinction completely -- though I did think about it for a while.
For one thing we need to be able to handle cases like the case where
heap_lock_tuple() modifies an all-frozen page, and makes it
all-visible without making it completely unskippable to every VACUUM
operation.

-- 
Peter Geoghegan

Re: effective_multixact_freeze_max_age issue

2022-08-29 Thread Peter Geoghegan

On Sun, Aug 28, 2022 at 4:14 PM Nathan Bossart  wrote:
> The idea seems sound to me, and IMO your patch simplifies things nicely,
> which might be reason enough to proceed with it.

It is primarily a case of making things simpler. Why would it ever
make sense to interpret age differently in the presence of a long
running transaction, though only for the FreezeLimit/MultiXactCutoff
cutoff calculation? And not for the closely related
freeze_table_age/multixact_freeze_table_age calculation? It's hard to
imagine that that was ever a deliberate choice.

vacuum_set_xid_limits() didn't contain the logic for determining if
its caller's VACUUM should be an aggressive VACUUM until quite
recently. Postgres 15 commit efa4a9462a put the logic for determining
aggressiveness right next to the logic for determining FreezeLimit,
which made the inconsistency much more noticeable. It is easy to
believe that this was really just an oversight, all along.

> However, I'm struggling
> to understand when this change would help much in practice.  IIUC it will
> cause vacuums to freeze a bit more, but outside of extreme cases (maybe
> when vacuum_freeze_min_age is set very high and there are long-running
> transactions), ISTM it might not have tremendously much impact.  Is the
> intent to create some sort of long-term behavior change for autovacuum, or
> is this mostly aimed towards consistency among the cutoff calculations?

I agree that this will have only a negligible impact on the majority
(perhaps even the vast majority) of applications. The primary
justification for this patch (simplification) seems sufficient, all
things considered. Even still, it's possible that it will help in
extreme cases. Cases with pathological performance issues,
particularly those involving MultiXacts.

We already set FreezeLimit to the most aggressive possible value of
OldestXmin when OldestXmin has itself already crossed a quasi
arbitrary XID-age threshold of autovacuum_freeze_max_age XIDs (i.e.
when OldestXmin < safeLimit), with analogous rules for
MultiXactCutoff/OldestMxact. Consequently, the way that we set the
cutoffs for freezing in pathological cases where (say) there is a
leaked replication slot will see a sharp discontinuity in how
FreezeLimit is set, within and across tables. And for what?

Initially, these pathological cases will end up using exactly the same
FreezeLimit for every VACUUM against every table (assuming that we're
using a system-wide min_freeze_age setting) -- every VACUUM operation
will use a FreezeLimit of `OldestXmin - vacuum_freeze_min_age`. At a
certain point that'll suddenly flip -- now every VACUUM operation will
use a FreezeLimit of `OldestXmin`. OTOH with the patch they'd all have
a FreezeLimit that is tied to the time that each VACUUM started --
which is exactly the FreezeLimit behavior that we'd get if there was
no leaked replication slot (at least until OldestXmin finally attains
an age of vacuum_freeze_min_age, when it finally becomes unavoidable,
even with the patch).

There is something to be said for preserving the "natural diversity"
of the relfrozenxid values among tables, too. The FreezeLimit we use
is (at least for now) almost always going to be very close to (if not
exactly) the same value as the NewFrozenXid value that we set
relfrozenxid to at the end of VACUUM (at least in larger tables).
Without the patch, a once-off problem with a leaked replication slot
can accidentally result in lasting problems where all of the largest
tables get their antiwraparound autovacuums at exactly the same time.
The current behavior increases the risk that we'll accidentally
"synchronize" the relfrozenxid values for large tables that had an
antiwraparound vacuum during the time when OldestXmin was held back.

Needlessly using the same FreezeLimit across each VACUUM operation
risks disrupting the natural ebb and flow of things. It's hard to say
how much of a problem that is in the real word. But why take any
chances?

--
Peter Geoghegan

Re: effective_multixact_freeze_max_age issue

2022-08-28 Thread Peter Geoghegan

On Tue, Aug 2, 2022 at 4:12 PM Peter Geoghegan  wrote:
> That is, we only need to make sure that the "multiXactCutoff must
> always be <= oldestMxact" invariant holds once, by checking for it
> directly. The same thing happens with OldestXmin/FreezeLimit. That
> seems like a simpler foundation. It's also a lot more logical. Why
> should the cutoff for freezing be held back by a long running
> transaction, except to the extent that it is strictly necessary to do
> so to avoid wrong answers (wrong answers seen by the long running
> transaction)?

Anybody have any input on this? I'm hoping that this can be committed soon.

ISTM that the way that we currently derive FreezeLimit (by starting
with OldestXmin rather than starting with the same
ReadNextTransactionId() value that gets used for the aggressiveness
cutoffs) is just an accident of history. The "Routine vacuuming" docs
already describe this behavior in terms that sound closer to the
behavior with the patch than the actual current behavior:

"When VACUUM scans every page in the table that is not already
all-frozen, it should set age(relfrozenxid) to a value just a little
more than the vacuum_freeze_min_age setting that was used (more by the
number of transactions started since the VACUUM started)"

Besides, why should there be an idiosyncratic definition of "age" that
is only used with
vacuum_freeze_min_age/vacuum_multixact_freeze_min_age? Why would
anyone want to do less freezing in the presence of a long running
transaction? It simply makes no sense (unless we're forced to do so to
preserve basic guarantees needed for correctness, such as the
"FreezeLimit <= OldestXmin" invariant).

-- 
Peter Geoghegan

Re: New strategies for freezing, advancing relfrozenxid early

2022-08-25 Thread Peter Geoghegan

On Thu, Aug 25, 2022 at 4:23 PM Peter Geoghegan  wrote:
> As a general rule VACUUM will tend to do more eager freezing with the
> patch set compared to HEAD, though it should never do less eager
> freezing. Not even in corner cases -- never.

Come to think of it, I don't think that that's quite true. Though the
fourth patch isn't particularly related to the problem.

It *is* true that VACUUM will do at least as much freezing of XID
based tuple header fields as before. That just leaves MXIDs. It's even
true that we will do just as much freezing as before if you go pure on
MultiXact-age. But I'm the one that likes to point out that age is
altogether the wrong approach for stuff like this -- so that won't cut
it.

More concretely, I think that the patch series will fail to do certain
inexpensive eager processing of tuple xmax, that will happen today,
regardless of what the user has set vacuum_freeze_min_age or
vacuum_multixact_freeze_min_age to. Although we currently only care
about XID age when processing simple XIDs, we already sometimes make
trade-offs similar to the trade-off I propose to make in the fourth
patch for Multis.

In other words, on HEAD, we promise to process any XMID >=
MultiXactCutoff inside FreezeMultiXactId(). But we also manage to do
"eager processing of xmax" when it's cheap and easy to do so, without
caring about MultiXactCutoff at all -- this is likely the common case,
even. This preexisting eager processing of Multis is likely important
in many applications.

The problem that I think I've created is that page-level freezing as
implemented in lazy_scan_prune by the third patch doesn't know that we
might be a good idea to execute a subset of freeze plans, in order to
remove a multi from a page right away. It mostly has the right idea by
holding off on freezing until it looks like a good idea at the level
of the whole page, but I think that this is a plausible exception.
Just because we're much more sensitive to leaving behind an Multi, and
right now the only code path that can remove a Multi that isn't needed
anymore is FreezeMultiXactId().

If xmax was an updater that aborted instead of a multi then we could
rely on hint bits being set by pruning to avoid clog lookups.
Technically nobody has violated their contract here, I think, but it
still seems like it could easily be unacceptable.

I need to come up with my own microbenchmark suite for Multis -- that
was on my TODO list already. I still believe that the fourth patch
addresses Andres' complaint about allocating new Multis during VACUUM.
But it seems like I need to think about the nuances of Multis some
more. In particular, what the performance impact might be of making a
decision on freezing at the page level, in light of the special
performance considerations for Multis.

Maybe it needs to be more granular than that, more often. Or maybe we
can comprehensively solve the problem in some other way entirely.
Maybe pruning should do this instead, in general. Something like that
might put this right, and be independently useful.

-- 
Peter Geoghegan

Re: New strategies for freezing, advancing relfrozenxid early

2022-08-25 Thread Peter Geoghegan

On Thu, Aug 25, 2022 at 3:35 PM Jeremy Schneider  wrote:
> We should be careful here. IIUC, the current autovac behavior helps
> bound the "spread" or range of active multixact IDs in the system, which
> directly determines the number of distinct pages that contain those
> multixacts. If the proposed change herein causes the spread/range of
> MXIDs to significantly increase, then it will increase the number of
> blocks and increase the probability of thrashing on the SLRUs for these
> data structures.

As a general rule VACUUM will tend to do more eager freezing with the
patch set compared to HEAD, though it should never do less eager
freezing. Not even in corner cases -- never.

With the patch, VACUUM pretty much uses the most aggressive possible
XID-wise/MXID-wise cutoffs in almost all cases (though only when we
actually decide to freeze a page at all, which is now a separate
question). The fourth patch in the patch series introduces a very
limited exception, where we use the same cutoffs that we'll always use
on HEAD (FreezeLimit + MultiXactCutoff) instead of the aggressive
variants (OldestXmin and OldestMxact). This isn't just *any* xmax
containing a MultiXact: it's a Multi that contains *some* XIDs that
*need* to go away during the ongoing VACUUM, and others that *cannot*
go away. Oh, and there usually has to be a need to keep two or more
XIDs for this to happen -- if there is only one XID then we can
usually swap xmax with that XID without any fuss.

> PS. see also
> https://www.postgresql.org/message-id/247e3ce4-ae81-d6ad-f54d-7d3e0409a...@ardentperf.com

I think that the problem you describe here is very real, though I
suspect that it needs to be addressed by making opportunistic cleanup
of Multis happen more reliably. Running VACUUM more often just isn't
practical once a table reaches a certain size. In general, any kind of
processing that is time sensitive probably shouldn't be happening
solely during VACUUM -- it's just too risky. VACUUM might take a
relatively long time to get to the affected page. It might not even be
that long in wall clock time or whatever -- just too long to reliably
avoid the problem.

-- 
Peter Geoghegan

New strategies for freezing, advancing relfrozenxid early

2022-08-25 Thread Peter Geoghegan

an the alternative).

Reducing the WAL space overhead of freezing
===

Not included in this new v1 are other patches that control the
overhead of added freezing -- my focus since joining AWS has been to
get these more strategic patches in shape, and telling the right story
about what I'm trying to do here. I'm going to say a little on the
patches that I have in the pipeline here, though. Getting the
low-level/mechanical overhead of freezing under control will probably
require a few complementary techniques, not just high-level strategies
(though the strategy stuff is the most important piece).

The really interesting omitted-in-v1 patch adds deduplication of
xl_heap_freeze_page WAL records. This reduces the space overhead of
WAL records used to freeze by ~5x in most cases. It works in the
obvious way: we just store the 12 byte freeze plans that appear in
each xl_heap_freeze_page record only once, and then store an array of
item offset numbers for each entry (rather than naively storing a full
12 bytes per tuple frozen per page-level WAL record). This means that
we only need an "extra" ~2 bytes of WAL space per "extra" tuple frozen
(2 bytes for an OffsetNumber) once we decide to freeze something on
the same page. The *marginal* cost can be much lower than it is today,
which makes page-based batching of freezing much more compelling IMV.

Thoughts?
--
Peter Geoghegan


v1-0001-Add-page-level-freezing-to-VACUUM.patch
Description: Binary data


v1-0004-Avoid-allocating-MultiXacts-during-VACUUM.patch
Description: Binary data


v1-0003-Add-eager-freezing-strategy-to-VACUUM.patch
Description: Binary data


v1-0002-Teach-VACUUM-to-use-visibility-map-snapshot.patch
Description: Binary data

Instrumented pages/tuples frozen in autovacuum's server log out (and VACUUM VERBOSE)

2022-08-20 Thread Peter Geoghegan

Attached is another autovacuum (and VACUUM VERBOSE) instrumentation
patch. This one adds instrumentation about freezing to the report
autovacuum makes to the server log. Specifically, it makes the output
look like this:

regression=#  vacuum (freeze,verbose) foo;
INFO:  aggressively vacuuming "regression.public.foo"
INFO:  finished vacuuming "regression.public.foo": index scans: 0
pages: 0 removed, 45 remain, 45 scanned (100.00% of total)
tuples: 0 removed, 1 remain, 0 are dead but not yet removable
removable cutoff: 751, which was 0 XIDs old when operation ended
new relfrozenxid: 751, which is 2 XIDs ahead of previous value
XIDs processed: 45 pages from table (100.00% of total) had 1 tuples frozen
index scan not needed: 0 pages from table (0.00% of total) had 0 dead
item identifiers removed
I/O timings: read: 0.023 ms, write: 0.000 ms
avg read rate: 2.829 MB/s, avg write rate: 5.658 MB/s
buffer usage: 95 hits, 2 misses, 4 dirtied
WAL usage: 91 records, 1 full page images, 133380 bytes
system usage: CPU: user: 0.00 s, system: 0.00 s, elapsed: 0.00 s
VACUUM

Notice the new line about freezing, which we always output -- it's the
line that begins with "XIDs processed", that appears about half way
down. The new line is deliberately placed after the existing "new
relfrozenxid" line and before the existing line about dead item
identifiers. This placement of the new instrumentation seems logical
to me; freezing is related to relfrozenxid (obviously), but doesn't
need to be shoehorned into the prominent early line that reports on
tuples removed/remain[ing].

Like its neighboring "dead item identifier" line, this new line shows
the number of items/tuples affected, and the number of heap pages
affected -- with heap pages shown both as an absolute number and as a
percentage of rel_pages (in parentheses). The main cost associated
with freezing is the WAL overhead, so emphasizing pages here seems
like the way to go -- pages are more interesting than tuples. This
format also makes it relatively easy to get a sense of the *relative*
costs of the overhead of each distinct class/category of maintenance
performed.

-- 
Peter Geoghegan


v1-0001-Add-freezing-instrumentation-to-VACUUM-VERBOSE.patch
Description: Binary data

Re: pg15b3: crash in paralell vacuum

2022-08-18 Thread Peter Geoghegan

On Thu, Aug 18, 2022 at 7:25 AM Masahiko Sawada  wrote:
> Justin, if it's reproducible in your environment, could you please try
> it again with the attached patch?

Pushed, thanks.

I wonder how this issue could have been caught earlier, or even
avoided in the first place. Would the bug have been caught if Valgrind
had known to mark dynamic shared memory VALGRIND_MAKE_MEM_UNDEFINED()
when it is first allocated? ISTM that we should do something that is
analogous to aset.c's Valgrind handling for palloc() requests.

Similar work on buffers in shared memory led to us catching a tricky
bug involving unsafe access to a buffer, a little while ago -- see
bugfix commit 7b7ed046. The bug in question would probably have taken
much longer to catch without the instrumentation. In fact, it seems
like a good idea to use Valgrind for *anything* where it *might* catch
bugs, just in case.

Valgrind can work well for shared memory without any extra work. The
backend's own idea of the memory (the memory mapping used by the
process) is all that Valgrind cares about. You don't have to worry
about Valgrind instrumentation in one backend causing confusion in
another backend. It's very practical, and very general purpose. I
think that most of the protection comes from a basic understanding of
"this memory is unsafe to access, this memory contains uninitialized
data that cannot be assumed to have any particular value, this memory
is initialized and safe".

--
Peter Geoghegan

Re: Remaining case where reltuples can become distorted across multiple VACUUM operations

2022-08-18 Thread Peter Geoghegan

On Thu, Aug 11, 2022 at 1:48 AM Matthias van de Meent
 wrote:
> I think I understand your reasoning, but I don't agree with the
> conclusion. The attached patch 0002 does fix that skew too, at what I
> consider negligible cost. 0001 is your patch with a new version
> number.

Your patch added allowSystemTableMods to one of the tests. I guess
that this was an oversight?

> I'm fine with your patch as is, but would appreciate it if known
> estimate mistakes would also be fixed.

Why do you think that this particular scenario/example deserves
special attention? As I've acknowledged already, it is true that your
scenario is one in which we provably give a less accurate estimate,
based on already-available information. But other than that, I don't
see any underlying principle that would be violated by my original
patch (any kind of principle, held by anybody). reltuples is just an
estimate.

I was thinking of going your way on this, purely because it didn't
seem like there'd be much harm in it (why not just handle your case
and be done with it?). But I don't think that it's a good idea now.
reltuples is usually derived by ANALYZE using a random sample, so the
idea that tuple density can be derived accurately enough from a random
sample is pretty baked in. You're talking about a case where ignoring
just one page ("sampling" all but one of the pages) *isn't* good
enough. It just doesn't seem like something that needs to be addressed
-- it's quite awkward to do so.

Barring any further objections, I plan on committing the original
version tomorrow.

> An alternative solution could be doing double-vetting, where we ignore
> tuples_scanned if <2% of pages AND <2% of previous estimated tuples
> was scanned.

I'm not sure that I've understood you, but I think that you're talking
about remembering more information (in pg_class), which is surely out
of scope for a bug fix.

-- 
Peter Geoghegan

Re: Remaining case where reltuples can become distorted across multiple VACUUM operations

2022-08-08 Thread Peter Geoghegan

On Mon, Aug 8, 2022 at 9:17 AM Matthias van de Meent
 wrote:
> Because if a subset of the pages of a relation contains more tuples
> than your current total expected tuples in the table, you should
> update your expectations regardless of which blocks or which number of
> blocks you've scanned - the previous stored value is a strictly worse
> estimation than your last measurement.

The previous stored value could be -1, which represents the idea that
we don't know the tuple density yet. So it doesn't necessarily follow
that the new estimate is strictly better, even in this exact scenario.

> A 33-block relation with first 32 1-tuple pages is still enough to
> have a last page with 250 tuples, which would be ignored in that
> scheme and have a total tuple count of 33 or so.

The simple fact is that there is only so much we can do with the
limited information/context that we have. Heuristics are not usually
free of all bias. Often the bias is the whole point -- the goal can be
to make sure that we have the bias that we know we can live with, and
not the opposite bias, which is much worse. Details of which are
usually very domain specific.

I presented my patch with a very simple test case -- a very clear
problem. Can you do the same for this scenario?

I accept that it is possible that we'll keep an old reltuples which is
provably less accurate than doing something with the latest
information from vacuumlazy.c. But the conditions under which this can
happen are *very* narrow. I am not inclined to do anything about it
for that reason.

-- 
Peter Geoghegan

Re: Remaining case where reltuples can become distorted across multiple VACUUM operations

2022-08-08 Thread Peter Geoghegan

On Mon, Aug 8, 2022 at 8:33 AM Matthias van de Meent
 wrote:
> For example, if currently the measured 2% of the pages contains more
> than 100% of the previous count of tuples, or with your patch the last
> page contains more than 100% of the previous count of the tuples, that
> new count is ignored, which seems silly considering that the vacuum
> count is supposed to be authorative.

The 2% thing is conditioned on the new relpages value precisely
matching the existing relpages from pg_class -- which makes it very
targeted. I don't see why scanned_tuples greatly exceeding the
existing reltuples from pg_class is interesting (any more interesting
than the other way around).

We'll always accept scanned_tuples as authoritative when VACUUM
actually scans all pages, no matter what. Currently it isn't possible
for VACUUM to skip pages in a table that is 32 pages or less in size.
So even the new "single page" thing from the patch cannot matter
there.

--
Peter Geoghegan

Re: Remaining case where reltuples can become distorted across multiple VACUUM operations

2022-08-08 Thread Peter Geoghegan

On Mon, Aug 8, 2022 at 8:14 AM Matthias van de Meent
 wrote:
> I do not have intimate knowledge of this code, but shouldn't we also
> add some sefety guarantees like the following in these blocks? Right
> now, we'll keep underestimating the table size even when we know that
> the count is incorrect.
>
> if (scanned_tuples > old_rel_tuples)
> return some_weighted_scanned_tuples;

Not sure what you mean -- we do something very much like that already.

We take the existing tuple density, and assume that that hasn't
changed for any unscanned pages -- that is used to build a total
number of tuples for the unscanned pages. Then we add the number of
live tuples/scanned_tuples that the vacuumlazy.c caller just
encountered on scanned_pages. That's often where the final reltuples
value comes from.

-- 
Peter Geoghegan

Re: Remaining case where reltuples can become distorted across multiple VACUUM operations

2022-08-08 Thread Peter Geoghegan

On Fri, Aug 5, 2022 at 5:39 PM Peter Geoghegan  wrote:
> Attached patch fixes closes the remaining gap. With the patch applied,
> the second and subsequent vacuum verbose operations from the test case
> will show that reltuples is still 1 (it won't ever change). The
> patch just extends an old behavior that was applied when scanned_pages
> == 0 to cases where scanned_pages <= 1 (unless we happened to scan all
> of the relation's tables, of course).

My plan is to commit this later in the week, barring objections. Maybe
on Thursday.

-- 
Peter Geoghegan

Re: Making autovacuum logs indicate if insert-based threshold was the triggering condition

2022-08-06 Thread Peter Geoghegan

On Sat, Aug 6, 2022 at 3:51 PM Justin Pryzby  wrote:
> > Well, autovacuum.c should have (and/or kind of already has) 3
> > different triggering conditions. These are mutually exclusive
> > conditions -- technically autovacuum.c always launches an autovacuum
> > against a table because exactly 1 of the 3 thresholds were crossed.
>
> The issue being that both thresholds can be crossed:
>
> >> 2022-08-06 16:47:47.674 CDT autovacuum worker[12707] DEBUG:  t: VAC: 9 
> >> (THRESHOLD 50), INS: 9 (THRESHOLD 1000), anl: 18 (threshold 50)

What are the chances that both thresholds will be crossed at *exactly*
(not approximately) the same time in a real world case, where the
table isn't tiny (tiny relative to the "autovacuum_naptime quantum")?
This is a very narrow case.

Besides, the same can already be said with how autovacuum.c crosses
the XID-based antiwraparound threshold. Yet we still arbitrarily
report that it's antiwraparound in the logs, which (at least right
now) is generally assumed to mostly be about advancing relfrozenxid.
(Or maybe it's the other way around; it doesn't matter.)

It might make sense to *always* show how close we were to hitting each
of the thresholds, including the ones that we didn't end up hitting
(we may come pretty close quite often, which seems like it might
matter). But showing multiple conditions together just because the
planets aligned (we hit multiple thresholds together) emphasizes the
low-level mechanism, which is pretty far removed from anything that
matters. You might as well pick either threshold at random once this
happens -- even an expert won't be able to tell the difference.

-- 
Peter Geoghegan

Re: Making autovacuum logs indicate if insert-based threshold was the triggering condition

2022-08-06 Thread Peter Geoghegan

On Sat, Aug 6, 2022 at 2:50 PM Justin Pryzby  wrote:
> This sounded familiar, and it seems like I anticipated that it might be an
> issue.  Here, I was advocating for the new insert-based GUCs to default to -1,
> to have insert-based autovacuum fall back to the thresholds specified by the
> pre-existing GUCs (20% + 50), which would (in my proposal) remain be the 
> normal
> way to tune any type of vacuum.
>
> https://www.postgresql.org/message-id/20200317233218.gd26...@telsasoft.com
>
> I haven't heard of anyone who had trouble setting the necessary GUC, but I'm
> not surprised if most postgres installations are running versions before 13.

ISTM that having insert-based triggering conditions is definitely a
good idea, but what we have right now still has problems. It currently
won't work very well unless the user goes out of their way to tune
freezing to do the right thing. Typically we miss out on the
opportunity to freeze early, because without sophisticated
intervention from the user there is only a slim chance of *any*
freezing taking place outside of the inevitable antiwraparound
autovacuum.

> > Note that a VACUUM that is an "automatic vacuum for inserted tuples" cannot
> > [...] also be a "regular" autovacuum/VACUUM
>
> Why not ?

Well, autovacuum.c should have (and/or kind of already has) 3
different triggering conditions. These are mutually exclusive
conditions -- technically autovacuum.c always launches an autovacuum
against a table because exactly 1 of the 3 thresholds were crossed. My
patch makes sure that it always gives exactly one reason why
autovacuum.c decided to VACUUM, so by definition there is only one
relevant piece of information for vacuumlazy.c to report in the log.
That's fairly simple and high level, and presumably something that
users won't have much trouble understanding.

Right now antiwraparound autovacuum "implies aggressive", in that it
almost always makes vacuumlazy.c use aggressive mode, but this seems
totally arbitrary to me -- they don't have to be virtually synonymous.
I think that antiwraparound autovacuum could even be rebranded as "an
autovacuum that takes place because the table hasn't had one in a long
time". This is much less scary, and makes it clearer that autovacuum.c
shouldn't be expected to really understand what will turn out to be
important "at runtime". That's the time to make important decisions
about what work to do -- when we actually have accurate information.

My antiwraparound example is just that: an example. There is a broader
idea: we shouldn't be too confident that the exact triggering
condition autovacuum.c applied to launch an autovacuum worker turns
out to be the best reason to VACUUM, or even a good reason --
vacuumlazy.c should be able to cope with that. The user is kept in the
loop about both, by reporting the triggering condition and the details
of what really happened at runtime. Maybe lazyvacuum.c can be taught
to speed up and slow down based on the conditions it observes as it
scans the heap -- there are many possibilities.

This broader idea is pretty much what you were getting at with your
example, I think.

-- 
Peter Geoghegan

Making autovacuum logs indicate if insert-based threshold was the triggering condition

2022-08-06 Thread Peter Geoghegan

It's quite possible (and probably very common) for certain tables to
have autovacuum scheduling trigger autovacuums based on both the
"regular" bloat-orientated thresholds, and the newer insert-based
thresholds. It may be far from obvious which triggering condition
autovacuum.c must have applied to trigger any given autovacuum, since
that information isn't currently passed down to lazyvacuum.c. This
seems like a problem to me; how are users supposed to tune
autovacuum's thresholds without even basic feedback about how the
thresholds get applied?

Attached patch teaches autovacuum.c to pass the information down to
lazyvacuum.c, which includes the information in the autovacuum log.
The approach I've taken is very similar to the existing approach with
anti-wraparound autovacuum. It's pretty straightforward. Note that a
VACUUM that is an "automatic vacuum for inserted tuples" cannot also
be an antiwraparound autovacuum, nor can it also be a "regular"
autovacuum/VACUUM -- there are now 3 distinct "triggering conditions"
for autovacuum.

Although this patch doesn't touch antiwraparound autovacuums at all, I
will note in passing that I think that anti-wraparound autovacuums
should become just another triggering condition for autovacuum -- IMV
they shouldn't be special in *any* way. We'd still need to keep
antiwraparound's "cannot automatically cancel autovacuum worker"
behavior in some form, but that would become dynamic, a little like
the failsafe is today, and would trigger on its own timeline (probably
*much* later than we first trigger antiwraparound autovacuum). We'd
also need to decouple "aggressiveness" (the behaviors that we
associate with aggressive mode in vacuumlazy.c) from the condition of
the table/system when VACUUM first began -- those could all become
dynamic, too.

-- 
Peter Geoghegan


v1-0001-Show-when-autovacuum-is-insert-driven-in-log.patch
Description: Binary data

Remaining case where reltuples can become distorted across multiple VACUUM operations

2022-08-05 Thread Peter Geoghegan

My bugfix commit 74388a1a (which was pushed back in February) added
heuristics to VACUUM's reltuples calculation/estimate. This prevented
VACUUM from distorting our estimate of reltuples over time, across
successive VACUUM operations run against the same table. The problem
was that VACUUM could scan the same single heap page again and again,
while believing it saw a random sample each time. This eventually
leads to a pg_class.reltuples value that is based on the assumption
that every single heap page in the table is just like the heap page
that gets "sampled" again and again. This was always the last heap
page (due to implementation details related to the work done by commit
e8429082), which in practice tend to be particularly poor
representations of the overall reltuples density of tables.

I have discovered a gap in these heuristics: there are remaining cases
where its percentage threshold doesn't prevent reltuples distortion as
intended. It can still happen with tables that are small enough that a
cutoff of 2% of rel_pages is less than a single page, yet still large
enough that vacuumlazy.c will consider it worth its while to skip some
pages using the visibility map. It will typically skip all but the
final heap page from the relation (same as the first time around).

Here is a test case that shows how this can still happen on HEAD (and
in Postgres 15):

regression=# create table foo(bar int);insert into foo select i from
generate_series(1, 1) i;
CREATE TABLE
INSERT 0 1

Now run vacuum verbose against the table several times:

regression=# vacuum verbose foo;
*** SNIP ***
regression=# vacuum verbose foo;

The first vacuum shows "tuples: 0 removed, 1 remain...", which is
correct. However, each subsequent vacuum verbose revises the estimate
downwards, eventually making pg_class.reltuples significantly
underestimate tuple density (same as the first time around).

Attached patch fixes closes the remaining gap. With the patch applied,
the second and subsequent vacuum verbose operations from the test case
will show that reltuples is still 1 (it won't ever change). The
patch just extends an old behavior that was applied when scanned_pages
== 0 to cases where scanned_pages <= 1 (unless we happened to scan all
of the relation's tables, of course). It doesn't remove the original
test from commit 74388a1a, which still seems like a good idea to me.

-- 
Peter Geoghegan


v1-0001-Avoid-reltuples-distortion-in-very-small-tables.patch
Description: Binary data

Re: BTMaxItemSize seems to be subtly incorrect

2022-08-05 Thread Peter Geoghegan

On Fri, Aug 5, 2022 at 10:13 AM Peter Geoghegan  wrote:
> Update: I discovered that I can get the regression tests to fail (even
> on mainstream 64-bit platforms) by MAXALIGN()'ing the expression that
> we assign to state->maxpostingsize at the top of _bt_dedup_pass().

Looks like this was nothing more than a silly oversight with how the
macro was defined. As written, it would evaluate to the wrong thing at
the same point in nbtdedup.c, just because it was used in an
expression.

Pushed a fix for that just now.
-- 
Peter Geoghegan

Re: BTMaxItemSize seems to be subtly incorrect

2022-08-05 Thread Peter Geoghegan

On Thu, Aug 4, 2022 at 10:40 PM Peter Geoghegan  wrote:
> This very likely has something to do with the way nbtdedup.c uses
> BTMaxItemSize(), which apparently won't work on these 32-bit systems
> now.

Update: I discovered that I can get the regression tests to fail (even
on mainstream 64-bit platforms) by MAXALIGN()'ing the expression that
we assign to state->maxpostingsize at the top of _bt_dedup_pass().
This is surprising; it contradicts existing comments that explain that
the existing max is 1/6 of a page by choice, to get better space
utilization than the more natural cap of 1/3 of a page. It now looks
like that might have actually been strictly necessary, all along.

-- 
Peter Geoghegan

Re: BTMaxItemSize seems to be subtly incorrect

2022-08-04 Thread Peter Geoghegan

On Thu, Aug 4, 2022 at 10:25 PM Thomas Munro  wrote:
> FYI florican and lapwing showed:
>
> 2022-08-05 01:04:29.903 EDT [34485:5] FATAL:  deduplication failed to
> add heap tid to pending posting list
> 2022-08-05 01:04:29.903 EDT [34485:6] CONTEXT:  WAL redo at 0/49708D8
> for Btree/DEDUP: nintervals 4; blkref #0: rel 1663/16384/2674, blk 1

This very likely has something to do with the way nbtdedup.c uses
BTMaxItemSize(), which apparently won't work on these 32-bit systems
now.

I'll fix this tomorrow morning.

-- 
Peter Geoghegan

Re: BTMaxItemSize seems to be subtly incorrect

2022-08-04 Thread Peter Geoghegan

On Thu, Jun 9, 2022 at 11:20 AM Robert Haas  wrote:
> I think I'd feel more comfortable if you wrote the patch, if that's possible.

Okay, pushed a fix just now.

Thanks
-- 
Peter Geoghegan

Re: annoyance with .git-blame-ignore-revs

2022-08-04 Thread Peter Geoghegan

On Mon, Jul 11, 2022 at 12:30 PM Alvaro Herrera  wrote:
> $ git blame configure
> fatal: could not open object name list: .git-blame-ignore-revs
>
> My first workaround was to add empty .git-blame-ignore-revs in all
> checkouts.  This was moderately ok (shrug), until after a recent `tig`
> upgrade the empty file started to show up in the history as an untracked
> file.

Ping? Would be nice to get this done soon. I don't think that it
requires a great deal of care. If I was doing this myself, I would
probably make sure that the backbranch copies of the file won't
reference commits from later releases. But even that probably doesn't
matter; just backpatching the file from HEAD as-is wouldn't break
anybody's workflow.

Again, to reiterate: I see no reason to do anything on the
backbranches here more than once.

I mentioned already that somebody proposed a patch that fixes the
problem at the git level, which seems to have stalled. Here is the
discussion:

https://public-inbox.org/git/xmqq5ywehb69.fsf@gitster.g/T/

ISTM that we're working around what is actually a usability problem
with git (imagine that!). I think that that's fine. Just thought that
it was worth acknowledging it as such. We're certainly not the first
people to run into this exact annoyance.

-- 
Peter Geoghegan

Re: pg15b2: large objects lost on upgrade

2022-08-04 Thread Peter Geoghegan

On Thu, Aug 4, 2022 at 12:31 PM Robert Haas  wrote:
> > What about autoanalyze?
>
> What about it?

It has a tendency to consume an XID, here or there, quite
unpredictably. I've noticed that this often involves an analyze of
pg_statistic. Have you accounted for that?

You said upthread that you don't like "fuzzy" tests, because it's too
easy for things to look like they're working when they really aren't.
I suppose that there may be some truth to that, but ISTM that there is
also a lot to be said for a test that can catch failures that weren't
specifically anticipated. Users won't be running pg_upgrade with
autovacuum disabled. And so ISTM that just testing that relfrozenxid
has been carried forward is more precise about one particular detail
(more precise than alternative approaches to testing), but less
precise about the thing that we actually care about.

-- 
Peter Geoghegan

Re: pg15b2: large objects lost on upgrade

2022-08-04 Thread Peter Geoghegan

On Thu, Aug 4, 2022 at 12:15 PM Robert Haas  wrote:
> Given that the old cluster is suffering no new write
> transactions, there's probably exactly two values that are legal: one
> being the value from the old cluster, which we know, and the other
> being whatever a vacuum of that table would produce, which we don't
> know, although we do know that it's somewhere in that range.

What about autoanalyze?

-- 
Peter Geoghegan

Re: pg15b2: large objects lost on upgrade

2022-08-04 Thread Peter Geoghegan

On Thu, Aug 4, 2022 at 11:07 AM Tom Lane  wrote:
> How much will that add to the test's runtime?  I could get behind this
> idea if it's not exorbitantly expensive.

I'm not sure offhand, but I suspect it wouldn't be too bad.

-- 
Peter Geoghegan

Re: pg15b2: large objects lost on upgrade

2022-08-04 Thread Peter Geoghegan

On Thu, Aug 4, 2022 at 9:44 AM Robert Haas  wrote:
> But if you don't want to do that, and you also don't want to have
> random failures, the only alternatives are weakening the check and
> removing the test. It's kind of hard to say which is better, but I'm
> inclined to think that if we just weaken the test we're going to think
> we've got coverage for this kind of problem when we really don't.

Perhaps amcheck's verify_heapam() function can be used here. What
could be better than exhaustively verifying that the relfrozenxid (and
relminmxid) invariants hold for every single tuple in the table? Those
are the exact conditions that we care about, as far as
relfrozenxid/relminmxid goes.

My sense is that that has a much better chance of detecting a real bug
at some point. This approach is arguably an example of property-based
testing.

-- 
Peter Geoghegan

Re: pg15b2: large objects lost on upgrade

2022-08-03 Thread Peter Geoghegan

On Wed, Aug 3, 2022 at 1:47 PM Tom Lane  wrote:
> Again, this seems to me to be breaking the test's real-world applicability
> for a (false?) sense of stability.

I agree.

A lot of the VACUUM test flappiness issues we've had to deal with in
the past now seem like problems with VACUUM itself, the test's design,
or both. For example, why should we get a totally different
pg_class.reltuples because we couldn't get a cleanup lock on some
page? Why not just make sure to give the same answer either way,
which happens to be the most useful behavior to the user? That way
the test isn't just targeting implementation details.

--
Peter Geoghegan

Re: pg15b2: large objects lost on upgrade

2022-08-03 Thread Peter Geoghegan

On Wed, Aug 3, 2022 at 1:34 PM Tom Lane  wrote:
> That doesn't seem like it'd be all that thorough: we expect VACUUM
> to skip pages whenever possible.  I'm also a bit concerned about
> the expense, though admittedly this test is ridiculously expensive
> already.

I bet the SKIP_PAGES_THRESHOLD stuff will be enough to make VACUUM
visit every heap page in practice for a test case like this. That is
all it takes to be able to safely advance relfrozenxid to whatever the
oldest extant XID happened to be. However, I'm no fan of the
SKIP_PAGES_THRESHOLD behavior, and already have plans to get rid of it
-- so I wouldn't rely on that continuing to be true forever.

It's probably not really necessary to have that kind of coverage in
this particular test case. VACUUM will complain about weird
relfrozenxid values in a large variety of contexts, even without
assertions enabled. Mostly I was just saying: if we really do need
test coverage of relfrozenxid in this context, then VACUUM is probably
the way to go.

-- 
Peter Geoghegan

Re: pg15b2: large objects lost on upgrade

2022-08-03 Thread Peter Geoghegan

On Wed, Aug 3, 2022 at 1:20 PM Andres Freund  wrote:
> > I don't really like this approach. Imagine that the code got broken in
> > such a way that relfrozenxid and relminmxid were set to a value chosen
> > at random - say, the contents of 4 bytes of unallocated memory that
> > contained random garbage. Well, right now, the chances that this would
> > cause a test failure are nearly 100%. With this change, they'd be
> > nearly 0%.
>
> Can't that pretty easily be addressed by subsequently querying txid_current(),
> and checking that the value isn't newer than that?

It couldn't hurt to do that as well, in passing (at the same time as
testing that newrelfrozenxid >= oldrelfrozenxid directly). But
deliberately running VACUUM afterwards seems like a good idea. We
really ought to expect VACUUM to catch cases where
relfrozenxid/relminmxid is faulty, at least in cases where it can be
proven wrong by noticing some kind of inconsistency.

-- 
Peter Geoghegan

Re: pg15b2: large objects lost on upgrade

2022-08-03 Thread Peter Geoghegan

On Wed, Aug 3, 2022 at 6:59 AM Robert Haas  wrote:
> I don't really like this approach. Imagine that the code got broken in
> such a way that relfrozenxid and relminmxid were set to a value chosen
> at random - say, the contents of 4 bytes of unallocated memory that
> contained random garbage. Well, right now, the chances that this would
> cause a test failure are nearly 100%. With this change, they'd be
> nearly 0%.

If that kind of speculative bug existed, and somehow triggered before
the concurrent autovacuum ran (which seems very likely to be the
source of the test flappiness), then it would still be caught, most
likely. VACUUM itself has the following defenses:

* The defensive "can't happen" errors added to
heap_prepare_freeze_tuple() and related freezing routines by commit
699bf7d0 in 2017, as hardening following the "freeze the dead" bug.
That'll catch XIDs that are before the relfrozenxid at the start of
the VACUUM (ditto for MXIDs/relminmxid).

* The assertion added in my recent commit 0b018fab, which verifies
that we're about to set relfrozenxid to something sane.

* VACUUM now warns when it sees a *previous* relfrozenxid that's
apparently "in the future", following recent commit e83ebfe6. This
problem scenario is associated with several historic bugs in
pg_upgrade, where for one reason or another it failed to carry forward
correct relfrozenxid and/or relminmxid values for a table (see the
commit message for references to those old pg_upgrade bugs).

It might make sense to run a manual VACUUM right at the end of the
test, so that you reliably get this kind of coverage, even without
autovacuum.

-- 
Peter Geoghegan

Re: pg15b2: large objects lost on upgrade

2022-08-03 Thread Peter Geoghegan

On Tue, Aug 2, 2022 at 12:32 PM Tom Lane  wrote:
> Hmmm ... now that you mention it, I see nothing in 002_pg_upgrade.pl
> that attempts to turn off autovacuum on either the source server or
> the destination.  So one plausible theory is that autovac moved the
> numbers since we checked.

It's very easy to believe that my work in commit 0b018fab could make
that happen, which is only a few months old. It's now completely
routine for non-aggressive autovacuums to advance relfrozenxid by at
least a small amount.

For example, when autovacuum runs against either the tellers table or
the branches table during a pgbench run, it will now advance
relfrozenxid, every single time. And to a very recent value. This will
happen in spite of the fact that no freezing ever takes place -- it's
just a consequence of the oldest extant XID consistently being quite
young each time, due to workload characteristics.

-- 
Peter Geoghegan

effective_multixact_freeze_max_age issue

2022-08-02 Thread Peter Geoghegan

The effective_multixact_freeze_max_age mechanism added by commit
53bb309d2d forces aggressive VACUUMs to take place earlier, as
protection against wraparound of the MultiXact member space SLRU.
There was also a follow-up bugfix several years later -- commit
6bda2af039 -- which made sure that the MXID-wise cutoff used to
determine which MXIDs to freeze in vacuumlazy.c could never exceed
oldestMxact (VACUUM generally cannot freeze MultiXacts that are still
seen as running by somebody according to oldestMxact).

I would like to talk about making the
effective_multixact_freeze_max_age stuff more robust, particularly in
the presence of a long held snapshot that holds things up even as SLRU
space for MultiXact members dwindles. I have to admit that I always
found this part of vacuum_set_xid_limits() confusing. I suspect that
the problem has something to do with how we calculate vacuumlazy.c's
multiXactCutoff (as well as FreezeLimit): vacuum_set_xid_limits() just
subtracts a freezemin value from GetOldestMultiXactId(). This is
confusingly similar (though different in important ways) to the
handling for other related cutoffs that happens nearby. In particular,
we start from ReadNextMultiXactId() (not from GetOldestMultiXactId())
for the cutoff that determines if the VACUUM is going to be
aggressive. I think that this can be fixed -- see the attached patch.

Of course, it wouldn't be safe to allow vacuum_set_xid_limits() to
hand off a multiXactCutoff to vacuumlazy.c that is (for whatever
reason) less than GetOldestMultiXactId()/oldestMxact (the bug fixed by
6bda2af039 involved just such a scenario). But that doesn't seem like
much of a problem to me. We can just handle it directly, as needed.
The attached patch handles it as follows:

/* Compute multiXactCutoff, being careful to generate a valid value */
*multiXactCutoff = nextMXID - mxid_freezemin;
if (*multiXactCutoff < FirstMultiXactId)
*multiXactCutoff = FirstMultiXactId;
/* multiXactCutoff must always be <= oldestMxact */
if (MultiXactIdPrecedes(*oldestMxact, *multiXactCutoff))
*multiXactCutoff = *oldestMxact;

That is, we only need to make sure that the "multiXactCutoff must
always be <= oldestMxact" invariant holds once, by checking for it
directly. The same thing happens with OldestXmin/FreezeLimit. That
seems like a simpler foundation. It's also a lot more logical. Why
should the cutoff for freezing be held back by a long running
transaction, except to the extent that it is strictly necessary to do
so to avoid wrong answers (wrong answers seen by the long running
transaction)?

This allows us to simplify the code that issues a WARNING about
oldestMxact/OldestXmin inside vacuum_set_xid_limits(). Why not
actually test oldestMxact/OldestXmin directly, without worrying about
the limits (multiXactCutoff/FreezeLimit)? That also seems more
logical; there is more to be concerned about than freezing being
blocked when OldestXmin gets very old. Though we still rely on the
autovacuum_freeze_max_age GUC to represent "a wildly unreasonable
number of XIDs for OldestXmin to be held back by", just because that's
still convenient.

-- 
Peter Geoghegan


v1-0001-Derive-VACUUM-s-FreezeLimit-from-next-transaction.patch
Description: Binary data

Re: Maximize page freezing

2022-07-29 Thread Peter Geoghegan

On Fri, Jul 29, 2022 at 5:55 AM Simon Riggs
 wrote:
> > I should be able to post something in a couple of weeks.
>
> How do you see that affecting this thread?

Well, it's clearly duplicative, at least in part. That in itself
doesn't mean much, but there are some general questions (that apply to
any variant of proactive/batched freezing), particularly around the
added overhead, and the question of whether or not we get to advance
relfrozenxid substantially in return for that cost. Those parts are
quite tricky.

I have every intention of addressing these thorny questions in my
upcoming patch set, which actually does far more than change the rules
about when and how we freeze -- changing the mechanism itself is very
much the easy part. I'm taking a holistic approach that involves
making an up-front decision about freezing strategy based on the
observed characteristics of the table, driven by what we see in the
visibility map at the start.

Similar questions will also apply to this patch, even though it isn't
as aggressive (your patch doesn't trigger freezing when a page is
about to be set all-visible in order to make sure that it can be set
all-frozen instead). You still want to give the user a clear benefit
for any added overhead. It needs a great deal of performance
validation, too.

-- 
Peter Geoghegan

Re: Maximize page freezing

2022-07-28 Thread Peter Geoghegan

On Thu, Jul 28, 2022 at 6:56 AM Matthias van de Meent
 wrote:
> Great idea, yet this patch seems to only freeze those tuples that are
> located after the first to-be-frozen tuple. It should probably
> re-visit earlier live tuples to potentially freeze those as well.

I have a big patch set pending that does this (which I dubbed
"page-level freezing"), plus a bunch of other things that control the
overhead. Although the basic idea of freezing all of the tuples on a
page together appears in earlier patching that were posted. These were
things that didn't make it into Postgres 15.

I should be able to post something in a couple of weeks.

-- 
Peter Geoghegan

Re: PANIC: wrong buffer passed to visibilitymap_clear

2022-07-22 Thread Peter Geoghegan

On Fri, Jul 22, 2022 at 1:22 AM 王伟(学弈)  wrote:
> I recently find this problem while testing PG14 with sysbench.

The line numbers from your stack trace don't match up with
REL_14_STABLE. Is this actually a fork of Postgres 14? (Oh, looks like
it's an old beta release.)

> Then I look through the emails from pgsql-hackers and find a previous 
> similary bug which is 
> https://www.postgresql.org/message-id/flat/2247102.1618008027%40sss.pgh.pa.us.
>  But the bugfix commit(34f581c39e97e2ea237255cf75cccebccc02d477) is already 
> patched to PG14.

It does seem possible that there is another similar bug somewhere --
another case where we were protected by the fact that VACUUM acquired
a full cleanup lock (not just an exclusive buffer lock) during its
second heap pass. That changed in Postgres 14 (commit 8523492d4e). But
I really don't know -- almost anything is possible.

> I'm wondering whether there's another code path to lead this problem 
> happened. Since, I take a deep dig via gdb which turns out that newbuffer is 
> not euqal to buffer. In other words, the function RelationGetBufferForTuple 
> must have been called just now.
> Besides, why didn't we re-check the flag after RelationGetBufferForTuple was 
> called?

Recheck what flag? And at what point? It's not easy to figure this out
from your stack trace, because of the line number issues.

It would also be helpful if you told us about the specific table
involved. Though the important thing (the essential thing) is to test
today's REL_14_STABLE. There have been *lots* of bug fixes since
Postgres 14 beta2 was current.

-- 
Peter Geoghegan

Re: [PoC] Improve dead tuple storage for lazy vacuum

2022-07-18 Thread Peter Geoghegan

On Mon, Jul 18, 2022 at 9:10 PM John Naylor
 wrote:
> On Tue, Jul 19, 2022 at 9:24 AM Andres Freund  wrote:
> > FWIW, I think the best path forward would be to do something similar to the
> > simplehash.h approach, so it can be customized to the specific user.
>
> I figured that would come up at some point. It may be worth doing in the 
> future, but I think it's way too much to ask for the first use case.

I have a prototype patch that creates a read-only snapshot of the
visibility map, and has vacuumlazy.c work off of that when determining
with pages to skip. The patch also gets rid of the
SKIP_PAGES_THRESHOLD stuff. This is very effective with TPC-C,
principally because it really cuts down on the number of scanned_pages
that are scanned only because the VM bit is unset concurrently by DML.
The window for this is very large when the table is large (and
naturally takes a long time to scan), resulting in many more "dead but
not yet removable" tuples being encountered than necessary. Which
itself causes bogus information in the FSM -- information about the
space that VACUUM could free from the page, which is often highly
misleading.

There are remaining questions about how to do this properly. Right now
I'm just copying pages from the VM into local memory, right after
OldestXmin is first acquired -- we "lock in" a snapshot of the VM at
the earliest opportunity, which is what lazy_scan_skip() actually
works off now. There needs to be some consideration given to the
resource management aspects of this -- it needs to use memory
sensibly, which the current prototype patch doesn't do at all. I'm
probably going to seriously pursue this as a project soon, and will
probably need some kind of data structure for the local copy. The raw
pages are usually quite space inefficient, considering we only need an
immutable snapshot of the VM.

I wonder if it makes sense to use this as part of this project. It
will be possible to know the exact heap pages that will become
scanned_pages before scanning even one page with this design (perhaps
with caveats about low memory conditions). It could also be very
effective as a way of speeding up TID lookups in the reasonably common
case where most scanned_pages don't have any LP_DEAD items -- just
look it up in our local/materialized copy of the VM first. But even
when LP_DEAD items are spread fairly evenly, it could still give us
reliable information about the distribution of LP_DEAD items very
early on.

Maybe the two data structures could even be combined in some way? You
can use more memory for the local copy of the VM if you know that you
won't need the memory for dead_items. It's kinda the same problem, in
a way.

-- 
Peter Geoghegan

Re: Rename some rel truncation related constants at the top of vacuumlazy.c

2022-07-18 Thread Peter Geoghegan

On Mon, Jul 18, 2022 at 8:55 PM Tom Lane  wrote:
> Um ... you seem to have removed some useful comments?

I don't think that the stuff about making them into a GUC is useful myself.

> Personally I wouldn't do this, as I don't think the renaming
> brings much benefit, and it will create a hazard for back-patching
> any fixes that might be needed in that code.  I'm not hugely upset
> about it, but that's the way I'd vote if asked.

In that case I withdraw the patch.

FWIW I wrote the patch during the course of work on new feature
development. A patch that added a couple of similar constants a bit
further down. Seemed neater this way, but it's certainly not worth
arguing over.

-- 
Peter Geoghegan

Rename some rel truncation related constants at the top of vacuumlazy.c

2022-07-18 Thread Peter Geoghegan

I propose to rename some of the rel truncation related constants at
the top of vacuumlazy.c, per the attached patch. The patch
consolidates related constants into a single block/grouping, and
imposes a uniform naming scheme.

-- 
Peter Geoghegan


0001-vacuumlazy.c-rename-rel-truncation-constants.patch
Description: Binary data

Re: [PATCH v1] remove redundant check of item pointer

2022-07-14 Thread Peter Geoghegan

On Thu, Jul 14, 2022 at 3:59 PM Tom Lane  wrote:
> Even in an assert-enabled build, wouldn't you expect the compiler to
> optimize away the second assertion as unreachable code?

I think that it probably would, even at -O0 (GCC doesn't really allow
you to opt out of all optimizations). I did think of that myself, but
it seemed rather beside the point.

There have been individual cases where individual assertions were
deemed a bit too heavyweight. But those have been few and far between.
I myself tend to use *lots* of technically-redundant assertions like
this for preconditions and postconditions. At worst they're code
comments that are all but guaranteed to stay current.

-- 
Peter Geoghegan

Re: [PATCH v1] remove redundant check of item pointer

2022-07-14 Thread Peter Geoghegan

On Thu, Jul 14, 2022 at 3:31 PM Bruce Momjian  wrote:
> On Wed, Apr 27, 2022 at 08:04:00PM +0800, Junwang Zhao wrote:
> for non-Assert builds, ItemPointerGetOffsetNumberNoCheck() and
> ItemPointerGetOffsetNumber() are the same, so I don't see the point to
> making this change.  Frankly, I don't know why we even have two
> functions for this.  I am guessing ItemPointerGetOffsetNumberNoCheck is
> for cases where you have an Assert build and do not want the check.

Sometimes we use ItemPointerData for things that aren't actually TIDs.
For example, both GIN and B-Tree type-pun the ItemPointerData field
from the Indextuple struct. Plus we do something like that with
UPDATEs that affect a partitioning key in a partitioned table.

The proposal doesn't seem like an improvement. Technically the
assertion cannot possibly fail here because the earlier assertion
would always fail instead, so strictly speaking it is redundant -- at
least right now. That is true. But it seems much more important to be
consistent about which variant to use. Especially because there is
obviously no overhead in builds without assertions enabled.

-- 
Peter Geoghegan

Re: annoyance with .git-blame-ignore-revs

2022-07-11 Thread Peter Geoghegan

On Mon, Jul 11, 2022 at 12:36 PM Tom Lane  wrote:
> Only if we had to update all those copies all the time.  But
> I'm guessing we wouldn't need a branch's copy to be newer than
> the last pgindent run affecting that branch?

I wouldn't care very much if the file itself was empty in the
backbranches, and remained that way -- that would at least suppress
annoying error messages on those branches (from my text editor's git
blame feature).

You might as well have the relevant commits when you backpatch, but
that's kind of not the point. At least to me. In any case I don't see
a need to maintain the file on the backbranches.

-- 
Peter Geoghegan

Re: annoyance with .git-blame-ignore-revs

2022-07-11 Thread Peter Geoghegan

On Mon, Jul 11, 2022 at 12:30 PM Alvaro Herrera  wrote:
> Anybody has any idea how to handle this better?
>
> A viable option would be to backpatch the addition of
> .git-blame-ignore-revs to all live branches.  Would that bother anyone?

+1. I was thinking of suggesting the same thing myself, for the same reasons.

This solution is a good start, but it does leave one remaining
problem: commits from before the introduction of
.git-blame-ignore-revs still won't have the file. There was actually a
patch for git that tried to address the problem directly, but it
didn't go anywhere. Maybe just backpatching the file is good enough.

-- 
Peter Geoghegan

Re: Add last_vacuum_index_scans in pg_stat_all_tables

2022-07-08 Thread Peter Geoghegan

On Fri, Jul 8, 2022 at 10:47 AM Alvaro Herrera  wrote:
> Saving some sort of history would be much more useful, but of course a
> lot more work.

I think that storing a certain amount of history would be very useful,
for lots of reasons. Not just for instrumentation purposes; I envisage
a design where VACUUM itself makes certain decisions based on the
history of each VACUUM operation against the table. The direction that
things have taken suggests a certain amount about the direction that
things are going in, which we should try to influence.

The simplest and best example of how this could help is probably
freezing, and freeze debt. Currently, the visibility map interacts
with vacuum_freeze_min_age in a way that allows unfrozen all-visible
pages to accumulate. These pages won't be frozen until the next
aggressive VACUUM. But there is no fixed relationship between the
number of XIDs consumed by the system (per unit of wallclock time) and
the number of unfrozen all-visible pages (over the same duration). So
we might end up having to freeze an absolutely enormous number of
pages in the eventual aggressive vacuum. We also might not -- it's
really hard to predict, for reasons that just don't make much sense.

There are a few things we could do here, but having a sense of history
seems like the important part. If (say) the table exceeds a certain
size, and the number of all-visible pages grows and grows (without any
freezing taking place), then we should "proactively" freeze at least
some of the unfrozen all-visible pages in earlier VACUUM operations.
In other words, we should (at the very least) spread out the burden of
freezing those pages over time, while being careful to not pay too
much more than we would with the old approach if and when the workload
characteristics change again.

More generally, I think that we should blur the distinction between
aggressive and non-aggressive autovacuum. Sure, we'd still need VACUUM
to "behave aggressively" in some sense, but that could all happen
dynamically, without committing to a particular course of action until
the last moment -- being able to change our minds at the last minute
can be very valuable, even though we probably won't change our minds
too often.

-- 
Peter Geoghegan

Re: explain analyze rows=%.0f

2022-07-07 Thread Peter Geoghegan

On Thu, Jul 7, 2022 at 1:21 PM Robert Haas  wrote:
> Nested Loop (actual time=TIME FOR THIS AND ALL CHILDREN rows=THE REAL
> ROW COUNT loops=1)
> -> Seq Scan on something (actual time=THE TIME IT REALLY TOOK rows=THE
> REAL ROW COUNT loops=1)
> -> Index Scan using someidx on somethingelse (actual time=NOT REALLY
> HOW LONG IT TOOK rows=NOT REALLY HOW MANY ROWS WE GOT loops=HUGE
> NUMBER)
>
> If I'm looking at this plan and trying to find out what's gone wrong,
> I want to know how much time got spent in the nested loop, how much
> time got spent in the Seq Scan, and how much time got spent in the
> Index Scan. It's easy to figure out how much time got spent in the Seq
> Scan, but to find out how much time got spent in the Index Scan, I
> have to multiply the time by the loop count.

I agree that this general state of affairs is very confusing, and
seems like something that we should still improve. Just because it's a
very old way of presenting the information doesn't mean that it's the
best one, or even a particularly good one. Plus you could probably
make some kind of concession in the direction of maintaining
compatibility with the current approach if you had to. Right?

-- 
Peter Geoghegan

Re: [UNVERIFIED SENDER] Re: pg_upgrade can result in early wraparound on databases with high transaction load

2022-07-05 Thread Peter Geoghegan

On Tue, Jul 5, 2022 at 12:41 PM Peter Geoghegan  wrote:
> Actually, commit 74cf7d46 was where pg_resetxlog/pg_resetwal's -u
> argument was first added, for use by pg_upgrade. That commit is only
> about a year old, and was only backpatched to 9.6.

I just realized that this thread was where that work was first
discussed. That explains why it took a year to discover that we broke
8.4!

On further reflection I think that breaking pg_upgrade for 8.4 might
have been a good thing. The issue was fairly visible and obvious if
you actually ran into it, which is vastly preferable to what would
have happened before commit 74cf7d46.

-- 
Peter Geoghegan

Re: [UNVERIFIED SENDER] Re: pg_upgrade can result in early wraparound on databases with high transaction load

2022-07-05 Thread Peter Geoghegan

On Tue, Jul 5, 2022 at 11:53 AM Andrew Dunstan  wrote:
> > Sure enough, 8.4's pg_controldata doesn't print anything about
> > oldestXID, because that info wasn't there then.
> >
> > Given the lack of field complaints, it's probably not worth trying
> > to do anything to restore that capability.  But we really ought to
> > update pg_upgrade's code and docs in pre-v15 branches to say that
> > the minimum supported source version is 9.0.
>
>
> So it's taken us a year to discover the issue :-(

I'm not surprised at all, given the history here. There were at least
a couple of bugs affecting how pg_upgrade carries forward information
about these cutoffs. See commits 74cf7d46 and a61daa14.

Actually, commit 74cf7d46 was where pg_resetxlog/pg_resetwal's -u
argument was first added, for use by pg_upgrade. That commit is only
about a year old, and was only backpatched to 9.6. Unfortunately the
previous approach to carrying forward oldestXID was an accident that
usually worked. So...yeah, things are bad here. At least we now have
the ability to detect any downstream problems that this might cause by
using pg_amcheck.

--
Peter Geoghegan

Re: First draft of the PG 15 release notes

2022-07-05 Thread Peter Geoghegan

On Tue, Jul 5, 2022 at 11:09 AM Bruce Momjian  wrote:
>
> Actually, I was wrong.  I thought that we only mentioned that we
> computed a more agressive xid, but now see I was mentioning the _frozen_
> xid.  Reading the commit, we do compute the multi-xid and store that too
> so I have updated the PG 15 release notes with the attached patch.

It might be worth using the "symbol names" directly, since they appear
in the documentation already (under "Routine Vacuuming"). These are
relfrozenxid and
relminmxid. These are implementation
details, but they're documented in detail (though admittedly the
documentation has *lots* of problems).

Here is what I would like this item to hint at, to advanced users with
tricky requirements: The new approach to setting relminmxid will
improve the behavior of VACUUM in databases that already happen to use
lots of MultiXacts. These users will notice that autovacuum now works
off of relminmxid values that actually tell us something about each
table's consumption of MultiXacts over time. Most individual tables
naturally consume *zero* MultiXacts, even in databases that consume
many MultiXacts -- due to naturally occuring workload characteristics.
The old approach failed to recognize this, leading to very uniform
relminmxid values across tables that were in fact very different,
MultiXact-wise.

The way that we handle relfrozenxid is probably much less likely to
make life much easier for any database, at least on its own, in
Postgres 15. So from the point of view of a user considering
upgrading, the impact on relminmxid is likely to be far more
important.

Admittedly the most likely scenario by far is that the whole feature
just isn't interesting, but a small minority of advanced users (users
with painful MultiXact problems) will find the relminmxid thing very
compelling.

-- 
Peter Geoghegan

Re: First draft of the PG 15 release notes

2022-07-02 Thread Peter Geoghegan

>
> Okay, thanks Bruce.
-- 
Peter Geoghegan

Re: AIX support - alignment issues

2022-07-02 Thread Peter Geoghegan

On Sat, Jul 2, 2022 at 12:22 PM Tom Lane  wrote:
> Now, it's certainly possible that AIX is the only surviving platform
> that hasn't adopted bug-compatible-with-glibc interpretations of
> POSIX.  But I think the standard is the standard, and we ought to
> stay within it.  So I find value in these fixes.

I imagine that there is strong evolutionary pressure pushing minority
platforms in the direction of bug-compatible-with-glibc. There is
definitely a similar trend around things like endianness and alignment
pickiness. But it wasn't always so.

It seems fair to wonder if AIX bucks the glibc-compatible trend
because it is already on the verge of extinction. If it wasn't just
about dead already then somebody would have gone to the trouble of
making it bug-compatible-with-glibc by now. (To be clear, I'm not
arguing that this is a good thing.)

Maybe it is still worth hanging on to AIX support for the time being,
but it would be nice to have some idea of where we *will* finally draw
the line. If the complaints from Andres aren't a good enough reason
now, then what other hypothetical reasons might be good enough in the
future? It seems fairly likely that Postgres desupporting AIX will
happen (say) at some time in the next decade, no matter what happens
today.

-- 
Peter Geoghegan

Re: AIX support - alignment issues

2022-07-02 Thread Peter Geoghegan

On Sat, Jul 2, 2022 at 11:34 AM Andres Freund  wrote:
> Personally I think we should just drop AIX. The amount of effort to keep it
> working is substantial due to being quite different from other unices ([2]), 
> the is
> very outdated, the whole ecosystem is barely on lifesupport ([3]). And all of 
> that
> for very little real world use.

I tend to agree about dropping AIX. But I wonder if there is an
argument against that proposal that doesn't rely on AIX being relevant
to at least one user. Has supporting AIX ever led to the discovery of
a bug that didn't just affect AIX? In other words, are AIX systems
peculiar in some particular way that clearly makes them more likely to
flush out a certain class of bugs? What is the best argument *against*
desupporting AIX that you know of?

Desupporting AIX doesn't mean that any AIX users will be left in the
lurch immediately. Obviously these users will be able to use a
supported version of Postgres for several more years.

-- 
Peter Geoghegan

Re: First draft of the PG 15 release notes

2022-07-01 Thread Peter Geoghegan

On Fri, Jul 1, 2022 at 9:41 AM Bruce Momjian  wrote:
> > It might be worth explaining the shift directly in the release notes.
> > The new approach is simpler and makes a lot more sense -- why should
> > the relfrozenxid be closely tied to freezing? We don't necessarily
>
> I don't think this is an appropriate detail for the release notes.

Okay. What about saying something about relminmxid advancement where
the database consumes lots of multixacts?

-- 
Peter Geoghegan

Re: doc: Clarify what "excluded" represents for INSERT ON CONFLICT

2022-07-01 Thread Peter Geoghegan

On Fri, Jul 1, 2022 at 6:01 AM Robert Haas  wrote:
> What would probably help more is adding something like this to the
> error message:
>
> HINT: column "b" could refer to any of these relations: "foo", "excluded"
>
> That could also help people who encounter this error in other
> situations. I'm not 100% sure this is a good idea, but I feel like it
> would have a much better chance of helping someone in this situation
> than the proposed doc patch.

I agree with everything you've said here, though I am 100% sure that
something like your proposed HINT would be a real usability win.

The "Perhaps you meant to reference the column" HINT displayed when
the user misspells a column is a surprisingly popular feature. I'm
aware of quite a few people singing its praises on Twitter, for
example. That hardly ever happens, even with features that we think of
as high impact major features. So evidently users really value these
kinds of quality of life improvements.

--
Peter Geoghegan

Re: doc: Clarify what "excluded" represents for INSERT ON CONFLICT

2022-07-01 Thread Peter Geoghegan

On Fri, Jul 1, 2022 at 6:40 AM Tom Lane  wrote:
> Robert Haas  writes:
> > What would probably help more is adding something like this to the
> > error message:
> > HINT: column "b" could refer to any of these relations: "foo", "excluded"
>
> +1, that seems like it could be handy across the board.

The user *will* get a similar HINT if they happen to *also* spell the
would-be ambiguous column name slightly incorrectly:

ERROR:  column "barr" does not exist
LINE 1: ...lict (bar) do update set bar = excluded.bar where barr != 5;
 ^
HINT:  Perhaps you meant to reference the column "foo.bar" or the
column "excluded.bar".

-- 
Peter Geoghegan

Re: doc: Clarify what "excluded" represents for INSERT ON CONFLICT

2022-06-30 Thread Peter Geoghegan

On Thu, Jun 30, 2022 at 3:07 PM David G. Johnston
 wrote:
> Yes, and based on a single encounter I agree this doesn't seem like a broadly 
> encountered issue.  My takeaway from that eventually led to this proposal.  
> The "Other Person" who is complaining about the docs is one of the mentors on 
> the Discord server and works for one of the corporate contributors to the 
> community. (I suppose Discord is considered public so maybe this redaction is 
> unnecessary...)

My impression from reading this transcript is that the user was
confused as to why they needed to qualify the target table name in the
ON CONFLICT DO UPDATE's WHERE clause -- they didn't have to qualify it
in the targetlist that appears in "SET ... ", so why the need to do it
in the WHERE clause? This isn't something that upsert statements need
to do all that often, just because adding additional conditions to the
WHERE clause isn't usually necessary. That much makes sense to me -- I
*can* imagine how that could cause confusion.

If that interpretation is correct, then it's not clear what it should
mean for how the INSERT documentation describes EXCLUDED. EXCLUDED is
involved here, since EXCLUDED is the thing that creates the ambiguity,
but that seems almost incidental to me. This user would probably not
have been confused if they didn't need to use a WHERE clause (very
much the common case), even when expression evaluation involving
EXCLUDED in the SET was still used (also common).

--
Peter Geoghegan

Re: doc: Clarify what "excluded" represents for INSERT ON CONFLICT

2022-06-30 Thread Peter Geoghegan

On Thu, Jun 30, 2022 at 2:07 PM David G. Johnston
 wrote:
> Current:
> "The SET and WHERE clauses in ON CONFLICT DO UPDATE have access to the
> existing row using the table's name (or an alias), and to [rows] proposed
> for insertion using the special excluded table."
>
> The word table in that sentence is wrong and not a useful way to think of the 
> thing which we've named excluded.  It is a single value of a composite type 
> having the structure of the named table.

I think that your reasoning is correct, but I don't agree with your
conclusion. The term "special excluded table" is a fudge, but that
isn't necessarily a bad thing. Sure, we could add something about the
UPDATE being similar to an UPDATE with a self-join, as I said
upthread. But I think that that would make the concept harder to
grasp.

> I'll agree that most people will mentally paper over the difference and go 
> merrily on their way.  At least one person recently did not do that, which 
> prompted an email to the community

Can you provide a reference for this? Didn't see anything like that in
the reference you gave upthread.

I have a hard time imagining a user that reads the INSERT docs and
imagines that "excluded" is a relation that is visible to the query in
ways that are not limited to expression evaluation for the UPDATE's
WHERE/SET. The way that it works (and doesn't work) follows naturally
from what a user would want to do in order to upsert. MySQL's INSERT
... ON DUPLICATE KEY UPDATE feature has a magical UPSERT-only
expression instead of "excluded".

-- 
Peter Geoghegan

Re: doc: Clarify what "excluded" represents for INSERT ON CONFLICT

2022-06-30 Thread Peter Geoghegan

On Thu, Jun 30, 2022 at 1:43 PM Robert Haas  wrote:
> rhaas=# insert into foo values (1, 'frob') on conflict (a) do update
> set b = (select b || 'nitz' from excluded);
> ERROR:  relation "excluded" does not exist
> LINE 1: ...ct (a) do update set b = (select b || 'nitz' from excluded);
>
> I do find that a bit of a curious error message, because that relation
> clearly DOES exist in the range table.

Let's say that we supported this syntax. I see some problems with that
(you may have thought of these already). Thinking of "excluded" as a
separate table creates some very thorny questions, such as:

How would the user be expected to handle the case where there was more
than a single "row" in "excluded"? How could the implementation know
what the contents of the "excluded table" were ahead of time in the
multi-row-insert case? We'd have to know about *all* of the conflicts
for all rows proposed for insertion to do this, which seems impossible
in the general case -- because some of those conflicts won't have
happened yet, and cannot be predicted.

The "excluded" pseudo-table is conceptually similar to an from_item
alias used within an UPDATE  FROM ... where the target table is
also the from_item table (i.e. there is a self-join). There is only
one table involved.

-- 
Peter Geoghegan

Re: vacuum verbose no longer reveals anything about pins

2022-06-30 Thread Peter Geoghegan

On Thu, Jun 30, 2022 at 8:43 AM Robert Haas  wrote:
> Ah, I missed that. I think that in the test case I was using, there
> was a conflicting pin but there were no dead tuples, so that line
> wasn't present in the output.

Even if there was a DEAD tuple, your test would still have to account
for opportunistic pruning before the cursor acquires its blocking pin
(I'm guessing that you're using a cursor for this).  The earlier
pruning could turn the DEAD tuple into either an LP_UNUSED item (in
the case of a heap-only tuple) or an LP_DEAD stub line pointer.

As I said, any existing LP_DEAD items will get put in the dead_items
array by lazy_scan_noprune, very much like it would in the
cleanup-lock-acquired/lazy_scan_prune case.

> Maybe nothing. I just thought you'd completely removed all reporting
> on this, and I'm glad that's not so.

The important idea behind all this is that VACUUM now uses a slightly
more abstract definition of scanned_pages. This is far easier to work
with at a high level, especially in the instrumentation code used by
VACUUM VERBOSE.

You may have also noticed that VACUUM VERBOSE/autovacuum logging
actually shows the number of scanned pages in Postgres 15, for the
first time -- even though that's very basic information. The revised
definition of scanned_pages enabled that enhancement. We are no longer
encumbered by the existence of a no-cleanup-lock-page special case.

-- 
Peter Geoghegan

Re: vacuum verbose no longer reveals anything about pins

2022-06-30 Thread Peter Geoghegan

On Thu, Jun 30, 2022 at 5:57 AM Robert Haas  wrote:
> I was dismayed to learn that VACUUM VERBOSE on a table no longer tells
> you anything about whether any pages were skipped due to pins.

VACUUM VERBOSE will show a dedicated line that reports on the number
of pages that we couldn't get a cleanup lock on, if and only if we
couldn't do useful work as a result. In practice this means pages that
had one or more fully DEAD tuples that couldn't be removed due to our
inability to prune. In my view this is strictly better than reporting
on the number of "skipped due to pins" pages.

In the case of any pages that we couldn't get a cleanup lock on that
didn't have any DEAD tuples (pages that are not reported on at all),
VACUUM hasn't missed any work whatsoever. It even does heap vacuuming,
which doesn't require a cleanup lock in either the first or the second
heap pass. What's the problem with not reporting on that?

-- 
Peter Geoghegan

Re: First draft of the PG 15 release notes

2022-06-28 Thread Peter Geoghegan

On Tue, Jun 28, 2022 at 1:35 PM Bruce Momjian  wrote:
> Okay, text updated, thanks.  Applied patch attached.

I have some notes on these items:

1. "Allow vacuum to be more aggressive in setting the oldest frozenxid
(Peter Geoghegan)"

2. "Add additional information to VACUUM VERBOSE and autovacuum
logging messages (Peter Geoghegan)"

The main enhancement to VACUUM for Postgres 15 was item 1, which
taught VACUUM to dynamically track the oldest remaining XID (and the
oldest remaining MXID) that will remain in the table at the end of the
same VACUUM operation. These final/oldest XID/MXID values are what we
now use to set relfrozenxid and relminmxid in pg_class. Previously we
just set relfrozenxid/relminmxid to whatever XID/MXID value was used
to determine which XIDs/MXIDs needed to be frozen. These values often
indicated more about VACUUM implementation details (like the
vacuum_freeze_min_age GUc's value) than the actual true contents of the
table at the end of the most recent VACUUM.

It might be worth explaining the shift directly in the release notes.
The new approach is simpler and makes a lot more sense -- why should
the relfrozenxid be closely tied to freezing? We don't necessarily
have to freeze any tuple to advance relfrozenxid right up to the
removable cutoff/OldestXmin used by VACUUM. For example,
anti-wraparound VACUUMs that run against static tables now set
relfrozenxid/relminmxid to VACUUM's removable cutoff/OldestXmin
directly, without freezing anything (after the first time). Same with
tables that happen to have every row deleted -- only the actual
unfrozen XIDs/MXIDs left in the table matter, and if there happen to
be none at all then we can use the same relfrozenxid as we would for
a CREATE TABLE. All depends on what the workload allows.

There will also be a real practical benefit for users that allocate a
lot of MultiXactIds: We'll now have pg_class.relminmxid values that
are much more reliable indicators of what is really going on in the
table, MultiXactId-wise. I expect that this will make it much less
likely that anti-wraparound VACUUMs will run needlessly against the
largest tables, where there probably wasn't ever one single
MultiXactId. In other words, the implementation will have more
accurate information at the level of each table, and won't .

I think that very uneven consumption of MultiXactIds at the table
level is probably common in real databases. Plus VACUUM can usually
remove a non-running MultiXact from a tuple's xmax, regardless of
whether or not the mxid happens to be before the
vacuum_multixact_freeze_min_age-based MXID cutoff -- VACUUM has
always just set xmax to InvalidXid in passing when it's possible to do so
easily. MultiXacts are inherently pretty short-lived information about
row lockers at a point in time. We don't really need to keep them
around for very long. We may now be able to truncate the two MultiXact
related SLRUs much more frequently with some workloads.

Finally, note that the new VACUUM VERBOSE output (which is now pretty
much the same as the autovacuum log output) shows when and how
relfrozenxid/relminmxid have advanced. This should make it relatively
easy to observe these effects where they exist.

Thanks

--
Peter Geoghegan

Re: Separate the attribute physical order from logical order

2022-06-28 Thread Peter Geoghegan

On Tue, Jun 28, 2022 at 11:47 AM Tom Lane  wrote:
> Now you do need something that will make the three meanings different
> in order to test that step.  But I'd suggest some bit of throwaway code
> that just assigns randomly different logical and physical orders.

That seems like a good idea. Might also make sense to make the
behavior configurable via a developer-only GUC, to enable exhaustive
tests that use every possible permutation of physical/logical mappings
for a given table.

Perhaps the random behavior itself should work by selecting a value
for the GUC at various key points via a PRNG. During CREATE TABLE, for
example. This approach could make it easier to reproduce failures on the
buildfarm.

--
Peter Geoghegan

Re: do only critical work during single-user vacuum?

2022-06-27 Thread Peter Geoghegan

On Mon, Jun 27, 2022 at 12:36 PM Justin Pryzby  wrote:
> By chance, I came across this prior thread which advocated the same thing in a
> initially (rather than indirectly as in this year's thread).

Revisiting this topic reminded me that PostgreSQL 14 (the first
version that had the wraparound failsafe mechanism controlled by
vacuum_failsafe_age) has been a stable release for 9 months now. As of
today I am still not aware of even one user that ran into the failsafe
mechanism in production. It might well have happened by now, of
course, but I am not aware of any specific case. Perhaps this will
change soon enough -- maybe somebody else will read this and enlighten
me.

To me the fact that the failsafe seems to seldom kick-in in practice
suggests something about workload characteristics in general: that it
isn't all that common for users to try to get away with putting off
freezing until a table attains an age that is significantly above 1
billion XIDs.

When people talk about things like 64-bit XIDs, I tend to wonder: if 2
billion XIDs wasn't enough, why should 4 billion or 8 billion be
enough? *Maybe* the system can do better by getting even further into
debt than it can today, but you can't expect to avoid freezing
altogether (without significant work elsewhere). My general sense is
that freezing isn't a particularly good thing to try to do lazily --
even if we ignore the risk of an eventual wraparound failure.

-- 
Peter Geoghegan

Re: some aspects of our qsort might not be ideal

2022-06-23 Thread Peter Geoghegan

On Thu, Jun 23, 2022 at 7:51 AM Matthias van de Meent
 wrote:
> I think that mostly has to do with reliability / stability of ORDER BY
> in combination with LIMIT and OFFSET, even though right now we cannot
> fully guarantee such stability due to unstable results from underlying
> plan nodes.

The current qsort isn't stable. While quicksort is never stable, our
particular implementation has fairly significant optimizations that
strongly rely on not using a stable sort. In particular, the B&M
optimizations for duplicate elements.

These optimizations make things like datum tuplesorts for
'SELECT(DISTINCT mycol) ...' style queries on low cardinality columns
extremely fast. We're not really sorting so much as bucketing. This is
based on Dijkstra's Dutch national flag problem.

-- 
Peter Geoghegan

Re: Devel docs on website reloading

2022-06-22 Thread Peter Geoghegan

On Wed, Jun 22, 2022 at 8:45 AM Bruce Momjian  wrote:
> That is a big help for committers who want to email a URL of the new
> doc commit output.

Yes, it's a nice "quality of life" improvement.

Thanks for finally getting this done!
-- 
Peter Geoghegan

Re: PG 15 (and to a smaller degree 14) regression due to ExprEvalStep size

2022-06-16 Thread Peter Geoghegan

On Thu, Jun 16, 2022 at 7:15 PM David Rowley  wrote:
> So, there appears to be no performance regression due to the extra
> indirection. There's maybe even some gains due to the smaller step
> size.

Have you tried this with the insert benchmark [1]?

I've run it myself in the past (when working on B-Tree deduplication).
It's quite straightforward to set up and run.

[1] http://smalldatum.blogspot.com/2017/06/the-insert-benchmark.html
-- 
Peter Geoghegan

Re: better page-level checksums

2022-06-15 Thread Peter Geoghegan

On Wed, Jun 15, 2022 at 1:27 PM Robert Haas  wrote:
> I think what will happen, depending on
> the encryption mode, is probably that either (a) the page will decrypt
> to complete garbage or (b) the page will fail some kind of
> verification and you won't be able to decrypt it at all. Either way,
> you won't be able to infer anything about what caused the problem. All
> you'll know is that something is wrong. That sucks - a lot - and I
> don't have a lot of good ideas as to what can be done about it. The
> idea that an encrypted page is unintelligible and that small changes
> to either the encrypted or unencrypted data should result in large
> changes to the other is intrinsic to the nature of encryption. It's
> more or less un-debuggable by design.

It's pretty clear that there must be a lot of truth to that. But that
doesn't mean that there aren't meaningful gradations beyond that.

I think that it's worth doing the following exercise (humor me): Why
wouldn't it be okay to just encrypt the tuple space and the line
pointer array, leaving both the page header and page special area
unencrypted? What kind of user would find that trade-off to be
unacceptable, and why? What's the nuance of it?

For all I know you're right (about encrypting the whole page, metadata
and all). I just want to know why that is. I understand that this
whole area is one where in general we may have to live with a certain
amount of uncertainty about what really matters.

> With extended checksums, I don't think the issues are anywhere near as
> bad. I'm not deeply opposed to setting a page-level flag but I expect
> nominal benefits.

I also expect only a small benefit. But that isn't a particularly
important factor in my mind.

Let's suppose that it turns out to be significantly more useful than
we originally expected, for whatever reason. Assuming all that, what
else can be said about it now? Isn't it now *relatively* likely that
including that status bit metadata will be *extremely* valuable, and
not merely somewhat more valuable?

I guess it doesn't matter much now (since you have all but conceded
that using a bit for this makes sense), but FWIW that's the main
reason why I almost took it for granted that we'd need to use a status
bit (or bits) for this.

-- 
Peter Geoghegan

Re: better page-level checksums

2022-06-14 Thread Peter Geoghegan

On Tue, Jun 14, 2022 at 7:21 PM Robert Haas  wrote:
> On Tue, Jun 14, 2022 at 9:56 PM Peter Geoghegan  wrote:
> > Technically we don't already do that today, with the 16-bit checksums
> > that are stored in PageHeaderData.pd_checksum. But we do something
> > equivalent: low-level tools can still infer that checksums must not be
> > enabled on the page (really the cluster) indirectly in the event of a
> > 0 checksum. A 0 value can reasonably be interpreted as a page from a
> > cluster without checksums (barring page corruption). This is basically
> > reasonable because our implementation of checksums is guaranteed to
> > not generate 0 as a valid checksum value.
>
> I don't think that 'pg_checksums -d' zeroes the checksum values on the
> pages in the cluster.

Obviously there are limitations on when and how we can infer something
about the whole cluster based on one single page image -- it all
depends on the context. I'm only arguing that we ought to make this
kind of analysis as easy as we reasonably can. I just don't see any
downside to having a status bit per checksum or encryption algorithm
at the page level, and plenty of upside (especially if the event of
bugs).

This seems like the absolute bare minimum to me, and I'm genuinely
surprised that there is even a question about whether or not we should
do that much.

-- 
Peter Geoghegan

Re: better page-level checksums

2022-06-14 Thread Peter Geoghegan

On Tue, Jun 14, 2022 at 7:17 PM Robert Haas  wrote:
> But it seems
> absolutely clear that our goal ought to be to leak as little
> information as possible.

But at what cost?

Basically I think that this is giving up rather a lot. For example,
isn't it possible that we'd have corruption that could be a bug in
either the checksum code, or in recovery?

I'd feel a lot better about it if there was some sense of both the
costs and the benefits.

> > Let's assume for now that we don't leave pd_flags unencrypted, as you
> > have suggested. We're still discussing new approaches to checksumming
> > in the scope of this work, which of course includes many individual
> > cases that don't involve any encryption. Plus even with encryption
> > there are things like defensive assertions that can be added by using
> > a flag bit for this.
>
> True. I don't think we should be too profligate with those bits just
> in case somebody needs a bunch of them for something important in the
> future, but it's probably fine to use up one or two.

Sure, but how many could possibly be needed for this? I can't see it
being more than 2 or 3. Which seems absolutely fine. They *definitely*
have no value if nobody ever uses them for anything.

-- 
Peter Geoghegan

Re: better page-level checksums

2022-06-14 Thread Peter Geoghegan

On Tue, Jun 14, 2022 at 1:32 PM Peter Geoghegan  wrote:
> On Tue, Jun 14, 2022 at 1:22 PM Robert Haas  wrote:
> > I still am not clear on precisely what you are proposing here. I do
> > agree that there is significant bit space available in pd_flags and
> > that consuming some of it wouldn't be stupid, but that doesn't add up
> > to a proposal. Maybe the proposal is: figure out how many different
> > configurations there are for this new kind of page space, let's say N,
> > and then reserve ceil(log2(N)) bits from pd_flags to indicate which
> > one we've got.
>
> I'm just making a general point. Why wouldn't we start out with the
> assumption that we use some pd_flags bit space for this stuff?

Technically we don't already do that today, with the 16-bit checksums
that are stored in PageHeaderData.pd_checksum. But we do something
equivalent: low-level tools can still infer that checksums must not be
enabled on the page (really the cluster) indirectly in the event of a
0 checksum. A 0 value can reasonably be interpreted as a page from a
cluster without checksums (barring page corruption). This is basically
reasonable because our implementation of checksums is guaranteed to
not generate 0 as a valid checksum value.

While pg_filedump does not rely on the 0 checksum convention
currently, it doesn't really need to. When the user uses the -k option
to verify checksums in passing, pg_filedump can assume that checksums
must be enabled ("the user said they must be so expect it" is a
reasonable assumption at that point). This also depends on there being
only one approach to checksums.

-- 
Peter Geoghegan

Re: better page-level checksums

2022-06-14 Thread Peter Geoghegan

On Tue, Jun 14, 2022 at 1:22 PM Robert Haas  wrote:
> I still am not clear on precisely what you are proposing here. I do
> agree that there is significant bit space available in pd_flags and
> that consuming some of it wouldn't be stupid, but that doesn't add up
> to a proposal. Maybe the proposal is: figure out how many different
> configurations there are for this new kind of page space, let's say N,
> and then reserve ceil(log2(N)) bits from pd_flags to indicate which
> one we've got.

I'm just making a general point. Why wouldn't we start out with the
assumption that we use some pd_flags bit space for this stuff?

> One possible problem with this is that, if the page is actually
> encrypted, we might want pd_flags to also be encrypted. The existing
> contents of pd_flags disclose some information about the tuples that
> are on the page, so having them exposed to prying eyes does not seem
> appealing.

I'm skeptical of the idea that we want to avoid leaving any metadata
unencrypted. But I'm not an expert on TDE, and don't want to say too
much about it without having done some more research. I would like to
see some justification for just encrypting everything on the page
without concern for the loss of debuggability, though. What is the
underlying theory behind that particular decision? Are there any
examples that we can draw from, from other systems or published
designs?

Let's assume for now that we don't leave pd_flags unencrypted, as you
have suggested. We're still discussing new approaches to checksumming
in the scope of this work, which of course includes many individual
cases that don't involve any encryption. Plus even with encryption
there are things like defensive assertions that can be added by using
a flag bit for this.

-- 
Peter Geoghegan

Re: better page-level checksums

2022-06-14 Thread Peter Geoghegan

On Tue, Jun 14, 2022 at 12:13 PM Robert Haas  wrote:
> Peter, unless I have missed something, this email is the very first
> one where you or anyone else have said anything at all about a PD_*
> bit. Even here, it's not very clear exactly what you are proposing.
> Therefore I have neither said anything bad about it in the past, nor
> can I now answer the question as to what is "so bad about it." If you
> want to make a concrete proposal, I will be happy to tell you what I
> think about it.

I am proposing that we not commit ourselves to relying on implicit
information about what must be true for every page in the cluster.
Just having a little additional page-header metadata (in pd_flags)
would accomplish that much, and wouldn't in itself impose any real
overhead on TDE.

It's not like the PageHeaderData.pd_flags bits are already a precious
commodity, in the same way as the heap tuple infomask status bits are.
We can afford to use some of them for this purpose, and then some.

Why wouldn't we do it that way, just on general principle?

You may still find it useful to rely on high level context at the
level of code that runs on the server, perhaps for performance reasons
(though it's unclear how much it matters). In which case the status
bit is technically redundant information as far as the code is
concerned. That may well be fine.

--
Peter Geoghegan

Re: better page-level checksums

2022-06-14 Thread Peter Geoghegan

On Tue, Jun 14, 2022 at 11:52 AM Robert Haas  wrote:
> > Even within TDE, it might be okay to assume that it's a feature that
> > the user must commit to using for a whole cluster at initdb time. What
> > isn't okay is committing to that assumption now and forever, by
> > leaving the door open to a world in which that assumption no longer
> > holds. Like when you do finally get around to making TDE something
> > that can work at the relation level, for example. Even if there is
> > only a small chance of that ever happening, why wouldn't we be
> > prepared for it, just on general principle?
>
> To the extent that we can leave ourselves room to do new things in the
> future without incurring unreasonable costs in the present, I'm in
> favor of that, as I believe anyone would be. But as you say, a lot
> depends on the specifics. Theoretical flexibility that can only be
> used in practice by really slow code doesn't help anybody.

A tool like pg_filedump or a backup tool can easily afford this
overhead. The only cost that TDE has to pay for this added flexibility
is that it has to set one of the PD_* bits in a code path that is
already bound to be very expensive. What's so bad about that?

Honestly, I'm a bit surprised that you're pushing back on this
particular point. A nonce for TDE is just something that code in
places like bufpage.h ought to know about. It has to be negotiated at
that level, because it will in fact affect a lot of callers to the
bufpage.h functions.

-- 
Peter Geoghegan

Re: better page-level checksums

2022-06-14 Thread Peter Geoghegan

On Tue, Jun 14, 2022 at 11:14 AM Robert Haas  wrote:
> We can have anything we want here, but we can't have everything we
> want at the same time. There are irreducible engineering trade-offs
> here. If all pages in a given cluster are the same, backends can
> compute the values of things that are currently compile-time constants
> upon startup and continue to use them for the lifetime of the backend.
> If pages can vary, some encrypted or checksummed and others not, then
> you have to recompute those values for every page. That's bound to
> have some cost. It is also more flexible.

Maybe not -- it depends on the particulars of the code. For example,
it might be okay for the B-Tree code to assume that B-Tree pages have
a special area at a known fixed offset, determined at compile time. At
the same time, it might very well not be okay for a backup tool to
make any such assumption, because it doesn't have the same context.

Even within TDE, it might be okay to assume that it's a feature that
the user must commit to using for a whole cluster at initdb time. What
isn't okay is committing to that assumption now and forever, by
leaving the door open to a world in which that assumption no longer
holds. Like when you do finally get around to making TDE something
that can work at the relation level, for example. Even if there is
only a small chance of that ever happening, why wouldn't we be
prepared for it, just on general principle?

-- 
Peter Geoghegan

Re: better page-level checksums

2022-06-14 Thread Peter Geoghegan

On Tue, Jun 14, 2022 at 10:43 AM Robert Haas  wrote:
> Hmm, but on the other hand, if you imagine a scenario in which the
> "storage system extra blob" is actually a nonce for TDE, you need to
> be able to find it before you've decrypted the rest of the page. If
> pd_checksum gives you the offset of that data, you need to exclude it
> from what gets encrypted, which means that you need encrypt three
> separate non-contiguous areas of the page whose combined size is
> unlikely to be a multiple of the encryption algorithm's block size.
> That kind of sucks (and putting it at the end of the page makes it way
> better).

I don't have a great understanding of how that cost will be felt in
detail right now, because I don't know enough about the project and
the requirements for TDE in general.

> That said, I certainly agree that finding the special space needs to
> be fast. The question in my mind is HOW fast it needs to be, and what
> techniques we might be able to use to dodge the problem. For instance,
> suppose that, during the startup sequence, we look at the control
> file, figure out the size of the 'storage system extra blob', and
> based on that each AM figures out the byte-offset of its special space
> and caches that in a global variable. Then, instead of
> PageGetSpecialSpace(page) it does PageGetBtreeSpecialSpace(page) or
> whatever, where the implementation is ((char*) page) +
> the_afformentioned_global_variable. Is that going to be too slow?

Who knows? For now the important point is that there is a tension
between the requirements of TDE, and the requirements of access
methods (especially index access methods). It's possible that this
will turn out not to be much of a problem. But the burden of proof is
yours. Making a big change to the on-disk format like this (a change
that affects every access method) should be held to an exceptionally
high standard.

There are bound to be tacit or even explicit assumptions made by
access methods that you risk breaking here. The reality is that all of
the access method code evolved in an environment where the special
space size was constant and generic for a given BLCKSZ. I don't have
much sympathy for any suggestion that code written 20 years ago should
have known not to make these assumptions. I have a lot more sympathy
for the idea that it's a general problem with our infrastructure
(particularly code in bufpage.c and the delicate assumptions made by
its callers) -- a problem that is worth addressing with a broad
solution that enables lots of different work.

We don't necessarily get another shot at this if we get it wrong now.

> There's going to have to be some compromise here. On the one hand
> you're going to have people who want to be able to do run-time
> conversions between page formats even at the cost of extra runtime
> overhead on top of what the basic feature necessarily implies. On the
> other hand you're going to have people who don't think any overhead at
> all is acceptable, even if it's purely nominal and only visible on a
> microbenchmark. Such arguments can easily become holy wars.

How many times has a big change to the on-disk format of this kind of
magnitude taken place, post-pg_upgrade? I would argue that this would
be the first, since it is the moral equivalent of extending the size
of the generic page header.

For all I know the overhead will be perfectly fine, and everybody
wins. I just want to be adamant that we're making the right
trade-offs, and maximizing the benefit from any new cost imposed on
access method code.

-- 
Peter Geoghegan

Re: better page-level checksums

2022-06-14 Thread Peter Geoghegan

On Tue, Jun 14, 2022 at 9:26 AM Tom Lane  wrote:
> It's been some years since I had much to do with pg_filedump, but
> my recollection is that the size of the special space is only one
> part of its heuristics, because there already *are* collisions.

Right, there are collisions even today. The heuristics are kludgey,
but they work perfectly in practice. That's not just due to luck --
it's due to people making sure that they continued to work over time.

> Moreover, there already are per-AM magic numbers in there that
> it uses to resolve those cases.  They're not at the front though.
> Nobody has ever wanted to break on-disk compatibility just to make
> pg_filedump's page-type identification less klugy, so I find it
> hard to believe that the above suggestion isn't a non-starter.

There is no doubt that it's not worth breaking on-disk compatibility
just for pg_filedump. The important principle here is that
high-context page formats are bad, and should be avoided whenever
possible.

Why isn't it possible to avoid it here? We have all the bits we need
for it in the page header, and then some. Why should we assume that
it'll never be useful to apply encryption selectively, perhaps at the
relation level?

-- 
Peter Geoghegan

Re: better page-level checksums

2022-06-14 Thread Peter Geoghegan

On Tue, Jun 14, 2022 at 8:48 AM Robert Haas  wrote:
> On Mon, Jun 13, 2022 at 6:26 PM Peter Geoghegan  wrote:
> > Anyway, I can see how it would be useful to be able to know the offset
> > of a nonce or of a hash digest on any given page, without access to a
> > running server. But why shouldn't that be possible with other designs,
> > including designs closer to what I've outlined?
>
> I don't know what you mean by this. As far as I'm aware, the only
> design you've outlined is one where the space wasn't at the same
> offset on every page.

I am skeptical of that particular aspect, yes. Though I would define
it the other way around (now the true special area struct isn't
necessarily at the same offset for a given AM, at least across data
directories).

My main concern is maintaining the ability to interpret much about the
contents of a page without context, and to not make it any harder to
grow the special area dynamically -- which is a broader concern.
Your patch isn't going to be the last one that wants to do something
with the special area. This needs to be carefully considered.

I see a huge amount of potential for adding new optimizations that use
subsidiary space on the page, presumably implemented via a special
area that can grow dynamically. For example, an ad-hoc compression
technique for heap pages that temporarily "absorbs" some extra
versions in the event of opportunistic pruning running and failing to
free enough space. Such a design would operate on similar principles
to deduplication in unique indexes, where the goal is to buy time
rather than buy space. When we fail to keep the contents of a heap
page together today, we often barely fail, so I expect something like
this to have an outsized impact on some workloads.

> In general, I was imagining that you'd need to look at the control
> file to understand how much space had been reserved per page in this
> particular cluster. I agree that's a bit awkward, especially for
> pg_filedump. However, pg_filedump and I think also some code internal
> to PostgreSQL try to figure out what kind of page we've got by looking
> at the *size* of the special space. It's only good luck that we
> haven't had a collision there yet, and continuing to rely on that
> seems like a dead end. Perhaps we should start including a per-AM
> magic number at the beginning of the special space.

It's true that that approach is just a hack -- we probably can do
better. I don't think that it's okay to break it, though. At least not
without providing a comparable alternative, that doesn't rely on
context from the control file.

--
Peter Geoghegan

Re: Tightening behaviour for non-immutable behaviour in immutable functions

2022-06-13 Thread Peter Geoghegan

On Mon, Jun 13, 2022 at 1:51 PM Greg Stark  wrote:
> By "relatively common" I think we're talking "nigh universal". Afaics
> there are no warnings in the docs about worrying about search_path on
> IMMUTABLE functions. There is for SECURITY DEFINER but I have to admit
> I wasn't aware myself of all the gotchas described there.

I didn't realize that it was that bad. Even if it's only 10% as bad as
you say, it would still be very valuable to do something about it
(ideally with an approach that is non-invasive).

-- 
Peter Geoghegan

Re: 2022-06-16 release announcement draft

2022-06-13 Thread Peter Geoghegan

On Mon, Jun 13, 2022 at 6:15 PM Jonathan S. Katz  wrote:
> > Perhaps it is also worth mentioning that you can use REINDEX without
> > CONCURRENTLY, even before upgrading.
>
> I'm hesitant on giving too many options. We did put out the "warning"
> announcement providing this as an option. I do think that folks who are
> running CIC/RIC are sensitive to locking, and a plain old "REINDEX" may
> be viable except in an emergency.

The locking implications for plain REINDEX are surprising IMV -- and
so I suggest sticking with what you have here.

In many cases using plain REINDEX is not meaningfully different to
taking a full AccessExclusiveLock on the table (we only acquire an AEL
on the index, but in practice that can be a distinction without a
difference). We at least went some way towards making the situation
with REINDEX locking clearer in a doc patch that recently became
commit 8ac700ac.

--
Peter Geoghegan

< 4 5 6 7 8 9 10 11 12 13 >

801 - 900 of 2507 matches

Mail list logo