subject:"Re\: Vacuum ERRORs out considering freezing dead tuples from before OldestXmin"

On Mon, Jul 22, 2024 at 6:36 PM Tom Lane  wrote:
>
> Melanie Plageman  writes:
> > We've only run tests with this commit on some of the back branches for
> > some of these animals. Of those, I don't see any failures so far. So,
> > it seems the test instability is just related to trying to get
> > multiple passes of index vacuuming reliably with TIDStore.
>
> > AFAICT, all the 32bit machine failures are timeouts waiting for the
> > standby to catch up (mamba, gull, merswine). Unfortunately, the
> > failures on copperhead (a 64 bit machine) are because we don't
> > actually succeed in triggering a second vacuum pass. This would not be
> > fixed by a longer timeout.
>
> Ouch.  This seems to me to raise the importance of getting a better
> way to test multiple-index-vacuum-passes.  Peter argued upthread
> that we don't need a better way, but I don't see how that argument
> holds water if copperhead was not reaching it despite being 64-bit.
> (Did you figure out exactly why it doesn't reach the code?)

I wasn't able to reproduce the failure (failing to do > 1 index vacuum
pass) on my local machine (which is 64 bit) without decreasing the
number of tuples inserted. The copperhead failure confuses me because
the speed of the machine should *not* affect how much space the dead
item TIDStore takes up. I would have bet money that the same number
and offsets of dead tuples per page in a relation would take up the
same amount of space in a TIDStore on any 64-bit system -- regardless
of how slowly it runs vacuum.

Here is some background on how I came up with the DDL and tuple count
for the test: TIDStore uses 32 BITS_PER_BITMAPWORD on 32 bit systems
and 64 on 64 bit systems. So, if you only have one bitmapword's worth
of dead items per page, it was easy to figure out that you would need
double the number of pages with dead items to take up the same amount
of TIDStore space on a 32 bit system than on a 64 bit system.

I wanted to figure out how to take up double the amount of TIDStore
space *without* doubling the number of tuples. This is not
straightforward. You can't just delete twice as many dead tuples per
page. For starters, you can compactly represent many dead tuples in a
single bitmapword. Outside of this, there seems to be some effect on
the amount of space the adaptive radix tree takes up if the dead items
on the pages are at the same offsets on all the pages. I thought this
might have to do with being able to use the same chunk (in ART terms)?
I spent some time trying to figure it out, but I gave up once I got
confused enough to try and read the adaptive radix tree paper.

I found myself wishing there was some way to visualize the TIDStore. I
don't have good ideas how to represent this, but if we found one, we
could add a function to the test_tidstore module.

I also think it would be useful to have peak TIDStore usage in bytes
in the vacuum verbose output. I had it on my list to propose something
like this after I hacked together a version myself while trying to
debug the test locally.

- Melanie

Re: Vacuum ERRORs out considering freezing dead tuples from before OldestXmin

2024-07-22 Thread Masahiko Sawada

On Mon, Jul 22, 2024 at 2:04 PM Melanie Plageman
 wrote:
>
> On Mon, Jul 22, 2024 at 2:17 PM Melanie Plageman
>  wrote:
> >
> > So, I've just gone through all the test failures on master and 17 for
> > mamba, gull, mereswine, and copperhead. I wanted to confirm that the
> > test was always failing for the same reason and also if it had any
> > failures pre-TIDStore.
> >
> > We've only run tests with this commit on some of the back branches for
> > some of these animals. Of those, I don't see any failures so far. So,
> > it seems the test instability is just related to trying to get
> > multiple passes of index vacuuming reliably with TIDStore.
> >
> > AFAICT, all the 32bit machine failures are timeouts waiting for the
> > standby to catch up (mamba, gull, merswine). Unfortunately, the
> > failures on copperhead (a 64 bit machine) are because we don't
> > actually succeed in triggering a second vacuum pass. This would not be
> > fixed by a longer timeout.
> >
> > Because of this, I'm inclined to revert the test on 17 and master to
> > avoid distracting folks committing other work and seeing those animals
> > go red.
>
> Okay, I reverted this for now on 17 and master. Adding Sawada-san to
> the thread to see if he has any ideas for a smaller two-round index
> vacuum example.
>

+   CREATE TABLE ${table1}(col1 int)
+   WITH (autovacuum_enabled=false, fillfactor=10);
+   INSERT INTO $table1 VALUES(7);
+   INSERT INTO $table1 SELECT generate_series(1, $nrows) % 3;
+   CREATE INDEX on ${table1}(col1);
+   UPDATE $table1 SET col1 = 3 WHERE col1 = 0;
+   INSERT INTO $table1 VALUES(7);

These queries make sense to me; these make the radix tree wide and use
more nodes, instead of fattening lead nodes (i.e. the offset bitmap).
The $table1 has 18182 blocks and the statistics of radix tree shows:

max_val = 65535
num_keys = 18182
height = 1, n4 = 0, n16 = 1, n32 = 0, n64 = 0, n256 = 72, leaves = 18182

Which means that the height of the tree is 2 and we use the maximum
size node for all nodes except for 1 node.

I don't have any great idea to substantially reduce the total number
of tuples in the $table1. Probably we can use DELETE instead of UPDATE
to make garbage tuples (although I'm not sure it's okay for this
test). Which reduces the amount of WAL records from 11MB to 4MB and
would reduce the time to catch up. But I'm not sure how much it would
help. There might be ideas to trigger a two-round index vacuum with
fewer tuples but if the tests are too optimized for the current
TidStore, we will have to re-adjust them if the TidStore changes in
the future. So I think it's better and reliable to allow
maintenance_work_mem to be a lower value or use injection points
somehow.

Regards,

-- 
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com

Re: Vacuum ERRORs out considering freezing dead tuples from before OldestXmin

Michael Paquier  writes:
> Sawada-san and John are the two ones in the best position to answer
> that.  I'm not sure either how to force a second index pass, either.

Yeah, I think we've established that having some way to force that,
without using a huge test case, would be really desirable.  Maybe
just provide a way to put an artificial limit on how many tuples
processed per pass?

(And no, I wasn't trying to rag on Melanie.  My point here is that
we've failed to design-in easy testability of this code path, and
that's surely not her fault.)

regards, tom lane

Re: Vacuum ERRORs out considering freezing dead tuples from before OldestXmin

2024-07-22 Thread Michael Paquier

On Mon, Jul 22, 2024 at 10:37:17AM -0700, Andres Freund wrote:
> On 2024-07-22 13:16:49 -0400, Robert Haas wrote:
>> Laziness would have been not bothering to develop a TAP test for this
>> at all. Going to the trouble of creating one and not being able to
>> make it as fast or as stable as everybody would like is just being
>> human.
> 
> Yea, I think calling weeks of effort by Melanie lazy is, uhm, not kind.

FWIW, I'm really impressed by what she has achieved here by Melanie,
fixing a hard bug while hacking a crazily-complicated test to make it
reproducible.  This has an incredible amount of value in the long-run.

> It's not like somebody else had a great suggestion for how to do this in a
> better way either.

Sawada-san and John are the two ones in the best position to answer
that.  I'm not sure either how to force a second index pass, either.
Hmm.  An idea would be to manipulate the TIDStore stack under the
injection points switch?  This is run by the CI and some buildfarm
members already.
--
Michael


signature.asc
Description: PGP signature

Re: Vacuum ERRORs out considering freezing dead tuples from before OldestXmin

Melanie Plageman  writes:
> We've only run tests with this commit on some of the back branches for
> some of these animals. Of those, I don't see any failures so far. So,
> it seems the test instability is just related to trying to get
> multiple passes of index vacuuming reliably with TIDStore.

> AFAICT, all the 32bit machine failures are timeouts waiting for the
> standby to catch up (mamba, gull, merswine). Unfortunately, the
> failures on copperhead (a 64 bit machine) are because we don't
> actually succeed in triggering a second vacuum pass. This would not be
> fixed by a longer timeout.

Ouch.  This seems to me to raise the importance of getting a better
way to test multiple-index-vacuum-passes.  Peter argued upthread
that we don't need a better way, but I don't see how that argument
holds water if copperhead was not reaching it despite being 64-bit.
(Did you figure out exactly why it doesn't reach the code?)

> Because of this, I'm inclined to revert the test on 17 and master to
> avoid distracting folks committing other work and seeing those animals
> go red.

Agreed as a short-term measure.

regards, tom lane

Re: Vacuum ERRORs out considering freezing dead tuples from before OldestXmin

On Mon, Jul 22, 2024 at 2:17 PM Melanie Plageman
 wrote:
>
> So, I've just gone through all the test failures on master and 17 for
> mamba, gull, mereswine, and copperhead. I wanted to confirm that the
> test was always failing for the same reason and also if it had any
> failures pre-TIDStore.
>
> We've only run tests with this commit on some of the back branches for
> some of these animals. Of those, I don't see any failures so far. So,
> it seems the test instability is just related to trying to get
> multiple passes of index vacuuming reliably with TIDStore.
>
> AFAICT, all the 32bit machine failures are timeouts waiting for the
> standby to catch up (mamba, gull, merswine). Unfortunately, the
> failures on copperhead (a 64 bit machine) are because we don't
> actually succeed in triggering a second vacuum pass. This would not be
> fixed by a longer timeout.
>
> Because of this, I'm inclined to revert the test on 17 and master to
> avoid distracting folks committing other work and seeing those animals
> go red.

Okay, I reverted this for now on 17 and master. Adding Sawada-san to
the thread to see if he has any ideas for a smaller two-round index
vacuum example.

- Melanie

Re: Vacuum ERRORs out considering freezing dead tuples from before OldestXmin

2024-07-22 Thread Peter Geoghegan

On Mon, Jul 22, 2024 at 2:13 PM Andres Freund  wrote:
> It's hard by now (i.e. 17+) because you need substantial amounts of rows to be
> able to trigger it which makes it a hard fight to introduce.

I didn't think that it was particularly hard when I tested the test
that Melanie committed.

> And the cost of
> setting the GUC limit lower is essentially zero.

Apparently you know more about TID Store than me.

If it really is trivial to lower the limit, then I have no objections
to doing so. That would make it easy to fix the test flappiness issues
by just using the much lower limit.

-- 
Peter Geoghegan

Re: Vacuum ERRORs out considering freezing dead tuples from before OldestXmin

On Mon, Jul 22, 2024 at 1:17 PM Robert Haas  wrote:
>
> On Mon, Jul 22, 2024 at 11:48 AM Tom Lane  wrote:
> > I'm a little suspicious
> > of using it for tests that merely take an unreasonable amount of
> > time --- to me, that indicates laziness on the part of the test
> > author.
>
> Laziness would have been not bothering to develop a TAP test for this
> at all. Going to the trouble of creating one and not being able to
> make it as fast or as stable as everybody would like is just being
> human.
>
> I never quite know what to do about TAP testing for issues like this.
> Ideally, we want a test case that runs quickly, is highly stable, is
> perfectly sensitive to the bug being fixed, and has a reasonable
> likelihood of being sensitive to future bugs of the same ilk. But such
> a test case need not exist, and even if it does, it need not be the
> case that any of us are able to find it. Or maybe finding it is
> possible but will take an unreasonable amount of time: if it took a
> committer six months to come up with such a test case for this bug,
> would that be worth it, or just overkill? I'd say overkill: I'd rather
> have that committer working on other stuff than spending six months
> trying to craft the perfect test case for a bug that's already fixed.
>
> Also, this particular bug seems to require a very specific combination
> of circumstances in order to trigger it. So the test gets complicated.
> As mentioned, that makes it harder to get the test case fast and
> stable, but it also reduces the chances that the test case will ever
> find anything. I don't think that this will be the last time we make a
> mistake around VACUUM's xmin handling, but the next mistake may well
> require an equally baroque but *different* setup to cause a problem. I
> hate to come to the conclusion that we just shouldn't test for this,
> but I don't think it's fair to send Melanie off on a wild goose chase
> looking for a perfect test case that may not realistically exist,
> either.

So, I've just gone through all the test failures on master and 17 for
mamba, gull, mereswine, and copperhead. I wanted to confirm that the
test was always failing for the same reason and also if it had any
failures pre-TIDStore.

We've only run tests with this commit on some of the back branches for
some of these animals. Of those, I don't see any failures so far. So,
it seems the test instability is just related to trying to get
multiple passes of index vacuuming reliably with TIDStore.

AFAICT, all the 32bit machine failures are timeouts waiting for the
standby to catch up (mamba, gull, merswine). Unfortunately, the
failures on copperhead (a 64 bit machine) are because we don't
actually succeed in triggering a second vacuum pass. This would not be
fixed by a longer timeout.

Because of this, I'm inclined to revert the test on 17 and master to
avoid distracting folks committing other work and seeing those animals
go red.

I wonder if Sawada-san or John have a test case minimally reproducing
a case needing multiple index vacuuming rounds. You can't do it with
my example and just more dead rows per page. If you just increase the
number of dead tuples, it doesn't increase the size of the TIDStore
unless those dead tuples are at different offsets. And I couldn't find
DDL which would cause the TIDStore to be > 1MB without using a low
fill-factor and many rows. Additionally, the fact that the same number
of rows does not trigger the multiple passes on two different 64bit
machines worries me and makes me think that we will struggle to
trigger these conditions without overshooting the minimum by quite a
bit.

- Melanie

Re: Vacuum ERRORs out considering freezing dead tuples from before OldestXmin

2024-07-22 Thread Andres Freund

On 2024-07-22 12:00:51 -0400, Peter Geoghegan wrote:
> On Mon, Jul 22, 2024 at 11:49 AM Tom Lane  wrote:
> > >> Andres has suggested in the past that we allow maintenance_work_mem be
> > >> set to a lower value or introduce some kind of development GUC so that
> > >> we can more easily test multiple pass index vacuuming. Do you think
> > >> this would be worth it?
> >
> > > No, I don't.
> >
> > I don't see why that's not a good idea.
> 
> I don't think that it's worth going to that trouble. Testing multiple
> passes isn't hard -- not in any real practical sense.

It's hard by now (i.e. 17+) because you need substantial amounts of rows to be
able to trigger it which makes it a hard fight to introduce. And the cost of
setting the GUC limit lower is essentially zero.

What's the point of having such a high lower limit?

Greetings,

Andres Freund

Re: Vacuum ERRORs out considering freezing dead tuples from before OldestXmin

2024-07-22 Thread Andres Freund

On 2024-07-22 13:16:49 -0400, Robert Haas wrote:
> On Mon, Jul 22, 2024 at 11:48 AM Tom Lane  wrote:
> > I'm a little suspicious
> > of using it for tests that merely take an unreasonable amount of
> > time --- to me, that indicates laziness on the part of the test
> > author.
> 
> Laziness would have been not bothering to develop a TAP test for this
> at all. Going to the trouble of creating one and not being able to
> make it as fast or as stable as everybody would like is just being
> human.

Yea, I think calling weeks of effort by Melanie lazy is, uhm, not kind.

It's not like somebody else had a great suggestion for how to do this in a
better way either.

Re: Vacuum ERRORs out considering freezing dead tuples from before OldestXmin

2024-07-22 Thread Robert Haas

On Mon, Jul 22, 2024 at 11:48 AM Tom Lane  wrote:
> I'm a little suspicious
> of using it for tests that merely take an unreasonable amount of
> time --- to me, that indicates laziness on the part of the test
> author.

Laziness would have been not bothering to develop a TAP test for this
at all. Going to the trouble of creating one and not being able to
make it as fast or as stable as everybody would like is just being
human.

I never quite know what to do about TAP testing for issues like this.
Ideally, we want a test case that runs quickly, is highly stable, is
perfectly sensitive to the bug being fixed, and has a reasonable
likelihood of being sensitive to future bugs of the same ilk. But such
a test case need not exist, and even if it does, it need not be the
case that any of us are able to find it. Or maybe finding it is
possible but will take an unreasonable amount of time: if it took a
committer six months to come up with such a test case for this bug,
would that be worth it, or just overkill? I'd say overkill: I'd rather
have that committer working on other stuff than spending six months
trying to craft the perfect test case for a bug that's already fixed.

Also, this particular bug seems to require a very specific combination
of circumstances in order to trigger it. So the test gets complicated.
As mentioned, that makes it harder to get the test case fast and
stable, but it also reduces the chances that the test case will ever
find anything. I don't think that this will be the last time we make a
mistake around VACUUM's xmin handling, but the next mistake may well
require an equally baroque but *different* setup to cause a problem. I
hate to come to the conclusion that we just shouldn't test for this,
but I don't think it's fair to send Melanie off on a wild goose chase
looking for a perfect test case that may not realistically exist,
either.

-- 
Robert Haas
EDB: http://www.enterprisedb.com

Re: Vacuum ERRORs out considering freezing dead tuples from before OldestXmin

2024-07-22 Thread Andres Freund

Hi,

On 2024-07-21 12:51:51 -0400, Tom Lane wrote:
> Melanie Plageman  writes:
> > When I run it on my machine with some added logging, the space taken
> > by dead items is about 330 kB more than maintenance_work_mem (which is
> > set to 1 MB). I could roughly double the excess by increasing the
> > number of inserted tuples from 40 to 60. I'll do this.

> mamba, gull, and mereswine are 32-bit machines, which aside from
> being old and slow suffer an immediate 2x size-of-test penalty:

I think what we ought to do here is to lower the lower limit for memory usage
for vacuum. With the new state in 17+ it basically has become impossible to
test multi-pass vacuums in a way that won't get your test thrown out - that's
bad.

> I do not think the answer to this is to nag the respective animal owners to
> raise PG_TEST_TIMEOUT_DEFAULT.  IMV this test is simply not worth the cycles
> it takes, at least not for these machines.

This specific area of the code has a *long* history of bugs, I'd be very loath
to give up testing.

Greetings,

Andres Freund

Re: Vacuum ERRORs out considering freezing dead tuples from before OldestXmin

2024-07-22 Thread Peter Geoghegan

On Mon, Jul 22, 2024 at 11:49 AM Tom Lane  wrote:
> >> Andres has suggested in the past that we allow maintenance_work_mem be
> >> set to a lower value or introduce some kind of development GUC so that
> >> we can more easily test multiple pass index vacuuming. Do you think
> >> this would be worth it?
>
> > No, I don't.
>
> I don't see why that's not a good idea.

I don't think that it's worth going to that trouble. Testing multiple
passes isn't hard -- not in any real practical sense.

I accept that there needs to be some solution to the problem of the
tests timing out on slow running buildfarm animals. Your
PG_TEST_SKIP_SLOW proposal seems like a good approach.

-- 
Peter Geoghegan

Re: Vacuum ERRORs out considering freezing dead tuples from before OldestXmin

Alvaro Herrera  writes:
> I think it's a bad idea to require buildfarm owners to edit their config
> files as we add tests that depend on PG_TEST_EXTRA.  AFAIR we invented
> that setting so that tests that had security implications could be made
> opt-in instead of opt-out; I think this was a sensible thing to do, to
> avoid possibly compromising the machines in some way.  But I think these
> new tests have a different problem, so we shouldn't use the same
> mechanism.

That's my feeling also.

> What about some brainstorming to improve this?

> For example: have something in the tree that lets committers opt some
> tests out from specific BF machines without having to poke at the BF
> machines.  I imagine two files: one that carries tags for buildfarm
> members, something like the /etc/groups file,

I'd turn it around, and provide some way for buildfarm owners to
say "this machine is slow".  Maybe make the tests respond to the
presence of an environment variable PG_TEST_SKIP_SLOW, or some
such thing.  That particular solution would require no new
infrastructure (such as a new buildfarm client release); it'd
only require editing the config files of affected animals.

regards, tom lane

Re: Vacuum ERRORs out considering freezing dead tuples from before OldestXmin

Peter Geoghegan  writes:
> On Mon, Jul 22, 2024 at 9:32 AM Melanie Plageman
>  wrote:
>> Andres has suggested in the past that we allow maintenance_work_mem be
>> set to a lower value or introduce some kind of development GUC so that
>> we can more easily test multiple pass index vacuuming. Do you think
>> this would be worth it?

> No, I don't.

I don't see why that's not a good idea.

regards, tom lane

Re: Vacuum ERRORs out considering freezing dead tuples from before OldestXmin

Melanie Plageman  writes:
> On Sun, Jul 21, 2024 at 5:04 PM Tom Lane  wrote:
>> I note also that the PG_TEST_EXTRA approach has caused xid_wraparound
>> to get next-to-zero buildfarm coverage.  If that test is actually
>> capable of revealing problems, we're unlikely to find out under the
>> status quo.

> What is the argument for PG_TEST_EXTRA if it is not running on almost
> any buildfarm animals? Are some of those tests valuable for other
> reasons than being consistently automatically run (e.g. developer
> understanding of how a particular part of code works)?

The point of PG_TEST_EXTRA is to make some of the tests be opt-in.
Originally it was just used for tests that might have security
implications (e.g. the kerberos tests, which involve running a
not-terribly-locked-down kerberos server).  I'm a little suspicious
of using it for tests that merely take an unreasonable amount of
time --- to me, that indicates laziness on the part of the test
author.  It'd be better to get the test runtime down to the point
where it's reasonable to expect all the buildfarm animals to run it.
As an example, we're not getting any Valgrind coverage on
xid_wraparound, and we won't ever get it with the current approach,
which I find bad.

> I don't know if you mean that PG_TEST_EXTRA tests are never
> run or just seldom run. If they are never run on the buildfarm, then
> they could end up silently breaking too.

They are opt-in, meaning that both buildfarm owners and regular
developers have to take extra action (i.e. set the PG_TEST_EXTRA
environment variable) to run them.  There's a reasonable number
of animals opting into ssl, kerberos, etc, but I see only two
that are opting into xid_wraparound.  If we change this new test
to be conditional on PG_TEST_EXTRA, it won't get run unless you
successfully nag some buildfarm owners to run it.

regards, tom lane

Re: Vacuum ERRORs out considering freezing dead tuples from before OldestXmin

2024-07-22 Thread Peter Geoghegan

On Mon, Jul 22, 2024 at 9:32 AM Melanie Plageman
 wrote:
> Andres has suggested in the past that we allow maintenance_work_mem be
> set to a lower value or introduce some kind of development GUC so that
> we can more easily test multiple pass index vacuuming. Do you think
> this would be worth it?

No, I don't.

-- 
Peter Geoghegan

Re: Vacuum ERRORs out considering freezing dead tuples from before OldestXmin

2024-07-22 Thread Alvaro Herrera

On 2024-Jul-22, Melanie Plageman wrote:

> On Sun, Jul 21, 2024 at 5:04 PM Tom Lane  wrote:

> > I note also that the PG_TEST_EXTRA approach has caused xid_wraparound
> > to get next-to-zero buildfarm coverage.  If that test is actually
> > capable of revealing problems, we're unlikely to find out under the
> > status quo.
> 
> What is the argument for PG_TEST_EXTRA if it is not running on almost
> any buildfarm animals? Are some of those tests valuable for other
> reasons than being consistently automatically run (e.g. developer
> understanding of how a particular part of code works)?

I think it's a bad idea to require buildfarm owners to edit their config
files as we add tests that depend on PG_TEST_EXTRA.  AFAIR we invented
that setting so that tests that had security implications could be made
opt-in instead of opt-out; I think this was a sensible thing to do, to
avoid possibly compromising the machines in some way.  But I think these
new tests have a different problem, so we shouldn't use the same
mechanism.

What about some brainstorming to improve this?

For example: have something in the tree that lets committers opt some
tests out from specific BF machines without having to poke at the BF
machines.  I imagine two files: one that carries tags for buildfarm
members, something like the /etc/groups file,

src/test/tags.lst
  slow: gull,mamba,mereswine,copperhead

and another file that lists tests to skip on members that have certain
tags,

src/tools/buildfarm/do_not_run.lst
  slow:src/test/modules/xid_wraparound
  slow:src/test/recovery/t/043_vacuum_horizon_floor.pl

so that run_build.pl know that if the current member has tag slow, then
these two tests are to be skipped.

Then we can have xid_wraparound enabled generally (without requiring
PG_TEST_EXTRA), and the BF client knows not to run it in the particular
cases where it's not wanted.

This proposal has a number of problems (a glaring one being the
maintenance of the list of members per tag), but maybe it inspires
better ideas.

-- 
Álvaro Herrera   48°01'N 7°57'E  —  https://www.EnterpriseDB.com/
"Siempre hay que alimentar a los dioses, aunque la tierra esté seca" (Orual)

Re: Vacuum ERRORs out considering freezing dead tuples from before OldestXmin

On Sun, Jul 21, 2024 at 4:29 PM Peter Geoghegan  wrote:
>
> On Sun, Jul 21, 2024 at 12:51 PM Tom Lane  wrote:
> > I do not think the answer to this is to nag the respective animal
> > owners to raise PG_TEST_TIMEOUT_DEFAULT.  IMV this test is simply
> > not worth the cycles it takes, at least not for these machines.
>
> Can't we just move it to PG_TEST_EXTRA? Alongside the existing
> "xid_wraparound" test?
>
> We didn't even have basic coverage of multi-pass VACUUMs before now.
> This new test added that coverage. I think that it will pull its
> weight.

Andres has suggested in the past that we allow maintenance_work_mem be
set to a lower value or introduce some kind of development GUC so that
we can more easily test multiple pass index vacuuming. Do you think
this would be worth it?

- Melanie

Re: Vacuum ERRORs out considering freezing dead tuples from before OldestXmin

On Sun, Jul 21, 2024 at 5:04 PM Tom Lane  wrote:
>
> Peter Geoghegan  writes:
> > On Sun, Jul 21, 2024 at 12:51 PM Tom Lane  wrote:
> >> I do not think the answer to this is to nag the respective animal
> >> owners to raise PG_TEST_TIMEOUT_DEFAULT.  IMV this test is simply
> >> not worth the cycles it takes, at least not for these machines.
>
> > Can't we just move it to PG_TEST_EXTRA? Alongside the existing
> > "xid_wraparound" test?
>
> Perhaps.  xid_wraparound seems entirely too slow for what it's
> testing as well, if you ask me, and there's a concurrent thread
> about that test causing problems too.
>
> > There will always be a small number of extremely slow buildfarm
> > animals. Optimizing for things like Raspberry pi animals with SD cards
> > just doesn't seem like a good use of developer time. I really care
> > about keeping the tests fast, but only on platforms that hackers
> > actually use for their development work.
>
> I find this argument completely disingenuous.  If a test is slow
> enough to cause timeout failures on slower machines, then it's also
> eating a disproportionate number of cycles in every other check-world
> run --- many of which have humans waiting for them to finish.  Caring
> about the runtime of test cases is good for future-you not just
> obsolete buildfarm animals.
>
> I note also that the PG_TEST_EXTRA approach has caused xid_wraparound
> to get next-to-zero buildfarm coverage.  If that test is actually
> capable of revealing problems, we're unlikely to find out under the
> status quo.

What is the argument for PG_TEST_EXTRA if it is not running on almost
any buildfarm animals? Are some of those tests valuable for other
reasons than being consistently automatically run (e.g. developer
understanding of how a particular part of code works)?

If they aren't being run, how do we know that they still work (as test
infrastructure changes)? The recovery conflict test is skipped on 15,
which means that backported perl test changes may break it without us
knowing. I don't know if you mean that PG_TEST_EXTRA tests are never
run or just seldom run. If they are never run on the buildfarm, then
they could end up silently breaking too.

- Melanie

Re: Vacuum ERRORs out considering freezing dead tuples from before OldestXmin

On Sun, Jul 21, 2024 at 12:51 PM Tom Lane  wrote:
>
> Melanie Plageman  writes:
> > When I run it on my machine with some added logging, the space taken
> > by dead items is about 330 kB more than maintenance_work_mem (which is
> > set to 1 MB). I could roughly double the excess by increasing the
> > number of inserted tuples from 40 to 60. I'll do this.
>
> So, after about two days in the buildfarm, we have failure reports
> from this test on gull, mamba, mereswine, and copperhead.  mamba
> is mine, and I was able to reproduce the failure in a manual run.
> The problem seems to be that the test simply takes too long and
> we hit the default 180-second timeout on one step or another.
> I was able to make it pass by dint of
>
> $ export PG_TEST_TIMEOUT_DEFAULT=1800
>
> However, the test then took 908 seconds:

Thanks for taking the time to do this. If the test failures can be
fixed by increasing timeout, that means that at least multiple index
vacuums are reliably triggered with that number of rows. Obviously we
can't have a super slow, flakey test, but I was worried the test might
fail on different platforms because somehow the row count was
insufficient to cause multiple index vacuums on some platforms for
some reason (due to adaptive radix tree size being dependent on many
factors).

> $ time make installcheck PROVE_TESTS=t/043_vacuum_horizon_floor.pl
> ...
> # +++ tap install-check in src/test/recovery +++
> t/043_vacuum_horizon_floor.pl .. ok
> All tests successful.
> Files=1, Tests=3, 908 wallclock secs ( 0.17 usr  0.01 sys + 21.42 cusr 35.03 
> csys = 56.63 CPU)
> Result: PASS
>   909.26 real22.10 user35.21 sys
>
> This is even slower than the 027_stream_regress.pl test, which
> currently takes around 847 seconds on that machine.
>
> mamba, gull, and mereswine are 32-bit machines, which aside from
> being old and slow suffer an immediate 2x size-of-test penalty:
>
> >> # The TIDStore vacuum uses to store dead items is optimized for its target
> >> # system. On a 32-bit system, our example requires twice as many pages with
> >> # the same number of dead items per page to fill the TIDStore and trigger a
> >> # second round of index vacuuming.
> >> my $is_64bit = $node_primary->safe_psql($test_db,
> >>  qq[SELECT typbyval FROM pg_type WHERE typname = 'int8';]);
> >>
> >> my $nrows = $is_64bit eq 't' ? 40 : 80;
>
> copperhead is 64-bit but is nonetheless even slower than the
> other three, so the fact that it's also timing out isn't
> that surprising.
>
> I do not think the answer to this is to nag the respective animal
> owners to raise PG_TEST_TIMEOUT_DEFAULT.  IMV this test is simply
> not worth the cycles it takes, at least not for these machines.
> I'm not sure whether to propose reverting it entirely or just
> disabling it on 32-bit hardware.  I don't think we'd lose anything
> meaningful in test coverage if we did the latter; but that won't be
> enough to make copperhead happy.  I am also suspicious that we'll
> get bad news from other very slow animals such as dikkop.

I am happy to do what Peter suggests and move it to PG_TEST_EXTRA, to
disable for 32-bit, or to revert it.

> I wonder if there is a less expensive way to trigger the test
> situation than brute-forcing things with a large index.
> Maybe the injection point infrastructure could help?

The issue with an injection point is that we need more than for the
vacuuming backend to pause at a specific point, we need a refresh of
GlobalVisState to be forced at that point. Even if the horizon moves
backward on the primary, this backend won't notice unless it has to
update its GlobalVisState -- which often happens due to taking a new
snapshot but this also happens at the end of index vacuuming
explicitly.

- Melanie

Re: Vacuum ERRORs out considering freezing dead tuples from before OldestXmin

2024-07-21 Thread Peter Geoghegan

On Sun, Jul 21, 2024 at 5:04 PM Tom Lane  wrote:
> > There will always be a small number of extremely slow buildfarm
> > animals. Optimizing for things like Raspberry pi animals with SD cards
> > just doesn't seem like a good use of developer time. I really care
> > about keeping the tests fast, but only on platforms that hackers
> > actually use for their development work.
>
> I find this argument completely disingenuous.

Disingenuous? Really?

> If a test is slow
> enough to cause timeout failures on slower machines, then it's also
> eating a disproportionate number of cycles in every other check-world
> run --- many of which have humans waiting for them to finish.  Caring
> about the runtime of test cases is good for future-you not just
> obsolete buildfarm animals.

That's not necessarily true, though.

I actually benchmarked this new test. I found that its runtime was a
little on the long side as these recovery TAP tests go, but not to an
excessive degree. It wasn't the slowest by any means.

It's entirely possible that the new test would in fact be far slower
than other comparable tests, were I to run a similar benchmark on
something like a Raspberry pi with an SD card -- that would explain
the apparent inconsistency here. Obviously Raspberry pi type hardware
is expected to be much slower than the machine I use day to day, but
that isn't the only thing that matters. A Raspberry pi can also have
completely different performance characteristics to high quality
workstation hardware. The CPU might be tolerably fast, while I/O is a
huge bottleneck.

> I note also that the PG_TEST_EXTRA approach has caused xid_wraparound
> to get next-to-zero buildfarm coverage.  If that test is actually
> capable of revealing problems, we're unlikely to find out under the
> status quo.

I saw that.

I think that there is significant value in providing a way for
individual developers to test wraparound. Both by providing a TAP
test, and providing the associated SQL callable C test functions.
There is less value in testing it on every conceivable platform.

-- 
Peter Geoghegan

Re: Vacuum ERRORs out considering freezing dead tuples from before OldestXmin

2024-07-21 Thread Tom Lane

Peter Geoghegan  writes:
> On Sun, Jul 21, 2024 at 12:51 PM Tom Lane  wrote:
>> I do not think the answer to this is to nag the respective animal
>> owners to raise PG_TEST_TIMEOUT_DEFAULT.  IMV this test is simply
>> not worth the cycles it takes, at least not for these machines.

> Can't we just move it to PG_TEST_EXTRA? Alongside the existing
> "xid_wraparound" test?

Perhaps.  xid_wraparound seems entirely too slow for what it's
testing as well, if you ask me, and there's a concurrent thread
about that test causing problems too.

> There will always be a small number of extremely slow buildfarm
> animals. Optimizing for things like Raspberry pi animals with SD cards
> just doesn't seem like a good use of developer time. I really care
> about keeping the tests fast, but only on platforms that hackers
> actually use for their development work.

I find this argument completely disingenuous.  If a test is slow
enough to cause timeout failures on slower machines, then it's also
eating a disproportionate number of cycles in every other check-world
run --- many of which have humans waiting for them to finish.  Caring
about the runtime of test cases is good for future-you not just
obsolete buildfarm animals.

I note also that the PG_TEST_EXTRA approach has caused xid_wraparound
to get next-to-zero buildfarm coverage.  If that test is actually
capable of revealing problems, we're unlikely to find out under the
status quo.

regards, tom lane

Re: Vacuum ERRORs out considering freezing dead tuples from before OldestXmin

2024-07-21 Thread Peter Geoghegan

On Sun, Jul 21, 2024 at 12:51 PM Tom Lane  wrote:
> I do not think the answer to this is to nag the respective animal
> owners to raise PG_TEST_TIMEOUT_DEFAULT.  IMV this test is simply
> not worth the cycles it takes, at least not for these machines.

Can't we just move it to PG_TEST_EXTRA? Alongside the existing
"xid_wraparound" test?

We didn't even have basic coverage of multi-pass VACUUMs before now.
This new test added that coverage. I think that it will pull its
weight.

There will always be a small number of extremely slow buildfarm
animals. Optimizing for things like Raspberry pi animals with SD cards
just doesn't seem like a good use of developer time. I really care
about keeping the tests fast, but only on platforms that hackers
actually use for their development work.

--
Peter Geoghegan

Re: Vacuum ERRORs out considering freezing dead tuples from before OldestXmin

2024-07-21 Thread Tom Lane

Melanie Plageman  writes:
> When I run it on my machine with some added logging, the space taken
> by dead items is about 330 kB more than maintenance_work_mem (which is
> set to 1 MB). I could roughly double the excess by increasing the
> number of inserted tuples from 40 to 60. I'll do this.

So, after about two days in the buildfarm, we have failure reports
from this test on gull, mamba, mereswine, and copperhead.  mamba
is mine, and I was able to reproduce the failure in a manual run.
The problem seems to be that the test simply takes too long and
we hit the default 180-second timeout on one step or another.
I was able to make it pass by dint of

$ export PG_TEST_TIMEOUT_DEFAULT=1800

However, the test then took 908 seconds:

$ time make installcheck PROVE_TESTS=t/043_vacuum_horizon_floor.pl
...
# +++ tap install-check in src/test/recovery +++
t/043_vacuum_horizon_floor.pl .. ok   
All tests successful.
Files=1, Tests=3, 908 wallclock secs ( 0.17 usr  0.01 sys + 21.42 cusr 35.03 
csys = 56.63 CPU)
Result: PASS
  909.26 real22.10 user35.21 sys

This is even slower than the 027_stream_regress.pl test, which
currently takes around 847 seconds on that machine.

mamba, gull, and mereswine are 32-bit machines, which aside from
being old and slow suffer an immediate 2x size-of-test penalty:

>> # The TIDStore vacuum uses to store dead items is optimized for its target
>> # system. On a 32-bit system, our example requires twice as many pages with
>> # the same number of dead items per page to fill the TIDStore and trigger a
>> # second round of index vacuuming.
>> my $is_64bit = $node_primary->safe_psql($test_db,
>>  qq[SELECT typbyval FROM pg_type WHERE typname = 'int8';]);
>> 
>> my $nrows = $is_64bit eq 't' ? 40 : 80;

copperhead is 64-bit but is nonetheless even slower than the
other three, so the fact that it's also timing out isn't
that surprising.

I do not think the answer to this is to nag the respective animal
owners to raise PG_TEST_TIMEOUT_DEFAULT.  IMV this test is simply
not worth the cycles it takes, at least not for these machines.
I'm not sure whether to propose reverting it entirely or just
disabling it on 32-bit hardware.  I don't think we'd lose anything
meaningful in test coverage if we did the latter; but that won't be
enough to make copperhead happy.  I am also suspicious that we'll
get bad news from other very slow animals such as dikkop.

I wonder if there is a less expensive way to trigger the test
situation than brute-forcing things with a large index.
Maybe the injection point infrastructure could help?

regards, tom lane

Re: Vacuum ERRORs out considering freezing dead tuples from before OldestXmin

2024-07-17 Thread Melanie Plageman

On Wed, Jul 17, 2024 at 12:11 PM Peter Geoghegan  wrote:
>
> On Wed, Jul 17, 2024 at 12:07 PM Melanie Plageman
>  wrote:
> > We didn't end up doing two index vacuum passes. Because it doesn't
> > repro locally for me, I can only assume that the conditions for
> > forcing two index vacuuming passes in master just weren't met in this
> > case. I'm unsurprised, as it is much harder since 17 to force two
> > passes of index vacuuming. It seems like this might be as unstable as
> > I feared. I could add more dead data. Or, I could just commit the test
> > to the back branches before 17. What do you think?
>
> How much margin of error do you have, in terms of total number of
> dead_items? That is, have you whittled it down to the minimum possible
> threshold for 2 passes?

When I run it on my machine with some added logging, the space taken
by dead items is about 330 kB more than maintenance_work_mem (which is
set to 1 MB). I could roughly double the excess by increasing the
number of inserted tuples from 40 to 60. I'll do this.

> Some logging with VACUUM VERBOSE (run on the ci instance) might be 
> illuminating.

Vacuum verbose only will tell us the number of dead tuples and dead
item identifiers but not how much space they take up -- which is how
we decide whether or not to do index vacuuming.

- Melanie

Re: Vacuum ERRORs out considering freezing dead tuples from before OldestXmin

2024-07-17 Thread Peter Geoghegan

On Wed, Jul 17, 2024 at 12:07 PM Melanie Plageman
 wrote:
> We didn't end up doing two index vacuum passes. Because it doesn't
> repro locally for me, I can only assume that the conditions for
> forcing two index vacuuming passes in master just weren't met in this
> case. I'm unsurprised, as it is much harder since 17 to force two
> passes of index vacuuming. It seems like this might be as unstable as
> I feared. I could add more dead data. Or, I could just commit the test
> to the back branches before 17. What do you think?

How much margin of error do you have, in terms of total number of
dead_items? That is, have you whittled it down to the minimum possible
threshold for 2 passes?

Some logging with VACUUM VERBOSE (run on the ci instance) might be illuminating.

-- 
Peter Geoghegan

Re: Vacuum ERRORs out considering freezing dead tuples from before OldestXmin

2024-07-17 Thread Melanie Plageman

On Wed, Jul 17, 2024 at 11:07 AM Melanie Plageman
 wrote:
>
> On Mon, Jul 15, 2024 at 6:02 PM Peter Geoghegan  wrote:
> >
> > I think that there is some chance that this test will break the build
> > farm in whatever way, since there is a long history of VACUUM not
> > quite behaving as expected with these sorts of tests. I think that you
> > should commit the test case separately, first thing in the morning,
> > and then keep an eye on the build farm for the rest of the day. I
> > don't think that it's sensible to bend over backwards, just to avoid
> > breaking the build farm in this way.
>
> Sounds good.

Hmm. So, I was just running all the versions through CI again, and I
noticed that the test failed on master on CI on Linux - Debian
Bookworm with Meson. (This passes locally for me and has passed on
previous CI runs).

[15:43:41.547] stderr:
[15:43:41.547] # poll_query_until timed out executing this query:
[15:43:41.547] #
[15:43:41.547] # SELECT index_vacuum_count > 0
[15:43:41.547] # FROM pg_stat_progress_vacuum
[15:43:41.547] # WHERE datname='test_db' AND relid::regclass =
'vac_horizon_floor_table'::regclass;
[15:43:41.547] #
[15:43:41.547] # expecting this output:
[15:43:41.547] # t
[15:43:41.547] # last actual query output:
[15:43:41.547] # f

We didn't end up doing two index vacuum passes. Because it doesn't
repro locally for me, I can only assume that the conditions for
forcing two index vacuuming passes in master just weren't met in this
case. I'm unsurprised, as it is much harder since 17 to force two
passes of index vacuuming. It seems like this might be as unstable as
I feared. I could add more dead data. Or, I could just commit the test
to the back branches before 17. What do you think?

- Melanie

Re: Vacuum ERRORs out considering freezing dead tuples from before OldestXmin

2024-07-17 Thread Melanie Plageman

On Mon, Jul 15, 2024 at 6:02 PM Peter Geoghegan  wrote:
>
> On Mon, Jul 8, 2024 at 2:25 PM Melanie Plageman
>  wrote:
> > I could still use another pair of eyes on the test (looking out for
> > stability enhancing measures I could take).
>
> First, the basics: I found that your test failed reliably without your
> fix, and passed reliably with your fix.

Thanks for the review.

> Minor nitpicking about the comments in your TAP test:
>
> * It is necessary but not sufficient for your test to "skewer"
> maybe_needed, relative to OldestXmin. Obviously, it is not sufficient
> because the test can only fail when VACUUM prunes a heap page after
> the backend's horizons have been "skewered" in this sense.
>
> Pruning is when we get stuck, and if there's no more pruning then
> there's no opportunity for VACUUM to get stuck. Perhaps this point
> should be noted directly in the comments. You could add a sentence
> immediately after the existing sentence "Then vacuum's first pass will
> continue and pruning...". This new sentence would then add commentary
> such as "Finally, vacuum's second pass over the heap...".

I've added a description to the top of the test of the scenario
required and then reworked the comment you are describing to try and
make this more clear.

> * Perhaps you should point out that you're using VACUUM FREEZE for
> this because it'll force the backend to always get a cleanup lock.
> This is something you rely on to make the repro reliable, but that's
> it.
>
> In other words, point out to the reader that this bug has nothing to
> do with freezing; it just so happens to be convenient to use VACUUM
> FREEZE this way, due to implementation details.

I've mentioned this in a comment.

> * The sentence "VACUUM proceeds with pruning and does a visibility
> check on each tuple..." describes the bug in terms of the current
> state of things on Postgres 17, but Postgres 17 hasn't been released
> just yet. Does that really make sense?

In the patch targeted at master, I think it makes sense to describe
the code as it is. In the backpatch versions, I reworked this comment
to be correct for those versions.

> If you want to describe the invariant that caused
> heap_pre_freeze_checks() to error-out on HEAD/Postgres 17, then the
> commit message of your fix seems like the right place for that. You
> could reference these errors in passing. The errors seem fairly
> incidental to the real problem, at least to me.

The errors are mentioned in the fix commit message.

> I think that there is some chance that this test will break the build
> farm in whatever way, since there is a long history of VACUUM not
> quite behaving as expected with these sorts of tests. I think that you
> should commit the test case separately, first thing in the morning,
> and then keep an eye on the build farm for the rest of the day. I
> don't think that it's sensible to bend over backwards, just to avoid
> breaking the build farm in this way.

Sounds good.

- Melanie
From dab4725c8e5dc7103607790d915bb47e2139d0ce Mon Sep 17 00:00:00 2001
From: Melanie Plageman 
Date: Wed, 17 Jul 2024 11:01:00 -0400
Subject: [PATCH v4 1/2] Ensure vacuum removes all visibly dead tuples older
 than OldestXmin

If vacuum fails to remove a tuple with xmax older than
VacuumCutoffs->OldestXmin and younger than GlobalVisState->maybe_needed,
it may attempt to freeze the tuple's xmax and then ERROR out in
pre-freeze checks with "cannot freeze committed xmax".

Fix this by having vacuum always remove tuples older than OldestXmin.

It is possible for GlobalVisState->maybe_needed to precede OldestXmin if
maybe_needed is forced to go backward while vacuum is running. This can
happen if a disconnected standby with a running transaction older than
VacuumCutoffs->OldestXmin reconnects to the primary after vacuum
initially calculates GlobalVisState and OldestXmin.

In back branches starting with 14, the first version using
GlobalVisState, failing to remove tuples older than OldestXmin during
pruning caused vacuum to infinitely loop in lazy_scan_prune(), as
investigated on this [1] thread. After 1ccc1e05ae removed the retry loop
in lazy_scan_prune() and stopped comparing tuples to OldestXmin, the
hang could no longer happen, but we could still attempt to freeze dead
tuples with xmax older than OldestXmin -- resulting in an ERROR.

Fix this by always removing dead tuples with xmax older than
VacuumCutoffs->OldestXmin. This is okay because the standby won't replay
the tuple removal until the tuple is removable. Thus, the worst that can
happen is a recovery conflict.

[1] https://postgr.es/m/20240415173913.4zyyrwaftujxthf2%40awork3.anarazel.de#1b216b7768b5bd577a3d3d51bd5aadee

Back-patch through 14

Author: Melanie Plageman
Reviewed-by: Peter Geoghegan, Robert Haas, Andres Freund, Heikki Linnakangas, and Noah Misch
Discussion: https://postgr.es/m/CAAKRu_bDD7oq9ZwB2OJqub5BovMG6UjEYsoK2LVttadjEqyRGg%40mail.gmail.com
---
 src/backend/access/heap/pruneheap.c  | 23 +++

Re: Vacuum ERRORs out considering freezing dead tuples from before OldestXmin

2024-07-15 Thread Peter Geoghegan

On Mon, Jul 8, 2024 at 2:25 PM Melanie Plageman
 wrote:
> Attached v3 has one important additional component in the test -- I
> use pg_stat_progress_vacuum to confirm that we actually do more than
> one pass of index vacuuming. Otherwise, it would have been trivial for
> the test to incorrectly pass.

That's a good idea.

> I could still use another pair of eyes on the test (looking out for
> stability enhancing measures I could take).

First, the basics: I found that your test failed reliably without your
fix, and passed reliably with your fix.

Next, performance: the total test runtime (as indicated by "time meson
test -q recovery/043_vacuum_horizon_floor") was comparable to other
recovery/* TAP tests. The new vacuum_horizon_floor test is a little on
the long running side, as these tests go, but I think that's fine.

Minor nitpicking about the comments in your TAP test:

* It is necessary but not sufficient for your test to "skewer"
maybe_needed, relative to OldestXmin. Obviously, it is not sufficient
because the test can only fail when VACUUM prunes a heap page after
the backend's horizons have been "skewered" in this sense.

Pruning is when we get stuck, and if there's no more pruning then
there's no opportunity for VACUUM to get stuck. Perhaps this point
should be noted directly in the comments. You could add a sentence
immediately after the existing sentence "Then vacuum's first pass will
continue and pruning...". This new sentence would then add commentary
such as "Finally, vacuum's second pass over the heap...".

* Perhaps you should point out that you're using VACUUM FREEZE for
this because it'll force the backend to always get a cleanup lock.
This is something you rely on to make the repro reliable, but that's
it.

In other words, point out to the reader that this bug has nothing to
do with freezing; it just so happens to be convenient to use VACUUM
FREEZE this way, due to implementation details.

* The sentence "VACUUM proceeds with pruning and does a visibility
check on each tuple..." describes the bug in terms of the current
state of things on Postgres 17, but Postgres 17 hasn't been released
just yet. Does that really make sense?

If you want to describe the invariant that caused
heap_pre_freeze_checks() to error-out on HEAD/Postgres 17, then the
commit message of your fix seems like the right place for that. You
could reference these errors in passing. The errors seem fairly
incidental to the real problem, at least to me.

I think that there is some chance that this test will break the build
farm in whatever way, since there is a long history of VACUUM not
quite behaving as expected with these sorts of tests. I think that you
should commit the test case separately, first thing in the morning,
and then keep an eye on the build farm for the rest of the day. I
don't think that it's sensible to bend over backwards, just to avoid
breaking the build farm in this way.

Thanks for working on this
-- 
Peter Geoghegan

Re: Vacuum ERRORs out considering freezing dead tuples from before OldestXmin

2024-07-08 Thread Melanie Plageman

On Tue, Jul 2, 2024 at 7:07 PM Melanie Plageman
 wrote:
>
> On Mon, Jun 24, 2024 at 4:27 AM Heikki Linnakangas  wrote:
> >
> > Would it be possible to make it robust so that we could always run it
> > with "make check"? This seems like an important corner case to
> > regression test.
>
> Okay, I've attached a new version of the patch and a new version of
> the repro that may be fast and stable enough to commit. It is more
> minimal than the previous version. I made the table as small as I
> could to still trigger two rounds of index vacuuming. I tried to make
> it as stable as possible. I also removed the cursor on the standby
> that could trigger recovery conflicts. It would be super helpful if
> someone could take a look at the test and point out any ways I could
> make it even more likely to be stable.

Attached v3 has one important additional component in the test -- I
use pg_stat_progress_vacuum to confirm that we actually do more than
one pass of index vacuuming. Otherwise, it would have been trivial for
the test to incorrectly pass.

I could still use another pair of eyes on the test (looking out for
stability enhancing measures I could take). I also would be happy if
someone looked at the commit message and/or comments to let me know if
they make sense.

I'll finish with versions of the patch and test targeted at v14-16 and
propose those before committing this.

- Melanie
From b83a5e4ab84cb4b4de104fee05daf0bf88fe70ff Mon Sep 17 00:00:00 2001
From: Melanie Plageman 
Date: Thu, 20 Jun 2024 16:38:30 -0400
Subject: [PATCH v3 1/2] Test that vacuum removes tuples older than OldestXmin

If vacuum fails to prune a tuple killed before OldestXmin, it will
decide to freeze its xmax and later error out in pre-freeze checks.
---
 src/test/recovery/meson.build |   1 +
 .../recovery/t/043_vacuum_horizon_floor.pl| 236 ++
 2 files changed, 237 insertions(+)
 create mode 100644 src/test/recovery/t/043_vacuum_horizon_floor.pl

diff --git a/src/test/recovery/meson.build b/src/test/recovery/meson.build
index b1eb77b1ec1..1d55d6bf560 100644
--- a/src/test/recovery/meson.build
+++ b/src/test/recovery/meson.build
@@ -51,6 +51,7 @@ tests += {
   't/040_standby_failover_slots_sync.pl',
   't/041_checkpoint_at_promote.pl',
   't/042_low_level_backup.pl',
+  't/043_vacuum_horizon_floor.pl',
 ],
   },
 }
diff --git a/src/test/recovery/t/043_vacuum_horizon_floor.pl b/src/test/recovery/t/043_vacuum_horizon_floor.pl
new file mode 100644
index 000..1b82d935be5
--- /dev/null
+++ b/src/test/recovery/t/043_vacuum_horizon_floor.pl
@@ -0,0 +1,236 @@
+use strict;
+use warnings;
+use PostgreSQL::Test::Cluster;
+use Test::More;
+
+# Test that vacuum prunes away all dead tuples killed before OldestXmin
+
+# Set up nodes
+my $node_primary = PostgreSQL::Test::Cluster->new('primary');
+$node_primary->init(allows_streaming => 'physical');
+
+$node_primary->append_conf(
+	'postgresql.conf', qq[
+hot_standby_feedback = on
+log_recovery_conflict_waits = true
+autovacuum = off
+maintenance_work_mem = 1024
+]);
+$node_primary->start;
+
+my $node_replica = PostgreSQL::Test::Cluster->new('standby');
+
+$node_primary->backup('my_backup');
+$node_replica->init_from_backup($node_primary, 'my_backup',
+	has_streaming => 1);
+
+$node_replica->start;
+
+my $test_db = "test_db";
+$node_primary->safe_psql('postgres', "CREATE DATABASE $test_db");
+
+# Save the original connection info for later use
+my $orig_conninfo = $node_primary->connstr();
+
+my $table1 = "vac_horizon_floor_table";
+
+# Long-running Primary Session A
+my $psql_primaryA =
+  $node_primary->background_psql($test_db, on_error_stop => 1);
+
+# Long-running Primary Session B
+my $psql_primaryB  =
+  $node_primary->background_psql($test_db, on_error_stop => 1);
+
+# Insert and update enough rows that we force at least one round of index
+# vacuuming before getting to a dead tuple which was killed after the standby
+# is disconnected.
+#
+# We need multiple index vacuuming passes to repro because after the standby
+# reconnects to the primary, our backend's GlobalVisStates will not have been
+# updated with the new horizon until something forces them to be updated.
+#
+# _bt_pendingfsm_finalize() calls GetOldestNonRemovableTransactionId() at the
+# end of a round of index vacuuming, updating the backend's GlobalVisState
+# and, in our case, moving maybe_needed backwards.
+#
+# Then vacuum's first pass will continue and pruning will find our later
+# inserted and updated tuple HEAPTUPLE_RECENTLY_DEAD when compared to
+# maybe_needed but HEAPTUPLE_DEAD when compared to OldestXmin.
+$node_primary->safe_psql($test_db, qq[
+	CREATE TABLE ${table1}(col1 int) with (autovacuum_enabled=false, fillfactor=10);
+	INSERT INTO $table1 VALUES(7);
+	INSERT INTO $table1 SELECT generate_series(1, 40) % 3;
+	CREATE INDEX on ${table1}(col1);
+	UPDATE $table1 SET col1 = 3 WHERE col1 = 0;
+	INSERT INTO $table1 VALUES(7);
+]);
+
+# We will later move

Re: Vacuum ERRORs out considering freezing dead tuples from before OldestXmin

2024-07-02 Thread Melanie Plageman

On Mon, Jun 24, 2024 at 4:27 AM Heikki Linnakangas  wrote:
>
> Would it be possible to make it robust so that we could always run it
> with "make check"? This seems like an important corner case to
> regression test.

Okay, I've attached a new version of the patch and a new version of
the repro that may be fast and stable enough to commit. It is more
minimal than the previous version. I made the table as small as I
could to still trigger two rounds of index vacuuming. I tried to make
it as stable as possible. I also removed the cursor on the standby
that could trigger recovery conflicts. It would be super helpful if
someone could take a look at the test and point out any ways I could
make it even more likely to be stable.

- Melanie
From 7ecc180c4762c258dab01c7929c90a9c8e48c95b Mon Sep 17 00:00:00 2001
From: Melanie Plageman 
Date: Thu, 20 Jun 2024 17:21:11 -0400
Subject: [PATCH v2 2/2] Ensure vacuum removes all visibly dead tuples older
 than OldestXmin

If vacuum fails to remove a tuple with xmax older than
VacuumCutoffs->OldestXmin and younger than GlobalVisState->maybe_needed,
it may attempt to freeze the tuple's xmax and then ERROR out in
pre-freeze checks with "cannot freeze committed xmax".

Fix this by having vacuum always remove tuples older than OldestXmin.

It is possible for GlobalVisState->maybe_needed to precede OldestXmin if
maybe_needed is forced to go backward while vacuum is running. This can
happen if a disconnected standby with a running transaction older than
VacuumCutoffs->OldestXmin reconnects to the primary after vacuum
initially calculates GlobalVisState and OldestXmin.

In back branches starting with 14, failing to remove tuples older than
OldestXmin during pruning caused vacuum to infinitely loop in
lazy_scan_prune(), as investigated on this [1] thread. After 1ccc1e05ae
removed the retry loop in lazy_scan_prune() and stopped comparing tuples
to OldestXmin, the hang could no longer happen, but we could still
attempt to freeze dead tuples with xmax older than OldestXmin --
resulting in an ERROR.

Fix this by always removing dead tuples with xmax older than
VacuumCutoffs->OldestXmin. This is okay because the standby won't replay
the tuple removal until the tuple is removable. Thus, the worst that can
happen is a recovery conflict.

[1] https://postgr.es/m/20240415173913.4zyyrwaftujxthf2%40awork3.anarazel.de#1b216b7768b5bd577a3d3d51bd5aadee

Back-patch through 14

Author: Melanie Plageman
Reviewed-by: Peter Geoghegan, Robert Haas, Andres Freund, Heikki Linnakangas, and Noah Misch
Discussion: https://postgr.es/m/CAAKRu_bDD7oq9ZwB2OJqub5BovMG6UjEYsoK2LVttadjEqyRGg%40mail.gmail.com
---
 src/backend/access/heap/pruneheap.c  | 23 ++-
 src/backend/access/heap/vacuumlazy.c | 14 +++---
 2 files changed, 29 insertions(+), 8 deletions(-)

diff --git a/src/backend/access/heap/pruneheap.c b/src/backend/access/heap/pruneheap.c
index 3cdfc5b7f1b..869d82ad667 100644
--- a/src/backend/access/heap/pruneheap.c
+++ b/src/backend/access/heap/pruneheap.c
@@ -325,6 +325,8 @@ heap_page_prune_opt(Relation relation, Buffer buffer)
  *
  * cutoffs contains the freeze cutoffs, established by VACUUM at the beginning
  * of vacuuming the relation.  Required if HEAP_PRUNE_FREEZE option is set.
+ * cutoffs->OldestXmin is also used to determine if dead tuples are
+ * HEAPTUPLE_RECENTLY_DEAD or HEAPTUPLE_DEAD.
  *
  * presult contains output parameters needed by callers, such as the number of
  * tuples removed and the offsets of dead items on the page after pruning.
@@ -922,8 +924,27 @@ heap_prune_satisfies_vacuum(PruneState *prstate, HeapTuple tup, Buffer buffer)
 	if (res != HEAPTUPLE_RECENTLY_DEAD)
 		return res;

+	/*
+	 * For VACUUM, we must be sure to prune tuples with xmax older than
+	 * OldestXmin -- a visibility cutoff determined at the beginning of
+	 * vacuuming the relation. OldestXmin is used for freezing determination
+	 * and we cannot freeze dead tuples' xmaxes.
+	 */
+	if (prstate->cutoffs &&
+		TransactionIdIsValid(prstate->cutoffs->OldestXmin) &&
+		NormalTransactionIdPrecedes(dead_after, prstate->cutoffs->OldestXmin))
+		return HEAPTUPLE_DEAD;
+
+	/*
+	 * Determine whether or not the tuple is considered dead when compared
+	 * with the provided GlobalVisState. On-access pruning does not provide
+	 * VacuumCutoffs. And for vacuum, even if the tuple's xmax is not older
+	 * than OldestXmin, GlobalVisTestIsRemovableXid() could find the row dead
+	 * if the GlobalVisState has been updated since the beginning of vacuuming
+	 * the relation.
+	 */
 	if (GlobalVisTestIsRemovableXid(prstate->vistest, dead_after))
-		res = HEAPTUPLE_DEAD;
+		return HEAPTUPLE_DEAD;

 	return res;
 }
diff --git a/src/backend/access/heap/vacuumlazy.c b/src/backend/access/heap/vacuumlazy.c
index 3f88cf1e8ef..abb47ae5960 100644
--- a/src/backend/access/heap/vacuumlazy.c
+++ b/src/backend/access/heap/vacuumlazy.c
@@ -438,13 +438,13 @@ heap_vacuum_rel(Relation rel, VacuumParams *params,
 	 * as an

Re: Vacuum ERRORs out considering freezing dead tuples from before OldestXmin

On Tue, Jun 25, 2024 at 4:41 PM Andres Freund  wrote:
> I doubt that's doable in the back branches. And even on HEAD, I don't think
> it's a particularly attractive option - there's just a global vistest for each
> of the types of objects with a specific horizon (they need to be updated
> occasionally, e.g. when taking snapshots). So there's not really a spot to put
> an associated OldestXmin. We could put it there and remove it at the end of
> vacuum / in an exception handler, but that seems substantially worse.

Oh, right: I forgot that the visibility test objects were just
pointers to global variables.

Well, I don't know. I guess that doesn't leave any real options but to
fix it as Melanie proposed. But I still don't like it very much. I
feel like having to test against two different thresholds in the
pruning code is surely a sign that we're doing something wrong.

-- 
Robert Haas
EDB: http://www.enterprisedb.com

Re: Vacuum ERRORs out considering freezing dead tuples from before OldestXmin

On 2024-06-25 14:35:00 -0400, Robert Haas wrote:
> Is there any way that we could instead tweak things so that we adjust
> the visibility test object itself? Like can have a GlobalVisTest API
> where we can supply the OldestXmin from the VacuumCutoffs and have it
> ... do something useful with that?

I doubt that's doable in the back branches. And even on HEAD, I don't think
it's a particularly attractive option - there's just a global vistest for each
of the types of objects with a specific horizon (they need to be updated
occasionally, e.g. when taking snapshots). So there's not really a spot to put
an associated OldestXmin. We could put it there and remove it at the end of
vacuum / in an exception handler, but that seems substantially worse.

Re: Vacuum ERRORs out considering freezing dead tuples from before OldestXmin

On Tue, Jun 25, 2024 at 1:10 PM Andres Freund  wrote:
> That said, obviously there will be plenty setups where this won't cause an
> issue. I don't really have a handle on how often it'd be a problem.

Fair enough. Even if it's not super-common, it doesn't seem like a
great idea to regress such scenarios in the back-branches.

Is there any way that we could instead tweak things so that we adjust
the visibility test object itself? Like can have a GlobalVisTest API
where we can supply the OldestXmin from the VacuumCutoffs and have it
... do something useful with that?

-- 
Robert Haas
EDB: http://www.enterprisedb.com

Re: Vacuum ERRORs out considering freezing dead tuples from before OldestXmin

Hi,

On 2024-06-25 12:31:11 -0400, Robert Haas wrote:
> On Tue, Jun 25, 2024 at 11:39 AM Melanie Plageman
>  wrote:
> > On Tue, Jun 25, 2024 at 10:31 AM Robert Haas  wrote:
> > > On Tue, Jun 25, 2024 at 9:07 AM Andres Freund  wrote:
> > > > It's not hard - but it has downsides. It'll mean that - outside of 
> > > > vacuum -
> > > > we'll much more often not react to horizons going backwards due to
> > > > hot_standby_feedback. Which means that hot_standby_feedback, when used 
> > > > without
> > > > slots, will prevent fewer conflicts.
> > >
> > > Can you explain this in more detail?
> >
> > If we prevent GlobalVisState from moving backward, then we would less
> > frequently be pushing the horizon backward on the primary in response
> > to hot standby feedback. Then, the primary would do more things that
> > would not be safely replayable on the standby -- so the standby could
> > end up encountering more recovery conflicts.
> 
> I don't get it. hot_standby_feedback only moves horizons backward on
> the primary, AFAIK, when it first connects, or when it reconnects.
> Which I guess could be frequent for some users with flaky networks,
> but does that really rise to the level of "much more often"?

Well, the thing is that with the "prevent it from going backwards" approach,
once the horizon is set to something recent in a backend, it's "sticky".  If a
replica is a bit behind or if there's a long-lived snapshot and disconnects,
the vistest state will advance beyond where the replica needs it to be.  Even
if the standby later reconnects, the vistest in long-lived sessions will still
have the more advanced state. So all future pruning these backends do runs
into the risk of performing pruning that removes rows the standby still deems
visible and thus causes recovery conflicts.

I.e. you don't even need frequent disconnects, you just need one disconnect
and sessions that aren't shortlived.

That said, obviously there will be plenty setups where this won't cause an
issue. I don't really have a handle on how often it'd be a problem.

Greetings,

Andres Freund

Re: Vacuum ERRORs out considering freezing dead tuples from before OldestXmin

On Tue, Jun 25, 2024 at 11:39 AM Melanie Plageman
 wrote:
> On Tue, Jun 25, 2024 at 10:31 AM Robert Haas  wrote:
> > On Tue, Jun 25, 2024 at 9:07 AM Andres Freund  wrote:
> > > It's not hard - but it has downsides. It'll mean that - outside of vacuum 
> > > -
> > > we'll much more often not react to horizons going backwards due to
> > > hot_standby_feedback. Which means that hot_standby_feedback, when used 
> > > without
> > > slots, will prevent fewer conflicts.
> >
> > Can you explain this in more detail?
>
> If we prevent GlobalVisState from moving backward, then we would less
> frequently be pushing the horizon backward on the primary in response
> to hot standby feedback. Then, the primary would do more things that
> would not be safely replayable on the standby -- so the standby could
> end up encountering more recovery conflicts.

I don't get it. hot_standby_feedback only moves horizons backward on
the primary, AFAIK, when it first connects, or when it reconnects.
Which I guess could be frequent for some users with flaky networks,
but does that really rise to the level of "much more often"?

-- 
Robert Haas
EDB: http://www.enterprisedb.com

Re: Vacuum ERRORs out considering freezing dead tuples from before OldestXmin

2024-06-25 Thread Melanie Plageman

On Tue, Jun 25, 2024 at 10:31 AM Robert Haas  wrote:
>
> On Tue, Jun 25, 2024 at 9:07 AM Andres Freund  wrote:
> > It's not hard - but it has downsides. It'll mean that - outside of vacuum -
> > we'll much more often not react to horizons going backwards due to
> > hot_standby_feedback. Which means that hot_standby_feedback, when used 
> > without
> > slots, will prevent fewer conflicts.
>
> Can you explain this in more detail?

If we prevent GlobalVisState from moving backward, then we would less
frequently be pushing the horizon backward on the primary in response
to hot standby feedback. Then, the primary would do more things that
would not be safely replayable on the standby -- so the standby could
end up encountering more recovery conflicts.

- Melanie

Re: Vacuum ERRORs out considering freezing dead tuples from before OldestXmin

2024-06-25 Thread Alena Rybakina

On 24.06.2024 17:37, Melanie Plageman wrote:

On Mon, Jun 24, 2024 at 4:10 AM Alena Rybakina wrote:

We can fix this by always removing tuples considered dead before
VacuumCutoffs->OldestXmin. This is okay even if a reconnected standby
has a transaction that sees that tuple as alive, because it will
simply wait to replay the removal until it would be correct to do so
or recovery conflict handling will cancel the transaction that sees
the tuple as alive and allow replay to continue.

This is an interesting and difficult case) I noticed that when initializing the
cluster, in my opinion, we provide excessive freezing. Initialization takes a
long time, which can lead, for example, to longer test execution. I got rid of
this by adding the OldestMxact checkbox is not FirstMultiXactId, and it works
fine.

if (prstate->cutoffs &&
TransactionIdIsValid(prstate->cutoffs->OldestXmin) &&
prstate->cutoffs->OldestMxact != FirstMultiXactId &&
NormalTransactionIdPrecedes(dead_after, prstate->cutoffs->OldestXmin))
return HEAPTUPLE_DEAD;

Can I keep it?

This looks like an addition to the new criteria I added to
heap_prune_satisfies_vacuum(). Is that what you are suggesting? If so,
it looks like it would only return HEAPTUPLE_DEAD (and thus only
remove) a subset of the tuples my original criteria would remove. When
vacuum calculates OldestMxact as FirstMultiXactId, it would not remove
those tuples deleted before OldestXmin. It seems like OldestMxact will
equal FirstMultiXactID sometimes right after initdb and after
transaction ID wraparound. I'm not sure I totally understand the
criteria.

One thing I find confusing about this is that this would actually
remove less tuples than with my criteria -- which could lead to more
freezing. When vacuum calculates OldestMxact == FirstMultiXactID, we
would not remove tuples deleted before OldestXmin and thus return
HEAPTUPLE_RECENTLY_DEAD for those tuples. Then we would consider
freezing them. So, it seems like we would do more freezing by adding
this criteria.

Could you explain more about how the criteria you are suggesting
works? Are you saying it does less freezing than master or less
freezing than with my patch?

At first, Inoticedthatwiththispatch, vacuumfoulsthenodes more
oftenthanbefore,anditseemedto me thatmoretimewasspentinitializingthe
clusterwiththispatchthanbefore,soIsuggestedconsideringthiscondition.Aftercheckingagain,
Ifoundthatthe problemwaswithmylaptop.So,sorryforthe noise.

Attached is the suggested fix for master plus a repro. I wrote it as a
recovery suite TAP test, but I am _not_ proposing we add it to the
ongoing test suite. It is, amongst other things, definitely prone to
flaking. I also had to use loads of data to force two index vacuuming
passes now that we have TIDStore, so it is a slow test.

-- snip --

I have a modified version of this that repros the infinite loop on
14-16 with substantially less data. See it here [2]. Also, the repro
attached to this mail won't work on 14 and 15 because of changes to
background_psql.

I couldn't understand why the replica is necessary here. Now I am digging why I
got the similar behavior without replica when I have only one instance. I'm
still checking this in my test, but I believe this patch fixes the original
problem because the symptoms were the same.

Did you get similar behavior on master or on back branches? Was the
behavior you observed the infinite loop or the error during
heap_prepare_freeze_tuple()?

In my examples, the replica is needed because something has to move
the horizon on the primary backwards. When a standby reconnects with
an older oldest running transaction ID than any of the running
transactions on the primary and the vacuuming backend recomputes its
RecentXmin, the horizon may move backwards when compared to the
horizon calculated at the beginning of the vacuum. Vacuum does not
recompute cutoffs->OldestXmin during vacuuming a relation but it may
recompute the values in the GlobalVisState it uses for pruning.

We knew of only one other way that the horizon could move backwards
which Matthias describes here [1]. However, this is thought to be its
own concurrency-related bug in the commit-abort path that should be
fixed -- as opposed to the standby reconnecting with an older oldest
running transaction ID which can be expected.

Do you know if you were seeing the effects of the scenario Matthias describes?

[1]https://www.postgresql.org/message-id/CAEze2WjMTh4KS0%3DQEQB-Jq%2BtDLPR%2B0%2BzVBMfVwSPK5A%3DWZa95Q%40mail.gmail.com
I'm sorry, I need a little more time to figure this out. I will answer
this question later.

--
Regards,
Alena Rybakina
Postgres Professional:http://www.postgrespro.com
The Russian Postgres Company

Re: Vacuum ERRORs out considering freezing dead tuples from before OldestXmin

On Tue, Jun 25, 2024 at 9:07 AM Andres Freund  wrote:
> It's not hard - but it has downsides. It'll mean that - outside of vacuum -
> we'll much more often not react to horizons going backwards due to
> hot_standby_feedback. Which means that hot_standby_feedback, when used without
> slots, will prevent fewer conflicts.

Can you explain this in more detail?

-- 
Robert Haas
EDB: http://www.enterprisedb.com

Re: Vacuum ERRORs out considering freezing dead tuples from before OldestXmin

On 2024-06-25 08:42:02 -0400, Robert Haas wrote:
> On Tue, Jun 25, 2024 at 8:03 AM Andres Freund  wrote:
> > I think that's going in the wrong direction. We *want* to prune more
> > aggressively if we can (*), the necessary state is represented by the
> > vistest. That's a different thing than *having* to prune tuples beyond a
> > certain xmin (the cutoff determined by vacuum.c/vacuumlazy.c). The problem
> > we're having here is that the two states can get out of sync due to the
> > vistest "moving backwards", because of hot_standby_feedback (and perhaps 
> > also
> > an issue around aborts).
> 
> I agree that we want to prune more aggressively if we can. I think
> that fixing this by preventing vistest from going backward is
> reasonable, and I like it better than what Melanie proposed, although
> I like what Melanie proposed much better than not fixing it! I'm not
> sure how to do that cleanly, but one of you may have an idea.

It's not hard - but it has downsides. It'll mean that - outside of vacuum -
we'll much more often not react to horizons going backwards due to
hot_standby_feedback. Which means that hot_standby_feedback, when used without
slots, will prevent fewer conflicts.


> I do think that having a bunch of different XID values that function
> as horizons and a vistest object that holds some more XID horizons
> floating around in vacuum makes the code hard to understand. The
> relationships between the various values are not well-documented. For
> instance, the vistest has to be after vacrel->cutoffs.OldestXmin for
> correctness, but I don't think there's a single comment anywhere
> saying that;

It is somewhat documented:

 * Note: the approximate horizons (see definition of GlobalVisState) are
 * updated by the computations done here. That's currently required for
 * correctness and a small optimization. Without doing so it's possible that
 * heap vacuum's call to heap_page_prune_and_freeze() uses a more conservative
 * horizon than later when deciding which tuples can be removed - which the
 * code doesn't expect (breaking HOT).


> And more generally, it seems like a fairly big problem to me that
> LVRelState directly stores NewRelfrozenXid; contains a VacuumCutoffs
> object that stores relfrozenxid, OldestXmin, and FreezeLimit; and also
> points to a GlobalVisState object that contains definitely_needed and
> maybe_needed. That is six different XID cutoffs for one vacuum
> operation. That's a lot. I can't describe how they're all different
> from each other or what the necessary relationships between them are
> off-hand, and I bet nobody else could either, at least until recently,
> else we might not have this bug. I feel like if it were possible to
> have fewer of them and still have things work, we'd be better off. I'm
> not sure that's doable. But six seems like a lot.

Agreed. I don't think you can just unify things though, they actually are all
different for good, or at least decent, reasons.  I think improving the naming
alone could help a good bit though.

Greetings,

Andres Freund

Re: Vacuum ERRORs out considering freezing dead tuples from before OldestXmin

On Tue, Jun 25, 2024 at 8:03 AM Andres Freund  wrote:
> I think that's going in the wrong direction. We *want* to prune more
> aggressively if we can (*), the necessary state is represented by the
> vistest. That's a different thing than *having* to prune tuples beyond a
> certain xmin (the cutoff determined by vacuum.c/vacuumlazy.c). The problem
> we're having here is that the two states can get out of sync due to the
> vistest "moving backwards", because of hot_standby_feedback (and perhaps also
> an issue around aborts).

I agree that we want to prune more aggressively if we can. I think
that fixing this by preventing vistest from going backward is
reasonable, and I like it better than what Melanie proposed, although
I like what Melanie proposed much better than not fixing it! I'm not
sure how to do that cleanly, but one of you may have an idea.

I do think that having a bunch of different XID values that function
as horizons and a vistest object that holds some more XID horizons
floating around in vacuum makes the code hard to understand. The
relationships between the various values are not well-documented. For
instance, the vistest has to be after vacrel->cutoffs.OldestXmin for
correctness, but I don't think there's a single comment anywhere
saying that; meanwhile, the comments for VacuumCutoffs say "OldestXmin
is the Xid below which tuples deleted by any xact (that committed)
should be considered DEAD, not just RECENTLY_DEAD." Surely the reader
can be forgiven for thinking that this is the cutoff that will
actually be used by pruning, but it isn't.

And more generally, it seems like a fairly big problem to me that
LVRelState directly stores NewRelfrozenXid; contains a VacuumCutoffs
object that stores relfrozenxid, OldestXmin, and FreezeLimit; and also
points to a GlobalVisState object that contains definitely_needed and
maybe_needed. That is six different XID cutoffs for one vacuum
operation. That's a lot. I can't describe how they're all different
from each other or what the necessary relationships between them are
off-hand, and I bet nobody else could either, at least until recently,
else we might not have this bug. I feel like if it were possible to
have fewer of them and still have things work, we'd be better off. I'm
not sure that's doable. But six seems like a lot.

-- 
Robert Haas
EDB: http://www.enterprisedb.com

Re: Vacuum ERRORs out considering freezing dead tuples from before OldestXmin

Hi,

On 2024-06-24 16:35:50 -0400, Robert Haas wrote:
> On Mon, Jun 24, 2024 at 3:23 PM Melanie Plageman
>  wrote:
> > Are you more concerned about having a single horizon for pruning or
> > about having a horizon that does not move backwards after being
> > established at the beginning of vacuuming the relation?
> 
> I'm not sure I understand. The most important thing here is fixing the
> bug. But if we have a choice of how to fix the bug, I'd prefer to do
> it by having the pruning code test one horizon that is always correct,
> rather than (as I think the patch does) having it test against two
> horizons because as a way of covering possible discrepancies between
> those values.

I think that's going in the wrong direction. We *want* to prune more
aggressively if we can (*), the necessary state is represented by the
vistest. That's a different thing than *having* to prune tuples beyond a
certain xmin (the cutoff determined by vacuum.c/vacuumlazy.c). The problem
we're having here is that the two states can get out of sync due to the
vistest "moving backwards", because of hot_standby_feedback (and perhaps also
an issue around aborts).

To prevent that we can
a) make sure that we always take the hard cutoff into account
b) prevent vistest from going backwards

(*) we really ought to become more aggressive, by removing intermediary row
versions when they're not visible to anyone, but not yet old enough to be
below the horizon. But that realistically will only be possible in *some*
cases, e.g. when the predecessor row version is on the same page.

> > I had always thought it was because the vacuuming backend's
> > GlobalVisState will get updated periodically throughout vacuum and so,
> > assuming the oldest running transaction changes, our horizon for
> > vacuum would change. But, in writing this repro, it is actually quite
> > hard to get GlobalVisState to update. Our backend's RecentXmin needs
> > to have changed. And there aren't very many places where we take a new
> > snapshot after starting to vacuum a relation. One of those is at the
> > end of index vacuuming, but that can only affect the pruning horizon
> > if we have to do multiple rounds of index vacuuming. Is that really
> > the case we are thinking of when we say we want the pruning horizon to
> > move forward during vacuum?
> 
> I thought the idea was that the GlobalVisTest stuff would force a
> recalculation now and then, but maybe it doesn't actually do that?

It forces an accurate horizon to be determined the first time it would require
it to determine visibility. The "first time" is determined by RecentXmin not
having changed.

The main goal of the vistest stuff was to not have to determine an accurate
horizon in GetSnapshotData(). Determining an accurate horizon requires
accessing each backends ->xmin, which causes things to scale badly, as ->xmin
changes so frequently.

The cost of determining the accurate horizon is irrelevant for vacuums, but
it's not at all irrelevant for on-access pruning.

> Suppose process A begins a transaction, acquires an XID, and then goes
> idle. Process B now begins a giant vacuum. At some point in the middle
> of the vacuum, A ends the transaction. Are you saying that B's
> GlobalVisTest never really notices that this has happened?

Not at the moment, but we should add heuristics like that.

Greetings,

Andres Freund

Re: Vacuum ERRORs out considering freezing dead tuples from before OldestXmin

2024-06-24 Thread Noah Misch

On Mon, Jun 24, 2024 at 09:49:53PM -0400, Peter Geoghegan wrote:
> On Mon, Jun 24, 2024 at 9:30 PM Noah Misch  wrote:
> > On Mon, Jun 24, 2024 at 03:23:39PM -0400, Melanie Plageman wrote:
> > > Right now, in master, we do use a single horizon when determining what
> > > is pruned -- that from GlobalVisState. OldestXmin is only used for
> > > freezing and full page visibility determinations. Using a different
> > > horizon for pruning by vacuum than freezing is what is causing the
> > > error on master.
> >
> > Agreed, and I think using different sources for pruning and freezing is a
> > recipe for future bugs.  Fundamentally, both are about answering "is
> > snapshot_considers_xid_in_progress(snapshot, xid) false for every snapshot?"
> > That's not to say this thread shall unify the two, but I suspect that's the
> > right long-term direction.
> 
> What does it really mean to unify the two, though?
> 
> If the OldestXmin field was located in struct GlobalVisState (next to
> definitely_needed and maybe_needed), but everything worked in
> essentially the same way as it will with Melanie's patch in place,
> would that count as unifying the two? Why or why not?

To me, no, unification would mean removing the data redundancy.  Relocating
the redundant field and/or code that updates the redundant field certainly
could reduce the risk of bugs, so I'm not opposing everything short of
removing the data redundancy.  I'm just agreeing with the "prefer" from
https://postgr.es/m/CA+TgmoYzS_bkt_MrNxr5QrXDKfedmh4tStn8UBTTBXqv=3j...@mail.gmail.com

Re: Vacuum ERRORs out considering freezing dead tuples from before OldestXmin

On Mon, Jun 24, 2024 at 9:30 PM Noah Misch  wrote:
> On Mon, Jun 24, 2024 at 03:23:39PM -0400, Melanie Plageman wrote:
> > Right now, in master, we do use a single horizon when determining what
> > is pruned -- that from GlobalVisState. OldestXmin is only used for
> > freezing and full page visibility determinations. Using a different
> > horizon for pruning by vacuum than freezing is what is causing the
> > error on master.
>
> Agreed, and I think using different sources for pruning and freezing is a
> recipe for future bugs.  Fundamentally, both are about answering "is
> snapshot_considers_xid_in_progress(snapshot, xid) false for every snapshot?"
> That's not to say this thread shall unify the two, but I suspect that's the
> right long-term direction.

What does it really mean to unify the two, though?

If the OldestXmin field was located in struct GlobalVisState (next to
definitely_needed and maybe_needed), but everything worked in
essentially the same way as it will with Melanie's patch in place,
would that count as unifying the two? Why or why not?

-- 
Peter Geoghegan

Re: Vacuum ERRORs out considering freezing dead tuples from before OldestXmin

2024-06-24 Thread Noah Misch

On Mon, Jun 24, 2024 at 04:51:24PM -0400, Peter Geoghegan wrote:
> On Mon, Jun 24, 2024 at 4:36 PM Robert Haas  wrote:
> > I'm not sure I understand. The most important thing here is fixing the
> > bug. But if we have a choice of how to fix the bug, I'd prefer to do
> > it by having the pruning code test one horizon that is always correct,
> > rather than (as I think the patch does) having it test against two
> > horizons because as a way of covering possible discrepancies between
> > those values.
> 
> Your characterizing of OldestXmin + vistest as two horizons seems
> pretty arbitrary to me. I know what you mean, of course, but it seems
> like a distinction without a difference.

"Two horizons" matches how I model it.  If the two were _always_ indicating
the same notion of visibility, we wouldn't have this thread.

On Mon, Jun 24, 2024 at 03:23:39PM -0400, Melanie Plageman wrote:
> Right now, in master, we do use a single horizon when determining what
> is pruned -- that from GlobalVisState. OldestXmin is only used for
> freezing and full page visibility determinations. Using a different
> horizon for pruning by vacuum than freezing is what is causing the
> error on master.

Agreed, and I think using different sources for pruning and freezing is a
recipe for future bugs.  Fundamentally, both are about answering "is
snapshot_considers_xid_in_progress(snapshot, xid) false for every snapshot?"
That's not to say this thread shall unify the two, but I suspect that's the
right long-term direction.

Re: Vacuum ERRORs out considering freezing dead tuples from before OldestXmin

On Mon, Jun 24, 2024 at 4:51 PM Peter Geoghegan  wrote:
>
> On Mon, Jun 24, 2024 at 4:36 PM Robert Haas  wrote:
> > I thought the idea was that the GlobalVisTest stuff would force a
> > recalculation now and then, but maybe it doesn't actually do that?
>
> It definitely can do that. Just not in a way that meaningfully
> increases the number of heap tuples that we can recognize as DEAD and
> remove. At least not currently.
>
> > Suppose process A begins a transaction, acquires an XID, and then goes
> > idle. Process B now begins a giant vacuum. At some point in the middle
> > of the vacuum, A ends the transaction. Are you saying that B's
> > GlobalVisTest never really notices that this has happened?
>
> That's my understanding, yes. That is, vistest is approximately the
> same thing as OldestXmin anyway. At least for now.

Exactly. Something has to cause this backend to update its view of the
horizon. At the end of index vacuuming,
GetOldestNonRemovableTransactionId() will explicitly
ComputeXidHorizons() which will update our backend's GlobalVisStates.
Otherwise, if our backend's RecentXmin is updated, by taking a new
snapshot, then we may update our GlobalVisStates. See
GlobalVisTestShouldUpdate() for the conditions under which we would
update our GlobalVisStates during the normal visibility checks
happening during pruning.

Vacuum used to open indexes after calculating horizons before starting
its first pass. This led to a recomputation of the horizon. But, in
master, there aren't many obvious places where such a thing would be
happening.

- Melanie

Re: Vacuum ERRORs out considering freezing dead tuples from before OldestXmin

On Mon, Jun 24, 2024 at 4:36 PM Robert Haas  wrote:
> I'm not sure I understand. The most important thing here is fixing the
> bug. But if we have a choice of how to fix the bug, I'd prefer to do
> it by having the pruning code test one horizon that is always correct,
> rather than (as I think the patch does) having it test against two
> horizons because as a way of covering possible discrepancies between
> those values.

Your characterizing of OldestXmin + vistest as two horizons seems
pretty arbitrary to me. I know what you mean, of course, but it seems
like a distinction without a difference.

> I thought the idea was that the GlobalVisTest stuff would force a
> recalculation now and then, but maybe it doesn't actually do that?

It definitely can do that. Just not in a way that meaningfully
increases the number of heap tuples that we can recognize as DEAD and
remove. At least not currently.

> Suppose process A begins a transaction, acquires an XID, and then goes
> idle. Process B now begins a giant vacuum. At some point in the middle
> of the vacuum, A ends the transaction. Are you saying that B's
> GlobalVisTest never really notices that this has happened?

That's my understanding, yes. That is, vistest is approximately the
same thing as OldestXmin anyway. At least for now.

-- 
Peter Geoghegan

Re: Vacuum ERRORs out considering freezing dead tuples from before OldestXmin

On Mon, Jun 24, 2024 at 4:42 PM Peter Geoghegan  wrote:
>
> On Mon, Jun 24, 2024 at 3:36 PM Melanie Plageman
>  wrote:
> > One thing I don't understand is why it is okay to freeze the xmax of a
> > dead tuple just because it is from an aborted update.
>
> We don't do that with XID-based xmaxs. Though perhaps we should, since
> we'll already prune-away the successor tuple, and so might as well go
> one tiny step further and clear the xmax for the original tuple via
> freezing/setting it InvalidTransactionId. Instead we just leave the
> original tuple largely undisturbed, with its original xmax.

I thought that was the case too, but we call
heap_prepare_freeze_tuple() on HEAPTUPLE_RECENTLY_DEAD tuples and then

else if (TransactionIdIsNormal(xid))
{
/* Raw xmax is normal XID */
freeze_xmax = TransactionIdPrecedes(xid, cutoffs->OldestXmin);
}

And then later we

if (freeze_xmax)
frz->xmax = InvalidTransactionId;

and then when we execute freezing the tuple in heap_execute_freeze_tuple()

HeapTupleHeaderSetXmax(tuple, frz->xmax);

Which sets the xmax to InvalidTransactionId. Or am I missing something?

> > The only case in which we freeze dead tuples
> > with a non-multi xmax is if the xmax is from before OldestXmin and is
> > also not committed (so from an aborted update).
>
> Perhaps I misunderstand, but: we simply don't freeze DEAD (not
> RECENTLY_DEAD) tuples in the first place, because we don't have to
> (pruning removes them instead). It doesn't matter if they're DEAD due
> to being from aborted transactions or DEAD due to being
> deleted/updated by a transaction that committed (committed and <
> OldestXmin).

Right, I'm talking about HEAPTUPLE_RECENTLY_DEAD tuples.
HEAPTUPLE_DEAD tuples are pruned away. But we can't replace the xmax
of a tuple that has been deleted or updated by a transaction that
committed with InvalidTransactionId. And it seems like the code does
that? Why even call heap_prepare_freeze_tuple() on
HEAPTUPLE_RECENTLY_DEAD tuples? Is it mainly to handle MultiXact
freezing?

> The freezing related code paths in heapam.c don't particularly care
> whether a tuple xmax is RECENTLY_DEAD or LIVE to HTSV + OldestXmin.
> Just as long as it's not fully DEAD (then it should have been pruned).

But it just seems like we shouldn't freeze RECENTLY_DEAD either.

- Melanie

Re: Vacuum ERRORs out considering freezing dead tuples from before OldestXmin

On Mon, Jun 24, 2024 at 3:36 PM Melanie Plageman
 wrote:
> One thing I don't understand is why it is okay to freeze the xmax of a
> dead tuple just because it is from an aborted update.

We don't do that with XID-based xmaxs. Though perhaps we should, since
we'll already prune-away the successor tuple, and so might as well go
one tiny step further and clear the xmax for the original tuple via
freezing/setting it InvalidTransactionId. Instead we just leave the
original tuple largely undisturbed, with its original xmax.

We do something like that with Multi-based xmax fields, though not
with the specific goal of cleaning up after aborts in mind (we can
also remove lockers that are no longer running, regardless of where
they are relative to OldestXmin, stuff like that). The actual goal
with that is to enforce MultiXactCutoff, independently of whether or
not their member XIDs are < FreezeLimit yet.

> The only case in which we freeze dead tuples
> with a non-multi xmax is if the xmax is from before OldestXmin and is
> also not committed (so from an aborted update).

Perhaps I misunderstand, but: we simply don't freeze DEAD (not
RECENTLY_DEAD) tuples in the first place, because we don't have to
(pruning removes them instead). It doesn't matter if they're DEAD due
to being from aborted transactions or DEAD due to being
deleted/updated by a transaction that committed (committed and <
OldestXmin).

The freezing related code paths in heapam.c don't particularly care
whether a tuple xmax is RECENTLY_DEAD or LIVE to HTSV + OldestXmin.
Just as long as it's not fully DEAD (then it should have been pruned).

--
Peter Geoghegan

Re: Vacuum ERRORs out considering freezing dead tuples from before OldestXmin

2024-06-24 Thread Robert Haas

On Mon, Jun 24, 2024 at 3:23 PM Melanie Plageman
 wrote:
> Are you more concerned about having a single horizon for pruning or
> about having a horizon that does not move backwards after being
> established at the beginning of vacuuming the relation?

I'm not sure I understand. The most important thing here is fixing the
bug. But if we have a choice of how to fix the bug, I'd prefer to do
it by having the pruning code test one horizon that is always correct,
rather than (as I think the patch does) having it test against two
horizons because as a way of covering possible discrepancies between
those values.

> Right now, in master, we do use a single horizon when determining what
> is pruned -- that from GlobalVisState. OldestXmin is only used for
> freezing and full page visibility determinations. Using a different
> horizon for pruning by vacuum than freezing is what is causing the
> error on master.

Agreed.

> I had always thought it was because the vacuuming backend's
> GlobalVisState will get updated periodically throughout vacuum and so,
> assuming the oldest running transaction changes, our horizon for
> vacuum would change. But, in writing this repro, it is actually quite
> hard to get GlobalVisState to update. Our backend's RecentXmin needs
> to have changed. And there aren't very many places where we take a new
> snapshot after starting to vacuum a relation. One of those is at the
> end of index vacuuming, but that can only affect the pruning horizon
> if we have to do multiple rounds of index vacuuming. Is that really
> the case we are thinking of when we say we want the pruning horizon to
> move forward during vacuum?

I thought the idea was that the GlobalVisTest stuff would force a
recalculation now and then, but maybe it doesn't actually do that?

Suppose process A begins a transaction, acquires an XID, and then goes
idle. Process B now begins a giant vacuum. At some point in the middle
of the vacuum, A ends the transaction. Are you saying that B's
GlobalVisTest never really notices that this has happened?

-- 
Robert Haas
EDB: http://www.enterprisedb.com

Re: Vacuum ERRORs out considering freezing dead tuples from before OldestXmin

On Thu, Jun 20, 2024 at 7:42 PM Melanie Plageman
 wrote:
>
> If vacuum fails to remove a tuple with xmax older than
> VacuumCutoffs->OldestXmin and younger than
> GlobalVisState->maybe_needed, it will ERROR out when determining
> whether or not to freeze the tuple with "cannot freeze committed
> xmax".

One thing I don't understand is why it is okay to freeze the xmax of a
dead tuple just because it is from an aborted update.
heap_prepare_freeze_tuple() is called on HEAPTUPLE_RECENTLY_DEAD
tuples with normal xmaxes (non-multis) so that it can freeze tuples
from aborted updates. The only case in which we freeze dead tuples
with a non-multi xmax is if the xmax is from before OldestXmin and is
also not committed (so from an aborted update). Freezing dead tuples
replaces their xmax with InvalidTransactionId -- which would make them
look alive. So, it makes sense we don't do this for dead tuples in the
common case. But why is it 1) okay and 2) desirable to freeze xmaxes
of tuples from aborted updates? Won't it make them look alive again?

- Melanie

Re: Vacuum ERRORs out considering freezing dead tuples from before OldestXmin

On Mon, Jun 24, 2024 at 3:23 PM Melanie Plageman
 wrote:
> I had always thought it was because the vacuuming backend's
> GlobalVisState will get updated periodically throughout vacuum and so,
> assuming the oldest running transaction changes, our horizon for
> vacuum would change.

I believe that it's more of an aspirational thing at this point. That
it is currently aspirational (it happens to some extent but isn't ever
particularly useful) shouldn't change the analysis about how to fix
this bug.

> One of those is at the
> end of index vacuuming, but that can only affect the pruning horizon
> if we have to do multiple rounds of index vacuuming. Is that really
> the case we are thinking of when we say we want the pruning horizon to
> move forward during vacuum?

No, that's definitely not what we were thinking of. It's just an
accident that it's almost the only thing that'll do that.

-- 
Peter Geoghegan

Re: Vacuum ERRORs out considering freezing dead tuples from before OldestXmin

On Mon, Jun 24, 2024 at 1:05 PM Robert Haas  wrote:
>
> On Mon, Jun 24, 2024 at 12:43 PM Peter Geoghegan  wrote:
> > The problem here is that OldestXmin is supposed to be more
> > conservative than vistest, which it almost always is, except in this
> > one edge case. I don't think that plugging that hole changes the basic
> > fact that there is one source of truth about what *needs* to be
> > pruned. There is such a source of truth: OldestXmin.
>
> Well, another approach could be to make it so that OldestXmin actually
> is always more conservative than vistest rather than almost always.

For the purposes of pruning, we are effectively always using the more
conservative of the two with this patch.

Are you more concerned about having a single horizon for pruning or
about having a horizon that does not move backwards after being
established at the beginning of vacuuming the relation?

Right now, in master, we do use a single horizon when determining what
is pruned -- that from GlobalVisState. OldestXmin is only used for
freezing and full page visibility determinations. Using a different
horizon for pruning by vacuum than freezing is what is causing the
error on master.

> I agree with you that letting the pruning horizon move forward during
> vacuum is desirable. I'm just wondering if having the vacuum code need
> to know a second horizon is really the best way to address that.

I was thinking about this some more and I realized I don't really get
why we think using GlobalVisState for pruning will let us remove more
tuples in the common case.

I had always thought it was because the vacuuming backend's
GlobalVisState will get updated periodically throughout vacuum and so,
assuming the oldest running transaction changes, our horizon for
vacuum would change. But, in writing this repro, it is actually quite
hard to get GlobalVisState to update. Our backend's RecentXmin needs
to have changed. And there aren't very many places where we take a new
snapshot after starting to vacuum a relation. One of those is at the
end of index vacuuming, but that can only affect the pruning horizon
if we have to do multiple rounds of index vacuuming. Is that really
the case we are thinking of when we say we want the pruning horizon to
move forward during vacuum?

- Melanie

Re: Vacuum ERRORs out considering freezing dead tuples from before OldestXmin

On Mon, Jun 24, 2024 at 1:05 PM Robert Haas  wrote:
> On Mon, Jun 24, 2024 at 12:43 PM Peter Geoghegan  wrote:
> > The problem here is that OldestXmin is supposed to be more
> > conservative than vistest, which it almost always is, except in this
> > one edge case. I don't think that plugging that hole changes the basic
> > fact that there is one source of truth about what *needs* to be
> > pruned. There is such a source of truth: OldestXmin.
>
> Well, another approach could be to make it so that OldestXmin actually
> is always more conservative than vistest rather than almost always.

If we did things like that then it would still be necessary to write a
patch like the one Melanie came up with, on the grounds that we'd
really need to be paranoid about having missed some subtlety. We might
as well just rely on the mechanism directly. I just don't think that
it makes much difference.

-- 
Peter Geoghegan

Re: Vacuum ERRORs out considering freezing dead tuples from before OldestXmin

2024-06-24 Thread Robert Haas

On Mon, Jun 24, 2024 at 12:43 PM Peter Geoghegan  wrote:
> The problem here is that OldestXmin is supposed to be more
> conservative than vistest, which it almost always is, except in this
> one edge case. I don't think that plugging that hole changes the basic
> fact that there is one source of truth about what *needs* to be
> pruned. There is such a source of truth: OldestXmin.

Well, another approach could be to make it so that OldestXmin actually
is always more conservative than vistest rather than almost always.

I agree with you that letting the pruning horizon move forward during
vacuum is desirable. I'm just wondering if having the vacuum code need
to know a second horizon is really the best way to address that.

-- 
Robert Haas
EDB: http://www.enterprisedb.com

Re: Vacuum ERRORs out considering freezing dead tuples from before OldestXmin

On Mon, Jun 24, 2024 at 11:44 AM Robert Haas  wrote:
> I don't have a great feeling about this fix. It's not that I think
> it's wrong. It's just that the underlying problem here is that we have
> heap_page_prune_and_freeze() getting both GlobalVisState *vistest and
> struct VacuumCutoffs *cutoffs, and the vistest wants to be in charge
> of deciding what gets pruned, but that doesn't actually work, because
> as I pointed out in
> http://postgr.es/m/ca+tgmob1btwcp6r5-tovhb5wqhasptsr2tjkcdcutmzauyb...@mail.gmail.com
> it's not properly synchronized with vacrel->cutoffs.OldestXmin. Your
> fix is to consider both variables, which again may be totally correct,
> but wouldn't it be a lot better if we didn't have two variables
> fighting for control of the same behavior?

Why would it be better? It's to our advantage to have vistest prune
away extra tuples when possible. Andres placed a lot of emphasis on
that during the snapshot scalability work -- vistest can be updated on
the fly.

The problem here is that OldestXmin is supposed to be more
conservative than vistest, which it almost always is, except in this
one edge case. I don't think that plugging that hole changes the basic
fact that there is one source of truth about what *needs* to be
pruned. There is such a source of truth: OldestXmin.

-- 
Peter Geoghegan

Re: Vacuum ERRORs out considering freezing dead tuples from before OldestXmin

2024-06-24 Thread Robert Haas

On Thu, Jun 20, 2024 at 7:42 PM Melanie Plageman
 wrote:
> We can fix this by always removing tuples considered dead before
> VacuumCutoffs->OldestXmin.

I don't have a great feeling about this fix. It's not that I think
it's wrong. It's just that the underlying problem here is that we have
heap_page_prune_and_freeze() getting both GlobalVisState *vistest and
struct VacuumCutoffs *cutoffs, and the vistest wants to be in charge
of deciding what gets pruned, but that doesn't actually work, because
as I pointed out in
http://postgr.es/m/ca+tgmob1btwcp6r5-tovhb5wqhasptsr2tjkcdcutmzauyb...@mail.gmail.com
it's not properly synchronized with vacrel->cutoffs.OldestXmin. Your
fix is to consider both variables, which again may be totally correct,
but wouldn't it be a lot better if we didn't have two variables
fighting for control of the same behavior?

(I'm not trying to be a nuisance here -- I think it's great that
you've done the work to pin this down and perhaps there is no better
fix than what you've proposed.)

-- 
Robert Haas
EDB: http://www.enterprisedb.com

Re: Vacuum ERRORs out considering freezing dead tuples from before OldestXmin

2024-06-24 Thread Tomas Vondra





On 6/24/24 16:53, Melanie Plageman wrote:
> On Mon, Jun 24, 2024 at 4:27 AM Heikki Linnakangas  wrote:
>>
>> On 21/06/2024 03:02, Peter Geoghegan wrote:
>>> On Thu, Jun 20, 2024 at 7:42 PM Melanie Plageman
>>>  wrote:
>>>
 The repro forces a round of index vacuuming after the standby
 reconnects and before pruning a dead tuple whose xmax is older than
 OldestXmin.

 At the end of the round of index vacuuming, _bt_pendingfsm_finalize()
 calls GetOldestNonRemovableTransactionId(), thereby updating the
 backend's GlobalVisState and moving maybe_needed backwards.
>>>
>>> Right. I saw details exactly consistent with this when I used GDB
>>> against a production instance.
>>>
>>> I'm glad that you were able to come up with a repro that involves
>>> exactly the same basic elements, including index page deletion.
>>
>> Would it be possible to make it robust so that we could always run it
>> with "make check"? This seems like an important corner case to
>> regression test.
> 
> I'd have to look into how to ensure I can stabilize some of the parts
> that seem prone to flaking. I can probably stabilize the vacuum bit
> with a query of pg_stat_activity making sure it is waiting to acquire
> the cleanup lock.
> 
> I don't, however, see a good way around the large amount of data
> required to trigger more than one round of index vacuuming. I could
> generate the data more efficiently than I am doing here
> (generate_series() in the from clause). Perhaps with a copy? I know it
> is too slow now to go in an ongoing test, but I don't have an
> intuition around how fast it would have to be to be acceptable. Is
> there a set of additional tests that are slower that we don't always
> run? I didn't follow how the wraparound test ended up, but that seems
> like one that would have been slow.
> 

I think it depends on what is the impact on the 'make check' duration.
If it could be added to one of the existing test groups, then it depends
on how long the slowest test in that group is. If the new test needs to
be in a separate group, it probably needs to be very fast.

But I was wondering how much time are we talking about, so I tried

creating a table, filling it with 300k rows => 250ms
creating an index => 180ms
delete 90% data => 200ms
vacuum t => 130ms

which with m_w_m=1MB does two rounds of index cleanup. That's ~760ms.
I'm not sure how much more stuff does the test need to do, but this
would be pretty reasonable, if we could add it to an existing group.


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: Vacuum ERRORs out considering freezing dead tuples from before OldestXmin

On Mon, Jun 24, 2024 at 4:27 AM Heikki Linnakangas  wrote:
>
> On 21/06/2024 03:02, Peter Geoghegan wrote:
> > On Thu, Jun 20, 2024 at 7:42 PM Melanie Plageman
> >  wrote:
> >
> >> The repro forces a round of index vacuuming after the standby
> >> reconnects and before pruning a dead tuple whose xmax is older than
> >> OldestXmin.
> >>
> >> At the end of the round of index vacuuming, _bt_pendingfsm_finalize()
> >> calls GetOldestNonRemovableTransactionId(), thereby updating the
> >> backend's GlobalVisState and moving maybe_needed backwards.
> >
> > Right. I saw details exactly consistent with this when I used GDB
> > against a production instance.
> >
> > I'm glad that you were able to come up with a repro that involves
> > exactly the same basic elements, including index page deletion.
>
> Would it be possible to make it robust so that we could always run it
> with "make check"? This seems like an important corner case to
> regression test.

I'd have to look into how to ensure I can stabilize some of the parts
that seem prone to flaking. I can probably stabilize the vacuum bit
with a query of pg_stat_activity making sure it is waiting to acquire
the cleanup lock.

I don't, however, see a good way around the large amount of data
required to trigger more than one round of index vacuuming. I could
generate the data more efficiently than I am doing here
(generate_series() in the from clause). Perhaps with a copy? I know it
is too slow now to go in an ongoing test, but I don't have an
intuition around how fast it would have to be to be acceptable. Is
there a set of additional tests that are slower that we don't always
run? I didn't follow how the wraparound test ended up, but that seems
like one that would have been slow.

- Melanie

Re: Vacuum ERRORs out considering freezing dead tuples from before OldestXmin