subject:"New IndexAM API controlling index vacuum strategies"

Re: New IndexAM API controlling index vacuum strategies

2021-04-15 Thread Peter Geoghegan

On Thu, Apr 15, 2021 at 5:12 PM Andres Freund  wrote:
> > https://thebuild.com/blog/2019/02/08/do-not-change-autovacuum-age-settings/
>
> Not at all convinced. The issue of needing to modify a lot of
> all-visible pages again to freeze them is big enough to let it be a
> problem even after the freeze map. Yes, there's workloads where it's
> much less of a problem, but not all the time.

Not convinced of what? I only claimed that it was much less common.
Many users live in fear of the extreme worst case of the database no
longer being able to accept writes. That is a very powerful fear.

> > As I said, we handle the case where autovacuum_freeze_max_age is set
> > to something larger than vacuum_failsafe_age is a straightforward and
> > pretty sensible way. I am curious, though: what
> > autovacuum_freeze_max_age setting is "much higher" than 1.6 billion,
> > but somehow also not extremely ill-advised and dangerous? What number
> > is that, precisely? Apparently this is common, but I must confess that
> > it's the first I've heard about it.
>
> I didn't intend to say that the autovacuum_freeze_max_age would be set
> much higher than 1.6 billion, just that that the headroom would be much
> less. I've set it, and seen it set, to 1.5-1.8bio without problems,
> while reducing overhead substantially.

Okay, that makes way more sense. (Though I still think that a
autovacuum_freeze_max_age beyond 1 billion is highly dubious.)

Let's say you set autovacuum_freeze_max_age to 1.8 billion (and you
really know what you're doing). This puts you in a pretty select group
of Postgres users -- the kind of select group that might be expected
to pay very close attention to the compatibility section of the
release notes. In any case it makes the failsafe kick in when
relfrozenxid age is 1.89 billion. Is that so bad? In fact, isn't this
feature actually pretty great for this select cohort of Postgres users
that live dangerously? Now it's far safer to live on the edge (perhaps
with some additional tuning that ought to be easy for this elite group
of users).

-- 
Peter Geoghegan

Re: New IndexAM API controlling index vacuum strategies

2021-04-15 Thread Andres Freund

Hi,

On 2021-04-14 21:30:29 -0700, Peter Geoghegan wrote:
> I think that this was once true, but is now much less common, mostly
> due to the freeze map stuff in 9.6. And due a general recognition that
> the *risk* of increasing them is just too great (a risk that we can
> hope was diminished by the failsafe, incidentally). As an example of
> this, Christophe Pettus had a Damascene conversion when it came to
> increasing autovacuum_freeze_max_age aggressively, which we explains
> here:
> 
> https://thebuild.com/blog/2019/02/08/do-not-change-autovacuum-age-settings/

Not at all convinced. The issue of needing to modify a lot of
all-visible pages again to freeze them is big enough to let it be a
problem even after the freeze map. Yes, there's workloads where it's
much less of a problem, but not all the time.

> As I said, we handle the case where autovacuum_freeze_max_age is set
> to something larger than vacuum_failsafe_age is a straightforward and
> pretty sensible way. I am curious, though: what
> autovacuum_freeze_max_age setting is "much higher" than 1.6 billion,
> but somehow also not extremely ill-advised and dangerous? What number
> is that, precisely? Apparently this is common, but I must confess that
> it's the first I've heard about it.

I didn't intend to say that the autovacuum_freeze_max_age would be set
much higher than 1.6 billion, just that that the headroom would be much
less. I've set it, and seen it set, to 1.5-1.8bio without problems,
while reducing overhead substantially.

Greetings,

Andres Freund

Re: New IndexAM API controlling index vacuum strategies

2021-04-14 Thread Peter Geoghegan

On Wed, Apr 14, 2021 at 8:38 PM Andres Freund  wrote:
> The reason I didn't do further reviews for things in this thread was
> that I was trying really hard to get the shared memory stats patch into
> a committable shape - there were just not enough hours in the day. I
> think it's to be expected that, during the final CF, there aren't a lot
> of resources for reviewing patches that are substantially new.  Why
> should these new patches have gotten priority over a much older patch
> set that also address significant operational issues?

We're all doing our best.

> It's very common for larger / busier databases to *substantially*
> increase autovacuum_freeze_max_age, so there won't be 1.6 billion XIDs
> of headroom, but a few hundred million. The cost of doing unnecessary
> anti-wraparound vacuums is just too great. And databases on the busier &
> larger side of things are precisely the ones that are more likely to hit
> wraparound issues (otherwise you're just not that likely to burn through
> that many xids).

I think that this was once true, but is now much less common, mostly
due to the freeze map stuff in 9.6. And due a general recognition that
the *risk* of increasing them is just too great (a risk that we can
hope was diminished by the failsafe, incidentally). As an example of
this, Christophe Pettus had a Damascene conversion when it came to
increasing autovacuum_freeze_max_age aggressively, which we explains
here:

https://thebuild.com/blog/2019/02/08/do-not-change-autovacuum-age-settings/

In short, he went from regularly advising clients to increase
autovacuum_freeze_max_age to telling them to specifically advising
them to never touch them.

Even if we assume that I'm 100% wrong about autovacuum_freeze_max_age,
it's still true that the vacuum_failsafe_age GUC is interpreted with
reference to autovacuum_freeze_max_age -- it will always be
interpreted as if it was set to 105% of whatever the current value of
autovacuum_freeze_max_age happens to be (so it's symmetric with the
freeze_table_age GUC and its 95% behavior). So it's never completely
unreasonable in the sense that it directly clashes with an existing
autovacuum_freeze_max_age setting from before the upgrade.

Of course this doesn't mean that there couldn't possibly be any
problems with the new mechanism clashing with
autovacuum_freeze_max_age in some unforeseen way. But, the worst that
can happen is that a user that is sophisticated enough to very
aggressively increase autovacuum_freeze_max_age upgrades to Postgres
14, and then finds that index vacuuming is sometimes skipped. Which
they'll see lots of annoying and scary messages about if they ever
look in the logs. I think that that's an acceptable price to pay to
protect the majority of less sophisticated users.

> And my concern isn't really that vacuum would have finished without a
> problem if cost limiting hadn't been disabled, but that having multiple
> autovacuum workers going all out will cause problems. Like the system
> slowing down so much that it's hard to fix the actual root cause of the
> wraparound - I've seen systems with a bunch unthrottled autovacuum
> overwhelme the IO subsystem so much that simply opening a connection to
> fix the issue took 10+ minutes. Especially on systems with provisioned
> IO (i.e. just about all cloud storage) that's not too hard to hit.

I don't think that it's reasonable to expect an intervention like this
to perfectly eliminate all risk, while at the same time never
introducing any new theoretical risks. (Especially while also being
simple and obviously correct.)

> > If it's intrinsically impossible to advance relfrozenxid, then surely
> > all bets are off. But even in this scenario it's very unlikely that we
> > wouldn't at least do index vacuuming for those index tuples that are
> > dead and safe to delete according to the OldestXmin cutoff. You still
> > have 1.6 billion XIDs before the failsafe first kicks in, regardless
> > of the issue of the OldestXmin/FreezeLimit being excessively far in
> > the past.
>
> As I said above, I don't think the "1.6 billion XIDs" argument has
> merit, because it's so reasonable (and common) to set
> autovacuum_freeze_max_age to something much larger.

No merit? Really? Not even a teeny, tiny, microscopic little bit of
merit? You're sure?

As I said, we handle the case where autovacuum_freeze_max_age is set
to something larger than vacuum_failsafe_age is a straightforward and
pretty sensible way. I am curious, though: what
autovacuum_freeze_max_age setting is "much higher" than 1.6 billion,
but somehow also not extremely ill-advised and dangerous? What number
is that, precisely? Apparently this is common, but I must confess that
it's the first I've heard about it.

-- 
Peter Geoghegan

Re: New IndexAM API controlling index vacuum strategies

2021-04-14 Thread Andres Freund

Hi,

On 2021-04-14 19:53:29 -0700, Peter Geoghegan wrote:
> > Or at least
> > tests for it should be added (pg_resetwal + autovacuum_naptime=1s or
> > such should make that doable, or even just running a small test with
> > lower thresholds).
>
> You know what else doesn't have test coverage? Any kind of aggressive
> VACUUM. There is a problem with our culture around testing. I would
> like to address that in the scope of this project, but you know how it
> is. Can I take it that I'll have your support with adding those tests?

Sure!

> > I think there are good arguments for having logic for an "emergency
> > vacuum" mode (and also some good ones against). I'm not convinced that
> > the current set of things that are [not] skipped in failsafe mode is the
> > "obviously right set of things"™ but am convinced that there wasn't
> > enough consensus building o what that set of things is. This all also
> > would be different if it were the start of the development window,
> > rather than the end.
>
> I all but begged you to review the patches. Same with Robert. While
> the earlier patches (where almost all of the complexity is) did get
> review from both you and Robert (which I was grateful to receive), for
> whatever reason neither of you looked at the later patches in detail.

Based on a quick scan of the thread, the first version of a patch that
kind of resembles what got committed around the topic at hand seems to
be 
https://postgr.es/m/CAH2-Wzm7Y%3D_g3FjVHv7a85AfUbuSYdggDnEqN1hodVeOctL%2BOw%40mail.gmail.com
posted 2021-03-15. That's well into the last CF.

The reason I didn't do further reviews for things in this thread was
that I was trying really hard to get the shared memory stats patch into
a committable shape - there were just not enough hours in the day. I
think it's to be expected that, during the final CF, there aren't a lot
of resources for reviewing patches that are substantially new.  Why
should these new patches have gotten priority over a much older patch
set that also address significant operational issues?

> > I think there's also a clear danger in having "cliffs" where the
> > behaviour changes appruptly once a certain threshold is reached. It's
> > not unlikely for systems to fall over entirely over when
> >
> > a) autovacuum cost limiting is disabled. E.g. reaching your disk
> >iops/throughput quota and barely being able to log into postgres
> >anymore to kill the stuck connection causing the wraparound issue.
>
> Let me get this straight: You're concerned that hurrying up vacuuming
> when we have 500 million XIDs left to burn will overwhelm the system,
> which would presumably have finished in time otherwise?
> Even though it would have to do way more work in absolute terms in the
> absence of the failsafe? And even though the 1.6 billion XID age that
> we got to before the failsafe kicked in was clearly not enough? You'd
> want to "play it safe", and stick with the original plan at that
> point?

It's very common for larger / busier databases to *substantially*
increase autovacuum_freeze_max_age, so there won't be 1.6 billion XIDs
of headroom, but a few hundred million. The cost of doing unnecessary
anti-wraparound vacuums is just too great. And databases on the busier &
larger side of things are precisely the ones that are more likely to hit
wraparound issues (otherwise you're just not that likely to burn through
that many xids).

And my concern isn't really that vacuum would have finished without a
problem if cost limiting hadn't been disabled, but that having multiple
autovacuum workers going all out will cause problems. Like the system
slowing down so much that it's hard to fix the actual root cause of the
wraparound - I've seen systems with a bunch unthrottled autovacuum
overwhelme the IO subsystem so much that simply opening a connection to
fix the issue took 10+ minutes. Especially on systems with provisioned
IO (i.e. just about all cloud storage) that's not too hard to hit.

> > b) No index cleanup happens anymore. E.g. a workload with a lot of
> >bitmap index scans (which do not support killtuples) could end up a
> >off a lot worse because index pointers to dead tuples aren't being
> >cleaned up. In cases where an old transaction or leftover replication
> >slot is causing the problem (together a significant percentage of
> >wraparound situations) this situation will persist across repeated
> >(explicit or automatic) vacuums for a table, because relfrozenxid
> >won't actually be advanced. And this in turn might actually end up
> >slowing resolution of the wraparound issue more than doing all the
> >index scans.
>
> If it's intrinsically impossible to advance relfrozenxid, then surely
> all bets are off. But even in this scenario it's very unlikely that we
> wouldn't at least do index vacuuming for those index tuples that are
> dead and safe to delete according to the OldestXmin cutoff. You still
> have 1.6 billion XIDs before the

Re: New IndexAM API controlling index vacuum strategies

2021-04-14 Thread Peter Geoghegan

On Wed, Apr 14, 2021 at 6:53 PM Andres Freund  wrote:
> > To a large degree the failsafe is something that is written in the
> > hope that it will never be needed. This is unlike most other things,
> > and has its own unique risks.
>
> Among them that the code is not covered by tests and is unlikely to be
> meaningfully exercised within the beta timeframe due to the timeframes
> for hitting it (hard to actually hit below a 1/2 day running extreme
> workloads, weeks for more realistic ones). Which means that this code
> has to be extra vigorously reviewed, not the opposite.

There is test coverage for the optimization to bypass index vacuuming
with very few dead tuples. Plus we can expect it to kick in very
frequently. That's not as good as testing this other mechanism
directly, which I agree ought to happen too. But the difference
between those two cases is pretty much entirely around when and how
they kick in. I have normalized the idea that index vacuuming is
optional in principle, so in an important sense it is tested all the
time.

> Or at least
> tests for it should be added (pg_resetwal + autovacuum_naptime=1s or
> such should make that doable, or even just running a small test with
> lower thresholds).

You know what else doesn't have test coverage? Any kind of aggressive
VACUUM. There is a problem with our culture around testing. I would
like to address that in the scope of this project, but you know how it
is. Can I take it that I'll have your support with adding those tests?

> This line of argumentation scares me. Not explained arguments, about
> running in conditions that we otherwise don't run in, when in
> exceptional circumstances. This code has a history of super subtle
> interactions, with quite a few data loss causing bugs due to us not
> forseeing some combination of circumstances.

I'll say it again: I was wrong to not make that clearer prior to
committing the fixup. I regret that error, which probably had a lot to
do with being fatigued.

> I think there are good arguments for having logic for an "emergency
> vacuum" mode (and also some good ones against). I'm not convinced that
> the current set of things that are [not] skipped in failsafe mode is the
> "obviously right set of things"™ but am convinced that there wasn't
> enough consensus building o what that set of things is. This all also
> would be different if it were the start of the development window,
> rather than the end.

I all but begged you to review the patches. Same with Robert. While
the earlier patches (where almost all of the complexity is) did get
review from both you and Robert (which I was grateful to receive), for
whatever reason neither of you looked at the later patches in detail.
(Robert said that the failsafe ought to cover single-pass/no-indexes
VACUUM at one point, which did influence the design of the failsafe,
but for the most part his input on the later stuff was minimal and
expressed in general terms.)

Of course, nothing stops us from improving the mechanism in the
future. Though I maintain that the fundamental approach of finishing
as quickly as possible is basically sound (short of fixing the problem
directly, for example by obviating the need for freezing).

> In my experience the big problem with vacuums in a wraparound situation
> isn't actually things like truncation or eventhe index scans (although
> they certainly can cause bad problems), but that VACUUM modifies
> (prune/vacuum and WAL log or just setting hint bits) a crapton of pages
> that don't actually need to be modified just to be able to get out of
> the wraparound situation.

Things like UUID indexes are very popular, and are likely to have an
outsized impact on dirtying pages (which I agree is the real problem).
Plus some people just have a ridiculous amount of indexes (e.g., the
Discourse table that they pointed out as a particularly good target
for deduplication had a total of 13 indexes). There is an excellent
chance that stuff like that is involved in installations that actually
have huge problems. The visibility map works pretty well these days --
but not for indexes.

> And that the overhead of writing out all those
> dirty pages + WAL logging causes the VACUUM to take unacceptably
> long. E.g. because your storage is cloud storage with a few ms of
> latency, and the ringbuffer + wal_buffer sizes cause so many synchronous
> writes that you end up with < 10MB/s of data being processed.

This is a false dichotomy. There probably is an argument for making
the failsafe not do pruning that isn't strictly necessary (or
something like that) in a future release. I don't see what particular
significance that has for the failsafe mechanism now. The sooner we
can advance relfrozenxid when it's dangerously far in the past, the
better. It's true that the mechanism doesn't exploit every possible
opportunity to do so. But it mostly manages to do that.

> I think there's also a clear danger in having "cliffs" where the
> behaviour changes apprup

Re: New IndexAM API controlling index vacuum strategies

2021-04-14 Thread Andres Freund

Hi,

On 2021-04-14 14:55:36 -0700, Peter Geoghegan wrote:
> On Wed, Apr 14, 2021 at 12:33 PM Andres Freund  wrote:
> > I'm getting a bit bothered by the speed at which you're pushing fairly
> > substantial behavioural for vacuum. In this case without even a warning
> > that you're about to do so.
> 
> To a large degree the failsafe is something that is written in the
> hope that it will never be needed. This is unlike most other things,
> and has its own unique risks.

Among them that the code is not covered by tests and is unlikely to be
meaningfully exercised within the beta timeframe due to the timeframes
for hitting it (hard to actually hit below a 1/2 day running extreme
workloads, weeks for more realistic ones). Which means that this code
has to be extra vigorously reviewed, not the opposite.  Or at least
tests for it should be added (pg_resetwal + autovacuum_naptime=1s or
such should make that doable, or even just running a small test with
lower thresholds).

> I just went one step further than that in this recent commit. I didn't
> point these details out before now because (to me) this is beside the
> point. Which is that the failsafe is just that -- a failsafe. Anything
> that adds unnecessary unpredictable delay in reaching the point of
> advancing relfrozenxid should be avoided. (Besides, the design of
> should_attempt_truncation() and lazy_truncate_heap() is very far from
> guaranteeing that truncation will take place at the best of times.)

This line of argumentation scares me. Not explained arguments, about
running in conditions that we otherwise don't run in, when in
exceptional circumstances. This code has a history of super subtle
interactions, with quite a few data loss causing bugs due to us not
forseeing some combination of circumstances.

I think there are good arguments for having logic for an "emergency
vacuum" mode (and also some good ones against). I'm not convinced that
the current set of things that are [not] skipped in failsafe mode is the
"obviously right set of things"™ but am convinced that there wasn't
enough consensus building o what that set of things is. This all also
would be different if it were the start of the development window,
rather than the end.

In my experience the big problem with vacuums in a wraparound situation
isn't actually things like truncation or eventhe index scans (although
they certainly can cause bad problems), but that VACUUM modifies
(prune/vacuum and WAL log or just setting hint bits) a crapton of pages
that don't actually need to be modified just to be able to get out of
the wraparound situation. And that the overhead of writing out all those
dirty pages + WAL logging causes the VACUUM to take unacceptably
long. E.g. because your storage is cloud storage with a few ms of
latency, and the ringbuffer + wal_buffer sizes cause so many synchronous
writes that you end up with < 10MB/s of data being processed.

I think there's also a clear danger in having "cliffs" where the
behaviour changes appruptly once a certain threshold is reached. It's
not unlikely for systems to fall over entirely over when

a) autovacuum cost limiting is disabled. E.g. reaching your disk
   iops/throughput quota and barely being able to log into postgres
   anymore to kill the stuck connection causing the wraparound issue.

b) No index cleanup happens anymore. E.g. a workload with a lot of
   bitmap index scans (which do not support killtuples) could end up a
   off a lot worse because index pointers to dead tuples aren't being
   cleaned up. In cases where an old transaction or leftover replication
   slot is causing the problem (together a significant percentage of
   wraparound situations) this situation will persist across repeated
   (explicit or automatic) vacuums for a table, because relfrozenxid
   won't actually be advanced. And this in turn might actually end up
   slowing resolution of the wraparound issue more than doing all the
   index scans.

Because this is a hard cliff rather than something phasing in, it's not
really possible to for a user to see this slowly getting worse and
addressing the issue. Especially for a) this could be addressed by not
turning off cost limiting at once, but instead have it decrease the
closer you get to some limit.

Greetings,

Andres Freund

Re: New IndexAM API controlling index vacuum strategies

2021-04-14 Thread Andres Freund

Hi,

On 2021-04-14 20:08:10 -0400, Robert Haas wrote:
> On Wed, Apr 14, 2021 at 5:55 PM Peter Geoghegan  wrote:
> > On Wed, Apr 14, 2021 at 12:33 PM Andres Freund  wrote:
> > > I'm getting a bit bothered by the speed at which you're pushing fairly
> > > substantial behavioural for vacuum. In this case without even a warning
> > > that you're about to do so.
> >
> > To a large degree the failsafe is something that is written in the
> > hope that it will never be needed. This is unlike most other things,
> > and has its own unique risks.
> >
> > I think that the proper thing to do is to accept a certain amount of
> > risk in this area. The previous status quo was *appalling*, and so it
> > seems very unlikely that the failsafe hasn't mostly eliminated a lot
> > of risk for users. That factor is not everything, but it should count
> > for a lot. The only way that we're going to have total confidence in
> > anything like this is through the experience of it mostly working over
> > several releases.
> 
> I think this is largely missing the point Andres was making, which is
> that you made a significant behavior change after feature freeze
> without any real opportunity for discussion. More generally, you've
> changed a bunch of other stuff relatively quickly based on input from
> a relatively limited number of people. Now, it's fair to say that it's
> often hard to get input on things, and sometimes you have to just take
> your best shot and hope you're right. But in this particular case, you
> didn't even try to get broader participation or buy-in. That's not
> good.

Yep, that was what I was trying to get at.

- Andres

Re: New IndexAM API controlling index vacuum strategies

2021-04-14 Thread Peter Geoghegan

On Wed, Apr 14, 2021 at 5:08 PM Robert Haas  wrote:
> I think this is largely missing the point Andres was making, which is
> that you made a significant behavior change after feature freeze
> without any real opportunity for discussion.

I don't believe that it was a significant behavior change, for the
reason I gave: the fact of the matter is that it's practically
impossible for us to truncate the heap anyway, provided we have
already decided to not vacuum (as opposed to prune) heap pages that
almost certainly have some LP_DEAD items in them. Note that later heap
pages are the most likely to still have some LP_DEAD items once the
failsafe triggers, which are precisely the ones that will affect
whether or not we can truncate the whole heap.

I accept that I could have done better with the messaging. I'll try to
avoid repeating that mistake in the future.

> More generally, you've
> changed a bunch of other stuff relatively quickly based on input from
> a relatively limited number of people. Now, it's fair to say that it's
> often hard to get input on things, and sometimes you have to just take
> your best shot and hope you're right.

I agree in general, and I agree that that's what I've done in this
instance. It goes without saying, but I'll say it anyway: I accept
full responsibility.

> But in this particular case, you
> didn't even try to get broader participation or buy-in. That's not
> good.

I will admit to being somewhat burned out by this project. That might
have been a factor.

-- 
Peter Geoghegan

Re: New IndexAM API controlling index vacuum strategies

2021-04-14 Thread Robert Haas

On Wed, Apr 14, 2021 at 5:55 PM Peter Geoghegan  wrote:
> On Wed, Apr 14, 2021 at 12:33 PM Andres Freund  wrote:
> > I'm getting a bit bothered by the speed at which you're pushing fairly
> > substantial behavioural for vacuum. In this case without even a warning
> > that you're about to do so.
>
> To a large degree the failsafe is something that is written in the
> hope that it will never be needed. This is unlike most other things,
> and has its own unique risks.
>
> I think that the proper thing to do is to accept a certain amount of
> risk in this area. The previous status quo was *appalling*, and so it
> seems very unlikely that the failsafe hasn't mostly eliminated a lot
> of risk for users. That factor is not everything, but it should count
> for a lot. The only way that we're going to have total confidence in
> anything like this is through the experience of it mostly working over
> several releases.

I think this is largely missing the point Andres was making, which is
that you made a significant behavior change after feature freeze
without any real opportunity for discussion. More generally, you've
changed a bunch of other stuff relatively quickly based on input from
a relatively limited number of people. Now, it's fair to say that it's
often hard to get input on things, and sometimes you have to just take
your best shot and hope you're right. But in this particular case, you
didn't even try to get broader participation or buy-in. That's not
good.

-- 
Robert Haas
EDB: http://www.enterprisedb.com

Re: New IndexAM API controlling index vacuum strategies

2021-04-14 Thread Peter Geoghegan

On Wed, Apr 14, 2021 at 12:33 PM Andres Freund  wrote:
> I'm getting a bit bothered by the speed at which you're pushing fairly
> substantial behavioural for vacuum. In this case without even a warning
> that you're about to do so.

To a large degree the failsafe is something that is written in the
hope that it will never be needed. This is unlike most other things,
and has its own unique risks.

I think that the proper thing to do is to accept a certain amount of
risk in this area. The previous status quo was *appalling*, and so it
seems very unlikely that the failsafe hasn't mostly eliminated a lot
of risk for users. That factor is not everything, but it should count
for a lot. The only way that we're going to have total confidence in
anything like this is through the experience of it mostly working over
several releases.

> I don't think it's that blindingly obvious that skipping truncation is
> the right thing to do that it doesn't need review. Consider e.g. the
> case that you're close to wraparound because you ran out of space for
> the amount of WAL VACUUM produces, previously leading to autovacuums
> being aborted / the server restarted. The user might then stop regular
> activity and try to VACUUM. Skipping the truncation might now make it
> harder to actually vacuum all the tables without running out of space.

Note that the criteria for whether or not "hastup=false" for a page is
slightly different in lazy_scan_prune() -- I added a comment that
points this out directly (the fact that it works that way is not new,
and might have originally been a happy mistake). Unlike
count_nondeletable_pages(), which is used by heap truncation,
lazy_scan_prune() is concerned about whether or not it's *likely to be
possible* to truncate away the page by the time lazy_truncate_heap()
is reached (if it is reached at all). And so it's optimistic about
LP_DEAD items that it observes being removed by
lazy_vacuum_heap_page() before we get to lazy_truncate_heap(). It's
inherently race-prone anyway, so it might as well assume that LP_DEAD
items will eventually become LP_UNUSED items later on.

It follows that the chances of lazy_truncate_heap() failing to
truncate anything when the failsafe has already triggered are
exceptionally high -- all the LP_DEAD items are still there, and
cannot be safely removed during truncation (for the usual reasons). I
just went one step further than that in this recent commit. I didn't
point these details out before now because (to me) this is beside the
point. Which is that the failsafe is just that -- a failsafe. Anything
that adds unnecessary unpredictable delay in reaching the point of
advancing relfrozenxid should be avoided. (Besides, the design of
should_attempt_truncation() and lazy_truncate_heap() is very far from
guaranteeing that truncation will take place at the best of times.)

FWIW, my intention is to try to get as much feedback about the
failsafe as I possibly can -- it's hard to reason about exceptional
events. I'm also happy to further discuss the specifics with you now.

-- 
Peter Geoghegan

Re: New IndexAM API controlling index vacuum strategies

2021-04-14 Thread Andres Freund

Hi,

On 2021-04-13 12:59:03 -0700, Peter Geoghegan wrote:
> I agree. Bypassing heap truncation is exactly the kind of thing that
> risks adding significant, unpredictable delay at a time when we need
> to advance relfrozenxid as quickly as possible.
> 
> I pushed a trivial commit that makes the failsafe bypass heap
> truncation as well just now.

I'm getting a bit bothered by the speed at which you're pushing fairly
substantial behavioural for vacuum. In this case without even a warning
that you're about to do so.

I don't think it's that blindingly obvious that skipping truncation is
the right thing to do that it doesn't need review. Consider e.g. the
case that you're close to wraparound because you ran out of space for
the amount of WAL VACUUM produces, previously leading to autovacuums
being aborted / the server restarted. The user might then stop regular
activity and try to VACUUM. Skipping the truncation might now make it
harder to actually vacuum all the tables without running out of space.

FWIW, I also don't like that substantial behaviour changes to how vacuum
works were discussed only in a thread titled "New IndexAM API
controlling index vacuum strategies".

Greetings,

Andres Freund

Re: New IndexAM API controlling index vacuum strategies

2021-04-13 Thread Masahiko Sawada

On Wed, Apr 14, 2021 at 4:59 AM Peter Geoghegan  wrote:
>
> On Mon, Apr 12, 2021 at 11:05 PM Masahiko Sawada  
> wrote:
> > I realized that when the failsafe is triggered, we don't bypass heap
> > truncation that is performed before updating relfrozenxid. I think
> > it's better to bypass it too. What do you think?
>
> I agree. Bypassing heap truncation is exactly the kind of thing that
> risks adding significant, unpredictable delay at a time when we need
> to advance relfrozenxid as quickly as possible.
>
> I pushed a trivial commit that makes the failsafe bypass heap
> truncation as well just now.

Great, thanks!

Regards,

-- 
Masahiko Sawada
EDB:  https://www.enterprisedb.com/

Re: New IndexAM API controlling index vacuum strategies

2021-04-13 Thread Peter Geoghegan

On Mon, Apr 12, 2021 at 11:05 PM Masahiko Sawada  wrote:
> I realized that when the failsafe is triggered, we don't bypass heap
> truncation that is performed before updating relfrozenxid. I think
> it's better to bypass it too. What do you think?

I agree. Bypassing heap truncation is exactly the kind of thing that
risks adding significant, unpredictable delay at a time when we need
to advance relfrozenxid as quickly as possible.

I pushed a trivial commit that makes the failsafe bypass heap
truncation as well just now.

Thanks
-- 
Peter Geoghegan

Re: New IndexAM API controlling index vacuum strategies

2021-04-12 Thread Masahiko Sawada

On Thu, Apr 8, 2021 at 11:42 AM Masahiko Sawada  wrote:
>
> On Thu, Apr 8, 2021 at 8:41 AM Peter Geoghegan  wrote:
> >
> > On Wed, Apr 7, 2021 at 8:52 AM Peter Geoghegan  wrote:
> > > All of the changes from your fixup patch are clear improvements, and
> > > so I'll include them in the final commit. Thanks!

I realized that when the failsafe is triggered, we don't bypass heap
truncation that is performed before updating relfrozenxid. I think
it's better to bypass it too. What do you think?

Regards,

-- 
Masahiko Sawada
EDB:  https://www.enterprisedb.com/

Re: New IndexAM API controlling index vacuum strategies

2021-04-07 Thread Masahiko Sawada

On Thu, Apr 8, 2021 at 8:41 AM Peter Geoghegan  wrote:
>
> On Wed, Apr 7, 2021 at 8:52 AM Peter Geoghegan  wrote:
> > All of the changes from your fixup patch are clear improvements, and
> > so I'll include them in the final commit. Thanks!
>
> I did change the defaults of the GUCs to 1.6 billion, though.

Okay.

>
> All patches in the patch series have been pushed. Hopefully I will not
> be the next person to break the buildfarm today.
>
> Thanks Masahiko, and everybody else involved!

Thank you, too!

Regards,

-- 
Masahiko Sawada
EDB:  https://www.enterprisedb.com/

Re: New IndexAM API controlling index vacuum strategies

2021-04-07 Thread Peter Geoghegan

On Wed, Apr 7, 2021 at 8:52 AM Peter Geoghegan  wrote:
> All of the changes from your fixup patch are clear improvements, and
> so I'll include them in the final commit. Thanks!

I did change the defaults of the GUCs to 1.6 billion, though.

All patches in the patch series have been pushed. Hopefully I will not
be the next person to break the buildfarm today.

Thanks Masahiko, and everybody else involved!
-- 
Peter Geoghegan

Re: New IndexAM API controlling index vacuum strategies

2021-04-07 Thread Peter Geoghegan

On Wed, Apr 7, 2021 at 12:50 AM Masahiko Sawada  wrote:
> Thank you for updating the patches!
>
> I've done the final round of review:

All of the changes from your fixup patch are clear improvements, and
so I'll include them in the final commit. Thanks!

> For 0001 patch, we call PageTruncateLinePointerArray() only in the
> second pass over heap. I think we should note that the second pass is
> called only when we found/made LP_DEAD on the page. That is, if all
> dead tuples have been marked as LP_UNUSED by HOT pruning, the page
> would not be processed by the second pass, resulting in not removing
> LP_UNUSED at the end of line pointer array. So think we can call it in
> this case, i.g., when lpdead_items is 0 and tuples_deleted > 0 in
> lazy_scan_prune().

Maybe it would be beneficial to do that, but I haven't done it in the
version of the patch that I just pushed. We have run out of time to
consider calling PageTruncateLinePointerArray() in more places. I
think that the most important thing is that we have *some* protection
against line pointer bloat.

--
Peter Geoghegan

Re: New IndexAM API controlling index vacuum strategies

2021-04-07 Thread Amit Kapila

On Tue, Apr 6, 2021 at 5:49 AM Peter Geoghegan  wrote:
>
> On Mon, Apr 5, 2021 at 5:09 PM Peter Geoghegan  wrote:
> > Oh yeah. "static BufferAccessStrategy vac_strategy" is guaranteed to
> > be initialized to 0, simply because it's static and global. That
> > explains it.
>
> So do we need to allocate a strategy in workers now, or leave things
> as they are/were?
>
> I'm going to go ahead with pushing my commit to do that now, just to
> get the buildfarm green. It's still a bug in Postgres 13, albeit a
> less serious one than I first suspected.
>

I have started a separate thread [1] to fix this in PG-13.

[1] - 
https://www.postgresql.org/message-id/CAA4eK1KbmJgRV2W3BbzRnKUSrukN7SbqBBriC4RDB5KBhopkGQ%40mail.gmail.com

-- 
With Regards,
Amit Kapila.

Re: New IndexAM API controlling index vacuum strategies

2021-04-07 Thread Masahiko Sawada

On Wed, Apr 7, 2021 at 12:16 PM Peter Geoghegan  wrote:
>
> On Tue, Apr 6, 2021 at 7:05 AM Matthias van de Meent
>  wrote:
> > If you have updated patches, I'll try to check them this evening (CEST).
>
> Here is v11, which is not too different from v10 as far as the
> truncation stuff goes.
>
> Masahiko should take a look at the last patch again. I renamed the
> GUCs to reflect the fact that we do everything possible to advance
> relfrozenxid in the case where the fail safe mechanism kicks in -- not
> just skipping index vacuuming. It also incorporates your most recent
> round of feedback.

Thank you for updating the patches!

I've done the final round of review:

+   /*
+* Before beginning heap scan, check if it's already necessary to apply
+* fail safe speedup
+*/
+   should_speedup_failsafe(vacrel);

Maybe we can call it at an earlier point, for example before
lazy_space_alloc()? That way, we will not need to enable parallelism
if we know it's already an emergency situation.

---
+   msgfmt = _(" %u pages from table (%.2f%% of total) had
%lld dead item identifiers removed\n");
+
+   if (vacrel->nindexes == 0 || (vacrel->do_index_vacuuming &&
+ vacrel->num_index_scans == 0))
+   appendStringInfo(&buf, _("index scan not needed:"));
+   else if (vacrel->do_index_vacuuming &&
vacrel->num_index_scans > 0)
+   appendStringInfo(&buf, _("index scan needed:"));
+   else
+   {
+   msgfmt = _(" %u pages from table (%.2f%% of total)
have %lld dead item identifiers\n");
+
+   if (!vacrel->do_failsafe_speedup)
+   appendStringInfo(&buf, _("index scan bypassed:"));
+   else
+   appendStringInfo(&buf, _("index scan bypassed
due to emergency:"));
+   }
+   appendStringInfo(&buf, msgfmt,
+vacrel->lpdead_item_pages,
+100.0 * vacrel->lpdead_item_pages /
vacrel->rel_pages,
+(long long) vacrel->lpdead_items);

I think we can make it clean if we check vacrel->do_index_vacuuming first.

I've attached the patch that proposes the change for the above points
and can be applied on top of 0002 patch. Please feel free to adopt or
reject it.

For 0001 patch, we call PageTruncateLinePointerArray() only in the
second pass over heap. I think we should note that the second pass is
called only when we found/made LP_DEAD on the page. That is, if all
dead tuples have been marked as LP_UNUSED by HOT pruning, the page
would not be processed by the second pass, resulting in not removing
LP_UNUSED at the end of line pointer array. So think we can call it in
this case, i.g., when lpdead_items is 0 and tuples_deleted > 0 in
lazy_scan_prune().

Regards,

--
Masahiko Sawada
EDB:  https://www.enterprisedb.com/

fix_proposal.patch
Description: Binary data

Re: New IndexAM API controlling index vacuum strategies

2021-04-06 Thread Peter Geoghegan

On Tue, Apr 6, 2021 at 7:05 AM Matthias van de Meent
 wrote:
> If you have updated patches, I'll try to check them this evening (CEST).

Here is v11, which is not too different from v10 as far as the
truncation stuff goes.

Masahiko should take a look at the last patch again. I renamed the
GUCs to reflect the fact that we do everything possible to advance
relfrozenxid in the case where the fail safe mechanism kicks in -- not
just skipping index vacuuming. It also incorporates your most recent
round of feedback.

Thanks
-- 
Peter Geoghegan


v11-0001-Truncate-line-pointer-array-during-VACUUM.patch
Description: Binary data


v11-0002-Bypass-index-vacuuming-in-some-cases.patch
Description: Binary data

Re: New IndexAM API controlling index vacuum strategies

2021-04-06 Thread Matthias van de Meent

On Tue, 6 Apr 2021 at 05:13, Peter Geoghegan  wrote:
>
> On Mon, Apr 5, 2021 at 2:44 PM Matthias van de Meent
>  wrote:
> > This assertion is false, and should be a guarding if-statement. HOT
> > redirect pointers are not updated if the tuple they're pointing to is
> > vacuumed (i.e. when it was never committed) so this nextoffnum might
> > in a correctly working system point past maxoff.
>
> I will need to go through this in detail soon.
>
> > > Line pointer truncation doesn't happen during pruning, as it did in
> > > Matthias' original patch. In this revised version, line pointer
> > > truncation occurs during the second phase of VACUUM. There are several
> > > reasons to prefer this approach. It seems both safer and more useful
> > > that way (compared to the approach of doing line pointer truncation
> > > during pruning). It also makes intuitive sense to do it this way, at
> > > least to me -- the second pass over the heap is supposed to be for
> > > "freeing" LP_DEAD line pointers.
> >
> > Good catch for running a line pointer truncating pass at the second
> > pass over the heap in VACUUM, but I believe that it is also very
> > useful for pruning. Line pointer bloat due to excessive HOT chains
> > cannot be undone until the 2nd run of VACUUM happens with this patch,
> > which is a missed chance for all non-vacuum pruning.
>
> Maybe - I have my doubts about it having much value outside of the
> more extreme cases. But let's assume that I'm wrong about that, for
> the sake of argument.
>
> The current plan is to no longer require a super-exclusive lock inside
> lazy_vacuum_heap_page(), which means that we can no longer safely call
> PageRepairFragmentation() at that point. This will mean that
> PageRepairFragmentation() is 100% owned by pruning. And so the
> question of whether or not line pointer truncation should also happen
> in PageRepairFragmentation() to cover pruning is (or will be) a
> totally separate question to the question of how
> lazy_vacuum_heap_page() does it. Nothing stops you from independently
> pursuing that as a project for Postgres 15.

Ah, then I misunderstood your intentions when you mentioned including
a modified version of my patch. In which case, I agree that improving
HOT pruning is indeed out of scope.

> > What difference is there between opportunistically pruned HOT line
> > pointers, and VACUUMed line pointers?
>
> The fact that they are physically identical to each other isn't
> everything. The "life cycle" of an affected page is crucially
> important.
>
> I find that there is a lot of value in thinking about how things look
> at the page level moment to moment, and even over hours and days.
> Usually with a sample workload and table in mind. I already mentioned
> the new_order table from TPC-C, which is characterized by continual
> churn from more-or-less even amounts of range deletes and bulk inserts
> over time. That seems to be the kind of workload where you see big
> problems with line pointer bloat. Because there is constant churn of
> unrelated logical rows (it's not a bunch of UPDATEs).
>
> It's possible for very small effects to aggregate into large and
> significant effects -- I know this from my experience with indexing.
> Plus the FSM is probably not very smart about fragmentation, which
> makes it even more complicated. And so it's easy to be wrong if you
> predict that some seemingly insignificant extra intervention couldn't
> possibly help. For that reason, I don't want to predict that you're
> wrong now. It's just a question of time, and of priorities.
>
> > Truncating during pruning has
> > the benefit of keeping the LP array short where possible, and seeing
> > that truncating the LP array allows for more applied
> > PD_HAS_FREE_LINES-optimization, I fail to see why you wouldn't want to
> > truncate the LP array whenever clearing up space.
>
> Truncating the line pointer array is not an intrinsic good. I hesitate
> to do it during pruning in the absence of clear evidence that it's
> independently useful. Pruning is a very performance sensitive
> operation. Much more so than VACUUM's second heap pass.
>
> > Other than those questions, some comments on the other patches:
> >
> > 0002:
> > > +appendStringInfo(&buf, _("There were %lld dead item identifiers.\n"),
> > > + (long long) vacrel->lpdead_item_pages);
> >
> > I presume this should use vacrel->lpdead_items?.
>
> It should have been, but as it happens I have decided to not do this
> at all in 0002-*. Better to not report on LP_UNUSED *or* LP_DEAD items
> at this point of VACUUM VERBOSE output.
>
> > 0003:
> > > + * ...  Aborted transactions
> > > + * have tuples that we can treat as DEAD without caring about where there
> > > + * tuple header XIDs ...
> >
> > This should be '... where their tuple header XIDs...'
>
> Will fix.
>
> > > +retry:
> > > +
> > > ...
> > > +res = HeapTupleSatisfiesVacuum(&tuple, vacrel->OldestXmin, buf);
> > > +
> > > +if (unlikely(res

Re: New IndexAM API controlling index vacuum strategies

2021-04-05 Thread Peter Geoghegan

On Mon, Apr 5, 2021 at 2:44 PM Matthias van de Meent
 wrote:
> This assertion is false, and should be a guarding if-statement. HOT
> redirect pointers are not updated if the tuple they're pointing to is
> vacuumed (i.e. when it was never committed) so this nextoffnum might
> in a correctly working system point past maxoff.

I will need to go through this in detail soon.

> > Line pointer truncation doesn't happen during pruning, as it did in
> > Matthias' original patch. In this revised version, line pointer
> > truncation occurs during the second phase of VACUUM. There are several
> > reasons to prefer this approach. It seems both safer and more useful
> > that way (compared to the approach of doing line pointer truncation
> > during pruning). It also makes intuitive sense to do it this way, at
> > least to me -- the second pass over the heap is supposed to be for
> > "freeing" LP_DEAD line pointers.
>
> Good catch for running a line pointer truncating pass at the second
> pass over the heap in VACUUM, but I believe that it is also very
> useful for pruning. Line pointer bloat due to excessive HOT chains
> cannot be undone until the 2nd run of VACUUM happens with this patch,
> which is a missed chance for all non-vacuum pruning.

Maybe - I have my doubts about it having much value outside of the
more extreme cases. But let's assume that I'm wrong about that, for
the sake of argument.

The current plan is to no longer require a super-exclusive lock inside
lazy_vacuum_heap_page(), which means that we can no longer safely call
PageRepairFragmentation() at that point. This will mean that
PageRepairFragmentation() is 100% owned by pruning. And so the
question of whether or not line pointer truncation should also happen
in PageRepairFragmentation() to cover pruning is (or will be) a
totally separate question to the question of how
lazy_vacuum_heap_page() does it. Nothing stops you from independently
pursuing that as a project for Postgres 15.

> What difference is there between opportunistically pruned HOT line
> pointers, and VACUUMed line pointers?

The fact that they are physically identical to each other isn't
everything. The "life cycle" of an affected page is crucially
important.

I find that there is a lot of value in thinking about how things look
at the page level moment to moment, and even over hours and days.
Usually with a sample workload and table in mind. I already mentioned
the new_order table from TPC-C, which is characterized by continual
churn from more-or-less even amounts of range deletes and bulk inserts
over time. That seems to be the kind of workload where you see big
problems with line pointer bloat. Because there is constant churn of
unrelated logical rows (it's not a bunch of UPDATEs).

It's possible for very small effects to aggregate into large and
significant effects -- I know this from my experience with indexing.
Plus the FSM is probably not very smart about fragmentation, which
makes it even more complicated. And so it's easy to be wrong if you
predict that some seemingly insignificant extra intervention couldn't
possibly help. For that reason, I don't want to predict that you're
wrong now. It's just a question of time, and of priorities.

> Truncating during pruning has
> the benefit of keeping the LP array short where possible, and seeing
> that truncating the LP array allows for more applied
> PD_HAS_FREE_LINES-optimization, I fail to see why you wouldn't want to
> truncate the LP array whenever clearing up space.

Truncating the line pointer array is not an intrinsic good. I hesitate
to do it during pruning in the absence of clear evidence that it's
independently useful. Pruning is a very performance sensitive
operation. Much more so than VACUUM's second heap pass.

> Other than those questions, some comments on the other patches:
>
> 0002:
> > +appendStringInfo(&buf, _("There were %lld dead item identifiers.\n"),
> > + (long long) vacrel->lpdead_item_pages);
>
> I presume this should use vacrel->lpdead_items?.

It should have been, but as it happens I have decided to not do this
at all in 0002-*. Better to not report on LP_UNUSED *or* LP_DEAD items
at this point of VACUUM VERBOSE output.

> 0003:
> > + * ...  Aborted transactions
> > + * have tuples that we can treat as DEAD without caring about where there
> > + * tuple header XIDs ...
>
> This should be '... where their tuple header XIDs...'

Will fix.

> > +retry:
> > +
> > ...
> > +res = HeapTupleSatisfiesVacuum(&tuple, vacrel->OldestXmin, buf);
> > +
> > +if (unlikely(res == HEAPTUPLE_DEAD))
> > +goto retry;
>
> In this unlikely case, you reset the tuples_deleted value that was
> received earlier from heap_page_prune. This results in inaccurate
> statistics, as repeated calls to heap_page_prune on the same page will
> not count tuples that were deleted in a previous call.

I don't think that it matters. The "tupgone=true" case has no test
coverage (see coverage.postgre

Re: New IndexAM API controlling index vacuum strategies

2021-04-05 Thread Andres Freund

Hi,

On 2021-04-05 17:18:37 -0700, Peter Geoghegan wrote:
> On Mon, Apr 5, 2021 at 5:09 PM Peter Geoghegan  wrote:
> > Oh yeah. "static BufferAccessStrategy vac_strategy" is guaranteed to
> > be initialized to 0, simply because it's static and global. That
> > explains it.
> 
> So do we need to allocate a strategy in workers now, or leave things
> as they are/were?

> I'm going to go ahead with pushing my commit to do that now, just to
> get the buildfarm green. It's still a bug in Postgres 13, albeit a
> less serious one than I first suspected.

Feels like a v13 bug to me, one that should be fixed.

Greetings,

Andres Freund

Re: New IndexAM API controlling index vacuum strategies

2021-04-05 Thread Peter Geoghegan

On Mon, Apr 5, 2021 at 5:09 PM Peter Geoghegan  wrote:
> Oh yeah. "static BufferAccessStrategy vac_strategy" is guaranteed to
> be initialized to 0, simply because it's static and global. That
> explains it.

So do we need to allocate a strategy in workers now, or leave things
as they are/were?

I'm going to go ahead with pushing my commit to do that now, just to
get the buildfarm green. It's still a bug in Postgres 13, albeit a
less serious one than I first suspected.

-- 
Peter Geoghegan

Re: New IndexAM API controlling index vacuum strategies

2021-04-05 Thread Peter Geoghegan

On Mon, Apr 5, 2021 at 5:00 PM Andres Freund  wrote:
> What do you mean with "appear to work"? Isn't, in 13, the only
> consequence of vac_strategy not being "propagated" that we'll not use a
> strategy in parallel workers?  Presumably that was hard to notice
> because most people don't run manual VACUUM with cost limits turned
> on. And autovacuum doesn't use parallelism.

Oh yeah. "static BufferAccessStrategy vac_strategy" is guaranteed to
be initialized to 0, simply because it's static and global. That
explains it.


-- 
Peter Geoghegan

Re: New IndexAM API controlling index vacuum strategies

2021-04-05 Thread Andres Freund

Hi,

On 2021-04-05 16:53:58 -0700, Peter Geoghegan wrote:
> REL_13_STABLE will need to be considered separately. I still haven't
> figured out how this ever appeared to work for this long. The
> vac_strategy/bstrategy state simply wasn't propagated at all.

What do you mean with "appear to work"? Isn't, in 13, the only
consequence of vac_strategy not being "propagated" that we'll not use a
strategy in parallel workers?  Presumably that was hard to notice
because most people don't run manual VACUUM with cost limits turned
on. And autovacuum doesn't use parallelism.

Greetings,

Andres Freund

Re: New IndexAM API controlling index vacuum strategies

2021-04-05 Thread Peter Geoghegan

On Mon, Apr 5, 2021 at 4:29 PM Tom Lane  wrote:
> Peter Geoghegan  writes:
> > Do you think that it's okay that we rely on the propagation of global
> > state to parallel workers on Postgres 13? Don't we need something like
> > my fixup commit 49f49def on Postgres 13 as well? At least for the
> > EXEC_BACKEND case, I think.
>
> Uh ... *what* propagation of global state to parallel workers?  Workers
> fork off from the postmaster, not from their leader process.
>
> (I note that morepork is still failing.  The other ones didn't report
> in yet.)

Evidently my fixup commit 49f49def was written in way too much of a
panic. I'm going to push a new fix shortly. This will make workers do
their own GetAccessStrategy(BAS_VACUUM), just to get the buildfarm
green.

REL_13_STABLE will need to be considered separately. I still haven't
figured out how this ever appeared to work for this long. The
vac_strategy/bstrategy state simply wasn't propagated at all.

-- 
Peter Geoghegan

Re: New IndexAM API controlling index vacuum strategies

2021-04-05 Thread Masahiko Sawada

On Tue, Apr 6, 2021 at 8:29 AM Tom Lane  wrote:
>
> Peter Geoghegan  writes:
> > Do you think that it's okay that we rely on the propagation of global
> > state to parallel workers on Postgres 13? Don't we need something like
> > my fixup commit 49f49def on Postgres 13 as well? At least for the
> > EXEC_BACKEND case, I think.
>
> Uh ... *what* propagation of global state to parallel workers?  Workers
> fork off from the postmaster, not from their leader process.

Right. I think we should apply that fix on PG13 as well.

Regards,

-- 
Masahiko Sawada
EDB:  https://www.enterprisedb.com/

Re: New IndexAM API controlling index vacuum strategies

2021-04-05 Thread Tom Lane

Peter Geoghegan  writes:
> Do you think that it's okay that we rely on the propagation of global
> state to parallel workers on Postgres 13? Don't we need something like
> my fixup commit 49f49def on Postgres 13 as well? At least for the
> EXEC_BACKEND case, I think.

Uh ... *what* propagation of global state to parallel workers?  Workers
fork off from the postmaster, not from their leader process.

(I note that morepork is still failing.  The other ones didn't report
in yet.)

regards, tom lane

Re: New IndexAM API controlling index vacuum strategies

2021-04-05 Thread Peter Geoghegan

On Mon, Apr 5, 2021 at 4:30 AM Masahiko Sawada  wrote:
> Did you try the change around parallel_process_one_index() that I
> suggested in the previous reply[1]? If we don't change the logic, we
> need to update the above comment. Previously, we update stats[idx] in
> vacuum_one_index() (renamed to parallel_process_one_index()) but with
> your patch, where we update it is its caller.

I don't know how I missed it the first time. I agree that it is a lot
better that way.

I did it that way in the version of the patch that I pushed just now. Thanks!

Do you think that it's okay that we rely on the propagation of global
state to parallel workers on Postgres 13? Don't we need something like
my fixup commit 49f49def on Postgres 13 as well? At least for the
EXEC_BACKEND case, I think.

> We removed two Assert(!IsParallelWorker()) at two places. It seems to
> me that those assertions are still valid. Do we really need to remove
> them?

I have restored the assertions in what became the final version.

> 0004 patch:
>
> src/backend/access/heap/heapam.c:638: trailing whitespace.

Will fix.

> ---
> 0005 patch:
>
> + * Caller is expected to call here before and after vacuuming each index in
> + * the case of two-pass VACUUM, or every BYPASS_EMERGENCY_MIN_PAGES blocks in
> + * the case of no-indexes/one-pass VACUUM.
>
> I think it should be "every VACUUM_FSM_EVERY_PAGES blocks" instead of
> "every BYPASS_EMERGENCY_MIN_PAGES blocks".

Will fix.

> +#define BYPASS_EMERGENCY_MIN_PAGES \
> +   ((BlockNumber) (((uint64) 4 * 1024 * 1024 * 1024) / BLCKSZ))
> +
>
> I think we need a description for BYPASS_EMERGENCY_MIN_PAGES.

I agree - will fix.

> allindexes can be false even if we process all indexes, which is fine
> with me because setting allindexes = false disables the subsequent
> heap vacuuming. I think it's appropriate behavior in emergency cases.
> In that sense, can we do should_speedup_failsafe() check also after
> parallel index vacuuming? And we can also check it at the beginning of
> lazy vacuum.

Those both seem like good ideas. Especially the one about checking
right at the start. Now that the patch makes the emergency mechanism
not apply a delay (not just skip index vacuuming), having a precheck
at the very start makes a lot of sense. This also makes VACUUM hurry
in the case where there was a dangerously slow VACUUM that happened to
not be aggressive. Such a VACUUM will use the emergency mechanism but
won't advance relfrozenxid, because we have to rely on the autovacuum
launcher launching an anti-wraparound/aggressive autovacuum
immediately afterwards. We want that second anti-wraparound VACUUM to
hurry from the very start of lazy_scan_heap().

--
Peter Geoghegan

Re: New IndexAM API controlling index vacuum strategies

2021-04-05 Thread Matthias van de Meent

On Sun, 4 Apr 2021 at 04:00, Peter Geoghegan  wrote:
>
> On Thu, Apr 1, 2021 at 8:25 PM Peter Geoghegan  wrote:
> > I've also found a way to further simplify the table-without-indexes
> > case: make it behave like a regular two-pass/has-indexes VACUUM with
> > regard to visibility map stuff when the page doesn't need a call to
> > lazy_vacuum_heap() (because there are no LP_DEAD items to set
> > LP_UNUSED on the page following pruning). But when it does call
> > lazy_vacuum_heap(), the call takes care of everything for
> > lazy_scan_heap(), which just continues to the next page due to
> > considering prunestate to have been "invalidated" by the call to
> > lazy_vacuum_heap(). So there is absolutely minimal special case code
> > for the table-without-indexes case now.
>
> Attached is v10, which simplifies the one-pass/table-without-indexes
> VACUUM as described.

Great!

> Other changes (some of which are directly related to the
> one-pass/table-without-indexes refactoring):
>
> * The second patch no longer breaks up lazy_scan_heap() into multiple
> functions -- we only retain the lazy_scan_prune() function, which is
> the one that I find very compelling.
>
> This addresses Robert's concern about the functions -- I think that
> it's much better this way, now that I see it.
>
> * No more diff churn in the first patch. This was another concern held
> by Robert, as well as by Masahiko.
>
> In general both the first and second patches are much easier to follow now.
>
> * The emergency mechanism is now able to kick in when we happen to be
> doing a one-pass/table-without-indexes VACUUM -- no special
> cases/"weasel words" are needed.
>
> * Renamed "onerel" to "rel" in the first patch, per Robert's suggestion.
>
> * Fixed various specific issues raised by Masahiko's review,
> particularly in the first patch and last patch in the series.
>
> Finally, there is a new patch added to the series in v10:
>
> * I now include a modified version of Matthias van de Meent's line
> pointer truncation patch [1].

Thanks for notifying. I've noticed that you've based this on v3 of
that patch, and consequently has at least one significant bug that I
fixed in v5 of that patchset:

0004:
> @@ -962,6 +962,7 @@ heap_get_root_tuples(Page page, OffsetNumber 
> *root_offsets)
>   */
>  for (;;)
>  {
> +Assert(OffsetNumberIsValid(nextoffnum) && nextoffnum <= maxoff);
>  lp = PageGetItemId(page, nextoffnum);
>
>  /* Check for broken chains */

This assertion is false, and should be a guarding if-statement. HOT
redirect pointers are not updated if the tuple they're pointing to is
vacuumed (i.e. when it was never committed) so this nextoffnum might
in a correctly working system point past maxoff.

> Line pointer truncation doesn't happen during pruning, as it did in
> Matthias' original patch. In this revised version, line pointer
> truncation occurs during the second phase of VACUUM. There are several
> reasons to prefer this approach. It seems both safer and more useful
> that way (compared to the approach of doing line pointer truncation
> during pruning). It also makes intuitive sense to do it this way, at
> least to me -- the second pass over the heap is supposed to be for
> "freeing" LP_DEAD line pointers.

Good catch for running a line pointer truncating pass at the second
pass over the heap in VACUUM, but I believe that it is also very
useful for pruning. Line pointer bloat due to excessive HOT chains
cannot be undone until the 2nd run of VACUUM happens with this patch,
which is a missed chance for all non-vacuum pruning.

> Many workloads rely heavily on opportunistic pruning. With a workload
> that benefits a lot from HOT (e.g. pgbench with heap fillfactor
> reduced to 90), there are many LP_UNUSED line pointers, even though we
> may never have a VACUUM that actually performs a second heap pass
> (because LP_DEAD items cannot accumulate in heap pages). Prior to the
> HOT commit in 2007, LP_UNUSED line pointers were strictly something
> that VACUUM created from dead tuples. It seems to me that we should
> only target the latter "category" of LP_UNUSED line pointers when
> considering truncating the array -- we ought to leave pruning
> (especially opportunistic pruning that takes place outside of VACUUM)
> alone.

What difference is there between opportunistically pruned HOT line
pointers, and VACUUMed line pointers? Truncating during pruning has
the benefit of keeping the LP array short where possible, and seeing
that truncating the LP array allows for more applied
PD_HAS_FREE_LINES-optimization, I fail to see why you wouldn't want to
truncate the LP array whenever clearing up space.

Other than those questions, some comments on the other patches:

0002:
> +appendStringInfo(&buf, _("There were %lld dead item identifiers.\n"),
> + (long long) vacrel->lpdead_item_pages);

I presume this should use vacrel->lpdead_items?.

0003:
> + * ...  Aborted trans

Re: New IndexAM API controlling index vacuum strategies

2021-04-05 Thread Masahiko Sawada

On Sun, Apr 4, 2021 at 11:00 AM Peter Geoghegan  wrote:
>
> On Thu, Apr 1, 2021 at 8:25 PM Peter Geoghegan  wrote:
> > I've also found a way to further simplify the table-without-indexes
> > case: make it behave like a regular two-pass/has-indexes VACUUM with
> > regard to visibility map stuff when the page doesn't need a call to
> > lazy_vacuum_heap() (because there are no LP_DEAD items to set
> > LP_UNUSED on the page following pruning). But when it does call
> > lazy_vacuum_heap(), the call takes care of everything for
> > lazy_scan_heap(), which just continues to the next page due to
> > considering prunestate to have been "invalidated" by the call to
> > lazy_vacuum_heap(). So there is absolutely minimal special case code
> > for the table-without-indexes case now.
>
> Attached is v10, which simplifies the one-pass/table-without-indexes
> VACUUM as described.
>

Thank you for updating the patch.

>
> * I now include a modified version of Matthias van de Meent's line
> pointer truncation patch [1].
>
> Matthias' patch seems very much in scope here. The broader patch
> series establishes the principle that we can leave LP_DEAD line
> pointers in an unreclaimed state indefinitely, without consequence
> (beyond the obvious).  We had better avoid line pointer bloat that
> cannot be reversed when VACUUM does eventually get around to doing a
> second pass over the heap. This is another case where it seems prudent
> to keep the costs understandable/linear -- page-level line pointer
> bloat seems like a cost that increases in a non-linear fashion, which
> undermines the whole idea of modelling when it's okay to skip
> index/heap vacuuming. (Also, line pointer bloat sucks.)
>
> Line pointer truncation doesn't happen during pruning, as it did in
> Matthias' original patch. In this revised version, line pointer
> truncation occurs during the second phase of VACUUM. There are several
> reasons to prefer this approach. It seems both safer and more useful
> that way (compared to the approach of doing line pointer truncation
> during pruning). It also makes intuitive sense to do it this way, at
> least to me -- the second pass over the heap is supposed to be for
> "freeing" LP_DEAD line pointers.

+1

0002, 0003, and 0004 patches look good to me. 0001 and 0005 also look
good to me but I have some trivial review comments on them.

0001 patch:

/*
-* Now that stats[idx] points to the DSM segment, we
don't need the
-* locally allocated results.
+* Now that top-level indstats[idx] points to the DSM
segment, we
+* don't need the locally allocated results.
 */
-   pfree(*stats);
-   *stats = bulkdelete_res;
+   pfree(istat);
+   istat = bulkdelete_res;

Did you try the change around parallel_process_one_index() that I
suggested in the previous reply[1]? If we don't change the logic, we
need to update the above comment. Previously, we update stats[idx] in
vacuum_one_index() (renamed to parallel_process_one_index()) but with
your patch, where we update it is its caller.

---
+lazy_vacuum_all_indexes(LVRelState *vacrel)
 {
-   Assert(!IsParallelWorker());
-   Assert(nindexes > 0);
+   Assert(vacrel->nindexes > 0);
+   Assert(TransactionIdIsNormal(vacrel->relfrozenxid));
+   Assert(MultiXactIdIsValid(vacrel->relminmxid));

and

-   Assert(!IsParallelWorker());
-   Assert(nindexes > 0);
+   Assert(vacrel->nindexes > 0);

We removed two Assert(!IsParallelWorker()) at two places. It seems to
me that those assertions are still valid. Do we really need to remove
them?

---
0004 patch:

src/backend/access/heap/heapam.c:638: trailing whitespace.
+   /*

I found a whitespace issue.

---
0005 patch:

+ * Caller is expected to call here before and after vacuuming each index in
+ * the case of two-pass VACUUM, or every BYPASS_EMERGENCY_MIN_PAGES blocks in
+ * the case of no-indexes/one-pass VACUUM.

I think it should be "every VACUUM_FSM_EVERY_PAGES blocks" instead of
"every BYPASS_EMERGENCY_MIN_PAGES blocks".

---
+/*
+ * Threshold that controls whether we bypass index vacuuming and heap
+ * vacuuming.  When we're under the threshold they're deemed unnecessary.
+ * BYPASS_THRESHOLD_PAGES is applied as a multiplier on the table's rel_pages
+ * for those pages known to contain one or more LP_DEAD items.
+ */
+#define BYPASS_THRESHOLD_PAGES 0.02/* i.e. 2% of rel_pages */
+
+#define BYPASS_EMERGENCY_MIN_PAGES \
+   ((BlockNumber) (((uint64) 4 * 1024 * 1024 * 1024) / BLCKSZ))
+

I think we need a description for BYPASS_EMERGENCY_MIN_PAGES.

---
for (int idx = 0; idx < vacrel->nindexes; idx++)
{
Relationindrel = vacrel->indrels[idx];
IndexBulkDeleteResult *istat = vacrel->indstats[idx];

vacrel->indstats[idx] =
lazy_vacuum_one_index(indrel, istat, vacrel->old_live_tuples,

Re: New IndexAM API controlling index vacuum strategies

2021-04-01 Thread Peter Geoghegan

On Thu, Apr 1, 2021 at 6:14 AM Robert Haas  wrote:
> Without offering an opinion on this particular implementation choice,
> +1 for the idea of trying to make the table-with-indexes and the
> table-without-indexes cases work in ways that will feel similar to the
> user. Tables without indexes are probably rare in practice, but if
> some behaviors are implemented for one case and not the other, it will
> probably be confusing. One thought here is that it might help to try
> to write documentation for whatever behavior you choose. If it's hard
> to document without weasel-words, maybe it's not the best approach.

I have found a way to do this that isn't too painful, a little like
the VACUUM_FSM_EVERY_PAGES thing.

I've also found a way to further simplify the table-without-indexes
case: make it behave like a regular two-pass/has-indexes VACUUM with
regard to visibility map stuff when the page doesn't need a call to
lazy_vacuum_heap() (because there are no LP_DEAD items to set
LP_UNUSED on the page following pruning). But when it does call
lazy_vacuum_heap(), the call takes care of everything for
lazy_scan_heap(), which just continues to the next page due to
considering prunestate to have been "invalidated" by the call to
lazy_vacuum_heap(). So there is absolutely minimal special case code
for the table-without-indexes case now.

BTW I removed all of the lazy_scan_heap() utility functions from the
second patch in my working copy of the patch series. You were right
about that -- they weren't useful. We should just have the pruning
wrapper function I've called lazy_scan_prune(), not any of the others.
We only need one local variable in the lazy_vacuum_heap() that isn't
either the prune state set/returned by lazy_scan_prune(), or generic
stuff like a Buffer variable.

-- 
Peter Geoghegan

Re: New IndexAM API controlling index vacuum strategies

2021-04-01 Thread Robert Haas

On Wed, Mar 31, 2021 at 9:44 PM Masahiko Sawada  wrote:
> > But let me think about it...I suppose we could do it when one-pass
> > VACUUM considers vacuuming a range of FSM pages every
> > VACUUM_FSM_EVERY_PAGES. That's kind of similar to index vacuuming, in
> > a way -- it wouldn't be too bad to check for emergencies in the same
> > way there.
>
> Yeah, I also thought that would be a good place to check for
> emergencies. That sounds reasonable.

Without offering an opinion on this particular implementation choice,
+1 for the idea of trying to make the table-with-indexes and the
table-without-indexes cases work in ways that will feel similar to the
user. Tables without indexes are probably rare in practice, but if
some behaviors are implemented for one case and not the other, it will
probably be confusing. One thought here is that it might help to try
to write documentation for whatever behavior you choose. If it's hard
to document without weasel-words, maybe it's not the best approach.

-- 
Robert Haas
EDB: http://www.enterprisedb.com

Re: New IndexAM API controlling index vacuum strategies

2021-03-31 Thread Masahiko Sawada

On Thu, Apr 1, 2021 at 9:58 AM Peter Geoghegan  wrote:
>
> On Wed, Mar 31, 2021 at 4:45 AM Masahiko Sawada  wrote:
> > Both 0001 and 0002 patch refactors the whole lazy vacuum code. Can we
> > merge them? I basically agree with the refactoring made by 0001 patch
> > but I'm concerned a bit that having such a large refactoring at very
> > close to feature freeze could be risky. We would need more eyes to
> > review during stabilization.
>
> I think that Robert makes some related points about how we might cut
> scope here. So I'll definitely do some of that, maybe all of it.
>
> > I think it's more clear to use this macro. The macro can be like this:
> >
> > ParallelVacuumIsActive(vacrel) (((LVRelState) vacrel)->lps != NULL)
>
> Yes, that might be better. I'll consider it when I get back to the
> patch tomorrow.
>
> > + * LVDeadTuples stores TIDs that are gathered during pruning/the initial 
> > heap
> > + * scan.  These get deleted from indexes during index vacuuming.  They're 
> > then
> > + * removed from the heap during a second heap pass that performs heap
> > + * vacuuming.
> >   */
> >
> > The second sentence of the removed lines still seems to be useful
> > information for readers?
>
> I don't think that the stuff about shared memory was useful, really.
> If we say something like this then it should be about the LVRelState
> pointer, not the struct.

Understood.

> > ---
> > +   /* Stop applying cost limits from this point on */
> > +   VacuumCostActive = false;
> > +   VacuumCostBalance = 0;
> > +   }
> >
> > I agree with the idea of disabling vacuum delay in emergency cases.
> > But why do we do that only in the case of the table with indexes? I
> > think this optimization is helpful even in the table with no indexes.
> > We can check the XID wraparound emergency by calling
> > vacuum_xid_limit_emergency() at some point to disable vacuum delay?
>
> Hmm. I see your point, but at the same time I think that the risk is
> lower on a table that has no indexes. It may be true that index
> vacuuming doesn't necessarily take the majority of all of the work in
> lots of cases. But I think that it is true that it does when things
> get very bad -- one-pass/no indexes VACUUM does not care about
> maintenance_work_mem, etc.

Agreed.

>
> But let me think about it...I suppose we could do it when one-pass
> VACUUM considers vacuuming a range of FSM pages every
> VACUUM_FSM_EVERY_PAGES. That's kind of similar to index vacuuming, in
> a way -- it wouldn't be too bad to check for emergencies in the same
> way there.

Yeah, I also thought that would be a good place to check for
emergencies. That sounds reasonable.

Regards,

-- 
Masahiko Sawada
EDB:  https://www.enterprisedb.com/

Re: New IndexAM API controlling index vacuum strategies

2021-03-31 Thread Peter Geoghegan

On Wed, Mar 31, 2021 at 4:45 AM Masahiko Sawada  wrote:
> Both 0001 and 0002 patch refactors the whole lazy vacuum code. Can we
> merge them? I basically agree with the refactoring made by 0001 patch
> but I'm concerned a bit that having such a large refactoring at very
> close to feature freeze could be risky. We would need more eyes to
> review during stabilization.

I think that Robert makes some related points about how we might cut
scope here. So I'll definitely do some of that, maybe all of it.

> I think it's more clear to use this macro. The macro can be like this:
>
> ParallelVacuumIsActive(vacrel) (((LVRelState) vacrel)->lps != NULL)

Yes, that might be better. I'll consider it when I get back to the
patch tomorrow.

> + * LVDeadTuples stores TIDs that are gathered during pruning/the initial heap
> + * scan.  These get deleted from indexes during index vacuuming.  They're 
> then
> + * removed from the heap during a second heap pass that performs heap
> + * vacuuming.
>   */
>
> The second sentence of the removed lines still seems to be useful
> information for readers?

I don't think that the stuff about shared memory was useful, really.
If we say something like this then it should be about the LVRelState
pointer, not the struct.

> -* We do not process them because it's
> a very rare condition,
> -* and the next vacuum will process them 
> anyway.
>
> Maybe the above comments should not be removed by 0001 patch.

Right.

> Looking at the comments, I thought that this function also frees
> palloc'd dead tuple space but it doesn't. It seems to more clear that
> doing pfree(vacrel->dead_tuples) here or not creating
> lazy_space_free().

I'll need to think about this some more.

> ---
> +   if (shared_istat)
> +   {
> +   /* Get the space for IndexBulkDeleteResult */
> +   bulkdelete_res = &(shared_istat->istat);
> +
> +   /*
> +* Update the pointer to the corresponding
> bulk-deletion result if
> +* someone has already updated it.
> +*/
> +   if (shared_istat->updated && istat == NULL)
> +   istat = bulkdelete_res;
> +   }
>
> (snip)
>
> +   if (shared_istat && !shared_istat->updated && istat != NULL)
> +   {
> +   memcpy(bulkdelete_res, istat, sizeof(IndexBulkDeleteResult));
> +   shared_istat->updated = true;
> +
> +   /*
> +* Now that top-level indstats[idx] points to the DSM
> segment, we
> +* don't need the locally allocated results.
> +*/
> +   pfree(istat);
> +   istat = bulkdelete_res;
> +   }
> +
> +   return istat;
>
> If we have parallel_process_one_index() return the address of
> IndexBulkDeleteResult, we can simplify the first part above. Also, it
> seems better to use a separate variable from istat to store the
> result. How about the following structure?

I'll try it that way and see how it goes.

> +   /* This won't have changed: */
> +   Assert(savefreespace && freespace == PageGetHeapFreeSpace(page));
>
> This assertion can be false because freespace can be 0 if the page's
> PD_HAS_FREE_LINES hint can wrong. Since lazy_vacuum_heap_page() fixes
> it, PageGetHeapFreeSpace(page) in the assertion returns non-zero
> value.

Good catch, I'll fix it.

> The first vacrel->relname should be vacrel->relnamespace.

Will fix.

> I think we can use errmsg_plural() for "X index scans" part.

Yeah, I think that that would be more consistent.

> We should use vacrel->lpdead_item_pages instead of vacrel->rel_pages

Will fix. I was mostly focussed on the log_autovacuum version, which
is why it looks nice already.

> ---
> +   /* Stop applying cost limits from this point on */
> +   VacuumCostActive = false;
> +   VacuumCostBalance = 0;
> +   }
>
> I agree with the idea of disabling vacuum delay in emergency cases.
> But why do we do that only in the case of the table with indexes? I
> think this optimization is helpful even in the table with no indexes.
> We can check the XID wraparound emergency by calling
> vacuum_xid_limit_emergency() at some point to disable vacuum delay?

Hmm. I see your point, but at the same time I think that the risk is
lower on a table that has no indexes. It may be true that index
vacuuming doesn't necessarily take the majority of all of the work in
lots of cases. But I think that it is true that it does when things
get very bad -- one-pass/no indexes VACUUM does not care about
maintenance_work_mem, etc.

But let me think about it...I suppose we could do it when one-pass
VACUUM considers vacuuming a range of FSM pages every
VACUUM_FSM_EVERY_PAGES. That's kind of similar to index vacuuming, in
a way -- it wouldn't be too bad to check for emergencies in the same
way there.

> Both vacrel->do_index_vacuuming

Re: New IndexAM API controlling index vacuum strategies

2021-03-31 Thread Peter Geoghegan

On Wed, Mar 31, 2021 at 9:29 AM Robert Haas  wrote:
> I can't effectively review 0001 because it both changes the code for
> individual functions significantly and reorders them within the file.
> I think it needs to be separated into two patches, one of which makes
> the changes and the other of which reorders stuff. I would probably
> vote for just dropping the second one, since I'm not sure there's
> really enough value there to justify the code churn, but if we're
> going to do it, I think it should definitely be done separately.

Thanks for the review!

I'll split it up that way. I think that I need to see it both ways
before deciding if I should push back on that. I will admit that I was
a bit zealous in rearranging things because it seems long overdue. But
I might well have gone too far with rearranging code.

> * "onerel" is a stupid naming convention that I'd rather not propagate
> further. It makes sense in the context of a function whose job it is
> to iterate over a list of relations and do something for each one. But
> once you're down into the code that only knows about one relation in
> the first place, calling that relation "onerel" rather than "rel" or
> "vacrel" or even "heaprel" is just confusing. Ditto for "onerelid".

I agree, and can change it. Though at the cost of more diff churn.

> * Moving stuff from static variables into LVRelState seems like a
> great idea. Renaming it from LVRelStats seems like a good idea, too.

The static variables were bad, but nowhere near as bad as the
variables that are local to lazy_scan_heap(). They are currently a
gigantic mess.

Not that LVRelStats was much better. We have the latestRemovedXid
field in LVRelStats, and dead_tuples, but *don't* have a bunch of
things that really are stats (it seems to be quite random). Calling
the struct LVRelStats was always dubious.

> * Setting vacrel->lps = NULL "for now" when we already did palloc0 at
> allocation time seems counterproductive.

Okay, will fix.

> * The code associated with the comment block that says "Initialize
> state for a parallel vacuum" has been moved inside lazy_space_alloc().
> That doesn't seem like an especially good choice, because no casual
> reader is going to expect a function called lazy_space_alloc() to be
> entering parallel mode and so forth as a side effect. Also, the call
> to lazy_space_alloc() still has a comment that says "Allocate the
> space for dead tuples in case parallel vacuum is not initialized."
> even though the ParallelVacuumIsActive() check has been removed and
> the function now does a lot more than allocating space.

Will fix.

> * lazy_scan_heap() removes the comment which begins "Note that
> vacrelstats->dead_tuples could have tuples which became dead after
> HOT-pruning but are not marked dead yet." But IIUC that special case
> is removed by a later patch, not 0001, in which case it is that patch
> that should be touching this comment.

Will fix.

> Regarding 0002:
>
> * It took me a while to understand why lazy_scan_new_page() and
> lazy_scan_empty_page() are named the way they are. I'm not sure
> exactly what would be better, so I am not necessarily saying I think
> you have to change anything, but for the record I think this naming
> sucks.

I agree -- it's dreadful.

> The reason we have "lazy" in here, AFAIU, is because originally
> we only had old-style VACUUM FULL, and that was the good hard-working
> VACUUM, and what we now think of as VACUUM was the "lazy" version that
> didn't really do the whole job. Then we decided it was the
> hard-working version that actually sucked and we always wanted to be
> lazy (or else rewrite the table). So now we have all of these
> functions named "lazy" which are really just functions to do "vacuum".

FWIW I always thought that the terminology was lifted from the world
of garbage collection. There is a thing called a lazy sweep algorithm.
Isn't vacuuming very much like sweeping? There are also mark-sweep
garbage collection algorithms that take two passes,  one phase
variants, etc.

In general the world of garbage collection has some ideas that might
be worth pilfering for ideas. It's not all that relevant to our world,
and a lot of it is totally irrelevant, but there is enough overlap for
it to interest me. Though GC is such a vast and complicated world that
it's difficult to know where to begin. I own a copy of the book
"Garbage Collection: Algorithms for Automatic Dynamic Memory
Management". Most of it goes over my head, but I have a feeling that
I'll get my $20 worth at some point.

> If, for example, we could agree that the whole thing is vacuum and the first
> time we touch the heap page that's the strawberry phase and then the
> second time we touch it that's the rhubarb phase, then we could have
> vacuum_strawberry_page(), vacuum_strawberry_new_page(),
> vacuum_rhubarb_phase(), etc. and everything would be a lot clearer,
> assuming that you replaced the words "strawberry" and "rhubarb" with
> something actually

Re: New IndexAM API controlling index vacuum strategies

2021-03-31 Thread Robert Haas

On Mon, Mar 29, 2021 at 12:16 AM Peter Geoghegan  wrote:
> And now here's v8, which has the following additional cleanup:

I can't effectively review 0001 because it both changes the code for
individual functions significantly and reorders them within the file.
I think it needs to be separated into two patches, one of which makes
the changes and the other of which reorders stuff. I would probably
vote for just dropping the second one, since I'm not sure there's
really enough value there to justify the code churn, but if we're
going to do it, I think it should definitely be done separately.

Here are a few comments on the parts I was able to understand:

* "onerel" is a stupid naming convention that I'd rather not propagate
further. It makes sense in the context of a function whose job it is
to iterate over a list of relations and do something for each one. But
once you're down into the code that only knows about one relation in
the first place, calling that relation "onerel" rather than "rel" or
"vacrel" or even "heaprel" is just confusing. Ditto for "onerelid".

* Moving stuff from static variables into LVRelState seems like a
great idea. Renaming it from LVRelStats seems like a good idea, too.

* Setting vacrel->lps = NULL "for now" when we already did palloc0 at
allocation time seems counterproductive.

* The code associated with the comment block that says "Initialize
state for a parallel vacuum" has been moved inside lazy_space_alloc().
That doesn't seem like an especially good choice, because no casual
reader is going to expect a function called lazy_space_alloc() to be
entering parallel mode and so forth as a side effect. Also, the call
to lazy_space_alloc() still has a comment that says "Allocate the
space for dead tuples in case parallel vacuum is not initialized."
even though the ParallelVacuumIsActive() check has been removed and
the function now does a lot more than allocating space.

* lazy_scan_heap() removes the comment which begins "Note that
vacrelstats->dead_tuples could have tuples which became dead after
HOT-pruning but are not marked dead yet." But IIUC that special case
is removed by a later patch, not 0001, in which case it is that patch
that should be touching this comment.

Regarding 0002:

* It took me a while to understand why lazy_scan_new_page() and
lazy_scan_empty_page() are named the way they are. I'm not sure
exactly what would be better, so I am not necessarily saying I think
you have to change anything, but for the record I think this naming
sucks. The reason we have "lazy" in here, AFAIU, is because originally
we only had old-style VACUUM FULL, and that was the good hard-working
VACUUM, and what we now think of as VACUUM was the "lazy" version that
didn't really do the whole job. Then we decided it was the
hard-working version that actually sucked and we always wanted to be
lazy (or else rewrite the table). So now we have all of these
functions named "lazy" which are really just functions to do "vacuum".
But, if we just did s/lazy/vacuum/g we'd be in trouble, because we use
"vacuum" to mean "part of vacuum." That's actually a pretty insane
thing to do, but we like terminological confusion so much that we
decided to use the word vacuum not just to refer to one part of vacuum
but to two different parts of vacuum. During heap vacuuming, which is
the relevant thing here, we call the first part a "scan" and the
second part "vacuum," hence lazy_scan_page() and lazy_vacuum_page().
For indexes, we can decide to vacuum indexes or cleanup indexes,
either of which is part of our overall strategy of trying to do a
VACUUM. We need some words here that are not so overloaded. If, for
example, we could agree that the whole thing is vacuum and the first
time we touch the heap page that's the strawberry phase and then the
second time we touch it that's the rhubarb phase, then we could have
vacuum_strawberry_page(), vacuum_strawberry_new_page(),
vacuum_rhubarb_phase(), etc. and everything would be a lot clearer,
assuming that you replaced the words "strawberry" and "rhubarb" with
something actually meaningful. But that seems hard. I thought about
suggesting that the word for strawberry should be "prune", but it does
more than that. I thought about suggesting that either the word for
strawberry or the word for rhubarb should be "cleanup," but that's
another word that is already confusingly overloaded. So I don't know.

* But all that having been said, it's easy to get confused and think
that lazy_scan_new_page() is scanning a new page for lazy vacuum, but
in fact it's the new-page handler for the scan phase of lazy vacuum,
and it doesn't scan anything at all. If there's a way to avoid that
kind of confusion, +1 from me.

* One possibility is that maybe it's not such a great idea to put this
logic in its own function. I'm rather suspicious on principle of
functions that are called with a locked or pinned buffer and release
the lock or pin before returning. It suggests that the abstraction is

Re: New IndexAM API controlling index vacuum strategies

2021-03-31 Thread Masahiko Sawada

On Wed, Mar 31, 2021 at 12:01 PM Peter Geoghegan  wrote:
>
> On Sun, Mar 28, 2021 at 9:16 PM Peter Geoghegan  wrote:
> > And now here's v8, which has the following additional cleanup:
>
> And here's v9, which has improved commit messages for the first 2
> patches, and many small tweaks within all 4 patches.
>
> The most interesting change is that lazy_scan_heap() now has a fairly
> elaborate assertion that verifies that its idea about whether or not
> the page is all_visible and all_frozen is shared by
> heap_page_is_all_visible() -- this is a stripped down version of the
> logic that now lives in lazy_scan_heap(). It exists so that the second
> pass over the heap can set visibility map bits.

Thank you for updating the patches.

Both 0001 and 0002 patch refactors the whole lazy vacuum code. Can we
merge them? I basically agree with the refactoring made by 0001 patch
but I'm concerned a bit that having such a large refactoring at very
close to feature freeze could be risky. We would need more eyes to
review during stabilization.

Here are some comments on 0001 patch:

-/*
- * Macro to check if we are in a parallel vacuum.  If true, we are in the
- * parallel mode and the DSM segment is initialized.
- */
-#define ParallelVacuumIsActive(lps) PointerIsValid(lps)
-

I think it's more clear to use this macro. The macro can be like this:

ParallelVacuumIsActive(vacrel) (((LVRelState) vacrel)->lps != NULL)

---
 /*
- * LVDeadTuples stores the dead tuple TIDs collected during the heap scan.
- * This is allocated in the DSM segment in parallel mode and in local memory
- * in non-parallel mode.
+ * LVDeadTuples stores TIDs that are gathered during pruning/the initial heap
+ * scan.  These get deleted from indexes during index vacuuming.  They're then
+ * removed from the heap during a second heap pass that performs heap
+ * vacuuming.
  */

The second sentence of the removed lines still seems to be useful
information for readers?

---
-*
-* Note that vacrelstats->dead_tuples
could have tuples which
-* became dead after HOT-pruning but
are not marked dead yet.
-* We do not process them because it's
a very rare condition,
-* and the next vacuum will process them anyway.

Maybe the above comments should not be removed by 0001 patch.

---
+   /* Free resources managed by lazy_space_alloc() */
+   lazy_space_free(vacrel);

and

+/* Free space for dead tuples */
+static void
+lazy_space_free(LVRelState *vacrel)
+{
+   if (!vacrel->lps)
+   return;
+
+   /*
+* End parallel mode before updating index statistics as we cannot write
+* during parallel mode.
+*/
+   end_parallel_vacuum(vacrel);

Looking at the comments, I thought that this function also frees
palloc'd dead tuple space but it doesn't. It seems to more clear that
doing pfree(vacrel->dead_tuples) here or not creating
lazy_space_free().

Also, the comment for end_paralle_vacuum() looks not relevant with
this function. Maybe we can update to:

/* Exit parallel mode and free the parallel context */

---
+   if (shared_istat)
+   {
+   /* Get the space for IndexBulkDeleteResult */
+   bulkdelete_res = &(shared_istat->istat);
+
+   /*
+* Update the pointer to the corresponding
bulk-deletion result if
+* someone has already updated it.
+*/
+   if (shared_istat->updated && istat == NULL)
+   istat = bulkdelete_res;
+   }

(snip)

+   if (shared_istat && !shared_istat->updated && istat != NULL)
+   {
+   memcpy(bulkdelete_res, istat, sizeof(IndexBulkDeleteResult));
+   shared_istat->updated = true;
+
+   /*
+* Now that top-level indstats[idx] points to the DSM
segment, we
+* don't need the locally allocated results.
+*/
+   pfree(istat);
+   istat = bulkdelete_res;
+   }
+
+   return istat;

If we have parallel_process_one_index() return the address of
IndexBulkDeleteResult, we can simplify the first part above. Also, it
seems better to use a separate variable from istat to store the
result. How about the following structure?

IndexBulkDeleteResult *istat_res;

/*
 * Update the pointer of the corresponding bulk-deletion result if
 * someone has already updated it.
 */
if (shared_istat && shared_istat->updated && istat == NULL)
istat = shared_istat->istat;

/* Do vacuum or cleanup of the index */
if (lvshared->for_cleanup)
istat_res = lazy_cleanup_one_index(indrel, istat, ...);
else
istat_res = lazy_vacuum_one_index(indrel, istat, ...);

/*
 * (snip)
 */
if (shared_istat && !shared_istat->updated && istat_res != NULL)
{
memcpy(

Re: New IndexAM API controlling index vacuum strategies

2021-03-23 Thread Masahiko Sawada

On Wed, Mar 24, 2021 at 11:44 AM Peter Geoghegan  wrote:
>
> On Tue, Mar 23, 2021 at 4:02 AM Masahiko Sawada  wrote:
> > Here are review comments on 0003 patch:
>
> Attached is a new revision, v5. It fixes bit rot caused by recent
> changes (your index autovacuum logging stuff). It has also been
> cleaned up in response to your recent review comments -- both from
> this email, and the other review email that I responded to separately
> today.
>
> > +* If we skip vacuum, we just ignore the collected dead tuples.  Note 
> > that
> > +* vacrelstats->dead_tuples could have tuples which became dead after
> > +* HOT-pruning but are not marked dead yet.  We do not process them
> > +* because it's a very rare condition, and the next vacuum will process
> > +* them anyway.
> > +*/
> >
> > The second paragraph is no longer true after removing the 'tupegone' case.
>
> Fixed.
>
> > Maybe we can use vacrelstats->num_index_scans instead of
> > calledtwopass? When calling to two_pass_strategy() at the end of
> > lazy_scan_heap(), if vacrelstats->num_index_scans is 0 it means this
> > is the first time call, which is equivalent to calledtwopass = false.
>
> It's true that when "vacrelstats->num_index_scans > 0" it definitely
> can't have been the first call. But how can we distinguish between 1.)
> the case where we're being called for the first time, and 2.) the case
> where it's the second call, but the first call actually skipped index
> vacuuming? When we skip index vacuuming we won't increment
> num_index_scans (which seems appropriate to me).

In (2) case, I think we skipped index vacuuming in the first call
because index_cleanup was disabled (if index_cleanup was not disabled,
we didn't skip it because two_pass_strategy() is called with onecall =
false). So in the second call, we skip index vacuuming for the same
reason. Even with the 0004 patch (skipping index vacuuming in
emergency cases), the check of XID wraparound emergency should be done
before the !onecall check in two_pass_strategy() since we should skip
index vacuuming in an emergency case even in the case where
maintenance_work_mem runs out. Therefore, similarly, we will skip
index vacuuming also in the second call.

That being said, I agree that using ‘calledtwopass’ is much readable.
So I’ll keep it as is.

>
> For now I have added an assertion that "vacrelstats->num_index_scan ==
> 0" at the point where we apply skipping indexes as an optimization
> (i.e. the point where the patch 0003- mechanism is applied).
>
> > Perhaps we can make INDEX_CLEANUP option a four-value option: on, off,
> > auto, and default? A problem with the above change would be that if
> > the user wants to do "auto" mode, they might need to reset
> > vacuum_index_cleanup reloption before executing VACUUM command. In
> > other words, there is no way in VACUUM command to force "auto" mode.
> > So I think we can add "auto" value to INDEX_CLEANUP option and ignore
> > the vacuum_index_cleanup reloption if that value is specified.
>
> I agree that this aspect definitely needs more work. I'll leave it to you to
> do this in a separate revision of this new 0003 patch (so no changes here
> from me for v5).
>
> > Are you updating also the 0003 patch? if you're focusing on 0001 and
> > 0002 patch, I'll update the 0003 patch along with the fourth patch
> > (skipping index vacuum in emergency cases).
>
> I suggest that you start integrating it with the wraparound emergency
> mechanism, which can become patch 0004- of the patch series. You can
> manage 0003- and 0004- now. You can post revisions of each of those
> two independently of my revisions. What do you think? I have included
> 0003- for now because you had review comments on it that I worked
> through, but you should own that, I think.
>
> I suppose that you should include the versions of 0001- and 0002- you
> worked off of, just for the convenience of others/to keep the CF
> tester happy. I don't think that I'm going to make many changes that
> will break your patch, except for obvious bit rot that can be fixed
> through fairly mechanical rebasing.

Agreed.

I was just about to post my 0004 patch based on v4 patch series. I'll
update 0003 and 0004 patches based on v5 patch series you just posted,
and post them including 0001 and 0002 patches.

Regards,

-- 
Masahiko Sawada
EDB:  https://www.enterprisedb.com/

Re: New IndexAM API controlling index vacuum strategies

2021-03-23 Thread Peter Geoghegan

On Mon, Mar 22, 2021 at 6:40 AM Masahiko Sawada  wrote:
> I've looked at this 0001 patch and here are some review comments:

> + * Since we might have to prune a second time here, the code is structured to
> + * use a local per-page copy of the counters that caller accumulates.  We add
> + * our per-page counters to the per-VACUUM totals from caller last of all, to
> + * avoid double counting.
>
> Those comments should be a part of 0002 patch?

Right -- will fix.

> pc.num_tuples is incremented twice. ps->hastup = true is also duplicated.

Must have been a mistake when splitting the patch up -- will fix.

> ---
> In step 7, with the patch, we save the freespace of the page and do
> lazy_vacuum_page(). But should it be done in reverse?

> How about renaming to vacuum_two_pass_strategy() or something to clear
> this function is used to vacuum?

Okay. I will rename it to lazy_vacuum_pruned_items().

> vacrelstats->dead_tuples->num_tuples)));
>
> It seems that the comment needs to be updated.

Will fix.

> I’ll review the other two patches tomorrow.

And I'll respond to your remarks on those (which are already posted
now) separately.

> I didn't look at the 0002 patch in-depth but the main difference
> between those two WAL records is that XLOG_HEAP2_PRUNE has the offset
> numbers of unused, redirected, and dead whereas XLOG_HEAP2_VACUUM has
> only the offset numbers of unused?

That's one difference. Another difference is that there is no
latestRemovedXid field. And there is a third difference: we no longer
need a super-exclusive lock for heap page vacuuming (not pruning) with
this design -- which also means that we cannot defragment the page
during heap vacuuming (that's unsafe with only an exclusive lock
because it physically relocates tuples with storage that somebody else
may have a C pointer to that they expect to stay sane). These
differences during original execution of heap page vacuum necessitate
inventing a new REDO routine that does things in exactly the same way.

To put it another way, heap vacuuming is now very similar to index
vacuuming (both are dissimilar to heap pruning). They're very simple,
and 100% a matter of freeing space in physical data structures.
Clearly that's always something that we can put off if it makes sense
to do so. That high level simplicity seems important to me. I always
disliked the way the WAL records for vacuumlazy.c worked. Especially
the XLOG_HEAP2_CLEANUP_INFO record -- that one is terrible.

--
Peter Geoghegan

Re: New IndexAM API controlling index vacuum strategies

2021-03-23 Thread Masahiko Sawada

On Mon, Mar 22, 2021 at 10:39 PM Masahiko Sawada  wrote:
>
> On Sat, Mar 20, 2021 at 11:05 AM Peter Geoghegan  wrote:
> >
> > On Wed, Mar 17, 2021 at 7:55 PM Peter Geoghegan  wrote:
> >
> > 2. A patch to remove the tupgone case.
> >
> > Severa new and interesting changes here -- see below.
> >
> > 3. The patch to optimize VACUUM by teaching it to skip index and heap
> > vacuuming in certain cases where we only expect a very small benefit.
>
> I’ll review the other two patches tomorrow.

Here are review comments on 0003 patch:

+   /*
+* Check whether or not to do index vacuum and heap vacuum.
+*
+* We do both index vacuum and heap vacuum if more than
+* SKIP_VACUUM_PAGES_RATIO of all heap pages have at least one LP_DEAD
+* line pointer.  This is normally a case where dead tuples on the heap
+* are highly concentrated in relatively few heap blocks, where the
+* index's enhanced deletion mechanism that is clever about heap block
+* dead tuple concentrations including btree's bottom-up index deletion
+* works well.  Also, since we can clean only a few heap blocks, it would
+* be a less negative impact in terms of visibility map update.
+*
+* If we skip vacuum, we just ignore the collected dead tuples.  Note that
+* vacrelstats->dead_tuples could have tuples which became dead after
+* HOT-pruning but are not marked dead yet.  We do not process them
+* because it's a very rare condition, and the next vacuum will process
+* them anyway.
+*/

The second paragraph is no longer true after removing the 'tupegone' case.

---
if (dead_tuples->num_tuples > 0)
two_pass_strategy(onerel, vacrelstats, Irel, indstats, nindexes,
- lps, params->index_cleanup);
+ lps, params->index_cleanup,
+ has_dead_items_pages, !calledtwopass);

Maybe we can use vacrelstats->num_index_scans instead of
calledtwopass? When calling to two_pass_strategy() at the end of
lazy_scan_heap(), if vacrelstats->num_index_scans is 0 it means this
is the first time call, which is equivalent to calledtwopass = false.

---
-   params.index_cleanup = get_vacopt_ternary_value(opt);
+   {
+   if (opt->arg == NULL || strcmp(defGetString(opt), "auto") == 0)
+   params.index_cleanup = VACOPT_CLEANUP_AUTO;
+   else if (defGetBoolean(opt))
+   params.index_cleanup = VACOPT_CLEANUP_ENABLED;
+   else
+   params.index_cleanup = VACOPT_CLEANUP_DISABLED;
+   }


+   /*
+* Set index cleanup option based on reloptions if not set to either ON or
+* OFF.  Note that an VACUUM(INDEX_CLEANUP=AUTO) command is interpreted as
+* "prefer reloption, but if it's not set dynamically determine if index
+* vacuuming and cleanup" takes place in vacuumlazy.c.  Note also that the
+* reloption might be explicitly set to AUTO.
+*
+* XXX: Do we really want that?
+*/
+   if (params->index_cleanup == VACOPT_CLEANUP_AUTO &&
+   onerel->rd_options != NULL)
+   params->index_cleanup =
+   ((StdRdOptions *) onerel->rd_options)->vacuum_index_cleanup;

Perhaps we can make INDEX_CLEANUP option a four-value option: on, off,
auto, and default? A problem with the above change would be that if
the user wants to do "auto" mode, they might need to reset
vacuum_index_cleanup reloption before executing VACUUM command. In
other words, there is no way in VACUUM command to force "auto" mode.
So I think we can add "auto" value to INDEX_CLEANUP option and ignore
the vacuum_index_cleanup reloption if that value is specified.

Are you updating also the 0003 patch? if you're focusing on 0001 and
0002 patch, I'll update the 0003 patch along with the fourth patch
(skipping index vacuum in emergency cases).

Regards,

-- 
Masahiko Sawada
EDB:  https://www.enterprisedb.com/

Re: New IndexAM API controlling index vacuum strategies

2021-03-22 Thread Masahiko Sawada

On Tue, Mar 23, 2021 at 12:41 PM Peter Geoghegan  wrote:
>
> On Mon, Mar 22, 2021 at 8:33 PM Peter Geoghegan  wrote:
> > More concretely, maybe the new GUC is forced to be 1.05 of
> > vacuum_freeze_table_age. Of course that scheme is a bit arbitrary --
> > but so is the existing 0.95 scheme.
>
> I meant to write 1.05 of autovacuum_vacuum_max_age. So just as
> vacuum_freeze_table_age cannot really be greater than 0.95 *
> autovacuum_vacuum_max_age, your new GUC cannot really be less than
> 1.05 * autovacuum_vacuum_max_age.

That makes sense to me.

Regards,

-- 
Masahiko Sawada
EDB:  https://www.enterprisedb.com/

Re: New IndexAM API controlling index vacuum strategies

2021-03-22 Thread Masahiko Sawada

On Tue, Mar 23, 2021 at 12:28 PM Peter Geoghegan  wrote:
>
> On Mon, Mar 22, 2021 at 6:41 PM Masahiko Sawada  wrote:
> > But we're not sure when the next anti-wraparound vacuum will take
> > place. Since the table is already vacuumed by a non-aggressive vacuum
> > with disabling index cleanup, an autovacuum will process the table
> > when the table gets modified enough or the table's relfrozenxid gets
> > older than autovacuum_vacuum_max_age. If the new threshold, probably a
> > new GUC, is much lower than autovacuum_vacuum_max_age and
> > vacuum_freeze_table_age, the table is continuously vacuumed without
> > advancing relfrozenxid, leading to unnecessarily index bloat. Given
> > the new threshold is for emergency purposes (i.g., advancing
> > relfrozenxid faster), I think it might be better to use
> > vacuum_freeze_table_age as the lower bound of the new threshold. What
> > do you think?
>
> As you know, when the user sets vacuum_freeze_table_age to a value
> that is greater than the value of autovacuum_vacuum_max_age, the two
> GUCs have values that are contradictory. This contradiction is
> resolved inside vacuum_set_xid_limits(), which knows that it should
> "interpret" the value of vacuum_freeze_table_age as
> (autovacuum_vacuum_max_age * 0.95) to paper-over the user's error.
> This 0.95 behavior is documented in the user docs, though it happens
> silently.
>
> You seem to be concerned about a similar contradiction. In fact it's
> *very* similar contradiction, because this new GUC is naturally a
> "sibling GUC" of both vacuum_freeze_table_age and
> autovacuum_vacuum_max_age (the "units" are the same, though the
> behavior that each GUC triggers is different -- but
> vacuum_freeze_table_age and autovacuum_vacuum_max_age are both already
> *similar and different* in the same way). So perhaps the solution
> should be similar -- silently interpret the setting of the new GUC to
> resolve the contradiction.

Yeah, that's exactly what I also thought.

>
> (Maybe I should say "these two new GUCs"? MultiXact variant might be 
> needed...)

Yes, I think we should have also for MultiXact.

>
> This approach has the following advantages:
>
> * It follows precedent.
>
> * It establishes that the new GUC is a logical extension of the
> existing vacuum_freeze_table_age and autovacuum_vacuum_max_age GUCs.
>
> * The default value for the new GUC will be so much higher (say 1.8
> billion XIDs) than even the default of autovacuum_vacuum_max_age that
> it won't disrupt anybody's existing postgresql.conf setup.
>
> * For the same reason (the big space between autovacuum_vacuum_max_age
> and the new GUC with default settings), you can almost set the new GUC
> without needing to know about autovacuum_vacuum_max_age.
>
> * The overall behavior isn't actually restrictive/paternalistic. That
> is, if you know what you're doing (say you're testing the feature) you
> can reduce all 3 sibling GUCs to 0 and get the testing behavior that
> you desire.
>
> What do you think?

Totally agreed.

Regards,

-- 
Masahiko Sawada
EDB:  https://www.enterprisedb.com/

Re: New IndexAM API controlling index vacuum strategies

2021-03-22 Thread Peter Geoghegan

On Mon, Mar 22, 2021 at 8:33 PM Peter Geoghegan  wrote:
> More concretely, maybe the new GUC is forced to be 1.05 of
> vacuum_freeze_table_age. Of course that scheme is a bit arbitrary --
> but so is the existing 0.95 scheme.

I meant to write 1.05 of autovacuum_vacuum_max_age. So just as
vacuum_freeze_table_age cannot really be greater than 0.95 *
autovacuum_vacuum_max_age, your new GUC cannot really be less than
1.05 * autovacuum_vacuum_max_age.

-- 
Peter Geoghegan

Re: New IndexAM API controlling index vacuum strategies

2021-03-22 Thread Peter Geoghegan

On Mon, Mar 22, 2021 at 8:28 PM Peter Geoghegan  wrote:
> You seem to be concerned about a similar contradiction. In fact it's
> *very* similar contradiction, because this new GUC is naturally a
> "sibling GUC" of both vacuum_freeze_table_age and
> autovacuum_vacuum_max_age (the "units" are the same, though the
> behavior that each GUC triggers is different -- but
> vacuum_freeze_table_age and autovacuum_vacuum_max_age are both already
> *similar and different* in the same way). So perhaps the solution
> should be similar -- silently interpret the setting of the new GUC to
> resolve the contradiction.

More concretely, maybe the new GUC is forced to be 1.05 of
vacuum_freeze_table_age. Of course that scheme is a bit arbitrary --
but so is the existing 0.95 scheme.

There may be some value in picking a scheme that "advertises" that all
three GUCs are symmetrical, or at least related -- all three divide up
the table's XID space.

-- 
Peter Geoghegan

Re: New IndexAM API controlling index vacuum strategies

2021-03-22 Thread Peter Geoghegan

On Mon, Mar 22, 2021 at 6:41 PM Masahiko Sawada  wrote:
> But we're not sure when the next anti-wraparound vacuum will take
> place. Since the table is already vacuumed by a non-aggressive vacuum
> with disabling index cleanup, an autovacuum will process the table
> when the table gets modified enough or the table's relfrozenxid gets
> older than autovacuum_vacuum_max_age. If the new threshold, probably a
> new GUC, is much lower than autovacuum_vacuum_max_age and
> vacuum_freeze_table_age, the table is continuously vacuumed without
> advancing relfrozenxid, leading to unnecessarily index bloat. Given
> the new threshold is for emergency purposes (i.g., advancing
> relfrozenxid faster), I think it might be better to use
> vacuum_freeze_table_age as the lower bound of the new threshold. What
> do you think?

As you know, when the user sets vacuum_freeze_table_age to a value
that is greater than the value of autovacuum_vacuum_max_age, the two
GUCs have values that are contradictory. This contradiction is
resolved inside vacuum_set_xid_limits(), which knows that it should
"interpret" the value of vacuum_freeze_table_age as
(autovacuum_vacuum_max_age * 0.95) to paper-over the user's error.
This 0.95 behavior is documented in the user docs, though it happens
silently.

You seem to be concerned about a similar contradiction. In fact it's
*very* similar contradiction, because this new GUC is naturally a
"sibling GUC" of both vacuum_freeze_table_age and
autovacuum_vacuum_max_age (the "units" are the same, though the
behavior that each GUC triggers is different -- but
vacuum_freeze_table_age and autovacuum_vacuum_max_age are both already
*similar and different* in the same way). So perhaps the solution
should be similar -- silently interpret the setting of the new GUC to
resolve the contradiction.

(Maybe I should say "these two new GUCs"? MultiXact variant might be needed...)

This approach has the following advantages:

* It follows precedent.

* It establishes that the new GUC is a logical extension of the
existing vacuum_freeze_table_age and autovacuum_vacuum_max_age GUCs.

* The default value for the new GUC will be so much higher (say 1.8
billion XIDs) than even the default of autovacuum_vacuum_max_age that
it won't disrupt anybody's existing postgresql.conf setup.

* For the same reason (the big space between autovacuum_vacuum_max_age
and the new GUC with default settings), you can almost set the new GUC
without needing to know about autovacuum_vacuum_max_age.

* The overall behavior isn't actually restrictive/paternalistic. That
is, if you know what you're doing (say you're testing the feature) you
can reduce all 3 sibling GUCs to 0 and get the testing behavior that
you desire.

What do you think?

-- 
Peter Geoghegan

Re: New IndexAM API controlling index vacuum strategies

2021-03-22 Thread Masahiko Sawada

On Fri, Mar 19, 2021 at 3:36 AM Peter Geoghegan  wrote:
>
> On Thu, Mar 18, 2021 at 3:32 AM Masahiko Sawada  wrote:
> > If we have the constant threshold of 1 billion transactions, a vacuum
> > operation might not be an anti-wraparound vacuum and even not be an
> > aggressive vacuum, depending on autovacuum_freeze_max_age value. Given
> > the purpose of skipping index vacuuming in this case, I think it
> > doesn't make sense to have non-aggressive vacuum skip index vacuuming
> > since it might not be able to advance relfrozenxid. If we have a
> > constant threshold, 2 billion transactions, maximum value of
> > autovacuum_freeze_max_age, seems to work.
>
> I like the idea of not making the behavior a special thing that only
> happens with a certain variety of VACUUM operation (non-aggressive or
> anti-wraparound VACUUMs). Just having a very high threshold should be
> enough.
>
> Even if we're not going to be able to advance relfrozenxid, we'll
> still finish much earlier and let a new anti-wraparound vacuum take
> place that will do that -- and will be able to reuse much of the work
> of the original VACUUM. Of course this anti-wraparound vacuum will
> also skip index vacuuming from the start (whereas the first VACUUM may
> well have done some index vacuuming before deciding to end index
> vacuuming to hurry with finishing).

But we're not sure when the next anti-wraparound vacuum will take
place. Since the table is already vacuumed by a non-aggressive vacuum
with disabling index cleanup, an autovacuum will process the table
when the table gets modified enough or the table's relfrozenxid gets
older than autovacuum_vacuum_max_age. If the new threshold, probably a
new GUC, is much lower than autovacuum_vacuum_max_age and
vacuum_freeze_table_age, the table is continuously vacuumed without
advancing relfrozenxid, leading to unnecessarily index bloat. Given
the new threshold is for emergency purposes (i.g., advancing
relfrozenxid faster), I think it might be better to use
vacuum_freeze_table_age as the lower bound of the new threshold. What
do you think?

Regards,

-- 
Masahiko Sawada
EDB:  https://www.enterprisedb.com/

Re: New IndexAM API controlling index vacuum strategies

2021-03-22 Thread Peter Geoghegan

On Mon, Mar 22, 2021 at 7:05 AM Robert Haas  wrote:
> I agree. I was having trouble before understanding exactly what you
> are proposing, but this makes sense to me and I agree it's a good
> idea.

Our ambition is for this to be one big multi-release umbrella project,
with several individual enhancements that each deliver a user-visible
benefit on their own. The fact that we're talking about a few things
at once is confusing, but I think that you need a "grand bargain" kind
of discussion for this. I believe that it actually makes sense to do
it that way, difficult though it may be.

Sometimes the goal is to improve performance, other times the goal is
to improve robustness. Although the distinction gets blurry at the
margins. If VACUUM was infinitely fast (say because of sorcery), then
performance would bee *unbeatable* -- plus we'd never have to worry
about anti-wraparound vacuums not completing in time!

> I'm not 100% sure whether we need a new GUC for this or not. I think
> that if by default this triggers at the 90% of the hard-shutdown
> limit, it would be unlikely, and perhaps unreasonable, for users to
> want to raise the limit. However, I wonder whether some users will
> want to lower the limit. Would it be reasonable for someone to want to
> trigger this at 50% or 70% of XID exhaustion rather than waiting until
> things get really bad?

I wanted to avoid inventing a GUC for the mechanism in the third patch
(not the emergency mechanism we're discussing right now, which was
posted separately by Masahiko). I think that a GUC to control skipping
index vacuuming purely because there are very few index tuples to
delete in indexes will become a burden before long. In particular, we
should eventually be able to vacuum some indexes but not others (on
the same table) based on the local needs of each index.

As I keep pointing out, bottom-up index deletion has created a
situation where there can be dramatically different needs among
indexes on the same table -- it can literally prevent 100% of all page
splits from version churn in those indexes that are never subject to
logically changes from non-HOT updates. Whereas bottom-up index
deletion does nothing for any index that is logically updated, for the
obvious reason -- there is now frequently a sharp qualitative
difference among indexes that vacuumlazy.c currently imagines have
basically the same needs. Vacuuming these remaining indexes is a cost
that users will actually understand and accept, too.

But that has nothing to do with the emergency mechanism we're talking
about right now. I actually like your idea of making the emergency
mechanism a GUC. It's equivalent to index_cleanup, except that it is
continuous and dynamic (not discrete and static). The fact that this
GUC expresses what VACUUM should do in terms of the age of the target
table's current relfrozenxid age (and nothing else) seems like exactly
the right thing. As I said before: What else could possibly matter? So
I don't see any risk of such a GUC becoming a burden. I also think
that it's a useful knob to be able to tune. It's also going to help a
lot with testing the feature. So let's have a GUC for that.

> Also, one thing that I dislike about the current system is that, from
> a user perspective, when something goes wrong, nothing happens for a
> while and then the whole system goes bananas. It seems desirable to me
> to find ways of gradually ratcheting up the pressure, like cranking up
> the effective cost limit if we can somehow figure out that we're not
> keeping up. If, with your mechanism, there's an abrupt point when we
> switch from never doing this for any table to always doing this for
> every table, that might not be as good as something which does this
> "sometimes" and then, if that isn't enough to avoid disaster, does it
> "more," and eventually ramps up to doing it always, if trouble
> continues. I don't know whether that's possible here, or what it would
> look like, or even whether it's appropriate at all in this particular
> case, so I just offer it as food for thought.

That is exactly the kind of thing that some future highly evolved
version of the broader incremental/dynamic VACUUM design might do.
Your thoughts about the effective delay/throttling lessening as
conditions change is in line with the stuff that we're thinking of
doing. Though I don't believe Masahiko and I have talked about the
delay stuff specifically in any of our private discussions about it.

I am a big believer in the idea that we should have a variety of
strategies that are applied incrementally and dynamically, in response
to an immediate local need (say at the index level). VACUUM should be
able to organically figure out the best strategy (or combination of
strategies) itself, over time. Sometimes it will be very important to
recognize that most indexes have been able to use techniques like
bottom-up index deletion, and so really don't need to be vacuumed at
all. Other times the cost delay stuff

Re: New IndexAM API controlling index vacuum strategies

2021-03-22 Thread Robert Haas

On Thu, Mar 18, 2021 at 9:42 PM Peter Geoghegan  wrote:
> The fact that we can *continually* reevaluate if an ongoing VACUUM is
> at risk of taking too long is entirely the point here. We can in
> principle end index vacuuming dynamically, whenever we feel like it
> and for whatever reasons occur to us (hopefully these are good reasons
> -- the point is that we get to pick and choose). We can afford to be
> pretty aggressive about not giving up, while still having the benefit
> of doing that when it *proves* necessary. Because: what are the
> chances of the emergency mechanism ending index vacuuming being the
> wrong thing to do if we only do that when the system clearly and
> measurably has no more than about 10% of the possible XID space to go
> before the system becomes unavailable for writes?

I agree. I was having trouble before understanding exactly what you
are proposing, but this makes sense to me and I agree it's a good
idea.

> > But ... should the thresholds for triggering these kinds of mechanisms
> > really be hard-coded with no possibility of being configured in the
> > field? What if we find out after the release is shipped that the
> > mechanism works better if you make it kick in sooner, or later, or if
> > it depends on other things about the system, which I think it almost
> > certainly does? Thresholds that can't be changed without a recompile
> > are bad news. That's why we have GUCs.
>
> I'm fine with a GUC, though only for the emergency mechanism. The
> default really matters, though -- it shouldn't be necessary to tune
> (since we're trying to address a problem that many people don't know
> they have until it's too late). I still like 1.8 billion XIDs as the
> value -- I propose that that be made the default.

I'm not 100% sure whether we need a new GUC for this or not. I think
that if by default this triggers at the 90% of the hard-shutdown
limit, it would be unlikely, and perhaps unreasonable, for users to
want to raise the limit. However, I wonder whether some users will
want to lower the limit. Would it be reasonable for someone to want to
trigger this at 50% or 70% of XID exhaustion rather than waiting until
things get really bad?

Also, one thing that I dislike about the current system is that, from
a user perspective, when something goes wrong, nothing happens for a
while and then the whole system goes bananas. It seems desirable to me
to find ways of gradually ratcheting up the pressure, like cranking up
the effective cost limit if we can somehow figure out that we're not
keeping up. If, with your mechanism, there's an abrupt point when we
switch from never doing this for any table to always doing this for
every table, that might not be as good as something which does this
"sometimes" and then, if that isn't enough to avoid disaster, does it
"more," and eventually ramps up to doing it always, if trouble
continues. I don't know whether that's possible here, or what it would
look like, or even whether it's appropriate at all in this particular
case, so I just offer it as food for thought.

> > On another note, I cannot say enough bad things about the function
> > name two_pass_strategy(). I sincerely hope that you're not planning to
> > create a function which is a major point of control for VACUUM whose
> > name gives no hint that it has anything to do with vacuum.
>
> You always hate my names for things. But that's fine by me -- I'm
> usually not very attached to them. I'm happy to change it to whatever
> you prefer.

My taste in names may be debatable, but including the subsystem name
in the function name seems like a pretty bare-minimum requirement,
especially when the other words in the function name could refer to
just about anything.

-- 
Robert Haas
EDB: http://www.enterprisedb.com

Re: New IndexAM API controlling index vacuum strategies

2021-03-22 Thread Masahiko Sawada

On Sat, Mar 20, 2021 at 11:05 AM Peter Geoghegan  wrote:
>
> On Wed, Mar 17, 2021 at 7:55 PM Peter Geoghegan  wrote:
> > Attached patch series splits everything up. There is now a large patch
> > that removes the tupgone special case, and a second patch that
> > actually adds code that dynamically decides to not do index vacuuming
> > in cases where (for whatever reason) it doesn't seem useful.
>
> Attached is v4. This revision of the patch series is split up into
> smaller pieces for easier review. There are now 3 patches in the
> series:

Thank you for the patches!

>
> 1. A refactoring patch that takes code from lazy_scan_heap() and
> breaks it into several new functions.
>
> Not too many changes compared to the last revision here (mostly took
> things out and put them in the second patch).

I've looked at this 0001 patch and here are some review comments:

+/*
+ * scan_prune_page() -- lazy_scan_heap() pruning and freezing.
+ *
+ * Caller must hold pin and buffer cleanup lock on the buffer.
+ *
+ * Prior to PostgreSQL 14 there were very rare cases where lazy_scan_heap()
+ * treated tuples that still had storage after pruning as DEAD.  That happened
+ * when heap_page_prune() could not prune tuples that were nevertheless deemed
+ * DEAD by its own HeapTupleSatisfiesVacuum() call.  This created rare hard to
+ * test cases.  It meant that there was no very sharp distinction between DEAD
+ * tuples and tuples that are to be kept and be considered for freezing inside
+ * heap_prepare_freeze_tuple().  It also meant that lazy_vacuum_page() had to
+ * be prepared to remove items with storage (tuples with tuple headers) that
+ * didn't get pruned, which created a special case to handle recovery
+ * conflicts.
+ *
+ * The approach we take here now (to eliminate all of this complexity) is to
+ * simply restart pruning in these very rare cases -- cases where a concurrent
+ * abort of an xact makes our HeapTupleSatisfiesVacuum() call disagrees with
+ * what heap_page_prune() thought about the tuple only microseconds earlier.
+ *
+ * Since we might have to prune a second time here, the code is structured to
+ * use a local per-page copy of the counters that caller accumulates.  We add
+ * our per-page counters to the per-VACUUM totals from caller last of all, to
+ * avoid double counting.

Those comments should be a part of 0002 patch?

---
+   pc.num_tuples += 1;
+   ps->hastup = true;
+
+   /*
+* Each non-removable tuple must be checked to
see if it needs
+* freezing
+*/
+   if (heap_prepare_freeze_tuple(tuple.t_data,
+
   RelFrozenXid, RelMinMxid,
+
   FreezeLimit, MultiXactCutoff,
+
   &frozen[nfrozen],
+
   &tuple_totally_frozen))
+   frozen[nfrozen++].offset = offnum;
+
+   pc.num_tuples += 1;
+   ps->hastup = true;

pc.num_tuples is incremented twice. ps->hastup = true is also duplicated.

---
In step 7, with the patch, we save the freespace of the page and do
lazy_vacuum_page(). But should it be done in reverse?

---
+static void
+two_pass_strategy(Relation onerel, LVRelStats *vacrelstats,
+ Relation *Irel,
IndexBulkDeleteResult **indstats, int nindexes,
+ LVParallelState *lps,
VacOptTernaryValue index_cleanup)

How about renaming to vacuum_two_pass_strategy() or something to clear
this function is used to vacuum?

---
+   /*
+* skipped index vacuuming.  Make log report that
lazy_vacuum_heap
+* would've made.
+*
+* Don't report tups_vacuumed here because it will be
zero here in
+* common case where there are no newly pruned LP_DEAD
items for this
+* VACUUM.  This is roughly consistent with
lazy_vacuum_heap(), and
+* the similar !useindex ereport() at the end of
lazy_scan_heap().
+* Note, however, that has_dead_items_pages is # of
heap pages with
+* one or more LP_DEAD items (could be from us or from another
+* VACUUM), not # blocks scanned.
+*/
+   ereport(elevel,
+   (errmsg("\"%s\": INDEX_CLEANUP off
forced VACUUM to not totally remove %d pruned items",
+   vacrelstats->relname,
+
vacrelstats->dead_tuples->num_tuples)));

It seems that the comment needs to be updated.

>
> 2. A patch to remove the tupgone case.
>
> Severa new and interesting changes here -- see below.
>
> 3. The patch to optimize VACUUM by teaching it to skip index and heap
> vacuuming in certain cases where we only expect a very small benefit.

I’ll review the other two patches tomorrow.

>
> We now go further with removing unnecessary stuff in WAL records in

Re: New IndexAM API controlling index vacuum strategies

2021-03-21 Thread Peter Geoghegan

On Sun, Mar 21, 2021 at 1:24 AM Greg Stark  wrote:
> What I've seen is an application that regularly ran ANALYZE on a
> table. This worked fine as long as vacuums took less than the interval
> between analyzes (in this case 1h) but once vacuum started taking
> longer than that interval autovacuum would cancel it every time due to
> the conflicting lock.
>
> That would have just continued until the wraparound vacuum which
> wouldn't self-cancel except that there was also a demon running which
> would look for sessions stuck on a lock and kill the blocker -- which
> included killing the wraparound vacuum.

That's a new one! Though clearly it's an example of what I described.
I do agree that sometimes the primary cause is the special rules for
cancellations with anti-wraparound autovacuums.

> And yes, this demon is obviously a terrible idea but of course it was
> meant for killing buggy user queries. It wasn't expecting to find
> autovacuum jobs blocking things.  The real surprise for that user was
> that VACUUM could be blocked by things that someone would reasonably
> want to run regularly like ANALYZE.

The infrastructure from my patch to eliminate the tupgone special case
(the patch that fully decouples index and heap vacuuming from pruning
and freezing) ought to enable smarter autovacuum cancellations. It
should be possible to make "canceling" an autovacuum worker actually
instruct the worker to consider the possibility of finishing off the
VACUUM operation very quickly, by simply ending index vacuuming (and
heap vacuuming). It should only be necessary to cancel when that
strategy won't work out, because we haven't finished all required
pruning and freezing yet -- which are the only truly essential tasks
of any "successful" VACUUM operation.

Maybe it would only be appropriate to do something like that for
anti-wraparound VACUUMs, which, as you say, don't get cancelled when
they block the acquisition of a lock (which is a sensible design,
though only because of the specific risk of not managing to advance
relfrozenxid). There wouldn't be a question of canceling an
anti-wraparound VACUUM in the conventional sense with this mechanism.
It would simply instruct the anti-wraparound VACUUM to finish as
quickly as possible by skipping the indexes. Naturally the
implementation wouldn't really need to consider whether that meant the
anti-wraparound VACUUM could end almost immediately, or some time
later -- the point is that it completes ASAP.

-- 
Peter Geoghegan

Re: New IndexAM API controlling index vacuum strategies

2021-03-21 Thread Greg Stark

On Thu, 18 Mar 2021 at 14:37, Peter Geoghegan  wrote:

> They usually involve some *combination* of Postgres problems,
> application code problems, and DBA error. Not any one thing. I've seen
> problems with application code that runs DDL at scheduled intervals,
> which interacts badly with vacuum -- but only really on the rare
> occasions when freezing is required!

What I've seen is an application that regularly ran ANALYZE on a
table. This worked fine as long as vacuums took less than the interval
between analyzes (in this case 1h) but once vacuum started taking
longer than that interval autovacuum would cancel it every time due to
the conflicting lock.

That would have just continued until the wraparound vacuum which
wouldn't self-cancel except that there was also a demon running which
would look for sessions stuck on a lock and kill the blocker -- which
included killing the wraparound vacuum.

And yes, this demon is obviously a terrible idea but of course it was
meant for killing buggy user queries. It wasn't expecting to find
autovacuum jobs blocking things.  The real surprise for that user was
that VACUUM could be blocked by things that someone would reasonably
want to run regularly like ANALYZE.

-- 
greg

Re: New IndexAM API controlling index vacuum strategies

2021-03-18 Thread Peter Geoghegan

On Thu, Mar 18, 2021 at 2:05 PM Robert Haas  wrote:
> On Wed, Mar 17, 2021 at 11:23 PM Peter Geoghegan  wrote:
> > Most anti-wraparound VACUUMs are really not emergencies, though.
>
> That's true, but it's equally true that most of the time it's not
> necessary to wear a seatbelt to avoid personal injury. The difficulty
> is that it's hard to predict on which occasions it is necessary, and
> therefore it is advisable to do it all the time.

Just to be clear: This was pretty much the point I was making here --
although I guess you're making the broader point about autovacuum and
freezing in general.

The fact that we can *continually* reevaluate if an ongoing VACUUM is
at risk of taking too long is entirely the point here. We can in
principle end index vacuuming dynamically, whenever we feel like it
and for whatever reasons occur to us (hopefully these are good reasons
-- the point is that we get to pick and choose). We can afford to be
pretty aggressive about not giving up, while still having the benefit
of doing that when it *proves* necessary. Because: what are the
chances of the emergency mechanism ending index vacuuming being the
wrong thing to do if we only do that when the system clearly and
measurably has no more than about 10% of the possible XID space to go
before the system becomes unavailable for writes?

What could possibly matter more than that?

By making the decision dynamic, the chances of our
threshold/heuristics causing the wrong behavior become negligible --
even though we're making the decision based on a tiny amount of
(current, authoritative) information. The only novel risk I can think
about is that somebody comes to rely on the mechanism saving the day,
over and over again, rather than fixing a fixable problem.

> autovacuum decides
> whether an emergency exists, in the first instance, by comparing
> age(relfrozenxid) to autovacuum_freeze_max_age, but that's problematic
> for at least two reasons. First, what matters is not when the vacuum
> starts, but when the vacuum finishes.

To be fair the vacuum_set_xid_limits() mechanism that you refer to
makes perfect sense. It's just totally insufficient for the reasons
you say.

> A user who has no tables larger
> than 100MB can set autovacuum_freeze_max_age a lot closer to the high
> limit without risk of hitting it than a user who has a 10TB table. The
> time to run vacuum is dependent on both the size of the table and the
> applicable cost delay settings, none of which autovacuum knows
> anything about. It also knows nothing about the XID consumption rate.
> It's relying on the user to set autovacuum_freeze_max_age low enough
> that all the anti-wraparound vacuums will finish before the system
> crashes into a wall.

Literally nobody on earth knows what their XID burn rate is when it
really matters. It might be totally out of control that one day of
your life where it truly matters (e.g., due to a recent buggy code
deployment, which I've seen up close). That's how emergencies work.

A dynamic approach is not merely preferable. It seems essential. No
top-down plan is going to be smart enough to predict that it'll take a
really long time to get that one super-exclusive lock on relatively
few pages.

> Second, what happens to one table affects what
> happens to other tables. Even if you have perfect knowledge of your
> XID consumption rate and the speed at which vacuum will complete, you
> can't just configure autovacuum_freeze_max_age to allow exactly enough
> time for the vacuum to complete once it hits the threshold, unless you
> have one autovacuum worker per table so that the work for that table
> never has to wait for work on any other tables. And even then, as you
> mention, you have to worry about the possibility that a vacuum was
> already in progress on that table itself. Here again, we rely on the
> user to know empirically how high they can set
> autovacuum_freeze_max_age without cutting it too close.

But the VM is a lot more useful when you effectively eliminate index
vacuuming from the picture. And VACUUM has a pretty good understanding
of how that works. Index vacuuming remains the achilles' heel, and I
think that avoiding it in some cases has tremendous value. It has
outsized importance now because we've significantly ameliorated the
problems in the heap, by having the visibility map. What other factor
can make VACUUM take 10x longer than usual on occasion?

Autovacuum scheduling is essentially a top-down model of the needs of
the system -- and one with a lot of flaws. IMV we can make the model's
simplistic view of reality better by making the reality better (i.e.
simpler, more tolerant of stressors) instead of making the model
better.

> Now, that's not actually a good thing, because most users aren't smart
> enough to do that, and will either leave a gigantic safety margin that
> they don't need, or will leave an inadequate safety margin and take
> the system down. However, it means we need to be very, very careful
> abo

Re: New IndexAM API controlling index vacuum strategies

2021-03-18 Thread Robert Haas

On Wed, Mar 17, 2021 at 11:23 PM Peter Geoghegan  wrote:
> Most anti-wraparound VACUUMs are really not emergencies, though.

That's true, but it's equally true that most of the time it's not
necessary to wear a seatbelt to avoid personal injury. The difficulty
is that it's hard to predict on which occasions it is necessary, and
therefore it is advisable to do it all the time. autovacuum decides
whether an emergency exists, in the first instance, by comparing
age(relfrozenxid) to autovacuum_freeze_max_age, but that's problematic
for at least two reasons. First, what matters is not when the vacuum
starts, but when the vacuum finishes. A user who has no tables larger
than 100MB can set autovacuum_freeze_max_age a lot closer to the high
limit without risk of hitting it than a user who has a 10TB table. The
time to run vacuum is dependent on both the size of the table and the
applicable cost delay settings, none of which autovacuum knows
anything about. It also knows nothing about the XID consumption rate.
It's relying on the user to set autovacuum_freeze_max_age low enough
that all the anti-wraparound vacuums will finish before the system
crashes into a wall. Second, what happens to one table affects what
happens to other tables. Even if you have perfect knowledge of your
XID consumption rate and the speed at which vacuum will complete, you
can't just configure autovacuum_freeze_max_age to allow exactly enough
time for the vacuum to complete once it hits the threshold, unless you
have one autovacuum worker per table so that the work for that table
never has to wait for work on any other tables. And even then, as you
mention, you have to worry about the possibility that a vacuum was
already in progress on that table itself. Here again, we rely on the
user to know empirically how high they can set
autovacuum_freeze_max_age without cutting it too close.

Now, that's not actually a good thing, because most users aren't smart
enough to do that, and will either leave a gigantic safety margin that
they don't need, or will leave an inadequate safety margin and take
the system down. However, it means we need to be very, very careful
about hard-coded thresholds like 90% of the available XID space. I do
think that there is a case for triggering emergency extra safety
measures when things are looking scary. One that I think would help a
tremendous amount is to start ignoring the vacuum cost delay when
wraparound danger (and maybe even bloat danger) starts to loom.
Perhaps skipping index vacuuming is another such measure, though I
suspect it would help fewer people, because in most of the cases I
see, the system is throttled to use a tiny percentage of its actual
hardware capability. If you're running at 1/5 of the speed of which
the hardware is capable, you can only do better by skipping index
cleanup if that skips more than 80% of page accesses, which could be
true but probably isn't. In reality, I think we probably want both
mechanisms, because they complement each other. If one can save 3X and
the other 4X, the combination is a 12X improvement, which is a big
deal. We might want other things, too.

But ... should the thresholds for triggering these kinds of mechanisms
really be hard-coded with no possibility of being configured in the
field? What if we find out after the release is shipped that the
mechanism works better if you make it kick in sooner, or later, or if
it depends on other things about the system, which I think it almost
certainly does? Thresholds that can't be changed without a recompile
are bad news. That's why we have GUCs.

On another note, I cannot say enough bad things about the function
name two_pass_strategy(). I sincerely hope that you're not planning to
create a function which is a major point of control for VACUUM whose
name gives no hint that it has anything to do with vacuum.

-- 
Robert Haas
EDB: http://www.enterprisedb.com

Re: New IndexAM API controlling index vacuum strategies

2021-03-18 Thread Peter Geoghegan

On Thu, Mar 18, 2021 at 3:32 AM Masahiko Sawada  wrote:
> If we have the constant threshold of 1 billion transactions, a vacuum
> operation might not be an anti-wraparound vacuum and even not be an
> aggressive vacuum, depending on autovacuum_freeze_max_age value. Given
> the purpose of skipping index vacuuming in this case, I think it
> doesn't make sense to have non-aggressive vacuum skip index vacuuming
> since it might not be able to advance relfrozenxid. If we have a
> constant threshold, 2 billion transactions, maximum value of
> autovacuum_freeze_max_age, seems to work.

I like the idea of not making the behavior a special thing that only
happens with a certain variety of VACUUM operation (non-aggressive or
anti-wraparound VACUUMs). Just having a very high threshold should be
enough.

Even if we're not going to be able to advance relfrozenxid, we'll
still finish much earlier and let a new anti-wraparound vacuum take
place that will do that -- and will be able to reuse much of the work
of the original VACUUM. Of course this anti-wraparound vacuum will
also skip index vacuuming from the start (whereas the first VACUUM may
well have done some index vacuuming before deciding to end index
vacuuming to hurry with finishing).

There is a risk in having the limit be too high, though. We need to
give VACUUM time to reach two_pass_strategy() to notice the problem
and act (maybe each call to lazy_vacuum_all_indexes() takes a long
time). Also, while it's possible (any perhaps even likely) that cases
that use this emergency mechanism will be able to end the VACUUM
immediately (because there is enough maintenance_work_mem() to make
the first call to two_pass_strategy() also the last call), that won't
always be how it works. Even deciding to stop index vacuuming (and
heap vacuuming) may not be enough to avert disaster if left too late
-- because we may still have to do a lot of table pruning. In cases
where there is not nearly enough maintenance_work_mem we will get
through the table a lot faster once we decide to skip indexes, but
there is some risk that even this will not be fast enough.

How about 1.8 billion XIDs? That's the maximum value of
autovacuum_freeze_max_age (2 billion) minus the default value (200
million). That is high enough that it seems almost impossible for this
emergency mechanism to hurt rather than help. At the same time it is
not so high that there isn't some remaining time to finish off work
which is truly required.

> > This seems like a good idea because we should try to avoid changing
> > back to index vacuuming having decided to skip it once.
>
> Once decided to skip index vacuuming due to too old relfrozenxid
> stuff, the decision never be changed within the same vacuum operation,
> right? Because the relfrozenxid is advanced at the end of vacuum.

I see no reason why it would be fundamentally incorrect to teach
two_pass_strategy() to make new and independent decisions about doing
index vacuuming on each call. I just don't think that that makes any
sense to do so, practically speaking. Why would we even *want* to
decide to not do index vacuuming, and then change our mind about it
again (resume index vacuuming again, for later heap blocks)? That
sounds a bit too much like me!

There is another reason to never go back to index vacuuming: we should
have an ereport() at the point that we decide to not do index
vacuuming (or not do additional index vacuuming) inside
two_pass_strategy(). This should deliver an unpleasant message to the
DBA. The message is (in my own informal language): An emergency
failsafe mechanism made VACUUM skip index vacuuming, just to avoid
likely XID wraparound failure. This is not supposed to happen.
Consider tuning autovacuum settings, especially if you see this
message regularly.

Obviously the reason to delay the decision is that we cannot easily
predict how long any given VACUUM will take (or just to reach
two_pass_strategy()). Nor can we really hope to understand how many
XIDs will be consumed in that time. So rather than trying to
understand all that, we can instead just wait until we have reliable
information. It is true that the risk of waiting until it's too late
to avert disaster exists (which is why 1.8 billion XIDs seems like a
good threshold to me), but there is only so much we can do about that.
We don't need it to be perfect, just much better.

In my experience, anti-wraparound VACUUM scenarios all have an
"accident chain", which is a concept from the world of aviation and
safety-critical systems:

https://en.wikipedia.org/wiki/Chain_of_events_(accident_analysis)

They usually involve some *combination* of Postgres problems,
application code problems, and DBA error. Not any one thing. I've seen
problems with application code that runs DDL at scheduled intervals,
which interacts badly with vacuum -- but only really on the rare
occasions when freezing is required! I've also seen a buggy
hand-written upsert function that artificially burned through XIDs

Re: New IndexAM API controlling index vacuum strategies

2021-03-18 Thread Masahiko Sawada

On Thu, Mar 18, 2021 at 3:41 PM Peter Geoghegan  wrote:
>
> On Wed, Mar 17, 2021 at 11:23 PM Masahiko Sawada  
> wrote:
> > Attached the updated patch that can be applied on top of your v3 patches.
>
> Some feedback on this:
>
> * I think that we can afford to be very aggressive here, because we're
> checking dynamically. And we're concerned about extremes only. So an
> age of as high as 1 billion transactions seems like a better approach.
> What do you think?

If we have the constant threshold of 1 billion transactions, a vacuum
operation might not be an anti-wraparound vacuum and even not be an
aggressive vacuum, depending on autovacuum_freeze_max_age value. Given
the purpose of skipping index vacuuming in this case, I think it
doesn't make sense to have non-aggressive vacuum skip index vacuuming
since it might not be able to advance relfrozenxid. If we have a
constant threshold, 2 billion transactions, maximum value of
autovacuum_freeze_max_age, seems to work.

>
> * I think that you need to remember that we have decided not to do any
> more index vacuuming, rather than calling
> check_index_cleanup_xid_limit() each time -- maybe store that
> information in a state variable.
>
> This seems like a good idea because we should try to avoid changing
> back to index vacuuming having decided to skip it once.

Once decided to skip index vacuuming due to too old relfrozenxid
stuff, the decision never be changed within the same vacuum operation,
right? Because the relfrozenxid is advanced at the end of vacuum.

> Also, we need
> to refer to this in lazy_scan_heap(), so that we avoid index cleanup
> having also avoided index vacuuming. This is like the INDEX_CLEANUP =
> off case, which is also only for emergencies. It is not like the
> SKIP_VACUUM_PAGES_RATIO case, which is just an optimization.

Agreed with this point. I'll fix it in the next version patch.

Regards,

--
Masahiko Sawada
EDB:  https://www.enterprisedb.com/

Re: New IndexAM API controlling index vacuum strategies

2021-03-17 Thread Peter Geoghegan

On Wed, Mar 17, 2021 at 11:23 PM Masahiko Sawada  wrote:
> Attached the updated patch that can be applied on top of your v3 patches.

Some feedback on this:

* I think that we can afford to be very aggressive here, because we're
checking dynamically. And we're concerned about extremes only. So an
age of as high as 1 billion transactions seems like a better approach.
What do you think?

* I think that you need to remember that we have decided not to do any
more index vacuuming, rather than calling
check_index_cleanup_xid_limit() each time -- maybe store that
information in a state variable.

This seems like a good idea because we should try to avoid changing
back to index vacuuming having decided to skip it once. Also, we need
to refer to this in lazy_scan_heap(), so that we avoid index cleanup
having also avoided index vacuuming. This is like the INDEX_CLEANUP =
off case, which is also only for emergencies. It is not like the
SKIP_VACUUM_PAGES_RATIO case, which is just an optimization.

Thanks
-- 
Peter Geoghegan

Re: New IndexAM API controlling index vacuum strategies

2021-03-17 Thread Masahiko Sawada

On Thu, Mar 18, 2021 at 12:23 PM Peter Geoghegan  wrote:
>
> On Wed, Mar 17, 2021 at 7:16 PM Masahiko Sawada  wrote:
> > Since I was thinking that always skipping index vacuuming on
> > anti-wraparound autovacuum is legitimate, skipping index vacuuming
> > only when we're really close to the point of going into read-only mode
> > seems a bit conservative, but maybe a good start. I've attached a PoC
> > patch to disable index vacuuming if the table's relfrozenxid is too
> > older than autovacuum_freeze_max_age (older than 1.5x of
> > autovacuum_freeze_max_age).
>
> Most anti-wraparound VACUUMs are really not emergencies, though. So
> treating them as special simply because they're anti-wraparound
> vacuums doesn't seem like the right thing to do. I think that we
> should dynamically decide to do this when (antiwraparound) VACUUM has
> already been running for some time. We need to delay the decision
> until it is almost certainly true that we really have an emergency.

That's a good idea to delay the decision until two_pass_strategy().

>
> Can you take what you have here, and make the decision dynamic? Delay
> it until we're done with the first heap scan? This will require
> rebasing on top of the patch I posted. And then adding a third patch,
> a little like the second patch -- but not too much like it.

Attached the updated patch that can be applied on top of your v3 patches.

Regards,

-- 
Masahiko Sawada
EDB:  https://www.enterprisedb.com/


v3-0003-Skip-index-vacuuming-when-there-is-a-risk-of-wrap.patch
Description: Binary data

Re: New IndexAM API controlling index vacuum strategies

2021-03-17 Thread Peter Geoghegan

On Wed, Mar 17, 2021 at 7:16 PM Masahiko Sawada  wrote:
> Since I was thinking that always skipping index vacuuming on
> anti-wraparound autovacuum is legitimate, skipping index vacuuming
> only when we're really close to the point of going into read-only mode
> seems a bit conservative, but maybe a good start. I've attached a PoC
> patch to disable index vacuuming if the table's relfrozenxid is too
> older than autovacuum_freeze_max_age (older than 1.5x of
> autovacuum_freeze_max_age).

Most anti-wraparound VACUUMs are really not emergencies, though. So
treating them as special simply because they're anti-wraparound
vacuums doesn't seem like the right thing to do. I think that we
should dynamically decide to do this when (antiwraparound) VACUUM has
already been running for some time. We need to delay the decision
until it is almost certainly true that we really have an emergency.

Can you take what you have here, and make the decision dynamic? Delay
it until we're done with the first heap scan? This will require
rebasing on top of the patch I posted. And then adding a third patch,
a little like the second patch -- but not too much like it.

In the second/SKIP_VACUUM_PAGES_RATIO patch I posted today, the
function two_pass_strategy() (my new name for the main entry point for
calling lazy_vacuum_all_indexes() and lazy_vacuum_heap()) is only
willing to perform the "skip index vacuuming" optimization when the
call to two_pass_strategy() is the first call and the last call for
that entire VACUUM (plus we test the number of heap blocks with
LP_DEAD items using SKIP_VACUUM_PAGES_RATIO, of course). It works this
way purely because I don't think that we should be aggressive when
we've already run out of maintenance_work_mem. That's a bad time to
apply a performance optimization.

But what you're talking about now isn't a performance optimization
(the mechanism is similar or the same, but the underlying reasons are
totally different) -- it's a safety/availability thing. I don't think
that you need to be concerned about running out of
maintenance_work_mem in two_pass_strategy() when applying logic that
is concerned about keeping the database online by avoiding XID
wraparound. You just need to have high confidence that it is a true
emergency. I think that we can be ~99% sure that we're in a real
emergency by using dynamic information about how old relfrozenxid is
*now*, and by rechecking a few times during VACUUM. Probably by
rechecking every time we call two_pass_strategy().

I now believe that there is no fundamental correctness issue with
teaching two_pass_strategy() to skip index vacuuming when we're low on
memory -- it is 100% a matter of costs and benefits. The core
skip-index-vacuuming mechanism is very flexible. If we can be sure
that it's a real emergency, I think that we can justify behaving very
aggressively (letting indexes get bloated is after all very
aggressive). We just need to be 99%+ sure that continuing with
vacuuming will be worse that ending vacuuming. Which seems possible by
making the decision dynamic (and revisiting it at least a few times
during a very long VACUUM, in case things change).

-- 
Peter Geoghegan

Re: New IndexAM API controlling index vacuum strategies

2021-03-17 Thread Peter Geoghegan

On Mon, Mar 15, 2021 at 4:11 PM Andres Freund  wrote:
> I kinda wonder whether this case should just be handled by just gotoing
> back to the start of the blkno loop, and redoing the pruning. The only
> thing that makes that a bit more complicatd is that we've already
> incremented vacrelstats->{scanned_pages,vacrelstats->tupcount_pages}.
>
> We really should put the per-page work (i.e. the blkno loop body) of
> lazy_scan_heap() into a separate function, same with the
> too-many-dead-tuples branch.

Attached patch series splits everything up. There is now a large patch
that removes the tupgone special case, and a second patch that
actually adds code that dynamically decides to not do index vacuuming
in cases where (for whatever reason) it doesn't seem useful.

Here are some key points about the first patch:

* Eliminates the "tupgone = true" special case by putting pruning, the
HTSV() call, as well as tuple freezing into a new, dedicated function
-- the function is prepared to restart pruning in those rare cases
where the vacuumlazy.c HTSV() call indicates that a tuple is dead.
Restarting pruning again prunes again, rendering the DEAD tuple with
storage an LP_DEAD line pointer stub.

The restart thing is based on Andres' suggestion.

This patch enables incremental VACUUM (the second patch, and likely
other variations) by allowing us to make a uniform assumption that it
is never strictly necessary to reach lazy_vacuum_all_indexes() or
lazy_vacuum_heap(). It is now possible to "end VACUUM early" while
still advancing relfrozenxid. Provided we've finished the first scan
of the heap, that should be safe.

* In principle we could visit and revisit the question of whether or
not vacuuming should continue or end early many times, as new
information comes to light. For example, maybe Masahiko's patch from
today could be taught to monitor how old relfrozenxid is again and
again, before finally giving up early when the risk of wraparound
becomes very severe -- but not before then.

* I've added three structs that replace a blizzard of local variables
we used lazy_scan_heap() with just three (three variables for each of
the three structs). I've also moved several chunks of logic to other
new functions (in addition to one that does pruning and freezing).

I think that I have the data structures roughly right here -- but I
would appreciate some feedback on that. Does this seem like the right
direction?

-- 
Peter Geoghegan

v3-0002-Skip-index-vacuuming-dynamically.patch
Description: Binary data

v3-0001-Remove-tupgone-special-case-from-vacuumlazy.c.patch
Description: Binary data

Re: New IndexAM API controlling index vacuum strategies

2021-03-17 Thread Masahiko Sawada

On Wed, Mar 17, 2021 at 7:21 AM Peter Geoghegan  wrote:
>
> On Tue, Mar 16, 2021 at 6:40 AM Masahiko Sawada  wrote:
> > > Note that I've merged multiple existing functions in vacuumlazy.c into
> > > one: the patch merges lazy_vacuum_all_indexes() and lazy_vacuum_heap()
> > > into a single function named vacuum_indexes_mark_unused()
>
> > I agree to create a function like vacuum_indexes_mark_unused() that
> > makes a decision and does index and heap vacumming accordingly. But
> > what is the point of removing both lazy_vacuum_all_indexes() and
> > lazy_vacuum_heap()? I think we can simply have
> > vacuum_indexes_mark_unused() call those functions. I'm concerned that
> > if we add additional criteria to vacuum_indexes_mark_unused() in the
> > future the function will become very large.
>
> I agree now. I became overly excited about advertising the fact that
> these two functions are logically one thing. This is important, but it
> isn't necessary to go as far as actually making everything into one
> function. Adding some comments would also make that point clear, but
> without making vacuumlazy.c even more spaghetti-like. I'll fix it.
>
> > > I wonder if we can add some kind of emergency anti-wraparound vacuum
> > > logic to what I have here, for Postgres 14.
>
> > +1
> >
> > I think we can set VACOPT_TERNARY_DISABLED to
> > tab->at_params.index_cleanup in table_recheck_autovac() or increase
> > the thresholds used to not skipping index vacuuming.
>
> I was worried about the "tupgone = true" special case causing problems
> when we decide to do some index vacuuming and some heap
> vacuuming/LP_UNUSED-marking but then later decide to end the VACUUM.
> But I now think that getting rid of "tupgone = true" gives us total
> freedom to
> choose what to do, including the freedom to start with index vacuuming
> and then give up on it later -- even after we do some amount of
> LP_UNUSED-marking (during a VACUUM with multiple index passes, perhaps
> due to a low maintenance_work_mem setting). That isn't actually
> special at all, because everything will be 100% decoupled.
>
> Whether or not it's a good idea to either not start index vacuuming or
> to end it early (e.g. due to XID wraparound pressure) is a complicated
> question, and the right approach will be debatable in each case/when
> thinking about each issue. However, deciding on the best behavior to
> address these problems should have nothing to do with implementation
> details and everything to do with the costs and benefits to users.
> Which now seems possible.
>
> A sophisticated version of the "XID wraparound pressure"
> implementation could increase reliability without ever being
> conservative, just by evaluating the situation regularly and being
> prepared to change course when necessary -- until it is truly a matter
> of shutting down new XID allocations/the server. It should be possible
> to decide to end VACUUM early and advance relfrozenxid (provided we
> have reached the point of finishing all required pruning and freezing,
> of course). Highly agile behavior seems quite possible, even if it
> takes a while to agree on a good design.

Since I was thinking that always skipping index vacuuming on
anti-wraparound autovacuum is legitimate, skipping index vacuuming
only when we're really close to the point of going into read-only mode
seems a bit conservative, but maybe a good start. I've attached a PoC
patch to disable index vacuuming if the table's relfrozenxid is too
older than autovacuum_freeze_max_age (older than 1.5x of
autovacuum_freeze_max_age).

>
> > Aside from whether it's good or bad, strictly speaking, it could
> > change the index AM API contract. The documentation of
> > amvacuumcleanup() says:
> >
> > ---
> > stats is whatever the last ambulkdelete call returned, or NULL if
> > ambulkdelete was not called because no tuples needed to be deleted.
> > ---
> >
> > With this change, we could pass NULL to amvacuumcleanup even though
> > the index might have tuples needed to be deleted.
>
> I think that we should add a "Note" box to the documentation that
> notes the difference here. Though FWIW my interpretation of the words
> "no tuples needed to be deleted" was always "no tuples needed to be
> deleted because vacuumlazy.c didn't call ambulkdelete()". After all,
> VACUUM can add to tups_vacuumed through pruning inside
> heap_prune_chain(). It is already possible (though only barely) to not
> call ambulkdelete() for indexes (to only call amvacuumcleanup() during
> cleanup) despite the fact that heap vacuuming did "delete tuples".

Agreed.

Regards,

--
Masahiko Sawada
EDB:  https://www.enterprisedb.com/


poc_skip_index_cleanup_at_anti_wraparound.patch
Description: Binary data

Re: New IndexAM API controlling index vacuum strategies

2021-03-16 Thread Peter Geoghegan

On Tue, Mar 16, 2021 at 6:40 AM Masahiko Sawada  wrote:
> > Note that I've merged multiple existing functions in vacuumlazy.c into
> > one: the patch merges lazy_vacuum_all_indexes() and lazy_vacuum_heap()
> > into a single function named vacuum_indexes_mark_unused()

> I agree to create a function like vacuum_indexes_mark_unused() that
> makes a decision and does index and heap vacumming accordingly. But
> what is the point of removing both lazy_vacuum_all_indexes() and
> lazy_vacuum_heap()? I think we can simply have
> vacuum_indexes_mark_unused() call those functions. I'm concerned that
> if we add additional criteria to vacuum_indexes_mark_unused() in the
> future the function will become very large.

I agree now. I became overly excited about advertising the fact that
these two functions are logically one thing. This is important, but it
isn't necessary to go as far as actually making everything into one
function. Adding some comments would also make that point clear, but
without making vacuumlazy.c even more spaghetti-like. I'll fix it.

> > I wonder if we can add some kind of emergency anti-wraparound vacuum
> > logic to what I have here, for Postgres 14.

> +1
>
> I think we can set VACOPT_TERNARY_DISABLED to
> tab->at_params.index_cleanup in table_recheck_autovac() or increase
> the thresholds used to not skipping index vacuuming.

I was worried about the "tupgone = true" special case causing problems
when we decide to do some index vacuuming and some heap
vacuuming/LP_UNUSED-marking but then later decide to end the VACUUM.
But I now think that getting rid of "tupgone = true" gives us total
freedom to
choose what to do, including the freedom to start with index vacuuming
and then give up on it later -- even after we do some amount of
LP_UNUSED-marking (during a VACUUM with multiple index passes, perhaps
due to a low maintenance_work_mem setting). That isn't actually
special at all, because everything will be 100% decoupled.

Whether or not it's a good idea to either not start index vacuuming or
to end it early (e.g. due to XID wraparound pressure) is a complicated
question, and the right approach will be debatable in each case/when
thinking about each issue. However, deciding on the best behavior to
address these problems should have nothing to do with implementation
details and everything to do with the costs and benefits to users.
Which now seems possible.

A sophisticated version of the "XID wraparound pressure"
implementation could increase reliability without ever being
conservative, just by evaluating the situation regularly and being
prepared to change course when necessary -- until it is truly a matter
of shutting down new XID allocations/the server. It should be possible
to decide to end VACUUM early and advance relfrozenxid (provided we
have reached the point of finishing all required pruning and freezing,
of course). Highly agile behavior seems quite possible, even if it
takes a while to agree on a good design.

> Aside from whether it's good or bad, strictly speaking, it could
> change the index AM API contract. The documentation of
> amvacuumcleanup() says:
>
> ---
> stats is whatever the last ambulkdelete call returned, or NULL if
> ambulkdelete was not called because no tuples needed to be deleted.
> ---
>
> With this change, we could pass NULL to amvacuumcleanup even though
> the index might have tuples needed to be deleted.

I think that we should add a "Note" box to the documentation that
notes the difference here. Though FWIW my interpretation of the words
"no tuples needed to be deleted" was always "no tuples needed to be
deleted because vacuumlazy.c didn't call ambulkdelete()". After all,
VACUUM can add to tups_vacuumed through pruning inside
heap_prune_chain(). It is already possible (though only barely) to not
call ambulkdelete() for indexes (to only call amvacuumcleanup() during
cleanup) despite the fact that heap vacuuming did "delete tuples".

It's not that important, but my point is that the design has always
been top-down -- an index AM "needs to delete" whatever it is told it
needs to delete. It has no direct understanding of any higher-level
issues.

> As I mentioned in a recent reply, I'm concerned about a case where we
> ran out maintenance_work_mem and decided not to skip index vacuuming
> but collected only a few dead tuples in the second index vacuuming
> (i.g., the total amount of dead tuples is slightly larger than
> maintenance_work_mem). In this case, I think we can skip the second
> (i.g., final) index vacuuming at least in terms of
> maintenance_work_mem. Maybe the same is true in terms of LP_DEAD
> accumulation.

I remember that. That now seems very doable, but time grows short...

I have already prototyped Andres' idea, which was to eliminate the
HEAPTUPLE_DEAD case inside lazy_scan_heap() by restarting pruning for
the same page. I've also moved the pruning into its own function
called lazy_scan_heap_page(), because handling the restart requires
that we b

Re: New IndexAM API controlling index vacuum strategies

2021-03-16 Thread Masahiko Sawada

On Tue, Mar 16, 2021 at 10:39 PM Masahiko Sawada  wrote:
>
> On Mon, Mar 15, 2021 at 11:04 AM Peter Geoghegan  wrote:
> >
> > One consequence of my approach is that we now call
> > lazy_cleanup_all_indexes(), even when we've skipped index vacuuming
> > itself. We should at least "check-in" with the indexes IMV. To an
> > index AM, this will be indistinguishable from a VACUUM that never had
> > tuples for it to delete, and so never called ambulkdelete() before
> > calling amvacuumcleanup().  This seems logical to me: why should there
> > be any significant behavioral divergence between the case where there
> > are 0 tuples to delete and the case where there is 1 tuple to delete?
> > The extra work that we perform in amvacuumcleanup() (if any) should
> > almost always be a no-op in nbtree following my recent refactoring
> > work. More generally, if an index AM is doing too much during cleanup,
> > and this becomes a bottleneck, then IMV that's a problem that needs to
> > be fixed in the index AM.
>
> Aside from whether it's good or bad, strictly speaking, it could
> change the index AM API contract. The documentation of
> amvacuumcleanup() says:
>
> ---
> stats is whatever the last ambulkdelete call returned, or NULL if
> ambulkdelete was not called because no tuples needed to be deleted.
> ---
>
> With this change, we could pass NULL to amvacuumcleanup even though
> the index might have tuples needed to be deleted.

It seems there is no problem with that change at least in built-in
index AMs. So +1 for this change. We would need to slightly update the
doc accordingly.

Regards,

-- 
Masahiko Sawada
EDB:  https://www.enterprisedb.com/

Re: New IndexAM API controlling index vacuum strategies

2021-03-16 Thread Masahiko Sawada

On Mon, Mar 15, 2021 at 11:04 AM Peter Geoghegan  wrote:
>
> On Thu, Mar 11, 2021 at 8:31 AM Robert Haas  wrote:
> > But even if not, I'm not sure this
> > helps much with the situation you're concerned about, which involves
> > non-HOT tuples.
>
> Attached is a POC-quality revision of Masahiko's
> skip_index_vacuum.patch [1]. There is an improved design for skipping
> index vacuuming (improved over the INDEX_CLEANUP stuff from Postgres
> 12). I'm particularly interested in your perspective on this
> refactoring stuff, Robert, because you ran into the same issues after
> initial commit of the INDEX_CLEANUP reloption feature [2] -- you ran
> into issues with the "tupgone = true" special case. This is the case
> where VACUUM considers a tuple dead that was not marked LP_DEAD by
> pruning, and so needs to be killed in the second heap scan in
> lazy_vacuum_heap() instead. You'll recall that these issues were fixed
> by your commit dd695979888 from May 2019. I think that we need to go
> further than you did in dd695979888 for this -- we ought to get rid of
> the special case entirely.

Thank you for the patch!

>
> This patch makes the "VACUUM (INDEX_CLEANUP OFF)" mechanism no longer
> get invoked as if it was like the "no indexes on table so do it all in
> one heap pass" optimization. This seems a lot clearer -- INDEX_CLEANUP
> OFF isn't able to call lazy_vacuum_page() at all (for the obvious
> reason), so any similarity between the two cases was always
> superficial -- skipping index vacuuming should not be confused with
> doing a one-pass VACUUM/having no indexes at all. The original
> INDEX_CLEANUP structure (from commits a96c41fe and dd695979) always
> seemed confusing to me for this reason, FWIW.

Agreed.

>
> Note that I've merged multiple existing functions in vacuumlazy.c into
> one: the patch merges lazy_vacuum_all_indexes() and lazy_vacuum_heap()
> into a single function named vacuum_indexes_mark_unused() (note also
> that lazy_vacuum_page() has been renamed to mark_unused_page() to
> reflect the fact that it is now strictly concerned with making LP_DEAD
> line pointers LP_UNUSED). The big idea is that there is one choke
> point that decides whether index vacuuming is needed at all at one
> point in time, dynamically. vacuum_indexes_mark_unused() decides this
> for us at the last moment. This can only happen during a VACUUM that
> has enough memory to fit all TIDs -- otherwise we won't skip anything
> dynamically.
>
> We may in the future add additional criteria for skipping index
> vacuuming. That can now just be added to the beginning of this new
> vacuum_indexes_mark_unused() function. We may even teach
> vacuum_indexes_mark_unused() to skip some indexes but not others in a
> future release, a possibility that was already discussed at length
> earlier in this thread. This new structure has all the context it
> needs to do all of these things.

I agree to create a function like vacuum_indexes_mark_unused() that
makes a decision and does index and heap vacumming accordingly. But
what is the point of removing both lazy_vacuum_all_indexes() and
lazy_vacuum_heap()? I think we can simply have
vacuum_indexes_mark_unused() call those functions. I'm concerned that
if we add additional criteria to vacuum_indexes_mark_unused() in the
future the function will become very large.

>
> I wonder if we can add some kind of emergency anti-wraparound vacuum
> logic to what I have here, for Postgres 14. Can we come up with logic
> that has us skip index vacuuming because XID wraparound is on the
> verge of causing an outage? That seems like a strategically important
> thing for Postgres, so perhaps we should try to get something like
> that in. Practically every post mortem blog post involving Postgres
> also involves anti-wraparound vacuum.

+1

I think we can set VACOPT_TERNARY_DISABLED to
tab->at_params.index_cleanup in table_recheck_autovac() or increase
the thresholds used to not skipping index vacuuming.

>
> One consequence of my approach is that we now call
> lazy_cleanup_all_indexes(), even when we've skipped index vacuuming
> itself. We should at least "check-in" with the indexes IMV. To an
> index AM, this will be indistinguishable from a VACUUM that never had
> tuples for it to delete, and so never called ambulkdelete() before
> calling amvacuumcleanup().  This seems logical to me: why should there
> be any significant behavioral divergence between the case where there
> are 0 tuples to delete and the case where there is 1 tuple to delete?
> The extra work that we perform in amvacuumcleanup() (if any) should
> almost always be a no-op in nbtree following my recent refactoring
> work. More generally, if an index AM is doing too much during cleanup,
> and this becomes a bottleneck, then IMV that's a problem that needs to
> be fixed in the index AM.

Aside from whether it's good or bad, strictly speaking, it could
change the index AM API contract. The documentation of
amvacuumcleanup() says:

---
stats is wh

Re: New IndexAM API controlling index vacuum strategies

2021-03-16 Thread Masahiko Sawada

On Sun, Mar 14, 2021 at 12:23 PM Peter Geoghegan  wrote:
>
> On Fri, Mar 12, 2021 at 9:34 PM Masahiko Sawada  wrote:
> > I agreed that when we're close to overrunning the
> > maintnenance_work_mem space, the situation changes. If we skip it in
> > even that case, the next vacuum will be likely to use up
> > maintenance_work_mem, leading to a second index scan. Which is
> > bad.
> >
> > If this threshold is aimed to avoid a second index scan due to
> > overrunning the maintenance_work_mem, using a ratio of
> > maintenance_work_mem would be a good idea. On the other hand, if it's
> > to avoid accumulating debt affecting the cost of index vacuuming,
> > using a ratio of the total heap tuples seems better.
>
> It's both, together. These are two *independent*
> considerations/thresholds. At least in the code that decides whether
> or not we skip. Either threshold can force a full index scan (index
> vacuuming).
>
> What I'm really worried about is falling behind (in terms of the
> amount of memory available for TIDs to delete in indexes) without any
> natural limit. Suppose we just have the SKIP_VACUUM_PAGES_RATIO
> threshold (i.e. no maintenance_work_mem threshold thing). With just
> SKIP_VACUUM_PAGES_RATIO there will be lots of tables where index
> vacuuming is almost always avoided, which is good. But
> SKIP_VACUUM_PAGES_RATIO might be a bit *too* effective. If we have to
> do 2 or even 3 scans of the index when we finally get to index
> vacuuming then that's not great, it's inefficient -- but it's at least
> *survivable*. But what if there are 10, 100, even 1000 bulk delete
> calls for each index when it finally happens? That's completely
> intolerable.
>
> In other words, I am not worried about debt, exactly. Debt is normal
> in moderation. Healthy, even. I am worried about bankruptcy, perhaps
> following a rare and extreme event. It's okay to be imprecise, but all
> of the problems must be survivable. The important thing to me for a
> maintenance_work_mem threshold is that there is *some* limit. At the
> same time, it may totally be worth accepting 2 or 3 index scans during
> some eventual VACUUM operation if there are many more VACUUM
> operations that don't even touch the index -- that's a good deal!
> Also, it may actually be inherently necessary to accept a small risk
> of having a future VACUUM operation that does multiple scans of each
> index -- that is probably a necessary part of skipping index vacuuming
> each time.
>
> Think about the cost of index vacuuming (the amount of I/O and the
> duration of index vacuuming) as less as less memory is available for
> TIDs. It's non-linear. The cost explodes once we're past a certain
> point. The truly important thing is to "never get killed by the
> explosion".

Agreed.

Maybe it's a good idea to add the ratio of dead tuples to the total
heap tuples as a threshold? I think that there are two risks in case
where we collect many dead tuples: maintenance_work_mem overrun and
LP_DEAD accumulation, even if those are concentrated in less than 1%
heap pages. The former risk is dealt with by the maintenance_work_mem
threshold as we discussed. But that threshold might not be enough to
deal with the latter risk. For example, a very larege table could have
many dead tuples in less than 1% heap pages and we may set
maintenance_work_mem to a high value. In that case, it might be ok in
terms of index vacuuming but might not be ok in terms of heap. So I
think we want not to skip index vacuuming. It’s an extream case. But
we also should note that the absolute number of tuples of 70% of
maintenance_work_mem is tend to increase if we improve memory
efficiency to store TIDs. So I think adding “dead tuples must be less
than 50% of total heap tuples” threshold to skip index vacuuming would
be a good safe guard against such an extream case.

This threshold is applied only at the last
lazy_vacuum_table_and_indexes() call so we know the total heap tuples
at that point. If we run out maintenance_work_mem in the middle of
heap scan, I think we should do index vacuuming regardless of the
number of dead tuples and the number of pages having at least one
LP_DEAD.

>
> > The situation where we need to deal with here is a very large table
> > that has a lot of dead tuples but those fit in fewer heap pages (less
> > than 1% of all heap blocks). In this case, it's likely that the number
> > of dead tuples also is relatively small compared to the total heap
> > tuples, as you mentioned. If dead tuples fitted in fewer pages but
> > accounted for most of all heap tuples in the heap, it would be a more
> > serious situation, there would definitely already be other problems.
> > So considering those conditions, I agreed to use a ratio of
> > maintenance_work_mem as a threshold. Maybe we can increase the
> > constant to 70, 80, or so.
>
> You mean 70% of maintenance_work_mem? That seems fine to me.

Yes.

> See my
> "Why does lazy_vacuum_table_and_indexes() not make one decision for
> the enti

Re: New IndexAM API controlling index vacuum strategies

2021-03-15 Thread Peter Geoghegan

On Mon, Mar 15, 2021 at 4:11 PM Andres Freund  wrote:
> > > I'm not comfortable with this change without adding more safety
> > > checks. If there's ever a case in which the HEAPTUPLE_DEAD case is hit
> > > and the xid needs to be frozen, we'll either cause errors or
> > > corruption. Yes, that's already the case with params->index_cleanup ==
> > > DISABLED, but that's not that widely used.
> >
> > I noticed that Noah's similar 2013 patch [1] added a defensive
> > heap_tuple_needs_freeze() + elog(ERROR) to the HEAPTUPLE_DEAD case. I
> > suppose that that's roughly what you have in mind here?
>
> I'm not sure that's sufficient. If the case is legitimately reachable
> (I'm maybe 60% is not, after staring at it for a long time, but ...),
> then we can't just error out when we didn't so far.

If you're only 60% sure that the heap_tuple_needs_freeze() error thing
doesn't break anything that should work by now then it seems unlikely
that you'll ever get past 90% sure. I think that we should make a
conservative assumption that the defensive elog(ERROR) won't be
sufficient, and proceed on that basis.

> I kinda wonder whether this case should just be handled by just gotoing
> back to the start of the blkno loop, and redoing the pruning. The only
> thing that makes that a bit more complicatd is that we've already
> incremented vacrelstats->{scanned_pages,vacrelstats->tupcount_pages}.

That seems like a good solution to me -- this is a very seldom hit
path, so we can be a bit inefficient without it mattering.

It might make sense to *also* check some things (maybe using
heap_tuple_needs_freeze()) in passing, just for debugging purposes.

> We really should put the per-page work (i.e. the blkno loop body) of
> lazy_scan_heap() into a separate function, same with the
> too-many-dead-tuples branch.

+1.

BTW I've noticed that the code (and code like it) tends to confuse
things that the current VACUUM performed versus things by *some*
VACUUM (that may or may not be current one). This refactoring might be
a good opportunity to think about that as well.

> >  * It is assumed that the caller has checked the tuple with
> >  * HeapTupleSatisfiesVacuum() and determined that it is not HEAPTUPLE_DEAD
> >  * (else we should be removing the tuple, not freezing it).
> >
> > Does that need work too?
>
> I'm pretty scared of the index-cleanup-disabled path, for that reason. I
> think the hot path is more likely to be unproblematic, but I'd not bet
> my (nonexistant) farm on it.

Well if we can solve the problem by simply doing pruning once again in
the event of a HEAPTUPLE_DEAD return value from the lazy_scan_heap()
HTSV call, then the comment becomes 100% true (which is not the case
even today).

-- 
Peter Geoghegan

Re: New IndexAM API controlling index vacuum strategies

2021-03-15 Thread Andres Freund

Hi,

On 2021-03-15 13:58:02 -0700, Peter Geoghegan wrote:
> On Mon, Mar 15, 2021 at 12:58 PM Peter Geoghegan  wrote:
> > > I'm not comfortable with this change without adding more safety
> > > checks. If there's ever a case in which the HEAPTUPLE_DEAD case is hit
> > > and the xid needs to be frozen, we'll either cause errors or
> > > corruption. Yes, that's already the case with params->index_cleanup ==
> > > DISABLED, but that's not that widely used.
> >
> > I noticed that Noah's similar 2013 patch [1] added a defensive
> > heap_tuple_needs_freeze() + elog(ERROR) to the HEAPTUPLE_DEAD case. I
> > suppose that that's roughly what you have in mind here?
> 
> I'm not sure if you're arguing that there might be (either now or in
> the future) a legitimate case (a case not involving data corruption)
> where we hit HEAPTUPLE_DEAD, and find we have an XID in the tuple that
> needs freezing. You seem to be suggesting that even throwing an error
> might not be acceptable, but what better alternative is there? Did you
> just mean that we should throw a *better*, more specific error right
> there, when we handle HEAPTUPLE_DEAD? (As opposed to relying on
> heap_prepare_freeze_tuple() to error out instead, which is what would
> happen today.)

Right now (outside of the index-cleanup-disabled path), we very well may
just actually successfully and correctly do the deletion? So there
clearly is another option?

See my email from a few minutes ago for a somewhat crude idea for how to
tackle the issue differently...

Greetings,

Andres Freund

Re: New IndexAM API controlling index vacuum strategies

2021-03-15 Thread Andres Freund

Hi,

On 2021-03-15 12:58:33 -0700, Peter Geoghegan wrote:
> On Mon, Mar 15, 2021 at 12:21 PM Andres Freund  wrote:
> > It's evil sorcery. Fragile sorcery. I think Robert, Tom and me all run
> > afoul of edge cases around it in the last few years.
> 
> Right, which is why I thought that I might be missing something; why
> put up with that at all for so long?
> 
> > > But removing the awful "tupgone = true" special case seems to buy us a
> > > lot -- it makes unifying everything relatively straightforward. In
> > > particular, it makes it possible to delay the decision to vacuum
> > > indexes until the last moment, which seems essential to making index
> > > vacuuming optional.
> >
> > You haven't really justified, in the patch or this email, why it's OK to
> > remove the whole logic around HEAPTUPLE_DEAD part of the logic.
> 
> I don't follow.
> 
> > VACUUM can take a long time, and not removing space for all the
> > transactions that aborted while it wa
> 
> I guess that you trailed off here. My understanding is that removing
> the special case results in practically no loss of dead tuples removed
> in practice -- so there are no practical performance considerations
> here.
> 
> Have I missed something?

Forget what I said above - I had intended to remove it after dislogding
something stuck in my brain... But apparently didn't :(. Sorry.


> > I'm not comfortable with this change without adding more safety
> > checks. If there's ever a case in which the HEAPTUPLE_DEAD case is hit
> > and the xid needs to be frozen, we'll either cause errors or
> > corruption. Yes, that's already the case with params->index_cleanup ==
> > DISABLED, but that's not that widely used.
> 
> I noticed that Noah's similar 2013 patch [1] added a defensive
> heap_tuple_needs_freeze() + elog(ERROR) to the HEAPTUPLE_DEAD case. I
> suppose that that's roughly what you have in mind here?

I'm not sure that's sufficient. If the case is legitimately reachable
(I'm maybe 60% is not, after staring at it for a long time, but ...),
then we can't just error out when we didn't so far.


I kinda wonder whether this case should just be handled by just gotoing
back to the start of the blkno loop, and redoing the pruning. The only
thing that makes that a bit more complicatd is that we've already
incremented vacrelstats->{scanned_pages,vacrelstats->tupcount_pages}.

We really should put the per-page work (i.e. the blkno loop body) of
lazy_scan_heap() into a separate function, same with the
too-many-dead-tuples branch.


> Comments above heap_prepare_freeze_tuple() say something about making
> sure that HTSV did not return HEAPTUPLE_DEAD...but that's already
> possible today:
> 
>  * It is assumed that the caller has checked the tuple with
>  * HeapTupleSatisfiesVacuum() and determined that it is not HEAPTUPLE_DEAD
>  * (else we should be removing the tuple, not freezing it).
> 
> Does that need work too?

I'm pretty scared of the index-cleanup-disabled path, for that reason. I
think the hot path is more likely to be unproblematic, but I'd not bet
my (nonexistant) farm on it.

Greetings,

Andres Freund

Re: New IndexAM API controlling index vacuum strategies

2021-03-15 Thread Peter Geoghegan

On Mon, Mar 15, 2021 at 12:58 PM Peter Geoghegan  wrote:
> > I'm not comfortable with this change without adding more safety
> > checks. If there's ever a case in which the HEAPTUPLE_DEAD case is hit
> > and the xid needs to be frozen, we'll either cause errors or
> > corruption. Yes, that's already the case with params->index_cleanup ==
> > DISABLED, but that's not that widely used.
>
> I noticed that Noah's similar 2013 patch [1] added a defensive
> heap_tuple_needs_freeze() + elog(ERROR) to the HEAPTUPLE_DEAD case. I
> suppose that that's roughly what you have in mind here?

I'm not sure if you're arguing that there might be (either now or in
the future) a legitimate case (a case not involving data corruption)
where we hit HEAPTUPLE_DEAD, and find we have an XID in the tuple that
needs freezing. You seem to be suggesting that even throwing an error
might not be acceptable, but what better alternative is there? Did you
just mean that we should throw a *better*, more specific error right
there, when we handle HEAPTUPLE_DEAD? (As opposed to relying on
heap_prepare_freeze_tuple() to error out instead, which is what would
happen today.)

That seems like the most reasonable interpretation of your words to
me. That is, I think that you're saying (based in part on remarks on
that other thread [1]) that you believe that fully eliminating the
"tupgone = true" special case is okay in principle, but that more
hardening is needed -- if it ever breaks we want to hear about it.
And, while it would be better to do a much broader refactor to unite
heap_prune_chain() and lazy_scan_heap(), it is not essential (because
the issue is not really new, even without VACUUM (INDEX_CLEANUP
OFF)/"params->index_cleanup == DISABLED").

Which is it?

[1] 
https://www.postgresql.org/message-id/20200724165514.dnu5hr4vvgkssf5p%40alap3.anarazel.de
-- 
Peter Geoghegan

Re: New IndexAM API controlling index vacuum strategies

2021-03-15 Thread Peter Geoghegan

On Mon, Mar 15, 2021 at 12:21 PM Andres Freund  wrote:
> It's evil sorcery. Fragile sorcery. I think Robert, Tom and me all run
> afoul of edge cases around it in the last few years.

Right, which is why I thought that I might be missing something; why
put up with that at all for so long?

> > But removing the awful "tupgone = true" special case seems to buy us a
> > lot -- it makes unifying everything relatively straightforward. In
> > particular, it makes it possible to delay the decision to vacuum
> > indexes until the last moment, which seems essential to making index
> > vacuuming optional.
>
> You haven't really justified, in the patch or this email, why it's OK to
> remove the whole logic around HEAPTUPLE_DEAD part of the logic.

I don't follow.

> VACUUM can take a long time, and not removing space for all the
> transactions that aborted while it wa

I guess that you trailed off here. My understanding is that removing
the special case results in practically no loss of dead tuples removed
in practice -- so there are no practical performance considerations
here.

Have I missed something?

> > Note that I've merged multiple existing functions in vacuumlazy.c into
> > one: the patch merges lazy_vacuum_all_indexes() and lazy_vacuum_heap()
> > into a single function named vacuum_indexes_mark_unused() (note also
> > that lazy_vacuum_page() has been renamed to mark_unused_page() to
> > reflect the fact that it is now strictly concerned with making LP_DEAD
> > line pointers LP_UNUSED).
>
> It doesn't really seem to be *just* doing that - doing the
> PageRepairFragmentation() and all-visible marking is relevant too?

I wrote it in a day, just to show what I had in mind. The renaming
stuff is a part of unifying those functions, which can be discussed
after the "tupgone = true" special case is removed. It's not like I'm
set on the details that you see in the patch.

> For me the patch does way too many things at once, making it harder than
> necessary to review, test (including later bisection). I'd much rather
> see the tupgone thing addressed on its own, without further changes, and
> then the rest done in separate commits subsequently.

I agree that it should be broken up for review.

> I'm not comfortable with this change without adding more safety
> checks. If there's ever a case in which the HEAPTUPLE_DEAD case is hit
> and the xid needs to be frozen, we'll either cause errors or
> corruption. Yes, that's already the case with params->index_cleanup ==
> DISABLED, but that's not that widely used.

I noticed that Noah's similar 2013 patch [1] added a defensive
heap_tuple_needs_freeze() + elog(ERROR) to the HEAPTUPLE_DEAD case. I
suppose that that's roughly what you have in mind here?

I suppose that that was pre-9.3-MultiXacts, and so now it's more complicated.

Comments above heap_prepare_freeze_tuple() say something about making
sure that HTSV did not return HEAPTUPLE_DEAD...but that's already
possible today:

 * It is assumed that the caller has checked the tuple with
 * HeapTupleSatisfiesVacuum() and determined that it is not HEAPTUPLE_DEAD
 * (else we should be removing the tuple, not freezing it).

Does that need work too?

> See
> https://postgr.es/m/20200724165514.dnu5hr4vvgkssf5p%40alap3.anarazel.de
> for some discussion around the fragility.

That's a good reference, thanks.

[1] 
https://www.postgresql.org/message-id/20130130020456.GE3524%40tornado.leadboat.com
-- 
Peter Geoghegan

Re: New IndexAM API controlling index vacuum strategies

2021-03-15 Thread Andres Freund

Hi,

On 2021-03-14 19:04:34 -0700, Peter Geoghegan wrote:
> Attached is a POC-quality revision of Masahiko's
> skip_index_vacuum.patch [1]. There is an improved design for skipping
> index vacuuming (improved over the INDEX_CLEANUP stuff from Postgres
> 12). I'm particularly interested in your perspective on this
> refactoring stuff, Robert, because you ran into the same issues after
> initial commit of the INDEX_CLEANUP reloption feature [2] -- you ran
> into issues with the "tupgone = true" special case. This is the case
> where VACUUM considers a tuple dead that was not marked LP_DEAD by
> pruning, and so needs to be killed in the second heap scan in
> lazy_vacuum_heap() instead.

It's evil sorcery. Fragile sorcery. I think Robert, Tom and me all run
afoul of edge cases around it in the last few years.

> But removing the awful "tupgone = true" special case seems to buy us a
> lot -- it makes unifying everything relatively straightforward. In
> particular, it makes it possible to delay the decision to vacuum
> indexes until the last moment, which seems essential to making index
> vacuuming optional.

You haven't really justified, in the patch or this email, why it's OK to
remove the whole logic around HEAPTUPLE_DEAD part of the logic.

VACUUM can take a long time, and not removing space for all the
transactions that aborted while it wa

> Note that I've merged multiple existing functions in vacuumlazy.c into
> one: the patch merges lazy_vacuum_all_indexes() and lazy_vacuum_heap()
> into a single function named vacuum_indexes_mark_unused() (note also
> that lazy_vacuum_page() has been renamed to mark_unused_page() to
> reflect the fact that it is now strictly concerned with making LP_DEAD
> line pointers LP_UNUSED).

It doesn't really seem to be *just* doing that - doing the
PageRepairFragmentation() and all-visible marking is relevant too?

For me the patch does way too many things at once, making it harder than
necessary to review, test (including later bisection). I'd much rather
see the tupgone thing addressed on its own, without further changes, and
then the rest done in separate commits subsequently.

I don't like vacuum_indexes_mark_unused() as a name. That sounds like
the index is marked unused, not index entries pointing to tuples.  Don't
really like mark_unused_page() either for similar reasons - but it's not
quite as confusing.

> - if (tupgone)
> - {
> - lazy_record_dead_tuple(dead_tuples, 
> &(tuple.t_self));
> - 
> HeapTupleHeaderAdvanceLatestRemovedXid(tuple.t_data,
> - 
>&vacrelstats->latestRemovedXid);
> - tups_vacuumed += 1;
> - has_dead_items = true;
> - }
> - else
> - {
> - booltuple_totally_frozen;
> + num_tuples += 1;
> + hastup = true;
>  
> - num_tuples += 1;
> - hastup = true;
> + /*
> +  * Each non-removable tuple must be checked to see if 
> it needs
> +  * freezing.  Note we already have exclusive buffer 
> lock.
> +  */
> + if (heap_prepare_freeze_tuple(tuple.t_data,
> + 
>   relfrozenxid, relminmxid,
> + 
>   FreezeLimit, MultiXactCutoff,
> + 
>   &frozen[nfrozen],
> + 
>   &tuple_totally_frozen))
> + frozen[nfrozen++].offset = offnum;

I'm not comfortable with this change without adding more safety
checks. If there's ever a case in which the HEAPTUPLE_DEAD case is hit
and the xid needs to be frozen, we'll either cause errors or
corruption. Yes, that's already the case with params->index_cleanup ==
DISABLED, but that's not that widely used.

See
https://postgr.es/m/20200724165514.dnu5hr4vvgkssf5p%40alap3.anarazel.de
for some discussion around the fragility.

Greetings,

Andres Freund

Re: New IndexAM API controlling index vacuum strategies

2021-03-14 Thread Peter Geoghegan

On Thu, Mar 11, 2021 at 8:31 AM Robert Haas  wrote:
> But even if not, I'm not sure this
> helps much with the situation you're concerned about, which involves
> non-HOT tuples.

Attached is a POC-quality revision of Masahiko's
skip_index_vacuum.patch [1]. There is an improved design for skipping
index vacuuming (improved over the INDEX_CLEANUP stuff from Postgres
12). I'm particularly interested in your perspective on this
refactoring stuff, Robert, because you ran into the same issues after
initial commit of the INDEX_CLEANUP reloption feature [2] -- you ran
into issues with the "tupgone = true" special case. This is the case
where VACUUM considers a tuple dead that was not marked LP_DEAD by
pruning, and so needs to be killed in the second heap scan in
lazy_vacuum_heap() instead. You'll recall that these issues were fixed
by your commit dd695979888 from May 2019. I think that we need to go
further than you did in dd695979888 for this -- we ought to get rid of
the special case entirely.

ISTM that any new code that skips index vacuuming really ought to be
structured as a dynamic version of the "VACUUM (INDEX_CLEANUP OFF)"
mechanism. Or vice versa. The important thing is to recognize that
they're essentially the same thing, and to structure the code such
that they become exactly the same mechanism internally. That's not
trivial right now. But removing the awful "tupgone = true" special
case seems to buy us a lot -- it makes unifying everything relatively
straightforward. In particular, it makes it possible to delay the
decision to vacuum indexes until the last moment, which seems
essential to making index vacuuming optional. And so I have removed
the tupgone/XLOG_HEAP2_CLEANUP_INFO crud in the patch -- that's what
all of the changes relate to. This results in a net negative line
count, which is a nice bonus!

I've CC'd Noah, because my additions to this revision (of Masahiko's
patch) are loosely based on an abandoned 2013 patch from Noah [3] --
Noah didn't care for the "tupgone = true" special case either. I think
that it's fair to say that Tom doesn't much care for it either [4], or
at least was distressed by its lack of test coverage as of a couple of
years ago -- which is a problem that still exists today. Honestly, I'm
surprised that somebody else hasn't removed the code in question
already, long ago -- what possible argument can be made for it now?

This patch makes the "VACUUM (INDEX_CLEANUP OFF)" mechanism no longer
get invoked as if it was like the "no indexes on table so do it all in
one heap pass" optimization. This seems a lot clearer -- INDEX_CLEANUP
OFF isn't able to call lazy_vacuum_page() at all (for the obvious
reason), so any similarity between the two cases was always
superficial -- skipping index vacuuming should not be confused with
doing a one-pass VACUUM/having no indexes at all. The original
INDEX_CLEANUP structure (from commits a96c41fe and dd695979) always
seemed confusing to me for this reason, FWIW.

Note that I've merged multiple existing functions in vacuumlazy.c into
one: the patch merges lazy_vacuum_all_indexes() and lazy_vacuum_heap()
into a single function named vacuum_indexes_mark_unused() (note also
that lazy_vacuum_page() has been renamed to mark_unused_page() to
reflect the fact that it is now strictly concerned with making LP_DEAD
line pointers LP_UNUSED). The big idea is that there is one choke
point that decides whether index vacuuming is needed at all at one
point in time, dynamically. vacuum_indexes_mark_unused() decides this
for us at the last moment. This can only happen during a VACUUM that
has enough memory to fit all TIDs -- otherwise we won't skip anything
dynamically.

We may in the future add additional criteria for skipping index
vacuuming. That can now just be added to the beginning of this new
vacuum_indexes_mark_unused() function. We may even teach
vacuum_indexes_mark_unused() to skip some indexes but not others in a
future release, a possibility that was already discussed at length
earlier in this thread. This new structure has all the context it
needs to do all of these things.

I wonder if we can add some kind of emergency anti-wraparound vacuum
logic to what I have here, for Postgres 14. Can we come up with logic
that has us skip index vacuuming because XID wraparound is on the
verge of causing an outage? That seems like a strategically important
thing for Postgres, so perhaps we should try to get something like
that in. Practically every post mortem blog post involving Postgres
also involves anti-wraparound vacuum.

One consequence of my approach is that we now call
lazy_cleanup_all_indexes(), even when we've skipped index vacuuming
itself. We should at least "check-in" with the indexes IMV. To an
index AM, this will be indistinguishable from a VACUUM that never had
tuples for it to delete, and so never called ambulkdelete() before
calling amvacuumcleanup().  This seems logical to me: why should there
be any significant behavioral divergence betwee

Re: New IndexAM API controlling index vacuum strategies

2021-03-14 Thread Peter Geoghegan

On Sat, Mar 13, 2021 at 7:23 PM Peter Geoghegan  wrote:
> In other words, I am not worried about debt, exactly. Debt is normal
> in moderation. Healthy, even. I am worried about bankruptcy, perhaps
> following a rare and extreme event. It's okay to be imprecise, but all
> of the problems must be survivable. The important thing to me for a
> maintenance_work_mem threshold is that there is *some* limit. At the
> same time, it may totally be worth accepting 2 or 3 index scans during
> some eventual VACUUM operation if there are many more VACUUM
> operations that don't even touch the index -- that's a good deal!
> Also, it may actually be inherently necessary to accept a small risk
> of having a future VACUUM operation that does multiple scans of each
> index -- that is probably a necessary part of skipping index vacuuming
> each time.
>
> Think about the cost of index vacuuming (the amount of I/O and the
> duration of index vacuuming) as less as less memory is available for
> TIDs. It's non-linear. The cost explodes once we're past a certain
> point. The truly important thing is to "never get killed by the
> explosion".

I just remembered this blog post, which gives a nice high level
summary of my mental model for things like this:

https://jessitron.com/2021/01/18/when-costs-are-nonlinear-keep-it-small/

This patch should eliminate inefficient index vacuuming involving very
small "batch sizes" (i.e. a small number of TIDs/index tuples to
delete from indexes). At the same time, it should not allow the batch
size to get too large because that's also inefficient. Perhaps larger
batch sizes are not exactly inefficient -- maybe they're risky. Though
risky is actually kind of the same thing as inefficient, at least to
me.

So IMV what we want to do here is to recognize cases where "batch
size" is so small that index vacuuming couldn't possibly be efficient.
We don't need to truly understand how that might change over time in
each case -- this is relatively easy.

There is some margin for error here, even with this reduced-scope
version that just does the SKIP_VACUUM_PAGES_RATIO thing. The patch
can afford to make suboptimal decisions about the scheduling of index
vacuuming over time (relative to the current approach), provided the
additional cost is at least *tolerable* -- that way we are still very
likely to win in the aggregate, over time. However, the patch cannot
be allowed to create a new risk of significantly worse performance for
any one VACUUM operation.

--
Peter Geoghegan

Re: New IndexAM API controlling index vacuum strategies

2021-03-13 Thread Peter Geoghegan

On Fri, Mar 12, 2021 at 9:34 PM Masahiko Sawada  wrote:
> I agreed that when we're close to overrunning the
> maintnenance_work_mem space, the situation changes. If we skip it in
> even that case, the next vacuum will be likely to use up
> maintenance_work_mem, leading to a second index scan. Which is
> bad.
>
> If this threshold is aimed to avoid a second index scan due to
> overrunning the maintenance_work_mem, using a ratio of
> maintenance_work_mem would be a good idea. On the other hand, if it's
> to avoid accumulating debt affecting the cost of index vacuuming,
> using a ratio of the total heap tuples seems better.

It's both, together. These are two *independent*
considerations/thresholds. At least in the code that decides whether
or not we skip. Either threshold can force a full index scan (index
vacuuming).

What I'm really worried about is falling behind (in terms of the
amount of memory available for TIDs to delete in indexes) without any
natural limit. Suppose we just have the SKIP_VACUUM_PAGES_RATIO
threshold (i.e. no maintenance_work_mem threshold thing). With just
SKIP_VACUUM_PAGES_RATIO there will be lots of tables where index
vacuuming is almost always avoided, which is good. But
SKIP_VACUUM_PAGES_RATIO might be a bit *too* effective. If we have to
do 2 or even 3 scans of the index when we finally get to index
vacuuming then that's not great, it's inefficient -- but it's at least
*survivable*. But what if there are 10, 100, even 1000 bulk delete
calls for each index when it finally happens? That's completely
intolerable.

In other words, I am not worried about debt, exactly. Debt is normal
in moderation. Healthy, even. I am worried about bankruptcy, perhaps
following a rare and extreme event. It's okay to be imprecise, but all
of the problems must be survivable. The important thing to me for a
maintenance_work_mem threshold is that there is *some* limit. At the
same time, it may totally be worth accepting 2 or 3 index scans during
some eventual VACUUM operation if there are many more VACUUM
operations that don't even touch the index -- that's a good deal!
Also, it may actually be inherently necessary to accept a small risk
of having a future VACUUM operation that does multiple scans of each
index -- that is probably a necessary part of skipping index vacuuming
each time.

Think about the cost of index vacuuming (the amount of I/O and the
duration of index vacuuming) as less as less memory is available for
TIDs. It's non-linear. The cost explodes once we're past a certain
point. The truly important thing is to "never get killed by the
explosion".

> The situation where we need to deal with here is a very large table
> that has a lot of dead tuples but those fit in fewer heap pages (less
> than 1% of all heap blocks). In this case, it's likely that the number
> of dead tuples also is relatively small compared to the total heap
> tuples, as you mentioned. If dead tuples fitted in fewer pages but
> accounted for most of all heap tuples in the heap, it would be a more
> serious situation, there would definitely already be other problems.
> So considering those conditions, I agreed to use a ratio of
> maintenance_work_mem as a threshold. Maybe we can increase the
> constant to 70, 80, or so.

You mean 70% of maintenance_work_mem? That seems fine to me. See my
"Why does lazy_vacuum_table_and_indexes() not make one decision for
the entire VACUUM on the first call, and then stick to its decision?"
remarks at the end of this email, though -- maybe it should not be an
explicit threshold at all.

High level philosophical point: In general I think that the algorithm
for deciding whether or not to perform index vacuuming should *not* be
clever. It should also not focus on getting the benefit of skipping
index vacuuming. I think that a truly robust design will be one that
always starts with the assumption that index vacuuming will be
skipped, and then "works backwards" by considering thresholds/reasons
to *not* skip. For example, the SKIP_VACUUM_PAGES_RATIO thing. The
risk of "explosions" or "bankruptcy" can be thought of as a cost here,
too.

We should simply focus on the costs directly, without even trying to
understand the relationship between each of the costs, and without
really trying to understand the benefit to the user from skipping
index vacuuming.

> > Have you thought more about how the index vacuuming skipping can be
> > configured by users? Maybe a new storage param, that works like the
> > current SKIP_VACUUM_PAGES_RATIO constant?
>
> Since it’s unclear to me yet that adding a new storage parameter or
> GUC parameter for this feature would be useful even for future
> improvements in this area, I haven't thought yet about having users
> control skipping index vacuuming. I’m okay with a constant value for
> the threshold for now.

I agree -- a GUC will become obsolete in only a year or two anyway.
And it will be too hard to tune.

Question about your patch: lazy_vacuum_table_and_indexes() ca

Re: New IndexAM API controlling index vacuum strategies

2021-03-12 Thread Masahiko Sawada

On Tue, Mar 9, 2021 at 2:22 PM Peter Geoghegan  wrote:
>
> On Tue, Mar 2, 2021 at 8:49 PM Masahiko Sawada  wrote:
> > On Tue, Mar 2, 2021 at 2:34 PM Peter Geoghegan  wrote:
> > > lazy_vacuum_table_and_indexes() should probably not skip index
> > > vacuuming when we're close to exceeding the space allocated for the
> > > LVDeadTuples array. Maybe we should not skip when
> > > vacrelstats->dead_tuples->num_tuples is greater than 50% of
> > > dead_tuples->max_tuples? Of course, this would only need to be
> > > considered when lazy_vacuum_table_and_indexes() is only called once
> > > for the entire VACUUM operation (otherwise we have far too little
> > > maintenance_work_mem/dead_tuples->max_tuples anyway).
> >
> > Doesn't it actually mean we consider how many dead *tuples* we
> > collected during a vacuum? I’m not sure how important the fact we’re
> > close to exceeding the maintenance_work_mem space. Suppose
> > maintenance_work_mem is 64MB, we will not skip both index vacuum and
> > heap vacuum if the number of dead tuples exceeds 5592404 (we can
> > collect 11184809 tuples with 64MB memory). But those tuples could be
> > concentrated in a small number of blocks, for example in a very large
> > table case. It seems to contradict the current strategy that we want
> > to skip vacuum if relatively few blocks are modified. No?
>
> There are competing considerations. I think that we need to be
> sensitive to accumulating "debt" here. The cost of index vacuuming
> grows in a non-linear fashion as the index grows (or as
> maintenance_work_mem is lowered). This is the kind of thing that we
> should try to avoid, I think. I suspect that cases where we can skip
> index vacuuming and heap vacuuming are likely to involve very few dead
> tuples in most cases anyway.
>
> We should not be sensitive to the absolute number of dead tuples when
> it doesn't matter (say because they're concentrated in relatively few
> heap pages). But when we overrun the maintenance_work_mem space, then
> the situation changes; the number of dead tuples clearly matters just
> because we run out of space for the TID array. The heap page level
> skew is not really important once that happens.
>
> That said, maybe there is a better algorithm. 50% was a pretty arbitrary 
> number.

I agreed that when we're close to overrunning the
maintnenance_work_mem space, the situation changes. If we skip it in
even that case, the next vacuum will be likely to use up
maintenance_work_mem, leading to a second index scan. Which is
bad.

If this threshold is aimed to avoid a second index scan due to
overrunning the maintenance_work_mem, using a ratio of
maintenance_work_mem would be a good idea. On the other hand, if it's
to avoid accumulating debt affecting the cost of index vacuuming,
using a ratio of the total heap tuples seems better.

The situation where we need to deal with here is a very large table
that has a lot of dead tuples but those fit in fewer heap pages (less
than 1% of all heap blocks). In this case, it's likely that the number
of dead tuples also is relatively small compared to the total heap
tuples, as you mentioned. If dead tuples fitted in fewer pages but
accounted for most of all heap tuples in the heap, it would be a more
serious situation, there would definitely already be other problems.
So considering those conditions, I agreed to use a ratio of
maintenance_work_mem as a threshold. Maybe we can increase the
constant to 70, 80, or so.

>
> Have you thought more about how the index vacuuming skipping can be
> configured by users? Maybe a new storage param, that works like the
> current SKIP_VACUUM_PAGES_RATIO constant?

Since it’s unclear to me yet that adding a new storage parameter or
GUC parameter for this feature would be useful even for future
improvements in this area, I haven't thought yet about having users
control skipping index vacuuming. I’m okay with a constant value for
the threshold for now.

Regards,

--
Masahiko Sawada
EDB:  https://www.enterprisedb.com/

Re: New IndexAM API controlling index vacuum strategies

2021-03-11 Thread Peter Geoghegan

On Thu, Mar 11, 2021 at 8:31 AM Robert Haas wrote:
> I agree, but all you need is one long-lived tuple toward the end of
> the array and you're stuck never being able to truncate it. It seems
> like a worthwhile improvement, but whether it actually helps will be
> workload-dependant.

When it comes to improving VACUUM I think that most of the really
interesting scenarios are workload dependent in one way or another. In
fact even that concept becomes a little meaningless much of the time.
For example with workloads that really benefit from bottom-up
deletion, the vast majority of individual leaf pages have quite a bit
of spare capacity at any given time. Again, "rare" events can have
outsized importance in the aggregate -- most of the time every leaf
page taken individually is a-okay!

It's certainly not just indexing stuff. We have a tendency to imagine
that HOT updates occur when indexes are not logically modified, except
perhaps in the presence of some kind of stressor, like a long-running
transaction. I guess that I do the same, informally. But let's not
forget that the reality is that very few tables *consistently* get HOT
updates, regardless of the shape of indexes and UPDATE statements. So
in the long run practically all tables in many ways consist of pages
that resemble those from a table that "only gets non-HOT updates" in
the simplest sense.

I suspect that the general preference for using lower-offset LP_UNUSED
items first (inside PageAddItemExtended()) will tend to make this
problem of "one high tuple that isn't dead" not so bad in many cases.
In any case Matthias' patch makes the situation strictly better, and
we can only fix one problem at a time. We have to start by eliminating
individual low-level behaviors that *don't make sense*.

Jan Wieck told me that he had to set heap fill factor to the
ludicrously conservative setting of 50 just to get the
TPC-C/BenchmarkSQL OORDER and ORDER_LINE tables to be stable over time
[1] -- on-disk size stability is absolutely expected here. And these
are the biggest tables! It takes hours if not days or even weeks for
the situation to really get out of hand with a normal FF setting. I am
almost certain that this is due to second order effects (even third
order effects) that start from things like line pointer bloat and FSM
inefficiencies. I suspect that it doesn't matter too much if you make
heap fill factor 70 or 90 with these tables because the effect is
non-linear -- for whatever reason 50 was found to be the magic number,
through trial and error.

"Incremental VACUUM" (the broad concept, not just this one patch) is
likely to rely on our being able to make the performance
characteristics more linear, at least in future iterations. Of course
it's true that we should eliminate line pointer bloat and any kind of
irreversible bloat because the overall effect is non-linear, unstable
behavior, which is highly undesirable on its face. But it's also true
that these improvements leave us with more linear behavior at a
high-level, which is itself much easier to understand and model in a
top-down fashion. It then becomes possible to build a cost model that
makes VACUUM sensitive to the needs of the app, and how to make
on-disk sizes *stable* in a variety of conditions. So in that sense
I'd say that Matthias' patch is totally relevant.

I know that I sound hippy-dippy here. But the fact is that bottom-up
index deletion has *already* made the performance characteristics much
simpler and therefore much easier to model. I hope to do more of that.

[1]
https://github.com/wieck/benchmarksql/blob/29b62435dc5c9eaf178983b43818fcbba82d4286/run/sql.postgres/extraCommandsBeforeLoad.sql#L1
--
Peter Geoghegan

Re: New IndexAM API controlling index vacuum strategies

2021-03-11 Thread Matthias van de Meent

On Thu, 11 Mar 2021 at 17:31, Robert Haas  wrote:
>
> On Tue, Mar 9, 2021 at 3:35 PM Peter Geoghegan  wrote:
> > Speaking of line pointer bloat (and "irreversible" bloat), I came
> > across something relevant today. I believe that this recent patch from
> > Matthias van de Meent is a relatively easy way to improve the
> > situation:
> >
> > https://www.postgresql.org/message-id/flat/CAEze2WjgaQc55Y5f5CQd3L%3DeS5CZcff2Obxp%3DO6pto8-f0hC4w%40mail.gmail.com
>
> I agree, but all you need is one long-lived tuple toward the end of
> the array and you're stuck never being able to truncate it. It seems
> like a worthwhile improvement, but whether it actually helps will be
> workload-dependant.
>
> Maybe it'd be OK to allow a much longer array with offsets > some
> constant being usable only for HOT. HOT tuples are not indexed, so it
> might be easier to rearrange things to allow compaction of the array
> if it does happen to get fragmented. But I'm not sure it's OK to
> relocate even a HOT tuple to a different TID.

I'm currently trying to work out how to shuffle HOT tuples around as
an extension on top of my heap->pd_lower patch, and part of that will
be determining when and how HOT tuples are exposed internally. I'm
probably going to need to change how they are referenced to get that
working (current concept: HOT root TID + transaction identifier for
the places that need more than 1 item in HOT chains), but its a very
bare-bones prototype currently only generating the data record
nescessary to shuffle the item pointers.

In that, I've noticed that moving HOT items takes a lot of memory (~ 3
OffsetNumbers per increment of MaxHeapTuplesPerPage, plus some marking
bits) to implement it in O(n); which means it would probably warrant
its own loop in heap_page_prune seperate from the current
mark-and-sweep, triggered based on new measurements included in the
current mark-and-sweep of the prune loop.

Another idea I'm considering (no real implementation ideas) to add to
this extension patch is moving HOT tuples to make space for incoming
tuples, to guarantee that non-movable items are placed early on the
page. This increases the chances for PageRepairFragmentation to
eventually reclaim space from the item pointer array.

I have nothing much worth showing yet for these additional patches,
though, and all of it might not be worth the additional CPU cycles
(it's 'only' 4 bytes per line pointer cleared, so it might be
considered too expensive when also taking WAL into account).

> Can someone, perhaps
> even just the user, still have a reference to the old one and care
> about us invalidating it? Maybe. But even if not, I'm not sure this
> helps much with the situation you're concerned about, which involves
> non-HOT tuples.

Users having references to TIDs of HOT tuples should in my opinion be
considered unknown behaviour. It might currently work, but the only
access to a HOT tuple that is guaranteed to work should be through the
chain's root. Breaking the current guarantee of HOT tuples not moving
might be worth it if we can get enough savings in storage (which is
also becoming more likely if MaxHeapTuplesPerPage is changed to larger
values). As to who actually uses / stores these references, I think
that the only place they are stored with some expectation of
persistence are in sequential heap scans, and that can be changed.

With regards,

Matthias van de Meent

Re: New IndexAM API controlling index vacuum strategies

2021-03-11 Thread Robert Haas

On Tue, Mar 9, 2021 at 3:35 PM Peter Geoghegan  wrote:
> Speaking of line pointer bloat (and "irreversible" bloat), I came
> across something relevant today. I believe that this recent patch from
> Matthias van de Meent is a relatively easy way to improve the
> situation:
>
> https://www.postgresql.org/message-id/flat/CAEze2WjgaQc55Y5f5CQd3L%3DeS5CZcff2Obxp%3DO6pto8-f0hC4w%40mail.gmail.com

I agree, but all you need is one long-lived tuple toward the end of
the array and you're stuck never being able to truncate it. It seems
like a worthwhile improvement, but whether it actually helps will be
workload-dependant.

Maybe it'd be OK to allow a much longer array with offsets > some
constant being usable only for HOT. HOT tuples are not indexed, so it
might be easier to rearrange things to allow compaction of the array
if it does happen to get fragmented. But I'm not sure it's OK to
relocate even a HOT tuple to a different TID. Can someone, perhaps
even just the user, still have a reference to the old one and care
about us invalidating it? Maybe. But even if not, I'm not sure this
helps much with the situation you're concerned about, which involves
non-HOT tuples.

-- 
Robert Haas
EDB: http://www.enterprisedb.com

Re: New IndexAM API controlling index vacuum strategies

2021-03-11 Thread Masahiko Sawada

On Wed, Mar 3, 2021 at 12:40 PM Peter Geoghegan  wrote:
>
> On Tue, Mar 2, 2021 at 6:44 PM Masahiko Sawada  wrote:
> > A scale type parameter seems good to me but I wonder if how users can
> > tune that parameter. We already have tuple-based parameters such as
> > autovacuum_vacuum_scale_factor/threshold and I think that users
> > basically don't pay attention to that table updates result in how many
> > blocks.
>
> Fair. The scale thing was just a random suggestion, nothing to take
> too seriously.
>
> > The third idea is a VACUUM command option like DISABLE_PAGE_SKIPPING
> > to disable such skipping behavior. I imagine that the
> > user-controllable-option to enforce both heap vacuum and index vacuum
> > would be required also in the future when we have the vacuum strategy
> > feature (i.g., incremental vacuum).
>
> Yeah, I'm worried about conflicting requirements here -- this patch
> and the next patch (that pushes the same ideas further) might have
> different requirements.
>
> I think that this patch will mostly be useful in cases where there are
> very few LP_DEAD-containing heap pages, but consistently more than
> zero. So it's probably not easy to tune.
>
> What we might want is an on/off switch. But why? DISABLE_PAGE_SKIPPING
> was added because the freeze map work in 9.6 was considered high risk
> at the time, and we needed to have a tool to manage that risk. But
> this patch doesn't seem nearly as tricky. No?

I think the motivation behind on/off switch is similar. I was
concerned about a case where there is a bug or something so that we
mistakenly skip vacuums on heap and indexes. But this feature would
not be as complicated as freeze map and only skips the part of
changing LP_DEAD to LP_UNUSED I agree it seems not to be essential.

Regards,

-- 
Masahiko Sawada
EDB:  https://www.enterprisedb.com/

Re: New IndexAM API controlling index vacuum strategies

2021-03-09 Thread Peter Geoghegan

On Mon, Mar 8, 2021 at 7:34 PM Peter Geoghegan  wrote:
> > One possible
> > consequence that I'm concerned about is sequential scan performance.
> > For an index scan, you just jump to the line pointer you want and then
> > go get the tuple, but a sequential scan has to loop over all the line
> > pointers on the page, and skipping a lot of dead ones can't be
> > completely free. A small increase in MaxHeapTuplesPerPage probably
> > wouldn't matter, but the proposed increase of almost 10x (291 -> 2042)
> > is a bit scary.
>
> I agree. Maybe the real problem here is that MaxHeapTuplesPerPage is a
> generic constant. Perhaps it should be something that can vary by
> table, according to practical table-level considerations such as
> projected tuple width given the "shape" of tuples for that table, etc.

Speaking of line pointer bloat (and "irreversible" bloat), I came
across something relevant today. I believe that this recent patch from
Matthias van de Meent is a relatively easy way to improve the
situation:

https://www.postgresql.org/message-id/flat/CAEze2WjgaQc55Y5f5CQd3L%3DeS5CZcff2Obxp%3DO6pto8-f0hC4w%40mail.gmail.com

-- 
Peter Geoghegan

Re: New IndexAM API controlling index vacuum strategies

2021-03-08 Thread Peter Geoghegan

On Tue, Mar 2, 2021 at 8:49 PM Masahiko Sawada  wrote:
> On Tue, Mar 2, 2021 at 2:34 PM Peter Geoghegan  wrote:
> > lazy_vacuum_table_and_indexes() should probably not skip index
> > vacuuming when we're close to exceeding the space allocated for the
> > LVDeadTuples array. Maybe we should not skip when
> > vacrelstats->dead_tuples->num_tuples is greater than 50% of
> > dead_tuples->max_tuples? Of course, this would only need to be
> > considered when lazy_vacuum_table_and_indexes() is only called once
> > for the entire VACUUM operation (otherwise we have far too little
> > maintenance_work_mem/dead_tuples->max_tuples anyway).
>
> Doesn't it actually mean we consider how many dead *tuples* we
> collected during a vacuum? I’m not sure how important the fact we’re
> close to exceeding the maintenance_work_mem space. Suppose
> maintenance_work_mem is 64MB, we will not skip both index vacuum and
> heap vacuum if the number of dead tuples exceeds 5592404 (we can
> collect 11184809 tuples with 64MB memory). But those tuples could be
> concentrated in a small number of blocks, for example in a very large
> table case. It seems to contradict the current strategy that we want
> to skip vacuum if relatively few blocks are modified. No?

There are competing considerations. I think that we need to be
sensitive to accumulating "debt" here. The cost of index vacuuming
grows in a non-linear fashion as the index grows (or as
maintenance_work_mem is lowered). This is the kind of thing that we
should try to avoid, I think. I suspect that cases where we can skip
index vacuuming and heap vacuuming are likely to involve very few dead
tuples in most cases anyway.

We should not be sensitive to the absolute number of dead tuples when
it doesn't matter (say because they're concentrated in relatively few
heap pages). But when we overrun the maintenance_work_mem space, then
the situation changes; the number of dead tuples clearly matters just
because we run out of space for the TID array. The heap page level
skew is not really important once that happens.

That said, maybe there is a better algorithm. 50% was a pretty arbitrary number.

Have you thought more about how the index vacuuming skipping can be
configured by users? Maybe a new storage param, that works like the
current SKIP_VACUUM_PAGES_RATIO constant?

-- 
Peter Geoghegan

Re: New IndexAM API controlling index vacuum strategies

2021-03-08 Thread Peter Geoghegan

On Mon, Mar 8, 2021 at 10:57 AM Robert Haas  wrote:
> Yes, I agree that it's good to postpone this to a future release, and
> that thinking through the consequences is not so easy.

The current plan is to commit something like Masahiko's
skip_index_vacuum.patch for Postgres 14. The latest version of that
patch (a reduced-scope version of Masahiko's patch without any changes
to MaxHeapTuplesPerPage) is available from:

https://postgr.es/m/CAD21AoAtZb4+HJT_8RoOXvu4HM-Zd4HKS3YSMCH6+-W=bdy...@mail.gmail.com

The idea is to "unify the vacuum_cleanup_index_scale_factor feature
from Postgres 11 with the INDEX_CLEANUP feature from Postgres 12".
This is the broader plan to make that "unification" happen for
Postgres 14:

https://postgr.es/m/CAH2-WzkYaDdbWOEwSSmC65FzF_jRLq-cxrYtt-2+ASoA156X=w...@mail.gmail.com

So, as I said, any change to MaxHeapTuplesPerPage is now out of scope
for Postgres 14.

> One possible
> consequence that I'm concerned about is sequential scan performance.
> For an index scan, you just jump to the line pointer you want and then
> go get the tuple, but a sequential scan has to loop over all the line
> pointers on the page, and skipping a lot of dead ones can't be
> completely free. A small increase in MaxHeapTuplesPerPage probably
> wouldn't matter, but the proposed increase of almost 10x (291 -> 2042)
> is a bit scary.

I agree. Maybe the real problem here is that MaxHeapTuplesPerPage is a
generic constant. Perhaps it should be something that can vary by
table, according to practical table-level considerations such as
projected tuple width given the "shape" of tuples for that table, etc.

Certain DB systems that use bitmap indexes extensively allow this to
be configured per-table. If you need to encode a bunch of TIDs as
bitmaps, you first need some trivial mapping from TIDs to integers
(before you even build the bitmap, much less compress it). So even
without VACUUM there is a trade-off to be made. It is *roughly*
comparable to the trade-off you make when deciding on a page size.

What I really want to do for Postgres 14 is to establish the principle
that index vacuuming is theoretically optional -- in all cases. There
will be immediate practical benefits, too. I think it's important to
remove the artificial behavioral differences between cases where there
are 0 dead tuples and cases where there is only 1. My guess is that
99%+ append-only tables are far more common than 100% append-only
tables in practice.

> It's also a little hard to believe that letting almost
> 50% of the total space on the page get chewed up by the line pointer
> array is going to be optimal. If that happens to every page while the
> amount of data stays the same, the table must almost double in size.
> That's got to be bad.

I think that we should be prepared for a large diversity of conditions
within a given table. It follows that we should try to be adaptive.

The reduced-scope patch currently tracks LP_DEAD line pointers at the
heap page level, and then applies a count of heap blocks with one or
more LP_DEAD line pointers (could be existing or just pruned by this
VACUUM) to determine a count of heap pages. This is used to determine
a threshold at which index vacuuming should be forced. Currently we
have a multiplier constant called SKIP_VACUUM_PAGES_RATIO, which is
0.01 -- 1% of heap blocks. Of course, it's possible that LP_DEAD line
pointers will be very concentrated, in which case we're more
aggressive about skipping index vacuuming (if you think of it in terms
of dead TIDs instead of heap blocks we're aggressive, that is). The
other extreme exists too: LP_DEAD line pointers may instead be spread
diffusively across all heap pages, in which case we are unlikely to
ever skip index vacuuming outside of cases like anti-wraparound vacuum
or insert-driven vacuum to set VM bits.

The next iteration of the high-level "granular vacuum" project (which
will presumably target Postgres 15) should probably involve more
complicated, qualitative judgements about LP_DEAD line pointers in the
heap. Likewise it should care about individual needs of indexes, which
is something that Masahiko experimented with in earlier drafts of the
patch on this thread. The needs of each index can be quite different
with bottom-up index deletion. We may in fact end up adding a new,
moderately complicated cost model -- it may have to be modelled as an
optimization problem.

In short, I think that thinking more about the logical state of the
database during VACUUM is likely to pay-off ("heap blocks vs dead
tuples" is one part of that). VACUUM should be a little more
qualitative, and a little less quantitative. The fact that we
currently don't stuff like that (unless bottom-up index deletion
counts) is not an inherent limitation of the design of VACUUM. I'm not
entirely sure how far it can be pushed, but it seems quite promising.

> The whole thing would be more appealing if there
> were some way to exert exponentially increasing back-pressure on th

Re: New IndexAM API controlling index vacuum strategies

2021-03-08 Thread Robert Haas

On Mon, Feb 1, 2021 at 10:17 PM Peter Geoghegan  wrote:
> * No need to change MaxHeapTuplesPerPage for now, since that only
> really makes sense in cases that heavily involve bottom-up deletion,
> where we care about the *concentration* of LP_DEAD line pointers in
> heap pages (and not just the absolute number in the entire table),
> which is qualitative, not quantitative (somewhat like bottom-up
> deletion).
>
> The change to MaxHeapTuplesPerPage that Masahiko has proposed does
> make sense -- there are good reasons to increase it. Of course there
> are also good reasons to not do so. I'm concerned that we won't have
> time to think through all the possible consequences.

Yes, I agree that it's good to postpone this to a future release, and
that thinking through the consequences is not so easy. One possible
consequence that I'm concerned about is sequential scan performance.
For an index scan, you just jump to the line pointer you want and then
go get the tuple, but a sequential scan has to loop over all the line
pointers on the page, and skipping a lot of dead ones can't be
completely free. A small increase in MaxHeapTuplesPerPage probably
wouldn't matter, but the proposed increase of almost 10x (291 -> 2042)
is a bit scary. It's also a little hard to believe that letting almost
50% of the total space on the page get chewed up by the line pointer
array is going to be optimal. If that happens to every page while the
amount of data stays the same, the table must almost double in size.
That's got to be bad. The whole thing would be more appealing if there
were some way to exert exponentially increasing back-pressure on the
length of the line pointer array - that is, make it so that the longer
the array is already, the less willing we are to extend it further.
But I don't really see how to do that.

Also, at the risk of going on and on, line pointer array bloat is very
hard to eliminate once it happens. We never even try to shrink the
line pointer array, and if the last TID in the array is still in use,
it wouldn't be possible anyway, assuming the table has at least one
non-BRIN index. Index page splits are likewise irreversible, but
creating a new index and dropping the old one is still less awful than
having to rewrite the table.

Another thing to consider is that MaxHeapTuplesPerPage is used to size
some stack-allocated arrays, especially the stack-allocated
PruneState. I thought for a while about this and I can't really see
why it would be a big problem, even with a large increase in
MaxHeapTuplesPerPage, so I'm just mentioning this in case it makes
somebody else think of something I've missed.

-- 
Robert Haas
EDB: http://www.enterprisedb.com

Re: New IndexAM API controlling index vacuum strategies

2021-03-02 Thread Masahiko Sawada

On Tue, Mar 2, 2021 at 2:34 PM Peter Geoghegan  wrote:
>
> On Mon, Mar 1, 2021 at 7:00 PM Peter Geoghegan  wrote:
> > I think that you're right. However, in practice it isn't harmful
> > because has_dead_tuples is only used when "all_visible = true", and
> > only to detect corruption (which should never happen). I think that it
> > should be fixed as part of this work, though.
>
> Currently the first callsite that calls the new
> lazy_vacuum_table_and_indexes() function in the patch
> ("skip_index_vacuum.patch") skips index vacuuming in exactly the same
> way as the second and final lazy_vacuum_table_and_indexes() call site.
> Don't we need to account for maintenance_work_mem in some way?
>
> lazy_vacuum_table_and_indexes() should probably not skip index
> vacuuming when we're close to exceeding the space allocated for the
> LVDeadTuples array. Maybe we should not skip when
> vacrelstats->dead_tuples->num_tuples is greater than 50% of
> dead_tuples->max_tuples? Of course, this would only need to be
> considered when lazy_vacuum_table_and_indexes() is only called once
> for the entire VACUUM operation (otherwise we have far too little
> maintenance_work_mem/dead_tuples->max_tuples anyway).

Doesn't it actually mean we consider how many dead *tuples* we
collected during a vacuum? I’m not sure how important the fact we’re
close to exceeding the maintenance_work_mem space. Suppose
maintenance_work_mem is 64MB, we will not skip both index vacuum and
heap vacuum if the number of dead tuples exceeds 5592404 (we can
collect 11184809 tuples with 64MB memory). But those tuples could be
concentrated in a small number of blocks, for example in a very large
table case. It seems to contradict the current strategy that we want
to skip vacuum if relatively few blocks are modified. No?

Regards,

-- 
Masahiko Sawada
EDB:  https://www.enterprisedb.com/

Re: New IndexAM API controlling index vacuum strategies

2021-03-02 Thread Peter Geoghegan

On Tue, Mar 2, 2021 at 6:44 PM Masahiko Sawada  wrote:
> A scale type parameter seems good to me but I wonder if how users can
> tune that parameter. We already have tuple-based parameters such as
> autovacuum_vacuum_scale_factor/threshold and I think that users
> basically don't pay attention to that table updates result in how many
> blocks.

Fair. The scale thing was just a random suggestion, nothing to take
too seriously.

> The third idea is a VACUUM command option like DISABLE_PAGE_SKIPPING
> to disable such skipping behavior. I imagine that the
> user-controllable-option to enforce both heap vacuum and index vacuum
> would be required also in the future when we have the vacuum strategy
> feature (i.g., incremental vacuum).

Yeah, I'm worried about conflicting requirements here -- this patch
and the next patch (that pushes the same ideas further) might have
different requirements.

I think that this patch will mostly be useful in cases where there are
very few LP_DEAD-containing heap pages, but consistently more than
zero. So it's probably not easy to tune.

What we might want is an on/off switch. But why? DISABLE_PAGE_SKIPPING
was added because the freeze map work in 9.6 was considered high risk
at the time, and we needed to have a tool to manage that risk. But
this patch doesn't seem nearly as tricky. No?

> > Lots of stuff in this area is kind of weird already. Sometimes this is
> > directly exposed to users, even. This came up recently, when I was
> > working on VACUUM VERBOSE stuff.

> That's true. I didn't know that.

It occurs to me that "tups_vacuumed vs. total LP_DEAD Items in heap
after VACUUM finishes" is similar to "pages_newly_deleted vs.
pages_deleted" for indexes. An easy mistake to make!

> > https://www.postgresql.org/message-id/flat/20130108024957.GA4751%40tornado.leadboat.com
> >
> > Of course, this effort to eliminate the "tupgone =
> > true"/XLOG_HEAP2_CLEANUP_INFO special case didn't go anywhere at the
> > time.
>
> I'll look at that thread.

I'm not sure if it's super valuable to look at the thread. But it is
reassuring to see that Noah shared the intuition that the "tupgone =
true" case was kind of bad, even back in 2013. It's one part of my
"mental map" of VACUUM.

-- 
Peter Geoghegan

Re: New IndexAM API controlling index vacuum strategies

2021-03-02 Thread Masahiko Sawada

On Tue, Mar 2, 2021 at 12:00 PM Peter Geoghegan  wrote:
>
> On Sun, Feb 21, 2021 at 10:28 PM Masahiko Sawada  
> wrote:
> > Sorry for the late response.
>
> Me too!

No problem, thank you for your comment!

>
> > 1. Whereas skipping index vacuum and heap vacuum is a very attractive
> > improvement, if we skip that by default I wonder if we need a way to
> > disable it. Vacuum plays a role in cleaning and diagnosing tables in
> > practice. So in a case where the table is bad state and the user wants
> > to clean all heap pages, it would be good to have a way to disable
> > this skipping behavior. One solution would be that index_cleanup
> > option has three different behaviors: on, auto (or smart), and off. We
> > enable this skipping behavior by default in ‘auto’ mode, but
> > specifying "INDEX_CLEANUP true” means to enforce index vacuum and
> > therefore disabling it.
>
> Sounds reasonable to me. Maybe users should express the skipping
> behavior that they desire in terms of the *proportion* of all heap
> blocks with LP_DEAD line pointers that we're willing to have while
> still skipping index vacuuming + lazy_vacuum_heap() heap scan. In
> other words, it can be a "scale" type GUC/param (though based on heap
> blocks *at the end* of the first heap scan, not tuples at the point
> the av launcher considers launching AV workers).

A scale type parameter seems good to me but I wonder if how users can
tune that parameter. We already have tuple-based parameters such as
autovacuum_vacuum_scale_factor/threshold and I think that users
basically don't pay attention to that table updates result in how many
blocks.

And I'm concerned that my above idea could confuse users since what we
want to control is both heap vacuum and index vacuum but it looks like
controlling only index vacuum.

The third idea is a VACUUM command option like DISABLE_PAGE_SKIPPING
to disable such skipping behavior. I imagine that the
user-controllable-option to enforce both heap vacuum and index vacuum
would be required also in the future when we have the vacuum strategy
feature (i.g., incremental vacuum).

>
> > @@ -1299,6 +1303,7 @@ lazy_scan_heap(Relation onerel, VacuumParams
> > *params, LVRelStats *vacrelstats,
> > {
> > lazy_record_dead_tuple(dead_tuples, &(tuple.t_self));
> > all_visible = false;
> > +   has_dead_tuples = true;
> > continue;
> > }
> >
> > I added the above change in the patch to count the number of heap
> > pages having at least one LP_DEAD line pointer. But it's weird to me
> > that we have never set has_dead_tuples true when we found an LP_DEAD
> > line pointer.
>
> I think that you're right. However, in practice it isn't harmful
> because has_dead_tuples is only used when "all_visible = true", and
> only to detect corruption (which should never happen). I think that it
> should be fixed as part of this work, though.

Agreed.

>
> Lots of stuff in this area is kind of weird already. Sometimes this is
> directly exposed to users, even. This came up recently, when I was
> working on VACUUM VERBOSE stuff. (This touched the precise piece of
> code you've patched in the quoted diff snippet, so perhaps you know
> some of the story I will tell you now already.)
>
> I recently noticed that VACUUM VERBOSE can report a very low
> tups_vacuumed/"removable heap tuples" when run against tables where
> most pruning is opportunistic pruning rather than VACUUM pruning
> (which is very common), provided there are no HOT updates (which is
> common but not very common). This can be very confusing, because
> VACUUM VERBOSE will report a "tuples_deleted" for the heap relation
> that is far far less than the "tuples_removed" it reports for indexes
> on the same table -- even though both fields have values that are
> technically accurate (e.g., not very affected by concurrent activity
> during VACUUM, nothing like that).
>
> This came to my attention when I was running BenchmarkSQL for the
> 64-bit XID deleted pages patch. One of the BenchmarkSQL tables (though
> only one -- the table whose UPDATEs are not HOT safe, which is unique
> among BenchmarkSQL/TPC-C tables). I pushed a commit with comment
> changes [1] to make that aspect of VACUUM VERBOSE a little less
> confusing. (I was actually running a quick-and-dirty hack that made
> log_autovacuum show VACUUM VERBOSE index stuff -- I would probably
> have missed the weird difference between heap tups_vacuumed and index
> tuples_removed without this custom log_autovacuum hack.)

That's true. I didn't know that.

>
> Just to be clear (I think you agree already): we should base any
> triggering logic for skipping index vacuuming/lazy_vacuum_heap() on
> logic that does not care *when* heap pages first contained LP_DEAD
> line pointers (could be that they were counted in tups_vacuumed due to
> being pruned during this VACUUM operation, could be from an earlier
> opportunistic pruning, etc).

Agreed. We should bas

Re: New IndexAM API controlling index vacuum strategies

2021-03-01 Thread Peter Geoghegan

On Mon, Mar 1, 2021 at 7:00 PM Peter Geoghegan  wrote:
> I think that you're right. However, in practice it isn't harmful
> because has_dead_tuples is only used when "all_visible = true", and
> only to detect corruption (which should never happen). I think that it
> should be fixed as part of this work, though.

Currently the first callsite that calls the new
lazy_vacuum_table_and_indexes() function in the patch
("skip_index_vacuum.patch") skips index vacuuming in exactly the same
way as the second and final lazy_vacuum_table_and_indexes() call site.
Don't we need to account for maintenance_work_mem in some way?

lazy_vacuum_table_and_indexes() should probably not skip index
vacuuming when we're close to exceeding the space allocated for the
LVDeadTuples array. Maybe we should not skip when
vacrelstats->dead_tuples->num_tuples is greater than 50% of
dead_tuples->max_tuples? Of course, this would only need to be
considered when lazy_vacuum_table_and_indexes() is only called once
for the entire VACUUM operation (otherwise we have far too little
maintenance_work_mem/dead_tuples->max_tuples anyway).

-- 
Peter Geoghegan

Re: New IndexAM API controlling index vacuum strategies

2021-03-01 Thread Peter Geoghegan

On Sun, Feb 21, 2021 at 10:28 PM Masahiko Sawada  wrote:
> Sorry for the late response.

Me too!

> 1. Whereas skipping index vacuum and heap vacuum is a very attractive
> improvement, if we skip that by default I wonder if we need a way to
> disable it. Vacuum plays a role in cleaning and diagnosing tables in
> practice. So in a case where the table is bad state and the user wants
> to clean all heap pages, it would be good to have a way to disable
> this skipping behavior. One solution would be that index_cleanup
> option has three different behaviors: on, auto (or smart), and off. We
> enable this skipping behavior by default in ‘auto’ mode, but
> specifying "INDEX_CLEANUP true” means to enforce index vacuum and
> therefore disabling it.

Sounds reasonable to me. Maybe users should express the skipping
behavior that they desire in terms of the *proportion* of all heap
blocks with LP_DEAD line pointers that we're willing to have while
still skipping index vacuuming + lazy_vacuum_heap() heap scan. In
other words, it can be a "scale" type GUC/param (though based on heap
blocks *at the end* of the first heap scan, not tuples at the point
the av launcher considers launching AV workers).

> @@ -1299,6 +1303,7 @@ lazy_scan_heap(Relation onerel, VacuumParams
> *params, LVRelStats *vacrelstats,
> {
> lazy_record_dead_tuple(dead_tuples, &(tuple.t_self));
> all_visible = false;
> +   has_dead_tuples = true;
> continue;
> }
>
> I added the above change in the patch to count the number of heap
> pages having at least one LP_DEAD line pointer. But it's weird to me
> that we have never set has_dead_tuples true when we found an LP_DEAD
> line pointer.

I think that you're right. However, in practice it isn't harmful
because has_dead_tuples is only used when "all_visible = true", and
only to detect corruption (which should never happen). I think that it
should be fixed as part of this work, though.

Lots of stuff in this area is kind of weird already. Sometimes this is
directly exposed to users, even. This came up recently, when I was
working on VACUUM VERBOSE stuff. (This touched the precise piece of
code you've patched in the quoted diff snippet, so perhaps you know
some of the story I will tell you now already.)

I recently noticed that VACUUM VERBOSE can report a very low
tups_vacuumed/"removable heap tuples" when run against tables where
most pruning is opportunistic pruning rather than VACUUM pruning
(which is very common), provided there are no HOT updates (which is
common but not very common). This can be very confusing, because
VACUUM VERBOSE will report a "tuples_deleted" for the heap relation
that is far far less than the "tuples_removed" it reports for indexes
on the same table -- even though both fields have values that are
technically accurate (e.g., not very affected by concurrent activity
during VACUUM, nothing like that).

This came to my attention when I was running BenchmarkSQL for the
64-bit XID deleted pages patch. One of the BenchmarkSQL tables (though
only one -- the table whose UPDATEs are not HOT safe, which is unique
among BenchmarkSQL/TPC-C tables). I pushed a commit with comment
changes [1] to make that aspect of VACUUM VERBOSE a little less
confusing. (I was actually running a quick-and-dirty hack that made
log_autovacuum show VACUUM VERBOSE index stuff -- I would probably
have missed the weird difference between heap tups_vacuumed and index
tuples_removed without this custom log_autovacuum hack.)

Just to be clear (I think you agree already): we should base any
triggering logic for skipping index vacuuming/lazy_vacuum_heap() on
logic that does not care *when* heap pages first contained LP_DEAD
line pointers (could be that they were counted in tups_vacuumed due to
being pruned during this VACUUM operation, could be from an earlier
opportunistic pruning, etc).

> Currently, we set it to false true in 'tupgone' case but
> it seems to me that we should do that in this case as well since we
> use this flag in the following check:
>
>else if (PageIsAllVisible(page) && has_dead_tuples)
>{
>elog(WARNING, "page containing dead tuples is marked as
> all-visible in relation \"%s\" page %u",
> vacrelstats->relname, blkno);
>PageClearAllVisible(page);
>MarkBufferDirty(buf);
>visibilitymap_clear(onerel, blkno, vmbuffer,
>VISIBILITYMAP_VALID_BITS);
>}

The "tupgone = true"/HEAPTUPLE_DEAD-race case is *extremely* weird. It
has zero test coverage according to coverage.postgresql.org [2],
despite being very complicated.

3 points on the "tupgone = true" weirdness (I'm writing this as a
record for myself, almost):

1. It is the reason why lazy_vacuum_heap() must be prepared to set
tuples LP_UNUSED that are not already LP_DEAD. So when
lazy_vacuum_page() says "the first dead tuple for this page",  that
doesn't

Re: New IndexAM API controlling index vacuum strategies

2021-02-21 Thread Masahiko Sawada

On Wed, Feb 10, 2021 at 4:12 PM Peter Geoghegan  wrote:
>
> On Tue, Feb 9, 2021 at 6:14 PM Masahiko Sawada  wrote:
> > Thanks. I think that's very good if we resolve this recycling stuff
> > first then try the new approach to skip index vacuum in more cases.
> > That way, even if the vacuum strategy stuff took a very long time to
> > get committed over several major versions, we would not be affected by
> > deleted nbtree page recycling problem (at least for built-in index
> > AMs). Also, the approach of 6655a7299d8 itself is a good improvement
> > and seems straightforward to me.
>
> I'm glad that you emphasized this issue, because I came up with a
> solution that turns out to not be very invasive. At the same time it
> has unexpected advantages, liking improving amcheck coverage for
> deleted pages.

Sorry for the late response.

I've attached the patch that adds a check whether or not to do index
vacuum (and heap vacuum) if 1% of all heap pages have LP_DEAD line
pointer.

While developing this feature, I realized the following two things:

1. Whereas skipping index vacuum and heap vacuum is a very attractive
improvement, if we skip that by default I wonder if we need a way to
disable it. Vacuum plays a role in cleaning and diagnosing tables in
practice. So in a case where the table is bad state and the user wants
to clean all heap pages, it would be good to have a way to disable
this skipping behavior. One solution would be that index_cleanup
option has three different behaviors: on, auto (or smart), and off. We
enable this skipping behavior by default in ‘auto’ mode, but
specifying "INDEX_CLEANUP true” means to enforce index vacuum and
therefore disabling it.

---
2.
@@ -1299,6 +1303,7 @@ lazy_scan_heap(Relation onerel, VacuumParams
*params, LVRelStats *vacrelstats,
{
lazy_record_dead_tuple(dead_tuples, &(tuple.t_self));
all_visible = false;
+   has_dead_tuples = true;
continue;
}

I added the above change in the patch to count the number of heap
pages having at least one LP_DEAD line pointer. But it's weird to me
that we have never set has_dead_tuples true when we found an LP_DEAD
line pointer. Currently, we set it to false true in 'tupgone' case but
it seems to me that we should do that in this case as well since we
use this flag in the following check:

   else if (PageIsAllVisible(page) && has_dead_tuples)
   {
   elog(WARNING, "page containing dead tuples is marked as
all-visible in relation \"%s\" page %u",
vacrelstats->relname, blkno);
   PageClearAllVisible(page);
   MarkBufferDirty(buf);
   visibilitymap_clear(onerel, blkno, vmbuffer,
   VISIBILITYMAP_VALID_BITS);
   }

Regards,

-- 
Masahiko Sawada
EDB:  https://www.enterprisedb.com/

skip_index_vacuum.patch
Description: Binary data

Re: New IndexAM API controlling index vacuum strategies

2021-02-09 Thread Peter Geoghegan

On Tue, Feb 9, 2021 at 6:14 PM Masahiko Sawada  wrote:
> Thanks. I think that's very good if we resolve this recycling stuff
> first then try the new approach to skip index vacuum in more cases.
> That way, even if the vacuum strategy stuff took a very long time to
> get committed over several major versions, we would not be affected by
> deleted nbtree page recycling problem (at least for built-in index
> AMs). Also, the approach of 6655a7299d8 itself is a good improvement
> and seems straightforward to me.

I'm glad that you emphasized this issue, because I came up with a
solution that turns out to not be very invasive. At the same time it
has unexpected advantages, liking improving amcheck coverage for
deleted pages.

-- 
Peter Geoghegan

Re: New IndexAM API controlling index vacuum strategies

2021-02-09 Thread Masahiko Sawada

On Sat, Feb 6, 2021 at 5:02 AM Peter Geoghegan  wrote:
>
> On Wed, Feb 3, 2021 at 8:18 PM Masahiko Sawada  wrote:
> > > I'm starting to think that the right short term goal should not
> > > directly involve bottom-up index deletion. We should instead return to
> > > the idea of "unifying" the vacuum_cleanup_index_scale_factor feature
> > > with the INDEX_CLEANUP feature, which is kind of where this whole idea
> > > started out at. This short term goal is much more than mere
> > > refactoring. It is still a whole new user-visible feature. The patch
> > > would teach VACUUM to skip doing any real index work within both
> > > ambulkdelete() and amvacuumcleanup() in many important cases.
> > >
> >
> > I agree to cut the scope. I've also been thinking about the impact of
> > this patch on users.
>
> It's probably also true that on balance users care more about the
> "~99.9% append-only table" case (as well as the HOT updates workload I
> brought up in response to Victor on February 2) than making VACUUM
> very sensitive to how well bottom-up deletion is working. Yes, it's
> annoying that VACUUM still wastes effort on indexes where bottom-up
> deletion alone can do all required garbage collection. But that's not
> going to be a huge surprise to users. Whereas the "~99.9% append-only
> table" case causes huge surprises to users -- users hate this kind of
> thing.

Agreed.

>
> > If vacuum skips both ambulkdelete and amvacuumcleanup in that case,
> > I'm concerned that this could increase users who are affected by the
> > known issue of leaking deleted pages. Currently, users who are
> > affected by that is only users who use INDEX_CLEANUP off. But if we
> > enable this feature by default, all users potentially are affected by
> > that issue.
>
> FWIW I think that it's unfair to blame INDEX_CLEANUP for any problems
> in this area. The truth is that the design of the
> deleted-page-recycling stuff has always caused leaked pages, even with
> workloads that should not be challenging to the implementation in any
> way. See my later remarks.
>
> > To improve index tuple deletion in that case, skipping bulkdelete is
> > also a good idea whereas the retail index deletion is also a good
> > solution. I have thought the retail index deletion would be
> > appropriate to this case but since some index AM cannot support it
> > skipping index scans is a good solution anyway.
>
> The big problem with retail index tuple deletion is that it is not
> possible once heap pruning takes place (opportunistic pruning, or
> pruning performed by VACUUM). Pruning will destroy the information
> that retail deletion needs to find the index tuple (the column
> values).

Right.

>
> I think that we probably will end up using retail index tuple
> deletion, but it will only be one strategy among several. We'll never
> be able to rely on it, even within nbtree. My personal opinion is that
> completely eliminating VACUUM is not a useful long term goal.

Totally agreed. We are not able to rely on it. It would be a good way
to delete small amount index garbage tuples but the usage is limited.

>
> > > Here is how the triggering criteria could work: maybe skipping
> > > accessing all indexes during VACUUM happens when less than 1% or
> > > 10,000 of the items from the table are to be removed by VACUUM --
> > > whichever is greater. Of course this is just the first thing I thought
> > > of. It's a starting point for further discussion.
> >
> > I also prefer your second idea :)
>
> Cool. Yeah, I always like it when the potential downside of a design
> is obviously low, and the potential upside is obviously very high. I
> am much less concerned about any uncertainty around when and how users
> will get the big upside. I like it when my problems are "luxury
> problems".  :-)
>
> > As I mentioned above, I'm still concerned that the extent of users who
> > are affected by the issue of leaking deleted pages could get expanded.
> > Currently, we don't have a way to detect how many index pages are
> > leaked. If there are potential cases where many deleted pages are
> > leaked this feature would make things worse.
>
> The existing problems here were significant even before you added
> INDEX_CLEANUP. For example, let's say VACUUM deletes a page, and then
> later recycles it in the normal/correct way -- this is the simplest
> possible case for page deletion. The page is now in the FSM, ready to
> be recycled during the next page split. Or is it? Even in this case
> there are no guarantees! This is because _bt_getbuf() does not fully
> trust the FSM to give it a 100% recycle-safe page for its page split
> caller -- _bt_getbuf() checks the page using _bt_page_recyclable()
> (which is the same check that VACUUM does to decide a deleted page is
> now recyclable). Obviously this means that the FSM can "leak" a page,
> just because there happened to be no page splits before wraparound
> occurred (and so now _bt_page_recyclable() thinks the very old page is
> very new

Re: New IndexAM API controlling index vacuum strategies

2021-02-05 Thread Peter Geoghegan

On Wed, Feb 3, 2021 at 8:18 PM Masahiko Sawada  wrote:
> > I'm starting to think that the right short term goal should not
> > directly involve bottom-up index deletion. We should instead return to
> > the idea of "unifying" the vacuum_cleanup_index_scale_factor feature
> > with the INDEX_CLEANUP feature, which is kind of where this whole idea
> > started out at. This short term goal is much more than mere
> > refactoring. It is still a whole new user-visible feature. The patch
> > would teach VACUUM to skip doing any real index work within both
> > ambulkdelete() and amvacuumcleanup() in many important cases.
> >
>
> I agree to cut the scope. I've also been thinking about the impact of
> this patch on users.

It's probably also true that on balance users care more about the
"~99.9% append-only table" case (as well as the HOT updates workload I
brought up in response to Victor on February 2) than making VACUUM
very sensitive to how well bottom-up deletion is working. Yes, it's
annoying that VACUUM still wastes effort on indexes where bottom-up
deletion alone can do all required garbage collection. But that's not
going to be a huge surprise to users. Whereas the "~99.9% append-only
table" case causes huge surprises to users -- users hate this kind of
thing.

> If vacuum skips both ambulkdelete and amvacuumcleanup in that case,
> I'm concerned that this could increase users who are affected by the
> known issue of leaking deleted pages. Currently, users who are
> affected by that is only users who use INDEX_CLEANUP off. But if we
> enable this feature by default, all users potentially are affected by
> that issue.

FWIW I think that it's unfair to blame INDEX_CLEANUP for any problems
in this area. The truth is that the design of the
deleted-page-recycling stuff has always caused leaked pages, even with
workloads that should not be challenging to the implementation in any
way. See my later remarks.

> To improve index tuple deletion in that case, skipping bulkdelete is
> also a good idea whereas the retail index deletion is also a good
> solution. I have thought the retail index deletion would be
> appropriate to this case but since some index AM cannot support it
> skipping index scans is a good solution anyway.

The big problem with retail index tuple deletion is that it is not
possible once heap pruning takes place (opportunistic pruning, or
pruning performed by VACUUM). Pruning will destroy the information
that retail deletion needs to find the index tuple (the column
values).

I think that we probably will end up using retail index tuple
deletion, but it will only be one strategy among several. We'll never
be able to rely on it, even within nbtree. My personal opinion is that
completely eliminating VACUUM is not a useful long term goal.

> > Here is how the triggering criteria could work: maybe skipping
> > accessing all indexes during VACUUM happens when less than 1% or
> > 10,000 of the items from the table are to be removed by VACUUM --
> > whichever is greater. Of course this is just the first thing I thought
> > of. It's a starting point for further discussion.
>
> I also prefer your second idea :)

Cool. Yeah, I always like it when the potential downside of a design
is obviously low, and the potential upside is obviously very high. I
am much less concerned about any uncertainty around when and how users
will get the big upside. I like it when my problems are "luxury
problems".  :-)

> As I mentioned above, I'm still concerned that the extent of users who
> are affected by the issue of leaking deleted pages could get expanded.
> Currently, we don't have a way to detect how many index pages are
> leaked. If there are potential cases where many deleted pages are
> leaked this feature would make things worse.

The existing problems here were significant even before you added
INDEX_CLEANUP. For example, let's say VACUUM deletes a page, and then
later recycles it in the normal/correct way -- this is the simplest
possible case for page deletion. The page is now in the FSM, ready to
be recycled during the next page split. Or is it? Even in this case
there are no guarantees! This is because _bt_getbuf() does not fully
trust the FSM to give it a 100% recycle-safe page for its page split
caller -- _bt_getbuf() checks the page using _bt_page_recyclable()
(which is the same check that VACUUM does to decide a deleted page is
now recyclable). Obviously this means that the FSM can "leak" a page,
just because there happened to be no page splits before wraparound
occurred (and so now _bt_page_recyclable() thinks the very old page is
very new/in the future).

In general the recycling stuff feels ridiculously over engineered to
me. It is true that page deletion is intrinsically complicated, and is
worth having -- that makes sense to me. But the complexity of the
recycling stuff seems ridiculous.

There is only one sensible solution: follow the example of commit
6655a7299d8 in nbtree. This commit fully fixed exactly th

Re: New IndexAM API controlling index vacuum strategies

2021-02-03 Thread Masahiko Sawada

On Tue, Feb 2, 2021 at 12:17 PM Peter Geoghegan  wrote:
>
> On Fri, Jan 29, 2021 at 5:26 PM Peter Geoghegan  wrote:
> > It'll be essential to have good instrumentation as we do more
> > benchmarking. We're probably going to have to make subjective
> > assessments of benchmark results, based on multiple factors. That will
> > probably be the only practical way to assess how much better (or
> > worse) the patch is compared to master. This patch is more about
> > efficiency and predictability than performance per se. Which is good,
> > because that's where most of the real world problems actually are.
>
> I've been thinking about how to get this patch committed for
> PostgreSQL 14. This will probably require cutting scope, so that the
> initial commit is not so ambitious. I think that "incremental VACUUM"
> could easily take up a lot of my time for Postgres 15, and maybe even
> Postgres 16.
>
> I'm starting to think that the right short term goal should not
> directly involve bottom-up index deletion. We should instead return to
> the idea of "unifying" the vacuum_cleanup_index_scale_factor feature
> with the INDEX_CLEANUP feature, which is kind of where this whole idea
> started out at. This short term goal is much more than mere
> refactoring. It is still a whole new user-visible feature. The patch
> would teach VACUUM to skip doing any real index work within both
> ambulkdelete() and amvacuumcleanup() in many important cases.
>

I agree to cut the scope. I've also been thinking about the impact of
this patch on users.

I also think we still have a lot of things to consider. For example,
we need to consider and evaluate how incremental vacuum works for
larger tuples or larger fillfactor, etc, and need to discuss more on
the concept of leaving LP_DEAD in the space left by fillfactor is a
good idea or not. Also, we need to discuss the changes in this patch
to nbtree. Since the bottom-up index deletion is a new code for PG14,
in a case where there is a problem in that, this feature could make
things worse since this feature uses it. Perhaps we would need some
safeguard and it also needs time. From that point of view, I think
it’s a good idea to introduce these features to a different major
version. Given the current situation, I agreed that 2 months is too
short to do all things.

> Here is a more detailed explanation:
>
> Today we can skip all significant work in ambulkdelete() and
> amvacuumcleanup() when there are zero dead tuples in the table. But
> why is the threshold *precisely* zero? If we could treat cases that
> have "practically zero" dead tuples in the same way (or almost the
> same way) as cases with *exactly* zero dead tuple, that's still a big
> improvement. And it still sets an important precedent that is crucial
> for the wider "incremental VACUUM" project: the criteria for
> triggering index vacuuming becomes truly "fuzzy" for the first time.
> It is "fuzzy" in the sense that index vacuuming might not happen
> during VACUUM at all now, even when the user didn't explicitly use
> VACUUUM's INDEX_CLEANUP option, and even when more than *precisely*
> zero dead index tuples are involved (though not *much* more than zero,
> can't be too aggressive). That really is a big change.
>
> A recap on vacuum_cleanup_index_scale_factor, just to avoid confusion:
>
> The reader should note that this is very different to Masahiko's
> vacuum_cleanup_index_scale_factor project, which skips *cleanup* in
> VACUUM (not bulk delete), a question which only comes up when there
> are definitely zero dead index tuples. The unifying work I'm talking
> about now implies that we completely avoid scanning indexes during
> vacuum, even when they are known to have at least a few dead index
> tuples, and even when VACUUM's INDEX_CLEANUP emergency option is not
> in use. Which, as I just said, is a big change.

If vacuum skips both ambulkdelete and amvacuumcleanup in that case,
I'm concerned that this could increase users who are affected by the
known issue of leaking deleted pages. Currently, users who are
affected by that is only users who use INDEX_CLEANUP off. But if we
enable this feature by default, all users potentially are affected by
that issue.

>
> Thoughts on triggering criteria for new "unified" design, ~99.9%
> append-only tables:
>
> Actually, in *one* sense the difference between "precisely zero" and
> "practically zero" here *is* small. But it's still probably going to
> result in skipping reading indexes during VACUUM in many important
> cases. Like when you must VACUUM a table that is ~99.9% append-only.
> In the real world things are rarely in exact discrete categories, even
> when we imagine that they are. It's too easy to be wrong about one
> tiny detail -- like one tiny UPDATE from 4 weeks ago, perhaps. Having
> a tiny amount of "forgiveness" here is actually a *huge* improvement
> on having precisely zero forgiveness. Small and big.
>
> This should help cases that get big surprising spikes due to
> anti-w

Re: New IndexAM API controlling index vacuum strategies

2021-02-02 Thread Peter Geoghegan

On Tue, Feb 2, 2021 at 6:28 AM Victor Yegorov  wrote:
> I really like this idea!

Cool!

> It resembles the approach used in bottom-up index deletion, block-based
> accounting provides a better estimate for the usefulness of the operation.

It does resemble bottom-up index deletion, in one important general
sense: it is somewhat qualitative (though *also* somewhat quantitive).
This is new for vacuumlazy.c. But the idea now is to deemphasize
bottom-up index deletion heavy workloads in the first version of this
patch -- just to cut scope.

The design I described yesterday centers around "~99.9% append-only
table" workloads, where anti-wraparound vacuums that scan indexes are
a big source of unnecessary work (in practice it is always
anti-wraparound vacuums, simply because there will never be enough
garbage to trigger a regular autovacuum run). But it now occurs to me
that there is another very important case that it will also help,
without making the triggering condition for index vacuuming any more
complicated: it will help cases where HOT updates are expected
(because all updates don't modify indexed columns).

It's practically impossible for HOT updates to occur 100% of the time,
even with workloads whose updates never modify indexed columns. You
can clearly see this by looking at the stats from pg_stat_user_tables
with a standard pgbench workload. It does get better with lower heap
fill factor, but I think that HOT is never 100% effective (i.e. 100%
of updates are HOT updates) in the real world -- unless maybe you set
heap fillfactor as low as 50, which is very rare. HOT might well be
95% effective, or 99% effective, but it's never truly 100% effective.
And so this is another important workload where the difference between
"practically zero dead tuples" and "precisely zero dead tuples"
*really* matters when deciding if a VACUUM operation needs to go
ahead.

Once again, a small difference, but also a big difference. Forgive me
for repeating myself do much, but: paying attention to cost/benefit
asymmetries like this one sometimes allow us to recognize an
optimization that is an "excellent deal". We saw this with bottom-up
index deletion. Seems good to keep an eye out for that.

> I suppose that 1% threshold should be configurable as a cluster-wide GUC
> and also as a table storage parameter?

Possibly. I'm concerned about making any user-visible interface (say a
GUC) compatible with an improved version that is smarter about
bottom-up index deletion (in particular, one that can vacuum only a
subset of the indexes on a table, which now seems too ambitious for
Postgres 14).

-- 
Peter Geoghegan

Re: New IndexAM API controlling index vacuum strategies

2021-02-02 Thread Victor Yegorov

вт, 2 февр. 2021 г. в 05:27, Peter Geoghegan :

> And now here is the second thing I thought of, which is much better:
>
> Sometimes 1% of the dead tuples in a heap relation will be spread
> across 90%+ of the pages. With other workloads 1% of dead tuples might
> be highly concentrated, and appear in no more than 1% of all heap
> pages. Obviously the distinction between these two cases/workloads
> matters a lot. And so the triggering criteria must be quantitative
> *and* qualitative. It should not be based on counting dead tuples,
> since that alone won't differentiate these two extreme cases - both of
> which are probably quite common (in the real world extremes are
> actually the normal and common case IME).
>
> I like the idea of basing it on counting *heap blocks*, not dead
> tuples. We can count heap blocks that have *at least* one dead tuple
> (of course it doesn't matter how they're dead, whether it was this
> VACUUM operation or some earlier opportunistic pruning). Note in
> particular that it should not matter if it's a heap block that has
> only one LP_DEAD line pointer or a heap page that is near the
> MaxHeapTuplesPerPage limit for the page -- we count either type of
> page towards the heap-page based limit used to decide if index
> vacuuming goes ahead for all indexes during VACUUM.
>

I really like this idea!

It resembles the approach used in bottom-up index deletion, block-based
accounting provides a better estimate for the usefulness of the operation.

I suppose that 1% threshold should be configurable as a cluster-wide GUC
and also as a table storage parameter?


-- 
Victor Yegorov

Re: New IndexAM API controlling index vacuum strategies

2021-02-01 Thread Peter Geoghegan

On Mon, Feb 1, 2021 at 7:17 PM Peter Geoghegan  wrote:
> Here is how the triggering criteria could work: maybe skipping
> accessing all indexes during VACUUM happens when less than 1% or
> 10,000 of the items from the table are to be removed by VACUUM --
> whichever is greater. Of course this is just the first thing I thought
> of. It's a starting point for further discussion.

And now here is the second thing I thought of, which is much better:

Sometimes 1% of the dead tuples in a heap relation will be spread
across 90%+ of the pages. With other workloads 1% of dead tuples might
be highly concentrated, and appear in no more than 1% of all heap
pages. Obviously the distinction between these two cases/workloads
matters a lot. And so the triggering criteria must be quantitative
*and* qualitative. It should not be based on counting dead tuples,
since that alone won't differentiate these two extreme cases - both of
which are probably quite common (in the real world extremes are
actually the normal and common case IME).

I like the idea of basing it on counting *heap blocks*, not dead
tuples. We can count heap blocks that have *at least* one dead tuple
(of course it doesn't matter how they're dead, whether it was this
VACUUM operation or some earlier opportunistic pruning). Note in
particular that it should not matter if it's a heap block that has
only one LP_DEAD line pointer or a heap page that is near the
MaxHeapTuplesPerPage limit for the page -- we count either type of
page towards the heap-page based limit used to decide if index
vacuuming goes ahead for all indexes during VACUUM.

This removes almost all of the issues with not setting visibility map
bits reliably (another concern of mine that I forget to mention
earlier), but it is still very likely to do the right thing in the
"~99.9% append-only table" case I mentioned in the last email. We can
be relatively aggressive (say by triggering index skipping when less
than 1% of all heap pages have dead tuples). Plus the new nbtree index
tuple deletion stuff is naturally very good at deleting index tuples
in the event of dead tuples becoming concentrated in relatively few
table blocks -- that helps too. This is true of the enhanced simple
deletion mechanism (which has been taught to be clever about table
block dead tuple concentrations/table layout), as well as bottom-up
index deletion.

-- 
Peter Geoghegan

Re: New IndexAM API controlling index vacuum strategies

2021-02-01 Thread Peter Geoghegan

On Fri, Jan 29, 2021 at 5:26 PM Peter Geoghegan  wrote:
> It'll be essential to have good instrumentation as we do more
> benchmarking. We're probably going to have to make subjective
> assessments of benchmark results, based on multiple factors. That will
> probably be the only practical way to assess how much better (or
> worse) the patch is compared to master. This patch is more about
> efficiency and predictability than performance per se. Which is good,
> because that's where most of the real world problems actually are.

I've been thinking about how to get this patch committed for
PostgreSQL 14. This will probably require cutting scope, so that the
initial commit is not so ambitious. I think that "incremental VACUUM"
could easily take up a lot of my time for Postgres 15, and maybe even
Postgres 16.

I'm starting to think that the right short term goal should not
directly involve bottom-up index deletion. We should instead return to
the idea of "unifying" the vacuum_cleanup_index_scale_factor feature
with the INDEX_CLEANUP feature, which is kind of where this whole idea
started out at. This short term goal is much more than mere
refactoring. It is still a whole new user-visible feature. The patch
would teach VACUUM to skip doing any real index work within both
ambulkdelete() and amvacuumcleanup() in many important cases.

Here is a more detailed explanation:

Today we can skip all significant work in ambulkdelete() and
amvacuumcleanup() when there are zero dead tuples in the table. But
why is the threshold *precisely* zero? If we could treat cases that
have "practically zero" dead tuples in the same way (or almost the
same way) as cases with *exactly* zero dead tuple, that's still a big
improvement. And it still sets an important precedent that is crucial
for the wider "incremental VACUUM" project: the criteria for
triggering index vacuuming becomes truly "fuzzy" for the first time.
It is "fuzzy" in the sense that index vacuuming might not happen
during VACUUM at all now, even when the user didn't explicitly use
VACUUUM's INDEX_CLEANUP option, and even when more than *precisely*
zero dead index tuples are involved (though not *much* more than zero,
can't be too aggressive). That really is a big change.

A recap on vacuum_cleanup_index_scale_factor, just to avoid confusion:

The reader should note that this is very different to Masahiko's
vacuum_cleanup_index_scale_factor project, which skips *cleanup* in
VACUUM (not bulk delete), a question which only comes up when there
are definitely zero dead index tuples. The unifying work I'm talking
about now implies that we completely avoid scanning indexes during
vacuum, even when they are known to have at least a few dead index
tuples, and even when VACUUM's INDEX_CLEANUP emergency option is not
in use. Which, as I just said, is a big change.

Thoughts on triggering criteria for new "unified" design, ~99.9%
append-only tables:

Actually, in *one* sense the difference between "precisely zero" and
"practically zero" here *is* small. But it's still probably going to
result in skipping reading indexes during VACUUM in many important
cases. Like when you must VACUUM a table that is ~99.9% append-only.
In the real world things are rarely in exact discrete categories, even
when we imagine that they are. It's too easy to be wrong about one
tiny detail -- like one tiny UPDATE from 4 weeks ago, perhaps. Having
a tiny amount of "forgiveness" here is actually a *huge* improvement
on having precisely zero forgiveness. Small and big.

This should help cases that get big surprising spikes due to
anti-wraparound vacuums that must vacuum indexes for the first time in
ages -- indexes may be vacuumed despite only having a tiny absolute
number of dead tuples. I don't think that it's necessary to treat
anti-wraparound vacuums as special at all (not in Postgres 14 and
probably not ever), because simply considering cases where the table
has "practically zero" dead tuples alone should be enough. Vacuuming a
10GB index to delete only 10 tuples simply makes no sense. It doesn't
necessarily matter how we end up there, it just shouldn't happen.

The ~99.9% append-only table case is likely to be important and common
in the real world. We should start there for Postgres 14 because it's
easier, that's all. It's not fundamentally different to what happens
in workloads involving lots of bottom-up deletion -- it's just
simpler, and easier to reason about. Bottom-up deletion is an
important piece of the big puzzle here, but some variant of
"incremental VACUUM" really would still make sense in a world where
bottom-up index deletion does not exist. (In fact, I started thinking
about "incremental VACUUM" before bottom-up index deletion, and didn't
make any connection between them until work on bottom-up deletion had
already progressed significantly.)

Here is how the triggering criteria could work: maybe skipping
accessing all indexes during VACUUM happens when less than 1% or
10,000 of the items

Re: New IndexAM API controlling index vacuum strategies

2021-01-29 Thread Peter Geoghegan

On Tue, Jan 26, 2021 at 10:59 PM Masahiko Sawada  wrote:
> What value is set to fillfactor?

90, same as before.

> That's very good. I'm happy that this patch efficiently utilizes
> bottom-up index deletion feature.

Me too!

> Looking at the relation size growth, there is almost no difference
> between master and patched in spite of skipping some vacuums in the
> patched test, which is also good.

Right. Stability is everything here. Actually I think that most
performance problems in Postgres are mostly about stability if you
really look into it.

> > I did notice a problem, though. I now think that the criteria for
> > skipping an index vacuum in the third patch from the series is too
> > conservative, and that this led to an excessive number of index
> > vacuums with the patch.
>
> Maybe that's why there are 5 autovacuum runs on pgbench_accounts in
> the master branch whereas there are 7 runs in the patched?

Probably, but it might also be due to some other contributing factor.
There is still very little growth in the size of the indexes, and the
PK still has zero growth. The workload consists of 32 hours of a
10ktps workload, so I imagine that there is opportunity for some
extremely rare event to happen a few times. Too tired to think about
it in detail right now.

It might also be related to the simple fact that only one VACUUM
process may run against a table at any given time! With a big enough
table, and several indexes, and reasonably aggressive av settings,
it's probably almost impossible for autovacuum to "keep up" (in the
exact sense that the user asks for by having certain av settings).
This must be taken into account in some general way --

It's a bit tricky to interpret results here, generally speaking,
because there are probably a few things like that. To me, the most
important thing is that the new behavior "makes sense" in some kind of
general way, that applies across a variety of workloads. It may not be
possible to directly compare master and patch like this and arrive at
one simple number that is fair. If we really wanted one simple
benchmark number, maybe we'd have to tune the patch and master
separately -- which doesn't *seem* fair.

> Also makes sense to me. The patch I recently submitted doesn't include
> it but I'll do that in the next version patch.

Great!

> Maybe the same is true for heap? I mean that skipping heap vacuum on a
> too-small table will not bring the benefit but bloat. I think we could
> proceed with heap vacuum if a table is smaller than a threshold, even
> if one of the indexes wanted to skip.

I think that you're probably right about that. It isn't a problem for
v2 in practice because the bloat will reliably cause LP_DEAD line
pointers to accumulate in heap pages, so you VACUUM anyway -- this is
certainly what you *always* see in the small pgbench tables with the
default workload. But even then -- why not be careful? I agree that
there should be some kind of limit on table size that applies here --
a size at which we'll never apply any of these optimizations, no
matter what.

> Yeah, I think the following information would also be helpful:
>
> * did vacuum heap? or skipped it?
> * how many indexes did/didn't bulk-deletion?
> * time spent for each vacuum phase.

That list looks good -- in general I don't like that log_autovacuum
cannot ever have the VACUUM VERBOSE per-index output -- maybe that
could be revisited soon? I remember reading your e-mail about this on
a recent thread, and I imagine that you already saw the connection
yourself.

It'll be essential to have good instrumentation as we do more
benchmarking. We're probably going to have to make subjective
assessments of benchmark results, based on multiple factors. That will
probably be the only practical way to assess how much better (or
worse) the patch is compared to master. This patch is more about
efficiency and predictability than performance per se. Which is good,
because that's where most of the real world problems actually are.

-- 
Peter Geoghegan

Re: New IndexAM API controlling index vacuum strategies

2021-01-26 Thread Masahiko Sawada

On Fri, Jan 22, 2021 at 2:00 PM Peter Geoghegan  wrote:
>
> On Tue, Jan 19, 2021 at 2:57 PM Peter Geoghegan  wrote:
> > Looks good. I'll give this version a review now. I will do a lot more
> > soon. I need to come up with a good benchmark for this, that I can
> > return to again and again during review as needed.
>
> I performed another benchmark, similar to the last one but with the
> latest version (v2), and over a much longer period. Attached is a
> summary of the whole benchmark, and log_autovacuum output from the
> logs of both the master branch and the patch.

Thank you for performing the benchmark!

>
> This was pgbench scale 2000, 4 indexes on pgbench_accounts, and a
> transaction with one update and two selects. Each run was 4 hours, and
> we alternate between patch and master for each run, and alternate
> between 16 and 32 clients. There were 8 4 hour runs in total, meaning
> the entire set of runs took 8 * 4 hours = 32 hours (not including
> initial load time and a few other small things like that). I used a
> 10k TPS rate limit, so TPS isn't interesting here. Latency is
> interesting -- we see a nice improvement in latency (i.e. a reduction)
> for the patch (see all.summary.out).

What value is set to fillfactor?

>
> The benefits of the patch are clearly visible when I drill down and
> look at the details. Each pgbench_accounts autovacuum VACUUM operation
> can finish faster with the patch because they can often skip at least
> some indexes (usually the PK, sometimes 3 out of 4 indexes total). But
> it's more subtle than some might assume. We're skipping indexes that
> VACUUM actually would have deleted *some* index tuples from, which is
> very good. Bottom-up index deletion is usually lazy, and only
> occasionally very eager, so you still have plenty of "floating
> garbage" index tuples in most pages. And now we see VACUUM behave a
> little more like bottom-up index deletion -- it is lazy when that is
> appropriate (with indexes that really only have floating garbage that
> is spread diffusely throughout the index structure), and eager when
> that is appropriate (with indexes that get much more garbage).

That's very good. I'm happy that this patch efficiently utilizes
bottom-up index deletion feature.

Looking at the relation size growth, there is almost no difference
between master and patched in spite of skipping some vacuums in the
patched test, which is also good.

>
> The benefit is not really that we're avoiding doing I/O for index
> vacuuming (though that is one of the smaller benefits here). The real
> benefit is that VACUUM is not dirtying pages, since it skips indexes
> when it would be "premature" to vacuum them from an efficiency point
> of view. This is important because we know that Postgres throughput is
> very often limited by page cleaning. Also, the "economics" of this new
> behavior make perfect sense -- obviously it's more efficient to delay
> garbage cleanup until the point when the same page will be modified by
> a backend anyway -- in the case of this benchmark via bottom-up index
> deletion (which deletes all garbage tuples in the leaf page at the
> point that it runs for a subset of pointed-to heap pages -- it's not
> using an oldestXmin cutoff from 30 minutes ago). So whenever we dirty
> a page, we now get more value per additional-page-dirtied.
>
> I believe that controlling the number of pages dirtied by VACUUM is
> usually much more important than reducing the amount of read I/O from
> VACUUM, for reasons I go into on the recent "vacuum_cost_page_miss
> default value and modern hardware" thread. As a further consequence of
> all this, VACUUM can go faster safely and sustainably (since the cost
> limit is not affected so much by vacuum_cost_page_miss), which has its
> own benefits (e.g. oldestXmin cutoff doesn't get so old towards the
> end).
>
> Another closely related huge improvement that we see here is that the
> number of FPIs generated by VACUUM can be significantly reduced. This
> cost is closely related to the cost of dirtying pages, but it's worth
> mentioning separately. You'll see some of that in the log_autovacuum
> log output I attached.

Makes sense.

>
> There is an archive with much more detailed information, including
> dumps from most pg_stat_* views at key intervals. This has way more
> information than anybody is likely to want:
>
> https://drive.google.com/file/d/1OTiErELKRZmYnuJuczO2Tfcm1-cBYITd/view?usp=sharing
>
> I did notice a problem, though. I now think that the criteria for
> skipping an index vacuum in the third patch from the series is too
> conservative, and that this led to an excessive number of index
> vacuums with the patch.

Maybe that's why there are 5 autovacuum runs on pgbench_accounts in
the master branch whereas there are 7 runs in the patched?

>  This is probably because there was a tiny
> number of page splits in some of the indexes that were not really
> supposed to grow. I believe that this is caused by ANALYZE runni

1 2 >

1 - 100 of 123 matches

Mail list logo