subject:"Re\: New strategies for freezing, advancing relfrozenxid early"

On Thu, Jan 26, 2023 at 9:58 PM Andres Freund  wrote:
> It doesn't seem like a great proxy to me. ISTM that this means that how
> aggressive vacuum is about opportunistically freezing pages depends on config
> variables like checkpoint_timeout & max_wal_size (less common opportunistic
> freezing), full_page_writes & use of unlogged tables (no opportunistic
> freezing), and the largely random scheduling of autovac workers.

The FPI thing was originally supposed to complement the freezing
strategies stuff, and possibly other rules that live in
lazy_scan_prune. Obviously you can freeze a page by following any rule
that you care to invent -- you can decide by calling random(). Two
rules can coexist during the same VACUUM (actually, they do already).

> Essentially the "any fpi" logic is a very coarse grained way of using the page
> LSN as a measurement. As I said, I don't think "has a checkpoint occurred
> since the last write" is a good metric to avoid unnecessary freezing - it's
> too coarse. But I think using the LSN is the right thought. What about
> something like
>
>   lsn_threshold =  insert_lsn - (insert_lsn - lsn_of_last_vacuum) * 0.1
>   if (/* other conds */ && PageGetLSN(page) <= lsn_threshold)
>  FreezeMe();
>
> I probably got some details wrong, what I am going for with lsn_threshold is
> that we'd freeze an already dirty page if it's not been updated within 10% of
> the LSN distance to the last VACUUM.

It seems to me that you're reinventing something akin to eager
freezing strategy here. At least that's how I define it, since now
you're bringing the high level context into it; what happens with the
table, with VACUUM operations, and so on. Obviously this requires
tracking the metadata that you suppose will be available in some way
or other, in particular things like lsn_of_last_vacuum.

What about unlogged/temporary tables? The obvious thing to do there is
what I did in the patch that was reverted (freeze whenever the page
will thereby become all-frozen), and forget about LSNs. But you have
already objected to that part, specifically.

BTW, you still haven't changed the fact that you get rather different
behavior with checksums/wal_log_hints. I think that that's good, but
you didn't seem to.

> I don't think the speculation is that fundamentally different - a heavily
> updated table with a bit of a historic, non-changing portion, makes
> vacuum_freeze_strategy_threshold freeze way more aggressively than either "any
> record" or "any fpi".

That's true. The point I was making is that both this proposal and
eager freezing are based on some kind of high level picture of the
needs of the table, based on high level metadata. To me that's the
defining characteristic.

> > And even when we lose, you generally still won't have been completely
> > wrong -- even then there generally will indeed be a second FPI later
> > on for the same page, to go with everything else. This makes the
> > wasted freezing even less significant, on a comparative basis!
>
> This is precisely why I think that we can afford to be quite aggressive about
> freezing already dirty pages...

I'm beginning to warm to this idea, now that I understand it a little better.

-- 
Peter Geoghegan

Re: New strategies for freezing, advancing relfrozenxid early

Hi,

On 2023-01-26 19:01:03 -0800, Peter Geoghegan wrote:
> On Thu, Jan 26, 2023 at 6:37 PM Andres Freund  wrote:
> > I also don't really see how that is responsive to anything else in my
> > email. That's just as true for the current gating condition (the issuance of
> > an FPI during heap_page_prune() / HTSV()).
> >
> > What I was wondering about is whether we should replace the
> >   fpi_before != pgWalUsage.wal_fpi
> > with
> >   records_before != pgWalUsage.wal_records && !WouldIssueFpi(page)
>
> I understand that. What I'm saying is that that's going to create a
> huge problem of its own, unless you separately account for that
> problem.

> The simplest and obvious example is something like a pgbench_tellers
> table. VACUUM will generally run fast enough relative to the workload
> that it will set some of those pages all-visible. Now it's going to
> freeze them, too. Arguably it shouldn't even be setting the pages
> all-visible, but now you make that existing problem much worse.

So the benefit of the FPI condition is that it indicates that the page hasn't
been updated all that recently, because, after all, a checkpoint has happened
since?  If that's the intention, it needs a huge honking comment - at least I
can't read that out of:

  Also freeze when pruning generated an FPI, if doing so means that we set the
  page all-frozen afterwards (might not happen until final heap pass).

It doesn't seem like a great proxy to me. ISTM that this means that how
aggressive vacuum is about opportunistically freezing pages depends on config
variables like checkpoint_timeout & max_wal_size (less common opportunistic
freezing), full_page_writes & use of unlogged tables (no opportunistic
freezing), and the largely random scheduling of autovac workers.

I can see it making a difference for pgbench_tellers, but it's a pretty small
difference in overall WAL volume. I can think of more adverse workloads though
- but even there the difference seems not huge, and not predictably
reached. Due to the freeze plan stuff you added, the amount of WAL for
freezing a page is pretty darn small compared to the amount of WAL if compared
to the amount of WAL needed to fill a page with non-frozen tuples.

That's not to say we shouldn't reduce the risk - I agree that both the "any
fpi" and the "any record" condition can have adverse effects!

However, an already dirty page getting frozen is also the one case where
freezing won't have meaningful write amplication effect. So I think it's worth
trying spending effort figuring out how we can make freezing in that situation
have unlikely and small downsides.

The cases with downsides are tables that are very heavily updated througout,
where the page is going to be defrosted again almost immediately. As you say,
the all-visible marking has a similar problem.

Essentially the "any fpi" logic is a very coarse grained way of using the page
LSN as a measurement. As I said, I don't think "has a checkpoint occurred
since the last write" is a good metric to avoid unnecessary freezing - it's
too coarse. But I think using the LSN is the right thought. What about
something like

  lsn_threshold =  insert_lsn - (insert_lsn - lsn_of_last_vacuum) * 0.1
  if (/* other conds */ && PageGetLSN(page) <= lsn_threshold)
 FreezeMe();

I probably got some details wrong, what I am going for with lsn_threshold is
that we'd freeze an already dirty page if it's not been updated within 10% of
the LSN distance to the last VACUUM.

> The important point is that there doesn't seem to be any good way
> around thinking about the table as a whole if you're going to freeze
> speculatively. This is not the same dynamic as we see with the FPI
> thing IMV -- that's not nearly so speculative as what you're talking
> about, since it is speculative in roughly the same sense that eager
> freezing was speculative (hence the suggestion that something like
> vacuum_freeze_strategy_threshold could have a roll to play).

I don't think the speculation is that fundamentally different - a heavily
updated table with a bit of a historic, non-changing portion, makes
vacuum_freeze_strategy_threshold freeze way more aggressively than either "any
record" or "any fpi".

> The FPI thing is mostly about the cost now versus the cost later on.
> You're gambling that you won't get another FPI later on if you freeze
> now. But the cost of a second FPI later on is so much higher than the
> added cost of freezing now that that's a very favorable bet, that we
> can afford to "lose" many times while still coming out ahead overall.

Agreed. And not just avoiding FPIs, avoiding another dirtying of the page! The
latter part is especially huge IMO. Depending on s_b size it can also avoid
another *read* of the page...

> And even when we lose, you generally still won't have been completely
> wrong -- even then there generally will indeed be a second FPI later
> on for the same page, to go with everything else. This makes the
> wasted freezing even

Re: New strategies for freezing, advancing relfrozenxid early

On Thu, Jan 26, 2023 at 6:37 PM Andres Freund  wrote:
> I also don't really see how that is responsive to anything else in my
> email. That's just as true for the current gating condition (the issuance of
> an FPI during heap_page_prune() / HTSV()).
>
> What I was wondering about is whether we should replace the
>   fpi_before != pgWalUsage.wal_fpi
> with
>   records_before != pgWalUsage.wal_records && !WouldIssueFpi(page)

I understand that. What I'm saying is that that's going to create a
huge problem of its own, unless you separately account for that
problem.

The simplest and obvious example is something like a pgbench_tellers
table. VACUUM will generally run fast enough relative to the workload
that it will set some of those pages all-visible. Now it's going to
freeze them, too. Arguably it shouldn't even be setting the pages
all-visible, but now you make that existing problem much worse.

The important point is that there doesn't seem to be any good way
around thinking about the table as a whole if you're going to freeze
speculatively. This is not the same dynamic as we see with the FPI
thing IMV -- that's not nearly so speculative as what you're talking
about, since it is speculative in roughly the same sense that eager
freezing was speculative (hence the suggestion that something like
vacuum_freeze_strategy_threshold could have a roll to play).

The FPI thing is mostly about the cost now versus the cost later on.
You're gambling that you won't get another FPI later on if you freeze
now. But the cost of a second FPI later on is so much higher than the
added cost of freezing now that that's a very favorable bet, that we
can afford to "lose" many times while still coming out ahead overall.
And even when we lose, you generally still won't have been completely
wrong -- even then there generally will indeed be a second FPI later
on for the same page, to go with everything else. This makes the
wasted freezing even less significant, on a comparative basis!

It's also likely true that an FPI in lazy_scan_prune is a much
stronger signal, but I think that the important dynamic is that we're
reasoning about "costs now vs costs later on". The asymmetry is really
important.

-- 
Peter Geoghegan

Re: New strategies for freezing, advancing relfrozenxid early

Hi,

On 2023-01-26 15:36:52 -0800, Peter Geoghegan wrote:
> On Thu, Jan 26, 2023 at 12:45 PM Andres Freund  wrote:
> > > Most of the overhead of FREEZE WAL records (with freeze plan
> > > deduplication and page-level freezing in) is generic WAL record header
> > > overhead. Your recent adversarial test case is going to choke on that,
> > > too. At least if you set checkpoint_timeout to 1 minute again.
> >
> > I don't quite follow. What do you mean with "record header overhead"? Unless
> > that includes FPIs, I don't think that's that commonly true?
>
> Even if there are no directly observable FPIs, there is still extra
> WAL, which can cause FPIs indirectly, just by making checkpoints more
> frequent. I feel ridiculous even having to explain this to you.

What does that have to do with "generic WAL record overhead"?

I also don't really see how that is responsive to anything else in my
email. That's just as true for the current gating condition (the issuance of
an FPI during heap_page_prune() / HTSV()).

What I was wondering about is whether we should replace the
  fpi_before != pgWalUsage.wal_fpi
with
  records_before != pgWalUsage.wal_records && !WouldIssueFpi(page)

> > The problematic case I am talking about is when we do *not* emit a WAL 
> > record
> > during pruning (because there's nothing to prune), but want to freeze the
> > table. If you don't log an FPI, the remaining big overhead is that 
> > increasing
> > the LSN on the page will often cause an XLogFlush() when writing out the
> > buffer.
> >
> > I don't see what your reference to checkpoint timeout is about here?
> >
> > Also, as I mentioned before, the problem isn't specific to 
> > checkpoint_timeout
> > = 1min. It just makes it cheaper to reproduce.
>
> That's flagrantly intellectually dishonest. Sure, it made it easier to
> reproduce. But that's not all it did!
>
> You had *lots* of specific numbers and technical details in your first
> email, such as "Time for vacuuming goes up to ~5x. WAL volume to
> ~9x.". But you did not feel that it was worth bothering with details
> like having set checkpoint_timeout to 1 minute, which is a setting
> that nobody uses, and obviously had a multiplicative effect. That
> detail was unimportant. I had to drag it out of you!

The multiples were for checkpoint_timeout=5min, with
 '250s' instead of WHERE ts < now() - '120s'

I started out with checkpoint_timeout=1min, as I wanted to quickly test my
theory. Then I increased checkpoint_timeout back to 5min, adjusted the query
to some randomly guessed value. Happened to get nearly the same results.

I then experimented more with '1min', because it's less annoying to have to
wait for 120s until deletions start, than to wait for 250s. Because it's
quicker to run I thought I'd share the less resource intensive version. A
mistake as I now realize.

This wasn't intended as a carefully designed benchmark, or anything. It was a
quick proof for a problem that I found obvious. And it's not something worth
testing carefully - e.g. the constants in the test are actually quite hardware
specific, because the insert/seconds rate is very machine specific, and it's
completely unnecessarily hardware intensive due to the use of single-row
inserts, instead of batched operations.  It's just a POC.

> You basically found a way to add WAL overhead to a system/workload
> that is already in a write amplification vicious cycle, with latent
> tipping point type behavior.
>
> There is a practical point here, that is equally obvious, and yet
> somehow still needs to be said: benchmarks like that one are basically
> completely free of useful information. If we can't agree on how to
> assess such things in general, then what can we agree on when it comes
> to what should be done about it, what trade-off to make, when it comes
> to any similar question?

It's not at all free of useful information. It reproduces a problem I
predicted repeatedly, that others in the discussion also wondered about, that
you refused to acknowledge or address.

It's not a good benchmark - I completely agree with that much. It was not
designed to carefully benchmark different settings or such. It was designed to
show a problem. And it does that.

> > You're right, it makes sense to consider whether we'll emit a
> > XLOG_HEAP2_VISIBLE anyway.
>
> As written the page-level freezing FPI mechanism probably doesn't
> really stand to benefit much from doing that. Either checksums are
> disabled and it's just a hint, or they're enabled and there is a very
> high chance that we'll get an FPI inside lazy_scan_prune rather than
> right after it is called, when PD_ALL_VISIBLE is set.

I think it might be useful with logged hint bits, consider cases where all the
tuples on the page were already fully hinted. That's not uncommon, I think?

> > > > A less aggressive version would be to check if any WAL records were 
> > > > emitted
> > > > during heap_page_prune() (instead of FPIs) and whether we'd emit an FPI 
> > >

Re: New strategies for freezing, advancing relfrozenxid early

Hi,

On 2023-01-26 14:27:53 -0500, Robert Haas wrote:
> One idea that I've had about how to solve this problem is to try to
> make vacuum try to aggressively freeze some portion of the table on
> each pass, and to behave less aggressively on the rest of the table so
> that, hopefully, no single vacuum does too much work.

I agree that this rough direction is worthwhile to purse.

> Unfortunately, I don't really know how to do that effectively. If we knew
> that the table was going to see 10 vacuums before we hit
> autovacuum_freeze_max_age, we could try to have each one do 10% of the
> amount of freezing that was going to need to be done rather than letting any
> single vacuum do all of it, but we don't have that sort of information.

I think, quite fundamentally, it's not possible to bound the amount of work an
anti-wraparound vacuum has to do if we don't have an age based autovacuum
trigger kicking in before autovacuum_freeze_max_age. After all, there might be
no autovacuum before that's autovacuum_freeze_max_age is reached.

But there's just no reason to not have a trigger below
autovacuum_freeze_max_age. That's why I think Peter's patch to split age and
anti-"auto-cancel" autovacuums is an strictly necessary change if we want to
make autovacuum fundamentally suck less. There's a few boring details to
figure out how to set/compute those limits, but I don't think there's anything
fundamentally hard.

I think we also need the number of all-frozen pages in pg_class if we want to
make better scheduling decision. As we already compute the number of
all-visible pages at the end of vacuuming, we can compute the number of
all-frozen pages as well. The space for another integer in pg_class doesn't
bother me one bit.

Let's say we had a autovacuum_vacuum_age trigger of 100m, and
autovacuum_freeze_max_age=500m. We know that we're roughly going to be
vacuuming 5 times before reaching autovacuum_freeze_max_age (very slow
autovacuums are an issue, but if one autovacuum takes 100m+ xids long, there's
not much we can do).

With that we could determine the eager percentage along the lines of:
  frozen_target = Min(age(relfrozenxid), 
autovacuum_freeze_max_age)/autovacuum_freeze_max_age
  eager_percentage = Min(0, frozen_target * relpages - pg_class.relallfrozen * 
relpages)

One thing I don't know fully how to handle is how to ensure that we try to
freeze a different part of the table each vacuum. I guess we could store a
page number in pgstats?

This would help address the "cliff" issue of reaching
autovacuum_freeze_max_age. What it would *not*, on its own, would is the
number of times we rewrite pages.

I can guess at a few ways to heuristically identify when tables are "append
mostly" from vacuum's view (a table can be update heavy, but very localized to
recent rows, and still be append mostly from vacuum's view).  There's obvious
cases, e.g. when there are way more inserts than dead rows.  But other cases
are harder.

> Also, even if we did have that sort of information, the idea only works if
> the pages that we freeze sooner are ones that we're not about to update or
> delete again, and we don't have any idea what is likely there.

Perhaps we could use something like
  (age(relfrozenxid) - age(newest_xid_on_page)) / age(relfrozenxid)
as a heuristic?

I have a gut feeling that we should somehow collect/use statistics about the
number of frozen pages, marked as such by the last (or recent?) vacuum, that
had to be "defrosted" by backends. But I don't quite know how to yet.  I think
we could collect statistics about that by storing the LSN of the last vacuum
in the shared stats, and incrementing that counter when defrosting.

A lot of things like that would work a whole lot better if we had statistics
that take older data into account, but weigh it less than more recent
data. But that's hard/expensive to collect.

Greetings,

Andres Freund

Re: New strategies for freezing, advancing relfrozenxid early

On Thu, Jan 26, 2023 at 12:45 PM Andres Freund  wrote:
> > Most of the overhead of FREEZE WAL records (with freeze plan
> > deduplication and page-level freezing in) is generic WAL record header
> > overhead. Your recent adversarial test case is going to choke on that,
> > too. At least if you set checkpoint_timeout to 1 minute again.
>
> I don't quite follow. What do you mean with "record header overhead"? Unless
> that includes FPIs, I don't think that's that commonly true?

Even if there are no directly observable FPIs, there is still extra
WAL, which can cause FPIs indirectly, just by making checkpoints more
frequent. I feel ridiculous even having to explain this to you.

> The problematic case I am talking about is when we do *not* emit a WAL record
> during pruning (because there's nothing to prune), but want to freeze the
> table. If you don't log an FPI, the remaining big overhead is that increasing
> the LSN on the page will often cause an XLogFlush() when writing out the
> buffer.
>
> I don't see what your reference to checkpoint timeout is about here?
>
> Also, as I mentioned before, the problem isn't specific to checkpoint_timeout
> = 1min. It just makes it cheaper to reproduce.

That's flagrantly intellectually dishonest. Sure, it made it easier to
reproduce. But that's not all it did!

You had *lots* of specific numbers and technical details in your first
email, such as "Time for vacuuming goes up to ~5x. WAL volume to
~9x.". But you did not feel that it was worth bothering with details
like having set checkpoint_timeout to 1 minute, which is a setting
that nobody uses, and obviously had a multiplicative effect. That
detail was unimportant. I had to drag it out of you!

You basically found a way to add WAL overhead to a system/workload
that is already in a write amplification vicious cycle, with latent
tipping point type behavior.

There is a practical point here, that is equally obvious, and yet
somehow still needs to be said: benchmarks like that one are basically
completely free of useful information. If we can't agree on how to
assess such things in general, then what can we agree on when it comes
to what should be done about it, what trade-off to make, when it comes
to any similar question?

> > In many cases we'll have to dirty the page anyway, just to set
> > PD_ALL_VISIBLE. The whole way the logic works is conditioned (whether
> > triggered by an FPI or triggered by my now-reverted GUC) on being able
> > to set the whole page all-frozen in the VM.
>
> IIRC setting PD_ALL_VISIBLE doesn't trigger an FPI unless we need to log hint
> bits. But freezing does trigger one even without wal_log_hint_bits.

That is correct.

> You're right, it makes sense to consider whether we'll emit a
> XLOG_HEAP2_VISIBLE anyway.

As written the page-level freezing FPI mechanism probably doesn't
really stand to benefit much from doing that. Either checksums are
disabled and it's just a hint, or they're enabled and there is a very
high chance that we'll get an FPI inside lazy_scan_prune rather than
right after it is called, when PD_ALL_VISIBLE is set.

That's not perfect, of course, but it doesn't have to be. Perhaps it
should still be improved, just on general principle.

> > > A less aggressive version would be to check if any WAL records were 
> > > emitted
> > > during heap_page_prune() (instead of FPIs) and whether we'd emit an FPI 
> > > if we
> > > modified the page again. Similar to what we do now, except not requiring 
> > > an
> > > FPI to have been emitted.
> >
> > Also way more aggressive. Not nearly enough on its own.
>
> In which cases will it be problematically more aggressive?
>
> If we emitted a WAL record during pruning we've already set the LSN of the
> page to a very recent LSN. We know the page is dirty. Thus we'll already
> trigger an XLogFlush() during ringbuffer replacement. We won't emit an FPI.

You seem to be talking about this as if the only thing that could
matter is the immediate FPI -- the first order effects -- and not any
second order effects. You certainly didn't get to 9x extra WAL
overhead by controlling for that before. Should I take it that you've
decided to assess these things more sensibly now? Out of curiosity:
why the change of heart?

> > > But to me it seems a bit odd that VACUUM now is more aggressive if 
> > > checksums /
> > > wal_log_hint bits is on, than without them. Which I think is how using 
> > > either
> > > of pgWalUsage.wal_fpi, pgWalUsage.wal_records ends up working?
> >
> > Which part is the odd part? Is it odd that page-level freezing works
> > that way, or is it odd that page-level checksums work that way?
>
> That page-level freezing works that way.

I think that it will probably cause a little confusion, and should be
specifically documented. But other than that, it seems reasonable
enough to me. I mean, should I not do something that's going to be of
significant help to users with checksums, just because it'll be
somewhat confusing to a small min

Re: New strategies for freezing, advancing relfrozenxid early

On Thu, Jan 26, 2023 at 1:22 PM Robert Haas  wrote:
> On Thu, Jan 26, 2023 at 4:06 PM Peter Geoghegan  wrote:
> > There is very good reason to believe that the large majority of all
> > data that people store in a system like Postgres is extremely cold
> > data:
>
> The systems where I end up troubleshooting problems seem to be, most
> typically, busy OLTP systems. I'm not in a position to say whether
> that's more or less common than systems with extremely cold data, but
> I am in a position to say that my employer will have a lot fewer happy
> customers if we regress that use case. Naturally I'm keen to avoid
> that.

This is the kind of remark that makes me think that you don't get it.

The most influential OLTP benchmark of all time is TPC-C, which has
exactly this problem. In spades -- it's enormously disruptive. Which
is one reason why I used it as a showcase for a lot of this work. Plus
practical experience (like the Heroku database in the blog post I
linked to) fully agrees with that benchmark, as far as this stuff goes
-- that was also a busy OLTP database.

Online transaction involves transactions. Right? There is presumably
some kind of ledger, some kind of orders table. Naturally these have
entries that age out fairly predictably. After a while, almost all the
data is cold data. It is usually about that simple.

One of the key strengths of systems like Postgres is the ability to
inexpensively store a relatively large amount of data that has just
about zero chance of being read, let alone modified. While at the same
time having decent OLTP performance for the hot data. Not nearly as
good as an in-memory system, mind you -- and yet in-memory systems
remain largely a niche thing.

-- 
Peter Geoghegan

Re: New strategies for freezing, advancing relfrozenxid early

On Thu, Jan 26, 2023 at 4:06 PM Peter Geoghegan  wrote:
> There is very good reason to believe that the large majority of all
> data that people store in a system like Postgres is extremely cold
> data:

The systems where I end up troubleshooting problems seem to be, most
typically, busy OLTP systems. I'm not in a position to say whether
that's more or less common than systems with extremely cold data, but
I am in a position to say that my employer will have a lot fewer happy
customers if we regress that use case. Naturally I'm keen to avoid
that.

> Having a separate aggressive step that rewrites an entire large table,
> apparently at random, is just a huge burden to users. You've said that
> you agree that it sucks, but somehow I still can't shake the feeling
> that you don't fully understand just how much it sucks.

Ha!

Well, that's possible. But maybe you don't understand how much your
patch makes other things suck.

I don't think we can really get anywhere here by postulating that the
problem is the other person's lack of understanding, even if such a
postulate should happen to be correct.

-- 
Robert Haas
EDB: http://www.enterprisedb.com

Re: New strategies for freezing, advancing relfrozenxid early

On Thu, Jan 26, 2023 at 12:54 PM Robert Haas  wrote:
> > The overwhelming cost is usually FPIs in any case. If you're not
> > mostly focussing on that, you're focussing on the wrong thing. At
> > least with larger tables. You just have to focus on the picture over
> > time, across multiple VACUUM operations.
>
> I think that's all mostly true, but the cases where being more
> aggressive can cause *extra* FPIs are worthy of just as much attention
> as the cases where we can reduce them.

It's a question of our exposure to real problems, in no small part.
What can we afford to be wrong about? What problem can be fixed by the
user more or less as it emerges, and what problem doesn't have that
quality?

There is very good reason to believe that the large majority of all
data that people store in a system like Postgres is extremely cold
data:

https://www.microsoft.com/en-us/research/video/cost-performance-in-modern-data-stores-how-data-cashing-systems-succeed/
https://brandur.org/fragments/events

Having a separate aggressive step that rewrites an entire large table,
apparently at random, is just a huge burden to users. You've said that
you agree that it sucks, but somehow I still can't shake the feeling
that you don't fully understand just how much it sucks.

-- 
Peter Geoghegan

Re: New strategies for freezing, advancing relfrozenxid early

Hi,

On 2023-01-26 20:26:00 +0100, Matthias van de Meent wrote:
> Could someone explain to me why we don't currently (optionally)
> include the functionality of page freezing in the PRUNE records?

I think we definitely should (and have argued for it a couple times). It's not
just about reducing WAL overhead, it's also about reducing redundant
visibility checks - which are where a very significant portion of the CPU time
for VACUUMing goes to.

Besides performance considerations, it's also just plain weird that
lazy_scan_prune() can end up with a different visibility than
heap_page_prune() (mostly due to concurrent aborts).

The number of WAL records we often end up emitting for a processing a single
page in vacuum is just plain absurd:
- PRUNE
- FREEZE_PAGE
- VISIBLE

There's afaict no justification whatsoever for these to be separate records.

> I think they're quite closely related (in that they both execute in VACUUM
> and are required for long-term system stability), and are even more related
> now that we have opportunistic page-level freezing. I think adding a "freeze
> this page as well"-flag in PRUNE records would go a long way to reducing the
> WAL overhead of aggressive and more opportunistic freezing.

Yep.

I think we should also seriously consider setting all-visible during on-access
pruning, and freezing rows during on-access pruning.

Greetings,

Andres Freund

Re: New strategies for freezing, advancing relfrozenxid early

On Thu, Jan 26, 2023 at 2:57 PM Peter Geoghegan  wrote:
> Relatively difficult for Andres, or for somebody else? What are the
> real parameters here? Obviously there are no clear answers available.

Andres is certainly smarter than the average guy, but practically any
scenario that someone can create in a few lines of SQL is something to
which code will be exposed to on some real-world system. If Andres
came along and said, hey, well I found a way to make this patch suck,
and proceeded to describe a scenario that involved a complex set of
tables and multiple workloads running simultaneously and using a
debugger to trigger some race condition and whatever, I'd be like "OK,
but is that really going to happen?". The actual scenario he came up
with is three lines of SQL, and it's nothing remotely obscure. That
kind of thing is going to happen *all the time*.

> The overwhelming cost is usually FPIs in any case. If you're not
> mostly focussing on that, you're focussing on the wrong thing. At
> least with larger tables. You just have to focus on the picture over
> time, across multiple VACUUM operations.

I think that's all mostly true, but the cases where being more
aggressive can cause *extra* FPIs are worthy of just as much attention
as the cases where we can reduce them.

-- 
Robert Haas
EDB: http://www.enterprisedb.com

Re: New strategies for freezing, advancing relfrozenxid early

Hi,

On 2023-01-26 10:44:45 -0800, Peter Geoghegan wrote:
> On Thu, Jan 26, 2023 at 9:53 AM Andres Freund  wrote:
> > > That's going to be very significantly more aggressive. For example
> > > it'll impact small tables very differently.
> >
> > Maybe it would be too aggressive, not sure. The cost of a freeze WAL record 
> > is
> > relatively small, with one important exception below, if we are 99.99% sure
> > that it's not going to require an FPI and isn't going to dirty the page.
> >
> > The exception is that a newer LSN on the page can cause the ringbuffer
> > replacement to trigger more more aggressive WAL flushing. No meaningful
> > difference if we modified the page during pruning, or if the page was 
> > already
> > in s_b (since it likely won't be written out via the ringbuffer in that 
> > case),
> > but if checksums are off and we just hint-dirtied the page, it could be a
> > significant issue.
> 
> Most of the overhead of FREEZE WAL records (with freeze plan
> deduplication and page-level freezing in) is generic WAL record header
> overhead. Your recent adversarial test case is going to choke on that,
> too. At least if you set checkpoint_timeout to 1 minute again.

I don't quite follow. What do you mean with "record header overhead"? Unless
that includes FPIs, I don't think that's that commonly true?

The problematic case I am talking about is when we do *not* emit a WAL record
during pruning (because there's nothing to prune), but want to freeze the
table. If you don't log an FPI, the remaining big overhead is that increasing
the LSN on the page will often cause an XLogFlush() when writing out the
buffer.

I don't see what your reference to checkpoint timeout is about here?

Also, as I mentioned before, the problem isn't specific to checkpoint_timeout
= 1min. It just makes it cheaper to reproduce.

> > Thus a modification of the above logic could be to opportunistically freeze 
> > if
> > a ) it won't cause an FPI and either
> > b1) the page was already dirty before pruning, as we'll not do a ringbuffer
> > replacement in that case
> > or
> > b2) We wrote a WAL record during pruning, as the difference in flush 
> > position
> > is marginal
> >
> > An even more aggressive version would be to replace b1) with logic that'd
> > allow newly dirtying the page if it wasn't read through the ringbuffer. But
> > newly dirtying the page feels like it'd be more dangerous.
> 
> In many cases we'll have to dirty the page anyway, just to set
> PD_ALL_VISIBLE. The whole way the logic works is conditioned (whether
> triggered by an FPI or triggered by my now-reverted GUC) on being able
> to set the whole page all-frozen in the VM.

IIRC setting PD_ALL_VISIBLE doesn't trigger an FPI unless we need to log hint
bits. But freezing does trigger one even without wal_log_hint_bits.

You're right, it makes sense to consider whether we'll emit a
XLOG_HEAP2_VISIBLE anyway.

> > A less aggressive version would be to check if any WAL records were emitted
> > during heap_page_prune() (instead of FPIs) and whether we'd emit an FPI if 
> > we
> > modified the page again. Similar to what we do now, except not requiring an
> > FPI to have been emitted.
> 
> Also way more aggressive. Not nearly enough on its own.

In which cases will it be problematically more aggressive?

If we emitted a WAL record during pruning we've already set the LSN of the
page to a very recent LSN. We know the page is dirty. Thus we'll already
trigger an XLogFlush() during ringbuffer replacement. We won't emit an FPI.

> > But to me it seems a bit odd that VACUUM now is more aggressive if 
> > checksums /
> > wal_log_hint bits is on, than without them. Which I think is how using 
> > either
> > of pgWalUsage.wal_fpi, pgWalUsage.wal_records ends up working?
> 
> Which part is the odd part? Is it odd that page-level freezing works
> that way, or is it odd that page-level checksums work that way?

That page-level freezing works that way.

> In any case this seems like an odd thing for you to say, having
> eviscerated a patch that really just made the same behavior trigger
> independently of FPIs in some tables, controlled via a GUC.

jdksjfkjdlkajsd;lfkjasd;lkfj;alskdfj

That behaviour I critizied was causing a torrent of FPIs and additional
dirtying of pages. My proposed replacement for the current FPI check doesn't,
because a) it only triggers when we wrote a WAL record b) It doesn't trigger
if we would write an FPI.

Greetings,

Andres Freund

Re: New strategies for freezing, advancing relfrozenxid early

On Thu, Jan 26, 2023 at 11:26 AM Matthias van de Meent
 wrote:
> Could someone explain to me why we don't currently (optionally)
> include the functionality of page freezing in the PRUNE records? I
> think they're quite closely related (in that they both execute in
> VACUUM and are required for long-term system stability), and are even
> more related now that we have opportunistic page-level freezing. I
> think adding a "freeze this page as well"-flag in PRUNE records would
> go a long way to reducing the WAL overhead of aggressive and more
> opportunistic freezing.

Yeah, we've talked about doing that in the past year. It's quite
possible. It would make quite a lot of sense, because the actual
overhead of the WAL record for freezing tends to come from the generic
WAL record header stuff itself. If there was only one record for both,
then you'd only need to include the relfilenode and block number (and
so on) once.

It would be tricky to handle Multis, so what you'd probably do is just
freezing xmin, and possibly aborted and locker XIDs in xmax. So you
wouldn't completely get rid of the main freeze record, but would be
able to avoid it in many important cases.

-- 
Peter Geoghegan

Re: New strategies for freezing, advancing relfrozenxid early

On Thu, Jan 26, 2023 at 11:28 AM Robert Haas  wrote:
> I think it's pretty much impossible to freeze more aggressively
> without losing in some scenario or other. If waiting longer to freeze
> would have resulted in the data getting updated again or deleted
> before we froze it, then waiting longer reduces the total amount of
> freezing work that ever has to be done. Freezing more aggressively
> inevitably gives up some amount of that potential benefit in order to
> try to secure some other benefit. It's a trade-off.

There is no question about that.

> I think that the goal of a patch that makes vacuum more (or less)
> aggressive should be to make the cases where we lose as obscure as
> possible, and the cases where we win as broad as possible. I think
> that, in order to be a good patch, it needs to be relatively difficult
> to find cases where we incur a big loss. If it's easy to find a big
> loss, then I think it's better to stick with the current behavior,
> even if it's also easy to find a big gain.

Again, this seems totally uncontroversial. It's just incredibly vague,
and not at all actionable.

Relatively difficult for Andres, or for somebody else? What are the
real parameters here? Obviously there are no clear answers available.

> However, I'm also not
> prepared to go all the way to the other end of the spectrum and say
> that all of your ideas and everything in this patch are great. I don't
> think either of those things, either.

It doesn't matter. I'm done with it. This is not a negotiation about
what gets in and what doesn't get in.

All that I aim to do now is to draw some kind of line under the basic
page-level freezing work, since of course I'm still responsible for
that. And perhaps to defend my personal reputation.

> I certainly think that freezing more aggressively in some scenarios
> could be a great idea, but it seems like the patch's theory is to be
> very nearly maximally aggressive in every vacuum run if the table size
> is greater than some threshold, and I don't think that's right at all.

We'll systematically avoid accumulating debt past a certain point --
that's its purpose. That is, we'll avoid accumulating all-visible
pages that eventually need to be frozen.

> I'm not exactly sure what information we should use to decide how
> aggressive to be, but I am pretty sure that the size of the table is
> not it.  It's true that, for a small table, the cost of having to
> eventually vacuum the whole table at once isn't going to be very high,
> whereas for a large table, it will be. That line of reasoning makes a
> size threshold sound reasonable. However, the amount of extra work
> that we can potentially do by vacuuming more aggressively *also*
> increases with the table size, which to me means using that a
> criterion actually isn't sensible at all.

The overwhelming cost is usually FPIs in any case. If you're not
mostly focussing on that, you're focussing on the wrong thing. At
least with larger tables. You just have to focus on the picture over
time, across multiple VACUUM operations.

> One idea that I've had about how to solve this problem is to try to
> make vacuum try to aggressively freeze some portion of the table on
> each pass, and to behave less aggressively on the rest of the table so
> that, hopefully, no single vacuum does too much work. Unfortunately, I
> don't really know how to do that effectively.

That has been proposed a couple of times in the context of this
thread. It won't work, because the way autovacuum works in general
(and likely always will work) doesn't allow it. With an append-only
table, each VACUUM will naturally have to scan significantly more
pages than the last one, forever (barring antiwraparound vacuums). Why
wouldn't it continue that way? I mean it might not (the table might
stop growing altogether), but then it doesn't matter much what we do.

If you're not behaving very proactively at the level of each VACUUM
operation, then the picture over time is that you're *already* falling
behind. At least with an append-only table. You have to think of the
sequence of operations, not just one.

> In theory we could have some system that tracks how
> recently each page range in a table has been modified, and direct our
> freezing activity toward the ones less-recently modified on the theory
> that they're not so likely to be modified again in the near future,
> but in reality we have no such system. So I don't really feel like I
> know what the right answer is here, yet.

So we need to come up with a way of getting reliable information from
the future, about an application that we have no particular
understanding of. As opposed to just eating the cost to some degree,
and making it configurable.

-- 
Peter Geoghegan

Re: New strategies for freezing, advancing relfrozenxid early

On Thu, Jan 26, 2023 at 11:35 AM Peter Geoghegan  wrote:
> You complained about the descriptions being theoretical. But there's
> nothing theoretical about the fact that we more or less do *all*
> freezing in an eventual aggressive VACUUM in many important cases,
> including very simple cases like pgbench_history -- the simplest
> possible append-only table case. We'll merrily rewrite the entire
> table, all at once, for no good reason at all. Consistently, reliably.
> It's so incredibly obvious that this makes zero sense! And yet I don't
> think you've ever engaged with such basic points as that one.

I'm aware that that's a problem, and I agree that it sucks. I think
that what this patch does is make vacuum more aggressively, and I
expect that would help this problem. I haven't said much about that
because I don't think it's controversial. However, the patch also has
a cost, and that's what I think is controversial.

I think it's pretty much impossible to freeze more aggressively
without losing in some scenario or other. If waiting longer to freeze
would have resulted in the data getting updated again or deleted
before we froze it, then waiting longer reduces the total amount of
freezing work that ever has to be done. Freezing more aggressively
inevitably gives up some amount of that potential benefit in order to
try to secure some other benefit. It's a trade-off.

I think that the goal of a patch that makes vacuum more (or less)
aggressive should be to make the cases where we lose as obscure as
possible, and the cases where we win as broad as possible. I think
that, in order to be a good patch, it needs to be relatively difficult
to find cases where we incur a big loss. If it's easy to find a big
loss, then I think it's better to stick with the current behavior,
even if it's also easy to find a big gain. There's nothing wonderful
about the current behavior, but (to paraphrase what I think Andres has
already said several times) it's better to keep shipping code with the
same bad behavior than to put out a new major release with behaviors
that are just as bad, but different.

I feel like your emails sometimes seem to suppose that I think that
you're a bad person, or a bad developer, or that you have no good
ideas, or that you have no good ideas about this topic, or that this
topic is not important, or that we don't need to do better than we are
currently doing. I think none of those things. However, I'm also not
prepared to go all the way to the other end of the spectrum and say
that all of your ideas and everything in this patch are great. I don't
think either of those things, either.

I certainly think that freezing more aggressively in some scenarios
could be a great idea, but it seems like the patch's theory is to be
very nearly maximally aggressive in every vacuum run if the table size
is greater than some threshold, and I don't think that's right at all.
I'm not exactly sure what information we should use to decide how
aggressive to be, but I am pretty sure that the size of the table is
not it.  It's true that, for a small table, the cost of having to
eventually vacuum the whole table at once isn't going to be very high,
whereas for a large table, it will be. That line of reasoning makes a
size threshold sound reasonable. However, the amount of extra work
that we can potentially do by vacuuming more aggressively *also*
increases with the table size, which to me means using that a
criterion actually isn't sensible at all.

One idea that I've had about how to solve this problem is to try to
make vacuum try to aggressively freeze some portion of the table on
each pass, and to behave less aggressively on the rest of the table so
that, hopefully, no single vacuum does too much work. Unfortunately, I
don't really know how to do that effectively. If we knew that the
table was going to see 10 vacuums before we hit
autovacuum_freeze_max_age, we could try to have each one do 10% of the
amount of freezing that was going to need to be done rather than
letting any single vacuum do all of it, but we don't have that sort of
information. Also, even if we did have that sort of information, the
idea only works if the pages that we freeze sooner are ones that we're
not about to update or delete again, and we don't have any idea what
is likely there. In theory we could have some system that tracks how
recently each page range in a table has been modified, and direct our
freezing activity toward the ones less-recently modified on the theory
that they're not so likely to be modified again in the near future,
but in reality we have no such system. So I don't really feel like I
know what the right answer is here, yet.

-- 
Robert Haas
EDB: http://www.enterprisedb.com

Re: New strategies for freezing, advancing relfrozenxid early

2023-01-26 Thread Matthias van de Meent

On Thu, 26 Jan 2023 at 19:45, Peter Geoghegan  wrote:
>
> On Thu, Jan 26, 2023 at 9:53 AM Andres Freund  wrote:
> > I assume the case you're thinking of is that pruning did *not* do any 
> > changes,
> > but in the process of figuring out that nothing needed to be pruned, we did 
> > a
> > MarkBufferDirtyHint(), and as part of that emitted an FPI?
>
> Yes.
>
> > > That's going to be very significantly more aggressive. For example
> > > it'll impact small tables very differently.
> >
> > Maybe it would be too aggressive, not sure. The cost of a freeze WAL record 
> > is
> > relatively small, with one important exception below, if we are 99.99% sure
> > that it's not going to require an FPI and isn't going to dirty the page.
> >
> > The exception is that a newer LSN on the page can cause the ringbuffer
> > replacement to trigger more more aggressive WAL flushing. No meaningful
> > difference if we modified the page during pruning, or if the page was 
> > already
> > in s_b (since it likely won't be written out via the ringbuffer in that 
> > case),
> > but if checksums are off and we just hint-dirtied the page, it could be a
> > significant issue.
>
> Most of the overhead of FREEZE WAL records (with freeze plan
> deduplication and page-level freezing in) is generic WAL record header
> overhead. Your recent adversarial test case is going to choke on that,
> too. At least if you set checkpoint_timeout to 1 minute again.

Could someone explain to me why we don't currently (optionally)
include the functionality of page freezing in the PRUNE records? I
think they're quite closely related (in that they both execute in
VACUUM and are required for long-term system stability), and are even
more related now that we have opportunistic page-level freezing. I
think adding a "freeze this page as well"-flag in PRUNE records would
go a long way to reducing the WAL overhead of aggressive and more
opportunistic freezing.

-Matthias

Re: New strategies for freezing, advancing relfrozenxid early

On Thu, Jan 26, 2023 at 9:53 AM Andres Freund  wrote:
> I assume the case you're thinking of is that pruning did *not* do any changes,
> but in the process of figuring out that nothing needed to be pruned, we did a
> MarkBufferDirtyHint(), and as part of that emitted an FPI?

Yes.

> > That's going to be very significantly more aggressive. For example
> > it'll impact small tables very differently.
>
> Maybe it would be too aggressive, not sure. The cost of a freeze WAL record is
> relatively small, with one important exception below, if we are 99.99% sure
> that it's not going to require an FPI and isn't going to dirty the page.
>
> The exception is that a newer LSN on the page can cause the ringbuffer
> replacement to trigger more more aggressive WAL flushing. No meaningful
> difference if we modified the page during pruning, or if the page was already
> in s_b (since it likely won't be written out via the ringbuffer in that case),
> but if checksums are off and we just hint-dirtied the page, it could be a
> significant issue.

Most of the overhead of FREEZE WAL records (with freeze plan
deduplication and page-level freezing in) is generic WAL record header
overhead. Your recent adversarial test case is going to choke on that,
too. At least if you set checkpoint_timeout to 1 minute again.

> Thus a modification of the above logic could be to opportunistically freeze if
> a ) it won't cause an FPI and either
> b1) the page was already dirty before pruning, as we'll not do a ringbuffer
> replacement in that case
> or
> b2) We wrote a WAL record during pruning, as the difference in flush position
> is marginal
>
> An even more aggressive version would be to replace b1) with logic that'd
> allow newly dirtying the page if it wasn't read through the ringbuffer. But
> newly dirtying the page feels like it'd be more dangerous.

In many cases we'll have to dirty the page anyway, just to set
PD_ALL_VISIBLE. The whole way the logic works is conditioned (whether
triggered by an FPI or triggered by my now-reverted GUC) on being able
to set the whole page all-frozen in the VM.

> A less aggressive version would be to check if any WAL records were emitted
> during heap_page_prune() (instead of FPIs) and whether we'd emit an FPI if we
> modified the page again. Similar to what we do now, except not requiring an
> FPI to have been emitted.

Also way more aggressive. Not nearly enough on its own.

> But to me it seems a bit odd that VACUUM now is more aggressive if checksums /
> wal_log_hint bits is on, than without them. Which I think is how using either
> of pgWalUsage.wal_fpi, pgWalUsage.wal_records ends up working?

Which part is the odd part? Is it odd that page-level freezing works
that way, or is it odd that page-level checksums work that way?

In any case this seems like an odd thing for you to say, having
eviscerated a patch that really just made the same behavior trigger
independently of FPIs in some tables, controlled via a GUC.

-- 
Peter Geoghegan

Re: New strategies for freezing, advancing relfrozenxid early

Hi,

On 2023-01-26 08:54:55 -0800, Peter Geoghegan wrote:
> On Thu, Jan 26, 2023 at 8:35 AM Andres Freund  wrote:
> > I think it's probably ok, but perhaps deserves a bit more thought about when
> > to "opportunistically" freeze. Perhaps to make it *more* aggressive than 
> > it's
> > now.
> >
> > With "opportunistic freezing" I mean freezing the page, even though we don't
> > *have* to freeze any of the tuples.
> >
> > The overall condition gating freezing is:
> > if (pagefrz.freeze_required || tuples_frozen == 0 ||
> > (prunestate->all_visible && prunestate->all_frozen &&
> >  fpi_before != pgWalUsage.wal_fpi))
> >
> > fpi_before is set before the heap_page_prune() call.
> 
> Have you considered page-level checksums, and how the impact on hint
> bits needs to be accounted for here?
> 
> All RDS customers use page-level checksums. And I've noticed that it's
> very common for the number of FPIs to only be very slightly less than
> the number of pages dirtied. Much of which is just hint bits. The
> "fpi_before != pgWalUsage.wal_fpi" test catches that.

I assume the case you're thinking of is that pruning did *not* do any changes,
but in the process of figuring out that nothing needed to be pruned, we did a
MarkBufferDirtyHint(), and as part of that emitted an FPI?

> > To me a condition that checked if the buffer is already dirty and if another
> > XLogInsert() would be likely to generate an FPI would make more sense. The
> > rare race case of a checkpoint starting concurrently doesn't matter IMO.
> 
> That's going to be very significantly more aggressive. For example
> it'll impact small tables very differently.

Maybe it would be too aggressive, not sure. The cost of a freeze WAL record is
relatively small, with one important exception below, if we are 99.99% sure
that it's not going to require an FPI and isn't going to dirty the page.

The exception is that a newer LSN on the page can cause the ringbuffer
replacement to trigger more more aggressive WAL flushing. No meaningful
difference if we modified the page during pruning, or if the page was already
in s_b (since it likely won't be written out via the ringbuffer in that case),
but if checksums are off and we just hint-dirtied the page, it could be a
significant issue.

Thus a modification of the above logic could be to opportunistically freeze if
a ) it won't cause an FPI and either
b1) the page was already dirty before pruning, as we'll not do a ringbuffer
replacement in that case
or
b2) We wrote a WAL record during pruning, as the difference in flush position
is marginal

An even more aggressive version would be to replace b1) with logic that'd
allow newly dirtying the page if it wasn't read through the ringbuffer. But
newly dirtying the page feels like it'd be more dangerous.

A less aggressive version would be to check if any WAL records were emitted
during heap_page_prune() (instead of FPIs) and whether we'd emit an FPI if we
modified the page again. Similar to what we do now, except not requiring an
FPI to have been emitted.

But to me it seems a bit odd that VACUUM now is more aggressive if checksums /
wal_log_hint bits is on, than without them. Which I think is how using either
of pgWalUsage.wal_fpi, pgWalUsage.wal_records ends up working?

Greetings,

Andres Freund

Re: New strategies for freezing, advancing relfrozenxid early

On Thu, Jan 26, 2023 at 8:35 AM Andres Freund  wrote:
> I think it's probably ok, but perhaps deserves a bit more thought about when
> to "opportunistically" freeze. Perhaps to make it *more* aggressive than it's
> now.
>
> With "opportunistic freezing" I mean freezing the page, even though we don't
> *have* to freeze any of the tuples.
>
> The overall condition gating freezing is:
> if (pagefrz.freeze_required || tuples_frozen == 0 ||
> (prunestate->all_visible && prunestate->all_frozen &&
>  fpi_before != pgWalUsage.wal_fpi))
>
> fpi_before is set before the heap_page_prune() call.

Have you considered page-level checksums, and how the impact on hint
bits needs to be accounted for here?

All RDS customers use page-level checksums. And I've noticed that it's
very common for the number of FPIs to only be very slightly less than
the number of pages dirtied. Much of which is just hint bits. The
"fpi_before != pgWalUsage.wal_fpi" test catches that.

> To me a condition that checked if the buffer is already dirty and if another
> XLogInsert() would be likely to generate an FPI would make more sense. The
> rare race case of a checkpoint starting concurrently doesn't matter IMO.

That's going to be very significantly more aggressive. For example
it'll impact small tables very differently.

--
Peter Geoghegan

Re: New strategies for freezing, advancing relfrozenxid early

Hi,

On 2023-01-26 09:20:57 -0500, Robert Haas wrote:
> On Wed, Jan 25, 2023 at 10:56 PM Andres Freund  wrote:
> > but that's only true because page level freezing neutered
> > vacuum_freeze_min_age. Compared to <16, it's a *huge* change.
> 
> Do you think that page-level freezing
> (1de58df4fec7325d91f5a8345757314be7ac05da) was improvidently
> committed?

I think it's probably ok, but perhaps deserves a bit more thought about when
to "opportunistically" freeze. Perhaps to make it *more* aggressive than it's
now.

With "opportunistic freezing" I mean freezing the page, even though we don't
*have* to freeze any of the tuples.

The overall condition gating freezing is:
if (pagefrz.freeze_required || tuples_frozen == 0 ||
(prunestate->all_visible && prunestate->all_frozen &&
 fpi_before != pgWalUsage.wal_fpi))

fpi_before is set before the heap_page_prune() call.

To me the
  fpi_before != pgWalUsage.wal_fpi"
part doesn't make a whole lot of sense. For one, it won't at all work if
full_page_writes=off. But more importantly, it also means we'll not freeze
when VACUUMing a recently modified page, even if pruning already emitted a WAL
record and we'd not emit an FPI if we freezed the page now.

To me a condition that checked if the buffer is already dirty and if another
XLogInsert() would be likely to generate an FPI would make more sense. The
rare race case of a checkpoint starting concurrently doesn't matter IMO.

A minor complaint I have about the code is that the "tuples_frozen == 0" path
imo is confusing. We go into the "freeze" path, which then inside has another
if for the tuples_frozen == 0 part. I get that this deduplicates the
NewRelFrozenXid handling, but it still looks odd.

> I have always been a bit skeptical of vacuum_freeze_min_age as a
> mechanism. It's certainly true that it is a waste of energy to freeze
> tuples that will soon be removed anyway, but on the other hand,
> repeatedly dirtying the same page for various different freezing and
> visibility related reasons *really sucks*, and even repeatedly reading
> the page because we kept deciding not to do anything yet isn't great.
> It seems possible that the page-level freezing mechanism could help
> with that quite a bit, and I think that the heuristic that patch
> proposes is basically reasonable: if there's at least one tuple on the
> page that is old enough to justify freezing, it doesn't seem like a
> bad bet to freeze all the others that can be frozen at the same time,
> at least if it means that we can mark the page all-visible or
> all-frozen. If it doesn't, then I'm not so sure; maybe we're best off
> deferring as much work as possible to a time when we *can* mark the
> page all-visible or all-frozen.

Agreed. Freezing everything if we need to freeze some things seems quite safe
to me.

> In short, I think that neutering vacuum_freeze_min_age at least to
> some degree might be a good thing, but that's not to say that I'm
> altogether confident in that patch, either.

I am not too woried about the neutering in the page level freezing patch.

The combination of the page level work with the eager strategy is where the
sensibly-more-aggressive freeze_min_age got turbocharged to an imo dangerous
degree.

Greetings,

Andres Freund

Re: New strategies for freezing, advancing relfrozenxid early

On Thu, Jan 26, 2023 at 5:41 AM Robert Haas  wrote:
> On Wed, Jan 25, 2023 at 11:25 PM Peter Geoghegan  wrote:
> > On Wed, Jan 25, 2023 at 7:41 PM Robert Haas  wrote:
> > > Both Andres and I have repeatedly expressed concern about how much is
> > > being changed in the behavior of vacuum, and how quickly, and IMHO on
> > > the basis of very limited evidence that the changes are improvements.
> > > The fact that Andres was very quickly able to find cases where the
> > > patch produces large regression is just more evidence of that. It's
> > > also hard to even understand what has been changed, because the
> > > descriptions are so theoretical.
> >
> > Did you actually read the motivating examples Wiki page?
>
> I don't know. I've read a lot of stuff that you've written on this
> topic, which has taken a significant amount of time, and I still don't
> understand a lot of what you're changing, and I don't agree with all
> of the things that I do understand.

You complained about the descriptions being theoretical. But there's
nothing theoretical about the fact that we more or less do *all*
freezing in an eventual aggressive VACUUM in many important cases,
including very simple cases like pgbench_history -- the simplest
possible append-only table case. We'll merrily rewrite the entire
table, all at once, for no good reason at all. Consistently, reliably.
It's so incredibly obvious that this makes zero sense! And yet I don't
think you've ever engaged with such basic points as that one.

-- 
Peter Geoghegan

Re: New strategies for freezing, advancing relfrozenxid early

On Wed, Jan 25, 2023 at 7:56 PM Andres Freund  wrote:
> > https://wiki.postgresql.org/wiki/Freezing/skipping_strategies_patch:_motivating_examples#Patch_2
> >
> > The difference between this and VACUUM FREEZE is described here:
> >
> > "Note how we freeze most pages, but still leave a significant number
> > unfrozen each time, despite using an eager approach to freezing
> > (2981204 scanned - 2355230 frozen = 625974 pages scanned but left
> > unfrozen). Again, this is because we don't freeze pages unless they're
> > already eligible to be set all-visible.
>
> The only reason there is a substantial difference is because of pgbench's
> uniform access pattern. Most real-world applications don't have that.

It's not pgbench! It's TPC-C. It's actually an adversarial case for
the patch series.

-- 
Peter Geoghegan

Re: New strategies for freezing, advancing relfrozenxid early

On Wed, Jan 25, 2023 at 10:56 PM Andres Freund  wrote:
> but that's only true because page level freezing neutered
> vacuum_freeze_min_age. Compared to <16, it's a *huge* change.

Do you think that page-level freezing
(1de58df4fec7325d91f5a8345757314be7ac05da) was improvidently
committed?

I have always been a bit skeptical of vacuum_freeze_min_age as a
mechanism. It's certainly true that it is a waste of energy to freeze
tuples that will soon be removed anyway, but on the other hand,
repeatedly dirtying the same page for various different freezing and
visibility related reasons *really sucks*, and even repeatedly reading
the page because we kept deciding not to do anything yet isn't great.
It seems possible that the page-level freezing mechanism could help
with that quite a bit, and I think that the heuristic that patch
proposes is basically reasonable: if there's at least one tuple on the
page that is old enough to justify freezing, it doesn't seem like a
bad bet to freeze all the others that can be frozen at the same time,
at least if it means that we can mark the page all-visible or
all-frozen. If it doesn't, then I'm not so sure; maybe we're best off
deferring as much work as possible to a time when we *can* mark the
page all-visible or all-frozen.

In short, I think that neutering vacuum_freeze_min_age at least to
some degree might be a good thing, but that's not to say that I'm
altogether confident in that patch, either.

-- 
Robert Haas
EDB: http://www.enterprisedb.com

Re: New strategies for freezing, advancing relfrozenxid early

On Wed, Jan 25, 2023 at 11:25 PM Peter Geoghegan  wrote:
> On Wed, Jan 25, 2023 at 7:41 PM Robert Haas  wrote:
> > Both Andres and I have repeatedly expressed concern about how much is
> > being changed in the behavior of vacuum, and how quickly, and IMHO on
> > the basis of very limited evidence that the changes are improvements.
> > The fact that Andres was very quickly able to find cases where the
> > patch produces large regression is just more evidence of that. It's
> > also hard to even understand what has been changed, because the
> > descriptions are so theoretical.
>
> Did you actually read the motivating examples Wiki page?

I don't know. I've read a lot of stuff that you've written on this
topic, which has taken a significant amount of time, and I still don't
understand a lot of what you're changing, and I don't agree with all
of the things that I do understand. I can't state with confidence that
the motivating examples wiki page was or was not among the things that
I read. But, you know, when people start running PostgreSQL 16, and
have some problem, they're not going to read the motivating examples
wiki page. They're going to read the documentation. If they can't find
the answer there, they (or some hacker that they contact) will
probably read the code comments and the relevant commit messages.
Those either clearly explain what was changed in a way that somebody
can understand, or they don't. If they don't, *the commits are not
good enough*, regardless of what other information may exist in any
other place.

-- 
Robert Haas
EDB: http://www.enterprisedb.com

Re: New strategies for freezing, advancing relfrozenxid early

On Wed, Jan 25, 2023 at 8:24 PM Peter Geoghegan  wrote:
> > I think we're on a very dangerous path here. I want VACUUM to be
> > better as the next person, but I really don't believe that's the
> > direction we're headed. I think if we release like this, we're going
> > to experience more VACUUM pain, not less. And worse still, I don't
> > think anyone other than Peter and Andres is going to understand why
> > it's happening.
>
> I think that the only sensible course of action at this point is for
> me to revert the page-level freezing commit from today, and abandon
> all outstanding work on VACUUM. I will still stand by the basic
> page-level freezing work, at least to the extent that I am able to.

I have now reverted today's commit. I have also withdrawn all
remaining work from the patch series as a whole, which is reflected in
the CF app. Separately, I have withdrawn 2 other VACUUM related
patches of mine via the CF app: the antiwraparound autovacuum patch
series, plus a patch that did some further work on freezing
MultiXacts.

I have no intention of picking any of these patches back up again. I
also intend to completely avoid new work on both VACUUM and
autovacuum, not including ambulkdelete() code run by index access
methods. I will continue to do maintenance and bugfix work when it
happens to involve VACUUM, though.

For the record, in case it matters: I certainly have no objection to
anybody else picking up any of this unfinished work for themselves, in
part or in full.

--
Peter Geoghegan

Re: New strategies for freezing, advancing relfrozenxid early

On Wed, Jan 25, 2023 at 8:12 PM John Naylor
 wrote:
> That was followed by several paragraphs that never got around to explaining 
> why table size should drive freezing strategy.

You were talking about the system level view of freeze debt, and how
the table view might not be a sufficient proxy for that. What does
that have to do with anything that we've discussed on this thread
recently?

> Review is a feedback mechanism alerting the patch author to possible 
> problems. Listening to feedback is like vacuum, in a way: If it hurts, you're 
> not doing it enough.

An elegant analogy.

-- 
Peter Geoghegan

Re: New strategies for freezing, advancing relfrozenxid early

On Wed, Jan 25, 2023 at 7:41 PM Robert Haas  wrote:
> Both Andres and I have repeatedly expressed concern about how much is
> being changed in the behavior of vacuum, and how quickly, and IMHO on
> the basis of very limited evidence that the changes are improvements.
> The fact that Andres was very quickly able to find cases where the
> patch produces large regression is just more evidence of that. It's
> also hard to even understand what has been changed, because the
> descriptions are so theoretical.

Did you actually read the motivating examples Wiki page?

> I think we're on a very dangerous path here. I want VACUUM to be
> better as the next person, but I really don't believe that's the
> direction we're headed. I think if we release like this, we're going
> to experience more VACUUM pain, not less. And worse still, I don't
> think anyone other than Peter and Andres is going to understand why
> it's happening.

I think that the only sensible course of action at this point is for
me to revert the page-level freezing commit from today, and abandon
all outstanding work on VACUUM. I will still stand by the basic
page-level freezing work, at least to the extent that I am able to.
Honestly, just typing that makes me feel a big sense of relief.

I am a proud, stubborn man. While the experience of working on the
earlier related stuff for Postgres 15 was itself enough to make me
seriously reassess my choice to work on VACUUM in general, I still
wanted to finish off what I'd started. I don't see how that'll be
possible now -- I'm just not in a position to be in the center of
another controversy, and I just don't seem to be able to avoid them
here, as a practical matter. I will resolve to be a less stubborn
person. I don't have the constitution for it anymore.

-- 
Peter Geoghegan

Re: New strategies for freezing, advancing relfrozenxid early

2023-01-25 Thread John Naylor

On Thu, Jan 26, 2023 at 10:11 AM Andres Freund  wrote:

> I am. Just not every tradeoff. I just don't see any useful tradeoffs
purely
> based on the relation size.

I expressed reservations about relation size six weeks ago:

On Wed, Dec 14, 2022 at 12:16 AM Peter Geoghegan  wrote:
>
> On Tue, Dec 13, 2022 at 12:29 AM John Naylor
>  wrote:
> > If the number of unfrozen heap pages is the thing we care about,
perhaps that, and not the total size of the table, should be the parameter
that drives freezing strategy?
>
> That's not the only thing we care about, though.

That was followed by several paragraphs that never got around to explaining
why table size should drive freezing strategy. Review is a feedback
mechanism alerting the patch author to possible problems. Listening to
feedback is like vacuum, in a way: If it hurts, you're not doing it enough.

--
John Naylor
EDB: http://www.enterprisedb.com

Re: New strategies for freezing, advancing relfrozenxid early

Hi,

On 2023-01-25 18:43:10 -0800, Peter Geoghegan wrote:
> On Wed, Jan 25, 2023 at 6:33 PM Andres Freund  wrote:
> > As far as I can tell, with the eager strategy, the only thing
> > vacuum_freeze_min_age really influences is whether we'll block waiting for a
> > cleanup lock.  IOW, VACUUM on a table > vacuum_freeze_strategy_threshold is
> > now a slightly less-blocking version of VACUUM FREEZE.
>
> That's simply not true, at all. I'm very surprised that you think
> that. The commit message very clearly addresses this.

It says something like that, but it's not really true:

Looking at the results of
  DROP TABLE IF EXISTS frak;
  -- autovac disabled so we see just the result of the vacuum below
  CREATE TABLE frak WITH (autovacuum_enabled=0) AS SELECT generate_series(1, 
1000);
  VACUUM frak;
  SELECT pg_relation_size('frak') / 8192 AS relsize_pages, 
SUM(all_visible::int) all_vis_pages, SUM(all_frozen::int) all_frozen_pages FROM 
pg_visibility('frak');

across releases.

In < 16 you'll get:
┌───┬───┬──┐
│ relsize_pages │ all_vis_pages │ all_frozen_pages │
├───┼───┼──┤
│ 44248 │ 44248 │0 │
└───┴───┴──┘

You simply can't freeze these rows, because they're not vacuum_freeze_min_age
xids old.

With 16 and the default vacuum_freeze_strategy_threshold you'll get the same
(even though we wouldn't actually trigger an FPW).

With 16 and vacuum_freeze_strategy_threshold=0, you'll get:
┌───┬───┬──┐
│ relsize_pages │ all_vis_pages │ all_frozen_pages │
├───┼───┼──┤
│ 44248 │ 44248 │44248 │
└───┴───┴──┘

IOW, basically what you get with VACUUM FREEZE.

That's actually what I was complaining about. The commit message in a way is
right that
Settings
like vacuum_freeze_min_age still get applied in just the same way in
every VACUUM, independent of the strategy in use.  The only mechanical
difference between eager and lazy freezing strategies is that only the
former applies its own additional criteria to trigger freezing pages.

but that's only true because page level freezing neutered
vacuum_freeze_min_age. Compared to <16, it's a *huge* change.

Yes, it's true that VACUUM still is less agressive than VACUUM FREEZE, even
disregarding cleanup locks, because it won't freeze if there's non-removable
rows on the page. But more often than not that's a pretty small difference.

> Once again I'll refer you to my Wiki page on this:
>
> https://wiki.postgresql.org/wiki/Freezing/skipping_strategies_patch:_motivating_examples#Patch_2
>
> The difference between this and VACUUM FREEZE is described here:
>
> "Note how we freeze most pages, but still leave a significant number
> unfrozen each time, despite using an eager approach to freezing
> (2981204 scanned - 2355230 frozen = 625974 pages scanned but left
> unfrozen). Again, this is because we don't freeze pages unless they're
> already eligible to be set all-visible.

The only reason there is a substantial difference is because of pgbench's
uniform access pattern. Most real-world applications don't have that.

Greetings,

Andres Freund

Re: New strategies for freezing, advancing relfrozenxid early

On Wed, Jan 25, 2023 at 7:11 PM Andres Freund  wrote:
> > > I switched between vacuum_freeze_strategy_threshold=0 and
> > > vacuum_freeze_strategy_threshold=too-high, because it's quicker/takes less
> > > warmup to set up something with smaller tables.
> >
> > This makes no sense to me, at all.
>
> It's quicker to run the workload with a table that initially is below 4GB, but
> still be able to test the eager strategy. It wouldn't change anything
> fundamental to just make the rows a bit wider, or to have a static portion of
> the table.

What does that actually mean? Wouldn't change anything fundamental?

What it would do is significantly reduce the write amplification
effect that you encountered. You came up with numbers of up to 7x, a
number that you used without any mention of checkpoint_timeout being
lowered to only 1 minutes (I had to push to get that information). Had
you done things differently (larger table, larger setting) then that
would have made the regression far smaller. So yeah, "nothing
fundamental".

> > How, in general, can we detect what kind of 1TB table it will be, in the
> > absence of user input?
>
> I suspect we'll need some form of heuristics to differentiate between tables
> that are more append heavy and tables that are changing more heavily.

The TPC-C tables are actually a perfect adversarial cases for this,
because it's both, together. What then?

> I think
> it might be preferrable to not have a hard cliff but a gradual changeover -
> hard cliffs tend to lead to issue one can't see coming.

As soon as you change your behavior you have to account for the fact
that you behaved lazily up until all prior VACUUMs. I think that
you're better off just being eager with new pages and modified pages,
while not specifically going

> I IIRC previously was handwaving at keeping track of the average age of tuples
> on all-visible pages. That could extend the prior heuristic. A heavily
> changing table will have a relatively young average, a more append only table
> will have an increasing average age.
>
>
> It might also make sense to look at the age of relfrozenxid - there's really
> no point in being overly eager if the relation is quite young.

I don't think that's true. What about bulk loading? It's a totally
valid and common requirement.

--
Peter Geoghegan

Re: New strategies for freezing, advancing relfrozenxid early

2023-01-25 Thread Robert Haas

On Wed, Jan 25, 2023 at 8:49 PM Andres Freund  wrote:
> The concrete setting of vacuum_freeze_strategy_threshold doesn't matter.
> Table size simply isn't a usable proxy for whether eager freezing is a good
> idea or not.

I strongly agree. I can't imagine how a size-based threshold can make
any sense at all.

Both Andres and I have repeatedly expressed concern about how much is
being changed in the behavior of vacuum, and how quickly, and IMHO on
the basis of very limited evidence that the changes are improvements.
The fact that Andres was very quickly able to find cases where the
patch produces large regression is just more evidence of that. It's
also hard to even understand what has been changed, because the
descriptions are so theoretical.

I think we're on a very dangerous path here. I want VACUUM to be
better as the next person, but I really don't believe that's the
direction we're headed. I think if we release like this, we're going
to experience more VACUUM pain, not less. And worse still, I don't
think anyone other than Peter and Andres is going to understand why
it's happening.

-- 
Robert Haas
EDB: http://www.enterprisedb.com

Re: New strategies for freezing, advancing relfrozenxid early

Hk,

On 2023-01-25 18:31:16 -0800, Peter Geoghegan wrote:
> On Wed, Jan 25, 2023 at 5:49 PM Andres Freund  wrote:
> > Sure. But significantly regressing plausible if not common workloads is
> > different than knowing that there'll be some edge case where we'll do
> > something worse.
> 
> That's very vague. Significant to whom, for what purpose?

Sure it's vague. But you can't tell me that it's uncommon to use postgres to
store rows that isn't retained for > 50million xids.

> > I reproduced both with checkpoint_timeout=5min and 1min. 1min is easier for
> > impatient me.
> 
> You said "Autovacuum on average generates between 1.5x-7x as much WAL
> as before". Why stop there, though? There's a *big* multiplicative
> effect in play here from FPIs, obviously, so the sky's the limit. Why
> not set checkpoint_timeout to 30s?

The amount of WAL increases substantially even with 5min, the degree of the
increase varies more though. But that largely vanishes if you increase the
time after which rows are deleted a bit. I just am not patient enough to wait
for that.

> > I switched between vacuum_freeze_strategy_threshold=0 and
> > vacuum_freeze_strategy_threshold=too-high, because it's quicker/takes less
> > warmup to set up something with smaller tables.
> 
> This makes no sense to me, at all.

It's quicker to run the workload with a table that initially is below 4GB, but
still be able to test the eager strategy. It wouldn't change anything
fundamental to just make the rows a bit wider, or to have a static portion of
the table.

And changing between vacuum_freeze_strategy_threshold=0/very-large (or I
assume -1, didn't check) while the workload is running having to wait until
the 120s to start deleting have passed..

> > The concrete setting of vacuum_freeze_strategy_threshold doesn't matter.
> > Table size simply isn't a usable proxy for whether eager freezing is a good
> > idea or not.
> 
> It's not supposed to be - you have it backwards. It's intended to work
> as a proxy for whether lazy freezing is a bad idea, particularly in
> the worst case.

That's a distinction without a difference.

> There is also an effect that likely would have been protective with
> your test case had you used a larger table with the same test case
> (and had you not lowered vacuum_freeze_strategy_threshold from its
> already low default).

Again, you just need a less heavily changing portion of the the table or a
slightly larger "deletion delay" and you end up with a table well over
4GB. Even as stated I end up with > 4GB after a bit of running.

It's just a shortcut to make testing this easier.

> > You can have a 1TB table full of transient data, or you can have a 1TB table
> > where part of the data is transient and only settles after a time. In 
> > neither
> > case eager freezing is ok.
> 
> It sounds like you're not willing to accept any kind of trade-off.

I am. Just not every tradeoff. I just don't see any useful tradeoffs purely
based on the relation size.

> How, in general, can we detect what kind of 1TB table it will be, in the
> absence of user input?

I suspect we'll need some form of heuristics to differentiate between tables
that are more append heavy and tables that are changing more heavily. I think
it might be preferrable to not have a hard cliff but a gradual changeover -
hard cliffs tend to lead to issue one can't see coming.

I think several of the heuristics below become easier once we introduce "xid
age" vacuums.

One idea is to start tracking the number of all-frozen pages in pg_class. If
there's a significant percentage of all-visible but not all-frozen pages,
vacuum should be more eager. If only a small portion of the table is not
frozen, there's no need to be eager. If only a small portion of the table is
all-visible, there similarly is no need to freeze eagerly.

I IIRC previously was handwaving at keeping track of the average age of tuples
on all-visible pages. That could extend the prior heuristic. A heavily
changing table will have a relatively young average, a more append only table
will have an increasing average age.

It might also make sense to look at the age of relfrozenxid - there's really
no point in being overly eager if the relation is quite young. And a very
heavily changing table will tend to be younger. But likely the approach of
tracking the age of all-visible pages will be more accurate.

The heuristics don't have to be perfect. If we get progressively more eager,
an occasional somewhat eager vacuum isn't a huge issue, as long as it then
leads to the next few vacuums to become less eager.

> And in the absence of user input, why would we prefer to default to a
> behavior that is highly destabilizing when we get it wrong?

Users know the current behaviour. Introducing significant issues that didn't
previously exist will cause new issues and new frustrations.

Greetings,

Andres Freund

Re: New strategies for freezing, advancing relfrozenxid early

On Wed, Jan 25, 2023 at 6:33 PM Andres Freund  wrote:
> My point was the other way round. That vacuum_freeze_min_age *prevented* us
> from freezing rows "too soon" - obviously a very blunt instrument.

Yes, not freezing at all until aggressive vacuum is definitely good
when you don't really need to freeze at all.

> Since page level freezing, it only partially does that, because we'll freeze
> even newer rows, if pruning triggered an FPI (I don't think that's quite the
> right check, but that's a separate discussion).

But the added cost is very low, and it might well make all the difference.

> As far as I can tell, with the eager strategy, the only thing
> vacuum_freeze_min_age really influences is whether we'll block waiting for a
> cleanup lock.  IOW, VACUUM on a table > vacuum_freeze_strategy_threshold is
> now a slightly less-blocking version of VACUUM FREEZE.

That's simply not true, at all. I'm very surprised that you think
that. The commit message very clearly addresses this. You know, the
part that you specifically quoted to complain about today!

Once again I'll refer you to my Wiki page on this:

https://wiki.postgresql.org/wiki/Freezing/skipping_strategies_patch:_motivating_examples#Patch_2

The difference between this and VACUUM FREEZE is described here:

"Note how we freeze most pages, but still leave a significant number
unfrozen each time, despite using an eager approach to freezing
(2981204 scanned - 2355230 frozen = 625974 pages scanned but left
unfrozen). Again, this is because we don't freeze pages unless they're
already eligible to be set all-visible. We saw the same effect with
the first pgbench_history example, but it was hardly noticeable at all
there. Whereas here we see that even eager freezing opts to hold off
on freezing relatively many individual heap pages, due to the observed
conditions on those particular heap pages."

If it was true that eager freezing strategy behaved just the same as
VACUUM FREEZE (at least as far as freezing is concerned) then
scenarios like this one would show that VACUUM froze practically all
of the pages it scanned -- maybe fully 100% of all scanned pages would
be frozen. This effect is absent from small tables, and I suspect that
it's absent from your test case in part because you used a table that
was too small.

Obviously the way that eager freezing strategy avoids freezing
concurrently modified pages isn't perfect. It's one approach to
limiting the downside from eager freezing, in tables (or even
individual pages) where it's inappropriate. Of course that isn't
perfect, but it's a significant factor.

-- 
Peter Geoghegan

Re: New strategies for freezing, advancing relfrozenxid early

Hi,

On 2023-01-25 17:28:48 -0800, Peter Geoghegan wrote:
> On Wed, Jan 25, 2023 at 5:15 PM Andres Freund  wrote:
> > However, it significantly increases the overall work when rows have a 
> > somewhat
> > limited lifetime. The documented reason why vacuum_freeze_min_age exist -
> > although I think it doesn't really achieve its documented goal anymore, 
> > after
> > the recent changes page-level freezing changes.
> 
> Huh? vacuum_freeze_min_age hasn't done that, at all. At least not
> since the visibility map went in back in 8.4:

My point was the other way round. That vacuum_freeze_min_age *prevented* us
from freezing rows "too soon" - obviously a very blunt instrument.

Since page level freezing, it only partially does that, because we'll freeze
even newer rows, if pruning triggered an FPI (I don't think that's quite the
right check, but that's a separate discussion).

As far as I can tell, with the eager strategy, the only thing
vacuum_freeze_min_age really influences is whether we'll block waiting for a
cleanup lock.  IOW, VACUUM on a table > vacuum_freeze_strategy_threshold is
now a slightly less-blocking version of VACUUM FREEZE.

The paragraph I was referencing:

One disadvantage of decreasing vacuum_freeze_min_age is 
that
it might cause VACUUM to do useless work: freezing a row
version is a waste of time if the row is modified
soon thereafter (causing it to acquire a new XID).  So the setting should
be large enough that rows are not frozen until they are unlikely to change
any more.

But now vacuum_freeze_min_age doesn't reliably influence whether we'll freeze
row anymore.

Am I missing something here?

> > > VACUUM determines its freezing strategy based on the value of the new
> > > vacuum_freeze_strategy_threshold GUC (or reloption) with logged tables;
> > > tables that exceed the size threshold use the eager freezing strategy.
> >
> > I think that's not a sufficient guard at all. The size of a table doesn't 
> > say
> > much about how a table is used.
> 
> Sufficient for what purpose?

Not not regress a substantial portion of our userbase.

Greetings,

Andres Freund

Re: New strategies for freezing, advancing relfrozenxid early

On Wed, Jan 25, 2023 at 5:49 PM Andres Freund  wrote:
> Sure. But significantly regressing plausible if not common workloads is
> different than knowing that there'll be some edge case where we'll do
> something worse.

That's very vague. Significant to whom, for what purpose?

> prep:
> CREATE TABLE pgbench_time_data(client_id int8 NOT NULL, ts timestamptz NOT 
> NULL, filla int8 NOT NULL, fillb int8 not null, fillc int8 not null);
> CREATE INDEX ON pgbench_time_data(ts);
> ALTER SYSTEM SET autovacuum_naptime = '10s';
> ALTER SYSTEM SET autovacuum_vacuum_cost_delay TO -1;
> ALTER SYSTEM SET synchronous_commit = off; -- otherwise more clients are 
> needed
>
> pgbench script, with 15 clients:
> INSERT INTO pgbench_time_data(client_id, ts, filla, fillb, fillc) VALUES 
> (:client_id, now(), 0, 0, 0);
>
> psql session deleting old data:
> EXPLAIN ANALYZE DELETE FROM pgbench_time_data WHERE ts < now() - 
> '120s'::interval \watch 1
>
> Realistically the time should be longer, but I didn't want to wait that long
> for the deletions to actually start.

I'll review this tomorrow.

> I reproduced both with checkpoint_timeout=5min and 1min. 1min is easier for
> impatient me.

You said "Autovacuum on average generates between 1.5x-7x as much WAL
as before". Why stop there, though? There's a *big* multiplicative
effect in play here from FPIs, obviously, so the sky's the limit. Why
not set checkpoint_timeout to 30s?

> I switched between vacuum_freeze_strategy_threshold=0 and
> vacuum_freeze_strategy_threshold=too-high, because it's quicker/takes less
> warmup to set up something with smaller tables.

This makes no sense to me, at all.

> The concrete setting of vacuum_freeze_strategy_threshold doesn't matter.
> Table size simply isn't a usable proxy for whether eager freezing is a good
> idea or not.

It's not supposed to be - you have it backwards. It's intended to work
as a proxy for whether lazy freezing is a bad idea, particularly in
the worst case.

There is also an effect that likely would have been protective with
your test case had you used a larger table with the same test case
(and had you not lowered vacuum_freeze_strategy_threshold from its
already low default). In general there'd be a much better chance of
concurrent reuse of space by new inserts discouraging page-level
freezing, since VACUUM would take much longer relative to everything
else, as compared to a small table.

> You can have a 1TB table full of transient data, or you can have a 1TB table
> where part of the data is transient and only settles after a time. In neither
> case eager freezing is ok.

It sounds like you're not willing to accept any kind of trade-off.
How, in general, can we detect what kind of 1TB table it will be, in
the absence of user input? And in the absence of user input, why would
we prefer to default to a behavior that is highly destabilizing when
we get it wrong?

-- 
Peter Geoghegan

Re: New strategies for freezing, advancing relfrozenxid early

Hi,

On 2023-01-25 17:37:17 -0800, Peter Geoghegan wrote:
> On Wed, Jan 25, 2023 at 5:26 PM Andres Freund  wrote:
> > Another bad scenario: Some longrunning / hung transaction caused us to get
> > close to the xid wraparound. Problem was resolved, autovacuum runs. 
> > Previously
> > we wouldn't have frozen the portion of the table that was actively changing,
> > now we will. Consequence: We get closer to the "no write" limit / the outage
> > lasts longer.
> 
> Obviously it isn't difficult to just invent a new rule that gets
> applied by lazy_scan_strategy. For example, it would take me less than
> 5 minutes to write a patch that disables eager freezing when the
> failsafe is in effect.

Sure. I'm not saying that these issues cannot be addressed. Of course no patch
of a meaningful size is perfect and we all can't predict the future. But this
is a very significant behavioural change to vacuum, and there are pretty
simple scenarios in which it causes significant regressions. And at least some
of the issues have been pointed out before.

Greetings,

Andres Freund

Re: New strategies for freezing, advancing relfrozenxid early

Hi,

On 2023-01-25 17:22:32 -0800, Peter Geoghegan wrote:
> On Wed, Jan 25, 2023 at 4:43 PM Andres Freund  wrote:
> > I unfortunately haven't been able to keep up with the thread and saw this 
> > just
> > now. But I've expressed the concern below several times before, so it
> > shouldn't come as a surprise.
> 
> You missed the announcement 9 days ago, and the similar clear
> signalling of a commit from yesterday. I guess I'll need to start
> personally reaching out to you any time I commit anything in this area
> in the future. I almost considered doing that here, in fact.

There's just too much email on -hackers to keep up with, if I ever want to do
any development of my own. I raised this concern before though, so it's not
like it's a surprise.


> > The most common problematic scenario I see are tables full of rows with
> > limited lifetime. E.g. because rows get aggregated up after a while. Before
> > those rows practically never got frozen - but now we'll freeze them all the
> > time.
> 
> Fundamentally, the choice to freeze or not freeze is driven by
> speculation about the needs of the table, with some guidance from the
> user. That isn't new. It seems to me that it will always be possible
> for you to come up with an adversarial case that makes any given
> approach look bad, no matter how good it is. Of course that doesn't
> mean that this particular complaint has no validity; but it does mean
> that you need to be willing to draw the line somewhere.

Sure. But significantly regressing plausible if not common workloads is
different than knowing that there'll be some edge case where we'll do
something worse.


> > I whipped up a quick test: 15 pgbench threads insert rows, 1 psql \while 
> > loop
> > deletes older rows.
> 
> Can you post the script? And what setting did you use?

prep:
CREATE TABLE pgbench_time_data(client_id int8 NOT NULL, ts timestamptz NOT 
NULL, filla int8 NOT NULL, fillb int8 not null, fillc int8 not null);
CREATE INDEX ON pgbench_time_data(ts);
ALTER SYSTEM SET autovacuum_naptime = '10s';
ALTER SYSTEM SET autovacuum_vacuum_cost_delay TO -1;
ALTER SYSTEM SET synchronous_commit = off; -- otherwise more clients are needed

pgbench script, with 15 clients:
INSERT INTO pgbench_time_data(client_id, ts, filla, fillb, fillc) VALUES 
(:client_id, now(), 0, 0, 0);

psql session deleting old data:
EXPLAIN ANALYZE DELETE FROM pgbench_time_data WHERE ts < now() - 
'120s'::interval \watch 1

Realistically the time should be longer, but I didn't want to wait that long
for the deletions to actually start.


I reproduced both with checkpoint_timeout=5min and 1min. 1min is easier for
impatient me.


I switched between vacuum_freeze_strategy_threshold=0 and
vacuum_freeze_strategy_threshold=too-high, because it's quicker/takes less
warmup to set up something with smaller tables.

shared_buffers=32GB for fits in s_b, 1GB otherwise.

max_wal_size=150GB, log_autovacuum_min_duration=0, and a bunch of logging
settings.


> > Workload fits in s_b:
> >
> > Autovacuum on average generates between 1.5x-7x as much WAL as before,
> > depending on how things interact with checkpoints. And not just that, each
> > autovac cycle also takes substantially longer than before - the average time
> > for an autovacuum roughly doubled.  Which of course increases the amount of
> > bloat.
> 
> Anything that causes an autovacuum to take longer will effectively
> make autovacuum think that it has removed more bloat than it really
> has, which will then make autovacuum less aggressive when it really
> should be more aggressive. That's a preexisting issue, that needs to
> be accounted for in the context of this discussion.

That's not the problem here - on my system autovac starts again very
quickly. The problem is that we accumulate bloat while autovacuum is
running. Wasting time/WAL volume on freezing pages that don't need to be
frozen is an issue.



> In particular, it would be very useful to know what the parameters of
> the discussion are. Obviously I cannot come up with an algorithm that
> can literally predict the future. But I may be able to handle specific
> cases of concern better, or to better help users cope in whatever way.

> > This is significantly worse than I predicted. This was my first attempt at
> > coming up with a problematic workload. There'll likely be way worse in
> > production.
> 
> As I said in the commit message, the current default for
> vacuum_freeze_strategy_threshold is considered low, and was always
> intended to be provisional. Something that I explicitly noted would be
> reviewed after the beta period is over, once we gained more experience
> with the setting.

> I think that a far higher setting could be almost as effective. 32GB,
> or even 64GB could work quite well, since you'll still have the FPI
> optimization.

The concrete setting of vacuum_freeze_strategy_threshold doesn't matter.
Table size simply isn't a usable proxy for whether eager freezing is a good
idea or not.

You can ha

Re: New strategies for freezing, advancing relfrozenxid early

On Wed, Jan 25, 2023 at 5:26 PM Andres Freund  wrote:
> Another bad scenario: Some longrunning / hung transaction caused us to get
> close to the xid wraparound. Problem was resolved, autovacuum runs. Previously
> we wouldn't have frozen the portion of the table that was actively changing,
> now we will. Consequence: We get closer to the "no write" limit / the outage
> lasts longer.

Obviously it isn't difficult to just invent a new rule that gets
applied by lazy_scan_strategy. For example, it would take me less than
5 minutes to write a patch that disables eager freezing when the
failsafe is in effect.

> I don't see an alternative to reverting this for now.

I want to see your test case before acting.

-- 
Peter Geoghegan

Re: New strategies for freezing, advancing relfrozenxid early

On Wed, Jan 25, 2023 at 5:15 PM Andres Freund  wrote:
> However, it significantly increases the overall work when rows have a somewhat
> limited lifetime. The documented reason why vacuum_freeze_min_age exist -
> although I think it doesn't really achieve its documented goal anymore, after
> the recent changes page-level freezing changes.

Huh? vacuum_freeze_min_age hasn't done that, at all. At least not
since the visibility map went in back in 8.4:

https://wiki.postgresql.org/wiki/Freezing/skipping_strategies_patch:_motivating_examples#Today.2C_on_Postgres_HEAD_2

That's why we literally do ~100% of all freezing in aggressive mode
VACUUM with append-only or append-mostly tables.

> > VACUUM determines its freezing strategy based on the value of the new
> > vacuum_freeze_strategy_threshold GUC (or reloption) with logged tables;
> > tables that exceed the size threshold use the eager freezing strategy.
>
> I think that's not a sufficient guard at all. The size of a table doesn't say
> much about how a table is used.

Sufficient for what purpose?

> > Eager freezing is strictly more aggressive than lazy freezing.  Settings
> > like vacuum_freeze_min_age still get applied in just the same way in
> > every VACUUM, independent of the strategy in use.  The only mechanical
> > difference between eager and lazy freezing strategies is that only the
> > former applies its own additional criteria to trigger freezing pages.
>
> That's only true because vacuum_freeze_min_age being has been fairly radically
> redefined recently.

So? This part of the commit message is a simple statement of fact.

-- 
Peter Geoghegan

Re: New strategies for freezing, advancing relfrozenxid early

Hi,

On 2023-01-25 16:43:47 -0800, Andres Freund wrote:
> I think, as committed, this will cause serious issues for some reasonably
> common workloads, due to substantially increased WAL traffic.
> 
> 
> The most common problematic scenario I see are tables full of rows with
> limited lifetime. E.g. because rows get aggregated up after a while. Before
> those rows practically never got frozen - but now we'll freeze them all the
> time.

Another bad scenario: Some longrunning / hung transaction caused us to get
close to the xid wraparound. Problem was resolved, autovacuum runs. Previously
we wouldn't have frozen the portion of the table that was actively changing,
now we will. Consequence: We get closer to the "no write" limit / the outage
lasts longer.

I don't see an alternative to reverting this for now.

Greetings,

Andres Freund

Re: New strategies for freezing, advancing relfrozenxid early

On Wed, Jan 25, 2023 at 4:43 PM Andres Freund  wrote:
> I unfortunately haven't been able to keep up with the thread and saw this just
> now. But I've expressed the concern below several times before, so it
> shouldn't come as a surprise.

You missed the announcement 9 days ago, and the similar clear
signalling of a commit from yesterday. I guess I'll need to start
personally reaching out to you any time I commit anything in this area
in the future. I almost considered doing that here, in fact.

> The most common problematic scenario I see are tables full of rows with
> limited lifetime. E.g. because rows get aggregated up after a while. Before
> those rows practically never got frozen - but now we'll freeze them all the
> time.

Fundamentally, the choice to freeze or not freeze is driven by
speculation about the needs of the table, with some guidance from the
user. That isn't new. It seems to me that it will always be possible
for you to come up with an adversarial case that makes any given
approach look bad, no matter how good it is. Of course that doesn't
mean that this particular complaint has no validity; but it does mean
that you need to be willing to draw the line somewhere.

In particular, it would be very useful to know what the parameters of
the discussion are. Obviously I cannot come up with an algorithm that
can literally predict the future. But I may be able to handle specific
cases of concern better, or to better help users cope in whatever way.

> I whipped up a quick test: 15 pgbench threads insert rows, 1 psql \while loop
> deletes older rows.

Can you post the script? And what setting did you use?

> Workload fits in s_b:
>
> Autovacuum on average generates between 1.5x-7x as much WAL as before,
> depending on how things interact with checkpoints. And not just that, each
> autovac cycle also takes substantially longer than before - the average time
> for an autovacuum roughly doubled.  Which of course increases the amount of
> bloat.

Anything that causes an autovacuum to take longer will effectively
make autovacuum think that it has removed more bloat than it really
has, which will then make autovacuum less aggressive when it really
should be more aggressive. That's a preexisting issue, that needs to
be accounted for in the context of this discussion.

> This is significantly worse than I predicted. This was my first attempt at
> coming up with a problematic workload. There'll likely be way worse in
> production.

As I said in the commit message, the current default for
vacuum_freeze_strategy_threshold is considered low, and was always
intended to be provisional. Something that I explicitly noted would be
reviewed after the beta period is over, once we gained more experience
with the setting.

I think that a far higher setting could be almost as effective. 32GB,
or even 64GB could work quite well, since you'll still have the FPI
optimization.

-- 
Peter Geoghegan

Re: New strategies for freezing, advancing relfrozenxid early

Hi,

On 2023-01-24 14:49:38 -0800, Peter Geoghegan wrote:
> From e41d3f45fcd6f639b768c22139006ad11422575f Mon Sep 17 00:00:00 2001
> From: Peter Geoghegan 
> Date: Thu, 24 Nov 2022 18:20:36 -0800
> Subject: [PATCH v17 1/3] Add eager and lazy freezing strategies to VACUUM.
> 
> Eager freezing strategy avoids large build-ups of all-visible pages.  It
> makes VACUUM trigger page-level freezing whenever doing so will enable
> the page to become all-frozen in the visibility map.  This is useful for
> tables that experience continual growth, particularly strict append-only
> tables such as pgbench's history table.  Eager freezing significantly
> improves performance stability by spreading out the cost of freezing
> over time, rather than doing most freezing during aggressive VACUUMs.
> It complements the insert autovacuum mechanism added by commit b07642db.

However, it significantly increases the overall work when rows have a somewhat
limited lifetime. The documented reason why vacuum_freeze_min_age exist -
although I think it doesn't really achieve its documented goal anymore, after
the recent changes page-level freezing changes.


> VACUUM determines its freezing strategy based on the value of the new
> vacuum_freeze_strategy_threshold GUC (or reloption) with logged tables;
> tables that exceed the size threshold use the eager freezing strategy.

I think that's not a sufficient guard at all. The size of a table doesn't say
much about how a table is used.


> Unlogged tables and temp tables will always use eager freezing strategy,
> since there is essentially no downside.

I somewhat doubt that that is true, but certainly the cost is lower.


> Eager freezing is strictly more aggressive than lazy freezing.  Settings
> like vacuum_freeze_min_age still get applied in just the same way in
> every VACUUM, independent of the strategy in use.  The only mechanical
> difference between eager and lazy freezing strategies is that only the
> former applies its own additional criteria to trigger freezing pages.

That's only true because vacuum_freeze_min_age being has been fairly radically
redefined recently.

Greetings,

Andres Freund

Re: New strategies for freezing, advancing relfrozenxid early

Hi,

On 2023-01-24 14:49:38 -0800, Peter Geoghegan wrote:
> On Mon, Jan 16, 2023 at 5:55 PM Peter Geoghegan  wrote:
> > 0001 (the freezing strategies patch) is now committable IMV. Or at
> > least will be once I polish the docs a bit more. I plan on committing
> > 0001 some time next week, barring any objections.
>
> I plan on committing 0001 (the freezing strategies commit) tomorrow
> morning, US Pacific time.

I unfortunately haven't been able to keep up with the thread and saw this just
now. But I've expressed the concern below several times before, so it
shouldn't come as a surprise.

I think, as committed, this will cause serious issues for some reasonably
common workloads, due to substantially increased WAL traffic.

The most common problematic scenario I see are tables full of rows with
limited lifetime. E.g. because rows get aggregated up after a while. Before
those rows practically never got frozen - but now we'll freeze them all the
time.

I whipped up a quick test: 15 pgbench threads insert rows, 1 psql \while loop
deletes older rows.

Workload fits in s_b:

Autovacuum on average generates between 1.5x-7x as much WAL as before,
depending on how things interact with checkpoints. And not just that, each
autovac cycle also takes substantially longer than before - the average time
for an autovacuum roughly doubled.  Which of course increases the amount of
bloat.

When workload doesn't fit in s_b:

Time for vacuuming goes up to ~5x. WAL volume to ~9x. Autovacuum can't keep up
with bloat, every vacuum takes longer than the prior one:
65s->78s->139s->176s
And that's with autovac cost limits removed! Relation size nearly doubles due
to bloat.

After I disabled the new strategy autovac started to catch up again:
124s->101s->103->46s->20s->28s->24s

This is significantly worse than I predicted. This was my first attempt at
coming up with a problematic workload. There'll likely be way worse in
production.

I think as-is this logic will cause massive issues.

Andres

Re: New strategies for freezing, advancing relfrozenxid early

2023-01-25 Thread Matthias van de Meent

On Tue, 24 Jan 2023 at 23:50, Peter Geoghegan  wrote:
>
> On Mon, Jan 16, 2023 at 5:55 PM Peter Geoghegan  wrote:
> > 0001 (the freezing strategies patch) is now committable IMV. Or at
> > least will be once I polish the docs a bit more. I plan on committing
> > 0001 some time next week, barring any objections.
>
> I plan on committing 0001 (the freezing strategies commit) tomorrow
> morning, US Pacific time.
>
> Attached is v17. There are no significant differences compared to v17.
> I decided to post a new version now, ahead of commit, to show how I've
> cleaned up the docs in 0001 -- docs describing the new GUC, freeze
> strategies, and so on.

LGTM, +1 on 0001

Some more comments on 0002:

> +lazy_scan_strategy(LVRelState *vacrel, bool force_scan_all)
> scanned_pages_lazy & scanned_pages_eager

We have not yet scanned the pages, so I suggest plan/scan_pages_eager
and *_lazy as variable names instead, to minimize confusion about the
naming.

I'll await the next iteration of 0002 in which you've completed more
TODOs before I'll dig deeper into that patch.


Kind regards,

Matthias van de Meent

Re: New strategies for freezing, advancing relfrozenxid early

2023-01-23 Thread Peter Geoghegan

On Mon, Jan 23, 2023 at 3:17 AM Dilip Kumar  wrote:
> My final set of comments for 0002

Thanks for the review!

> I do not understand much use of maintaining these two
> 'scanned_pages_lazy' and 'scanned_pages_eager' variables.  I think
> just maintaining 'scanned_pages' should be sufficient.  I do not see
> in patches also they are really used.

I agree that the visibility map snapshot struct could stand to be
cleaned up -- some of that state may not be needed, and it wouldn't be
that hard to use memory a little more economically, particularly with
very small tables. It's on my TODO list already.

> +#define MAX_PAGES_YOUNG_TABLEAGE0.05/* 5% of rel_pages */
> +#define MAX_PAGES_OLD_TABLEAGE0.70/* 70% of rel_pages */
>
> Why is the logic behind 5% and 70% are those based on some
> experiments?  Should those be tuning parameters so that with real
> world use cases if we realise that it would be good if the eager scan
> is getting selected more frequently or less frequently then we can
> tune those parameters?

The specific multipliers constants chosen (for
MAX_PAGES_YOUNG_TABLEAGE and MAX_PAGES_OLD_TABLEAGE) were based on
both experiments and intuition. The precise values could be somewhat
different without it really mattering, though. For example, with a
table like pgbench_history (which is a really important case for the
patch in general), there won't be any all-visible pages at all (at
least after a short while), so it won't matter what these constants
are -- eager scanning will always be chosen.

I don't think that they should be parameters. The useful parameter for
users remains vacuum_freeze_table_age/autovacuum_freeze_max_age (note
that vacuum_freeze_table_age usually gets its value from
autovacuum_freeze_max_age due to changes in 0002). Like today,
vacuum_freeze_table_age forces VACUUM to scan all not-all-frozen pages
so that relfrozenxid can be advanced. Unlike today, it forces eager
scanning (not aggressive mode). But even long before eager scanning is
*forced*, pressure to use eager scanning gradually builds. That
pressure will usually cause some VACUUM to use eager scanning before
it's strictly necessary. Overall,
vacuum_freeze_table_age/autovacuum_freeze_max_age now provide loose
guidance.

It really has to be loose in this sense in order for
lazy_scan_strategy() to have the freedom to do the right thing based
on the characteristics of the table as a whole, according to its
visibility map snapshot. This allows lazy_scan_strategy() to stumble
upon once-off opportunities to advance relfrozenxid inexpensively,
including cases where it could never happen with the current model.
These opportunities are side-effects of workload characteristics that
can be hard to predict [1][2].

> I think this should be moved as first if case, I mean why to do all
> the calculations based on the 'tableagefrac' and
> 'TABLEAGEFRAC_XXPOINT' if we are forced to scan them all.  I agree the
> extra computation we are doing might not really matter compared to the
> vacuum work we are going to perform but still seems logical to me to
> do the simple check first.

This is only needed for DISABLE_PAGE_SKIPPING, which is an escape
hatch option that is never supposed to be needed. I don't think that
it's worth going to the trouble of indenting the code more just so
this is avoided -- it really is an afterthought. Besides, the compiler
might well be doing this for us.

> 4. Should we move prefetching as a separate patch, instead of merging
> with the scanning strategy?

I don't think that breaking that out would be an improvement. A lot of
the prefetching stuff informs how the visibility map code is
structured.

[1] 
https://wiki.postgresql.org/wiki/Freezing/skipping_strategies_patch:_motivating_examples#Patch_3
[2] 
https://wiki.postgresql.org/wiki/Freezing/skipping_strategies_patch:_motivating_examples#Opportunistically_advancing_relfrozenxid_with_bursty.2C_real-world_workloads
--
Peter Geoghegan

Re: New strategies for freezing, advancing relfrozenxid early

2023-01-23 Thread Dilip Kumar

On Wed, Jan 18, 2023 at 1:47 PM Dilip Kumar  wrote:
>
> On Tue, Jan 17, 2023 at 10:05 AM Peter Geoghegan  wrote:

My final set of comments for 0002

1.
+struct vmsnapshot
+{
+/* Target heap rel */
+Relationrel;
+/* Scanning strategy used by VACUUM operation */
+vmstrategystrat;
+/* Per-strategy final scanned_pages */
+BlockNumber rel_pages;
+BlockNumber scanned_pages_lazy;
+BlockNumber scanned_pages_eager;

I do not understand much use of maintaining these two
'scanned_pages_lazy' and 'scanned_pages_eager' variables.  I think
just maintaining 'scanned_pages' should be sufficient.  I do not see
in patches also they are really used.  lazy_scan_strategy() is using
these variables but this is getting values of these out parameters
from visibilitymap_snap_acquire().  And visibilitymap_snap_strategy()
is also using this, but it seems there we just need the final result
of 'scanned_pages' instead of these two variables.

2.

+#define MAX_PAGES_YOUNG_TABLEAGE0.05/* 5% of rel_pages */
+#define MAX_PAGES_OLD_TABLEAGE0.70/* 70% of rel_pages */

Why is the logic behind 5% and 70% are those based on some
experiments?  Should those be tuning parameters so that with real
world use cases if we realise that it would be good if the eager scan
is getting selected more frequently or less frequently then we can
tune those parameters?

3.
+/*
+ * VACUUM's DISABLE_PAGE_SKIPPING option overrides our decision by forcing
+ * VACUUM to scan every page (VACUUM effectively distrusts rel's VM)
+ */
+if (force_scan_all)
+vacrel->vmstrat = VMSNAP_SCAN_ALL;

I think this should be moved as first if case, I mean why to do all
the calculations based on the 'tableagefrac' and
'TABLEAGEFRAC_XXPOINT' if we are forced to scan them all.  I agree the
extra computation we are doing might not really matter compared to the
vacuum work we are going to perform but still seems logical to me to
do the simple check first.

4. Should we move prefetching as a separate patch, instead of merging
with the scanning strategy?

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Re: New strategies for freezing, advancing relfrozenxid early

2023-01-18 Thread Dilip Kumar

On Tue, Jan 17, 2023 at 10:05 AM Peter Geoghegan  wrote:
>
> On Mon, Jan 16, 2023 at 8:13 PM Dilip Kumar  wrote:
> > I think that it makes sense to keep 'vacuum_freeze_strategy_threshold'
> > strictly for freezing.  But the point is that the eager scanning
> > strategy is driven by table freezing needs of the table (tableagefrac)
> > that make sense, but if we have selected the eager freezing based on
> > the table age and its freezing need then why don't we force the eager
> > freezing as well if we have selected eager scanning, after all the
> > eager scanning is selected for satisfying the freezing need.
>
> Don't think of eager scanning as the new name for aggressive mode --
> it's a fairly different concept, because we care about costs now.
> Eager scanning can be chosen just because it's very cheap relative to
> the alternative of lazy scanning, even when relfrozenxid is still very
> recent. (This kind of behavior isn't really new [1], but the exact
> implementation from the patch is new.)
>
> Tables such as pgbench_branches and pgbench_tellers will reliably use
> eager scanning strategy, no matter how any GUC has been set -- just
> because the added cost is always zero (relative to lazy scanning). It
> really doesn't matter how far along tableagefrac here, ever. These
> same tables will never use eager freezing strategy, unless the
> vacuum_freeze_strategy_threshold GUC is misconfigured. (This is
> another example of how scanning strategy and freezing strategy may
> differ for the same table.)

Yes, I agree with that.  Thanks for explaining in detail.

> You do have a good point, though. I think that I know what you mean.
> Note that antiwraparound autovacuums (or VACUUMs of tables very near
> to that point) *will* always use both the eager freezing strategy and
> the eager scanning strategy -- which is probably close to what you
> meant.

Right

> The important point is that there can be more than one reason to
> prefer one strategy to another -- and the reasons can be rather
> different. Occasionally it'll be a combination of two factors together
> that push things in favor of one strategy over the other -- even
> though either factor on its own would not have resulted in the same
> choice.

Yes, that makes sense to me.

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Re: New strategies for freezing, advancing relfrozenxid early

On Mon, Jan 16, 2023 at 8:13 PM Dilip Kumar  wrote:
> I think that it makes sense to keep 'vacuum_freeze_strategy_threshold'
> strictly for freezing.  But the point is that the eager scanning
> strategy is driven by table freezing needs of the table (tableagefrac)
> that make sense, but if we have selected the eager freezing based on
> the table age and its freezing need then why don't we force the eager
> freezing as well if we have selected eager scanning, after all the
> eager scanning is selected for satisfying the freezing need.

Don't think of eager scanning as the new name for aggressive mode --
it's a fairly different concept, because we care about costs now.
Eager scanning can be chosen just because it's very cheap relative to
the alternative of lazy scanning, even when relfrozenxid is still very
recent. (This kind of behavior isn't really new [1], but the exact
implementation from the patch is new.)

Tables such as pgbench_branches and pgbench_tellers will reliably use
eager scanning strategy, no matter how any GUC has been set -- just
because the added cost is always zero (relative to lazy scanning). It
really doesn't matter how far along tableagefrac here, ever. These
same tables will never use eager freezing strategy, unless the
vacuum_freeze_strategy_threshold GUC is misconfigured. (This is
another example of how scanning strategy and freezing strategy may
differ for the same table.)

You do have a good point, though. I think that I know what you mean.
Note that antiwraparound autovacuums (or VACUUMs of tables very near
to that point) *will* always use both the eager freezing strategy and
the eager scanning strategy -- which is probably close to what you
meant.

The important point is that there can be more than one reason to
prefer one strategy to another -- and the reasons can be rather
different. Occasionally it'll be a combination of two factors together
that push things in favor of one strategy over the other -- even
though either factor on its own would not have resulted in the same
choice.

[1] 
https://wiki.postgresql.org/wiki/Freezing/skipping_strategies_patch:_motivating_examples#Constantly_updated_tables_.28usually_smaller_tables.29
-- 
Peter Geoghegan

Re: New strategies for freezing, advancing relfrozenxid early

2023-01-16 Thread Dilip Kumar

On Mon, Jan 16, 2023 at 11:31 PM Peter Geoghegan  wrote:
>
> > I think '(nextXID - cutoffs->relfrozenxid) / freeze_table_age' should
> > be the actual fraction right?  What is the point of adding 0.5 to the
> > divisor?  If there is a logical reason, maybe we can explain in the
> > comments.
>
> It's just a way of avoiding division by zero.

oh, correct :)

> > While looking into the logic of 'lazy_scan_strategy', I think the idea
> > looks very good but the only thing is that
> > we have kept eager freeze and eager scan completely independent.
> > Don't you think that if a table is chosen for an eager scan
> > then we should force the eager freezing as well?
>
> Earlier versions of the patch kind of worked that way.
> lazy_scan_strategy would actually use twice the GUC setting to
> determine scanning strategy. That approach could make our "transition
> from lazy to eager strategies" involve an excessive amount of
> "catch-up freezing" in the VACUUM operation that advanced relfrozenxid
> for the first time, which you see an example of here:
>
> https://wiki.postgresql.org/wiki/Freezing/skipping_strategies_patch:_motivating_examples#Patch
>
> Now we treat the scanning and freezing strategies as two independent
> choices. Of course they're not independent in any practical sense, but
> I think it's slightly simpler and more elegant that way -- it makes
> the GUC vacuum_freeze_strategy_threshold strictly about freezing
> strategy, while still leading to VACUUM advancing relfrozenxid in a
> way that makes sense. It just happens as a second order effect. Why
> add a special case?

I think that it makes sense to keep 'vacuum_freeze_strategy_threshold'
strictly for freezing.  But the point is that the eager scanning
strategy is driven by table freezing needs of the table (tableagefrac)
that make sense, but if we have selected the eager freezing based on
the table age and its freezing need then why don't we force the eager
freezing as well if we have selected eager scanning, after all the
eager scanning is selected for satisfying the freezing need.  But
OTOH, the eager scanning might get selected if it appears that we
might not have to scan too many extra pages compared to lazy scan so
in those cases forcing the eager freezing might not be wise.  So maybe
it is a good idea to keep them the way you have in your patch.

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Re: New strategies for freezing, advancing relfrozenxid early

On Mon, Jan 16, 2023 at 10:00 AM Peter Geoghegan  wrote:
> Now we treat the scanning and freezing strategies as two independent
> choices. Of course they're not independent in any practical sense, but
> I think it's slightly simpler and more elegant that way -- it makes
> the GUC vacuum_freeze_strategy_threshold strictly about freezing
> strategy, while still leading to VACUUM advancing relfrozenxid in a
> way that makes sense. It just happens as a second order effect. Why
> add a special case?

This might be a better way to explain it:

The main page-level freezing commit (commit 1de58df4) already added an
optimization that triggers page-level freezing "early" (early relative
to vacuum_freeze_min_age). This happens whenever a page already needs
to have an FPI logged inside lazy_scan_prune -- even when we're using
the lazy freezing strategy. The optimization isn't configurable, and
gets applied regardless of freezing strategy (technically there is no
such thing as freezing strategies on HEAD just yet, though HEAD still
has this optimization).

There will be workloads where the FPI optimization will result in
freezing many more pages -- especially when data checksums are in use
(since then we could easily need to log an FPI just so pruning can set
a hint bit). As a result, certain VACUUMs that use the lazy freezing
strategy will freeze in almost the same way as an equivalent VACUUM
using the eager freezing strategy. Such a "nominally lazy but actually
quite eager" VACUUM operation should get the same benefit in terms of
relfrozenxid advancement as it would if it really had used the eager
freezing strategy instead. It's fairly obvious that we'll get the same
benefit in relfrozenxid advancement (comparable relfrozenxid results
for comparable freezing work), since the way that VACUUM decides on
its scanning strategy is not conditioned on freezing strategy (whether
by the ongoing VACUUM or any other VACUUM against the same table).

All that matters is the conditions in the table (in particular the
added cost of opting for eager scanning over lazy scanning) as
indicated by the visibility map at the start of each VACUUM -- how
those conditions came about really isn't interesting at that point.
And so lazy_scan_strategy doesn't care about them when it chooses
VACUUM's scanning strategy.

There are even tables/workloads where relfrozenxid will be able to
jump forward by a huge amount whenever VACUUM choosing the eager
scanning strategy, despite the fact that VACUUM generally does little
or no freezing to make that possible:

https://wiki.postgresql.org/wiki/Freezing/skipping_strategies_patch:_motivating_examples#Patch_3

--
Peter Geoghegan

Re: New strategies for freezing, advancing relfrozenxid early

On Mon, Jan 16, 2023 at 10:10 AM Peter Geoghegan wrote:
> Attached is v16, which incorporates some of Matthias' feedback.

0001 (the freezing strategies patch) is now committable IMV. Or at
least will be once I polish the docs a bit more. I plan on committing
0001 some time next week, barring any objections.

I should point out that 0001 is far shorter and simpler than the
page-level freezing commit that already went in (commit 1de58df4). The
only thing in 0001 that seems like it might be a bit controversial
(when considered on its own) is the addition of the
vacuum_freeze_strategy_threshold GUC/reloption. Note in particular
that vacuum_freeze_strategy_threshold doesn't look like any other
existing GUC; it gets applied as a threshold on the size of the rel's
main fork at the beginning of vacuumlazy.c processing. As far as I
know there are no objections to that approach at this time, but it
does still seem worth drawing attention to now.

0001 also makes unlogged tables and temp tables always use eager
freezing strategy, no matter how the GUC/reloption are set. This seems
*very* easy to justify, since the potential downside of such a policy
is obviously extremely low, even when we make very pessimistic
assumptions. The usual cost we need to worry about when it comes to
freezing is the added WAL overhead -- that clearly won't apply when
we're vacuuming non-permanent tables. That really just leaves the cost
of dirtying extra pages, which in general could have a noticeable
system-level impact in the case of unlogged tables.

Dirtying extra pages when vacuuming an unlogged table is also a
non-issue. Even the eager freezing strategy only freezes "extra" pages
("extra" relative to the lazy strategy behavior) given a page that
will be set all-visible in any case [1]. Such a page will need to have
its page-level PD_ALL_VISIBLE bit set in any case -- which is already
enough to dirty the page. And so there can never be any additional
pages dirtied as a result of the special policy 0001 adds for
non-permanent relations.

[1]
https://wiki.postgresql.org/wiki/Freezing/skipping_strategies_patch:_motivating_examples#Patch_2
--
Peter Geoghegan

Re: New strategies for freezing, advancing relfrozenxid early