subject:"Eager page freeze criteria clarification"

Re: Eager page freeze criteria clarification

2024-01-02 Thread Melanie Plageman

On Thu, Dec 21, 2023 at 3:58 PM Melanie Plageman
 wrote:
>
> On Wed, Dec 13, 2023 at 12:24 PM Robert Haas  wrote:
> > On Sat, Dec 9, 2023 at 5:12 AM Melanie Plageman
> >  wrote:
> > > The goal is to keep pages frozen for at least target_freeze_duration.
> > > target_freeze_duration is in seconds and pages only have a last
> > > modification LSN, so target_freeze_duration must be converted to LSNs.
> > > To accomplish this, I've added an LSNTimeline data structure, containing
> > > XLogRecPtr, TimestampTz pairs stored with decreasing precision as they
> > > age. When we need to translate the guc value to LSNs, we linearly
> > > interpolate it on this timeline. For the time being, the global
> > > LSNTimeline is in PgStat_WalStats and is only updated by vacuum. There
> > > is no reason it can't be updated with some other cadence and/or by some
> > > other process (nothing about it is inherently tied to vacuum). The
> > > cached translated value of target_freeze_duration is stored in each
> > > table's stats. This is arbitrary as it is not a table-level stat.
> > > However, it needs to be located somewhere that is accessible on
> > > update/delete. We may want to recalculate it more often than once per
> > > table vacuum, especially in case of long-running vacuums.
> >
> > This part sounds like it isn't quite baked yet. The idea of the data
> > structure seems fine, but updating it once per vacuum sounds fairly
> > unprincipled to me? Don't we want the updates to happen on a somewhat
> > regular wall clock cadence?
>
> Yes, this part was not fully baked. I actually discussed this with
> Andres at PGConf EU last week and he suggested that background writer
> update the LSNTimeline. He also suggested I propose the LSNTimeline in
> a new thread. I could add a pageinspect function returning the
> estimated time of last page modification given the page LSN (so it is
> proposed with a user).

I've rebased this over top of the revised LSNTimeline functionality
proposed separately in [1].
It is also registered for the current commitfest.

I plan to add the decay logic and benchmark it this week.

- Melanie

[1] 
https://www.postgresql.org/message-id/CAAKRu_Z7tR7D1=DR=xwqdefyk_nu_gxgw88x0htxn6ask-8...@mail.gmail.com
From 6fe9f26ebe6bd773e631bcdf0b7ea928a4ca8e27 Mon Sep 17 00:00:00 2001
From: Melanie Plageman 
Date: Wed, 27 Dec 2023 16:40:27 -0500
Subject: [PATCH v3 02/12] Add LSNTimeline for converting LSN <-> time

Add a new structure, LSNTimeline, consisting of LSNTimes -- each an LSN,
time pair. Each LSNTime can represent multiple logical LSN, time pairs,
referred to as members. LSN <-> time conversions can be done using
linear interpolation with two LSNTimes on the LSNTimeline.

This commit does not add a global instance of LSNTimeline. It adds the
structures and functions needed to maintain and access such a timeline.
---
 src/backend/utils/activity/pgstat_wal.c | 199 
 src/include/pgstat.h|  34 
 src/tools/pgindent/typedefs.list|   2 +
 3 files changed, 235 insertions(+)

diff --git a/src/backend/utils/activity/pgstat_wal.c b/src/backend/utils/activity/pgstat_wal.c
index 6a81b781357..ba40aad258a 100644
--- a/src/backend/utils/activity/pgstat_wal.c
+++ b/src/backend/utils/activity/pgstat_wal.c
@@ -17,8 +17,11 @@
 
 #include "postgres.h"
 
+#include "access/xlog.h"
 #include "utils/pgstat_internal.h"
 #include "executor/instrument.h"
+#include "utils/builtins.h"
+#include "utils/timestamp.h"
 
 
 PgStat_PendingWalStats PendingWalStats = {0};
@@ -32,6 +35,12 @@ PgStat_PendingWalStats PendingWalStats = {0};
 static WalUsage prevWalUsage;
 
 
+static void lsntime_absorb(LSNTime *a, const LSNTime *b);
+void lsntime_insert(LSNTimeline *timeline, TimestampTz time, XLogRecPtr lsn);
+
+XLogRecPtr estimate_lsn_at_time(const LSNTimeline *timeline, TimestampTz time);
+TimestampTz estimate_time_at_lsn(const LSNTimeline *timeline, XLogRecPtr lsn);
+
 /*
  * Calculate how much WAL usage counters have increased and update
  * shared WAL and IO statistics.
@@ -184,3 +193,193 @@ pgstat_wal_snapshot_cb(void)
 		   sizeof(pgStatLocal.snapshot.wal));
 	LWLockRelease(&stats_shmem->lock);
 }
+
+/*
+ * Set *a to be the earlier of *a or *b.
+ */
+static void
+lsntime_absorb(LSNTime *a, const LSNTime *b)
+{
+	LSNTime		result;
+	int			new_members = a->members + b->members;
+
+	if (a->time < b->time)
+		result = *a;
+	else if (b->time < a->time)
+		result = *b;
+	else if (a->lsn < b->lsn)
+		result = *a;
+	else if (b->lsn < a->lsn)
+		result = *b;
+	else
+		result = *a;
+
+	*a = result;
+	a->members = new_members;
+}
+
+/*
+ * Insert a new LSNTime into the LSNTimeline in the first element with spare
+ * capacity.
+ */
+void
+lsntime_insert(LSNTimeline *timeline, TimestampTz time,
+			   XLogRecPtr lsn)
+{
+	LSNTime		temp;
+	LSNTime		carry = {.lsn = lsn,.time = time,.members = 1};
+
+	for (int i = 0; i < timeline->length; i++)
+	{
+		bool		full;
+		LSNTime*cur = &timeline->data[i];

Re: Eager page freeze criteria clarification

2023-12-21 Thread Robert Haas

On Thu, Dec 21, 2023 at 10:56 AM Melanie Plageman
 wrote:
> Agreed. I plan to test with another distribution. Though, the exercise
> of determining which ones are useful is probably more challenging.
> I imagine we will have to choose one distribution (as opposed to
> supporting different distributions and choosing based on data access
> patterns for a table). Though, even with a normal distribution, I
> think it should be an improvement.

Our current algorithm isn't adaptive at all, so I like our chances of
coming out ahead. It won't surprise me if somebody finds a case where
there is a regression, but if we handle some common and important
cases correctly (e.g. append-only, update-everything-nonstop) then I
think we're probably ahead even if there are some cases where we do
worse. It does depend on how much worse they are, and how realistic
they are, but we don't want to be too fearful here: we know what we're
doing right now isn't too great.

-- 
Robert Haas
EDB: http://www.enterprisedb.com

Re: Eager page freeze criteria clarification

2023-12-21 Thread Melanie Plageman

On Wed, Dec 13, 2023 at 12:24 PM Robert Haas  wrote:
>
> Great results.

Thanks!

> On Sat, Dec 9, 2023 at 5:12 AM Melanie Plageman
>  wrote:
> > Values can be "removed" from the accumulator by simply decrementing its
> > cardinality and decreasing the sum and sum squared by a value that will
> > not change the mean and standard deviation of the overall distribution.
> > To adapt to a table's changing access patterns, we'll need to remove
> > values from this accumulator over time, but this patch doesn't yet
> > decide when to do this. A simple solution may be to cap the cardinality
> > of the accumulator to the greater of 1% of the table size, or some fixed
> > number of values (perhaps 200?). Even without such removal of values,
> > the distribution recorded in the accumulator will eventually skew toward
> > more recent data, albeit at a slower rate.
>
> I think we're going to need something here. Otherwise, after 6 months
> of use, changing a table's perceived access pattern will be quite
> difficult.
>
> I think one challenge here is to find something that doesn't decay too
> often and end up with cases where it basically removes all the data.
>
> As long as you avoid that, I suspect that the algorithm might not be
> terribly sensitive to other kinds of changes. If you decay after 200
> values or 2000 or 20,000, it will only affect how fast we can change
> our notion of the access pattern, and my guess would be that any of
> those values would produce broadly acceptable results, with some
> differences in the details. If you decay after 200,000,000 values or
> not at all, then I think there will be problems.

I'll add the decay logic and devise a benchmark that will exercise it.
I can test at least one or two of these ideas.

> > The goal is to keep pages frozen for at least target_freeze_duration.
> > target_freeze_duration is in seconds and pages only have a last
> > modification LSN, so target_freeze_duration must be converted to LSNs.
> > To accomplish this, I've added an LSNTimeline data structure, containing
> > XLogRecPtr, TimestampTz pairs stored with decreasing precision as they
> > age. When we need to translate the guc value to LSNs, we linearly
> > interpolate it on this timeline. For the time being, the global
> > LSNTimeline is in PgStat_WalStats and is only updated by vacuum. There
> > is no reason it can't be updated with some other cadence and/or by some
> > other process (nothing about it is inherently tied to vacuum). The
> > cached translated value of target_freeze_duration is stored in each
> > table's stats. This is arbitrary as it is not a table-level stat.
> > However, it needs to be located somewhere that is accessible on
> > update/delete. We may want to recalculate it more often than once per
> > table vacuum, especially in case of long-running vacuums.
>
> This part sounds like it isn't quite baked yet. The idea of the data
> structure seems fine, but updating it once per vacuum sounds fairly
> unprincipled to me? Don't we want the updates to happen on a somewhat
> regular wall clock cadence?

Yes, this part was not fully baked. I actually discussed this with
Andres at PGConf EU last week and he suggested that background writer
update the LSNTimeline. He also suggested I propose the LSNTimeline in
a new thread. I could add a pageinspect function returning the
estimated time of last page modification given the page LSN (so it is
proposed with a user).

- Melanie

Re: Eager page freeze criteria clarification

2023-12-21 Thread Joe Conway


On 12/21/23 10:56, Melanie Plageman wrote:

On Sat, Dec 9, 2023 at 9:24 AM Joe Conway  wrote:

However, even if we assume a more-or-less normal distribution, we should
consider using subgroups in a way similar to Statistical Process
Control[1]. The reasoning is explained in this quote:

 The Math Behind Subgroup Size

 The Central Limit Theorem (CLT) plays a pivotal role here. According
 to CLT, as the subgroup size (n) increases, the distribution of the
 sample means will approximate a normal distribution, regardless of
 the shape of the population distribution. Therefore, as your
 subgroup size increases, your control chart limits will narrow,
 making the chart more sensitive to special cause variation and more
 prone to false alarms.


I haven't read anything about statistical process control until you
mentioned this. I read the link you sent and also googled around a
bit. I was under the impression that the more samples we have, the
better. But, it seems like this may not be the assumption in
statistical process control?

It may help us to get more specific. I'm not sure what the
relationship between "unsets" in my code and subgroup members would
be.  The article you linked suggests that each subgroup should be of
size 5 or smaller. Translating that to my code, were you imagining
subgroups of "unsets" (each time we modify a page that was previously
all-visible)?


Basically, yes.

It might not makes sense, but I think we could test the theory by 
plotting a histogram of the raw data, and then also plot a histogram 
based on sub-grouping every 5 sequential values in your accumulator.


If the former does not look very normal (I would guess most workloads it 
will be skewed with a long tail) and the latter looks to be more normal, 
then it would say we were on the right track.


There are statistical tests for "normalness" that could be applied too 
( e.g. 
https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6350423/#sec2-13title ) 
which be a more rigorous approach, but the quick look at histograms 
might be sufficiently convincing.


--
Joe Conway
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

Re: Eager page freeze criteria clarification

2023-12-21 Thread Melanie Plageman

On Sat, Dec 9, 2023 at 9:24 AM Joe Conway  wrote:
>
> On 12/8/23 23:11, Melanie Plageman wrote:
> >
> > I'd be delighted to receive any feedback, ideas, questions, or review.
>
>
> This is well thought out, well described, and a fantastic improvement in
> my view -- well done!

Thanks, Joe! That means a lot! I see work done by hackers on the
mailing list a lot that makes me think, "hey, that's
cool/clever/awesome!" but I don't give that feedback. I appreciate you
doing that!

> I do think we will need to consider distributions other than normal, but
> I don't know offhand what they will be.

Agreed. I plan to test with another distribution. Though, the exercise
of determining which ones are useful is probably more challenging.
I imagine we will have to choose one distribution (as opposed to
supporting different distributions and choosing based on data access
patterns for a table). Though, even with a normal distribution, I
think it should be an improvement.

> However, even if we assume a more-or-less normal distribution, we should
> consider using subgroups in a way similar to Statistical Process
> Control[1]. The reasoning is explained in this quote:
>
>  The Math Behind Subgroup Size
>
>  The Central Limit Theorem (CLT) plays a pivotal role here. According
>  to CLT, as the subgroup size (n) increases, the distribution of the
>  sample means will approximate a normal distribution, regardless of
>  the shape of the population distribution. Therefore, as your
>  subgroup size increases, your control chart limits will narrow,
>  making the chart more sensitive to special cause variation and more
>  prone to false alarms.

I haven't read anything about statistical process control until you
mentioned this. I read the link you sent and also googled around a
bit. I was under the impression that the more samples we have, the
better. But, it seems like this may not be the assumption in
statistical process control?

It may help us to get more specific. I'm not sure what the
relationship between "unsets" in my code and subgroup members would
be.  The article you linked suggests that each subgroup should be of
size 5 or smaller. Translating that to my code, were you imagining
subgroups of "unsets" (each time we modify a page that was previously
all-visible)?

Thanks for the feedback!

- Melanie

Re: Eager page freeze criteria clarification

2023-12-13 Thread Robert Haas

Great results.

On Sat, Dec 9, 2023 at 5:12 AM Melanie Plageman
 wrote:
> Values can be "removed" from the accumulator by simply decrementing its
> cardinality and decreasing the sum and sum squared by a value that will
> not change the mean and standard deviation of the overall distribution.
> To adapt to a table's changing access patterns, we'll need to remove
> values from this accumulator over time, but this patch doesn't yet
> decide when to do this. A simple solution may be to cap the cardinality
> of the accumulator to the greater of 1% of the table size, or some fixed
> number of values (perhaps 200?). Even without such removal of values,
> the distribution recorded in the accumulator will eventually skew toward
> more recent data, albeit at a slower rate.

I think we're going to need something here. Otherwise, after 6 months
of use, changing a table's perceived access pattern will be quite
difficult.

I think one challenge here is to find something that doesn't decay too
often and end up with cases where it basically removes all the data.

As long as you avoid that, I suspect that the algorithm might not be
terribly sensitive to other kinds of changes. If you decay after 200
values or 2000 or 20,000, it will only affect how fast we can change
our notion of the access pattern, and my guess would be that any of
those values would produce broadly acceptable results, with some
differences in the details. If you decay after 200,000,000 values or
not at all, then I think there will be problems.

> This algorithm is able to predict when pages will be modified before
> target_freeze_threshold as long as their modification pattern fits a
> normal distribution -- this accommodates access patterns where, after
> some period of time, pages become less likely to be modified the older
> they are. As an example, however, access patterns where modification
> times are bimodal aren't handled well by this model (imagine that half
> your modifications are to very new data and the other half are to data
> that is much older but still younger than target_freeze_duration). If
> this is a common access pattern, the normal distribution could easily be
> swapped out for another distribution. The current accumulator is capable
> of tracking a distribution's first two moments of central tendency (the
> mean and standard deviation). We could track more if we wanted to use a
> fancier distribution.

I think it might be fine to ignore this problem. What we're doing
right now is way worse.

> We also must consider what to do when we have few unsets, e.g. with an
> insert-only workload. When there are very few unsets (I chose 30 which
> the internet says is the approximate minimum number required for the
> central limit theorem to hold), we can choose to always freeze freezable
> pages; above this limit, the calculated page threshold is used. However,
> we may not want to make predictions based on 31 values either. To avoid
> this "cliff", we could modify the algorithm a bit to skew the mean and
> standard deviation of the distribution using a confidence interval based
> on the number of values we've collected.

I think some kind of smooth transition would might be a good idea, but
I also don't know how much it matters. I think the important thing is
that we freeze aggressively unless there's a clear signal telling us
not to do that. Absence of evidence should cause us to tend toward
aggressive freezing. Otherwise, we get insert-only tables wrong.

> The goal is to keep pages frozen for at least target_freeze_duration.
> target_freeze_duration is in seconds and pages only have a last
> modification LSN, so target_freeze_duration must be converted to LSNs.
> To accomplish this, I've added an LSNTimeline data structure, containing
> XLogRecPtr, TimestampTz pairs stored with decreasing precision as they
> age. When we need to translate the guc value to LSNs, we linearly
> interpolate it on this timeline. For the time being, the global
> LSNTimeline is in PgStat_WalStats and is only updated by vacuum. There
> is no reason it can't be updated with some other cadence and/or by some
> other process (nothing about it is inherently tied to vacuum). The
> cached translated value of target_freeze_duration is stored in each
> table's stats. This is arbitrary as it is not a table-level stat.
> However, it needs to be located somewhere that is accessible on
> update/delete. We may want to recalculate it more often than once per
> table vacuum, especially in case of long-running vacuums.

This part sounds like it isn't quite baked yet. The idea of the data
structure seems fine, but updating it once per vacuum sounds fairly
unprincipled to me? Don't we want the updates to happen on a somewhat
regular wall clock cadence?

> To benchmark this new heuristic (I'm calling it algo 6 since it is the
> 6th I've proposed on this thread), I've used a modified subset of my
> original workloads:
>
> [ everything worked perfectly ]

Hard to beat that.

--

Re: Eager page freeze criteria clarification

2023-12-09 Thread Joe Conway


On 12/8/23 23:11, Melanie Plageman wrote:

On Wed, Nov 8, 2023 at 9:23 PM Melanie Plageman
 wrote:

The next step is to devise different heuristics and measure their
efficacy. IMO, the goal of the algorithm it is to freeze pages in a
relation such that we drive early unfreezes/freezes -> 0 and pages
frozen/number of pages of a certain age -> 1.


Attached is a patchset with an adaptive freezing algorithm that works
well. It keeps track of the pages' unmodified duration after having been
set all-visible and uses that to predict how likely a page is to be
modified given its age.

Each time an all-visible page is modified by an update, delete, or tuple
lock, if the page was all-visible for less than target_freeze_duration
(a new guc that specifies the minimum amount of time you would like data
to stay frozen), it is counted as an "early unset" and the duration that
it was unmodified is added into an accumulator. Before each relation is
vacuumed, we calculate the mean and standard deviation of these
durations using the accumulator. This is used to calculate a page LSN
threshold demarcating pages with a < 5% likelihood of being modified
before target_freeze_duration. We don't freeze pages younger than this
threshold.

I've done away with the ring buffer of vacuums and the idea of
attributing an "unset" to the vacuum that set it all visible. Instead,
using an "accumulator", I keep a running sum of the page ages, along
with the cardinality of the accumulated set and the sum squared of page
ages. This data structure allows us to extract the mean and standard
deviation, at any time, from an arbitrary number of values in constant
space; and is used to model the pattern of these unsets as a normal
distribution that we can use to try and predict whether or not a page
will be modified.

Values can be "removed" from the accumulator by simply decrementing its
cardinality and decreasing the sum and sum squared by a value that will
not change the mean and standard deviation of the overall distribution.
To adapt to a table's changing access patterns, we'll need to remove
values from this accumulator over time, but this patch doesn't yet
decide when to do this. A simple solution may be to cap the cardinality
of the accumulator to the greater of 1% of the table size, or some fixed
number of values (perhaps 200?). Even without such removal of values,
the distribution recorded in the accumulator will eventually skew toward
more recent data, albeit at a slower rate.

This algorithm is able to predict when pages will be modified before
target_freeze_threshold as long as their modification pattern fits a
normal distribution -- this accommodates access patterns where, after
some period of time, pages become less likely to be modified the older
they are. As an example, however, access patterns where modification
times are bimodal aren't handled well by this model (imagine that half
your modifications are to very new data and the other half are to data
that is much older but still younger than target_freeze_duration). If
this is a common access pattern, the normal distribution could easily be
swapped out for another distribution. The current accumulator is capable
of tracking a distribution's first two moments of central tendency (the
mean and standard deviation). We could track more if we wanted to use a
fancier distribution.

We also must consider what to do when we have few unsets, e.g. with an
insert-only workload. When there are very few unsets (I chose 30 which
the internet says is the approximate minimum number required for the
central limit theorem to hold), we can choose to always freeze freezable
pages; above this limit, the calculated page threshold is used. However,
we may not want to make predictions based on 31 values either. To avoid
this "cliff", we could modify the algorithm a bit to skew the mean and
standard deviation of the distribution using a confidence interval based
on the number of values we've collected.

The goal is to keep pages frozen for at least target_freeze_duration.
target_freeze_duration is in seconds and pages only have a last
modification LSN, so target_freeze_duration must be converted to LSNs.
To accomplish this, I've added an LSNTimeline data structure, containing
XLogRecPtr, TimestampTz pairs stored with decreasing precision as they
age. When we need to translate the guc value to LSNs, we linearly
interpolate it on this timeline. For the time being, the global
LSNTimeline is in PgStat_WalStats and is only updated by vacuum. There
is no reason it can't be updated with some other cadence and/or by some
other process (nothing about it is inherently tied to vacuum). The
cached translated value of target_freeze_duration is stored in each
table's stats. This is arbitrary as it is not a table-level stat.
However, it needs to be located somewhere that is accessible on
update/delete. We may want to recalculate it more often than once per
table vacuum, especially in case of long-running vacuums.

To be

Re: Eager page freeze criteria clarification

2023-12-08 Thread Melanie Plageman

On Wed, Nov 8, 2023 at 9:23 PM Melanie Plageman
 wrote:
> The next step is to devise different heuristics and measure their
> efficacy. IMO, the goal of the algorithm it is to freeze pages in a
> relation such that we drive early unfreezes/freezes -> 0 and pages
> frozen/number of pages of a certain age -> 1.

Attached is a patchset with an adaptive freezing algorithm that works
well. It keeps track of the pages' unmodified duration after having been
set all-visible and uses that to predict how likely a page is to be
modified given its age.

Each time an all-visible page is modified by an update, delete, or tuple
lock, if the page was all-visible for less than target_freeze_duration
(a new guc that specifies the minimum amount of time you would like data
to stay frozen), it is counted as an "early unset" and the duration that
it was unmodified is added into an accumulator. Before each relation is
vacuumed, we calculate the mean and standard deviation of these
durations using the accumulator. This is used to calculate a page LSN
threshold demarcating pages with a < 5% likelihood of being modified
before target_freeze_duration. We don't freeze pages younger than this
threshold.

I've done away with the ring buffer of vacuums and the idea of
attributing an "unset" to the vacuum that set it all visible. Instead,
using an "accumulator", I keep a running sum of the page ages, along
with the cardinality of the accumulated set and the sum squared of page
ages. This data structure allows us to extract the mean and standard
deviation, at any time, from an arbitrary number of values in constant
space; and is used to model the pattern of these unsets as a normal
distribution that we can use to try and predict whether or not a page
will be modified.

Values can be "removed" from the accumulator by simply decrementing its
cardinality and decreasing the sum and sum squared by a value that will
not change the mean and standard deviation of the overall distribution.
To adapt to a table's changing access patterns, we'll need to remove
values from this accumulator over time, but this patch doesn't yet
decide when to do this. A simple solution may be to cap the cardinality
of the accumulator to the greater of 1% of the table size, or some fixed
number of values (perhaps 200?). Even without such removal of values,
the distribution recorded in the accumulator will eventually skew toward
more recent data, albeit at a slower rate.

This algorithm is able to predict when pages will be modified before
target_freeze_threshold as long as their modification pattern fits a
normal distribution -- this accommodates access patterns where, after
some period of time, pages become less likely to be modified the older
they are. As an example, however, access patterns where modification
times are bimodal aren't handled well by this model (imagine that half
your modifications are to very new data and the other half are to data
that is much older but still younger than target_freeze_duration). If
this is a common access pattern, the normal distribution could easily be
swapped out for another distribution. The current accumulator is capable
of tracking a distribution's first two moments of central tendency (the
mean and standard deviation). We could track more if we wanted to use a
fancier distribution.

We also must consider what to do when we have few unsets, e.g. with an
insert-only workload. When there are very few unsets (I chose 30 which
the internet says is the approximate minimum number required for the
central limit theorem to hold), we can choose to always freeze freezable
pages; above this limit, the calculated page threshold is used. However,
we may not want to make predictions based on 31 values either. To avoid
this "cliff", we could modify the algorithm a bit to skew the mean and
standard deviation of the distribution using a confidence interval based
on the number of values we've collected.

The goal is to keep pages frozen for at least target_freeze_duration.
target_freeze_duration is in seconds and pages only have a last
modification LSN, so target_freeze_duration must be converted to LSNs.
To accomplish this, I've added an LSNTimeline data structure, containing
XLogRecPtr, TimestampTz pairs stored with decreasing precision as they
age. When we need to translate the guc value to LSNs, we linearly
interpolate it on this timeline. For the time being, the global
LSNTimeline is in PgStat_WalStats and is only updated by vacuum. There
is no reason it can't be updated with some other cadence and/or by some
other process (nothing about it is inherently tied to vacuum). The
cached translated value of target_freeze_duration is stored in each
table's stats. This is arbitrary as it is not a table-level stat.
However, it needs to be located somewhere that is accessible on
update/delete. We may want to recalculate it more often than once per
table vacuum, especially in case of long-running vacuums.

To benchmark this new heuristic (I'm callin

Re: Eager page freeze criteria clarification

2023-11-08 Thread Melanie Plageman

On Wed, Oct 11, 2023 at 8:43 PM Andres Freund  wrote:
>
> A rough sketch of a freezing heuristic:
>
> - We concluded that to intelligently control opportunistic freezing we need
>   statistics about the number of freezes and unfreezes
>
>   - We should track page freezes / unfreezes in shared memory stats on a
> per-relation basis
>
>   - To use such statistics to control heuristics, we need to turn them into
> rates. For that we need to keep snapshots of absolute values at certain
> times (when vacuuming), allowing us to compute a rate.
>
>   - If we snapshot some stats, we need to limit the amount of data that 
> occupies
>
> - evict based on wall clock time (we don't care about unfreezing pages
>   frozen a month ago)
>
> - "thin out" data when exceeding limited amount of stats per relation
>   using random sampling or such
>
> - need a smarter approach than just keeping N last vacuums, as there are
>   situations where a table is (auto-) vacuumed at a high frequency
>
>
>   - only looking at recent-ish table stats is fine, because we
>  - a) don't want to look at too old data, as we need to deal with changing
>workloads
>
>  - b) if there aren't recent vacuums, falsely freezing is of bounded cost
>
>   - shared memory stats being lost on crash-restart/failover might be a 
> problem
>
> - we certainly don't want to immediate store these stats in a table, due
>   to the xid consumption that'd imply
>
>
> - Attributing "unfreezes" to specific vacuums would be powerful:
>
>   - "Number of pages frozen during vacuum" and "Number of pages unfrozen that
> were frozen during the same vacuum" provides numerator / denominator for
> an "error rate"
>
>   - We can perform this attribution by comparing the page LSN with recorded
> start/end LSNs of recent vacuums
>
>   - If the freezing error rate of recent vacuums is low, freeze more
> aggressively. This is important to deal with insert mostly workloads.
>
>   - If old data is "unfrozen", that's fine, we can ignore such unfreezes when
> controlling "freezing aggressiveness"
>
> - Ignoring unfreezing of old pages is important to e.g. deal with
>   workloads that delete old data
>
>   - This approach could provide "goals" for opportunistic freezing in a
> somewhat understandable way. E.g. aiming to rarely unfreeze data that has
> been frozen within 1h/1d/...

I have taken a stab at implementing the freeze stats tracking
infrastructure we would need to drive a heuristic based on error rate.

Attached is a series of patches which adds a ring buffer of vacuum
freeze statistics to each table's stats (PgStat_StatTabEntry).

It also introduces a guc: target_page_freeze_duration. This is the time
in seconds which we would like pages to stay frozen for. The aim is to
adjust opportunistic freeze behavior such that pages stay frozen for at
least target_page_freeze_duration seconds.

When a table is vacuumed the next entry is initialized in the ring
buffer (PgStat_StatTabEntry->frz_buckets[frz_current]). If all of the
buckets are in use, the two oldest buckets both ending either before or
after now - target_page_freeze_duration are combined.

When vacuum freezes a page, we increment the freeze counter in the
current PgStat_Frz entry in the ring buffer and update the counters used
to calculate the max, min, and average page age.

When a frozen page is modified, we increment "unfreezes" in the bucket
spanning the page freeze LSN. We also update the counters used to
calculate the min, max, and average frozen duration. Because we have
both the LSNs and time elapsed during the vacuum which froze the page,
we can derive the approximate time at which the page was frozen (using
linear interpolation) and use that to speculate how long it remained
frozen. If we are unfreezing it sooner than target_page_freeze_duration,
this is counted as an "early unfreeze". Using early unfreezes and page
freezes, we can calculate an error rate.

The guc, opp_freeze_algo, remains to allow us to test different freeze
heuristics during development. I included a dummy algorithm (algo 2) to
demonstrate what we could do with the data. If the error rate is above a
hard-coded threshold and the page is older than the average page age, it
is frozen. This is missing an "aggressiveness" knob.

There are two workloads I think we can focus on. The first is a variant
of our former workload I. The second workload, J, is the pgbench
built-in simple-update workload.

I2. Work queue
   WL 1: 32 clients inserting a single row, updating an indexed
   column in another row twice, then deleting the last updated row
pgbench scale 100
WL 2: 2 clients, running the TPC-B like built-in workload
rate-limited to 10,000 TPS

J. simple-update
pgbench scale 450
WL 1: 16 clients running built-in simple-update workload

The simple-update workload works well because it only updates
pgbench_accounts and inserts into pgbench_

Re: Eager page freeze criteria clarification

2023-10-12 Thread Robert Haas

On Thu, Oct 12, 2023 at 11:50 AM Melanie Plageman
 wrote:
> I was under the impression that we decided we still had to consider
> the number of clean pages dirtied as well as the number of pages
> unfrozen. The number of pages frozen and unfrozen over a time period
> gives us some idea of if we are freezing the wrong pages -- but it
> doesn't tell us if we are freezing the right pages. A riff on an
> earlier example by Robert:
>
> While vacuuming a relation, we freeze 100 pages. During the same time
> period, we modify 1,000,000 previously clean pages. Of these 1,000,000
> pages modified, 90 were frozen. So we unfroze 90% of the pages frozen
> during this time. Does this mean we should back off of trying to
> freeze any pages in the relation?

I didn't think we decided the thing your first sentence says you
thought we decided ... but I also didn't think of this example. That
said, it might be fine to back off freezing in this case because we
weren't doing enough of it to make any real difference in the first
place. Maybe there's a more moderate example where it feels like a
bigger problem?

> >   - "Number of pages frozen during vacuum" and "Number of pages unfrozen 
> > that
> > were frozen during the same vacuum" provides numerator / denominator for
> > an "error rate"
> >
> >   - We can perform this attribution by comparing the page LSN with recorded
> > start/end LSNs of recent vacuums
>
> While implementing a rough sketch of this, I realized I had a question
> about this.
>
> vacuum 1 starts at lsn 10 and ends at lsn 200. It froze 100 pages.
> vacuum 2 then starts at lsn 600.
> 5 frozen pages with page lsn > 10 and < 200 were updated. We count
> those in vacuum 1's stats. 3 frozen pages with page lsn > 200 and <
> 600 were updated. Do we count those somewhere?

How did those pages get frozen when no VACUUM was running?

The LSN of the frozen page just prior to unfreezing it should be from
the operation that froze it, which should be some VACUUM. I think the
case you're talking about could happen if we did on-access freezing,
but today I believe we don't.

-- 
Robert Haas
EDB: http://www.enterprisedb.com

Re: Eager page freeze criteria clarification

2023-10-12 Thread Melanie Plageman

On Wed, Oct 11, 2023 at 8:43 PM Andres Freund  wrote:
>
> Robert, Melanie and I spent an evening discussing this topic around
> pgconf.nyc. Here are, mildly revised, notes from that:

Thanks for taking notes!

>   The main thing we are worried about is repeated freezing / unfreezing of
>   pages within a relatively short time period.
>
> - Computing an average "modification distance" as I (Andres) proposed efor
>   each page is complicated / "fuzzy"
>
>   The main problem is that it's not clear how to come up with a good number
>   for workloads that have many more inserts into new pages than modifications
>   of existing pages.
>
>   It's also hard to use average for this kind of thing, e.g. in cases where
>   new pages are frequently updated, but also some old data is updated, it's
>   easy for the updates to the old data to completely skew the average, even
>   though that shouldn't prevent us from freezing.
>
> - We also discussed an idea by Robert to track the number of times we need to
>   dirty a page when unfreezing and to compare that to the number of pages
>   dirtied overall (IIRC), but I don't think we really came to a conclusion
>   around that - and I didn't write down anything so this is purely from
>   memory.

I was under the impression that we decided we still had to consider
the number of clean pages dirtied as well as the number of pages
unfrozen. The number of pages frozen and unfrozen over a time period
gives us some idea of if we are freezing the wrong pages -- but it
doesn't tell us if we are freezing the right pages. A riff on an
earlier example by Robert:

While vacuuming a relation, we freeze 100 pages. During the same time
period, we modify 1,000,000 previously clean pages. Of these 1,000,000
pages modified, 90 were frozen. So we unfroze 90% of the pages frozen
during this time. Does this mean we should back off of trying to
freeze any pages in the relation?

> A rough sketch of a freezing heuristic:
...
> - Attributing "unfreezes" to specific vacuums would be powerful:
>
>   - "Number of pages frozen during vacuum" and "Number of pages unfrozen that
> were frozen during the same vacuum" provides numerator / denominator for
> an "error rate"
>
>   - We can perform this attribution by comparing the page LSN with recorded
> start/end LSNs of recent vacuums

While implementing a rough sketch of this, I realized I had a question
about this.

vacuum 1 starts at lsn 10 and ends at lsn 200. It froze 100 pages.
vacuum 2 then starts at lsn 600.
5 frozen pages with page lsn > 10 and < 200 were updated. We count
those in vacuum 1's stats. 3 frozen pages with page lsn > 200 and <
600 were updated. Do we count those somewhere?

>   - This approach could provide "goals" for opportunistic freezing in a
> somewhat understandable way. E.g. aiming to rarely unfreeze data that has
> been frozen within 1h/1d/...

Similar to the above question, if we are tracking pages frozen and
unfrozen during a time period, if there are many vacuums in quick
succession, we might care if a page was frozen by one vacuum and then
unfrozen during a subsequent vacuum if not too much time has passed.

- Melanie

Re: Eager page freeze criteria clarification

2023-10-12 Thread Robert Haas

Thanks for these notes.

On Wed, Oct 11, 2023 at 8:43 PM Andres Freund  wrote:
> - We also discussed an idea by Robert to track the number of times we need to
>   dirty a page when unfreezing and to compare that to the number of pages
>   dirtied overall (IIRC), but I don't think we really came to a conclusion
>   around that - and I didn't write down anything so this is purely from
>   memory.

See 
http://postgr.es/m/ca+tgmoycwisxlbl-pxu13oevthloxm20ojqjnrztkkhxsy9...@mail.gmail.com
for my on-list discussion of this.

> - Attributing "unfreezes" to specific vacuums would be powerful:
>
>   - "Number of pages frozen during vacuum" and "Number of pages unfrozen that
> were frozen during the same vacuum" provides numerator / denominator for
> an "error rate"

I want to highlight the denominator issue here. I think we all have
the intuition that if we count the number of times that a
recently-frozen page gets unfrozen, and that's a big number, that's
bad, and a sign that we need to freeze less aggressively. But a lot of
the struggle has been around answering the question "big compared to
what". A lot of the obvious candidates fail to behave nicely in corner
cases, as discussed in the above email. I think this is one of the
better candidates so far proposed, possibly the best.

>   - This approach could provide "goals" for opportunistic freezing in a
> somewhat understandable way. E.g. aiming to rarely unfreeze data that has
> been frozen within 1h/1d/...

This strikes me as another important point. Making the behavior
understandable to users is going to be important, because sometimes
whatever system we might craft will misbehave, and then people are
going to need to be able to understand why it's misbehaving and how to
tune/fix it so it works.

> Around this point my laptop unfortunately ran out of battery. Possibly the
> attendees of this mini summit also ran out of steam (and tea).

When the tea is gone, there's little point in continuing.

> I likely mangled this substantially, both when taking notes during the lively
> discussion, and when revising them to make them a bit more readable.

I think it's quite a good summary, actually. Thanks!

-- 
Robert Haas
EDB: http://www.enterprisedb.com

Re: Eager page freeze criteria clarification

2023-10-11 Thread Andres Freund

Hi,


Robert, Melanie and I spent an evening discussing this topic around
pgconf.nyc. Here are, mildly revised, notes from that:


First a few random points that didn't fit with the sketch of an approach
below:

- Are unlogged tables a problem for using LSN based heuristics for freezing?

  We concluded, no, not a problem, because aggressively freezing does not
  increase overhead meaningfully, as we would already dirty both the heap and VM
  page to set the all-visible flag.

- "Unfreezing" pages that were frozen hours / days ago aren't too bad and can
  be desirable.

  The main thing we are worried about is repeated freezing / unfreezing of
  pages within a relatively short time period.

- Computing an average "modification distance" as I (Andres) proposed efor
  each page is complicated / "fuzzy"

  The main problem is that it's not clear how to come up with a good number
  for workloads that have many more inserts into new pages than modifications
  of existing pages.

  It's also hard to use average for this kind of thing, e.g. in cases where
  new pages are frequently updated, but also some old data is updated, it's
  easy for the updates to the old data to completely skew the average, even
  though that shouldn't prevent us from freezing.

- We also discussed an idea by Robert to track the number of times we need to
  dirty a page when unfreezing and to compare that to the number of pages
  dirtied overall (IIRC), but I don't think we really came to a conclusion
  around that - and I didn't write down anything so this is purely from
  memory.


A rough sketch of a freezing heuristic:

- We concluded that to intelligently control opportunistic freezing we need
  statistics about the number of freezes and unfreezes

  - We should track page freezes / unfreezes in shared memory stats on a
per-relation basis

  - To use such statistics to control heuristics, we need to turn them into
rates. For that we need to keep snapshots of absolute values at certain
times (when vacuuming), allowing us to compute a rate.

  - If we snapshot some stats, we need to limit the amount of data that occupies

- evict based on wall clock time (we don't care about unfreezing pages
  frozen a month ago)

- "thin out" data when exceeding limited amount of stats per relation
  using random sampling or such

- need a smarter approach than just keeping N last vacuums, as there are
  situations where a table is (auto-) vacuumed at a high frequency


  - only looking at recent-ish table stats is fine, because we
 - a) don't want to look at too old data, as we need to deal with changing
   workloads

 - b) if there aren't recent vacuums, falsely freezing is of bounded cost

  - shared memory stats being lost on crash-restart/failover might be a problem

- we certainly don't want to immediate store these stats in a table, due
  to the xid consumption that'd imply


- Attributing "unfreezes" to specific vacuums would be powerful:

  - "Number of pages frozen during vacuum" and "Number of pages unfrozen that
were frozen during the same vacuum" provides numerator / denominator for
an "error rate"

  - We can perform this attribution by comparing the page LSN with recorded
start/end LSNs of recent vacuums

  - If the freezing error rate of recent vacuums is low, freeze more
aggressively. This is important to deal with insert mostly workloads.

  - If old data is "unfrozen", that's fine, we can ignore such unfreezes when
controlling "freezing aggressiveness"

- Ignoring unfreezing of old pages is important to e.g. deal with
  workloads that delete old data

  - This approach could provide "goals" for opportunistic freezing in a
somewhat understandable way. E.g. aiming to rarely unfreeze data that has
been frozen within 1h/1d/...


Around this point my laptop unfortunately ran out of battery. Possibly the
attendees of this mini summit also ran out of steam (and tea).


We had a few "disagreements" or "unresolved issues":

- How aggressive should we be when we have no stats?

- Should the freezing heuristic take into account whether freezing would
  require an FPI? Or whether page was not in s_b, or ...


I likely mangled this substantially, both when taking notes during the lively
discussion, and when revising them to make them a bit more readable.

Greetings,

Andres Freund

Re: Eager page freeze criteria clarification

2023-10-04 Thread Peter Geoghegan

On Mon, Oct 2, 2023 at 4:25 PM Robert Haas  wrote:
> On Mon, Oct 2, 2023 at 11:37 AM Peter Geoghegan  wrote:
> > If no vacuuming against pgbench_accounts is strictly better than some
> > vacuuming (unless it's just to advance relfrozenxid, which can't be
> > avoided), then to what extent is Melanie's freezing patch obligated to
> > not make the situation worse? I'm not saying that it doesn't matter at
> > all; just that there needs to be a sense of proportion.
>
> I agree. I think Melanie's previous test results showed no measurable
> performance regression but a significant regression in WAL volume. I
> don't remember how much of a regression it was, but it was enough to
> make me think that the extra WAL volume could probably be turned into
> a performance loss by adjusting the test scenario (a bit less
> hardware, or a bandwidth-constrained standby connection, etc.). I
> think I'd be OK to accept a small amount of additional WAL, though. Do
> you see it differently?

That's pretty much how I see it too.

> I wonder how much useless work we're doing in this area today. I mean,
> most pruning in that workload gets done by HOT rather than by vacuum,
> because vacuum simply can't keep up, but I don't think it's worthless
> work: if it wasn't done in the background by VACUUM, it would happen
> in the foreground anyway very soon after.

Yeah, but it probably wouldn't be as efficient, since only VACUUM uses
a somewhat older OldestXmin/prune cutoff.

As much as anything else, I'm arguing that we should treat the general
problem of useless work as relevant in the context of the performance
validation work/benchmarks.

> I don't have a good feeling
> for how much index cleanup we end up doing in a pgbench workload, but
> it seems to me that if we don't do index cleanup at least now and
> then, we'll lose the ability to recycle line pointers in the wake of
> non-HOT updates, and that seems bad. Maybe that's rare in pgbench
> specifically, or maybe it isn't, but it seems like you'd only have to
> change the workload a tiny bit for that to become a real problem.

There's no question that pruning that's required to go ahead in order
for index cleanup to take place isn't really ever something that we'd
want to skip.

-- 
Peter Geoghegan

Re: Eager page freeze criteria clarification

2023-10-02 Thread Robert Haas

On Mon, Oct 2, 2023 at 11:37 AM Peter Geoghegan  wrote:
> If no vacuuming against pgbench_accounts is strictly better than some
> vacuuming (unless it's just to advance relfrozenxid, which can't be
> avoided), then to what extent is Melanie's freezing patch obligated to
> not make the situation worse? I'm not saying that it doesn't matter at
> all; just that there needs to be a sense of proportion.

I agree. I think Melanie's previous test results showed no measurable
performance regression but a significant regression in WAL volume. I
don't remember how much of a regression it was, but it was enough to
make me think that the extra WAL volume could probably be turned into
a performance loss by adjusting the test scenario (a bit less
hardware, or a bandwidth-constrained standby connection, etc.). I
think I'd be OK to accept a small amount of additional WAL, though. Do
you see it differently?

> Whatever your priorities might be, workload-wise, it could still be
> useful to recognize useless freezing as part of a broader problem of
> useless (or at least dubious) work performed by VACUUM -- especially
> with pgbench-like workloads. The utility of freezing is a lot easier
> to measure than the utility of pruning, but why should we assume that
> pruning isn't already just as much of a problem? (Maybe that's not a
> problem that particularly interests you right now; I'm bringing it up
> because it seems possible that putting it in scope could somehow
> clarify what to do about freezing.)

I wonder how much useless work we're doing in this area today. I mean,
most pruning in that workload gets done by HOT rather than by vacuum,
because vacuum simply can't keep up, but I don't think it's worthless
work: if it wasn't done in the background by VACUUM, it would happen
in the foreground anyway very soon after. I don't have a good feeling
for how much index cleanup we end up doing in a pgbench workload, but
it seems to me that if we don't do index cleanup at least now and
then, we'll lose the ability to recycle line pointers in the wake of
non-HOT updates, and that seems bad. Maybe that's rare in pgbench
specifically, or maybe it isn't, but it seems like you'd only have to
change the workload a tiny bit for that to become a real problem.

-- 
Robert Haas
EDB: http://www.enterprisedb.com

Re: Eager page freeze criteria clarification

2023-10-02 Thread Peter Geoghegan

On Mon, Oct 2, 2023 at 6:26 AM Robert Haas  wrote:
> I think it's true that the fact that pgbench does what pgbench does
> makes us think more about that workload than about some other, equally
> plausible workload. It's the test we have, so we end up running it a
> lot. If we had some other test, we'd run that one.

I find pgbench very useful as a stress-test. It usually represents the
most extreme possible adversarial case, which puts certain patches in
context, helping to paint a complete picture.

> As for non-uniform access, that is real and does matter, but there are
> certainly installations with tables where no rows survive long enough
> to need freezing, either because the table is regularly emptied, or
> just because the update load is high enough to hit all the rows fairly
> quickly.

That's true, but I think that they're all queue-like tables. Not large
tables where there just isn't any chance of any row/page not being
updating after a fairly short period of time.

> Maybe I'm misunderstanding your point here, in which case all of the
> above may be irrelevant. But my feeling is that we can't simply ignore
> cases where all/many rows are short-lived and say, well, those are
> unrealistic, so let's just freeze everything super-aggressively and
> that should be fine. I don't think that's OK. We can (and I think we
> should) treat that situation as a special case rather than as the
> typical case, but I think it would be a bad idea to dismiss it as a
> case that we don't need to worry about at all.

I think that my point about pgbench being generally unrepresentative
might have overshadowed the more important point about the VACUUM
requirements of pgbench_accounts in particular. Those are particularly
unrealistic.

If no vacuuming against pgbench_accounts is strictly better than some
vacuuming (unless it's just to advance relfrozenxid, which can't be
avoided), then to what extent is Melanie's freezing patch obligated to
not make the situation worse? I'm not saying that it doesn't matter at
all; just that there needs to be a sense of proportion. If even
pruning (by VACUUM) is kinda, sorta, useless, then it seems like we
might need to revisit basic assumptions. (Or maybe the problem itself
can just be avoided for the most part; whatever works.)

I have a patch that provides a simple way of tracking whether each
PRUNE record was generated by VACUUM or by opportunistic pruning (also
something Melanie is involved with):

https://commitfest.postgresql.org/44/4288/

Although it's really hard (maybe impossible) to track right now, I've
noticed that the vast majority of all PRUNE records are from
opportunistic vacuuming with just about every write-heavy benchmark,
including pgbench and TPC-C. And that's just going on the raw count of
each type of record, not the utility provided by each pruning
operation. Admittedly "utility" is a very hard thing to measure here,
but it might still matter. VACUUM might be doing some pruning, but
only pruning targeting pages that never particularly "need to be
pruned" by VACUUM itself (since pruning could happen whenever via
opportunistic pruning anyway).

Whatever your priorities might be, workload-wise, it could still be
useful to recognize useless freezing as part of a broader problem of
useless (or at least dubious) work performed by VACUUM -- especially
with pgbench-like workloads. The utility of freezing is a lot easier
to measure than the utility of pruning, but why should we assume that
pruning isn't already just as much of a problem? (Maybe that's not a
problem that particularly interests you right now; I'm bringing it up
because it seems possible that putting it in scope could somehow
clarify what to do about freezing.)

-- 
Peter Geoghegan

Re: Eager page freeze criteria clarification

2023-10-02 Thread Robert Haas

On Fri, Sep 29, 2023 at 8:50 PM Peter Geoghegan  wrote:
> While pgbench makes a fine stress-test, for the most part its workload
> is highly unrealistic. And yet we seem to think that it's just about
> the most important benchmark of all. If we're not willing to get over
> even small regressions in pgbench, I fear we'll never make significant
> progress in this area. It's at least partly a cultural problem IMV.

I think it's true that the fact that pgbench does what pgbench does
makes us think more about that workload than about some other, equally
plausible workload. It's the test we have, so we end up running it a
lot. If we had some other test, we'd run that one.

But I don't think I buy that it's "highly unrealistic." It is true
that pgbench with default options is limited only by the speed of the
database, while real update-heavy workloads are typically limited by
something external, like the number of users hitting the web site that
the database is backing. It's also true that real workloads tend to
involve some level of non-uniform access. But I'm not sure that either
of those things really matter that much in the end.

The problem I have with the external rate-limiting argument is that it
ignores hardware selection, architectural decision-making, and
workload growth. Sure, people are unlikely to stand up a database that
can do 10,000 transactions per second and hit it with a workload that
requires doing 20,000 transactions per second, because they're going
to find out in testing that it doesn't work. Then they will either buy
more hardware or rearchitect the system to reduce the required number
of transactions per second or give up on using PostgreSQL. So when
they do put it into production, it's unlikely to be overloaded on day
one. But that's just because all of the systems that would have been
overloaded on day one never make it to day one. They get killed off or
changed before they get there. So it isn't like higher maximum
throughput wouldn't have been beneficial. And over time, systems that
initially started out not running maxed out can and, in my experience,
fairly often do end up maxed out because once you've got the thing in
production, it's hard to change anything, and load often does grow
over time.

As for non-uniform access, that is real and does matter, but there are
certainly installations with tables where no rows survive long enough
to need freezing, either because the table is regularly emptied, or
just because the update load is high enough to hit all the rows fairly
quickly.

Maybe I'm misunderstanding your point here, in which case all of the
above may be irrelevant. But my feeling is that we can't simply ignore
cases where all/many rows are short-lived and say, well, those are
unrealistic, so let's just freeze everything super-aggressively and
that should be fine. I don't think that's OK. We can (and I think we
should) treat that situation as a special case rather than as the
typical case, but I think it would be a bad idea to dismiss it as a
case that we don't need to worry about at all.

-- 
Robert Haas
EDB: http://www.enterprisedb.com

Re: Eager page freeze criteria clarification

2023-09-29 Thread Peter Geoghegan

On Fri, Sep 29, 2023 at 11:27 AM Robert Haas  wrote:
> I think that's true. For me, the issue is what a user is practically
> likely to notice and care about. I submit that on a
> not-particularly-busy system, it would probably be fine to freeze
> aggressively in almost every situation, because you're only incurring
> costs you can afford to pay. On a busy system, it's more important to
> be right, or at least not too badly wrong. But even on a busy system,
> I think that when the time between data being written and being frozen
> is more than a few tens of minutes, it's very doubtful that anyone is
> going to notice the contribution that freezing makes to the overall
> workload. They're much more likely to notice an annoying autovacuum
> than they are to notIce a bit of excess freezing that ends up getting
> reversed. But when you start cranking the time between writing data
> and freezing it down into the single-digit numbers of minutes, and
> even more if you push down to tens of seconds or less, now I think
> people are going to care more about useless freezing work than about
> long-term autovacuum risks. Because now their database is really busy
> so they care a lot about performance, and seemingly most of the data
> involved is ephemeral anyway.

I think that that's all true.

As I believe I pointed out at least once in the past, my thinking
about the practical requirements from users was influenced by this
paper/talk:

https://www.microsoft.com/en-us/research/video/cost-performance-in-modern-data-stores-how-data-cashing-systems-succeed/

For the types of applications that Postgres is a natural fit for, most
of the data is cold -- very cold ("freezing" cold, even). If you have
a 50GB pgbench_accounts table, Postgres will always perform
suboptimally. But so will SQL Server, Oracle, and MySQL.

While pgbench makes a fine stress-test, for the most part its workload
is highly unrealistic. And yet we seem to think that it's just about
the most important benchmark of all. If we're not willing to get over
even small regressions in pgbench, I fear we'll never make significant
progress in this area. It's at least partly a cultural problem IMV.

The optimal freezing strategy for pgbench_accounts is to never freeze,
even once. It doesn't matter if you vary the distributions of the
accounts table updates -- it's still inevitable that each and every
row will get updated before too long. In fact, it's not just freezing
that should always be avoided here -- same applies with pruning by
VACUUM.

As I suggested in my last email, the lesson offered by
pgbench_accounts table seems to be "never VACUUM at all, except
perhaps to advance relfrozenxid" (which shouldn't actually require any
freezing even one page). If you haven't tuned heap fill factor, then
you might want to VACUUM a bit, at first. But, overall, vacuuming is
bad. That is the logical though absurd conclusion. It completely flies
in the face of practical experience.

-- 
Peter Geoghegan

Re: Eager page freeze criteria clarification

2023-09-29 Thread Peter Geoghegan

On Fri, Sep 29, 2023 at 11:27 AM Robert Haas  wrote:
> > Even if you're willing to assume that vacuum_freeze_min_age isn't just
> > an arbitrary threshold, this still seems wrong. vacuum_freeze_min_age
> > is applied by VACUUM, at the point that it scans pages. If VACUUM were
> > infinitely fast, and new VACUUMs were launched constantly, then
> > vacuum_freeze_min_age (and this bucketing scheme) might make more
> > sense. But, you know, they're not. So whether or not VACUUM (with
> > Andres' algorithm) deems a page that it has frozen to have been
> > opportunistically frozen or not is greatly influenced by factors that
> > couldn't possibly be relevant.
>
> I'm not totally sure that I'm understanding what you're concerned
> about here, but I *think* that the issue you're worried about here is:
> if we have various rules that can cause freezing, let's say X Y and Z,
> and we adjust the aggressiveness of rule X based on the performance of
> rule Y, that would be stupid and might suck.

Your summary is pretty close. There are a couple of specific nuances
to it, though:

1. Anything that uses XID age or even LSN age necessarily depends on
when VACUUM shows up, which itself depends on many other random
things.

With small to medium sized tables that don't really grow, it's perhaps
reasonable to expect this to not matter. But with tables like the
TPC-C order/order lines table, or even pgbench_history, the next
VACUUM operation will reliably be significantly longer and more
expensive than the last one, forever (ignoring the influence of
aggressive mode, and assuming typical autovacuum settings). So VACUUMs
get bigger and less frequent as the table grows.

As the table continues to grow, at some point we reach a stage where
many XIDs encountered by VACUUM will be significantly older than
vacuum_freeze_min_age, while others will be significantly younger. And
so whether we apply the vacuum_freeze_min_age rule (or some other age
based rule) is increasingly a matter of random happenstance (i.e. is
more and more due to when VACUUM happens to show up), and has less and
less to do with what the workload signals we should do. This is a
moving target, but (if I'm not mistaken) under the scheme described by
Andres we're not even trying to compensate for that.

Separately, I have a practical concern:

2. It'll be very hard to independently track the effectiveness of
rules X, Y, and Z as a practical matter, because the application of
each rule quite naturally influences the application of every other
rule over time. They simply aren't independent things in any practical
sense.

Even if this wasn't an issue, I can't think of a reasonable cost
model. Is it good or bad if "opportunistic freezing" results in
unfreezing 50% of the time? AFAICT that's an *extremely* complicated
question. You cannot just interpolate from the 0% case (definitely
good) and the 100% case (definitely bad) and expect to get a sensible
answer. You can't split the difference -- even if we allow ourselves
to ignore tricky value judgement type questions.

> Assuming that the previous sentence is a correct framing, let's take X
> to be "freezing based on the page LSN age" and Y to be "freezing based
> on vacuum_freeze_min_age". I think the problem scenario here would be
> if it turns out that, under some set of circumstances, Y freezes more
> aggressively than X. For example, suppose the user runs VACUUM FREEZE,
> effectively setting vacuum_freeze_min_age=0 for that operation. If the
> table is being modified at all, it's likely to suffer a bunch of
> unfreezing right afterward, which could cause us to decide to make
> future vacuums freeze less aggressively. That's not necessarily what
> we want, because evidently the user, at least at that moment in time,
> thought that previous freezing hadn't been aggressive enough. They
> might be surprised to find that flash-freezing the table inhibited
> future automatic freezing.

I didn't think of that one myself, but it's a great example.

> Or suppose that they just have a very high XID consumption rate
> compared to the rate of modifications to this particular table, such
> that criteria related to vacuum_freeze_min_age tend to be satisfied a
> lot, and thus vacuums tend to freeze a lot no matter what the page LSN
> age is. This scenario actually doesn't seem like a problem, though. In
> this case the freezing criterion based on page LSN age is already not
> getting used, so it doesn't really matter whether we tune it up or
> down or whatever.

It would have to be a smaller table, which I'm relatively unconcerned about.

> > Okay then. I guess it's more accurate to say that we'll have a strong
> > bias in the direction of freezing when an FPI won't result, though not
> > an infinitely strong bias. We'll at least have something that can be
> > thought of as an improved version of the FPI thing for 17, I think --
> > which is definitely significant progress.
>
> I do kind of wonder whether we're going to care about

Re: Eager page freeze criteria clarification

2023-09-29 Thread Robert Haas

On Fri, Sep 29, 2023 at 11:57 AM Peter Geoghegan  wrote:
> Assuming your concern is more or less limited to those cases where the
> same page could be frozen an unbounded number of times (once or almost
> once per VACUUM), then I think we fully agree. We ought to converge on
> the right behavior over time, but it's particularly important that we
> never converge on the wrong behavior instead.

I think that more or less matches my current thinking on the subject.
A caveat might be: If it were once per two vacuums rather than once
per vacuum, that might still be an issue. But I agree with the idea
that the case that matters is *repeated* wasteful freezing. I don't
think freezing is expensive enough that individual instances of
mistaken freezing are worth getting too stressed about, but as you
say, the overall pattern does matter.

> The TPC-C scenario is partly interesting because it isn't actually
> obvious what the most desirable behavior is, even assuming that you
> had perfect information, and were not subject to practical
> considerations about the complexity of your algorithm. There doesn't
> seem to be perfect clarity on what the goal should actually be in such
> scenarios -- it's not like the problem is just that we can't agree on
> the best way to accomplish those goals with this specific workload.
>
> If performance/efficiency and performance stability are directly in
> tension (as they sometimes are), how much do you want to prioritize
> one or the other? It's not an easy question to answer. It's a value
> judgement as much as anything else.

I think that's true. For me, the issue is what a user is practically
likely to notice and care about. I submit that on a
not-particularly-busy system, it would probably be fine to freeze
aggressively in almost every situation, because you're only incurring
costs you can afford to pay. On a busy system, it's more important to
be right, or at least not too badly wrong. But even on a busy system,
I think that when the time between data being written and being frozen
is more than a few tens of minutes, it's very doubtful that anyone is
going to notice the contribution that freezing makes to the overall
workload. They're much more likely to notice an annoying autovacuum
than they are to notIce a bit of excess freezing that ends up getting
reversed. But when you start cranking the time between writing data
and freezing it down into the single-digit numbers of minutes, and
even more if you push down to tens of seconds or less, now I think
people are going to care more about useless freezing work than about
long-term autovacuum risks. Because now their database is really busy
so they care a lot about performance, and seemingly most of the data
involved is ephemeral anyway.

> Even if you're willing to assume that vacuum_freeze_min_age isn't just
> an arbitrary threshold, this still seems wrong. vacuum_freeze_min_age
> is applied by VACUUM, at the point that it scans pages. If VACUUM were
> infinitely fast, and new VACUUMs were launched constantly, then
> vacuum_freeze_min_age (and this bucketing scheme) might make more
> sense. But, you know, they're not. So whether or not VACUUM (with
> Andres' algorithm) deems a page that it has frozen to have been
> opportunistically frozen or not is greatly influenced by factors that
> couldn't possibly be relevant.

I'm not totally sure that I'm understanding what you're concerned
about here, but I *think* that the issue you're worried about here is:
if we have various rules that can cause freezing, let's say X Y and Z,
and we adjust the aggressiveness of rule X based on the performance of
rule Y, that would be stupid and might suck.

Assuming that the previous sentence is a correct framing, let's take X
to be "freezing based on the page LSN age" and Y to be "freezing based
on vacuum_freeze_min_age". I think the problem scenario here would be
if it turns out that, under some set of circumstances, Y freezes more
aggressively than X. For example, suppose the user runs VACUUM FREEZE,
effectively setting vacuum_freeze_min_age=0 for that operation. If the
table is being modified at all, it's likely to suffer a bunch of
unfreezing right afterward, which could cause us to decide to make
future vacuums freeze less aggressively. That's not necessarily what
we want, because evidently the user, at least at that moment in time,
thought that previous freezing hadn't been aggressive enough. They
might be surprised to find that flash-freezing the table inhibited
future automatic freezing.

Or suppose that they just have a very high XID consumption rate
compared to the rate of modifications to this particular table, such
that criteria related to vacuum_freeze_min_age tend to be satisfied a
lot, and thus vacuums tend to freeze a lot no matter what the page LSN
age is. This scenario actually doesn't seem like a problem, though. In
this case the freezing criterion based on page LSN age is already not
getting used, so it doesn't really matter wheth

Re: Eager page freeze criteria clarification

2023-09-29 Thread Peter Geoghegan

On Fri, Sep 29, 2023 at 7:55 AM Robert Haas  wrote:
> On Thu, Sep 28, 2023 at 12:03 AM Peter Geoghegan  wrote:
> > But isn't the main problem *not* freezing when we could and
> > should have? (Of course the cost of freezing is very relevant, but
> > it's still secondary.)
>
> Perhaps this is all in how you look at it, but I don't see it this
> way. It's easy to see how to solve the "not freezing" problem: just
> freeze everything as often as possible. If that were cheap enough that
> we could just do it, then we'd just do it and be done here. The
> problem is that, at least in my opinion, that seems too expensive in
> some cases. I'm starting to believe that those cases are narrower than
> I once thought, but I think they do exist. So now, I'm thinking that
> maybe the main problem is identifying when you've got such a case, so
> that you know when you need to be less aggressive.

Assuming your concern is more or less limited to those cases where the
same page could be frozen an unbounded number of times (once or almost
once per VACUUM), then I think we fully agree. We ought to converge on
the right behavior over time, but it's particularly important that we
never converge on the wrong behavior instead.

> > Won't the algorithm that you've sketched always think that
> > "unfreezing" pages doesn't affect recently frozen pages with such a
> > workload? Isn't the definition of "recently frozen" that emerges from
> > this algorithm not in any way related to the order delivery time, or
> > anything like that? You know, rather like vacuum_freeze_min_age.
>
> FWIW, I agree that vacuum_freeze_min_age sucks. I have been reluctant
> to endorse changes in this area mostly because I fear replacing one
> bad idea with another, not because I think that what we have now is
> particularly good. It's better to be wrong in the same way in every
> release than to have every release be equally wrong but in a different
> way.
>
> Also, I think the question of what "recently frozen" means is a good
> one, but I'm not convinced that it ought to relate to the order
> delivery time.

I don't think it should, either.

The TPC-C scenario is partly interesting because it isn't actually
obvious what the most desirable behavior is, even assuming that you
had perfect information, and were not subject to practical
considerations about the complexity of your algorithm. There doesn't
seem to be perfect clarity on what the goal should actually be in such
scenarios -- it's not like the problem is just that we can't agree on
the best way to accomplish those goals with this specific workload.

If performance/efficiency and performance stability are directly in
tension (as they sometimes are), how much do you want to prioritize
one or the other? It's not an easy question to answer. It's a value
judgement as much as anything else.

> If we insert into a table and 12-14 hours go buy before
> it's updated, it doesn't seem particularly bad to me if we froze that
> data meanwhile (regardless of what metric drove that freezing). Same
> thing if it's 2-4 hours. What seems bad to me is if we're constantly
> updating the table and vacuum comes sweeping through and freezing
> everything to no purpose over and over again and then it gets
> un-frozen a few seconds or minutes later.

Right -- we agree here.

I even think that it makes sense to freeze pages knowing for sure that
the pages will be unfrozen on that sort of timeline (at least with a
large and ever growing table like this). It may technically be less
efficient, but not when you consider how everything else is affected
by the disruptive impact of freezing a great deal of stuff all at
once. (Of course it's also true that we don't really know what will
happen, which is all the more reason to freeze eagerly.)

> Now maybe that's the wrong idea. After all, as a percentage, the
> overhead is the same either way, regardless of whether we're talking
> about WAL volume or CPU cycles. But somehow it feels worse to make the
> same mistakes every few minutes or potentially even tens of seconds
> than it does to make the same mistakes every few hours. The absolute
> cost is a lot higher.

I agree.

Another problem with the algorithm that Andres sketched is that it
supposes that vacuum_freeze_min_age means something relevant -- that's
how we decide whether or not freezing should count as "opportunistic".
But there really isn't that much difference between (say) an XID age
of 25 million and 50 million. At least not with a table like the TPC-C
tables, where VACUUMs are naturally big operations that take place
relatively infrequently.

Assume a default vacuum_freeze_min_age of 50 million. How can it make
sense to deem freezing a page "opportunistic" when its oldest XID has
only attained an age of 25 million, if the subsequent unfreezing
happens when that same XID would have attained an age of 75 million,
had we not frozen it? And if you agree that it doesn't make sense, how
can we compensate for this effect, as

Re: Eager page freeze criteria clarification

2023-09-29 Thread Robert Haas

On Thu, Sep 28, 2023 at 12:03 AM Peter Geoghegan  wrote:
> But isn't the main problem *not* freezing when we could and
> should have? (Of course the cost of freezing is very relevant, but
> it's still secondary.)

Perhaps this is all in how you look at it, but I don't see it this
way. It's easy to see how to solve the "not freezing" problem: just
freeze everything as often as possible. If that were cheap enough that
we could just do it, then we'd just do it and be done here. The
problem is that, at least in my opinion, that seems too expensive in
some cases. I'm starting to believe that those cases are narrower than
I once thought, but I think they do exist. So now, I'm thinking that
maybe the main problem is identifying when you've got such a case, so
that you know when you need to be less aggressive.

> Won't the algorithm that you've sketched always think that
> "unfreezing" pages doesn't affect recently frozen pages with such a
> workload? Isn't the definition of "recently frozen" that emerges from
> this algorithm not in any way related to the order delivery time, or
> anything like that? You know, rather like vacuum_freeze_min_age.

FWIW, I agree that vacuum_freeze_min_age sucks. I have been reluctant
to endorse changes in this area mostly because I fear replacing one
bad idea with another, not because I think that what we have now is
particularly good. It's better to be wrong in the same way in every
release than to have every release be equally wrong but in a different
way.

Also, I think the question of what "recently frozen" means is a good
one, but I'm not convinced that it ought to relate to the order
delivery time. If we insert into a table and 12-14 hours go buy before
it's updated, it doesn't seem particularly bad to me if we froze that
data meanwhile (regardless of what metric drove that freezing). Same
thing if it's 2-4 hours. What seems bad to me is if we're constantly
updating the table and vacuum comes sweeping through and freezing
everything to no purpose over and over again and then it gets
un-frozen a few seconds or minutes later.

Now maybe that's the wrong idea. After all, as a percentage, the
overhead is the same either way, regardless of whether we're talking
about WAL volume or CPU cycles. But somehow it feels worse to make the
same mistakes every few minutes or potentially even tens of seconds
than it does to make the same mistakes every few hours. The absolute
cost is a lot higher.

> On a positive note, I like that what you've laid out freezes eagerly
> when an FPI won't result -- this much we can all agree on. I guess
> that that part is becoming uncontroversial.

I don't think that we're going to be able to get away with freezing
rows in a small, frequently-updated table just because no FPI will
result. I think Melanie's results show that the cost is not
negligible. But Andres's pseudocode algorithm, although it is more
aggressive in that case, doesn't necessarily seem bad to me, because
it still has some ability to hold off freezing in such cases if our
statistics show that it isn't working out.

-- 
Robert Haas
EDB: http://www.enterprisedb.com

Re: Eager page freeze criteria clarification

2023-09-29 Thread Robert Haas

On Wed, Sep 27, 2023 at 7:09 PM Melanie Plageman
 wrote:
> At the risk of seeming too execution-focused, I want to try and get more
> specific. Here is a description of an example implementation to test my
> understanding:
>
> In table-level stats, save two numbers: younger_than_cpt/older_than_cpt
> storing the number of instances of unfreezing a page which is either
> younger or older than the start of the most recent checkpoint at the
> time of its unfreezing

When I first read this, I thought this sounded right, except for the
naming which doesn't mention freezing, but on further thought, this
feels like it might not be the right thing. Why do I care how many
super-old pages (regardless of how exactly we define super-old) I
unfroze? I think what I care about is more like: how often do I have
to unfreeze a recently-frozen page in order to complete a DML
operation?

For example, consider a table with a million pages, 10 of which are
frozen. 1 was frozen a long time ago, 9 were frozen recently, and the
other 999,990 are unfrozen. I update every page in the table. Well,
now older_than_cpt = 1 and younger_than_cpt = 9, so the ratio of the
two is terrible: 90% of the frozen pages I encountered were
recently-frozen. But that's irrelevant. What matters is that if I had
frozen less aggressively, I could have saved myself 9 unfreeze
operations across a million page modifications. So actually I'm doing
great: only 0.0009% of my page modifications required an unfreeze that
I maybe should have avoided and didn't.

Likewise, if all of the pages had been frozen, but only 9 were frozen
recently, I'm doing exactly equally great. But if the number of
recently frozen pages were higher, then I'd be doing less great. So I
think comparing the number of young pages unfrozen to the number of
old pages unfrozen is the wrong test, and comparing it instead to the
number of pages modified is a better test.

But I don't think it's completely right, either. Imagine a table with
100 recently frozen pages. I modify the table and unfreeze one of
them. So, 100% of the pages that I unfroze were recently-frozen. Does
this mean that we should back off and freeze less aggressively? Not
really. It seems like I actually got this case 99% right. So another
idea could be to look at what percentage of frozen pages I end up
un-freezing shortly thereafter. That gets this example right. But it
gets the earlier example with a million pages wrong.

Maybe there's something to combining these ideas. If the percentage of
page modifications (of clean pages, say, so that we don't skew the
stats as much when a page is modified many times in a row) that have
to un-freeze a recently-frozen page is low, then that means there's no
real performance penalty for whatever aggressive freezing we're doing.
Even if we un-freeze a zillion pages, if that happens over the course
of modifying 100 zillion pages, it's not material. On the flip side,
if the percentage of frozen pages that get unfrozen soon thereafter is
low, then it feels worth doing even if most page modifications end up
un-freezing a recently-frozen page, because on net we're still ending
up with a lot of extra stuff frozen.

Said differently, we're freezing too aggressively if un-freezing is
both adding a significant expense (i.e. the foreground work we're
doing often requires un-freezing ages) and ineffective (i.e. we don't
end up with frozen pages left over). If it has one of those problems,
it's still OK, but if it has both, we need to back off.

Maybe that's still not quite right, but it's the best I've got this morning.

> You would likely want to combine it with one of the other heuristics we
> discussed.
>
> For example:
> For a table with only 20% younger unfreezings, when vacuuming that page,
>
>   if insert LSN - RedoRecPtr < insert LSN - page LSN
>   page is older than the most recent checkpoint start, so freeze it
>   regardless of whether or not it would emit an FPI
>
> What aggressiveness levels should there be? What should change at each
> level? What criteria should pages have to meet to be subject to the
> aggressiveness level?

My guess is that, when we're being aggressive, we should be aiming to
freeze all pages that can be completely frozen with no additional
criterion. Any additional criterion that we add here is likely to mess
up the insert-only table case, and I'm not really sure what it saves
us from. On the other hand, when we're being non-aggressive, we might
still sometimes want to freeze in situations where we historically
haven't. Consider the following three examples:

1. pgbench_accounts table, standard run
2. pgbench_accounts table, highly skewed run
3. slowly growing table with a hot tail. records are inserted, updated
heavily for a while, thereafter updated only very occasionally.

Aggressive freezing could misfire in all of these cases, because all
of them have some portion of the table where rows are being updated
very frequently. But in the second and third cases, the

Re: Eager page freeze criteria clarification

2023-09-27 Thread Melanie Plageman

On Fri, Sep 8, 2023 at 12:07 AM Andres Freund  wrote:
>
> Hi,
>
> On 2023-09-06 10:35:17 -0400, Robert Haas wrote:
> > On Wed, Sep 6, 2023 at 1:09 AM Andres Freund  wrote:
> > > Yea, it'd be useful to have a reasonably approximate wall clock time for 
> > > the
> > > last modification of a page. We just don't have infrastructure for 
> > > determining
> > > that. We'd need an LSN->time mapping (xid->time wouldn't be particularly
> > > useful, I think).
> > >
> > > A very rough approximate modification time can be computed by assuming an 
> > > even
> > > rate of WAL generation, and using the LSN at the time of the last vacuum 
> > > and
> > > the time of the last vacuum, to compute the approximate age.
> > >
> > > For a while I thought that'd not give us anything that just using LSNs 
> > > gives
> > > us, but I think it might allow coming up with a better cutoff logic: 
> > > Instead
> > > of using a cutoff like "page LSN is older than 10% of the LSNs since the 
> > > last
> > > vacuum of the table", it would allow us to approximate "page has not been
> > > modified in the last 15 seconds" or such.  I think that might help avoid
> > > unnecessary freezing on tables with very frequent vacuuming.
> >
> > Yes. I'm uncomfortable with the last-vacuum-LSN approach mostly
> > because of the impact on very frequently vacuumed tables, and
> > secondarily because of the impact on very infrequently vacuumed
> > tables.
> >
> > Downthread, I proposed using the RedoRecPtr of the latest checkpoint
> > rather than the LSN of the previou vacuum. I still like that idea.
>
> Assuming that "downthread" references
> https://postgr.es/m/CA%2BTgmoYb670VcDFbekjn2YQOKF9a7e-kBFoj2WJF1HtH7YPaWQ%40mail.gmail.com
> could you sketch out the logic you're imagining a bit more?
>
>
> > It's a value that we already have, with no additional bookkeeping. It
> > varies over a much narrower range than the interval between vacuums on
> > a table. The vacuum interval could be as short as tens of seconds as
> > long as years, while the checkpoint interval is almost always going to
> > be between a few minutes at the low end and some tens of minutes at
> > the high end, hours at the very most. That's quite appealing.
>
> The reason I was thinking of using the "lsn at the end of the last vacuum", is
> that it seems to be more adapative to the frequency of vacuuming.
>
> One the one hand, if a table is rarely autovacuumed because it is huge,
> (InsertLSN-RedoRecPtr) might or might not be representative of the workload
> over a longer time. On the other hand, if a page in a frequently vacuumed
> table has an LSN from around the last vacuum (or even before), it should be
> frozen, but will appear to be recent in RedoRecPtr based heuristics?
>
>
> Perhaps we can mix both approaches. We can use the LSN and time of the last
> vacuum to establish an LSN->time mapping that's reasonably accurate for a
> relation. For infrequently vacuumed tables we can use the time between
> checkpoints to establish a *more aggressive* cutoff for freezing then what a
> percent-of-time-since-last-vacuum appach would provide. If e.g. a table gets
> vacuumed every 100 hours and checkpoint timeout is 1 hour, no realistic
> percent-of-time-since-last-vacuum setting will allow freezing, as all dirty
> pages will be too new. To allow freezing a decent proportion of those, we
> could allow freezing pages that lived longer than ~20%
> time-between-recent-checkpoints.

One big sticking point for me (brought up elsewhere in this thread, but,
AFAICT never resolved) is that it seems like the fact that we mark pages
all-visible even when not freezing them means that no matter what
heuristic we use, we won't have the opportunity to freeze the pages we
most want to freeze.

Getting to the specific proposal here -- having two+ heuristics and
using them based on table characteristics:

Take the append-only case. Let's say that we had a way to determine if
an insert-only table should use the vacuum LSN heuristic or older than X
checkpoints heuristic. Given the default
autovacuum_vacuum_insert_threshold, every 1000 tuples inserted,
autovacuum will vacuum the table. If you insert to the table often, then
it is a frequently vacuumed table and, if I'm understanding the proposal
correctly, you would want to use a vacuum LSN-based threshold to
determine which pages are old enough to freeze. Using only the
page-older-than-X%-of-LSNs-since-last-vacuum heuristic and assuming no
concurrent workloads, each autovacuum will leave X% of the pages
unfrozen. However, those pages will likely be marked all visible. That
means that the next non-aggressive autovacuum will not even scan those
pages (assuming that run of pages exceeds the skip pages threshold). So
some chunk of pages will get left behind on every autovacuum.

On the other hand, let's say that the table is not frequently inserted
to, so we consider it an infrequently updated table and want to use the
checkpoint-based heuristic. If the page hasn't

Re: Eager page freeze criteria clarification

2023-09-27 Thread Peter Geoghegan

On Wed, Sep 27, 2023 at 6:35 PM Andres Freund  wrote:
> >   if insert LSN - RedoRecPtr < insert LSN - page LSN
> >   page is older than the most recent checkpoint start, so freeze it
> >   regardless of whether or not it would emit an FPI
> >
> > What aggressiveness levels should there be? What should change at each
> > level? What criteria should pages have to meet to be subject to the
> > aggressiveness level?
>
> I'm thinking something very roughly along these lines could make sense:
>
> page_lsn_age = insert_lsn - page_lsn;

While there is no reason to not experiment here, I have my doubts
about what you've sketched out. Most importantly, it doesn't have
anything to say about the cost of not freezing -- just the cost of
freezing. But isn't the main problem *not* freezing when we could and
should have? (Of course the cost of freezing is very relevant, but
it's still secondary.)

But even leaving that aside, I just don't get why this will work with
the case that you yourself emphasized earlier on: a workload with
inserts plus "hot tail" updates. If you run TPC-C according to spec,
there is about 12 or 14 hours between the initial inserts into the
orders and order lines table (by a new order transaction), and the
subsequent updates (from the delivery transaction). When I run the
benchmark, I usually don't stick with the spec (it's rather limiting
on modern hardware), so it's more like 2 - 4 hours before each new
order is delivered (meaning updated in those two big tables). Either
way, it's a fairly long time relative to everything else.

Won't the algorithm that you've sketched always think that
"unfreezing" pages doesn't affect recently frozen pages with such a
workload? Isn't the definition of "recently frozen" that emerges from
this algorithm not in any way related to the order delivery time, or
anything like that? You know, rather like vacuum_freeze_min_age.

Separately, at one point you also said "Yes. If the ratio of
opportunistically frozen pages (which I'd define as pages that were
frozen not because they strictly needed to) vs the number of unfrozen
pages increases, we need to make opportunistic freezing less
aggressive and vice versa".

Can we expect a discount for freezing that happened to be very cheap
anyway, when that doesn't work out?

What about a page that we would have had to have frozen anyway (based
on the conventional vacuum_freeze_min_age criteria) not too long after
it was frozen by this new mechanism, that nevertheless became unfrozen
some time later? That is, a page where "the unfreezing" cannot
reasonably be blamed on the initial so-called opportunistic freezing,
because really it was a total accident involving when VACUUM showed
up? You know, just like we'd expect with the TPC-C tables.

Aside: "unfrozen pages" seems to refer to pages that were frozen, and
became unfrozen. Not pages that are simply frozen. Lots of
opportunities for confusion here.

I'm not saying that it's wrong to freeze like this in the specific case of
TPC-C. But do you really need to invent all this complicated
infrastructure, just to avoid freezing the same pages again in a tight
loop?

On a positive note, I like that what you've laid out freezes eagerly
when an FPI won't result -- this much we can all agree on. I guess
that that part is becoming uncontroversial.

--
Peter Geoghegan

Re: Eager page freeze criteria clarification

2023-09-27 Thread Andres Freund

On 2023-09-27 19:09:41 -0400, Melanie Plageman wrote:
> On Wed, Sep 27, 2023 at 3:25 PM Robert Haas  wrote:
> >
> > On Wed, Sep 27, 2023 at 12:34 PM Andres Freund  wrote:
> > > One way to deal with that would be to not track the average age in
> > > LSN-difference-bytes, but convert the value to some age metric at that
> > > time. If we e.g. were to convert the byte-age into an approximate age in
> > > checkpoints, with quadratic bucketing (e.g. 0 -> current checkpoint, 1 -> 
> > > 1
> > > checkpoint, 2 -> 2 checkpoints ago, 3 -> 4 checkpoints ago, ...), using a 
> > > mean
> > > of that age would probably be fine.
> >
> > Yes. I think it's possible that we could even get by with just two
> > buckets. Say current checkpoint and not. Or current-or-previous
> > checkpoint and not. And just look at what percentage of accesses fall
> > into this first bucket -- it should be small or we're doing it wrong.
> > It seems like the only thing we actually need to avoid is freezing the
> > same ages over and over again in a tight loop.
>
> At the risk of seeming too execution-focused, I want to try and get more
> specific.

I think that's a good intuition :)

> Here is a description of an example implementation to test my
> understanding:
>
> In table-level stats, save two numbers: younger_than_cpt/older_than_cpt
> storing the number of instances of unfreezing a page which is either
> younger or older than the start of the most recent checkpoint at the
> time of its unfreezing

> This has the downside of counting most unfreezings directly after a
> checkpoint in the older_than_cpt bucket. That is: older_than_cpt !=
> longer_frozen_duration at certain times in the checkpoint cycle.

Yea - I don't think just using before/after checkpoint is a good measure. As
you say, it'd be quite jumpy around checkpoints - even though the freezing
behaviour hasn't materially changed. I think using the *distance* between
checkpoints would be a more reliable measure, i.e. if (insert_lsn - page_lsn)
< recent_average_lsn_diff_between_checkpoints, then it's recently modified,
otherwise not.

One problem with using checkpoints "distances" to control things is
forced/immediate checkpoints. The fact that a base backup was started (and
thus a checkpoint completed much earlier than it would have otherwise)
shouldn't make our system assume that the overall behaviour is quite different
going forward.

> Now, I'm trying to imagine how this would interact in a meaningful way
> with opportunistic freezing behavior during vacuum.
>
> You would likely want to combine it with one of the other heuristics we
> discussed.
>
> For example:
> For a table with only 20% younger unfreezings, when vacuuming that page,

Fwiw, I wouldn't say that unfreezing 20% of recently frozen pages is a low
value.

>   if insert LSN - RedoRecPtr < insert LSN - page LSN
>   page is older than the most recent checkpoint start, so freeze it
>   regardless of whether or not it would emit an FPI
>
> What aggressiveness levels should there be? What should change at each
> level? What criteria should pages have to meet to be subject to the
> aggressiveness level?

I'm thinking something very roughly along these lines could make sense:

page_lsn_age = insert_lsn - page_lsn;

if (dirty && !fpi)
{
   /*
* If we can freeze without an FPI, be quite agressive about
* opportunistically freezing. We just need to prevent freezing
* when the table is constantly being rewritten. It's ok to make mistakes
* initially - the rate of unfreezes will quickly stop us from making
* mistakes as often.
*/
#define NO_FPI_FREEZE_FACTOR 10.0
   if (page_lsn_age >
   average_lsn_bytes_per_checkpoint * (1 - recent_unfreeze_ratio) * 
NO_FPI_FREEZE_FACTOR)
  freeze = true;
}
else
{
   /*
* Freezing would emit an FPI and/or dirty the page, making freezing quite
* a bit more costly. Be more hesitant about freezing recently modified
* data, unless it's very rare that we unfreeze recently modified data.
* For insert-only/mostly tables, unfreezes should be rare, so we'll still
* freeze most of the time.
*/
#define FPI_FREEZE_FACTOR 1
   if (page_lsn_age >
   average_lsn_bytes_per_checkpoint * (1 - recent_unfreeze_ratio) * 
FPI_FREEZE_FACTOR)
   freeze = true;
}

Greetings,

Andres Freund

Re: Eager page freeze criteria clarification

2023-09-27 Thread Peter Geoghegan

On Wed, Sep 27, 2023 at 5:42 PM Melanie Plageman
 wrote:
> On Wed, Sep 27, 2023 at 5:27 PM Peter Geoghegan  wrote:
> > What about my idea of holding back when some tuples are already frozen
> > from before? Admittedly that's still a fairly raw idea, but something
> > along those lines seems promising.
>
> Originally, when you proposed this, the idea was to avoid freezing a
> page that has some tuples already frozen because that would mean we
> have frozen it and it was unfrozen after.

Right.

> But, today, I was thinking about what heuristics to combine with one
> that considers the average amount of time pages in a table spend
> frozen [1], and I think it might be interesting to use the "some
> tuples on the page are frozen" test in the opposite way than it was
> originally proposed.
>
> Take a case where you insert a row and then update it once and repeat
> forever. Let's say you freeze the page before you've filled it up,
> and, on the next update, the page is unfrozen. Most of the tuples on
> the page will still be frozen. The next time the page is vacuumed, the
> fact that the page has a lot of frozen tuples actually means that it
> is probably worth freezing the remaining few tuples and marking the
> page frozen.

Assuming that you can get away with not writing an FPI, yeah,
probably. Otherwise, maybe, less clear.

> Basically, I'm wondering if we can use the page-is-partially-frozen
> test to catch cases where we are willing to freeze a page a couple of
> times to make sure it gets frozen.

That doesn't seem like the opposite of my suggestion (except maybe in
a mechanical sort of way). To me this sounds like a refinement of the
same idea (not surprising that it would need to be refined).

You seem to be suggesting that we would freeze if either no tuples
were frozen at all, or if almost all tuples were already frozen -- but
not if the page was "somewhere in between". I think I see where you're
going with that.

-- 
Peter Geoghegan

Re: Eager page freeze criteria clarification

2023-09-27 Thread Peter Geoghegan

On Wed, Sep 27, 2023 at 6:07 PM Andres Freund  wrote:
> > I would be sure to look out for new inserts that "unfreeze" pages, too
> > -- ideally you'd have instrumentation that caught that, in order to
> > get a general sense of the extent of the problem in each of your
> > chosen representative workloads. This is particularly likely to be a
> > concern when there is enough space on a heap page to fit one more heap
> > tuple, that's smaller than most other tuples. The FSM will "helpfully"
> > make sure of it. This problem isn't rare at all, unfortunately.
>
> I'm not as convinced as you are that this is a problem / that the solution
> won't cause more problems than it solves. Users are concerned when free space
> can't be used - you don't have to look further than the discussion in the last
> weeks about adding the ability to disable HOT to fight bloat.

All that I said was that it would be a good idea to keep an eye out
for it during performance validation work

> I do agree that the FSM code tries way too hard to fit things onto early pages
> - it e.g. can slow down concurrent copy workloads by 3-4x due to contention in
> the FSM - and that it has more size classes than necessary, but I don't think
> just closing frozen pages against further insertions of small tuples will
> cause its own set of issues.

Of course it'll cause other issues. Of course it's very subtle.

> I think at the very least there'd need to be something causing pages to reopen
> once the aggregate unused space in the table reaches some threshold.

Of course that's true. ISTM that you might well need some kind of
hysteresis to avoid pages ping-ponging. If it isn't sticky, it might
never settle, or take way too long to settle.

-- 
Peter Geoghegan

Re: Eager page freeze criteria clarification

2023-09-27 Thread Andres Freund

Hi,

On 2023-09-27 17:43:04 -0700, Peter Geoghegan wrote:
> On Wed, Sep 27, 2023 at 5:20 PM Melanie Plageman
>  wrote:
> > > Can you define "unfreeze"? I don't know if this newly invented term
> > > refers to unsetting a page that was marked all-frozen following (say)
> > > an UPDATE, or if it refers to choosing to not freeze when the option
> > > was available (in the sense that it was possible to do it and fully
> > > mark the page all-frozen in the VM). Or something else.
> >
> > By "unfreeze", I mean unsetting a page all frozen in the visibility
> > map when modifying the page for the first time after it was last
> > frozen.
>
> I see. So I guess that Andres meant that you'd track that within all
> backends, using pgstats infrastructure (when he summarized your call
> earlier today)?

That call was just between Robert and me (and not dedicated just to this
topic, fwiw).

Yes, I was thinking of tracking that in pgstat. I can imagine occasionally
rolling it over into pg_class, to better deal with crashes / failovers, but am
fairly agnostic on whether that's really useful / necessary.

> And that that information would be an important input for VACUUM, as opposed
> to something that it maintained itself?

Yes. If the ratio of opportunistically frozen pages (which I'd define as pages
that were frozen not because they strictly needed to) vs the number of
unfrozen pages increases, we need to make opportunistic freezing less
aggressive and vice versa.

> ISTM that the concept of "unfreezing" a page is equivalent to
> "opening" the page that was "closed" at some point (by VACUUM). It's
> not limited to freezing per se -- it's "closed for business until
> further notice", which is a slightly broader concept (and one not
> unique to Postgres). You don't just need to be concerned about updates
> and deletes -- inserts are also a concern.
>
> I would be sure to look out for new inserts that "unfreeze" pages, too
> -- ideally you'd have instrumentation that caught that, in order to
> get a general sense of the extent of the problem in each of your
> chosen representative workloads. This is particularly likely to be a
> concern when there is enough space on a heap page to fit one more heap
> tuple, that's smaller than most other tuples. The FSM will "helpfully"
> make sure of it. This problem isn't rare at all, unfortunately.

I'm not as convinced as you are that this is a problem / that the solution
won't cause more problems than it solves. Users are concerned when free space
can't be used - you don't have to look further than the discussion in the last
weeks about adding the ability to disable HOT to fight bloat.

I do agree that the FSM code tries way too hard to fit things onto early pages
- it e.g. can slow down concurrent copy workloads by 3-4x due to contention in
the FSM - and that it has more size classes than necessary, but I don't think
just closing frozen pages against further insertions of small tuples will
cause its own set of issues.

I think at the very least there'd need to be something causing pages to reopen
once the aggregate unused space in the table reaches some threshold.

Greetings,

Andres Freund

Re: Eager page freeze criteria clarification

2023-09-27 Thread Peter Geoghegan

On Wed, Sep 27, 2023 at 5:20 PM Melanie Plageman
 wrote:
> > Can you define "unfreeze"? I don't know if this newly invented term
> > refers to unsetting a page that was marked all-frozen following (say)
> > an UPDATE, or if it refers to choosing to not freeze when the option
> > was available (in the sense that it was possible to do it and fully
> > mark the page all-frozen in the VM). Or something else.
>
> By "unfreeze", I mean unsetting a page all frozen in the visibility
> map when modifying the page for the first time after it was last
> frozen.

I see. So I guess that Andres meant that you'd track that within all
backends, using pgstats infrastructure (when he summarized your call
earlier today)? And that that information would be an important input
for VACUUM, as opposed to something that it maintained itself?

> I would probably call choosing not to freeze when the option is
> available "no freeze". I have been thinking of what to call it because
> I want to add some developer stats for myself indicating why a page
> that was freezable was not frozen.

I think that having that sort of information available via custom
instrumentation (just for the performance validation side) makes a lot
of sense.

ISTM that the concept of "unfreezing" a page is equivalent to
"opening" the page that was "closed" at some point (by VACUUM). It's
not limited to freezing per se -- it's "closed for business until
further notice", which is a slightly broader concept (and one not
unique to Postgres). You don't just need to be concerned about updates
and deletes -- inserts are also a concern.

I would be sure to look out for new inserts that "unfreeze" pages, too
-- ideally you'd have instrumentation that caught that, in order to
get a general sense of the extent of the problem in each of your
chosen representative workloads. This is particularly likely to be a
concern when there is enough space on a heap page to fit one more heap
tuple, that's smaller than most other tuples. The FSM will "helpfully"
make sure of it. This problem isn't rare at all, unfortunately.

> > The choice to freeze or not freeze pretty much always relies on
> > guesswork about what'll happen to the page in the future, no?
> > Obviously we wouldn't even apply the FPI trigger criteria if we could
> > somehow easily determine that it won't work out (to some degree that's
> > what conditioning it on being able to set the all-frozen VM bit
> > actually does).
>
> I suppose you are thinking of "opportunistic" as freezing whenever we
> aren't certain it is the right thing to do simply because we have the
> opportunity to do it?

I have heard the term "opportunistic freezing" used to refer to
freezing that takes place outside of VACUUM before now. You know,
something perfectly analogous to pruning in VACUUM versus
opportunistic pruning. (I knew that you can't have meant that -- my
point is that the terminology in this area has problems.)

> I want a way to express "freeze when freeze min age doesn't require it"

That makes sense when you consider where we are right now, but it'll
sound odd in a world where freezing via min_freeze_age is the
exception rather than the rule. If anything, it would make more sense
if the traditional min_freeze_age trigger criteria was the type of
freezing that needed its own adjective.

-- 
Peter Geoghegan

Re: Eager page freeze criteria clarification

2023-09-27 Thread Melanie Plageman

On Wed, Sep 27, 2023 at 5:27 PM Peter Geoghegan  wrote:
>
> On Wed, Sep 27, 2023 at 1:45 PM Andres Freund  wrote:
> > I am much more concerned about cases where
> > opportunistic freezing requires an FPI - it'll often *still* be the right
> > choice to freeze the page, but we need a way to prevent that from causing a
> > lot of WAL in worse cases.
>
> What about my idea of holding back when some tuples are already frozen
> from before? Admittedly that's still a fairly raw idea, but something
> along those lines seems promising.

Originally, when you proposed this, the idea was to avoid freezing a
page that has some tuples already frozen because that would mean we
have frozen it and it was unfrozen after.

But, today, I was thinking about what heuristics to combine with one
that considers the average amount of time pages in a table spend
frozen [1], and I think it might be interesting to use the "some
tuples on the page are frozen" test in the opposite way than it was
originally proposed.

Take a case where you insert a row and then update it once and repeat
forever. Let's say you freeze the page before you've filled it up,
and, on the next update, the page is unfrozen. Most of the tuples on
the page will still be frozen. The next time the page is vacuumed, the
fact that the page has a lot of frozen tuples actually means that it
is probably worth freezing the remaining few tuples and marking the
page frozen.

Basically, I'm wondering if we can use the page-is-partially-frozen
test to catch cases where we are willing to freeze a page a couple of
times to make sure it gets frozen.

- Melanie

[1] 
https://www.postgresql.org/message-id/CAAKRu_Y0nLmQ%3DYS1c2ORzLi7bu3eWjdx%2B32BuFc0Tho2o7E3rw%40mail.gmail.com

Re: Eager page freeze criteria clarification

2023-09-27 Thread Melanie Plageman

On Wed, Sep 27, 2023 at 7:39 PM Peter Geoghegan  wrote:
>
> On Wed, Sep 27, 2023 at 4:09 PM Melanie Plageman
>  wrote:
> > At the risk of seeming too execution-focused, I want to try and get more
> > specific. Here is a description of an example implementation to test my
> > understanding:
> >
> > In table-level stats, save two numbers: younger_than_cpt/older_than_cpt
> > storing the number of instances of unfreezing a page which is either
> > younger or older than the start of the most recent checkpoint at the
> > time of its unfreezing
>
> Can you define "unfreeze"? I don't know if this newly invented term
> refers to unsetting a page that was marked all-frozen following (say)
> an UPDATE, or if it refers to choosing to not freeze when the option
> was available (in the sense that it was possible to do it and fully
> mark the page all-frozen in the VM). Or something else.

By "unfreeze", I mean unsetting a page all frozen in the visibility
map when modifying the page for the first time after it was last
frozen.

I would probably call choosing not to freeze when the option is
available "no freeze". I have been thinking of what to call it because
I want to add some developer stats for myself indicating why a page
that was freezable was not frozen.

> I also find the term "opportunistic freezing" confusing. What's
> opportunistic about it? It seems as if the term has been used as a
> synonym of "freezing not triggered by min_freeze_age" on this thread,
> or "freezing that is partly conditioned on setting the page all-frozen
> in the VM", but I'm not even sure of that.

By "opportunistic freezing", I refer to freezing not triggered by
min_freeze_age. Though we only do so now if the page could be set
all-frozen, that is not a requirement for use of the term, IMO. I
assume that we are not considering changing any behavior related to
freezing when a tuple is older than freeze_min_age, so I think of that
freezing as "required". And any freezing of newer tuples is
opportunistic.

> The choice to freeze or not freeze pretty much always relies on
> guesswork about what'll happen to the page in the future, no?
> Obviously we wouldn't even apply the FPI trigger criteria if we could
> somehow easily determine that it won't work out (to some degree that's
> what conditioning it on being able to set the all-frozen VM bit
> actually does).

I suppose you are thinking of "opportunistic" as freezing whenever we
aren't certain it is the right thing to do simply because we have the
opportunity to do it? If so, I'm not sure we need that distinction
currently. It seems more useful in a narrative context as a way of
characterizing a situation than it does when categorizing different
behaviors.

I want a way to express "freeze when freeze min age doesn't require it"

- Melanie

Re: Eager page freeze criteria clarification

2023-09-27 Thread Peter Geoghegan

On Wed, Sep 27, 2023 at 4:09 PM Melanie Plageman
 wrote:
> At the risk of seeming too execution-focused, I want to try and get more
> specific. Here is a description of an example implementation to test my
> understanding:
>
> In table-level stats, save two numbers: younger_than_cpt/older_than_cpt
> storing the number of instances of unfreezing a page which is either
> younger or older than the start of the most recent checkpoint at the
> time of its unfreezing

Can you define "unfreeze"? I don't know if this newly invented term
refers to unsetting a page that was marked all-frozen following (say)
an UPDATE, or if it refers to choosing to not freeze when the option
was available (in the sense that it was possible to do it and fully
mark the page all-frozen in the VM). Or something else.

I also find the term "opportunistic freezing" confusing. What's
opportunistic about it? It seems as if the term has been used as a
synonym of "freezing not triggered by min_freeze_age" on this thread,
or "freezing that is partly conditioned on setting the page all-frozen
in the VM", but I'm not even sure of that.

The choice to freeze or not freeze pretty much always relies on
guesswork about what'll happen to the page in the future, no?
Obviously we wouldn't even apply the FPI trigger criteria if we could
somehow easily determine that it won't work out (to some degree that's
what conditioning it on being able to set the all-frozen VM bit
actually does).

-- 
Peter Geoghegan

Re: Eager page freeze criteria clarification

2023-09-27 Thread Melanie Plageman

On Wed, Sep 27, 2023 at 3:25 PM Robert Haas  wrote:
>
> On Wed, Sep 27, 2023 at 12:34 PM Andres Freund  wrote:
> > One way to deal with that would be to not track the average age in
> > LSN-difference-bytes, but convert the value to some age metric at that
> > time. If we e.g. were to convert the byte-age into an approximate age in
> > checkpoints, with quadratic bucketing (e.g. 0 -> current checkpoint, 1 -> 1
> > checkpoint, 2 -> 2 checkpoints ago, 3 -> 4 checkpoints ago, ...), using a 
> > mean
> > of that age would probably be fine.
>
> Yes. I think it's possible that we could even get by with just two
> buckets. Say current checkpoint and not. Or current-or-previous
> checkpoint and not. And just look at what percentage of accesses fall
> into this first bucket -- it should be small or we're doing it wrong.
> It seems like the only thing we actually need to avoid is freezing the
> same ages over and over again in a tight loop.

At the risk of seeming too execution-focused, I want to try and get more
specific. Here is a description of an example implementation to test my
understanding:

In table-level stats, save two numbers: younger_than_cpt/older_than_cpt
storing the number of instances of unfreezing a page which is either
younger or older than the start of the most recent checkpoint at the
time of its unfreezing

Upon update or delete (and insert?), if the page being modified is
frozen and
  if insert LSN - RedoRecPtr > insert LSN - old page LSN
  page is younger, younger_than_cpt += 1

  otherwise, older_than_cpt += 1

The ratio of younger/total and older/total can be used to determine how
aggressive opportunistic freezing will be.

This has the downside of counting most unfreezings directly after a
checkpoint in the older_than_cpt bucket. That is: older_than_cpt !=
longer_frozen_duration at certain times in the checkpoint cycle.

Now, I'm trying to imagine how this would interact in a meaningful way
with opportunistic freezing behavior during vacuum.

You would likely want to combine it with one of the other heuristics we
discussed.

For example:
For a table with only 20% younger unfreezings, when vacuuming that page,

  if insert LSN - RedoRecPtr < insert LSN - page LSN
  page is older than the most recent checkpoint start, so freeze it
  regardless of whether or not it would emit an FPI

What aggressiveness levels should there be? What should change at each
level? What criteria should pages have to meet to be subject to the
aggressiveness level?

I have some ideas, but I'd like to try an algorithm along these lines
with an updated work queue workload and the insert-only workload. And I
want to make sure I understand the proposal first.

- Melanie

Re: Eager page freeze criteria clarification

2023-09-27 Thread Peter Geoghegan

On Wed, Sep 27, 2023 at 2:26 PM Peter Geoghegan  wrote:
> On Wed, Sep 27, 2023 at 1:45 PM Andres Freund  wrote:
> > I think we need to make vacuums on large tables much more aggressive than 
> > they
> > are now, independent of opportunistic freezing heuristics. It's idiotic that
> > on large tables we delay vacuuming until multi-pass vacuums are pretty much
> > guaranteed.
>
> Not having to do all of the freezing at once will often still make
> sense in cases where we "lose".

One more thing on this, and the subject of large table that keep
getting larger (including those with a "hot tail" of updates):

Since autovacuum runs against such tables at geometric intervals (as
determined by autovacuum_vacuum_insert_scale_factor), the next VACUUM
is always going to be longer and more expensive than this VACUUM,
forever (ignoring the influence of aggressive mode for a second). This
would even be true if we didn't have the related problem of
autovacuum_vacuum_insert_scale_factor not accounting for the fact that
when VACUUM starts and when VACUUM ends aren't exactly the same thing
in large tables [1] -- that aspect just makes the problem even worse.

Basically, the whole "wait and see" approach makes zero sense here
because we really do need to be aggressive about freezing just to keep
up with the workload. The number of pages we'll scan in the next
VACUUM will always be significantly larger, even if we're very
aggressive about freezing (theoretically it might not be, but then
what VACUUM does doesn't matter that much either way). Time is very
much not on our side here. So we need to anticipate what happens next
with the workload, and how that affects VACUUM in the future -- not
just how VACUUM affects the workload. (VACUUM is just another part of
the workload, in fact.)

[1] 
https://www.postgresql.org/message-id/CAH2-Wzn=bZ4wynYB0hBAeF4kGXGoqC=PZVKHeerBU-je9AQF=g...@mail.gmail.com
-- 
Peter Geoghegan

Re: Eager page freeze criteria clarification

2023-09-27 Thread Peter Geoghegan

On Wed, Sep 27, 2023 at 1:45 PM Andres Freund  wrote:
> On 2023-09-27 13:14:41 -0700, Peter Geoghegan wrote:
> > As a general rule, I think that we're better of gambling against
> > future FPIs, and then pulling back if we go too far. The fact that we
> > went one VACUUM operation without the workload unsetting an
> > all-visible page isn't that much of a signal about what might happen
> > to that page.
>
> I think we can afford to be quite aggressive about opportunistically freezing
> when doing so wouldn't emit an FPI.

I agree. This part is relatively easy -- it is more or less v2 of the
FPI thing. The only problem that I see is that it just isn't that
compelling on its own -- it just doesn't seem ambitious enough.

> I am much more concerned about cases where
> opportunistic freezing requires an FPI - it'll often *still* be the right
> choice to freeze the page, but we need a way to prevent that from causing a
> lot of WAL in worse cases.

What about my idea of holding back when some tuples are already frozen
from before? Admittedly that's still a fairly raw idea, but something
along those lines seems promising.

If you limit yourself to what I've called the easy cases, then you
can't really expect to make a dent in the problems that we see with
large tables -- where FPIs are pretty much the norm rather than the
exception. Again, that's fine, but I'd be disappointed if you and
Melanie can't do better than that for 17.

> > 2. Large tables (i.e. the tables where it really matters) just don't
> > have that many VACUUM operations, relative to everything else.
>
> I think we need to make vacuums on large tables much more aggressive than they
> are now, independent of opportunistic freezing heuristics. It's idiotic that
> on large tables we delay vacuuming until multi-pass vacuums are pretty much
> guaranteed.

Not having to do all of the freezing at once will often still make
sense in cases where we "lose". It's hard to precisely describe how to
assess such things (what's the break even point?), but that makes it
no less true. Constantly losing by a small amount is usually better
than massive drops in performance.

> The current logic made some sense when we didn't have the VM, but now
> autovacuum scheduling is influenced by the portion of the table that that
> vacuum will never look at, which makes no sense.

Yep:

https://www.postgresql.org/message-id/CAH2-Wz=MGFwJEpEjVzXwEjY5yx=uunpza6bt4dsmasrgluq...@mail.gmail.com

> > Who says we'll get more than one opportunity per page with these tables,
> > even with this behavior of scanning all-visible pages in non-aggressive
> > VACUUMs?  Big append-only tables simply won't get the opportunity to catch
> > up in the next non-aggressive VACUUM if there simply isn't one.
>
> I agree that we need to freeze pages in append only tables ASAP. I don't think
> they're that hard a case to detect though. The harder case is the - IME very
> common - case of tables that are largely immutable but have a moving tail
> that's hotly updated.

Doing well with such "moving tail of updates" tables is exactly what
doing well on TPC-C requires. It's easy to detect that a table is
either one of these two things. Which of the two it is specifically
presents greater difficulties. So I don't see strict append-only
versus "append and update each row once" as all that different,
practically speaking, from the point of view of VACUUM.

Often, the individual pages from TPC-C's order lines table look very
much like pgbench_history style pages would -- just because VACUUM is
either ahead or behind the current "hot tail" position with almost all
pages that it scans, due to the current phase of the moon. This is
partly why I place so much emphasis on teaching VACUUM to understand
that what it sees in each page might well have a lot to do with when
it happened to show up, as opposed to something about the
workload/table itself.

BTW, if you're worried about the "hot tail" case in particular then
you should definitely put the FSM stuff in scope here -- it's a huge
part of the overall problem, since it makes the pages take so much
longer to "settle" than you might expect when just considering the
workload abstractly.

-- 
Peter Geoghegan

Re: Eager page freeze criteria clarification

2023-09-27 Thread Andres Freund

Hi,

On 2023-09-27 13:14:41 -0700, Peter Geoghegan wrote:
> On Wed, Sep 27, 2023 at 1:03 PM Andres Freund  wrote:
> > I suspect that medium term the better approach would be to be much more
> > aggressive about setting all-visible, including as part of page-level
> > visibility checks, and to deal with the concern of vacuum not processing 
> > such
> > pages soon by not just looking at unmarked pages, but also a portion of the
> > all-visible-but-not-frozen pages (the more all-visible-but-not-frozen pages
> > there are, the more of them each vacuum should process). I think all-visible
> > is too important for index only scans, for us to be able to remove it, or
> > delay setting it until freezing makes sense.
> >
> > My confidence in my gut feeling isn't all that high here, though.
> 
> I think that this is a bad idea. I see two main problems with
> "splitting the difference" like this:
> 
> 1. If we randomly scan some all-visible pages in non-aggressive
> VACUUMs, then we're sure to get FPIs in order to be able to freeze.
> 
> As a general rule, I think that we're better of gambling against
> future FPIs, and then pulling back if we go too far. The fact that we
> went one VACUUM operation without the workload unsetting an
> all-visible page isn't that much of a signal about what might happen
> to that page.

I think we can afford to be quite aggressive about opportunistically freezing
when doing so wouldn't emit an FPI. I am much more concerned about cases where
opportunistic freezing requires an FPI - it'll often *still* be the right
choice to freeze the page, but we need a way to prevent that from causing a
lot of WAL in worse cases.

I think the case where no-fpi-required-freezing is a problem are small,
frequently updated, tables, which are very frequently vacuumed.

> 2. Large tables (i.e. the tables where it really matters) just don't
> have that many VACUUM operations, relative to everything else.

I think we need to make vacuums on large tables much more aggressive than they
are now, independent of opportunistic freezing heuristics. It's idiotic that
on large tables we delay vacuuming until multi-pass vacuums are pretty much
guaranteed.  We don't have the stats to detect that, but if we had them, we
should trigger autovacuums dead items + dead tuples gets to a significant
fraction of what can be tracked in m_w_m.

The current logic made some sense when we didn't have the VM, but now
autovacuum scheduling is influenced by the portion of the table that that
vacuum will never look at, which makes no sense. One can argue that that
frozen-portion is a proxy for the size of the indexes that would need to be
scanned - but that also doesn't make much sense to me, because the number &
size of indexes depends a lot on the workload and because we skip index
vacuums when not necessary.

I guess we could just use (relation_size - skipped_pages) *
autovacuum_vacuum_scale_factor as the threshold, but we don't have a cheap way
to compute the number of frozen pages right now (we do have relallvisible for
non-aggressive vacuums).

> Who says we'll get more than one opportunity per page with these tables,
> even with this behavior of scanning all-visible pages in non-aggressive
> VACUUMs?  Big append-only tables simply won't get the opportunity to catch
> up in the next non-aggressive VACUUM if there simply isn't one.

I agree that we need to freeze pages in append only tables ASAP. I don't think
they're that hard a case to detect though. The harder case is the - IME very
common - case of tables that are largely immutable but have a moving tail
that's hotly updated.

Greetings,

Andres Freund

Re: Eager page freeze criteria clarification

2023-09-27 Thread Peter Geoghegan

On Wed, Sep 27, 2023 at 1:03 PM Andres Freund  wrote:
> I suspect that medium term the better approach would be to be much more
> aggressive about setting all-visible, including as part of page-level
> visibility checks, and to deal with the concern of vacuum not processing such
> pages soon by not just looking at unmarked pages, but also a portion of the
> all-visible-but-not-frozen pages (the more all-visible-but-not-frozen pages
> there are, the more of them each vacuum should process). I think all-visible
> is too important for index only scans, for us to be able to remove it, or
> delay setting it until freezing makes sense.
>
> My confidence in my gut feeling isn't all that high here, though.

I think that this is a bad idea. I see two main problems with
"splitting the difference" like this:

1. If we randomly scan some all-visible pages in non-aggressive
VACUUMs, then we're sure to get FPIs in order to be able to freeze.

As a general rule, I think that we're better of gambling against
future FPIs, and then pulling back if we go too far. The fact that we
went one VACUUM operation without the workload unsetting an
all-visible page isn't that much of a signal about what might happen
to that page.

2. Large tables (i.e. the tables where it really matters) just don't
have that many VACUUM operations, relative to everything else. Who
says we'll get more than one opportunity per page with these tables,
even with this behavior of scanning all-visible pages in
non-aggressive VACUUMs?

Big append-only tables simply won't get the opportunity to catch up in
the next non-aggressive VACUUM if there simply isn't one.

-- 
Peter Geoghegan

Re: Eager page freeze criteria clarification

2023-09-27 Thread Robert Haas

On Wed, Sep 27, 2023 at 12:34 PM Andres Freund  wrote:
> What do you mean with "always freeze aggressively" - do you mean 'aggressive'
> autovacuums? Or opportunistic freezing being aggressive? I don't know why the
> former would be the case?

I meant the latter.

> > When it grows large enough, we suddenly stop freezing it aggressively, and
> > now it starts experiencing vacuums that do a whole bunch of work all at
> > once. A user who notices that is likely to be pretty confused about what
> > happened, and maybe not too happy when they find out.
>
> Hm - isn't my proposal exactly the other way round? I'm proposing that a page
> is frozen more aggressively if not already in shared buffers - which will
> become more common once the table has grown "large enough"?

OK, but it's counterintuitive either way, IMHO.

> I think there were three main ideas that we discussed:
>
> 1) We don't need to be accurate in the freezing decisions for individual
>pages, we "just" need to avoid the situation that over time we commonly
>freeze pages that will be updated again "soon".
> 2) It might be valuable to adjust the "should freeze page opportunistically"
>based on feedback.
> 3) We might need to classify the workload for a table and use different
>heruristics for different workloads.

I agree with all of that. Good summary.

> One way to deal with that would be to not track the average age in
> LSN-difference-bytes, but convert the value to some age metric at that
> time. If we e.g. were to convert the byte-age into an approximate age in
> checkpoints, with quadratic bucketing (e.g. 0 -> current checkpoint, 1 -> 1
> checkpoint, 2 -> 2 checkpoints ago, 3 -> 4 checkpoints ago, ...), using a mean
> of that age would probably be fine.

Yes. I think it's possible that we could even get by with just two
buckets. Say current checkpoint and not. Or current-or-previous
checkpoint and not. And just look at what percentage of accesses fall
into this first bucket -- it should be small or we're doing it wrong.
It seems like the only thing we actually need to avoid is freezing the
same ages over and over again in a tight loop.

-- 
Robert Haas
EDB: http://www.enterprisedb.com

Re: Eager page freeze criteria clarification

2023-09-27 Thread Robert Haas

On Wed, Sep 27, 2023 at 2:36 PM Melanie Plageman
 wrote:
> It seems like the ideal freeze pattern for an insert-only table would be
> to freeze as soon as the page is full before any checkpoints which could
> force you to emit an FPI.

Yes. So imagine we have two freeze criteria:

1. Do not ever opportunistically freeze pages.
2. Opportunistically freeze pages whenever we can completely freeze the page.

For a table where the same rows are frequently updated or where the
table is frequently truncated, we want to apply (1). For an
insert-only table, we want to apply (2). Maybe I'm all wet here, but
it's starting to seem to me that in almost all other cases, we also
want to apply (2). If that's true, then the only thing we need to do
here is identify the special cases where (1) is correct.

It'd be even better if we could adapt this behavior down to the page
level - imagine a large mostly cold table with a hot tail. In a
perfect world we apply (1) to the tail and (2) to the rest. But this
might be too much of a stretch to actually get right.

> One big sticking point for me (brought up elsewhere in this thread, but,
> AFAICT never resolved) is that it seems like the fact that we mark pages
> all-visible even when not freezing them means that no matter what
> heuristic we use, we won't have the opportunity to freeze the pages we
> most want to freeze.

The only solution to this problem that I can see is what Peter
proposed earlier: if we're not prepared to freeze the page, then don't
mark it all-visible either. This might be the right thing to do, and
if it is, we could even go further and get rid of the two as separate
concepts completely. However, I think it would be OK to leave all of
that to one side for the moment, *especially* if we adopt some
proposal that does a lot more opportunistic freezing than we do
currently. Because then the problem just wouldn't come up nearly as
much as it does now. One patch can't fix everything, and shouldn't
try.

-- 
Robert Haas
EDB: http://www.enterprisedb.com

Re: Eager page freeze criteria clarification

2023-09-27 Thread Peter Geoghegan

On Wed, Sep 27, 2023 at 11:36 AM Melanie Plageman
 wrote:
> One big sticking point for me (brought up elsewhere in this thread, but,
> AFAICT never resolved) is that it seems like the fact that we mark pages
> all-visible even when not freezing them means that no matter what
> heuristic we use, we won't have the opportunity to freeze the pages we
> most want to freeze.

I think of it as being a bit like an absorbing state from a Markov
chain. It's a perfect recipe for weird path dependent behavior, which
seems to make your task just about impossible. It literally makes
vacuuming more frequently lead to less useful work being performed! It
does so *reliably*, in the simplest of cases.

Ideally, you'd be designing an algorithm for a system where we could
expect to get approximately the desired outcome whenever we have
approximately the right idea. A system where mistakes tend to "cancel
each other out" over time. As long as you have "set all-visible but
don't freeze" behavior as your starting point, then ISTM that your
algorithm almost has to make no mistakes at all to really improve on
what we have right now. As things stand, when your algorithm decides
to not freeze (for pages that are still set all-visible in the VM),
you really can't afford to be wrong. (I think that you get this
already, but it's a point worth emphasizing.)

-- 
Peter Geoghegan

Re: Eager page freeze criteria clarification

2023-09-27 Thread Andres Freund

Hi,

On 2023-09-27 15:45:06 -0400, Robert Haas wrote:
> > One big sticking point for me (brought up elsewhere in this thread, but,
> > AFAICT never resolved) is that it seems like the fact that we mark pages
> > all-visible even when not freezing them means that no matter what
> > heuristic we use, we won't have the opportunity to freeze the pages we
> > most want to freeze.
> 
> The only solution to this problem that I can see is what Peter
> proposed earlier: if we're not prepared to freeze the page, then don't
> mark it all-visible either. This might be the right thing to do, and
> if it is, we could even go further and get rid of the two as separate
> concepts completely.

I suspect that medium term the better approach would be to be much more
aggressive about setting all-visible, including as part of page-level
visibility checks, and to deal with the concern of vacuum not processing such
pages soon by not just looking at unmarked pages, but also a portion of the
all-visible-but-not-frozen pages (the more all-visible-but-not-frozen pages
there are, the more of them each vacuum should process). I think all-visible
is too important for index only scans, for us to be able to remove it, or
delay setting it until freezing makes sense.

My confidence in my gut feeling isn't all that high here, though.

> However, I think it would be OK to leave all of
> that to one side for the moment, *especially* if we adopt some
> proposal that does a lot more opportunistic freezing than we do
> currently. Because then the problem just wouldn't come up nearly as
> much as it does now. One patch can't fix everything, and shouldn't
> try.

Agreed.

Greetings,

Andres Freund

Re: Eager page freeze criteria clarification

2023-09-27 Thread Peter Geoghegan

On Wed, Sep 27, 2023 at 12:45 PM Robert Haas  wrote:
> > One big sticking point for me (brought up elsewhere in this thread, but,
> > AFAICT never resolved) is that it seems like the fact that we mark pages
> > all-visible even when not freezing them means that no matter what
> > heuristic we use, we won't have the opportunity to freeze the pages we
> > most want to freeze.
>
> The only solution to this problem that I can see is what Peter
> proposed earlier: if we're not prepared to freeze the page, then don't
> mark it all-visible either. This might be the right thing to do, and
> if it is, we could even go further and get rid of the two as separate
> concepts completely. However, I think it would be OK to leave all of
> that to one side for the moment, *especially* if we adopt some
> proposal that does a lot more opportunistic freezing than we do
> currently. Because then the problem just wouldn't come up nearly as
> much as it does now. One patch can't fix everything, and shouldn't
> try.

I guess that it might make sense to leave the issue of
all-visible-only pages out of it for the time being. I would just
point out that that means that you'll be limited to adding strictly
opportunistic behaviors, where it just so happens that VACUUM stumbles
upon pages where freezing early likely makes sense. Basically, an
incremental improvement on the FPI thing -- not the sort of thing
where I'd expect Melanie's current approach of optimizing a whole
cross-section of representative workloads to really help with.

I have a separate concern about it that I'll raise shortly, in my
response to Andres.

-- 
Peter Geoghegan

Re: Eager page freeze criteria clarification

2023-09-27 Thread Peter Geoghegan

On Wed, Sep 27, 2023 at 10:46 AM Andres Freund  wrote:
> I don't disagree that we should do something in that direction - I just don't
> see the raw number of unfrozen pages being useful in that regard. If you have
> a database where no pages live long, we don't need to freeze
> oppportunistically, yet the fraction of unfrozen pages will be huge.

We don't know how to reliably predict the future. We can do our best
to ameliorate problems with such workloads, using a slew of different
strategies (including making the behaviors configurable, holding off
on freezing a second or a third time, etc). Such workloads are not
very common, and won't necessarily suffer too much. I think that
they're pretty much limited to queue-like tables.

Any course of action will likely have some downsides.
Melanie/you/whomever will need to make a trade-off, knowing that
somebody somewhere isn't going to be completely happy about it. I just
don't think that anybody will ever be able to come up with an
algorithm that can generalize well enough for things to not work out
that way. I really care about the problems in this area being
addressed more comprehensively, so I'm certainly not going to be the
one that just refuses to accept a trade-off (within reason, of
course).

> > > If we want to take global freeze debt into account, which I think is a 
> > > good
> > > idea, we'll need a smarter way to represent the debt than just the number 
> > > of
> > > unfrozen pages.  I think we would need to track the age of unfrozen pages 
> > > in
> > > some way. If there are a lot of unfrozen pages with a recent xid, then 
> > > it's
> > > fine, but if they are older and getting older, it's a problem and we need 
> > > to
> > > be more aggressive.
> >
> > Tables like pgbench_history will have lots of unfrozen pages with a
> > recent XID that get scanned during every VACUUM. We should be freezing
> > such pages at the earliest opportunity.
>
> I think we ought to be able to freeze tables with as simple a workload as
> pgbench_history has aggressively without taking a global freeze debt into
> account.

Agreed.

> We definitely *also* should take the number of unfrozen pages into account. I
> just don't determining freeze debt primarily using the number of unfrozen
> pages will be useful. The presence of unfrozen pages that are likely to be
> updated again soon is not a problem and makes the simple metric pretty much
> useless.

I never said "primarily".

> > To be clear, that doesn't mean that XID age shouldn't play an
> > important role in helping VACUUM to differentiate between pages that
> > should not be frozen and pages that should be frozen.
>
> I think we need to take it into acocunt to determine a useful freeze debt on a
> table level (and potentially system wide too).

If we reach the stage where XID age starts to matter, we're no longer
talking about performance stability IMV. We're talking about avoiding
wraparound, which is mostly a different problem.

Recall that I started this whole line of discussion about debt in
reaction to a point of Robert's about lazy freezing. My original point
was that (as a general rule) we can afford to be lazy when we're not
risking too much -- when the eventual cost of catching up (having
gotten it wrong) isn't too painful (i.e. doesn't lead to a big balloon
payment down the road). Laziness is the thing that requires
justification. Putting off vital maintenance work for as long as
possible only makes sense under fairly limited circumstances.

A complicating factor here is the influence of the free space map. The
way that that works (the total lack of hysteresis to discourage using
space from old pages) is probably something that makes the problems
with freezing harder to solve. Maybe that should be put in scope.

> Assuming we could compute it cheaply enough, if we had an approximate median
> oldest-64bit-xid-on-page and the number of unfrozen pages, we could
> differentiate between tables that have lots of recent unfrozen pages (the
> median will be low) and pages with lots of unfrozen pages that are unlikely to
> be updated again (the median will be high and growing).  Something like the
> median 64bit xid would be interesting because it'd not get "invalidated" if
> relfrozenxid is increased.

I'm glad that you're mostly of the view that we should be freezing a
lot more aggressively overall, but I think that you're still too
focussed on avoiding small problems. I understand why novel new
problems are generally more of a concern than established old
problems, but there needs to be a sense of proportion. Performance
stability is incredibly important, and isn't zero cost.

-- 
Peter Geoghegan

Re: Eager page freeze criteria clarification

2023-09-27 Thread Andres Freund

Hi,

On 2023-09-27 10:25:00 -0700, Peter Geoghegan wrote:
> On Wed, Sep 27, 2023 at 10:01 AM Andres Freund  wrote:
> > On 2023-09-26 09:07:13 -0700, Peter Geoghegan wrote:
> > I don't think doing this on a system wide basis with a metric like #unfrozen
> > pages is a good idea. It's quite common to have short lived data in some
> > tables while also having long-lived data in other tables. Making 
> > opportunistic
> > freezing more aggressive in that situation will just hurt, without a benefit
> > (potentially even slowing down the freezing of older data!). And even 
> > within a
> > single table, making freezing more aggressive because there's a decent sized
> > part of the table that is updated regularly and thus not frozen, doesn't 
> > make
> > sense.
>
> I never said that #unfrozen pages should be the sole criterion, for
> anything. Just that it would influence the overall strategy, making
> the system veer towards more aggressive freezing. It would complement
> a more sophisticated algorithm that decides whether or not to freeze a
> page based on its individual characteristics.
>
> For example, maybe the page-level algorithm would have a random
> component. That could potentially be where the global (or at least
> table level) view gets to influence things -- the random aspect is
> weighed using the global view of debt. That kind of thing seems like
> an interesting avenue of investigation.

I don't disagree that we should do something in that direction - I just don't
see the raw number of unfrozen pages being useful in that regard. If you have
a database where no pages live long, we don't need to freeze
oppportunistically, yet the fraction of unfrozen pages will be huge.

> > If we want to take global freeze debt into account, which I think is a good
> > idea, we'll need a smarter way to represent the debt than just the number of
> > unfrozen pages.  I think we would need to track the age of unfrozen pages in
> > some way. If there are a lot of unfrozen pages with a recent xid, then it's
> > fine, but if they are older and getting older, it's a problem and we need to
> > be more aggressive.
>
> Tables like pgbench_history will have lots of unfrozen pages with a
> recent XID that get scanned during every VACUUM. We should be freezing
> such pages at the earliest opportunity.

I think we ought to be able to freeze tables with as simple a workload as
pgbench_history has aggressively without taking a global freeze debt into
account.

> > The problem I see is how track the age of unfrozen data -
> > it'd be easy enough to track the mean(oldest-64bit-xid-on-page), but then we
> > again have the issue of rare outliers moving the mean too much...
>
> I think that XID age is mostly not very important compared to the
> absolute amount of unfrozen pages, and the cost profile of freezing
> now versus later. (XID age *is* important in emergencies, but that's
> mostly not what we're discussing right now.)

We definitely *also* should take the number of unfrozen pages into account. I
just don't determining freeze debt primarily using the number of unfrozen
pages will be useful. The presence of unfrozen pages that are likely to be
updated again soon is not a problem and makes the simple metric pretty much
useless.

> To be clear, that doesn't mean that XID age shouldn't play an
> important role in helping VACUUM to differentiate between pages that
> should not be frozen and pages that should be frozen.

I think we need to take it into acocunt to determine a useful freeze debt on a
table level (and potentially system wide too).

Assuming we could compute it cheaply enough, if we had an approximate median
oldest-64bit-xid-on-page and the number of unfrozen pages, we could
differentiate between tables that have lots of recent unfrozen pages (the
median will be low) and pages with lots of unfrozen pages that are unlikely to
be updated again (the median will be high and growing).  Something like the
median 64bit xid would be interesting because it'd not get "invalidated" if
relfrozenxid is increased.

Greetings,

Andres Freund

Re: Eager page freeze criteria clarification

2023-09-27 Thread Peter Geoghegan

On Wed, Sep 27, 2023 at 10:01 AM Andres Freund  wrote:
> On 2023-09-26 09:07:13 -0700, Peter Geoghegan wrote:
> I don't think doing this on a system wide basis with a metric like #unfrozen
> pages is a good idea. It's quite common to have short lived data in some
> tables while also having long-lived data in other tables. Making opportunistic
> freezing more aggressive in that situation will just hurt, without a benefit
> (potentially even slowing down the freezing of older data!). And even within a
> single table, making freezing more aggressive because there's a decent sized
> part of the table that is updated regularly and thus not frozen, doesn't make
> sense.

I never said that #unfrozen pages should be the sole criterion, for
anything. Just that it would influence the overall strategy, making
the system veer towards more aggressive freezing. It would complement
a more sophisticated algorithm that decides whether or not to freeze a
page based on its individual characteristics.

For example, maybe the page-level algorithm would have a random
component. That could potentially be where the global (or at least
table level) view gets to influence things -- the random aspect is
weighed using the global view of debt. That kind of thing seems like
an interesting avenue of investigation.

> If we want to take global freeze debt into account, which I think is a good
> idea, we'll need a smarter way to represent the debt than just the number of
> unfrozen pages.  I think we would need to track the age of unfrozen pages in
> some way. If there are a lot of unfrozen pages with a recent xid, then it's
> fine, but if they are older and getting older, it's a problem and we need to
> be more aggressive.

Tables like pgbench_history will have lots of unfrozen pages with a
recent XID that get scanned during every VACUUM. We should be freezing
such pages at the earliest opportunity.

> The problem I see is how track the age of unfrozen data -
> it'd be easy enough to track the mean(oldest-64bit-xid-on-page), but then we
> again have the issue of rare outliers moving the mean too much...

I think that XID age is mostly not very important compared to the
absolute amount of unfrozen pages, and the cost profile of freezing
now versus later. (XID age *is* important in emergencies, but that's
mostly not what we're discussing right now.)

To be clear, that doesn't mean that XID age shouldn't play an
important role in helping VACUUM to differentiate between pages that
should not be frozen and pages that should be frozen. But even there
it should probably be treated as a cue. And, the relationship between
the XIDs on the page is probably more important than their absolute
age (or relationship to XIDs that appear elsewhere more generally).

For example, if a page is filled with heap tuples whose XIDs all match
(i.e. tuples that were all inserted by the same transaction), or XIDs
that are at least very close together, then that could make VACUUM
more enthusiastic about freezing now. OTOH if the XIDs are more
heterogeneous (but never very old), or if some xmin fields were
frozen, then VACUUM should show much less enthusiasm for freezing if
it's expensive.

-- 
Peter Geoghegan

Re: Eager page freeze criteria clarification

2023-09-27 Thread Andres Freund

Hi,

On 2023-09-26 09:07:13 -0700, Peter Geoghegan wrote:
> On Tue, Sep 26, 2023 at 8:19 AM Andres Freund  wrote:
> > However, I'm not at all convinced doing this on a system wide level is a 
> > good
> > idea. Databases do often contain multiple types of workloads at the same
> > time. E.g., we want to freeze aggressively in a database that has the bulk 
> > of
> > its size in archival partitions but has lots of unfrozen data in an active
> > partition. And databases have often loads of data that's going to change
> > frequently / isn't long lived, and we don't want to super aggressively 
> > freeze
> > that, just because it's a large portion of the data.
> 
> I didn't say that we should always have most of the data in the
> database frozen, though. Just that we can reasonably be more lazy
> about freezing the remainder of pages if we observe that most pages
> are already frozen. How they got that way is another discussion.
> 
> I also think that the absolute amount of debt (measured in physical
> units such as unfrozen pages) should be kept under control. But that
> isn't something that can ever be expected to work on the basis of a
> simple threshold -- if only because autovacuum scheduling just doesn't
> work that way, and can't really be adapted to work that way.

I don't think doing this on a system wide basis with a metric like #unfrozen
pages is a good idea. It's quite common to have short lived data in some
tables while also having long-lived data in other tables. Making opportunistic
freezing more aggressive in that situation will just hurt, without a benefit
(potentially even slowing down the freezing of older data!). And even within a
single table, making freezing more aggressive because there's a decent sized
part of the table that is updated regularly and thus not frozen, doesn't make
sense.

If we want to take global freeze debt into account, which I think is a good
idea, we'll need a smarter way to represent the debt than just the number of
unfrozen pages.  I think we would need to track the age of unfrozen pages in
some way. If there are a lot of unfrozen pages with a recent xid, then it's
fine, but if they are older and getting older, it's a problem and we need to
be more aggressive.  The problem I see is how track the age of unfrozen data -
it'd be easy enough to track the mean(oldest-64bit-xid-on-page), but then we
again have the issue of rare outliers moving the mean too much...

Greetings,

Andres Freund

Re: Eager page freeze criteria clarification

2023-09-27 Thread Andres Freund

Hi,

(I wrote the first part of the email before Robert and I chatted on a call, I
left it in the email for posterity)

On 2023-09-26 13:49:32 -0400, Robert Haas wrote:
> On Tue, Sep 26, 2023 at 11:11 AM Andres Freund  wrote:
> > As long as the most extreme cases are prevented, unnecessarily freezing is 
> > imo
> > far less harmful than freezing too little.
> >
> > I'm worried that using something as long as 100-200%
> > time-between-recent-checkpoints won't handle insert-mostly workload well,
> > which IME are also workloads suffering quite badly under our current scheme 
> > -
> > and which are quite common.
>
> I wrote about this problem in my other reply and I'm curious as to
> your thoughts about it. Basically, suppose we forget about all of
> Melanie's tests except for three cases: (1) an insert-only table, (2)
> an update-heavy workload with uniform distribution, and (3) an
> update-heavy workload with skew. In case (1), freezing is good. In
> case (2), freezing is bad. In case (3), freezing is good for cooler
> pages and bad for hotter ones. I postulate that any
> recency-of-modification threshold that handles (1) well will handle
> (2) poorly, and that the only way to get both right is to take some
> other factor into account. You seem to be arguing that we can just
> freeze aggressively in case (2) and it won't cost much, but it doesn't
> sound to me like Melanie believes that and I don't think I do either.

I don't believe we can freeze aggressively in all cases of 2) without causing
problems. A small-ish table that's vacuumed constantly, where all rows are
constantly frozen and then updated, will suffer a lot from the WAL
overhead. Whereas superfluously freezing a row in a table with many millions
of rows, where each row is only occasionally updated, due to the update rate
being much smaller than the number of rows, will have acceptable overhead.

What I *do* believe is that for all but the most extreme cases, it's safer to
freeze too much than to freeze too little. There definitely are negative
consequences, but they're more bounded and less surprising than not freezing
for ages and then suddenly freezing everything at once.

Whether 2) really exists in the real world for huge tables, is of course
somewhat debatable...

> > > This doesn't seem completely stupid, but I fear it would behave
> > > dramatically differently on a workload a little smaller than s_b vs.
> > > one a little larger than s_b, and that doesn't seem good.
> >
> > Hm. I'm not sure that that's a real problem. In the case of a workload 
> > bigger
> > than s_b, having to actually read the page again increases the cost of
> > freezing later, even if the workload is just a bit bigger than s_b.
>
> That is true, but I don't think it means that there is no problem. It
> could lead to a situation where, for a while, a table never needs any
> significant freezing, because we always freeze aggressively.

What do you mean with "always freeze aggressively" - do you mean 'aggressive'
autovacuums? Or opportunistic freezing being aggressive? I don't know why the
former would be the case?

> When it grows large enough, we suddenly stop freezing it aggressively, and
> now it starts experiencing vacuums that do a whole bunch of work all at
> once. A user who notices that is likely to be pretty confused about what
> happened, and maybe not too happy when they find out.

Hm - isn't my proposal exactly the other way round? I'm proposing that a page
is frozen more aggressively if not already in shared buffers - which will
become more common once the table has grown "large enough"?

(the remainder was written after that call)

I think there were three main ideas that we discussed:

1) We don't need to be accurate in the freezing decisions for individual
   pages, we "just" need to avoid the situation that over time we commonly
   freeze pages that will be updated again "soon".
2) It might be valuable to adjust the "should freeze page opportunistically"
   based on feedback.
3) We might need to classify the workload for a table and use different
   heruristics for different workloads.

For 2), one of the mechanisms we discussed was to collect information about
the "age" of a page when we "unfreeze" it. If we frequently unfreeze pages
that were recently frozen, we need to be less aggressive in opportunistic
freezing going forward. If that never happens, we can be more aggressive.

The basic idea for classifying the age of a page when unfreezing is to use
"insert_lsn - page_lsn", pretty simple. We can convert that into time using
the averaged WAL generation rate.  What's a bit harder is figuring out how to
usefully aggregate the age across multiple "unfreezes".

I was initially thinking of just using the mean, but Robert was rightly
concerned that that'd the mean would be moved a lot when occasionally freezing
very old pages, potentially leading to opportunistically freezing young pages
too aggressively. The median would be a better choice, bu

Re: Eager page freeze criteria clarification

2023-09-26 Thread Robert Haas

On Tue, Sep 26, 2023 at 11:11 AM Andres Freund  wrote:
> That I'd like you to expand on "using the RedoRecPtr of the latest checkpoint
> rather than the LSN of the previou vacuum." - I can think of ways of doing so
> that could end up with quite different behaviour...

Yeah, me too. I'm not sure what is best.

> As long as the most extreme cases are prevented, unnecessarily freezing is imo
> far less harmful than freezing too little.
>
> I'm worried that using something as long as 100-200%
> time-between-recent-checkpoints won't handle insert-mostly workload well,
> which IME are also workloads suffering quite badly under our current scheme -
> and which are quite common.

I wrote about this problem in my other reply and I'm curious as to
your thoughts about it. Basically, suppose we forget about all of
Melanie's tests except for three cases: (1) an insert-only table, (2)
an update-heavy workload with uniform distribution, and (3) an
update-heavy workload with skew. In case (1), freezing is good. In
case (2), freezing is bad. In case (3), freezing is good for cooler
pages and bad for hotter ones. I postulate that any
recency-of-modification threshold that handles (1) well will handle
(2) poorly, and that the only way to get both right is to take some
other factor into account. You seem to be arguing that we can just
freeze aggressively in case (2) and it won't cost much, but it doesn't
sound to me like Melanie believes that and I don't think I do either.

> > This doesn't seem completely stupid, but I fear it would behave
> > dramatically differently on a workload a little smaller than s_b vs.
> > one a little larger than s_b, and that doesn't seem good.
>
> Hm. I'm not sure that that's a real problem. In the case of a workload bigger
> than s_b, having to actually read the page again increases the cost of
> freezing later, even if the workload is just a bit bigger than s_b.

That is true, but I don't think it means that there is no problem. It
could lead to a situation where, for a while, a table never needs any
significant freezing, because we always freeze aggressively. When it
grows large enough, we suddenly stop freezing it aggressively, and now
it starts experiencing vacuums that do a whole bunch of work all at
once. A user who notices that is likely to be pretty confused about
what happened, and maybe not too happy when they find out.

-- 
Robert Haas
EDB: http://www.enterprisedb.com

Re: Eager page freeze criteria clarification

2023-09-26 Thread Peter Geoghegan

On Tue, Sep 26, 2023 at 8:19 AM Andres Freund  wrote:
> However, I'm not at all convinced doing this on a system wide level is a good
> idea. Databases do often contain multiple types of workloads at the same
> time. E.g., we want to freeze aggressively in a database that has the bulk of
> its size in archival partitions but has lots of unfrozen data in an active
> partition. And databases have often loads of data that's going to change
> frequently / isn't long lived, and we don't want to super aggressively freeze
> that, just because it's a large portion of the data.

I didn't say that we should always have most of the data in the
database frozen, though. Just that we can reasonably be more lazy
about freezing the remainder of pages if we observe that most pages
are already frozen. How they got that way is another discussion.

I also think that the absolute amount of debt (measured in physical
units such as unfrozen pages) should be kept under control. But that
isn't something that can ever be expected to work on the basis of a
simple threshold -- if only because autovacuum scheduling just doesn't
work that way, and can't really be adapted to work that way.

-- 
Peter Geoghegan

Re: Eager page freeze criteria clarification

2023-09-26 Thread Andres Freund

Hi,

On 2023-09-25 14:16:46 -0700, Peter Geoghegan wrote:
> On Mon, Sep 25, 2023 at 11:45 AM Robert Haas  wrote:
> I'm surprised that there hasn't been any discussion of the absolute
> amount of system-wide freeze debt on this thread. If 90% of the pages
> in the entire database are frozen, it'll generally be okay if we make
> the wrong call by freezing lazily when we shouldn't. This is doubly
> true within small to medium sized tables, where the cost of catching
> up on freezing cannot ever be too bad (concentrations of unfrozen
> pages in one big table are what really hurt users).

I generally agree with the idea of using "freeze debt" as an input - IIRC I
proposed using the number of frozen pages vs number of pages as the input to
the heuristic in one of the other threads about the topic a while back. I also
think we should read a portion of all-visible-but-not-frozen pages during
non-aggressive vacuums to manage the freeze debt.

However, I'm not at all convinced doing this on a system wide level is a good
idea. Databases do often contain multiple types of workloads at the same
time. E.g., we want to freeze aggressively in a database that has the bulk of
its size in archival partitions but has lots of unfrozen data in an active
partition. And databases have often loads of data that's going to change
frequently / isn't long lived, and we don't want to super aggressively freeze
that, just because it's a large portion of the data.

Greetings,

Andres Freund

Re: Eager page freeze criteria clarification

2023-09-26 Thread Andres Freund

Hi,

On 2023-09-25 14:45:07 -0400, Robert Haas wrote:
> On Fri, Sep 8, 2023 at 12:07 AM Andres Freund  wrote:
> > > Downthread, I proposed using the RedoRecPtr of the latest checkpoint
> > > rather than the LSN of the previou vacuum. I still like that idea.
> >
> > Assuming that "downthread" references
> > https://postgr.es/m/CA%2BTgmoYb670VcDFbekjn2YQOKF9a7e-kBFoj2WJF1HtH7YPaWQ%40mail.gmail.com
> > could you sketch out the logic you're imagining a bit more?
> 
> I'm not exactly sure what the question is here.

That I'd like you to expand on "using the RedoRecPtr of the latest checkpoint
rather than the LSN of the previou vacuum." - I can think of ways of doing so
that could end up with quite different behaviour...


> > Perhaps we can mix both approaches. We can use the LSN and time of the last
> > vacuum to establish an LSN->time mapping that's reasonably accurate for a
> > relation. For infrequently vacuumed tables we can use the time between
> > checkpoints to establish a *more aggressive* cutoff for freezing then what a
> > percent-of-time-since-last-vacuum appach would provide. If e.g. a table gets
> > vacuumed every 100 hours and checkpoint timeout is 1 hour, no realistic
> > percent-of-time-since-last-vacuum setting will allow freezing, as all dirty
> > pages will be too new. To allow freezing a decent proportion of those, we
> > could allow freezing pages that lived longer than ~20%
> > time-between-recent-checkpoints.
> 
> Yeah, I don't know if that's exactly the right idea, but I think it's
> in the direction that I was thinking about. I'd even be happy with
> 100% of the time-between-recent checkpoints, maybe even 200% of
> time-between-recent checkpoints. But I think there probably should be
> some threshold beyond which we say "look, this doesn't look like it
> gets touched that much, let's just freeze it so we don't have to come
> back to it again later."

As long as the most extreme cases are prevented, unnecessarily freezing is imo
far less harmful than freezing too little.

I'm worried that using something as long as 100-200%
time-between-recent-checkpoints won't handle insert-mostly workload well,
which IME are also workloads suffering quite badly under our current scheme -
and which are quite common.


> I think part of the calculus here should probably be that when the
> freeze threshold is long, the potential gains from making it even
> longer are not that much. If I change the freeze threshold on a table
> from 1 minute to 1 hour, I can potentially save uselessly freezing
> that page 59 times per hour, every hour, forever, if the page always
> gets modified right after I touch it. If I change the freeze threshold
> on a table from 1 hour to 1 day, I can only save 23 unnecessary
> freezes per day. Percentage-wise, the overhead of being wrong is the
> same in both cases: I can have as many extra freeze operations as I
> have page modifications, if I pick the worst possible times to freeze
> in every case. But in absolute terms, the savings in the second
> scenario are a lot less. I think if a user is accessing a table
> frequently, the overhead of jamming a useless freeze in between every
> table access is going to be a lot more noticeable then when the table
> is only accessed every once in a while. And I also think it's a lot
> less likely that we'll reliably get it wrong. Workloads that touch a
> page and then touch it again ~N seconds later can exist for all values
> of N, but I bet they're way more common for small values of N than
> large ones.

True. And with larger Ns it also becomes more likely that we'd need to freeze
the rows anyway. I've seen tables being hit with several anti-wraparound vacuums
a day, but not several anti-wraparound vacuums a minute...


> Is there also a need for a similar guard in the other direction? Let's
> say that autovacuum_naptime=15s and on some particular table it
> triggers every time. I've actually seen this on small queue tables. Do
> you think that, in such tables, we should freeze pages that haven't
> been modified in 15s?

I don't think it matters much, proportionally to the workload of rewriting
nearly all of the table every few seconds, the overhead of freezing a bunch of
already dirty pages is neglegible.


> > Hm, possibly stupid idea: What about using shared_buffers residency as a
> > factor? If vacuum had to read in a page to vacuum it, a) we would need read 
> > IO
> > to freeze it later, as we'll soon evict the page via the ringbuffer b)
> > non-residency indicates the page isn't constantly being modified?
> 
> This doesn't seem completely stupid, but I fear it would behave
> dramatically differently on a workload a little smaller than s_b vs.
> one a little larger than s_b, and that doesn't seem good.

Hm. I'm not sure that that's a real problem. In the case of a workload bigger
than s_b, having to actually read the page again increases the cost of
freezing later, even if the workload is just a bit bigger than s_b.
Greetings,

Andr

Re: Eager page freeze criteria clarification

2023-09-25 Thread Peter Geoghegan

On Mon, Sep 25, 2023 at 11:45 AM Robert Haas  wrote:
> > The reason I was thinking of using the "lsn at the end of the last vacuum", 
> > is
> > that it seems to be more adapative to the frequency of vacuuming.
>
> Yes, but I think it's *too* adaptive. The frequency of vacuuming can
> plausibly be multiple times per minute or not even annually. That's
> too big a range of variation.

+1. The risk of VACUUM chasing its own tail seems very real. We want
VACUUM to be adaptive to the workload, not adaptive to itself.

> Yeah, I don't know if that's exactly the right idea, but I think it's
> in the direction that I was thinking about. I'd even be happy with
> 100% of the time-between-recent checkpoints, maybe even 200% of
> time-between-recent checkpoints. But I think there probably should be
> some threshold beyond which we say "look, this doesn't look like it
> gets touched that much, let's just freeze it so we don't have to come
> back to it again later."

The sole justification for any strategy that freezes lazily is that it
can avoid useless freezing when freezing turns out to be unnecessary
-- that's it. So I find it more natural to think of freezing as the
default action, and *not freezing* as the thing that requires
justification. Thinking about it "backwards" like that just seems
simpler to me. There is only one possible reason to not freeze, but
several reasons to freeze.

> I think part of the calculus here should probably be that when the
> freeze threshold is long, the potential gains from making it even
> longer are not that much. If I change the freeze threshold on a table
> from 1 minute to 1 hour, I can potentially save uselessly freezing
> that page 59 times per hour, every hour, forever, if the page always
> gets modified right after I touch it. If I change the freeze threshold
> on a table from 1 hour to 1 day, I can only save 23 unnecessary
> freezes per day.

I totally agree with you on this point. It seems related to my point
about "freezing being the conceptual default action" in VACUUM.

Generally speaking, over-freezing is a problem when we reach the same
wrong conclusion (freeze the page) about the same relatively few pages
over and over -- senselessly repeating those mistakes really adds up
when you're vacuuming the same table very frequently. On the other
hand, under-freezing is typically a problem when we reach the same
wrong conclusion (don't freeze the page) about lots of pages only once
in a very long while. I strongly suspect that there is very little
gray area between the two, across the full spectrum of application
characteristics.

Most individual pages have very little chance of being modified in the
short to medium term. In a perfect world, with a perfect algorithm,
we'd almost certainly be freezing most pages at the earliest
opportunity. It is nevertheless also true that a freezing policy that
is only somewhat more aggressive than this ideal oracle algorithm will
freeze way too aggressive (by at least some measures). There isn't
much of a paradox to resolve here: it's all down to the cadence of
vacuuming, and of rows subject to constant churn.

As you point out, the "same policy" can produce dramatically different
outcomes when you actually consider what the consequences of the
policy are over time, when applied by VACUUM under a variety of
different workload conditions. So any freezing policy must be designed
with due consideration for those sorts of things. If VACUUM doesn't
freeze the page now, then when will it freeze it? For most individual
pages, that time will come (again, pages that benefit from lazy
vacuuming are the exception rather than the rule). Right now, VACUUM
almost behaves as if it thought "that's not my problem, it's a problem
for future me!".

Trying to differentiate between pages that we must not over freeze and
pages that we must not under freeze seems important. Generational
garbage collection (as used by managed VM runtimes) does something
that seems a little like this. It's based on the empirical observation
that "most objects die young". What the precise definition of "young"
really is varies significantly, but that turns out to be less of a
problem than you might think -- it can be derived through feedback
cycles. If you look at memory lifetimes on a logarithmic scale, very
different sorts of applications tend to look like they have remarkably
similar memory allocation characteristics.

> Percentage-wise, the overhead of being wrong is the
> same in both cases: I can have as many extra freeze operations as I
> have page modifications, if I pick the worst possible times to freeze
> in every case. But in absolute terms, the savings in the second
> scenario are a lot less.

Very true.

I'm surprised that there hasn't been any discussion of the absolute
amount of system-wide freeze debt on this thread. If 90% of the pages
in the entire database are frozen, it'll generally be okay if we make
the wrong call by freezing lazily when we shouldn't. This i

Re: Eager page freeze criteria clarification

2023-09-25 Thread Robert Haas

On Sat, Sep 23, 2023 at 3:53 PM Melanie Plageman
 wrote:
> Freeze tuples on a page opportunistically if the page would be totally
> frozen and:
>
>  4. Buffer is already dirty and no FPI is required OR page LSN is older
>  than 33% of the LSNs since the last vacuum of the table.
>
>  5. Buffer is already dirty and no FPI is required AND page LSN is older
>  than 33% of the LSNs since the last vacuum of the table.
>
> On master, the heuristic is to freeze a page opportunistically if it
> would be totally frozen and if pruning emitted an FPI.
> ---
>
> My takeaways from all of the workload results are as follows:
>
> Algorithm 4 is too aggressive and regresses performance compared to
> master.
>
> Algorithm 5 freezes surprisingly few pages, especially in workloads
> where only the most recent data is being accessed or modified
> (append-only, update recent data).
>
> A work queue-like workload with other concurrent workloads is a
> particular challenge for the freeze heuristic, and we should think more
> about how to handle this.

I feel like we have a sort of Goldilocks problem here: the porridge is
either too hot or too cold, never just right. Say we just look at
workload B, looking at the difference between pgbench_accounts (which
is randomly and frequently updated and thus shouldn't be
opportunistically frozen) and pgbench_history (which is append-only
and should thus be frozen aggressively). Algorithm 4 gets
pgbench_history right and pgbench_accounts wrong, and master does the
opposite. In a perfect world, we'd have an algorithm which could
distinguish sharply between those cases, ramping up to maximum
aggressiveness on pgbench_history while doing nothing at all to
pgbench_accounts.

Algorithm 5 partially accomplishes this, but the results aren't
super-inspiring either. It doesn't add many page freezes in the case
where freezing is bad, but it also only manages to freeze a quarter of
pgbench_history, where algorithm 4 manages to freeze basically all of
it. That's a pretty stark difference. Given that algorithm 5 seems to
make some mistakes on some of the other workloads, I don't think it's
entirely clear that it's an improvement over master, at least in
practical terms.

It might be worth thinking harder about what it takes specifically to
get the pgbench_history case, aka the append-only table case, correct.
One thing that probably doesn't work very well is to freeze pages that
are more than X minutes old. Algorithm 5 uses an LSN threshold instead
of a wall-clock based threshold, but the effect is the same. I think
the problem here is that the vacuum operation essentially happens in
an instant. At the instant that it happens, some fraction of the data
added since the last vacuum is older than whatever threshold you pick,
and the rest is newer. If data is added at a constant rate and you
want to freeze at least 90% of the data, your recency threshold has to
be no more than 10% of the time since the last vacuum. But with
autovacuum_naptimes=60s, that's like 6 seconds, and that's way too
aggressive for a table like pgbench_accounts. It seems to me that it's
not possible to get both cases right by twiddling the threshold,
because pgbench_history wants the threshold to be 0, and
pgbench_accounts wants it to be ... perhaps not infinity, because
maybe the distribution is Gaussian or Zipfian or something rather than
uniform, but probably a couple of minutes.

So I feel like if we want to get both pgbench_history and
pgbench_accounts right, we need to consider some additional piece of
information that makes those cases distinguishable. Either of those
tables can contain a page that hasn't been accessed in 20 seconds, but
the correct behavior for such a page differs between one case and the
other. One random idea that I had was to refuse to opportunistically
freeze a page more than once while it remains resident in
shared_buffers. The first time we do it, we set a bit in the buffer
header or something that suppresses further opportunistic freezing.
When the buffer is evicted the bit is cleared. So we can still be
wrong on a heavily updated table like pgbench_acccounts, but if the
table fits in shared_buffers, we'll soon realize that we're getting it
wrong a lot and will stop making the same mistake over and over. But
this kind of idea only works if the working set is small enough to fit
in shared_buffers, so I don't think it's actually a great plan, unless
we only care about suppressing excess freezing on workloads that fit
in shared_buffers.

A variant on the same theme could be to keep some table-level counters
and use them to assess how often we're getting it wrong. If we're
often thawing recently-frozen pages, don't freeze so aggressively. But
this will not work if different parts of the same table behave
differently.

If we don't want to do something like this that somehow responds to
the characteristics of a particular page or table, then it seems we
either have to freeze quite aggressively to pick up

Re: Eager page freeze criteria clarification

2023-09-25 Thread Robert Haas

On Fri, Sep 8, 2023 at 12:07 AM Andres Freund  wrote:
> > Downthread, I proposed using the RedoRecPtr of the latest checkpoint
> > rather than the LSN of the previou vacuum. I still like that idea.
>
> Assuming that "downthread" references
> https://postgr.es/m/CA%2BTgmoYb670VcDFbekjn2YQOKF9a7e-kBFoj2WJF1HtH7YPaWQ%40mail.gmail.com
> could you sketch out the logic you're imagining a bit more?

I'm not exactly sure what the question is here. I mean, it doesn't
quite make sense to just ask whether the page LSN is newer than the
last checkpoint's REDO record, because I think that's basically just
asking whether or not we would need an FPI, and the freeze criteria
that Melanie has been considering incorporate that check in some other
way already. But maybe some variant on that idea is useful - the
distance to the second-most-recent checkpoint, or a multiple or
percentage of the distance to the most-recent checkpoint, or whatever.

> The reason I was thinking of using the "lsn at the end of the last vacuum", is
> that it seems to be more adapative to the frequency of vacuuming.

Yes, but I think it's *too* adaptive. The frequency of vacuuming can
plausibly be multiple times per minute or not even annually. That's
too big a range of variation. The threshold for freezing can vary by
how actively the table or the page is updated, but I don't think it
should vary by six orders of magnitude. Under what theory does it make
sense to say that say "this row in table hasn't been modified in 20
seconds, so let's freeze it, but this row in table B hasn't been
modified in 8 months, so let's not freeze it because it might be
modified again soon"? If you use the LSN at the end of the last
vacuum, you're going to end up making decisions exactly like that,
which seems wrong to me.

> Perhaps we can mix both approaches. We can use the LSN and time of the last
> vacuum to establish an LSN->time mapping that's reasonably accurate for a
> relation. For infrequently vacuumed tables we can use the time between
> checkpoints to establish a *more aggressive* cutoff for freezing then what a
> percent-of-time-since-last-vacuum appach would provide. If e.g. a table gets
> vacuumed every 100 hours and checkpoint timeout is 1 hour, no realistic
> percent-of-time-since-last-vacuum setting will allow freezing, as all dirty
> pages will be too new. To allow freezing a decent proportion of those, we
> could allow freezing pages that lived longer than ~20%
> time-between-recent-checkpoints.

Yeah, I don't know if that's exactly the right idea, but I think it's
in the direction that I was thinking about. I'd even be happy with
100% of the time-between-recent checkpoints, maybe even 200% of
time-between-recent checkpoints. But I think there probably should be
some threshold beyond which we say "look, this doesn't look like it
gets touched that much, let's just freeze it so we don't have to come
back to it again later."

I think part of the calculus here should probably be that when the
freeze threshold is long, the potential gains from making it even
longer are not that much. If I change the freeze threshold on a table
from 1 minute to 1 hour, I can potentially save uselessly freezing
that page 59 times per hour, every hour, forever, if the page always
gets modified right after I touch it. If I change the freeze threshold
on a table from 1 hour to 1 day, I can only save 23 unnecessary
freezes per day. Percentage-wise, the overhead of being wrong is the
same in both cases: I can have as many extra freeze operations as I
have page modifications, if I pick the worst possible times to freeze
in every case. But in absolute terms, the savings in the second
scenario are a lot less. I think if a user is accessing a table
frequently, the overhead of jamming a useless freeze in between every
table access is going to be a lot more noticeable then when the table
is only accessed every once in a while. And I also think it's a lot
less likely that we'll reliably get it wrong. Workloads that touch a
page and then touch it again ~N seconds later can exist for all values
of N, but I bet they're way more common for small values of N than
large ones.

Is there also a need for a similar guard in the other direction? Let's
say that autovacuum_naptime=15s and on some particular table it
triggers every time. I've actually seen this on small queue tables. Do
you think that, in such tables, we should freeze pages that haven't
been modified in 15s?

> Hm, possibly stupid idea: What about using shared_buffers residency as a
> factor? If vacuum had to read in a page to vacuum it, a) we would need read IO
> to freeze it later, as we'll soon evict the page via the ringbuffer b)
> non-residency indicates the page isn't constantly being modified?

This doesn't seem completely stupid, but I fear it would behave
dramatically differently on a workload a little smaller than s_b vs.
one a little larger than s_b, and that doesn't seem good.

-- 
Robert Haas
EDB: http://www.enterprisedb

Re: Eager page freeze criteria clarification

2023-09-24 Thread Melanie Plageman

On Sat, Sep 23, 2023 at 3:53 PM Melanie Plageman
 wrote:
>
> Workload F:
>
> +--++-++--+
> | algo | WAL GB | cptr bgwriter writes| other reads/writes | IO time AV 
> worker|
> +--++-+-+-+
> |M |173 |   1,202,231 | 53,957,448 |   12,389 
> |
> |4 |189 |   1,212,521 | 55,589,140 |   13,084 
> |
> |5 |173 |   1,194,242 | 54,260,118 |   13,407 
> |
> +--++-++--+
>
> +--+--+
> | algo | P99 latency  |
> +--+--+
> |M |   19875  |
> |4 |   19314  |
> |5 |   19701  |
> +--+--+

Andres mentioned that the P99 latency for the COPY workload (workload F)
might not be meaningful, so I have calculated the duration total, mean,
median, min, max and standard deviation in milliseconds.

Workload F:
+--++---++++-+
| algo |  Total |   Mean| Median |Min |Max |  Stddev |
+--++---++++-+
|M |  1,270,903 | 18,155| 17,755 | 17,090 | 19,994 | 869 |
|4 |  1,167,135 | 16,673| 16,421 | 15,585 | 19,485 | 811 |
|5 |  1,250,145 | 17,859| 17,704 | 15,763 | 19,871 |   1,009 |
+--++---++++-+

Interestingly, algorithm 4 had the lowest total duration for all COPYs.
Some investigation of other data collected during the runs led us to
believe this may be due to autovacuum workers doing more IO with
algorithm 4 and thus generating more WAL and ending up initializing more
WAL files themselves. Whereas on master and with algorithm 5, client
backends had to initialize WAL files themselves, leading COPYs to take
longer. This was supported by the presence of more WALInit wait events
for client backends on master and with algorithm 5.

Calculating these made me realize that my conclusions about the work
queue workload (workload I) didn't make much sense. Because this
workload updated a non-indexed column, most pruning was HOT pruning done
on access and basically no page freezing was done by vacuum. This means
we weren't seeing negative performance effects of freezing related to
the work queue table.

The difference in this benchmark came from the relatively poor
performance of the concurrent COPYs when that table was frozen more
aggressively. I plan to run a new version of this workload which updates
an indexed column for comparison and does not use a concurrent COPY.

This is the duration total, mean, median, min, max, and standard
deviation in milliseconds of the COPYs which ran concurrently with the
work queue pgbench.

Workload I COPYs:
+--++---++++-+
| algo |  Total |  Mean | Median |   Min  |   Max  |  Stddev |
+--++---++++-+
|M | 191,032|  4,898|  4,726 |  4,486 |  9,353 | 800 |
|4 | 193,534|  4,962|  4,793 |  4,533 |  9,381 | 812 |
|5 | 194,351|  4,983|  4,771 |  4,617 |  9,159 | 783 |
+--++---++++-+

I think this shows that algorithm 4 COPYs performed the worst. This is
in contrast to the COPY-only workload (F) which did not show worse
performance for algorithm 4. I think this means I should modify the work
queue example and use something other than concurrent COPYs to avoid
obscuring characteristics of the work queue example.

- Melanie

Re: Eager page freeze criteria clarification

2023-09-23 Thread Melanie Plageman

On Mon, Aug 28, 2023 at 4:30 PM Melanie Plageman
 wrote:
> On Mon, Aug 28, 2023 at 12:26 PM Robert Haas  wrote:
> > In row D, your algorithms are all bad, really bad. I don't quite
> > understand how it can be that bad, actually.
>
> So, I realize now that this test was poorly designed. I meant it to be a
> worst case scenario, but I think one critical part was wrong. In this
> example one client is going at full speed inserting a row and then
> updating it. Then another rate-limited client is deleting old data
> periodically to keep the table at a constant size. I meant to bulk load
> the table with enough data that the delete job would have data to delete
> from the start. With the default autovacuum settings, over the course of
> 45 minutes, I usually saw around 40 autovacuums of the table. Due to the
> rate limiting, the first autovacuum of the table ends up freezing many
> pages that are deleted soon after. Thus the total number of page freezes
> is very high.
>
> I will redo benchmarking of workload D and start the table with the
> number of rows which the DELETE job seeks to maintain. My back of the
> envelope math says that this will mean ratios closer to a dozen (and not
> 5000).
...
> I'll rerun workload D in a more reasonable way and be back with results.

Hi,

I have edited and rerun all of the benchmarks for a subset of the
algorithms. I ran workloads A-I, except for G (G is not an interesting
workload), on Postgres master as well as Postgres with freeze heuristic
algorithms 4 and 5.

As a refresher, algorithms 4 and 5 are as follows:

Freeze tuples on a page opportunistically if the page would be totally
frozen and:

 4. Buffer is already dirty and no FPI is required OR page LSN is older
 than 33% of the LSNs since the last vacuum of the table.

 5. Buffer is already dirty and no FPI is required AND page LSN is older
 than 33% of the LSNs since the last vacuum of the table.

On master, the heuristic is to freeze a page opportunistically if it
would be totally frozen and if pruning emitted an FPI.

I made a few configuration changes for all benchmarks -- most notably
autovacuum is on for all workloads and autovacuum_naptime is decreased
to 10 seconds. The runtime of all time-based workloads was changed to 40
minutes. All workloads were run for a fixed duration except for the COPY
workload (workload F).

It is also worth noting that data loaded into the standard pgbench
tables (with pgbench -i) was loaded with COPY and not COPY FREEZE. Also,
pgbench -i and the pgbench runs were passed the --no-vacuum option.

The updated workloads are as follows:

A. Gaussian TPC-B like + select-only:
   pgbench scale 450 (DB < SB) with indexes on updated columns
   WL 1: 16 clients doing TPC-B like pgbench but with Gaussian access
   distribution (with parameter 6) for updated tables
   WL 2: 2 clients, rate-limited to 1000 TPS, doing select-only pgbench

B. TPC-B like
   pgbench scale 450
   16 clients doing built-in pgbench TPC-C like script

C. Shifting hot set:
   16 clients inserting a single row then updating an indexed column in
   that row

D. Shifting hot set, delete old data
   WL 1: 16 clients inserting one row then updating an indexed column in
   that row
   WL 2: 1 client, rate limited to 10 TPS, deleting data more than 1
   minute old

E. Shifting hot set, delete new data, access new data
   WL 1: 16 clients cycling through 2 single row inserts, an update of
   an indexed column in a single recently inserted row, and a delete of
   a single recently inserted row
   WL 2: 1 client, rate-limited to 0.2 TPS, selecting data from the last
   5 minutes

F. Many COPYs
   1 client, copying a total of 90 GB of data in 70 individual COPYs

G. N/A

H. Append only table
   16 clients, inserting a single row at a time

I. Work queue
   WL 1: 16 clients inserting a single row, updating a non-indexed
   column in that row twice, then deleting that row
   WL 2: 1 client, rate-limited to 0.02 TPS, COPYing data into another
   table

Below are some of the data I collected, including how much WAL was
generated, how much IO various backends did, TPS, and P99 latency.
These were collected at the end of the pgbench run.

Most numbers were rounded to the nearest whole number to make the chart
easier to see. So, for example, a 0 in WAL GB does not mean that no WAL
was emitted. P99 latency below 0 was rounded to a single decimal place.

All IO time is in milliseconds.

"P99 lat" was calculated using the "time" field from pgbench's
"per-transaction logging" option [1]. It is the "transaction's elapsed time"
or duration. I converted it to milliseconds.

"pages frozen" here refers to the number of pages marked frozen in the
visibility map at the end of the run. "page freezes" refers to the
number of times vacuum froze tuples in lazy_scan_prune(). This does not
include cases in which the page was found to be all frozen but no tuples
were explicitly frozen.

"AVs" is the number of autovacuums for the table or tables specifie

Re: Eager page freeze criteria clarification

2023-09-08 Thread Andres Freund

Hi,

On 2023-09-07 22:29:04 -0700, Peter Geoghegan wrote:
> On Thu, Sep 7, 2023 at 9:45 PM Andres Freund  wrote:
> > I.e. setting an, otherwise unmodified, page all-visible won't trigger an FPI
> > if checksums are disabled, but will FPI with checksums enabled. I think 
> > that's
> > a substantial difference in WAL volume for insert-only workloads...
> 
> Note that all RDS Postgres users get page-level checksums. Overall,
> the FPI trigger mechanism is going to noticeably improve performance
> characteristics for many users. Simple and crude though it is.

You mean the current heuristic or some new heuristic we're coming up with in
this thread?  If the former, unfortunately I think the current heuristic often
won't trigger in cases where freezing would be fine, e.g. on an insert-mostly
(or hot pruned) workload with some read accesses.  If the tuples are already
hint-bitted, there's no FPI during heap_page_prune(), and thus we don't freeze
- even though we *do* subsequently trigger an FPI, while setting all-visible.

See e.g. the stats for the modified version of the scenario, where there the
table is hint-bitted that I have since posted:
https://postgr.es/m/20230908053634.hyn46pugqp4lsiw5%40awork3.anarazel.de

There we freeze neither with nor without checksums, despite incurring FPIs
when checksums are enabled.


> > Type   N  (%)  Record 
> > size  (%) FPI size  (%)Combined size  (%)
> >    -  ---  
> > ---  ---   ---- 
> >  ---
> > XLOG/FPI_FOR_HINT  44253 ( 33.34)  
> > 2168397 (  7.84)361094232 (100.00)363262629 ( 93.44)
> > Transaction/INVALIDATION   1 (  0.00)   
> > 78 (  0.00)0 (  0.00)   78 (  0.00)
> > Standby/INVALIDATIONS  1 (  0.00)   
> > 90 (  0.00)0 (  0.00)   90 (  0.00)
> > Heap2/FREEZE_PAGE  44248 ( 33.33) 
> > 22876120 ( 82.72)0 (  0.00) 22876120 (  
> > 5.88)
> > Heap2/VISIBLE  44248 ( 33.33)  
> > 2610642 (  9.44)16384 (  0.00)  2627026 (  0.68)
> > Heap/INPLACE   1 (  0.00)  
> > 188 (  0.00)0 (  0.00)  188 (  0.00)
> >   
> >     
> > Total 132752  
> > 27655515 [7.11%] 361110616 [92.89%]388766131 [100%]
> >
> > In realistic tables, where rows are wider than a single int, FPI_FOR_HINT
> > dominates even further, as the FREEZE_PAGE would be smaller if there weren't
> > 226 tuples on each page...
> 
> If FREEZE_PAGE WAL volume is really what holds back further high level
> improvements in this area, then it can be worked on directly -- it's
> not a fixed cost. It wouldn't be particularly difficult, in fact.

Agreed!


> These are records that still mostly consist of long runs of contiguous
> page offset numbers. They're ideally suited for compression using some
> kind of simple variant of run length encoding. The freeze plan
> deduplication stuff in 16 made a big difference, but it's still not
> very hard to improve matters here.

Yea, even just using ranges of offsets should help in a lot of cases.

Greetings,

Andres Freund

Re: Eager page freeze criteria clarification

2023-09-07 Thread Peter Geoghegan

On Thu, Sep 7, 2023 at 10:46 PM Andres Freund  wrote:
> You mean the current heuristic or some new heuristic we're coming up with in
> this thread?  If the former, unfortunately I think the current heuristic often
> won't trigger in cases where freezing would be fine, e.g. on an insert-mostly
> (or hot pruned) workload with some read accesses.  If the tuples are already
> hint-bitted, there's no FPI during heap_page_prune(), and thus we don't freeze
> - even though we *do* subsequently trigger an FPI, while setting all-visible.

It's obviously not adequate, or anything like that. I didn't say that it was.

I've heard plenty of complaints about freezing and antiwraparound
autovacuuming, and basically no complaints about the cost of FPIs due
to checksums (apparently Robert wasn't aware of the problem, at least
not as it relates to VACUUM setting the VM). This is true even though
one might expect it to be the other way around, based on simple
analysis using pg_waldump.

There are probably lots of reasons for why that is, but the biggest
reason is likely that users just really hate huge balloon payments for
things like freezing. It's just about the worst thing about the
system.

-- 
Peter Geoghegan

Re: Eager page freeze criteria clarification

2023-09-07 Thread Andres Freund

Hi,

On 2023-09-07 21:45:22 -0700, Andres Freund wrote:
> In contrast to that, freezing will almost always trigger an FPI (except for
> empty pages, but we imo ought to stop setting empty pages all frozen [1]).
> 
> 
> Yep, a quick experiment confirms that:
> 
> DROP TABLE IF EXISTS foo;
> CREATE TABLE foo AS SELECT generate_series(1, 1000);
> CHECKPOINT;
> VACUUM (VERBOSE) foo;
> 
> checksums off: WAL usage: 44249 records, 3 full page images, 2632091 bytes
> checksums on: WAL usage: 132748 records, 44253 full page images, 388758161 
> bytes
> 
> 
> I initially was confused by the 3x wal records - I was expecting 2x. The
> reason is that with checksums on, we emit an FPI during the visibility check,
> which then triggers the current heuristic for opportunistic freezing. The
> saving grace is that WAL volume is completely dominated by the FPIs:
> 
> Type   N  (%)  Record 
> size  (%) FPI size  (%)Combined size  (%)
>    -  ---  
> ---  ---   ----  
> ---
> XLOG/FPI_FOR_HINT  44253 ( 33.34)  
> 2168397 (  7.84)361094232 (100.00)363262629 ( 93.44)
> Transaction/INVALIDATION   1 (  0.00)   
> 78 (  0.00)0 (  0.00)   78 (  0.00)
> Standby/INVALIDATIONS  1 (  0.00)   
> 90 (  0.00)0 (  0.00)   90 (  0.00)
> Heap2/FREEZE_PAGE  44248 ( 33.33) 
> 22876120 ( 82.72)0 (  0.00) 22876120 (  5.88)
> Heap2/VISIBLE  44248 ( 33.33)  
> 2610642 (  9.44)16384 (  0.00)  2627026 (  0.68)
> Heap/INPLACE   1 (  0.00)  
> 188 (  0.00)0 (  0.00)  188 (  0.00)
>   
>     
> Total 132752  
> 27655515 [7.11%] 361110616 [92.89%]388766131 [100%]
> 
> In realistic tables, where rows are wider than a single int, FPI_FOR_HINT
> dominates even further, as the FREEZE_PAGE would be smaller if there weren't
> 226 tuples on each page...

The above is not a great demonstration of the overhead of setting all-visible,
as the FPIs are triggered via FPI_FOR_HINTs, independent of setting
all-visible. Adding "SELECT count(*) FROM foo" before the checkpoint sets them
earlier and results in:

checksum off:

WAL usage: 44249 records, 3 full page images, 2627915 bytes

Type   N  (%)  Record size  
(%) FPI size  (%)Combined size  (%)
   -  ---  ---  
---   ----  ---
Transaction/INVALIDATION   1 (  0.00)   78 
(  0.00)0 (  0.00)   78 (  0.00)
Standby/INVALIDATIONS  1 (  0.00)   90 
(  0.00)0 (  0.00)   90 (  0.00)
Heap2/VISIBLE  44248 ( 99.99)  2610642 
( 99.99)16384 ( 95.15)  2627026 ( 99.96)
Heap/INPLACE   1 (  0.00)   53 
(  0.00)  836 (  4.85)  889 (  0.03)
    
  
Total  44251   2610863 
[99.34%]17220 [0.66%]   2628083 [100%]

checksums on:

WAL usage: 44252 records, 44254 full page images, 363935830 bytes

Type   N  (%)  Record size  
(%) FPI size  (%)Combined size  (%)
   -  ---  ---  
---   ----  ---
XLOG/FPI_FOR_HINT  3 (  0.01)  147 
(  0.01)24576 (  0.01)24723 (  0.01)
Transaction/INVALIDATION   1 (  0.00)   78 
(  0.00)0 (  0.00)   78 (  0.00)
Standby/INVALIDATIONS  1 (  0.00)   90 
(  0.00)0 (  0.00)   90 (  0.00)
Heap2/VISIBLE  44248 ( 99.99)  2831882 
( 99.99)

Re: Eager page freeze criteria clarification

2023-09-07 Thread Peter Geoghegan

On Thu, Sep 7, 2023 at 9:45 PM Andres Freund  wrote:
> I.e. setting an, otherwise unmodified, page all-visible won't trigger an FPI
> if checksums are disabled, but will FPI with checksums enabled. I think that's
> a substantial difference in WAL volume for insert-only workloads...

Note that all RDS Postgres users get page-level checksums. Overall,
the FPI trigger mechanism is going to noticeably improve performance
characteristics for many users. Simple and crude though it is.

> Type   N  (%)  Record 
> size  (%) FPI size  (%)Combined size  (%)
>    -  ---  
> ---  ---   ----  
> ---
> XLOG/FPI_FOR_HINT  44253 ( 33.34)  
> 2168397 (  7.84)361094232 (100.00)363262629 ( 93.44)
> Transaction/INVALIDATION   1 (  0.00)   
> 78 (  0.00)0 (  0.00)   78 (  0.00)
> Standby/INVALIDATIONS  1 (  0.00)   
> 90 (  0.00)0 (  0.00)   90 (  0.00)
> Heap2/FREEZE_PAGE  44248 ( 33.33) 
> 22876120 ( 82.72)0 (  0.00) 22876120 (  5.88)
> Heap2/VISIBLE  44248 ( 33.33)  
> 2610642 (  9.44)16384 (  0.00)  2627026 (  0.68)
> Heap/INPLACE   1 (  0.00)  
> 188 (  0.00)0 (  0.00)  188 (  0.00)
>   
>     
> Total 132752  
> 27655515 [7.11%] 361110616 [92.89%]388766131 [100%]
>
> In realistic tables, where rows are wider than a single int, FPI_FOR_HINT
> dominates even further, as the FREEZE_PAGE would be smaller if there weren't
> 226 tuples on each page...

If FREEZE_PAGE WAL volume is really what holds back further high level
improvements in this area, then it can be worked on directly -- it's
not a fixed cost. It wouldn't be particularly difficult, in fact.
These are records that still mostly consist of long runs of contiguous
page offset numbers. They're ideally suited for compression using some
kind of simple variant of run length encoding. The freeze plan
deduplication stuff in 16 made a big difference, but it's still not
very hard to improve matters here.

-- 
Peter Geoghegan

Re: Eager page freeze criteria clarification

2023-09-07 Thread Andres Freund

Hi,

On 2023-09-06 16:21:31 -0400, Robert Haas wrote:
> On Wed, Sep 6, 2023 at 12:20 PM Peter Geoghegan  wrote:
> > If VACUUM freezes too aggressively, then (pretty much by definition)
> > we can be sure that the next VACUUM will scan the same pages -- there
> > may be some scope for VACUUM to "learn from its mistake" when we err
> > in the direction of over-freezing. But when VACUUM makes the opposite
> > mistake (doesn't freeze when it should have), it won't scan those same
> > pages again for a long time, by design. It therefore has no plausible
> > way of "learning from its mistakes" before it becomes an extremely
> > expensive and painful lesson (which happens whenever the next
> > aggressive VACUUM takes place). This is in large part a consequence of
> > the way that VACUUM dutifully sets pages all-visible whenever
> > possible. That behavior interacts badly with many workloads, over
> > time.
>
> I think this is an insightful commentary with which I partially agree.
> As I see it, the difference is that when you make the mistake of
> marking something all-visible or freezing it too aggressively, you
> incur a price that you pay almost immediately. When you make the
> mistake of not marking something all-visible when it would have been
> best to do so, you incur a price that you pay later, when the next
> VACUUM happens. When you make the mistake of not marking something
> all-frozen when it would have been best to do so, you incur a price
> that you pay even later, not at the next VACUUM but at some VACUUM
> further off. So there are different trade-offs. When you pay the price
> for a mistake immediately or nearly immediately, it can potentially
> harm the performance of the foreground workload, if you're making a
> lot of mistakes.

We have to make a *lot* of mistakes to badly harm the foreground workload. As
long as we don't constantly trigger FPIs, the worst case effects of freezing
unnecessarily aren't that large, particularly if we keep the number of
XLogInsert()s constant (via the combining of records that Melanie is working
on).  Once/if we want to opportunistically freeze when it *does* trigger an
FPI, we *do* need to be more certain that it's not pointless work.

We could further bound the worst case overhead by having a range
representation for freeze plans. That should be quite doable, we can add
another flag for xl_heap_freeze_plan.frzflags, which indicates the tuples in
the offset array are start/end tids, rather than individual tids.

> That sucks. On the other hand, when you defer paying
> the price until some later bulk operation, the costs of all of your
> mistakes get added up and then you pay the whole price all at once,
> which means you can be suddenly slapped with an enormous bill that you
> weren't expecting. That sucks, too, just in a different way.

It's particularly bad in the case of freezing because there's practically no
backpressure against deferring more work than the system can handle. If we had
made foreground processes freeze one page for every unfrozen page they create,
once the table reaches a certain percentage of old unfrozen pages, it'd be a
different story...

I think it's important that we prevent exploding the WAL volume due to
opportunistic freezing the same page over and over, but as long as we take
care to not do that, I think the impact on the foreground is going to be
small.

I'm sure that we can come up with cases where it's noticeable, e.g. because
the system is already completely bottlecked by WAL IO throughput and a small
increase in WAL volume is going to push things over the edge. But such systems
are going to be in substantial trouble at the next anti-wraparound vacuum and
will be out of commission for days once they hit the anti-wraparound shutdown
limits.

Some backpressure in the form of a small performance decrease for foreground
work might even be good there.

> > VACUUM simply ignores such second-order effects. Perhaps it would be
> > practical to address some of the issues in this area by avoiding
> > setting pages all-visible without freezing them, in some general
> > sense. That at least creates a kind of symmetry between mistakes in
> > the direction of under-freezing and mistakes in the direction of
> > over-freezing. That might enable VACUUM to course-correct in either
> > direction.
> >
> > Melanie is already planning on combining the WAL records (PRUNE,
> > FREEZE_PAGE, and VISIBLE). Perhaps that'll weaken the argument for
> > setting unfrozen pages all-visible even further.
>
> Yeah, so I think the question here is whether it's ever a good idea to
> mark a page all-visible without also freezing it. If it's not, then we
> should either mark fewer pages all-visible, or freeze more of them.
> Maybe I'm all wet here, but I think it depends on the situation. If a
> page is already dirty and has had an FPI since the last checkpoint,
> then it's pretty appealing to freeze whenever we mark all-visible. We
> still have to consider whe

Re: Eager page freeze criteria clarification

2023-09-07 Thread Andres Freund

Hi,

On 2023-09-06 10:46:03 -0400, Robert Haas wrote:
> On Fri, Sep 1, 2023 at 9:07 PM Peter Geoghegan  wrote:
> > Why not also avoid setting pages all-visible? The WAL records aren't
> > too much smaller than most freeze records these days -- 64 bytes on
> > most systems. I realize that the rules for FPIs are a bit different
> > when page-level checksums aren't enabled, but fundamentally it's the
> > same situation. No?
>
> It's an interesting point. AFAIK, whether or not page-level checksums
> are enabled doesn't really matter here.

I think it does matter:

void
visibilitymap_set(Relation rel, BlockNumber heapBlk, Buffer heapBuf,
  XLogRecPtr recptr, Buffer vmBuf, TransactionId cutoff_xid,
  uint8 flags)
...
recptr = log_heap_visible(rel, heapBuf, vmBuf, cutoff_xid, 
flags);

/*
 * If data checksums are enabled (or wal_log_hints=on), we
 * need to protect the heap page from being torn.
 *
 * If not, then we must *not* update the heap page's LSN. In
 * this case, the FPI for the heap page was omitted from the
 * WAL record inserted above, so it would be incorrect to
 * update the heap page's LSN.
 */
if (XLogHintBitIsNeeded())
{
PageheapPage = BufferGetPage(heapBuf);

PageSetLSN(heapPage, recptr);
}

and

/*
 * Perform XLogInsert for a heap-visible operation.  'block' is the block
 * being marked all-visible, and vm_buffer is the buffer containing the
 * corresponding visibility map block.  Both should have already been modified
 * and dirtied.
 *
 * snapshotConflictHorizon comes from the largest xmin on the page being
 * marked all-visible.  REDO routine uses it to generate recovery conflicts.
 *
 * If checksums or wal_log_hints are enabled, we may also generate a full-page
 * image of heap_buffer. Otherwise, we optimize away the FPI (by specifying
 * REGBUF_NO_IMAGE for the heap buffer), in which case the caller should *not*
 * update the heap page's LSN.
 */
XLogRecPtr
log_heap_visible(Relation rel, Buffer heap_buffer, Buffer vm_buffer,
 TransactionId snapshotConflictHorizon, uint8 
vmflags)
{

I.e. setting an, otherwise unmodified, page all-visible won't trigger an FPI
if checksums are disabled, but will FPI with checksums enabled. I think that's
a substantial difference in WAL volume for insert-only workloads...

In contrast to that, freezing will almost always trigger an FPI (except for
empty pages, but we imo ought to stop setting empty pages all frozen [1]).

Yep, a quick experiment confirms that:

DROP TABLE IF EXISTS foo;
CREATE TABLE foo AS SELECT generate_series(1, 1000);
CHECKPOINT;
VACUUM (VERBOSE) foo;

checksums off: WAL usage: 44249 records, 3 full page images, 2632091 bytes
checksums on: WAL usage: 132748 records, 44253 full page images, 388758161 bytes

I initially was confused by the 3x wal records - I was expecting 2x. The
reason is that with checksums on, we emit an FPI during the visibility check,
which then triggers the current heuristic for opportunistic freezing. The
saving grace is that WAL volume is completely dominated by the FPIs:

Type   N  (%)  Record size  
(%) FPI size  (%)Combined size  (%)
   -  ---  ---  
---   ----  ---
XLOG/FPI_FOR_HINT  44253 ( 33.34)  2168397 
(  7.84)361094232 (100.00)363262629 ( 93.44)
Transaction/INVALIDATION   1 (  0.00)   78 
(  0.00)0 (  0.00)   78 (  0.00)
Standby/INVALIDATIONS  1 (  0.00)   90 
(  0.00)0 (  0.00)   90 (  0.00)
Heap2/FREEZE_PAGE  44248 ( 33.33) 22876120 
( 82.72)0 (  0.00) 22876120 (  5.88)
Heap2/VISIBLE  44248 ( 33.33)  2610642 
(  9.44)16384 (  0.00)  2627026 (  0.68)
Heap/INPLACE   1 (  0.00)  188 
(  0.00)0 (  0.00)  188 (  0.00)

Total 132752  27655515 
[7.11%] 361110616 [92.89%]388766131 [100%]

In realistic tables, where rows are wider than a single int, FPI_FOR_HINT
dominates even further, as the FREEZE_PAGE would be smaller if there weren't
226 tuples on each page.

Re: Eager page freeze criteria clarification

2023-09-07 Thread Andres Freund

Hi,

On 2023-09-06 10:35:17 -0400, Robert Haas wrote:
> On Wed, Sep 6, 2023 at 1:09 AM Andres Freund  wrote:
> > Yea, it'd be useful to have a reasonably approximate wall clock time for the
> > last modification of a page. We just don't have infrastructure for 
> > determining
> > that. We'd need an LSN->time mapping (xid->time wouldn't be particularly
> > useful, I think).
> >
> > A very rough approximate modification time can be computed by assuming an 
> > even
> > rate of WAL generation, and using the LSN at the time of the last vacuum and
> > the time of the last vacuum, to compute the approximate age.
> >
> > For a while I thought that'd not give us anything that just using LSNs gives
> > us, but I think it might allow coming up with a better cutoff logic: Instead
> > of using a cutoff like "page LSN is older than 10% of the LSNs since the 
> > last
> > vacuum of the table", it would allow us to approximate "page has not been
> > modified in the last 15 seconds" or such.  I think that might help avoid
> > unnecessary freezing on tables with very frequent vacuuming.
> 
> Yes. I'm uncomfortable with the last-vacuum-LSN approach mostly
> because of the impact on very frequently vacuumed tables, and
> secondarily because of the impact on very infrequently vacuumed
> tables.
> 
> Downthread, I proposed using the RedoRecPtr of the latest checkpoint
> rather than the LSN of the previou vacuum. I still like that idea.

Assuming that "downthread" references
https://postgr.es/m/CA%2BTgmoYb670VcDFbekjn2YQOKF9a7e-kBFoj2WJF1HtH7YPaWQ%40mail.gmail.com
could you sketch out the logic you're imagining a bit more?

> It's a value that we already have, with no additional bookkeeping. It
> varies over a much narrower range than the interval between vacuums on
> a table. The vacuum interval could be as short as tens of seconds as
> long as years, while the checkpoint interval is almost always going to
> be between a few minutes at the low end and some tens of minutes at
> the high end, hours at the very most. That's quite appealing.

The reason I was thinking of using the "lsn at the end of the last vacuum", is
that it seems to be more adapative to the frequency of vacuuming.

One the one hand, if a table is rarely autovacuumed because it is huge,
(InsertLSN-RedoRecPtr) might or might not be representative of the workload
over a longer time. On the other hand, if a page in a frequently vacuumed
table has an LSN from around the last vacuum (or even before), it should be
frozen, but will appear to be recent in RedoRecPtr based heuristics?

Perhaps we can mix both approaches. We can use the LSN and time of the last
vacuum to establish an LSN->time mapping that's reasonably accurate for a
relation. For infrequently vacuumed tables we can use the time between
checkpoints to establish a *more aggressive* cutoff for freezing then what a
percent-of-time-since-last-vacuum appach would provide. If e.g. a table gets
vacuumed every 100 hours and checkpoint timeout is 1 hour, no realistic
percent-of-time-since-last-vacuum setting will allow freezing, as all dirty
pages will be too new. To allow freezing a decent proportion of those, we
could allow freezing pages that lived longer than ~20%
time-between-recent-checkpoints.

Hm, possibly stupid idea: What about using shared_buffers residency as a
factor? If vacuum had to read in a page to vacuum it, a) we would need read IO
to freeze it later, as we'll soon evict the page via the ringbuffer b)
non-residency indicates the page isn't constantly being modified?

> Also, I think the time between checkpoints actually matters here, because in
> some sense we're looking to get dirty, already-FPI'd pages frozen before
> they get written out, or before a new FPI becomes necessary, and checkpoints
> are one way for the first of those things to happen and the only way for the
> second one to happen.

Intuitively it seems easier to take care of that by checking if an FPI is
needed or not?

I guess we, eventually, might want to also freeze if an FPI would be
generated, if we are reasonably certain that the page isn't going to be
modified again soon (e.g. when table stats indicate effectively aren't any
updates / deletes). Although perhaps it's better to take care of that via
freeze-on-writeout style logic.

I suspect than even for freeze-on-writeout we might end up needing some
heuristics

Greetings,

Andres Freund

Re: Eager page freeze criteria clarification

2023-09-06 Thread Robert Haas

On Wed, Sep 6, 2023 at 12:20 PM Peter Geoghegan  wrote:
> This was a case where index vacuuming was never required. It's just a
> simple and easy to recreate example of what I think of as a more
> general problem.

OK.

> Why wouldn't we expect a table to have some pages that ought to be
> frozen right away, and others where freezing should in theory be put
> off indefinitely? I think that that's very common.

Oh, I see. I agree that that's pretty common. I think what that means
in practice is that we need to avoid relying too much on
relation-level statistics to guide behavior with respect to individual
relation pages. On the other hand, I don't think it means that we
can't look at relation-wide or even system-wide statistics at all.
Sure, those statistics may not be perfect, but some things are not
practical to track on a page granularity, and having some
course-grained information can, I think, be better than having nothing
at all, if you're careful about how much and in what way you rely on
it.

> As you know, I am particularly concerned about the tendency of
> unfrozen all-visible pages to accumulate without bound (at least
> without bound expressed in physical units such as pages). The very
> fact that pages are being set all-visible by VACUUM can be seen as a
> part of a high-level systemic problem -- a problem that plays out over
> time, across multiple VACUUM operations. So even if the cost of
> setting pages all-visible happened to be much lower than the cost of
> freezing (which it isn't), setting pages all-visible without freezing
> has unique downsides.

I generally agree with all of that.

> If VACUUM freezes too aggressively, then (pretty much by definition)
> we can be sure that the next VACUUM will scan the same pages -- there
> may be some scope for VACUUM to "learn from its mistake" when we err
> in the direction of over-freezing. But when VACUUM makes the opposite
> mistake (doesn't freeze when it should have), it won't scan those same
> pages again for a long time, by design. It therefore has no plausible
> way of "learning from its mistakes" before it becomes an extremely
> expensive and painful lesson (which happens whenever the next
> aggressive VACUUM takes place). This is in large part a consequence of
> the way that VACUUM dutifully sets pages all-visible whenever
> possible. That behavior interacts badly with many workloads, over
> time.

I think this is an insightful commentary with which I partially agree.
As I see it, the difference is that when you make the mistake of
marking something all-visible or freezing it too aggressively, you
incur a price that you pay almost immediately. When you make the
mistake of not marking something all-visible when it would have been
best to do so, you incur a price that you pay later, when the next
VACUUM happens. When you make the mistake of not marking something
all-frozen when it would have been best to do so, you incur a price
that you pay even later, not at the next VACUUM but at some VACUUM
further off. So there are different trade-offs. When you pay the price
for a mistake immediately or nearly immediately, it can potentially
harm the performance of the foreground workload, if you're making a
lot of mistakes. That sucks. On the other hand, when you defer paying
the price until some later bulk operation, the costs of all of your
mistakes get added up and then you pay the whole price all at once,
which means you can be suddenly slapped with an enormous bill that you
weren't expecting. That sucks, too, just in a different way.

> VACUUM simply ignores such second-order effects. Perhaps it would be
> practical to address some of the issues in this area by avoiding
> setting pages all-visible without freezing them, in some general
> sense. That at least creates a kind of symmetry between mistakes in
> the direction of under-freezing and mistakes in the direction of
> over-freezing. That might enable VACUUM to course-correct in either
> direction.
>
> Melanie is already planning on combining the WAL records (PRUNE,
> FREEZE_PAGE, and VISIBLE). Perhaps that'll weaken the argument for
> setting unfrozen pages all-visible even further.

Yeah, so I think the question here is whether it's ever a good idea to
mark a page all-visible without also freezing it. If it's not, then we
should either mark fewer pages all-visible, or freeze more of them.
Maybe I'm all wet here, but I think it depends on the situation. If a
page is already dirty and has had an FPI since the last checkpoint,
then it's pretty appealing to freeze whenever we mark all-visible. We
still have to consider whether the incremental CPU cost and WAL volume
are worth it, but assuming those costs are small enough not to be a
big problem, it seems like a pretty good bet. Making a page
un-all-visible has some cost, but making a page un-all-frozen really
doesn't, so cool. On the other hand, if we have a page that isn't
dirty, hasn't had a recent FPI, and doesn't need pruning, but which
can

Re: Eager page freeze criteria clarification

2023-09-06 Thread Peter Geoghegan

On Wed, Sep 6, 2023 at 7:46 AM Robert Haas  wrote:
> It's interesting that when you raised the threshold the rate of
> vacuuming dropped to zero. I haven't seen that behavior. pgbench does
> rely very, very heavily on HOT-pruning, so VACUUM only cleans up a
> small percentage of the dead tuples that get created. However, it's
> still needed for index vacuuming, and I'm not sure what would cause it
> not to trigger for that purpose.

This was a case where index vacuuming was never required. It's just a
simple and easy to recreate example of what I think of as a more
general problem.

> > > And I'm not sure why we
> > > want to do that. If the table is being vacuumed a lot, it's probably
> > > also being modified a lot, which suggests that we ought to be more
> > > cautious about freezing, rather than the reverse.
> >
> > Why wouldn't it be both things at the same time, for the same table?
>
> Both of what things?

Why wouldn't we expect a table to have some pages that ought to be
frozen right away, and others where freezing should in theory be put
off indefinitely? I think that that's very common.

> > Why not also avoid setting pages all-visible? The WAL records aren't
> > too much smaller than most freeze records these days -- 64 bytes on
> > most systems. I realize that the rules for FPIs are a bit different
> > when page-level checksums aren't enabled, but fundamentally it's the
> > same situation. No?
>
> It's an interesting point. AFAIK, whether or not page-level checksums
> are enabled doesn't really matter here. But it seems fair to ask - if
> freezing is too aggressive, why is setting all-visible not also too
> aggressive? I don't have a good answer to that question right now.

As you know, I am particularly concerned about the tendency of
unfrozen all-visible pages to accumulate without bound (at least
without bound expressed in physical units such as pages). The very
fact that pages are being set all-visible by VACUUM can be seen as a
part of a high-level systemic problem -- a problem that plays out over
time, across multiple VACUUM operations. So even if the cost of
setting pages all-visible happened to be much lower than the cost of
freezing (which it isn't), setting pages all-visible without freezing
has unique downsides.

If VACUUM freezes too aggressively, then (pretty much by definition)
we can be sure that the next VACUUM will scan the same pages -- there
may be some scope for VACUUM to "learn from its mistake" when we err
in the direction of over-freezing. But when VACUUM makes the opposite
mistake (doesn't freeze when it should have), it won't scan those same
pages again for a long time, by design. It therefore has no plausible
way of "learning from its mistakes" before it becomes an extremely
expensive and painful lesson (which happens whenever the next
aggressive VACUUM takes place). This is in large part a consequence of
the way that VACUUM dutifully sets pages all-visible whenever
possible. That behavior interacts badly with many workloads, over
time.

VACUUM simply ignores such second-order effects. Perhaps it would be
practical to address some of the issues in this area by avoiding
setting pages all-visible without freezing them, in some general
sense. That at least creates a kind of symmetry between mistakes in
the direction of under-freezing and mistakes in the direction of
over-freezing. That might enable VACUUM to course-correct in either
direction.

Melanie is already planning on combining the WAL records (PRUNE,
FREEZE_PAGE, and VISIBLE). Perhaps that'll weaken the argument for
setting unfrozen pages all-visible even further.

-- 
Peter Geoghegan

Re: Eager page freeze criteria clarification

2023-09-06 Thread Robert Haas

On Fri, Sep 1, 2023 at 9:07 PM Peter Geoghegan  wrote:
> When I reported this a couple of years ago, I noticed that autovacuum
> would spin whenever I set autovacuum_vacuum_scale_factor to 0.02. But
> autovacuum would *never* run (outside of antiwraparound autovacuums)
> when it was set just a little higher (perhaps 0.03 or 0.04). So there
> was some inflection point at which its behavior totally changed.

I think that tables have a "natural" amount of bloat that is a
function of the workload. If you remove all the bloat by, say, running
VACUUM FULL, they quickly bloat again to that point, and then
stabilize (hopefully). It seems like if you set the scale factor to a
value below that "natural" amount of bloat, you might well spin,
especially on a very write-heavy workload like pgbench, because you're
basically trying to squeeze water from a stone on that point.

It's interesting that when you raised the threshold the rate of
vacuuming dropped to zero. I haven't seen that behavior. pgbench does
rely very, very heavily on HOT-pruning, so VACUUM only cleans up a
small percentage of the dead tuples that get created. However, it's
still needed for index vacuuming, and I'm not sure what would cause it
not to trigger for that purpose.

> > And I'm not sure why we
> > want to do that. If the table is being vacuumed a lot, it's probably
> > also being modified a lot, which suggests that we ought to be more
> > cautious about freezing, rather than the reverse.
>
> Why wouldn't it be both things at the same time, for the same table?

Both of what things?

> Why not also avoid setting pages all-visible? The WAL records aren't
> too much smaller than most freeze records these days -- 64 bytes on
> most systems. I realize that the rules for FPIs are a bit different
> when page-level checksums aren't enabled, but fundamentally it's the
> same situation. No?

It's an interesting point. AFAIK, whether or not page-level checksums
are enabled doesn't really matter here. But it seems fair to ask - if
freezing is too aggressive, why is setting all-visible not also too
aggressive? I don't have a good answer to that question right now.

-- 
Robert Haas
EDB: http://www.enterprisedb.com

Re: Eager page freeze criteria clarification

2023-09-06 Thread Robert Haas

On Wed, Sep 6, 2023 at 1:09 AM Andres Freund  wrote:
> Yea, it'd be useful to have a reasonably approximate wall clock time for the
> last modification of a page. We just don't have infrastructure for determining
> that. We'd need an LSN->time mapping (xid->time wouldn't be particularly
> useful, I think).
>
> A very rough approximate modification time can be computed by assuming an even
> rate of WAL generation, and using the LSN at the time of the last vacuum and
> the time of the last vacuum, to compute the approximate age.
>
> For a while I thought that'd not give us anything that just using LSNs gives
> us, but I think it might allow coming up with a better cutoff logic: Instead
> of using a cutoff like "page LSN is older than 10% of the LSNs since the last
> vacuum of the table", it would allow us to approximate "page has not been
> modified in the last 15 seconds" or such.  I think that might help avoid
> unnecessary freezing on tables with very frequent vacuuming.

Yes. I'm uncomfortable with the last-vacuum-LSN approach mostly
because of the impact on very frequently vacuumed tables, and
secondarily because of the impact on very infrequently vacuumed
tables.

Downthread, I proposed using the RedoRecPtr of the latest checkpoint
rather than the LSN of the previou vacuum. I still like that idea.
It's a value that we already have, with no additional bookkeeping. It
varies over a much narrower range than the interval between vacuums on
a table. The vacuum interval could be as short as tens of seconds as
long as years, while the checkpoint interval is almost always going to
be between a few minutes at the low end and some tens of minutes at
the high end, hours at the very most. That's quite appealing. Also, I
think the time between checkpoints actually matters here, because in
some sense we're looking to get dirty, already-FPI'd pages frozen
before they get written out, or before a new FPI becomes necessary,
and checkpoints are one way for the first of those things to happen
and the only way for the second one to happen.

-- 
Robert Haas
EDB: http://www.enterprisedb.com

Re: Eager page freeze criteria clarification

2023-09-05 Thread Andres Freund

Hi,

On 2023-08-28 12:26:01 -0400, Robert Haas wrote:
> On Mon, Aug 28, 2023 at 10:00 AM Melanie Plageman
>  wrote:
> > For the second goal, I've relied on past data to predict future
> > behavior, so I tried several criteria to estimate the likelihood that a
> > page will not be imminently modified. What was most effective was
> > Andres' suggestion of comparing the page LSN to the insert LSN at the
> > end of the last vacuum of that table; this approximates whether the page
> > has been recently modified, which is a decent proxy for whether it'll be
> > modified in the future. To do this, we need to save that insert LSN
> > somewhere. In the attached WIP patch, I saved it in the table stats, for
> > now -- knowing that those are not crash-safe.
>
> I wonder what the real plan here is for where to store this. It's not
> obvious that we need this to be crash-safe; it's after all only for
> use by a heuristic, and there's no actual breakage if the heuristic
> goes wrong. At the same time, it doesn't exactly feel like a
> statistic.

I'm not certain either. This is generally something that's not satisfying
right now - although IMO not necessarily for the reason you mention. Given
that we already store, e.g., the time of the last autovacuum in the stats, I
don't see a problem also storing a corresponding LSN. My issue is more that
this kind of information not being crashsafe is really problematic - it's a
well known issue that autovacuum just doesn't do anything for a while after a
crash-restart (or pitr restore or ...), for example.

Given that all the other datapoints are stored in the stats, I think just
storing the LSNs alongside is reasonable.

> Then there's the question of whether it's the right metric. My first
> reaction is to think that it sounds pretty good. One thing I really
> like about it is that if the table is being vacuumed frequently, then
> we freeze less aggressively, and if the table is being vacuumed
> infrequently, then we freeze more aggressively. That seems like a very
> desirable property. It also seems broadly good that this metric
> doesn't really care about reads. If there are a lot of reads on the
> system, or no reads at all, it doesn't really change the chances that
> a certain page is going to be written again soon, and since reads
> don't change the insert LSN, here again it seems to do the right
> thing. I'm a little less clear about whether it's good that it doesn't
> really depend on wall-clock time.

Yea, it'd be useful to have a reasonably approximate wall clock time for the
last modification of a page. We just don't have infrastructure for determining
that. We'd need an LSN->time mapping (xid->time wouldn't be particularly
useful, I think).

A very rough approximate modification time can be computed by assuming an even
rate of WAL generation, and using the LSN at the time of the last vacuum and
the time of the last vacuum, to compute the approximate age.

For a while I thought that'd not give us anything that just using LSNs gives
us, but I think it might allow coming up with a better cutoff logic: Instead
of using a cutoff like "page LSN is older than 10% of the LSNs since the last
vacuum of the table", it would allow us to approximate "page has not been
modified in the last 15 seconds" or such.  I think that might help avoid
unnecessary freezing on tables with very frequent vacuuming.

> Certainly, that's desirable from the point of view of not wanting to have to
> measure wall-clock time in places where we otherwise wouldn't have to, which
> tends to end up being expensive.

IMO the bigger issue is that we don't want to store a timestamp on each page.

> >Page Freezes/Page Frozen (less is better)

As, I think, Robert mentioned downthread, I'm not sure this is a useful thing
to judge the different heuristics by. If the number of pages frozen is small,
the ratio quickly can be very large, without the freezing having a negative
effect.

I suspect interesting top-level figures to compare would be:

1) WAL volume (to judge the amount of unnecessary FPIs)

2) data reads + writes (to see the effect of repeated vacuuming of the same
   blocks)

3) number of vacuums and/or time spent vacuuming (freezing less aggressively
   might increase the number of vacuums due to anti-wrap vacuums, at the same
   time, freezing too aggressively could lead to vacuums taking too long)

4) throughput of the workload (to see potential regressions due to vacuuming
   overhead)

5) for transactional workloads: p99 latency (to see if vacuuming increases
   commit latency and such, just using average tends to hide too much)

Greetings,

Andres Freund

Re: Eager page freeze criteria clarification

2023-09-01 Thread Peter Geoghegan

On Fri, Sep 1, 2023 at 12:34 PM Robert Haas  wrote:
> If the table
> is being vacuumed frequently, then the last-vacuum-LSN will be newer,
> which means we'll freeze *more* aggressively.

Sort of like how if you set autovacuum_vacuum_scale_factor low enough,
with standard pgbench, with heap fill factor tuned, autovacuum will
never think that the table doesn't need to be vacuumed. It will
continually see enough dead heap-only tuples to get another autovacuum
each time. Though there won't be any LP_DEAD items at any point --
regardless of when VACUUM actually runs.

When I reported this a couple of years ago, I noticed that autovacuum
would spin whenever I set autovacuum_vacuum_scale_factor to 0.02. But
autovacuum would *never* run (outside of antiwraparound autovacuums)
when it was set just a little higher (perhaps 0.03 or 0.04). So there
was some inflection point at which its behavior totally changed.

> And I'm not sure why we
> want to do that. If the table is being vacuumed a lot, it's probably
> also being modified a lot, which suggests that we ought to be more
> cautious about freezing, rather than the reverse.

Why wouldn't it be both things at the same time, for the same table?

Why not also avoid setting pages all-visible? The WAL records aren't
too much smaller than most freeze records these days -- 64 bytes on
most systems. I realize that the rules for FPIs are a bit different
when page-level checksums aren't enabled, but fundamentally it's the
same situation. No?

--
Peter Geoghegan

Re: Eager page freeze criteria clarification

2023-09-01 Thread Robert Haas

On Tue, Aug 29, 2023 at 10:22 AM Robert Haas  wrote:
> Let's assume for a moment that the rate at which the insert LSN is
> advancing is roughly constant over time, so that it serves as a good
> proxy for wall-clock time. Consider four tables A, B, C, and D that
> are, respectively, vacuumed once per minute, once per hour, once per
> day, and once per week. With a 33% threshold, pages in table A will be
> frozen if they haven't been modified in 20 seconds, page in table B
> will be frozen if they haven't been modified in 20 minutes, pages in
> table C will be frozen if they haven't been modified in 8 hours, and
> pages in table D will be frozen if they haven't been modified in 2
> days, 8 hours. My intuition is that this feels awfully aggressive for
> A and awfully passive for D.
>
> [ discussion of freeze-on-evict ]

Another way of thinking about this is: instead of replacing this
heuristic with a complicated freeze-on-evict system, maybe we just
need to adjust the heuristic. Maybe using an LSN is a good idea, but
maybe the LSN of the last table vacuum isn't the right one to be
using, or maybe it shouldn't be the only one that we use. For
instance, what about using the redo LSN of the last checkpoint, or the
checkpoint before the last checkpoint, or something like that?
Somebody might find that unprincipled, but it seems to me that the
checkpoint cycle has a lot to do with whether or not opportunistic
freezing makes sense. If a page is likely to be modified again before
a checkpoint forces it to be written, then freezing it is likely a
waste. If it's likely to be written out of shared_buffers before it's
modified again, then freezing it now is a pretty good bet. A given
page could be evicted from shared_buffers, and thus written, sooner
than the next checkpoint, but if it's dirty now, it definitely won't
be written any later than the next checkpoint. By looking at the LSN
of a page that I'm about to modify just before I modify it, I can make
some kind of a guess as to whether this is a page that is being
modified more or less than once per checkpoint cycle and adjust
freezing behavior accordingly.

One could also use a hybrid of the two values e.g. normally use the
insert LSN of the last VACUUM, but if that's newer than the redo LSN
of the last checkpoint, then use the latter instead, to avoid doing
too much freezing of pages that may have been quiescent for only a few
tens of seconds. I don't know if that's better or worse. As I think
about it, I realize that I don't really know why Andres suggested a
last-vacuum-LSN-based heuristic in the first place. Before, I wrote of
this that "One thing I really like about it is that if the table is
being vacuumed frequently, then we freeze less aggressively, and if
the table is being vacuumed infrequently, then we freeze more
aggressively." But actually, I don't think it does that. If the table
is being vacuumed frequently, then the last-vacuum-LSN will be newer,
which means we'll freeze *more* aggressively. And I'm not sure why we
want to do that. If the table is being vacuumed a lot, it's probably
also being modified a lot, which suggests that we ought to be more
cautious about freezing, rather than the reverse.

Just spitballing here.

-- 
Robert Haas
EDB: http://www.enterprisedb.com

Re: Eager page freeze criteria clarification

2023-08-29 Thread Robert Haas

On Mon, Aug 28, 2023 at 4:30 PM Melanie Plageman
 wrote:
> By low-velocity, do you mean lower overall TPS? In that case, wouldn't you be
> less likely to run into xid wraparound and thus need less aggressive
> opportunistic freezing?

Yes. But it also means that we've got slack capacity to do extra work
now without really hurting anything. If we can leverage that capacity
to reduce the pain of future bulk operations, that seems good. When
resources are tight, doing speculative work immediately becomes less
appealing. It's pretty hard to take such things into account, though.
I was just mentioning it.

> So, this is where the caveat about absolute number of page freezes
> matters. In algorithm A, master only did 57 page freezes (spread across
> the various pgbench tables). At the end of the run, 2 pages were still
> frozen.

I'm increasingly feeling that it's hard to make sense of the ratio.
Maybe show the number of pages frozen, the number that are frozen at
the end, and the number of pages in the database at the end of the run
as three separate values.

> (1) seems bad to me because it doesn't consider whether or not freezing
> will be useful -- only if it will be cheap. It froze very little of the
> cold data in a workload where a small percentage of it was being
> modified (especially workloads A, C, H). And it froze a lot of data in
> workloads where it was being uniformly modified (workload B).

Sure, but if the cost of being wrong is low, you can be wrong a lot
and still be pretty happy. It's unclear to me to what extent we should
gamble on making only inexpensive mistakes and to what extent we
should gamble on making only infrequent mistakes, but they're both
valid strategies.

> I suggested (4) and (5) because I think the "older than 33%" threshold
> is better than the "older than 10%" threshold. I chose both because I am
> still unclear on our values. Are we willing to freeze more aggressively
> at the expense of emitting more FPIs? As long as it doesn't affect
> throughput? For pretty much all of these workloads, the algorithms which
> froze based on page modification recency OR FPI required emitted many
> more FPIs than those which froze based only on page modification
> recency.

Let's assume for a moment that the rate at which the insert LSN is
advancing is roughly constant over time, so that it serves as a good
proxy for wall-clock time. Consider four tables A, B, C, and D that
are, respectively, vacuumed once per minute, once per hour, once per
day, and once per week. With a 33% threshold, pages in table A will be
frozen if they haven't been modified in 20 seconds, page in table B
will be frozen if they haven't been modified in 20 minutes, pages in
table C will be frozen if they haven't been modified in 8 hours, and
pages in table D will be frozen if they haven't been modified in 2
days, 8 hours. My intuition is that this feels awfully aggressive for
A and awfully passive for D.

To expand on that: apparently D doesn't actually get much write
activity, else there would be more vacuuming happening. So it's very
likely that pages in table D are going to get checkpointed and evicted
before they get modified again. Freezing them therefore seems like a
good bet: it's a lot cheaper to freeze those pages when they're
already in shared_buffers and dirty than it is if we have to read and
write them specifically for freezing. It's less obvious that what
we're doing in table A is wrong, and it could be exactly right, but
workloads where a row is modified, a human thinks about something
(e.g. whether to complete the proposed purchase), and then the same
row is modified again are not uncommon, and human thinking times can
easily exceed 20 seconds. On the other hand, workloads where a row is
modified, a computer thinks about something, and then the row is
modified again are also quite common, and computer thinking times can
easily be less than 20 seconds. It feels like a toss-up whether we get
it right. For this kind of table, I suspect we'd be happier freezing
pages that are about to be evicted or about to be written as part of a
checkpoint rather than freezing pages opportunistically in vacuum.

Maybe that's something we need to think harder about. If we froze
dirty pages that wouldn't need a new FPI just before evicting them,
and just before they would be written out for a checkpoint, under what
circumstances would we still want vacuum to opportunistically freeze?
I think the answer might be "only when it's very cheap." If it's very
cheap to freeze now, it's appealing to gamble on doing it before a
checkpoint comes along and makes the same operation require an extra
FPI. But if it isn't too cheap, then why not just wait and see what
happens? If the buffer gets modified again before it gets written out,
then freeing immediately is a waste and freeze-on-evict is better. If
it doesn't, freezing immediately and freeze-on-evict are the same
price.

I'm hand-waving a bit here because maybe freeze-on-evict is

Re: Eager page freeze criteria clarification

2023-08-28 Thread Peter Geoghegan

On Mon, Aug 28, 2023 at 4:15 PM Melanie Plageman
 wrote:
> On Mon, Aug 28, 2023 at 5:06 PM Peter Geoghegan  wrote:
> > What I've described in a scheme for deciding to not freeze where that
> > would usually happen -- a scheme for *vetoing* page freezing -- rather
> > than a scheme for deciding to freeze. On its own, what I suggested
> > cannot help at all. It assumes a world in which we're already deciding
> > to freeze much more frequently, based on whatever other criteria. It's
> > intended to complement something like the LSN scheme.
>
> I like the direction of this idea. It could avoid repeatedly freezing
> a page that is being modified over and over. I tried it out with a
> short-running version of workload I (approximating a work queue) --
> and it prevents unnecessary freezing.

I see a lot of value in it as an enabler of freezing at the earliest
opportunity, which is usually the way to go.

> The problem with this criterion is that if you freeze a page and then
> ever modify it again -- even once, vacuum will not be allowed to
> freeze it again until it is forced to.

Right. Like you, I was thinking of something that dampened VACUUM's
newfound enthusiasm for freezing, in whatever way -- not a discrete
rule. A mechanism with a sense of proportion about what it meant for
the page to look a certain way. The strength of the signal would
perhaps be highest (i.e. most discouraging of further freezing) on a
page that has only a few relatively recent unfrozen tuples. XID age
could actually be useful here!

> Perhaps we could use a very
> strict LSN cutoff for pages which contain any frozen tuples -- for
> example: only freeze a page containing a mix of frozen and unfrozen
> tuples if it has not been modified since before the last vacuum of the
> relation ended.

You might also need to account for things like interactions with the
FSM. If the new unfrozen tuple packs the page to the brim, and the new
row is from an insert, then maybe the mechanism shouldn't dampen our
enthusiasm for freezing the page at all. Similarly, on a page that has
no unfrozen tuples at all just yet, it would make sense for the
algorithm to be most enthusiastic about freezing those pages that can
fit no more tuples. Perhaps we'd be willing to accept the cost of an
extra FPI with such a page, for example.

It will take time to validate these sorts of ideas thoroughly, of
course. I agree with what you said upthread about it also being a
question of values and priorities.

-- 
Peter Geoghegan

Re: Eager page freeze criteria clarification

2023-08-28 Thread Melanie Plageman

On Mon, Aug 28, 2023 at 5:06 PM Peter Geoghegan  wrote:
>
> On Mon, Aug 28, 2023 at 1:17 PM Robert Haas  wrote:
> > I'm sure this could be implemented, but it's unclear to me why you
> > would expect it to perform well. Freezing a page that has no frozen
> > tuples yet isn't cheaper than freezing one that does, so for this idea
> > to be a win, the presence of frozen tuples on the page would have to
> > be a signal that the page is likely to be modified again in the near
> > future. In general, I don't see any reason why we should expect that
> > to be the case.
>
> What I've described in a scheme for deciding to not freeze where that
> would usually happen -- a scheme for *vetoing* page freezing -- rather
> than a scheme for deciding to freeze. On its own, what I suggested
> cannot help at all. It assumes a world in which we're already deciding
> to freeze much more frequently, based on whatever other criteria. It's
> intended to complement something like the LSN scheme.

I like the direction of this idea. It could avoid repeatedly freezing
a page that is being modified over and over. I tried it out with a
short-running version of workload I (approximating a work queue) --
and it prevents unnecessary freezing.

The problem with this criterion is that if you freeze a page and then
ever modify it again -- even once, vacuum will not be allowed to
freeze it again until it is forced to. Perhaps we could use a very
strict LSN cutoff for pages which contain any frozen tuples -- for
example: only freeze a page containing a mix of frozen and unfrozen
tuples if it has not been modified since before the last vacuum of the
relation ended.

- Melanie

Re: Eager page freeze criteria clarification

2023-08-28 Thread Peter Geoghegan

On Mon, Aug 28, 2023 at 1:17 PM Robert Haas  wrote:
> I'm sure this could be implemented, but it's unclear to me why you
> would expect it to perform well. Freezing a page that has no frozen
> tuples yet isn't cheaper than freezing one that does, so for this idea
> to be a win, the presence of frozen tuples on the page would have to
> be a signal that the page is likely to be modified again in the near
> future. In general, I don't see any reason why we should expect that
> to be the case.

What I've described in a scheme for deciding to not freeze where that
would usually happen -- a scheme for *vetoing* page freezing -- rather
than a scheme for deciding to freeze. On its own, what I suggested
cannot help at all. It assumes a world in which we're already deciding
to freeze much more frequently, based on whatever other criteria. It's
intended to complement something like the LSN scheme.

> What really matters here is finding a criterion that is likely to
> perform well in general, on a test case not known to us beforehand.
> This isn't an entirely feasible goal, because just as you can
> construct a test case where any given criterion performs well, so you
> can also construct one where any given criterion performs poorly. But
> I think a rule that has a clear theory of operation must be preferable
> to one that doesn't. The theory that Melanie and Andres are advancing
> is that a page that has been modified recently (in insert-LSN-time) is
> more likely to be modified again soon than one that has not i.e. the
> near future will be like the recent past.

I don't think that it's all that useful on its own. You just cannot
ignore the fact that the choice to not freeze now doesn't necessarily
mean that you get to rereview that choice in the near future.
Particularly with large tables, the opportunities to freeze at all are
few and far between -- if for no other reason than the general design
of autovacuum scheduling. Worse still, any unfrozen all-visible pages
can just accumulate as all-visible pages, until the next aggressive
VACUUM happens whenever. How can that not be extremely important?

That isn't an argument against a scheme that uses LSNs (many kinds of
information might be weighed) -- it's an argument in favor of paying
attention to the high level cadence of VACUUM. That much seems
essential. I think that there might well be room for having several
complementary schemes like the LSN scheme. Or one big scheme that
weighs multiple factors together, if you prefer. That all seems
basically reasonable to me.

Adaptive behavior is important with something as complicated as this.
Adaptive schemes all seem to involve trial and error. The cost of
freezing too much is relatively well understood, and can be managed
sensibly. So we should err in that direction -- a direction that is
relatively easy to understand, to notice, and to pull back from having
gone too far. Putting off freezing for a very long time is a source of
much of the seemingly intractable complexity in this area.

Another way of addressing that is getting rid of aggressive VACUUM as
a concept. But I'm not going to revisit that topic now, or likely
ever.

-- 
Peter Geoghegan

Re: Eager page freeze criteria clarification

2023-08-28 Thread Melanie Plageman

On Mon, Aug 28, 2023 at 12:26 PM Robert Haas  wrote:
>
> On Mon, Aug 28, 2023 at 10:00 AM Melanie Plageman
>  wrote:
> Then there's the question of whether it's the right metric. My first
> reaction is to think that it sounds pretty good. One thing I really
> like about it is that if the table is being vacuumed frequently, then
> we freeze less aggressively, and if the table is being vacuumed
> infrequently, then we freeze more aggressively. That seems like a very
> desirable property. It also seems broadly good that this metric
> doesn't really care about reads. If there are a lot of reads on the
> system, or no reads at all, it doesn't really change the chances that
> a certain page is going to be written again soon, and since reads
> don't change the insert LSN, here again it seems to do the right
> thing. I'm a little less clear about whether it's good that it doesn't
> really depend on wall-clock time. Certainly, that's desirable from the
> point of view of not wanting to have to measure wall-clock time in
> places where we otherwise wouldn't have to, which tends to end up
> being expensive. However, if I were making all of my freezing
> decisions manually, I might be more freeze-positive on a low-velocity
> system where writes are more stretched out across time than on a
> high-velocity system where we're blasting through the LSN space at a
> higher rate. But maybe that's not a very important consideration, and
> I don't know what we'd do about it anyway.

By low-velocity, do you mean lower overall TPS? In that case, wouldn't you be
less likely to run into xid wraparound and thus need less aggressive
opportunistic freezing?

> >Page Freezes/Page Frozen (less is better)
> >
> > |   | Master | (1) | (2) | (3) | (4) | (5) |
> > |---++-+-+-+-+-|
> > | A |  28.50 |3.89 |1.08 |1.15 |1.10 |1.10 |
> > | B |   1.00 |1.06 |1.65 |1.03 |1.59 |1.00 |
> > | C |N/A |1.00 |1.00 |1.00 |1.00 |1.00 |
> > | D |   2.00 | 5199.15 | 5276.85 | 4830.45 | 5234.55 | 2193.55 |
> > | E |   7.90 |3.21 |2.73 |2.70 |2.69 |2.43 |
> > | F |N/A |1.00 |1.00 |1.00 |1.00 |1.00 |
> > | G |N/A |1.00 |1.00 |1.00 |1.00 |1.00 |
> > | H |N/A |1.00 |1.00 |1.00 |1.00 |1.00 |
> > | I |N/A |   42.00 |   42.00 | N/A |   41.00 | N/A |
>
> Hmm. I would say that the interesting rows here are A, D, and I, with
> rows C and E deserving honorable mention. In row A, master is bad.

So, this is where the caveat about absolute number of page freezes
matters. In algorithm A, master only did 57 page freezes (spread across
the various pgbench tables). At the end of the run, 2 pages were still
frozen.

> In row D, your algorithms are all bad, really bad. I don't quite
> understand how it can be that bad, actually.

So, I realize now that this test was poorly designed. I meant it to be a
worst case scenario, but I think one critical part was wrong. In this
example one client is going at full speed inserting a row and then
updating it. Then another rate-limited client is deleting old data
periodically to keep the table at a constant size. I meant to bulk load
the table with enough data that the delete job would have data to delete
from the start. With the default autovacuum settings, over the course of
45 minutes, I usually saw around 40 autovacuums of the table. Due to the
rate limiting, the first autovacuum of the table ends up freezing many
pages that are deleted soon after. Thus the total number of page freezes
is very high.

I will redo benchmarking of workload D and start the table with the
number of rows which the DELETE job seeks to maintain. My back of the
envelope math says that this will mean ratios closer to a dozen (and not
5000).

Also, I had doubled checkpoint timeout, which likely led master to
freeze so few pages (2 total freezes, neither of which were still frozen
at the end of the run). This is an example where master's overall low
number of page freezes makes it difficult to compare to the alternatives
using a ratio.

I didn't initially question the numbers because it seems like freezing
data and then deleting it right after would naturally be one of the
worst cases for opportunistic freezing, but certainly not this bad.

> Row I looks bad for algorithms 1, 2, and 4: they freeze pages because
> it looks cheap, but the work doesn't really pay off.

Yes, the work queue example looks like it is hard to handle.

> >% Frozen at end of run
> >
> > |   | Master | (1) | (2) | (3) |  (4) | (5) |
> > |---++-+-+-+--+-+
> > | A |  0 |   1 |  99 |   0 |   81 |   0 |
> > | B | 71 |  96 |  99 |   3 |   98 |   2 |
> > | C |  0 |   9 | 100 |   6 |   92 |   5 |
> > | D |  0 |   1 |   1 |   1 |1 |   1 |
> > | E |  0 |  63 | 100 |  68 |  100 |  67 |
> > | F

Re: Eager page freeze criteria clarification

2023-08-28 Thread Robert Haas

On Mon, Aug 28, 2023 at 3:09 PM Peter Geoghegan  wrote:
> I've long emphasized the importance of designs that just try to avoid
> disaster. With that in mind, I wonder: have you thought about
> conditioning page freezing on whether or not there are already some
> frozen tuples on the page? You could perhaps give some weight to
> whether or not the page already has at least one or two preexisting
> frozen tuples when deciding on whether to freeze it once again now.
> You'd be more eager about freezing pages that have no frozen tuples
> whatsoever, compared to what you'd do with an otherwise equivalent
> page that has no unfrozen tuples.

I'm sure this could be implemented, but it's unclear to me why you
would expect it to perform well. Freezing a page that has no frozen
tuples yet isn't cheaper than freezing one that does, so for this idea
to be a win, the presence of frozen tuples on the page would have to
be a signal that the page is likely to be modified again in the near
future. In general, I don't see any reason why we should expect that
to be the case. One could easily construct a workload where it is the
case -- for instance, set up one table T1 where 90% of the tuples are
repeatedly updated and the other 10% are never touched, and another
table T2 that is insert-only. Once frozen, the never-updated tuples in
T1 become sentinels that we can use to know that the table isn't
insert-only. But I don't think that's very interesting: you can
construct a test case like this for any proposed criterion, just by
structuring the test workload so that whatever criterion is being
tested is a perfect predictor of whether the page will be modified
soon.

What really matters here is finding a criterion that is likely to
perform well in general, on a test case not known to us beforehand.
This isn't an entirely feasible goal, because just as you can
construct a test case where any given criterion performs well, so you
can also construct one where any given criterion performs poorly. But
I think a rule that has a clear theory of operation must be preferable
to one that doesn't. The theory that Melanie and Andres are advancing
is that a page that has been modified recently (in insert-LSN-time) is
more likely to be modified again soon than one that has not i.e. the
near future will be like the recent past. I'm not sure what the theory
behind the rule you propose here might be; if you articulated it
somewhere in your email, I seem to have missed it.

--
Robert Haas
EDB: http://www.enterprisedb.com

Re: Eager page freeze criteria clarification

2023-08-28 Thread Peter Geoghegan

On Mon, Aug 28, 2023 at 7:00 AM Melanie Plageman
 wrote:
> To do this, we need to save that insert LSN
> somewhere. In the attached WIP patch, I saved it in the table stats, for
> now -- knowing that those are not crash-safe.

What patch? You didn't actually attach one.

> Other discarded heuristic ideas included comparing the next transaction
> ID at the end of the vacuum of a relation to the visibility cutoff xid
> in the page -- but that wasn't effective for freezing data from bulk
> loads.

I've long emphasized the importance of designs that just try to avoid
disaster. With that in mind, I wonder: have you thought about
conditioning page freezing on whether or not there are already some
frozen tuples on the page? You could perhaps give some weight to
whether or not the page already has at least one or two preexisting
frozen tuples when deciding on whether to freeze it once again now.
You'd be more eager about freezing pages that have no frozen tuples
whatsoever, compared to what you'd do with an otherwise equivalent
page that has no unfrozen tuples.

Small mistakes are inevitable here. They have to be okay. What's not
okay is a small mistake that effectively becomes a big mistake because
it gets repeated across each VACUUM operation, again and again,
ceaselessly. You can probably be quite aggressive about freezing, to
good effect -- provided you can be sure that the downside when it
turns out to have been a bad idea is self-limiting, in whatever way.
Making more small mistakes might actually be a good thing --
especially if it can dramatically lower the chances of ever making any
really big mistakes.

VACUUM is not a passive observer of the system. It's just another
component in the system. So what VACUUM sees in the table depends in
no small part on what previous VACUUMs actually did. It follows that
VACUUM should be concerned about how what it does might either help or
hinder future VACUUMs. My preexisting frozen tuples suggestion is
really just an example of how that principle might be applied. Many
variations on the same general idea are possible.

There are already various extremely weird feedback loops where what
VACUUM did last time affects what it'll do this time. These are
vicious circles. So ISTM that there is a lot to be said for disrupting
those vicious circles, and even deliberately engineering heuristics
that have the potential to create virtuous circles for the right
workload.

-- 
Peter Geoghegan

Re: Eager page freeze criteria clarification

2023-08-28 Thread Peter Geoghegan

On Mon, Aug 28, 2023 at 7:00 AM Melanie Plageman
 wrote:
> I believe that there are two goals that should dictate whether or not we
> should perform opportunistic freezing:
>
>   1. Is it cheap? For example, if the buffer is already dirty, then no
>   write amplification occurs, since it must be written out anyway.
>   Freezing is also less expensive if we can do it without emitting an
>   FPI.
>
>   2. Will it be effective; that is, will it stay frozen?
>   Opportunistically freezing a page that will immediately be modified is
>   a waste.
>
> The current heuristic on master meets neither of these goals: it freezes
> a page if pruning emitted an FPI for it. This doesn't evaluate whether
> or not freezing itself would be cheap, but rather attempts to hide
> freezing behind an expensive operation.

The goal is to minimize the number of FPIs over time, in general.
That's hardly the same thing as hiding the cost.

> Furthermore, it often fails to
> freeze cold data and may indiscriminately freeze hot data.

You need to account for the cost of not freezing, too. Controlling the
overall freeze debt at the level of whole tables (and the level of the
whole system) is very important. In fact it's probably the single most
important thing. A good model might end up increasing the cost of
freezing on a simple quantitative basis, while making life much better
overall. Huge balloon payments for freezing are currently a huge
problem.

Performance stability might come with a cost in some cases. There
isn't necessarily anything wrong with that at all.

-- 
Peter Geoghegan

Re: Eager page freeze criteria clarification

2023-08-28 Thread Robert Haas

On Mon, Aug 28, 2023 at 10:00 AM Melanie Plageman
 wrote:
> For the second goal, I've relied on past data to predict future
> behavior, so I tried several criteria to estimate the likelihood that a
> page will not be imminently modified. What was most effective was
> Andres' suggestion of comparing the page LSN to the insert LSN at the
> end of the last vacuum of that table; this approximates whether the page
> has been recently modified, which is a decent proxy for whether it'll be
> modified in the future. To do this, we need to save that insert LSN
> somewhere. In the attached WIP patch, I saved it in the table stats, for
> now -- knowing that those are not crash-safe.

I wonder what the real plan here is for where to store this. It's not
obvious that we need this to be crash-safe; it's after all only for
use by a heuristic, and there's no actual breakage if the heuristic
goes wrong. At the same time, it doesn't exactly feel like a
statistic.

Then there's the question of whether it's the right metric. My first
reaction is to think that it sounds pretty good. One thing I really
like about it is that if the table is being vacuumed frequently, then
we freeze less aggressively, and if the table is being vacuumed
infrequently, then we freeze more aggressively. That seems like a very
desirable property. It also seems broadly good that this metric
doesn't really care about reads. If there are a lot of reads on the
system, or no reads at all, it doesn't really change the chances that
a certain page is going to be written again soon, and since reads
don't change the insert LSN, here again it seems to do the right
thing. I'm a little less clear about whether it's good that it doesn't
really depend on wall-clock time. Certainly, that's desirable from the
point of view of not wanting to have to measure wall-clock time in
places where we otherwise wouldn't have to, which tends to end up
being expensive. However, if I were making all of my freezing
decisions manually, I might be more freeze-positive on a low-velocity
system where writes are more stretched out across time than on a
high-velocity system where we're blasting through the LSN space at a
higher rate. But maybe that's not a very important consideration, and
I don't know what we'd do about it anyway.

>Page Freezes/Page Frozen (less is better)
>
> |   | Master | (1) | (2) | (3) | (4) | (5) |
> |---++-+-+-+-+-|
> | A |  28.50 |3.89 |1.08 |1.15 |1.10 |1.10 |
> | B |   1.00 |1.06 |1.65 |1.03 |1.59 |1.00 |
> | C |N/A |1.00 |1.00 |1.00 |1.00 |1.00 |
> | D |   2.00 | 5199.15 | 5276.85 | 4830.45 | 5234.55 | 2193.55 |
> | E |   7.90 |3.21 |2.73 |2.70 |2.69 |2.43 |
> | F |N/A |1.00 |1.00 |1.00 |1.00 |1.00 |
> | G |N/A |1.00 |1.00 |1.00 |1.00 |1.00 |
> | H |N/A |1.00 |1.00 |1.00 |1.00 |1.00 |
> | I |N/A |   42.00 |   42.00 | N/A |   41.00 | N/A |

Hmm. I would say that the interesting rows here are A, D, and I, with
rows C and E deserving honorable mention. In row A, master is bad. In
row D, your algorithms are all bad, really bad. I don't quite
understand how it can be that bad, actually. Row I looks bad for
algorithms 1, 2, and 4: they freeze pages because it looks cheap, but
the work doesn't really pay off.

>% Frozen at end of run
>
> |   | Master | (1) | (2) | (3) |  (4) | (5) |
> |---++-+-+-+--+-+
> | A |  0 |   1 |  99 |   0 |   81 |   0 |
> | B | 71 |  96 |  99 |   3 |   98 |   2 |
> | C |  0 |   9 | 100 |   6 |   92 |   5 |
> | D |  0 |   1 |   1 |   1 |1 |   1 |
> | E |  0 |  63 | 100 |  68 |  100 |  67 |
> | F |  0 |   5 |  14 |   6 |   14 |   5 |
> | G |  0 | 100 | 100 |  92 |  100 |  67 |
> | H |  0 |  11 | 100 |   9 |   86 |   5 |
> | I |  0 | 100 | 100 |   0 |  100 |   0 |

So all of the algorithms here, but especially 1, 2, and 4, freeze a
lot more often than master.

If I understand correctly, we'd like to see small numbers for B, D,
and I, and large numbers for the other workloads. None of the
algorithms seem to achieve that. (3) and (5) seem like they always
behave as well or better than master, but they produce small numbers
for A, C, F, and H. (1), (2), and (4) regress B and I relative to
master but do better than (3) and (5) on A, C, and the latter two also
on E.

B is such an important benchmarking workload that I'd be loathe to
regress it, so if I had to pick on the basis of this data, my vote
would be (3) or (5), provided whatever is happening with (D) in the
previous metric is not as bad as it looks. What's your reason for
preferring (4) and (5) over (2) and (3)? I'm not clear that these
numbers give us much of an idea whether 10% or 33% or something else
is better in general.

To be honest, having now spent more time l

Re: Eager page freeze criteria clarification

2023-08-28 Thread Melanie Plageman

On Fri, Jul 28, 2023 at 3:27 PM Melanie Plageman
 wrote:
> On Fri, Jul 28, 2023 at 3:00 PM Peter Geoghegan  wrote:
> > > Is this test meant to guard against unnecessary freezing or to avoid
> > > freezing when the cost is too high? That is, are we trying to
> > > determine how likely it is that the page has been recently modified
> > > and avoid eager freezing when it would be pointless (because the page
> > > will soon be modified again)?
> >
> > Sort of. This cost of freezing over time is weirdly nonlinear, so it's
> > hard to give a simple answer.
> >
> > The justification for the FPI trigger optimization is that FPIs are
> > overwhelmingly the cost that really matters when it comes to freezing
> > (and vacuuming in general) -- so we might as well make the best out of
> > a bad situation when pruning happens to get an FPI. There can easily
> > be a 10x or more cost difference (measured in total WAL volume)
> > between freezing without an FPI and freezing with an FPI.
> ...
> > In 16 VACUUM just "makes the best
> > out of a bad situation" when an FPI was already required during
> > pruning. We have already "paid for the privilege" of writing some WAL
> > for the page at that point, so it's reasonable to not squander a
> > window of opportunity to avoid future FPIs in future VACUUM
> > operations, by freezing early.
> >
> > We're "taking a chance" on being able to get freezing out of the way
> > early when an FPI triggers freezing. It's not guaranteed to work out
> > in each individual case, of course, but even if we assume it's fairly
> > unlikely to work out (which is very pessimistic) it's still very
> > likely a good deal.
> >
> > This strategy (the 16 strategy of freezing eagerly because we already
> > got an FPI) seems safer than a strategy involving freezing eagerly
> > because we won't get an FPI as a result. If for no other reason than
> > this: with the approach in 16 we already know for sure that we'll have
> > written an FPI anyway. It's hard to imagine somebody being okay with
> > the FPIs, but not being okay with the other extra WAL.
>
> I see. I don't have an opinion on the "best of a bad situation"
> argument. Though, I think it is worth amending the comment in the code
> to include this explanation.
>
> But, ISTM that there should also be some independent heuristic to
> determine whether or not it makes sense to freeze the page. That could
> be related to whether or not it will be cheap to do so (in which case
> we can check if we will have to emit an FPI as part of the freeze
> record) or it could be related to whether or not the freezing is
> likely to be pointless (we are likely to update the page again soon).
>
> It sounds like it was discussed before, but I'd be interested in
> revisiting it and happy to test out various ideas.

Hi, in service of "testing various ideas", I've benchmarked the
heuristics proposed in this thread, as well a few others that Andres and
I considered, for determining whether or not to opportunistically freeze
a page during vacuum. Note that this heuristic would be in addition to
the existing criterion that we only opportunistically freeze pages that
can be subsequently set all frozen in the visibility map.

I believe that there are two goals that should dictate whether or not we
should perform opportunistic freezing:

  1. Is it cheap? For example, if the buffer is already dirty, then no
  write amplification occurs, since it must be written out anyway.
  Freezing is also less expensive if we can do it without emitting an
  FPI.

  2. Will it be effective; that is, will it stay frozen?
  Opportunistically freezing a page that will immediately be modified is
  a waste.

The current heuristic on master meets neither of these goals: it freezes
a page if pruning emitted an FPI for it. This doesn't evaluate whether
or not freezing itself would be cheap, but rather attempts to hide
freezing behind an expensive operation. Furthermore, it often fails to
freeze cold data and may indiscriminately freeze hot data.

For the second goal, I've relied on past data to predict future
behavior, so I tried several criteria to estimate the likelihood that a
page will not be imminently modified. What was most effective was
Andres' suggestion of comparing the page LSN to the insert LSN at the
end of the last vacuum of that table; this approximates whether the page
has been recently modified, which is a decent proxy for whether it'll be
modified in the future. To do this, we need to save that insert LSN
somewhere. In the attached WIP patch, I saved it in the table stats, for
now -- knowing that those are not crash-safe.

Other discarded heuristic ideas included comparing the next transaction
ID at the end of the vacuum of a relation to the visibility cutoff xid
in the page -- but that wasn't effective for freezing data from bulk
loads.

The algorithms I evaluated all attempt to satisfy goal (1) by freezing
only if the buffer is already dirty and also by considering whether or
not

Re: Eager page freeze criteria clarification

2023-07-28 Thread Peter Geoghegan

On Fri, Jul 28, 2023 at 5:03 PM Melanie Plageman
 wrote:
> When you were working on this, what were the downsides of only having the
> criteria that 1) page would be all_visible/all_frozen and 2) we did prune
> something (i.e. no consideration of FPIs)?

Conditioning freezing on "all_visible && all_frozen" seems likely to
always be a good idea when it comes to any sort of speculative trigger
criteria. Most of the wins in this area will come from avoiding future
FPIs, and you can't hope to do that unless you freeze everything on
the page. The cost of freezing when "all_visible && all_frozen" does
not hold may be quite low, but the benefits will also be very low.
Also, the way that recovery conflicts are generated for freeze records
somewhat depends on this right now -- though you could probably fix
that if you had to.

Note that we *don't* really limit ourselves to cases where an FPI was
generated from pruning, exactly -- even on 16. With page-level
checksums we can also generate an FPI just to set a hint bit. That
triggers freezing too, even on insert-only tables (assuming checksums
are enabled). I expect that that will be an important condition for
triggering freezing in practice.

Your question about 2 seems equivalent to "why not just always
freeze?". I don't think that that's a bad question -- quite the
opposite. Even trying to give an answer to this question would amount
to getting involved in new work on VACUUM, though. So I don't think I
can help you here.

-- 
Peter Geoghegan

Re: Eager page freeze criteria clarification

2023-07-28 Thread Melanie Plageman

On Fri, Jul 28, 2023 at 4:49 PM Peter Geoghegan  wrote:
>
> On Fri, Jul 28, 2023 at 4:30 PM Andres Freund  wrote:
> > > Put differently, I can't see any reason to care whether pruning
> > > emitted an FPI or not. Either way, it's very unlikely that freezing
> > > needs to do so.
> >
> > +many
> >
> > Using FPIs having been generated during pruning as part of the logic to 
> > decide
> > whether to freeze makes no sense to me. We likely need some heuristic here,
> > but I suspect it has to look quite different than the FPI condition.
>
> I think that it depends on how you frame it. It makes zero sense that
> that's the only thing that can do something like this at all. It makes
> some sense as an incremental improvement, since there is no way that
> we can lose by too much, even if you make arbitrarily pessimistic
> assumptions. In other words, you won't be able to come up with an
> adversarial benchmark where this loses by too much.

When you were working on this, what were the downsides of only having the
criteria that 1) page would be all_visible/all_frozen and 2) we did prune
something (i.e. no consideration of FPIs)?

> As I said, I'm no longer interested in working on VACUUM (though I do
> hope that Melanie or somebody else picks up where I left off). I have
> nothing to say about any new work in this area. If you want me to do
> something in the scope of the work on 16, as a release management
> task, please be clear about what that is.

I'm only talking about this in the context of future work for 17.

- Melanie

Re: Eager page freeze criteria clarification

2023-07-28 Thread Robert Haas

On Fri, Jul 28, 2023 at 4:30 PM Andres Freund  wrote:
> Using FPIs having been generated during pruning as part of the logic to decide
> whether to freeze makes no sense to me. We likely need some heuristic here,
> but I suspect it has to look quite different than the FPI condition.

Why do we need a heuristic at all?

What we're talking about here, AIUI, is under what circumstance we
should eagerly freeze a page that can be fully frozen, leaving no
tuples that vacuum ever needs to worry about again except if the page
is further modified. Freezing in such circumstances seems highly
likely to work out in our favor. The only argument against it is that
the page might soon be modified again, in which case the effort of
freezing would be mostly wasted. But I'm not sure whether that's
really a valid rationale, because the win from not having to vacuum
the page again if it isn't modified seems pretty big, and emitting a
freeze record for a page we've just pruned doesn't seem that
expensive. If it is a valid rationale, my first thought would be to
make the heuristic based on vacuum_freeze_min_age. XID age isn't a
particularly great proxy for whether the page is still actively being
modified, but at least it has tradition going for it.

-- 
Robert Haas
EDB: http://www.enterprisedb.com

Re: Eager page freeze criteria clarification

2023-07-28 Thread Peter Geoghegan

On Fri, Jul 28, 2023 at 4:30 PM Andres Freund  wrote:
> > Put differently, I can't see any reason to care whether pruning
> > emitted an FPI or not. Either way, it's very unlikely that freezing
> > needs to do so.
>
> +many
>
> Using FPIs having been generated during pruning as part of the logic to decide
> whether to freeze makes no sense to me. We likely need some heuristic here,
> but I suspect it has to look quite different than the FPI condition.

I think that it depends on how you frame it. It makes zero sense that
that's the only thing that can do something like this at all. It makes
some sense as an incremental improvement, since there is no way that
we can lose by too much, even if you make arbitrarily pessimistic
assumptions. In other words, you won't be able to come up with an
adversarial benchmark where this loses by too much.

As I said, I'm no longer interested in working on VACUUM (though I do
hope that Melanie or somebody else picks up where I left off). I have
nothing to say about any new work in this area. If you want me to do
something in the scope of the work on 16, as a release management
task, please be clear about what that is.

--
Peter Geoghegan

Re: Eager page freeze criteria clarification

2023-07-28 Thread Andres Freund

Hi,

On 2023-07-28 16:19:47 -0400, Robert Haas wrote:
> On Fri, Jul 28, 2023 at 3:00 PM Peter Geoghegan  wrote:
> > > Or are we trying to determine how likely
> > > the freeze record is to emit an FPI and avoid eager freezing when it
> > > isn't worth the cost?
> >
> > No, that's not something that we're doing right now (we just talked
> > about doing something like that). In 16 VACUUM just "makes the best
> > out of a bad situation" when an FPI was already required during
> > pruning. We have already "paid for the privilege" of writing some WAL
> > for the page at that point, so it's reasonable to not squander a
> > window of opportunity to avoid future FPIs in future VACUUM
> > operations, by freezing early.
> 
> This doesn't make sense to me. It's true that if the pruning record
> emitted an FPI, then the freezing record probably won't need to do so
> either, unless by considerable ill-fortune the redo pointer was moved
> in the tiny window between pruning and freezing. But isn't that also
> true that if the pruning record *didn't* emit an FPI? In that case,
> the pruning record wasn't the first WAL-logged modification to the
> page during the then-current checkpoint cycle, and some earlier
> modification already did it. But in that case also the freezing won't
> need to emit a new FPI, except for the same identical case where the
> redo pointer moves at just the wrong time.
> 
> Put differently, I can't see any reason to care whether pruning
> emitted an FPI or not. Either way, it's very unlikely that freezing
> needs to do so.

+many

Using FPIs having been generated during pruning as part of the logic to decide
whether to freeze makes no sense to me. We likely need some heuristic here,
but I suspect it has to look quite different than the FPI condition.

Greetings,

Andres

Re: Eager page freeze criteria clarification

2023-07-28 Thread Robert Haas

On Fri, Jul 28, 2023 at 3:00 PM Peter Geoghegan  wrote:
> > Or are we trying to determine how likely
> > the freeze record is to emit an FPI and avoid eager freezing when it
> > isn't worth the cost?
>
> No, that's not something that we're doing right now (we just talked
> about doing something like that). In 16 VACUUM just "makes the best
> out of a bad situation" when an FPI was already required during
> pruning. We have already "paid for the privilege" of writing some WAL
> for the page at that point, so it's reasonable to not squander a
> window of opportunity to avoid future FPIs in future VACUUM
> operations, by freezing early.

This doesn't make sense to me. It's true that if the pruning record
emitted an FPI, then the freezing record probably won't need to do so
either, unless by considerable ill-fortune the redo pointer was moved
in the tiny window between pruning and freezing. But isn't that also
true that if the pruning record *didn't* emit an FPI? In that case,
the pruning record wasn't the first WAL-logged modification to the
page during the then-current checkpoint cycle, and some earlier
modification already did it. But in that case also the freezing won't
need to emit a new FPI, except for the same identical case where the
redo pointer moves at just the wrong time.

Put differently, I can't see any reason to care whether pruning
emitted an FPI or not. Either way, it's very unlikely that freezing
needs to do so.

-- 
Robert Haas
EDB: http://www.enterprisedb.com

Re: Eager page freeze criteria clarification

2023-07-28 Thread Peter Geoghegan

On Fri, Jul 28, 2023 at 3:27 PM Melanie Plageman
 wrote:
> I see. I don't have an opinion on the "best of a bad situation"
> argument. Though, I think it is worth amending the comment in the code
> to include this explanation.

I think that the user-facing docs should be completely overhauled to
make that clear, and reasonably accessible. It's hard to talk about in
code comments because it's something that can only be effective over
time, across multiple VACUUM operations.

> But, ISTM that there should also be some independent heuristic to
> determine whether or not it makes sense to freeze the page. That could
> be related to whether or not it will be cheap to do so (in which case
> we can check if we will have to emit an FPI as part of the freeze
> record) or it could be related to whether or not the freezing is
> likely to be pointless (we are likely to update the page again soon).

It sounds like you're interested in adding additional criteria to
trigger page-level freezing. That's something that the current
structure anticipates.

To give a slightly contrived example: it would be very easy to add
another condition, where (say) we call random() to determine whether
or not to freeze the page. You'd very likely want to gate this new
trigger criteria in the same way as the FPI criteria (i.e. only do it
when "prunestate->all_visible && prunestate->all_frozen" hold). We've
decoupled the decision to freeze from what it means to execute
freezing itself.

Having additional trigger criteria makes a lot of sense. I'm sure that
it makes sense to add at least one more, and it seems possible that
adding several more makes sense. Obviously, that will have to be new
work targeting 17, though. I made a decision to stop working on
VACUUM, though, so I'm afraid I won't be able to offer much help with
any of this. (Happy to give more background information, though.)

-- 
Peter Geoghegan

Re: Eager page freeze criteria clarification

2023-07-28 Thread Melanie Plageman

On Fri, Jul 28, 2023 at 3:00 PM Peter Geoghegan  wrote:
>
> On Fri, Jul 28, 2023 at 11:13 AM Melanie Plageman
>  wrote:
> > if (pagefrz.freeze_required || tuples_frozen == 0 ||
> > (prunestate->all_visible && prunestate->all_frozen &&
> >  fpi_before != pgWalUsage.wal_fpi))
> > {
> >
> > I'm trying to understand the condition fpi_before !=
> > pgWalUsage.wal_fpi -- don't eager freeze if pruning emitted an FPI.
>
> You mean "prunestate->all_visible && prunestate->all_frozen", which is
> a condition of applying FPI-based eager freezing, but not traditional
> lazy freezing?

Ah no, a very key word in my sentence was off:
I meant "don't eager freeze *unless* pruning emitted an FPI"
I was asking about the FPI-trigger optimization specifically.
I'm clear on the all_frozen/all_visible criteria.

> > Is this test meant to guard against unnecessary freezing or to avoid
> > freezing when the cost is too high? That is, are we trying to
> > determine how likely it is that the page has been recently modified
> > and avoid eager freezing when it would be pointless (because the page
> > will soon be modified again)?
>
> Sort of. This cost of freezing over time is weirdly nonlinear, so it's
> hard to give a simple answer.
>
> The justification for the FPI trigger optimization is that FPIs are
> overwhelmingly the cost that really matters when it comes to freezing
> (and vacuuming in general) -- so we might as well make the best out of
> a bad situation when pruning happens to get an FPI. There can easily
> be a 10x or more cost difference (measured in total WAL volume)
> between freezing without an FPI and freezing with an FPI.
...
> In 16 VACUUM just "makes the best
> out of a bad situation" when an FPI was already required during
> pruning. We have already "paid for the privilege" of writing some WAL
> for the page at that point, so it's reasonable to not squander a
> window of opportunity to avoid future FPIs in future VACUUM
> operations, by freezing early.
>
> We're "taking a chance" on being able to get freezing out of the way
> early when an FPI triggers freezing. It's not guaranteed to work out
> in each individual case, of course, but even if we assume it's fairly
> unlikely to work out (which is very pessimistic) it's still very
> likely a good deal.
>
> This strategy (the 16 strategy of freezing eagerly because we already
> got an FPI) seems safer than a strategy involving freezing eagerly
> because we won't get an FPI as a result. If for no other reason than
> this: with the approach in 16 we already know for sure that we'll have
> written an FPI anyway. It's hard to imagine somebody being okay with
> the FPIs, but not being okay with the other extra WAL.

I see. I don't have an opinion on the "best of a bad situation"
argument. Though, I think it is worth amending the comment in the code
to include this explanation.

But, ISTM that there should also be some independent heuristic to
determine whether or not it makes sense to freeze the page. That could
be related to whether or not it will be cheap to do so (in which case
we can check if we will have to emit an FPI as part of the freeze
record) or it could be related to whether or not the freezing is
likely to be pointless (we are likely to update the page again soon).

It sounds like it was discussed before, but I'd be interested in
revisiting it and happy to test out various ideas.

- Melanie

Re: Eager page freeze criteria clarification

2023-07-28 Thread Peter Geoghegan

On Fri, Jul 28, 2023 at 11:13 AM Melanie Plageman
 wrote:
> if (pagefrz.freeze_required || tuples_frozen == 0 ||
> (prunestate->all_visible && prunestate->all_frozen &&
>  fpi_before != pgWalUsage.wal_fpi))
> {
>
> I'm trying to understand the condition fpi_before !=
> pgWalUsage.wal_fpi -- don't eager freeze if pruning emitted an FPI.

You mean "prunestate->all_visible && prunestate->all_frozen", which is
a condition of applying FPI-based eager freezing, but not traditional
lazy freezing?

Obviously, the immediate impact of that is that the FPI trigger
condition is not met unless we know for sure that the page will be
marked all-visible and all-frozen in the visibility map afterwards. A
page that's eligible to become all-visible will also be seen as
eligible to become all-frozen in the vast majority of cases, but there
are some rare and obscure cases involving MultiXacts that must be
considered here. There is little point in freezing early unless we
have at least some chance of not having to freeze the same page again
in the future (ideally forever).

There is generally no point in freezing most of the tuples on a page
when there'll still be one or more not-yet-eligible unfrozen tuples
that get left behind -- we might as well not bother with freezing at
all when we see a page where that'll be the outcome from freezing.
However, there is (once again) a need to account for rare and extreme
cases -- it still needs to be possible to do that. Specifically, we
may be forced to freeze a page that's left with some remaining
unfrozen tuples when VACUUM is fundamentally incapable of freezing
them due to its OldestXmin/removable cutoff being too old. That can
happen when VACUUM needs to freeze according to the traditional
age-based settings, and yet the OldestXmin/removable cutoff gets held
back (by a leaked replication slot or whatever).

(Actually, VACUUM FREEZE freeze will freeze only a subset of the
tuples from some heap pages far more often. VACUUM FREEZE seems like a
bad design to me, though -- it uses the most aggressive possible XID
cutoff for freezing when it should probably hold off on freezing those
individual pages where we determine that it makes little sense. We
need to focus more on physical pages and their costs, and less on XID
cutoffs.)

> Is this test meant to guard against unnecessary freezing or to avoid
> freezing when the cost is too high? That is, are we trying to
> determine how likely it is that the page has been recently modified
> and avoid eager freezing when it would be pointless (because the page
> will soon be modified again)?

Sort of. This cost of freezing over time is weirdly nonlinear, so it's
hard to give a simple answer.

The justification for the FPI trigger optimization is that FPIs are
overwhelmingly the cost that really matters when it comes to freezing
(and vacuuming in general) -- so we might as well make the best out of
a bad situation when pruning happens to get an FPI. There can easily
be a 10x or more cost difference (measured in total WAL volume)
between freezing without an FPI and freezing with an FPI.

> Or are we trying to determine how likely
> the freeze record is to emit an FPI and avoid eager freezing when it
> isn't worth the cost?

No, that's not something that we're doing right now (we just talked
about doing something like that). In 16 VACUUM just "makes the best
out of a bad situation" when an FPI was already required during
pruning. We have already "paid for the privilege" of writing some WAL
for the page at that point, so it's reasonable to not squander a
window of opportunity to avoid future FPIs in future VACUUM
operations, by freezing early.

We're "taking a chance" on being able to get freezing out of the way
early when an FPI triggers freezing. It's not guaranteed to work out
in each individual case, of course, but even if we assume it's fairly
unlikely to work out (which is very pessimistic) it's still very
likely a good deal.

This strategy (the 16 strategy of freezing eagerly because we already
got an FPI) seems safer than a strategy involving freezing eagerly
because we won't get an FPI as a result. If for no other reason than
this: with the approach in 16 we already know for sure that we'll have
written an FPI anyway. It's hard to imagine somebody being okay with
the FPIs, but not being okay with the other extra WAL.

> But, the final rationale is still not clear to me. Could we add a
> comment above the if condition specifying both:
> a) what the test is a proxy for
> b) the intended outcome (when do we expect to eager freeze)
> And perhaps we could even describe a scenario where this heuristic is 
> effective?

There are lots of scenarios where it'll be effective. I agree that
there is a need to document this stuff a lot better. I have a pending
doc patch that overhauls the user-facing docs in this area.

My latest draft is here:

https://postgr.es/m/CAH2-Wz=UUJz+MMb1AxFzz-HDA=1t1fx_kmrdovopzxkpa-t...@mail.gmail.com
ht

Eager page freeze criteria clarification

2023-07-28 Thread Melanie Plageman

Hi,

While hacking on the pruning and page-level freezing code and am a bit
confused by the test guarding eager freezing  [1]:

/*
 * Freeze the page when heap_prepare_freeze_tuple indicates that at least
 * one XID/MXID from before FreezeLimit/MultiXactCutoff is present.  Also
 * freeze when pruning generated an FPI, if doing so means that we set the
 * page all-frozen afterwards (might not happen until final heap pass).
 */
if (pagefrz.freeze_required || tuples_frozen == 0 ||
(prunestate->all_visible && prunestate->all_frozen &&
 fpi_before != pgWalUsage.wal_fpi))
{

I'm trying to understand the condition fpi_before !=
pgWalUsage.wal_fpi -- don't eager freeze if pruning emitted an FPI.

Is this test meant to guard against unnecessary freezing or to avoid
freezing when the cost is too high? That is, are we trying to
determine how likely it is that the page has been recently modified
and avoid eager freezing when it would be pointless (because the page
will soon be modified again)? Or are we trying to determine how likely
the freeze record is to emit an FPI and avoid eager freezing when it
isn't worth the cost? Or something else?

The commit message says:

> Also teach VACUUM to trigger page-level freezing whenever it detects
> that heap pruning generated an FPI.  We'll have already written a large
> amount of WAL just to do that much, so it's very likely a good idea to
> get freezing out of the way for the page early.

And I found the thread where it was discussed [2]. Several possible
explanations are mentioned in the thread.

But, the final rationale is still not clear to me. Could we add a
comment above the if condition specifying both:
a) what the test is a proxy for
b) the intended outcome (when do we expect to eager freeze)
And perhaps we could even describe a scenario where this heuristic is effective?

- Melanie

[1] 
https://github.com/postgres/postgres/blob/master/src/backend/access/heap/vacuumlazy.c#L1802
[2] 
https://www.postgresql.org/message-id/flat/CAH2-Wzm_%3DmrWO%2BkUAJbR_gM_6RzpwVA8n8e4nh3dJGHdw_urew%40mail.gmail.com#c5aae6ea65e07de92a24a35445473448

91 matches

Mail list logo