Re: Eager page freeze criteria clarification
On Thu, Dec 21, 2023 at 3:58 PM Melanie Plageman wrote: > > On Wed, Dec 13, 2023 at 12:24 PM Robert Haas wrote: > > On Sat, Dec 9, 2023 at 5:12 AM Melanie Plageman > > wrote: > > > The goal is to keep pages frozen for at least target_freeze_duration. > > > target_freeze_duration is in seconds and pages only have a last > > > modification LSN, so target_freeze_duration must be converted to LSNs. > > > To accomplish this, I've added an LSNTimeline data structure, containing > > > XLogRecPtr, TimestampTz pairs stored with decreasing precision as they > > > age. When we need to translate the guc value to LSNs, we linearly > > > interpolate it on this timeline. For the time being, the global > > > LSNTimeline is in PgStat_WalStats and is only updated by vacuum. There > > > is no reason it can't be updated with some other cadence and/or by some > > > other process (nothing about it is inherently tied to vacuum). The > > > cached translated value of target_freeze_duration is stored in each > > > table's stats. This is arbitrary as it is not a table-level stat. > > > However, it needs to be located somewhere that is accessible on > > > update/delete. We may want to recalculate it more often than once per > > > table vacuum, especially in case of long-running vacuums. > > > > This part sounds like it isn't quite baked yet. The idea of the data > > structure seems fine, but updating it once per vacuum sounds fairly > > unprincipled to me? Don't we want the updates to happen on a somewhat > > regular wall clock cadence? > > Yes, this part was not fully baked. I actually discussed this with > Andres at PGConf EU last week and he suggested that background writer > update the LSNTimeline. He also suggested I propose the LSNTimeline in > a new thread. I could add a pageinspect function returning the > estimated time of last page modification given the page LSN (so it is > proposed with a user). I've rebased this over top of the revised LSNTimeline functionality proposed separately in [1]. It is also registered for the current commitfest. I plan to add the decay logic and benchmark it this week. - Melanie [1] https://www.postgresql.org/message-id/CAAKRu_Z7tR7D1=DR=xwqdefyk_nu_gxgw88x0htxn6ask-8...@mail.gmail.com From 6fe9f26ebe6bd773e631bcdf0b7ea928a4ca8e27 Mon Sep 17 00:00:00 2001 From: Melanie Plageman Date: Wed, 27 Dec 2023 16:40:27 -0500 Subject: [PATCH v3 02/12] Add LSNTimeline for converting LSN <-> time Add a new structure, LSNTimeline, consisting of LSNTimes -- each an LSN, time pair. Each LSNTime can represent multiple logical LSN, time pairs, referred to as members. LSN <-> time conversions can be done using linear interpolation with two LSNTimes on the LSNTimeline. This commit does not add a global instance of LSNTimeline. It adds the structures and functions needed to maintain and access such a timeline. --- src/backend/utils/activity/pgstat_wal.c | 199 src/include/pgstat.h| 34 src/tools/pgindent/typedefs.list| 2 + 3 files changed, 235 insertions(+) diff --git a/src/backend/utils/activity/pgstat_wal.c b/src/backend/utils/activity/pgstat_wal.c index 6a81b781357..ba40aad258a 100644 --- a/src/backend/utils/activity/pgstat_wal.c +++ b/src/backend/utils/activity/pgstat_wal.c @@ -17,8 +17,11 @@ #include "postgres.h" +#include "access/xlog.h" #include "utils/pgstat_internal.h" #include "executor/instrument.h" +#include "utils/builtins.h" +#include "utils/timestamp.h" PgStat_PendingWalStats PendingWalStats = {0}; @@ -32,6 +35,12 @@ PgStat_PendingWalStats PendingWalStats = {0}; static WalUsage prevWalUsage; +static void lsntime_absorb(LSNTime *a, const LSNTime *b); +void lsntime_insert(LSNTimeline *timeline, TimestampTz time, XLogRecPtr lsn); + +XLogRecPtr estimate_lsn_at_time(const LSNTimeline *timeline, TimestampTz time); +TimestampTz estimate_time_at_lsn(const LSNTimeline *timeline, XLogRecPtr lsn); + /* * Calculate how much WAL usage counters have increased and update * shared WAL and IO statistics. @@ -184,3 +193,193 @@ pgstat_wal_snapshot_cb(void) sizeof(pgStatLocal.snapshot.wal)); LWLockRelease(&stats_shmem->lock); } + +/* + * Set *a to be the earlier of *a or *b. + */ +static void +lsntime_absorb(LSNTime *a, const LSNTime *b) +{ + LSNTime result; + int new_members = a->members + b->members; + + if (a->time < b->time) + result = *a; + else if (b->time < a->time) + result = *b; + else if (a->lsn < b->lsn) + result = *a; + else if (b->lsn < a->lsn) + result = *b; + else + result = *a; + + *a = result; + a->members = new_members; +} + +/* + * Insert a new LSNTime into the LSNTimeline in the first element with spare + * capacity. + */ +void +lsntime_insert(LSNTimeline *timeline, TimestampTz time, + XLogRecPtr lsn) +{ + LSNTime temp; + LSNTime carry = {.lsn = lsn,.time = time,.members = 1}; + + for (int i = 0; i < timeline->length; i++) + { + bool full; + LSNTime*cur = &timeline->data[i];
Re: Eager page freeze criteria clarification
On Thu, Dec 21, 2023 at 10:56 AM Melanie Plageman wrote: > Agreed. I plan to test with another distribution. Though, the exercise > of determining which ones are useful is probably more challenging. > I imagine we will have to choose one distribution (as opposed to > supporting different distributions and choosing based on data access > patterns for a table). Though, even with a normal distribution, I > think it should be an improvement. Our current algorithm isn't adaptive at all, so I like our chances of coming out ahead. It won't surprise me if somebody finds a case where there is a regression, but if we handle some common and important cases correctly (e.g. append-only, update-everything-nonstop) then I think we're probably ahead even if there are some cases where we do worse. It does depend on how much worse they are, and how realistic they are, but we don't want to be too fearful here: we know what we're doing right now isn't too great. -- Robert Haas EDB: http://www.enterprisedb.com
Re: Eager page freeze criteria clarification
On Wed, Dec 13, 2023 at 12:24 PM Robert Haas wrote: > > Great results. Thanks! > On Sat, Dec 9, 2023 at 5:12 AM Melanie Plageman > wrote: > > Values can be "removed" from the accumulator by simply decrementing its > > cardinality and decreasing the sum and sum squared by a value that will > > not change the mean and standard deviation of the overall distribution. > > To adapt to a table's changing access patterns, we'll need to remove > > values from this accumulator over time, but this patch doesn't yet > > decide when to do this. A simple solution may be to cap the cardinality > > of the accumulator to the greater of 1% of the table size, or some fixed > > number of values (perhaps 200?). Even without such removal of values, > > the distribution recorded in the accumulator will eventually skew toward > > more recent data, albeit at a slower rate. > > I think we're going to need something here. Otherwise, after 6 months > of use, changing a table's perceived access pattern will be quite > difficult. > > I think one challenge here is to find something that doesn't decay too > often and end up with cases where it basically removes all the data. > > As long as you avoid that, I suspect that the algorithm might not be > terribly sensitive to other kinds of changes. If you decay after 200 > values or 2000 or 20,000, it will only affect how fast we can change > our notion of the access pattern, and my guess would be that any of > those values would produce broadly acceptable results, with some > differences in the details. If you decay after 200,000,000 values or > not at all, then I think there will be problems. I'll add the decay logic and devise a benchmark that will exercise it. I can test at least one or two of these ideas. > > The goal is to keep pages frozen for at least target_freeze_duration. > > target_freeze_duration is in seconds and pages only have a last > > modification LSN, so target_freeze_duration must be converted to LSNs. > > To accomplish this, I've added an LSNTimeline data structure, containing > > XLogRecPtr, TimestampTz pairs stored with decreasing precision as they > > age. When we need to translate the guc value to LSNs, we linearly > > interpolate it on this timeline. For the time being, the global > > LSNTimeline is in PgStat_WalStats and is only updated by vacuum. There > > is no reason it can't be updated with some other cadence and/or by some > > other process (nothing about it is inherently tied to vacuum). The > > cached translated value of target_freeze_duration is stored in each > > table's stats. This is arbitrary as it is not a table-level stat. > > However, it needs to be located somewhere that is accessible on > > update/delete. We may want to recalculate it more often than once per > > table vacuum, especially in case of long-running vacuums. > > This part sounds like it isn't quite baked yet. The idea of the data > structure seems fine, but updating it once per vacuum sounds fairly > unprincipled to me? Don't we want the updates to happen on a somewhat > regular wall clock cadence? Yes, this part was not fully baked. I actually discussed this with Andres at PGConf EU last week and he suggested that background writer update the LSNTimeline. He also suggested I propose the LSNTimeline in a new thread. I could add a pageinspect function returning the estimated time of last page modification given the page LSN (so it is proposed with a user). - Melanie
Re: Eager page freeze criteria clarification
On 12/21/23 10:56, Melanie Plageman wrote: On Sat, Dec 9, 2023 at 9:24 AM Joe Conway wrote: However, even if we assume a more-or-less normal distribution, we should consider using subgroups in a way similar to Statistical Process Control[1]. The reasoning is explained in this quote: The Math Behind Subgroup Size The Central Limit Theorem (CLT) plays a pivotal role here. According to CLT, as the subgroup size (n) increases, the distribution of the sample means will approximate a normal distribution, regardless of the shape of the population distribution. Therefore, as your subgroup size increases, your control chart limits will narrow, making the chart more sensitive to special cause variation and more prone to false alarms. I haven't read anything about statistical process control until you mentioned this. I read the link you sent and also googled around a bit. I was under the impression that the more samples we have, the better. But, it seems like this may not be the assumption in statistical process control? It may help us to get more specific. I'm not sure what the relationship between "unsets" in my code and subgroup members would be. The article you linked suggests that each subgroup should be of size 5 or smaller. Translating that to my code, were you imagining subgroups of "unsets" (each time we modify a page that was previously all-visible)? Basically, yes. It might not makes sense, but I think we could test the theory by plotting a histogram of the raw data, and then also plot a histogram based on sub-grouping every 5 sequential values in your accumulator. If the former does not look very normal (I would guess most workloads it will be skewed with a long tail) and the latter looks to be more normal, then it would say we were on the right track. There are statistical tests for "normalness" that could be applied too ( e.g. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6350423/#sec2-13title ) which be a more rigorous approach, but the quick look at histograms might be sufficiently convincing. -- Joe Conway PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com
Re: Eager page freeze criteria clarification
On Sat, Dec 9, 2023 at 9:24 AM Joe Conway wrote: > > On 12/8/23 23:11, Melanie Plageman wrote: > > > > I'd be delighted to receive any feedback, ideas, questions, or review. > > > This is well thought out, well described, and a fantastic improvement in > my view -- well done! Thanks, Joe! That means a lot! I see work done by hackers on the mailing list a lot that makes me think, "hey, that's cool/clever/awesome!" but I don't give that feedback. I appreciate you doing that! > I do think we will need to consider distributions other than normal, but > I don't know offhand what they will be. Agreed. I plan to test with another distribution. Though, the exercise of determining which ones are useful is probably more challenging. I imagine we will have to choose one distribution (as opposed to supporting different distributions and choosing based on data access patterns for a table). Though, even with a normal distribution, I think it should be an improvement. > However, even if we assume a more-or-less normal distribution, we should > consider using subgroups in a way similar to Statistical Process > Control[1]. The reasoning is explained in this quote: > > The Math Behind Subgroup Size > > The Central Limit Theorem (CLT) plays a pivotal role here. According > to CLT, as the subgroup size (n) increases, the distribution of the > sample means will approximate a normal distribution, regardless of > the shape of the population distribution. Therefore, as your > subgroup size increases, your control chart limits will narrow, > making the chart more sensitive to special cause variation and more > prone to false alarms. I haven't read anything about statistical process control until you mentioned this. I read the link you sent and also googled around a bit. I was under the impression that the more samples we have, the better. But, it seems like this may not be the assumption in statistical process control? It may help us to get more specific. I'm not sure what the relationship between "unsets" in my code and subgroup members would be. The article you linked suggests that each subgroup should be of size 5 or smaller. Translating that to my code, were you imagining subgroups of "unsets" (each time we modify a page that was previously all-visible)? Thanks for the feedback! - Melanie
Re: Eager page freeze criteria clarification
Great results. On Sat, Dec 9, 2023 at 5:12 AM Melanie Plageman wrote: > Values can be "removed" from the accumulator by simply decrementing its > cardinality and decreasing the sum and sum squared by a value that will > not change the mean and standard deviation of the overall distribution. > To adapt to a table's changing access patterns, we'll need to remove > values from this accumulator over time, but this patch doesn't yet > decide when to do this. A simple solution may be to cap the cardinality > of the accumulator to the greater of 1% of the table size, or some fixed > number of values (perhaps 200?). Even without such removal of values, > the distribution recorded in the accumulator will eventually skew toward > more recent data, albeit at a slower rate. I think we're going to need something here. Otherwise, after 6 months of use, changing a table's perceived access pattern will be quite difficult. I think one challenge here is to find something that doesn't decay too often and end up with cases where it basically removes all the data. As long as you avoid that, I suspect that the algorithm might not be terribly sensitive to other kinds of changes. If you decay after 200 values or 2000 or 20,000, it will only affect how fast we can change our notion of the access pattern, and my guess would be that any of those values would produce broadly acceptable results, with some differences in the details. If you decay after 200,000,000 values or not at all, then I think there will be problems. > This algorithm is able to predict when pages will be modified before > target_freeze_threshold as long as their modification pattern fits a > normal distribution -- this accommodates access patterns where, after > some period of time, pages become less likely to be modified the older > they are. As an example, however, access patterns where modification > times are bimodal aren't handled well by this model (imagine that half > your modifications are to very new data and the other half are to data > that is much older but still younger than target_freeze_duration). If > this is a common access pattern, the normal distribution could easily be > swapped out for another distribution. The current accumulator is capable > of tracking a distribution's first two moments of central tendency (the > mean and standard deviation). We could track more if we wanted to use a > fancier distribution. I think it might be fine to ignore this problem. What we're doing right now is way worse. > We also must consider what to do when we have few unsets, e.g. with an > insert-only workload. When there are very few unsets (I chose 30 which > the internet says is the approximate minimum number required for the > central limit theorem to hold), we can choose to always freeze freezable > pages; above this limit, the calculated page threshold is used. However, > we may not want to make predictions based on 31 values either. To avoid > this "cliff", we could modify the algorithm a bit to skew the mean and > standard deviation of the distribution using a confidence interval based > on the number of values we've collected. I think some kind of smooth transition would might be a good idea, but I also don't know how much it matters. I think the important thing is that we freeze aggressively unless there's a clear signal telling us not to do that. Absence of evidence should cause us to tend toward aggressive freezing. Otherwise, we get insert-only tables wrong. > The goal is to keep pages frozen for at least target_freeze_duration. > target_freeze_duration is in seconds and pages only have a last > modification LSN, so target_freeze_duration must be converted to LSNs. > To accomplish this, I've added an LSNTimeline data structure, containing > XLogRecPtr, TimestampTz pairs stored with decreasing precision as they > age. When we need to translate the guc value to LSNs, we linearly > interpolate it on this timeline. For the time being, the global > LSNTimeline is in PgStat_WalStats and is only updated by vacuum. There > is no reason it can't be updated with some other cadence and/or by some > other process (nothing about it is inherently tied to vacuum). The > cached translated value of target_freeze_duration is stored in each > table's stats. This is arbitrary as it is not a table-level stat. > However, it needs to be located somewhere that is accessible on > update/delete. We may want to recalculate it more often than once per > table vacuum, especially in case of long-running vacuums. This part sounds like it isn't quite baked yet. The idea of the data structure seems fine, but updating it once per vacuum sounds fairly unprincipled to me? Don't we want the updates to happen on a somewhat regular wall clock cadence? > To benchmark this new heuristic (I'm calling it algo 6 since it is the > 6th I've proposed on this thread), I've used a modified subset of my > original workloads: > > [ everything worked perfectly ] Hard to beat that. --
Re: Eager page freeze criteria clarification
On 12/8/23 23:11, Melanie Plageman wrote: On Wed, Nov 8, 2023 at 9:23 PM Melanie Plageman wrote: The next step is to devise different heuristics and measure their efficacy. IMO, the goal of the algorithm it is to freeze pages in a relation such that we drive early unfreezes/freezes -> 0 and pages frozen/number of pages of a certain age -> 1. Attached is a patchset with an adaptive freezing algorithm that works well. It keeps track of the pages' unmodified duration after having been set all-visible and uses that to predict how likely a page is to be modified given its age. Each time an all-visible page is modified by an update, delete, or tuple lock, if the page was all-visible for less than target_freeze_duration (a new guc that specifies the minimum amount of time you would like data to stay frozen), it is counted as an "early unset" and the duration that it was unmodified is added into an accumulator. Before each relation is vacuumed, we calculate the mean and standard deviation of these durations using the accumulator. This is used to calculate a page LSN threshold demarcating pages with a < 5% likelihood of being modified before target_freeze_duration. We don't freeze pages younger than this threshold. I've done away with the ring buffer of vacuums and the idea of attributing an "unset" to the vacuum that set it all visible. Instead, using an "accumulator", I keep a running sum of the page ages, along with the cardinality of the accumulated set and the sum squared of page ages. This data structure allows us to extract the mean and standard deviation, at any time, from an arbitrary number of values in constant space; and is used to model the pattern of these unsets as a normal distribution that we can use to try and predict whether or not a page will be modified. Values can be "removed" from the accumulator by simply decrementing its cardinality and decreasing the sum and sum squared by a value that will not change the mean and standard deviation of the overall distribution. To adapt to a table's changing access patterns, we'll need to remove values from this accumulator over time, but this patch doesn't yet decide when to do this. A simple solution may be to cap the cardinality of the accumulator to the greater of 1% of the table size, or some fixed number of values (perhaps 200?). Even without such removal of values, the distribution recorded in the accumulator will eventually skew toward more recent data, albeit at a slower rate. This algorithm is able to predict when pages will be modified before target_freeze_threshold as long as their modification pattern fits a normal distribution -- this accommodates access patterns where, after some period of time, pages become less likely to be modified the older they are. As an example, however, access patterns where modification times are bimodal aren't handled well by this model (imagine that half your modifications are to very new data and the other half are to data that is much older but still younger than target_freeze_duration). If this is a common access pattern, the normal distribution could easily be swapped out for another distribution. The current accumulator is capable of tracking a distribution's first two moments of central tendency (the mean and standard deviation). We could track more if we wanted to use a fancier distribution. We also must consider what to do when we have few unsets, e.g. with an insert-only workload. When there are very few unsets (I chose 30 which the internet says is the approximate minimum number required for the central limit theorem to hold), we can choose to always freeze freezable pages; above this limit, the calculated page threshold is used. However, we may not want to make predictions based on 31 values either. To avoid this "cliff", we could modify the algorithm a bit to skew the mean and standard deviation of the distribution using a confidence interval based on the number of values we've collected. The goal is to keep pages frozen for at least target_freeze_duration. target_freeze_duration is in seconds and pages only have a last modification LSN, so target_freeze_duration must be converted to LSNs. To accomplish this, I've added an LSNTimeline data structure, containing XLogRecPtr, TimestampTz pairs stored with decreasing precision as they age. When we need to translate the guc value to LSNs, we linearly interpolate it on this timeline. For the time being, the global LSNTimeline is in PgStat_WalStats and is only updated by vacuum. There is no reason it can't be updated with some other cadence and/or by some other process (nothing about it is inherently tied to vacuum). The cached translated value of target_freeze_duration is stored in each table's stats. This is arbitrary as it is not a table-level stat. However, it needs to be located somewhere that is accessible on update/delete. We may want to recalculate it more often than once per table vacuum, especially in case of long-running vacuums. To be
Re: Eager page freeze criteria clarification
On Wed, Nov 8, 2023 at 9:23 PM Melanie Plageman wrote: > The next step is to devise different heuristics and measure their > efficacy. IMO, the goal of the algorithm it is to freeze pages in a > relation such that we drive early unfreezes/freezes -> 0 and pages > frozen/number of pages of a certain age -> 1. Attached is a patchset with an adaptive freezing algorithm that works well. It keeps track of the pages' unmodified duration after having been set all-visible and uses that to predict how likely a page is to be modified given its age. Each time an all-visible page is modified by an update, delete, or tuple lock, if the page was all-visible for less than target_freeze_duration (a new guc that specifies the minimum amount of time you would like data to stay frozen), it is counted as an "early unset" and the duration that it was unmodified is added into an accumulator. Before each relation is vacuumed, we calculate the mean and standard deviation of these durations using the accumulator. This is used to calculate a page LSN threshold demarcating pages with a < 5% likelihood of being modified before target_freeze_duration. We don't freeze pages younger than this threshold. I've done away with the ring buffer of vacuums and the idea of attributing an "unset" to the vacuum that set it all visible. Instead, using an "accumulator", I keep a running sum of the page ages, along with the cardinality of the accumulated set and the sum squared of page ages. This data structure allows us to extract the mean and standard deviation, at any time, from an arbitrary number of values in constant space; and is used to model the pattern of these unsets as a normal distribution that we can use to try and predict whether or not a page will be modified. Values can be "removed" from the accumulator by simply decrementing its cardinality and decreasing the sum and sum squared by a value that will not change the mean and standard deviation of the overall distribution. To adapt to a table's changing access patterns, we'll need to remove values from this accumulator over time, but this patch doesn't yet decide when to do this. A simple solution may be to cap the cardinality of the accumulator to the greater of 1% of the table size, or some fixed number of values (perhaps 200?). Even without such removal of values, the distribution recorded in the accumulator will eventually skew toward more recent data, albeit at a slower rate. This algorithm is able to predict when pages will be modified before target_freeze_threshold as long as their modification pattern fits a normal distribution -- this accommodates access patterns where, after some period of time, pages become less likely to be modified the older they are. As an example, however, access patterns where modification times are bimodal aren't handled well by this model (imagine that half your modifications are to very new data and the other half are to data that is much older but still younger than target_freeze_duration). If this is a common access pattern, the normal distribution could easily be swapped out for another distribution. The current accumulator is capable of tracking a distribution's first two moments of central tendency (the mean and standard deviation). We could track more if we wanted to use a fancier distribution. We also must consider what to do when we have few unsets, e.g. with an insert-only workload. When there are very few unsets (I chose 30 which the internet says is the approximate minimum number required for the central limit theorem to hold), we can choose to always freeze freezable pages; above this limit, the calculated page threshold is used. However, we may not want to make predictions based on 31 values either. To avoid this "cliff", we could modify the algorithm a bit to skew the mean and standard deviation of the distribution using a confidence interval based on the number of values we've collected. The goal is to keep pages frozen for at least target_freeze_duration. target_freeze_duration is in seconds and pages only have a last modification LSN, so target_freeze_duration must be converted to LSNs. To accomplish this, I've added an LSNTimeline data structure, containing XLogRecPtr, TimestampTz pairs stored with decreasing precision as they age. When we need to translate the guc value to LSNs, we linearly interpolate it on this timeline. For the time being, the global LSNTimeline is in PgStat_WalStats and is only updated by vacuum. There is no reason it can't be updated with some other cadence and/or by some other process (nothing about it is inherently tied to vacuum). The cached translated value of target_freeze_duration is stored in each table's stats. This is arbitrary as it is not a table-level stat. However, it needs to be located somewhere that is accessible on update/delete. We may want to recalculate it more often than once per table vacuum, especially in case of long-running vacuums. To benchmark this new heuristic (I'm callin
Re: Eager page freeze criteria clarification
On Wed, Oct 11, 2023 at 8:43 PM Andres Freund wrote: > > A rough sketch of a freezing heuristic: > > - We concluded that to intelligently control opportunistic freezing we need > statistics about the number of freezes and unfreezes > > - We should track page freezes / unfreezes in shared memory stats on a > per-relation basis > > - To use such statistics to control heuristics, we need to turn them into > rates. For that we need to keep snapshots of absolute values at certain > times (when vacuuming), allowing us to compute a rate. > > - If we snapshot some stats, we need to limit the amount of data that > occupies > > - evict based on wall clock time (we don't care about unfreezing pages > frozen a month ago) > > - "thin out" data when exceeding limited amount of stats per relation > using random sampling or such > > - need a smarter approach than just keeping N last vacuums, as there are > situations where a table is (auto-) vacuumed at a high frequency > > > - only looking at recent-ish table stats is fine, because we > - a) don't want to look at too old data, as we need to deal with changing >workloads > > - b) if there aren't recent vacuums, falsely freezing is of bounded cost > > - shared memory stats being lost on crash-restart/failover might be a > problem > > - we certainly don't want to immediate store these stats in a table, due > to the xid consumption that'd imply > > > - Attributing "unfreezes" to specific vacuums would be powerful: > > - "Number of pages frozen during vacuum" and "Number of pages unfrozen that > were frozen during the same vacuum" provides numerator / denominator for > an "error rate" > > - We can perform this attribution by comparing the page LSN with recorded > start/end LSNs of recent vacuums > > - If the freezing error rate of recent vacuums is low, freeze more > aggressively. This is important to deal with insert mostly workloads. > > - If old data is "unfrozen", that's fine, we can ignore such unfreezes when > controlling "freezing aggressiveness" > > - Ignoring unfreezing of old pages is important to e.g. deal with > workloads that delete old data > > - This approach could provide "goals" for opportunistic freezing in a > somewhat understandable way. E.g. aiming to rarely unfreeze data that has > been frozen within 1h/1d/... I have taken a stab at implementing the freeze stats tracking infrastructure we would need to drive a heuristic based on error rate. Attached is a series of patches which adds a ring buffer of vacuum freeze statistics to each table's stats (PgStat_StatTabEntry). It also introduces a guc: target_page_freeze_duration. This is the time in seconds which we would like pages to stay frozen for. The aim is to adjust opportunistic freeze behavior such that pages stay frozen for at least target_page_freeze_duration seconds. When a table is vacuumed the next entry is initialized in the ring buffer (PgStat_StatTabEntry->frz_buckets[frz_current]). If all of the buckets are in use, the two oldest buckets both ending either before or after now - target_page_freeze_duration are combined. When vacuum freezes a page, we increment the freeze counter in the current PgStat_Frz entry in the ring buffer and update the counters used to calculate the max, min, and average page age. When a frozen page is modified, we increment "unfreezes" in the bucket spanning the page freeze LSN. We also update the counters used to calculate the min, max, and average frozen duration. Because we have both the LSNs and time elapsed during the vacuum which froze the page, we can derive the approximate time at which the page was frozen (using linear interpolation) and use that to speculate how long it remained frozen. If we are unfreezing it sooner than target_page_freeze_duration, this is counted as an "early unfreeze". Using early unfreezes and page freezes, we can calculate an error rate. The guc, opp_freeze_algo, remains to allow us to test different freeze heuristics during development. I included a dummy algorithm (algo 2) to demonstrate what we could do with the data. If the error rate is above a hard-coded threshold and the page is older than the average page age, it is frozen. This is missing an "aggressiveness" knob. There are two workloads I think we can focus on. The first is a variant of our former workload I. The second workload, J, is the pgbench built-in simple-update workload. I2. Work queue WL 1: 32 clients inserting a single row, updating an indexed column in another row twice, then deleting the last updated row pgbench scale 100 WL 2: 2 clients, running the TPC-B like built-in workload rate-limited to 10,000 TPS J. simple-update pgbench scale 450 WL 1: 16 clients running built-in simple-update workload The simple-update workload works well because it only updates pgbench_accounts and inserts into pgbench_
Re: Eager page freeze criteria clarification
On Thu, Oct 12, 2023 at 11:50 AM Melanie Plageman wrote: > I was under the impression that we decided we still had to consider > the number of clean pages dirtied as well as the number of pages > unfrozen. The number of pages frozen and unfrozen over a time period > gives us some idea of if we are freezing the wrong pages -- but it > doesn't tell us if we are freezing the right pages. A riff on an > earlier example by Robert: > > While vacuuming a relation, we freeze 100 pages. During the same time > period, we modify 1,000,000 previously clean pages. Of these 1,000,000 > pages modified, 90 were frozen. So we unfroze 90% of the pages frozen > during this time. Does this mean we should back off of trying to > freeze any pages in the relation? I didn't think we decided the thing your first sentence says you thought we decided ... but I also didn't think of this example. That said, it might be fine to back off freezing in this case because we weren't doing enough of it to make any real difference in the first place. Maybe there's a more moderate example where it feels like a bigger problem? > > - "Number of pages frozen during vacuum" and "Number of pages unfrozen > > that > > were frozen during the same vacuum" provides numerator / denominator for > > an "error rate" > > > > - We can perform this attribution by comparing the page LSN with recorded > > start/end LSNs of recent vacuums > > While implementing a rough sketch of this, I realized I had a question > about this. > > vacuum 1 starts at lsn 10 and ends at lsn 200. It froze 100 pages. > vacuum 2 then starts at lsn 600. > 5 frozen pages with page lsn > 10 and < 200 were updated. We count > those in vacuum 1's stats. 3 frozen pages with page lsn > 200 and < > 600 were updated. Do we count those somewhere? How did those pages get frozen when no VACUUM was running? The LSN of the frozen page just prior to unfreezing it should be from the operation that froze it, which should be some VACUUM. I think the case you're talking about could happen if we did on-access freezing, but today I believe we don't. -- Robert Haas EDB: http://www.enterprisedb.com
Re: Eager page freeze criteria clarification
On Wed, Oct 11, 2023 at 8:43 PM Andres Freund wrote: > > Robert, Melanie and I spent an evening discussing this topic around > pgconf.nyc. Here are, mildly revised, notes from that: Thanks for taking notes! > The main thing we are worried about is repeated freezing / unfreezing of > pages within a relatively short time period. > > - Computing an average "modification distance" as I (Andres) proposed efor > each page is complicated / "fuzzy" > > The main problem is that it's not clear how to come up with a good number > for workloads that have many more inserts into new pages than modifications > of existing pages. > > It's also hard to use average for this kind of thing, e.g. in cases where > new pages are frequently updated, but also some old data is updated, it's > easy for the updates to the old data to completely skew the average, even > though that shouldn't prevent us from freezing. > > - We also discussed an idea by Robert to track the number of times we need to > dirty a page when unfreezing and to compare that to the number of pages > dirtied overall (IIRC), but I don't think we really came to a conclusion > around that - and I didn't write down anything so this is purely from > memory. I was under the impression that we decided we still had to consider the number of clean pages dirtied as well as the number of pages unfrozen. The number of pages frozen and unfrozen over a time period gives us some idea of if we are freezing the wrong pages -- but it doesn't tell us if we are freezing the right pages. A riff on an earlier example by Robert: While vacuuming a relation, we freeze 100 pages. During the same time period, we modify 1,000,000 previously clean pages. Of these 1,000,000 pages modified, 90 were frozen. So we unfroze 90% of the pages frozen during this time. Does this mean we should back off of trying to freeze any pages in the relation? > A rough sketch of a freezing heuristic: ... > - Attributing "unfreezes" to specific vacuums would be powerful: > > - "Number of pages frozen during vacuum" and "Number of pages unfrozen that > were frozen during the same vacuum" provides numerator / denominator for > an "error rate" > > - We can perform this attribution by comparing the page LSN with recorded > start/end LSNs of recent vacuums While implementing a rough sketch of this, I realized I had a question about this. vacuum 1 starts at lsn 10 and ends at lsn 200. It froze 100 pages. vacuum 2 then starts at lsn 600. 5 frozen pages with page lsn > 10 and < 200 were updated. We count those in vacuum 1's stats. 3 frozen pages with page lsn > 200 and < 600 were updated. Do we count those somewhere? > - This approach could provide "goals" for opportunistic freezing in a > somewhat understandable way. E.g. aiming to rarely unfreeze data that has > been frozen within 1h/1d/... Similar to the above question, if we are tracking pages frozen and unfrozen during a time period, if there are many vacuums in quick succession, we might care if a page was frozen by one vacuum and then unfrozen during a subsequent vacuum if not too much time has passed. - Melanie
Re: Eager page freeze criteria clarification
Thanks for these notes. On Wed, Oct 11, 2023 at 8:43 PM Andres Freund wrote: > - We also discussed an idea by Robert to track the number of times we need to > dirty a page when unfreezing and to compare that to the number of pages > dirtied overall (IIRC), but I don't think we really came to a conclusion > around that - and I didn't write down anything so this is purely from > memory. See http://postgr.es/m/ca+tgmoycwisxlbl-pxu13oevthloxm20ojqjnrztkkhxsy9...@mail.gmail.com for my on-list discussion of this. > - Attributing "unfreezes" to specific vacuums would be powerful: > > - "Number of pages frozen during vacuum" and "Number of pages unfrozen that > were frozen during the same vacuum" provides numerator / denominator for > an "error rate" I want to highlight the denominator issue here. I think we all have the intuition that if we count the number of times that a recently-frozen page gets unfrozen, and that's a big number, that's bad, and a sign that we need to freeze less aggressively. But a lot of the struggle has been around answering the question "big compared to what". A lot of the obvious candidates fail to behave nicely in corner cases, as discussed in the above email. I think this is one of the better candidates so far proposed, possibly the best. > - This approach could provide "goals" for opportunistic freezing in a > somewhat understandable way. E.g. aiming to rarely unfreeze data that has > been frozen within 1h/1d/... This strikes me as another important point. Making the behavior understandable to users is going to be important, because sometimes whatever system we might craft will misbehave, and then people are going to need to be able to understand why it's misbehaving and how to tune/fix it so it works. > Around this point my laptop unfortunately ran out of battery. Possibly the > attendees of this mini summit also ran out of steam (and tea). When the tea is gone, there's little point in continuing. > I likely mangled this substantially, both when taking notes during the lively > discussion, and when revising them to make them a bit more readable. I think it's quite a good summary, actually. Thanks! -- Robert Haas EDB: http://www.enterprisedb.com
Re: Eager page freeze criteria clarification
Hi, Robert, Melanie and I spent an evening discussing this topic around pgconf.nyc. Here are, mildly revised, notes from that: First a few random points that didn't fit with the sketch of an approach below: - Are unlogged tables a problem for using LSN based heuristics for freezing? We concluded, no, not a problem, because aggressively freezing does not increase overhead meaningfully, as we would already dirty both the heap and VM page to set the all-visible flag. - "Unfreezing" pages that were frozen hours / days ago aren't too bad and can be desirable. The main thing we are worried about is repeated freezing / unfreezing of pages within a relatively short time period. - Computing an average "modification distance" as I (Andres) proposed efor each page is complicated / "fuzzy" The main problem is that it's not clear how to come up with a good number for workloads that have many more inserts into new pages than modifications of existing pages. It's also hard to use average for this kind of thing, e.g. in cases where new pages are frequently updated, but also some old data is updated, it's easy for the updates to the old data to completely skew the average, even though that shouldn't prevent us from freezing. - We also discussed an idea by Robert to track the number of times we need to dirty a page when unfreezing and to compare that to the number of pages dirtied overall (IIRC), but I don't think we really came to a conclusion around that - and I didn't write down anything so this is purely from memory. A rough sketch of a freezing heuristic: - We concluded that to intelligently control opportunistic freezing we need statistics about the number of freezes and unfreezes - We should track page freezes / unfreezes in shared memory stats on a per-relation basis - To use such statistics to control heuristics, we need to turn them into rates. For that we need to keep snapshots of absolute values at certain times (when vacuuming), allowing us to compute a rate. - If we snapshot some stats, we need to limit the amount of data that occupies - evict based on wall clock time (we don't care about unfreezing pages frozen a month ago) - "thin out" data when exceeding limited amount of stats per relation using random sampling or such - need a smarter approach than just keeping N last vacuums, as there are situations where a table is (auto-) vacuumed at a high frequency - only looking at recent-ish table stats is fine, because we - a) don't want to look at too old data, as we need to deal with changing workloads - b) if there aren't recent vacuums, falsely freezing is of bounded cost - shared memory stats being lost on crash-restart/failover might be a problem - we certainly don't want to immediate store these stats in a table, due to the xid consumption that'd imply - Attributing "unfreezes" to specific vacuums would be powerful: - "Number of pages frozen during vacuum" and "Number of pages unfrozen that were frozen during the same vacuum" provides numerator / denominator for an "error rate" - We can perform this attribution by comparing the page LSN with recorded start/end LSNs of recent vacuums - If the freezing error rate of recent vacuums is low, freeze more aggressively. This is important to deal with insert mostly workloads. - If old data is "unfrozen", that's fine, we can ignore such unfreezes when controlling "freezing aggressiveness" - Ignoring unfreezing of old pages is important to e.g. deal with workloads that delete old data - This approach could provide "goals" for opportunistic freezing in a somewhat understandable way. E.g. aiming to rarely unfreeze data that has been frozen within 1h/1d/... Around this point my laptop unfortunately ran out of battery. Possibly the attendees of this mini summit also ran out of steam (and tea). We had a few "disagreements" or "unresolved issues": - How aggressive should we be when we have no stats? - Should the freezing heuristic take into account whether freezing would require an FPI? Or whether page was not in s_b, or ... I likely mangled this substantially, both when taking notes during the lively discussion, and when revising them to make them a bit more readable. Greetings, Andres Freund
Re: Eager page freeze criteria clarification
On Mon, Oct 2, 2023 at 4:25 PM Robert Haas wrote: > On Mon, Oct 2, 2023 at 11:37 AM Peter Geoghegan wrote: > > If no vacuuming against pgbench_accounts is strictly better than some > > vacuuming (unless it's just to advance relfrozenxid, which can't be > > avoided), then to what extent is Melanie's freezing patch obligated to > > not make the situation worse? I'm not saying that it doesn't matter at > > all; just that there needs to be a sense of proportion. > > I agree. I think Melanie's previous test results showed no measurable > performance regression but a significant regression in WAL volume. I > don't remember how much of a regression it was, but it was enough to > make me think that the extra WAL volume could probably be turned into > a performance loss by adjusting the test scenario (a bit less > hardware, or a bandwidth-constrained standby connection, etc.). I > think I'd be OK to accept a small amount of additional WAL, though. Do > you see it differently? That's pretty much how I see it too. > I wonder how much useless work we're doing in this area today. I mean, > most pruning in that workload gets done by HOT rather than by vacuum, > because vacuum simply can't keep up, but I don't think it's worthless > work: if it wasn't done in the background by VACUUM, it would happen > in the foreground anyway very soon after. Yeah, but it probably wouldn't be as efficient, since only VACUUM uses a somewhat older OldestXmin/prune cutoff. As much as anything else, I'm arguing that we should treat the general problem of useless work as relevant in the context of the performance validation work/benchmarks. > I don't have a good feeling > for how much index cleanup we end up doing in a pgbench workload, but > it seems to me that if we don't do index cleanup at least now and > then, we'll lose the ability to recycle line pointers in the wake of > non-HOT updates, and that seems bad. Maybe that's rare in pgbench > specifically, or maybe it isn't, but it seems like you'd only have to > change the workload a tiny bit for that to become a real problem. There's no question that pruning that's required to go ahead in order for index cleanup to take place isn't really ever something that we'd want to skip. -- Peter Geoghegan
Re: Eager page freeze criteria clarification
On Mon, Oct 2, 2023 at 11:37 AM Peter Geoghegan wrote: > If no vacuuming against pgbench_accounts is strictly better than some > vacuuming (unless it's just to advance relfrozenxid, which can't be > avoided), then to what extent is Melanie's freezing patch obligated to > not make the situation worse? I'm not saying that it doesn't matter at > all; just that there needs to be a sense of proportion. I agree. I think Melanie's previous test results showed no measurable performance regression but a significant regression in WAL volume. I don't remember how much of a regression it was, but it was enough to make me think that the extra WAL volume could probably be turned into a performance loss by adjusting the test scenario (a bit less hardware, or a bandwidth-constrained standby connection, etc.). I think I'd be OK to accept a small amount of additional WAL, though. Do you see it differently? > Whatever your priorities might be, workload-wise, it could still be > useful to recognize useless freezing as part of a broader problem of > useless (or at least dubious) work performed by VACUUM -- especially > with pgbench-like workloads. The utility of freezing is a lot easier > to measure than the utility of pruning, but why should we assume that > pruning isn't already just as much of a problem? (Maybe that's not a > problem that particularly interests you right now; I'm bringing it up > because it seems possible that putting it in scope could somehow > clarify what to do about freezing.) I wonder how much useless work we're doing in this area today. I mean, most pruning in that workload gets done by HOT rather than by vacuum, because vacuum simply can't keep up, but I don't think it's worthless work: if it wasn't done in the background by VACUUM, it would happen in the foreground anyway very soon after. I don't have a good feeling for how much index cleanup we end up doing in a pgbench workload, but it seems to me that if we don't do index cleanup at least now and then, we'll lose the ability to recycle line pointers in the wake of non-HOT updates, and that seems bad. Maybe that's rare in pgbench specifically, or maybe it isn't, but it seems like you'd only have to change the workload a tiny bit for that to become a real problem. -- Robert Haas EDB: http://www.enterprisedb.com
Re: Eager page freeze criteria clarification
On Mon, Oct 2, 2023 at 6:26 AM Robert Haas wrote: > I think it's true that the fact that pgbench does what pgbench does > makes us think more about that workload than about some other, equally > plausible workload. It's the test we have, so we end up running it a > lot. If we had some other test, we'd run that one. I find pgbench very useful as a stress-test. It usually represents the most extreme possible adversarial case, which puts certain patches in context, helping to paint a complete picture. > As for non-uniform access, that is real and does matter, but there are > certainly installations with tables where no rows survive long enough > to need freezing, either because the table is regularly emptied, or > just because the update load is high enough to hit all the rows fairly > quickly. That's true, but I think that they're all queue-like tables. Not large tables where there just isn't any chance of any row/page not being updating after a fairly short period of time. > Maybe I'm misunderstanding your point here, in which case all of the > above may be irrelevant. But my feeling is that we can't simply ignore > cases where all/many rows are short-lived and say, well, those are > unrealistic, so let's just freeze everything super-aggressively and > that should be fine. I don't think that's OK. We can (and I think we > should) treat that situation as a special case rather than as the > typical case, but I think it would be a bad idea to dismiss it as a > case that we don't need to worry about at all. I think that my point about pgbench being generally unrepresentative might have overshadowed the more important point about the VACUUM requirements of pgbench_accounts in particular. Those are particularly unrealistic. If no vacuuming against pgbench_accounts is strictly better than some vacuuming (unless it's just to advance relfrozenxid, which can't be avoided), then to what extent is Melanie's freezing patch obligated to not make the situation worse? I'm not saying that it doesn't matter at all; just that there needs to be a sense of proportion. If even pruning (by VACUUM) is kinda, sorta, useless, then it seems like we might need to revisit basic assumptions. (Or maybe the problem itself can just be avoided for the most part; whatever works.) I have a patch that provides a simple way of tracking whether each PRUNE record was generated by VACUUM or by opportunistic pruning (also something Melanie is involved with): https://commitfest.postgresql.org/44/4288/ Although it's really hard (maybe impossible) to track right now, I've noticed that the vast majority of all PRUNE records are from opportunistic vacuuming with just about every write-heavy benchmark, including pgbench and TPC-C. And that's just going on the raw count of each type of record, not the utility provided by each pruning operation. Admittedly "utility" is a very hard thing to measure here, but it might still matter. VACUUM might be doing some pruning, but only pruning targeting pages that never particularly "need to be pruned" by VACUUM itself (since pruning could happen whenever via opportunistic pruning anyway). Whatever your priorities might be, workload-wise, it could still be useful to recognize useless freezing as part of a broader problem of useless (or at least dubious) work performed by VACUUM -- especially with pgbench-like workloads. The utility of freezing is a lot easier to measure than the utility of pruning, but why should we assume that pruning isn't already just as much of a problem? (Maybe that's not a problem that particularly interests you right now; I'm bringing it up because it seems possible that putting it in scope could somehow clarify what to do about freezing.) -- Peter Geoghegan
Re: Eager page freeze criteria clarification
On Fri, Sep 29, 2023 at 8:50 PM Peter Geoghegan wrote: > While pgbench makes a fine stress-test, for the most part its workload > is highly unrealistic. And yet we seem to think that it's just about > the most important benchmark of all. If we're not willing to get over > even small regressions in pgbench, I fear we'll never make significant > progress in this area. It's at least partly a cultural problem IMV. I think it's true that the fact that pgbench does what pgbench does makes us think more about that workload than about some other, equally plausible workload. It's the test we have, so we end up running it a lot. If we had some other test, we'd run that one. But I don't think I buy that it's "highly unrealistic." It is true that pgbench with default options is limited only by the speed of the database, while real update-heavy workloads are typically limited by something external, like the number of users hitting the web site that the database is backing. It's also true that real workloads tend to involve some level of non-uniform access. But I'm not sure that either of those things really matter that much in the end. The problem I have with the external rate-limiting argument is that it ignores hardware selection, architectural decision-making, and workload growth. Sure, people are unlikely to stand up a database that can do 10,000 transactions per second and hit it with a workload that requires doing 20,000 transactions per second, because they're going to find out in testing that it doesn't work. Then they will either buy more hardware or rearchitect the system to reduce the required number of transactions per second or give up on using PostgreSQL. So when they do put it into production, it's unlikely to be overloaded on day one. But that's just because all of the systems that would have been overloaded on day one never make it to day one. They get killed off or changed before they get there. So it isn't like higher maximum throughput wouldn't have been beneficial. And over time, systems that initially started out not running maxed out can and, in my experience, fairly often do end up maxed out because once you've got the thing in production, it's hard to change anything, and load often does grow over time. As for non-uniform access, that is real and does matter, but there are certainly installations with tables where no rows survive long enough to need freezing, either because the table is regularly emptied, or just because the update load is high enough to hit all the rows fairly quickly. Maybe I'm misunderstanding your point here, in which case all of the above may be irrelevant. But my feeling is that we can't simply ignore cases where all/many rows are short-lived and say, well, those are unrealistic, so let's just freeze everything super-aggressively and that should be fine. I don't think that's OK. We can (and I think we should) treat that situation as a special case rather than as the typical case, but I think it would be a bad idea to dismiss it as a case that we don't need to worry about at all. -- Robert Haas EDB: http://www.enterprisedb.com
Re: Eager page freeze criteria clarification
On Fri, Sep 29, 2023 at 11:27 AM Robert Haas wrote: > I think that's true. For me, the issue is what a user is practically > likely to notice and care about. I submit that on a > not-particularly-busy system, it would probably be fine to freeze > aggressively in almost every situation, because you're only incurring > costs you can afford to pay. On a busy system, it's more important to > be right, or at least not too badly wrong. But even on a busy system, > I think that when the time between data being written and being frozen > is more than a few tens of minutes, it's very doubtful that anyone is > going to notice the contribution that freezing makes to the overall > workload. They're much more likely to notice an annoying autovacuum > than they are to notIce a bit of excess freezing that ends up getting > reversed. But when you start cranking the time between writing data > and freezing it down into the single-digit numbers of minutes, and > even more if you push down to tens of seconds or less, now I think > people are going to care more about useless freezing work than about > long-term autovacuum risks. Because now their database is really busy > so they care a lot about performance, and seemingly most of the data > involved is ephemeral anyway. I think that that's all true. As I believe I pointed out at least once in the past, my thinking about the practical requirements from users was influenced by this paper/talk: https://www.microsoft.com/en-us/research/video/cost-performance-in-modern-data-stores-how-data-cashing-systems-succeed/ For the types of applications that Postgres is a natural fit for, most of the data is cold -- very cold ("freezing" cold, even). If you have a 50GB pgbench_accounts table, Postgres will always perform suboptimally. But so will SQL Server, Oracle, and MySQL. While pgbench makes a fine stress-test, for the most part its workload is highly unrealistic. And yet we seem to think that it's just about the most important benchmark of all. If we're not willing to get over even small regressions in pgbench, I fear we'll never make significant progress in this area. It's at least partly a cultural problem IMV. The optimal freezing strategy for pgbench_accounts is to never freeze, even once. It doesn't matter if you vary the distributions of the accounts table updates -- it's still inevitable that each and every row will get updated before too long. In fact, it's not just freezing that should always be avoided here -- same applies with pruning by VACUUM. As I suggested in my last email, the lesson offered by pgbench_accounts table seems to be "never VACUUM at all, except perhaps to advance relfrozenxid" (which shouldn't actually require any freezing even one page). If you haven't tuned heap fill factor, then you might want to VACUUM a bit, at first. But, overall, vacuuming is bad. That is the logical though absurd conclusion. It completely flies in the face of practical experience. -- Peter Geoghegan
Re: Eager page freeze criteria clarification
On Fri, Sep 29, 2023 at 11:27 AM Robert Haas wrote: > > Even if you're willing to assume that vacuum_freeze_min_age isn't just > > an arbitrary threshold, this still seems wrong. vacuum_freeze_min_age > > is applied by VACUUM, at the point that it scans pages. If VACUUM were > > infinitely fast, and new VACUUMs were launched constantly, then > > vacuum_freeze_min_age (and this bucketing scheme) might make more > > sense. But, you know, they're not. So whether or not VACUUM (with > > Andres' algorithm) deems a page that it has frozen to have been > > opportunistically frozen or not is greatly influenced by factors that > > couldn't possibly be relevant. > > I'm not totally sure that I'm understanding what you're concerned > about here, but I *think* that the issue you're worried about here is: > if we have various rules that can cause freezing, let's say X Y and Z, > and we adjust the aggressiveness of rule X based on the performance of > rule Y, that would be stupid and might suck. Your summary is pretty close. There are a couple of specific nuances to it, though: 1. Anything that uses XID age or even LSN age necessarily depends on when VACUUM shows up, which itself depends on many other random things. With small to medium sized tables that don't really grow, it's perhaps reasonable to expect this to not matter. But with tables like the TPC-C order/order lines table, or even pgbench_history, the next VACUUM operation will reliably be significantly longer and more expensive than the last one, forever (ignoring the influence of aggressive mode, and assuming typical autovacuum settings). So VACUUMs get bigger and less frequent as the table grows. As the table continues to grow, at some point we reach a stage where many XIDs encountered by VACUUM will be significantly older than vacuum_freeze_min_age, while others will be significantly younger. And so whether we apply the vacuum_freeze_min_age rule (or some other age based rule) is increasingly a matter of random happenstance (i.e. is more and more due to when VACUUM happens to show up), and has less and less to do with what the workload signals we should do. This is a moving target, but (if I'm not mistaken) under the scheme described by Andres we're not even trying to compensate for that. Separately, I have a practical concern: 2. It'll be very hard to independently track the effectiveness of rules X, Y, and Z as a practical matter, because the application of each rule quite naturally influences the application of every other rule over time. They simply aren't independent things in any practical sense. Even if this wasn't an issue, I can't think of a reasonable cost model. Is it good or bad if "opportunistic freezing" results in unfreezing 50% of the time? AFAICT that's an *extremely* complicated question. You cannot just interpolate from the 0% case (definitely good) and the 100% case (definitely bad) and expect to get a sensible answer. You can't split the difference -- even if we allow ourselves to ignore tricky value judgement type questions. > Assuming that the previous sentence is a correct framing, let's take X > to be "freezing based on the page LSN age" and Y to be "freezing based > on vacuum_freeze_min_age". I think the problem scenario here would be > if it turns out that, under some set of circumstances, Y freezes more > aggressively than X. For example, suppose the user runs VACUUM FREEZE, > effectively setting vacuum_freeze_min_age=0 for that operation. If the > table is being modified at all, it's likely to suffer a bunch of > unfreezing right afterward, which could cause us to decide to make > future vacuums freeze less aggressively. That's not necessarily what > we want, because evidently the user, at least at that moment in time, > thought that previous freezing hadn't been aggressive enough. They > might be surprised to find that flash-freezing the table inhibited > future automatic freezing. I didn't think of that one myself, but it's a great example. > Or suppose that they just have a very high XID consumption rate > compared to the rate of modifications to this particular table, such > that criteria related to vacuum_freeze_min_age tend to be satisfied a > lot, and thus vacuums tend to freeze a lot no matter what the page LSN > age is. This scenario actually doesn't seem like a problem, though. In > this case the freezing criterion based on page LSN age is already not > getting used, so it doesn't really matter whether we tune it up or > down or whatever. It would have to be a smaller table, which I'm relatively unconcerned about. > > Okay then. I guess it's more accurate to say that we'll have a strong > > bias in the direction of freezing when an FPI won't result, though not > > an infinitely strong bias. We'll at least have something that can be > > thought of as an improved version of the FPI thing for 17, I think -- > > which is definitely significant progress. > > I do kind of wonder whether we're going to care about
Re: Eager page freeze criteria clarification
On Fri, Sep 29, 2023 at 11:57 AM Peter Geoghegan wrote: > Assuming your concern is more or less limited to those cases where the > same page could be frozen an unbounded number of times (once or almost > once per VACUUM), then I think we fully agree. We ought to converge on > the right behavior over time, but it's particularly important that we > never converge on the wrong behavior instead. I think that more or less matches my current thinking on the subject. A caveat might be: If it were once per two vacuums rather than once per vacuum, that might still be an issue. But I agree with the idea that the case that matters is *repeated* wasteful freezing. I don't think freezing is expensive enough that individual instances of mistaken freezing are worth getting too stressed about, but as you say, the overall pattern does matter. > The TPC-C scenario is partly interesting because it isn't actually > obvious what the most desirable behavior is, even assuming that you > had perfect information, and were not subject to practical > considerations about the complexity of your algorithm. There doesn't > seem to be perfect clarity on what the goal should actually be in such > scenarios -- it's not like the problem is just that we can't agree on > the best way to accomplish those goals with this specific workload. > > If performance/efficiency and performance stability are directly in > tension (as they sometimes are), how much do you want to prioritize > one or the other? It's not an easy question to answer. It's a value > judgement as much as anything else. I think that's true. For me, the issue is what a user is practically likely to notice and care about. I submit that on a not-particularly-busy system, it would probably be fine to freeze aggressively in almost every situation, because you're only incurring costs you can afford to pay. On a busy system, it's more important to be right, or at least not too badly wrong. But even on a busy system, I think that when the time between data being written and being frozen is more than a few tens of minutes, it's very doubtful that anyone is going to notice the contribution that freezing makes to the overall workload. They're much more likely to notice an annoying autovacuum than they are to notIce a bit of excess freezing that ends up getting reversed. But when you start cranking the time between writing data and freezing it down into the single-digit numbers of minutes, and even more if you push down to tens of seconds or less, now I think people are going to care more about useless freezing work than about long-term autovacuum risks. Because now their database is really busy so they care a lot about performance, and seemingly most of the data involved is ephemeral anyway. > Even if you're willing to assume that vacuum_freeze_min_age isn't just > an arbitrary threshold, this still seems wrong. vacuum_freeze_min_age > is applied by VACUUM, at the point that it scans pages. If VACUUM were > infinitely fast, and new VACUUMs were launched constantly, then > vacuum_freeze_min_age (and this bucketing scheme) might make more > sense. But, you know, they're not. So whether or not VACUUM (with > Andres' algorithm) deems a page that it has frozen to have been > opportunistically frozen or not is greatly influenced by factors that > couldn't possibly be relevant. I'm not totally sure that I'm understanding what you're concerned about here, but I *think* that the issue you're worried about here is: if we have various rules that can cause freezing, let's say X Y and Z, and we adjust the aggressiveness of rule X based on the performance of rule Y, that would be stupid and might suck. Assuming that the previous sentence is a correct framing, let's take X to be "freezing based on the page LSN age" and Y to be "freezing based on vacuum_freeze_min_age". I think the problem scenario here would be if it turns out that, under some set of circumstances, Y freezes more aggressively than X. For example, suppose the user runs VACUUM FREEZE, effectively setting vacuum_freeze_min_age=0 for that operation. If the table is being modified at all, it's likely to suffer a bunch of unfreezing right afterward, which could cause us to decide to make future vacuums freeze less aggressively. That's not necessarily what we want, because evidently the user, at least at that moment in time, thought that previous freezing hadn't been aggressive enough. They might be surprised to find that flash-freezing the table inhibited future automatic freezing. Or suppose that they just have a very high XID consumption rate compared to the rate of modifications to this particular table, such that criteria related to vacuum_freeze_min_age tend to be satisfied a lot, and thus vacuums tend to freeze a lot no matter what the page LSN age is. This scenario actually doesn't seem like a problem, though. In this case the freezing criterion based on page LSN age is already not getting used, so it doesn't really matter wheth
Re: Eager page freeze criteria clarification
On Fri, Sep 29, 2023 at 7:55 AM Robert Haas wrote: > On Thu, Sep 28, 2023 at 12:03 AM Peter Geoghegan wrote: > > But isn't the main problem *not* freezing when we could and > > should have? (Of course the cost of freezing is very relevant, but > > it's still secondary.) > > Perhaps this is all in how you look at it, but I don't see it this > way. It's easy to see how to solve the "not freezing" problem: just > freeze everything as often as possible. If that were cheap enough that > we could just do it, then we'd just do it and be done here. The > problem is that, at least in my opinion, that seems too expensive in > some cases. I'm starting to believe that those cases are narrower than > I once thought, but I think they do exist. So now, I'm thinking that > maybe the main problem is identifying when you've got such a case, so > that you know when you need to be less aggressive. Assuming your concern is more or less limited to those cases where the same page could be frozen an unbounded number of times (once or almost once per VACUUM), then I think we fully agree. We ought to converge on the right behavior over time, but it's particularly important that we never converge on the wrong behavior instead. > > Won't the algorithm that you've sketched always think that > > "unfreezing" pages doesn't affect recently frozen pages with such a > > workload? Isn't the definition of "recently frozen" that emerges from > > this algorithm not in any way related to the order delivery time, or > > anything like that? You know, rather like vacuum_freeze_min_age. > > FWIW, I agree that vacuum_freeze_min_age sucks. I have been reluctant > to endorse changes in this area mostly because I fear replacing one > bad idea with another, not because I think that what we have now is > particularly good. It's better to be wrong in the same way in every > release than to have every release be equally wrong but in a different > way. > > Also, I think the question of what "recently frozen" means is a good > one, but I'm not convinced that it ought to relate to the order > delivery time. I don't think it should, either. The TPC-C scenario is partly interesting because it isn't actually obvious what the most desirable behavior is, even assuming that you had perfect information, and were not subject to practical considerations about the complexity of your algorithm. There doesn't seem to be perfect clarity on what the goal should actually be in such scenarios -- it's not like the problem is just that we can't agree on the best way to accomplish those goals with this specific workload. If performance/efficiency and performance stability are directly in tension (as they sometimes are), how much do you want to prioritize one or the other? It's not an easy question to answer. It's a value judgement as much as anything else. > If we insert into a table and 12-14 hours go buy before > it's updated, it doesn't seem particularly bad to me if we froze that > data meanwhile (regardless of what metric drove that freezing). Same > thing if it's 2-4 hours. What seems bad to me is if we're constantly > updating the table and vacuum comes sweeping through and freezing > everything to no purpose over and over again and then it gets > un-frozen a few seconds or minutes later. Right -- we agree here. I even think that it makes sense to freeze pages knowing for sure that the pages will be unfrozen on that sort of timeline (at least with a large and ever growing table like this). It may technically be less efficient, but not when you consider how everything else is affected by the disruptive impact of freezing a great deal of stuff all at once. (Of course it's also true that we don't really know what will happen, which is all the more reason to freeze eagerly.) > Now maybe that's the wrong idea. After all, as a percentage, the > overhead is the same either way, regardless of whether we're talking > about WAL volume or CPU cycles. But somehow it feels worse to make the > same mistakes every few minutes or potentially even tens of seconds > than it does to make the same mistakes every few hours. The absolute > cost is a lot higher. I agree. Another problem with the algorithm that Andres sketched is that it supposes that vacuum_freeze_min_age means something relevant -- that's how we decide whether or not freezing should count as "opportunistic". But there really isn't that much difference between (say) an XID age of 25 million and 50 million. At least not with a table like the TPC-C tables, where VACUUMs are naturally big operations that take place relatively infrequently. Assume a default vacuum_freeze_min_age of 50 million. How can it make sense to deem freezing a page "opportunistic" when its oldest XID has only attained an age of 25 million, if the subsequent unfreezing happens when that same XID would have attained an age of 75 million, had we not frozen it? And if you agree that it doesn't make sense, how can we compensate for this effect, as
Re: Eager page freeze criteria clarification
On Thu, Sep 28, 2023 at 12:03 AM Peter Geoghegan wrote: > But isn't the main problem *not* freezing when we could and > should have? (Of course the cost of freezing is very relevant, but > it's still secondary.) Perhaps this is all in how you look at it, but I don't see it this way. It's easy to see how to solve the "not freezing" problem: just freeze everything as often as possible. If that were cheap enough that we could just do it, then we'd just do it and be done here. The problem is that, at least in my opinion, that seems too expensive in some cases. I'm starting to believe that those cases are narrower than I once thought, but I think they do exist. So now, I'm thinking that maybe the main problem is identifying when you've got such a case, so that you know when you need to be less aggressive. > Won't the algorithm that you've sketched always think that > "unfreezing" pages doesn't affect recently frozen pages with such a > workload? Isn't the definition of "recently frozen" that emerges from > this algorithm not in any way related to the order delivery time, or > anything like that? You know, rather like vacuum_freeze_min_age. FWIW, I agree that vacuum_freeze_min_age sucks. I have been reluctant to endorse changes in this area mostly because I fear replacing one bad idea with another, not because I think that what we have now is particularly good. It's better to be wrong in the same way in every release than to have every release be equally wrong but in a different way. Also, I think the question of what "recently frozen" means is a good one, but I'm not convinced that it ought to relate to the order delivery time. If we insert into a table and 12-14 hours go buy before it's updated, it doesn't seem particularly bad to me if we froze that data meanwhile (regardless of what metric drove that freezing). Same thing if it's 2-4 hours. What seems bad to me is if we're constantly updating the table and vacuum comes sweeping through and freezing everything to no purpose over and over again and then it gets un-frozen a few seconds or minutes later. Now maybe that's the wrong idea. After all, as a percentage, the overhead is the same either way, regardless of whether we're talking about WAL volume or CPU cycles. But somehow it feels worse to make the same mistakes every few minutes or potentially even tens of seconds than it does to make the same mistakes every few hours. The absolute cost is a lot higher. > On a positive note, I like that what you've laid out freezes eagerly > when an FPI won't result -- this much we can all agree on. I guess > that that part is becoming uncontroversial. I don't think that we're going to be able to get away with freezing rows in a small, frequently-updated table just because no FPI will result. I think Melanie's results show that the cost is not negligible. But Andres's pseudocode algorithm, although it is more aggressive in that case, doesn't necessarily seem bad to me, because it still has some ability to hold off freezing in such cases if our statistics show that it isn't working out. -- Robert Haas EDB: http://www.enterprisedb.com
Re: Eager page freeze criteria clarification
On Wed, Sep 27, 2023 at 7:09 PM Melanie Plageman wrote: > At the risk of seeming too execution-focused, I want to try and get more > specific. Here is a description of an example implementation to test my > understanding: > > In table-level stats, save two numbers: younger_than_cpt/older_than_cpt > storing the number of instances of unfreezing a page which is either > younger or older than the start of the most recent checkpoint at the > time of its unfreezing When I first read this, I thought this sounded right, except for the naming which doesn't mention freezing, but on further thought, this feels like it might not be the right thing. Why do I care how many super-old pages (regardless of how exactly we define super-old) I unfroze? I think what I care about is more like: how often do I have to unfreeze a recently-frozen page in order to complete a DML operation? For example, consider a table with a million pages, 10 of which are frozen. 1 was frozen a long time ago, 9 were frozen recently, and the other 999,990 are unfrozen. I update every page in the table. Well, now older_than_cpt = 1 and younger_than_cpt = 9, so the ratio of the two is terrible: 90% of the frozen pages I encountered were recently-frozen. But that's irrelevant. What matters is that if I had frozen less aggressively, I could have saved myself 9 unfreeze operations across a million page modifications. So actually I'm doing great: only 0.0009% of my page modifications required an unfreeze that I maybe should have avoided and didn't. Likewise, if all of the pages had been frozen, but only 9 were frozen recently, I'm doing exactly equally great. But if the number of recently frozen pages were higher, then I'd be doing less great. So I think comparing the number of young pages unfrozen to the number of old pages unfrozen is the wrong test, and comparing it instead to the number of pages modified is a better test. But I don't think it's completely right, either. Imagine a table with 100 recently frozen pages. I modify the table and unfreeze one of them. So, 100% of the pages that I unfroze were recently-frozen. Does this mean that we should back off and freeze less aggressively? Not really. It seems like I actually got this case 99% right. So another idea could be to look at what percentage of frozen pages I end up un-freezing shortly thereafter. That gets this example right. But it gets the earlier example with a million pages wrong. Maybe there's something to combining these ideas. If the percentage of page modifications (of clean pages, say, so that we don't skew the stats as much when a page is modified many times in a row) that have to un-freeze a recently-frozen page is low, then that means there's no real performance penalty for whatever aggressive freezing we're doing. Even if we un-freeze a zillion pages, if that happens over the course of modifying 100 zillion pages, it's not material. On the flip side, if the percentage of frozen pages that get unfrozen soon thereafter is low, then it feels worth doing even if most page modifications end up un-freezing a recently-frozen page, because on net we're still ending up with a lot of extra stuff frozen. Said differently, we're freezing too aggressively if un-freezing is both adding a significant expense (i.e. the foreground work we're doing often requires un-freezing ages) and ineffective (i.e. we don't end up with frozen pages left over). If it has one of those problems, it's still OK, but if it has both, we need to back off. Maybe that's still not quite right, but it's the best I've got this morning. > You would likely want to combine it with one of the other heuristics we > discussed. > > For example: > For a table with only 20% younger unfreezings, when vacuuming that page, > > if insert LSN - RedoRecPtr < insert LSN - page LSN > page is older than the most recent checkpoint start, so freeze it > regardless of whether or not it would emit an FPI > > What aggressiveness levels should there be? What should change at each > level? What criteria should pages have to meet to be subject to the > aggressiveness level? My guess is that, when we're being aggressive, we should be aiming to freeze all pages that can be completely frozen with no additional criterion. Any additional criterion that we add here is likely to mess up the insert-only table case, and I'm not really sure what it saves us from. On the other hand, when we're being non-aggressive, we might still sometimes want to freeze in situations where we historically haven't. Consider the following three examples: 1. pgbench_accounts table, standard run 2. pgbench_accounts table, highly skewed run 3. slowly growing table with a hot tail. records are inserted, updated heavily for a while, thereafter updated only very occasionally. Aggressive freezing could misfire in all of these cases, because all of them have some portion of the table where rows are being updated very frequently. But in the second and third cases, the
Re: Eager page freeze criteria clarification
On Fri, Sep 8, 2023 at 12:07 AM Andres Freund wrote: > > Hi, > > On 2023-09-06 10:35:17 -0400, Robert Haas wrote: > > On Wed, Sep 6, 2023 at 1:09 AM Andres Freund wrote: > > > Yea, it'd be useful to have a reasonably approximate wall clock time for > > > the > > > last modification of a page. We just don't have infrastructure for > > > determining > > > that. We'd need an LSN->time mapping (xid->time wouldn't be particularly > > > useful, I think). > > > > > > A very rough approximate modification time can be computed by assuming an > > > even > > > rate of WAL generation, and using the LSN at the time of the last vacuum > > > and > > > the time of the last vacuum, to compute the approximate age. > > > > > > For a while I thought that'd not give us anything that just using LSNs > > > gives > > > us, but I think it might allow coming up with a better cutoff logic: > > > Instead > > > of using a cutoff like "page LSN is older than 10% of the LSNs since the > > > last > > > vacuum of the table", it would allow us to approximate "page has not been > > > modified in the last 15 seconds" or such. I think that might help avoid > > > unnecessary freezing on tables with very frequent vacuuming. > > > > Yes. I'm uncomfortable with the last-vacuum-LSN approach mostly > > because of the impact on very frequently vacuumed tables, and > > secondarily because of the impact on very infrequently vacuumed > > tables. > > > > Downthread, I proposed using the RedoRecPtr of the latest checkpoint > > rather than the LSN of the previou vacuum. I still like that idea. > > Assuming that "downthread" references > https://postgr.es/m/CA%2BTgmoYb670VcDFbekjn2YQOKF9a7e-kBFoj2WJF1HtH7YPaWQ%40mail.gmail.com > could you sketch out the logic you're imagining a bit more? > > > > It's a value that we already have, with no additional bookkeeping. It > > varies over a much narrower range than the interval between vacuums on > > a table. The vacuum interval could be as short as tens of seconds as > > long as years, while the checkpoint interval is almost always going to > > be between a few minutes at the low end and some tens of minutes at > > the high end, hours at the very most. That's quite appealing. > > The reason I was thinking of using the "lsn at the end of the last vacuum", is > that it seems to be more adapative to the frequency of vacuuming. > > One the one hand, if a table is rarely autovacuumed because it is huge, > (InsertLSN-RedoRecPtr) might or might not be representative of the workload > over a longer time. On the other hand, if a page in a frequently vacuumed > table has an LSN from around the last vacuum (or even before), it should be > frozen, but will appear to be recent in RedoRecPtr based heuristics? > > > Perhaps we can mix both approaches. We can use the LSN and time of the last > vacuum to establish an LSN->time mapping that's reasonably accurate for a > relation. For infrequently vacuumed tables we can use the time between > checkpoints to establish a *more aggressive* cutoff for freezing then what a > percent-of-time-since-last-vacuum appach would provide. If e.g. a table gets > vacuumed every 100 hours and checkpoint timeout is 1 hour, no realistic > percent-of-time-since-last-vacuum setting will allow freezing, as all dirty > pages will be too new. To allow freezing a decent proportion of those, we > could allow freezing pages that lived longer than ~20% > time-between-recent-checkpoints. One big sticking point for me (brought up elsewhere in this thread, but, AFAICT never resolved) is that it seems like the fact that we mark pages all-visible even when not freezing them means that no matter what heuristic we use, we won't have the opportunity to freeze the pages we most want to freeze. Getting to the specific proposal here -- having two+ heuristics and using them based on table characteristics: Take the append-only case. Let's say that we had a way to determine if an insert-only table should use the vacuum LSN heuristic or older than X checkpoints heuristic. Given the default autovacuum_vacuum_insert_threshold, every 1000 tuples inserted, autovacuum will vacuum the table. If you insert to the table often, then it is a frequently vacuumed table and, if I'm understanding the proposal correctly, you would want to use a vacuum LSN-based threshold to determine which pages are old enough to freeze. Using only the page-older-than-X%-of-LSNs-since-last-vacuum heuristic and assuming no concurrent workloads, each autovacuum will leave X% of the pages unfrozen. However, those pages will likely be marked all visible. That means that the next non-aggressive autovacuum will not even scan those pages (assuming that run of pages exceeds the skip pages threshold). So some chunk of pages will get left behind on every autovacuum. On the other hand, let's say that the table is not frequently inserted to, so we consider it an infrequently updated table and want to use the checkpoint-based heuristic. If the page hasn't
Re: Eager page freeze criteria clarification
On Wed, Sep 27, 2023 at 6:35 PM Andres Freund wrote: > > if insert LSN - RedoRecPtr < insert LSN - page LSN > > page is older than the most recent checkpoint start, so freeze it > > regardless of whether or not it would emit an FPI > > > > What aggressiveness levels should there be? What should change at each > > level? What criteria should pages have to meet to be subject to the > > aggressiveness level? > > I'm thinking something very roughly along these lines could make sense: > > page_lsn_age = insert_lsn - page_lsn; While there is no reason to not experiment here, I have my doubts about what you've sketched out. Most importantly, it doesn't have anything to say about the cost of not freezing -- just the cost of freezing. But isn't the main problem *not* freezing when we could and should have? (Of course the cost of freezing is very relevant, but it's still secondary.) But even leaving that aside, I just don't get why this will work with the case that you yourself emphasized earlier on: a workload with inserts plus "hot tail" updates. If you run TPC-C according to spec, there is about 12 or 14 hours between the initial inserts into the orders and order lines table (by a new order transaction), and the subsequent updates (from the delivery transaction). When I run the benchmark, I usually don't stick with the spec (it's rather limiting on modern hardware), so it's more like 2 - 4 hours before each new order is delivered (meaning updated in those two big tables). Either way, it's a fairly long time relative to everything else. Won't the algorithm that you've sketched always think that "unfreezing" pages doesn't affect recently frozen pages with such a workload? Isn't the definition of "recently frozen" that emerges from this algorithm not in any way related to the order delivery time, or anything like that? You know, rather like vacuum_freeze_min_age. Separately, at one point you also said "Yes. If the ratio of opportunistically frozen pages (which I'd define as pages that were frozen not because they strictly needed to) vs the number of unfrozen pages increases, we need to make opportunistic freezing less aggressive and vice versa". Can we expect a discount for freezing that happened to be very cheap anyway, when that doesn't work out? What about a page that we would have had to have frozen anyway (based on the conventional vacuum_freeze_min_age criteria) not too long after it was frozen by this new mechanism, that nevertheless became unfrozen some time later? That is, a page where "the unfreezing" cannot reasonably be blamed on the initial so-called opportunistic freezing, because really it was a total accident involving when VACUUM showed up? You know, just like we'd expect with the TPC-C tables. Aside: "unfrozen pages" seems to refer to pages that were frozen, and became unfrozen. Not pages that are simply frozen. Lots of opportunities for confusion here. I'm not saying that it's wrong to freeze like this in the specific case of TPC-C. But do you really need to invent all this complicated infrastructure, just to avoid freezing the same pages again in a tight loop? On a positive note, I like that what you've laid out freezes eagerly when an FPI won't result -- this much we can all agree on. I guess that that part is becoming uncontroversial. -- Peter Geoghegan
Re: Eager page freeze criteria clarification
On 2023-09-27 19:09:41 -0400, Melanie Plageman wrote: > On Wed, Sep 27, 2023 at 3:25 PM Robert Haas wrote: > > > > On Wed, Sep 27, 2023 at 12:34 PM Andres Freund wrote: > > > One way to deal with that would be to not track the average age in > > > LSN-difference-bytes, but convert the value to some age metric at that > > > time. If we e.g. were to convert the byte-age into an approximate age in > > > checkpoints, with quadratic bucketing (e.g. 0 -> current checkpoint, 1 -> > > > 1 > > > checkpoint, 2 -> 2 checkpoints ago, 3 -> 4 checkpoints ago, ...), using a > > > mean > > > of that age would probably be fine. > > > > Yes. I think it's possible that we could even get by with just two > > buckets. Say current checkpoint and not. Or current-or-previous > > checkpoint and not. And just look at what percentage of accesses fall > > into this first bucket -- it should be small or we're doing it wrong. > > It seems like the only thing we actually need to avoid is freezing the > > same ages over and over again in a tight loop. > > At the risk of seeming too execution-focused, I want to try and get more > specific. I think that's a good intuition :) > Here is a description of an example implementation to test my > understanding: > > In table-level stats, save two numbers: younger_than_cpt/older_than_cpt > storing the number of instances of unfreezing a page which is either > younger or older than the start of the most recent checkpoint at the > time of its unfreezing > This has the downside of counting most unfreezings directly after a > checkpoint in the older_than_cpt bucket. That is: older_than_cpt != > longer_frozen_duration at certain times in the checkpoint cycle. Yea - I don't think just using before/after checkpoint is a good measure. As you say, it'd be quite jumpy around checkpoints - even though the freezing behaviour hasn't materially changed. I think using the *distance* between checkpoints would be a more reliable measure, i.e. if (insert_lsn - page_lsn) < recent_average_lsn_diff_between_checkpoints, then it's recently modified, otherwise not. One problem with using checkpoints "distances" to control things is forced/immediate checkpoints. The fact that a base backup was started (and thus a checkpoint completed much earlier than it would have otherwise) shouldn't make our system assume that the overall behaviour is quite different going forward. > Now, I'm trying to imagine how this would interact in a meaningful way > with opportunistic freezing behavior during vacuum. > > You would likely want to combine it with one of the other heuristics we > discussed. > > For example: > For a table with only 20% younger unfreezings, when vacuuming that page, Fwiw, I wouldn't say that unfreezing 20% of recently frozen pages is a low value. > if insert LSN - RedoRecPtr < insert LSN - page LSN > page is older than the most recent checkpoint start, so freeze it > regardless of whether or not it would emit an FPI > > What aggressiveness levels should there be? What should change at each > level? What criteria should pages have to meet to be subject to the > aggressiveness level? I'm thinking something very roughly along these lines could make sense: page_lsn_age = insert_lsn - page_lsn; if (dirty && !fpi) { /* * If we can freeze without an FPI, be quite agressive about * opportunistically freezing. We just need to prevent freezing * when the table is constantly being rewritten. It's ok to make mistakes * initially - the rate of unfreezes will quickly stop us from making * mistakes as often. */ #define NO_FPI_FREEZE_FACTOR 10.0 if (page_lsn_age > average_lsn_bytes_per_checkpoint * (1 - recent_unfreeze_ratio) * NO_FPI_FREEZE_FACTOR) freeze = true; } else { /* * Freezing would emit an FPI and/or dirty the page, making freezing quite * a bit more costly. Be more hesitant about freezing recently modified * data, unless it's very rare that we unfreeze recently modified data. * For insert-only/mostly tables, unfreezes should be rare, so we'll still * freeze most of the time. */ #define FPI_FREEZE_FACTOR 1 if (page_lsn_age > average_lsn_bytes_per_checkpoint * (1 - recent_unfreeze_ratio) * FPI_FREEZE_FACTOR) freeze = true; } Greetings, Andres Freund
Re: Eager page freeze criteria clarification
On Wed, Sep 27, 2023 at 5:42 PM Melanie Plageman wrote: > On Wed, Sep 27, 2023 at 5:27 PM Peter Geoghegan wrote: > > What about my idea of holding back when some tuples are already frozen > > from before? Admittedly that's still a fairly raw idea, but something > > along those lines seems promising. > > Originally, when you proposed this, the idea was to avoid freezing a > page that has some tuples already frozen because that would mean we > have frozen it and it was unfrozen after. Right. > But, today, I was thinking about what heuristics to combine with one > that considers the average amount of time pages in a table spend > frozen [1], and I think it might be interesting to use the "some > tuples on the page are frozen" test in the opposite way than it was > originally proposed. > > Take a case where you insert a row and then update it once and repeat > forever. Let's say you freeze the page before you've filled it up, > and, on the next update, the page is unfrozen. Most of the tuples on > the page will still be frozen. The next time the page is vacuumed, the > fact that the page has a lot of frozen tuples actually means that it > is probably worth freezing the remaining few tuples and marking the > page frozen. Assuming that you can get away with not writing an FPI, yeah, probably. Otherwise, maybe, less clear. > Basically, I'm wondering if we can use the page-is-partially-frozen > test to catch cases where we are willing to freeze a page a couple of > times to make sure it gets frozen. That doesn't seem like the opposite of my suggestion (except maybe in a mechanical sort of way). To me this sounds like a refinement of the same idea (not surprising that it would need to be refined). You seem to be suggesting that we would freeze if either no tuples were frozen at all, or if almost all tuples were already frozen -- but not if the page was "somewhere in between". I think I see where you're going with that. -- Peter Geoghegan
Re: Eager page freeze criteria clarification
On Wed, Sep 27, 2023 at 6:07 PM Andres Freund wrote: > > I would be sure to look out for new inserts that "unfreeze" pages, too > > -- ideally you'd have instrumentation that caught that, in order to > > get a general sense of the extent of the problem in each of your > > chosen representative workloads. This is particularly likely to be a > > concern when there is enough space on a heap page to fit one more heap > > tuple, that's smaller than most other tuples. The FSM will "helpfully" > > make sure of it. This problem isn't rare at all, unfortunately. > > I'm not as convinced as you are that this is a problem / that the solution > won't cause more problems than it solves. Users are concerned when free space > can't be used - you don't have to look further than the discussion in the last > weeks about adding the ability to disable HOT to fight bloat. All that I said was that it would be a good idea to keep an eye out for it during performance validation work > I do agree that the FSM code tries way too hard to fit things onto early pages > - it e.g. can slow down concurrent copy workloads by 3-4x due to contention in > the FSM - and that it has more size classes than necessary, but I don't think > just closing frozen pages against further insertions of small tuples will > cause its own set of issues. Of course it'll cause other issues. Of course it's very subtle. > I think at the very least there'd need to be something causing pages to reopen > once the aggregate unused space in the table reaches some threshold. Of course that's true. ISTM that you might well need some kind of hysteresis to avoid pages ping-ponging. If it isn't sticky, it might never settle, or take way too long to settle. -- Peter Geoghegan
Re: Eager page freeze criteria clarification
Hi, On 2023-09-27 17:43:04 -0700, Peter Geoghegan wrote: > On Wed, Sep 27, 2023 at 5:20 PM Melanie Plageman > wrote: > > > Can you define "unfreeze"? I don't know if this newly invented term > > > refers to unsetting a page that was marked all-frozen following (say) > > > an UPDATE, or if it refers to choosing to not freeze when the option > > > was available (in the sense that it was possible to do it and fully > > > mark the page all-frozen in the VM). Or something else. > > > > By "unfreeze", I mean unsetting a page all frozen in the visibility > > map when modifying the page for the first time after it was last > > frozen. > > I see. So I guess that Andres meant that you'd track that within all > backends, using pgstats infrastructure (when he summarized your call > earlier today)? That call was just between Robert and me (and not dedicated just to this topic, fwiw). Yes, I was thinking of tracking that in pgstat. I can imagine occasionally rolling it over into pg_class, to better deal with crashes / failovers, but am fairly agnostic on whether that's really useful / necessary. > And that that information would be an important input for VACUUM, as opposed > to something that it maintained itself? Yes. If the ratio of opportunistically frozen pages (which I'd define as pages that were frozen not because they strictly needed to) vs the number of unfrozen pages increases, we need to make opportunistic freezing less aggressive and vice versa. > ISTM that the concept of "unfreezing" a page is equivalent to > "opening" the page that was "closed" at some point (by VACUUM). It's > not limited to freezing per se -- it's "closed for business until > further notice", which is a slightly broader concept (and one not > unique to Postgres). You don't just need to be concerned about updates > and deletes -- inserts are also a concern. > > I would be sure to look out for new inserts that "unfreeze" pages, too > -- ideally you'd have instrumentation that caught that, in order to > get a general sense of the extent of the problem in each of your > chosen representative workloads. This is particularly likely to be a > concern when there is enough space on a heap page to fit one more heap > tuple, that's smaller than most other tuples. The FSM will "helpfully" > make sure of it. This problem isn't rare at all, unfortunately. I'm not as convinced as you are that this is a problem / that the solution won't cause more problems than it solves. Users are concerned when free space can't be used - you don't have to look further than the discussion in the last weeks about adding the ability to disable HOT to fight bloat. I do agree that the FSM code tries way too hard to fit things onto early pages - it e.g. can slow down concurrent copy workloads by 3-4x due to contention in the FSM - and that it has more size classes than necessary, but I don't think just closing frozen pages against further insertions of small tuples will cause its own set of issues. I think at the very least there'd need to be something causing pages to reopen once the aggregate unused space in the table reaches some threshold. Greetings, Andres Freund
Re: Eager page freeze criteria clarification
On Wed, Sep 27, 2023 at 5:20 PM Melanie Plageman wrote: > > Can you define "unfreeze"? I don't know if this newly invented term > > refers to unsetting a page that was marked all-frozen following (say) > > an UPDATE, or if it refers to choosing to not freeze when the option > > was available (in the sense that it was possible to do it and fully > > mark the page all-frozen in the VM). Or something else. > > By "unfreeze", I mean unsetting a page all frozen in the visibility > map when modifying the page for the first time after it was last > frozen. I see. So I guess that Andres meant that you'd track that within all backends, using pgstats infrastructure (when he summarized your call earlier today)? And that that information would be an important input for VACUUM, as opposed to something that it maintained itself? > I would probably call choosing not to freeze when the option is > available "no freeze". I have been thinking of what to call it because > I want to add some developer stats for myself indicating why a page > that was freezable was not frozen. I think that having that sort of information available via custom instrumentation (just for the performance validation side) makes a lot of sense. ISTM that the concept of "unfreezing" a page is equivalent to "opening" the page that was "closed" at some point (by VACUUM). It's not limited to freezing per se -- it's "closed for business until further notice", which is a slightly broader concept (and one not unique to Postgres). You don't just need to be concerned about updates and deletes -- inserts are also a concern. I would be sure to look out for new inserts that "unfreeze" pages, too -- ideally you'd have instrumentation that caught that, in order to get a general sense of the extent of the problem in each of your chosen representative workloads. This is particularly likely to be a concern when there is enough space on a heap page to fit one more heap tuple, that's smaller than most other tuples. The FSM will "helpfully" make sure of it. This problem isn't rare at all, unfortunately. > > The choice to freeze or not freeze pretty much always relies on > > guesswork about what'll happen to the page in the future, no? > > Obviously we wouldn't even apply the FPI trigger criteria if we could > > somehow easily determine that it won't work out (to some degree that's > > what conditioning it on being able to set the all-frozen VM bit > > actually does). > > I suppose you are thinking of "opportunistic" as freezing whenever we > aren't certain it is the right thing to do simply because we have the > opportunity to do it? I have heard the term "opportunistic freezing" used to refer to freezing that takes place outside of VACUUM before now. You know, something perfectly analogous to pruning in VACUUM versus opportunistic pruning. (I knew that you can't have meant that -- my point is that the terminology in this area has problems.) > I want a way to express "freeze when freeze min age doesn't require it" That makes sense when you consider where we are right now, but it'll sound odd in a world where freezing via min_freeze_age is the exception rather than the rule. If anything, it would make more sense if the traditional min_freeze_age trigger criteria was the type of freezing that needed its own adjective. -- Peter Geoghegan
Re: Eager page freeze criteria clarification
On Wed, Sep 27, 2023 at 5:27 PM Peter Geoghegan wrote: > > On Wed, Sep 27, 2023 at 1:45 PM Andres Freund wrote: > > I am much more concerned about cases where > > opportunistic freezing requires an FPI - it'll often *still* be the right > > choice to freeze the page, but we need a way to prevent that from causing a > > lot of WAL in worse cases. > > What about my idea of holding back when some tuples are already frozen > from before? Admittedly that's still a fairly raw idea, but something > along those lines seems promising. Originally, when you proposed this, the idea was to avoid freezing a page that has some tuples already frozen because that would mean we have frozen it and it was unfrozen after. But, today, I was thinking about what heuristics to combine with one that considers the average amount of time pages in a table spend frozen [1], and I think it might be interesting to use the "some tuples on the page are frozen" test in the opposite way than it was originally proposed. Take a case where you insert a row and then update it once and repeat forever. Let's say you freeze the page before you've filled it up, and, on the next update, the page is unfrozen. Most of the tuples on the page will still be frozen. The next time the page is vacuumed, the fact that the page has a lot of frozen tuples actually means that it is probably worth freezing the remaining few tuples and marking the page frozen. Basically, I'm wondering if we can use the page-is-partially-frozen test to catch cases where we are willing to freeze a page a couple of times to make sure it gets frozen. - Melanie [1] https://www.postgresql.org/message-id/CAAKRu_Y0nLmQ%3DYS1c2ORzLi7bu3eWjdx%2B32BuFc0Tho2o7E3rw%40mail.gmail.com
Re: Eager page freeze criteria clarification
On Wed, Sep 27, 2023 at 7:39 PM Peter Geoghegan wrote: > > On Wed, Sep 27, 2023 at 4:09 PM Melanie Plageman > wrote: > > At the risk of seeming too execution-focused, I want to try and get more > > specific. Here is a description of an example implementation to test my > > understanding: > > > > In table-level stats, save two numbers: younger_than_cpt/older_than_cpt > > storing the number of instances of unfreezing a page which is either > > younger or older than the start of the most recent checkpoint at the > > time of its unfreezing > > Can you define "unfreeze"? I don't know if this newly invented term > refers to unsetting a page that was marked all-frozen following (say) > an UPDATE, or if it refers to choosing to not freeze when the option > was available (in the sense that it was possible to do it and fully > mark the page all-frozen in the VM). Or something else. By "unfreeze", I mean unsetting a page all frozen in the visibility map when modifying the page for the first time after it was last frozen. I would probably call choosing not to freeze when the option is available "no freeze". I have been thinking of what to call it because I want to add some developer stats for myself indicating why a page that was freezable was not frozen. > I also find the term "opportunistic freezing" confusing. What's > opportunistic about it? It seems as if the term has been used as a > synonym of "freezing not triggered by min_freeze_age" on this thread, > or "freezing that is partly conditioned on setting the page all-frozen > in the VM", but I'm not even sure of that. By "opportunistic freezing", I refer to freezing not triggered by min_freeze_age. Though we only do so now if the page could be set all-frozen, that is not a requirement for use of the term, IMO. I assume that we are not considering changing any behavior related to freezing when a tuple is older than freeze_min_age, so I think of that freezing as "required". And any freezing of newer tuples is opportunistic. > The choice to freeze or not freeze pretty much always relies on > guesswork about what'll happen to the page in the future, no? > Obviously we wouldn't even apply the FPI trigger criteria if we could > somehow easily determine that it won't work out (to some degree that's > what conditioning it on being able to set the all-frozen VM bit > actually does). I suppose you are thinking of "opportunistic" as freezing whenever we aren't certain it is the right thing to do simply because we have the opportunity to do it? If so, I'm not sure we need that distinction currently. It seems more useful in a narrative context as a way of characterizing a situation than it does when categorizing different behaviors. I want a way to express "freeze when freeze min age doesn't require it" - Melanie
Re: Eager page freeze criteria clarification
On Wed, Sep 27, 2023 at 4:09 PM Melanie Plageman wrote: > At the risk of seeming too execution-focused, I want to try and get more > specific. Here is a description of an example implementation to test my > understanding: > > In table-level stats, save two numbers: younger_than_cpt/older_than_cpt > storing the number of instances of unfreezing a page which is either > younger or older than the start of the most recent checkpoint at the > time of its unfreezing Can you define "unfreeze"? I don't know if this newly invented term refers to unsetting a page that was marked all-frozen following (say) an UPDATE, or if it refers to choosing to not freeze when the option was available (in the sense that it was possible to do it and fully mark the page all-frozen in the VM). Or something else. I also find the term "opportunistic freezing" confusing. What's opportunistic about it? It seems as if the term has been used as a synonym of "freezing not triggered by min_freeze_age" on this thread, or "freezing that is partly conditioned on setting the page all-frozen in the VM", but I'm not even sure of that. The choice to freeze or not freeze pretty much always relies on guesswork about what'll happen to the page in the future, no? Obviously we wouldn't even apply the FPI trigger criteria if we could somehow easily determine that it won't work out (to some degree that's what conditioning it on being able to set the all-frozen VM bit actually does). -- Peter Geoghegan
Re: Eager page freeze criteria clarification
On Wed, Sep 27, 2023 at 3:25 PM Robert Haas wrote: > > On Wed, Sep 27, 2023 at 12:34 PM Andres Freund wrote: > > One way to deal with that would be to not track the average age in > > LSN-difference-bytes, but convert the value to some age metric at that > > time. If we e.g. were to convert the byte-age into an approximate age in > > checkpoints, with quadratic bucketing (e.g. 0 -> current checkpoint, 1 -> 1 > > checkpoint, 2 -> 2 checkpoints ago, 3 -> 4 checkpoints ago, ...), using a > > mean > > of that age would probably be fine. > > Yes. I think it's possible that we could even get by with just two > buckets. Say current checkpoint and not. Or current-or-previous > checkpoint and not. And just look at what percentage of accesses fall > into this first bucket -- it should be small or we're doing it wrong. > It seems like the only thing we actually need to avoid is freezing the > same ages over and over again in a tight loop. At the risk of seeming too execution-focused, I want to try and get more specific. Here is a description of an example implementation to test my understanding: In table-level stats, save two numbers: younger_than_cpt/older_than_cpt storing the number of instances of unfreezing a page which is either younger or older than the start of the most recent checkpoint at the time of its unfreezing Upon update or delete (and insert?), if the page being modified is frozen and if insert LSN - RedoRecPtr > insert LSN - old page LSN page is younger, younger_than_cpt += 1 otherwise, older_than_cpt += 1 The ratio of younger/total and older/total can be used to determine how aggressive opportunistic freezing will be. This has the downside of counting most unfreezings directly after a checkpoint in the older_than_cpt bucket. That is: older_than_cpt != longer_frozen_duration at certain times in the checkpoint cycle. Now, I'm trying to imagine how this would interact in a meaningful way with opportunistic freezing behavior during vacuum. You would likely want to combine it with one of the other heuristics we discussed. For example: For a table with only 20% younger unfreezings, when vacuuming that page, if insert LSN - RedoRecPtr < insert LSN - page LSN page is older than the most recent checkpoint start, so freeze it regardless of whether or not it would emit an FPI What aggressiveness levels should there be? What should change at each level? What criteria should pages have to meet to be subject to the aggressiveness level? I have some ideas, but I'd like to try an algorithm along these lines with an updated work queue workload and the insert-only workload. And I want to make sure I understand the proposal first. - Melanie
Re: Eager page freeze criteria clarification
On Wed, Sep 27, 2023 at 2:26 PM Peter Geoghegan wrote: > On Wed, Sep 27, 2023 at 1:45 PM Andres Freund wrote: > > I think we need to make vacuums on large tables much more aggressive than > > they > > are now, independent of opportunistic freezing heuristics. It's idiotic that > > on large tables we delay vacuuming until multi-pass vacuums are pretty much > > guaranteed. > > Not having to do all of the freezing at once will often still make > sense in cases where we "lose". One more thing on this, and the subject of large table that keep getting larger (including those with a "hot tail" of updates): Since autovacuum runs against such tables at geometric intervals (as determined by autovacuum_vacuum_insert_scale_factor), the next VACUUM is always going to be longer and more expensive than this VACUUM, forever (ignoring the influence of aggressive mode for a second). This would even be true if we didn't have the related problem of autovacuum_vacuum_insert_scale_factor not accounting for the fact that when VACUUM starts and when VACUUM ends aren't exactly the same thing in large tables [1] -- that aspect just makes the problem even worse. Basically, the whole "wait and see" approach makes zero sense here because we really do need to be aggressive about freezing just to keep up with the workload. The number of pages we'll scan in the next VACUUM will always be significantly larger, even if we're very aggressive about freezing (theoretically it might not be, but then what VACUUM does doesn't matter that much either way). Time is very much not on our side here. So we need to anticipate what happens next with the workload, and how that affects VACUUM in the future -- not just how VACUUM affects the workload. (VACUUM is just another part of the workload, in fact.) [1] https://www.postgresql.org/message-id/CAH2-Wzn=bZ4wynYB0hBAeF4kGXGoqC=PZVKHeerBU-je9AQF=g...@mail.gmail.com -- Peter Geoghegan
Re: Eager page freeze criteria clarification
On Wed, Sep 27, 2023 at 1:45 PM Andres Freund wrote: > On 2023-09-27 13:14:41 -0700, Peter Geoghegan wrote: > > As a general rule, I think that we're better of gambling against > > future FPIs, and then pulling back if we go too far. The fact that we > > went one VACUUM operation without the workload unsetting an > > all-visible page isn't that much of a signal about what might happen > > to that page. > > I think we can afford to be quite aggressive about opportunistically freezing > when doing so wouldn't emit an FPI. I agree. This part is relatively easy -- it is more or less v2 of the FPI thing. The only problem that I see is that it just isn't that compelling on its own -- it just doesn't seem ambitious enough. > I am much more concerned about cases where > opportunistic freezing requires an FPI - it'll often *still* be the right > choice to freeze the page, but we need a way to prevent that from causing a > lot of WAL in worse cases. What about my idea of holding back when some tuples are already frozen from before? Admittedly that's still a fairly raw idea, but something along those lines seems promising. If you limit yourself to what I've called the easy cases, then you can't really expect to make a dent in the problems that we see with large tables -- where FPIs are pretty much the norm rather than the exception. Again, that's fine, but I'd be disappointed if you and Melanie can't do better than that for 17. > > 2. Large tables (i.e. the tables where it really matters) just don't > > have that many VACUUM operations, relative to everything else. > > I think we need to make vacuums on large tables much more aggressive than they > are now, independent of opportunistic freezing heuristics. It's idiotic that > on large tables we delay vacuuming until multi-pass vacuums are pretty much > guaranteed. Not having to do all of the freezing at once will often still make sense in cases where we "lose". It's hard to precisely describe how to assess such things (what's the break even point?), but that makes it no less true. Constantly losing by a small amount is usually better than massive drops in performance. > The current logic made some sense when we didn't have the VM, but now > autovacuum scheduling is influenced by the portion of the table that that > vacuum will never look at, which makes no sense. Yep: https://www.postgresql.org/message-id/CAH2-Wz=MGFwJEpEjVzXwEjY5yx=uunpza6bt4dsmasrgluq...@mail.gmail.com > > Who says we'll get more than one opportunity per page with these tables, > > even with this behavior of scanning all-visible pages in non-aggressive > > VACUUMs? Big append-only tables simply won't get the opportunity to catch > > up in the next non-aggressive VACUUM if there simply isn't one. > > I agree that we need to freeze pages in append only tables ASAP. I don't think > they're that hard a case to detect though. The harder case is the - IME very > common - case of tables that are largely immutable but have a moving tail > that's hotly updated. Doing well with such "moving tail of updates" tables is exactly what doing well on TPC-C requires. It's easy to detect that a table is either one of these two things. Which of the two it is specifically presents greater difficulties. So I don't see strict append-only versus "append and update each row once" as all that different, practically speaking, from the point of view of VACUUM. Often, the individual pages from TPC-C's order lines table look very much like pgbench_history style pages would -- just because VACUUM is either ahead or behind the current "hot tail" position with almost all pages that it scans, due to the current phase of the moon. This is partly why I place so much emphasis on teaching VACUUM to understand that what it sees in each page might well have a lot to do with when it happened to show up, as opposed to something about the workload/table itself. BTW, if you're worried about the "hot tail" case in particular then you should definitely put the FSM stuff in scope here -- it's a huge part of the overall problem, since it makes the pages take so much longer to "settle" than you might expect when just considering the workload abstractly. -- Peter Geoghegan
Re: Eager page freeze criteria clarification
Hi, On 2023-09-27 13:14:41 -0700, Peter Geoghegan wrote: > On Wed, Sep 27, 2023 at 1:03 PM Andres Freund wrote: > > I suspect that medium term the better approach would be to be much more > > aggressive about setting all-visible, including as part of page-level > > visibility checks, and to deal with the concern of vacuum not processing > > such > > pages soon by not just looking at unmarked pages, but also a portion of the > > all-visible-but-not-frozen pages (the more all-visible-but-not-frozen pages > > there are, the more of them each vacuum should process). I think all-visible > > is too important for index only scans, for us to be able to remove it, or > > delay setting it until freezing makes sense. > > > > My confidence in my gut feeling isn't all that high here, though. > > I think that this is a bad idea. I see two main problems with > "splitting the difference" like this: > > 1. If we randomly scan some all-visible pages in non-aggressive > VACUUMs, then we're sure to get FPIs in order to be able to freeze. > > As a general rule, I think that we're better of gambling against > future FPIs, and then pulling back if we go too far. The fact that we > went one VACUUM operation without the workload unsetting an > all-visible page isn't that much of a signal about what might happen > to that page. I think we can afford to be quite aggressive about opportunistically freezing when doing so wouldn't emit an FPI. I am much more concerned about cases where opportunistic freezing requires an FPI - it'll often *still* be the right choice to freeze the page, but we need a way to prevent that from causing a lot of WAL in worse cases. I think the case where no-fpi-required-freezing is a problem are small, frequently updated, tables, which are very frequently vacuumed. > 2. Large tables (i.e. the tables where it really matters) just don't > have that many VACUUM operations, relative to everything else. I think we need to make vacuums on large tables much more aggressive than they are now, independent of opportunistic freezing heuristics. It's idiotic that on large tables we delay vacuuming until multi-pass vacuums are pretty much guaranteed. We don't have the stats to detect that, but if we had them, we should trigger autovacuums dead items + dead tuples gets to a significant fraction of what can be tracked in m_w_m. The current logic made some sense when we didn't have the VM, but now autovacuum scheduling is influenced by the portion of the table that that vacuum will never look at, which makes no sense. One can argue that that frozen-portion is a proxy for the size of the indexes that would need to be scanned - but that also doesn't make much sense to me, because the number & size of indexes depends a lot on the workload and because we skip index vacuums when not necessary. I guess we could just use (relation_size - skipped_pages) * autovacuum_vacuum_scale_factor as the threshold, but we don't have a cheap way to compute the number of frozen pages right now (we do have relallvisible for non-aggressive vacuums). > Who says we'll get more than one opportunity per page with these tables, > even with this behavior of scanning all-visible pages in non-aggressive > VACUUMs? Big append-only tables simply won't get the opportunity to catch > up in the next non-aggressive VACUUM if there simply isn't one. I agree that we need to freeze pages in append only tables ASAP. I don't think they're that hard a case to detect though. The harder case is the - IME very common - case of tables that are largely immutable but have a moving tail that's hotly updated. Greetings, Andres Freund
Re: Eager page freeze criteria clarification
On Wed, Sep 27, 2023 at 1:03 PM Andres Freund wrote: > I suspect that medium term the better approach would be to be much more > aggressive about setting all-visible, including as part of page-level > visibility checks, and to deal with the concern of vacuum not processing such > pages soon by not just looking at unmarked pages, but also a portion of the > all-visible-but-not-frozen pages (the more all-visible-but-not-frozen pages > there are, the more of them each vacuum should process). I think all-visible > is too important for index only scans, for us to be able to remove it, or > delay setting it until freezing makes sense. > > My confidence in my gut feeling isn't all that high here, though. I think that this is a bad idea. I see two main problems with "splitting the difference" like this: 1. If we randomly scan some all-visible pages in non-aggressive VACUUMs, then we're sure to get FPIs in order to be able to freeze. As a general rule, I think that we're better of gambling against future FPIs, and then pulling back if we go too far. The fact that we went one VACUUM operation without the workload unsetting an all-visible page isn't that much of a signal about what might happen to that page. 2. Large tables (i.e. the tables where it really matters) just don't have that many VACUUM operations, relative to everything else. Who says we'll get more than one opportunity per page with these tables, even with this behavior of scanning all-visible pages in non-aggressive VACUUMs? Big append-only tables simply won't get the opportunity to catch up in the next non-aggressive VACUUM if there simply isn't one. -- Peter Geoghegan
Re: Eager page freeze criteria clarification
On Wed, Sep 27, 2023 at 12:34 PM Andres Freund wrote: > What do you mean with "always freeze aggressively" - do you mean 'aggressive' > autovacuums? Or opportunistic freezing being aggressive? I don't know why the > former would be the case? I meant the latter. > > When it grows large enough, we suddenly stop freezing it aggressively, and > > now it starts experiencing vacuums that do a whole bunch of work all at > > once. A user who notices that is likely to be pretty confused about what > > happened, and maybe not too happy when they find out. > > Hm - isn't my proposal exactly the other way round? I'm proposing that a page > is frozen more aggressively if not already in shared buffers - which will > become more common once the table has grown "large enough"? OK, but it's counterintuitive either way, IMHO. > I think there were three main ideas that we discussed: > > 1) We don't need to be accurate in the freezing decisions for individual >pages, we "just" need to avoid the situation that over time we commonly >freeze pages that will be updated again "soon". > 2) It might be valuable to adjust the "should freeze page opportunistically" >based on feedback. > 3) We might need to classify the workload for a table and use different >heruristics for different workloads. I agree with all of that. Good summary. > One way to deal with that would be to not track the average age in > LSN-difference-bytes, but convert the value to some age metric at that > time. If we e.g. were to convert the byte-age into an approximate age in > checkpoints, with quadratic bucketing (e.g. 0 -> current checkpoint, 1 -> 1 > checkpoint, 2 -> 2 checkpoints ago, 3 -> 4 checkpoints ago, ...), using a mean > of that age would probably be fine. Yes. I think it's possible that we could even get by with just two buckets. Say current checkpoint and not. Or current-or-previous checkpoint and not. And just look at what percentage of accesses fall into this first bucket -- it should be small or we're doing it wrong. It seems like the only thing we actually need to avoid is freezing the same ages over and over again in a tight loop. -- Robert Haas EDB: http://www.enterprisedb.com
Re: Eager page freeze criteria clarification
On Wed, Sep 27, 2023 at 2:36 PM Melanie Plageman wrote: > It seems like the ideal freeze pattern for an insert-only table would be > to freeze as soon as the page is full before any checkpoints which could > force you to emit an FPI. Yes. So imagine we have two freeze criteria: 1. Do not ever opportunistically freeze pages. 2. Opportunistically freeze pages whenever we can completely freeze the page. For a table where the same rows are frequently updated or where the table is frequently truncated, we want to apply (1). For an insert-only table, we want to apply (2). Maybe I'm all wet here, but it's starting to seem to me that in almost all other cases, we also want to apply (2). If that's true, then the only thing we need to do here is identify the special cases where (1) is correct. It'd be even better if we could adapt this behavior down to the page level - imagine a large mostly cold table with a hot tail. In a perfect world we apply (1) to the tail and (2) to the rest. But this might be too much of a stretch to actually get right. > One big sticking point for me (brought up elsewhere in this thread, but, > AFAICT never resolved) is that it seems like the fact that we mark pages > all-visible even when not freezing them means that no matter what > heuristic we use, we won't have the opportunity to freeze the pages we > most want to freeze. The only solution to this problem that I can see is what Peter proposed earlier: if we're not prepared to freeze the page, then don't mark it all-visible either. This might be the right thing to do, and if it is, we could even go further and get rid of the two as separate concepts completely. However, I think it would be OK to leave all of that to one side for the moment, *especially* if we adopt some proposal that does a lot more opportunistic freezing than we do currently. Because then the problem just wouldn't come up nearly as much as it does now. One patch can't fix everything, and shouldn't try. -- Robert Haas EDB: http://www.enterprisedb.com
Re: Eager page freeze criteria clarification
On Wed, Sep 27, 2023 at 11:36 AM Melanie Plageman wrote: > One big sticking point for me (brought up elsewhere in this thread, but, > AFAICT never resolved) is that it seems like the fact that we mark pages > all-visible even when not freezing them means that no matter what > heuristic we use, we won't have the opportunity to freeze the pages we > most want to freeze. I think of it as being a bit like an absorbing state from a Markov chain. It's a perfect recipe for weird path dependent behavior, which seems to make your task just about impossible. It literally makes vacuuming more frequently lead to less useful work being performed! It does so *reliably*, in the simplest of cases. Ideally, you'd be designing an algorithm for a system where we could expect to get approximately the desired outcome whenever we have approximately the right idea. A system where mistakes tend to "cancel each other out" over time. As long as you have "set all-visible but don't freeze" behavior as your starting point, then ISTM that your algorithm almost has to make no mistakes at all to really improve on what we have right now. As things stand, when your algorithm decides to not freeze (for pages that are still set all-visible in the VM), you really can't afford to be wrong. (I think that you get this already, but it's a point worth emphasizing.) -- Peter Geoghegan
Re: Eager page freeze criteria clarification
Hi, On 2023-09-27 15:45:06 -0400, Robert Haas wrote: > > One big sticking point for me (brought up elsewhere in this thread, but, > > AFAICT never resolved) is that it seems like the fact that we mark pages > > all-visible even when not freezing them means that no matter what > > heuristic we use, we won't have the opportunity to freeze the pages we > > most want to freeze. > > The only solution to this problem that I can see is what Peter > proposed earlier: if we're not prepared to freeze the page, then don't > mark it all-visible either. This might be the right thing to do, and > if it is, we could even go further and get rid of the two as separate > concepts completely. I suspect that medium term the better approach would be to be much more aggressive about setting all-visible, including as part of page-level visibility checks, and to deal with the concern of vacuum not processing such pages soon by not just looking at unmarked pages, but also a portion of the all-visible-but-not-frozen pages (the more all-visible-but-not-frozen pages there are, the more of them each vacuum should process). I think all-visible is too important for index only scans, for us to be able to remove it, or delay setting it until freezing makes sense. My confidence in my gut feeling isn't all that high here, though. > However, I think it would be OK to leave all of > that to one side for the moment, *especially* if we adopt some > proposal that does a lot more opportunistic freezing than we do > currently. Because then the problem just wouldn't come up nearly as > much as it does now. One patch can't fix everything, and shouldn't > try. Agreed. Greetings, Andres Freund
Re: Eager page freeze criteria clarification
On Wed, Sep 27, 2023 at 12:45 PM Robert Haas wrote: > > One big sticking point for me (brought up elsewhere in this thread, but, > > AFAICT never resolved) is that it seems like the fact that we mark pages > > all-visible even when not freezing them means that no matter what > > heuristic we use, we won't have the opportunity to freeze the pages we > > most want to freeze. > > The only solution to this problem that I can see is what Peter > proposed earlier: if we're not prepared to freeze the page, then don't > mark it all-visible either. This might be the right thing to do, and > if it is, we could even go further and get rid of the two as separate > concepts completely. However, I think it would be OK to leave all of > that to one side for the moment, *especially* if we adopt some > proposal that does a lot more opportunistic freezing than we do > currently. Because then the problem just wouldn't come up nearly as > much as it does now. One patch can't fix everything, and shouldn't > try. I guess that it might make sense to leave the issue of all-visible-only pages out of it for the time being. I would just point out that that means that you'll be limited to adding strictly opportunistic behaviors, where it just so happens that VACUUM stumbles upon pages where freezing early likely makes sense. Basically, an incremental improvement on the FPI thing -- not the sort of thing where I'd expect Melanie's current approach of optimizing a whole cross-section of representative workloads to really help with. I have a separate concern about it that I'll raise shortly, in my response to Andres. -- Peter Geoghegan
Re: Eager page freeze criteria clarification
On Wed, Sep 27, 2023 at 10:46 AM Andres Freund wrote: > I don't disagree that we should do something in that direction - I just don't > see the raw number of unfrozen pages being useful in that regard. If you have > a database where no pages live long, we don't need to freeze > oppportunistically, yet the fraction of unfrozen pages will be huge. We don't know how to reliably predict the future. We can do our best to ameliorate problems with such workloads, using a slew of different strategies (including making the behaviors configurable, holding off on freezing a second or a third time, etc). Such workloads are not very common, and won't necessarily suffer too much. I think that they're pretty much limited to queue-like tables. Any course of action will likely have some downsides. Melanie/you/whomever will need to make a trade-off, knowing that somebody somewhere isn't going to be completely happy about it. I just don't think that anybody will ever be able to come up with an algorithm that can generalize well enough for things to not work out that way. I really care about the problems in this area being addressed more comprehensively, so I'm certainly not going to be the one that just refuses to accept a trade-off (within reason, of course). > > > If we want to take global freeze debt into account, which I think is a > > > good > > > idea, we'll need a smarter way to represent the debt than just the number > > > of > > > unfrozen pages. I think we would need to track the age of unfrozen pages > > > in > > > some way. If there are a lot of unfrozen pages with a recent xid, then > > > it's > > > fine, but if they are older and getting older, it's a problem and we need > > > to > > > be more aggressive. > > > > Tables like pgbench_history will have lots of unfrozen pages with a > > recent XID that get scanned during every VACUUM. We should be freezing > > such pages at the earliest opportunity. > > I think we ought to be able to freeze tables with as simple a workload as > pgbench_history has aggressively without taking a global freeze debt into > account. Agreed. > We definitely *also* should take the number of unfrozen pages into account. I > just don't determining freeze debt primarily using the number of unfrozen > pages will be useful. The presence of unfrozen pages that are likely to be > updated again soon is not a problem and makes the simple metric pretty much > useless. I never said "primarily". > > To be clear, that doesn't mean that XID age shouldn't play an > > important role in helping VACUUM to differentiate between pages that > > should not be frozen and pages that should be frozen. > > I think we need to take it into acocunt to determine a useful freeze debt on a > table level (and potentially system wide too). If we reach the stage where XID age starts to matter, we're no longer talking about performance stability IMV. We're talking about avoiding wraparound, which is mostly a different problem. Recall that I started this whole line of discussion about debt in reaction to a point of Robert's about lazy freezing. My original point was that (as a general rule) we can afford to be lazy when we're not risking too much -- when the eventual cost of catching up (having gotten it wrong) isn't too painful (i.e. doesn't lead to a big balloon payment down the road). Laziness is the thing that requires justification. Putting off vital maintenance work for as long as possible only makes sense under fairly limited circumstances. A complicating factor here is the influence of the free space map. The way that that works (the total lack of hysteresis to discourage using space from old pages) is probably something that makes the problems with freezing harder to solve. Maybe that should be put in scope. > Assuming we could compute it cheaply enough, if we had an approximate median > oldest-64bit-xid-on-page and the number of unfrozen pages, we could > differentiate between tables that have lots of recent unfrozen pages (the > median will be low) and pages with lots of unfrozen pages that are unlikely to > be updated again (the median will be high and growing). Something like the > median 64bit xid would be interesting because it'd not get "invalidated" if > relfrozenxid is increased. I'm glad that you're mostly of the view that we should be freezing a lot more aggressively overall, but I think that you're still too focussed on avoiding small problems. I understand why novel new problems are generally more of a concern than established old problems, but there needs to be a sense of proportion. Performance stability is incredibly important, and isn't zero cost. -- Peter Geoghegan
Re: Eager page freeze criteria clarification
Hi, On 2023-09-27 10:25:00 -0700, Peter Geoghegan wrote: > On Wed, Sep 27, 2023 at 10:01 AM Andres Freund wrote: > > On 2023-09-26 09:07:13 -0700, Peter Geoghegan wrote: > > I don't think doing this on a system wide basis with a metric like #unfrozen > > pages is a good idea. It's quite common to have short lived data in some > > tables while also having long-lived data in other tables. Making > > opportunistic > > freezing more aggressive in that situation will just hurt, without a benefit > > (potentially even slowing down the freezing of older data!). And even > > within a > > single table, making freezing more aggressive because there's a decent sized > > part of the table that is updated regularly and thus not frozen, doesn't > > make > > sense. > > I never said that #unfrozen pages should be the sole criterion, for > anything. Just that it would influence the overall strategy, making > the system veer towards more aggressive freezing. It would complement > a more sophisticated algorithm that decides whether or not to freeze a > page based on its individual characteristics. > > For example, maybe the page-level algorithm would have a random > component. That could potentially be where the global (or at least > table level) view gets to influence things -- the random aspect is > weighed using the global view of debt. That kind of thing seems like > an interesting avenue of investigation. I don't disagree that we should do something in that direction - I just don't see the raw number of unfrozen pages being useful in that regard. If you have a database where no pages live long, we don't need to freeze oppportunistically, yet the fraction of unfrozen pages will be huge. > > If we want to take global freeze debt into account, which I think is a good > > idea, we'll need a smarter way to represent the debt than just the number of > > unfrozen pages. I think we would need to track the age of unfrozen pages in > > some way. If there are a lot of unfrozen pages with a recent xid, then it's > > fine, but if they are older and getting older, it's a problem and we need to > > be more aggressive. > > Tables like pgbench_history will have lots of unfrozen pages with a > recent XID that get scanned during every VACUUM. We should be freezing > such pages at the earliest opportunity. I think we ought to be able to freeze tables with as simple a workload as pgbench_history has aggressively without taking a global freeze debt into account. > > The problem I see is how track the age of unfrozen data - > > it'd be easy enough to track the mean(oldest-64bit-xid-on-page), but then we > > again have the issue of rare outliers moving the mean too much... > > I think that XID age is mostly not very important compared to the > absolute amount of unfrozen pages, and the cost profile of freezing > now versus later. (XID age *is* important in emergencies, but that's > mostly not what we're discussing right now.) We definitely *also* should take the number of unfrozen pages into account. I just don't determining freeze debt primarily using the number of unfrozen pages will be useful. The presence of unfrozen pages that are likely to be updated again soon is not a problem and makes the simple metric pretty much useless. > To be clear, that doesn't mean that XID age shouldn't play an > important role in helping VACUUM to differentiate between pages that > should not be frozen and pages that should be frozen. I think we need to take it into acocunt to determine a useful freeze debt on a table level (and potentially system wide too). Assuming we could compute it cheaply enough, if we had an approximate median oldest-64bit-xid-on-page and the number of unfrozen pages, we could differentiate between tables that have lots of recent unfrozen pages (the median will be low) and pages with lots of unfrozen pages that are unlikely to be updated again (the median will be high and growing). Something like the median 64bit xid would be interesting because it'd not get "invalidated" if relfrozenxid is increased. Greetings, Andres Freund
Re: Eager page freeze criteria clarification
On Wed, Sep 27, 2023 at 10:01 AM Andres Freund wrote: > On 2023-09-26 09:07:13 -0700, Peter Geoghegan wrote: > I don't think doing this on a system wide basis with a metric like #unfrozen > pages is a good idea. It's quite common to have short lived data in some > tables while also having long-lived data in other tables. Making opportunistic > freezing more aggressive in that situation will just hurt, without a benefit > (potentially even slowing down the freezing of older data!). And even within a > single table, making freezing more aggressive because there's a decent sized > part of the table that is updated regularly and thus not frozen, doesn't make > sense. I never said that #unfrozen pages should be the sole criterion, for anything. Just that it would influence the overall strategy, making the system veer towards more aggressive freezing. It would complement a more sophisticated algorithm that decides whether or not to freeze a page based on its individual characteristics. For example, maybe the page-level algorithm would have a random component. That could potentially be where the global (or at least table level) view gets to influence things -- the random aspect is weighed using the global view of debt. That kind of thing seems like an interesting avenue of investigation. > If we want to take global freeze debt into account, which I think is a good > idea, we'll need a smarter way to represent the debt than just the number of > unfrozen pages. I think we would need to track the age of unfrozen pages in > some way. If there are a lot of unfrozen pages with a recent xid, then it's > fine, but if they are older and getting older, it's a problem and we need to > be more aggressive. Tables like pgbench_history will have lots of unfrozen pages with a recent XID that get scanned during every VACUUM. We should be freezing such pages at the earliest opportunity. > The problem I see is how track the age of unfrozen data - > it'd be easy enough to track the mean(oldest-64bit-xid-on-page), but then we > again have the issue of rare outliers moving the mean too much... I think that XID age is mostly not very important compared to the absolute amount of unfrozen pages, and the cost profile of freezing now versus later. (XID age *is* important in emergencies, but that's mostly not what we're discussing right now.) To be clear, that doesn't mean that XID age shouldn't play an important role in helping VACUUM to differentiate between pages that should not be frozen and pages that should be frozen. But even there it should probably be treated as a cue. And, the relationship between the XIDs on the page is probably more important than their absolute age (or relationship to XIDs that appear elsewhere more generally). For example, if a page is filled with heap tuples whose XIDs all match (i.e. tuples that were all inserted by the same transaction), or XIDs that are at least very close together, then that could make VACUUM more enthusiastic about freezing now. OTOH if the XIDs are more heterogeneous (but never very old), or if some xmin fields were frozen, then VACUUM should show much less enthusiasm for freezing if it's expensive. -- Peter Geoghegan
Re: Eager page freeze criteria clarification
Hi, On 2023-09-26 09:07:13 -0700, Peter Geoghegan wrote: > On Tue, Sep 26, 2023 at 8:19 AM Andres Freund wrote: > > However, I'm not at all convinced doing this on a system wide level is a > > good > > idea. Databases do often contain multiple types of workloads at the same > > time. E.g., we want to freeze aggressively in a database that has the bulk > > of > > its size in archival partitions but has lots of unfrozen data in an active > > partition. And databases have often loads of data that's going to change > > frequently / isn't long lived, and we don't want to super aggressively > > freeze > > that, just because it's a large portion of the data. > > I didn't say that we should always have most of the data in the > database frozen, though. Just that we can reasonably be more lazy > about freezing the remainder of pages if we observe that most pages > are already frozen. How they got that way is another discussion. > > I also think that the absolute amount of debt (measured in physical > units such as unfrozen pages) should be kept under control. But that > isn't something that can ever be expected to work on the basis of a > simple threshold -- if only because autovacuum scheduling just doesn't > work that way, and can't really be adapted to work that way. I don't think doing this on a system wide basis with a metric like #unfrozen pages is a good idea. It's quite common to have short lived data in some tables while also having long-lived data in other tables. Making opportunistic freezing more aggressive in that situation will just hurt, without a benefit (potentially even slowing down the freezing of older data!). And even within a single table, making freezing more aggressive because there's a decent sized part of the table that is updated regularly and thus not frozen, doesn't make sense. If we want to take global freeze debt into account, which I think is a good idea, we'll need a smarter way to represent the debt than just the number of unfrozen pages. I think we would need to track the age of unfrozen pages in some way. If there are a lot of unfrozen pages with a recent xid, then it's fine, but if they are older and getting older, it's a problem and we need to be more aggressive. The problem I see is how track the age of unfrozen data - it'd be easy enough to track the mean(oldest-64bit-xid-on-page), but then we again have the issue of rare outliers moving the mean too much... Greetings, Andres Freund
Re: Eager page freeze criteria clarification
Hi, (I wrote the first part of the email before Robert and I chatted on a call, I left it in the email for posterity) On 2023-09-26 13:49:32 -0400, Robert Haas wrote: > On Tue, Sep 26, 2023 at 11:11 AM Andres Freund wrote: > > As long as the most extreme cases are prevented, unnecessarily freezing is > > imo > > far less harmful than freezing too little. > > > > I'm worried that using something as long as 100-200% > > time-between-recent-checkpoints won't handle insert-mostly workload well, > > which IME are also workloads suffering quite badly under our current scheme > > - > > and which are quite common. > > I wrote about this problem in my other reply and I'm curious as to > your thoughts about it. Basically, suppose we forget about all of > Melanie's tests except for three cases: (1) an insert-only table, (2) > an update-heavy workload with uniform distribution, and (3) an > update-heavy workload with skew. In case (1), freezing is good. In > case (2), freezing is bad. In case (3), freezing is good for cooler > pages and bad for hotter ones. I postulate that any > recency-of-modification threshold that handles (1) well will handle > (2) poorly, and that the only way to get both right is to take some > other factor into account. You seem to be arguing that we can just > freeze aggressively in case (2) and it won't cost much, but it doesn't > sound to me like Melanie believes that and I don't think I do either. I don't believe we can freeze aggressively in all cases of 2) without causing problems. A small-ish table that's vacuumed constantly, where all rows are constantly frozen and then updated, will suffer a lot from the WAL overhead. Whereas superfluously freezing a row in a table with many millions of rows, where each row is only occasionally updated, due to the update rate being much smaller than the number of rows, will have acceptable overhead. What I *do* believe is that for all but the most extreme cases, it's safer to freeze too much than to freeze too little. There definitely are negative consequences, but they're more bounded and less surprising than not freezing for ages and then suddenly freezing everything at once. Whether 2) really exists in the real world for huge tables, is of course somewhat debatable... > > > This doesn't seem completely stupid, but I fear it would behave > > > dramatically differently on a workload a little smaller than s_b vs. > > > one a little larger than s_b, and that doesn't seem good. > > > > Hm. I'm not sure that that's a real problem. In the case of a workload > > bigger > > than s_b, having to actually read the page again increases the cost of > > freezing later, even if the workload is just a bit bigger than s_b. > > That is true, but I don't think it means that there is no problem. It > could lead to a situation where, for a while, a table never needs any > significant freezing, because we always freeze aggressively. What do you mean with "always freeze aggressively" - do you mean 'aggressive' autovacuums? Or opportunistic freezing being aggressive? I don't know why the former would be the case? > When it grows large enough, we suddenly stop freezing it aggressively, and > now it starts experiencing vacuums that do a whole bunch of work all at > once. A user who notices that is likely to be pretty confused about what > happened, and maybe not too happy when they find out. Hm - isn't my proposal exactly the other way round? I'm proposing that a page is frozen more aggressively if not already in shared buffers - which will become more common once the table has grown "large enough"? (the remainder was written after that call) I think there were three main ideas that we discussed: 1) We don't need to be accurate in the freezing decisions for individual pages, we "just" need to avoid the situation that over time we commonly freeze pages that will be updated again "soon". 2) It might be valuable to adjust the "should freeze page opportunistically" based on feedback. 3) We might need to classify the workload for a table and use different heruristics for different workloads. For 2), one of the mechanisms we discussed was to collect information about the "age" of a page when we "unfreeze" it. If we frequently unfreeze pages that were recently frozen, we need to be less aggressive in opportunistic freezing going forward. If that never happens, we can be more aggressive. The basic idea for classifying the age of a page when unfreezing is to use "insert_lsn - page_lsn", pretty simple. We can convert that into time using the averaged WAL generation rate. What's a bit harder is figuring out how to usefully aggregate the age across multiple "unfreezes". I was initially thinking of just using the mean, but Robert was rightly concerned that that'd the mean would be moved a lot when occasionally freezing very old pages, potentially leading to opportunistically freezing young pages too aggressively. The median would be a better choice, bu
Re: Eager page freeze criteria clarification
On Tue, Sep 26, 2023 at 11:11 AM Andres Freund wrote: > That I'd like you to expand on "using the RedoRecPtr of the latest checkpoint > rather than the LSN of the previou vacuum." - I can think of ways of doing so > that could end up with quite different behaviour... Yeah, me too. I'm not sure what is best. > As long as the most extreme cases are prevented, unnecessarily freezing is imo > far less harmful than freezing too little. > > I'm worried that using something as long as 100-200% > time-between-recent-checkpoints won't handle insert-mostly workload well, > which IME are also workloads suffering quite badly under our current scheme - > and which are quite common. I wrote about this problem in my other reply and I'm curious as to your thoughts about it. Basically, suppose we forget about all of Melanie's tests except for three cases: (1) an insert-only table, (2) an update-heavy workload with uniform distribution, and (3) an update-heavy workload with skew. In case (1), freezing is good. In case (2), freezing is bad. In case (3), freezing is good for cooler pages and bad for hotter ones. I postulate that any recency-of-modification threshold that handles (1) well will handle (2) poorly, and that the only way to get both right is to take some other factor into account. You seem to be arguing that we can just freeze aggressively in case (2) and it won't cost much, but it doesn't sound to me like Melanie believes that and I don't think I do either. > > This doesn't seem completely stupid, but I fear it would behave > > dramatically differently on a workload a little smaller than s_b vs. > > one a little larger than s_b, and that doesn't seem good. > > Hm. I'm not sure that that's a real problem. In the case of a workload bigger > than s_b, having to actually read the page again increases the cost of > freezing later, even if the workload is just a bit bigger than s_b. That is true, but I don't think it means that there is no problem. It could lead to a situation where, for a while, a table never needs any significant freezing, because we always freeze aggressively. When it grows large enough, we suddenly stop freezing it aggressively, and now it starts experiencing vacuums that do a whole bunch of work all at once. A user who notices that is likely to be pretty confused about what happened, and maybe not too happy when they find out. -- Robert Haas EDB: http://www.enterprisedb.com
Re: Eager page freeze criteria clarification
On Tue, Sep 26, 2023 at 8:19 AM Andres Freund wrote: > However, I'm not at all convinced doing this on a system wide level is a good > idea. Databases do often contain multiple types of workloads at the same > time. E.g., we want to freeze aggressively in a database that has the bulk of > its size in archival partitions but has lots of unfrozen data in an active > partition. And databases have often loads of data that's going to change > frequently / isn't long lived, and we don't want to super aggressively freeze > that, just because it's a large portion of the data. I didn't say that we should always have most of the data in the database frozen, though. Just that we can reasonably be more lazy about freezing the remainder of pages if we observe that most pages are already frozen. How they got that way is another discussion. I also think that the absolute amount of debt (measured in physical units such as unfrozen pages) should be kept under control. But that isn't something that can ever be expected to work on the basis of a simple threshold -- if only because autovacuum scheduling just doesn't work that way, and can't really be adapted to work that way. -- Peter Geoghegan
Re: Eager page freeze criteria clarification
Hi, On 2023-09-25 14:16:46 -0700, Peter Geoghegan wrote: > On Mon, Sep 25, 2023 at 11:45 AM Robert Haas wrote: > I'm surprised that there hasn't been any discussion of the absolute > amount of system-wide freeze debt on this thread. If 90% of the pages > in the entire database are frozen, it'll generally be okay if we make > the wrong call by freezing lazily when we shouldn't. This is doubly > true within small to medium sized tables, where the cost of catching > up on freezing cannot ever be too bad (concentrations of unfrozen > pages in one big table are what really hurt users). I generally agree with the idea of using "freeze debt" as an input - IIRC I proposed using the number of frozen pages vs number of pages as the input to the heuristic in one of the other threads about the topic a while back. I also think we should read a portion of all-visible-but-not-frozen pages during non-aggressive vacuums to manage the freeze debt. However, I'm not at all convinced doing this on a system wide level is a good idea. Databases do often contain multiple types of workloads at the same time. E.g., we want to freeze aggressively in a database that has the bulk of its size in archival partitions but has lots of unfrozen data in an active partition. And databases have often loads of data that's going to change frequently / isn't long lived, and we don't want to super aggressively freeze that, just because it's a large portion of the data. Greetings, Andres Freund
Re: Eager page freeze criteria clarification
Hi, On 2023-09-25 14:45:07 -0400, Robert Haas wrote: > On Fri, Sep 8, 2023 at 12:07 AM Andres Freund wrote: > > > Downthread, I proposed using the RedoRecPtr of the latest checkpoint > > > rather than the LSN of the previou vacuum. I still like that idea. > > > > Assuming that "downthread" references > > https://postgr.es/m/CA%2BTgmoYb670VcDFbekjn2YQOKF9a7e-kBFoj2WJF1HtH7YPaWQ%40mail.gmail.com > > could you sketch out the logic you're imagining a bit more? > > I'm not exactly sure what the question is here. That I'd like you to expand on "using the RedoRecPtr of the latest checkpoint rather than the LSN of the previou vacuum." - I can think of ways of doing so that could end up with quite different behaviour... > > Perhaps we can mix both approaches. We can use the LSN and time of the last > > vacuum to establish an LSN->time mapping that's reasonably accurate for a > > relation. For infrequently vacuumed tables we can use the time between > > checkpoints to establish a *more aggressive* cutoff for freezing then what a > > percent-of-time-since-last-vacuum appach would provide. If e.g. a table gets > > vacuumed every 100 hours and checkpoint timeout is 1 hour, no realistic > > percent-of-time-since-last-vacuum setting will allow freezing, as all dirty > > pages will be too new. To allow freezing a decent proportion of those, we > > could allow freezing pages that lived longer than ~20% > > time-between-recent-checkpoints. > > Yeah, I don't know if that's exactly the right idea, but I think it's > in the direction that I was thinking about. I'd even be happy with > 100% of the time-between-recent checkpoints, maybe even 200% of > time-between-recent checkpoints. But I think there probably should be > some threshold beyond which we say "look, this doesn't look like it > gets touched that much, let's just freeze it so we don't have to come > back to it again later." As long as the most extreme cases are prevented, unnecessarily freezing is imo far less harmful than freezing too little. I'm worried that using something as long as 100-200% time-between-recent-checkpoints won't handle insert-mostly workload well, which IME are also workloads suffering quite badly under our current scheme - and which are quite common. > I think part of the calculus here should probably be that when the > freeze threshold is long, the potential gains from making it even > longer are not that much. If I change the freeze threshold on a table > from 1 minute to 1 hour, I can potentially save uselessly freezing > that page 59 times per hour, every hour, forever, if the page always > gets modified right after I touch it. If I change the freeze threshold > on a table from 1 hour to 1 day, I can only save 23 unnecessary > freezes per day. Percentage-wise, the overhead of being wrong is the > same in both cases: I can have as many extra freeze operations as I > have page modifications, if I pick the worst possible times to freeze > in every case. But in absolute terms, the savings in the second > scenario are a lot less. I think if a user is accessing a table > frequently, the overhead of jamming a useless freeze in between every > table access is going to be a lot more noticeable then when the table > is only accessed every once in a while. And I also think it's a lot > less likely that we'll reliably get it wrong. Workloads that touch a > page and then touch it again ~N seconds later can exist for all values > of N, but I bet they're way more common for small values of N than > large ones. True. And with larger Ns it also becomes more likely that we'd need to freeze the rows anyway. I've seen tables being hit with several anti-wraparound vacuums a day, but not several anti-wraparound vacuums a minute... > Is there also a need for a similar guard in the other direction? Let's > say that autovacuum_naptime=15s and on some particular table it > triggers every time. I've actually seen this on small queue tables. Do > you think that, in such tables, we should freeze pages that haven't > been modified in 15s? I don't think it matters much, proportionally to the workload of rewriting nearly all of the table every few seconds, the overhead of freezing a bunch of already dirty pages is neglegible. > > Hm, possibly stupid idea: What about using shared_buffers residency as a > > factor? If vacuum had to read in a page to vacuum it, a) we would need read > > IO > > to freeze it later, as we'll soon evict the page via the ringbuffer b) > > non-residency indicates the page isn't constantly being modified? > > This doesn't seem completely stupid, but I fear it would behave > dramatically differently on a workload a little smaller than s_b vs. > one a little larger than s_b, and that doesn't seem good. Hm. I'm not sure that that's a real problem. In the case of a workload bigger than s_b, having to actually read the page again increases the cost of freezing later, even if the workload is just a bit bigger than s_b. Greetings, Andr
Re: Eager page freeze criteria clarification
On Mon, Sep 25, 2023 at 11:45 AM Robert Haas wrote: > > The reason I was thinking of using the "lsn at the end of the last vacuum", > > is > > that it seems to be more adapative to the frequency of vacuuming. > > Yes, but I think it's *too* adaptive. The frequency of vacuuming can > plausibly be multiple times per minute or not even annually. That's > too big a range of variation. +1. The risk of VACUUM chasing its own tail seems very real. We want VACUUM to be adaptive to the workload, not adaptive to itself. > Yeah, I don't know if that's exactly the right idea, but I think it's > in the direction that I was thinking about. I'd even be happy with > 100% of the time-between-recent checkpoints, maybe even 200% of > time-between-recent checkpoints. But I think there probably should be > some threshold beyond which we say "look, this doesn't look like it > gets touched that much, let's just freeze it so we don't have to come > back to it again later." The sole justification for any strategy that freezes lazily is that it can avoid useless freezing when freezing turns out to be unnecessary -- that's it. So I find it more natural to think of freezing as the default action, and *not freezing* as the thing that requires justification. Thinking about it "backwards" like that just seems simpler to me. There is only one possible reason to not freeze, but several reasons to freeze. > I think part of the calculus here should probably be that when the > freeze threshold is long, the potential gains from making it even > longer are not that much. If I change the freeze threshold on a table > from 1 minute to 1 hour, I can potentially save uselessly freezing > that page 59 times per hour, every hour, forever, if the page always > gets modified right after I touch it. If I change the freeze threshold > on a table from 1 hour to 1 day, I can only save 23 unnecessary > freezes per day. I totally agree with you on this point. It seems related to my point about "freezing being the conceptual default action" in VACUUM. Generally speaking, over-freezing is a problem when we reach the same wrong conclusion (freeze the page) about the same relatively few pages over and over -- senselessly repeating those mistakes really adds up when you're vacuuming the same table very frequently. On the other hand, under-freezing is typically a problem when we reach the same wrong conclusion (don't freeze the page) about lots of pages only once in a very long while. I strongly suspect that there is very little gray area between the two, across the full spectrum of application characteristics. Most individual pages have very little chance of being modified in the short to medium term. In a perfect world, with a perfect algorithm, we'd almost certainly be freezing most pages at the earliest opportunity. It is nevertheless also true that a freezing policy that is only somewhat more aggressive than this ideal oracle algorithm will freeze way too aggressive (by at least some measures). There isn't much of a paradox to resolve here: it's all down to the cadence of vacuuming, and of rows subject to constant churn. As you point out, the "same policy" can produce dramatically different outcomes when you actually consider what the consequences of the policy are over time, when applied by VACUUM under a variety of different workload conditions. So any freezing policy must be designed with due consideration for those sorts of things. If VACUUM doesn't freeze the page now, then when will it freeze it? For most individual pages, that time will come (again, pages that benefit from lazy vacuuming are the exception rather than the rule). Right now, VACUUM almost behaves as if it thought "that's not my problem, it's a problem for future me!". Trying to differentiate between pages that we must not over freeze and pages that we must not under freeze seems important. Generational garbage collection (as used by managed VM runtimes) does something that seems a little like this. It's based on the empirical observation that "most objects die young". What the precise definition of "young" really is varies significantly, but that turns out to be less of a problem than you might think -- it can be derived through feedback cycles. If you look at memory lifetimes on a logarithmic scale, very different sorts of applications tend to look like they have remarkably similar memory allocation characteristics. > Percentage-wise, the overhead of being wrong is the > same in both cases: I can have as many extra freeze operations as I > have page modifications, if I pick the worst possible times to freeze > in every case. But in absolute terms, the savings in the second > scenario are a lot less. Very true. I'm surprised that there hasn't been any discussion of the absolute amount of system-wide freeze debt on this thread. If 90% of the pages in the entire database are frozen, it'll generally be okay if we make the wrong call by freezing lazily when we shouldn't. This i
Re: Eager page freeze criteria clarification
On Sat, Sep 23, 2023 at 3:53 PM Melanie Plageman wrote: > Freeze tuples on a page opportunistically if the page would be totally > frozen and: > > 4. Buffer is already dirty and no FPI is required OR page LSN is older > than 33% of the LSNs since the last vacuum of the table. > > 5. Buffer is already dirty and no FPI is required AND page LSN is older > than 33% of the LSNs since the last vacuum of the table. > > On master, the heuristic is to freeze a page opportunistically if it > would be totally frozen and if pruning emitted an FPI. > --- > > My takeaways from all of the workload results are as follows: > > Algorithm 4 is too aggressive and regresses performance compared to > master. > > Algorithm 5 freezes surprisingly few pages, especially in workloads > where only the most recent data is being accessed or modified > (append-only, update recent data). > > A work queue-like workload with other concurrent workloads is a > particular challenge for the freeze heuristic, and we should think more > about how to handle this. I feel like we have a sort of Goldilocks problem here: the porridge is either too hot or too cold, never just right. Say we just look at workload B, looking at the difference between pgbench_accounts (which is randomly and frequently updated and thus shouldn't be opportunistically frozen) and pgbench_history (which is append-only and should thus be frozen aggressively). Algorithm 4 gets pgbench_history right and pgbench_accounts wrong, and master does the opposite. In a perfect world, we'd have an algorithm which could distinguish sharply between those cases, ramping up to maximum aggressiveness on pgbench_history while doing nothing at all to pgbench_accounts. Algorithm 5 partially accomplishes this, but the results aren't super-inspiring either. It doesn't add many page freezes in the case where freezing is bad, but it also only manages to freeze a quarter of pgbench_history, where algorithm 4 manages to freeze basically all of it. That's a pretty stark difference. Given that algorithm 5 seems to make some mistakes on some of the other workloads, I don't think it's entirely clear that it's an improvement over master, at least in practical terms. It might be worth thinking harder about what it takes specifically to get the pgbench_history case, aka the append-only table case, correct. One thing that probably doesn't work very well is to freeze pages that are more than X minutes old. Algorithm 5 uses an LSN threshold instead of a wall-clock based threshold, but the effect is the same. I think the problem here is that the vacuum operation essentially happens in an instant. At the instant that it happens, some fraction of the data added since the last vacuum is older than whatever threshold you pick, and the rest is newer. If data is added at a constant rate and you want to freeze at least 90% of the data, your recency threshold has to be no more than 10% of the time since the last vacuum. But with autovacuum_naptimes=60s, that's like 6 seconds, and that's way too aggressive for a table like pgbench_accounts. It seems to me that it's not possible to get both cases right by twiddling the threshold, because pgbench_history wants the threshold to be 0, and pgbench_accounts wants it to be ... perhaps not infinity, because maybe the distribution is Gaussian or Zipfian or something rather than uniform, but probably a couple of minutes. So I feel like if we want to get both pgbench_history and pgbench_accounts right, we need to consider some additional piece of information that makes those cases distinguishable. Either of those tables can contain a page that hasn't been accessed in 20 seconds, but the correct behavior for such a page differs between one case and the other. One random idea that I had was to refuse to opportunistically freeze a page more than once while it remains resident in shared_buffers. The first time we do it, we set a bit in the buffer header or something that suppresses further opportunistic freezing. When the buffer is evicted the bit is cleared. So we can still be wrong on a heavily updated table like pgbench_acccounts, but if the table fits in shared_buffers, we'll soon realize that we're getting it wrong a lot and will stop making the same mistake over and over. But this kind of idea only works if the working set is small enough to fit in shared_buffers, so I don't think it's actually a great plan, unless we only care about suppressing excess freezing on workloads that fit in shared_buffers. A variant on the same theme could be to keep some table-level counters and use them to assess how often we're getting it wrong. If we're often thawing recently-frozen pages, don't freeze so aggressively. But this will not work if different parts of the same table behave differently. If we don't want to do something like this that somehow responds to the characteristics of a particular page or table, then it seems we either have to freeze quite aggressively to pick up
Re: Eager page freeze criteria clarification
On Fri, Sep 8, 2023 at 12:07 AM Andres Freund wrote: > > Downthread, I proposed using the RedoRecPtr of the latest checkpoint > > rather than the LSN of the previou vacuum. I still like that idea. > > Assuming that "downthread" references > https://postgr.es/m/CA%2BTgmoYb670VcDFbekjn2YQOKF9a7e-kBFoj2WJF1HtH7YPaWQ%40mail.gmail.com > could you sketch out the logic you're imagining a bit more? I'm not exactly sure what the question is here. I mean, it doesn't quite make sense to just ask whether the page LSN is newer than the last checkpoint's REDO record, because I think that's basically just asking whether or not we would need an FPI, and the freeze criteria that Melanie has been considering incorporate that check in some other way already. But maybe some variant on that idea is useful - the distance to the second-most-recent checkpoint, or a multiple or percentage of the distance to the most-recent checkpoint, or whatever. > The reason I was thinking of using the "lsn at the end of the last vacuum", is > that it seems to be more adapative to the frequency of vacuuming. Yes, but I think it's *too* adaptive. The frequency of vacuuming can plausibly be multiple times per minute or not even annually. That's too big a range of variation. The threshold for freezing can vary by how actively the table or the page is updated, but I don't think it should vary by six orders of magnitude. Under what theory does it make sense to say that say "this row in table hasn't been modified in 20 seconds, so let's freeze it, but this row in table B hasn't been modified in 8 months, so let's not freeze it because it might be modified again soon"? If you use the LSN at the end of the last vacuum, you're going to end up making decisions exactly like that, which seems wrong to me. > Perhaps we can mix both approaches. We can use the LSN and time of the last > vacuum to establish an LSN->time mapping that's reasonably accurate for a > relation. For infrequently vacuumed tables we can use the time between > checkpoints to establish a *more aggressive* cutoff for freezing then what a > percent-of-time-since-last-vacuum appach would provide. If e.g. a table gets > vacuumed every 100 hours and checkpoint timeout is 1 hour, no realistic > percent-of-time-since-last-vacuum setting will allow freezing, as all dirty > pages will be too new. To allow freezing a decent proportion of those, we > could allow freezing pages that lived longer than ~20% > time-between-recent-checkpoints. Yeah, I don't know if that's exactly the right idea, but I think it's in the direction that I was thinking about. I'd even be happy with 100% of the time-between-recent checkpoints, maybe even 200% of time-between-recent checkpoints. But I think there probably should be some threshold beyond which we say "look, this doesn't look like it gets touched that much, let's just freeze it so we don't have to come back to it again later." I think part of the calculus here should probably be that when the freeze threshold is long, the potential gains from making it even longer are not that much. If I change the freeze threshold on a table from 1 minute to 1 hour, I can potentially save uselessly freezing that page 59 times per hour, every hour, forever, if the page always gets modified right after I touch it. If I change the freeze threshold on a table from 1 hour to 1 day, I can only save 23 unnecessary freezes per day. Percentage-wise, the overhead of being wrong is the same in both cases: I can have as many extra freeze operations as I have page modifications, if I pick the worst possible times to freeze in every case. But in absolute terms, the savings in the second scenario are a lot less. I think if a user is accessing a table frequently, the overhead of jamming a useless freeze in between every table access is going to be a lot more noticeable then when the table is only accessed every once in a while. And I also think it's a lot less likely that we'll reliably get it wrong. Workloads that touch a page and then touch it again ~N seconds later can exist for all values of N, but I bet they're way more common for small values of N than large ones. Is there also a need for a similar guard in the other direction? Let's say that autovacuum_naptime=15s and on some particular table it triggers every time. I've actually seen this on small queue tables. Do you think that, in such tables, we should freeze pages that haven't been modified in 15s? > Hm, possibly stupid idea: What about using shared_buffers residency as a > factor? If vacuum had to read in a page to vacuum it, a) we would need read IO > to freeze it later, as we'll soon evict the page via the ringbuffer b) > non-residency indicates the page isn't constantly being modified? This doesn't seem completely stupid, but I fear it would behave dramatically differently on a workload a little smaller than s_b vs. one a little larger than s_b, and that doesn't seem good. -- Robert Haas EDB: http://www.enterprisedb
Re: Eager page freeze criteria clarification
On Sat, Sep 23, 2023 at 3:53 PM Melanie Plageman wrote: > > Workload F: > > +--++-++--+ > | algo | WAL GB | cptr bgwriter writes| other reads/writes | IO time AV > worker| > +--++-+-+-+ > |M |173 | 1,202,231 | 53,957,448 | 12,389 > | > |4 |189 | 1,212,521 | 55,589,140 | 13,084 > | > |5 |173 | 1,194,242 | 54,260,118 | 13,407 > | > +--++-++--+ > > +--+--+ > | algo | P99 latency | > +--+--+ > |M | 19875 | > |4 | 19314 | > |5 | 19701 | > +--+--+ Andres mentioned that the P99 latency for the COPY workload (workload F) might not be meaningful, so I have calculated the duration total, mean, median, min, max and standard deviation in milliseconds. Workload F: +--++---++++-+ | algo | Total | Mean| Median |Min |Max | Stddev | +--++---++++-+ |M | 1,270,903 | 18,155| 17,755 | 17,090 | 19,994 | 869 | |4 | 1,167,135 | 16,673| 16,421 | 15,585 | 19,485 | 811 | |5 | 1,250,145 | 17,859| 17,704 | 15,763 | 19,871 | 1,009 | +--++---++++-+ Interestingly, algorithm 4 had the lowest total duration for all COPYs. Some investigation of other data collected during the runs led us to believe this may be due to autovacuum workers doing more IO with algorithm 4 and thus generating more WAL and ending up initializing more WAL files themselves. Whereas on master and with algorithm 5, client backends had to initialize WAL files themselves, leading COPYs to take longer. This was supported by the presence of more WALInit wait events for client backends on master and with algorithm 5. Calculating these made me realize that my conclusions about the work queue workload (workload I) didn't make much sense. Because this workload updated a non-indexed column, most pruning was HOT pruning done on access and basically no page freezing was done by vacuum. This means we weren't seeing negative performance effects of freezing related to the work queue table. The difference in this benchmark came from the relatively poor performance of the concurrent COPYs when that table was frozen more aggressively. I plan to run a new version of this workload which updates an indexed column for comparison and does not use a concurrent COPY. This is the duration total, mean, median, min, max, and standard deviation in milliseconds of the COPYs which ran concurrently with the work queue pgbench. Workload I COPYs: +--++---++++-+ | algo | Total | Mean | Median | Min | Max | Stddev | +--++---++++-+ |M | 191,032| 4,898| 4,726 | 4,486 | 9,353 | 800 | |4 | 193,534| 4,962| 4,793 | 4,533 | 9,381 | 812 | |5 | 194,351| 4,983| 4,771 | 4,617 | 9,159 | 783 | +--++---++++-+ I think this shows that algorithm 4 COPYs performed the worst. This is in contrast to the COPY-only workload (F) which did not show worse performance for algorithm 4. I think this means I should modify the work queue example and use something other than concurrent COPYs to avoid obscuring characteristics of the work queue example. - Melanie
Re: Eager page freeze criteria clarification
On Mon, Aug 28, 2023 at 4:30 PM Melanie Plageman wrote: > On Mon, Aug 28, 2023 at 12:26 PM Robert Haas wrote: > > In row D, your algorithms are all bad, really bad. I don't quite > > understand how it can be that bad, actually. > > So, I realize now that this test was poorly designed. I meant it to be a > worst case scenario, but I think one critical part was wrong. In this > example one client is going at full speed inserting a row and then > updating it. Then another rate-limited client is deleting old data > periodically to keep the table at a constant size. I meant to bulk load > the table with enough data that the delete job would have data to delete > from the start. With the default autovacuum settings, over the course of > 45 minutes, I usually saw around 40 autovacuums of the table. Due to the > rate limiting, the first autovacuum of the table ends up freezing many > pages that are deleted soon after. Thus the total number of page freezes > is very high. > > I will redo benchmarking of workload D and start the table with the > number of rows which the DELETE job seeks to maintain. My back of the > envelope math says that this will mean ratios closer to a dozen (and not > 5000). ... > I'll rerun workload D in a more reasonable way and be back with results. Hi, I have edited and rerun all of the benchmarks for a subset of the algorithms. I ran workloads A-I, except for G (G is not an interesting workload), on Postgres master as well as Postgres with freeze heuristic algorithms 4 and 5. As a refresher, algorithms 4 and 5 are as follows: Freeze tuples on a page opportunistically if the page would be totally frozen and: 4. Buffer is already dirty and no FPI is required OR page LSN is older than 33% of the LSNs since the last vacuum of the table. 5. Buffer is already dirty and no FPI is required AND page LSN is older than 33% of the LSNs since the last vacuum of the table. On master, the heuristic is to freeze a page opportunistically if it would be totally frozen and if pruning emitted an FPI. I made a few configuration changes for all benchmarks -- most notably autovacuum is on for all workloads and autovacuum_naptime is decreased to 10 seconds. The runtime of all time-based workloads was changed to 40 minutes. All workloads were run for a fixed duration except for the COPY workload (workload F). It is also worth noting that data loaded into the standard pgbench tables (with pgbench -i) was loaded with COPY and not COPY FREEZE. Also, pgbench -i and the pgbench runs were passed the --no-vacuum option. The updated workloads are as follows: A. Gaussian TPC-B like + select-only: pgbench scale 450 (DB < SB) with indexes on updated columns WL 1: 16 clients doing TPC-B like pgbench but with Gaussian access distribution (with parameter 6) for updated tables WL 2: 2 clients, rate-limited to 1000 TPS, doing select-only pgbench B. TPC-B like pgbench scale 450 16 clients doing built-in pgbench TPC-C like script C. Shifting hot set: 16 clients inserting a single row then updating an indexed column in that row D. Shifting hot set, delete old data WL 1: 16 clients inserting one row then updating an indexed column in that row WL 2: 1 client, rate limited to 10 TPS, deleting data more than 1 minute old E. Shifting hot set, delete new data, access new data WL 1: 16 clients cycling through 2 single row inserts, an update of an indexed column in a single recently inserted row, and a delete of a single recently inserted row WL 2: 1 client, rate-limited to 0.2 TPS, selecting data from the last 5 minutes F. Many COPYs 1 client, copying a total of 90 GB of data in 70 individual COPYs G. N/A H. Append only table 16 clients, inserting a single row at a time I. Work queue WL 1: 16 clients inserting a single row, updating a non-indexed column in that row twice, then deleting that row WL 2: 1 client, rate-limited to 0.02 TPS, COPYing data into another table Below are some of the data I collected, including how much WAL was generated, how much IO various backends did, TPS, and P99 latency. These were collected at the end of the pgbench run. Most numbers were rounded to the nearest whole number to make the chart easier to see. So, for example, a 0 in WAL GB does not mean that no WAL was emitted. P99 latency below 0 was rounded to a single decimal place. All IO time is in milliseconds. "P99 lat" was calculated using the "time" field from pgbench's "per-transaction logging" option [1]. It is the "transaction's elapsed time" or duration. I converted it to milliseconds. "pages frozen" here refers to the number of pages marked frozen in the visibility map at the end of the run. "page freezes" refers to the number of times vacuum froze tuples in lazy_scan_prune(). This does not include cases in which the page was found to be all frozen but no tuples were explicitly frozen. "AVs" is the number of autovacuums for the table or tables specifie
Re: Eager page freeze criteria clarification
Hi, On 2023-09-07 22:29:04 -0700, Peter Geoghegan wrote: > On Thu, Sep 7, 2023 at 9:45 PM Andres Freund wrote: > > I.e. setting an, otherwise unmodified, page all-visible won't trigger an FPI > > if checksums are disabled, but will FPI with checksums enabled. I think > > that's > > a substantial difference in WAL volume for insert-only workloads... > > Note that all RDS Postgres users get page-level checksums. Overall, > the FPI trigger mechanism is going to noticeably improve performance > characteristics for many users. Simple and crude though it is. You mean the current heuristic or some new heuristic we're coming up with in this thread? If the former, unfortunately I think the current heuristic often won't trigger in cases where freezing would be fine, e.g. on an insert-mostly (or hot pruned) workload with some read accesses. If the tuples are already hint-bitted, there's no FPI during heap_page_prune(), and thus we don't freeze - even though we *do* subsequently trigger an FPI, while setting all-visible. See e.g. the stats for the modified version of the scenario, where there the table is hint-bitted that I have since posted: https://postgr.es/m/20230908053634.hyn46pugqp4lsiw5%40awork3.anarazel.de There we freeze neither with nor without checksums, despite incurring FPIs when checksums are enabled. > > Type N (%) Record > > size (%) FPI size (%)Combined size (%) > > - --- > > --- --- ---- > > --- > > XLOG/FPI_FOR_HINT 44253 ( 33.34) > > 2168397 ( 7.84)361094232 (100.00)363262629 ( 93.44) > > Transaction/INVALIDATION 1 ( 0.00) > > 78 ( 0.00)0 ( 0.00) 78 ( 0.00) > > Standby/INVALIDATIONS 1 ( 0.00) > > 90 ( 0.00)0 ( 0.00) 90 ( 0.00) > > Heap2/FREEZE_PAGE 44248 ( 33.33) > > 22876120 ( 82.72)0 ( 0.00) 22876120 ( > > 5.88) > > Heap2/VISIBLE 44248 ( 33.33) > > 2610642 ( 9.44)16384 ( 0.00) 2627026 ( 0.68) > > Heap/INPLACE 1 ( 0.00) > > 188 ( 0.00)0 ( 0.00) 188 ( 0.00) > > > > > > Total 132752 > > 27655515 [7.11%] 361110616 [92.89%]388766131 [100%] > > > > In realistic tables, where rows are wider than a single int, FPI_FOR_HINT > > dominates even further, as the FREEZE_PAGE would be smaller if there weren't > > 226 tuples on each page... > > If FREEZE_PAGE WAL volume is really what holds back further high level > improvements in this area, then it can be worked on directly -- it's > not a fixed cost. It wouldn't be particularly difficult, in fact. Agreed! > These are records that still mostly consist of long runs of contiguous > page offset numbers. They're ideally suited for compression using some > kind of simple variant of run length encoding. The freeze plan > deduplication stuff in 16 made a big difference, but it's still not > very hard to improve matters here. Yea, even just using ranges of offsets should help in a lot of cases. Greetings, Andres Freund
Re: Eager page freeze criteria clarification
On Thu, Sep 7, 2023 at 10:46 PM Andres Freund wrote: > You mean the current heuristic or some new heuristic we're coming up with in > this thread? If the former, unfortunately I think the current heuristic often > won't trigger in cases where freezing would be fine, e.g. on an insert-mostly > (or hot pruned) workload with some read accesses. If the tuples are already > hint-bitted, there's no FPI during heap_page_prune(), and thus we don't freeze > - even though we *do* subsequently trigger an FPI, while setting all-visible. It's obviously not adequate, or anything like that. I didn't say that it was. I've heard plenty of complaints about freezing and antiwraparound autovacuuming, and basically no complaints about the cost of FPIs due to checksums (apparently Robert wasn't aware of the problem, at least not as it relates to VACUUM setting the VM). This is true even though one might expect it to be the other way around, based on simple analysis using pg_waldump. There are probably lots of reasons for why that is, but the biggest reason is likely that users just really hate huge balloon payments for things like freezing. It's just about the worst thing about the system. -- Peter Geoghegan
Re: Eager page freeze criteria clarification
Hi, On 2023-09-07 21:45:22 -0700, Andres Freund wrote: > In contrast to that, freezing will almost always trigger an FPI (except for > empty pages, but we imo ought to stop setting empty pages all frozen [1]). > > > Yep, a quick experiment confirms that: > > DROP TABLE IF EXISTS foo; > CREATE TABLE foo AS SELECT generate_series(1, 1000); > CHECKPOINT; > VACUUM (VERBOSE) foo; > > checksums off: WAL usage: 44249 records, 3 full page images, 2632091 bytes > checksums on: WAL usage: 132748 records, 44253 full page images, 388758161 > bytes > > > I initially was confused by the 3x wal records - I was expecting 2x. The > reason is that with checksums on, we emit an FPI during the visibility check, > which then triggers the current heuristic for opportunistic freezing. The > saving grace is that WAL volume is completely dominated by the FPIs: > > Type N (%) Record > size (%) FPI size (%)Combined size (%) > - --- > --- --- ---- > --- > XLOG/FPI_FOR_HINT 44253 ( 33.34) > 2168397 ( 7.84)361094232 (100.00)363262629 ( 93.44) > Transaction/INVALIDATION 1 ( 0.00) > 78 ( 0.00)0 ( 0.00) 78 ( 0.00) > Standby/INVALIDATIONS 1 ( 0.00) > 90 ( 0.00)0 ( 0.00) 90 ( 0.00) > Heap2/FREEZE_PAGE 44248 ( 33.33) > 22876120 ( 82.72)0 ( 0.00) 22876120 ( 5.88) > Heap2/VISIBLE 44248 ( 33.33) > 2610642 ( 9.44)16384 ( 0.00) 2627026 ( 0.68) > Heap/INPLACE 1 ( 0.00) > 188 ( 0.00)0 ( 0.00) 188 ( 0.00) > > > Total 132752 > 27655515 [7.11%] 361110616 [92.89%]388766131 [100%] > > In realistic tables, where rows are wider than a single int, FPI_FOR_HINT > dominates even further, as the FREEZE_PAGE would be smaller if there weren't > 226 tuples on each page... The above is not a great demonstration of the overhead of setting all-visible, as the FPIs are triggered via FPI_FOR_HINTs, independent of setting all-visible. Adding "SELECT count(*) FROM foo" before the checkpoint sets them earlier and results in: checksum off: WAL usage: 44249 records, 3 full page images, 2627915 bytes Type N (%) Record size (%) FPI size (%)Combined size (%) - --- --- --- ---- --- Transaction/INVALIDATION 1 ( 0.00) 78 ( 0.00)0 ( 0.00) 78 ( 0.00) Standby/INVALIDATIONS 1 ( 0.00) 90 ( 0.00)0 ( 0.00) 90 ( 0.00) Heap2/VISIBLE 44248 ( 99.99) 2610642 ( 99.99)16384 ( 95.15) 2627026 ( 99.96) Heap/INPLACE 1 ( 0.00) 53 ( 0.00) 836 ( 4.85) 889 ( 0.03) Total 44251 2610863 [99.34%]17220 [0.66%] 2628083 [100%] checksums on: WAL usage: 44252 records, 44254 full page images, 363935830 bytes Type N (%) Record size (%) FPI size (%)Combined size (%) - --- --- --- ---- --- XLOG/FPI_FOR_HINT 3 ( 0.01) 147 ( 0.01)24576 ( 0.01)24723 ( 0.01) Transaction/INVALIDATION 1 ( 0.00) 78 ( 0.00)0 ( 0.00) 78 ( 0.00) Standby/INVALIDATIONS 1 ( 0.00) 90 ( 0.00)0 ( 0.00) 90 ( 0.00) Heap2/VISIBLE 44248 ( 99.99) 2831882 ( 99.99)
Re: Eager page freeze criteria clarification
On Thu, Sep 7, 2023 at 9:45 PM Andres Freund wrote: > I.e. setting an, otherwise unmodified, page all-visible won't trigger an FPI > if checksums are disabled, but will FPI with checksums enabled. I think that's > a substantial difference in WAL volume for insert-only workloads... Note that all RDS Postgres users get page-level checksums. Overall, the FPI trigger mechanism is going to noticeably improve performance characteristics for many users. Simple and crude though it is. > Type N (%) Record > size (%) FPI size (%)Combined size (%) > - --- > --- --- ---- > --- > XLOG/FPI_FOR_HINT 44253 ( 33.34) > 2168397 ( 7.84)361094232 (100.00)363262629 ( 93.44) > Transaction/INVALIDATION 1 ( 0.00) > 78 ( 0.00)0 ( 0.00) 78 ( 0.00) > Standby/INVALIDATIONS 1 ( 0.00) > 90 ( 0.00)0 ( 0.00) 90 ( 0.00) > Heap2/FREEZE_PAGE 44248 ( 33.33) > 22876120 ( 82.72)0 ( 0.00) 22876120 ( 5.88) > Heap2/VISIBLE 44248 ( 33.33) > 2610642 ( 9.44)16384 ( 0.00) 2627026 ( 0.68) > Heap/INPLACE 1 ( 0.00) > 188 ( 0.00)0 ( 0.00) 188 ( 0.00) > > > Total 132752 > 27655515 [7.11%] 361110616 [92.89%]388766131 [100%] > > In realistic tables, where rows are wider than a single int, FPI_FOR_HINT > dominates even further, as the FREEZE_PAGE would be smaller if there weren't > 226 tuples on each page... If FREEZE_PAGE WAL volume is really what holds back further high level improvements in this area, then it can be worked on directly -- it's not a fixed cost. It wouldn't be particularly difficult, in fact. These are records that still mostly consist of long runs of contiguous page offset numbers. They're ideally suited for compression using some kind of simple variant of run length encoding. The freeze plan deduplication stuff in 16 made a big difference, but it's still not very hard to improve matters here. -- Peter Geoghegan
Re: Eager page freeze criteria clarification
Hi, On 2023-09-06 16:21:31 -0400, Robert Haas wrote: > On Wed, Sep 6, 2023 at 12:20 PM Peter Geoghegan wrote: > > If VACUUM freezes too aggressively, then (pretty much by definition) > > we can be sure that the next VACUUM will scan the same pages -- there > > may be some scope for VACUUM to "learn from its mistake" when we err > > in the direction of over-freezing. But when VACUUM makes the opposite > > mistake (doesn't freeze when it should have), it won't scan those same > > pages again for a long time, by design. It therefore has no plausible > > way of "learning from its mistakes" before it becomes an extremely > > expensive and painful lesson (which happens whenever the next > > aggressive VACUUM takes place). This is in large part a consequence of > > the way that VACUUM dutifully sets pages all-visible whenever > > possible. That behavior interacts badly with many workloads, over > > time. > > I think this is an insightful commentary with which I partially agree. > As I see it, the difference is that when you make the mistake of > marking something all-visible or freezing it too aggressively, you > incur a price that you pay almost immediately. When you make the > mistake of not marking something all-visible when it would have been > best to do so, you incur a price that you pay later, when the next > VACUUM happens. When you make the mistake of not marking something > all-frozen when it would have been best to do so, you incur a price > that you pay even later, not at the next VACUUM but at some VACUUM > further off. So there are different trade-offs. When you pay the price > for a mistake immediately or nearly immediately, it can potentially > harm the performance of the foreground workload, if you're making a > lot of mistakes. We have to make a *lot* of mistakes to badly harm the foreground workload. As long as we don't constantly trigger FPIs, the worst case effects of freezing unnecessarily aren't that large, particularly if we keep the number of XLogInsert()s constant (via the combining of records that Melanie is working on). Once/if we want to opportunistically freeze when it *does* trigger an FPI, we *do* need to be more certain that it's not pointless work. We could further bound the worst case overhead by having a range representation for freeze plans. That should be quite doable, we can add another flag for xl_heap_freeze_plan.frzflags, which indicates the tuples in the offset array are start/end tids, rather than individual tids. > That sucks. On the other hand, when you defer paying > the price until some later bulk operation, the costs of all of your > mistakes get added up and then you pay the whole price all at once, > which means you can be suddenly slapped with an enormous bill that you > weren't expecting. That sucks, too, just in a different way. It's particularly bad in the case of freezing because there's practically no backpressure against deferring more work than the system can handle. If we had made foreground processes freeze one page for every unfrozen page they create, once the table reaches a certain percentage of old unfrozen pages, it'd be a different story... I think it's important that we prevent exploding the WAL volume due to opportunistic freezing the same page over and over, but as long as we take care to not do that, I think the impact on the foreground is going to be small. I'm sure that we can come up with cases where it's noticeable, e.g. because the system is already completely bottlecked by WAL IO throughput and a small increase in WAL volume is going to push things over the edge. But such systems are going to be in substantial trouble at the next anti-wraparound vacuum and will be out of commission for days once they hit the anti-wraparound shutdown limits. Some backpressure in the form of a small performance decrease for foreground work might even be good there. > > VACUUM simply ignores such second-order effects. Perhaps it would be > > practical to address some of the issues in this area by avoiding > > setting pages all-visible without freezing them, in some general > > sense. That at least creates a kind of symmetry between mistakes in > > the direction of under-freezing and mistakes in the direction of > > over-freezing. That might enable VACUUM to course-correct in either > > direction. > > > > Melanie is already planning on combining the WAL records (PRUNE, > > FREEZE_PAGE, and VISIBLE). Perhaps that'll weaken the argument for > > setting unfrozen pages all-visible even further. > > Yeah, so I think the question here is whether it's ever a good idea to > mark a page all-visible without also freezing it. If it's not, then we > should either mark fewer pages all-visible, or freeze more of them. > Maybe I'm all wet here, but I think it depends on the situation. If a > page is already dirty and has had an FPI since the last checkpoint, > then it's pretty appealing to freeze whenever we mark all-visible. We > still have to consider whe
Re: Eager page freeze criteria clarification
Hi, On 2023-09-06 10:46:03 -0400, Robert Haas wrote: > On Fri, Sep 1, 2023 at 9:07 PM Peter Geoghegan wrote: > > Why not also avoid setting pages all-visible? The WAL records aren't > > too much smaller than most freeze records these days -- 64 bytes on > > most systems. I realize that the rules for FPIs are a bit different > > when page-level checksums aren't enabled, but fundamentally it's the > > same situation. No? > > It's an interesting point. AFAIK, whether or not page-level checksums > are enabled doesn't really matter here. I think it does matter: void visibilitymap_set(Relation rel, BlockNumber heapBlk, Buffer heapBuf, XLogRecPtr recptr, Buffer vmBuf, TransactionId cutoff_xid, uint8 flags) ... recptr = log_heap_visible(rel, heapBuf, vmBuf, cutoff_xid, flags); /* * If data checksums are enabled (or wal_log_hints=on), we * need to protect the heap page from being torn. * * If not, then we must *not* update the heap page's LSN. In * this case, the FPI for the heap page was omitted from the * WAL record inserted above, so it would be incorrect to * update the heap page's LSN. */ if (XLogHintBitIsNeeded()) { PageheapPage = BufferGetPage(heapBuf); PageSetLSN(heapPage, recptr); } and /* * Perform XLogInsert for a heap-visible operation. 'block' is the block * being marked all-visible, and vm_buffer is the buffer containing the * corresponding visibility map block. Both should have already been modified * and dirtied. * * snapshotConflictHorizon comes from the largest xmin on the page being * marked all-visible. REDO routine uses it to generate recovery conflicts. * * If checksums or wal_log_hints are enabled, we may also generate a full-page * image of heap_buffer. Otherwise, we optimize away the FPI (by specifying * REGBUF_NO_IMAGE for the heap buffer), in which case the caller should *not* * update the heap page's LSN. */ XLogRecPtr log_heap_visible(Relation rel, Buffer heap_buffer, Buffer vm_buffer, TransactionId snapshotConflictHorizon, uint8 vmflags) { I.e. setting an, otherwise unmodified, page all-visible won't trigger an FPI if checksums are disabled, but will FPI with checksums enabled. I think that's a substantial difference in WAL volume for insert-only workloads... In contrast to that, freezing will almost always trigger an FPI (except for empty pages, but we imo ought to stop setting empty pages all frozen [1]). Yep, a quick experiment confirms that: DROP TABLE IF EXISTS foo; CREATE TABLE foo AS SELECT generate_series(1, 1000); CHECKPOINT; VACUUM (VERBOSE) foo; checksums off: WAL usage: 44249 records, 3 full page images, 2632091 bytes checksums on: WAL usage: 132748 records, 44253 full page images, 388758161 bytes I initially was confused by the 3x wal records - I was expecting 2x. The reason is that with checksums on, we emit an FPI during the visibility check, which then triggers the current heuristic for opportunistic freezing. The saving grace is that WAL volume is completely dominated by the FPIs: Type N (%) Record size (%) FPI size (%)Combined size (%) - --- --- --- ---- --- XLOG/FPI_FOR_HINT 44253 ( 33.34) 2168397 ( 7.84)361094232 (100.00)363262629 ( 93.44) Transaction/INVALIDATION 1 ( 0.00) 78 ( 0.00)0 ( 0.00) 78 ( 0.00) Standby/INVALIDATIONS 1 ( 0.00) 90 ( 0.00)0 ( 0.00) 90 ( 0.00) Heap2/FREEZE_PAGE 44248 ( 33.33) 22876120 ( 82.72)0 ( 0.00) 22876120 ( 5.88) Heap2/VISIBLE 44248 ( 33.33) 2610642 ( 9.44)16384 ( 0.00) 2627026 ( 0.68) Heap/INPLACE 1 ( 0.00) 188 ( 0.00)0 ( 0.00) 188 ( 0.00) Total 132752 27655515 [7.11%] 361110616 [92.89%]388766131 [100%] In realistic tables, where rows are wider than a single int, FPI_FOR_HINT dominates even further, as the FREEZE_PAGE would be smaller if there weren't 226 tuples on each page.
Re: Eager page freeze criteria clarification
Hi, On 2023-09-06 10:35:17 -0400, Robert Haas wrote: > On Wed, Sep 6, 2023 at 1:09 AM Andres Freund wrote: > > Yea, it'd be useful to have a reasonably approximate wall clock time for the > > last modification of a page. We just don't have infrastructure for > > determining > > that. We'd need an LSN->time mapping (xid->time wouldn't be particularly > > useful, I think). > > > > A very rough approximate modification time can be computed by assuming an > > even > > rate of WAL generation, and using the LSN at the time of the last vacuum and > > the time of the last vacuum, to compute the approximate age. > > > > For a while I thought that'd not give us anything that just using LSNs gives > > us, but I think it might allow coming up with a better cutoff logic: Instead > > of using a cutoff like "page LSN is older than 10% of the LSNs since the > > last > > vacuum of the table", it would allow us to approximate "page has not been > > modified in the last 15 seconds" or such. I think that might help avoid > > unnecessary freezing on tables with very frequent vacuuming. > > Yes. I'm uncomfortable with the last-vacuum-LSN approach mostly > because of the impact on very frequently vacuumed tables, and > secondarily because of the impact on very infrequently vacuumed > tables. > > Downthread, I proposed using the RedoRecPtr of the latest checkpoint > rather than the LSN of the previou vacuum. I still like that idea. Assuming that "downthread" references https://postgr.es/m/CA%2BTgmoYb670VcDFbekjn2YQOKF9a7e-kBFoj2WJF1HtH7YPaWQ%40mail.gmail.com could you sketch out the logic you're imagining a bit more? > It's a value that we already have, with no additional bookkeeping. It > varies over a much narrower range than the interval between vacuums on > a table. The vacuum interval could be as short as tens of seconds as > long as years, while the checkpoint interval is almost always going to > be between a few minutes at the low end and some tens of minutes at > the high end, hours at the very most. That's quite appealing. The reason I was thinking of using the "lsn at the end of the last vacuum", is that it seems to be more adapative to the frequency of vacuuming. One the one hand, if a table is rarely autovacuumed because it is huge, (InsertLSN-RedoRecPtr) might or might not be representative of the workload over a longer time. On the other hand, if a page in a frequently vacuumed table has an LSN from around the last vacuum (or even before), it should be frozen, but will appear to be recent in RedoRecPtr based heuristics? Perhaps we can mix both approaches. We can use the LSN and time of the last vacuum to establish an LSN->time mapping that's reasonably accurate for a relation. For infrequently vacuumed tables we can use the time between checkpoints to establish a *more aggressive* cutoff for freezing then what a percent-of-time-since-last-vacuum appach would provide. If e.g. a table gets vacuumed every 100 hours and checkpoint timeout is 1 hour, no realistic percent-of-time-since-last-vacuum setting will allow freezing, as all dirty pages will be too new. To allow freezing a decent proportion of those, we could allow freezing pages that lived longer than ~20% time-between-recent-checkpoints. Hm, possibly stupid idea: What about using shared_buffers residency as a factor? If vacuum had to read in a page to vacuum it, a) we would need read IO to freeze it later, as we'll soon evict the page via the ringbuffer b) non-residency indicates the page isn't constantly being modified? > Also, I think the time between checkpoints actually matters here, because in > some sense we're looking to get dirty, already-FPI'd pages frozen before > they get written out, or before a new FPI becomes necessary, and checkpoints > are one way for the first of those things to happen and the only way for the > second one to happen. Intuitively it seems easier to take care of that by checking if an FPI is needed or not? I guess we, eventually, might want to also freeze if an FPI would be generated, if we are reasonably certain that the page isn't going to be modified again soon (e.g. when table stats indicate effectively aren't any updates / deletes). Although perhaps it's better to take care of that via freeze-on-writeout style logic. I suspect than even for freeze-on-writeout we might end up needing some heuristics Greetings, Andres Freund
Re: Eager page freeze criteria clarification
On Wed, Sep 6, 2023 at 12:20 PM Peter Geoghegan wrote: > This was a case where index vacuuming was never required. It's just a > simple and easy to recreate example of what I think of as a more > general problem. OK. > Why wouldn't we expect a table to have some pages that ought to be > frozen right away, and others where freezing should in theory be put > off indefinitely? I think that that's very common. Oh, I see. I agree that that's pretty common. I think what that means in practice is that we need to avoid relying too much on relation-level statistics to guide behavior with respect to individual relation pages. On the other hand, I don't think it means that we can't look at relation-wide or even system-wide statistics at all. Sure, those statistics may not be perfect, but some things are not practical to track on a page granularity, and having some course-grained information can, I think, be better than having nothing at all, if you're careful about how much and in what way you rely on it. > As you know, I am particularly concerned about the tendency of > unfrozen all-visible pages to accumulate without bound (at least > without bound expressed in physical units such as pages). The very > fact that pages are being set all-visible by VACUUM can be seen as a > part of a high-level systemic problem -- a problem that plays out over > time, across multiple VACUUM operations. So even if the cost of > setting pages all-visible happened to be much lower than the cost of > freezing (which it isn't), setting pages all-visible without freezing > has unique downsides. I generally agree with all of that. > If VACUUM freezes too aggressively, then (pretty much by definition) > we can be sure that the next VACUUM will scan the same pages -- there > may be some scope for VACUUM to "learn from its mistake" when we err > in the direction of over-freezing. But when VACUUM makes the opposite > mistake (doesn't freeze when it should have), it won't scan those same > pages again for a long time, by design. It therefore has no plausible > way of "learning from its mistakes" before it becomes an extremely > expensive and painful lesson (which happens whenever the next > aggressive VACUUM takes place). This is in large part a consequence of > the way that VACUUM dutifully sets pages all-visible whenever > possible. That behavior interacts badly with many workloads, over > time. I think this is an insightful commentary with which I partially agree. As I see it, the difference is that when you make the mistake of marking something all-visible or freezing it too aggressively, you incur a price that you pay almost immediately. When you make the mistake of not marking something all-visible when it would have been best to do so, you incur a price that you pay later, when the next VACUUM happens. When you make the mistake of not marking something all-frozen when it would have been best to do so, you incur a price that you pay even later, not at the next VACUUM but at some VACUUM further off. So there are different trade-offs. When you pay the price for a mistake immediately or nearly immediately, it can potentially harm the performance of the foreground workload, if you're making a lot of mistakes. That sucks. On the other hand, when you defer paying the price until some later bulk operation, the costs of all of your mistakes get added up and then you pay the whole price all at once, which means you can be suddenly slapped with an enormous bill that you weren't expecting. That sucks, too, just in a different way. > VACUUM simply ignores such second-order effects. Perhaps it would be > practical to address some of the issues in this area by avoiding > setting pages all-visible without freezing them, in some general > sense. That at least creates a kind of symmetry between mistakes in > the direction of under-freezing and mistakes in the direction of > over-freezing. That might enable VACUUM to course-correct in either > direction. > > Melanie is already planning on combining the WAL records (PRUNE, > FREEZE_PAGE, and VISIBLE). Perhaps that'll weaken the argument for > setting unfrozen pages all-visible even further. Yeah, so I think the question here is whether it's ever a good idea to mark a page all-visible without also freezing it. If it's not, then we should either mark fewer pages all-visible, or freeze more of them. Maybe I'm all wet here, but I think it depends on the situation. If a page is already dirty and has had an FPI since the last checkpoint, then it's pretty appealing to freeze whenever we mark all-visible. We still have to consider whether the incremental CPU cost and WAL volume are worth it, but assuming those costs are small enough not to be a big problem, it seems like a pretty good bet. Making a page un-all-visible has some cost, but making a page un-all-frozen really doesn't, so cool. On the other hand, if we have a page that isn't dirty, hasn't had a recent FPI, and doesn't need pruning, but which can
Re: Eager page freeze criteria clarification
On Wed, Sep 6, 2023 at 7:46 AM Robert Haas wrote: > It's interesting that when you raised the threshold the rate of > vacuuming dropped to zero. I haven't seen that behavior. pgbench does > rely very, very heavily on HOT-pruning, so VACUUM only cleans up a > small percentage of the dead tuples that get created. However, it's > still needed for index vacuuming, and I'm not sure what would cause it > not to trigger for that purpose. This was a case where index vacuuming was never required. It's just a simple and easy to recreate example of what I think of as a more general problem. > > > And I'm not sure why we > > > want to do that. If the table is being vacuumed a lot, it's probably > > > also being modified a lot, which suggests that we ought to be more > > > cautious about freezing, rather than the reverse. > > > > Why wouldn't it be both things at the same time, for the same table? > > Both of what things? Why wouldn't we expect a table to have some pages that ought to be frozen right away, and others where freezing should in theory be put off indefinitely? I think that that's very common. > > Why not also avoid setting pages all-visible? The WAL records aren't > > too much smaller than most freeze records these days -- 64 bytes on > > most systems. I realize that the rules for FPIs are a bit different > > when page-level checksums aren't enabled, but fundamentally it's the > > same situation. No? > > It's an interesting point. AFAIK, whether or not page-level checksums > are enabled doesn't really matter here. But it seems fair to ask - if > freezing is too aggressive, why is setting all-visible not also too > aggressive? I don't have a good answer to that question right now. As you know, I am particularly concerned about the tendency of unfrozen all-visible pages to accumulate without bound (at least without bound expressed in physical units such as pages). The very fact that pages are being set all-visible by VACUUM can be seen as a part of a high-level systemic problem -- a problem that plays out over time, across multiple VACUUM operations. So even if the cost of setting pages all-visible happened to be much lower than the cost of freezing (which it isn't), setting pages all-visible without freezing has unique downsides. If VACUUM freezes too aggressively, then (pretty much by definition) we can be sure that the next VACUUM will scan the same pages -- there may be some scope for VACUUM to "learn from its mistake" when we err in the direction of over-freezing. But when VACUUM makes the opposite mistake (doesn't freeze when it should have), it won't scan those same pages again for a long time, by design. It therefore has no plausible way of "learning from its mistakes" before it becomes an extremely expensive and painful lesson (which happens whenever the next aggressive VACUUM takes place). This is in large part a consequence of the way that VACUUM dutifully sets pages all-visible whenever possible. That behavior interacts badly with many workloads, over time. VACUUM simply ignores such second-order effects. Perhaps it would be practical to address some of the issues in this area by avoiding setting pages all-visible without freezing them, in some general sense. That at least creates a kind of symmetry between mistakes in the direction of under-freezing and mistakes in the direction of over-freezing. That might enable VACUUM to course-correct in either direction. Melanie is already planning on combining the WAL records (PRUNE, FREEZE_PAGE, and VISIBLE). Perhaps that'll weaken the argument for setting unfrozen pages all-visible even further. -- Peter Geoghegan
Re: Eager page freeze criteria clarification
On Fri, Sep 1, 2023 at 9:07 PM Peter Geoghegan wrote: > When I reported this a couple of years ago, I noticed that autovacuum > would spin whenever I set autovacuum_vacuum_scale_factor to 0.02. But > autovacuum would *never* run (outside of antiwraparound autovacuums) > when it was set just a little higher (perhaps 0.03 or 0.04). So there > was some inflection point at which its behavior totally changed. I think that tables have a "natural" amount of bloat that is a function of the workload. If you remove all the bloat by, say, running VACUUM FULL, they quickly bloat again to that point, and then stabilize (hopefully). It seems like if you set the scale factor to a value below that "natural" amount of bloat, you might well spin, especially on a very write-heavy workload like pgbench, because you're basically trying to squeeze water from a stone on that point. It's interesting that when you raised the threshold the rate of vacuuming dropped to zero. I haven't seen that behavior. pgbench does rely very, very heavily on HOT-pruning, so VACUUM only cleans up a small percentage of the dead tuples that get created. However, it's still needed for index vacuuming, and I'm not sure what would cause it not to trigger for that purpose. > > And I'm not sure why we > > want to do that. If the table is being vacuumed a lot, it's probably > > also being modified a lot, which suggests that we ought to be more > > cautious about freezing, rather than the reverse. > > Why wouldn't it be both things at the same time, for the same table? Both of what things? > Why not also avoid setting pages all-visible? The WAL records aren't > too much smaller than most freeze records these days -- 64 bytes on > most systems. I realize that the rules for FPIs are a bit different > when page-level checksums aren't enabled, but fundamentally it's the > same situation. No? It's an interesting point. AFAIK, whether or not page-level checksums are enabled doesn't really matter here. But it seems fair to ask - if freezing is too aggressive, why is setting all-visible not also too aggressive? I don't have a good answer to that question right now. -- Robert Haas EDB: http://www.enterprisedb.com
Re: Eager page freeze criteria clarification
On Wed, Sep 6, 2023 at 1:09 AM Andres Freund wrote: > Yea, it'd be useful to have a reasonably approximate wall clock time for the > last modification of a page. We just don't have infrastructure for determining > that. We'd need an LSN->time mapping (xid->time wouldn't be particularly > useful, I think). > > A very rough approximate modification time can be computed by assuming an even > rate of WAL generation, and using the LSN at the time of the last vacuum and > the time of the last vacuum, to compute the approximate age. > > For a while I thought that'd not give us anything that just using LSNs gives > us, but I think it might allow coming up with a better cutoff logic: Instead > of using a cutoff like "page LSN is older than 10% of the LSNs since the last > vacuum of the table", it would allow us to approximate "page has not been > modified in the last 15 seconds" or such. I think that might help avoid > unnecessary freezing on tables with very frequent vacuuming. Yes. I'm uncomfortable with the last-vacuum-LSN approach mostly because of the impact on very frequently vacuumed tables, and secondarily because of the impact on very infrequently vacuumed tables. Downthread, I proposed using the RedoRecPtr of the latest checkpoint rather than the LSN of the previou vacuum. I still like that idea. It's a value that we already have, with no additional bookkeeping. It varies over a much narrower range than the interval between vacuums on a table. The vacuum interval could be as short as tens of seconds as long as years, while the checkpoint interval is almost always going to be between a few minutes at the low end and some tens of minutes at the high end, hours at the very most. That's quite appealing. Also, I think the time between checkpoints actually matters here, because in some sense we're looking to get dirty, already-FPI'd pages frozen before they get written out, or before a new FPI becomes necessary, and checkpoints are one way for the first of those things to happen and the only way for the second one to happen. -- Robert Haas EDB: http://www.enterprisedb.com
Re: Eager page freeze criteria clarification
Hi, On 2023-08-28 12:26:01 -0400, Robert Haas wrote: > On Mon, Aug 28, 2023 at 10:00 AM Melanie Plageman > wrote: > > For the second goal, I've relied on past data to predict future > > behavior, so I tried several criteria to estimate the likelihood that a > > page will not be imminently modified. What was most effective was > > Andres' suggestion of comparing the page LSN to the insert LSN at the > > end of the last vacuum of that table; this approximates whether the page > > has been recently modified, which is a decent proxy for whether it'll be > > modified in the future. To do this, we need to save that insert LSN > > somewhere. In the attached WIP patch, I saved it in the table stats, for > > now -- knowing that those are not crash-safe. > > I wonder what the real plan here is for where to store this. It's not > obvious that we need this to be crash-safe; it's after all only for > use by a heuristic, and there's no actual breakage if the heuristic > goes wrong. At the same time, it doesn't exactly feel like a > statistic. I'm not certain either. This is generally something that's not satisfying right now - although IMO not necessarily for the reason you mention. Given that we already store, e.g., the time of the last autovacuum in the stats, I don't see a problem also storing a corresponding LSN. My issue is more that this kind of information not being crashsafe is really problematic - it's a well known issue that autovacuum just doesn't do anything for a while after a crash-restart (or pitr restore or ...), for example. Given that all the other datapoints are stored in the stats, I think just storing the LSNs alongside is reasonable. > Then there's the question of whether it's the right metric. My first > reaction is to think that it sounds pretty good. One thing I really > like about it is that if the table is being vacuumed frequently, then > we freeze less aggressively, and if the table is being vacuumed > infrequently, then we freeze more aggressively. That seems like a very > desirable property. It also seems broadly good that this metric > doesn't really care about reads. If there are a lot of reads on the > system, or no reads at all, it doesn't really change the chances that > a certain page is going to be written again soon, and since reads > don't change the insert LSN, here again it seems to do the right > thing. I'm a little less clear about whether it's good that it doesn't > really depend on wall-clock time. Yea, it'd be useful to have a reasonably approximate wall clock time for the last modification of a page. We just don't have infrastructure for determining that. We'd need an LSN->time mapping (xid->time wouldn't be particularly useful, I think). A very rough approximate modification time can be computed by assuming an even rate of WAL generation, and using the LSN at the time of the last vacuum and the time of the last vacuum, to compute the approximate age. For a while I thought that'd not give us anything that just using LSNs gives us, but I think it might allow coming up with a better cutoff logic: Instead of using a cutoff like "page LSN is older than 10% of the LSNs since the last vacuum of the table", it would allow us to approximate "page has not been modified in the last 15 seconds" or such. I think that might help avoid unnecessary freezing on tables with very frequent vacuuming. > Certainly, that's desirable from the point of view of not wanting to have to > measure wall-clock time in places where we otherwise wouldn't have to, which > tends to end up being expensive. IMO the bigger issue is that we don't want to store a timestamp on each page. > >Page Freezes/Page Frozen (less is better) As, I think, Robert mentioned downthread, I'm not sure this is a useful thing to judge the different heuristics by. If the number of pages frozen is small, the ratio quickly can be very large, without the freezing having a negative effect. I suspect interesting top-level figures to compare would be: 1) WAL volume (to judge the amount of unnecessary FPIs) 2) data reads + writes (to see the effect of repeated vacuuming of the same blocks) 3) number of vacuums and/or time spent vacuuming (freezing less aggressively might increase the number of vacuums due to anti-wrap vacuums, at the same time, freezing too aggressively could lead to vacuums taking too long) 4) throughput of the workload (to see potential regressions due to vacuuming overhead) 5) for transactional workloads: p99 latency (to see if vacuuming increases commit latency and such, just using average tends to hide too much) Greetings, Andres Freund
Re: Eager page freeze criteria clarification
On Fri, Sep 1, 2023 at 12:34 PM Robert Haas wrote: > If the table > is being vacuumed frequently, then the last-vacuum-LSN will be newer, > which means we'll freeze *more* aggressively. Sort of like how if you set autovacuum_vacuum_scale_factor low enough, with standard pgbench, with heap fill factor tuned, autovacuum will never think that the table doesn't need to be vacuumed. It will continually see enough dead heap-only tuples to get another autovacuum each time. Though there won't be any LP_DEAD items at any point -- regardless of when VACUUM actually runs. When I reported this a couple of years ago, I noticed that autovacuum would spin whenever I set autovacuum_vacuum_scale_factor to 0.02. But autovacuum would *never* run (outside of antiwraparound autovacuums) when it was set just a little higher (perhaps 0.03 or 0.04). So there was some inflection point at which its behavior totally changed. > And I'm not sure why we > want to do that. If the table is being vacuumed a lot, it's probably > also being modified a lot, which suggests that we ought to be more > cautious about freezing, rather than the reverse. Why wouldn't it be both things at the same time, for the same table? Why not also avoid setting pages all-visible? The WAL records aren't too much smaller than most freeze records these days -- 64 bytes on most systems. I realize that the rules for FPIs are a bit different when page-level checksums aren't enabled, but fundamentally it's the same situation. No? -- Peter Geoghegan
Re: Eager page freeze criteria clarification
On Tue, Aug 29, 2023 at 10:22 AM Robert Haas wrote: > Let's assume for a moment that the rate at which the insert LSN is > advancing is roughly constant over time, so that it serves as a good > proxy for wall-clock time. Consider four tables A, B, C, and D that > are, respectively, vacuumed once per minute, once per hour, once per > day, and once per week. With a 33% threshold, pages in table A will be > frozen if they haven't been modified in 20 seconds, page in table B > will be frozen if they haven't been modified in 20 minutes, pages in > table C will be frozen if they haven't been modified in 8 hours, and > pages in table D will be frozen if they haven't been modified in 2 > days, 8 hours. My intuition is that this feels awfully aggressive for > A and awfully passive for D. > > [ discussion of freeze-on-evict ] Another way of thinking about this is: instead of replacing this heuristic with a complicated freeze-on-evict system, maybe we just need to adjust the heuristic. Maybe using an LSN is a good idea, but maybe the LSN of the last table vacuum isn't the right one to be using, or maybe it shouldn't be the only one that we use. For instance, what about using the redo LSN of the last checkpoint, or the checkpoint before the last checkpoint, or something like that? Somebody might find that unprincipled, but it seems to me that the checkpoint cycle has a lot to do with whether or not opportunistic freezing makes sense. If a page is likely to be modified again before a checkpoint forces it to be written, then freezing it is likely a waste. If it's likely to be written out of shared_buffers before it's modified again, then freezing it now is a pretty good bet. A given page could be evicted from shared_buffers, and thus written, sooner than the next checkpoint, but if it's dirty now, it definitely won't be written any later than the next checkpoint. By looking at the LSN of a page that I'm about to modify just before I modify it, I can make some kind of a guess as to whether this is a page that is being modified more or less than once per checkpoint cycle and adjust freezing behavior accordingly. One could also use a hybrid of the two values e.g. normally use the insert LSN of the last VACUUM, but if that's newer than the redo LSN of the last checkpoint, then use the latter instead, to avoid doing too much freezing of pages that may have been quiescent for only a few tens of seconds. I don't know if that's better or worse. As I think about it, I realize that I don't really know why Andres suggested a last-vacuum-LSN-based heuristic in the first place. Before, I wrote of this that "One thing I really like about it is that if the table is being vacuumed frequently, then we freeze less aggressively, and if the table is being vacuumed infrequently, then we freeze more aggressively." But actually, I don't think it does that. If the table is being vacuumed frequently, then the last-vacuum-LSN will be newer, which means we'll freeze *more* aggressively. And I'm not sure why we want to do that. If the table is being vacuumed a lot, it's probably also being modified a lot, which suggests that we ought to be more cautious about freezing, rather than the reverse. Just spitballing here. -- Robert Haas EDB: http://www.enterprisedb.com
Re: Eager page freeze criteria clarification
On Mon, Aug 28, 2023 at 4:30 PM Melanie Plageman wrote: > By low-velocity, do you mean lower overall TPS? In that case, wouldn't you be > less likely to run into xid wraparound and thus need less aggressive > opportunistic freezing? Yes. But it also means that we've got slack capacity to do extra work now without really hurting anything. If we can leverage that capacity to reduce the pain of future bulk operations, that seems good. When resources are tight, doing speculative work immediately becomes less appealing. It's pretty hard to take such things into account, though. I was just mentioning it. > So, this is where the caveat about absolute number of page freezes > matters. In algorithm A, master only did 57 page freezes (spread across > the various pgbench tables). At the end of the run, 2 pages were still > frozen. I'm increasingly feeling that it's hard to make sense of the ratio. Maybe show the number of pages frozen, the number that are frozen at the end, and the number of pages in the database at the end of the run as three separate values. > (1) seems bad to me because it doesn't consider whether or not freezing > will be useful -- only if it will be cheap. It froze very little of the > cold data in a workload where a small percentage of it was being > modified (especially workloads A, C, H). And it froze a lot of data in > workloads where it was being uniformly modified (workload B). Sure, but if the cost of being wrong is low, you can be wrong a lot and still be pretty happy. It's unclear to me to what extent we should gamble on making only inexpensive mistakes and to what extent we should gamble on making only infrequent mistakes, but they're both valid strategies. > I suggested (4) and (5) because I think the "older than 33%" threshold > is better than the "older than 10%" threshold. I chose both because I am > still unclear on our values. Are we willing to freeze more aggressively > at the expense of emitting more FPIs? As long as it doesn't affect > throughput? For pretty much all of these workloads, the algorithms which > froze based on page modification recency OR FPI required emitted many > more FPIs than those which froze based only on page modification > recency. Let's assume for a moment that the rate at which the insert LSN is advancing is roughly constant over time, so that it serves as a good proxy for wall-clock time. Consider four tables A, B, C, and D that are, respectively, vacuumed once per minute, once per hour, once per day, and once per week. With a 33% threshold, pages in table A will be frozen if they haven't been modified in 20 seconds, page in table B will be frozen if they haven't been modified in 20 minutes, pages in table C will be frozen if they haven't been modified in 8 hours, and pages in table D will be frozen if they haven't been modified in 2 days, 8 hours. My intuition is that this feels awfully aggressive for A and awfully passive for D. To expand on that: apparently D doesn't actually get much write activity, else there would be more vacuuming happening. So it's very likely that pages in table D are going to get checkpointed and evicted before they get modified again. Freezing them therefore seems like a good bet: it's a lot cheaper to freeze those pages when they're already in shared_buffers and dirty than it is if we have to read and write them specifically for freezing. It's less obvious that what we're doing in table A is wrong, and it could be exactly right, but workloads where a row is modified, a human thinks about something (e.g. whether to complete the proposed purchase), and then the same row is modified again are not uncommon, and human thinking times can easily exceed 20 seconds. On the other hand, workloads where a row is modified, a computer thinks about something, and then the row is modified again are also quite common, and computer thinking times can easily be less than 20 seconds. It feels like a toss-up whether we get it right. For this kind of table, I suspect we'd be happier freezing pages that are about to be evicted or about to be written as part of a checkpoint rather than freezing pages opportunistically in vacuum. Maybe that's something we need to think harder about. If we froze dirty pages that wouldn't need a new FPI just before evicting them, and just before they would be written out for a checkpoint, under what circumstances would we still want vacuum to opportunistically freeze? I think the answer might be "only when it's very cheap." If it's very cheap to freeze now, it's appealing to gamble on doing it before a checkpoint comes along and makes the same operation require an extra FPI. But if it isn't too cheap, then why not just wait and see what happens? If the buffer gets modified again before it gets written out, then freeing immediately is a waste and freeze-on-evict is better. If it doesn't, freezing immediately and freeze-on-evict are the same price. I'm hand-waving a bit here because maybe freeze-on-evict is
Re: Eager page freeze criteria clarification
On Mon, Aug 28, 2023 at 4:15 PM Melanie Plageman wrote: > On Mon, Aug 28, 2023 at 5:06 PM Peter Geoghegan wrote: > > What I've described in a scheme for deciding to not freeze where that > > would usually happen -- a scheme for *vetoing* page freezing -- rather > > than a scheme for deciding to freeze. On its own, what I suggested > > cannot help at all. It assumes a world in which we're already deciding > > to freeze much more frequently, based on whatever other criteria. It's > > intended to complement something like the LSN scheme. > > I like the direction of this idea. It could avoid repeatedly freezing > a page that is being modified over and over. I tried it out with a > short-running version of workload I (approximating a work queue) -- > and it prevents unnecessary freezing. I see a lot of value in it as an enabler of freezing at the earliest opportunity, which is usually the way to go. > The problem with this criterion is that if you freeze a page and then > ever modify it again -- even once, vacuum will not be allowed to > freeze it again until it is forced to. Right. Like you, I was thinking of something that dampened VACUUM's newfound enthusiasm for freezing, in whatever way -- not a discrete rule. A mechanism with a sense of proportion about what it meant for the page to look a certain way. The strength of the signal would perhaps be highest (i.e. most discouraging of further freezing) on a page that has only a few relatively recent unfrozen tuples. XID age could actually be useful here! > Perhaps we could use a very > strict LSN cutoff for pages which contain any frozen tuples -- for > example: only freeze a page containing a mix of frozen and unfrozen > tuples if it has not been modified since before the last vacuum of the > relation ended. You might also need to account for things like interactions with the FSM. If the new unfrozen tuple packs the page to the brim, and the new row is from an insert, then maybe the mechanism shouldn't dampen our enthusiasm for freezing the page at all. Similarly, on a page that has no unfrozen tuples at all just yet, it would make sense for the algorithm to be most enthusiastic about freezing those pages that can fit no more tuples. Perhaps we'd be willing to accept the cost of an extra FPI with such a page, for example. It will take time to validate these sorts of ideas thoroughly, of course. I agree with what you said upthread about it also being a question of values and priorities. -- Peter Geoghegan
Re: Eager page freeze criteria clarification
On Mon, Aug 28, 2023 at 5:06 PM Peter Geoghegan wrote: > > On Mon, Aug 28, 2023 at 1:17 PM Robert Haas wrote: > > I'm sure this could be implemented, but it's unclear to me why you > > would expect it to perform well. Freezing a page that has no frozen > > tuples yet isn't cheaper than freezing one that does, so for this idea > > to be a win, the presence of frozen tuples on the page would have to > > be a signal that the page is likely to be modified again in the near > > future. In general, I don't see any reason why we should expect that > > to be the case. > > What I've described in a scheme for deciding to not freeze where that > would usually happen -- a scheme for *vetoing* page freezing -- rather > than a scheme for deciding to freeze. On its own, what I suggested > cannot help at all. It assumes a world in which we're already deciding > to freeze much more frequently, based on whatever other criteria. It's > intended to complement something like the LSN scheme. I like the direction of this idea. It could avoid repeatedly freezing a page that is being modified over and over. I tried it out with a short-running version of workload I (approximating a work queue) -- and it prevents unnecessary freezing. The problem with this criterion is that if you freeze a page and then ever modify it again -- even once, vacuum will not be allowed to freeze it again until it is forced to. Perhaps we could use a very strict LSN cutoff for pages which contain any frozen tuples -- for example: only freeze a page containing a mix of frozen and unfrozen tuples if it has not been modified since before the last vacuum of the relation ended. - Melanie
Re: Eager page freeze criteria clarification
On Mon, Aug 28, 2023 at 1:17 PM Robert Haas wrote: > I'm sure this could be implemented, but it's unclear to me why you > would expect it to perform well. Freezing a page that has no frozen > tuples yet isn't cheaper than freezing one that does, so for this idea > to be a win, the presence of frozen tuples on the page would have to > be a signal that the page is likely to be modified again in the near > future. In general, I don't see any reason why we should expect that > to be the case. What I've described in a scheme for deciding to not freeze where that would usually happen -- a scheme for *vetoing* page freezing -- rather than a scheme for deciding to freeze. On its own, what I suggested cannot help at all. It assumes a world in which we're already deciding to freeze much more frequently, based on whatever other criteria. It's intended to complement something like the LSN scheme. > What really matters here is finding a criterion that is likely to > perform well in general, on a test case not known to us beforehand. > This isn't an entirely feasible goal, because just as you can > construct a test case where any given criterion performs well, so you > can also construct one where any given criterion performs poorly. But > I think a rule that has a clear theory of operation must be preferable > to one that doesn't. The theory that Melanie and Andres are advancing > is that a page that has been modified recently (in insert-LSN-time) is > more likely to be modified again soon than one that has not i.e. the > near future will be like the recent past. I don't think that it's all that useful on its own. You just cannot ignore the fact that the choice to not freeze now doesn't necessarily mean that you get to rereview that choice in the near future. Particularly with large tables, the opportunities to freeze at all are few and far between -- if for no other reason than the general design of autovacuum scheduling. Worse still, any unfrozen all-visible pages can just accumulate as all-visible pages, until the next aggressive VACUUM happens whenever. How can that not be extremely important? That isn't an argument against a scheme that uses LSNs (many kinds of information might be weighed) -- it's an argument in favor of paying attention to the high level cadence of VACUUM. That much seems essential. I think that there might well be room for having several complementary schemes like the LSN scheme. Or one big scheme that weighs multiple factors together, if you prefer. That all seems basically reasonable to me. Adaptive behavior is important with something as complicated as this. Adaptive schemes all seem to involve trial and error. The cost of freezing too much is relatively well understood, and can be managed sensibly. So we should err in that direction -- a direction that is relatively easy to understand, to notice, and to pull back from having gone too far. Putting off freezing for a very long time is a source of much of the seemingly intractable complexity in this area. Another way of addressing that is getting rid of aggressive VACUUM as a concept. But I'm not going to revisit that topic now, or likely ever. -- Peter Geoghegan
Re: Eager page freeze criteria clarification
On Mon, Aug 28, 2023 at 12:26 PM Robert Haas wrote: > > On Mon, Aug 28, 2023 at 10:00 AM Melanie Plageman > wrote: > Then there's the question of whether it's the right metric. My first > reaction is to think that it sounds pretty good. One thing I really > like about it is that if the table is being vacuumed frequently, then > we freeze less aggressively, and if the table is being vacuumed > infrequently, then we freeze more aggressively. That seems like a very > desirable property. It also seems broadly good that this metric > doesn't really care about reads. If there are a lot of reads on the > system, or no reads at all, it doesn't really change the chances that > a certain page is going to be written again soon, and since reads > don't change the insert LSN, here again it seems to do the right > thing. I'm a little less clear about whether it's good that it doesn't > really depend on wall-clock time. Certainly, that's desirable from the > point of view of not wanting to have to measure wall-clock time in > places where we otherwise wouldn't have to, which tends to end up > being expensive. However, if I were making all of my freezing > decisions manually, I might be more freeze-positive on a low-velocity > system where writes are more stretched out across time than on a > high-velocity system where we're blasting through the LSN space at a > higher rate. But maybe that's not a very important consideration, and > I don't know what we'd do about it anyway. By low-velocity, do you mean lower overall TPS? In that case, wouldn't you be less likely to run into xid wraparound and thus need less aggressive opportunistic freezing? > >Page Freezes/Page Frozen (less is better) > > > > | | Master | (1) | (2) | (3) | (4) | (5) | > > |---++-+-+-+-+-| > > | A | 28.50 |3.89 |1.08 |1.15 |1.10 |1.10 | > > | B | 1.00 |1.06 |1.65 |1.03 |1.59 |1.00 | > > | C |N/A |1.00 |1.00 |1.00 |1.00 |1.00 | > > | D | 2.00 | 5199.15 | 5276.85 | 4830.45 | 5234.55 | 2193.55 | > > | E | 7.90 |3.21 |2.73 |2.70 |2.69 |2.43 | > > | F |N/A |1.00 |1.00 |1.00 |1.00 |1.00 | > > | G |N/A |1.00 |1.00 |1.00 |1.00 |1.00 | > > | H |N/A |1.00 |1.00 |1.00 |1.00 |1.00 | > > | I |N/A | 42.00 | 42.00 | N/A | 41.00 | N/A | > > Hmm. I would say that the interesting rows here are A, D, and I, with > rows C and E deserving honorable mention. In row A, master is bad. So, this is where the caveat about absolute number of page freezes matters. In algorithm A, master only did 57 page freezes (spread across the various pgbench tables). At the end of the run, 2 pages were still frozen. > In row D, your algorithms are all bad, really bad. I don't quite > understand how it can be that bad, actually. So, I realize now that this test was poorly designed. I meant it to be a worst case scenario, but I think one critical part was wrong. In this example one client is going at full speed inserting a row and then updating it. Then another rate-limited client is deleting old data periodically to keep the table at a constant size. I meant to bulk load the table with enough data that the delete job would have data to delete from the start. With the default autovacuum settings, over the course of 45 minutes, I usually saw around 40 autovacuums of the table. Due to the rate limiting, the first autovacuum of the table ends up freezing many pages that are deleted soon after. Thus the total number of page freezes is very high. I will redo benchmarking of workload D and start the table with the number of rows which the DELETE job seeks to maintain. My back of the envelope math says that this will mean ratios closer to a dozen (and not 5000). Also, I had doubled checkpoint timeout, which likely led master to freeze so few pages (2 total freezes, neither of which were still frozen at the end of the run). This is an example where master's overall low number of page freezes makes it difficult to compare to the alternatives using a ratio. I didn't initially question the numbers because it seems like freezing data and then deleting it right after would naturally be one of the worst cases for opportunistic freezing, but certainly not this bad. > Row I looks bad for algorithms 1, 2, and 4: they freeze pages because > it looks cheap, but the work doesn't really pay off. Yes, the work queue example looks like it is hard to handle. > >% Frozen at end of run > > > > | | Master | (1) | (2) | (3) | (4) | (5) | > > |---++-+-+-+--+-+ > > | A | 0 | 1 | 99 | 0 | 81 | 0 | > > | B | 71 | 96 | 99 | 3 | 98 | 2 | > > | C | 0 | 9 | 100 | 6 | 92 | 5 | > > | D | 0 | 1 | 1 | 1 |1 | 1 | > > | E | 0 | 63 | 100 | 68 | 100 | 67 | > > | F
Re: Eager page freeze criteria clarification
On Mon, Aug 28, 2023 at 3:09 PM Peter Geoghegan wrote: > I've long emphasized the importance of designs that just try to avoid > disaster. With that in mind, I wonder: have you thought about > conditioning page freezing on whether or not there are already some > frozen tuples on the page? You could perhaps give some weight to > whether or not the page already has at least one or two preexisting > frozen tuples when deciding on whether to freeze it once again now. > You'd be more eager about freezing pages that have no frozen tuples > whatsoever, compared to what you'd do with an otherwise equivalent > page that has no unfrozen tuples. I'm sure this could be implemented, but it's unclear to me why you would expect it to perform well. Freezing a page that has no frozen tuples yet isn't cheaper than freezing one that does, so for this idea to be a win, the presence of frozen tuples on the page would have to be a signal that the page is likely to be modified again in the near future. In general, I don't see any reason why we should expect that to be the case. One could easily construct a workload where it is the case -- for instance, set up one table T1 where 90% of the tuples are repeatedly updated and the other 10% are never touched, and another table T2 that is insert-only. Once frozen, the never-updated tuples in T1 become sentinels that we can use to know that the table isn't insert-only. But I don't think that's very interesting: you can construct a test case like this for any proposed criterion, just by structuring the test workload so that whatever criterion is being tested is a perfect predictor of whether the page will be modified soon. What really matters here is finding a criterion that is likely to perform well in general, on a test case not known to us beforehand. This isn't an entirely feasible goal, because just as you can construct a test case where any given criterion performs well, so you can also construct one where any given criterion performs poorly. But I think a rule that has a clear theory of operation must be preferable to one that doesn't. The theory that Melanie and Andres are advancing is that a page that has been modified recently (in insert-LSN-time) is more likely to be modified again soon than one that has not i.e. the near future will be like the recent past. I'm not sure what the theory behind the rule you propose here might be; if you articulated it somewhere in your email, I seem to have missed it. -- Robert Haas EDB: http://www.enterprisedb.com
Re: Eager page freeze criteria clarification
On Mon, Aug 28, 2023 at 7:00 AM Melanie Plageman wrote: > To do this, we need to save that insert LSN > somewhere. In the attached WIP patch, I saved it in the table stats, for > now -- knowing that those are not crash-safe. What patch? You didn't actually attach one. > Other discarded heuristic ideas included comparing the next transaction > ID at the end of the vacuum of a relation to the visibility cutoff xid > in the page -- but that wasn't effective for freezing data from bulk > loads. I've long emphasized the importance of designs that just try to avoid disaster. With that in mind, I wonder: have you thought about conditioning page freezing on whether or not there are already some frozen tuples on the page? You could perhaps give some weight to whether or not the page already has at least one or two preexisting frozen tuples when deciding on whether to freeze it once again now. You'd be more eager about freezing pages that have no frozen tuples whatsoever, compared to what you'd do with an otherwise equivalent page that has no unfrozen tuples. Small mistakes are inevitable here. They have to be okay. What's not okay is a small mistake that effectively becomes a big mistake because it gets repeated across each VACUUM operation, again and again, ceaselessly. You can probably be quite aggressive about freezing, to good effect -- provided you can be sure that the downside when it turns out to have been a bad idea is self-limiting, in whatever way. Making more small mistakes might actually be a good thing -- especially if it can dramatically lower the chances of ever making any really big mistakes. VACUUM is not a passive observer of the system. It's just another component in the system. So what VACUUM sees in the table depends in no small part on what previous VACUUMs actually did. It follows that VACUUM should be concerned about how what it does might either help or hinder future VACUUMs. My preexisting frozen tuples suggestion is really just an example of how that principle might be applied. Many variations on the same general idea are possible. There are already various extremely weird feedback loops where what VACUUM did last time affects what it'll do this time. These are vicious circles. So ISTM that there is a lot to be said for disrupting those vicious circles, and even deliberately engineering heuristics that have the potential to create virtuous circles for the right workload. -- Peter Geoghegan
Re: Eager page freeze criteria clarification
On Mon, Aug 28, 2023 at 7:00 AM Melanie Plageman wrote: > I believe that there are two goals that should dictate whether or not we > should perform opportunistic freezing: > > 1. Is it cheap? For example, if the buffer is already dirty, then no > write amplification occurs, since it must be written out anyway. > Freezing is also less expensive if we can do it without emitting an > FPI. > > 2. Will it be effective; that is, will it stay frozen? > Opportunistically freezing a page that will immediately be modified is > a waste. > > The current heuristic on master meets neither of these goals: it freezes > a page if pruning emitted an FPI for it. This doesn't evaluate whether > or not freezing itself would be cheap, but rather attempts to hide > freezing behind an expensive operation. The goal is to minimize the number of FPIs over time, in general. That's hardly the same thing as hiding the cost. > Furthermore, it often fails to > freeze cold data and may indiscriminately freeze hot data. You need to account for the cost of not freezing, too. Controlling the overall freeze debt at the level of whole tables (and the level of the whole system) is very important. In fact it's probably the single most important thing. A good model might end up increasing the cost of freezing on a simple quantitative basis, while making life much better overall. Huge balloon payments for freezing are currently a huge problem. Performance stability might come with a cost in some cases. There isn't necessarily anything wrong with that at all. -- Peter Geoghegan
Re: Eager page freeze criteria clarification
On Mon, Aug 28, 2023 at 10:00 AM Melanie Plageman wrote: > For the second goal, I've relied on past data to predict future > behavior, so I tried several criteria to estimate the likelihood that a > page will not be imminently modified. What was most effective was > Andres' suggestion of comparing the page LSN to the insert LSN at the > end of the last vacuum of that table; this approximates whether the page > has been recently modified, which is a decent proxy for whether it'll be > modified in the future. To do this, we need to save that insert LSN > somewhere. In the attached WIP patch, I saved it in the table stats, for > now -- knowing that those are not crash-safe. I wonder what the real plan here is for where to store this. It's not obvious that we need this to be crash-safe; it's after all only for use by a heuristic, and there's no actual breakage if the heuristic goes wrong. At the same time, it doesn't exactly feel like a statistic. Then there's the question of whether it's the right metric. My first reaction is to think that it sounds pretty good. One thing I really like about it is that if the table is being vacuumed frequently, then we freeze less aggressively, and if the table is being vacuumed infrequently, then we freeze more aggressively. That seems like a very desirable property. It also seems broadly good that this metric doesn't really care about reads. If there are a lot of reads on the system, or no reads at all, it doesn't really change the chances that a certain page is going to be written again soon, and since reads don't change the insert LSN, here again it seems to do the right thing. I'm a little less clear about whether it's good that it doesn't really depend on wall-clock time. Certainly, that's desirable from the point of view of not wanting to have to measure wall-clock time in places where we otherwise wouldn't have to, which tends to end up being expensive. However, if I were making all of my freezing decisions manually, I might be more freeze-positive on a low-velocity system where writes are more stretched out across time than on a high-velocity system where we're blasting through the LSN space at a higher rate. But maybe that's not a very important consideration, and I don't know what we'd do about it anyway. >Page Freezes/Page Frozen (less is better) > > | | Master | (1) | (2) | (3) | (4) | (5) | > |---++-+-+-+-+-| > | A | 28.50 |3.89 |1.08 |1.15 |1.10 |1.10 | > | B | 1.00 |1.06 |1.65 |1.03 |1.59 |1.00 | > | C |N/A |1.00 |1.00 |1.00 |1.00 |1.00 | > | D | 2.00 | 5199.15 | 5276.85 | 4830.45 | 5234.55 | 2193.55 | > | E | 7.90 |3.21 |2.73 |2.70 |2.69 |2.43 | > | F |N/A |1.00 |1.00 |1.00 |1.00 |1.00 | > | G |N/A |1.00 |1.00 |1.00 |1.00 |1.00 | > | H |N/A |1.00 |1.00 |1.00 |1.00 |1.00 | > | I |N/A | 42.00 | 42.00 | N/A | 41.00 | N/A | Hmm. I would say that the interesting rows here are A, D, and I, with rows C and E deserving honorable mention. In row A, master is bad. In row D, your algorithms are all bad, really bad. I don't quite understand how it can be that bad, actually. Row I looks bad for algorithms 1, 2, and 4: they freeze pages because it looks cheap, but the work doesn't really pay off. >% Frozen at end of run > > | | Master | (1) | (2) | (3) | (4) | (5) | > |---++-+-+-+--+-+ > | A | 0 | 1 | 99 | 0 | 81 | 0 | > | B | 71 | 96 | 99 | 3 | 98 | 2 | > | C | 0 | 9 | 100 | 6 | 92 | 5 | > | D | 0 | 1 | 1 | 1 |1 | 1 | > | E | 0 | 63 | 100 | 68 | 100 | 67 | > | F | 0 | 5 | 14 | 6 | 14 | 5 | > | G | 0 | 100 | 100 | 92 | 100 | 67 | > | H | 0 | 11 | 100 | 9 | 86 | 5 | > | I | 0 | 100 | 100 | 0 | 100 | 0 | So all of the algorithms here, but especially 1, 2, and 4, freeze a lot more often than master. If I understand correctly, we'd like to see small numbers for B, D, and I, and large numbers for the other workloads. None of the algorithms seem to achieve that. (3) and (5) seem like they always behave as well or better than master, but they produce small numbers for A, C, F, and H. (1), (2), and (4) regress B and I relative to master but do better than (3) and (5) on A, C, and the latter two also on E. B is such an important benchmarking workload that I'd be loathe to regress it, so if I had to pick on the basis of this data, my vote would be (3) or (5), provided whatever is happening with (D) in the previous metric is not as bad as it looks. What's your reason for preferring (4) and (5) over (2) and (3)? I'm not clear that these numbers give us much of an idea whether 10% or 33% or something else is better in general. To be honest, having now spent more time l
Re: Eager page freeze criteria clarification
On Fri, Jul 28, 2023 at 3:27 PM Melanie Plageman wrote: > On Fri, Jul 28, 2023 at 3:00 PM Peter Geoghegan wrote: > > > Is this test meant to guard against unnecessary freezing or to avoid > > > freezing when the cost is too high? That is, are we trying to > > > determine how likely it is that the page has been recently modified > > > and avoid eager freezing when it would be pointless (because the page > > > will soon be modified again)? > > > > Sort of. This cost of freezing over time is weirdly nonlinear, so it's > > hard to give a simple answer. > > > > The justification for the FPI trigger optimization is that FPIs are > > overwhelmingly the cost that really matters when it comes to freezing > > (and vacuuming in general) -- so we might as well make the best out of > > a bad situation when pruning happens to get an FPI. There can easily > > be a 10x or more cost difference (measured in total WAL volume) > > between freezing without an FPI and freezing with an FPI. > ... > > In 16 VACUUM just "makes the best > > out of a bad situation" when an FPI was already required during > > pruning. We have already "paid for the privilege" of writing some WAL > > for the page at that point, so it's reasonable to not squander a > > window of opportunity to avoid future FPIs in future VACUUM > > operations, by freezing early. > > > > We're "taking a chance" on being able to get freezing out of the way > > early when an FPI triggers freezing. It's not guaranteed to work out > > in each individual case, of course, but even if we assume it's fairly > > unlikely to work out (which is very pessimistic) it's still very > > likely a good deal. > > > > This strategy (the 16 strategy of freezing eagerly because we already > > got an FPI) seems safer than a strategy involving freezing eagerly > > because we won't get an FPI as a result. If for no other reason than > > this: with the approach in 16 we already know for sure that we'll have > > written an FPI anyway. It's hard to imagine somebody being okay with > > the FPIs, but not being okay with the other extra WAL. > > I see. I don't have an opinion on the "best of a bad situation" > argument. Though, I think it is worth amending the comment in the code > to include this explanation. > > But, ISTM that there should also be some independent heuristic to > determine whether or not it makes sense to freeze the page. That could > be related to whether or not it will be cheap to do so (in which case > we can check if we will have to emit an FPI as part of the freeze > record) or it could be related to whether or not the freezing is > likely to be pointless (we are likely to update the page again soon). > > It sounds like it was discussed before, but I'd be interested in > revisiting it and happy to test out various ideas. Hi, in service of "testing various ideas", I've benchmarked the heuristics proposed in this thread, as well a few others that Andres and I considered, for determining whether or not to opportunistically freeze a page during vacuum. Note that this heuristic would be in addition to the existing criterion that we only opportunistically freeze pages that can be subsequently set all frozen in the visibility map. I believe that there are two goals that should dictate whether or not we should perform opportunistic freezing: 1. Is it cheap? For example, if the buffer is already dirty, then no write amplification occurs, since it must be written out anyway. Freezing is also less expensive if we can do it without emitting an FPI. 2. Will it be effective; that is, will it stay frozen? Opportunistically freezing a page that will immediately be modified is a waste. The current heuristic on master meets neither of these goals: it freezes a page if pruning emitted an FPI for it. This doesn't evaluate whether or not freezing itself would be cheap, but rather attempts to hide freezing behind an expensive operation. Furthermore, it often fails to freeze cold data and may indiscriminately freeze hot data. For the second goal, I've relied on past data to predict future behavior, so I tried several criteria to estimate the likelihood that a page will not be imminently modified. What was most effective was Andres' suggestion of comparing the page LSN to the insert LSN at the end of the last vacuum of that table; this approximates whether the page has been recently modified, which is a decent proxy for whether it'll be modified in the future. To do this, we need to save that insert LSN somewhere. In the attached WIP patch, I saved it in the table stats, for now -- knowing that those are not crash-safe. Other discarded heuristic ideas included comparing the next transaction ID at the end of the vacuum of a relation to the visibility cutoff xid in the page -- but that wasn't effective for freezing data from bulk loads. The algorithms I evaluated all attempt to satisfy goal (1) by freezing only if the buffer is already dirty and also by considering whether or not
Re: Eager page freeze criteria clarification
On Fri, Jul 28, 2023 at 5:03 PM Melanie Plageman wrote: > When you were working on this, what were the downsides of only having the > criteria that 1) page would be all_visible/all_frozen and 2) we did prune > something (i.e. no consideration of FPIs)? Conditioning freezing on "all_visible && all_frozen" seems likely to always be a good idea when it comes to any sort of speculative trigger criteria. Most of the wins in this area will come from avoiding future FPIs, and you can't hope to do that unless you freeze everything on the page. The cost of freezing when "all_visible && all_frozen" does not hold may be quite low, but the benefits will also be very low. Also, the way that recovery conflicts are generated for freeze records somewhat depends on this right now -- though you could probably fix that if you had to. Note that we *don't* really limit ourselves to cases where an FPI was generated from pruning, exactly -- even on 16. With page-level checksums we can also generate an FPI just to set a hint bit. That triggers freezing too, even on insert-only tables (assuming checksums are enabled). I expect that that will be an important condition for triggering freezing in practice. Your question about 2 seems equivalent to "why not just always freeze?". I don't think that that's a bad question -- quite the opposite. Even trying to give an answer to this question would amount to getting involved in new work on VACUUM, though. So I don't think I can help you here. -- Peter Geoghegan
Re: Eager page freeze criteria clarification
On Fri, Jul 28, 2023 at 4:49 PM Peter Geoghegan wrote: > > On Fri, Jul 28, 2023 at 4:30 PM Andres Freund wrote: > > > Put differently, I can't see any reason to care whether pruning > > > emitted an FPI or not. Either way, it's very unlikely that freezing > > > needs to do so. > > > > +many > > > > Using FPIs having been generated during pruning as part of the logic to > > decide > > whether to freeze makes no sense to me. We likely need some heuristic here, > > but I suspect it has to look quite different than the FPI condition. > > I think that it depends on how you frame it. It makes zero sense that > that's the only thing that can do something like this at all. It makes > some sense as an incremental improvement, since there is no way that > we can lose by too much, even if you make arbitrarily pessimistic > assumptions. In other words, you won't be able to come up with an > adversarial benchmark where this loses by too much. When you were working on this, what were the downsides of only having the criteria that 1) page would be all_visible/all_frozen and 2) we did prune something (i.e. no consideration of FPIs)? > As I said, I'm no longer interested in working on VACUUM (though I do > hope that Melanie or somebody else picks up where I left off). I have > nothing to say about any new work in this area. If you want me to do > something in the scope of the work on 16, as a release management > task, please be clear about what that is. I'm only talking about this in the context of future work for 17. - Melanie
Re: Eager page freeze criteria clarification
On Fri, Jul 28, 2023 at 4:30 PM Andres Freund wrote: > Using FPIs having been generated during pruning as part of the logic to decide > whether to freeze makes no sense to me. We likely need some heuristic here, > but I suspect it has to look quite different than the FPI condition. Why do we need a heuristic at all? What we're talking about here, AIUI, is under what circumstance we should eagerly freeze a page that can be fully frozen, leaving no tuples that vacuum ever needs to worry about again except if the page is further modified. Freezing in such circumstances seems highly likely to work out in our favor. The only argument against it is that the page might soon be modified again, in which case the effort of freezing would be mostly wasted. But I'm not sure whether that's really a valid rationale, because the win from not having to vacuum the page again if it isn't modified seems pretty big, and emitting a freeze record for a page we've just pruned doesn't seem that expensive. If it is a valid rationale, my first thought would be to make the heuristic based on vacuum_freeze_min_age. XID age isn't a particularly great proxy for whether the page is still actively being modified, but at least it has tradition going for it. -- Robert Haas EDB: http://www.enterprisedb.com
Re: Eager page freeze criteria clarification
On Fri, Jul 28, 2023 at 4:30 PM Andres Freund wrote: > > Put differently, I can't see any reason to care whether pruning > > emitted an FPI or not. Either way, it's very unlikely that freezing > > needs to do so. > > +many > > Using FPIs having been generated during pruning as part of the logic to decide > whether to freeze makes no sense to me. We likely need some heuristic here, > but I suspect it has to look quite different than the FPI condition. I think that it depends on how you frame it. It makes zero sense that that's the only thing that can do something like this at all. It makes some sense as an incremental improvement, since there is no way that we can lose by too much, even if you make arbitrarily pessimistic assumptions. In other words, you won't be able to come up with an adversarial benchmark where this loses by too much. As I said, I'm no longer interested in working on VACUUM (though I do hope that Melanie or somebody else picks up where I left off). I have nothing to say about any new work in this area. If you want me to do something in the scope of the work on 16, as a release management task, please be clear about what that is. -- Peter Geoghegan
Re: Eager page freeze criteria clarification
Hi, On 2023-07-28 16:19:47 -0400, Robert Haas wrote: > On Fri, Jul 28, 2023 at 3:00 PM Peter Geoghegan wrote: > > > Or are we trying to determine how likely > > > the freeze record is to emit an FPI and avoid eager freezing when it > > > isn't worth the cost? > > > > No, that's not something that we're doing right now (we just talked > > about doing something like that). In 16 VACUUM just "makes the best > > out of a bad situation" when an FPI was already required during > > pruning. We have already "paid for the privilege" of writing some WAL > > for the page at that point, so it's reasonable to not squander a > > window of opportunity to avoid future FPIs in future VACUUM > > operations, by freezing early. > > This doesn't make sense to me. It's true that if the pruning record > emitted an FPI, then the freezing record probably won't need to do so > either, unless by considerable ill-fortune the redo pointer was moved > in the tiny window between pruning and freezing. But isn't that also > true that if the pruning record *didn't* emit an FPI? In that case, > the pruning record wasn't the first WAL-logged modification to the > page during the then-current checkpoint cycle, and some earlier > modification already did it. But in that case also the freezing won't > need to emit a new FPI, except for the same identical case where the > redo pointer moves at just the wrong time. > > Put differently, I can't see any reason to care whether pruning > emitted an FPI or not. Either way, it's very unlikely that freezing > needs to do so. +many Using FPIs having been generated during pruning as part of the logic to decide whether to freeze makes no sense to me. We likely need some heuristic here, but I suspect it has to look quite different than the FPI condition. Greetings, Andres
Re: Eager page freeze criteria clarification
On Fri, Jul 28, 2023 at 3:00 PM Peter Geoghegan wrote: > > Or are we trying to determine how likely > > the freeze record is to emit an FPI and avoid eager freezing when it > > isn't worth the cost? > > No, that's not something that we're doing right now (we just talked > about doing something like that). In 16 VACUUM just "makes the best > out of a bad situation" when an FPI was already required during > pruning. We have already "paid for the privilege" of writing some WAL > for the page at that point, so it's reasonable to not squander a > window of opportunity to avoid future FPIs in future VACUUM > operations, by freezing early. This doesn't make sense to me. It's true that if the pruning record emitted an FPI, then the freezing record probably won't need to do so either, unless by considerable ill-fortune the redo pointer was moved in the tiny window between pruning and freezing. But isn't that also true that if the pruning record *didn't* emit an FPI? In that case, the pruning record wasn't the first WAL-logged modification to the page during the then-current checkpoint cycle, and some earlier modification already did it. But in that case also the freezing won't need to emit a new FPI, except for the same identical case where the redo pointer moves at just the wrong time. Put differently, I can't see any reason to care whether pruning emitted an FPI or not. Either way, it's very unlikely that freezing needs to do so. -- Robert Haas EDB: http://www.enterprisedb.com
Re: Eager page freeze criteria clarification
On Fri, Jul 28, 2023 at 3:27 PM Melanie Plageman wrote: > I see. I don't have an opinion on the "best of a bad situation" > argument. Though, I think it is worth amending the comment in the code > to include this explanation. I think that the user-facing docs should be completely overhauled to make that clear, and reasonably accessible. It's hard to talk about in code comments because it's something that can only be effective over time, across multiple VACUUM operations. > But, ISTM that there should also be some independent heuristic to > determine whether or not it makes sense to freeze the page. That could > be related to whether or not it will be cheap to do so (in which case > we can check if we will have to emit an FPI as part of the freeze > record) or it could be related to whether or not the freezing is > likely to be pointless (we are likely to update the page again soon). It sounds like you're interested in adding additional criteria to trigger page-level freezing. That's something that the current structure anticipates. To give a slightly contrived example: it would be very easy to add another condition, where (say) we call random() to determine whether or not to freeze the page. You'd very likely want to gate this new trigger criteria in the same way as the FPI criteria (i.e. only do it when "prunestate->all_visible && prunestate->all_frozen" hold). We've decoupled the decision to freeze from what it means to execute freezing itself. Having additional trigger criteria makes a lot of sense. I'm sure that it makes sense to add at least one more, and it seems possible that adding several more makes sense. Obviously, that will have to be new work targeting 17, though. I made a decision to stop working on VACUUM, though, so I'm afraid I won't be able to offer much help with any of this. (Happy to give more background information, though.) -- Peter Geoghegan
Re: Eager page freeze criteria clarification
On Fri, Jul 28, 2023 at 3:00 PM Peter Geoghegan wrote: > > On Fri, Jul 28, 2023 at 11:13 AM Melanie Plageman > wrote: > > if (pagefrz.freeze_required || tuples_frozen == 0 || > > (prunestate->all_visible && prunestate->all_frozen && > > fpi_before != pgWalUsage.wal_fpi)) > > { > > > > I'm trying to understand the condition fpi_before != > > pgWalUsage.wal_fpi -- don't eager freeze if pruning emitted an FPI. > > You mean "prunestate->all_visible && prunestate->all_frozen", which is > a condition of applying FPI-based eager freezing, but not traditional > lazy freezing? Ah no, a very key word in my sentence was off: I meant "don't eager freeze *unless* pruning emitted an FPI" I was asking about the FPI-trigger optimization specifically. I'm clear on the all_frozen/all_visible criteria. > > Is this test meant to guard against unnecessary freezing or to avoid > > freezing when the cost is too high? That is, are we trying to > > determine how likely it is that the page has been recently modified > > and avoid eager freezing when it would be pointless (because the page > > will soon be modified again)? > > Sort of. This cost of freezing over time is weirdly nonlinear, so it's > hard to give a simple answer. > > The justification for the FPI trigger optimization is that FPIs are > overwhelmingly the cost that really matters when it comes to freezing > (and vacuuming in general) -- so we might as well make the best out of > a bad situation when pruning happens to get an FPI. There can easily > be a 10x or more cost difference (measured in total WAL volume) > between freezing without an FPI and freezing with an FPI. ... > In 16 VACUUM just "makes the best > out of a bad situation" when an FPI was already required during > pruning. We have already "paid for the privilege" of writing some WAL > for the page at that point, so it's reasonable to not squander a > window of opportunity to avoid future FPIs in future VACUUM > operations, by freezing early. > > We're "taking a chance" on being able to get freezing out of the way > early when an FPI triggers freezing. It's not guaranteed to work out > in each individual case, of course, but even if we assume it's fairly > unlikely to work out (which is very pessimistic) it's still very > likely a good deal. > > This strategy (the 16 strategy of freezing eagerly because we already > got an FPI) seems safer than a strategy involving freezing eagerly > because we won't get an FPI as a result. If for no other reason than > this: with the approach in 16 we already know for sure that we'll have > written an FPI anyway. It's hard to imagine somebody being okay with > the FPIs, but not being okay with the other extra WAL. I see. I don't have an opinion on the "best of a bad situation" argument. Though, I think it is worth amending the comment in the code to include this explanation. But, ISTM that there should also be some independent heuristic to determine whether or not it makes sense to freeze the page. That could be related to whether or not it will be cheap to do so (in which case we can check if we will have to emit an FPI as part of the freeze record) or it could be related to whether or not the freezing is likely to be pointless (we are likely to update the page again soon). It sounds like it was discussed before, but I'd be interested in revisiting it and happy to test out various ideas. - Melanie
Re: Eager page freeze criteria clarification
On Fri, Jul 28, 2023 at 11:13 AM Melanie Plageman wrote: > if (pagefrz.freeze_required || tuples_frozen == 0 || > (prunestate->all_visible && prunestate->all_frozen && > fpi_before != pgWalUsage.wal_fpi)) > { > > I'm trying to understand the condition fpi_before != > pgWalUsage.wal_fpi -- don't eager freeze if pruning emitted an FPI. You mean "prunestate->all_visible && prunestate->all_frozen", which is a condition of applying FPI-based eager freezing, but not traditional lazy freezing? Obviously, the immediate impact of that is that the FPI trigger condition is not met unless we know for sure that the page will be marked all-visible and all-frozen in the visibility map afterwards. A page that's eligible to become all-visible will also be seen as eligible to become all-frozen in the vast majority of cases, but there are some rare and obscure cases involving MultiXacts that must be considered here. There is little point in freezing early unless we have at least some chance of not having to freeze the same page again in the future (ideally forever). There is generally no point in freezing most of the tuples on a page when there'll still be one or more not-yet-eligible unfrozen tuples that get left behind -- we might as well not bother with freezing at all when we see a page where that'll be the outcome from freezing. However, there is (once again) a need to account for rare and extreme cases -- it still needs to be possible to do that. Specifically, we may be forced to freeze a page that's left with some remaining unfrozen tuples when VACUUM is fundamentally incapable of freezing them due to its OldestXmin/removable cutoff being too old. That can happen when VACUUM needs to freeze according to the traditional age-based settings, and yet the OldestXmin/removable cutoff gets held back (by a leaked replication slot or whatever). (Actually, VACUUM FREEZE freeze will freeze only a subset of the tuples from some heap pages far more often. VACUUM FREEZE seems like a bad design to me, though -- it uses the most aggressive possible XID cutoff for freezing when it should probably hold off on freezing those individual pages where we determine that it makes little sense. We need to focus more on physical pages and their costs, and less on XID cutoffs.) > Is this test meant to guard against unnecessary freezing or to avoid > freezing when the cost is too high? That is, are we trying to > determine how likely it is that the page has been recently modified > and avoid eager freezing when it would be pointless (because the page > will soon be modified again)? Sort of. This cost of freezing over time is weirdly nonlinear, so it's hard to give a simple answer. The justification for the FPI trigger optimization is that FPIs are overwhelmingly the cost that really matters when it comes to freezing (and vacuuming in general) -- so we might as well make the best out of a bad situation when pruning happens to get an FPI. There can easily be a 10x or more cost difference (measured in total WAL volume) between freezing without an FPI and freezing with an FPI. > Or are we trying to determine how likely > the freeze record is to emit an FPI and avoid eager freezing when it > isn't worth the cost? No, that's not something that we're doing right now (we just talked about doing something like that). In 16 VACUUM just "makes the best out of a bad situation" when an FPI was already required during pruning. We have already "paid for the privilege" of writing some WAL for the page at that point, so it's reasonable to not squander a window of opportunity to avoid future FPIs in future VACUUM operations, by freezing early. We're "taking a chance" on being able to get freezing out of the way early when an FPI triggers freezing. It's not guaranteed to work out in each individual case, of course, but even if we assume it's fairly unlikely to work out (which is very pessimistic) it's still very likely a good deal. This strategy (the 16 strategy of freezing eagerly because we already got an FPI) seems safer than a strategy involving freezing eagerly because we won't get an FPI as a result. If for no other reason than this: with the approach in 16 we already know for sure that we'll have written an FPI anyway. It's hard to imagine somebody being okay with the FPIs, but not being okay with the other extra WAL. > But, the final rationale is still not clear to me. Could we add a > comment above the if condition specifying both: > a) what the test is a proxy for > b) the intended outcome (when do we expect to eager freeze) > And perhaps we could even describe a scenario where this heuristic is > effective? There are lots of scenarios where it'll be effective. I agree that there is a need to document this stuff a lot better. I have a pending doc patch that overhauls the user-facing docs in this area. My latest draft is here: https://postgr.es/m/CAH2-Wz=UUJz+MMb1AxFzz-HDA=1t1fx_kmrdovopzxkpa-t...@mail.gmail.com ht
Eager page freeze criteria clarification
Hi, While hacking on the pruning and page-level freezing code and am a bit confused by the test guarding eager freezing [1]: /* * Freeze the page when heap_prepare_freeze_tuple indicates that at least * one XID/MXID from before FreezeLimit/MultiXactCutoff is present. Also * freeze when pruning generated an FPI, if doing so means that we set the * page all-frozen afterwards (might not happen until final heap pass). */ if (pagefrz.freeze_required || tuples_frozen == 0 || (prunestate->all_visible && prunestate->all_frozen && fpi_before != pgWalUsage.wal_fpi)) { I'm trying to understand the condition fpi_before != pgWalUsage.wal_fpi -- don't eager freeze if pruning emitted an FPI. Is this test meant to guard against unnecessary freezing or to avoid freezing when the cost is too high? That is, are we trying to determine how likely it is that the page has been recently modified and avoid eager freezing when it would be pointless (because the page will soon be modified again)? Or are we trying to determine how likely the freeze record is to emit an FPI and avoid eager freezing when it isn't worth the cost? Or something else? The commit message says: > Also teach VACUUM to trigger page-level freezing whenever it detects > that heap pruning generated an FPI. We'll have already written a large > amount of WAL just to do that much, so it's very likely a good idea to > get freezing out of the way for the page early. And I found the thread where it was discussed [2]. Several possible explanations are mentioned in the thread. But, the final rationale is still not clear to me. Could we add a comment above the if condition specifying both: a) what the test is a proxy for b) the intended outcome (when do we expect to eager freeze) And perhaps we could even describe a scenario where this heuristic is effective? - Melanie [1] https://github.com/postgres/postgres/blob/master/src/backend/access/heap/vacuumlazy.c#L1802 [2] https://www.postgresql.org/message-id/flat/CAH2-Wzm_%3DmrWO%2BkUAJbR_gM_6RzpwVA8n8e4nh3dJGHdw_urew%40mail.gmail.com#c5aae6ea65e07de92a24a35445473448