Re: New strategies for freezing, advancing relfrozenxid early
On Fri, Jan 27, 2023 at 12:52 AM Andres Freund wrote: > I agree with bringing high-level context into the decision about whether to > freeze agressively - my problem with the eager freezing strategy patch isn't > that it did that too much, it's that it didn't do it enough. > > > But I also don't think what I describe above is really comparable to "table > level" eager freezing though - the potential worst case overhead is a small > fraction of the WAL volume, and there's zero increase in data write volume. All I meant was that I initially thought that you were trying to replace the FPI thing with something at the same level of ambition, that could work in a low context way. But I now see that you're actually talking about something quite a bit more ambitious for Postgres 16, which is structurally similar to a freezing strategy, from a code point of view -- it relies on high-level context for the VACUUM/table as a whole. I wasn't equating it with the eager freezing strategy in any other way. It might also be true that this other thing happens to render the FPI mechanism redundant. I'm actually not completely sure that it will just yet. Let me verify my understanding of your proposal: You mean that we'd take the page LSN before doing anything with the page, right at the top of lazy_scan_prune, at the same point that "fpi_before" is initialized currently. Then, if we subsequently dirtied the page (as determined by its LSN, so as to focus on "dirtied via WAL logged operation") during pruning, *and* if the "lsn_before" of the page was from before our cutoff (derived via " lsn_threshold = insert_lsn - (insert_lsn - lsn_of_last_vacuum) * 0.1" or similar), *and* if the page is eligible to become all-frozen, then we'd freeze the page. That's it, right? It's about pages that *we* (VACUUM) dirtied, and wrote records and/or FPIs for already? > I suspect the absolute worst case of "always freeze dirty pages" is when a > single tuple on the page gets updated immediately after every time we freeze > the page - a single tuple is where the freeze record is the least space > efficient. The smallest update is about the same size as the smallest freeze > record. For that to amount to a large WAL increase you'd a crazy rate of such > updates interspersed with vacuums. In slightly more realistic cases (i.e. not > column less tuples that constantly get updated and freezing happening all the > time) you end up with a reasonably small WAL rate overhead. Other thing is that we'd be doing this in situations where we already know that a VISIBLE record is required, which is comparable in size to a FREEZE_PAGE record with one tuple/plan (around 64 bytes). The smallest WAL records are mostly just generic WAL record header overhead. > Obviously that's a pointless workload, but I do think that > analyzing the "outer boundaries" of the regression something can cause, can be > helpful. I agree about the "outer boundaries" being a useful guide. > I think one way forward with the eager strategy approach would be to have a > very narrow gating condition for now, and then incrementally expand it in > later releases. > > One use-case where the eager strategy is particularly useful is > [nearly-]append-only tables - and it's also the one workload that's reasonably > easy to detect using stats. Maybe something like > (dead_tuples_since_last_vacuum / inserts_since_last_vacuum) < 0.05 > or so. > > That'll definitely leave out loads of workloads where eager freezing would be > useful - but are there semi-reasonable workloads where it'll hurt badly? I > don't *think* so. I have no further plans to work on eager freezing strategy, or anything of the sort, in light of recent developments. My goal at this point is very unambitious: to get the basic page-level freezing work into a form that makes sense as a standalone thing for Postgres 16. To put things on a good footing, so that I can permanently bow out of all work on VACUUM having left everything in good order. That's all. Now, that might still mean that I'd facilitate future work of this sort, by getting the right basic structure in place. But my involvement in any work on freezing or anything of the sort ends here, both as a patch author and a committer of anybody else's work. I'm proud of the work I've done on VACUUM, but I'm keen to move on from it. > > What about unlogged/temporary tables? The obvious thing to do there is > > what I did in the patch that was reverted (freeze whenever the page > > will thereby become all-frozen), and forget about LSNs. But you have > > already objected to that part, specifically. > > My main concern about that is the data write amplification it could cause when > page is clean when we start freezing. But I can't see a large potential > downside to always freezing unlogged/temp tables when the page is already > dirty. But we have to dirty the page anyway, just to set PD_ALL_VISIBLE. That was always a gating condition. Actually, that may have depen
Re: New strategies for freezing, advancing relfrozenxid early
Hi, On 2023-01-27 12:53:58 -0500, Robert Haas wrote: > On Fri, Jan 27, 2023 at 12:58 AM Andres Freund wrote: > > Essentially the "any fpi" logic is a very coarse grained way of using the > > page > > LSN as a measurement. As I said, I don't think "has a checkpoint occurred > > since the last write" is a good metric to avoid unnecessary freezing - it's > > too coarse. But I think using the LSN is the right thought. What about > > something like > > > > lsn_threshold = insert_lsn - (insert_lsn - lsn_of_last_vacuum) * 0.1 > > if (/* other conds */ && PageGetLSN(page) <= lsn_threshold) > > FreezeMe(); > > > > I probably got some details wrong, what I am going for with lsn_threshold is > > that we'd freeze an already dirty page if it's not been updated within 10% > > of > > the LSN distance to the last VACUUM. > > I think this might not be quite the right idea for a couple of reasons. It's definitely not perfect. If we had an approximate LSN->time map as general infrastructure, we could do a lot better. I think it'd be reasonably easy to maintain that in the autovacuum launcher, for example. One thing worth calling out here, because it's not obvious from the code quoted above in isolation, is that what I was trying to refine here was the decision when to perform opportunistic freezing *of already dirty pages that do not require an FPI*. So all that we need to prevent here is freezing very hotly updated data, where the WAL overhead of the freeze records would be noticable, because we constantly VACUUM, due to the high turnover. > First, suppose that the table is being processed just by autovacuum > (no manual VACUUM operations) and that the rate of WAL generation is > pretty even, so that LSN age is a good proxy for time. If autovacuum > processes the table once per hour, this will freeze if it hasn't been > updated in the last six minutes. That sounds good. But if autovacuum > processes the table once per day, then this will freeze if it hasn't > been updated in 2.4 hours. That might be OK, but it sounds a little on > the long side. You're right. I was thinking of the "lsn_since_last_vacuum" because I was posulating it being useful elsewhere in the thread (but for eager strategy logic) - but here that's really not very relevant. Given that we're dealing with already dirty pages not requiring an FPI, I think a much better "reference LSN" would be the LSN of the last checkpoint (LSN of the last checkpoint record, not the current REDO pointer). > Second, and more seriously, I think this would, in some circumstances, > lead to tremendously unstable behavior. Suppose somebody does a bunch > of work on a table and then they're like "oh, we should clean up, > VACUUM" and it completes quickly because it's been a while since the > last vacuum and so it doesn't freeze much. Then, for whatever reason, > they decide to run it one more time, and it goes bananas and starts > freezing all kinds of stuff because the LSN distance since the last > vacuum is basically zero. Or equally, you run a manual VACUUM, and you > get completely different behavior depending on how long it's been > since the last autovacuum ran. I don't think this quite applies to the scenario at hand, because it's restricted to already dirty pages. And the max increased overhead is also small due to that - so occasionally getting it wrong is that impactful. Greetings, Andres Freund
Re: New strategies for freezing, advancing relfrozenxid early
On Fri, Jan 27, 2023 at 12:58 AM Andres Freund wrote: > Essentially the "any fpi" logic is a very coarse grained way of using the page > LSN as a measurement. As I said, I don't think "has a checkpoint occurred > since the last write" is a good metric to avoid unnecessary freezing - it's > too coarse. But I think using the LSN is the right thought. What about > something like > > lsn_threshold = insert_lsn - (insert_lsn - lsn_of_last_vacuum) * 0.1 > if (/* other conds */ && PageGetLSN(page) <= lsn_threshold) > FreezeMe(); > > I probably got some details wrong, what I am going for with lsn_threshold is > that we'd freeze an already dirty page if it's not been updated within 10% of > the LSN distance to the last VACUUM. I think this might not be quite the right idea for a couple of reasons. First, suppose that the table is being processed just by autovacuum (no manual VACUUM operations) and that the rate of WAL generation is pretty even, so that LSN age is a good proxy for time. If autovacuum processes the table once per hour, this will freeze if it hasn't been updated in the last six minutes. That sounds good. But if autovacuum processes the table once per day, then this will freeze if it hasn't been updated in 2.4 hours. That might be OK, but it sounds a little on the long side. If autovacuum processes the table once per week, then this will freeze if it hasn't been updated in 16.8 hours. That sounds too conservative. Conversely, if autovacuum processes the table every 3 minutes, then this will freeze the data if it hasn't been updated in the last 18 seconds, which sounds awfully aggressive. Maybe I'm wrong here, but I feel like the absolute amount of wall-clock time we're talking about here probably matters to some degree. I'm not sure whether a strict time-based threshold like, say, 10 minutes would be a good idea, leaving aside the difficulties of implementation. It might be right to think that if the table is being vacuumed a lot, freezing more aggressively is smart, and if it's being vacuumed infrequently, freezing less aggressively is smart, because if the table has enough activity that it's being vacuumed frequently, that might also be a sign that we need to freeze more aggressively in order to avoid having things go sideways. However, I'm not completely sure about that, and I think it's possible that we need some guardrails to avoid going too far in either direction. Second, and more seriously, I think this would, in some circumstances, lead to tremendously unstable behavior. Suppose somebody does a bunch of work on a table and then they're like "oh, we should clean up, VACUUM" and it completes quickly because it's been a while since the last vacuum and so it doesn't freeze much. Then, for whatever reason, they decide to run it one more time, and it goes bananas and starts freezing all kinds of stuff because the LSN distance since the last vacuum is basically zero. Or equally, you run a manual VACUUM, and you get completely different behavior depending on how long it's been since the last autovacuum ran. In some ways, I think this proposal has many of the same problems as vacuum_freeze_min_age. In both cases, the instinct is that we should use something on the page to let us know how long it's been since the page was modified, and proceed on the theory that if the page has not been modified recently, it probably isn't about to be modified again. That's a reasonable instinct, but the rate of XID advancement and the rate of LSN advancement are both highly variable, even on a system that's always under some load. -- Robert Haas EDB: http://www.enterprisedb.com
Re: New strategies for freezing, advancing relfrozenxid early
On Fri, Jan 27, 2023 at 6:48 AM Robert Haas wrote: > > One of the key strengths of systems like Postgres is the ability to > > inexpensively store a relatively large amount of data that has just > > about zero chance of being read, let alone modified. While at the same > > time having decent OLTP performance for the hot data. Not nearly as > > good as an in-memory system, mind you -- and yet in-memory systems > > remain largely a niche thing. > > I think it's interesting that TPC-C suffers from the kind of problem > that your patch was intended to address. I hadn't considered that. But > I do not think it detracts from the basic point I was making, which is > that you need to think about the downsides of your patch, not just the > upsides. > > If you want to argue that there is *no* OLTP workload that will be > harmed by freezing as aggressively as possible, then that would be a > good argument in favor of your patch, because it would be arguing that > the downside simply doesn't exist, at least for OLTP workloads. The > fact that you can think of *one particular* OLTP workload that can > benefit from the patch is just doubling down on the "my patch has an > upside" argument, which literally no one is disputing. You've treated me to another multi paragraph talking down, as if I was still clinging to my original position, which is of course not the case. I've literally said I'm done with VACUUM for good, and that I just want to put a line under this. Yet you still persist in doing this sort of thing. I'm not fighting you, I'm not fighting Andres. I was making a point about the need to do something in this area in general. That's all. -- Peter Geoghegan
Re: New strategies for freezing, advancing relfrozenxid early
On Thu, Jan 26, 2023 at 6:37 PM Peter Geoghegan wrote: > > I don't see what your reference to checkpoint timeout is about here? > > > > Also, as I mentioned before, the problem isn't specific to > > checkpoint_timeout > > = 1min. It just makes it cheaper to reproduce. > > That's flagrantly intellectually dishonest. This kind of ad hominum attack has no place on this mailing list, or anywhere in the PostgreSQL community. If you think there's a problem with Andres's test case, or his analysis of it, you can talk about those problems without accusing him of intellectual dishonesty. I don't see anything to indicate that he was being intentionally dishonest, either. At most he was mistaken. More than likely, not even that. -- Robert Haas EDB: http://www.enterprisedb.com
Re: New strategies for freezing, advancing relfrozenxid early
On Thu, Jan 26, 2023 at 4:51 PM Peter Geoghegan wrote: > This is the kind of remark that makes me think that you don't get it. > > The most influential OLTP benchmark of all time is TPC-C, which has > exactly this problem. In spades -- it's enormously disruptive. Which > is one reason why I used it as a showcase for a lot of this work. Plus > practical experience (like the Heroku database in the blog post I > linked to) fully agrees with that benchmark, as far as this stuff goes > -- that was also a busy OLTP database. > > Online transaction involves transactions. Right? There is presumably > some kind of ledger, some kind of orders table. Naturally these have > entries that age out fairly predictably. After a while, almost all the > data is cold data. It is usually about that simple. > > One of the key strengths of systems like Postgres is the ability to > inexpensively store a relatively large amount of data that has just > about zero chance of being read, let alone modified. While at the same > time having decent OLTP performance for the hot data. Not nearly as > good as an in-memory system, mind you -- and yet in-memory systems > remain largely a niche thing. I think it's interesting that TPC-C suffers from the kind of problem that your patch was intended to address. I hadn't considered that. But I do not think it detracts from the basic point I was making, which is that you need to think about the downsides of your patch, not just the upsides. If you want to argue that there is *no* OLTP workload that will be harmed by freezing as aggressively as possible, then that would be a good argument in favor of your patch, because it would be arguing that the downside simply doesn't exist, at least for OLTP workloads. The fact that you can think of *one particular* OLTP workload that can benefit from the patch is just doubling down on the "my patch has an upside" argument, which literally no one is disputing. I don't think you can make such an argument stick, though. OLTP workloads come in all shapes and sizes. It's pretty common to have tables where the application inserts a bunch of data, updates it over and over again like, truncates the table, and starts over. In such a case, aggressive freezing has to be a loss, because no freezing is ever needed. It's also surprisingly common to have tables where a bunch of data is inserted and then, after a bit of processing, a bunch of rows are updated exactly once, after which the data is not modified any further. In those kinds of cases, aggressive freezing is a great idea if it happens after that round of updates but a poor idea if it happens before that round of updates. It's also pretty common to have cases where portions of the table become very hot, get a lot of updates for a while, and then that part of the table becomes cool and some other part of the table becomes very hot for a while. I think it's possible that aggressive freezing might do OK in such environments, actually. It will be a negative if we aggressively freeze the part of the table that's currently hot, but I think typically tables that have this access pattern are quite big, so VACUUM isn't going to sweep through the table all that often. It will probably freeze a lot more data-that-was-hot-a-bit-ago than it will freeze data-that-is-hot-this-very-minute. Then again, maybe that would happen without the patch, too. Maybe this kind of case is a wash for your patch? I don't know. Whatever you think of these examples, I don't see how it can be right to suppose that *in general* freezing very aggressively has no downsides. If that were true, then we probably wouldn't have vacuum_freeze_min_age at all. We would always just freeze everything ASAP. I mean, you could theorize that whoever invented that GUC is an idiot and that they had absolutely no good reason for introducing it, but that seems pretty ridiculous. Someone put guards against overly-aggressive freezing into the system *for a reason* and if you just go rip them all out, you're going to reintroduce the problems against which they were intended to guard. -- Robert Haas EDB: http://www.enterprisedb.com
Re: New strategies for freezing, advancing relfrozenxid early
Hi, On 2023-01-27 00:51:59 -0800, Andres Freund wrote: > One use-case where the eager strategy is particularly useful is > [nearly-]append-only tables - and it's also the one workload that's reasonably > easy to detect using stats. Maybe something like > (dead_tuples_since_last_vacuum / inserts_since_last_vacuum) < 0.05 > or so. > > That'll definitely leave out loads of workloads where eager freezing would be > useful - but are there semi-reasonable workloads where it'll hurt badly? I > don't *think* so. That 0.05 could be a GUC + relopt combo, which'd allow users to opt in tables with known usage pattern into always using eager freezing. Greetings, Andres Freund
Re: New strategies for freezing, advancing relfrozenxid early
Hi, On 2023-01-26 23:11:41 -0800, Peter Geoghegan wrote: > > Essentially the "any fpi" logic is a very coarse grained way of using the > > page > > LSN as a measurement. As I said, I don't think "has a checkpoint occurred > > since the last write" is a good metric to avoid unnecessary freezing - it's > > too coarse. But I think using the LSN is the right thought. What about > > something like > > > > lsn_threshold = insert_lsn - (insert_lsn - lsn_of_last_vacuum) * 0.1 > > if (/* other conds */ && PageGetLSN(page) <= lsn_threshold) > > FreezeMe(); > > > > I probably got some details wrong, what I am going for with lsn_threshold is > > that we'd freeze an already dirty page if it's not been updated within 10% > > of > > the LSN distance to the last VACUUM. > > It seems to me that you're reinventing something akin to eager > freezing strategy here. At least that's how I define it, since now > you're bringing the high level context into it; what happens with the > table, with VACUUM operations, and so on. Obviously this requires > tracking the metadata that you suppose will be available in some way > or other, in particular things like lsn_of_last_vacuum. I agree with bringing high-level context into the decision about whether to freeze agressively - my problem with the eager freezing strategy patch isn't that it did that too much, it's that it didn't do it enough. But I also don't think what I describe above is really comparable to "table level" eager freezing though - the potential worst case overhead is a small fraction of the WAL volume, and there's zero increase in data write volume. I suspect the absolute worst case of "always freeze dirty pages" is when a single tuple on the page gets updated immediately after every time we freeze the page - a single tuple is where the freeze record is the least space efficient. The smallest update is about the same size as the smallest freeze record. For that to amount to a large WAL increase you'd a crazy rate of such updates interspersed with vacuums. In slightly more realistic cases (i.e. not column less tuples that constantly get updated and freezing happening all the time) you end up with a reasonably small WAL rate overhead. That worst case of "freeze dirty" is bad enough to spend some brain and compute cycles to prevent. But if we don't always get it right in some workload, it's not *awful*. The worst case of the "eager freeze strategy" is a lot larger - it's probably something like updating one narrow tuple every page, once per checkpoint, so that each freeze generates an FPI. I think that results in a max overhead of 2x for data writes, and about 150x for WAL volume (ratio of one update record with an FPI). Obviously that's a pointless workload, but I do think that analyzing the "outer boundaries" of the regression something can cause, can be helpful. I think one way forward with the eager strategy approach would be to have a very narrow gating condition for now, and then incrementally expand it in later releases. One use-case where the eager strategy is particularly useful is [nearly-]append-only tables - and it's also the one workload that's reasonably easy to detect using stats. Maybe something like (dead_tuples_since_last_vacuum / inserts_since_last_vacuum) < 0.05 or so. That'll definitely leave out loads of workloads where eager freezing would be useful - but are there semi-reasonable workloads where it'll hurt badly? I don't *think* so. > What about unlogged/temporary tables? The obvious thing to do there is > what I did in the patch that was reverted (freeze whenever the page > will thereby become all-frozen), and forget about LSNs. But you have > already objected to that part, specifically. My main concern about that is the data write amplification it could cause when page is clean when we start freezing. But I can't see a large potential downside to always freezing unlogged/temp tables when the page is already dirty. > BTW, you still haven't changed the fact that you get rather different > behavior with checksums/wal_log_hints. I think that that's good, but > you didn't seem to. I think that, if we had something like the recency test I was talking about, we could afford to alway freeze when the page is already dirty and not very recently modified. I.e. not even insist on a WAL record having been generated during pruning/HTSV. But I need to think through the dangers of that more. Greetings, Andres Freund
Re: New strategies for freezing, advancing relfrozenxid early
On Thu, Jan 26, 2023 at 9:58 PM Andres Freund wrote: > It doesn't seem like a great proxy to me. ISTM that this means that how > aggressive vacuum is about opportunistically freezing pages depends on config > variables like checkpoint_timeout & max_wal_size (less common opportunistic > freezing), full_page_writes & use of unlogged tables (no opportunistic > freezing), and the largely random scheduling of autovac workers. The FPI thing was originally supposed to complement the freezing strategies stuff, and possibly other rules that live in lazy_scan_prune. Obviously you can freeze a page by following any rule that you care to invent -- you can decide by calling random(). Two rules can coexist during the same VACUUM (actually, they do already). > Essentially the "any fpi" logic is a very coarse grained way of using the page > LSN as a measurement. As I said, I don't think "has a checkpoint occurred > since the last write" is a good metric to avoid unnecessary freezing - it's > too coarse. But I think using the LSN is the right thought. What about > something like > > lsn_threshold = insert_lsn - (insert_lsn - lsn_of_last_vacuum) * 0.1 > if (/* other conds */ && PageGetLSN(page) <= lsn_threshold) > FreezeMe(); > > I probably got some details wrong, what I am going for with lsn_threshold is > that we'd freeze an already dirty page if it's not been updated within 10% of > the LSN distance to the last VACUUM. It seems to me that you're reinventing something akin to eager freezing strategy here. At least that's how I define it, since now you're bringing the high level context into it; what happens with the table, with VACUUM operations, and so on. Obviously this requires tracking the metadata that you suppose will be available in some way or other, in particular things like lsn_of_last_vacuum. What about unlogged/temporary tables? The obvious thing to do there is what I did in the patch that was reverted (freeze whenever the page will thereby become all-frozen), and forget about LSNs. But you have already objected to that part, specifically. BTW, you still haven't changed the fact that you get rather different behavior with checksums/wal_log_hints. I think that that's good, but you didn't seem to. > I don't think the speculation is that fundamentally different - a heavily > updated table with a bit of a historic, non-changing portion, makes > vacuum_freeze_strategy_threshold freeze way more aggressively than either "any > record" or "any fpi". That's true. The point I was making is that both this proposal and eager freezing are based on some kind of high level picture of the needs of the table, based on high level metadata. To me that's the defining characteristic. > > And even when we lose, you generally still won't have been completely > > wrong -- even then there generally will indeed be a second FPI later > > on for the same page, to go with everything else. This makes the > > wasted freezing even less significant, on a comparative basis! > > This is precisely why I think that we can afford to be quite aggressive about > freezing already dirty pages... I'm beginning to warm to this idea, now that I understand it a little better. -- Peter Geoghegan
Re: New strategies for freezing, advancing relfrozenxid early
Hi, On 2023-01-26 19:01:03 -0800, Peter Geoghegan wrote: > On Thu, Jan 26, 2023 at 6:37 PM Andres Freund wrote: > > I also don't really see how that is responsive to anything else in my > > email. That's just as true for the current gating condition (the issuance of > > an FPI during heap_page_prune() / HTSV()). > > > > What I was wondering about is whether we should replace the > > fpi_before != pgWalUsage.wal_fpi > > with > > records_before != pgWalUsage.wal_records && !WouldIssueFpi(page) > > I understand that. What I'm saying is that that's going to create a > huge problem of its own, unless you separately account for that > problem. > The simplest and obvious example is something like a pgbench_tellers > table. VACUUM will generally run fast enough relative to the workload > that it will set some of those pages all-visible. Now it's going to > freeze them, too. Arguably it shouldn't even be setting the pages > all-visible, but now you make that existing problem much worse. So the benefit of the FPI condition is that it indicates that the page hasn't been updated all that recently, because, after all, a checkpoint has happened since? If that's the intention, it needs a huge honking comment - at least I can't read that out of: Also freeze when pruning generated an FPI, if doing so means that we set the page all-frozen afterwards (might not happen until final heap pass). It doesn't seem like a great proxy to me. ISTM that this means that how aggressive vacuum is about opportunistically freezing pages depends on config variables like checkpoint_timeout & max_wal_size (less common opportunistic freezing), full_page_writes & use of unlogged tables (no opportunistic freezing), and the largely random scheduling of autovac workers. I can see it making a difference for pgbench_tellers, but it's a pretty small difference in overall WAL volume. I can think of more adverse workloads though - but even there the difference seems not huge, and not predictably reached. Due to the freeze plan stuff you added, the amount of WAL for freezing a page is pretty darn small compared to the amount of WAL if compared to the amount of WAL needed to fill a page with non-frozen tuples. That's not to say we shouldn't reduce the risk - I agree that both the "any fpi" and the "any record" condition can have adverse effects! However, an already dirty page getting frozen is also the one case where freezing won't have meaningful write amplication effect. So I think it's worth trying spending effort figuring out how we can make freezing in that situation have unlikely and small downsides. The cases with downsides are tables that are very heavily updated througout, where the page is going to be defrosted again almost immediately. As you say, the all-visible marking has a similar problem. Essentially the "any fpi" logic is a very coarse grained way of using the page LSN as a measurement. As I said, I don't think "has a checkpoint occurred since the last write" is a good metric to avoid unnecessary freezing - it's too coarse. But I think using the LSN is the right thought. What about something like lsn_threshold = insert_lsn - (insert_lsn - lsn_of_last_vacuum) * 0.1 if (/* other conds */ && PageGetLSN(page) <= lsn_threshold) FreezeMe(); I probably got some details wrong, what I am going for with lsn_threshold is that we'd freeze an already dirty page if it's not been updated within 10% of the LSN distance to the last VACUUM. > The important point is that there doesn't seem to be any good way > around thinking about the table as a whole if you're going to freeze > speculatively. This is not the same dynamic as we see with the FPI > thing IMV -- that's not nearly so speculative as what you're talking > about, since it is speculative in roughly the same sense that eager > freezing was speculative (hence the suggestion that something like > vacuum_freeze_strategy_threshold could have a roll to play). I don't think the speculation is that fundamentally different - a heavily updated table with a bit of a historic, non-changing portion, makes vacuum_freeze_strategy_threshold freeze way more aggressively than either "any record" or "any fpi". > The FPI thing is mostly about the cost now versus the cost later on. > You're gambling that you won't get another FPI later on if you freeze > now. But the cost of a second FPI later on is so much higher than the > added cost of freezing now that that's a very favorable bet, that we > can afford to "lose" many times while still coming out ahead overall. Agreed. And not just avoiding FPIs, avoiding another dirtying of the page! The latter part is especially huge IMO. Depending on s_b size it can also avoid another *read* of the page... > And even when we lose, you generally still won't have been completely > wrong -- even then there generally will indeed be a second FPI later > on for the same page, to go with everything else. This makes the > wasted freezing even
Re: New strategies for freezing, advancing relfrozenxid early
On Thu, Jan 26, 2023 at 6:37 PM Andres Freund wrote: > I also don't really see how that is responsive to anything else in my > email. That's just as true for the current gating condition (the issuance of > an FPI during heap_page_prune() / HTSV()). > > What I was wondering about is whether we should replace the > fpi_before != pgWalUsage.wal_fpi > with > records_before != pgWalUsage.wal_records && !WouldIssueFpi(page) I understand that. What I'm saying is that that's going to create a huge problem of its own, unless you separately account for that problem. The simplest and obvious example is something like a pgbench_tellers table. VACUUM will generally run fast enough relative to the workload that it will set some of those pages all-visible. Now it's going to freeze them, too. Arguably it shouldn't even be setting the pages all-visible, but now you make that existing problem much worse. The important point is that there doesn't seem to be any good way around thinking about the table as a whole if you're going to freeze speculatively. This is not the same dynamic as we see with the FPI thing IMV -- that's not nearly so speculative as what you're talking about, since it is speculative in roughly the same sense that eager freezing was speculative (hence the suggestion that something like vacuum_freeze_strategy_threshold could have a roll to play). The FPI thing is mostly about the cost now versus the cost later on. You're gambling that you won't get another FPI later on if you freeze now. But the cost of a second FPI later on is so much higher than the added cost of freezing now that that's a very favorable bet, that we can afford to "lose" many times while still coming out ahead overall. And even when we lose, you generally still won't have been completely wrong -- even then there generally will indeed be a second FPI later on for the same page, to go with everything else. This makes the wasted freezing even less significant, on a comparative basis! It's also likely true that an FPI in lazy_scan_prune is a much stronger signal, but I think that the important dynamic is that we're reasoning about "costs now vs costs later on". The asymmetry is really important. -- Peter Geoghegan
Re: New strategies for freezing, advancing relfrozenxid early
Hi, On 2023-01-26 15:36:52 -0800, Peter Geoghegan wrote: > On Thu, Jan 26, 2023 at 12:45 PM Andres Freund wrote: > > > Most of the overhead of FREEZE WAL records (with freeze plan > > > deduplication and page-level freezing in) is generic WAL record header > > > overhead. Your recent adversarial test case is going to choke on that, > > > too. At least if you set checkpoint_timeout to 1 minute again. > > > > I don't quite follow. What do you mean with "record header overhead"? Unless > > that includes FPIs, I don't think that's that commonly true? > > Even if there are no directly observable FPIs, there is still extra > WAL, which can cause FPIs indirectly, just by making checkpoints more > frequent. I feel ridiculous even having to explain this to you. What does that have to do with "generic WAL record overhead"? I also don't really see how that is responsive to anything else in my email. That's just as true for the current gating condition (the issuance of an FPI during heap_page_prune() / HTSV()). What I was wondering about is whether we should replace the fpi_before != pgWalUsage.wal_fpi with records_before != pgWalUsage.wal_records && !WouldIssueFpi(page) > > The problematic case I am talking about is when we do *not* emit a WAL > > record > > during pruning (because there's nothing to prune), but want to freeze the > > table. If you don't log an FPI, the remaining big overhead is that > > increasing > > the LSN on the page will often cause an XLogFlush() when writing out the > > buffer. > > > > I don't see what your reference to checkpoint timeout is about here? > > > > Also, as I mentioned before, the problem isn't specific to > > checkpoint_timeout > > = 1min. It just makes it cheaper to reproduce. > > That's flagrantly intellectually dishonest. Sure, it made it easier to > reproduce. But that's not all it did! > > You had *lots* of specific numbers and technical details in your first > email, such as "Time for vacuuming goes up to ~5x. WAL volume to > ~9x.". But you did not feel that it was worth bothering with details > like having set checkpoint_timeout to 1 minute, which is a setting > that nobody uses, and obviously had a multiplicative effect. That > detail was unimportant. I had to drag it out of you! The multiples were for checkpoint_timeout=5min, with '250s' instead of WHERE ts < now() - '120s' I started out with checkpoint_timeout=1min, as I wanted to quickly test my theory. Then I increased checkpoint_timeout back to 5min, adjusted the query to some randomly guessed value. Happened to get nearly the same results. I then experimented more with '1min', because it's less annoying to have to wait for 120s until deletions start, than to wait for 250s. Because it's quicker to run I thought I'd share the less resource intensive version. A mistake as I now realize. This wasn't intended as a carefully designed benchmark, or anything. It was a quick proof for a problem that I found obvious. And it's not something worth testing carefully - e.g. the constants in the test are actually quite hardware specific, because the insert/seconds rate is very machine specific, and it's completely unnecessarily hardware intensive due to the use of single-row inserts, instead of batched operations. It's just a POC. > You basically found a way to add WAL overhead to a system/workload > that is already in a write amplification vicious cycle, with latent > tipping point type behavior. > > There is a practical point here, that is equally obvious, and yet > somehow still needs to be said: benchmarks like that one are basically > completely free of useful information. If we can't agree on how to > assess such things in general, then what can we agree on when it comes > to what should be done about it, what trade-off to make, when it comes > to any similar question? It's not at all free of useful information. It reproduces a problem I predicted repeatedly, that others in the discussion also wondered about, that you refused to acknowledge or address. It's not a good benchmark - I completely agree with that much. It was not designed to carefully benchmark different settings or such. It was designed to show a problem. And it does that. > > You're right, it makes sense to consider whether we'll emit a > > XLOG_HEAP2_VISIBLE anyway. > > As written the page-level freezing FPI mechanism probably doesn't > really stand to benefit much from doing that. Either checksums are > disabled and it's just a hint, or they're enabled and there is a very > high chance that we'll get an FPI inside lazy_scan_prune rather than > right after it is called, when PD_ALL_VISIBLE is set. I think it might be useful with logged hint bits, consider cases where all the tuples on the page were already fully hinted. That's not uncommon, I think? > > > > A less aggressive version would be to check if any WAL records were > > > > emitted > > > > during heap_page_prune() (instead of FPIs) and whether we'd emit an FPI > > >
Re: New strategies for freezing, advancing relfrozenxid early
Hi, On 2023-01-26 14:27:53 -0500, Robert Haas wrote: > One idea that I've had about how to solve this problem is to try to > make vacuum try to aggressively freeze some portion of the table on > each pass, and to behave less aggressively on the rest of the table so > that, hopefully, no single vacuum does too much work. I agree that this rough direction is worthwhile to purse. > Unfortunately, I don't really know how to do that effectively. If we knew > that the table was going to see 10 vacuums before we hit > autovacuum_freeze_max_age, we could try to have each one do 10% of the > amount of freezing that was going to need to be done rather than letting any > single vacuum do all of it, but we don't have that sort of information. I think, quite fundamentally, it's not possible to bound the amount of work an anti-wraparound vacuum has to do if we don't have an age based autovacuum trigger kicking in before autovacuum_freeze_max_age. After all, there might be no autovacuum before that's autovacuum_freeze_max_age is reached. But there's just no reason to not have a trigger below autovacuum_freeze_max_age. That's why I think Peter's patch to split age and anti-"auto-cancel" autovacuums is an strictly necessary change if we want to make autovacuum fundamentally suck less. There's a few boring details to figure out how to set/compute those limits, but I don't think there's anything fundamentally hard. I think we also need the number of all-frozen pages in pg_class if we want to make better scheduling decision. As we already compute the number of all-visible pages at the end of vacuuming, we can compute the number of all-frozen pages as well. The space for another integer in pg_class doesn't bother me one bit. Let's say we had a autovacuum_vacuum_age trigger of 100m, and autovacuum_freeze_max_age=500m. We know that we're roughly going to be vacuuming 5 times before reaching autovacuum_freeze_max_age (very slow autovacuums are an issue, but if one autovacuum takes 100m+ xids long, there's not much we can do). With that we could determine the eager percentage along the lines of: frozen_target = Min(age(relfrozenxid), autovacuum_freeze_max_age)/autovacuum_freeze_max_age eager_percentage = Min(0, frozen_target * relpages - pg_class.relallfrozen * relpages) One thing I don't know fully how to handle is how to ensure that we try to freeze a different part of the table each vacuum. I guess we could store a page number in pgstats? This would help address the "cliff" issue of reaching autovacuum_freeze_max_age. What it would *not*, on its own, would is the number of times we rewrite pages. I can guess at a few ways to heuristically identify when tables are "append mostly" from vacuum's view (a table can be update heavy, but very localized to recent rows, and still be append mostly from vacuum's view). There's obvious cases, e.g. when there are way more inserts than dead rows. But other cases are harder. > Also, even if we did have that sort of information, the idea only works if > the pages that we freeze sooner are ones that we're not about to update or > delete again, and we don't have any idea what is likely there. Perhaps we could use something like (age(relfrozenxid) - age(newest_xid_on_page)) / age(relfrozenxid) as a heuristic? I have a gut feeling that we should somehow collect/use statistics about the number of frozen pages, marked as such by the last (or recent?) vacuum, that had to be "defrosted" by backends. But I don't quite know how to yet. I think we could collect statistics about that by storing the LSN of the last vacuum in the shared stats, and incrementing that counter when defrosting. A lot of things like that would work a whole lot better if we had statistics that take older data into account, but weigh it less than more recent data. But that's hard/expensive to collect. Greetings, Andres Freund
Re: New strategies for freezing, advancing relfrozenxid early
On Thu, Jan 26, 2023 at 12:45 PM Andres Freund wrote: > > Most of the overhead of FREEZE WAL records (with freeze plan > > deduplication and page-level freezing in) is generic WAL record header > > overhead. Your recent adversarial test case is going to choke on that, > > too. At least if you set checkpoint_timeout to 1 minute again. > > I don't quite follow. What do you mean with "record header overhead"? Unless > that includes FPIs, I don't think that's that commonly true? Even if there are no directly observable FPIs, there is still extra WAL, which can cause FPIs indirectly, just by making checkpoints more frequent. I feel ridiculous even having to explain this to you. > The problematic case I am talking about is when we do *not* emit a WAL record > during pruning (because there's nothing to prune), but want to freeze the > table. If you don't log an FPI, the remaining big overhead is that increasing > the LSN on the page will often cause an XLogFlush() when writing out the > buffer. > > I don't see what your reference to checkpoint timeout is about here? > > Also, as I mentioned before, the problem isn't specific to checkpoint_timeout > = 1min. It just makes it cheaper to reproduce. That's flagrantly intellectually dishonest. Sure, it made it easier to reproduce. But that's not all it did! You had *lots* of specific numbers and technical details in your first email, such as "Time for vacuuming goes up to ~5x. WAL volume to ~9x.". But you did not feel that it was worth bothering with details like having set checkpoint_timeout to 1 minute, which is a setting that nobody uses, and obviously had a multiplicative effect. That detail was unimportant. I had to drag it out of you! You basically found a way to add WAL overhead to a system/workload that is already in a write amplification vicious cycle, with latent tipping point type behavior. There is a practical point here, that is equally obvious, and yet somehow still needs to be said: benchmarks like that one are basically completely free of useful information. If we can't agree on how to assess such things in general, then what can we agree on when it comes to what should be done about it, what trade-off to make, when it comes to any similar question? > > In many cases we'll have to dirty the page anyway, just to set > > PD_ALL_VISIBLE. The whole way the logic works is conditioned (whether > > triggered by an FPI or triggered by my now-reverted GUC) on being able > > to set the whole page all-frozen in the VM. > > IIRC setting PD_ALL_VISIBLE doesn't trigger an FPI unless we need to log hint > bits. But freezing does trigger one even without wal_log_hint_bits. That is correct. > You're right, it makes sense to consider whether we'll emit a > XLOG_HEAP2_VISIBLE anyway. As written the page-level freezing FPI mechanism probably doesn't really stand to benefit much from doing that. Either checksums are disabled and it's just a hint, or they're enabled and there is a very high chance that we'll get an FPI inside lazy_scan_prune rather than right after it is called, when PD_ALL_VISIBLE is set. That's not perfect, of course, but it doesn't have to be. Perhaps it should still be improved, just on general principle. > > > A less aggressive version would be to check if any WAL records were > > > emitted > > > during heap_page_prune() (instead of FPIs) and whether we'd emit an FPI > > > if we > > > modified the page again. Similar to what we do now, except not requiring > > > an > > > FPI to have been emitted. > > > > Also way more aggressive. Not nearly enough on its own. > > In which cases will it be problematically more aggressive? > > If we emitted a WAL record during pruning we've already set the LSN of the > page to a very recent LSN. We know the page is dirty. Thus we'll already > trigger an XLogFlush() during ringbuffer replacement. We won't emit an FPI. You seem to be talking about this as if the only thing that could matter is the immediate FPI -- the first order effects -- and not any second order effects. You certainly didn't get to 9x extra WAL overhead by controlling for that before. Should I take it that you've decided to assess these things more sensibly now? Out of curiosity: why the change of heart? > > > But to me it seems a bit odd that VACUUM now is more aggressive if > > > checksums / > > > wal_log_hint bits is on, than without them. Which I think is how using > > > either > > > of pgWalUsage.wal_fpi, pgWalUsage.wal_records ends up working? > > > > Which part is the odd part? Is it odd that page-level freezing works > > that way, or is it odd that page-level checksums work that way? > > That page-level freezing works that way. I think that it will probably cause a little confusion, and should be specifically documented. But other than that, it seems reasonable enough to me. I mean, should I not do something that's going to be of significant help to users with checksums, just because it'll be somewhat confusing to a small min
Re: New strategies for freezing, advancing relfrozenxid early
On Thu, Jan 26, 2023 at 1:22 PM Robert Haas wrote: > On Thu, Jan 26, 2023 at 4:06 PM Peter Geoghegan wrote: > > There is very good reason to believe that the large majority of all > > data that people store in a system like Postgres is extremely cold > > data: > > The systems where I end up troubleshooting problems seem to be, most > typically, busy OLTP systems. I'm not in a position to say whether > that's more or less common than systems with extremely cold data, but > I am in a position to say that my employer will have a lot fewer happy > customers if we regress that use case. Naturally I'm keen to avoid > that. This is the kind of remark that makes me think that you don't get it. The most influential OLTP benchmark of all time is TPC-C, which has exactly this problem. In spades -- it's enormously disruptive. Which is one reason why I used it as a showcase for a lot of this work. Plus practical experience (like the Heroku database in the blog post I linked to) fully agrees with that benchmark, as far as this stuff goes -- that was also a busy OLTP database. Online transaction involves transactions. Right? There is presumably some kind of ledger, some kind of orders table. Naturally these have entries that age out fairly predictably. After a while, almost all the data is cold data. It is usually about that simple. One of the key strengths of systems like Postgres is the ability to inexpensively store a relatively large amount of data that has just about zero chance of being read, let alone modified. While at the same time having decent OLTP performance for the hot data. Not nearly as good as an in-memory system, mind you -- and yet in-memory systems remain largely a niche thing. -- Peter Geoghegan
Re: New strategies for freezing, advancing relfrozenxid early
On Thu, Jan 26, 2023 at 4:06 PM Peter Geoghegan wrote: > There is very good reason to believe that the large majority of all > data that people store in a system like Postgres is extremely cold > data: The systems where I end up troubleshooting problems seem to be, most typically, busy OLTP systems. I'm not in a position to say whether that's more or less common than systems with extremely cold data, but I am in a position to say that my employer will have a lot fewer happy customers if we regress that use case. Naturally I'm keen to avoid that. > Having a separate aggressive step that rewrites an entire large table, > apparently at random, is just a huge burden to users. You've said that > you agree that it sucks, but somehow I still can't shake the feeling > that you don't fully understand just how much it sucks. Ha! Well, that's possible. But maybe you don't understand how much your patch makes other things suck. I don't think we can really get anywhere here by postulating that the problem is the other person's lack of understanding, even if such a postulate should happen to be correct. -- Robert Haas EDB: http://www.enterprisedb.com
Re: New strategies for freezing, advancing relfrozenxid early
On Thu, Jan 26, 2023 at 12:54 PM Robert Haas wrote: > > The overwhelming cost is usually FPIs in any case. If you're not > > mostly focussing on that, you're focussing on the wrong thing. At > > least with larger tables. You just have to focus on the picture over > > time, across multiple VACUUM operations. > > I think that's all mostly true, but the cases where being more > aggressive can cause *extra* FPIs are worthy of just as much attention > as the cases where we can reduce them. It's a question of our exposure to real problems, in no small part. What can we afford to be wrong about? What problem can be fixed by the user more or less as it emerges, and what problem doesn't have that quality? There is very good reason to believe that the large majority of all data that people store in a system like Postgres is extremely cold data: https://www.microsoft.com/en-us/research/video/cost-performance-in-modern-data-stores-how-data-cashing-systems-succeed/ https://brandur.org/fragments/events Having a separate aggressive step that rewrites an entire large table, apparently at random, is just a huge burden to users. You've said that you agree that it sucks, but somehow I still can't shake the feeling that you don't fully understand just how much it sucks. -- Peter Geoghegan
Re: New strategies for freezing, advancing relfrozenxid early
Hi, On 2023-01-26 20:26:00 +0100, Matthias van de Meent wrote: > Could someone explain to me why we don't currently (optionally) > include the functionality of page freezing in the PRUNE records? I think we definitely should (and have argued for it a couple times). It's not just about reducing WAL overhead, it's also about reducing redundant visibility checks - which are where a very significant portion of the CPU time for VACUUMing goes to. Besides performance considerations, it's also just plain weird that lazy_scan_prune() can end up with a different visibility than heap_page_prune() (mostly due to concurrent aborts). The number of WAL records we often end up emitting for a processing a single page in vacuum is just plain absurd: - PRUNE - FREEZE_PAGE - VISIBLE There's afaict no justification whatsoever for these to be separate records. > I think they're quite closely related (in that they both execute in VACUUM > and are required for long-term system stability), and are even more related > now that we have opportunistic page-level freezing. I think adding a "freeze > this page as well"-flag in PRUNE records would go a long way to reducing the > WAL overhead of aggressive and more opportunistic freezing. Yep. I think we should also seriously consider setting all-visible during on-access pruning, and freezing rows during on-access pruning. Greetings, Andres Freund
Re: New strategies for freezing, advancing relfrozenxid early
On Thu, Jan 26, 2023 at 2:57 PM Peter Geoghegan wrote: > Relatively difficult for Andres, or for somebody else? What are the > real parameters here? Obviously there are no clear answers available. Andres is certainly smarter than the average guy, but practically any scenario that someone can create in a few lines of SQL is something to which code will be exposed to on some real-world system. If Andres came along and said, hey, well I found a way to make this patch suck, and proceeded to describe a scenario that involved a complex set of tables and multiple workloads running simultaneously and using a debugger to trigger some race condition and whatever, I'd be like "OK, but is that really going to happen?". The actual scenario he came up with is three lines of SQL, and it's nothing remotely obscure. That kind of thing is going to happen *all the time*. > The overwhelming cost is usually FPIs in any case. If you're not > mostly focussing on that, you're focussing on the wrong thing. At > least with larger tables. You just have to focus on the picture over > time, across multiple VACUUM operations. I think that's all mostly true, but the cases where being more aggressive can cause *extra* FPIs are worthy of just as much attention as the cases where we can reduce them. -- Robert Haas EDB: http://www.enterprisedb.com
Re: New strategies for freezing, advancing relfrozenxid early
Hi, On 2023-01-26 10:44:45 -0800, Peter Geoghegan wrote: > On Thu, Jan 26, 2023 at 9:53 AM Andres Freund wrote: > > > That's going to be very significantly more aggressive. For example > > > it'll impact small tables very differently. > > > > Maybe it would be too aggressive, not sure. The cost of a freeze WAL record > > is > > relatively small, with one important exception below, if we are 99.99% sure > > that it's not going to require an FPI and isn't going to dirty the page. > > > > The exception is that a newer LSN on the page can cause the ringbuffer > > replacement to trigger more more aggressive WAL flushing. No meaningful > > difference if we modified the page during pruning, or if the page was > > already > > in s_b (since it likely won't be written out via the ringbuffer in that > > case), > > but if checksums are off and we just hint-dirtied the page, it could be a > > significant issue. > > Most of the overhead of FREEZE WAL records (with freeze plan > deduplication and page-level freezing in) is generic WAL record header > overhead. Your recent adversarial test case is going to choke on that, > too. At least if you set checkpoint_timeout to 1 minute again. I don't quite follow. What do you mean with "record header overhead"? Unless that includes FPIs, I don't think that's that commonly true? The problematic case I am talking about is when we do *not* emit a WAL record during pruning (because there's nothing to prune), but want to freeze the table. If you don't log an FPI, the remaining big overhead is that increasing the LSN on the page will often cause an XLogFlush() when writing out the buffer. I don't see what your reference to checkpoint timeout is about here? Also, as I mentioned before, the problem isn't specific to checkpoint_timeout = 1min. It just makes it cheaper to reproduce. > > Thus a modification of the above logic could be to opportunistically freeze > > if > > a ) it won't cause an FPI and either > > b1) the page was already dirty before pruning, as we'll not do a ringbuffer > > replacement in that case > > or > > b2) We wrote a WAL record during pruning, as the difference in flush > > position > > is marginal > > > > An even more aggressive version would be to replace b1) with logic that'd > > allow newly dirtying the page if it wasn't read through the ringbuffer. But > > newly dirtying the page feels like it'd be more dangerous. > > In many cases we'll have to dirty the page anyway, just to set > PD_ALL_VISIBLE. The whole way the logic works is conditioned (whether > triggered by an FPI or triggered by my now-reverted GUC) on being able > to set the whole page all-frozen in the VM. IIRC setting PD_ALL_VISIBLE doesn't trigger an FPI unless we need to log hint bits. But freezing does trigger one even without wal_log_hint_bits. You're right, it makes sense to consider whether we'll emit a XLOG_HEAP2_VISIBLE anyway. > > A less aggressive version would be to check if any WAL records were emitted > > during heap_page_prune() (instead of FPIs) and whether we'd emit an FPI if > > we > > modified the page again. Similar to what we do now, except not requiring an > > FPI to have been emitted. > > Also way more aggressive. Not nearly enough on its own. In which cases will it be problematically more aggressive? If we emitted a WAL record during pruning we've already set the LSN of the page to a very recent LSN. We know the page is dirty. Thus we'll already trigger an XLogFlush() during ringbuffer replacement. We won't emit an FPI. > > But to me it seems a bit odd that VACUUM now is more aggressive if > > checksums / > > wal_log_hint bits is on, than without them. Which I think is how using > > either > > of pgWalUsage.wal_fpi, pgWalUsage.wal_records ends up working? > > Which part is the odd part? Is it odd that page-level freezing works > that way, or is it odd that page-level checksums work that way? That page-level freezing works that way. > In any case this seems like an odd thing for you to say, having > eviscerated a patch that really just made the same behavior trigger > independently of FPIs in some tables, controlled via a GUC. jdksjfkjdlkajsd;lfkjasd;lkfj;alskdfj That behaviour I critizied was causing a torrent of FPIs and additional dirtying of pages. My proposed replacement for the current FPI check doesn't, because a) it only triggers when we wrote a WAL record b) It doesn't trigger if we would write an FPI. Greetings, Andres Freund
Re: New strategies for freezing, advancing relfrozenxid early
On Thu, Jan 26, 2023 at 11:26 AM Matthias van de Meent wrote: > Could someone explain to me why we don't currently (optionally) > include the functionality of page freezing in the PRUNE records? I > think they're quite closely related (in that they both execute in > VACUUM and are required for long-term system stability), and are even > more related now that we have opportunistic page-level freezing. I > think adding a "freeze this page as well"-flag in PRUNE records would > go a long way to reducing the WAL overhead of aggressive and more > opportunistic freezing. Yeah, we've talked about doing that in the past year. It's quite possible. It would make quite a lot of sense, because the actual overhead of the WAL record for freezing tends to come from the generic WAL record header stuff itself. If there was only one record for both, then you'd only need to include the relfilenode and block number (and so on) once. It would be tricky to handle Multis, so what you'd probably do is just freezing xmin, and possibly aborted and locker XIDs in xmax. So you wouldn't completely get rid of the main freeze record, but would be able to avoid it in many important cases. -- Peter Geoghegan
Re: New strategies for freezing, advancing relfrozenxid early
On Thu, Jan 26, 2023 at 11:28 AM Robert Haas wrote: > I think it's pretty much impossible to freeze more aggressively > without losing in some scenario or other. If waiting longer to freeze > would have resulted in the data getting updated again or deleted > before we froze it, then waiting longer reduces the total amount of > freezing work that ever has to be done. Freezing more aggressively > inevitably gives up some amount of that potential benefit in order to > try to secure some other benefit. It's a trade-off. There is no question about that. > I think that the goal of a patch that makes vacuum more (or less) > aggressive should be to make the cases where we lose as obscure as > possible, and the cases where we win as broad as possible. I think > that, in order to be a good patch, it needs to be relatively difficult > to find cases where we incur a big loss. If it's easy to find a big > loss, then I think it's better to stick with the current behavior, > even if it's also easy to find a big gain. Again, this seems totally uncontroversial. It's just incredibly vague, and not at all actionable. Relatively difficult for Andres, or for somebody else? What are the real parameters here? Obviously there are no clear answers available. > However, I'm also not > prepared to go all the way to the other end of the spectrum and say > that all of your ideas and everything in this patch are great. I don't > think either of those things, either. It doesn't matter. I'm done with it. This is not a negotiation about what gets in and what doesn't get in. All that I aim to do now is to draw some kind of line under the basic page-level freezing work, since of course I'm still responsible for that. And perhaps to defend my personal reputation. > I certainly think that freezing more aggressively in some scenarios > could be a great idea, but it seems like the patch's theory is to be > very nearly maximally aggressive in every vacuum run if the table size > is greater than some threshold, and I don't think that's right at all. We'll systematically avoid accumulating debt past a certain point -- that's its purpose. That is, we'll avoid accumulating all-visible pages that eventually need to be frozen. > I'm not exactly sure what information we should use to decide how > aggressive to be, but I am pretty sure that the size of the table is > not it. It's true that, for a small table, the cost of having to > eventually vacuum the whole table at once isn't going to be very high, > whereas for a large table, it will be. That line of reasoning makes a > size threshold sound reasonable. However, the amount of extra work > that we can potentially do by vacuuming more aggressively *also* > increases with the table size, which to me means using that a > criterion actually isn't sensible at all. The overwhelming cost is usually FPIs in any case. If you're not mostly focussing on that, you're focussing on the wrong thing. At least with larger tables. You just have to focus on the picture over time, across multiple VACUUM operations. > One idea that I've had about how to solve this problem is to try to > make vacuum try to aggressively freeze some portion of the table on > each pass, and to behave less aggressively on the rest of the table so > that, hopefully, no single vacuum does too much work. Unfortunately, I > don't really know how to do that effectively. That has been proposed a couple of times in the context of this thread. It won't work, because the way autovacuum works in general (and likely always will work) doesn't allow it. With an append-only table, each VACUUM will naturally have to scan significantly more pages than the last one, forever (barring antiwraparound vacuums). Why wouldn't it continue that way? I mean it might not (the table might stop growing altogether), but then it doesn't matter much what we do. If you're not behaving very proactively at the level of each VACUUM operation, then the picture over time is that you're *already* falling behind. At least with an append-only table. You have to think of the sequence of operations, not just one. > In theory we could have some system that tracks how > recently each page range in a table has been modified, and direct our > freezing activity toward the ones less-recently modified on the theory > that they're not so likely to be modified again in the near future, > but in reality we have no such system. So I don't really feel like I > know what the right answer is here, yet. So we need to come up with a way of getting reliable information from the future, about an application that we have no particular understanding of. As opposed to just eating the cost to some degree, and making it configurable. -- Peter Geoghegan
Re: New strategies for freezing, advancing relfrozenxid early
On Thu, Jan 26, 2023 at 11:35 AM Peter Geoghegan wrote: > You complained about the descriptions being theoretical. But there's > nothing theoretical about the fact that we more or less do *all* > freezing in an eventual aggressive VACUUM in many important cases, > including very simple cases like pgbench_history -- the simplest > possible append-only table case. We'll merrily rewrite the entire > table, all at once, for no good reason at all. Consistently, reliably. > It's so incredibly obvious that this makes zero sense! And yet I don't > think you've ever engaged with such basic points as that one. I'm aware that that's a problem, and I agree that it sucks. I think that what this patch does is make vacuum more aggressively, and I expect that would help this problem. I haven't said much about that because I don't think it's controversial. However, the patch also has a cost, and that's what I think is controversial. I think it's pretty much impossible to freeze more aggressively without losing in some scenario or other. If waiting longer to freeze would have resulted in the data getting updated again or deleted before we froze it, then waiting longer reduces the total amount of freezing work that ever has to be done. Freezing more aggressively inevitably gives up some amount of that potential benefit in order to try to secure some other benefit. It's a trade-off. I think that the goal of a patch that makes vacuum more (or less) aggressive should be to make the cases where we lose as obscure as possible, and the cases where we win as broad as possible. I think that, in order to be a good patch, it needs to be relatively difficult to find cases where we incur a big loss. If it's easy to find a big loss, then I think it's better to stick with the current behavior, even if it's also easy to find a big gain. There's nothing wonderful about the current behavior, but (to paraphrase what I think Andres has already said several times) it's better to keep shipping code with the same bad behavior than to put out a new major release with behaviors that are just as bad, but different. I feel like your emails sometimes seem to suppose that I think that you're a bad person, or a bad developer, or that you have no good ideas, or that you have no good ideas about this topic, or that this topic is not important, or that we don't need to do better than we are currently doing. I think none of those things. However, I'm also not prepared to go all the way to the other end of the spectrum and say that all of your ideas and everything in this patch are great. I don't think either of those things, either. I certainly think that freezing more aggressively in some scenarios could be a great idea, but it seems like the patch's theory is to be very nearly maximally aggressive in every vacuum run if the table size is greater than some threshold, and I don't think that's right at all. I'm not exactly sure what information we should use to decide how aggressive to be, but I am pretty sure that the size of the table is not it. It's true that, for a small table, the cost of having to eventually vacuum the whole table at once isn't going to be very high, whereas for a large table, it will be. That line of reasoning makes a size threshold sound reasonable. However, the amount of extra work that we can potentially do by vacuuming more aggressively *also* increases with the table size, which to me means using that a criterion actually isn't sensible at all. One idea that I've had about how to solve this problem is to try to make vacuum try to aggressively freeze some portion of the table on each pass, and to behave less aggressively on the rest of the table so that, hopefully, no single vacuum does too much work. Unfortunately, I don't really know how to do that effectively. If we knew that the table was going to see 10 vacuums before we hit autovacuum_freeze_max_age, we could try to have each one do 10% of the amount of freezing that was going to need to be done rather than letting any single vacuum do all of it, but we don't have that sort of information. Also, even if we did have that sort of information, the idea only works if the pages that we freeze sooner are ones that we're not about to update or delete again, and we don't have any idea what is likely there. In theory we could have some system that tracks how recently each page range in a table has been modified, and direct our freezing activity toward the ones less-recently modified on the theory that they're not so likely to be modified again in the near future, but in reality we have no such system. So I don't really feel like I know what the right answer is here, yet. -- Robert Haas EDB: http://www.enterprisedb.com
Re: New strategies for freezing, advancing relfrozenxid early
On Thu, 26 Jan 2023 at 19:45, Peter Geoghegan wrote: > > On Thu, Jan 26, 2023 at 9:53 AM Andres Freund wrote: > > I assume the case you're thinking of is that pruning did *not* do any > > changes, > > but in the process of figuring out that nothing needed to be pruned, we did > > a > > MarkBufferDirtyHint(), and as part of that emitted an FPI? > > Yes. > > > > That's going to be very significantly more aggressive. For example > > > it'll impact small tables very differently. > > > > Maybe it would be too aggressive, not sure. The cost of a freeze WAL record > > is > > relatively small, with one important exception below, if we are 99.99% sure > > that it's not going to require an FPI and isn't going to dirty the page. > > > > The exception is that a newer LSN on the page can cause the ringbuffer > > replacement to trigger more more aggressive WAL flushing. No meaningful > > difference if we modified the page during pruning, or if the page was > > already > > in s_b (since it likely won't be written out via the ringbuffer in that > > case), > > but if checksums are off and we just hint-dirtied the page, it could be a > > significant issue. > > Most of the overhead of FREEZE WAL records (with freeze plan > deduplication and page-level freezing in) is generic WAL record header > overhead. Your recent adversarial test case is going to choke on that, > too. At least if you set checkpoint_timeout to 1 minute again. Could someone explain to me why we don't currently (optionally) include the functionality of page freezing in the PRUNE records? I think they're quite closely related (in that they both execute in VACUUM and are required for long-term system stability), and are even more related now that we have opportunistic page-level freezing. I think adding a "freeze this page as well"-flag in PRUNE records would go a long way to reducing the WAL overhead of aggressive and more opportunistic freezing. -Matthias
Re: New strategies for freezing, advancing relfrozenxid early
On Thu, Jan 26, 2023 at 9:53 AM Andres Freund wrote: > I assume the case you're thinking of is that pruning did *not* do any changes, > but in the process of figuring out that nothing needed to be pruned, we did a > MarkBufferDirtyHint(), and as part of that emitted an FPI? Yes. > > That's going to be very significantly more aggressive. For example > > it'll impact small tables very differently. > > Maybe it would be too aggressive, not sure. The cost of a freeze WAL record is > relatively small, with one important exception below, if we are 99.99% sure > that it's not going to require an FPI and isn't going to dirty the page. > > The exception is that a newer LSN on the page can cause the ringbuffer > replacement to trigger more more aggressive WAL flushing. No meaningful > difference if we modified the page during pruning, or if the page was already > in s_b (since it likely won't be written out via the ringbuffer in that case), > but if checksums are off and we just hint-dirtied the page, it could be a > significant issue. Most of the overhead of FREEZE WAL records (with freeze plan deduplication and page-level freezing in) is generic WAL record header overhead. Your recent adversarial test case is going to choke on that, too. At least if you set checkpoint_timeout to 1 minute again. > Thus a modification of the above logic could be to opportunistically freeze if > a ) it won't cause an FPI and either > b1) the page was already dirty before pruning, as we'll not do a ringbuffer > replacement in that case > or > b2) We wrote a WAL record during pruning, as the difference in flush position > is marginal > > An even more aggressive version would be to replace b1) with logic that'd > allow newly dirtying the page if it wasn't read through the ringbuffer. But > newly dirtying the page feels like it'd be more dangerous. In many cases we'll have to dirty the page anyway, just to set PD_ALL_VISIBLE. The whole way the logic works is conditioned (whether triggered by an FPI or triggered by my now-reverted GUC) on being able to set the whole page all-frozen in the VM. > A less aggressive version would be to check if any WAL records were emitted > during heap_page_prune() (instead of FPIs) and whether we'd emit an FPI if we > modified the page again. Similar to what we do now, except not requiring an > FPI to have been emitted. Also way more aggressive. Not nearly enough on its own. > But to me it seems a bit odd that VACUUM now is more aggressive if checksums / > wal_log_hint bits is on, than without them. Which I think is how using either > of pgWalUsage.wal_fpi, pgWalUsage.wal_records ends up working? Which part is the odd part? Is it odd that page-level freezing works that way, or is it odd that page-level checksums work that way? In any case this seems like an odd thing for you to say, having eviscerated a patch that really just made the same behavior trigger independently of FPIs in some tables, controlled via a GUC. -- Peter Geoghegan
Re: New strategies for freezing, advancing relfrozenxid early
Hi, On 2023-01-26 08:54:55 -0800, Peter Geoghegan wrote: > On Thu, Jan 26, 2023 at 8:35 AM Andres Freund wrote: > > I think it's probably ok, but perhaps deserves a bit more thought about when > > to "opportunistically" freeze. Perhaps to make it *more* aggressive than > > it's > > now. > > > > With "opportunistic freezing" I mean freezing the page, even though we don't > > *have* to freeze any of the tuples. > > > > The overall condition gating freezing is: > > if (pagefrz.freeze_required || tuples_frozen == 0 || > > (prunestate->all_visible && prunestate->all_frozen && > > fpi_before != pgWalUsage.wal_fpi)) > > > > fpi_before is set before the heap_page_prune() call. > > Have you considered page-level checksums, and how the impact on hint > bits needs to be accounted for here? > > All RDS customers use page-level checksums. And I've noticed that it's > very common for the number of FPIs to only be very slightly less than > the number of pages dirtied. Much of which is just hint bits. The > "fpi_before != pgWalUsage.wal_fpi" test catches that. I assume the case you're thinking of is that pruning did *not* do any changes, but in the process of figuring out that nothing needed to be pruned, we did a MarkBufferDirtyHint(), and as part of that emitted an FPI? > > To me a condition that checked if the buffer is already dirty and if another > > XLogInsert() would be likely to generate an FPI would make more sense. The > > rare race case of a checkpoint starting concurrently doesn't matter IMO. > > That's going to be very significantly more aggressive. For example > it'll impact small tables very differently. Maybe it would be too aggressive, not sure. The cost of a freeze WAL record is relatively small, with one important exception below, if we are 99.99% sure that it's not going to require an FPI and isn't going to dirty the page. The exception is that a newer LSN on the page can cause the ringbuffer replacement to trigger more more aggressive WAL flushing. No meaningful difference if we modified the page during pruning, or if the page was already in s_b (since it likely won't be written out via the ringbuffer in that case), but if checksums are off and we just hint-dirtied the page, it could be a significant issue. Thus a modification of the above logic could be to opportunistically freeze if a ) it won't cause an FPI and either b1) the page was already dirty before pruning, as we'll not do a ringbuffer replacement in that case or b2) We wrote a WAL record during pruning, as the difference in flush position is marginal An even more aggressive version would be to replace b1) with logic that'd allow newly dirtying the page if it wasn't read through the ringbuffer. But newly dirtying the page feels like it'd be more dangerous. A less aggressive version would be to check if any WAL records were emitted during heap_page_prune() (instead of FPIs) and whether we'd emit an FPI if we modified the page again. Similar to what we do now, except not requiring an FPI to have been emitted. But to me it seems a bit odd that VACUUM now is more aggressive if checksums / wal_log_hint bits is on, than without them. Which I think is how using either of pgWalUsage.wal_fpi, pgWalUsage.wal_records ends up working? Greetings, Andres Freund
Re: New strategies for freezing, advancing relfrozenxid early
On Thu, Jan 26, 2023 at 8:35 AM Andres Freund wrote: > I think it's probably ok, but perhaps deserves a bit more thought about when > to "opportunistically" freeze. Perhaps to make it *more* aggressive than it's > now. > > With "opportunistic freezing" I mean freezing the page, even though we don't > *have* to freeze any of the tuples. > > The overall condition gating freezing is: > if (pagefrz.freeze_required || tuples_frozen == 0 || > (prunestate->all_visible && prunestate->all_frozen && > fpi_before != pgWalUsage.wal_fpi)) > > fpi_before is set before the heap_page_prune() call. Have you considered page-level checksums, and how the impact on hint bits needs to be accounted for here? All RDS customers use page-level checksums. And I've noticed that it's very common for the number of FPIs to only be very slightly less than the number of pages dirtied. Much of which is just hint bits. The "fpi_before != pgWalUsage.wal_fpi" test catches that. > To me a condition that checked if the buffer is already dirty and if another > XLogInsert() would be likely to generate an FPI would make more sense. The > rare race case of a checkpoint starting concurrently doesn't matter IMO. That's going to be very significantly more aggressive. For example it'll impact small tables very differently. -- Peter Geoghegan
Re: New strategies for freezing, advancing relfrozenxid early
Hi, On 2023-01-26 09:20:57 -0500, Robert Haas wrote: > On Wed, Jan 25, 2023 at 10:56 PM Andres Freund wrote: > > but that's only true because page level freezing neutered > > vacuum_freeze_min_age. Compared to <16, it's a *huge* change. > > Do you think that page-level freezing > (1de58df4fec7325d91f5a8345757314be7ac05da) was improvidently > committed? I think it's probably ok, but perhaps deserves a bit more thought about when to "opportunistically" freeze. Perhaps to make it *more* aggressive than it's now. With "opportunistic freezing" I mean freezing the page, even though we don't *have* to freeze any of the tuples. The overall condition gating freezing is: if (pagefrz.freeze_required || tuples_frozen == 0 || (prunestate->all_visible && prunestate->all_frozen && fpi_before != pgWalUsage.wal_fpi)) fpi_before is set before the heap_page_prune() call. To me the fpi_before != pgWalUsage.wal_fpi" part doesn't make a whole lot of sense. For one, it won't at all work if full_page_writes=off. But more importantly, it also means we'll not freeze when VACUUMing a recently modified page, even if pruning already emitted a WAL record and we'd not emit an FPI if we freezed the page now. To me a condition that checked if the buffer is already dirty and if another XLogInsert() would be likely to generate an FPI would make more sense. The rare race case of a checkpoint starting concurrently doesn't matter IMO. A minor complaint I have about the code is that the "tuples_frozen == 0" path imo is confusing. We go into the "freeze" path, which then inside has another if for the tuples_frozen == 0 part. I get that this deduplicates the NewRelFrozenXid handling, but it still looks odd. > I have always been a bit skeptical of vacuum_freeze_min_age as a > mechanism. It's certainly true that it is a waste of energy to freeze > tuples that will soon be removed anyway, but on the other hand, > repeatedly dirtying the same page for various different freezing and > visibility related reasons *really sucks*, and even repeatedly reading > the page because we kept deciding not to do anything yet isn't great. > It seems possible that the page-level freezing mechanism could help > with that quite a bit, and I think that the heuristic that patch > proposes is basically reasonable: if there's at least one tuple on the > page that is old enough to justify freezing, it doesn't seem like a > bad bet to freeze all the others that can be frozen at the same time, > at least if it means that we can mark the page all-visible or > all-frozen. If it doesn't, then I'm not so sure; maybe we're best off > deferring as much work as possible to a time when we *can* mark the > page all-visible or all-frozen. Agreed. Freezing everything if we need to freeze some things seems quite safe to me. > In short, I think that neutering vacuum_freeze_min_age at least to > some degree might be a good thing, but that's not to say that I'm > altogether confident in that patch, either. I am not too woried about the neutering in the page level freezing patch. The combination of the page level work with the eager strategy is where the sensibly-more-aggressive freeze_min_age got turbocharged to an imo dangerous degree. Greetings, Andres Freund
Re: New strategies for freezing, advancing relfrozenxid early
On Thu, Jan 26, 2023 at 5:41 AM Robert Haas wrote: > On Wed, Jan 25, 2023 at 11:25 PM Peter Geoghegan wrote: > > On Wed, Jan 25, 2023 at 7:41 PM Robert Haas wrote: > > > Both Andres and I have repeatedly expressed concern about how much is > > > being changed in the behavior of vacuum, and how quickly, and IMHO on > > > the basis of very limited evidence that the changes are improvements. > > > The fact that Andres was very quickly able to find cases where the > > > patch produces large regression is just more evidence of that. It's > > > also hard to even understand what has been changed, because the > > > descriptions are so theoretical. > > > > Did you actually read the motivating examples Wiki page? > > I don't know. I've read a lot of stuff that you've written on this > topic, which has taken a significant amount of time, and I still don't > understand a lot of what you're changing, and I don't agree with all > of the things that I do understand. You complained about the descriptions being theoretical. But there's nothing theoretical about the fact that we more or less do *all* freezing in an eventual aggressive VACUUM in many important cases, including very simple cases like pgbench_history -- the simplest possible append-only table case. We'll merrily rewrite the entire table, all at once, for no good reason at all. Consistently, reliably. It's so incredibly obvious that this makes zero sense! And yet I don't think you've ever engaged with such basic points as that one. -- Peter Geoghegan
Re: New strategies for freezing, advancing relfrozenxid early
On Wed, Jan 25, 2023 at 7:56 PM Andres Freund wrote: > > https://wiki.postgresql.org/wiki/Freezing/skipping_strategies_patch:_motivating_examples#Patch_2 > > > > The difference between this and VACUUM FREEZE is described here: > > > > "Note how we freeze most pages, but still leave a significant number > > unfrozen each time, despite using an eager approach to freezing > > (2981204 scanned - 2355230 frozen = 625974 pages scanned but left > > unfrozen). Again, this is because we don't freeze pages unless they're > > already eligible to be set all-visible. > > The only reason there is a substantial difference is because of pgbench's > uniform access pattern. Most real-world applications don't have that. It's not pgbench! It's TPC-C. It's actually an adversarial case for the patch series. -- Peter Geoghegan
Re: New strategies for freezing, advancing relfrozenxid early
On Wed, Jan 25, 2023 at 10:56 PM Andres Freund wrote: > but that's only true because page level freezing neutered > vacuum_freeze_min_age. Compared to <16, it's a *huge* change. Do you think that page-level freezing (1de58df4fec7325d91f5a8345757314be7ac05da) was improvidently committed? I have always been a bit skeptical of vacuum_freeze_min_age as a mechanism. It's certainly true that it is a waste of energy to freeze tuples that will soon be removed anyway, but on the other hand, repeatedly dirtying the same page for various different freezing and visibility related reasons *really sucks*, and even repeatedly reading the page because we kept deciding not to do anything yet isn't great. It seems possible that the page-level freezing mechanism could help with that quite a bit, and I think that the heuristic that patch proposes is basically reasonable: if there's at least one tuple on the page that is old enough to justify freezing, it doesn't seem like a bad bet to freeze all the others that can be frozen at the same time, at least if it means that we can mark the page all-visible or all-frozen. If it doesn't, then I'm not so sure; maybe we're best off deferring as much work as possible to a time when we *can* mark the page all-visible or all-frozen. In short, I think that neutering vacuum_freeze_min_age at least to some degree might be a good thing, but that's not to say that I'm altogether confident in that patch, either. -- Robert Haas EDB: http://www.enterprisedb.com
Re: New strategies for freezing, advancing relfrozenxid early
On Wed, Jan 25, 2023 at 11:25 PM Peter Geoghegan wrote: > On Wed, Jan 25, 2023 at 7:41 PM Robert Haas wrote: > > Both Andres and I have repeatedly expressed concern about how much is > > being changed in the behavior of vacuum, and how quickly, and IMHO on > > the basis of very limited evidence that the changes are improvements. > > The fact that Andres was very quickly able to find cases where the > > patch produces large regression is just more evidence of that. It's > > also hard to even understand what has been changed, because the > > descriptions are so theoretical. > > Did you actually read the motivating examples Wiki page? I don't know. I've read a lot of stuff that you've written on this topic, which has taken a significant amount of time, and I still don't understand a lot of what you're changing, and I don't agree with all of the things that I do understand. I can't state with confidence that the motivating examples wiki page was or was not among the things that I read. But, you know, when people start running PostgreSQL 16, and have some problem, they're not going to read the motivating examples wiki page. They're going to read the documentation. If they can't find the answer there, they (or some hacker that they contact) will probably read the code comments and the relevant commit messages. Those either clearly explain what was changed in a way that somebody can understand, or they don't. If they don't, *the commits are not good enough*, regardless of what other information may exist in any other place. -- Robert Haas EDB: http://www.enterprisedb.com
Re: New strategies for freezing, advancing relfrozenxid early
On Wed, Jan 25, 2023 at 8:24 PM Peter Geoghegan wrote: > > I think we're on a very dangerous path here. I want VACUUM to be > > better as the next person, but I really don't believe that's the > > direction we're headed. I think if we release like this, we're going > > to experience more VACUUM pain, not less. And worse still, I don't > > think anyone other than Peter and Andres is going to understand why > > it's happening. > > I think that the only sensible course of action at this point is for > me to revert the page-level freezing commit from today, and abandon > all outstanding work on VACUUM. I will still stand by the basic > page-level freezing work, at least to the extent that I am able to. I have now reverted today's commit. I have also withdrawn all remaining work from the patch series as a whole, which is reflected in the CF app. Separately, I have withdrawn 2 other VACUUM related patches of mine via the CF app: the antiwraparound autovacuum patch series, plus a patch that did some further work on freezing MultiXacts. I have no intention of picking any of these patches back up again. I also intend to completely avoid new work on both VACUUM and autovacuum, not including ambulkdelete() code run by index access methods. I will continue to do maintenance and bugfix work when it happens to involve VACUUM, though. For the record, in case it matters: I certainly have no objection to anybody else picking up any of this unfinished work for themselves, in part or in full. -- Peter Geoghegan
Re: New strategies for freezing, advancing relfrozenxid early
On Wed, Jan 25, 2023 at 8:12 PM John Naylor wrote: > That was followed by several paragraphs that never got around to explaining > why table size should drive freezing strategy. You were talking about the system level view of freeze debt, and how the table view might not be a sufficient proxy for that. What does that have to do with anything that we've discussed on this thread recently? > Review is a feedback mechanism alerting the patch author to possible > problems. Listening to feedback is like vacuum, in a way: If it hurts, you're > not doing it enough. An elegant analogy. -- Peter Geoghegan
Re: New strategies for freezing, advancing relfrozenxid early
On Wed, Jan 25, 2023 at 7:41 PM Robert Haas wrote: > Both Andres and I have repeatedly expressed concern about how much is > being changed in the behavior of vacuum, and how quickly, and IMHO on > the basis of very limited evidence that the changes are improvements. > The fact that Andres was very quickly able to find cases where the > patch produces large regression is just more evidence of that. It's > also hard to even understand what has been changed, because the > descriptions are so theoretical. Did you actually read the motivating examples Wiki page? > I think we're on a very dangerous path here. I want VACUUM to be > better as the next person, but I really don't believe that's the > direction we're headed. I think if we release like this, we're going > to experience more VACUUM pain, not less. And worse still, I don't > think anyone other than Peter and Andres is going to understand why > it's happening. I think that the only sensible course of action at this point is for me to revert the page-level freezing commit from today, and abandon all outstanding work on VACUUM. I will still stand by the basic page-level freezing work, at least to the extent that I am able to. Honestly, just typing that makes me feel a big sense of relief. I am a proud, stubborn man. While the experience of working on the earlier related stuff for Postgres 15 was itself enough to make me seriously reassess my choice to work on VACUUM in general, I still wanted to finish off what I'd started. I don't see how that'll be possible now -- I'm just not in a position to be in the center of another controversy, and I just don't seem to be able to avoid them here, as a practical matter. I will resolve to be a less stubborn person. I don't have the constitution for it anymore. -- Peter Geoghegan
Re: New strategies for freezing, advancing relfrozenxid early
On Thu, Jan 26, 2023 at 10:11 AM Andres Freund wrote: > I am. Just not every tradeoff. I just don't see any useful tradeoffs purely > based on the relation size. I expressed reservations about relation size six weeks ago: On Wed, Dec 14, 2022 at 12:16 AM Peter Geoghegan wrote: > > On Tue, Dec 13, 2022 at 12:29 AM John Naylor > wrote: > > If the number of unfrozen heap pages is the thing we care about, perhaps that, and not the total size of the table, should be the parameter that drives freezing strategy? > > That's not the only thing we care about, though. That was followed by several paragraphs that never got around to explaining why table size should drive freezing strategy. Review is a feedback mechanism alerting the patch author to possible problems. Listening to feedback is like vacuum, in a way: If it hurts, you're not doing it enough. -- John Naylor EDB: http://www.enterprisedb.com
Re: New strategies for freezing, advancing relfrozenxid early
Hi, On 2023-01-25 18:43:10 -0800, Peter Geoghegan wrote: > On Wed, Jan 25, 2023 at 6:33 PM Andres Freund wrote: > > As far as I can tell, with the eager strategy, the only thing > > vacuum_freeze_min_age really influences is whether we'll block waiting for a > > cleanup lock. IOW, VACUUM on a table > vacuum_freeze_strategy_threshold is > > now a slightly less-blocking version of VACUUM FREEZE. > > That's simply not true, at all. I'm very surprised that you think > that. The commit message very clearly addresses this. It says something like that, but it's not really true: Looking at the results of DROP TABLE IF EXISTS frak; -- autovac disabled so we see just the result of the vacuum below CREATE TABLE frak WITH (autovacuum_enabled=0) AS SELECT generate_series(1, 1000); VACUUM frak; SELECT pg_relation_size('frak') / 8192 AS relsize_pages, SUM(all_visible::int) all_vis_pages, SUM(all_frozen::int) all_frozen_pages FROM pg_visibility('frak'); across releases. In < 16 you'll get: ┌───┬───┬──┐ │ relsize_pages │ all_vis_pages │ all_frozen_pages │ ├───┼───┼──┤ │ 44248 │ 44248 │0 │ └───┴───┴──┘ You simply can't freeze these rows, because they're not vacuum_freeze_min_age xids old. With 16 and the default vacuum_freeze_strategy_threshold you'll get the same (even though we wouldn't actually trigger an FPW). With 16 and vacuum_freeze_strategy_threshold=0, you'll get: ┌───┬───┬──┐ │ relsize_pages │ all_vis_pages │ all_frozen_pages │ ├───┼───┼──┤ │ 44248 │ 44248 │44248 │ └───┴───┴──┘ IOW, basically what you get with VACUUM FREEZE. That's actually what I was complaining about. The commit message in a way is right that Settings like vacuum_freeze_min_age still get applied in just the same way in every VACUUM, independent of the strategy in use. The only mechanical difference between eager and lazy freezing strategies is that only the former applies its own additional criteria to trigger freezing pages. but that's only true because page level freezing neutered vacuum_freeze_min_age. Compared to <16, it's a *huge* change. Yes, it's true that VACUUM still is less agressive than VACUUM FREEZE, even disregarding cleanup locks, because it won't freeze if there's non-removable rows on the page. But more often than not that's a pretty small difference. > Once again I'll refer you to my Wiki page on this: > > https://wiki.postgresql.org/wiki/Freezing/skipping_strategies_patch:_motivating_examples#Patch_2 > > The difference between this and VACUUM FREEZE is described here: > > "Note how we freeze most pages, but still leave a significant number > unfrozen each time, despite using an eager approach to freezing > (2981204 scanned - 2355230 frozen = 625974 pages scanned but left > unfrozen). Again, this is because we don't freeze pages unless they're > already eligible to be set all-visible. The only reason there is a substantial difference is because of pgbench's uniform access pattern. Most real-world applications don't have that. Greetings, Andres Freund
Re: New strategies for freezing, advancing relfrozenxid early
On Wed, Jan 25, 2023 at 7:11 PM Andres Freund wrote: > > > I switched between vacuum_freeze_strategy_threshold=0 and > > > vacuum_freeze_strategy_threshold=too-high, because it's quicker/takes less > > > warmup to set up something with smaller tables. > > > > This makes no sense to me, at all. > > It's quicker to run the workload with a table that initially is below 4GB, but > still be able to test the eager strategy. It wouldn't change anything > fundamental to just make the rows a bit wider, or to have a static portion of > the table. What does that actually mean? Wouldn't change anything fundamental? What it would do is significantly reduce the write amplification effect that you encountered. You came up with numbers of up to 7x, a number that you used without any mention of checkpoint_timeout being lowered to only 1 minutes (I had to push to get that information). Had you done things differently (larger table, larger setting) then that would have made the regression far smaller. So yeah, "nothing fundamental". > > How, in general, can we detect what kind of 1TB table it will be, in the > > absence of user input? > > I suspect we'll need some form of heuristics to differentiate between tables > that are more append heavy and tables that are changing more heavily. The TPC-C tables are actually a perfect adversarial cases for this, because it's both, together. What then? > I think > it might be preferrable to not have a hard cliff but a gradual changeover - > hard cliffs tend to lead to issue one can't see coming. As soon as you change your behavior you have to account for the fact that you behaved lazily up until all prior VACUUMs. I think that you're better off just being eager with new pages and modified pages, while not specifically going > I IIRC previously was handwaving at keeping track of the average age of tuples > on all-visible pages. That could extend the prior heuristic. A heavily > changing table will have a relatively young average, a more append only table > will have an increasing average age. > > > It might also make sense to look at the age of relfrozenxid - there's really > no point in being overly eager if the relation is quite young. I don't think that's true. What about bulk loading? It's a totally valid and common requirement. -- Peter Geoghegan
Re: New strategies for freezing, advancing relfrozenxid early
On Wed, Jan 25, 2023 at 8:49 PM Andres Freund wrote: > The concrete setting of vacuum_freeze_strategy_threshold doesn't matter. > Table size simply isn't a usable proxy for whether eager freezing is a good > idea or not. I strongly agree. I can't imagine how a size-based threshold can make any sense at all. Both Andres and I have repeatedly expressed concern about how much is being changed in the behavior of vacuum, and how quickly, and IMHO on the basis of very limited evidence that the changes are improvements. The fact that Andres was very quickly able to find cases where the patch produces large regression is just more evidence of that. It's also hard to even understand what has been changed, because the descriptions are so theoretical. I think we're on a very dangerous path here. I want VACUUM to be better as the next person, but I really don't believe that's the direction we're headed. I think if we release like this, we're going to experience more VACUUM pain, not less. And worse still, I don't think anyone other than Peter and Andres is going to understand why it's happening. -- Robert Haas EDB: http://www.enterprisedb.com
Re: New strategies for freezing, advancing relfrozenxid early
Hk, On 2023-01-25 18:31:16 -0800, Peter Geoghegan wrote: > On Wed, Jan 25, 2023 at 5:49 PM Andres Freund wrote: > > Sure. But significantly regressing plausible if not common workloads is > > different than knowing that there'll be some edge case where we'll do > > something worse. > > That's very vague. Significant to whom, for what purpose? Sure it's vague. But you can't tell me that it's uncommon to use postgres to store rows that isn't retained for > 50million xids. > > I reproduced both with checkpoint_timeout=5min and 1min. 1min is easier for > > impatient me. > > You said "Autovacuum on average generates between 1.5x-7x as much WAL > as before". Why stop there, though? There's a *big* multiplicative > effect in play here from FPIs, obviously, so the sky's the limit. Why > not set checkpoint_timeout to 30s? The amount of WAL increases substantially even with 5min, the degree of the increase varies more though. But that largely vanishes if you increase the time after which rows are deleted a bit. I just am not patient enough to wait for that. > > I switched between vacuum_freeze_strategy_threshold=0 and > > vacuum_freeze_strategy_threshold=too-high, because it's quicker/takes less > > warmup to set up something with smaller tables. > > This makes no sense to me, at all. It's quicker to run the workload with a table that initially is below 4GB, but still be able to test the eager strategy. It wouldn't change anything fundamental to just make the rows a bit wider, or to have a static portion of the table. And changing between vacuum_freeze_strategy_threshold=0/very-large (or I assume -1, didn't check) while the workload is running having to wait until the 120s to start deleting have passed.. > > The concrete setting of vacuum_freeze_strategy_threshold doesn't matter. > > Table size simply isn't a usable proxy for whether eager freezing is a good > > idea or not. > > It's not supposed to be - you have it backwards. It's intended to work > as a proxy for whether lazy freezing is a bad idea, particularly in > the worst case. That's a distinction without a difference. > There is also an effect that likely would have been protective with > your test case had you used a larger table with the same test case > (and had you not lowered vacuum_freeze_strategy_threshold from its > already low default). Again, you just need a less heavily changing portion of the the table or a slightly larger "deletion delay" and you end up with a table well over 4GB. Even as stated I end up with > 4GB after a bit of running. It's just a shortcut to make testing this easier. > > You can have a 1TB table full of transient data, or you can have a 1TB table > > where part of the data is transient and only settles after a time. In > > neither > > case eager freezing is ok. > > It sounds like you're not willing to accept any kind of trade-off. I am. Just not every tradeoff. I just don't see any useful tradeoffs purely based on the relation size. > How, in general, can we detect what kind of 1TB table it will be, in the > absence of user input? I suspect we'll need some form of heuristics to differentiate between tables that are more append heavy and tables that are changing more heavily. I think it might be preferrable to not have a hard cliff but a gradual changeover - hard cliffs tend to lead to issue one can't see coming. I think several of the heuristics below become easier once we introduce "xid age" vacuums. One idea is to start tracking the number of all-frozen pages in pg_class. If there's a significant percentage of all-visible but not all-frozen pages, vacuum should be more eager. If only a small portion of the table is not frozen, there's no need to be eager. If only a small portion of the table is all-visible, there similarly is no need to freeze eagerly. I IIRC previously was handwaving at keeping track of the average age of tuples on all-visible pages. That could extend the prior heuristic. A heavily changing table will have a relatively young average, a more append only table will have an increasing average age. It might also make sense to look at the age of relfrozenxid - there's really no point in being overly eager if the relation is quite young. And a very heavily changing table will tend to be younger. But likely the approach of tracking the age of all-visible pages will be more accurate. The heuristics don't have to be perfect. If we get progressively more eager, an occasional somewhat eager vacuum isn't a huge issue, as long as it then leads to the next few vacuums to become less eager. > And in the absence of user input, why would we prefer to default to a > behavior that is highly destabilizing when we get it wrong? Users know the current behaviour. Introducing significant issues that didn't previously exist will cause new issues and new frustrations. Greetings, Andres Freund
Re: New strategies for freezing, advancing relfrozenxid early
On Wed, Jan 25, 2023 at 6:33 PM Andres Freund wrote: > My point was the other way round. That vacuum_freeze_min_age *prevented* us > from freezing rows "too soon" - obviously a very blunt instrument. Yes, not freezing at all until aggressive vacuum is definitely good when you don't really need to freeze at all. > Since page level freezing, it only partially does that, because we'll freeze > even newer rows, if pruning triggered an FPI (I don't think that's quite the > right check, but that's a separate discussion). But the added cost is very low, and it might well make all the difference. > As far as I can tell, with the eager strategy, the only thing > vacuum_freeze_min_age really influences is whether we'll block waiting for a > cleanup lock. IOW, VACUUM on a table > vacuum_freeze_strategy_threshold is > now a slightly less-blocking version of VACUUM FREEZE. That's simply not true, at all. I'm very surprised that you think that. The commit message very clearly addresses this. You know, the part that you specifically quoted to complain about today! Once again I'll refer you to my Wiki page on this: https://wiki.postgresql.org/wiki/Freezing/skipping_strategies_patch:_motivating_examples#Patch_2 The difference between this and VACUUM FREEZE is described here: "Note how we freeze most pages, but still leave a significant number unfrozen each time, despite using an eager approach to freezing (2981204 scanned - 2355230 frozen = 625974 pages scanned but left unfrozen). Again, this is because we don't freeze pages unless they're already eligible to be set all-visible. We saw the same effect with the first pgbench_history example, but it was hardly noticeable at all there. Whereas here we see that even eager freezing opts to hold off on freezing relatively many individual heap pages, due to the observed conditions on those particular heap pages." If it was true that eager freezing strategy behaved just the same as VACUUM FREEZE (at least as far as freezing is concerned) then scenarios like this one would show that VACUUM froze practically all of the pages it scanned -- maybe fully 100% of all scanned pages would be frozen. This effect is absent from small tables, and I suspect that it's absent from your test case in part because you used a table that was too small. Obviously the way that eager freezing strategy avoids freezing concurrently modified pages isn't perfect. It's one approach to limiting the downside from eager freezing, in tables (or even individual pages) where it's inappropriate. Of course that isn't perfect, but it's a significant factor. -- Peter Geoghegan
Re: New strategies for freezing, advancing relfrozenxid early
Hi, On 2023-01-25 17:28:48 -0800, Peter Geoghegan wrote: > On Wed, Jan 25, 2023 at 5:15 PM Andres Freund wrote: > > However, it significantly increases the overall work when rows have a > > somewhat > > limited lifetime. The documented reason why vacuum_freeze_min_age exist - > > although I think it doesn't really achieve its documented goal anymore, > > after > > the recent changes page-level freezing changes. > > Huh? vacuum_freeze_min_age hasn't done that, at all. At least not > since the visibility map went in back in 8.4: My point was the other way round. That vacuum_freeze_min_age *prevented* us from freezing rows "too soon" - obviously a very blunt instrument. Since page level freezing, it only partially does that, because we'll freeze even newer rows, if pruning triggered an FPI (I don't think that's quite the right check, but that's a separate discussion). As far as I can tell, with the eager strategy, the only thing vacuum_freeze_min_age really influences is whether we'll block waiting for a cleanup lock. IOW, VACUUM on a table > vacuum_freeze_strategy_threshold is now a slightly less-blocking version of VACUUM FREEZE. The paragraph I was referencing: One disadvantage of decreasing vacuum_freeze_min_age is that it might cause VACUUM to do useless work: freezing a row version is a waste of time if the row is modified soon thereafter (causing it to acquire a new XID). So the setting should be large enough that rows are not frozen until they are unlikely to change any more. But now vacuum_freeze_min_age doesn't reliably influence whether we'll freeze row anymore. Am I missing something here? > > > VACUUM determines its freezing strategy based on the value of the new > > > vacuum_freeze_strategy_threshold GUC (or reloption) with logged tables; > > > tables that exceed the size threshold use the eager freezing strategy. > > > > I think that's not a sufficient guard at all. The size of a table doesn't > > say > > much about how a table is used. > > Sufficient for what purpose? Not not regress a substantial portion of our userbase. Greetings, Andres Freund
Re: New strategies for freezing, advancing relfrozenxid early
On Wed, Jan 25, 2023 at 5:49 PM Andres Freund wrote: > Sure. But significantly regressing plausible if not common workloads is > different than knowing that there'll be some edge case where we'll do > something worse. That's very vague. Significant to whom, for what purpose? > prep: > CREATE TABLE pgbench_time_data(client_id int8 NOT NULL, ts timestamptz NOT > NULL, filla int8 NOT NULL, fillb int8 not null, fillc int8 not null); > CREATE INDEX ON pgbench_time_data(ts); > ALTER SYSTEM SET autovacuum_naptime = '10s'; > ALTER SYSTEM SET autovacuum_vacuum_cost_delay TO -1; > ALTER SYSTEM SET synchronous_commit = off; -- otherwise more clients are > needed > > pgbench script, with 15 clients: > INSERT INTO pgbench_time_data(client_id, ts, filla, fillb, fillc) VALUES > (:client_id, now(), 0, 0, 0); > > psql session deleting old data: > EXPLAIN ANALYZE DELETE FROM pgbench_time_data WHERE ts < now() - > '120s'::interval \watch 1 > > Realistically the time should be longer, but I didn't want to wait that long > for the deletions to actually start. I'll review this tomorrow. > I reproduced both with checkpoint_timeout=5min and 1min. 1min is easier for > impatient me. You said "Autovacuum on average generates between 1.5x-7x as much WAL as before". Why stop there, though? There's a *big* multiplicative effect in play here from FPIs, obviously, so the sky's the limit. Why not set checkpoint_timeout to 30s? > I switched between vacuum_freeze_strategy_threshold=0 and > vacuum_freeze_strategy_threshold=too-high, because it's quicker/takes less > warmup to set up something with smaller tables. This makes no sense to me, at all. > The concrete setting of vacuum_freeze_strategy_threshold doesn't matter. > Table size simply isn't a usable proxy for whether eager freezing is a good > idea or not. It's not supposed to be - you have it backwards. It's intended to work as a proxy for whether lazy freezing is a bad idea, particularly in the worst case. There is also an effect that likely would have been protective with your test case had you used a larger table with the same test case (and had you not lowered vacuum_freeze_strategy_threshold from its already low default). In general there'd be a much better chance of concurrent reuse of space by new inserts discouraging page-level freezing, since VACUUM would take much longer relative to everything else, as compared to a small table. > You can have a 1TB table full of transient data, or you can have a 1TB table > where part of the data is transient and only settles after a time. In neither > case eager freezing is ok. It sounds like you're not willing to accept any kind of trade-off. How, in general, can we detect what kind of 1TB table it will be, in the absence of user input? And in the absence of user input, why would we prefer to default to a behavior that is highly destabilizing when we get it wrong? -- Peter Geoghegan
Re: New strategies for freezing, advancing relfrozenxid early
Hi, On 2023-01-25 17:37:17 -0800, Peter Geoghegan wrote: > On Wed, Jan 25, 2023 at 5:26 PM Andres Freund wrote: > > Another bad scenario: Some longrunning / hung transaction caused us to get > > close to the xid wraparound. Problem was resolved, autovacuum runs. > > Previously > > we wouldn't have frozen the portion of the table that was actively changing, > > now we will. Consequence: We get closer to the "no write" limit / the outage > > lasts longer. > > Obviously it isn't difficult to just invent a new rule that gets > applied by lazy_scan_strategy. For example, it would take me less than > 5 minutes to write a patch that disables eager freezing when the > failsafe is in effect. Sure. I'm not saying that these issues cannot be addressed. Of course no patch of a meaningful size is perfect and we all can't predict the future. But this is a very significant behavioural change to vacuum, and there are pretty simple scenarios in which it causes significant regressions. And at least some of the issues have been pointed out before. Greetings, Andres Freund
Re: New strategies for freezing, advancing relfrozenxid early
Hi, On 2023-01-25 17:22:32 -0800, Peter Geoghegan wrote: > On Wed, Jan 25, 2023 at 4:43 PM Andres Freund wrote: > > I unfortunately haven't been able to keep up with the thread and saw this > > just > > now. But I've expressed the concern below several times before, so it > > shouldn't come as a surprise. > > You missed the announcement 9 days ago, and the similar clear > signalling of a commit from yesterday. I guess I'll need to start > personally reaching out to you any time I commit anything in this area > in the future. I almost considered doing that here, in fact. There's just too much email on -hackers to keep up with, if I ever want to do any development of my own. I raised this concern before though, so it's not like it's a surprise. > > The most common problematic scenario I see are tables full of rows with > > limited lifetime. E.g. because rows get aggregated up after a while. Before > > those rows practically never got frozen - but now we'll freeze them all the > > time. > > Fundamentally, the choice to freeze or not freeze is driven by > speculation about the needs of the table, with some guidance from the > user. That isn't new. It seems to me that it will always be possible > for you to come up with an adversarial case that makes any given > approach look bad, no matter how good it is. Of course that doesn't > mean that this particular complaint has no validity; but it does mean > that you need to be willing to draw the line somewhere. Sure. But significantly regressing plausible if not common workloads is different than knowing that there'll be some edge case where we'll do something worse. > > I whipped up a quick test: 15 pgbench threads insert rows, 1 psql \while > > loop > > deletes older rows. > > Can you post the script? And what setting did you use? prep: CREATE TABLE pgbench_time_data(client_id int8 NOT NULL, ts timestamptz NOT NULL, filla int8 NOT NULL, fillb int8 not null, fillc int8 not null); CREATE INDEX ON pgbench_time_data(ts); ALTER SYSTEM SET autovacuum_naptime = '10s'; ALTER SYSTEM SET autovacuum_vacuum_cost_delay TO -1; ALTER SYSTEM SET synchronous_commit = off; -- otherwise more clients are needed pgbench script, with 15 clients: INSERT INTO pgbench_time_data(client_id, ts, filla, fillb, fillc) VALUES (:client_id, now(), 0, 0, 0); psql session deleting old data: EXPLAIN ANALYZE DELETE FROM pgbench_time_data WHERE ts < now() - '120s'::interval \watch 1 Realistically the time should be longer, but I didn't want to wait that long for the deletions to actually start. I reproduced both with checkpoint_timeout=5min and 1min. 1min is easier for impatient me. I switched between vacuum_freeze_strategy_threshold=0 and vacuum_freeze_strategy_threshold=too-high, because it's quicker/takes less warmup to set up something with smaller tables. shared_buffers=32GB for fits in s_b, 1GB otherwise. max_wal_size=150GB, log_autovacuum_min_duration=0, and a bunch of logging settings. > > Workload fits in s_b: > > > > Autovacuum on average generates between 1.5x-7x as much WAL as before, > > depending on how things interact with checkpoints. And not just that, each > > autovac cycle also takes substantially longer than before - the average time > > for an autovacuum roughly doubled. Which of course increases the amount of > > bloat. > > Anything that causes an autovacuum to take longer will effectively > make autovacuum think that it has removed more bloat than it really > has, which will then make autovacuum less aggressive when it really > should be more aggressive. That's a preexisting issue, that needs to > be accounted for in the context of this discussion. That's not the problem here - on my system autovac starts again very quickly. The problem is that we accumulate bloat while autovacuum is running. Wasting time/WAL volume on freezing pages that don't need to be frozen is an issue. > In particular, it would be very useful to know what the parameters of > the discussion are. Obviously I cannot come up with an algorithm that > can literally predict the future. But I may be able to handle specific > cases of concern better, or to better help users cope in whatever way. > > This is significantly worse than I predicted. This was my first attempt at > > coming up with a problematic workload. There'll likely be way worse in > > production. > > As I said in the commit message, the current default for > vacuum_freeze_strategy_threshold is considered low, and was always > intended to be provisional. Something that I explicitly noted would be > reviewed after the beta period is over, once we gained more experience > with the setting. > I think that a far higher setting could be almost as effective. 32GB, > or even 64GB could work quite well, since you'll still have the FPI > optimization. The concrete setting of vacuum_freeze_strategy_threshold doesn't matter. Table size simply isn't a usable proxy for whether eager freezing is a good idea or not. You can ha
Re: New strategies for freezing, advancing relfrozenxid early
On Wed, Jan 25, 2023 at 5:26 PM Andres Freund wrote: > Another bad scenario: Some longrunning / hung transaction caused us to get > close to the xid wraparound. Problem was resolved, autovacuum runs. Previously > we wouldn't have frozen the portion of the table that was actively changing, > now we will. Consequence: We get closer to the "no write" limit / the outage > lasts longer. Obviously it isn't difficult to just invent a new rule that gets applied by lazy_scan_strategy. For example, it would take me less than 5 minutes to write a patch that disables eager freezing when the failsafe is in effect. > I don't see an alternative to reverting this for now. I want to see your test case before acting. -- Peter Geoghegan
Re: New strategies for freezing, advancing relfrozenxid early
On Wed, Jan 25, 2023 at 5:15 PM Andres Freund wrote: > However, it significantly increases the overall work when rows have a somewhat > limited lifetime. The documented reason why vacuum_freeze_min_age exist - > although I think it doesn't really achieve its documented goal anymore, after > the recent changes page-level freezing changes. Huh? vacuum_freeze_min_age hasn't done that, at all. At least not since the visibility map went in back in 8.4: https://wiki.postgresql.org/wiki/Freezing/skipping_strategies_patch:_motivating_examples#Today.2C_on_Postgres_HEAD_2 That's why we literally do ~100% of all freezing in aggressive mode VACUUM with append-only or append-mostly tables. > > VACUUM determines its freezing strategy based on the value of the new > > vacuum_freeze_strategy_threshold GUC (or reloption) with logged tables; > > tables that exceed the size threshold use the eager freezing strategy. > > I think that's not a sufficient guard at all. The size of a table doesn't say > much about how a table is used. Sufficient for what purpose? > > Eager freezing is strictly more aggressive than lazy freezing. Settings > > like vacuum_freeze_min_age still get applied in just the same way in > > every VACUUM, independent of the strategy in use. The only mechanical > > difference between eager and lazy freezing strategies is that only the > > former applies its own additional criteria to trigger freezing pages. > > That's only true because vacuum_freeze_min_age being has been fairly radically > redefined recently. So? This part of the commit message is a simple statement of fact. -- Peter Geoghegan
Re: New strategies for freezing, advancing relfrozenxid early
Hi, On 2023-01-25 16:43:47 -0800, Andres Freund wrote: > I think, as committed, this will cause serious issues for some reasonably > common workloads, due to substantially increased WAL traffic. > > > The most common problematic scenario I see are tables full of rows with > limited lifetime. E.g. because rows get aggregated up after a while. Before > those rows practically never got frozen - but now we'll freeze them all the > time. Another bad scenario: Some longrunning / hung transaction caused us to get close to the xid wraparound. Problem was resolved, autovacuum runs. Previously we wouldn't have frozen the portion of the table that was actively changing, now we will. Consequence: We get closer to the "no write" limit / the outage lasts longer. I don't see an alternative to reverting this for now. Greetings, Andres Freund
Re: New strategies for freezing, advancing relfrozenxid early
On Wed, Jan 25, 2023 at 4:43 PM Andres Freund wrote: > I unfortunately haven't been able to keep up with the thread and saw this just > now. But I've expressed the concern below several times before, so it > shouldn't come as a surprise. You missed the announcement 9 days ago, and the similar clear signalling of a commit from yesterday. I guess I'll need to start personally reaching out to you any time I commit anything in this area in the future. I almost considered doing that here, in fact. > The most common problematic scenario I see are tables full of rows with > limited lifetime. E.g. because rows get aggregated up after a while. Before > those rows practically never got frozen - but now we'll freeze them all the > time. Fundamentally, the choice to freeze or not freeze is driven by speculation about the needs of the table, with some guidance from the user. That isn't new. It seems to me that it will always be possible for you to come up with an adversarial case that makes any given approach look bad, no matter how good it is. Of course that doesn't mean that this particular complaint has no validity; but it does mean that you need to be willing to draw the line somewhere. In particular, it would be very useful to know what the parameters of the discussion are. Obviously I cannot come up with an algorithm that can literally predict the future. But I may be able to handle specific cases of concern better, or to better help users cope in whatever way. > I whipped up a quick test: 15 pgbench threads insert rows, 1 psql \while loop > deletes older rows. Can you post the script? And what setting did you use? > Workload fits in s_b: > > Autovacuum on average generates between 1.5x-7x as much WAL as before, > depending on how things interact with checkpoints. And not just that, each > autovac cycle also takes substantially longer than before - the average time > for an autovacuum roughly doubled. Which of course increases the amount of > bloat. Anything that causes an autovacuum to take longer will effectively make autovacuum think that it has removed more bloat than it really has, which will then make autovacuum less aggressive when it really should be more aggressive. That's a preexisting issue, that needs to be accounted for in the context of this discussion. > This is significantly worse than I predicted. This was my first attempt at > coming up with a problematic workload. There'll likely be way worse in > production. As I said in the commit message, the current default for vacuum_freeze_strategy_threshold is considered low, and was always intended to be provisional. Something that I explicitly noted would be reviewed after the beta period is over, once we gained more experience with the setting. I think that a far higher setting could be almost as effective. 32GB, or even 64GB could work quite well, since you'll still have the FPI optimization. -- Peter Geoghegan
Re: New strategies for freezing, advancing relfrozenxid early
Hi, On 2023-01-24 14:49:38 -0800, Peter Geoghegan wrote: > From e41d3f45fcd6f639b768c22139006ad11422575f Mon Sep 17 00:00:00 2001 > From: Peter Geoghegan > Date: Thu, 24 Nov 2022 18:20:36 -0800 > Subject: [PATCH v17 1/3] Add eager and lazy freezing strategies to VACUUM. > > Eager freezing strategy avoids large build-ups of all-visible pages. It > makes VACUUM trigger page-level freezing whenever doing so will enable > the page to become all-frozen in the visibility map. This is useful for > tables that experience continual growth, particularly strict append-only > tables such as pgbench's history table. Eager freezing significantly > improves performance stability by spreading out the cost of freezing > over time, rather than doing most freezing during aggressive VACUUMs. > It complements the insert autovacuum mechanism added by commit b07642db. However, it significantly increases the overall work when rows have a somewhat limited lifetime. The documented reason why vacuum_freeze_min_age exist - although I think it doesn't really achieve its documented goal anymore, after the recent changes page-level freezing changes. > VACUUM determines its freezing strategy based on the value of the new > vacuum_freeze_strategy_threshold GUC (or reloption) with logged tables; > tables that exceed the size threshold use the eager freezing strategy. I think that's not a sufficient guard at all. The size of a table doesn't say much about how a table is used. > Unlogged tables and temp tables will always use eager freezing strategy, > since there is essentially no downside. I somewhat doubt that that is true, but certainly the cost is lower. > Eager freezing is strictly more aggressive than lazy freezing. Settings > like vacuum_freeze_min_age still get applied in just the same way in > every VACUUM, independent of the strategy in use. The only mechanical > difference between eager and lazy freezing strategies is that only the > former applies its own additional criteria to trigger freezing pages. That's only true because vacuum_freeze_min_age being has been fairly radically redefined recently. Greetings, Andres Freund
Re: New strategies for freezing, advancing relfrozenxid early
Hi, On 2023-01-24 14:49:38 -0800, Peter Geoghegan wrote: > On Mon, Jan 16, 2023 at 5:55 PM Peter Geoghegan wrote: > > 0001 (the freezing strategies patch) is now committable IMV. Or at > > least will be once I polish the docs a bit more. I plan on committing > > 0001 some time next week, barring any objections. > > I plan on committing 0001 (the freezing strategies commit) tomorrow > morning, US Pacific time. I unfortunately haven't been able to keep up with the thread and saw this just now. But I've expressed the concern below several times before, so it shouldn't come as a surprise. I think, as committed, this will cause serious issues for some reasonably common workloads, due to substantially increased WAL traffic. The most common problematic scenario I see are tables full of rows with limited lifetime. E.g. because rows get aggregated up after a while. Before those rows practically never got frozen - but now we'll freeze them all the time. I whipped up a quick test: 15 pgbench threads insert rows, 1 psql \while loop deletes older rows. Workload fits in s_b: Autovacuum on average generates between 1.5x-7x as much WAL as before, depending on how things interact with checkpoints. And not just that, each autovac cycle also takes substantially longer than before - the average time for an autovacuum roughly doubled. Which of course increases the amount of bloat. When workload doesn't fit in s_b: Time for vacuuming goes up to ~5x. WAL volume to ~9x. Autovacuum can't keep up with bloat, every vacuum takes longer than the prior one: 65s->78s->139s->176s And that's with autovac cost limits removed! Relation size nearly doubles due to bloat. After I disabled the new strategy autovac started to catch up again: 124s->101s->103->46s->20s->28s->24s This is significantly worse than I predicted. This was my first attempt at coming up with a problematic workload. There'll likely be way worse in production. I think as-is this logic will cause massive issues. Andres
Re: New strategies for freezing, advancing relfrozenxid early
On Tue, 24 Jan 2023 at 23:50, Peter Geoghegan wrote: > > On Mon, Jan 16, 2023 at 5:55 PM Peter Geoghegan wrote: > > 0001 (the freezing strategies patch) is now committable IMV. Or at > > least will be once I polish the docs a bit more. I plan on committing > > 0001 some time next week, barring any objections. > > I plan on committing 0001 (the freezing strategies commit) tomorrow > morning, US Pacific time. > > Attached is v17. There are no significant differences compared to v17. > I decided to post a new version now, ahead of commit, to show how I've > cleaned up the docs in 0001 -- docs describing the new GUC, freeze > strategies, and so on. LGTM, +1 on 0001 Some more comments on 0002: > +lazy_scan_strategy(LVRelState *vacrel, bool force_scan_all) > scanned_pages_lazy & scanned_pages_eager We have not yet scanned the pages, so I suggest plan/scan_pages_eager and *_lazy as variable names instead, to minimize confusion about the naming. I'll await the next iteration of 0002 in which you've completed more TODOs before I'll dig deeper into that patch. Kind regards, Matthias van de Meent
Re: New strategies for freezing, advancing relfrozenxid early
On Mon, Jan 23, 2023 at 3:17 AM Dilip Kumar wrote: > My final set of comments for 0002 Thanks for the review! > I do not understand much use of maintaining these two > 'scanned_pages_lazy' and 'scanned_pages_eager' variables. I think > just maintaining 'scanned_pages' should be sufficient. I do not see > in patches also they are really used. I agree that the visibility map snapshot struct could stand to be cleaned up -- some of that state may not be needed, and it wouldn't be that hard to use memory a little more economically, particularly with very small tables. It's on my TODO list already. > +#define MAX_PAGES_YOUNG_TABLEAGE0.05/* 5% of rel_pages */ > +#define MAX_PAGES_OLD_TABLEAGE0.70/* 70% of rel_pages */ > > Why is the logic behind 5% and 70% are those based on some > experiments? Should those be tuning parameters so that with real > world use cases if we realise that it would be good if the eager scan > is getting selected more frequently or less frequently then we can > tune those parameters? The specific multipliers constants chosen (for MAX_PAGES_YOUNG_TABLEAGE and MAX_PAGES_OLD_TABLEAGE) were based on both experiments and intuition. The precise values could be somewhat different without it really mattering, though. For example, with a table like pgbench_history (which is a really important case for the patch in general), there won't be any all-visible pages at all (at least after a short while), so it won't matter what these constants are -- eager scanning will always be chosen. I don't think that they should be parameters. The useful parameter for users remains vacuum_freeze_table_age/autovacuum_freeze_max_age (note that vacuum_freeze_table_age usually gets its value from autovacuum_freeze_max_age due to changes in 0002). Like today, vacuum_freeze_table_age forces VACUUM to scan all not-all-frozen pages so that relfrozenxid can be advanced. Unlike today, it forces eager scanning (not aggressive mode). But even long before eager scanning is *forced*, pressure to use eager scanning gradually builds. That pressure will usually cause some VACUUM to use eager scanning before it's strictly necessary. Overall, vacuum_freeze_table_age/autovacuum_freeze_max_age now provide loose guidance. It really has to be loose in this sense in order for lazy_scan_strategy() to have the freedom to do the right thing based on the characteristics of the table as a whole, according to its visibility map snapshot. This allows lazy_scan_strategy() to stumble upon once-off opportunities to advance relfrozenxid inexpensively, including cases where it could never happen with the current model. These opportunities are side-effects of workload characteristics that can be hard to predict [1][2]. > I think this should be moved as first if case, I mean why to do all > the calculations based on the 'tableagefrac' and > 'TABLEAGEFRAC_XXPOINT' if we are forced to scan them all. I agree the > extra computation we are doing might not really matter compared to the > vacuum work we are going to perform but still seems logical to me to > do the simple check first. This is only needed for DISABLE_PAGE_SKIPPING, which is an escape hatch option that is never supposed to be needed. I don't think that it's worth going to the trouble of indenting the code more just so this is avoided -- it really is an afterthought. Besides, the compiler might well be doing this for us. > 4. Should we move prefetching as a separate patch, instead of merging > with the scanning strategy? I don't think that breaking that out would be an improvement. A lot of the prefetching stuff informs how the visibility map code is structured. [1] https://wiki.postgresql.org/wiki/Freezing/skipping_strategies_patch:_motivating_examples#Patch_3 [2] https://wiki.postgresql.org/wiki/Freezing/skipping_strategies_patch:_motivating_examples#Opportunistically_advancing_relfrozenxid_with_bursty.2C_real-world_workloads -- Peter Geoghegan
Re: New strategies for freezing, advancing relfrozenxid early
On Wed, Jan 18, 2023 at 1:47 PM Dilip Kumar wrote: > > On Tue, Jan 17, 2023 at 10:05 AM Peter Geoghegan wrote: My final set of comments for 0002 1. +struct vmsnapshot +{ +/* Target heap rel */ +Relationrel; +/* Scanning strategy used by VACUUM operation */ +vmstrategystrat; +/* Per-strategy final scanned_pages */ +BlockNumber rel_pages; +BlockNumber scanned_pages_lazy; +BlockNumber scanned_pages_eager; I do not understand much use of maintaining these two 'scanned_pages_lazy' and 'scanned_pages_eager' variables. I think just maintaining 'scanned_pages' should be sufficient. I do not see in patches also they are really used. lazy_scan_strategy() is using these variables but this is getting values of these out parameters from visibilitymap_snap_acquire(). And visibilitymap_snap_strategy() is also using this, but it seems there we just need the final result of 'scanned_pages' instead of these two variables. 2. +#define MAX_PAGES_YOUNG_TABLEAGE0.05/* 5% of rel_pages */ +#define MAX_PAGES_OLD_TABLEAGE0.70/* 70% of rel_pages */ Why is the logic behind 5% and 70% are those based on some experiments? Should those be tuning parameters so that with real world use cases if we realise that it would be good if the eager scan is getting selected more frequently or less frequently then we can tune those parameters? 3. +/* + * VACUUM's DISABLE_PAGE_SKIPPING option overrides our decision by forcing + * VACUUM to scan every page (VACUUM effectively distrusts rel's VM) + */ +if (force_scan_all) +vacrel->vmstrat = VMSNAP_SCAN_ALL; I think this should be moved as first if case, I mean why to do all the calculations based on the 'tableagefrac' and 'TABLEAGEFRAC_XXPOINT' if we are forced to scan them all. I agree the extra computation we are doing might not really matter compared to the vacuum work we are going to perform but still seems logical to me to do the simple check first. 4. Should we move prefetching as a separate patch, instead of merging with the scanning strategy? -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com
Re: New strategies for freezing, advancing relfrozenxid early
On Tue, Jan 17, 2023 at 10:05 AM Peter Geoghegan wrote: > > On Mon, Jan 16, 2023 at 8:13 PM Dilip Kumar wrote: > > I think that it makes sense to keep 'vacuum_freeze_strategy_threshold' > > strictly for freezing. But the point is that the eager scanning > > strategy is driven by table freezing needs of the table (tableagefrac) > > that make sense, but if we have selected the eager freezing based on > > the table age and its freezing need then why don't we force the eager > > freezing as well if we have selected eager scanning, after all the > > eager scanning is selected for satisfying the freezing need. > > Don't think of eager scanning as the new name for aggressive mode -- > it's a fairly different concept, because we care about costs now. > Eager scanning can be chosen just because it's very cheap relative to > the alternative of lazy scanning, even when relfrozenxid is still very > recent. (This kind of behavior isn't really new [1], but the exact > implementation from the patch is new.) > > Tables such as pgbench_branches and pgbench_tellers will reliably use > eager scanning strategy, no matter how any GUC has been set -- just > because the added cost is always zero (relative to lazy scanning). It > really doesn't matter how far along tableagefrac here, ever. These > same tables will never use eager freezing strategy, unless the > vacuum_freeze_strategy_threshold GUC is misconfigured. (This is > another example of how scanning strategy and freezing strategy may > differ for the same table.) Yes, I agree with that. Thanks for explaining in detail. > You do have a good point, though. I think that I know what you mean. > Note that antiwraparound autovacuums (or VACUUMs of tables very near > to that point) *will* always use both the eager freezing strategy and > the eager scanning strategy -- which is probably close to what you > meant. Right > The important point is that there can be more than one reason to > prefer one strategy to another -- and the reasons can be rather > different. Occasionally it'll be a combination of two factors together > that push things in favor of one strategy over the other -- even > though either factor on its own would not have resulted in the same > choice. Yes, that makes sense to me. -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com
Re: New strategies for freezing, advancing relfrozenxid early
On Mon, Jan 16, 2023 at 8:13 PM Dilip Kumar wrote: > I think that it makes sense to keep 'vacuum_freeze_strategy_threshold' > strictly for freezing. But the point is that the eager scanning > strategy is driven by table freezing needs of the table (tableagefrac) > that make sense, but if we have selected the eager freezing based on > the table age and its freezing need then why don't we force the eager > freezing as well if we have selected eager scanning, after all the > eager scanning is selected for satisfying the freezing need. Don't think of eager scanning as the new name for aggressive mode -- it's a fairly different concept, because we care about costs now. Eager scanning can be chosen just because it's very cheap relative to the alternative of lazy scanning, even when relfrozenxid is still very recent. (This kind of behavior isn't really new [1], but the exact implementation from the patch is new.) Tables such as pgbench_branches and pgbench_tellers will reliably use eager scanning strategy, no matter how any GUC has been set -- just because the added cost is always zero (relative to lazy scanning). It really doesn't matter how far along tableagefrac here, ever. These same tables will never use eager freezing strategy, unless the vacuum_freeze_strategy_threshold GUC is misconfigured. (This is another example of how scanning strategy and freezing strategy may differ for the same table.) You do have a good point, though. I think that I know what you mean. Note that antiwraparound autovacuums (or VACUUMs of tables very near to that point) *will* always use both the eager freezing strategy and the eager scanning strategy -- which is probably close to what you meant. The important point is that there can be more than one reason to prefer one strategy to another -- and the reasons can be rather different. Occasionally it'll be a combination of two factors together that push things in favor of one strategy over the other -- even though either factor on its own would not have resulted in the same choice. [1] https://wiki.postgresql.org/wiki/Freezing/skipping_strategies_patch:_motivating_examples#Constantly_updated_tables_.28usually_smaller_tables.29 -- Peter Geoghegan
Re: New strategies for freezing, advancing relfrozenxid early
On Mon, Jan 16, 2023 at 11:31 PM Peter Geoghegan wrote: > > > I think '(nextXID - cutoffs->relfrozenxid) / freeze_table_age' should > > be the actual fraction right? What is the point of adding 0.5 to the > > divisor? If there is a logical reason, maybe we can explain in the > > comments. > > It's just a way of avoiding division by zero. oh, correct :) > > While looking into the logic of 'lazy_scan_strategy', I think the idea > > looks very good but the only thing is that > > we have kept eager freeze and eager scan completely independent. > > Don't you think that if a table is chosen for an eager scan > > then we should force the eager freezing as well? > > Earlier versions of the patch kind of worked that way. > lazy_scan_strategy would actually use twice the GUC setting to > determine scanning strategy. That approach could make our "transition > from lazy to eager strategies" involve an excessive amount of > "catch-up freezing" in the VACUUM operation that advanced relfrozenxid > for the first time, which you see an example of here: > > https://wiki.postgresql.org/wiki/Freezing/skipping_strategies_patch:_motivating_examples#Patch > > Now we treat the scanning and freezing strategies as two independent > choices. Of course they're not independent in any practical sense, but > I think it's slightly simpler and more elegant that way -- it makes > the GUC vacuum_freeze_strategy_threshold strictly about freezing > strategy, while still leading to VACUUM advancing relfrozenxid in a > way that makes sense. It just happens as a second order effect. Why > add a special case? I think that it makes sense to keep 'vacuum_freeze_strategy_threshold' strictly for freezing. But the point is that the eager scanning strategy is driven by table freezing needs of the table (tableagefrac) that make sense, but if we have selected the eager freezing based on the table age and its freezing need then why don't we force the eager freezing as well if we have selected eager scanning, after all the eager scanning is selected for satisfying the freezing need. But OTOH, the eager scanning might get selected if it appears that we might not have to scan too many extra pages compared to lazy scan so in those cases forcing the eager freezing might not be wise. So maybe it is a good idea to keep them the way you have in your patch. -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com
Re: New strategies for freezing, advancing relfrozenxid early
On Mon, Jan 16, 2023 at 10:00 AM Peter Geoghegan wrote: > Now we treat the scanning and freezing strategies as two independent > choices. Of course they're not independent in any practical sense, but > I think it's slightly simpler and more elegant that way -- it makes > the GUC vacuum_freeze_strategy_threshold strictly about freezing > strategy, while still leading to VACUUM advancing relfrozenxid in a > way that makes sense. It just happens as a second order effect. Why > add a special case? This might be a better way to explain it: The main page-level freezing commit (commit 1de58df4) already added an optimization that triggers page-level freezing "early" (early relative to vacuum_freeze_min_age). This happens whenever a page already needs to have an FPI logged inside lazy_scan_prune -- even when we're using the lazy freezing strategy. The optimization isn't configurable, and gets applied regardless of freezing strategy (technically there is no such thing as freezing strategies on HEAD just yet, though HEAD still has this optimization). There will be workloads where the FPI optimization will result in freezing many more pages -- especially when data checksums are in use (since then we could easily need to log an FPI just so pruning can set a hint bit). As a result, certain VACUUMs that use the lazy freezing strategy will freeze in almost the same way as an equivalent VACUUM using the eager freezing strategy. Such a "nominally lazy but actually quite eager" VACUUM operation should get the same benefit in terms of relfrozenxid advancement as it would if it really had used the eager freezing strategy instead. It's fairly obvious that we'll get the same benefit in relfrozenxid advancement (comparable relfrozenxid results for comparable freezing work), since the way that VACUUM decides on its scanning strategy is not conditioned on freezing strategy (whether by the ongoing VACUUM or any other VACUUM against the same table). All that matters is the conditions in the table (in particular the added cost of opting for eager scanning over lazy scanning) as indicated by the visibility map at the start of each VACUUM -- how those conditions came about really isn't interesting at that point. And so lazy_scan_strategy doesn't care about them when it chooses VACUUM's scanning strategy. There are even tables/workloads where relfrozenxid will be able to jump forward by a huge amount whenever VACUUM choosing the eager scanning strategy, despite the fact that VACUUM generally does little or no freezing to make that possible: https://wiki.postgresql.org/wiki/Freezing/skipping_strategies_patch:_motivating_examples#Patch_3 -- Peter Geoghegan
Re: New strategies for freezing, advancing relfrozenxid early
On Mon, Jan 16, 2023 at 10:10 AM Peter Geoghegan wrote: > Attached is v16, which incorporates some of Matthias' feedback. 0001 (the freezing strategies patch) is now committable IMV. Or at least will be once I polish the docs a bit more. I plan on committing 0001 some time next week, barring any objections. I should point out that 0001 is far shorter and simpler than the page-level freezing commit that already went in (commit 1de58df4). The only thing in 0001 that seems like it might be a bit controversial (when considered on its own) is the addition of the vacuum_freeze_strategy_threshold GUC/reloption. Note in particular that vacuum_freeze_strategy_threshold doesn't look like any other existing GUC; it gets applied as a threshold on the size of the rel's main fork at the beginning of vacuumlazy.c processing. As far as I know there are no objections to that approach at this time, but it does still seem worth drawing attention to now. 0001 also makes unlogged tables and temp tables always use eager freezing strategy, no matter how the GUC/reloption are set. This seems *very* easy to justify, since the potential downside of such a policy is obviously extremely low, even when we make very pessimistic assumptions. The usual cost we need to worry about when it comes to freezing is the added WAL overhead -- that clearly won't apply when we're vacuuming non-permanent tables. That really just leaves the cost of dirtying extra pages, which in general could have a noticeable system-level impact in the case of unlogged tables. Dirtying extra pages when vacuuming an unlogged table is also a non-issue. Even the eager freezing strategy only freezes "extra" pages ("extra" relative to the lazy strategy behavior) given a page that will be set all-visible in any case [1]. Such a page will need to have its page-level PD_ALL_VISIBLE bit set in any case -- which is already enough to dirty the page. And so there can never be any additional pages dirtied as a result of the special policy 0001 adds for non-permanent relations. [1] https://wiki.postgresql.org/wiki/Freezing/skipping_strategies_patch:_motivating_examples#Patch_2 -- Peter Geoghegan
Re: New strategies for freezing, advancing relfrozenxid early
On Sun, Jan 15, 2023 at 9:13 PM Dilip Kumar wrote: > I have looked into the patch set, I think 0001 looks good to me about > 0002 I have a few questions, 0003 I haven't yet looked at Thanks for taking a look. > I think '(nextXID - cutoffs->relfrozenxid) / freeze_table_age' should > be the actual fraction right? What is the point of adding 0.5 to the > divisor? If there is a logical reason, maybe we can explain in the > comments. It's just a way of avoiding division by zero. > While looking into the logic of 'lazy_scan_strategy', I think the idea > looks very good but the only thing is that > we have kept eager freeze and eager scan completely independent. > Don't you think that if a table is chosen for an eager scan > then we should force the eager freezing as well? Earlier versions of the patch kind of worked that way. lazy_scan_strategy would actually use twice the GUC setting to determine scanning strategy. That approach could make our "transition from lazy to eager strategies" involve an excessive amount of "catch-up freezing" in the VACUUM operation that advanced relfrozenxid for the first time, which you see an example of here: https://wiki.postgresql.org/wiki/Freezing/skipping_strategies_patch:_motivating_examples#Patch Now we treat the scanning and freezing strategies as two independent choices. Of course they're not independent in any practical sense, but I think it's slightly simpler and more elegant that way -- it makes the GUC vacuum_freeze_strategy_threshold strictly about freezing strategy, while still leading to VACUUM advancing relfrozenxid in a way that makes sense. It just happens as a second order effect. Why add a special case? In principle the break-even point for eager scanning strategy (i.e. advancing relfrozenxid) is based on the added cost only under this scheme. There is no reason for lazy_scan_strategy to care about what happened in the past to make the eager scanning strategy look like a good idea. Similarly, there isn't any practical reason why lazy_scan_strategy needs to anticipate what will happen in the near future with freezing. -- Peter Geoghegan
Re: New strategies for freezing, advancing relfrozenxid early
On Mon, Jan 9, 2023 at 7:16 AM Peter Geoghegan wrote: > > On Tue, Jan 3, 2023 at 12:30 PM Peter Geoghegan wrote: > > Attached is v14. > > This has stopped applying due to conflicts with nearby work on VACUUM > from Tom. So I attached a new revision, v15, just to make CFTester > green again. > > I didn't have time to incorporate any of the feedback from Matthias > just yet. That will have to wait until v16. > I have looked into the patch set, I think 0001 looks good to me about 0002 I have a few questions, 0003 I haven't yet looked at 1. +/* + * Finally, set tableagefrac for VACUUM. This can come from either XID or + * XMID table age (whichever is greater currently). + */ +XIDFrac = (double) (nextXID - cutoffs->relfrozenxid) / +((double) freeze_table_age + 0.5); I think '(nextXID - cutoffs->relfrozenxid) / freeze_table_age' should be the actual fraction right? What is the point of adding 0.5 to the divisor? If there is a logical reason, maybe we can explain in the comments. 2. While looking into the logic of 'lazy_scan_strategy', I think the idea looks very good but the only thing is that we have kept eager freeze and eager scan completely independent. Don't you think that if a table is chosen for an eager scan then we should force the eager freezing as well? -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com
Re: New strategies for freezing, advancing relfrozenxid early
On Thu, Jan 5, 2023 at 10:19 AM Matthias van de Meent wrote: > Could this use something like autovacuum_cost_delay? I don't quite > like the use of arbitrary hardcoded millisecond delays It's not unlike (say) the way that there can sometimes be hardcoded waits inside GetMultiXactIdMembers(), which does run during VACUUM. It's not supposed to be noticeable at all. If it is noticeable in any practical sense, then the design is flawed, and should be fixed. > it can slow a > system down by a significant fraction, especially on high-contention > systems, and this potential of 60ms delay per scanned page can limit > the throughput of this new vacuum strategy to < 17 pages/second > (<136kB/sec) for highly contended sections, which is not great. We're only willing to wait the full 60ms when smaller waits don't work out. And when 60ms doesn't do it, we'll then accept an older final NewRelfrozenXid value. Our willingness to wait at all is conditioned on the existing NewRelfrozenXid tracker being affected at all by whether or not we accept reduced lazy_scan_noprune processing for the page. So the waits are naturally self-limiting. You may be right that I need to do more about the possibility of something like that happening -- it's a legitimate concern. But I think that this may be enough on its own. I've never seen a workload where more than a small fraction of all pages couldn't be cleanup locked right away. But I *have* seen workloads where VACUUM vainly waited forever for a cleanup lock on one single heap page. > It is also not unlikely that in the time it was waiting, the page > contents were updated significantly (concurrent prune, DELETEs > committed), which could result in improved bounds. I think we should > redo the dead items check if we waited, but failed to get a lock - any > tuples removed now reduce work we'll have to do later. I don't think that it matters very much. That's always true. It seems very unlikely that we'll get better bounds here, unless it happens by getting a full cleanup lock and then doing full lazy_scan_prune processing after all. Sure, it's possible that a concurrent opportunistic prune could make the crucial difference, even though we ourselves couldn't get a cleanup lock despite going to considerable trouble. I just don't think that it's worth doing anything about. > > +++ b/doc/src/sgml/ref/vacuum.sgml > > [...] Pages where > > + all tuples are known to be frozen are always skipped. > > "...are always skipped, unless the >DISABLE_PAGE_SKIPPING< option is used." I'll look into changing this. > > +++ b/doc/src/sgml/maintenance.sgml > > There are a lot of details being lost from the previous version of > that document. Some of the details are obsolete (mentions of > aggressive VACUUM and freezing behavior), but others are not > (FrozenTransactionId in rows from a pre-9.4 system, the need for > vacuum for prevention of issues surrounding XID wraparound). I will admit that I really hate the "Routine Vacuuming" docs, and think that they explain things in just about the worst possible way. I also think that this needs to be broken up into pieces. As I said recently, the docs are the part of the patch series that is the least worked out. > I also am not sure this is the best place to store most of these > mentions, but I can't find a different place where these details on > certain interesting parts of the system are documented, and plain > removal of the information does not sit right with me. I'm usually the person that argues for describing more implementation details in the docs. But starting with low-level details here is deeply confusing. At most these are things that should be discussed in the context of internals, as part of some completely different chapter. I'll see about moving details of things like FrozenTransactionId somewhere else. > Specifically, I don't like the removal of the following information > from our documentation: > > - Size of pg_xact and pg_commit_ts data in relation to > autovacuum_freeze_max_age >Although it is less likely with the new behaviour that we'll hit > these limits due to more eager freezing of transactions, it is still > important for users to have easy access to this information, and > tuning this for storage size is not useless information. That is a fair point. Though note that these things have weaker relationships with settings like autovacuum_freeze_max_age now. Mostly this is a positive improvement (in the sense that we can truncate SLRUs much more aggressively on average), but not always. > - The reason why VACUUM is essential to the long-term consistency of > Postgres' MVCC system > Informing the user about our use of 32-bit transaction IDs and > that we update an epoch when this XID wraps around does not > automatically make the user aware of the issues that surface around > XID wraparound. Retaining the explainer for XID wraparound in the docs > seems like a decent idea - it may be moved, but please do
Re: New strategies for freezing, advancing relfrozenxid early
On Wed, Jan 4, 2023 at 5:21 PM Matthias van de Meent wrote: > Some reviews (untested; only code review so far) on these versions of > the patches: Thanks for the review! > > [PATCH v14 1/3] Add eager and lazy freezing strategies to VACUUM. > I don't think the mention of 'cutoff point' is necessary when it has > 'Threshold'. Fair. Will fix. > > +intfreeze_strategy_threshold;/* threshold to use eager > > [...] > > +BlockNumber freeze_strategy_threshold; > > Is there a way to disable the 'eager' freezing strategy? `int` cannot > hold the maximum BlockNumber... I'm going to fix this by switching over to making the GUC (and the reloption) GUC_UNIT_MB, while keeping it in ConfigureNamesInt[]. That approach is a little bit more cumbersome, but not by much. That'll solve this problem. > > +lazy_scan_strategy(vacrel); > > if (verbose) > > I'm slightly suprised you didn't update the message for verbose vacuum > to indicate whether we used the eager strategy: there are several GUCs > for tuning this behaviour, so you'd expect to want direct confirmation > that the configuration is effective. Perhaps that would be worth doing, but I don't think that it's all that useful in the grand scheme of things. I wouldn't mind including it, but I think that it shouldn't be given much prominence. It's certainly far less important than "aggressive vs non-aggressive" is right now. Eagerness is not just a synonym of aggressiveness. For example, every VACUUM of a table like pgbench_tellers or pgbench_branches will use eager scanning strategy. More generally, you have to bear in mind that the actual state of the table is just as important as the GUCs themselves. We try to avoid obligations that could be very hard or even impossible for vacuumlazy.c to fulfill. There are far weaker constraints on things like the final relfrozenxid value we'll set in pg_class (more on this below, when I talk about MinXid/MinMulti). It will advance far more frequently and by many more XIDs than it would today, on average. But occasionally it will allow a far earlier relfrozenxid than aggressive mode would ever allow, since making some small amount of progress now is almost always much better than making no progress at all. > (looks at further patches) I see that the message for verbose vacuum > sees significant changes in patch 2 instead. It just works out to be slightly simpler that way. I want to add the scanned_pages stuff to VERBOSE in the vmsnap/scanning strategies commit, so I need to make significant changes to the initial VERBOSE message in that commit. There is little point in preserving information about aggressive mode if it's removed in the very next commit anyway. > > [PATCH v14 2/3] Add eager and lazy VM strategies to VACUUM. > Right now, we don't use syncscan to determine a startpoint. I can't > find the reason why in the available documentation: [0] discusses the > issue but without clearly describing an issue why it wouldn't be > interesting from a 'nothing lost' perspective. That's not something I've given much thought to. It's a separate issue, I think. Though I will say that one reason why I think that the vm snapshot concept will become important is that working off an immutable structure makes various things much easier, in fairly obvious ways. It makes it straightforward to reorder work. So things like parallel heap vacuuming are a lot more straightforward. I also think that it would be useful to teach VACUUM to speculatively scan a random sample of pages, just like a normal VACUUM. We start out doing a normal VACUUM that just processes scanned_pages in a random order. At some point we look at the state of pages so far. If it looks like the table really doesn't urgently need to be vacuumed, then we can give up before paying much of a cost. If it looks like the table really needs to be VACUUM'd, we can press on almost like any other VACUUM would. This is related to the problem of bad statistics that drive autovacuum. Deciding as much as possible at runtime, dynamically, seems promising to me. > In addition, I noticed that progress reporting of blocks scanned > ("heap_blocks_scanned", duh) includes skipped pages. Now that we have > a solid grasp of how many blocks we're planning to scan, we can update > the reported stats to how many blocks we're planning to scan (and have > scanned), increasing the user value of that progress view. Yeah, that's definitely a natural direction to go with this. Knowing scanned_pages from the start is a basis for much more useful progress reporting. > > +doubletableagefrac; > > I think this can use some extra info on the field itself, that it is > the fraction of how "old" the relfrozenxid and relminmxid fields are, > as a fraction between 0 (latest values; nextXID and nextMXID), and 1 > (values that are old by at least freeze_table_age and > multixact_freeze_table_age (multi)transaction ids, respectively). Agreed that that needs more t
Re: New strategies for freezing, advancing relfrozenxid early
On Thu, 5 Jan 2023 at 02:21, I wrote: > > On Tue, 3 Jan 2023 at 21:30, Peter Geoghegan wrote: > > > > Attached is v14. > > [PATCH v14 3/3] Finish removing aggressive mode VACUUM. > > I've not completed a review for this patch - I'll continue on that > tomorrow: This is that. > @@ -2152,10 +2109,98 @@ lazy_scan_noprune(LVRelState *vacrel, > [...] > +/* wait 10ms, then 20ms, then 30ms, then give up */ > [...] > +pg_usleep(1000L * 10L * i); Could this use something like autovacuum_cost_delay? I don't quite like the use of arbitrary hardcoded millisecond delays - it can slow a system down by a significant fraction, especially on high-contention systems, and this potential of 60ms delay per scanned page can limit the throughput of this new vacuum strategy to < 17 pages/second (<136kB/sec) for highly contended sections, which is not great. It is also not unlikely that in the time it was waiting, the page contents were updated significantly (concurrent prune, DELETEs committed), which could result in improved bounds. I think we should redo the dead items check if we waited, but failed to get a lock - any tuples removed now reduce work we'll have to do later. > +++ b/doc/src/sgml/ref/vacuum.sgml > [...] Pages where > + all tuples are known to be frozen are always skipped. "...are always skipped, unless the >DISABLE_PAGE_SKIPPING< option is used." > +++ b/doc/src/sgml/maintenance.sgml There are a lot of details being lost from the previous version of that document. Some of the details are obsolete (mentions of aggressive VACUUM and freezing behavior), but others are not (FrozenTransactionId in rows from a pre-9.4 system, the need for vacuum for prevention of issues surrounding XID wraparound). I also am not sure this is the best place to store most of these mentions, but I can't find a different place where these details on certain interesting parts of the system are documented, and plain removal of the information does not sit right with me. Specifically, I don't like the removal of the following information from our documentation: - Size of pg_xact and pg_commit_ts data in relation to autovacuum_freeze_max_age Although it is less likely with the new behaviour that we'll hit these limits due to more eager freezing of transactions, it is still important for users to have easy access to this information, and tuning this for storage size is not useless information. - The reason why VACUUM is essential to the long-term consistency of Postgres' MVCC system Informing the user about our use of 32-bit transaction IDs and that we update an epoch when this XID wraps around does not automatically make the user aware of the issues that surface around XID wraparound. Retaining the explainer for XID wraparound in the docs seems like a decent idea - it may be moved, but please don't delete it. - Special transaction IDs, their meaning and where they can occur I can't seem to find any other information in the docs section, and it is useful to have users understand that certain values are considered special: FrozenTransactionId and BootstrapTransactionId. Kind regards, Matthias van de Meent
Re: New strategies for freezing, advancing relfrozenxid early
On Tue, 3 Jan 2023 at 21:30, Peter Geoghegan wrote: > > Attached is v14. Some reviews (untested; only code review so far) on these versions of the patches: > [PATCH v14 1/3] Add eager and lazy freezing strategies to VACUUM. > +/* > + * Threshold cutoff point (expressed in # of physical heap rel blocks in > + * rel's main fork) that triggers VACUUM's eager freezing strategy > + */ I don't think the mention of 'cutoff point' is necessary when it has 'Threshold'. > +intfreeze_strategy_threshold;/* threshold to use eager > [...] > +BlockNumber freeze_strategy_threshold; Is there a way to disable the 'eager' freezing strategy? `int` cannot hold the maximum BlockNumber... > +lazy_scan_strategy(vacrel); > if (verbose) I'm slightly suprised you didn't update the message for verbose vacuum to indicate whether we used the eager strategy: there are several GUCs for tuning this behaviour, so you'd expect to want direct confirmation that the configuration is effective. (looks at further patches) I see that the message for verbose vacuum sees significant changes in patch 2 instead. --- > [PATCH v14 2/3] Add eager and lazy VM strategies to VACUUM. General comments: I don't see anything regarding scan synchronization in the vmsnap scan system. I understand that VACUUM is a load that is significantly different from normal SEQSCANs, but are there good reasons to _not_ synchronize the start of VACUUM? Right now, we don't use syncscan to determine a startpoint. I can't find the reason why in the available documentation: [0] discusses the issue but without clearly describing an issue why it wouldn't be interesting from a 'nothing lost' perspective. In addition, I noticed that progress reporting of blocks scanned ("heap_blocks_scanned", duh) includes skipped pages. Now that we have a solid grasp of how many blocks we're planning to scan, we can update the reported stats to how many blocks we're planning to scan (and have scanned), increasing the user value of that progress view. [0] https://www.postgresql.org/message-id/flat/19398.1212328662%40sss.pgh.pa.us#17b2feb0fde6a41779290632d8c70ef1 > +doubletableagefrac; I think this can use some extra info on the field itself, that it is the fraction of how "old" the relfrozenxid and relminmxid fields are, as a fraction between 0 (latest values; nextXID and nextMXID), and 1 (values that are old by at least freeze_table_age and multixact_freeze_table_age (multi)transaction ids, respectively). > -#define VACOPT_DISABLE_PAGE_SKIPPING 0x80/* don't skip any pages */ > +#define VACOPT_DISABLE_PAGE_SKIPPING 0x80/* don't skip using VM */ I'm not super happy with this change. I don't think we should touch the VM using snapshots _at all_ when disable_page_skipping is set: > + * Decide vmsnap scanning strategy. > * > - * This test also enables more frequent relfrozenxid advancement during > - * non-aggressive VACUUMs. If the range has any all-visible pages then > - * skipping makes updating relfrozenxid unsafe, which is a real downside. > + * First acquire a visibility map snapshot, which determines the number > of > + * pages that each vmsnap scanning strategy is required to scan for us in > + * passing. I think we should not take disk-backed vm snapshots when force_scan_all is set. We need VACUUM to be able to run on very resource-constrained environments, and this does not do that - it adds a disk space requirement for the VM snapshot for all but the smallest relation sizes, which is bad when you realize that we need VACUUM when we want to clean up things like CLOG. Additionally, it took me several reads of the code and comments to understand what the decision-making process for lazy vs eager is, and why. The comments are interspersed with the code, with no single place that describes it from a bird's eyes' view. I think something like the following would be appreciated by other readers of the code: + We determine whether we choose the eager or lazy scanning strategy based on how many extra pages the eager strategy would take over the lazy strategy, and how "old" the table is (as determined in tableagefrac): + When a table is still "young" (tableagefrac < TABLEAGEFRAC_MIDPOINT), the eager strategy is accepted if we need to scan 5% (MAX_PAGES_YOUNG_TABLEAGE) more of the table. + As the table gets "older" (tableagefrac between MIDPOINT and HIGHPOINT), the threshold for eager scanning is relaxed linearly from this 5% to 70% (MAX_PAGES_OLD_TABLEAGE) of the table being scanned extra (over what would be scanned by the lazy strategy). + Once the tableagefrac passes HIGHPOINT, we stop considering the lazy strategy, and always eagerly scan the table. > @@ -1885,6 +1902,30 @@ retry: > tuples_frozen = 0;/* avoid miscounts in instrumentation */ > } > > /* > + * There should never be dead or deleted tuples when PD_ALL_VISIBLE is > + * set. Che
Re: New strategies for freezing, advancing relfrozenxid early
On Mon, 2023-01-02 at 11:45 -0800, Peter Geoghegan wrote: > What do you think of the wording adjustments in the attached patch? > It's based on your suggested wording. Great, thank you. -- Jeff Davis PostgreSQL Contributor Team - AWS
Re: New strategies for freezing, advancing relfrozenxid early
On Sat, Dec 31, 2022 at 12:45 PM Peter Geoghegan wrote: > On Sat, Dec 31, 2022 at 11:46 AM Jeff Davis wrote: > > "We have no freeze plans to execute, so there's no cost to following > > the freeze path. This is important in the case where the page is > > entirely frozen already, so that the page will be marked as such in the > > VM." > > I'm happy to use your wording instead -- I'll come up with a patch for that. What do you think of the wording adjustments in the attached patch? It's based on your suggested wording. -- Peter Geoghegan 0001-Tweak-page-level-freezing-comments.patch Description: Binary data
Re: New strategies for freezing, advancing relfrozenxid early
On Sat, Dec 31, 2022 at 11:46 AM Jeff Davis wrote: > On Fri, 2022-12-30 at 16:58 -0800, Peter Geoghegan wrote: > > Following the path of freezing a page is *always* valid, by > > definition. Including when there are zero freeze plans to execute, or > > even zero tuples to examine in the first place -- we'll at least be > > able to perform nominal freezing, no matter what. > > This is a much clearer description, in my opinion. Do you think this is > already reflected in the comments (and I missed it)? I am arguably the person least qualified to answer this question. :-) > Perhaps the comment in the "if (tuples_frozen == 0)" branch could be > something more like: > > "We have no freeze plans to execute, so there's no cost to following > the freeze path. This is important in the case where the page is > entirely frozen already, so that the page will be marked as such in the > VM." I'm happy to use your wording instead -- I'll come up with a patch for that. In my mind it's just a restatement of what's there already. I assume that you're right about it being clearer this way. > Of course, I'm sure there are some nuances that I'm still missing. I don't think that there is, actually. I now believe that you totally understand the mechanics involved here. I'm glad that I was able to ascertain that that's all it was. It's worth going to the trouble of getting something like this exactly right. -- Peter Geoghegan
Re: New strategies for freezing, advancing relfrozenxid early
On Fri, 2022-12-30 at 16:58 -0800, Peter Geoghegan wrote: > Following the path of freezing a page is *always* valid, by > definition. Including when there are zero freeze plans to execute, or > even zero tuples to examine in the first place -- we'll at least be > able to perform nominal freezing, no matter what. This is a much clearer description, in my opinion. Do you think this is already reflected in the comments (and I missed it)? Perhaps the comment in the "if (tuples_frozen == 0)" branch could be something more like: "We have no freeze plans to execute, so there's no cost to following the freeze path. This is important in the case where the page is entirely frozen already, so that the page will be marked as such in the VM." I'm not even sure we really want a new concept of "nominal freezing". I think you are right to just call it a degenerate case where it can be interpreted as either freezing zero things or not freezing; and the former is convenient for us because we want to follow that code path. That would be another good way of writing the comment, in my opinion. Of course, I'm sure there are some nuances that I'm still missing. -- Jeff Davis PostgreSQL Contributor Team - AWS
Re: New strategies for freezing, advancing relfrozenxid early
On Fri, Dec 30, 2022 at 1:12 PM Peter Geoghegan wrote: > > "Nominal freezing" is happening when there are no freeze plans at all. > > I get that it's to manage control flow so that the right thing happens > > later. But I think it should be defined in terms of what state the page > > is in so that we know that following a given path is valid. Defining > > "nominal freezing" as a case where there are no freeze plans is just > > confusing to me. > > What would you prefer? The state that the page is in is not something > that I want to draw much attention to, because it's confusing in a way > that mostly isn't worth talking about. I probably should have addressed what you said more directly. Here goes: Following the path of freezing a page is *always* valid, by definition. Including when there are zero freeze plans to execute, or even zero tuples to examine in the first place -- we'll at least be able to perform nominal freezing, no matter what. OTOH, following the "no freeze" path is permissible whenever the freeze_required flag hasn't been set during any call to heap_prepare_freeze_tuple(). It is never actually mandatory for lazy_scan_prune() to *not* freeze. It's a bit like how a simple point can be understood as a degenerate circle of radius 0. It's an abstract definition, which is just a tool for describing things precisely -- hopefully a useful tool. I welcome the opportunity to be able to describe things in a way that is clearer or more useful, in whatever way. But it's not like I haven't already put in significant effort to this exact question of what "freezing the page" really means to lazy_scan_prune(). Naming things is hard. -- Peter Geoghegan
Re: New strategies for freezing, advancing relfrozenxid early
On Fri, Dec 30, 2022 at 12:43 PM Jeff Davis wrote: > I always understood "freezing" to mean that a concrete action was > taken, and associated WAL generated. "When I use a word… it means just what I choose it to mean -- neither more nor less". I have also always understood freezing that way too. In fact, I still do understand it that way -- I don't think that it has been undermined by any of this. I've just invented this esoteric concept of nominal freezing that can be ignored approximately all the time, to solve one narrow problem that needed to be solved, that isn't that interesting anywhere else. > "Nominal freezing" is happening when there are no freeze plans at all. > I get that it's to manage control flow so that the right thing happens > later. But I think it should be defined in terms of what state the page > is in so that we know that following a given path is valid. Defining > "nominal freezing" as a case where there are no freeze plans is just > confusing to me. What would you prefer? The state that the page is in is not something that I want to draw much attention to, because it's confusing in a way that mostly isn't worth talking about. When we do nominal freezing, we don't necessarily go on to set the page all-frozen. In fact, it's not particularly likely that that will end up happening! Bear in mind that the exact definition of "freeze the page" is somewhat creative, even without bringing nominal freezing into it. It just has to be in order to support the requirements we have for MultiXacts (in particular for FRM_NOOP processing). The new concepts don't quite map directly on to the old ones. At the same time, it really is very often the case that "freezing the page" will perform maximally aggressive freezing, in the sense that it does precisely what a VACUUM FREEZE would do given the same page (in any Postgres version). -- Peter Geoghegan
Re: New strategies for freezing, advancing relfrozenxid early
On Mon, 2022-12-26 at 12:53 -0800, Peter Geoghegan wrote: > * v12 merges together the code for the "freeze the page" > lazy_scan_prune path with the block that actually calls > heap_freeze_execute_prepared(). > > This should make it clear that pagefrz.freeze_required really does > mean that freezing is required. Hopefully that addresses Jeff's > recent > concern. It's certainly an improvement, in any case. Better, thank you. > * On a related note, comments around the same point in > lazy_scan_prune > as well as comments above the HeapPageFreeze struct now explain a > concept I decided to call "nominal freezing". This is the case where > we "freeze a page" without having any freeze plans to execute. > > "nominal freezing" is the new name for a concept I invented many > months ago, which helps to resolve subtle problems with the way that > heap_prepare_freeze_tuple is tasked with doing two different things > for its lazy_scan_prune caller: 1. telling lazy_scan_prune how it > would freeze each tuple (were it to freeze the page), and 2. helping > lazy_scan_prune to determine if the page should become all-frozen in > the VM. The latter is always conditioned on page-level freezing > actually going ahead, since everything else in > heap_prepare_freeze_tuple has to work that way. > > We always freeze a page with zero freeze plans (or "nominally freeze" > the page) in lazy_scan_prune (which is nothing new in itself). We > thereby avoid breaking heap_prepare_freeze_tuple's working assumption > that all it needs to focus on what the page will look like after > freezing executes, while also avoiding senselessly throwing away the > ability to set a page all-frozen in the VM in lazy_scan_prune when > it'll cost us nothing extra. That is, by always freezing in the event > of zero freeze plans, we won't senselessly miss out on setting a page > all-frozen in cases where we don't actually have to execute any > freeze > plans to make that safe, while the "freeze the page path versus don't > freeze the page path" dichotomy still works as a high level > conceptual > abstraction. I always understood "freezing" to mean that a concrete action was taken, and associated WAL generated. "Nominal freezing" is happening when there are no freeze plans at all. I get that it's to manage control flow so that the right thing happens later. But I think it should be defined in terms of what state the page is in so that we know that following a given path is valid. Defining "nominal freezing" as a case where there are no freeze plans is just confusing to me. -- Jeff Davis PostgreSQL Contributor Team - AWS
RE: New strategies for freezing, advancing relfrozenxid early
Dear Peter, > Anyway, this assertion is wrong, and simply needs to be removed. > Thanks for the report Thanks for modifying for quickly! I found your commit in the remote repository. I will watch and report again if there are another issue. Best Regards, Hayato Kuroda FUJITSU LIMITED
Re: New strategies for freezing, advancing relfrozenxid early
On Mon, Dec 26, 2022 at 10:57 PM Hayato Kuroda (Fujitsu) wrote: > I guessed that this assertion failure seemed to be caused by the commit > 4ce3af[2], > because the Assert() seemed to be added by the commit. I agree that the problem is with this assertion, which is on the master branch (not in recent versions of the patch series itself) following commit 4ce3af: else { /* * Freeze plan for tuple "freezes xmax" in the strictest sense: * it'll leave nothing in xmax (neither an Xid nor a MultiXactId). */ Assert(MultiXactIdPrecedes(xid, cutoffs->OldestMxact)); ... } The problem is that FRM_INVALIDATE_XMAX multi processing can occur both in Multis from before OldestMxact and Multis >= OldestMxact. The latter case (the >= case) is far less common, but still quite possible. Not sure how I missed that. Anyway, this assertion is wrong, and simply needs to be removed. Thanks for the report -- Peter Geoghegan
RE: New strategies for freezing, advancing relfrozenxid early
Dear Peter, Jeff, While reviewing other patches, I found that cfbot raised ERROR during the VACUUM FREEZE [1] on FreeBSD instance. It seemed that same error has been occurred in other threads. ``` 2022-12-23 08:50:20.175 UTC [34653][postmaster] LOG: server process (PID 37171) was terminated by signal 6: Abort trap 2022-12-23 08:50:20.175 UTC [34653][postmaster] DETAIL: Failed process was running: VACUUM FREEZE tab_freeze; 2022-12-23 08:50:20.175 UTC [34653][postmaster] LOG: terminating any other active server processes ``` I guessed that this assertion failure seemed to be caused by the commit 4ce3af[2], because the Assert() seemed to be added by the commit. ``` [08:51:31.189] #3 0x009b88d7 in ExceptionalCondition (conditionName=, fileName=0x2fd9df "../src/backend/access/heap/heapam.c", lineNumber=lineNumber@entry=6618) at ../src/backend/utils/error/assert.c:66 [08:51:31.189] No locals. [08:51:31.189] #4 0x00564205 in heap_prepare_freeze_tuple (tuple=0x8070f0bb0, cutoffs=cutoffs@entry=0x80222e768, frz=0x7fffb2d0, totally_frozen=totally_frozen@entry=0x7fffc478, relfrozenxid_out=, relfrozenxid_out@entry=0x7fffc4a8, relminmxid_out=, relminmxid_out@entry=0x7fffc474) at ../src/backend/access/heap/heapam.c:6618 ``` Sorry for noise if you have already known or it is not related with this thread. [1]: https://cirrus-ci.com/task/4580705867399168 [2]: https://github.com/postgres/postgres/commit/4ce3afb82ecfbf64d4f6247e725004e1da30f47c Best Regards, Hayato Kuroda FUJITSU LIMITED
Re: New strategies for freezing, advancing relfrozenxid early
On Wed, Dec 21, 2022 at 4:30 PM Jeff Davis wrote: > The confusing thing to me is perhaps just the name -- to me, > "freeze_required" suggests that if it were set to true, it would cause > freezing to happen. But as far as I can tell, it does not cause > freezing to happen, it causes some other things to happen that are > necessary when freezing happens (updating and using the right > trackers). freeze_required is about what's required, which tells us nothing about what will happen when it's not required (could go either way, depending on how lazy_scan_prune feels about it). Setting freeze_required=true implies that heap_prepare_freeze_tuple has stopped doing maintenance of the "no freeze" trackers. When it sets freeze_required=true, it really *does* force freezing to happen, in every practical sense. This happens because lazy_scan_prune does what it's told to do when it's told that freezing is required. Because of course it does, why wouldn't it? So...I still don't get what you mean. Why would lazy_scan_prune ever break its contract with heap_prepare_freeze_tuple? And in what sense would you say that heap_prepare_freeze_tuple's setting freeze_required=true doesn't quite amount to "forcing freezing"? Are you worried about the possibility that lazy_scan_prune will decide to rebel at some point, and fail to honor its contract with heap_prepare_freeze_tuple? :-) > A minor point, no need to take action here. Perhaps rename the > variable. Andres was the one that suggested this name, actually. I initially just called it "freeze", but I think that Andres had it right. > I think 0001+0002 are about ready. Great. I plan on committing 0001 in the next few days. Committing 0002 might take a bit longer. Thanks -- Peter Geoghegan
Re: New strategies for freezing, advancing relfrozenxid early
On Tue, 2022-12-20 at 21:26 -0800, Peter Geoghegan wrote: > When freeze_required is set to true, that means that lazy_scan_prune > literally has no choice -- it simply must freeze the page as > instructed by heap_prepare_freeze_tuple/FreezeMultiXactId. It's not > just a strong suggestion -- it's crucial that lazy_scan_prune freezes > the page as instructed. The confusing thing to me is perhaps just the name -- to me, "freeze_required" suggests that if it were set to true, it would cause freezing to happen. But as far as I can tell, it does not cause freezing to happen, it causes some other things to happen that are necessary when freezing happens (updating and using the right trackers). A minor point, no need to take action here. Perhaps rename the variable. I think 0001+0002 are about ready. -- Jeff Davis PostgreSQL Contributor Team - AWS
Re: New strategies for freezing, advancing relfrozenxid early
On Tue, Dec 20, 2022 at 7:15 PM Peter Geoghegan wrote: > On Tue, Dec 20, 2022 at 5:44 PM Jeff Davis wrote: > > Next, the 'freeze_required' field suggests that it's more involved in > > the control flow that causes freezing than it actually is. All it does > > is communicate how the trackers need to be adjusted. The return value > > of heap_prepare_freeze_tuple() (and underneath, the flags set by > > FreezeMultiXactId()) are what actually control what happens. It would > > be nice to make this more clear somehow. > > I'm not sure what you mean. Page-level freezing *doesn't* have to go > ahead when freeze_required is not ever set to true for any tuple on > the page (which is most of the time, in practice). lazy_scan_prune > gets to make a choice about freezing the page, when the choice is > available. Oh wait, I think I see the point of confusion now. When freeze_required is set to true, that means that lazy_scan_prune literally has no choice -- it simply must freeze the page as instructed by heap_prepare_freeze_tuple/FreezeMultiXactId. It's not just a strong suggestion -- it's crucial that lazy_scan_prune freezes the page as instructed. The "no freeze" trackers (HeapPageFreeze.NoFreezePageRelfrozenXid and HeapPageFreeze.NoFreezePageRelminMxid) won't have been maintained properly when freeze_required was set, so lazy_scan_prune can't expect to use them -- doing so would lead to VACUUM setting incorrect values in pg_class later on. Avoiding the work of maintaining those "no freeze" trackers isn't just a nice-to-have microoptimization -- it is sometimes very important. We kind of rely on this to be able to avoid getting too many MultiXact member SLRU buffer misses inside FreezeMultiXactId. There is a comment above FreezeMultiXactId that advises its caller that it had better not call heap_tuple_should_freeze when freeze_required is set to true, because that could easily lead to multixact member SLRU buffer misses -- misses that FreezeMultiXactId set out to avoid itself. It could actually be cheaper to freeze than to not freeze, in the case of a Multi -- member space misses can sometimes be really expensive. And so FreezeMultiXactId sometimes freezes a Multi even though it's not strictly required to do so. Note also that this isn't a new behavior -- it's actually an old one, for the most part. It kinda doesn't look that way, because we haven't passed down separate FreezeLimit/OldestXmin cutoffs (and separate OldestMxact/MultiXactCutoff cutoffs) until now. But we often don't need that granular information to be able to process Multis before the multi value is < MultiXactCutoff. If you look at how FreezeMultiXactId works, in detail, you'll see that even on Postgres HEAD it can (say) set a tuple's xmax to InvalidTransactionId long before the multi value is < MultiXactCutoff. It just needs to detect that the multi is not still running, and notice that it's HEAP_XMAX_IS_LOCKED_ONLY(). Stuff like that happens quite a bit. So for the most part "eager processing of Multis as a special case" is an old behavior, that has only been enhanced a little bit (the really important, new change in FreezeMultiXactId is how the FRM_NOOP case works with FreezeLimit, even though OldestXmin is used nearby -- this is extra confusing because 0002 doesn't change how we use FreezeLimit -- it actually changes every other use of FreezeLimit nearby, making it OldestXmin). -- Peter Geoghegan
Re: New strategies for freezing, advancing relfrozenxid early
On Tue, Dec 20, 2022 at 5:44 PM Jeff Davis wrote: > Comments on 0002: > > Can you explain the following portion of the diff: > > > - else if (MultiXactIdPrecedes(multi, cutoffs->MultiXactCutoff)) > + else if (MultiXactIdPrecedes(multi, cutoffs->OldestMxact)) > > ... > > + /* Can't violate the MultiXactCutoff invariant, either */ > + if (!need_replace) > + need_replace = MultiXactIdPrecedes( > +multi, cutoffs->MultiXactCutoff); Don't forget the historic context: before Postgres 15's commit 0b018fab, VACUUM's final relfrozenxid always came from FreezeLimit. Almost all of this code predates that work. So the general idea that you can make a "should I freeze or should I ratchet back my relfrozenxid tracker instead?" trade-off at the level of individual tuples and pages is still a very new one. Right now it's only applied within lazy_scan_noprune(), but 0002 leverages the same principles here. Before now, these heapam.c freezing routines had cutoffs called cutoff_xid and cutoff_multi. These had values that actually came from vacuumlazy.c's FreezeLimit and MultiXactCutoff cutoffs (which was rather unclear). But cutoff_xid and cutoff_multi were *also* used as inexact proxies for OldestXmin and OldestMxact (also kind of unclear, but true). For example, there are some sanity checks in heapam.c that kind of pretend that cutoff_xid is OldestXmin, even though it usually isn't the same value (it can be, but only during VACUUM FREEZE, or when the min freeze age is 0 in some other way). So 0002 teaches the same heapam.c code about everything -- about all of the different cutoffs, and about the true requirements of VACUUM around relfrozenxid advancement. In fact, 0002 makes vacuumlazy.c cede a lot of control of "XID stuff" to the same heapam.c code, freezing it up to think about freezing as something that works at the level of physical pages. This is key to allowing vacuumlazy.c to reason about freezing at the level of the whole table. It thinks about physical blocks, leaving logical XIDs up to heapam.c code. This business that you asked about in FreezeMultiXactId() is needed so that we can allow vacuumlazy.c to "think in terms of physical pages", while at the same time avoiding allocating new Multis in VACUUM -- which requires "thinking about individual xmax fields" instead -- a somewhat conflicting goal. We're really trying to have it both ways (we get page-level freezing, with a little tuple level freezing on the side, sufficient to to avoid allocating new Multis during VACUUMs in roughly the same way as we do right now). In most cases "freezing a page" removes all XIDs < OldestXmin, and all MXIDs < OldestMxact. It doesn't quite work that way in certain rare cases involving MultiXacts, though. It is convenient to define "freeze the page" in a way that gives heapam.c's FreezeMultiXactId() the leeway to put off the work of processing an individual tuple's xmax, whenever it happens to be a MultiXactId that would require an expensive second pass to process aggressively (allocating a new Multi during VACUUM is especially worth avoiding here). Our definition of "freeze the page" is a bit creative, at least if you're used to thinking about it in terms of strict XID-wise cutoffs like OldestXmin/FreezeLimit. But even if you do think of it in terms of XIDs, the practical difference is extremely small in practice. FreezeMultiXactId() effectively makes a decision on how to proceed with processing at the level of each individual xmax field. Its no-op multi processing "freezes" an xmax in the event of a costly-to-process xmax on a page when (for whatever reason) page-level freezing is triggered. If, on the other hand, page-level freezing isn't triggered for the page, then page-level no-op processing takes care of the multi for us instead. Either way, the remaining Multi will ratchet back VACUUM's relfrozenxid and/or relminmxid trackers as required, and we won't need an expensive second pass over the multi (unless we really have no choice, for example during a VACUUM FREEZE, where OldestXmin==FreezeLimit). > Regarding correctness, it seems like the basic structure and invariants > are the same, and it builds on the changes already in 9e5405993c. Patch > 0002 seems *mostly* about making choices within the existing framework. > That gives me more confidence. You're right that it's the same basic invariants as before, of course. Turns out that those invariants can be pushed quite far. Though note that I kind of invented a new invariant (not really, sort of). Well, it's a postcondition, which is a sort of invariant: any scanned heap page that can be cleanup locked must never have any remaining XIDs < FreezeLimit, nor can any MXIDs < MultiXactCutoff remain. But a cleanup-locked page does *not* need to get rid of all XIDs < OldestXmin, nor MXIDs < OldestMxact. This flexibility is mostly useful because it allows lazy_scan_prune to just decide to not freeze. But, to a much lesser degree, it's useful bec
Re: New strategies for freezing, advancing relfrozenxid early
On Sun, 2022-12-18 at 14:20 -0800, Peter Geoghegan wrote: > On Thu, Dec 15, 2022 at 10:53 AM Peter Geoghegan wrote: > > I agree that the burden of catch-up freezing is excessive here (in > > fact I already wrote something to that effect on the wiki page). > > The > > likely solution can be simple enough. > > Attached is v10, which fixes this issue, but using a different > approach to the one I sketched here. Comments on 0002: Can you explain the following portion of the diff: - else if (MultiXactIdPrecedes(multi, cutoffs->MultiXactCutoff)) + else if (MultiXactIdPrecedes(multi, cutoffs->OldestMxact)) ... + /* Can't violate the MultiXactCutoff invariant, either */ + if (!need_replace) + need_replace = MultiXactIdPrecedes( +multi, cutoffs->MultiXactCutoff); Regarding correctness, it seems like the basic structure and invariants are the same, and it builds on the changes already in 9e5405993c. Patch 0002 seems *mostly* about making choices within the existing framework. That gives me more confidence. That being said, it does push harder against the limits on both sides. If I understand correctly, that means pages with wider distributions of xids are going to persist longer, which could expose pre-existing bugs in new and interesting ways. Next, the 'freeze_required' field suggests that it's more involved in the control flow that causes freezing than it actually is. All it does is communicate how the trackers need to be adjusted. The return value of heap_prepare_freeze_tuple() (and underneath, the flags set by FreezeMultiXactId()) are what actually control what happens. It would be nice to make this more clear somehow. The comment: /* * If we freeze xmax, make absolutely sure that it's not an XID that * is important. (Note, a lock-only xmax can be removed independent * of committedness, since a committed lock holder has released the * lock). */ caused me to go down a rabbit hole looking for edge cases where we might want to freeze an xmax but not an xmin; e.g. tup.xmax < OldestXmin < tup.xmin or the related case where tup.xmax < RecentXmin < tup.xmin. I didn't find a problem, so that's good news. I also tried some pgbench activity along with concurrent vacuums (and vacuum freezes) along with periodic verify_heapam(). No problems there. Did you already describe the testing you've done for 0001+0002 specfiically? It's not radically new logic, but it would be good to try to catch minor state-handling errors. -- Jeff Davis PostgreSQL Contributor Team - AWS
Re: New strategies for freezing, advancing relfrozenxid early
Hi! I'll try to apply this patch onto my branch with Pluggable TOAST to test these mechanics with new TOAST. Would reply on the result. It could be difficult though, because both have a lot of changes that affect the same code. >I'm not sure how much this would help with bloat. I suspect that it >could make a big difference with the right workload. If you always >need frequent autovacuums, just to deal with bloat, then there is >never a good time to run an aggressive antiwraparound autovacuum. An >aggressive AV will probably end up taking much longer than the typical >autovacuum that deals with bloat. While the aggressive AV will remove >as much bloat as any other AV, in theory, that might not help much. If >the aggressive AV takes as long as (say) 5 regular autovacuums would >have taken, and if you really needed those 5 separate autovacuums to >run, just to deal with the bloat, then that's a real problem. The >aggressive AV effectively causes bloat with such a workload. On Tue, Dec 20, 2022 at 12:01 PM Jeff Davis wrote: > On Sun, 2022-12-18 at 14:20 -0800, Peter Geoghegan wrote: > > Attached is v10, which fixes this issue, but using a different > > approach to the one I sketched here. > > In 0001, it's fairly straightforward rearrangement and looks like an > improvement to me. I have a few complaints, but they are about pre- > existing code that you moved around, and I like that you didn't > editorialize too much while just moving code around. +1 from me. > > > -- > Jeff Davis > PostgreSQL Contributor Team - AWS > > > > > -- Regards, -- Nikita Malakhov Postgres Professional https://postgrespro.ru/
Re: New strategies for freezing, advancing relfrozenxid early
On Sun, 2022-12-18 at 14:20 -0800, Peter Geoghegan wrote: > Attached is v10, which fixes this issue, but using a different > approach to the one I sketched here. In 0001, it's fairly straightforward rearrangement and looks like an improvement to me. I have a few complaints, but they are about pre- existing code that you moved around, and I like that you didn't editorialize too much while just moving code around. +1 from me. -- Jeff Davis PostgreSQL Contributor Team - AWS
Re: New strategies for freezing, advancing relfrozenxid early
On Thu, Dec 15, 2022 at 11:59 PM Nikita Malakhov wrote: > I've found this discussion very interesting, in view of vacuuming > TOAST tables is always a problem because these tables tend to > bloat very quickly with dead data - just to remind, all TOAST-able > columns of the relation use the same TOAST table which is one > for the relation, and TOASTed data are not updated - there are > only insert and delete operations. I don't think that it would be any different to any other table that happened to have lots of inserts and deletes, such as the table described here: https://wiki.postgresql.org/wiki/Freezing/skipping_strategies_patch:_motivating_examples#Mixed_inserts_and_deletes In the real world, a table like this would probably consist of some completely static data, combined with other data that is constantly deleted and re-inserted -- probably only a small fraction of the table at any one time. I would expect such a table to work quite well, because the static pages would all become frozen (at least after a while), leaving behind only the tuples that are deleted quickly, most of the time. VACUUM would have a decent chance of noticing that it will be cheap to advance relfrozenxid in earlier VACUUM operations, as bloat is cleaned up -- even a VACUUM that happens long before the point that autovacuum.c will launch an antiwraparound autovacuum has a decent chance of it. That's not a new idea, really; the pgbench_branches example from the Wiki page looks like that already, and even works on Postgres 15. Here is the part that's new: the pressure to advance relfrozenxid grows gradually, as table age grows. If table age is still very young, then we'll only do it if the number of "extra" scanned pages is < 5% of rel_pages -- only when the added cost is very low (again, like the pgbench_branches example, mostly). Once table age gets about halfway towards the point that antiwraparound autovacuuming is required, VACUUM then starts caring less about costs. It gradually worries less about the costs, and more about the need to advance it. Ideally it will happen before antiwraparound autovacuum is actually required. I'm not sure how much this would help with bloat. I suspect that it could make a big difference with the right workload. If you always need frequent autovacuums, just to deal with bloat, then there is never a good time to run an aggressive antiwraparound autovacuum. An aggressive AV will probably end up taking much longer than the typical autovacuum that deals with bloat. While the aggressive AV will remove as much bloat as any other AV, in theory, that might not help much. If the aggressive AV takes as long as (say) 5 regular autovacuums would have taken, and if you really needed those 5 separate autovacuums to run, just to deal with the bloat, then that's a real problem. The aggressive AV effectively causes bloat with such a workload. -- Peter Geoghegan
Re: New strategies for freezing, advancing relfrozenxid early
On Thu, Dec 15, 2022 at 11:48 PM John Naylor wrote: > Thanks for this. This is the kind of concrete, data-based evidence that I > find much more convincing, or at least easy to reason about. I'm glad to hear that it helped. It's always difficult to judge where other people are coming from, especially when it's not clear how much context is shared. Face time would have helped here, too. > One motivating example mentioned is the append-only table. If we detected > that case, which I assume we can because autovacuum_vacuum_insert_* GUCs > exist, we could use that information as one way to drive eager freezing > independently of size. At least in theory -- it's very possible size will be > a necessary part of the decision, but it's less clear that it's as useful as > a user-tunable knob. I am not strongly opposed to that idea, though I have my doubts about it. I have thought about it already, and it wouldn't be hard to get the information to vacuumlazy.c (I plan on doing it as part of related work on antiwraparound autovacuum, in fact [1]). I'm skeptical of the general idea that autovacuum.c has enough reliable information to give detailed recommendations as to how vacuumlazy.c should process the table. I have pointed out several major flaws with the autovacuum.c dead tuple accounting in the past [2][3], but I also think that there are significant problems with the tuples inserted accounting. Basically, I think that there are effects which are arguably an example of the inspection paradox [4]. Insert-based autovacuums occur on a timeline determined by the "inserted since last autovacuum" statistics. These statistics are (in part) maintained by autovacuum/VACUUM itself. Which has no specific understanding of how it might end up chasing its own tail. Let me be more concrete about what I mean about autovacuum chasing its own tail. The autovacuum_vacuum_insert_threshold mechanism works by triggering an autovacuum whenever the number of tuples inserted since the last autovacuum/VACUUM reaches a certain threshold -- usually some fixed proportion of pg_class.reltuples. But the tuples-inserted-since-last-VACUUM counter gets reset at the end of VACUUM, not at the start. Whereas VACUUM itself processes only the subset of pages that needed to be vacuumed at the start of the VACUUM. There is no attempt to compensate for that disparity. This *isn't* really a measure of "unvacuumed tuples" (you'd need to compensate to get that). This "at the start vs at the end" difference won't matter at all with smaller tables. And even in larger tables we might hope that the effect would kind of average out. But what about cases where one particular VACUUM operation takes an unusually long time, out of a sequence of successive VACUUMs that run against the same table? For example, the sequence that you see on the Wiki page, when Postgres HEAD autovacuum does an aggressive VACUUM on one occasion, which takes dramatically longer [5]. Notice that the sequence in [5] shows that the patch does one more autovacuum operation in total, compared to HEAD/master. That's a lot more -- we're talking about VACUUMs that each take 40+ minutes. That can be explained by the fact that VACUUM (quite naturally) resets the "tuples inserted since last VACUUM" at the end of that unusually long running aggressive autovacuum -- just like any other VACUUM would. That seems very weird to me. If (say) we happened to have a much higher vacuum_freeze_table_age setting, then we wouldn't have had an aggressive VACUUM until much later on (or never, because the benchmark would just end). And the VACUUM that was aggressive would have been a regular VACUUM instead, and would therefore have completed far sooner, and would therefore have had a *totally* different cadence, compared to what we actually saw -- it becomes distorted in a way that outlasts the aggressive VACUUM. With a far higher vacuum_freeze_table_age, we might have even managed to do two regular autovacuums in the same period that it took a single aggressive VACUUM to run in (that's not too far from what actually happened with the patch). The *second* regular autovacuum would then end up resetting the "inserted since last VACUUM" counter to 0 at the same time as the long running aggressive VACUUM actually did so (same wall clock time, same time since the start of the benchmark). Notice that we'll have done much less useful work (on cleaning up bloat and setting newer pages all-visible) with the "one long aggressive mode VACUUM" setup/scenario -- we'll be way behind -- but the statistics will nevertheless look about the same as they do in the "two fast autovacuums instead of one slow autovacuum" counterfactual scenario. In short, autovacuum.c fails to appreciate that a lot of stuff about the table changes when VACUUM runs. Time hasn't stood still -- the table was modified and extended throughout. So autovacuum.c hasn't compensated for how VACUUM actually performed, and, in effect, forgets how far i
Re: New strategies for freezing, advancing relfrozenxid early
Hi! I've found this discussion very interesting, in view of vacuuming TOAST tables is always a problem because these tables tend to bloat very quickly with dead data - just to remind, all TOAST-able columns of the relation use the same TOAST table which is one for the relation, and TOASTed data are not updated - there are only insert and delete operations. Have you tested it with large and constantly used TOAST tables? How would it work with the current TOAST implementation? We propose a different approach to the TOAST mechanics [1], and a new vacuum would be very promising. Thank you! [1] https://commitfest.postgresql.org/41/3490/ On Fri, Dec 16, 2022 at 10:48 AM John Naylor wrote: > > On Wed, Dec 14, 2022 at 6:07 AM Peter Geoghegan wrote: > > > > At the suggestion of Jeff, I wrote a Wiki page that shows motivating > > examples for the patch series: > > > > > https://wiki.postgresql.org/wiki/Freezing/skipping_strategies_patch:_motivating_examples > > > > These are all cases where VACUUM currently doesn't do the right thing > > around freezing, in a way that is greatly ameliorated by the patch. > > Perhaps this will help other hackers to understand the motivation > > behind some of these mechanisms. There are plenty of details that only > > make sense in the context of a certain kind of table, with certain > > performance characteristics that the design is sensitive to, and seeks > > to take advantage of in one way or another. > > Thanks for this. This is the kind of concrete, data-based evidence that I > find much more convincing, or at least easy to reason about. I'd actually > recommend in the future to open discussion with this kind of analysis -- > even before coding, it's possible to indicate what a design is *intended* > to achieve. And reviewers can likewise bring up cases of their own in a > concrete fashion. > > On Wed, Dec 14, 2022 at 12:16 AM Peter Geoghegan wrote: > > > At the very least, a given VACUUM operation has to choose its freezing > > strategy based on how it expects the table will look when it's done > > vacuuming the table, and how that will impact the next VACUUM against > > the same table. Without that, then vacuuming an append-only table will > > fall into a pattern of setting pages all-visible in one vacuum, and > > then freezing those same pages all-frozen in the very next vacuum > > because there are too many. Which makes little sense; we're far better > > off freezing the pages at the earliest opportunity instead. > > That makes sense, but I wonder if we can actually be more specific: One > motivating example mentioned is the append-only table. If we detected that > case, which I assume we can because autovacuum_vacuum_insert_* GUCs exist, > we could use that information as one way to drive eager freezing > independently of size. At least in theory -- it's very possible size will > be a necessary part of the decision, but it's less clear that it's as > useful as a user-tunable knob. > > If we then ignored the append-only case when evaluating a freezing policy, > maybe other ideas will fall out. I don't have a well-thought out idea about > policy or knobs, but it's worth thinking about. > > Aside from that, I've only given the patches a brief reading. Having seen > the VM snapshot in practice (under "Scanned pages, visibility map snapshot" > in the wiki page), it's neat to see fewer pages being scanned. Prefetching > not only seems superior to SKIP_PAGES_THRESHOLD, but anticipates > asynchronous IO. Keeping only one VM snapshot page in memory makes perfect > sense. > > I do have a cosmetic, but broad-reaching, nitpick about terms regarding > "skipping strategy". That's phrased as a kind of negative -- what we're > *not* doing. Many times I had to pause and compute in my head what we're > *doing*, i.e. the "scanning strategy". For example, I wonder if the VM > strategies would be easier to read as: > > VMSNAP_SKIP_ALL_VISIBLE -> VMSNAP_SCAN_LAZY > VMSNAP_SKIP_ALL_FROZEN -> VMSNAP_SCAN_EAGER > VMSNAP_SKIP_NONE -> VMSNAP_SCAN_ALL > > Notice here they're listed in order of increasing eagerness. > > -- > John Naylor > EDB: http://www.enterprisedb.com > -- Regards, Nikita Malakhov Postgres Professional https://postgrespro.ru/
Re: New strategies for freezing, advancing relfrozenxid early
On Wed, Dec 14, 2022 at 6:07 AM Peter Geoghegan wrote: > > At the suggestion of Jeff, I wrote a Wiki page that shows motivating > examples for the patch series: > > https://wiki.postgresql.org/wiki/Freezing/skipping_strategies_patch:_motivating_examples > > These are all cases where VACUUM currently doesn't do the right thing > around freezing, in a way that is greatly ameliorated by the patch. > Perhaps this will help other hackers to understand the motivation > behind some of these mechanisms. There are plenty of details that only > make sense in the context of a certain kind of table, with certain > performance characteristics that the design is sensitive to, and seeks > to take advantage of in one way or another. Thanks for this. This is the kind of concrete, data-based evidence that I find much more convincing, or at least easy to reason about. I'd actually recommend in the future to open discussion with this kind of analysis -- even before coding, it's possible to indicate what a design is *intended* to achieve. And reviewers can likewise bring up cases of their own in a concrete fashion. On Wed, Dec 14, 2022 at 12:16 AM Peter Geoghegan wrote: > At the very least, a given VACUUM operation has to choose its freezing > strategy based on how it expects the table will look when it's done > vacuuming the table, and how that will impact the next VACUUM against > the same table. Without that, then vacuuming an append-only table will > fall into a pattern of setting pages all-visible in one vacuum, and > then freezing those same pages all-frozen in the very next vacuum > because there are too many. Which makes little sense; we're far better > off freezing the pages at the earliest opportunity instead. That makes sense, but I wonder if we can actually be more specific: One motivating example mentioned is the append-only table. If we detected that case, which I assume we can because autovacuum_vacuum_insert_* GUCs exist, we could use that information as one way to drive eager freezing independently of size. At least in theory -- it's very possible size will be a necessary part of the decision, but it's less clear that it's as useful as a user-tunable knob. If we then ignored the append-only case when evaluating a freezing policy, maybe other ideas will fall out. I don't have a well-thought out idea about policy or knobs, but it's worth thinking about. Aside from that, I've only given the patches a brief reading. Having seen the VM snapshot in practice (under "Scanned pages, visibility map snapshot" in the wiki page), it's neat to see fewer pages being scanned. Prefetching not only seems superior to SKIP_PAGES_THRESHOLD, but anticipates asynchronous IO. Keeping only one VM snapshot page in memory makes perfect sense. I do have a cosmetic, but broad-reaching, nitpick about terms regarding "skipping strategy". That's phrased as a kind of negative -- what we're *not* doing. Many times I had to pause and compute in my head what we're *doing*, i.e. the "scanning strategy". For example, I wonder if the VM strategies would be easier to read as: VMSNAP_SKIP_ALL_VISIBLE -> VMSNAP_SCAN_LAZY VMSNAP_SKIP_ALL_FROZEN -> VMSNAP_SCAN_EAGER VMSNAP_SKIP_NONE -> VMSNAP_SCAN_ALL Notice here they're listed in order of increasing eagerness. -- John Naylor EDB: http://www.enterprisedb.com
Re: New strategies for freezing, advancing relfrozenxid early
On Thu, Dec 15, 2022 at 11:11 AM Justin Pryzby wrote: > The patches (003 and 005) are missing a word > should use to decide whether to its eager freezing strategy. I mangled this during rebasing for v9, which reordered the commits. Will be fixed in v10. > On the wiki, missing a word: > builds on related added Fixed. Thanks -- Peter Geoghegan
Re: New strategies for freezing, advancing relfrozenxid early
The patches (003 and 005) are missing a word should use to decide whether to its eager freezing strategy. On the wiki, missing a word: builds on related added
Re: New strategies for freezing, advancing relfrozenxid early
On Thu, Dec 15, 2022 at 6:50 AM Matthias van de Meent wrote: > This first run of (auto)vacuum after the 8GB threshold seems > to appear as a significant IO event (both in WAL and relation > read/write traffic) with 50% of the table updated and WAL-logged. I > think this should be limited to some degree, such as only freeze > all_visible blocks up to 10% of the table's blocks in eager vacuum, so > that the load is spread across a larger time frame and more VACUUM > runs. I agree that the burden of catch-up freezing is excessive here (in fact I already wrote something to that effect on the wiki page). The likely solution can be simple enough. In v9 of the patch, we switch over to eager freezing when table size crosses 4GB (since that is the value of the vacuum_freeze_strategy_threshold GUC). The catch up freezing that you draw attention to here occurs when table size exceeds 8GB, which is a separate physical table size threshold that forces eager relfrozenxid advancement. The second threshold is hard-coded to 2x the first one. I think that this issue can be addressed by making the second threshold 4x or even 8x vacuum_freeze_strategy_threshold, not just 2x. That would mean that we'd have to freeze just as many pages whenever we did the catch-up freezing -- so no change in the added *absolute* cost of freezing. But, the *relative* cost would be much lower, simply because catch-up freezing would take place when the table was much larger. So it would be a lot less noticeable. Note that we might never reach the second table size threshold before we must advance relfrozenxid, in any case. The catch-up freezing might actually take place because table age created pressure to advance relfrozenxid. It's useful to have a purely physical/table-size threshold like this, especially in bulk loading scenarios. But it's not like table age doesn't have any influence at all, anymore. The cost model weighs physical units/costs as well as table age, and in general the most likely trigger for advancing relfrozenxid is usually some combination of the two, not any one factor on its own [1]. [1] https://wiki.postgresql.org/wiki/Freezing/skipping_strategies_patch:_motivating_examples#Opportunistically_advancing_relfrozenxid_with_bursty.2C_real-world_workloads -- Peter Geoghegan
Re: New strategies for freezing, advancing relfrozenxid early
On Wed, 14 Dec 2022 at 00:07, Peter Geoghegan wrote: > > On Tue, Dec 13, 2022 at 9:16 AM Peter Geoghegan wrote: > > That's not the only thing we care about, though. And to the extent we > > care about it, we mostly care about the consequences of either > > freezing or not freezing eagerly. Concentration of unfrozen pages in > > one particular table is a lot more of a concern than the same number > > of heap pages being spread out across multiple tables. Those tables > > can all be independently vacuumed, and come with their own > > relfrozenxid, that can be advanced independently, and are very likely > > to be frozen as part of a vacuum that needed to happen anyway. > > At the suggestion of Jeff, I wrote a Wiki page that shows motivating > examples for the patch series: > > https://wiki.postgresql.org/wiki/Freezing/skipping_strategies_patch:_motivating_examples > > These are all cases where VACUUM currently doesn't do the right thing > around freezing, in a way that is greatly ameliorated by the patch. > Perhaps this will help other hackers to understand the motivation > behind some of these mechanisms. There are plenty of details that only > make sense in the context of a certain kind of table, with certain > performance characteristics that the design is sensitive to, and seeks > to take advantage of in one way or another. In this mentioned wiki page, section "Simple append-only", the following is written: > Our "transition from lazy to eager strategies" concludes with an autovacuum > that actually advanced relfrozenxid eagerly: >> automatic vacuum of table "regression.public.pgbench_history": index scans: 0 >> pages: 0 removed, 1078444 remain, 561143 scanned (52.03% of total) >> [...] >> frozen: 560841 pages from table (52.00% of total) had 88051825 tuples frozen >> [...] >> WAL usage: 1121683 records, 557662 full page images, 4632208091 bytes I think that this 'transition from lazy to eager' could benefit from a limit on how many all_visible blocks each autovacuum iteration can freeze. This first run of (auto)vacuum after the 8GB threshold seems to appear as a significant IO event (both in WAL and relation read/write traffic) with 50% of the table updated and WAL-logged. I think this should be limited to some degree, such as only freeze all_visible blocks up to 10% of the table's blocks in eager vacuum, so that the load is spread across a larger time frame and more VACUUM runs. Kind regards, Matthias van de Meent.
Re: New strategies for freezing, advancing relfrozenxid early
On Tue, Dec 13, 2022 at 9:16 AM Peter Geoghegan wrote: > That's not the only thing we care about, though. And to the extent we > care about it, we mostly care about the consequences of either > freezing or not freezing eagerly. Concentration of unfrozen pages in > one particular table is a lot more of a concern than the same number > of heap pages being spread out across multiple tables. Those tables > can all be independently vacuumed, and come with their own > relfrozenxid, that can be advanced independently, and are very likely > to be frozen as part of a vacuum that needed to happen anyway. At the suggestion of Jeff, I wrote a Wiki page that shows motivating examples for the patch series: https://wiki.postgresql.org/wiki/Freezing/skipping_strategies_patch:_motivating_examples These are all cases where VACUUM currently doesn't do the right thing around freezing, in a way that is greatly ameliorated by the patch. Perhaps this will help other hackers to understand the motivation behind some of these mechanisms. There are plenty of details that only make sense in the context of a certain kind of table, with certain performance characteristics that the design is sensitive to, and seeks to take advantage of in one way or another. -- Peter Geoghegan
Re: New strategies for freezing, advancing relfrozenxid early
On Tue, Dec 13, 2022 at 12:29 AM John Naylor wrote: > If the number of unfrozen heap pages is the thing we care about, perhaps > that, and not the total size of the table, should be the parameter that > drives freezing strategy? That's not the only thing we care about, though. And to the extent we care about it, we mostly care about the consequences of either freezing or not freezing eagerly. Concentration of unfrozen pages in one particular table is a lot more of a concern than the same number of heap pages being spread out across multiple tables. Those tables can all be independently vacuumed, and come with their own relfrozenxid, that can be advanced independently, and are very likely to be frozen as part of a vacuum that needed to happen anyway. Pages become frozen pages because VACUUM freezes those pages. Same with all-visible pages, which could in principle have been made all-frozen instead, had VACUUM opted to do it that way back when it processed the page. So VACUUM is not a passive, neutral observer here. What happens over time and across multiple VACUUM operations is very relevant. VACUUM needs to pick up where it left off last time, at least with larger tables, where the time between VACUUMs is naturally very high, and where each individual VACUUM has to process a huge number of individual pages. It's not really practical to take a "wait and see" approach with big tables. At the very least, a given VACUUM operation has to choose its freezing strategy based on how it expects the table will look when it's done vacuuming the table, and how that will impact the next VACUUM against the same table. Without that, then vacuuming an append-only table will fall into a pattern of setting pages all-visible in one vacuum, and then freezing those same pages all-frozen in the very next vacuum because there are too many. Which makes little sense; we're far better off freezing the pages at the earliest opportunity instead. We're going to have to write a WAL record for the visibility map anyway, so doing everything at the same time has a lot to recommend it. Even if it turns out to be quite wrong, we may still come out ahead in terms of absolute volume of WAL written, and especially in terms of performance stability. To a limited extent we need to reason about what will happen in the near future. But we also need to reason about which kinds of mispredictions we cannot afford to make, and which kinds are okay. Some mistakes hurt a lot more than others. -- Peter Geoghegan
Re: New strategies for freezing, advancing relfrozenxid early
On Tue, Dec 13, 2022 at 8:00 AM Peter Geoghegan wrote: > > On Mon, Dec 12, 2022 at 3:47 PM Jeff Davis wrote: > > But the heuristic also seems off to me. What if you have lots of partitions > > in an append-only range-partitioned table? That would tend to use the > > lazy freezing strategy (because each partition is small), but that's > > not what you want. I understand heuristics aren't perfect, but it feels > > like we could do something better. > > It is at least vastly superior to vacuum_freeze_min_age in cases like > this. Not that that's hard -- vacuum_freeze_min_age just doesn't ever > trigger freezing in any autovacuum given a table like pgbench_history > (barring during aggressive mode), due to how it interacts with the > visibility map. So we're practically guaranteed to do literally all > freezing for an append-only table in an aggressive mode VACUUM. > > Worst of all, that happens on a timeline that has nothing to do with > the physical characteristics of the table itself (like the number of > unfrozen heap pages or something). If the number of unfrozen heap pages is the thing we care about, perhaps that, and not the total size of the table, should be the parameter that drives freezing strategy? > That said, I agree that the system-level picture of debt (the system > level view of the number of unfrozen heap pages) is relevant, and that > it isn't directly considered by the patch. I think that that can be > treated as work for a future release. In fact, I think that there is a > great deal that we could teach autovacuum.c about the system level > view of things -- this is only one. It seems an easier path to considering system-level of debt (as measured by unfrozen heap pages) would be to start with considering table-level debt measured the same way. -- John Naylor EDB: http://www.enterprisedb.com
Re: New strategies for freezing, advancing relfrozenxid early
On Mon, Dec 12, 2022 at 3:47 PM Jeff Davis wrote: > Freezing is driven by a need to keep the age of the oldest > transaction ID in a table to less than ~2B; and also the need to > truncate the clog (and reduce lookups of really old xids). It's fine to > give a brief explanation about why we can't track very old xids, but > it's more of an internal detail and not the main point. I agree that that's the conventional definition. What I am proposing is that we revise that definition a little. We should start the discussion of freezing in the user level docs by pointing out that freezing also plays a role at the level of individual pages. An all-frozen page is self-contained, now and forever (or until it gets dirtied again, at least). Even on a standby we will reliably avoid having to do clog lookups for a page that happens to have all of its tuples frozen. I don't want to push back too much here. I just don't think that it makes terribly much sense for the docs to start the conversation about freezing by talking about the worst consequences of not freezing for an extended period of time. That's relevant, and it's probably going to end up as the aspect of freezing that we spend most time on, but it still doesn't seem like a useful starting point to me. To me this seems related to the fallacy that relfrozenxid age is any kind of indicator about how far behind we are on freezing. I think that there is value in talking about freezing as a maintenance task for physical heap pages, and only then talking about relfrozenxid and the circular XID space. The 64-bit XID patch doesn't get rid of freezing at all, because it is still needed to break the dependency of tuples stored in heap pages on the pg_xact, and other SLRUs -- which suggests that you can talk about freezing and advancing relfrozenxid as different (though still closely related) concepts. > * I'm still having a hard time with vacuum_freeze_strategy_threshold. > Part of it is the name, which doesn't seem to convey the meaning. I chose the name long ago, and never gave it terribly much thought. I'm happy to go with whatever name you prefer. > But the heuristic also seems off to me. What if you have lots of partitions > in an append-only range-partitioned table? That would tend to use the > lazy freezing strategy (because each partition is small), but that's > not what you want. I understand heuristics aren't perfect, but it feels > like we could do something better. It is at least vastly superior to vacuum_freeze_min_age in cases like this. Not that that's hard -- vacuum_freeze_min_age just doesn't ever trigger freezing in any autovacuum given a table like pgbench_history (barring during aggressive mode), due to how it interacts with the visibility map. So we're practically guaranteed to do literally all freezing for an append-only table in an aggressive mode VACUUM. Worst of all, that happens on a timeline that has nothing to do with the physical characteristics of the table itself (like the number of unfrozen heap pages or something). In fact, it doesn't even have anything to do with how many distinct XIDs modified that particular table -- XID age works at the system level. By working at the heap rel level (which means the partition level if it's a partitioned table), and by being based on physical units (table size), vacuum_freeze_strategy_threshold at least manages to limit the accumulation of unfrozen heap pages in each individual relation. This is the fundamental unit at which VACUUM operates. So even if you get very unlucky and accumulate many unfrozen heap pages that happen to be distributed across many different tables, you can at least vacuum each table independently, and in parallel. The really big problems all seem to involve concentration of unfrozen tables in one particular table (usually the events table, the largest table in the system by a couple of orders of magnitude). That said, I agree that the system-level picture of debt (the system level view of the number of unfrozen heap pages) is relevant, and that it isn't directly considered by the patch. I think that that can be treated as work for a future release. In fact, I think that there is a great deal that we could teach autovacuum.c about the system level view of things -- this is only one. > Also, another purpose of this seems > to be to achieve v15 behavior (if v16 behavior causes a problem for > some workload), which seems like a good idea, but perhaps we should > have a more direct setting for that? Why, though? I think that it happens to make sense to do both with one setting. Not because it's better to have 2 settings than 1 (though it is) -- just because it makes sense here, given these specifics. > * The comment above lazy_scan_strategy() is phrased in terms of the > "traditional approach". It would be more clear if you described the > current strategies and how they're chosen. The pre-16 behavior was as > lazy as possible, so that's easy enough to describe without r
Re: New strategies for freezing, advancing relfrozenxid early
On Sat, 2022-12-10 at 18:11 -0800, Peter Geoghegan wrote: > On Tue, Dec 6, 2022 at 1:45 PM Peter Geoghegan wrote: > > v9 will also address some of the concerns you raised in your review > > that weren't covered by v8, especially about the VM snapshotting > > infrastructure. But also your concerns about the transition from > > lazy > > strategies to eager strategies. > > Attached is v9. Highlights: Comments: * The documentation shouldn't have a heading like "Managing the 32-bit Transaction ID address space". We already have a concept of "age" documented, and I think that's all that's needed in the relevant section. Freezing is driven by a need to keep the age of the oldest transaction ID in a table to less than ~2B; and also the need to truncate the clog (and reduce lookups of really old xids). It's fine to give a brief explanation about why we can't track very old xids, but it's more of an internal detail and not the main point. * I'm still having a hard time with vacuum_freeze_strategy_threshold. Part of it is the name, which doesn't seem to convey the meaning. But the heuristic also seems off to me. What if you have lots of partitions in an append-only range-partitioned table? That would tend to use the lazy freezing strategy (because each partition is small), but that's not what you want. I understand heuristics aren't perfect, but it feels like we could do something better. Also, another purpose of this seems to be to achieve v15 behavior (if v16 behavior causes a problem for some workload), which seems like a good idea, but perhaps we should have a more direct setting for that? * The comment above lazy_scan_strategy() is phrased in terms of the "traditional approach". It would be more clear if you described the current strategies and how they're chosen. The pre-16 behavior was as lazy as possible, so that's easy enough to describe without referring to history. * "eager skipping behavior" seems like a weird phrasing because it's not immediately clear if that means "skip more pages" (eager to skip pages and lazy to process them) or "skip fewer pages" (lazy to skip the pages and eager to process the pages). * The skipping behavior is for all-visible pages is binary: skip them all, or skip none. That makes sense in the context of relfrozenxid advancement. But how does that avoid IO spikes? It would seem perfectly reasonable to me, if relfrozenxid advancement is not a pressing problem, to process some fraction of the all-visible pages (or perhaps process enough of them to freeze some fraction). That would ensure that each VACUUM makes a payment on the deferred costs of freezing. I think this has already been discussed but it keeps reappearing in my mind, so maybe we can settle this with a comment (and/or docs)? * I'm wondering whether vacuum_freeze_min_age makes sense anymore. It doesn't take effect unless the page is not skipped, which is confusing from a usability standpoint, and we have better heuristics to decide if the whole page should be frozen or not anyway (i.e. if an FPI was already taken then freezing is cheaper). -- Jeff Davis PostgreSQL Contributor Team - AWS
Re: New strategies for freezing, advancing relfrozenxid early
On Tue, Dec 6, 2022 at 10:42 AM Andres Freund wrote: > The docs don't build: > https://cirrus-ci.com/task/5456939761532928 > [20:00:58.203] postgres.sgml:52: element link: validity error : IDREF > attribute linkend references an unknown ID "vacuum-for-wraparound" Thanks for pointing this out. FWIW it is a result of Bruce's recent addition of the transaction processing chapter to the docs. My intention is to post v9 later in the week, which will fix the doc build, and a lot more besides that. If you are planning on doing another round of review, I'd suggest that you hold off until then. v9 will have structural improvements that will likely make it easier to understand all the steps leading up to removing aggressive mode completely. It'll be easier to relate each local step/patch to the bigger picture for VACUUM. v9 will also address some of the concerns you raised in your review that weren't covered by v8, especially about the VM snapshotting infrastructure. But also your concerns about the transition from lazy strategies to eager strategies. The "catch up freezing" performed by the first VACUUM operation run against a table that just exceeded the GUC-controlled table size threshold will have far more limited impact, because the burden of freezing will be spread out across multiple VACUUM operations. The big idea behind the patch series is to relieve users from having to think about a special type of VACUUM that has to do much more freezing than other VACUUMs that ran against the same table in the recent past, of course, so it is important to avoid accidentally allowing any behavior that looks kind of like the ghost of aggressive VACUUM. -- Peter Geoghegan
Re: New strategies for freezing, advancing relfrozenxid early
Hi, On 2022-11-23 15:06:52 -0800, Peter Geoghegan wrote: > Attached is v8. The docs don't build: https://cirrus-ci.com/task/5456939761532928 [20:00:58.203] postgres.sgml:52: element link: validity error : IDREF attribute linkend references an unknown ID "vacuum-for-wraparound" Greetings, Andres Freund
Re: New strategies for freezing, advancing relfrozenxid early
Hi, On 2022-11-15 19:02:12 -0800, Peter Geoghegan wrote: > From 352867c5027fae6194ab1c6480cd326963e201b1 Mon Sep 17 00:00:00 2001 > From: Peter Geoghegan > Date: Sun, 12 Jun 2022 15:46:08 -0700 > Subject: [PATCH v6 1/6] Add page-level freezing to VACUUM. > > Teach VACUUM to decide on whether or not to trigger freezing at the > level of whole heap pages, not individual tuple fields. OldestXmin is > now treated as the cutoff for freezing eligibility in all cases, while > FreezeLimit is used to trigger freezing at the level of each page (we > now freeze all eligible XIDs on a page when freezing is triggered for > the page). > > This approach decouples the question of _how_ VACUUM could/will freeze a > given heap page (which of its XIDs are eligible to be frozen) from the > question of whether it actually makes sense to do so right now. > > Just adding page-level freezing does not change all that much on its > own: VACUUM will still typically freeze very lazily, since we're only > forcing freezing of all of a page's eligible tuples when we decide to > freeze at least one (on the basis of XID age and FreezeLimit). For now > VACUUM still freezes everything almost as lazily as it always has. > Later work will teach VACUUM to apply an alternative eager freezing > strategy that triggers page-level freezing earlier, based on additional > criteria. > --- > src/include/access/heapam.h | 42 +- > src/backend/access/heap/heapam.c | 199 +-- > src/backend/access/heap/vacuumlazy.c | 95 - > 3 files changed, 222 insertions(+), 114 deletions(-) > > diff --git a/src/include/access/heapam.h b/src/include/access/heapam.h > index ebe723abb..ea709bf1b 100644 > --- a/src/include/access/heapam.h > +++ b/src/include/access/heapam.h > @@ -112,6 +112,38 @@ typedef struct HeapTupleFreeze > OffsetNumber offset; > } HeapTupleFreeze; > > +/* > + * State used by VACUUM to track what the oldest extant XID/MXID will become > + * when determing whether and how to freeze a page's heap tuples via calls to > + * heap_prepare_freeze_tuple. Perhaps this could say something like "what the oldest extant XID/MXID currently is and what it would be if we decide to freeze the page" or such? > + * The relfrozenxid_out and relminmxid_out fields are the current target > + * relfrozenxid and relminmxid for VACUUM caller's heap rel. Any and all "VACUUM caller's heap rel." could stand to be rephrased. > + * unfrozen XIDs or MXIDs that remain in caller's rel after VACUUM finishes > + * _must_ have values >= the final relfrozenxid/relminmxid values in > pg_class. > + * This includes XIDs that remain as MultiXact members from any tuple's xmax. > + * Each heap_prepare_freeze_tuple call pushes back relfrozenxid_out and/or > + * relminmxid_out as needed to avoid unsafe values in rel's authoritative > + * pg_class tuple. > + * > + * Alternative "no freeze" variants of relfrozenxid_nofreeze_out and > + * relminmxid_nofreeze_out must also be maintained for !freeze pages. > + */ relfrozenxid_nofreeze_out isn't really a "no freeze variant" :) I think it might be better to just always maintain the nofreeze state. > +typedef struct HeapPageFreeze > +{ > + /* Is heap_prepare_freeze_tuple caller required to freeze page? */ > + boolfreeze; s/freeze/freeze_required/? > + /* Values used when page is to be frozen based on freeze plans */ > + TransactionId relfrozenxid_out; > + MultiXactId relminmxid_out; > + > + /* Used by caller for '!freeze' pages */ > + TransactionId relfrozenxid_nofreeze_out; > + MultiXactId relminmxid_nofreeze_out; > + > +} HeapPageFreeze; > + Given the number of parameters to heap_prepare_freeze_tuple, why don't we pass in more of them in via HeapPageFreeze? > /* > * function prototypes for heap access method > * > @@ -180,17 +212,17 @@ extern void heap_inplace_update(Relation relation, > HeapTuple tuple); > extern bool heap_prepare_freeze_tuple(HeapTupleHeader tuple, > > TransactionId relfrozenxid, TransactionId relminmxid, > > TransactionId cutoff_xid, TransactionId cutoff_multi, > + > TransactionId limit_xid, MultiXactId limit_multi, > > HeapTupleFreeze *frz, bool *totally_frozen, > - > TransactionId *relfrozenxid_out, > - > MultiXactId *relminmxid_out); > + > HeapPageFreeze *xtrack); What does 'xtrack' stand for? Xid Tracking? > * VACUUM caller must assemble HeapFreezeTuple entries for
Re: New strategies for freezing, advancing relfrozenxid early
Note that this fails under -fsanitize=align Subject: [PATCH v5 2/6] Teach VACUUM to use visibility map snapshot. performing post-bootstrap initialization ... ../src/backend/access/heap/visibilitymap.c:482:38: runtime error: load of misaligned address 0x5559e1352424 for type 'uint64', which requires 8 byte alignment > *all_visible += pg_popcount64(umap[i] & > VISIBLE_MASK64);
Re: New strategies for freezing, advancing relfrozenxid early
On Tue, Oct 4, 2022 at 7:59 PM Jeff Davis wrote: > I am fine with that, but I'd like us all to understand what the > downsides are. Although I'm sure that there must be one case that loses measurably, it's not particularly obvious where to start looking for one. I mean it's easy to imagine individual pages that we lose on, but a practical test case where most of the pages are like that reliably is harder to imagine. > If I understand correctly: > > 1. Eager freezing (meaning to freeze at the same time as setting all- > visible) causes a modest amount of WAL traffic, hopefully before the > next checkpoint so we can avoid FPIs. Lazy freezing (meaning set all- > visible but don't freeze) defers the work, and it might never need to > be done; but if it does, it can cause spikes at unfortunate times and > is more likely to generate more FPIs. Lazy freezing means to freeze every eligible tuple (every XID < OldestXmin) when one or more XIDs are before FreezeLimit. Eager freezing means freezing every eligible tuple when the page is about to be set all-visible, or whenever lazy freezing would trigger freezing. Eager freezing tends to avoid big spikes in larger tables, which is very important. It can sometimes be cheaper and better in every way than lazy freezing. Though lazy freezing sometimes retains an advantage by avoiding freezing that is never going to be needed altogether, typically only in small tables. Lazy freezing is fairly similar to what we do on HEAD now -- though it's not identical. It's still "page level freezing". It has lazy criteria for triggering page freezing. > 2. You're trying to mitigate the downsides of eager freezing by: > a. when freezing a tuple, eagerly freeze other tuples on that page > b. optimize WAL freeze records Sort of. Both of these techniques apply to eager freezing too, in fact. It's just that eager freezing is likely to do the bulk of all freezing that actually goes ahead. It'll disproportionately be helped by these techniques because it'll do most actual freezing that goes ahead (even when most VACUUM operations use the lazy freezing strategy, which is probably the common case -- just because lazy freezing freezes lazily). > 3. You're trying to capture the trade-off in #1 by using the table size > as a proxy. Deferred work is only really a problem for big tables, so > that's where you use eager freezing. Right. > But maybe we can just always use > eager freezing?: That doesn't seem like a bad idea, though it might be tricky to put into practice. It might be possible to totally unite the concept of all-visible and all-frozen pages in the scope of this work. But there are surprisingly many tricky details involved. I'm not surprised that you're suggesting this -- it basically makes sense to me. It's just the practicalities that I worry about here. > a. You're mitigating the WAL work for freezing. I don't see why this would be true. Lazy vs Eager are exactly the same for a given page at the point that freezing is triggered. We'll freeze all eligible tuples (often though not always every tuple), or none at all. Lazy vs Eager describe the policy for deciding to freeze a page, but do not affect the actual execution steps taken once we decide to freeze. > b. A lot of people run with checksums on, meaning that setting the > all-visible bit requires WAL work anyway, and often FPIs. The idea of rolling the WAL records into one does seem appealing, but we'd still need the original WAL record to set a page all-visible in VACUUM's second heap pass (only setting a page all-visible in the first heap pass could be optimized by making the FREEZE_PAGE WAL record mark the page all-visible too). Or maybe we'd roll that into the VACUUM WAL record at the same time. In any case the second heap pass would have to have a totally different WAL logging strategy to the first heap pass. Not insurmountable, but not exactly an easy thing to do in passing either. > c. All-visible is conceptually similar to freezing, but less > important, and it feels more and more like the design concept of all- > visible isn't carrying its weight. Well, not quite -- at least not on the VM side itself. There are cases where heap_lock_tuple() will update a tuple's xmax, replacing it with a new Multi. This will necessitate clearly the page's all-frozen bit in the VM -- but the all-visible bit will stay set. This is why it's possible for small numbers of all-visible pages to appear even in large tables that have been eagerly frozen. > d. (tangent) I had an old patch[1] that actually removed > PD_ALL_VISIBLE (the page bit, not the VM bit), which was rejected, but > perhaps its time has come? I remember that pgCon developer meeting well. :-) If anything your original argument for getting rid of PD_ALL_VISIBLE is weakened by the proposal to merge together the WAL records for freezing and for setting a heap page all visible. You'd know for sure that the page will be dirtied when such a WAL reco