Re: Minimal logical decoding on standbys
Hi, On 1/26/23 9:13 PM, Andres Freund wrote: Hi, On 2023-01-26 18:56:10 +0100, Drouvot, Bertrand wrote: - I'm struggling to create a test for btree killtuples as there is a need for rows removal on the table (that could produce a conflict too): Do you've a scenario in mind for this one? (and btw in what kind of WAL record should the conflict be detected in such a case? xl_btree_delete?) Hm, it might indeed be hard in "modern" postgres. I think you'd need at least two concurrent sessions, to prevent on-access pruning on the table. DROP TABLE IF EXISTS indexdel; CREATE TABLE indexdel(id int8 primary key); INSERT INTO indexdel SELECT generate_series(1, 1); VACUUM indexdel; -- ensure hint bits are set etc DELETE FROM indexdel; SELECT pg_current_wal_insert_lsn(); SET enable_indexonlyscan = false; -- This scan finds that the index items are dead - but doesn't yet issue a -- btree delete WAL record, that only happens when needing space on the page -- again. EXPLAIN (COSTS OFF, SUMMARY OFF) SELECT id FROM indexdel WHERE id < 10 ORDER BY id ASC; SELECT id FROM indexdel WHERE id < 100 ORDER BY id ASC; -- The insertions into the range of values prev INSERT INTO indexdel SELECT generate_series(1, 100); Does generate the btree deletion record, but it also does emit a PRUNE (from heapam_index_fetch_tuple() -> heap_page_prune_opt()). While the session could create a cursor to prevent later HOT cleanup, the query would also trigger hot pruning (or prevent the rows from being dead, if you declare the cursor before the DELETE). So you'd need overlapping cursors in a concurrent session... Thanks for the scenario and explanation! I agree that a second session would be needed (and so I understand why I was struggling when trying with a single session ;-) ) Too complicated. Yeah. Regards, -- Bertrand Drouvot PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com
Re: drop postmaster symlink
On 26.01.23 19:36, Karl O. Pinc wrote: I see a possible problem at line 1,412 of runtime.sgml This says: in the postmaster's startup script just before invoking the postmaster. Depending on how this is read, it could be interpreted to mean that a "postmaster" binary is invoked. It might be more clear to write: ... just before invoking postgres. Or it might not be worth bothering about; at this point, probably not, but I thought you might want the heads-up anyway. Good find. I have adjusted that, and a few more nearby.
Re: Syncrep and improving latency due to WAL throttling
On Thu, Jan 26, 2023 at 9:21 PM Andres Freund wrote: > > > 7. I think we need to not let backends throttle too frequently even > > though they have crossed wal_throttle_threshold bytes. The best way is > > to rely on replication lag, after all the goal of this feature is to > > keep replication lag under check - say, throttle only when > > wal_distance > wal_throttle_threshold && replication_lag > > > wal_throttle_replication_lag_threshold. > > I think my idea of only forcing to flush/wait an LSN some distance in the past > would automatically achieve that? I'm sorry, I couldn't get your point, can you please explain it a bit more? Looking at the patch, the feature, in its current shape, focuses on improving replication lag (by throttling WAL on the primary) only when synchronous replication is enabled. Why is that? Why can't we design it for replication in general (async, sync, and logical replication)? Keeping replication lag under check enables one to provide a better RPO guarantee as discussed in the other thread https://www.postgresql.org/message-id/CAHg%2BQDcO_zhgBCMn5SosvhuuCoJ1vKmLjnVuqUEOd4S73B1urw%40mail.gmail.com. -- Bharath Rupireddy PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com
Re: New strategies for freezing, advancing relfrozenxid early
On Thu, Jan 26, 2023 at 9:58 PM Andres Freund wrote: > It doesn't seem like a great proxy to me. ISTM that this means that how > aggressive vacuum is about opportunistically freezing pages depends on config > variables like checkpoint_timeout & max_wal_size (less common opportunistic > freezing), full_page_writes & use of unlogged tables (no opportunistic > freezing), and the largely random scheduling of autovac workers. The FPI thing was originally supposed to complement the freezing strategies stuff, and possibly other rules that live in lazy_scan_prune. Obviously you can freeze a page by following any rule that you care to invent -- you can decide by calling random(). Two rules can coexist during the same VACUUM (actually, they do already). > Essentially the "any fpi" logic is a very coarse grained way of using the page > LSN as a measurement. As I said, I don't think "has a checkpoint occurred > since the last write" is a good metric to avoid unnecessary freezing - it's > too coarse. But I think using the LSN is the right thought. What about > something like > > lsn_threshold = insert_lsn - (insert_lsn - lsn_of_last_vacuum) * 0.1 > if (/* other conds */ && PageGetLSN(page) <= lsn_threshold) > FreezeMe(); > > I probably got some details wrong, what I am going for with lsn_threshold is > that we'd freeze an already dirty page if it's not been updated within 10% of > the LSN distance to the last VACUUM. It seems to me that you're reinventing something akin to eager freezing strategy here. At least that's how I define it, since now you're bringing the high level context into it; what happens with the table, with VACUUM operations, and so on. Obviously this requires tracking the metadata that you suppose will be available in some way or other, in particular things like lsn_of_last_vacuum. What about unlogged/temporary tables? The obvious thing to do there is what I did in the patch that was reverted (freeze whenever the page will thereby become all-frozen), and forget about LSNs. But you have already objected to that part, specifically. BTW, you still haven't changed the fact that you get rather different behavior with checksums/wal_log_hints. I think that that's good, but you didn't seem to. > I don't think the speculation is that fundamentally different - a heavily > updated table with a bit of a historic, non-changing portion, makes > vacuum_freeze_strategy_threshold freeze way more aggressively than either "any > record" or "any fpi". That's true. The point I was making is that both this proposal and eager freezing are based on some kind of high level picture of the needs of the table, based on high level metadata. To me that's the defining characteristic. > > And even when we lose, you generally still won't have been completely > > wrong -- even then there generally will indeed be a second FPI later > > on for the same page, to go with everything else. This makes the > > wasted freezing even less significant, on a comparative basis! > > This is precisely why I think that we can afford to be quite aggressive about > freezing already dirty pages... I'm beginning to warm to this idea, now that I understand it a little better. -- Peter Geoghegan
Re: Reducing power consumption on idle servers
On Fri, Jan 27, 2023 at 7:37 PM Bharath Rupireddy wrote: > On Wed, Jan 25, 2023 at 2:10 AM Tom Lane wrote: > > It's kind of moot until we've reached the point where we can > > credibly claim to have explicit wakeups for every event of > > interest. I don't think we're very close to that today, and > > I do think we should try to get closer. There may come a point > > of diminishing returns though. > > IIUC, we're discussing here whether or not to get rid of hibernate > loops, IOW, sleep-wakeup-doworkifthereisany-sleep loops and rely on > other processes' wakeup signals to reduce the overall power > consumption, am I right? > > I'm trying to understand this a bit - can the signals (especially, > SIGURG that we use to set latches to wake up processes) ever get lost > on the way before reaching the target process? If yes, how? How > frequently can it happen? Is there any history of reported issues in > postgres because a signal got lost? > > I'm reading about Pending Signals and queuing of signals with > sigqueue() (in linux), can any of these guarantee that signals sent > never get lost? Signals don't get lost. At least with the current definition of so called "reliable" signals, in the POSIX standard. (There was a previous system of signals in ancient UNIX that was harder to program with; we still see echos of that today, for example in C standard there are basic signals, and Windows implements those, but they're borderline useless; before a signal handler runs, the handler is also de-registered, so many interesting handlers would have to re-register it, but signals can't be atomically blocked at the same time and there is a window of time where the default behaviour applies, which (from our modern perspective) is clearly insane as there is literally no way to close that race and avoid some fatal default behaviour). BTW sigqueue (signals that are queued up without being consolidated, and carry a user-controlled value with them) are "realtime signals" AKA "signal queuing", which I guess we'll never be able to use because at least Apple didn't implement them; the "reliable signals" we use are the plain kind where signals don't carry any kind of payload and are collapsed into a single "pending" bit, so that if 2 people do kill(SIGFOO) around the same time your handler might only run once (if you registered one and its not blocked), but it will definitely run > 0 times (or be delievered via various other means such as sigwait(), etc...). Then there is the question of races as you go into the wait primitive. I mean in between our reading latch->is_set and entering the kernel, which we defend against using various techniques which you can read about at the top of latch.c; the short version is that if that happens, those system calls will immediately return. The uncertainty discussed here comes from the comment beginning "There is a race condition here, ..." in bgwriter.c, referring to the protocol implemented by StrategyNotifyBgWriter(). I haven't really looked into it. I *guess* it's probably approximately OK if the bgwriter misses a wakeup from time to time, because there's another chance to wake it up as soon as someone needs another buffer, and if you don't need any more buffers soon your system is probably idle; but really "needing another buffer" isn't a great proxy for "more buffers are being made dirty" at all! Someone would need to think about it and construct a complete argument for why it's OK to go to sleep for 60 seconds, or infinity, with that sloppiness in place. There's also the walwriter to look into; from memory it was a little less fuzzy but I haven't looked recently.
RE: [Proposal] Add foreign-server health checks infrastructure
I found cfbot failure, PSA fixed version. Sorry for noise. Best Regards, Hayato Kuroda FUJITSU LIMITED v29-0003-add-test.patch Description: v29-0003-add-test.patch v29-0004-add-kqueue-support-for-PQconnCheck-and-PQcanConn.patch Description: v29-0004-add-kqueue-support-for-PQconnCheck-and-PQcanConn.patch v29-0001-Add-PQconnCheck-and-PQcanConnCheck-to-libpq.patch Description: v29-0001-Add-PQconnCheck-and-PQcanConnCheck-to-libpq.patch v29-0002-postgres_fdw-add-postgres_fdw_verify_connection_.patch Description: v29-0002-postgres_fdw-add-postgres_fdw_verify_connection_.patch
Re: Considering additional sort specialisation functions for PG16
On Thu, Jan 26, 2023 at 7:15 PM David Rowley wrote: > > On Thu, 26 Jan 2023 at 23:29, John Naylor wrote: > > Coming back to this, I wanted to sketch out this idea in a bit more detail. > > > > Have two memtuple arrays, one for first sortkey null and one for first sortkey non-null: > > - Qsort the non-null array, including whatever specialization is available. Existing specialized comparators could ignore nulls (and their ordering) taking less space in the binary. > > - Only if there is more than one sort key, qsort the null array. Ideally at some point we would have a method of ignoring the first sortkey (this is an existing opportunity that applies elsewhere as well). > > - To handle two arrays, grow_memtuples() would need some adjustment, as would any callers that read the final result of an in-memory sort -- they would need to retrieve the tuples starting with the appropriate array depending on NULLS FIRST/LAST behavior. > > Thanks for coming back to this. I've been thinking about this again > recently due to what was discovered in [1]. That was indeed part of the motivation for bringing this up. > Why I'm bringing this up here is that I wondered, in addition to what > you're mentioning above, if we're making some changes to allow > multiple in-memory arrays, would it be worth going a little further > and allowing it so we could have N arrays which we'd try to keep under > L3 cache size. Maybe some new GUC could be used to know what a good > value is. With fast enough disks, it's often faster to use smaller > values of work_mem which don't exceed L3 to keep these batches > smaller. Keeping them in memory would remove the need to write out > tapes. That's interesting. I don't know enough to guess how complex it would be to make "external" merges agnostic about whether the tapes are on disk or in memory. If in-memory sorts were designed analogously to external ones, grow_memtuples would never have to repalloc, it could just allocate a new tape, which could further shave some cycles. My hunch in my last email was that having separate groups of tapes for each null/non-null first sortkey would be complex, because it increases the number of places that have to know about nulls and their ordering. If we wanted to go to N arrays instead of 2, and additionally wanted separate null/non-null treatment, it seems we would still need 2 sets of arrays, one with N non-null-first-sortkey tapes and one with M null-first-sortkey tapes. So using 2 arrays seems like the logical first step. I'd be curious to hear other possible development paths. > I think the slower sorts I found in [2] could also be partially caused > by the current sort specialisation comparators re-comparing the > leading column during a tie-break. I've not gotten around to disabling > the sort specialisations to see if and how much this is a factor for > that test. Right, that's worth addressing independently of the window function consideration. I'm still swapping this area back in my head, but I believe one issue is that state->base.onlyKey signals two things: "one sortkey, not abbreviated". We could add a separate branch for "first key unabbreviated, nkeys>1" -- I don't think we'd need to specialize, just branch -- and instead of state->base.comparetup, call a set of analogous functions that only handle keys 2 and above (comparetup_tail_* ? or possibly just add a boolean parameter compare_first). That would not pose a huge challenge, I think, since they're already written like this: /* Compare the leading sort key */ compare = ApplySortComparator(...); if (compare != 0) return compare; /* Compare additional sort keys */ ... The null/non-null separation would eliminate a bunch of branches in inlined comparators, so we could afford to add another branch for number of keys. I haven't thought through either of these ideas in the gory detail, but I don't yet see any big obstacles. -- John Naylor EDB: http://www.enterprisedb.com
Re: Improve WALRead() to suck data directly from WAL buffers when possible
On Fri, Jan 27, 2023 at 3:17 PM Andres Freund wrote: > > Hi, > > On 2023-01-27 14:24:51 +0900, Masahiko Sawada wrote: > > If I'm understanding this result correctly, it seems to me that your > > patch works well with the WAL DIO patch (WALDIO vs. WAL DIO & WAL > > BUFFERS READ), but there seems no visible performance gain with only > > your patch (HEAD vs. WAL BUFFERS READ). So it seems to me that your > > patch should be included in the WAL DIO patch rather than applying it > > alone. Am I missing something? > > We already support using DIO for WAL - it's just restricted in a way that > makes it practically not usable. And the reason for that is precisely that > walsenders need to read the WAL. See get_sync_bit(): > > /* > * Optimize writes by bypassing kernel cache with O_DIRECT when using > * O_SYNC and O_DSYNC. But only if archiving and streaming are > disabled, > * otherwise the archive command or walsender process will read the > WAL > * soon after writing it, which is guaranteed to cause a physical > read if > * we bypassed the kernel cache. We also skip the > * posix_fadvise(POSIX_FADV_DONTNEED) call in XLogFileClose() for the > same > * reason. > * > * Never use O_DIRECT in walreceiver process for similar reasons; the > WAL > * written by walreceiver is normally read by the startup process soon > * after it's written. Also, walreceiver performs unaligned writes, > which > * don't work with O_DIRECT, so it is required for correctness too. > */ > if (!XLogIsNeeded() && !AmWalReceiverProcess()) > o_direct_flag = PG_O_DIRECT; > > > Even if that weren't the case, splitting up bigger commits in incrementally > committable chunks is a good idea. Agreed. I was wondering about the fact that the test result doesn't show things to satisfy the first motivation of this patch, which is to improve performance by reducing disk I/O and system calls regardless of the DIO patch. But it makes sense to me that this patch is a part of the DIO patch series. I'd like to confirm whether there is any performance regression caused by this patch in some cases, especially when not using DIO. Regards, -- Masahiko Sawada Amazon Web Services: https://aws.amazon.com
Re: Reducing power consumption on idle servers
On Wed, Jan 25, 2023 at 2:10 AM Tom Lane wrote: > > Thomas Munro writes: > > Yeah, I definitely want to fix it. I just worry that 60s is so long > > that it also needs that analysis work to be done to explain that it's > > OK that we're a bit sloppy on noticing when to wake up, at which point > > you might as well go to infinity. > > Yeah. The perfectionist in me wants to say that there should be > explicit wakeups for every event of interest, in which case there's > no need for a timeout. The engineer in me says "but what about bugs?". > Better a slow reaction than never reacting at all. OTOH, then you > have to have a discussion about whether 60s (or any other > ice-cap-friendly value) is an acceptable response time even in the > presence of bugs. > > It's kind of moot until we've reached the point where we can > credibly claim to have explicit wakeups for every event of > interest. I don't think we're very close to that today, and > I do think we should try to get closer. There may come a point > of diminishing returns though. IIUC, we're discussing here whether or not to get rid of hibernate loops, IOW, sleep-wakeup-doworkifthereisany-sleep loops and rely on other processes' wakeup signals to reduce the overall power consumption, am I right? I'm trying to understand this a bit - can the signals (especially, SIGURG that we use to set latches to wake up processes) ever get lost on the way before reaching the target process? If yes, how? How frequently can it happen? Is there any history of reported issues in postgres because a signal got lost? I'm reading about Pending Signals and queuing of signals with sigqueue() (in linux), can any of these guarantee that signals sent never get lost? FWIW, a recent commit cd4329d that removed promote_trigger_file also removed hibernation and added wait forever relying on the latch set. If we're worried that the signals can get lost on the way, then this also needs to be fixed. And, I also see lot of WaitLatch() with waiting forever relying on others to wake them up. -- Bharath Rupireddy PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com
Re: Improve WALRead() to suck data directly from WAL buffers when possible
Hi, On 2023-01-27 14:24:51 +0900, Masahiko Sawada wrote: > If I'm understanding this result correctly, it seems to me that your > patch works well with the WAL DIO patch (WALDIO vs. WAL DIO & WAL > BUFFERS READ), but there seems no visible performance gain with only > your patch (HEAD vs. WAL BUFFERS READ). So it seems to me that your > patch should be included in the WAL DIO patch rather than applying it > alone. Am I missing something? We already support using DIO for WAL - it's just restricted in a way that makes it practically not usable. And the reason for that is precisely that walsenders need to read the WAL. See get_sync_bit(): /* * Optimize writes by bypassing kernel cache with O_DIRECT when using * O_SYNC and O_DSYNC. But only if archiving and streaming are disabled, * otherwise the archive command or walsender process will read the WAL * soon after writing it, which is guaranteed to cause a physical read if * we bypassed the kernel cache. We also skip the * posix_fadvise(POSIX_FADV_DONTNEED) call in XLogFileClose() for the same * reason. * * Never use O_DIRECT in walreceiver process for similar reasons; the WAL * written by walreceiver is normally read by the startup process soon * after it's written. Also, walreceiver performs unaligned writes, which * don't work with O_DIRECT, so it is required for correctness too. */ if (!XLogIsNeeded() && !AmWalReceiverProcess()) o_direct_flag = PG_O_DIRECT; Even if that weren't the case, splitting up bigger commits in incrementally committable chunks is a good idea. Greetings, Andres Freund
Re: New strategies for freezing, advancing relfrozenxid early
Hi, On 2023-01-26 19:01:03 -0800, Peter Geoghegan wrote: > On Thu, Jan 26, 2023 at 6:37 PM Andres Freund wrote: > > I also don't really see how that is responsive to anything else in my > > email. That's just as true for the current gating condition (the issuance of > > an FPI during heap_page_prune() / HTSV()). > > > > What I was wondering about is whether we should replace the > > fpi_before != pgWalUsage.wal_fpi > > with > > records_before != pgWalUsage.wal_records && !WouldIssueFpi(page) > > I understand that. What I'm saying is that that's going to create a > huge problem of its own, unless you separately account for that > problem. > The simplest and obvious example is something like a pgbench_tellers > table. VACUUM will generally run fast enough relative to the workload > that it will set some of those pages all-visible. Now it's going to > freeze them, too. Arguably it shouldn't even be setting the pages > all-visible, but now you make that existing problem much worse. So the benefit of the FPI condition is that it indicates that the page hasn't been updated all that recently, because, after all, a checkpoint has happened since? If that's the intention, it needs a huge honking comment - at least I can't read that out of: Also freeze when pruning generated an FPI, if doing so means that we set the page all-frozen afterwards (might not happen until final heap pass). It doesn't seem like a great proxy to me. ISTM that this means that how aggressive vacuum is about opportunistically freezing pages depends on config variables like checkpoint_timeout & max_wal_size (less common opportunistic freezing), full_page_writes & use of unlogged tables (no opportunistic freezing), and the largely random scheduling of autovac workers. I can see it making a difference for pgbench_tellers, but it's a pretty small difference in overall WAL volume. I can think of more adverse workloads though - but even there the difference seems not huge, and not predictably reached. Due to the freeze plan stuff you added, the amount of WAL for freezing a page is pretty darn small compared to the amount of WAL if compared to the amount of WAL needed to fill a page with non-frozen tuples. That's not to say we shouldn't reduce the risk - I agree that both the "any fpi" and the "any record" condition can have adverse effects! However, an already dirty page getting frozen is also the one case where freezing won't have meaningful write amplication effect. So I think it's worth trying spending effort figuring out how we can make freezing in that situation have unlikely and small downsides. The cases with downsides are tables that are very heavily updated througout, where the page is going to be defrosted again almost immediately. As you say, the all-visible marking has a similar problem. Essentially the "any fpi" logic is a very coarse grained way of using the page LSN as a measurement. As I said, I don't think "has a checkpoint occurred since the last write" is a good metric to avoid unnecessary freezing - it's too coarse. But I think using the LSN is the right thought. What about something like lsn_threshold = insert_lsn - (insert_lsn - lsn_of_last_vacuum) * 0.1 if (/* other conds */ && PageGetLSN(page) <= lsn_threshold) FreezeMe(); I probably got some details wrong, what I am going for with lsn_threshold is that we'd freeze an already dirty page if it's not been updated within 10% of the LSN distance to the last VACUUM. > The important point is that there doesn't seem to be any good way > around thinking about the table as a whole if you're going to freeze > speculatively. This is not the same dynamic as we see with the FPI > thing IMV -- that's not nearly so speculative as what you're talking > about, since it is speculative in roughly the same sense that eager > freezing was speculative (hence the suggestion that something like > vacuum_freeze_strategy_threshold could have a roll to play). I don't think the speculation is that fundamentally different - a heavily updated table with a bit of a historic, non-changing portion, makes vacuum_freeze_strategy_threshold freeze way more aggressively than either "any record" or "any fpi". > The FPI thing is mostly about the cost now versus the cost later on. > You're gambling that you won't get another FPI later on if you freeze > now. But the cost of a second FPI later on is so much higher than the > added cost of freezing now that that's a very favorable bet, that we > can afford to "lose" many times while still coming out ahead overall. Agreed. And not just avoiding FPIs, avoiding another dirtying of the page! The latter part is especially huge IMO. Depending on s_b size it can also avoid another *read* of the page... > And even when we lose, you generally still won't have been completely > wrong -- even then there generally will indeed be a second FPI later > on for the same page, to go with everything else. This makes the > wasted freezing even
Re: recovery modules
On Wed, Jan 25, 2023 at 04:34:21PM +0900, Michael Paquier wrote: > The loop part is annoying.. I've never been a fan of adding this > cross-value checks for the archiver modules in the first place, and it > would make things much simpler in the checkpointer if we need to think > about that as we want these values to be reloadable. Perhaps this > could just be an exception where we just give priority on one over the > other archive_cleanup_command? The startup process has a well-defined > sequence after a failure, while the checkpointer is designed to remain > robust. Yeah, there are some problems here. If we ERROR, we'll just bounce back to the sigsetjmp() block once a second, and we'll never pick up configuration reloads, shutdown signals, etc. If we FATAL, we'll just rapidly restart over and over. Given the dicussion about misconfigured archiving parameters [0], I doubt folks will be okay with giving priority to one or the other. I'm currently thinking that the checkpointer should set a flag and clear the recovery callbacks when a misconfiguration is detected. Anytime the checkpointer tries to use the archive-cleanup callback, a WARNING would be emitted. This is similar to an approach I proposed for archiving misconfigurations (that we didn't proceed with) [1]. Given the aformentioned problems, this approach might be more suitable for the checkpointer than it is for the archiver. Thoughts? [0] https://postgr.es/m/9ee5d180-2c32-a1ca-d3d7-63a723f68d9a%40enterprisedb.com [1] https://postgr.es/m/20220914222736.GA3042279%40nathanxps13 -- Nathan Bossart Amazon Web Services: https://aws.amazon.com
Re: Improve WALRead() to suck data directly from WAL buffers when possible
On Thu, Jan 26, 2023 at 2:33 PM Bharath Rupireddy wrote: > > On Thu, Jan 26, 2023 at 2:45 AM Andres Freund wrote: > > > > Hi, > > > > On 2023-01-14 12:34:03 -0800, Andres Freund wrote: > > > On 2023-01-14 00:48:52 -0800, Jeff Davis wrote: > > > > On Mon, 2022-12-26 at 14:20 +0530, Bharath Rupireddy wrote: > > > > > Please review the attached v2 patch further. > > > > > > > > I'm still unclear on the performance goals of this patch. I see that it > > > > will reduce syscalls, which sounds good, but to what end? > > > > > > > > Does it allow a greater number of walsenders? Lower replication > > > > latency? Less IO bandwidth? All of the above? > > > > > > One benefit would be that it'd make it more realistic to use direct IO > > > for WAL > > > - for which I have seen significant performance benefits. But when we > > > afterwards have to re-read it from disk to replicate, it's less clearly a > > > win. > > > > Satya's email just now reminded me of another important reason: > > > > Eventually we should add the ability to stream out WAL *before* it has > > locally > > been written out and flushed. Obviously the relevant positions would have to > > be noted in the relevant message in the streaming protocol, and we couldn't > > generally allow standbys to apply that data yet. > > > > That'd allow us to significantly reduce the overhead of synchronous > > replication, because instead of commonly needing to send out all the pending > > WAL at commit, we'd just need to send out the updated flush position. The > > reason this would lower the overhead is that: > > > > a) The reduced amount of data to be transferred reduces latency - it's easy > > to > >accumulate a few TCP packets worth of data even in a single small OLTP > >transaction > > b) The remote side can start to write out data earlier > > > > > > Of course this would require additional infrastructure on the receiver > > side. E.g. some persistent state indicating up to where WAL is allowed to be > > applied, to avoid the standby getting ahead of th eprimary, in case the > > primary crash-restarts (or has more severe issues). > > > > > > With a bit of work we could perform WAL replay on standby without waiting > > for > > the fdatasync of the received WAL - that only needs to happen when a) we > > need > > to confirm a flush position to the primary b) when we need to write back > > pages > > from the buffer pool (and some other things). > > Thanks Andres, Jeff and Satya for taking a look at the thread. Andres > is right, the eventual plan is to do a bunch of other stuff as > described above and we've discussed this in another thread (see > below). I would like to once again clarify motivation behind this > feature: > > 1. It enables WAL readers (callers of WALRead() - wal senders, > pg_walinspect etc.) to use WAL buffers as first level cache which > might reduce number of IOPS at a peak load especially when the pread() > results in a disk read (WAL isn't available in OS page cache). I had > earlier presented the buffer hit ratio/amount of pread() system calls > reduced with wal senders in the first email of this thread (95% of the > time wal senders are able to read from WAL buffers without impacting > anybody). Now, here are the results with the WAL DIO patch [1] - where > WAL pread() turns into a disk read, see the results [2] and attached > graph. > > 2. As Andres rightly mentioned, it helps WAL DIO; since there's no OS > page cache, using WAL buffers as read cache helps a lot. It is clearly > evident from my experiment with WAL DIO patch [1], see the results [2] > and attached graph. As expected, WAL DIO brings down the TPS, whereas > WAL buffers read i.e. this patch brings it up. > > [2] Test case is an insert pgbench workload. > clientsHEADWAL DIOWAL DIO & WAL BUFFERS READWAL BUFFERS READ > 11404107014241375 > 2148779614541517 > 43064174330113019 > 86114355660265954 > 161156070511221612132 > 3223181130792344923561 > 6443607269834399745636 > 12880723451698151581911 > 25611092590185107332114046 > 512119354109817110287117506 > 768112435105795106853111605 > 1024107554105541105942109370 > 204888552790248069990555 > 409661323548145870461743 If I'm understanding this result correctly, it seems to me that your patch works well with the WAL DIO patch (WALDIO vs. WAL DIO & WAL BUFFERS READ), but there seems no visible performance gain with only your patch (HEAD vs. WAL BUFFERS READ). So it seems to me that your patch should be included in the WAL DIO patch rather than applying it alone. Am I missing something? Regards, -- Masahiko Sawada Amazon Web Services: https://aws.amazon.com
Re: Assertion failure in SnapBuildInitialSnapshot()
On Thu, Dec 8, 2022 at 8:17 AM Masahiko Sawada wrote: > > The same assertion failure has been reported on another thread[1]. > Since I could reproduce this issue several times in my environment > I've investigated the root cause. > > I think there is a race condition of updating > procArray->replication_slot_xmin by CreateInitDecodingContext() and > LogicalConfirmReceivedLocation(). > > What I observed in the test was that a walsender process called: > SnapBuildProcessRunningXacts() > LogicalIncreaseXminForSlot() > LogicalConfirmReceivedLocation() > ReplicationSlotsComputeRequiredXmin(false). > > In ReplicationSlotsComputeRequiredXmin() it acquired the > ReplicationSlotControlLock and got 0 as the minimum xmin since there > was no wal sender having effective_xmin. > What about the current walsender process which is processing running_xacts via SnapBuildProcessRunningXacts()? Isn't that walsender slot's effective_xmin have a non-zero value? If not, then why? -- With Regards, Amit Kapila.
Re: Inconsistency in reporting checkpointer stats
On Wed, Dec 21, 2022 at 5:15 PM Nitin Jadhav wrote: > > I have modified the code accordingly and attached the new version of > patches. patch 0001 fixes the inconsistency in checkpointer stats and > patch 0002 separates main buffer and SLRU buffer count from checkpoint > complete log message. IMO, there's no need for 2 separate patches for these changes. +(errmsg("restartpoint complete: wrote %d buffers (%.1f%%), " +"wrote %d slru buffers (%.1f%%); %d WAL file(s) added, " +"%d removed, %d recycled; write=%ld.%03d s, " +"sync=%ld.%03d s, total=%ld.%03d s; sync files=%d, " +"longest=%ld.%03d s, average=%ld.%03d s; distance=%d kB, " +"estimate=%d kB; lsn=%X/%X, redo lsn=%X/%X", Hm, restartpoint /checkpoint complete message is already too long to read and adding slru buffers to it make it further longer. Note that we don't need to add every checkpoint stat to the log message but to pg_stat_bgwriter. Isn't it enough to show SLRU buffers information in pg_stat_bgwriter alone? Can't one look at pg_stat_slru's blks_written (pgstat_count_slru_page_written()) to really track the SLRUs written? Or is it that one may want to track SLRUs during a checkpoint separately? Is there a real use-case/customer reported issue driving this change? After looking at pg_stat_slru.blks_written, I think the best way is to just leave things as-is and let CheckpointStats count slru buffers too unless there's really a reported issue that says pg_stat_slru.blks_written doesn't serve the purpose. -- Bharath Rupireddy PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com
Re: Something is wrong with wal_compression
On Fri, 2023-01-27 at 16:15 +1300, Thomas Munro wrote: > On Fri, Jan 27, 2023 at 3:04 PM Tom Lane wrote: > > Thomas Munro writes: > > > On Fri, Jan 27, 2023 at 1:30 PM Michael Paquier > > > wrote: > > > > My opinion would be to make this function more reliable, FWIW, even if > > > > that involves a performance impact when called in a close loop by > > > > forcing more WAL flushes to ensure its report durability and > > > > consistency. > > > > > Yeah, the other thread has a patch for that. But it would hurt some > > > workloads. > > > > I think we need to get the thing correct first and worry about > > performance later. What's wrong with simply making pg_xact_status > > write and flush a record of the XID's existence before returning it? > > Yeah, it will cost you if you use that function, but not if you don't. > > There is no > doubt that the current situation is unacceptable, though, so maybe we > really should just do it and make a faster one later. Anyone else > want to vote on this? I wasn't aware of the existence of pg_xact_status, so I suspect that it is not a widely known and used feature. After reading the documentation, I'd say that anybody who uses it will want it to give a reliable answer. So I'd agree that it is better to make it more expensive, but live up to its promise. Yours, Laurenz Albe
RE: Exit walsender before confirming remote flush in logical replication
Dear Dilip, hackers, Thanks for giving your opinion. I analyzed the relation with the given commit, and I thought I could keep my patch. How do you think? # Abstract * Some modifications should be needed. * We cannot rollback the shutdown if walsenders are stuck * We don't have a good way to detect stuck # Discussion Compared to physical replication, it is likely to happen that logical replication is stuck. I think the risk should be avoided as much as possible by fixing codes. Then, if it leads another failure, we can document the caution to users. While shutting down the server, checkpointer sends SIGUSR1 signal to wansenders. It is done after being exited other processes, so we cannot raise ERROR and rollback the operation if checkpointer recognize the process stuck at that time. We don't have any features that postmaster can check whether this node is publisher or not. So if we want to add the mechanism that can check the health of walsenders before shutting down, we must do that at the top of process_pm_shutdown_request() even if we are not in logical replication. I think it affects the basis of postgres largely, and in the first place, PostgreSQL does not have a mechanism to check the health of process. Therefore, I want to adopt the approach that walsender itself exits immediately when they get signals. ## About patch - Were fixes correct? In ProcessPendingWrites(), my patch, wansender calls WalSndDone() when it gets SIGUSR1 signal. I think this should be. From the patch [1]: ``` @@ -1450,6 +1450,10 @@ ProcessPendingWrites(void) /* Try to flush pending output to the client */ if (pq_flush_if_writable() != 0) WalSndShutdown(); + + /* If we got shut down requested, try to exit the process */ + if (got_STOPPING) + WalSndDone(XLogSendLogical); } /* reactivate latch so WalSndLoop knows to continue */ ``` Per my analysis, in case of logical replication, walsenders exit with following steps. Note that logical walsender does not receive SIGUSR2 signal, set flag by themselves instead: 1. postmaster sends shutdown requests to checkpointer 2. checkpointer sends SIGUSR1 to walsenders and wait 3. when walsenders accept SIGUSR1, they turn got_SIGUSR1 on. 4. walsenders consume all WALs. @XLogSendLogical 5. walsenders turn got_SIGUSR2 on by themselves @XLogSendLogical 6. walsenders recognize the flag is on, so call WalSndDone() @ WalSndLoop 7. proc_exit(0) 8. checkpoitner writes shutdown record ... Type (i) stuck, I reported in -hackers[1], means that processes stop at step 6 and Type (ii) stuck means that processes stop at 4. In step4, got_SIGUSR2 is never set to on, so we must use got_STOPPING flag. [1]: https://www.postgresql.org/message-id/tycpr01mb58701a47f35fed0a2b399662f5...@tycpr01mb5870.jpnprd01.prod.outlook.com Best Regards, Hayato Kuroda FUJITSU LIMITED
Re: Improve GetConfigOptionValues function
On Tue, Jan 24, 2023 at 8:43 PM Tom Lane wrote: > > Bharath Rupireddy writes: > > On Mon, Jan 23, 2023 at 9:51 PM Tom Lane wrote: > >> Also, I intentionally dropped the GUC_NO_SHOW_ALL check in > >> get_explain_guc_options, because it seems redundant given > >> the preceding GUC_EXPLAIN check. It's unlikely we'd ever have > >> a variable that's marked both GUC_EXPLAIN and GUC_NO_SHOW_ALL ... > >> but if we did, shouldn't the former take precedence here anyway? > > > You're right, but there's nothing that prevents users writing GUCs > > with GUC_EXPLAIN and GUC_NO_SHOW_ALL. > > "Users"? You do realize those flags are only settable by C code, > right? I meant extensions here. > Moreover, you haven't explained why it would be good that > you can't get at the behavior that a GUC is both shown in EXPLAIN > and not shown in SHOW ALL. If you want "not shown by either", > that's already accessible by setting only the GUC_NO_SHOW_ALL > flag. So I'd almost argue this is a bug fix, though I concede > it's a bit hard to imagine why somebody would want that choice. > Still, if we have two independent flags they should produce four > behaviors, not just three. Got it. I think I should've looked at GUC_NO_SHOW_ALL and GUC_EXPLAIN as separate flags not depending on each other in any way, meaning, GUCs marked with GUC_NO_SHOW_ALL mustn't be shown in SHOW ALL and its friends pg_settings and pg_show_all_settings(), and GUCs marked with GUC_EXPLAIN must be shown in explain output. This understanding is clear and simple IMO. Having said that, I have no objection to the v4 patch posted upthread. -- Bharath Rupireddy PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com
Re: Add LZ4 compression in pg_dump
On Mon, Jan 16, 2023 at 11:54:46AM +0900, Michael Paquier wrote: > On Sun, Jan 15, 2023 at 07:56:25PM -0600, Justin Pryzby wrote: > > On Mon, Jan 16, 2023 at 10:28:50AM +0900, Michael Paquier wrote: > >> The functions changed by 0001 are cfopen[_write](), > >> AllocateCompressor() and ReadDataFromArchive(). Why is it a good idea > >> to change these interfaces which basically exist to handle inputs? > > > > I changed to pass pg_compress_specification as a pointer, since that's > > the usual convention for structs, as followed by the existing uses of > > pg_compress_specification. > > Okay, but what do we gain here? It seems to me that this introduces > the risk that a careless change in one of the internal routines if > they change slight;ly compress_spec, hence impacting any of their > callers? Or is that fixing an actual bug (except if I am missing your > point, that does not seem to be the case)? To circle back to this: I was not saying there's any bug. The proposed change was only to follow normal and existing normal conventions for passing structs. It could also be a pointer to const. It's fine with me if you say that it's intentional how it's written already. > >> Is there some benefit in changing compression_spec within the > >> internals of these routines before going back one layer down to their > >> callers? Changing the compression_spec on-the-fly in these internal > >> paths could be risky, actually, no? > > > > I think what you're saying is that if the spec is passed as a pointer, > > then the called functions shouldn't set spec->algorithm=something. > > Yes. HEAD makes sure of that, 0001 would not prevent that. So I am a > bit confused in seeing how this is a benefit.
RE: [Proposal] Add foreign-server health checks infrastructure
Dear hackers, I have updated my patch for error handling and kqueue() support. Actually I do not have BSD-like machine, but I developed by using github CICD. I think at first we should focus on 0001-0003, and then work for 0004. Followings are change notes and my analysis. 0001 * Fix missed replacements from PQConnCheck() to PQconnCheck(). * Error handling was improved. Now we can detect the failure of poll() and return -1 at that time. * I thought we don't have to add select(2) in PQconnCheck(). According to man page [1], select(2) can be only used for watch whether the status is readable, writable, or exceptional condition. It means that select() does not have an event corresponding to POLLRDHUP. 0002, 0003 Not changed 0004 * Add kqueue(2) support() for BSD family. * I did not add epoll() support, because it can be used only on linux and such systems have POLLRDHUP instead. checked other codes in libpq, and they do not use epoll(). We can see that such an event does not specified in POSIX [2] and it can be used for linux only. I decided to use poll() as much as possible to keep the policy. * While coding, I found that there are no good timing to close the kernel event queue. It means that the lifetime of kqueue becomes same as the client program and occupy the small memory forever. I'm not sure it can be accepted. [1]: https://man7.org/linux/man-pages/man2/select.2.html [2]: https://pubs.opengroup.org/onlinepubs/9699919799/functions/poll.html Best Regards, Hayato Kuroda FUJITSU LIMITED v28-0001-Add-PQconnCheck-and-PQcanConnCheck-to-libpq.patch Description: v28-0001-Add-PQconnCheck-and-PQcanConnCheck-to-libpq.patch v28-0002-postgres_fdw-add-postgres_fdw_verify_connection_.patch Description: v28-0002-postgres_fdw-add-postgres_fdw_verify_connection_.patch v28-0003-add-test.patch Description: v28-0003-add-test.patch v28-0004-add-kqueue-support-for-PQconnCheck-and-PQcanConn.patch Description: v28-0004-add-kqueue-support-for-PQconnCheck-and-PQcanConn.patch
Re: Something is wrong with wal_compression
Thomas Munro writes: > On Fri, Jan 27, 2023 at 3:04 PM Tom Lane wrote: >> I think we need to get the thing correct first and worry about >> performance later. What's wrong with simply making pg_xact_status >> write and flush a record of the XID's existence before returning it? >> Yeah, it will cost you if you use that function, but not if you don't. > It would be pg_current_xact_id() that would have to pay the cost of > the WAL flush, not pg_xact_status() itself, Right, typo on my part. regards, tom lane
Re: Monotonic WindowFunc support for ntile(), percent_rank() and cume_dist()
On Wed, 25 Jan 2023 at 02:39, Melanie Plageman wrote: > > On Tue, Jan 24, 2023 at 02:00:33PM +1300, David Rowley wrote: > > If you feel strongly about that, then feel free to show me what you > > have in mind in more detail so I can think harder about it. > > Nah, I don't feel strongly. I think it was because looking at the patch > in isolation, the repetition stands out but in the context of the rest > of the code it doesn't. I played around with this patch again and the performance gains for the best case are not quite as good as we got for row_number(), rank() and dense_rank() in the original commit. The reasons for this is that ntile() and co all require getting a count of the total rows in the partition so it can calculate the result. ntile() needs to know how large the tiles are, for example. That obviously requires pulling all tuples into the tuplestore. Despite this, the performance with the patch is still much better than without. Here's master: explain (analyze, timing off, costs off) select * from (select a,ntile(10) over (order by a) nt from a) where nt < 1; Subquery Scan on unnamed_subquery (actual rows=0 loops=1) Filter: (unnamed_subquery.nt < 1) Rows Removed by Filter: 100 -> WindowAgg (actual rows=100 loops=1) -> Index Only Scan using a_a_idx on a (actual rows=100 loops=1) Heap Fetches: 0 Planning Time: 0.073 ms Execution Time: 254.118 ms (8 rows) and with the patch: WindowAgg (actual rows=0 loops=1) Run Condition: (ntile(10) OVER (?) < 1) -> Index Only Scan using a_a_idx on a (actual rows=100 loops=1) Heap Fetches: 0 Planning Time: 0.072 ms Execution Time: 140.023 ms (6 rows) Here's with row_number() for reference: explain (analyze, timing off, costs off) select * from (select a,row_number() over (order by a) rn from a) where rn < 1; WindowAgg (actual rows=0 loops=1) Run Condition: (row_number() OVER (?) < 1) -> Index Only Scan using a_a_idx on a (actual rows=1 loops=1) Heap Fetches: 0 Planning Time: 0.089 ms Execution Time: 0.054 ms (6 rows) you can clearly see the difference with the number of rows pulled out of the index only scan. This is just a 1 million row table with a single INT column and an index on that column. Anyway, all seems like clear wins and low controversy so I've now pushed it. David
Re: Something is wrong with wal_compression
On Fri, Jan 27, 2023 at 3:04 PM Tom Lane wrote: > Thomas Munro writes: > > On Fri, Jan 27, 2023 at 1:30 PM Michael Paquier wrote: > >> My opinion would be to make this function more reliable, FWIW, even if > >> that involves a performance impact when called in a close loop by > >> forcing more WAL flushes to ensure its report durability and > >> consistency. > > > Yeah, the other thread has a patch for that. But it would hurt some > > workloads. > > I think we need to get the thing correct first and worry about > performance later. What's wrong with simply making pg_xact_status > write and flush a record of the XID's existence before returning it? > Yeah, it will cost you if you use that function, but not if you don't. It would be pg_current_xact_id() that would have to pay the cost of the WAL flush, not pg_xact_status() itself, but yeah that's what the patch does (with some optimisations). I guess one question is whether there are any other reasonable real world uses of pg_current_xact_id(), other than the original goal[1]. If not, then at least you are penalising the right users, even though they probably only actually call pg_current_status() in extremely rare circumstances (if COMMIT hangs up). But I wouldn't be surprised if people have found other reasons to be interested in xid observability, related to distributed transactions and snapshots and suchlike. There is no doubt that the current situation is unacceptable, though, so maybe we really should just do it and make a faster one later. Anyone else want to vote on this? [1] https://www.postgresql.org/message-id/flat/CAMsr%2BYHQiWNEi0daCTboS40T%2BV5s_%2Bdst3PYv_8v2wNVH%2BXx4g%40mail.gmail.com
Re: Generating code for query jumbling through gen_node_support.pl
On Thu, Jan 26, 2023 at 09:39:05AM +0100, Peter Eisentraut wrote: > There are a couple of repetitive comments, like "typmod and collation > information are irrelevant for the query jumbling". This applies to all > nodes, so we don't need to repeat it for a number of nodes (and then not > mention it for other nodes). Maybe there should be a central place > somewhere that describes "these kinds of fields should normally be ignored". The place that would make the most sense to me to centralize this knowlegde is nodes.h itself, because that's where the node attributes are defined? -- Michael signature.asc Description: PGP signature
Re: New strategies for freezing, advancing relfrozenxid early
On Thu, Jan 26, 2023 at 6:37 PM Andres Freund wrote: > I also don't really see how that is responsive to anything else in my > email. That's just as true for the current gating condition (the issuance of > an FPI during heap_page_prune() / HTSV()). > > What I was wondering about is whether we should replace the > fpi_before != pgWalUsage.wal_fpi > with > records_before != pgWalUsage.wal_records && !WouldIssueFpi(page) I understand that. What I'm saying is that that's going to create a huge problem of its own, unless you separately account for that problem. The simplest and obvious example is something like a pgbench_tellers table. VACUUM will generally run fast enough relative to the workload that it will set some of those pages all-visible. Now it's going to freeze them, too. Arguably it shouldn't even be setting the pages all-visible, but now you make that existing problem much worse. The important point is that there doesn't seem to be any good way around thinking about the table as a whole if you're going to freeze speculatively. This is not the same dynamic as we see with the FPI thing IMV -- that's not nearly so speculative as what you're talking about, since it is speculative in roughly the same sense that eager freezing was speculative (hence the suggestion that something like vacuum_freeze_strategy_threshold could have a roll to play). The FPI thing is mostly about the cost now versus the cost later on. You're gambling that you won't get another FPI later on if you freeze now. But the cost of a second FPI later on is so much higher than the added cost of freezing now that that's a very favorable bet, that we can afford to "lose" many times while still coming out ahead overall. And even when we lose, you generally still won't have been completely wrong -- even then there generally will indeed be a second FPI later on for the same page, to go with everything else. This makes the wasted freezing even less significant, on a comparative basis! It's also likely true that an FPI in lazy_scan_prune is a much stronger signal, but I think that the important dynamic is that we're reasoning about "costs now vs costs later on". The asymmetry is really important. -- Peter Geoghegan
Re: Generating code for query jumbling through gen_node_support.pl
On Thu, Jan 26, 2023 at 09:37:13AM +0100, Peter Eisentraut wrote: > Ok, the documentation make sense now. I wonder what the performance impact > is. Probably, nobody cares about microoptimizing CREATE TABLE statements. > But BEGIN/COMMIT could matter. However, whatever you do in between the > BEGIN and COMMIT will already be jumbled, so you're already paying the > overhead. Hopefully, jumbling such simple commands will have no noticeable > overhead. > > In other words, we should test this and hopefully get rid of the 'string' > method. Yep. I have mentioned a few numbers upthread, and this deserves discussion. FYI, I have done more micro-benchmarking to compare both methods for utility queries by hijacking JumbleQuery() to run the computation in a tight loop run N times (could not come up with a better idea to avoid the repeated palloc/pfree overhead), as the path to stress is _jumbleNode(). See the attached, that should be able to apply on top of the latest patch set (named as .txt to not feed it to the CF bot, and need to recompile to switch the iteration). Using that, I can compile the following results for various cases (-O2 and compute_query_id=on): query | mode | iterations | avg_runtime_ns | avg_jumble_ns -++++--- begin | string | 5000 |4.53116 | 4.54 begin | jumble | 5000 | 30.94578 | 30.94 commit | string | 5000 |4.76004 | 4.74 commit | jumble | 5000 |31.4791 | 31.48 create table 1 column | string | 5000 |7.22836 | 7.08 create table 1 column | jumble | 5000 | 152.10852 |151.96 create table 5 columns | string | 5000 | 12.43412 | 12.28 create table 5 columns | jumble | 5000 | 352.88976 | 349.1 create table 20 columns | string |500 | 49.591 | 48.2 create table 20 columns | jumble |500 | 2272.4066 | 2271 drop table 1 column | string | 5000 |6.70538 | 6.56 drop table 1 column | jumble | 5000 | 50.38 | 50.24 drop table 5 columns| string | 5000 |6.88256 | 6.74 drop table 5 columns| jumble | 5000 | 50.02898 | 49.9 SET work_mem| string | 5000 |7.28752 | 7.28 SET work_mem| jumble | 5000 | 91.66588 | 91.64 (16 rows) avg_runtime_ns is (query runtime / iterations) and avg_jumble_ns is the same with the difference between the start/end logs in the txt patch attached. The overhead to run the query does not matter much if you compare both. The time it takes to run a jumble is correlated to the number of nodes to go through for each query, and there is a larger gap for more nodes to go through. Well, a simple "begin" or "commit" query has its computation time increase from 4ns to 30ns in average which would be unnoticeable. The gap is larger for larger nodes, like SET, still we jump from 7ns to 90ns in this case. DDLs take the most hit with this method, where a 20-column CREATE TABLE jumps from 50ns to 2us (note that the iteration is 10 times lower here). At the end, that would be unnoticeable for the average user, I guess, but here are the numbers I get on my laptop :) -- Michael From ee47c4e1137adc857184d76ab62232e9c3c95451 Mon Sep 17 00:00:00 2001 From: Michael Paquier Date: Fri, 27 Jan 2023 10:58:34 +0900 Subject: [PATCH 1/1] Add benchmark tweaks to stress jumbling code --- src/backend/nodes/queryjumblefuncs.c | 36 +--- 1 file changed, 27 insertions(+), 9 deletions(-) diff --git a/src/backend/nodes/queryjumblefuncs.c b/src/backend/nodes/queryjumblefuncs.c index d55adf5020..9c134cc5ae 100644 --- a/src/backend/nodes/queryjumblefuncs.c +++ b/src/backend/nodes/queryjumblefuncs.c @@ -101,32 +101,48 @@ CleanQuerytext(const char *query, int *location, int *len) JumbleState * JumbleQuery(Query *query, const char *querytext) { +#define MAX_QUERY_COMPUTE 500 + JumbleState *jstate = NULL; Assert(IsQueryIdEnabled()); + elog(WARNING, "start JumbleQuery"); + if (query->utilityStmt && utility_query_id == UTILITY_QUERY_ID_STRING) { - query->queryId = compute_utility_query_id(querytext, - query->stmt_location, - query->stmt_len); + for (int i = 0; i < MAX_QUERY_COMPUTE ; i++) + { + query->queryId = compute_utility_query_id(querytext, +
Re: New strategies for freezing, advancing relfrozenxid early
Hi, On 2023-01-26 15:36:52 -0800, Peter Geoghegan wrote: > On Thu, Jan 26, 2023 at 12:45 PM Andres Freund wrote: > > > Most of the overhead of FREEZE WAL records (with freeze plan > > > deduplication and page-level freezing in) is generic WAL record header > > > overhead. Your recent adversarial test case is going to choke on that, > > > too. At least if you set checkpoint_timeout to 1 minute again. > > > > I don't quite follow. What do you mean with "record header overhead"? Unless > > that includes FPIs, I don't think that's that commonly true? > > Even if there are no directly observable FPIs, there is still extra > WAL, which can cause FPIs indirectly, just by making checkpoints more > frequent. I feel ridiculous even having to explain this to you. What does that have to do with "generic WAL record overhead"? I also don't really see how that is responsive to anything else in my email. That's just as true for the current gating condition (the issuance of an FPI during heap_page_prune() / HTSV()). What I was wondering about is whether we should replace the fpi_before != pgWalUsage.wal_fpi with records_before != pgWalUsage.wal_records && !WouldIssueFpi(page) > > The problematic case I am talking about is when we do *not* emit a WAL > > record > > during pruning (because there's nothing to prune), but want to freeze the > > table. If you don't log an FPI, the remaining big overhead is that > > increasing > > the LSN on the page will often cause an XLogFlush() when writing out the > > buffer. > > > > I don't see what your reference to checkpoint timeout is about here? > > > > Also, as I mentioned before, the problem isn't specific to > > checkpoint_timeout > > = 1min. It just makes it cheaper to reproduce. > > That's flagrantly intellectually dishonest. Sure, it made it easier to > reproduce. But that's not all it did! > > You had *lots* of specific numbers and technical details in your first > email, such as "Time for vacuuming goes up to ~5x. WAL volume to > ~9x.". But you did not feel that it was worth bothering with details > like having set checkpoint_timeout to 1 minute, which is a setting > that nobody uses, and obviously had a multiplicative effect. That > detail was unimportant. I had to drag it out of you! The multiples were for checkpoint_timeout=5min, with '250s' instead of WHERE ts < now() - '120s' I started out with checkpoint_timeout=1min, as I wanted to quickly test my theory. Then I increased checkpoint_timeout back to 5min, adjusted the query to some randomly guessed value. Happened to get nearly the same results. I then experimented more with '1min', because it's less annoying to have to wait for 120s until deletions start, than to wait for 250s. Because it's quicker to run I thought I'd share the less resource intensive version. A mistake as I now realize. This wasn't intended as a carefully designed benchmark, or anything. It was a quick proof for a problem that I found obvious. And it's not something worth testing carefully - e.g. the constants in the test are actually quite hardware specific, because the insert/seconds rate is very machine specific, and it's completely unnecessarily hardware intensive due to the use of single-row inserts, instead of batched operations. It's just a POC. > You basically found a way to add WAL overhead to a system/workload > that is already in a write amplification vicious cycle, with latent > tipping point type behavior. > > There is a practical point here, that is equally obvious, and yet > somehow still needs to be said: benchmarks like that one are basically > completely free of useful information. If we can't agree on how to > assess such things in general, then what can we agree on when it comes > to what should be done about it, what trade-off to make, when it comes > to any similar question? It's not at all free of useful information. It reproduces a problem I predicted repeatedly, that others in the discussion also wondered about, that you refused to acknowledge or address. It's not a good benchmark - I completely agree with that much. It was not designed to carefully benchmark different settings or such. It was designed to show a problem. And it does that. > > You're right, it makes sense to consider whether we'll emit a > > XLOG_HEAP2_VISIBLE anyway. > > As written the page-level freezing FPI mechanism probably doesn't > really stand to benefit much from doing that. Either checksums are > disabled and it's just a hint, or they're enabled and there is a very > high chance that we'll get an FPI inside lazy_scan_prune rather than > right after it is called, when PD_ALL_VISIBLE is set. I think it might be useful with logged hint bits, consider cases where all the tuples on the page were already fully hinted. That's not uncommon, I think? > > > > A less aggressive version would be to check if any WAL records were > > > > emitted > > > > during heap_page_prune() (instead of FPIs) and whether we'd emit an FPI > > >
Re: Something is wrong with wal_compression
Thomas Munro writes: > On Fri, Jan 27, 2023 at 1:30 PM Michael Paquier wrote: >> My opinion would be to make this function more reliable, FWIW, even if >> that involves a performance impact when called in a close loop by >> forcing more WAL flushes to ensure its report durability and >> consistency. > Yeah, the other thread has a patch for that. But it would hurt some > workloads. I think we need to get the thing correct first and worry about performance later. What's wrong with simply making pg_xact_status write and flush a record of the XID's existence before returning it? Yeah, it will cost you if you use that function, but not if you don't. > A better patch would do some kind of amortisation > (reserving N xids at a time or some such scheme, while being careful > to make sure the right CLOG pages etc exist if you crash and skip a > bunch of xids on recovery) but be more complicated. Maybe that would be appropriate for HEAD, but I'd be wary of adding anything complicated to the back branches. This is clearly a very under-tested area. regards, tom lane
Re: wrong Append/MergeAppend elision?
On Fri, Jan 27, 2023 at 5:43 AM Tom Lane wrote: > David Rowley writes: > > On Fri, 27 Jan 2023 at 01:30, Amit Langote wrote: > >> It seems that the planner currently elides an Append/MergeAppend that > >> has run-time pruning info (part_prune_index) set, but which I think is > >> a bug. > > > There is still the trade-off of having to pull tuples through the > > Append node for when run-time pruning is unable to prune the last > > partition. So your proposal to leave the Append alone when there's > > run-time pruning info is certainly not a no-brainer. > > Yeah. Amit's proposal amounts to optimizing for the case that all > partitions get pruned, which does not seem to me to be the way > to bet. I'm inclined to think it's fine as-is. Fair enough. I thought for a second that maybe it was simply an oversight but David confirms otherwise. This was interacting badly with the other patch I'm working on and I just figured out the problem was with that other patch. -- Thanks, Amit Langote EDB: http://www.enterprisedb.com
Re: Add LZ4 compression in pg_dump
On Thu, Jan 26, 2023 at 12:22:45PM -0600, Justin Pryzby wrote: > Yeah. But the original log_warning text was better, and should be > restored: > > - if (AH->compression != 0) > - pg_log_warning("archive is compressed, but this installation > does not support compression -- no data will be available"); Yeah, this one's on me. So I have gone with the simplest solution and applied a fix to restore the original behavior, with the same warning showing up. -- Michael signature.asc Description: PGP signature
Re: New strategies for freezing, advancing relfrozenxid early
Hi, On 2023-01-26 14:27:53 -0500, Robert Haas wrote: > One idea that I've had about how to solve this problem is to try to > make vacuum try to aggressively freeze some portion of the table on > each pass, and to behave less aggressively on the rest of the table so > that, hopefully, no single vacuum does too much work. I agree that this rough direction is worthwhile to purse. > Unfortunately, I don't really know how to do that effectively. If we knew > that the table was going to see 10 vacuums before we hit > autovacuum_freeze_max_age, we could try to have each one do 10% of the > amount of freezing that was going to need to be done rather than letting any > single vacuum do all of it, but we don't have that sort of information. I think, quite fundamentally, it's not possible to bound the amount of work an anti-wraparound vacuum has to do if we don't have an age based autovacuum trigger kicking in before autovacuum_freeze_max_age. After all, there might be no autovacuum before that's autovacuum_freeze_max_age is reached. But there's just no reason to not have a trigger below autovacuum_freeze_max_age. That's why I think Peter's patch to split age and anti-"auto-cancel" autovacuums is an strictly necessary change if we want to make autovacuum fundamentally suck less. There's a few boring details to figure out how to set/compute those limits, but I don't think there's anything fundamentally hard. I think we also need the number of all-frozen pages in pg_class if we want to make better scheduling decision. As we already compute the number of all-visible pages at the end of vacuuming, we can compute the number of all-frozen pages as well. The space for another integer in pg_class doesn't bother me one bit. Let's say we had a autovacuum_vacuum_age trigger of 100m, and autovacuum_freeze_max_age=500m. We know that we're roughly going to be vacuuming 5 times before reaching autovacuum_freeze_max_age (very slow autovacuums are an issue, but if one autovacuum takes 100m+ xids long, there's not much we can do). With that we could determine the eager percentage along the lines of: frozen_target = Min(age(relfrozenxid), autovacuum_freeze_max_age)/autovacuum_freeze_max_age eager_percentage = Min(0, frozen_target * relpages - pg_class.relallfrozen * relpages) One thing I don't know fully how to handle is how to ensure that we try to freeze a different part of the table each vacuum. I guess we could store a page number in pgstats? This would help address the "cliff" issue of reaching autovacuum_freeze_max_age. What it would *not*, on its own, would is the number of times we rewrite pages. I can guess at a few ways to heuristically identify when tables are "append mostly" from vacuum's view (a table can be update heavy, but very localized to recent rows, and still be append mostly from vacuum's view). There's obvious cases, e.g. when there are way more inserts than dead rows. But other cases are harder. > Also, even if we did have that sort of information, the idea only works if > the pages that we freeze sooner are ones that we're not about to update or > delete again, and we don't have any idea what is likely there. Perhaps we could use something like (age(relfrozenxid) - age(newest_xid_on_page)) / age(relfrozenxid) as a heuristic? I have a gut feeling that we should somehow collect/use statistics about the number of frozen pages, marked as such by the last (or recent?) vacuum, that had to be "defrosted" by backends. But I don't quite know how to yet. I think we could collect statistics about that by storing the LSN of the last vacuum in the shared stats, and incrementing that counter when defrosting. A lot of things like that would work a whole lot better if we had statistics that take older data into account, but weigh it less than more recent data. But that's hard/expensive to collect. Greetings, Andres Freund
Re: Something is wrong with wal_compression
On Fri, Jan 27, 2023 at 1:30 PM Michael Paquier wrote: > On Thu, Jan 26, 2023 at 04:14:57PM -0800, Andrey Borodin wrote: > > If we agree that xid allocation is not something persistent, let's fix > > the test? We can replace a check with select * from pg_class or, > > maybe, add an amcheck run. > > As far as I recollect, this test was introduced to test this new > > function in 857ee8e391f. > > My opinion would be to make this function more reliable, FWIW, even if > that involves a performance impact when called in a close loop by > forcing more WAL flushes to ensure its report durability and > consistency. As things stand, this is basically unreliable, and we > document it as something applications can *use*. Adding a note in the > docs to say that this function can be unstable for some edge cases > does not make much sense to me, either. Commit 857ee8e itself says > that we can use it if a database connection is lost, which could > happen on a crash.. Yeah, the other thread has a patch for that. But it would hurt some workloads. A better patch would do some kind of amortisation (reserving N xids at a time or some such scheme, while being careful to make sure the right CLOG pages etc exist if you crash and skip a bunch of xids on recovery) but be more complicated. For the record, back before release 13 added the 64 bit xid allocator, these functions (or rather their txid_XXX ancestors) were broken in a different way: they didn't track epochs reliably, the discovery of which led to the new xid8-based functions, so that might provide a natural back-patching range, if a back-patchable solution can be agreed on.
Re: Record queryid when auto_explain.log_verbose is on
On Thu, Jan 26, 2023 at 10:00:04PM +0900, torikoshia wrote: > I'll work on this next. Cool, thanks! -- Michael signature.asc Description: PGP signature
Re: Something is wrong with wal_compression
On Thu, Jan 26, 2023 at 04:14:57PM -0800, Andrey Borodin wrote: > If we agree that xid allocation is not something persistent, let's fix > the test? We can replace a check with select * from pg_class or, > maybe, add an amcheck run. > As far as I recollect, this test was introduced to test this new > function in 857ee8e391f. My opinion would be to make this function more reliable, FWIW, even if that involves a performance impact when called in a close loop by forcing more WAL flushes to ensure its report durability and consistency. As things stand, this is basically unreliable, and we document it as something applications can *use*. Adding a note in the docs to say that this function can be unstable for some edge cases does not make much sense to me, either. Commit 857ee8e itself says that we can use it if a database connection is lost, which could happen on a crash.. -- Michael signature.asc Description: PGP signature
Re: Partition key causes problem for volatile target list query
On Thu, Jan 26, 2023 at 07:21:16PM -0500, Tom Lane wrote: > Well, if you looked further than the first few rows, it wouldn't be > "always zero". But the select from the partitioned table will read > the first partition first, and that partition will have the rows > with d1=0, by definition. > > =# explain select * from case_test2 limit 10; > QUERY PLAN > > > > --- > Limit (cost=0.00..0.19 rows=10 width=8) >-> Append (cost=0.00..1987.90 rows=102260 width=8) > -> Seq Scan on case_test2_0 case_test2_1 (cost=0.00..478.84 > rows=3318 > 4 width=8) > -> Seq Scan on case_test2_1 case_test2_2 (cost=0.00..480.86 > rows=3328 > 6 width=8) > -> Seq Scan on case_test2_2 case_test2_3 (cost=0.00..484.30 > rows=3353 > 0 width=8) > -> Seq Scan on case_test2_3 case_test2_4 (cost=0.00..32.60 > rows=2260 > width=8) > (6 rows) > > The result appears sorted by d1, but that's an implementation artifact. Wow, thanks. Not sure how I missed something so obvious. I just saw it myself by generating only 10 rows and noticing the numbers were always increasing. -- Bruce Momjian https://momjian.us EDB https://enterprisedb.com Embrace your flaws. They make you human, rather than perfect, which you will never be.
Re: Something is wrong with wal_compression
Andrey Borodin writes: > On Thu, Jan 26, 2023 at 3:04 PM Tom Lane wrote: >> Indeed, it seems like this behavior makes pg_xact_status() basically >> useless as things stand. > If we agree that xid allocation is not something persistent, let's fix > the test? If we're not going to fix this behavior, we need to fix the docs to disclaim that pg_xact_status() is of use for what it's said to be good for. regards, tom lane
Re: Partition key causes problem for volatile target list query
Bruce Momjian writes: > I have found an odd behavior --- a query in the target list that assigns > to a partitioned column causes queries that would normally be volatile > to return always zero. Well, if you looked further than the first few rows, it wouldn't be "always zero". But the select from the partitioned table will read the first partition first, and that partition will have the rows with d1=0, by definition. =# explain select * from case_test2 limit 10; QUERY PLAN --- Limit (cost=0.00..0.19 rows=10 width=8) -> Append (cost=0.00..1987.90 rows=102260 width=8) -> Seq Scan on case_test2_0 case_test2_1 (cost=0.00..478.84 rows=3318 4 width=8) -> Seq Scan on case_test2_1 case_test2_2 (cost=0.00..480.86 rows=3328 6 width=8) -> Seq Scan on case_test2_2 case_test2_3 (cost=0.00..484.30 rows=3353 0 width=8) -> Seq Scan on case_test2_3 case_test2_4 (cost=0.00..32.60 rows=2260 width=8) (6 rows) The result appears sorted by d1, but that's an implementation artifact. regards, tom lane
Re: Something is wrong with wal_compression
On Thu, Jan 26, 2023 at 3:04 PM Tom Lane wrote: > > Indeed, it seems like this behavior makes pg_xact_status() basically > useless as things stand. > If we agree that xid allocation is not something persistent, let's fix the test? We can replace a check with select * from pg_class or, maybe, add an amcheck run. As far as I recollect, this test was introduced to test this new function in 857ee8e391f. Best regards, Andrey Borodin.
Re: improving user.c error messages
On Thu, Jan 26, 2023 at 05:41:32PM -0500, Tom Lane wrote: > Well, it's not a hint. I think the above is fine for non-password > cases, but for passwords maybe > > ERROR: permission denied to alter role password > DETAIL: To change another role's password, you must have CREATEROLE > privilege and ADMIN OPTION on role "%s". Okay. I used this phrasing in v3. -- Nathan Bossart Amazon Web Services: https://aws.amazon.com >From 02938746f72d0545d6c5cc3d1e9c6dc30e6e4175 Mon Sep 17 00:00:00 2001 From: Nathan Bossart Date: Thu, 26 Jan 2023 11:05:13 -0800 Subject: [PATCH v3 1/1] Improve user.c error messages. --- src/backend/commands/user.c | 166 -- src/test/regress/expected/create_role.out | 77 ++ src/test/regress/expected/dependency.out | 4 + src/test/regress/expected/privileges.out | 23 +-- 4 files changed, 196 insertions(+), 74 deletions(-) diff --git a/src/backend/commands/user.c b/src/backend/commands/user.c index 3a92e930c0..ad22a89298 100644 --- a/src/backend/commands/user.c +++ b/src/backend/commands/user.c @@ -316,23 +316,33 @@ CreateRole(ParseState *pstate, CreateRoleStmt *stmt) if (!has_createrole_privilege(currentUserId)) ereport(ERROR, (errcode(ERRCODE_INSUFFICIENT_PRIVILEGE), - errmsg("permission denied to create role"))); + errmsg("permission denied to create role"), + errdetail("You must have %s privilege to create roles.", + "CREATEROLE"))); if (issuper) ereport(ERROR, (errcode(ERRCODE_INSUFFICIENT_PRIVILEGE), - errmsg("must be superuser to create superusers"))); + errmsg("permission denied to create role"), + errdetail("You must have %s privilege to create roles with %s.", + "SUPERUSER", "SUPERUSER"))); if (createdb && !have_createdb_privilege()) ereport(ERROR, (errcode(ERRCODE_INSUFFICIENT_PRIVILEGE), - errmsg("must have createdb permission to create createdb users"))); + errmsg("permission denied to create role"), + errdetail("You must have %s privilege to create roles with %s.", + "CREATEDB", "CREATEDB"))); if (isreplication && !has_rolreplication(currentUserId)) ereport(ERROR, (errcode(ERRCODE_INSUFFICIENT_PRIVILEGE), - errmsg("must have replication permission to create replication users"))); + errmsg("permission denied to create role"), + errdetail("You must have %s privilege to create roles with %s.", + "REPLICATION", "REPLICATION"))); if (bypassrls && !has_bypassrls_privilege(currentUserId)) ereport(ERROR, (errcode(ERRCODE_INSUFFICIENT_PRIVILEGE), - errmsg("must have bypassrls to create bypassrls users"))); + errmsg("permission denied to create role"), + errdetail("You must have %s privilege to create roles with %s.", + "BYPASSRLS", "BYPASSRLS"))); } /* @@ -744,10 +754,18 @@ AlterRole(ParseState *pstate, AlterRoleStmt *stmt) roleid = authform->oid; /* To mess with a superuser in any way you gotta be superuser. */ - if (!superuser() && (authform->rolsuper || dissuper)) + if (!superuser() && authform->rolsuper) ereport(ERROR, (errcode(ERRCODE_INSUFFICIENT_PRIVILEGE), - errmsg("must be superuser to alter superuser roles or change superuser attribute"))); + errmsg("permission denied to alter role"), + errdetail("You must have %s privilege to alter roles with %s.", + "SUPERUSER", "SUPERUSER"))); + if (!superuser() && dissuper) + ereport(ERROR, +(errcode(ERRCODE_INSUFFICIENT_PRIVILEGE), + errmsg("permission denied to alter role"), + errdetail("You must have %s privilege to change the %s attribute.", + "SUPERUSER", "SUPERUSER"))); /* * Most changes to a role require that you both have CREATEROLE privileges @@ -761,13 +779,17 @@ AlterRole(ParseState *pstate, AlterRoleStmt *stmt) dvalidUntil || disreplication || dbypassRLS) ereport(ERROR, (errcode(ERRCODE_INSUFFICIENT_PRIVILEGE), - errmsg("permission denied"))); + errmsg("permission denied to alter role"), + errdetail("You must have %s privilege and %s on role \"%s\".", + "CREATEROLE", "ADMIN OPTION", rolename))); /* an unprivileged user can change their own password */ if (dpassword && roleid != currentUserId) ereport(ERROR, (errcode(ERRCODE_INSUFFICIENT_PRIVILEGE), - errmsg("must have CREATEROLE privilege to change another user's password"))); + errmsg("permission denied to alter role"), + errdetail("To change another role's password, you must have %s privilege and %s on the role.", + "CREATEROLE", "ADMIN OPTION"))); } else if (!superuser()) { @@ -779,23 +801,30 @@ AlterRole(ParseState *pstate, AlterRoleStmt *stmt) if (dcreatedb && !have_createdb_privilege()) ereport(ERROR, (errcode(ERRCODE_INSUFFICIENT_PRIVILEGE), - errmsg("must have createdb privilege to change createdb attribute"))); + errmsg("permission denied to
Partition key causes problem for volatile target list query
I have found an odd behavior --- a query in the target list that assigns to a partitioned column causes queries that would normally be volatile to return always zero. In the first query, no partitioning is used: d1 | d2 + 1 | 0 2 | 0 2 | 1 1 | 0 1 | 2 1 | 2 1 | 0 0 | 2 2 | 0 2 | 2 In the next query, 'd1' is a partition key and it gets a constant value of zero for all rows: d1 | d2 + --> 0 | 1 --> 0 | 2 0 | 2 0 | 1 0 | 2 0 | 1 0 | 2 0 | 2 0 | 2 0 | 2 The self-contained query is attached. The value is _always_ zero, which suggests random() is not being called; calling setseed() does not change that. If I change "SELECT x" with "SELECT 2", the "2" is used. I see this behavior back to PG 11. -- Bruce Momjian https://momjian.us EDB https://enterprisedb.com Embrace your flaws. They make you human, rather than perfect, which you will never be. partition_test.sql Description: application/sql
Re: Rework of collation code, extensibility
Attached v9 and added perf numbers below. I'm hoping to commit 0002 and 0003 soon-ish, maybe a week or two, please let me know if you want me to hold off. (I won't commit the GUCs unless others find them generally useful; they are included here to more easily reproduce my performance tests.) My primary motivation is still related to https://commitfest.postgresql.org/41/3956/ but the combination of cleaner code and a performance boost seems like reasonable justification for this patch set independently. There aren't any clear open items on this patch. Peter Eisentraut asked me to focus this thread on the refactoring, which I've done by reducing it to 2 patches, and I left multilib ICU up to the other thread. He also questioned the increased line count, but I think the currently-low line count is due to bad style. PeterG provided some review comments, in particular when to do the tiebreaking, which I addressed. This patch has been around for a while, so I'll take a fresh look and see if I see risk areas, and re-run a few sanity checks. Of course more feedback would also be welcome. PERFORMANCE: == Setup: == base: master with v9-0001 applied (GUCs only) refactor: master with v9-0001, v9-0002, v9-0003 applied Note that I wasn't able to see any performance difference between the base and master, v9-0001 just adds some GUCs to make testing easier. glibc 2.35 ICU 70.1 gcc11.3.0LLVM 14.0.0 built with meson (uses -O3) $ perl text_generator.pl 1000 10 > /tmp/strings.utf8.txt CREATE TABLE s (t TEXT); COPY s FROM '/tmp/strings.utf8.txt'; VACUUM FREEZE s; CHECKPOINT; SET work_mem='10GB'; SET max_parallel_workers = 0; SET max_parallel_workers_per_gather = 0; = Test queries: = EXPLAIN ANALYZE SELECT t FROM s ORDER BY t COLLATE "C"; EXPLAIN ANALYZE SELECT t FROM s ORDER BY t COLLATE "en_US"; EXPLAIN ANALYZE SELECT t FROM s ORDER BY t COLLATE "en-US-x-icu"; Timings are measured as the milliseconds to return the first tuple from the Sort operator (as reported in EXPLAIN ANALYZE). Median of three runs. Results: baserefactor speedup sort_abbreviated_keys=false: C 73777273 1.4% en_US 35081 35090 0.0% en-US-x-ixu20520 19465 5.4% sort_abbreviated_keys=true: C 81058008 1.2% en_US 35067 34850 0.6% en-US-x-icu22626 21507 5.2% === Conclusion: === These numbers can move +/-1 percentage point, so I'd interpret anything less than that as noise. This happens to be the first run where all the numbers favored the refactoring patch, but it is generally consistent with what I had seen before. The important part is that, for ICU, it appears to be a substantial speedup when using meson (-O3). Also, when/if the multilib ICU support goes in, that may lose some of these gains due to an extra indirect function call. -- Jeff Davis PostgreSQL Contributor Team - AWS text_generator.pl Description: Perl program From 39ed011cc51ba3a4af5e3b559a7b8de25fb895a5 Mon Sep 17 00:00:00 2001 From: Jeff Davis Date: Sat, 21 Jan 2023 12:44:07 -0800 Subject: [PATCH v9 1/3] Introduce GUCs to control abbreviated keys sort optimization. The setting sort_abbreviated_keys turns the optimization on or off overall. The optimization relies on collation providers, which are complex dependencies, and the performance of the optimization may rely on many factors. Introducing a GUC allows easier diagnosis when this optimization results in worse perforamnce. The setting trust_strxfrm replaces the define TRUST_STRXFRM, allowing users to experiment with the abbreviated keys optimization when using the libc provider. Previously, the optimization only applied to collations using the ICU provider unless specially compiled. By default, allowed only for superusers (because an incorrect setting could lead to wrong results), but can be granted to others. --- doc/src/sgml/config.sgml | 40 ++ src/backend/utils/adt/varlena.c| 10 +++--- src/backend/utils/misc/guc_tables.c| 24 + src/backend/utils/sort/tuplesortvariants.c | 17 ++--- 4 files changed, 82 insertions(+), 9 deletions(-) diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml index f985afc009..8f55b89f35 100644 --- a/doc/src/sgml/config.sgml +++ b/doc/src/sgml/config.sgml @@ -11252,6 +11252,46 @@ dynamic_library_path = 'C:\tools\postgresql;H:\my_project\lib;$libdir' + + sort_abbreviated_keys (boolean) + + sort_abbreviated_keys configuration parameter + + + + +Enables or disables the use of abbreviated sort keys, a sort optimization, +if applicable. The default is true. Disabling may +be useful to diagnose
Re: New strategies for freezing, advancing relfrozenxid early
On Thu, Jan 26, 2023 at 12:45 PM Andres Freund wrote: > > Most of the overhead of FREEZE WAL records (with freeze plan > > deduplication and page-level freezing in) is generic WAL record header > > overhead. Your recent adversarial test case is going to choke on that, > > too. At least if you set checkpoint_timeout to 1 minute again. > > I don't quite follow. What do you mean with "record header overhead"? Unless > that includes FPIs, I don't think that's that commonly true? Even if there are no directly observable FPIs, there is still extra WAL, which can cause FPIs indirectly, just by making checkpoints more frequent. I feel ridiculous even having to explain this to you. > The problematic case I am talking about is when we do *not* emit a WAL record > during pruning (because there's nothing to prune), but want to freeze the > table. If you don't log an FPI, the remaining big overhead is that increasing > the LSN on the page will often cause an XLogFlush() when writing out the > buffer. > > I don't see what your reference to checkpoint timeout is about here? > > Also, as I mentioned before, the problem isn't specific to checkpoint_timeout > = 1min. It just makes it cheaper to reproduce. That's flagrantly intellectually dishonest. Sure, it made it easier to reproduce. But that's not all it did! You had *lots* of specific numbers and technical details in your first email, such as "Time for vacuuming goes up to ~5x. WAL volume to ~9x.". But you did not feel that it was worth bothering with details like having set checkpoint_timeout to 1 minute, which is a setting that nobody uses, and obviously had a multiplicative effect. That detail was unimportant. I had to drag it out of you! You basically found a way to add WAL overhead to a system/workload that is already in a write amplification vicious cycle, with latent tipping point type behavior. There is a practical point here, that is equally obvious, and yet somehow still needs to be said: benchmarks like that one are basically completely free of useful information. If we can't agree on how to assess such things in general, then what can we agree on when it comes to what should be done about it, what trade-off to make, when it comes to any similar question? > > In many cases we'll have to dirty the page anyway, just to set > > PD_ALL_VISIBLE. The whole way the logic works is conditioned (whether > > triggered by an FPI or triggered by my now-reverted GUC) on being able > > to set the whole page all-frozen in the VM. > > IIRC setting PD_ALL_VISIBLE doesn't trigger an FPI unless we need to log hint > bits. But freezing does trigger one even without wal_log_hint_bits. That is correct. > You're right, it makes sense to consider whether we'll emit a > XLOG_HEAP2_VISIBLE anyway. As written the page-level freezing FPI mechanism probably doesn't really stand to benefit much from doing that. Either checksums are disabled and it's just a hint, or they're enabled and there is a very high chance that we'll get an FPI inside lazy_scan_prune rather than right after it is called, when PD_ALL_VISIBLE is set. That's not perfect, of course, but it doesn't have to be. Perhaps it should still be improved, just on general principle. > > > A less aggressive version would be to check if any WAL records were > > > emitted > > > during heap_page_prune() (instead of FPIs) and whether we'd emit an FPI > > > if we > > > modified the page again. Similar to what we do now, except not requiring > > > an > > > FPI to have been emitted. > > > > Also way more aggressive. Not nearly enough on its own. > > In which cases will it be problematically more aggressive? > > If we emitted a WAL record during pruning we've already set the LSN of the > page to a very recent LSN. We know the page is dirty. Thus we'll already > trigger an XLogFlush() during ringbuffer replacement. We won't emit an FPI. You seem to be talking about this as if the only thing that could matter is the immediate FPI -- the first order effects -- and not any second order effects. You certainly didn't get to 9x extra WAL overhead by controlling for that before. Should I take it that you've decided to assess these things more sensibly now? Out of curiosity: why the change of heart? > > > But to me it seems a bit odd that VACUUM now is more aggressive if > > > checksums / > > > wal_log_hint bits is on, than without them. Which I think is how using > > > either > > > of pgWalUsage.wal_fpi, pgWalUsage.wal_records ends up working? > > > > Which part is the odd part? Is it odd that page-level freezing works > > that way, or is it odd that page-level checksums work that way? > > That page-level freezing works that way. I think that it will probably cause a little confusion, and should be specifically documented. But other than that, it seems reasonable enough to me. I mean, should I not do something that's going to be of significant help to users with checksums, just because it'll be somewhat confusing to a small min
Re: GUCs to control abbreviated sort keys
On Thu, 2023-01-26 at 22:39 +0100, Peter Eisentraut wrote: > Maybe an easier way to enable or disable it in the source code with a > #define would serve this. Making it a GUC right away seems a bit > heavy-handed. Further exploration and tweaking might well require > further source code changes, so relying on a source code level toggle > would seem appropriate. I am using these GUCs for testing the various collation paths in my collation refactoring branch. I find them pretty useful, and when I saw a regression, I thought others might think it was useful, too. But if not I'll just leave them in my branch and withdraw from this thread. -- Jeff Davis PostgreSQL Contributor Team - AWS
Re: Something is wrong with wal_compression
Thomas Munro writes: > On Fri, Jan 27, 2023 at 11:14 AM Tom Lane wrote: >> If any tuples made by that transaction had reached disk, >> we'd have a problem. > The problem is that the WAL wasn't flushed, allowing the same xid to > be allocated again after crash recovery. But for any data pages to > hit the disk, we'd have to flush WAL first, so then it couldn't > happen, no? Ah, now I get the point: the "committed xact" seen after restart isn't the same one as we saw before the crash, but a new one that was given the same XID because nothing about the old one had made it to disk yet. > FWIW I also re-complained about the dangers of anyone > relying on pg_xact_status() for its stated purpose after seeing > tanager's failure[1]. Indeed, it seems like this behavior makes pg_xact_status() basically useless as things stand. regards, tom lane
Re: run pgindent on a regular basis / scripted manner
On Thu, 26 Jan 2023 at 22:46, Andrew Dunstan wrote: > Hmm, but I usually run with -a, I even have a git alias for it. I guess > what this discussion illustrates is that there are various patters for > using git, and we shouldn't assume that everyone else is using the same > patterns we are. I definitely agree that there are lots of ways to use git. And I now understand why my hook didn't work well for your existing workflow. I've pretty much unlearned the -a flag. Because the easiest way I've been able to split up changes into different commits is using "git add -p", which adds partial pieces of files to the staging area. And that workflow combines terribly with "git commit -a" because -a adds all the things that I specifically didn't put in the staging area into the final commit anyway. > I'm still mildly inclined to say this material would be better placed > in the developer wiki. After all, this isn't the only thing a postgres > developer might use a git hook for I think it should definitely be somewhere. I have a preference for the repo, since I think the docs on codestyle are already in too many different places. But the wiki is already much better than having no shared hook at all. I mainly think we should try to make it as easy as possible for people to commit well indented code. > (mine has more material in it than in what I posted). Anything that is useful for the wider community and could be part of this example/template git hook? (e.g. some perltidy automation)
Re: Something is wrong with wal_compression
On Fri, Jan 27, 2023 at 11:14 AM Tom Lane wrote: > Andrey Borodin writes: > > On Thu, Jan 26, 2023 at 12:12 PM Tom Lane wrote: > >> That test case is demonstrating fundamental > >> database corruption after a crash. > > > Not exactly corruption. XID was not persisted and buffer data did not > > hit a disk. Database is in the correct state. > > Really? I don't see how this part is even a little bit okay: > > [00:40:50.744](0.046s) not ok 3 - xid is aborted after crash > [00:40:50.745](0.001s) > [00:40:50.745](0.000s) # Failed test 'xid is aborted after crash' > # at t/011_crash_recovery.pl line 57. > [00:40:50.746](0.001s) # got: 'committed' > # expected: 'aborted' > > If any tuples made by that transaction had reached disk, > we'd have a problem. The problem is that the WAL wasn't flushed, allowing the same xid to be allocated again after crash recovery. But for any data pages to hit the disk, we'd have to flush WAL first, so then it couldn't happen, no? FWIW I also re-complained about the dangers of anyone relying on pg_xact_status() for its stated purpose after seeing tanager's failure[1]. [1] https://www.postgresql.org/message-id/CA%2BhUKGJ9p2JPPMA4eYAKq%3Dr9d_4_8vziet_tS1LEBbiny5-ypA%40mail.gmail.com
Re: improving user.c error messages
Nathan Bossart writes: > On Thu, Jan 26, 2023 at 03:07:43PM -0500, Tom Lane wrote: >> I think the password case needs to be kept separate, because the >> conditions for it are different (specifically the exception that >> you can alter your own password). Lumping the rest together >> seems OK to me. > Hm. In v2, the error message for both cases is the same: > ERROR: permission denied to alter role > DETAIL: You must have CREATEROLE privilege and ADMIN OPTION on role > "regress_priv_user2". > We could add "to change its attributes" and "to change its password" to > separate the two, but I'm not sure that adds much. ISTM the current error > message for ALTER ROLE PASSWORD implies that you can change your own > password, and that's lost with my patch. Perhaps we should add an > errhint() with that information instead. WDYT? Well, it's not a hint. I think the above is fine for non-password cases, but for passwords maybe ERROR: permission denied to alter role password DETAIL: To change another role's password, you must have CREATEROLE privilege and ADMIN OPTION on role "%s". regards, tom lane
Re: suppressing useless wakeups in logical/worker.c
Nathan Bossart writes: > On Thu, Jan 26, 2023 at 04:09:51PM -0500, Tom Lane wrote: >> Right, so more like this. > LGTM Thanks, pushed. Returning to the prior patch ... I don't much care for this: +/* Maybe there will be a free slot in a second... */ +retry_time = TimestampTzPlusSeconds(now, 1); +LogRepWorkerUpdateSyncStartWakeup(retry_time); We're not moving the goalposts very far on unnecessary wakeups if we have to do that. Do we need to get a wakeup on sync slot free? Although having to send that to every worker seems ugly. Maybe this is being done in the wrong place and we need to find a way to get the launcher to handle it. As for the business about process_syncing_tables being only called conditionally, I was already of the opinion that the way it's getting called is loony. Why isn't it called from LogicalRepApplyLoop (and noplace else)? With more certainty about when it runs, we might not need so many kluges elsewhere. regards, tom lane
Re: Something is wrong with wal_compression
Andrey Borodin writes: > On Thu, Jan 26, 2023 at 12:12 PM Tom Lane wrote: >> That test case is demonstrating fundamental >> database corruption after a crash. > Not exactly corruption. XID was not persisted and buffer data did not > hit a disk. Database is in the correct state. Really? I don't see how this part is even a little bit okay: [00:40:50.744](0.046s) not ok 3 - xid is aborted after crash [00:40:50.745](0.001s) [00:40:50.745](0.000s) # Failed test 'xid is aborted after crash' # at t/011_crash_recovery.pl line 57. [00:40:50.746](0.001s) # got: 'committed' # expected: 'aborted' If any tuples made by that transaction had reached disk, we'd have a problem. regards, tom lane
Re: improving user.c error messages
On Thu, Jan 26, 2023 at 03:07:43PM -0500, Tom Lane wrote: > Nathan Bossart writes: >> On Thu, Jan 26, 2023 at 02:42:05PM -0500, Robert Haas wrote: >>> Basically my question is whether having one error message for all of >>> those cases is good enough, or whether we should be trying harder. > > I think the password case needs to be kept separate, because the > conditions for it are different (specifically the exception that > you can alter your own password). Lumping the rest together > seems OK to me. Hm. In v2, the error message for both cases is the same: ERROR: permission denied to alter role DETAIL: You must have CREATEROLE privilege and ADMIN OPTION on role "regress_priv_user2". We could add "to change its attributes" and "to change its password" to separate the two, but I'm not sure that adds much. ISTM the current error message for ALTER ROLE PASSWORD implies that you can change your own password, and that's lost with my patch. Perhaps we should add an errhint() with that information instead. WDYT? -- Nathan Bossart Amazon Web Services: https://aws.amazon.com
Re: New strategies for freezing, advancing relfrozenxid early
On Thu, Jan 26, 2023 at 1:22 PM Robert Haas wrote: > On Thu, Jan 26, 2023 at 4:06 PM Peter Geoghegan wrote: > > There is very good reason to believe that the large majority of all > > data that people store in a system like Postgres is extremely cold > > data: > > The systems where I end up troubleshooting problems seem to be, most > typically, busy OLTP systems. I'm not in a position to say whether > that's more or less common than systems with extremely cold data, but > I am in a position to say that my employer will have a lot fewer happy > customers if we regress that use case. Naturally I'm keen to avoid > that. This is the kind of remark that makes me think that you don't get it. The most influential OLTP benchmark of all time is TPC-C, which has exactly this problem. In spades -- it's enormously disruptive. Which is one reason why I used it as a showcase for a lot of this work. Plus practical experience (like the Heroku database in the blog post I linked to) fully agrees with that benchmark, as far as this stuff goes -- that was also a busy OLTP database. Online transaction involves transactions. Right? There is presumably some kind of ledger, some kind of orders table. Naturally these have entries that age out fairly predictably. After a while, almost all the data is cold data. It is usually about that simple. One of the key strengths of systems like Postgres is the ability to inexpensively store a relatively large amount of data that has just about zero chance of being read, let alone modified. While at the same time having decent OLTP performance for the hot data. Not nearly as good as an in-memory system, mind you -- and yet in-memory systems remain largely a niche thing. -- Peter Geoghegan
Re: run pgindent on a regular basis / scripted manner
On 2023-01-26 Th 12:05, Jelte Fennema wrote: >> At this stage the files are now indented, so if it failed and you run >> `git commit` again it will commit with the indention changes. > No, because at no point a "git add" is happening, so the changes > made by pgindent are not staged. As long as you don't run the > second "git commit" with the -a flag the commit will be exactly > the same as you prepared it before. Hmm, but I usually run with -a, I even have a git alias for it. I guess what this discussion illustrates is that there are various patters for using git, and we shouldn't assume that everyone else is using the same patterns we are. I'm still mildly inclined to say this material would be better placed in the developer wiki. After all, this isn't the only thing a postgres developer might use a git hook for (mine has more material in it than in what I posted). cheers andrew -- Andrew Dunstan EDB: https://www.enterprisedb.com
Re: GUCs to control abbreviated sort keys
On 25.01.23 22:16, Jeff Davis wrote: I am highlighting this case because the existence of a single non- contrived case or regression suggests that we may want to explore further and tweak heuristics. That's quite natural when the heuristics are based on a complex dependency like a collation provider. The sort_abbreviated_keys GUC makes that kind of exploration and tweaking a lot easier. Maybe an easier way to enable or disable it in the source code with a #define would serve this. Making it a GUC right away seems a bit heavy-handed. Further exploration and tweaking might well require further source code changes, so relying on a source code level toggle would seem appropriate.
Re: Something is wrong with wal_compression
On Thu, Jan 26, 2023 at 12:12 PM Tom Lane wrote: > > That test case is demonstrating fundamental > database corruption after a crash. > Not exactly corruption. XID was not persisted and buffer data did not hit a disk. Database is in the correct state. It was discussed long before WAL compression here [0]. The thing is it is easier to reproduce with compression, but compression has nothing to do with it, as far as I understand. Proposed fix is here[1], but I think it's better to fix the test. It should not veryfi Xid, but rather side effects of "CREATE TABLE mine(x integer);". Best regards, Andrey Borodin. [0] https://www.postgresql.org/message-id/flat/565FB155-C6B0-41E2-8C44-7B514DC25132%2540yandex-team.ru [1] https://www.postgresql.org/message-id/flat/20210313012820.GJ29463%40telsasoft.com#0f18d3a4d593ea656fdc761e026fee81
Re: suppressing useless wakeups in logical/worker.c
On Thu, Jan 26, 2023 at 04:09:51PM -0500, Tom Lane wrote: > Right, so more like this. LGTM -- Nathan Bossart Amazon Web Services: https://aws.amazon.com
Re: New strategies for freezing, advancing relfrozenxid early
On Thu, Jan 26, 2023 at 4:06 PM Peter Geoghegan wrote: > There is very good reason to believe that the large majority of all > data that people store in a system like Postgres is extremely cold > data: The systems where I end up troubleshooting problems seem to be, most typically, busy OLTP systems. I'm not in a position to say whether that's more or less common than systems with extremely cold data, but I am in a position to say that my employer will have a lot fewer happy customers if we regress that use case. Naturally I'm keen to avoid that. > Having a separate aggressive step that rewrites an entire large table, > apparently at random, is just a huge burden to users. You've said that > you agree that it sucks, but somehow I still can't shake the feeling > that you don't fully understand just how much it sucks. Ha! Well, that's possible. But maybe you don't understand how much your patch makes other things suck. I don't think we can really get anywhere here by postulating that the problem is the other person's lack of understanding, even if such a postulate should happen to be correct. -- Robert Haas EDB: http://www.enterprisedb.com
Re: Cygwin cleanup
Note that cirrus failed like this: https://api.cirrus-ci.com/v1/artifact/task/4881596411543552/testrun/build/testrun/subscription/010_truncate/log/010_truncate_publisher.log 2023-01-25 23:17:10.417 GMT [29821][walsender] [sub1][3/0:0] ERROR: could not open file "pg_logical/snapshots/0-14F2060.snap": Is a directory 2023-01-25 23:17:10.417 GMT [29821][walsender] [sub1][3/0:0] STATEMENT: START_REPLICATION SLOT "sub1" LOGICAL 0/0 (proto_version '4', origin 'any', publication_names '"pub1"') 2023-01-25 23:17:10.418 GMT [29850][walsender] [pg_16413_sync_16394_7192732880582452157][6/0:0] PANIC: could not open file "pg_logical/snapshots/0-14F2060.snap": No such file or directory 2023-01-25 23:17:10.418 GMT [29850][walsender] [pg_16413_sync_16394_7192732880582452157][6/0:0] STATEMENT: START_REPLICATION SLOT "pg_16413_sync_16394_7192732880582452157" LOGICAL 0/14F2060 (proto_version '4', origin 'any', publication_names '"pub3"') I don't understand how "Is a directory" happened .. It looks like maybe the call stack would've been: SnapBuildSerializationPoint() xlog_decode() or standby_decode() ? LogicalDecodingProcessRecord() XLogSendLogical() WalSndLoop() StartLogicalReplication() -- Justin
Re: suppressing useless wakeups in logical/worker.c
Nathan Bossart writes: > On Thu, Jan 26, 2023 at 03:04:30PM -0500, Tom Lane wrote: >> Hmm. I'm disinclined to add an assumption that the epoch is in the past, >> but I take your point that the subtraction would overflow with >> TIMESTAMP_INFINITY and a negative finite timestamp. Maybe we should >> make use of pg_sub_s64_overflow()? > That would be my vote. I think the 'diff <= 0' check might need to be > replaced with something like 'start_time > stop_time' so that we return 0 > for the underflow case. Right, so more like this. regards, tom lane diff --git a/src/backend/backup/basebackup_copy.c b/src/backend/backup/basebackup_copy.c index 05470057f5..2bb6c89f8c 100644 --- a/src/backend/backup/basebackup_copy.c +++ b/src/backend/backup/basebackup_copy.c @@ -215,7 +215,8 @@ bbsink_copystream_archive_contents(bbsink *sink, size_t len) * the system clock was set backward, so that such occurrences don't * have the effect of suppressing further progress messages. */ - if (ms < 0 || ms >= PROGRESS_REPORT_MILLISECOND_THRESHOLD) + if (ms >= PROGRESS_REPORT_MILLISECOND_THRESHOLD || + now < mysink->last_progress_report_time) { mysink->last_progress_report_time = now; diff --git a/src/backend/postmaster/postmaster.c b/src/backend/postmaster/postmaster.c index 5b775cf7d0..62fba5fcee 100644 --- a/src/backend/postmaster/postmaster.c +++ b/src/backend/postmaster/postmaster.c @@ -1670,11 +1670,12 @@ DetermineSleepTime(void) if (next_wakeup != 0) { - /* Ensure we don't exceed one minute, or go under 0. */ - return Max(0, - Min(60 * 1000, - TimestampDifferenceMilliseconds(GetCurrentTimestamp(), - next_wakeup))); + int ms; + + /* result of TimestampDifferenceMilliseconds is in [0, INT_MAX] */ + ms = (int) TimestampDifferenceMilliseconds(GetCurrentTimestamp(), + next_wakeup); + return Min(60 * 1000, ms); } return 60 * 1000; diff --git a/src/backend/replication/walreceiver.c b/src/backend/replication/walreceiver.c index e95398db05..b0cfddd548 100644 --- a/src/backend/replication/walreceiver.c +++ b/src/backend/replication/walreceiver.c @@ -445,7 +445,7 @@ WalReceiverMain(void) pgsocket wait_fd = PGINVALID_SOCKET; int rc; TimestampTz nextWakeup; -int nap; +long nap; /* * Exit walreceiver if we're not in recovery. This should not @@ -528,15 +528,9 @@ WalReceiverMain(void) for (int i = 0; i < NUM_WALRCV_WAKEUPS; ++i) nextWakeup = Min(wakeup[i], nextWakeup); -/* - * Calculate the nap time. WaitLatchOrSocket() doesn't accept - * timeouts longer than INT_MAX milliseconds, so we limit the - * result accordingly. Also, we round up to the next - * millisecond to avoid waking up too early and spinning until - * one of the wakeup times. - */ +/* Calculate the nap time, clamping as necessary. */ now = GetCurrentTimestamp(); -nap = (int) Min(INT_MAX, Max(0, (nextWakeup - now + 999) / 1000)); +nap = TimestampDifferenceMilliseconds(now, nextWakeup); /* * Ideally we would reuse a WaitEventSet object repeatedly diff --git a/src/backend/utils/adt/timestamp.c b/src/backend/utils/adt/timestamp.c index 928c330897..47e059a409 100644 --- a/src/backend/utils/adt/timestamp.c +++ b/src/backend/utils/adt/timestamp.c @@ -1690,26 +1690,31 @@ TimestampDifference(TimestampTz start_time, TimestampTz stop_time, * * This is typically used to calculate a wait timeout for WaitLatch() * or a related function. The choice of "long" as the result type - * is to harmonize with that. It is caller's responsibility that the - * input timestamps not be so far apart as to risk overflow of "long" - * (which'd happen at about 25 days on machines with 32-bit "long"). - * - * Both inputs must be ordinary finite timestamps (in current usage, - * they'll be results from GetCurrentTimestamp()). + * is to harmonize with that; furthermore, we clamp the result to at most + * INT_MAX milliseconds, because that's all that WaitLatch() allows. * * We expect start_time <= stop_time. If not, we return zero, * since then we're already past the previously determined stop_time. * + * Subtracting finite and infinite timestamps works correctly, returning + * zero or INT_MAX as appropriate. + * * Note we round up any fractional millisecond, since waiting for just * less than the intended timeout is undesirable. */ long TimestampDifferenceMilliseconds(TimestampTz start_time, TimestampTz stop_time) { - TimestampTz diff = stop_time - start_time; + TimestampTz diff; - if (diff <= 0) + /* Deal with zero or negative elapsed time quickly. */ + if (start_time >= stop_time) return 0; + /* To not fail with timestamp infinities, we must detect overflow. */ + if (pg_sub_s64_overflow(stop_time, start_time, &diff)) + return (long) INT_MAX; + if (diff >= (INT_MAX * INT64CONST(1000) - 999)) + return (long) INT_MAX; else return (long) ((diff + 9
Re: lockup in parallel hash join on dikkop (freebsd 14.0-current)
On Fri, Jan 27, 2023 at 9:57 AM Thomas Munro wrote: > On Fri, Jan 27, 2023 at 9:49 AM Tom Lane wrote: > > Tomas Vondra writes: > > > I received an alert dikkop (my rpi4 buildfarm animal running freebsd 14) > > > did not report any results for a couple days, and it seems it got into > > > an infinite loop in REL_11_STABLE when building hash table in a parallel > > > hashjoin, or something like that. > > > > > It seems to be progressing now, probably because I attached gdb to the > > > workers to get backtraces, which does signals etc. > > > > That reminds me of cases that I saw several times on my now-deceased > > animal florican: > > > > https://www.postgresql.org/message-id/flat/2245838.1645902425%40sss.pgh.pa.us > > > > There's clearly something rotten somewhere in there, but whether > > it's our bug or FreeBSD's isn't clear. > > And if it's ours, it's possibly in latch code and not anything higher > (I mean, not in condition variables, barriers, or parallel hash join) > because I saw a similar hang in the shm_mq stuff which uses the latch > API directly. Note that 13 switched to kqueue but still used the > self-pipe, and 14 switched to a signal event, and this hasn't been > reported in those releases or later, which makes the poll() code path > a key suspect. Also, 14 changed the flag/memory barrier dance (maybe_sleeping), but 13 did it the same way as 11 + 12. So between 12 and 13 we have just the poll -> kqueue change.
Re: New strategies for freezing, advancing relfrozenxid early
On Thu, Jan 26, 2023 at 12:54 PM Robert Haas wrote: > > The overwhelming cost is usually FPIs in any case. If you're not > > mostly focussing on that, you're focussing on the wrong thing. At > > least with larger tables. You just have to focus on the picture over > > time, across multiple VACUUM operations. > > I think that's all mostly true, but the cases where being more > aggressive can cause *extra* FPIs are worthy of just as much attention > as the cases where we can reduce them. It's a question of our exposure to real problems, in no small part. What can we afford to be wrong about? What problem can be fixed by the user more or less as it emerges, and what problem doesn't have that quality? There is very good reason to believe that the large majority of all data that people store in a system like Postgres is extremely cold data: https://www.microsoft.com/en-us/research/video/cost-performance-in-modern-data-stores-how-data-cashing-systems-succeed/ https://brandur.org/fragments/events Having a separate aggressive step that rewrites an entire large table, apparently at random, is just a huge burden to users. You've said that you agree that it sucks, but somehow I still can't shake the feeling that you don't fully understand just how much it sucks. -- Peter Geoghegan
Re: lockup in parallel hash join on dikkop (freebsd 14.0-current)
On Fri, Jan 27, 2023 at 9:49 AM Tom Lane wrote: > Tomas Vondra writes: > > I received an alert dikkop (my rpi4 buildfarm animal running freebsd 14) > > did not report any results for a couple days, and it seems it got into > > an infinite loop in REL_11_STABLE when building hash table in a parallel > > hashjoin, or something like that. > > > It seems to be progressing now, probably because I attached gdb to the > > workers to get backtraces, which does signals etc. > > That reminds me of cases that I saw several times on my now-deceased > animal florican: > > https://www.postgresql.org/message-id/flat/2245838.1645902425%40sss.pgh.pa.us > > There's clearly something rotten somewhere in there, but whether > it's our bug or FreeBSD's isn't clear. And if it's ours, it's possibly in latch code and not anything higher (I mean, not in condition variables, barriers, or parallel hash join) because I saw a similar hang in the shm_mq stuff which uses the latch API directly. Note that 13 switched to kqueue but still used the self-pipe, and 14 switched to a signal event, and this hasn't been reported in those releases or later, which makes the poll() code path a key suspect.
Re: Non-superuser subscription owners
On Thu, Jan 26, 2023 at 12:36 PM Jeff Davis wrote: > On Thu, 2023-01-26 at 09:43 -0500, Robert Haas wrote: > > I have no issue with that as a long-term plan. However, I think that > > for right now we should just introduce pg_create_subscription. It > > would make sense to add pg_create_connection in the same patch that > > adds a CREATE CONNECTION command (or whatever exact syntax we end up > > with) -- and that patch can also change CREATE SUBSCRIPTION to > > require > > both privileges where a connection string is specified directly. > > I assumed it would be a problem to say that pg_create_subscription was > enough to create a subscription today, and then later require > additional privileges (e.g. pg_create_connection). > > If that's not a problem, then this sounds fine with me. Wonderful! I'm working on a patch, but due to various distractions, it's not done yet. -- Robert Haas EDB: http://www.enterprisedb.com
Re: New strategies for freezing, advancing relfrozenxid early
Hi, On 2023-01-26 20:26:00 +0100, Matthias van de Meent wrote: > Could someone explain to me why we don't currently (optionally) > include the functionality of page freezing in the PRUNE records? I think we definitely should (and have argued for it a couple times). It's not just about reducing WAL overhead, it's also about reducing redundant visibility checks - which are where a very significant portion of the CPU time for VACUUMing goes to. Besides performance considerations, it's also just plain weird that lazy_scan_prune() can end up with a different visibility than heap_page_prune() (mostly due to concurrent aborts). The number of WAL records we often end up emitting for a processing a single page in vacuum is just plain absurd: - PRUNE - FREEZE_PAGE - VISIBLE There's afaict no justification whatsoever for these to be separate records. > I think they're quite closely related (in that they both execute in VACUUM > and are required for long-term system stability), and are even more related > now that we have opportunistic page-level freezing. I think adding a "freeze > this page as well"-flag in PRUNE records would go a long way to reducing the > WAL overhead of aggressive and more opportunistic freezing. Yep. I think we should also seriously consider setting all-visible during on-access pruning, and freezing rows during on-access pruning. Greetings, Andres Freund
Re: New strategies for freezing, advancing relfrozenxid early
On Thu, Jan 26, 2023 at 2:57 PM Peter Geoghegan wrote: > Relatively difficult for Andres, or for somebody else? What are the > real parameters here? Obviously there are no clear answers available. Andres is certainly smarter than the average guy, but practically any scenario that someone can create in a few lines of SQL is something to which code will be exposed to on some real-world system. If Andres came along and said, hey, well I found a way to make this patch suck, and proceeded to describe a scenario that involved a complex set of tables and multiple workloads running simultaneously and using a debugger to trigger some race condition and whatever, I'd be like "OK, but is that really going to happen?". The actual scenario he came up with is three lines of SQL, and it's nothing remotely obscure. That kind of thing is going to happen *all the time*. > The overwhelming cost is usually FPIs in any case. If you're not > mostly focussing on that, you're focussing on the wrong thing. At > least with larger tables. You just have to focus on the picture over > time, across multiple VACUUM operations. I think that's all mostly true, but the cases where being more aggressive can cause *extra* FPIs are worthy of just as much attention as the cases where we can reduce them. -- Robert Haas EDB: http://www.enterprisedb.com
Re: lockup in parallel hash join on dikkop (freebsd 14.0-current)
Tomas Vondra writes: > I received an alert dikkop (my rpi4 buildfarm animal running freebsd 14) > did not report any results for a couple days, and it seems it got into > an infinite loop in REL_11_STABLE when building hash table in a parallel > hashjoin, or something like that. > It seems to be progressing now, probably because I attached gdb to the > workers to get backtraces, which does signals etc. That reminds me of cases that I saw several times on my now-deceased animal florican: https://www.postgresql.org/message-id/flat/2245838.1645902425%40sss.pgh.pa.us There's clearly something rotten somewhere in there, but whether it's our bug or FreeBSD's isn't clear. regards, tom lane
Re: doc: add missing "id" attributes to extension packaging page
On 18.01.2023 at 06:50, Brar Piening wrote: I'll give it a proper look this weekend. It turns out I didn't get a round tuit. ... and I'm afraid I probably will not be able to work on this until mid/end February so we'll have to move this to the next commitfest until somebody wants to take it over and push it through.
Re: New strategies for freezing, advancing relfrozenxid early
Hi, On 2023-01-26 10:44:45 -0800, Peter Geoghegan wrote: > On Thu, Jan 26, 2023 at 9:53 AM Andres Freund wrote: > > > That's going to be very significantly more aggressive. For example > > > it'll impact small tables very differently. > > > > Maybe it would be too aggressive, not sure. The cost of a freeze WAL record > > is > > relatively small, with one important exception below, if we are 99.99% sure > > that it's not going to require an FPI and isn't going to dirty the page. > > > > The exception is that a newer LSN on the page can cause the ringbuffer > > replacement to trigger more more aggressive WAL flushing. No meaningful > > difference if we modified the page during pruning, or if the page was > > already > > in s_b (since it likely won't be written out via the ringbuffer in that > > case), > > but if checksums are off and we just hint-dirtied the page, it could be a > > significant issue. > > Most of the overhead of FREEZE WAL records (with freeze plan > deduplication and page-level freezing in) is generic WAL record header > overhead. Your recent adversarial test case is going to choke on that, > too. At least if you set checkpoint_timeout to 1 minute again. I don't quite follow. What do you mean with "record header overhead"? Unless that includes FPIs, I don't think that's that commonly true? The problematic case I am talking about is when we do *not* emit a WAL record during pruning (because there's nothing to prune), but want to freeze the table. If you don't log an FPI, the remaining big overhead is that increasing the LSN on the page will often cause an XLogFlush() when writing out the buffer. I don't see what your reference to checkpoint timeout is about here? Also, as I mentioned before, the problem isn't specific to checkpoint_timeout = 1min. It just makes it cheaper to reproduce. > > Thus a modification of the above logic could be to opportunistically freeze > > if > > a ) it won't cause an FPI and either > > b1) the page was already dirty before pruning, as we'll not do a ringbuffer > > replacement in that case > > or > > b2) We wrote a WAL record during pruning, as the difference in flush > > position > > is marginal > > > > An even more aggressive version would be to replace b1) with logic that'd > > allow newly dirtying the page if it wasn't read through the ringbuffer. But > > newly dirtying the page feels like it'd be more dangerous. > > In many cases we'll have to dirty the page anyway, just to set > PD_ALL_VISIBLE. The whole way the logic works is conditioned (whether > triggered by an FPI or triggered by my now-reverted GUC) on being able > to set the whole page all-frozen in the VM. IIRC setting PD_ALL_VISIBLE doesn't trigger an FPI unless we need to log hint bits. But freezing does trigger one even without wal_log_hint_bits. You're right, it makes sense to consider whether we'll emit a XLOG_HEAP2_VISIBLE anyway. > > A less aggressive version would be to check if any WAL records were emitted > > during heap_page_prune() (instead of FPIs) and whether we'd emit an FPI if > > we > > modified the page again. Similar to what we do now, except not requiring an > > FPI to have been emitted. > > Also way more aggressive. Not nearly enough on its own. In which cases will it be problematically more aggressive? If we emitted a WAL record during pruning we've already set the LSN of the page to a very recent LSN. We know the page is dirty. Thus we'll already trigger an XLogFlush() during ringbuffer replacement. We won't emit an FPI. > > But to me it seems a bit odd that VACUUM now is more aggressive if > > checksums / > > wal_log_hint bits is on, than without them. Which I think is how using > > either > > of pgWalUsage.wal_fpi, pgWalUsage.wal_records ends up working? > > Which part is the odd part? Is it odd that page-level freezing works > that way, or is it odd that page-level checksums work that way? That page-level freezing works that way. > In any case this seems like an odd thing for you to say, having > eviscerated a patch that really just made the same behavior trigger > independently of FPIs in some tables, controlled via a GUC. jdksjfkjdlkajsd;lfkjasd;lkfj;alskdfj That behaviour I critizied was causing a torrent of FPIs and additional dirtying of pages. My proposed replacement for the current FPI check doesn't, because a) it only triggers when we wrote a WAL record b) It doesn't trigger if we would write an FPI. Greetings, Andres Freund
Re: wrong Append/MergeAppend elision?
David Rowley writes: > On Fri, 27 Jan 2023 at 01:30, Amit Langote wrote: >> It seems that the planner currently elides an Append/MergeAppend that >> has run-time pruning info (part_prune_index) set, but which I think is >> a bug. > There is still the trade-off of having to pull tuples through the > Append node for when run-time pruning is unable to prune the last > partition. So your proposal to leave the Append alone when there's > run-time pruning info is certainly not a no-brainer. Yeah. Amit's proposal amounts to optimizing for the case that all partitions get pruned, which does not seem to me to be the way to bet. I'm inclined to think it's fine as-is. regards, tom lane
lockup in parallel hash join on dikkop (freebsd 14.0-current)
Hi, I received an alert dikkop (my rpi4 buildfarm animal running freebsd 14) did not report any results for a couple days, and it seems it got into an infinite loop in REL_11_STABLE when building hash table in a parallel hashjoin, or something like that. It seems to be progressing now, probably because I attached gdb to the workers to get backtraces, which does signals etc. Anyway, in 'ps ax' I saw this: 94545 - Ss 0:03.39 postgres: buildfarm regression [local] SELECT 94627 - Is 0:00.03 postgres: parallel worker for PID 94545 94628 - Is 0:00.02 postgres: parallel worker for PID 94545 and the backend was stuck waiting on this query: select final > 1 as multibatch from hash_join_batches( $$ select count(*) from join_foo left join (select b1.id, b1.t from join_bar b1 join join_bar b2 using (id)) ss on join_foo.id < ss.id + 1 and join_foo.id > ss.id - 1; $$); This started on 2023-01-20 23:23:18.125, and the next log (after I did the gdb stuff), is from 2023-01-26 20:05:16.751. Quite a bit of time. It seems all three processes are doing WaitEventSetWait, either through a ConditionVariable, or WaitLatch. But I don't have any good idea of what might have broken - and as it got "unstuck" I can't investigate more. But I see there's nodeHash and parallelism, and I recall there's a lot of gotchas due to how the backends cooperate when building the hash table, etc. Thomas, any idea what might be wrong? regards -- Tomas Vondra EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company[Switching to LWP 100877 of process 94628] _poll () at _poll.S:4 #0 _poll () at _poll.S:4 No locals. #1 0x88ef6850 in __thr_poll (fds=0x4, nfds=2, timeout=-1) at /usr/src/lib/libthr/thread/thr_syscalls.c:338 curthread = 0x8f09b000 ret = #2 0x007d3318 in WaitEventSetWaitBlock (set=0x8f19ceb0, cur_timeout=-1, occurred_events=0x80dcba98, nevents=1) at latch.c:1171 returned_events = 0 rc = cur_pollfd = cur_event = errflags = #3 WaitEventSetWait (set=0x8f19ceb0, timeout=, occurred_events=, occurred_events@entry=0x80dcba98, nevents=nevents@entry=1, wait_event_info=, wait_event_info@entry=134217745) at latch.c:1000 rc = start_time = {tv_sec = 32, tv_nsec = 2401711808} cur_time = {tv_sec = 2161949248, tv_nsec = 9648676} cur_timeout = -1 returned_events = #4 0x007f2898 in ConditionVariableSleep (cv=cv@entry=0x9c14f350, wait_event_info=wait_event_info@entry=134217745) at condition_variable.c:157 event = {pos = -1676348592, events = 0, fd = -2133017936, user_data = 0x7d3128 } done = #5 0x007cfd60 in BarrierArriveAndWait (barrier=0x9c14f338, barrier@entry=0x80dcbbc0, wait_event_info=134217745, wait_event_info@entry=1) at barrier.c:191 release = start_phase = 5 next_phase = 6 elected = #6 0x006b4808 in ExecParallelHashIncreaseNumBuckets (hashtable=hashtable@entry=0x8f0e16f0) at nodeHash.c:1572 pstate = i = chunk = chunk_s = #7 0x006b2d30 in ExecParallelHashTupleAlloc (hashtable=hashtable@entry=0x8f0e16f0, size=40, shared=shared@entry=0x80dcbbc0) at nodeHash.c:2795 growth = PHJ_GROWTH_NEED_MORE_BUCKETS chunk = pstate = 0x9c14f2a0 chunk_size = chunk_shared = result = curbatch = #8 0x006b27ac in ExecParallelHashTableInsert (hashtable=hashtable@entry=0x8f0e16f0, slot=, slot@entry=0x8f0e19d8, hashvalue=, hashvalue@entry=3919329229) at nodeHash.c:1693 hashTuple = tuple = 0x8f0ebd78 batchno = shared = 2400066360 bucketno = #9 0x006b0ac8 in MultiExecParallelHash (node=0x8f0e0908) at nodeHash.c:288 outerNode = 0x8f0e1170 hashtable = 0x8f0e16f0 hashkeys = 0x8f0ebac8 econtext = 0x8f0e1508 pstate = build_barrier = slot = 0x8f0e19d8 i = hashvalue = #10 MultiExecHash (node=node@entry=0x8f0e0b70) at nodeHash.c:112 No locals. #11 0x006a13e4 in MultiExecProcNode (node=node@entry=0x8f0e0908) at execProcnode.c:502 result = #12 0x006b5b00 in ExecHashJoinImpl (pstate=0x8f0e0b70, parallel=true) at nodeHashjoin.c:290 hashNode = econtext = 0x8f0e07d8 joinqual = 0x0 otherqual = 0x0 outerNode = 0x8f0e0a58 hashtable = 0x8f0e16f0 node = 0x8f0e0b70 parallel_state = 0x9c14f2a0 hashvalue = outerTupleSlot = batchno = build_barrier = #13 ExecParallelHashJoin (pstate=0x8f0e0b70) at nodeHashjoin.c:600 No locals. #14 0x006a193c in ExecProcNodeInstr (node=0x8f0e0b70) at execProcnode.c:462 result = #15 0x0069a60c in ExecProcNode (node=0x8f0e0b70) at ../../../src/include/e
Re: New strategies for freezing, advancing relfrozenxid early
On Thu, Jan 26, 2023 at 11:26 AM Matthias van de Meent wrote: > Could someone explain to me why we don't currently (optionally) > include the functionality of page freezing in the PRUNE records? I > think they're quite closely related (in that they both execute in > VACUUM and are required for long-term system stability), and are even > more related now that we have opportunistic page-level freezing. I > think adding a "freeze this page as well"-flag in PRUNE records would > go a long way to reducing the WAL overhead of aggressive and more > opportunistic freezing. Yeah, we've talked about doing that in the past year. It's quite possible. It would make quite a lot of sense, because the actual overhead of the WAL record for freezing tends to come from the generic WAL record header stuff itself. If there was only one record for both, then you'd only need to include the relfilenode and block number (and so on) once. It would be tricky to handle Multis, so what you'd probably do is just freezing xmin, and possibly aborted and locker XIDs in xmax. So you wouldn't completely get rid of the main freeze record, but would be able to avoid it in many important cases. -- Peter Geoghegan
Re: Add the ability to limit the amount of memory that can be allocated to backends.
On Mon, 2023-01-23 at 12:31 -0800, Andres Freund wrote: > Hi, > > I think it's basically still waiting on author, until the O(N) cost is gone > from the overflow limit check. > > Greetings, > > Andres Freund Yes, just a rebase. There is still work to be done per earlier in the thread. I do want to follow up and note re palloc/pfree vs malloc/free that the tracking code (0001-Add-tracking-...) is not tracking palloc/pfree but is explicitely tracking malloc/free. Not every palloc/pfree call executes the tracking code, only those where the path followed includes malloc() or free(). Routine palloc() calls fulfilled from the context's freelist/emptyblocks/freeblock/etc and pfree() calls not invoking free() avoid the tracking code. Thanks, Reid
Re: suppressing useless wakeups in logical/worker.c
On Thu, Jan 26, 2023 at 03:04:30PM -0500, Tom Lane wrote: > Nathan Bossart writes: >> I wonder if we should explicitly reject negative timestamps to eliminate >> any chance of int64 overflow, too. > > Hmm. I'm disinclined to add an assumption that the epoch is in the past, > but I take your point that the subtraction would overflow with > TIMESTAMP_INFINITY and a negative finite timestamp. Maybe we should > make use of pg_sub_s64_overflow()? That would be my vote. I think the 'diff <= 0' check might need to be replaced with something like 'start_time > stop_time' so that we return 0 for the underflow case. -- Nathan Bossart Amazon Web Services: https://aws.amazon.com
Re: Something is wrong with wal_compression
On Thu, Jan 26, 2023 at 02:08:27PM -0600, Justin Pryzby wrote: > On Thu, Jan 26, 2023 at 02:43:29PM -0500, Tom Lane wrote: > > The symptom being exhibited by Michael's new BF animal tanager > > is perfectly reproducible elsewhere. > > I think these tests have always failed with wal_compression ? > > https://www.postgresql.org/message-id/20210308.173242.463790587797836129.horikyota.ntt%40gmail.com > https://www.postgresql.org/message-id/20210313012820.gj29...@telsasoft.com > https://www.postgresql.org/message-id/2022031948.gj9...@telsasoft.com + https://www.postgresql.org/message-id/c86ce84f-dd38-9951-102f-13a931210f52%40dunslane.net
Re: Minimal logical decoding on standbys
Hi, On 2023-01-26 18:56:10 +0100, Drouvot, Bertrand wrote: > - I'm struggling to create a test for btree killtuples as there is a need for > rows removal on the table (that could produce a conflict too): > Do you've a scenario in mind for this one? (and btw in what kind of WAL > record should the conflict be detected in such a case? xl_btree_delete?) Hm, it might indeed be hard in "modern" postgres. I think you'd need at least two concurrent sessions, to prevent on-access pruning on the table. DROP TABLE IF EXISTS indexdel; CREATE TABLE indexdel(id int8 primary key); INSERT INTO indexdel SELECT generate_series(1, 1); VACUUM indexdel; -- ensure hint bits are set etc DELETE FROM indexdel; SELECT pg_current_wal_insert_lsn(); SET enable_indexonlyscan = false; -- This scan finds that the index items are dead - but doesn't yet issue a -- btree delete WAL record, that only happens when needing space on the page -- again. EXPLAIN (COSTS OFF, SUMMARY OFF) SELECT id FROM indexdel WHERE id < 10 ORDER BY id ASC; SELECT id FROM indexdel WHERE id < 100 ORDER BY id ASC; -- The insertions into the range of values prev INSERT INTO indexdel SELECT generate_series(1, 100); Does generate the btree deletion record, but it also does emit a PRUNE (from heapam_index_fetch_tuple() -> heap_page_prune_opt()). While the session could create a cursor to prevent later HOT cleanup, the query would also trigger hot pruning (or prevent the rows from being dead, if you declare the cursor before the DELETE). So you'd need overlapping cursors in a concurrent session... Too complicated. Greetings, Andres Freund
Re: wrong Append/MergeAppend elision?
On Fri, 27 Jan 2023 at 01:30, Amit Langote wrote: > It seems that the planner currently elides an Append/MergeAppend that > has run-time pruning info (part_prune_index) set, but which I think is > a bug. This is actually how I intended it to work. Whether it was a good idea or not, I'm currently unsure. I mentioned it in [1]. I think the plan shapes I was talking about were some ordered paths from partial paths per what is being added right at the end of add_paths_to_append_rel(). However, now that I look at it again, I'm not sure why it wouldn't be correct to still have those paths with a single-child Append. Certainly, the "if (list_length(live_childrels) == 1)" test made in add_paths_to_append_rel() is no longer aligned to the equivalent test in set_append_references(), so it's possible even now that we make a plan that uses the extra sorted partial paths added in add_paths_to_append_rel() and still have the Append in the final plan. There is still the trade-off of having to pull tuples through the Append node for when run-time pruning is unable to prune the last partition. So your proposal to leave the Append alone when there's run-time pruning info is certainly not a no-brainer. [1] https://www.postgresql.org/message-id/cakjs1f_utf1mbp8ueobyaarzio4e4qb4z8fksurpam+3q_h...@mail.gmail.com
Re: Something is wrong with wal_compression
Justin Pryzby writes: > On Thu, Jan 26, 2023 at 02:43:29PM -0500, Tom Lane wrote: >> The symptom being exhibited by Michael's new BF animal tanager >> is perfectly reproducible elsewhere. > I think these tests have always failed with wal_compression ? If that's a known problem, and we've done nothing about it, that is pretty horrid. That test case is demonstrating fundamental database corruption after a crash. regards, tom lane
Re: Something is wrong with wal_compression
On Thu, Jan 26, 2023 at 02:43:29PM -0500, Tom Lane wrote: > The symptom being exhibited by Michael's new BF animal tanager > is perfectly reproducible elsewhere. I think these tests have always failed with wal_compression ? https://www.postgresql.org/message-id/20210308.173242.463790587797836129.horikyota.ntt%40gmail.com https://www.postgresql.org/message-id/20210313012820.gj29...@telsasoft.com https://www.postgresql.org/message-id/2022031948.gj9...@telsasoft.com https://www.postgresql.org/message-id/ynqwd2gsmrnqw...@paquier.xyz |My buildfarm machine has been changed to use wal_compression = lz4, |while on it for HEAD runs.
Re: improving user.c error messages
Nathan Bossart writes: > On Thu, Jan 26, 2023 at 02:42:05PM -0500, Robert Haas wrote: >> Basically my question is whether having one error message for all of >> those cases is good enough, or whether we should be trying harder. I think the password case needs to be kept separate, because the conditions for it are different (specifically the exception that you can alter your own password). Lumping the rest together seems OK to me. regards, tom lane
Re: suppressing useless wakeups in logical/worker.c
Nathan Bossart writes: > On Thu, Jan 26, 2023 at 01:54:08PM -0500, Tom Lane wrote: >> - * Both inputs must be ordinary finite timestamps (in current usage, >> - * they'll be results from GetCurrentTimestamp()). >> + * At least one input must be an ordinary finite timestamp, else the "diff" >> + * calculation might overflow. We do support stop_time == >> TIMESTAMP_INFINITY, >> + * which will result in INT_MAX wait time. > I wonder if we should explicitly reject negative timestamps to eliminate > any chance of int64 overflow, too. Hmm. I'm disinclined to add an assumption that the epoch is in the past, but I take your point that the subtraction would overflow with TIMESTAMP_INFINITY and a negative finite timestamp. Maybe we should make use of pg_sub_s64_overflow()? BTW, I just noticed that the adjacent function TimestampDifference has a lot of callers that would be much happier using TimestampDifferenceMilliseconds. regards, tom lane
Re: improving user.c error messages
On Thu, Jan 26, 2023 at 02:42:05PM -0500, Robert Haas wrote: > @@ -758,16 +776,13 @@ AlterRole(ParseState *pstate, AlterRoleStmt *stmt) > { > /* things an unprivileged user certainly can't do */ > if (dinherit || dcreaterole || dcreatedb || dcanlogin || dconnlimit || > - dvalidUntil || disreplication || dbypassRLS) > + dvalidUntil || disreplication || dbypassRLS || > + (dpassword && roleid != currentUserId)) > ereport(ERROR, > (errcode(ERRCODE_INSUFFICIENT_PRIVILEGE), > - errmsg("permission denied"))); > - > - /* an unprivileged user can change their own password */ > - if (dpassword && roleid != currentUserId) > - ereport(ERROR, > - (errcode(ERRCODE_INSUFFICIENT_PRIVILEGE), > - errmsg("must have CREATEROLE privilege to change another user's > password"))); > + errmsg("permission denied to alter role"), > + errdetail("You must have %s privilege and %s on role \"%s\".", > +"CREATEROLE", "ADMIN OPTION", rolename))); > } > else if (!superuser()) > { > > Basically my question is whether having one error message for all of > those cases is good enough, or whether we should be trying harder. I > don't mind if the conclusion is that it's OK as-is, and I'm not > entirely sure what would be better. But when I was working on this > code, all of those cases OR'd together feeding into a single error > message seemed a little sketchy to me, so I am wondering what others > think. I wondered the same thing, but I hesitated because I didn't want to change too much in a patch for error messaging. I can give it a try. -- Nathan Bossart Amazon Web Services: https://aws.amazon.com
Re: New strategies for freezing, advancing relfrozenxid early
On Thu, Jan 26, 2023 at 11:28 AM Robert Haas wrote: > I think it's pretty much impossible to freeze more aggressively > without losing in some scenario or other. If waiting longer to freeze > would have resulted in the data getting updated again or deleted > before we froze it, then waiting longer reduces the total amount of > freezing work that ever has to be done. Freezing more aggressively > inevitably gives up some amount of that potential benefit in order to > try to secure some other benefit. It's a trade-off. There is no question about that. > I think that the goal of a patch that makes vacuum more (or less) > aggressive should be to make the cases where we lose as obscure as > possible, and the cases where we win as broad as possible. I think > that, in order to be a good patch, it needs to be relatively difficult > to find cases where we incur a big loss. If it's easy to find a big > loss, then I think it's better to stick with the current behavior, > even if it's also easy to find a big gain. Again, this seems totally uncontroversial. It's just incredibly vague, and not at all actionable. Relatively difficult for Andres, or for somebody else? What are the real parameters here? Obviously there are no clear answers available. > However, I'm also not > prepared to go all the way to the other end of the spectrum and say > that all of your ideas and everything in this patch are great. I don't > think either of those things, either. It doesn't matter. I'm done with it. This is not a negotiation about what gets in and what doesn't get in. All that I aim to do now is to draw some kind of line under the basic page-level freezing work, since of course I'm still responsible for that. And perhaps to defend my personal reputation. > I certainly think that freezing more aggressively in some scenarios > could be a great idea, but it seems like the patch's theory is to be > very nearly maximally aggressive in every vacuum run if the table size > is greater than some threshold, and I don't think that's right at all. We'll systematically avoid accumulating debt past a certain point -- that's its purpose. That is, we'll avoid accumulating all-visible pages that eventually need to be frozen. > I'm not exactly sure what information we should use to decide how > aggressive to be, but I am pretty sure that the size of the table is > not it. It's true that, for a small table, the cost of having to > eventually vacuum the whole table at once isn't going to be very high, > whereas for a large table, it will be. That line of reasoning makes a > size threshold sound reasonable. However, the amount of extra work > that we can potentially do by vacuuming more aggressively *also* > increases with the table size, which to me means using that a > criterion actually isn't sensible at all. The overwhelming cost is usually FPIs in any case. If you're not mostly focussing on that, you're focussing on the wrong thing. At least with larger tables. You just have to focus on the picture over time, across multiple VACUUM operations. > One idea that I've had about how to solve this problem is to try to > make vacuum try to aggressively freeze some portion of the table on > each pass, and to behave less aggressively on the rest of the table so > that, hopefully, no single vacuum does too much work. Unfortunately, I > don't really know how to do that effectively. That has been proposed a couple of times in the context of this thread. It won't work, because the way autovacuum works in general (and likely always will work) doesn't allow it. With an append-only table, each VACUUM will naturally have to scan significantly more pages than the last one, forever (barring antiwraparound vacuums). Why wouldn't it continue that way? I mean it might not (the table might stop growing altogether), but then it doesn't matter much what we do. If you're not behaving very proactively at the level of each VACUUM operation, then the picture over time is that you're *already* falling behind. At least with an append-only table. You have to think of the sequence of operations, not just one. > In theory we could have some system that tracks how > recently each page range in a table has been modified, and direct our > freezing activity toward the ones less-recently modified on the theory > that they're not so likely to be modified again in the near future, > but in reality we have no such system. So I don't really feel like I > know what the right answer is here, yet. So we need to come up with a way of getting reliable information from the future, about an application that we have no particular understanding of. As opposed to just eating the cost to some degree, and making it configurable. -- Peter Geoghegan
Re: suppressing useless wakeups in logical/worker.c
On Thu, Jan 26, 2023 at 01:54:08PM -0500, Tom Lane wrote: > After looking closer, I see that TimestampDifferenceMilliseconds > already explicitly states that its output is intended for WaitLatch > and friends, which makes it perfectly sane for it to clamp the result > to [0, INT_MAX] rather than depending on the caller to not pass > out-of-range values. +1 > * This is typically used to calculate a wait timeout for WaitLatch() > * or a related function. The choice of "long" as the result type > - * is to harmonize with that. It is caller's responsibility that the > - * input timestamps not be so far apart as to risk overflow of "long" > - * (which'd happen at about 25 days on machines with 32-bit "long"). > + * is to harmonize with that; furthermore, we clamp the result to at most > + * INT_MAX milliseconds, because that's all that WaitLatch() allows. > * > - * Both inputs must be ordinary finite timestamps (in current usage, > - * they'll be results from GetCurrentTimestamp()). > + * At least one input must be an ordinary finite timestamp, else the "diff" > + * calculation might overflow. We do support stop_time == > TIMESTAMP_INFINITY, > + * which will result in INT_MAX wait time. I wonder if we should explicitly reject negative timestamps to eliminate any chance of int64 overflow, too. Alternatively, we could detect that the operation will overflow and return either 0 or INT_MAX, but I assume there's minimal use of this function with negative timestamps. -- Nathan Bossart Amazon Web Services: https://aws.amazon.com
Something is wrong with wal_compression
The symptom being exhibited by Michael's new BF animal tanager is perfectly reproducible elsewhere. $ cat /home/postgres/tmp/temp_config #default_toast_compression = lz4 wal_compression = lz4 $ export TEMP_CONFIG=/home/postgres/tmp/temp_config $ cd ~/pgsql/src/test/recovery $ make check PROVE_TESTS=t/011_crash_recovery.pl ... +++ tap check in src/test/recovery +++ t/011_crash_recovery.pl .. 1/? # Failed test 'new xid after restart is greater' # at t/011_crash_recovery.pl line 53. # '729' # > # '729' # Failed test 'xid is aborted after crash' # at t/011_crash_recovery.pl line 57. # got: 'committed' # expected: 'aborted' # Looks like you failed 2 tests of 3. Maybe this is somehow the test script's fault, but I don't see how. It fails the same way with 'wal_compression = pglz', so I think it's generic to that whole feature rather than specific to LZ4. regards, tom lane
Re: Helper functions for wait_for_catchup() in Cluster.pm
Hi, On 1/26/23 10:42 AM, Alvaro Herrera wrote: On 2023-Jan-26, Drouvot, Bertrand wrote: On 1/24/23 7:27 PM, Alvaro Herrera wrote: 1. I don't think wait_for_write_catchup is necessary, because calling wait_for_catchup() and omitting the 'mode' and 'lsn' arguments would already do the same thing. Having a closer look, it does not seem to be the case. The default mode in wait_for_catchup() is 'replay' and the default mode for the lsn is 'write'. But in wait_for_write_catchup() we are making use of 'write' for both. 2. Because wait_for_replay_catchup is an instance method, passing the second node as argument is needlessly noisy, because that's already known as $self. So we can just say $primary_node->wait_for_replay_catchup($standby_node); Yeah, but same here, there is places where the node passed as the second argument is not the "$self": src/bin/pg_rewind/t/007_standby_source.pl:$node_b->wait_for_replay_catchup('node_c', $node_a); src/test/recovery/t/001_stream_rep.pl:$node_standby_1->wait_for_replay_catchup($node_standby_2, $node_primary); src/test/recovery/t/001_stream_rep.pl:$node_standby_1->wait_for_replay_catchup($node_standby_2, $node_primary); src/test/recovery/t/001_stream_rep.pl: $node_standby_1->wait_for_replay_catchup($node_standby_2, $node_primary); So it looks like there is still a need for wait_for_replay_catchup() with 2 parameters. Ah, cascading replication. In that case, let's make the second parameter optional. If it's not given, $self is used. Good point, done in V3 attached. Regards, -- Bertrand Drouvot PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.comdiff --git a/src/bin/pg_rewind/t/007_standby_source.pl b/src/bin/pg_rewind/t/007_standby_source.pl index 52644c2c0d..7236a3b177 100644 --- a/src/bin/pg_rewind/t/007_standby_source.pl +++ b/src/bin/pg_rewind/t/007_standby_source.pl @@ -74,9 +74,8 @@ $node_a->safe_psql('postgres', "INSERT INTO tbl1 values ('in A, before promotion')"); $node_a->safe_psql('postgres', 'CHECKPOINT'); -my $lsn = $node_a->lsn('write'); -$node_a->wait_for_catchup('node_b', 'write', $lsn); -$node_b->wait_for_catchup('node_c', 'write', $lsn); +$node_a->wait_for_write_catchup('node_b', $node_a); +$node_b->wait_for_write_catchup('node_c', $node_a); # Promote C # @@ -160,7 +159,7 @@ in A, after C was promoted $node_a->safe_psql('postgres', "INSERT INTO tbl1 values ('in A, after rewind')"); -$node_b->wait_for_catchup('node_c', 'replay', $node_a->lsn('write')); +$node_b->wait_for_replay_catchup('node_c', $node_a); check_query( 'SELECT * FROM tbl1', diff --git a/src/test/modules/brin/t/02_wal_consistency.pl b/src/test/modules/brin/t/02_wal_consistency.pl index 5983ef208e..8b2b244feb 100644 --- a/src/test/modules/brin/t/02_wal_consistency.pl +++ b/src/test/modules/brin/t/02_wal_consistency.pl @@ -70,6 +70,6 @@ my ($ret, $out, $err) = $whiskey->psql( }); cmp_ok($out, '>=', 1); -$whiskey->wait_for_catchup($charlie, 'replay', $whiskey->lsn('insert')); +$whiskey->wait_for_replay_catchup($charlie); done_testing(); diff --git a/src/test/perl/PostgreSQL/Test/Cluster.pm b/src/test/perl/PostgreSQL/Test/Cluster.pm index 04921ca3a3..3ba1545688 100644 --- a/src/test/perl/PostgreSQL/Test/Cluster.pm +++ b/src/test/perl/PostgreSQL/Test/Cluster.pm @@ -2709,6 +2709,45 @@ sub wait_for_catchup return; } +# Now defining helper functions wait_for_replay_catchup() and +# wait_for_write_catchup(). +# Please note that wait_for_flush_catchup() and wait_for_sent_catchup() are not +# defined as: 1) they are not used yet and 2) it lets their author (if any) +# decide the node->lsn() mode to be used. + +=pod + +=item $node->wait_for_replay_catchup(standby_name, node) + +Helper function for wait_for_catchup when waiting for the replay_lsn column +to catchup. + +=cut + +sub wait_for_replay_catchup +{ + my ($self, $standby_name, $node) = @_; + $node = defined($node) ? $node : $self; + + $self->wait_for_catchup($standby_name, 'replay', $node->lsn('flush')); +} + +=pod + +=item $node->wait_for_write_catchup(standby_name, node) + +Helper function for wait_for_catchup when waiting for the write_lsn column +to catchup. + +=cut + +sub wait_for_write_catchup +{ + my ($self, $standby_name, $node) = @_; + + $self->wait_for_catchup($standby_name, 'write', $node->lsn('write')); +} + =pod =item $node->wait_for_slot_catchup(slot_name, mode, target_lsn) diff --git a/src/test/recovery/t/001_stream_rep.pl b/src/test/recovery/t/001_stream_rep.pl index 23a90dd85b..76846905a7 100644 --- a/src/test/recovery/t/001_stream_rep.pl +++ b/src/test/recovery/t/001_stream_rep.pl @@ -47,9 +47,8 @@ $node_primary->safe_psql('postgres', "CREATE TABLE tab_int AS SELECT generate_series(1,1002) AS a"); # Wait for standbys to catch up -my $primary_lsn = $node_primary->lsn('write'); -$node_primary->wait_for_catchup($node_standby_1, 'replay', $pri
Re: improving user.c error messages
On Thu, Jan 26, 2023 at 2:14 PM Nathan Bossart wrote: > Yeah, it's probably better to say "to alter roles with %s" to refer to > roles that presently have the attribute and "to change the %s attribute" > when referring to privileges for the attribute. I did this in v2, too. > > I've also switched from errhint() to errdetail() as suggested by Tom. This seems fine to me in general but I'm not entirely sure about this part: @@ -758,16 +776,13 @@ AlterRole(ParseState *pstate, AlterRoleStmt *stmt) { /* things an unprivileged user certainly can't do */ if (dinherit || dcreaterole || dcreatedb || dcanlogin || dconnlimit || - dvalidUntil || disreplication || dbypassRLS) + dvalidUntil || disreplication || dbypassRLS || + (dpassword && roleid != currentUserId)) ereport(ERROR, (errcode(ERRCODE_INSUFFICIENT_PRIVILEGE), - errmsg("permission denied"))); - - /* an unprivileged user can change their own password */ - if (dpassword && roleid != currentUserId) - ereport(ERROR, - (errcode(ERRCODE_INSUFFICIENT_PRIVILEGE), - errmsg("must have CREATEROLE privilege to change another user's password"))); + errmsg("permission denied to alter role"), + errdetail("You must have %s privilege and %s on role \"%s\".", +"CREATEROLE", "ADMIN OPTION", rolename))); } else if (!superuser()) { Basically my question is whether having one error message for all of those cases is good enough, or whether we should be trying harder. I don't mind if the conclusion is that it's OK as-is, and I'm not entirely sure what would be better. But when I was working on this code, all of those cases OR'd together feeding into a single error message seemed a little sketchy to me, so I am wondering what others think. -- Robert Haas EDB: http://www.enterprisedb.com
Re: New strategies for freezing, advancing relfrozenxid early
On Thu, Jan 26, 2023 at 11:35 AM Peter Geoghegan wrote: > You complained about the descriptions being theoretical. But there's > nothing theoretical about the fact that we more or less do *all* > freezing in an eventual aggressive VACUUM in many important cases, > including very simple cases like pgbench_history -- the simplest > possible append-only table case. We'll merrily rewrite the entire > table, all at once, for no good reason at all. Consistently, reliably. > It's so incredibly obvious that this makes zero sense! And yet I don't > think you've ever engaged with such basic points as that one. I'm aware that that's a problem, and I agree that it sucks. I think that what this patch does is make vacuum more aggressively, and I expect that would help this problem. I haven't said much about that because I don't think it's controversial. However, the patch also has a cost, and that's what I think is controversial. I think it's pretty much impossible to freeze more aggressively without losing in some scenario or other. If waiting longer to freeze would have resulted in the data getting updated again or deleted before we froze it, then waiting longer reduces the total amount of freezing work that ever has to be done. Freezing more aggressively inevitably gives up some amount of that potential benefit in order to try to secure some other benefit. It's a trade-off. I think that the goal of a patch that makes vacuum more (or less) aggressive should be to make the cases where we lose as obscure as possible, and the cases where we win as broad as possible. I think that, in order to be a good patch, it needs to be relatively difficult to find cases where we incur a big loss. If it's easy to find a big loss, then I think it's better to stick with the current behavior, even if it's also easy to find a big gain. There's nothing wonderful about the current behavior, but (to paraphrase what I think Andres has already said several times) it's better to keep shipping code with the same bad behavior than to put out a new major release with behaviors that are just as bad, but different. I feel like your emails sometimes seem to suppose that I think that you're a bad person, or a bad developer, or that you have no good ideas, or that you have no good ideas about this topic, or that this topic is not important, or that we don't need to do better than we are currently doing. I think none of those things. However, I'm also not prepared to go all the way to the other end of the spectrum and say that all of your ideas and everything in this patch are great. I don't think either of those things, either. I certainly think that freezing more aggressively in some scenarios could be a great idea, but it seems like the patch's theory is to be very nearly maximally aggressive in every vacuum run if the table size is greater than some threshold, and I don't think that's right at all. I'm not exactly sure what information we should use to decide how aggressive to be, but I am pretty sure that the size of the table is not it. It's true that, for a small table, the cost of having to eventually vacuum the whole table at once isn't going to be very high, whereas for a large table, it will be. That line of reasoning makes a size threshold sound reasonable. However, the amount of extra work that we can potentially do by vacuuming more aggressively *also* increases with the table size, which to me means using that a criterion actually isn't sensible at all. One idea that I've had about how to solve this problem is to try to make vacuum try to aggressively freeze some portion of the table on each pass, and to behave less aggressively on the rest of the table so that, hopefully, no single vacuum does too much work. Unfortunately, I don't really know how to do that effectively. If we knew that the table was going to see 10 vacuums before we hit autovacuum_freeze_max_age, we could try to have each one do 10% of the amount of freezing that was going to need to be done rather than letting any single vacuum do all of it, but we don't have that sort of information. Also, even if we did have that sort of information, the idea only works if the pages that we freeze sooner are ones that we're not about to update or delete again, and we don't have any idea what is likely there. In theory we could have some system that tracks how recently each page range in a table has been modified, and direct our freezing activity toward the ones less-recently modified on the theory that they're not so likely to be modified again in the near future, but in reality we have no such system. So I don't really feel like I know what the right answer is here, yet. -- Robert Haas EDB: http://www.enterprisedb.com
Re: New strategies for freezing, advancing relfrozenxid early
On Thu, 26 Jan 2023 at 19:45, Peter Geoghegan wrote: > > On Thu, Jan 26, 2023 at 9:53 AM Andres Freund wrote: > > I assume the case you're thinking of is that pruning did *not* do any > > changes, > > but in the process of figuring out that nothing needed to be pruned, we did > > a > > MarkBufferDirtyHint(), and as part of that emitted an FPI? > > Yes. > > > > That's going to be very significantly more aggressive. For example > > > it'll impact small tables very differently. > > > > Maybe it would be too aggressive, not sure. The cost of a freeze WAL record > > is > > relatively small, with one important exception below, if we are 99.99% sure > > that it's not going to require an FPI and isn't going to dirty the page. > > > > The exception is that a newer LSN on the page can cause the ringbuffer > > replacement to trigger more more aggressive WAL flushing. No meaningful > > difference if we modified the page during pruning, or if the page was > > already > > in s_b (since it likely won't be written out via the ringbuffer in that > > case), > > but if checksums are off and we just hint-dirtied the page, it could be a > > significant issue. > > Most of the overhead of FREEZE WAL records (with freeze plan > deduplication and page-level freezing in) is generic WAL record header > overhead. Your recent adversarial test case is going to choke on that, > too. At least if you set checkpoint_timeout to 1 minute again. Could someone explain to me why we don't currently (optionally) include the functionality of page freezing in the PRUNE records? I think they're quite closely related (in that they both execute in VACUUM and are required for long-term system stability), and are even more related now that we have opportunistic page-level freezing. I think adding a "freeze this page as well"-flag in PRUNE records would go a long way to reducing the WAL overhead of aggressive and more opportunistic freezing. -Matthias
Re: improving user.c error messages
Thanks for taking a look. On Thu, Jan 26, 2023 at 10:07:39AM +0100, Alvaro Herrera wrote: > Please use > errdetail("You must have %s privilege to create roles with %s.", > "SUPERUSER", "SUPERUSER"))); > > in this kind of message where multiple copies appear that only differ in > the keyword to use, to avoid creating four copies of essentially the > same string. > > This applies in several places. I did this in v2. >> - errmsg("must have createdb privilege >> to change createdb attribute"))); >> + errmsg("permission denied to alter >> role"), >> + errhint("You must have CREATEDB >> privilege to alter roles with CREATEDB."))); > > I think this one is a bit ambiguous; does "with" mean that roles that > have that priv cannot be changed, or does it mean that you cannot meddle > with that bit in particular? I think it'd be better to say > "You must have %s privilege to change the %s attribute." > or something like that. Yeah, it's probably better to say "to alter roles with %s" to refer to roles that presently have the attribute and "to change the %s attribute" when referring to privileges for the attribute. I did this in v2, too. I've also switched from errhint() to errdetail() as suggested by Tom. -- Nathan Bossart Amazon Web Services: https://aws.amazon.com >From a6db1f28f0e7079193af4fca3fde27cf4780dbb7 Mon Sep 17 00:00:00 2001 From: Nathan Bossart Date: Thu, 26 Jan 2023 11:05:13 -0800 Subject: [PATCH v2 1/1] Improve user.c error messages. --- src/backend/commands/user.c | 171 -- src/test/regress/expected/create_role.out | 77 ++ src/test/regress/expected/dependency.out | 4 + src/test/regress/expected/privileges.out | 23 ++- 4 files changed, 195 insertions(+), 80 deletions(-) diff --git a/src/backend/commands/user.c b/src/backend/commands/user.c index 3a92e930c0..e2c80f5060 100644 --- a/src/backend/commands/user.c +++ b/src/backend/commands/user.c @@ -316,23 +316,33 @@ CreateRole(ParseState *pstate, CreateRoleStmt *stmt) if (!has_createrole_privilege(currentUserId)) ereport(ERROR, (errcode(ERRCODE_INSUFFICIENT_PRIVILEGE), - errmsg("permission denied to create role"))); + errmsg("permission denied to create role"), + errdetail("You must have %s privilege to create roles.", + "CREATEROLE"))); if (issuper) ereport(ERROR, (errcode(ERRCODE_INSUFFICIENT_PRIVILEGE), - errmsg("must be superuser to create superusers"))); + errmsg("permission denied to create role"), + errdetail("You must have %s privilege to create roles with %s.", + "SUPERUSER", "SUPERUSER"))); if (createdb && !have_createdb_privilege()) ereport(ERROR, (errcode(ERRCODE_INSUFFICIENT_PRIVILEGE), - errmsg("must have createdb permission to create createdb users"))); + errmsg("permission denied to create role"), + errdetail("You must have %s privilege to create roles with %s.", + "CREATEDB", "CREATEDB"))); if (isreplication && !has_rolreplication(currentUserId)) ereport(ERROR, (errcode(ERRCODE_INSUFFICIENT_PRIVILEGE), - errmsg("must have replication permission to create replication users"))); + errmsg("permission denied to create role"), + errdetail("You must have %s privilege to create roles with %s.", + "REPLICATION", "REPLICATION"))); if (bypassrls && !has_bypassrls_privilege(currentUserId)) ereport(ERROR, (errcode(ERRCODE_INSUFFICIENT_PRIVILEGE), - errmsg("must have bypassrls to create bypassrls users"))); + errmsg("permission denied to create role"), + errdetail("You must have %s privilege to create roles with %s.", + "BYPASSRLS", "BYPASSRLS"))); } /* @@ -744,10 +754,18 @@ AlterRole(ParseState *pstate, AlterRoleStmt *stmt) roleid = authform->oid; /* To mess with a superuser in any way you gotta be superuser. */ - if (!superuser() && (authform->rolsuper || dissuper)) + if (!superuser() && authform->rolsuper) ereport(ERROR, (errcode(ERRCODE_INSUFFICIENT_PRIVILEGE), - errmsg("must be superuser to alter superuser roles or change superuser attribute"))); + errmsg("permission denied to alter role"), + errdetail("You must have %s privilege to alter roles with %s.", + "SUPERUSER", "SUPERUSER"))); + if (!superuser() && dissuper) + ereport(ERROR, +(errcode(ERRCODE_INSUFFICIENT_PRIVILEGE), + errmsg("permission denied to alter role"), + errdetail("You must have %s privilege to change the %s attribute.", + "SUPERUSER", "SUPERUSER"))); /* * Most changes to a role require that you both have CREATEROLE privileges @@ -758,16 +776,13 @@ AlterRole(ParseState *pstate, AlterRoleStmt *stmt) { /* things an unprivileged user certainly can't do */ if (dinherit || dcreaterole || dcreatedb || dcanlogin || dc
Re: suppressing useless wakeups in logical/worker.c
I wrote: >>> It'd probably be reasonable to file down that sharp edge by instead >>> specifying that TimestampDifferenceMilliseconds will clamp overflowing >>> differences to LONG_MAX. Maybe there should be a clamp on the underflow >>> side too ... but should it be to LONG_MIN or to zero? After looking closer, I see that TimestampDifferenceMilliseconds already explicitly states that its output is intended for WaitLatch and friends, which makes it perfectly sane for it to clamp the result to [0, INT_MAX] rather than depending on the caller to not pass out-of-range values. I checked existing callers, and found one place in basebackup_copy.c that had not read the memo about TimestampDifferenceMilliseconds never returning a negative value, and another in postmaster.c that had not read the memo about Min() and Max() being macros. There are none that are unhappy about clamping to INT_MAX, and at least one that was already assuming it could just cast the result to int. Hence, I propose the attached. I haven't gone as far as to change the result type from long to int; that seems like a lot of code churn (if we are to update WaitLatch and all callers to match) and it won't really buy anything semantically. regards, tom lane diff --git a/src/backend/backup/basebackup_copy.c b/src/backend/backup/basebackup_copy.c index 05470057f5..2bb6c89f8c 100644 --- a/src/backend/backup/basebackup_copy.c +++ b/src/backend/backup/basebackup_copy.c @@ -215,7 +215,8 @@ bbsink_copystream_archive_contents(bbsink *sink, size_t len) * the system clock was set backward, so that such occurrences don't * have the effect of suppressing further progress messages. */ - if (ms < 0 || ms >= PROGRESS_REPORT_MILLISECOND_THRESHOLD) + if (ms >= PROGRESS_REPORT_MILLISECOND_THRESHOLD || + now < mysink->last_progress_report_time) { mysink->last_progress_report_time = now; diff --git a/src/backend/postmaster/postmaster.c b/src/backend/postmaster/postmaster.c index 5b775cf7d0..62fba5fcee 100644 --- a/src/backend/postmaster/postmaster.c +++ b/src/backend/postmaster/postmaster.c @@ -1670,11 +1670,12 @@ DetermineSleepTime(void) if (next_wakeup != 0) { - /* Ensure we don't exceed one minute, or go under 0. */ - return Max(0, - Min(60 * 1000, - TimestampDifferenceMilliseconds(GetCurrentTimestamp(), - next_wakeup))); + int ms; + + /* result of TimestampDifferenceMilliseconds is in [0, INT_MAX] */ + ms = (int) TimestampDifferenceMilliseconds(GetCurrentTimestamp(), + next_wakeup); + return Min(60 * 1000, ms); } return 60 * 1000; diff --git a/src/backend/replication/walreceiver.c b/src/backend/replication/walreceiver.c index e95398db05..b0cfddd548 100644 --- a/src/backend/replication/walreceiver.c +++ b/src/backend/replication/walreceiver.c @@ -445,7 +445,7 @@ WalReceiverMain(void) pgsocket wait_fd = PGINVALID_SOCKET; int rc; TimestampTz nextWakeup; -int nap; +long nap; /* * Exit walreceiver if we're not in recovery. This should not @@ -528,15 +528,9 @@ WalReceiverMain(void) for (int i = 0; i < NUM_WALRCV_WAKEUPS; ++i) nextWakeup = Min(wakeup[i], nextWakeup); -/* - * Calculate the nap time. WaitLatchOrSocket() doesn't accept - * timeouts longer than INT_MAX milliseconds, so we limit the - * result accordingly. Also, we round up to the next - * millisecond to avoid waking up too early and spinning until - * one of the wakeup times. - */ +/* Calculate the nap time, clamping as necessary. */ now = GetCurrentTimestamp(); -nap = (int) Min(INT_MAX, Max(0, (nextWakeup - now + 999) / 1000)); +nap = TimestampDifferenceMilliseconds(now, nextWakeup); /* * Ideally we would reuse a WaitEventSet object repeatedly diff --git a/src/backend/utils/adt/timestamp.c b/src/backend/utils/adt/timestamp.c index 928c330897..b1d1963729 100644 --- a/src/backend/utils/adt/timestamp.c +++ b/src/backend/utils/adt/timestamp.c @@ -1690,12 +1690,12 @@ TimestampDifference(TimestampTz start_time, TimestampTz stop_time, * * This is typically used to calculate a wait timeout for WaitLatch() * or a related function. The choice of "long" as the result type - * is to harmonize with that. It is caller's responsibility that the - * input timestamps not be so far apart as to risk overflow of "long" - * (which'd happen at about 25 days on machines with 32-bit "long"). + * is to harmonize with that; furthermore, we clamp the result to at most + * INT_MAX milliseconds, because that's all that WaitLatch() allows. * - * Both inputs must be ordinary finite timestamps (in current usage, - * they'll be results from GetCurrentTimestamp()). + * At least one input must be an ordinary finite timestamp, else the "diff" + * calculation might overflow. We do support stop_time == TIMESTAMP_INFINITY, + * which will result in INT_MAX wait time. * * We expect start_time <= stop_tim
Re: Syncrep and improving latency due to WAL throttling
On 1/26/23 16:40, Andres Freund wrote: > Hi, > > On 2023-01-26 12:08:16 +0100, Tomas Vondra wrote: >> It's not clear to me how could it cause deadlocks, as we're not waiting >> for a lock/resource locked by someone else, but it's certainly an issue >> for uninterruptible hangs. > > Maybe not. But I wouldn't want to bet on it. It's a violation of all kinds of > lock-ordering rules. > Not sure what lock ordering issues you have in mind, but I agree it's not the right place for the sleep, no argument here. > >>> My best idea for how to implement this in a somewhat safe way would be for >>> XLogInsertRecord() to set a flag indicating that we should throttle, and set >>> InterruptPending = true. Then the next CHECK_FOR_INTERRUPTS that's allowed >>> to >>> proceed (i.e. we'll not be in a critical / interrupts off section) can >>> actually perform the delay. That should fix the hard deadlock danger and >>> remove most of the increase in lock contention. >>> >> >> The solution I've imagined is something like autovacuum throttling - do >> some accounting of how much "WAL bandwidth" each process consumed, and >> then do the delay/sleep in a suitable place. > > Where would such a point be, except for ProcessInterrupts()? Iteratively > adding a separate set of "wal_delay()" points all over the executor, > commands/, ... seems like a bad plan. > I haven't thought about where to do the work, TBH. ProcessInterrupts() may very well be a good place. I should have been clearer, but the main benefit of autovacuum-like throttling is IMHO that it involves all processes and a global limit, while the current approach is per-backend. > >>> I don't think doing an XLogFlush() of a record that we just wrote is a good >>> idea - that'll sometimes significantly increase the number of fdatasyncs/sec >>> we're doing. To make matters worse, this will often cause partially filled >>> WAL >>> pages to be flushed out - rewriting a WAL page multiple times can >>> significantly increase overhead on the storage level. At the very least >>> this'd >>> have to flush only up to the last fully filled page. >>> >> >> I don't see why would that significantly increase the number of flushes. >> The goal is to do this only every ~1MB of a WAL or so, and chances are >> there were many (perhaps hundreds or more) commits in between. At least >> in a workload with a fair number of OLTP transactions (which is kinda >> the point of all this). > > Because the accounting is done separately in each process. Even if you just > add a few additional flushes for each connection, in aggregate that'll be a > lot. > How come? Imagine the backend does flush only after generating e.g. 1MB of WAL. Small transactions won't do any additional flushes at all (because commit resets the accounting). Large transactions will do an extra flush every 1MB, so 16x per WAL segment. But in between there will be many commits from the small transactions. If we backoff to the last complete page, that should eliminate even most of these flushes. So where would the additional flushes come from? > >> And the "small" OLTP transactions don't really do any extra fsyncs, >> because the accounting resets at commit (or places that flush WAL). >> [...] >>> Just counting the number of bytes inserted by a backend will make the >>> overhead >>> even worse, as the flush will be triggered even for OLTP sessions doing tiny >>> transactions, even though they don't contribute to the problem you're trying >>> to address. How about counting how many bytes of WAL a backend has inserted >>> since the last time that backend did an XLogFlush()? >>> >> >> No, we should reset the counter at commit, so small OLTP transactions >> should not reach really trigger this. That's kinda the point, to only >> delay "large" transactions producing a lot of WAL. > > I might have missed something, but I don't think the first version of patch > resets the accounting at commit? > Yeah, it doesn't. It's mostly a minimal PoC patch, sufficient to test the behavior / demonstrate the effect. > >> Same for the flushes of partially flushed pages - if there's enough of >> small OLTP transactions, we're already having this issue. I don't see >> why would this make it measurably worse. > > Yes, we already have that problem, and it hurts. *Badly*. I don't see how v1 > could *not* make it worse. Suddenly you get a bunch of additional XLogFlush() > calls to partial boundaries by every autovacuum, by every process doing a bit > more bulky work. Because you're flushing the LSN after a record you just > inserted, you're commonly not going to be "joining" the work of an already > in-process XLogFlush(). > Right. We do ~16 additional flushes per 16MB segment (or something like that, depending on the GUC value). Considering we do thousand of commits per segment, each of which does a flush, I don't see how would this make it measurably worse. > >> But if needed, we can simply backoff to the last page boundary, s
Re: New strategies for freezing, advancing relfrozenxid early
On Thu, Jan 26, 2023 at 9:53 AM Andres Freund wrote: > I assume the case you're thinking of is that pruning did *not* do any changes, > but in the process of figuring out that nothing needed to be pruned, we did a > MarkBufferDirtyHint(), and as part of that emitted an FPI? Yes. > > That's going to be very significantly more aggressive. For example > > it'll impact small tables very differently. > > Maybe it would be too aggressive, not sure. The cost of a freeze WAL record is > relatively small, with one important exception below, if we are 99.99% sure > that it's not going to require an FPI and isn't going to dirty the page. > > The exception is that a newer LSN on the page can cause the ringbuffer > replacement to trigger more more aggressive WAL flushing. No meaningful > difference if we modified the page during pruning, or if the page was already > in s_b (since it likely won't be written out via the ringbuffer in that case), > but if checksums are off and we just hint-dirtied the page, it could be a > significant issue. Most of the overhead of FREEZE WAL records (with freeze plan deduplication and page-level freezing in) is generic WAL record header overhead. Your recent adversarial test case is going to choke on that, too. At least if you set checkpoint_timeout to 1 minute again. > Thus a modification of the above logic could be to opportunistically freeze if > a ) it won't cause an FPI and either > b1) the page was already dirty before pruning, as we'll not do a ringbuffer > replacement in that case > or > b2) We wrote a WAL record during pruning, as the difference in flush position > is marginal > > An even more aggressive version would be to replace b1) with logic that'd > allow newly dirtying the page if it wasn't read through the ringbuffer. But > newly dirtying the page feels like it'd be more dangerous. In many cases we'll have to dirty the page anyway, just to set PD_ALL_VISIBLE. The whole way the logic works is conditioned (whether triggered by an FPI or triggered by my now-reverted GUC) on being able to set the whole page all-frozen in the VM. > A less aggressive version would be to check if any WAL records were emitted > during heap_page_prune() (instead of FPIs) and whether we'd emit an FPI if we > modified the page again. Similar to what we do now, except not requiring an > FPI to have been emitted. Also way more aggressive. Not nearly enough on its own. > But to me it seems a bit odd that VACUUM now is more aggressive if checksums / > wal_log_hint bits is on, than without them. Which I think is how using either > of pgWalUsage.wal_fpi, pgWalUsage.wal_records ends up working? Which part is the odd part? Is it odd that page-level freezing works that way, or is it odd that page-level checksums work that way? In any case this seems like an odd thing for you to say, having eviscerated a patch that really just made the same behavior trigger independently of FPIs in some tables, controlled via a GUC. -- Peter Geoghegan
Re: drop postmaster symlink
On Wed, 25 Jan 2023 18:03:25 -0600 "Karl O. Pinc" Buried in > https://www.postgresql.org/message-id/20230107165942.748ccf4e%40slate.karlpinc.com > is the one change I see that should be made. > > > In doc/src/sgml/ref/allfiles.sgml at line 222 there is an ENTITY > > defined which references the deleted postmaster.sgml file. > > This line needs to be removed and the > 0002-Don-t-install-postmaster-symlink-anymore.patch > updated. (Unless there's some magic going on > with the various allfiles.sgml files of which I am > not aware.) > > If this is fixed I see no other problems. Buried in the same email, and I apologize for not mentioning this, is one other bit of documentation text that might or might not need attention. > I see a possible problem at line 1,412 of runtime.sgml This says: in the postmaster's startup script just before invoking the postmaster. Depending on how this is read, it could be interpreted to mean that a "postmaster" binary is invoked. It might be more clear to write: ... just before invoking postgres. Or it might not be worth bothering about; at this point, probably not, but I thought you might want the heads-up anyway. Regards, Karl Free Software: "You don't pay back, you pay forward." -- Robert A. Heinlein
Re: Add LZ4 compression in pg_dump
On Wed, Jan 25, 2023 at 07:57:18PM +, gkokola...@pm.me wrote: > On Wednesday, January 25th, 2023 at 7:00 PM, Justin Pryzby > wrote: > > While looking at this, I realized that commit 5e73a6048 introduced a > > regression: > > > > @@ -3740,19 +3762,24 @@ ReadHead(ArchiveHandle *AH) > > > > - if (AH->compression != 0) > > > > - pg_log_warning("archive is compressed, but this installation does not > > support compression -- no data will be available"); > > + if (AH->compression_spec.algorithm == PG_COMPRESSION_GZIP) > > > > + pg_fatal("archive is compressed, but this installation does not support > > compression"); > > > > Before, it was possible to restore non-data chunks of a dump file, even > > if the current build didn't support its compression. But that's now > > impossible - and it makes the code we're discussing in RestoreArchive() > > unreachable. On Thu, Jan 26, 2023 at 08:53:28PM +0900, Michael Paquier wrote: > On Thu, Jan 26, 2023 at 11:24:47AM +, gkokola...@pm.me wrote: > > I gave this a little bit of thought. I think that ReadHead should not > > emit a warning, or at least not this warning as it is slightly misleading. > > It implies that it will automatically turn off data restoration, which is > > false. Further ahead, the code will fail with a conflicting error message > > stating that the compression is not available. > > > > Instead, it would be cleaner both for the user and the maintainer to > > move the check in RestoreArchive and make it the sole responsible for > > this logic. > > -pg_fatal("cannot restore from compressed archive (compression not > supported in this installation)"); > +pg_fatal("cannot restore data from compressed archive (compression not > supported in this installation)"); > Hmm. I don't mind changing this part as you suggest. > > -#ifndef HAVE_LIBZ > - if (AH->compression_spec.algorithm == PG_COMPRESSION_GZIP) > - pg_fatal("archive is compressed, but this installation does > not support compression"); > -#endif > However I think that we'd better keep the warning, as it can offer a > hint when using pg_restore -l not built with compression support if > looking at a dump that has been compressed. Yeah. But the original log_warning text was better, and should be restored: - if (AH->compression != 0) - pg_log_warning("archive is compressed, but this installation does not support compression -- no data will be available"); That commit also added this to pg-dump.c: + case PG_COMPRESSION_ZSTD: + pg_fatal("compression with %s is not yet supported", "ZSTD"); + break; + case PG_COMPRESSION_LZ4: + pg_fatal("compression with %s is not yet supported", "LZ4"); + break; In 002, that could be simplified by re-using the supports_compression() function. (And maybe the same in WriteDataToArchive()?) -- Justin
Re: meson oddities
On 2023-01-26 10:20:58 +0100, Peter Eisentraut wrote: > On 19.01.23 21:45, Andres Freund wrote: > > Hi, > > > > On 2023-01-19 21:37:15 +0100, Peter Eisentraut wrote: > > > On 11.01.23 12:05, Peter Eisentraut wrote: > > > > I think there is also an adjacent issue: The subdir options may be > > > > absolute or relative. So if you specify --prefix=/usr/local and > > > > --sysconfdir=/etc/postgresql, then > > > > > > > > config_paths_data.set_quoted('SYSCONFDIR', dir_prefix / > > > > dir_sysconf) > > > > > > > > would produce something like /usr/local/etc/postgresql. > > > > I don't think it would. The / operator understands absolute paths and > > doesn't > > add the "first component" if the second component is absolute. > > Oh, that is interesting. In that case, this is not the right patch. We > should proceed with my previous patch in [0] then. WFM. I still think it'd be slightly more legible if we tested the prefix for postgres|pgsql once, rather than do the per-variable .contains() checks on the "combined" path. But it's a pretty minor difference, and I'd have no problem with you comitting your version. Basically: is_pg_prefix = dir_prefix.contains('pgsql) or dir_prefix.contains('postgres') ... if not (is_pg_prefix or dir_data.contains('pgsql') or dir_data.contains('postgres')) instead of "your": if not ((dir_prefix/dir_data).contains('pgsql') or (dir_prefix/dir_data).contains('postgres')) Greetings, Andres Freund
Re: fix and document CLUSTER privileges
On Wed, Jan 25, 2023 at 08:27:57PM -0800, Jeff Davis wrote: > Committed these extra clarifications. Thank you. Thanks! -- Nathan Bossart Amazon Web Services: https://aws.amazon.com
Re: New strategies for freezing, advancing relfrozenxid early
Hi, On 2023-01-26 08:54:55 -0800, Peter Geoghegan wrote: > On Thu, Jan 26, 2023 at 8:35 AM Andres Freund wrote: > > I think it's probably ok, but perhaps deserves a bit more thought about when > > to "opportunistically" freeze. Perhaps to make it *more* aggressive than > > it's > > now. > > > > With "opportunistic freezing" I mean freezing the page, even though we don't > > *have* to freeze any of the tuples. > > > > The overall condition gating freezing is: > > if (pagefrz.freeze_required || tuples_frozen == 0 || > > (prunestate->all_visible && prunestate->all_frozen && > > fpi_before != pgWalUsage.wal_fpi)) > > > > fpi_before is set before the heap_page_prune() call. > > Have you considered page-level checksums, and how the impact on hint > bits needs to be accounted for here? > > All RDS customers use page-level checksums. And I've noticed that it's > very common for the number of FPIs to only be very slightly less than > the number of pages dirtied. Much of which is just hint bits. The > "fpi_before != pgWalUsage.wal_fpi" test catches that. I assume the case you're thinking of is that pruning did *not* do any changes, but in the process of figuring out that nothing needed to be pruned, we did a MarkBufferDirtyHint(), and as part of that emitted an FPI? > > To me a condition that checked if the buffer is already dirty and if another > > XLogInsert() would be likely to generate an FPI would make more sense. The > > rare race case of a checkpoint starting concurrently doesn't matter IMO. > > That's going to be very significantly more aggressive. For example > it'll impact small tables very differently. Maybe it would be too aggressive, not sure. The cost of a freeze WAL record is relatively small, with one important exception below, if we are 99.99% sure that it's not going to require an FPI and isn't going to dirty the page. The exception is that a newer LSN on the page can cause the ringbuffer replacement to trigger more more aggressive WAL flushing. No meaningful difference if we modified the page during pruning, or if the page was already in s_b (since it likely won't be written out via the ringbuffer in that case), but if checksums are off and we just hint-dirtied the page, it could be a significant issue. Thus a modification of the above logic could be to opportunistically freeze if a ) it won't cause an FPI and either b1) the page was already dirty before pruning, as we'll not do a ringbuffer replacement in that case or b2) We wrote a WAL record during pruning, as the difference in flush position is marginal An even more aggressive version would be to replace b1) with logic that'd allow newly dirtying the page if it wasn't read through the ringbuffer. But newly dirtying the page feels like it'd be more dangerous. A less aggressive version would be to check if any WAL records were emitted during heap_page_prune() (instead of FPIs) and whether we'd emit an FPI if we modified the page again. Similar to what we do now, except not requiring an FPI to have been emitted. But to me it seems a bit odd that VACUUM now is more aggressive if checksums / wal_log_hint bits is on, than without them. Which I think is how using either of pgWalUsage.wal_fpi, pgWalUsage.wal_records ends up working? Greetings, Andres Freund
Re: Non-superuser subscription owners
On Thu, 2023-01-26 at 09:43 -0500, Robert Haas wrote: > I have no issue with that as a long-term plan. However, I think that > for right now we should just introduce pg_create_subscription. It > would make sense to add pg_create_connection in the same patch that > adds a CREATE CONNECTION command (or whatever exact syntax we end up > with) -- and that patch can also change CREATE SUBSCRIPTION to > require > both privileges where a connection string is specified directly. I assumed it would be a problem to say that pg_create_subscription was enough to create a subscription today, and then later require additional privileges (e.g. pg_create_connection). If that's not a problem, then this sounds fine with me. -- Jeff Davis PostgreSQL Contributor Team - AWS
Re: run pgindent on a regular basis / scripted manner
> At this stage the files are now indented, so if it failed and you run > `git commit` again it will commit with the indention changes. No, because at no point a "git add" is happening, so the changes made by pgindent are not staged. As long as you don't run the second "git commit" with the -a flag the commit will be exactly the same as you prepared it before. > Sure, that's your choice. My intended audience here is committers, who > of course do work on master. Yes I understand, I meant it would be nice if the script had a environment variable like PG_COMMIT_HOOK_ALL_BRANCHES (bad name) for this purpose. On Thu, 26 Jan 2023 at 17:54, Andrew Dunstan wrote: > > > On 2023-01-26 Th 11:16, Jelte Fennema wrote: > > On Thu, 26 Jan 2023 at 15:40, Andrew Dunstan wrote: > >> I didn't really like your hook, as it forces a reindent, and many people > >> won't want that (for reasons given elsewhere in this thread). > > I'm not sure what you mean by "forces a reindent". Like I explained > > you can simply run "git commit" again to ignore the changes and > > commit anyway. As long as the files are indented on your filesystem > > the hook doesn't care if you actually included the indentation changes > > in the changes that you're currently committing. > > > Your hook does this: > > > +git diff --cached --name-only --diff-filter=ACMR | grep '\.[ch]$' |\ > +xargs src/tools/pgindent/pgindent --silent-diff \ > +|| { > +echo ERROR: Aborting commit because pgindent was not run > +git diff --cached --name-only --diff-filter=ACMR | grep > '\.[ch]$' | xargs src/tools/pgindent/pgindent > +exit 1 > +} > > > At this stage the files are now indented, so if it failed and you run > `git commit` again it will commit with the indention changes. > > > > > > So to be completely clear you can do the following with my hook: > > git commit # runs pgindent and fails > > git commit # commits changes anyway > > git commit -am 'Run pgindent' # commit indentation changes separately > > > > Or what I usually do: > > git commit # runs pgindent and fails > > git add --patch # choose relevant changes to add to commit > > git commit # commit the changes > > git checkout -- . # undo irrelevant changes on filesystem > > > > Honestly PGAUTOINDENT=no seems stricter, since the only > > way to bypass the failure is now to run manually run pgindent > > or git commit with the --no-verify flag. > > > >> files=$(git diff --cached --name-only --diff-filter=ACMR) > >> src/tools/pgindent/pgindent $files > > That seems like it would fail if there's any files or directories with > > spaces in them. Maybe this isn't something we care about though. > > > We don't have any, and the filenames git produces are relative to the > git root. I don't think this is an issue. > > > > > >> # no need to filter files - pgindent ignores everything that isn't a > >> # .c or .h file > > If the first argument is a non .c or .h file, then pgindent interprets > > it as the typedefs file. So it's definitely important to filter non .c > > and .h files out. Because now if you commit a single > > non .c or .h file this hook messes up the indentation in all of > > your files. You can reproduce by running: > > src/tools/pgindent/pgindent README > > > > I have a patch at [1] to remove this misfeature. > > > > > >> # only do this on master > >> test "$branch" = "master" || return 0 > > I would definitely want a way to disable this check. As a normal > > submitter I never work directly on master. > > > Sure, that's your choice. My intended audience here is committers, who > of course do work on master. > > > cheers > > > andrew > > > [1] https://postgr.es/m/21bb8573-9e56-812b-84cf-1e4f3c4c2...@dunslane.net > > -- > Andrew Dunstan > EDB: https://www.enterprisedb.com >
Re: New strategies for freezing, advancing relfrozenxid early
On Thu, Jan 26, 2023 at 8:35 AM Andres Freund wrote: > I think it's probably ok, but perhaps deserves a bit more thought about when > to "opportunistically" freeze. Perhaps to make it *more* aggressive than it's > now. > > With "opportunistic freezing" I mean freezing the page, even though we don't > *have* to freeze any of the tuples. > > The overall condition gating freezing is: > if (pagefrz.freeze_required || tuples_frozen == 0 || > (prunestate->all_visible && prunestate->all_frozen && > fpi_before != pgWalUsage.wal_fpi)) > > fpi_before is set before the heap_page_prune() call. Have you considered page-level checksums, and how the impact on hint bits needs to be accounted for here? All RDS customers use page-level checksums. And I've noticed that it's very common for the number of FPIs to only be very slightly less than the number of pages dirtied. Much of which is just hint bits. The "fpi_before != pgWalUsage.wal_fpi" test catches that. > To me a condition that checked if the buffer is already dirty and if another > XLogInsert() would be likely to generate an FPI would make more sense. The > rare race case of a checkpoint starting concurrently doesn't matter IMO. That's going to be very significantly more aggressive. For example it'll impact small tables very differently. -- Peter Geoghegan
Re: run pgindent on a regular basis / scripted manner
On 2023-01-26 Th 11:16, Jelte Fennema wrote: > On Thu, 26 Jan 2023 at 15:40, Andrew Dunstan wrote: >> I didn't really like your hook, as it forces a reindent, and many people >> won't want that (for reasons given elsewhere in this thread). > I'm not sure what you mean by "forces a reindent". Like I explained > you can simply run "git commit" again to ignore the changes and > commit anyway. As long as the files are indented on your filesystem > the hook doesn't care if you actually included the indentation changes > in the changes that you're currently committing. Your hook does this: +git diff --cached --name-only --diff-filter=ACMR | grep '\.[ch]$' |\ + xargs src/tools/pgindent/pgindent --silent-diff \ + || { + echo ERROR: Aborting commit because pgindent was not run + git diff --cached --name-only --diff-filter=ACMR | grep '\.[ch]$' | xargs src/tools/pgindent/pgindent + exit 1 + } At this stage the files are now indented, so if it failed and you run `git commit` again it will commit with the indention changes. > > So to be completely clear you can do the following with my hook: > git commit # runs pgindent and fails > git commit # commits changes anyway > git commit -am 'Run pgindent' # commit indentation changes separately > > Or what I usually do: > git commit # runs pgindent and fails > git add --patch # choose relevant changes to add to commit > git commit # commit the changes > git checkout -- . # undo irrelevant changes on filesystem > > Honestly PGAUTOINDENT=no seems stricter, since the only > way to bypass the failure is now to run manually run pgindent > or git commit with the --no-verify flag. > >> files=$(git diff --cached --name-only --diff-filter=ACMR) >> src/tools/pgindent/pgindent $files > That seems like it would fail if there's any files or directories with > spaces in them. Maybe this isn't something we care about though. We don't have any, and the filenames git produces are relative to the git root. I don't think this is an issue. > >> # no need to filter files - pgindent ignores everything that isn't a >> # .c or .h file > If the first argument is a non .c or .h file, then pgindent interprets > it as the typedefs file. So it's definitely important to filter non .c > and .h files out. Because now if you commit a single > non .c or .h file this hook messes up the indentation in all of > your files. You can reproduce by running: > src/tools/pgindent/pgindent README I have a patch at [1] to remove this misfeature. > >> # only do this on master >> test "$branch" = "master" || return 0 > I would definitely want a way to disable this check. As a normal > submitter I never work directly on master. Sure, that's your choice. My intended audience here is committers, who of course do work on master. cheers andrew [1] https://postgr.es/m/21bb8573-9e56-812b-84cf-1e4f3c4c2...@dunslane.net -- Andrew Dunstan EDB: https://www.enterprisedb.com