Re: pg_logical_emit_message() misses a XLogFlush()
Hi, On 2023-08-16 12:37:21 +0900, Michael Paquier wrote: > I won't fight much if there are objections to backpatching, but that's > not really wise (no idea how much EDB's close flavor of BDR relies on > that). To be clear: I don't just object to backpatching, I also object to making existing invocations flush WAL in HEAD. I do not at all object to adding a parameter that indicates flushing, or a separate function to do so. The latter might be better, because it allows you to flush a group of messages, rather than a single one. Greetings, Andres Freund
Re: pg_logical_emit_message() misses a XLogFlush()
Hi, On 2023-08-16 03:20:53 +0200, Tomas Vondra wrote: > On 8/16/23 02:33, Andres Freund wrote: > > Hi, > > > > On 2023-08-16 06:58:56 +0900, Michael Paquier wrote: > >> On Tue, Aug 15, 2023 at 11:37:32AM +0200, Tomas Vondra wrote: > >>> Shouldn't the flush be done only for non-transactional messages? The > >>> transactional case will be flushed by regular commit flush. > >> > >> Indeed, that would be better. I am sending an updated patch. > >> > >> I'd like to backpatch that, would there be any objections to that? > > > > Yes, I object. This would completely cripple the performance of some uses of > > logical messages - a slowdown of several orders of magnitude. It's not clear > > to me that flushing would be the right behaviour if it weren't released, but > > it certainly doesn't seem right to make such a change in a minor release. > > > > So are you objecting to adding the flush in general, or just to the > backpatching part? Both, I think. I don't object to adding a way to trigger flushing, but I think it needs to be optional. > IMHO we either guarantee durability of non-transactional messages, in which > case this would be a clear bug - and I'd say a fairly serious one. I'm > curious what the workload that'd see order of magnitude of slowdown does > with logical messages I've used it, but even if such workload exists, would > it really be enough to fix any other durability bug? Not sure what you mean with the last sentence? I've e.g. used non-transactional messages for: - A non-transactional queuing system. Where sometimes one would dump a portion of tables into messages, with something like SELECT pg_logical_emit_message(false, 'app:', to_json(r)) FROM r; Obviously flushing after every row would be bad. This is useful when you need to coordinate with other systems in a non-transactional way. E.g. letting other parts of the system know that files on disk (or in S3 or ...) were created/deleted, since a database rollback wouldn't unlink/revive the files. - Audit logging, when you want to log in a way that isn't undone by rolling back transaction - just flushing every pg_logical_emit_message() would increase the WAL flush rate many times, because instead of once per transaction, you'd now flush once per modified row. It'd basically make it impractical to use for such things. - Optimistic locking. Emitting things that need to be locked on logical replicas, to be able to commit on the primary. A pre-commit hook would wait for the WAL to be replayed sufficiently - but only once per transaction, not once per object. > Or perhaps we don't want to guarantee durability for such messages, in > which case we don't need to fix it at all (even in master). Well, I can see adding an option to flush, or perhaps a separate function to flush, to master. > The docs are not very clear on what to expect, unfortunately. It says > that non-transactional messages are "written immediately" which I could > interpret in either way. Yea, the docs certainly should be improved, regardless what we end up with. Greetings, Andres Freund
Re: pg_logical_emit_message() misses a XLogFlush()
Hi, On 2023-08-16 06:58:56 +0900, Michael Paquier wrote: > On Tue, Aug 15, 2023 at 11:37:32AM +0200, Tomas Vondra wrote: > > Shouldn't the flush be done only for non-transactional messages? The > > transactional case will be flushed by regular commit flush. > > Indeed, that would be better. I am sending an updated patch. > > I'd like to backpatch that, would there be any objections to that? Yes, I object. This would completely cripple the performance of some uses of logical messages - a slowdown of several orders of magnitude. It's not clear to me that flushing would be the right behaviour if it weren't released, but it certainly doesn't seem right to make such a change in a minor release. Greetings, Andres Freund
Re: run pgindent on a regular basis / scripted manner
Hi, On 2023-08-12 17:03:37 -0400, Andrew Dunstan wrote: > On 2023-08-11 Fr 19:02, Tom Lane wrote: > > Peter Geoghegan writes: > > > My workflow up until now has avoiding making updates to typedefs.list > > > in patches. I only update typedefs locally, for long enough to indent > > > my code. The final patch doesn't retain any typedefs.list changes. > > Yeah, I've done the same and will have to stop. > > > > > I guess that I can't do that anymore. Hopefully maintaining the > > > typedefs.list file isn't as inconvenient as it once seemed to me to > > > be. > > I don't think it'll be a problem. If your rule is "add new typedef > > names added by your patch to typedefs.list, keeping them in > > alphabetical order" then it doesn't seem very complicated, and > > hopefully conflicts between concurrently-developed patches won't > > be common. > > My recollection is that missing typedefs cause indentation that kinda sticks > out like a sore thumb. > > The reason we moved to a buildfarm based typedefs list was that some > typedefs are platform dependent, so any list really needs to be the union of > the found typedefs on various platforms, and the buildfarm was a convenient > vehicle for doing that. But that doesn't mean you shouldn't manually add a > typedef you have added in your code. It's a somewhat annoying task though, find all the typedefs, add them to the right place in the file (we have an out of order entry right now). I think a script that *adds* (but doesn't remove) local typedefs would make this less painful. Greetings, Andres Freund
Re: A failure in 031_recovery_conflict.pl on Debian/s390x
Hi, On 2023-08-12 15:50:24 +1200, Thomas Munro wrote: > Thanks. I realised that it's easy enough to test that theory about > cleanup locks by hacking ConditionalLockBufferForCleanup() to return > false randomly. Then the test occasionally fails as described. Seems > like we'll need to fix that test, but it's not evidence of a server > bug, and my signal handler refactoring patch is in the clear. Thanks > for testing it! WRT fixing the test: I think just using VACUUM FREEZE ought to do the job? After changing all the VACUUMs to VACUUM FREEZEs, 031_recovery_conflict.pl passes even after I make ConditionalLockBufferForCleanup() fail 100%. Greetings, Andres Freund
Re: Performance degradation on concurrent COPY into a single relation in PG16.
Hi, On 2023-08-08 12:45:05 +0900, Masahiko Sawada wrote: > > I think there could be a quite simple fix: Track by how much we've extended > > the relation previously in the same bistate. If we already extended by many > > blocks, it's very likey that we'll do so further. > > > > A simple prototype patch attached. The results for me are promising. I > > copied > > a smaller file [1], to have more accurate throughput results at shorter runs > > (15s). > > Thank you for the patch! Attached is a mildly updated version of that patch. No functional changes, just polished comments and added a commit message. > > Any chance you could your benchmark? I don't see as much of a regression vs > > 16 > > as you... > > Sure. The results are promising for me too: > > nclients = 1, execution time = 13.743 > nclients = 2, execution time = 7.552 > nclients = 4, execution time = 4.758 > nclients = 8, execution time = 3.035 > nclients = 16, execution time = 2.172 > nclients = 32, execution time = 1.959 > nclients = 64, execution time = 1.819 > nclients = 128, execution time = 1.583 > nclients = 256, execution time = 1.631 Nice. We are consistently better than both 15 and "post integer parsing 16". I'm really a bit baffled at myself for not using this approach from the get go. This also would make it much more beneficial to use a BulkInsertState in nodeModifyTable.c, even without converting to table_multi_insert(). I was tempted to optimize RelationAddBlocks() a bit, by not calling RelationExtensionLockWaiterCount() if we are already extending by MAX_BUFFERS_TO_EXTEND_BY. Before this patch, it was pretty much impossible to reach that case, because of the MAX_BUFFERED_* limits in copyfrom.c, but now it's more common. But that probably ought to be done only HEAD, not 16. A related oddity: RelationExtensionLockWaiterCount()->LockWaiterCount() uses an exclusive lock on the lock partition - which seems not at all necessary? Unless somebody sees a reason not to, I'm planning to push this? Greetings, Andres Freund >From 26e7faad65273beefeb7b40a169c1b631e204501 Mon Sep 17 00:00:00 2001 From: Andres Freund Date: Sat, 12 Aug 2023 12:31:43 -0700 Subject: [PATCH v2] hio: Take number of prior relation extensions into account The new relation extension logic, introduced in 00d1e02be24, could lead to slowdowns in some scenarios. E.g., when loading narrow rows into a table using COPY, the caller of RelationGetBufferForTuple() will only request a small number of pages. Without concurrency, we just extended using pwritev() in that case. However, if there is *some* concurrency, we switched between extending by a small number of pages and a larger number of pages, depending on the number of waiters for the relation extension logic. However, some filesystems, XFS in particular, do not perform well when switching between extending files using fallocate() and pwritev(). To avoid that issue, remember the number of prior relation extensions in BulkInsertState and extend more aggressively if there were prior relation extensions. That not just avoids the aforementioned slowdown, but also leads to noticeable performance gains in other situations, primarily due to extending more aggressively when there is no concurrency. I should have done it this way from the get go. Reported-by: Masahiko Sawada Author: Andres Freund Discussion: https://postgr.es/m/CAD21AoDvDmUQeJtZrau1ovnT_smN940=kp6mszngk3bq9yr...@mail.gmail.com Backpatch: 16-, where the new relation extension code was added --- src/include/access/hio.h | 13 ++--- src/backend/access/heap/heapam.c | 1 + src/backend/access/heap/hio.c| 19 +++ 3 files changed, 30 insertions(+), 3 deletions(-) diff --git a/src/include/access/hio.h b/src/include/access/hio.h index 228433ee4a2..9bc563b7628 100644 --- a/src/include/access/hio.h +++ b/src/include/access/hio.h @@ -32,15 +32,22 @@ typedef struct BulkInsertStateData Buffer current_buf; /* current insertion target page */ /* - * State for bulk extensions. Further pages that were unused at the time - * of the extension. They might be in use by the time we use them though, - * so rechecks are needed. + * State for bulk extensions. + * + * last_free..next_free are further pages that were unused at the time of + * the last extension. They might be in use by the time we use them + * though, so rechecks are needed. * * XXX: Eventually these should probably live in RelationData instead, * alongside targetblock. + * + * already_extended_by is the number of pages that this bulk inserted + * extended by. If we already extended by a significant number of pages, + * we can be more aggressive about extending going forward. */ BlockNumber next_free; BlockNumber last_free; + uint32 already_extended_by; } BulkInser
Re: Fix pg_stat_reset_single_table_counters function
Hi, On 2023-08-10 17:48:10 +0900, Masahiko Sawada wrote: > Good catch! I've confirmed that the issue has been fixed by your patch. Indeed. > However, I'm not sure the added regression tests are stable since > autovacuum workers may scan the pg_database and increment the > statistics after resetting the stats. What about updating the table and checking the update count is reset? That'd not be reset by autovacuum. Greetings, Andres Freund
Re: CREATE FUNCTION ... SEARCH { DEFAULT | SYSTEM | SESSION }
Hi, On 2023-08-11 19:35:22 -0700, Jeff Davis wrote: > Controlling search_path is critical for the correctness and security of > functions. Right now, the author of a function without a SET clause has > little ability to control the function's behavior, because even basic > operators like "+" involve search_path. This is a big problem for, e.g. > functions used in expression indexes which are called by any user with > write privileges on the table. > Motivation: > > I'd like to (eventually) get to safe-by-default behavior. In other > words, the simplest function declaration should be safe for the most > common use cases. I'm not sure that anything based, directly or indirectly, on search_path really is a realistic way to get there. > To get there, we need some way to explicitly specify the less common > cases. Right now there's no way for the function author to indicate > that a function intends to use the session's search path. We also need > an easier way to specify that the user wants a safe search_path ("SET > search_path = pg_catalog, pg_temp" is arcane). No disagreement with that. Even if I don't yet agree that your proposal is a convincing path to "easy security for PLs" - just making the search path stuff less arcane is good. > And when we know more about the user's actual intent, then it will be > easier to either form a transition plan to push users into the safer > behavior, or at least warn strongly when the user is doing something > dangerous (i.e. using a function that depends on the session's search > path as part of an expression index). I think that'd be pretty painful from a UX perspective. Having to write e.g. operators as operator(schema, op) just sucks as an experience. And with extensions plenty of operators will live outside of pg_catalog, so there is plenty things that will need qualifying. And because of things like type coercion search, which prefers "bettering fitting" coercions over search path order, you can't just put "less important" things later in search path. I wonder if we ought to work more on "fossilizing" the result of search path resolutions at the time functions are created, rather than requiring the user to do so explicitly. Most of the problem here comes down to the fact that if a user creates a function like 'a + b' we'll not resolve the operator, the potential type coercions etc, when the function is created - we do so when the function is executed. We can't just store the oids at the time, because that'd end up very fragile - tables/functions/... might be dropped and recreated etc and thus change their oid. But we could change the core PLs to rewrite all the queries (*) so that they schema qualify absolutely everything, including operators and implicit type casts. That way objects referenced by functions can still be replaced, but search path can't be used to "inject" objects in different schemas. Obviously it could lead to errors on some schema changes - e.g. changing a column type might mean that a relevant cast lives in a different place than with the old type - but I think that'll be quite rare. Perhaps we could offer a ALTER FUNCTION ... REFRESH REFERENCES; or such? One obvious downside of such an approach is that it requires some work with each PL. I'm not sure that's avoidable - and I suspect that most "security sensitive" functions are written in just a few languages. (*) Obviously the one thing that doesn't work for is use of EXECUTE in plpgsql and similar constructs elsewhere. I'm not sure there's much that can be done to make that safe, but it's worth thinking about more. Greetings, Andres Freund
Re: run pgindent on a regular basis / scripted manner
Hi, On 2023-08-11 20:11:34 -0400, Tom Lane wrote: > Andres Freund writes: > > On 2023-08-11 18:30:02 -0400, Tom Lane wrote: > >> +1 for including this in CI tests > > > I didn't even mean CI - I meant 'make check-world' / 'meson test'. Which of > > course would include CI automatically. > > Hmm. I'm allergic to anything that significantly increases the cost > of check-world, and this seems like it'd do that. Hm, compared to the cost of check-world it's not that large, but still, annoying to make it larger. We can make it lot cheaper, but perhaps not in a general enough fashion that it's suitable for a test. pgindent already can query git (for --commit). We could teach pgindent to ask git what remote branch is being tracked, and constructed a list of files of the difference between the remote branch and the local branch? That option could do something like: git diff --name-only $(git rev-parse --abbrev-ref --symbolic-full-name @{upstream}) That's pretty quick, even for a relatively large delta. > Maybe we could automate it, but not as part of check-world per se? We should definitely do that. Another related thing that'd be useful to script is updating typedefs.list with the additional typedefs found locally. Right now the script for that still lives in the Greetings, Andres Freund
Re: run pgindent on a regular basis / scripted manner
Hi, On 2023-08-11 18:30:02 -0400, Tom Lane wrote: > Peter Geoghegan writes: > > On Fri, Aug 11, 2023 at 2:25 PM Andres Freund wrote: > >> We could a test that fails when there's some mis-indented code. That seems > >> like it might catch things earlier? > > +1 for including this in CI tests I didn't even mean CI - I meant 'make check-world' / 'meson test'. Which of course would include CI automatically. > > It definitely would. That would go a long way towards addressing my > > concerns. But I suspect that that would run into problems that stem > > from the fact that the buildfarm is testing something that isn't all > > that simple. Don't typedefs need to be downloaded from some other > > blessed buildfarm animal? > > No. I presume koel is using src/tools/pgindent/typedefs.list, > which has always been the "canonical" list but up to now we've > been lazy about maintaining it. Part of the new regime is that > typedefs.list should now be updated on-the-fly by patches that > add new typedefs. Yea. Otherwise nobody else can indent reliably, without repeating the work of adding typedefs.list entries of all the patches since the last time it was updated in the repository. Greetings, Andres Freund
Re: run pgindent on a regular basis / scripted manner
Hi, On 2023-08-11 13:59:40 -0700, Peter Geoghegan wrote: > On Sat, Jun 17, 2023 at 7:08 AM Andrew Dunstan wrote: > > I have set up a new buildfarm animal called koel which will run the module. > > I'm starting to have doubts about this policy. There have now been > quite a few follow-up "fixes" to indentation issues that koel > complained about. None of these fixups have been included in > .git-blame-ignore-revs. If things continue like this then "git blame" > is bound to become much less usable over time. I'm not sure I buy that that's going to be a huge problem - most of the time such fixups are pretty small compared to larger reindents. > I don't think that it makes sense to invent yet another rule for > .git-blame-ignore-revs, though. Will we need another buildfarm member > to enforce that rule, too? We could a test that fails when there's some mis-indented code. That seems like it might catch things earlier? Greetings, Andres Freund
Re: AssertLog instead of Assert in some places
Hi, On 2023-08-11 13:19:34 -0700, Peter Geoghegan wrote: > On Fri, Aug 11, 2023 at 12:26 PM Andres Freund wrote: > > > For example, dealing with core dumps left behind by the regression > > > tests can be annoying. > > > > Hm. I don't have a significant problem with that. But I can see it being > > problematic. Unfortunately, short of preventing core dumps from happening, > > I don't think we really can do much about that - whatever is running the > > tests > > shouldn't have privileges to change system wide settings about where core > > dumps end up etc. > > I was unclear. I wasn't talking about managing core dumps. I was > talking about using core dumps to get a simple backtrace, that just > gives me some very basic information. I probably shouldn't have even > mentioned core dumps, because what I'm really concerned about is the > workflow around getting very basic information about assertion > failures. Not around core dumps per se. > > The inconsistent approach needed to get a simple, useful backtrace for > assertion failures (along with other basic information associated with > the failure) is a real problem. Particularly when running the tests. > Most individual assertion failures that I see are for code that I'm > practically editing in real time. So shaving cycles here really > matters. Ah, yes. Agreed, obviously. > For one thing the symbol mangling that we have around the built-in > backtraces makes them significantly less useful. I really hope that > your libbacktrace patch gets committed soon, since that looks like it > would be a nice quality of life improvement, all on its own. I've been hacking a further on it: https://github.com/anarazel/postgres/tree/backtrace Haven't yet posted a new version. Doing it properly for fronted tools has some dependencies on threading stuff. I'm hoping Thomas' patch for that will go in. Now it also intercepts segfaults and prints backtraces for them, if that's possible (it requires libbacktrace to be async signal safe, which it isn't on all platforms). Where supported, a crash (distinguishing from assertion failures) will now report something like: process with pid: 2900527 received signal: SIGSEGV, si_code: 1, si_addr: 0xdeadbeef [0x5628ec45212f] pg_fatalsig_handler+0x20f: ../../../../home/andres/src/postgresql/src/common/pg_backtrace.c:615 [0x7fc4b743650f] [unknown] [0x5628ec14897c] check_root+0x19c (inlined): ../../../../home/andres/src/postgresql/src/backend/main/main.c:393 [0x5628ec14897c] main+0x19c: ../../../../home/andres/src/postgresql/src/backend/main/main.c:183 after I added *(volatile int*)0xdeadbeef = 1; to check_root(). For signals sent by users, it'd show the pid of the process sending the signal on most OSs. I really would like some generalized infrastructure for that, so that we can report for things like query cancellations. As the patch stands, the only OS that doesn't yet have useful "backtrace on crash" support with that is windows, as libbacktrace unfortunately isn't signal safe on windows. But it'd still provide useful backtraces on Assert()s. I haven't yet figured out whether/when it's required to be signal safe on windows though - crashes are intercepted by SetUnhandledExceptionFilter() - I don't understand the precise constraints of that. Plenty people seem to generate backtraces on crashes on windows using that, without concerns for signal safety like things. Currently Macos CI doesn't use libbacktrace, but as it turns out backtrace_symbols() on windows is a heck of a lot better than on glibc platforms. CI for windows with visual studio doesn't have libbacktrace installed yet (and has the aforementioned signal safety issue), I think it's installed for windows w/ mingw. > It would also be great if the tests spit out information about > assertion failures that was reasonably complete (backtrace without any > mangling, query text included, other basic context), reliably and > uniformly -- it shouldn't matter if it's from TAP or pg_regress test > SQL scripts. Hm. What other basic context are you thinking of? Pid is obvious. I guess backend type could be useful too, but normally be inferred from the stack trace pretty easily. Application name could be useful to know which test caused the crash. I'm a bit wary about trying to print things like query text - what if that string points to memory not terminated by a \0? I guess we could use an approach similar to pgstat_get_crashed_backend_activity(). One issue with reporting crashes from signal handlers is that the obvious way to make that signal safe (lots of small writes) leads to the potential for interspersed lines. It's probably worth having a stat
Re: walsender "wakeup storm" on PG16, likely because of bc971f4025c (Optimize walsender wake up logic using condition variables)
Hi, On 2023-08-12 07:51:09 +1200, Thomas Munro wrote: > Oh, I see what's happening. Maybe commit b91dd9de wasn't the best > idea, but bc971f4025c broke an assumption, since it doesn't use > ConditionVariableSleep(). That is confusing the signal forwarding > logic which expects to find our entry in the wait list in the common > case. Hm, I guess I got confused by the cv code once more. I thought that ConditionVariableCancelSleep() would wake us up anyway, because once we return from ConditionVariableSleep(), we'd be off the list. But I now realize (and I think not for the first time), that ConditionVariableSleep() always puts us *back* on the list. Leaving aside the issue in this thread, isn't always adding us back into the list bad from a contention POV alone - it doubles the write traffic on the CV and is guaranteed to cause contention for ConditionVariableBroadcast(). I wonder if this explains some performance issues I've seen in the past. What if we instead reset cv_sleep_target once we've been taken off the list? I think it'd not be too hard to make it safe to do the proclist_contains() without the spinlock. Lwlocks have something similar, there we solve it by this sequence: proclist_delete(&wakeup, iter.cur, lwWaitLink); /* * Guarantee that lwWaiting being unset only becomes visible once the * unlink from the link has completed. Otherwise the target backend * could be woken up for other reason and enqueue for a new lock - if * that happens before the list unlink happens, the list would end up * being corrupted. * * The barrier pairs with the LWLockWaitListLock() when enqueuing for * another lock. */ pg_write_barrier(); waiter->lwWaiting = LW_WS_NOT_WAITING; PGSemaphoreUnlock(waiter->sem); I guess this means we'd need something like lwWaiting for CVs as well. > From a85b2827f4500bc2e7c533feace474bc47086dfa Mon Sep 17 00:00:00 2001 > From: Thomas Munro > Date: Sat, 12 Aug 2023 07:06:08 +1200 > Subject: [PATCH] De-pessimize ConditionVariableCancelSleep(). > > Commit b91dd9de was concerned with a theoretical problem with our > non-atomic condition variable operations. If you stop sleeping, and > then cancel the sleep in a separate step, you might be signaled in > between, and that could be lost. That doesn't matter for callers of > ConditionVariableBroadcast(), but callers of ConditionVariableSignal() > might be upset if a signal went missing like this. FWIW I suspect at least some of the places that'd have a problem without that forwarding, might also be racy with it > New idea: ConditionVariableCancelSleep() can just return true if you've > been signaled. Hypothetical users of ConditionVariableSignal() would > then still have a way to deal with rare lost signals if they are > concerned about that problem. Sounds like a plan to me. Greetings, Andres Freund
Re: AssertLog instead of Assert in some places
Hi, On 2023-08-11 11:56:27 -0700, Peter Geoghegan wrote: > On Fri, Aug 11, 2023 at 11:23 AM Andres Freund wrote: > > > Couldn't you say the same thing about defensive "can't happen" ERRORs? > > > They are essentially a form of assertion that isn't limited to > > > assert-enabled builds. > > > > Yes. A lot of them I hate them with the passion of a thousand suns ;). "Oh, > > our transaction state machinery is confused. Yes, let's just continue going > > through the same machinery again, that'll resolve it.". > > I am not unsympathetic to Ashutosh's point about conventional ERRORs > being easier to deal with when debugging your own code, during initial > development work. Oh, I am as well - I just don't think it's a good idea to introduce "log + error" assertions to core postgres, because it seems very likely that they'll end up getting used a lot. > But that seems like a problem with the tooling in other areas. Agreed. > For example, dealing with core dumps left behind by the regression > tests can be annoying. Hm. I don't have a significant problem with that. But I can see it being problematic. Unfortunately, short of preventing core dumps from happening, I don't think we really can do much about that - whatever is running the tests shouldn't have privileges to change system wide settings about where core dumps end up etc. > Don't you also hate it when there's a regression.diffs that just shows 20k > lines of subtractions? Perhaps you don't -- perhaps your custom setup makes > it quick and easy to get relevant information about what actually went > wrong. I do really hate that. At the very least we should switch to using restart-after-crash by default, and not start new tests once the server has crashed and do a waitpid(postmaster, WNOHANG) after each failing test, to see if the reason the test failed is that the backend died. > But it seems like that sort of thing could be easier to deal with by > default, without using custom shell scripts or anything -- particularly for > those of us that haven't been Postgres hackers for eons. Yes, wholeheartedly agreed. Greetings, Andres Freund
Re: AssertLog instead of Assert in some places
Hi, On 2023-08-11 11:14:34 -0700, Peter Geoghegan wrote: > On Fri, Aug 11, 2023 at 10:57 AM Andres Freund wrote: > > I am quite strongly against this. This will lead to assertions being hit in > > tests without that being noticed, e.g. because they happen in a background > > process that just restarts. > > Couldn't you say the same thing about defensive "can't happen" ERRORs? > They are essentially a form of assertion that isn't limited to > assert-enabled builds. Yes. A lot of them I hate them with the passion of a thousand suns ;). "Oh, our transaction state machinery is confused. Yes, let's just continue going through the same machinery again, that'll resolve it.". Even worse are the ones that are just WARNINGS. Like "Oh, something wrote beyond the length of allocated memory, that's something to issue a WARNING about and happily continue". There've been people arguing in the past that it's good for robustness to do stuff like that. I think it's exactly the opposite. Now, don't get me wrong, there are plenty cases where a "this shouldn't happen" elog(ERROR) is appropriate... > I have sometimes thought that it would be handy if there was a variant > of "can't happen" ERRORs that took on the structure of an assertion. What are you imagining? Basically something that generates an elog(ERROR) with the stringified expression as part of the error message? Greetings, Andres Freund
Re: [PATCH] Support static linking against LLVM
Hi, On 2023-08-11 12:59:27 -0500, Marcelo Juchem wrote: > In my case, my product has a very controlled environment. > We build all our infrastructure from source and we avoid dynamic linking by > design, except where technically not viable (e.g.: pgsql extensions). > > LLVM is one of the libraries we're specifically required to statically link. > Unfortunately I can't share the specifics of why that's the case. If you build llvm with just static lib support it should work as-is? > On a side note, I'd like to take this opportunity to ask you if there's any > work being done towards migrating away from deprecated LLVM APIs. > If there's no work being done on that front, I might take a stab at it if > there's any interest from the PostgreSQL community in that contribution. Yes: https://postgr.es/m/CA%2BhUKGKNX_%3Df%2B1C4r06WETKTq0G4Z_7q4L4Fxn5WWpMycDj9Fw%40mail.gmail.com > Migrating to the new APIs will have PostgreSQL require at the minimum > version 8 of LLVM (released in March 2019). I think Thomas' patch abstract it enough so that older versions continue to work... Greetings, Andres Freund
Re: LLVM 16 (opaque pointers)
Hi, On 2023-05-21 15:01:41 +1200, Thomas Munro wrote: > *For anyone working with this type of IR generation code and > questioning their sanity, I can pass on some excellent advice I got > from Andres: build LLVM yourself with assertions enabled, as they > catch some classes of silly mistake that otherwise just segfault > inscrutably on execution. Hm. I think we need a buildfarm animal with an assertion enabled llvm 16 once we merge this. I think after an upgrade my buildfarm machine has the necessary resources. > @@ -150,7 +150,7 @@ llvm_compile_expr(ExprState *state) > > /* create function */ > eval_fn = LLVMAddFunction(mod, funcname, > - > llvm_pg_var_func_type("TypeExprStateEvalFunc")); > + > llvm_pg_var_func_type("ExecInterpExprStillValid")); Hm, that's a bit ugly. But ... > @@ -77,9 +80,44 @@ extern Datum AttributeTemplate(PG_FUNCTION_ARGS); > Datum > AttributeTemplate(PG_FUNCTION_ARGS) > { > + PGFunction fp PG_USED_FOR_ASSERTS_ONLY; > + > + fp = &AttributeTemplate; > PG_RETURN_NULL(); > } Other parts of the file do this by putting the functions into referenced_functions[], i'd copy that here and below. > +void > +ExecEvalSubroutineTemplate(ExprState *state, > +struct ExprEvalStep *op, > +ExprContext *econtext) > +{ > + ExecEvalSubroutine fp PG_USED_FOR_ASSERTS_ONLY; > + > + fp = &ExecEvalSubroutineTemplate; > +} > + > +extern bool ExecEvalBoolSubroutineTemplate(ExprState *state, > + >struct ExprEvalStep *op, > + >ExprContext *econtext); > +bool > +ExecEvalBoolSubroutineTemplate(ExprState *state, > +struct ExprEvalStep > *op, > +ExprContext > *econtext) > +{ > + ExecEvalBoolSubroutine fp PG_USED_FOR_ASSERTS_ONLY; > + > + fp = &ExecEvalBoolSubroutineTemplate; > + return false; > +} > + Thanks for working on this! Greetings, Andres Freund
Re: LLVM 16 (opaque pointers)
Hi, On 2023-08-10 16:56:54 +0200, Ronan Dunklau wrote: > I tried my hand at backporting it to previous versions, and not knowing > anything about it made me indeed question my sanity. It's quite easy for PG > 15, 14, 13. PG 12 is nothing insurmontable either, but PG 11 is a bit hairier > most notably due to to the change in fcinfo args representation. But I guess > that's also a topic for another day :-) Given that 11 is about to be EOL, I don't think it's worth spending the time to support a new LLVM version for it. Greetings, Andres Freund
Re: AssertLog instead of Assert in some places
Hi, On 2023-08-11 17:59:37 +0530, Ashutosh Bapat wrote: > Most of the Asserts are recoverable by rolling back the transaction > without crashing the backend. So an elog(ERROR, ) is enough. But just > by themselves elogs are compiled into non-debug binary and the > condition check can waste CPU cycles esp. conditions are mostly true > like the ones we use in Assert. > > Attached patch combines Assert and elog(ERROR, ) so that when an > Assert is triggered in assert-enabled binary, it will throw an error > while keeping the backend intact. Thus it does not affect gdb session > or psql session. These elogs do not make their way to non-assert > binary so do not make it to production like Assert. I am quite strongly against this. This will lead to assertions being hit in tests without that being noticed, e.g. because they happen in a background process that just restarts. Greetings, Andres Freund
Re: walsender "wakeup storm" on PG16, likely because of bc971f4025c (Optimize walsender wake up logic using condition variables)
Hi, On 2023-08-11 15:31:43 +0200, Tomas Vondra wrote: > That's an awful lot of CPU for a cluster doing essentially "nothing" > (there's no WAL to decode/replicate, etc.). It usually disappears after > a couple seconds, but sometimes it's a rather persistent state. Ugh, that's not great. > The profile from the walsender processes looks like this: > > --99.94%--XLogSendLogical > | > |--99.23%--XLogReadRecord > | XLogReadAhead > | XLogDecodeNextRecord > | ReadPageInternal > | logical_read_xlog_page > | | > | |--97.80%--WalSndWaitForWal > | | | > | | |--68.48%--WalSndWait > > It seems to me the issue is in WalSndWait, which was reworked to use > ConditionVariableCancelSleep() in bc971f4025c. The walsenders end up > waking each other in a busy loop, until the timing changes just enough > to break the cycle. IMO ConditionVariableCancelSleep()'s behaviour of waking up additional processes can nearly be considered a bug, at least when combined with ConditionVariableBroadcast(). In that case the wakeup is never needed, and it can cause situations like this, where condition variables basically deteriorate to a busy loop. I hit this with AIO as well. I've been "solving" it by adding a ConditionVariableCancelSleepEx(), which has a only_broadcasts argument. I'm inclined to think that any code that needs that needs the forwarding behaviour is pretty much buggy. Greetings, Andres Freund
Re: [PATCH] Support static linking against LLVM
Hi, On 2023-08-10 14:45:47 -0500, Marcelo Juchem wrote: > By default, PostgreSQL doesn't explicitly choose whether to link > statically or dynamically against LLVM when LLVM JIT is enabled (e.g.: > `./configure --with-llvm`). > > `llvm-config` will choose to dynamically link by default. > > In order to statically link, one must pass `--link-static` to > `llvm-config` when listing linker flags (`--ldflags`) and libraries > (`--libs`). > > This patch enables custom flags to be passed to `llvm-config` linker > related invocations through the environment variable > `LLVM_CONFIG_LINK_ARGS`. > > To statically link against LLVM it suffices, then, to call `configure` > with environment variable `LLVM_CONFIG_LINK_ARGS=--link-static`. I'm not opposed to it, but I'm not sure I see much need for it either. Do you have a specific use case for this? Greetings, Andres Freund
Re: Add PG CI to older PG releases
Hi, On 2023-08-10 18:09:03 -0400, Tom Lane wrote: > Nazir Bilal Yavuz writes: > > PG CI is added starting from PG 15, adding PG CI to PG 14 and below > > could be beneficial. So, firstly I tried adding it to the > > REL_14_STABLE branch. If that makes sense, I will try to add PG CI to > > other old PG releases. > > I'm not actually sure this is worth spending time on. We don't > do new development in three-release-back branches, nor do we have > so many spare CI cycles that we can routinely run CI testing in > those branches [1]. I find it much less stressful to backpatch if I can test the patches via CI first. I don't think I'm alone in that - Alvaro for example has previously done this in a more limited fashion (no windows): https://postgr.es/m/20220928192226.4c6zeenujaoqq4bq%40alvherre.pgsql Greetings, Andres Freund
Re: Add PG CI to older PG releases
Hi, On 2023-08-10 19:55:15 +0300, Nazir Bilal Yavuz wrote: > v2-0001-windows-Only-consider-us-to-be-running-as-service.patch is an > older commit (59751ae47fd43add30350a4258773537e98d4063). A couple of > tests were failing without this because the log file was empty and the > tests were comparing strings with the log files (like in the first > mail). Hm. I'm a bit worried about backpatching this one, it's not a small behavioural change. But the prior behaviour is pretty awful and could very justifiably be viewed as a bug. At the very least this would need to be combined with commit 950e64fa46b164df87b5eb7c6e15213ab9880f87 Author: Andres Freund Date: 2022-07-18 17:06:34 -0700 Use STDOUT/STDERR_FILENO in most of syslogger. Greetings, Andres Freund
Re: Support to define custom wait events for extensions
Hi, On 2023-08-09 20:10:42 +0900, Masahiro Ikeda wrote: > * Is there any way to not force extensions that don't use shared > memory for their use like dblink to acquire AddinShmemInitLock?; I think the caller shouldn't need to do deal with AddinShmemInitLock at all. Greetings, Andres Freund
Re: Remove distprep
Hi, Thanks for sending the -packagers email Peter! On 2023-08-09 16:25:28 +0200, Christoph Berg wrote: > Re: Tom Lane > > Meh ... the fact that it works fine for you doesn't mean it will work > > fine elsewhere. Since we're trying to get out from under maintaining > > the autoconf build system, I don't think it makes sense to open > > ourselves up to having to do more work on it. A policy of benign > > neglect seems appropriate to me. > > Understood, I was just pointing out there are more types of generated > files in there. The situation for configure is somewhat different, due to being maintained in the repository, rather than just being included in the tarball... Greetings, Andres Freund
Re: Cirrus-ci is lowering free CI cycles - what to do with cfbot, etc?
Hi, On 2023-08-08 22:29:50 -0400, Robert Treat wrote: > In case it's helpful, from an SPI oriented perspective, $7K/month is > probably an order of magnitude more than what we can sustain, so I > don't see a way to make that work without some kind of additional > magic that includes other non-profits and/or commercial companies > changing donation habits between now and September. Yea, I think that'd make no sense, even if we could afford it. I think the patches I've written should drop it to 1/2 already. Thomas added some throttling to push it down further. > Purchasing a couple of mac-mini's (and/or similar gear) would be near > trivial though, just a matter of figuring out where/how to host it > (but I think infra can chime in on that if that's what get's decided). Cool. Because of the limitation of running two VMs at a time on macos and the comparatively low cost of mac minis, it seems they beat alternative models by a fair bit. Pginfra/sysadmin: ^ Based on being off by an order of magnitude, as you mention earlier, it seems that: 1) reducing test runtime etc, as already in progress 2) getting 2 mac minis as runners 3) using ~350 USD / mo in GCP costs for windows, linux, freebsd (*) Would be viable for a month or three? I hope we can get some cloud providers to chip in for 3), but I'd like to have something in place that doesn't depend on that. Given the cost of macos VMs at AWS, the only of the big cloud providers to have macos instances, I think we'd burn pointlessly quick through credits if we used VMs for that. (*) I think we should be able to get below that, but ... > The other likely option would be to seek out cloud credits from one of > the big three (or others); Amazon has continually said they would be > happy to donate more credits to us if we had a use, and I think some > of the other hosting providers have said similarly at times; so we'd > need to ask and hope it's not too bureaucratic. Yep. I tried to start that progress within microsoft, fwiw. Perhaps Joe and Jonathan know how to start within AWS? And perhaps Noah inside GCP? It'd be the least work to get it up and running in GCP, as it's already running there, but should be quite doable at the others as well. Greetings, Andres Freund
Re: Cirrus-ci is lowering free CI cycles - what to do with cfbot, etc?
Hi, On 2023-08-08 11:58:25 +0300, Heikki Linnakangas wrote: > On 08/08/2023 05:15, Andres Freund wrote: > > With the improvements detailed below, cirrus' free CI would last > > about ~65 runs / month. > > I think that's plenty. Not so sure, I would regularly exceed it, I think. But it definitely will suffice for more casual contributors. > > Potential paths forward for cfbot, in addition to the above: > > > > - Pay for compute / ask the various cloud providers to grant us compute > >credits. At least some of the cloud providers can be used via cirrus-ci. > > > > - Host (some) CI runners ourselves. Particularly with macos and windows, > > that > >could provide significant savings. > > > > - Build our own system, using buildbot, jenkins or whatnot. > > > > > > Opinions as to what to do? > > The resources for running our own system isn't free either. I'm sure we can > get sponsors for the cirrus-ci credits, or use donations. As outlined in my reply to Alvaro, just using credits likely is financially not viable... > > 5) Move use of -Dsegsize_blocks=6 from macos to linux > > > > Macos is expensive, -Dsegsize_blocks=6 slows things down. Alternatively > > we > > could stop covering both meson and autoconf segsize_blocks. It does > > affect > > runtime on linux as well. > > Could we have a comment somewhere on why we use -Dsegsize_blocks on these > particular CI runs? It seems pretty random. I guess the idea is to have one > autoconf task and one meson task with that option, to check that the > autoconf/meson option works? Hm, some of that was in the commit message, but I should have added it to .cirrus.yml as well. Normally, the "relation segment" code basically has no coverage in our tests, because we (quite reasonably) don't generate tables large enough. We've had plenty bugs that we didn't notice due the code not being exercised much. So it seemed useful to add CI coverage, by making the segments very small. I chose the tasks by looking at how long they took at the time, I think. Adding them to to the slower ones. > > 6) Disable write cache flushes on windows > > > > It's a bit ugly to do this without using the UI... Shaves off about 30s > > from the tests. > > A brief comment would be nice: "We don't care about persistence over hard > crashes in the CI, so disable write cache flushes to speed it up." Turns out that patch doesn't work on its own anyway, at least not reliably... I tested it by interactively logging into a windows vm and testing it there. It doesn't actually seem to suffice when run in isolation, because the relevant registry key doesn't yet exist. I haven't yet figured out the magic incantations for adding the missing "intermediary", but I'm getting there... Greetings, Andres Freund
Re: Support to define custom wait events for extensions
Hi, On 2023-08-09 08:03:29 +0900, Michael Paquier wrote: > On Tue, Aug 08, 2023 at 03:59:54PM -0700, Andres Freund wrote: > > On 2023-08-08 08:54:10 +0900, Michael Paquier wrote: > >> - WaitEventExtensionShmemInit() should gain a dshash_create(), to make > >> sure that the shared table is around, and we are going to have a > >> reference to it in WaitEventExtensionCounterData, saved from > >> dshash_get_hash_table_handle(). > > > > I'm not even sure it's worth using dshash here. Why don't we just create a > > decently sized dynahash (say 128 enties) in shared memory? We overallocate > > shared memory by enough that there's a lot of headroom for further entries, > > in > > the rare cases they're needed. > > The question here would be how many slots the most popular extensions > actually need, but that could always be sized up based on the > feedback. On a default initdb (i.e. 128MB s_b), after explicitly disabling huge pages, we over-allocate shared memory by by 1922304 bytes, according to pg_shmem_allocations. We allow that memory to be used for things like shared hashtables that grow beyond their initial size. So even if the hash table's static size is too small, there's lots of room to grow, even on small systems. Just because it's somewhat interesting: With huge pages available and not disabled, we over-allocate by 3364096 bytes. Greetings, Andres Freund
Re: Support to define custom wait events for extensions
Hi, On 2023-08-08 08:54:10 +0900, Michael Paquier wrote: > - WaitEventExtensionShmemInit() should gain a dshash_create(), to make > sure that the shared table is around, and we are going to have a > reference to it in WaitEventExtensionCounterData, saved from > dshash_get_hash_table_handle(). I'm not even sure it's worth using dshash here. Why don't we just create a decently sized dynahash (say 128 enties) in shared memory? We overallocate shared memory by enough that there's a lot of headroom for further entries, in the rare cases they're needed. > We are going to need a fixed size for these custom strings, but perhaps a > hard limit of 256 characters for each entry of the hash table is more than > enough for most users? I'd just use NAMEDATALEN. - Andres
Re: Configurable FP_LOCK_SLOTS_PER_BACKEND
Hi, On 2023-08-08 16:44:37 -0400, Robert Haas wrote: > On Mon, Aug 7, 2023 at 6:05 PM Andres Freund wrote: > > I think the biggest flaw of the locking scheme is that the LockHash locks > > protect two, somewhat independent, things: > > 1) the set of currently lockable objects, i.e. the entries in the hash > > table [partition] > > 2) the state of all the locks [in a partition] > > > > It'd not be that hard to avoid the shared hashtable lookup in a number of > > cases, e.g. by keeping LOCALLOCK entries around for longer, as I suggest > > above. But we can't, in general, avoid the lock on the partition anyway, as > > the each lock's state is also protected by the partition lock. > > Yes, and that's a huge problem. The main selling point of the whole > fast-path mechanism is to ease the pressure on the lock manager > partition locks, and if we did something like what you described in > the previous email without changing the locking regimen, we'd bring > all of that contention back. I'm pretty sure that would suck. Yea - I tried to outline how I think we could implement the fastpath locking scheme in a less limited way in the earlier email, that I had quoted above this bit. Here I was pontificating on what we possibly should do in addition to that. I think even if we had "unlimited" fastpath locking, there's still enough pressure on the lock manager locks that it's worth improving the overall locking scheme. Greetings, Andres Freund
Re: Cirrus-ci is lowering free CI cycles - what to do with cfbot, etc?
Hi, On 2023-08-08 18:34:58 +0200, Alvaro Herrera wrote: > On 2023-Aug-08, Andres Freund wrote: > > > Given the cost of macos, it seems like it'd be by far the most of affordable > > to just buy 1-2 mac minis (2x ~660USD) and stick them in a shelf somewhere, > > as > > persistent runners. Cirrus has builtin macos virtualization support - but > > can > > only host two VMs on each mac, due to macos licensing restrictions. A single > > mac mini would suffice to keep up with our unoptimized monthly runtime > > (although there likely would be some overhead). > > If using persistent workers is an option, maybe we should explore that. > I think we could move all or some of the Linux - Debian builds to > hardware that we already have in shelves (depending on how much compute > power is really needed.) (76+830+860+935)/((365/12)*24) = 3.7 3.7 instances with 4 "vcores" are busy 100% of the time. So we'd need at least ~16 cpu threads - I think cirrus sometimes uses instances that disable HT, so it'd perhaps be 16 cores actually. > I think using other OSes is more difficult, mostly because I doubt we > want to deal with licenses; but even FreeBSD might not be a realistic > option, at least not in the short term. They can be VMs, so that shouldn't be a big issue. > >task_name|sum > > + > > FreeBSD - 13 - Meson | 1017:56:09 > > Windows - Server 2019, MinGW64 - Meson | 00:00:00 > > SanityCheck| 76:48:41 > > macOS - Ventura - Meson| 873:12:43 > > Windows - Server 2019, VS 2019 - Meson & ninja | 1251:08:06 > > Linux - Debian Bullseye - Autoconf | 830:17:26 > > Linux - Debian Bullseye - Meson| 860:37:21 > > CompilerWarnings | 935:30:35 > > (8 rows) > > > > moving just Debian, that might alleviate 76+830+860+935 hours from the > Cirrus infra, which is ~46%. Not bad. > > > (How come Windows - Meson reports allballs?) It's mingw64, which we've marked as "manual", because we didn't have the cpu cycles to run it. Greetings, Andres Freund
Re: Cirrus-ci is lowering free CI cycles - what to do with cfbot, etc?
Hi, On 2023-08-08 10:25:52 -0500, Tristan Partin wrote: > On Mon Aug 7, 2023 at 9:15 PM CDT, Andres Freund wrote: > > FWIW: with the patches applied, the "credit costs" in cirrus CI are roughly > > like the following (depends on caching etc): > > > > task costs in credits > > linux-sanity: 0.01 > > linux-compiler-warnings: 0.05 > > linux-meson: 0.07 > > freebsd : 0.08 > > linux-autoconf: 0.09 > > windows : 0.18 > > macos : 0.28 > > total task runtime is 40.8 > > cost in credits is 0.76, monthly credits of 50 allow approx 66.10 runs/month > > I am not in the loop on the autotools vs meson stuff. How much longer do we > anticipate keeping autotools around? I think it depends in what fashion. We've been talking about supporting building out-of-tree modules with "pgxs" for at least a 5 year support window. But the replacement isn't yet finished [1], so that clock hasn't yet started ticking. > Seems like it could be a good opportunity to reduce some CI usage if > autotools were finally dropped, but I know there are still outstanding tasks > to complete. > > Back of the napkin math says autotools is about 12% of the credit cost, > though I haven't looked to see if linux-meson and linux-autotools are 1:1. The autoconf task is actually doing quite useful stuff right now, leaving the use of configure aside, as it builds with address sanitizer. Without that it'd be a lot faster. But we'd loose, imo quite important, coverage. The tests would run a bit faster with meson, but it'd be overall a difference on the margins. Greetings, Andres Freund [1] https://github.com/anarazel/postgres/tree/meson-pkgconfig
Re: Cirrus-ci is lowering free CI cycles - what to do with cfbot, etc?
Hi, On 2023-08-08 16:28:49 +0200, Peter Eisentraut wrote: > On 08.08.23 04:15, Andres Freund wrote: > > Potential paths forward for cfbot, in addition to the above: > > > > - Pay for compute / ask the various cloud providers to grant us compute > >credits. At least some of the cloud providers can be used via cirrus-ci. > > > > - Host (some) CI runners ourselves. Particularly with macos and windows, > > that > >could provide significant savings. > > > > - Build our own system, using buildbot, jenkins or whatnot. > > I think we should use the "compute credits" plan from Cirrus CI. It should > be possible to estimate the costs for that. Money is available, I think. Unfortunately just doing that seems like it would up considerably on the too expensive side. Here are the stats for last months' cfbot runtimes (provided by Thomas): task_name|sum + FreeBSD - 13 - Meson | 1017:56:09 Windows - Server 2019, MinGW64 - Meson | 00:00:00 SanityCheck| 76:48:41 macOS - Ventura - Meson| 873:12:43 Windows - Server 2019, VS 2019 - Meson & ninja | 1251:08:06 Linux - Debian Bullseye - Autoconf | 830:17:26 Linux - Debian Bullseye - Meson| 860:37:21 CompilerWarnings | 935:30:35 (8 rows) If I did the math right, that's about 7000 credits (and 1 credit costs 1 USD). task costs in credits linux-sanity: 55.30 linux-autoconf: 598.04 linux-meson: 619.40 linux-compiler-warnings: 674.28 freebsd : 732.24 windows : 1201.09 macos : 3143.52 Now, those times are before optimizing test runtime. And besides optimizing the tasks, we can also optimize not running tests for docs patches etc. And optimize cfbot to schedule a bit better. But still, the costs look not realistic to me. If instead we were to use our own GCP account, it's a lot less. t2d-standard-4 instances, which are faster than what we use right now, cost $0.168984 / hour as "normal" instances and $0.026764 as "spot" instances right now [1]. Windows VMs are considerably more expensive due to licensing - 0.184$/h in addition. Assuming spot instances, linux+freebsd tasks would cost ~100USD month (maybe 10-20% more in reality, due to a) spot instances getting terminated requiring retries and b) disks). Windows would be ~255 USD / month (same retries caveats). Given the cost of macos, it seems like it'd be by far the most of affordable to just buy 1-2 mac minis (2x ~660USD) and stick them in a shelf somewhere, as persistent runners. Cirrus has builtin macos virtualization support - but can only host two VMs on each mac, due to macos licensing restrictions. A single mac mini would suffice to keep up with our unoptimized monthly runtime (although there likely would be some overhead). Greetings, Andres Freund [1] https://cloud.google.com/compute/all-pricing
Re: Cirrus-ci is lowering free CI cycles - what to do with cfbot, etc?
Hi, On 2023-08-07 19:15:41 -0700, Andres Freund wrote: > The set of free CI providers has shrunk since we chose cirrus, as have the > "free" resources provided. I started, quite incomplete as of now, wiki page at > [4]. Oops, as Thomas just noticed, I left off that link: [4] https://wiki.postgresql.org/wiki/CI_Providers Greetings, Andres Freund
Cirrus-ci is lowering free CI cycles - what to do with cfbot, etc?
Hi, As some of you might have seen when running CI, cirrus-ci is restricting how much CI cycles everyone can use for free (announcement at [1]). This takes effect September 1st. This obviously has consequences both for individual users of CI as well as cfbot. The first thing I think we should do is to lower the cost of CI. One thing I had not entirely realized previously, is that macos CI is by far the most expensive CI to provide. That's not just the case with cirrus-ci, but also with other providers. See the series of patches described later in the email. To me, the situation for cfbot is different than the one for individual users. IMO, for the individual user case it's important to use CI for "free", without a whole lot of complexity. Which imo rules approaches like providing $cloud_provider compute accounts, that's too much setup work. With the improvements detailed below, cirrus' free CI would last about ~65 runs / month. For cfbot I hope we can find funding to pay for compute to use for CI. The, by far, most expensive bit is macos. To a significant degree due to macos licensing terms not allowing more than 2 VMs on a physical host :(. The reason we chose cirrus-ci were a) Ability to use full VMs, rather than a pre-selected set of VMs, which allows us to test a larger number b) Ability to link to log files, without requiring an account. E.g. github actions doesn't allow to view logs unless logged in. c) Amount of compute available. The set of free CI providers has shrunk since we chose cirrus, as have the "free" resources provided. I started, quite incomplete as of now, wiki page at [4]. Potential paths forward for individual CI: - migrate wholesale to another CI provider - split CI tasks across different CI providers, rely on github et al displaying the CI status for different platforms - give up Potential paths forward for cfbot, in addition to the above: - Pay for compute / ask the various cloud providers to grant us compute credits. At least some of the cloud providers can be used via cirrus-ci. - Host (some) CI runners ourselves. Particularly with macos and windows, that could provide significant savings. - Build our own system, using buildbot, jenkins or whatnot. Opinions as to what to do? The attached series of patches: 1) Makes startup of macos instances faster, using more efficient caching of the required packages. Also submitted as [2]. 2) Introduces a template initdb that's reused during the tests. Also submitted as [3] 3) Remove use of -DRANDOMIZE_ALLOCATED_MEMORY from macos tasks. It's expensive. And CI also uses asan on linux, so I don't think it's really needed. 4) Switch tasks to use debugoptimized builds. Previously many tasks used -Og, to get decent backtraces etc. But the amount of CPU burned that way is too large. One issue with that is that use of ccache becomes much more crucial, uncached build times do significantly increase. 5) Move use of -Dsegsize_blocks=6 from macos to linux Macos is expensive, -Dsegsize_blocks=6 slows things down. Alternatively we could stop covering both meson and autoconf segsize_blocks. It does affect runtime on linux as well. 6) Disable write cache flushes on windows It's a bit ugly to do this without using the UI... Shaves off about 30s from the tests. 7) pg_regress only checked once a second whether postgres started up, but it's usually much faster. Use pg_ctl's logic. It might be worth replacing the use psql with directly using libpq in pg_regress instead, looks like the overhead of repeatedly starting psql is noticeable. FWIW: with the patches applied, the "credit costs" in cirrus CI are roughly like the following (depends on caching etc): task costs in credits linux-sanity: 0.01 linux-compiler-warnings: 0.05 linux-meson: 0.07 freebsd : 0.08 linux-autoconf: 0.09 windows : 0.18 macos : 0.28 total task runtime is 40.8 cost in credits is 0.76, monthly credits of 50 allow approx 66.10 runs/month Greetings, Andres Freund [1] https://cirrus-ci.org/blog/2023/07/17/limiting-free-usage-of-cirrus-ci/ [2] https://www.postgresql.org/message-id/20230805202539.r3umyamsnctysdc7%40awork3.anarazel.de [3] https://postgr.es/m/20220120021859.3zpsfqn4z7ob7...@alap3.anarazel.de >From 00c48a90f3acc6cba6af873b29429ee6d4ba38a6 Mon Sep 17 00:00:00 2001 From: Andres Freund Date: Thu, 3 Aug 2023 23:29:13 -0700 Subject: [PATCH v1 1/9] ci: macos: used cached macports install This substantially speeds up the mac CI time. Discussion: https://postgr.es/m/20230805202539.r3umyamsnctys...@awork3.anarazel.de --- .cirrus.yml | 63 +++--- src/tools/ci/ci_macports_packages.sh | 97 2 files changed, 122 insertions(+), 38 deletions(-) create mode 100755 s
Re: cpluspluscheck vs ICU
Hi, On 2023-03-10 20:10:30 -0800, Andres Freund wrote: > On 2023-03-10 19:37:27 -0800, Andres Freund wrote: > > I just hit this once more - and I figured out a fairly easy fix: > > > > We just need a > > #ifndef U_DEFAULT_SHOW_DRAFT > > #define U_DEFAULT_SHOW_DRAFT 0 > > #endif > > before including unicode/ucol.h. > > > > At first I was looking at > > #define U_SHOW_CPLUSPLUS_API 0 > > and > > #define U_HIDE_INTERNAL_API 1 > > which both work, but they are documented to be internal. > > Err. Unfortunately only the U_SHOW_CPLUSPLUS_API approach actually works. The > others don't, not quite sure what I was doing earlier. > > So it's either relying on a define marked as internal, or the below: > > > Alternatively we could emit U_DEFAULT_SHOW_DRAFT 0 into pg_config.h to avoid > > that issue. > > > > > > The only other thing I see is to do something like: > > [ugly] > > which seems mighty ugly. The ICU docs talk about it like it's not really internal: https://github.com/unicode-org/icu/blob/720e5741ccaa112c4faafffdedeb7459b66c5673/docs/processes/release/tasks/healthy-code.md#test-icu4c-headers So I'm inclined to go with that solution. Any comments? Arguments against? Greetings, Andres Freund
Re: A failure in 031_recovery_conflict.pl on Debian/s390x
Hi, On 2023-08-07 12:57:40 +0200, Christoph Berg wrote: > Re: Thomas Munro > > Thanks for testing! Would you mind trying v8 from that thread? V7 > > had a silly bug (I accidentally deleted a 'case' label while cleaning > > some stuff up, resulting in the above error...) > > v8 worked better. It succeeded a few times (at least 12, my screen > scrollback didn't catch more) before erroring like this: > [10:21:58.410](0.151s) ok 15 - startup deadlock: logfile contains terminated > connection due to recovery conflict > [10:21:58.463](0.053s) not ok 16 - startup deadlock: stats show conflict on > standby > [10:21:58.463](0.000s) > [10:21:58.463](0.000s) # Failed test 'startup deadlock: stats show conflict > on standby' > # at t/031_recovery_conflict.pl line 332. > [10:21:58.463](0.000s) # got: '0' > # expected: '1' Hm, that could just be a "harmless" race. Does it still happen if you apply the attached patch in addition? Greetings, Andres Freund diff --git i/src/backend/utils/activity/pgstat_database.c w/src/backend/utils/activity/pgstat_database.c index 7149f22f729..bb36d73ec04 100644 --- i/src/backend/utils/activity/pgstat_database.c +++ w/src/backend/utils/activity/pgstat_database.c @@ -81,12 +81,22 @@ void pgstat_report_recovery_conflict(int reason) { PgStat_StatDBEntry *dbentry; + PgStat_EntryRef *entry_ref; + PgStatShared_Database *sharedent; Assert(IsUnderPostmaster); if (!pgstat_track_counts) return; - dbentry = pgstat_prep_database_pending(MyDatabaseId); + /* + * Update the shared stats directly - recovery conflicts should never be + * common enough for that to be a problem. + */ + entry_ref = + pgstat_get_entry_ref_locked(PGSTAT_KIND_DATABASE, MyDatabaseId, InvalidOid, false); + + sharedent = (PgStatShared_Database *) entry_ref->shared_stats; + dbentry = &sharedent->stats; switch (reason) { @@ -116,6 +126,8 @@ pgstat_report_recovery_conflict(int reason) dbentry->conflict_startup_deadlock++; break; } + + pgstat_unlock_entry(entry_ref); } /*
Re: Configurable FP_LOCK_SLOTS_PER_BACKEND
Hi, On 2023-08-07 14:36:48 -0700, Andres Freund wrote: > What if fast path locks entered PROCLOCK into the shared hashtable, just like > with normal locks, the first time a lock is acquired by a backend. Except that > we'd set a flag indicating the lock is a fastpath lock. When the lock is > released, neither the LOCALLOCK nor the PROCLOCK entry would be > removed. Instead, the LOCK/PROCLOCK would be modified to indicate that the > lock is not held anymore. > > That itself wouldn't buy us much - we'd still need to do a lookup in the > shared hashtable. But, by the time we decide whether to use fast path locks, > we've already done a hash lookup in the LOCALLOCK hashtable. Because the > PROCLOCK entry would continue to exist, we can use LOCALLOCK->proclock to get > the PROCLOCK entry without a shared hash table lookup. > > Acquiring a strong lock on a fastpath lock would basically entail modifying > all the relevant PROCLOCKs atomically to indicate that fast path locks aren't > possible anymore. Acquiring a fast path lock would just require atomically > modifying the PROCLOCK to indicate that the lock is held. > > On a first blush, this sounds like it could end up being fairly clean and > generic? On 2023-08-07 13:05:32 -0400, Robert Haas wrote: > Of course, another thing we could do is try to improve the main lock > manager somehow. I confess that I don't have a great idea for that at > the moment, but the current locking scheme there is from a very, very > long time ago and clearly wasn't designed with modern hardware in > mind. I think the biggest flaw of the locking scheme is that the LockHash locks protect two, somewhat independent, things: 1) the set of currently lockable objects, i.e. the entries in the hash table [partition] 2) the state of all the locks [in a partition] It'd not be that hard to avoid the shared hashtable lookup in a number of cases, e.g. by keeping LOCALLOCK entries around for longer, as I suggest above. But we can't, in general, avoid the lock on the partition anyway, as the each lock's state is also protected by the partition lock. The amount of work to do a lookup in the shared hashtable and/or create a new entry therein, is quite bound. But the work for acquiring a lock is much less so. We'll e.g. often have to iterate over the set of lock holders etc. I think we ought to investigate whether pushing down the locking for the "lock state" into the individual locks is worth it. That way the partitioned lock would just protect the hashtable. The biggest issue I see is deadlock checking. Right now acquiring all lock partitions gives you a consistent view of all the non-fastpath locks - and fastpath locks can't participate in deadlocks. Any scheme that makes "lock state" locking in general more granular, will make it next to impossible to have a similarly consistent view of all locks. I'm not sure the current degree of consistency is required however - the lockers participating in a lock cycle, pretty much by definition, are blocked. A secondary issue is that making the locks more granular could affect the happy path measurably - we'd need two atomics for each heavyweight lock acquisition, not one. But if we cached the lookup in the shared hashtable, we'd commonly be able to skip the hashtable lookup... Greetings, Andres Freund
Re: Configurable FP_LOCK_SLOTS_PER_BACKEND
Hi, On 2023-08-07 13:05:32 -0400, Robert Haas wrote: > I would also argue that the results are actually not that great, > because once you get past 64 partitions you're right back where you > started, or maybe worse off. To me, there's nothing magical about > cases between 16 and 64 relations that makes them deserve special > treatment - plenty of people are going to want to use hundreds of > partitions, and even if you only use a few dozen, this isn't going to > help as soon as you join two or three partitioned tables, and I > suspect it hurts whenever it doesn't help. > > I think we need a design that scales better. I don't really know what > that would look like, exactly, but your idea of a hybrid approach > seems like it might be worth further consideration. We don't have to > store an infinite number of fast-path locks in an array that we search > linearly, and it might be better that if we moved to some other > approach we could avoid some of the regression. My gut feeling is that the state for fast path locks doesn't live in quite the right place. What if fast path locks entered PROCLOCK into the shared hashtable, just like with normal locks, the first time a lock is acquired by a backend. Except that we'd set a flag indicating the lock is a fastpath lock. When the lock is released, neither the LOCALLOCK nor the PROCLOCK entry would be removed. Instead, the LOCK/PROCLOCK would be modified to indicate that the lock is not held anymore. That itself wouldn't buy us much - we'd still need to do a lookup in the shared hashtable. But, by the time we decide whether to use fast path locks, we've already done a hash lookup in the LOCALLOCK hashtable. Because the PROCLOCK entry would continue to exist, we can use LOCALLOCK->proclock to get the PROCLOCK entry without a shared hash table lookup. Acquiring a strong lock on a fastpath lock would basically entail modifying all the relevant PROCLOCKs atomically to indicate that fast path locks aren't possible anymore. Acquiring a fast path lock would just require atomically modifying the PROCLOCK to indicate that the lock is held. On a first blush, this sounds like it could end up being fairly clean and generic? Greetings, Andres Freund
Re: Configurable FP_LOCK_SLOTS_PER_BACKEND
Hi, On 2023-08-07 13:59:26 -0700, Matt Smiley wrote: > I have not yet written a reproducer since we see this daily in production. > I have a sketch of a few ways that I think will reproduce the behavior > we're observing, but haven't had time to implement it. > > I'm not sure if we're seeing this behavior in production It might be worth for you to backpatch commit 92daeca45df Author: Andres Freund Date: 2022-11-21 20:34:17 -0800 Add wait event for pg_usleep() in perform_spin_delay() into 12. That should be low risk and have only trivially resolvable conflicts. Alternatively, you could use bpftrace et al to set a userspace probe on perform_spin_delay(). > , but it's definitely an interesting find. Currently we are running > postgres 12.11, with an upcoming upgrade to 15 planned. Good to know > there's a potential improvement waiting in 16. I noticed that in > LWLockAcquire the call to LWLockDequeueSelf occurs ( > https://github.com/postgres/postgres/blob/REL_12_11/src/backend/storage/lmgr/lwlock.c#L1218) > directly between the unsuccessful attempt to immediately acquire the lock > and reporting the backend's wait event. That's normal. > > I'm also wondering if it's possible that the reason for the throughput > > drops > > are possibly correlated with heavyweight contention or higher frequency > > access > > to the pg_locks view. Deadlock checking and the locks view acquire locks on > > all lock manager partitions... So if there's a bout of real lock contention > > (for longer than deadlock_timeout)... > > > > Great questions, but we ruled that out. The deadlock_timeout is 5 seconds, > so frequently hitting that would massively violate SLO and would alert the > on-call engineers. The pg_locks view is scraped a couple times per minute > for metrics collection, but the lock_manager lwlock contention can be > observed thousands of times every second, typically with very short > durations. The following example (captured just now) shows the number of > times per second over a 10-second window that any 1 of the 16 > "lock_manager" lwlocks was contended: Some short-lived contention is fine and expected - the question is how long the waits are... Unfortunately my experience is that the overhead of bpftrace means that analyzing things like this with bpftrace is very hard... :(. > > Given that most of your lock manager traffic comes from query planning - > > have you evaluated using prepared statements more heavily? > > > > Yes, there are unrelated obstacles to doing so -- that's a separate can of > worms, unfortunately. But in this pathology, even if we used prepared > statements, the backend would still need to reacquire the same locks during > each executing transaction. So in terms of lock acquisition rate, whether > it's via the planner or executor doing it, the same relations have to be > locked. Planning will often lock more database objects than query execution. Which can keep you using fastpath locks for longer. Greetings, Andres Freund
Re: Performance degradation on concurrent COPY into a single relation in PG16.
Hi, On 2023-08-07 23:05:39 +0900, Masahiko Sawada wrote: > On Mon, Aug 7, 2023 at 3:16 PM David Rowley wrote: > > > > On Wed, 2 Aug 2023 at 13:35, David Rowley wrote: > > > So, it looks like this item can be closed off. I'll hold off from > > > doing that for a few days just in case anyone else wants to give > > > feedback or test themselves. > > > > Alright, closed. > > IIUC the problem with multiple concurrent COPY is not resolved yet. Yea - it was just hard to analyze until the other regressions were fixed. > The result of nclients = 1 became better thanks to recent fixes, but > there still seems to be the performance regression at nclient = 2~16 > (on RHEL 8 and 9). Andres reported[1] that after changing > MAX_BUFFERED_TUPLES to 5000 the numbers became a lot better but it > would not be the solution, as he mentioned. I think there could be a quite simple fix: Track by how much we've extended the relation previously in the same bistate. If we already extended by many blocks, it's very likey that we'll do so further. A simple prototype patch attached. The results for me are promising. I copied a smaller file [1], to have more accurate throughput results at shorter runs (15s). HEAD before: clients tps 1 41 2 76 4136 8248 16 360 32 375 64 317 HEAD after: clients tps 1 43 2 80 4155 8280 16 369 32 405 64 344 Any chance you could your benchmark? I don't see as much of a regression vs 16 as you... Greetings, Andres Freund [1] COPY (SELECT generate_series(1, 10)) TO '/tmp/data.copy'; diff --git i/src/include/access/hio.h w/src/include/access/hio.h index 228433ee4a2..5ae39aec7f8 100644 --- i/src/include/access/hio.h +++ w/src/include/access/hio.h @@ -41,6 +41,7 @@ typedef struct BulkInsertStateData */ BlockNumber next_free; BlockNumber last_free; + uint32 already_extended_by; } BulkInsertStateData; diff --git i/src/backend/access/heap/heapam.c w/src/backend/access/heap/heapam.c index 7ed72abe597..6a66214a580 100644 --- i/src/backend/access/heap/heapam.c +++ w/src/backend/access/heap/heapam.c @@ -1776,6 +1776,7 @@ GetBulkInsertState(void) bistate->current_buf = InvalidBuffer; bistate->next_free = InvalidBlockNumber; bistate->last_free = InvalidBlockNumber; + bistate->already_extended_by = 0; return bistate; } diff --git i/src/backend/access/heap/hio.c w/src/backend/access/heap/hio.c index c275b08494d..a2be4273df1 100644 --- i/src/backend/access/heap/hio.c +++ w/src/backend/access/heap/hio.c @@ -283,6 +283,13 @@ RelationAddBlocks(Relation relation, BulkInsertState bistate, */ extend_by_pages += extend_by_pages * waitcount; + /* + * If we previously extended using the same bistate, it's very likely + * we'll extend some more. Try to extend by as many pages. + */ + if (bistate) + extend_by_pages = Max(extend_by_pages, bistate->already_extended_by); + /* * Can't extend by more than MAX_BUFFERS_TO_EXTEND_BY, we need to pin * them all concurrently. @@ -409,6 +416,7 @@ RelationAddBlocks(Relation relation, BulkInsertState bistate, /* maintain bistate->current_buf */ IncrBufferRefCount(buffer); bistate->current_buf = buffer; + bistate->already_extended_by += extend_by_pages; } return buffer;
Re: Configurable FP_LOCK_SLOTS_PER_BACKEND
Hi, On 2023-08-02 16:51:29 -0700, Matt Smiley wrote: > I thought it might be helpful to share some more details from one of the > case studies behind Nik's suggestion. > > Bursty contention on lock_manager lwlocks recently became a recurring cause > of query throughput drops for GitLab.com, and we got to study the behavior > via USDT and uprobe instrumentation along with more conventional > observations (see > https://gitlab.com/gitlab-com/gl-infra/scalability/-/issues/2301). This > turned up some interesting finds, and I thought sharing some of that > research might be helpful. Hm, I'm curious whether you have a way to trigger the issue outside of your prod environment. Mainly because I'm wondering if you're potentially hitting the issue fixed in a4adc31f690 - we ended up not backpatching that fix, so you'd not see the benefit unless you reproduced the load in 16+. I'm also wondering if it's possible that the reason for the throughput drops are possibly correlated with heavyweight contention or higher frequency access to the pg_locks view. Deadlock checking and the locks view acquire locks on all lock manager partitions... So if there's a bout of real lock contention (for longer than deadlock_timeout)... Given that most of your lock manager traffic comes from query planning - have you evaluated using prepared statements more heavily? Greetings, Andres Freund
Re: initdb caching during tests
Hi, On 2023-08-05 16:58:38 -0400, Tom Lane wrote: > Andres Freund writes: > > Times for running all tests under meson, on my workstation (20 cores / 40 > > threads): > > > cassert build -O2: > > > Before: > > real0m44.638s > > user7m58.780s > > sys 2m48.773s > > > After: > > real0m38.938s > > user2m37.615s > > sys 2m0.570s > > Impressive results. Even though your bottom-line time doesn't change that > much Unfortunately we have a few tests that take quite a while - for those the initdb removal doesn't make that much of a difference. Particularly because this machine has enough CPUs to not be fully busy except for the first few seconds... E.g. for a run with the patch applied: 258/265 postgresql:pg_basebackup / pg_basebackup/010_pg_basebackup OK 16.58s 187 subtests passed 259/265 postgresql:subscription / subscription/100_bugs OK6.69s 12 subtests passed 260/265 postgresql:regress / regress/regress OK 24.95s 215 subtests passed 261/265 postgresql:ssl / ssl/001_ssltests OK7.97s 205 subtests passed 262/265 postgresql:pg_dump / pg_dump/002_pg_dump OK 19.65s 11262 subtests passed 263/265 postgresql:recovery / recovery/027_stream_regress OK 29.34s 6 subtests passed 264/265 postgresql:isolation / isolation/isolation OK 33.94s 112 subtests passed 265/265 postgresql:pg_upgrade / pg_upgrade/002_pg_upgrade OK 38.22s 18 subtests passed The pg_upgrade test is faster in isolation (29s), but not that much. The overall runtime is reduces due to the reduced "competing" cpu usage, but other than that... Looking at where the time is spent when running the pg_upgrade test on its own: grep -E '^\[' testrun/pg_upgrade/002_pg_upgrade/log/regress_log_002_pg_upgrade |sed -E -e 's/.*\(([0-9.]+)s\)(.*)/\1 \2/g'|sort -n -r cassert: 13.094 ok 5 - regression tests pass 6.147 ok 14 - run of pg_upgrade for new instance 2.340 ok 6 - dump before running pg_upgrade 1.638 ok 17 - dump after running pg_upgrade 1.375 ok 12 - run of pg_upgrade --check for new instance 0.798 ok 1 - check locales in original cluster 0.371 ok 9 - invalid database causes failure status (got 1 vs expected 1) 0.149 ok 7 - run of pg_upgrade --check for new instance with incorrect binary path 0.131 ok 16 - check that locales in new cluster match original cluster optimized: 8.372 ok 5 - regression tests pass 3.641 ok 14 - run of pg_upgrade for new instance 1.371 ok 12 - run of pg_upgrade --check for new instance 1.104 ok 6 - dump before running pg_upgrade 0.636 ok 17 - dump after running pg_upgrade 0.594 ok 1 - check locales in original cluster 0.359 ok 9 - invalid database causes failure status (got 1 vs expected 1) 0.148 ok 7 - run of pg_upgrade --check for new instance with incorrect binary path 0.127 ok 16 - check that locales in new cluster match original cluster The time for "dump before running pg_upgrade" is misleadingly high - there's no output between starting initdb and the dump, so the timing includes initdb and a bunch of other work. But it's still not fast (1.637s) after. A small factor is that the initdb times are not insignificant, because the template initdb can't be used due to a bunch of parameters passed to initdb :) > the big reduction in CPU time should translate to a nice speedup on slower > buildfarm animals. Yea. It's a particularly large win when using valgrind. Under valgrind, a very large portion of the time for many tests is just spent doing initdb... So I am hoping to see some nice gains for skink. Greetings, Andres Freund
ci: Improve macos startup using a cached macports installation
Hi, We have some issues with CI on macos and windows being too expensive (more on that soon in a separate email). For macos most of the obviously wasted time is spent installing packages with homebrew. Even with the package downloads being cached, it takes about 1m20s to install them. We can't just cache the whole homebrew installation, because it contains a lot of pre-installed packages. After a bunch of experimenting, I found a way to do this a lot faster: The attached patch installs macports and installs our dependencies from that. Because there aren't pre-existing macports packages, we can just cache the whole thing. Doing so naively wouldn't yield that much of a speedup, because it takes a while to unpack a tarball (or whatnot) with as many files as our dependencies have - that's a lot of filesystem metadata operations. Instead the script creates a HFS+ filesystem in a file and caches that - that's mounted within a few seconds. To further keep the size in check, that file is compressed with zstd in the cache. As macports has a package for IPC::Run and IO::Pty, we can use those instead of the separate cache we had for the perl installation. After the patch, the cached case takes ~5s to "install" and the cache is half the size than the one for homebrew. The comparison isn't entirely fair, because I used the occasion to not install 'make' (since meson is used for building) and llvm (we didn't enable it for the build anyway). That gain is a bit smaller without that, but still significant. An alternative implementation would be to create the "base" .dmg file outside of CI and download it onto the CI instances. But I didn't want to figure out the hosting situation for such files, so I thought this was the best near-medium term path. Greetings, Andres Freund >From e0f64949e7b2b2aca4398acb3bfa8e859857e910 Mon Sep 17 00:00:00 2001 From: Andres Freund Date: Thu, 3 Aug 2023 23:29:13 -0700 Subject: [PATCH v1] ci: macos: used cached macports install This substantially speeds up the mac CI time. Author: Reviewed-by: Discussion: https://postgr.es/m/ Backpatch: --- .cirrus.yml | 63 +++--- src/tools/ci/ci_macports_packages.sh | 97 2 files changed, 122 insertions(+), 38 deletions(-) create mode 100755 src/tools/ci/ci_macports_packages.sh diff --git a/.cirrus.yml b/.cirrus.yml index d260f15c4e2..e9cfc542cfe 100644 --- a/.cirrus.yml +++ b/.cirrus.yml @@ -429,8 +429,7 @@ task: CIRRUS_WORKING_DIR: ${HOME}/pgsql/ CCACHE_DIR: ${HOME}/ccache -HOMEBREW_CACHE: ${HOME}/homebrew-cache -PERL5LIB: ${HOME}/perl5/lib/perl5 +MACPORTS_CACHE: ${HOME}/macports-cache CC: ccache cc CXX: ccache c++ @@ -454,55 +453,43 @@ task: - mkdir ${HOME}/cores - sudo sysctl kern.corefile="${HOME}/cores/core.%P" - perl_cache: -folder: ~/perl5 - cpan_install_script: -- perl -mIPC::Run -e 1 || cpan -T IPC::Run -- perl -mIO::Pty -e 1 || cpan -T IO::Pty - upload_caches: perl - - - # XXX: Could we instead install homebrew into a cached directory? The - # homebrew installation takes a good bit of time every time, even if the - # packages do not need to be downloaded. - homebrew_cache: -folder: $HOMEBREW_CACHE + # Use macports, even though homebrew is installed. The installation + # of the additional packages we need would take quite a while with + # homebrew, even if we cache the downloads. We can't cache all of + # homebrew, because it's already large. So we use macports. To cache + # the installation we create a .dmg file that we mount if it already + # exists. + # XXX: The reason for the direct p5.34* references is that we'd need + # the large macport tree around to figure out that p5-io-tty is + # actually p5.34-io-tty. Using the unversioned name works, but + # updates macports every time. + macports_cache: +folder: ${MACPORTS_CACHE} setup_additional_packages_script: | -brew install \ +sh src/tools/ci/ci_macports_packages.sh \ ccache \ - icu4c \ - krb5 \ - llvm \ + icu \ + kerberos5 \ lz4 \ - make \ meson \ openldap \ openssl \ - python \ - tcl-tk \ + p5.34-io-tty \ + p5.34-ipc-run \ + tcl \ zstd - -brew cleanup -s # to reduce cache size - upload_caches: homebrew +# Make macports install visible for subsequent steps +echo PATH=/opt/local/sbin/:/opt/local/bin/:$PATH >> $CIRRUS_ENV + upload_caches: macports ccache_cache: folder: $CCACHE_DIR configure_script: | -brewpath="/opt/homebrew" -PKG_CONFIG_PATH="${brewpath}/lib/pkgconfig:${PKG_CONFIG_PATH}" - -for pkg in icu4c krb5 openldap openssl zstd ; do - pkgpath="${brewpath}/opt/${pkg}" - PKG_CONFIG_PATH="${pkgpath}/lib/pkgconfig:${PKG_CONFIG_PATH}&quo
initdb caching during tests
Hi, We have some issues with CI on macos and windows being too expensive (more on that soon in a separate email), which reminded me of this thread (with original title: [1]) I've attached a somewhat cleaned up version of the patch to cache initdb across runs. The results are still fairly impressive in my opinion. One thing I do not like, but don't have a good idea for how to improve, is that there's a bunch of duplicated logic in pg_regress.c and Cluster.pm. I've tried to move that into initdb.c itself, but that ends up pretty ugly, because we need to be a lot more careful about checking whether options are compatible etc. I've also thought about just putting this into a separate perl script, but right now we still allow basic regression tests without perl being available. So I concluded that for now just having the copies is the best answer. Times for running all tests under meson, on my workstation (20 cores / 40 threads): cassert build -O2: Before: real0m44.638s user7m58.780s sys 2m48.773s After: real0m38.938s user2m37.615s sys 2m0.570s cassert build -O0: Before: real1m11.290s user13m9.817s sys 2m54.946s After: real1m2.959s user3m5.835s sys 1m59.887s non-cassert build: Before: real0m34.579s user5m30.418s sys 2m40.507s After: real0m27.710s user2m20.644s sys 1m55.770s On CI this reduces the test times substantially: Freebsd 8:51 -> 5:35 Debian w/ asan, autoconf 6:43 -> 4:55 Debian w/ alignmentsan, ubsan 4:02 -> 2:33 macos 5:07 -> 4:29 windows 10:21 -> 9:49 This is ignoring a bit of run-to-run variance, but the trend is obvious enough that it's not worth worrying about that. Greetings, Andres Freund [1] https://postgr.es/m/20220120021859.3zpsfqn4z7ob7afz%40alap3.anarazel.de >From 0fd431c277f01284a91999a04368de6b59b6691e Mon Sep 17 00:00:00 2001 From: Andres Freund Date: Thu, 2 Feb 2023 21:51:53 -0800 Subject: [PATCH v3 1/2] Use "template" initdb in tests Discussion: https://postgr.es/m/20220120021859.3zpsfqn4z7ob7...@alap3.anarazel.de --- meson.build | 30 ++ .cirrus.yml | 3 +- src/test/perl/PostgreSQL/Test/Cluster.pm | 46 ++- src/test/regress/pg_regress.c| 74 ++-- src/Makefile.global.in | 52 + 5 files changed, 161 insertions(+), 44 deletions(-) diff --git a/meson.build b/meson.build index 04ea3488522..47429a18c3f 100644 --- a/meson.build +++ b/meson.build @@ -3056,8 +3056,10 @@ testport = 4 test_env = environment() temp_install_bindir = test_install_location / get_option('bindir') +test_initdb_template = meson.build_root() / 'tmp_install' / 'initdb-template' test_env.set('PG_REGRESS', pg_regress.full_path()) test_env.set('REGRESS_SHLIB', regress_module.full_path()) +test_env.set('INITDB_TEMPLATE', test_initdb_template) # Test suites that are not safe by default but can be run if selected # by the user via the whitespace-separated list in variable PG_TEST_EXTRA. @@ -3072,6 +3074,34 @@ if library_path_var != '' endif +# Create (and remove old) initdb template directory. Tests use that, where +# possible, to make it cheaper to run tests. +# +# Use python to remove the old cached initdb, as we cannot rely on a working +# 'rm' binary on windows. +test('initdb_cache', + python, + args: [ + '-c', ''' +import shutil +import sys +import subprocess + +shutil.rmtree(sys.argv[1], ignore_errors=True) +sp = subprocess.run(sys.argv[2:] + [sys.argv[1]]) +sys.exit(sp.returncode) +''', + test_initdb_template, + temp_install_bindir / 'initdb', + '-A', 'trust', '-N', '--no-instructions' + ], + priority: setup_tests_priority - 1, + timeout: 300, + is_parallel: false, + env: test_env, + suite: ['setup']) + + ### # Test Generation diff --git a/.cirrus.yml b/.cirrus.yml index d260f15c4e2..8fce17dff08 100644 --- a/.cirrus.yml +++ b/.cirrus.yml @@ -115,8 +115,9 @@ task: test_minimal_script: | su postgres <<-EOF ulimit -c unlimited + meson test $MTEST_ARGS --suite setup meson test $MTEST_ARGS --num-processes ${TEST_JOBS} \ -tmp_install cube/regress pg_ctl/001_start_stop +cube/regress pg_ctl/001_start_stop EOF on_failure: diff --git a/src/test/perl/PostgreSQL/Test/Cluster.pm b/src/test/perl/PostgreSQL/Test/Cluster.pm index 5e161dbee60..4d449c35de9 100644 --- a/src/test/perl/PostgreSQL/Test/Cluster.pm +++ b/src/test/perl/PostgreSQL/Test/Cluster.pm @@ -
Re: SIGQUIT handling, redux
Hi, On 2023-08-02 12:35:19 -0400, Tom Lane wrote: > Andres Freund writes: > > On 2020-09-11 11:52:55 -0400, Tom Lane wrote: > >> It's simple enough that maybe we could back-patch it, once it's > >> aged awhile in HEAD. OTOH, given the lack of field reports of > >> trouble here, I'm not sure back-patching is worth the risk. > > > FWIW, looking at collected stack traces in azure, there's a slow but steady > > stream of crashes below StartupPacketTimeoutHandler. ... > > Unsurprisingly just in versions before 14, where this change went in. > > I think that might be enough evidence for backpatching the commit? I've not > > heard of issues due to the checks in check_on_shmem_exit_lists_are_empty(). > > I'd be willing to take a look at this in a few weeks when $real_life > is a bit less demanding. Cool. > Right before minor releases would likely be a bad idea anyway. Agreed. I had just waded through the stacks, so I thought it'd be worth bringing up, didn't intend to imply we should backpatch immediately. Aside: Analyzing this kind of thing at scale is made considerably more painful by "expected" ereport(PANIC)s (say out of disk space during WAL writes) calling abort() and dumping core... While there are other PANICs you really want cores of. Greetings, Andres Freund
Re: SIGQUIT handling, redux
Hi, On 2020-09-11 11:52:55 -0400, Tom Lane wrote: > It's simple enough that maybe we could back-patch it, once it's > aged awhile in HEAD. OTOH, given the lack of field reports of > trouble here, I'm not sure back-patching is worth the risk. FWIW, looking at collected stack traces in azure, there's a slow but steady stream of crashes below StartupPacketTimeoutHandler. Most seem to be things like libcrypto->malloc->StartupPacketTimeoutHandler->proc_exit->socket_close->free->crash there's a few other variants, some where the stack apparently was not decipherable for the relevant tooling. Note that this wouldn't even include cases where this caused hangs - which is quite common IME. Unsurprisingly just in versions before 14, where this change went in. I think that might be enough evidence for backpatching the commit? I've not heard of issues due to the checks in check_on_shmem_exit_lists_are_empty(). Greetings, Andres Freund
Re: stats test intermittent failure
Hi, On 2023-07-31 21:03:07 +0900, Masahiko Sawada wrote: > Regarding the patch, I have a comment: > > -- Test that reuse of strategy buffers and reads of blocks into these reused > --- buffers while VACUUMing are tracked in pg_stat_io. > +-- buffers while VACUUMing are tracked in pg_stat_io. If there is sufficient > +-- demand for shared buffers from concurrent queries, some blocks may be > +-- evicted from the strategy ring before they can be reused. In such cases > +-- this, the backend will evict a block from a shared buffer outside of the > +-- ring and add it to the ring. This is considered an eviction and not a > reuse. > > The new comment seems not to be accurate if my understanding is correct. How > about the following? > > Test that reuse of strategy buffers and reads of blocks into these > reused buffers while VACUUMing are tracked in pg_stat_io. If there is > sufficient demand for shared buffers from concurrent queries, some > buffers may be pinned by other backends before they can be reused. In > such cases, the backend will evict a buffer from a shared buffer > outside of the ring and add it to the ring. This is considered an > eviction and not a reuse. I integrated the suggested change of the comment and tweaked it a bit more. And finally pushed the fix. Sorry that it took so long. Greetings, Andres Freund
Re: Support to define custom wait events for extensions
Hi, On 2023-08-01 12:14:49 +0900, Michael Paquier wrote: > On Tue, Aug 01, 2023 at 11:51:35AM +0900, Masahiro Ikeda wrote: > > Thanks for committing the main patch. > > > > In my understanding, the rest works are > > * to support WaitEventExtensionMultiple() > > * to replace WAIT_EVENT_EXTENSION to custom wait events > > > > Do someone already works for them? If not, I'll consider > > how to realize them. > > Note that postgres_fdw and dblink use WAIT_EVENT_EXTENSION, but have > no dependency to shared_preload_libraries. Perhaps these could just > use a dynamic handling but that deserves a separate discussion because > of the fact that they'd need shared memory without being able to > request it. autoprewarm.c is much simpler. This is why the scheme as implemented doesn't really make sense to me. It'd be much easier to use if we had a shared hashtable with the identifiers than what's been merged now. In plenty of cases it's not realistic for an extension library to run in each backend, while still needing to wait for things. Greetings, Andres Freund
Re: Performance degradation on concurrent COPY into a single relation in PG16.
Hi, On 2023-07-27 20:53:16 +1200, David Rowley wrote: > To summarise, REL_15_STABLE can run this benchmark in 526.014 ms on my > AMD 3990x machine. Today's REL_16_STABLE takes 530.344 ms. We're > talking about another patch to speed up the pg_strtoint functions > which gets this down to 483.790 ms. Do we need to do this for v16 or > can we just leave this as it is already? The slowdown does not seem > to be much above what we'd ordinarily classify as noise using this > test on my machine. I think we need to do something for 16 - it appears on recent-ish AMD the regression is quite a bit smaller than on intel. You see something < 1%, I see more like 4%. I think there's also other cases where the slowdown is more substantial. Besides intel vs amd, it also looks like the gcc version might make a difference. The code generated by 13 is noticeably slower than 12 for me... > Benchmark setup: > > COPY (SELECT generate_series(1, 200) a, (random() * 10 - > 5)::int b, 3243423 c) TO '/tmp/lotsaints.copy'; > DROP TABLE lotsaints; CREATE UNLOGGED TABLE lotsaints(a int, b int, c int); There's a lot of larger numbers in the file, which likely reduces the impact some. And there's the overhead of actually inserting the rows into the table, making the difference appear smaller than it is. If I avoid the actual insert into the table and use more columns, I see an about 10% regression here. COPY (SELECT generate_series(1, 1000) a, 10 b, 20 c, 30 d, 40 e, 50 f FROM generate_series(1, 1)) TO '/tmp/lotsaints_wide.copy'; psql -c 'DROP TABLE IF EXISTS lotsaints_wide; CREATE UNLOGGED TABLE lotsaints_wide(a int, b int, c int, d int, e int, f int);' && \ pgbench -n -P1 -f <( echo "COPY lotsaints_wide FROM '/tmp/lotsaints_wide.copy' WHERE false") -t 5 15: 2992.605 HEAD: 3325.201 fastpath1.patch 2932.606 fastpath2.patch 2783.915 Interestingly fastpath1 is slower now, even though it wasn't with earlier patches (which still is repeatable). I do not have the foggiest as to why. Greetings, Andres Freund
Re: pltcl tests fail with FreeBSD 13.2
Hi, On 2023-07-31 19:11:38 -0400, Tom Lane wrote: > Andres Freund writes: > > On 2023-07-31 18:33:37 -0400, Tom Lane wrote: > >> (And could it be that we had one in the predecessor 13.1 image?) > > > No, I checked, and it's not in there either... It looks like the difference > > is > > that 13.1 reads the UTC zoneinfo in that case, whereas 13.2 doesn't. > > Huh. Maybe worth reporting as a FreeBSD bug? Yea. Hoping our local freebsd developer has a suggestion as to which component to report it to, or even fix it :). > In any case I don't think it's *our* bug. Agreed. An explicit tzsetup UTC isn't much to carry going forward... Greetings, Andres Freund
Re: CDC/ETL system on top of logical replication with pgoutput, custom client
Hi, On 2023-07-31 21:25:06 +, José Neves wrote: > Ok, if I understood you correctly, I start to see where my logic is faulty. > Just to make sure that I got it right, taking the following example again: > > T-1 > INSERT LSN1-1000 > UPDATE LSN2-2000 > UPDATE LSN3-3000 > COMMIT LSN4-4000 > > T-2 > INSERT LSN1-500 > UPDATE LSN2-1500 > UPDATE LSN3-2500 > COMMIT LSN4-5500 > > Where data will arrive in this order: > > INSERT LSN1-500 > INSERT LSN1-1000 > UPDATE LSN2-1500 > UPDATE LSN2-2000 > UPDATE LSN3-2500 > UPDATE LSN3-3000 > COMMIT LSN4-4000 > COMMIT LSN4-5500 No, they won't arrive in that order. They will arive as BEGIN INSERT LSN1-1000 UPDATE LSN2-2000 UPDATE LSN3-3000 COMMIT LSN4-4000 BEGIN INSERT LSN1-500 UPDATE LSN2-1500 UPDATE LSN3-2500 COMMIT LSN4-5500 Because T1 committed before T2. Changes are only streamed out at commit / prepare transaction (*). Within a transaction, they however *will* be ordered by LSN. (*) Unless you use streaming mode, in which case it'll all be more complicated, as you'll also receive changes for transactions that might still abort. > You are saying that the LSN3-3000 will never be missing, either the entire > connection will fail at that point, or all should be received in the > expected order (which is different from the "numeric order" of LSNs). I'm not quite sure what you mean with the "different from the numeric order" bit... > If the connection is down, upon restart, I will receive the entire T-1 > transaction again (well, all example data again). Yes, unless you already acknowledged receipt up to LSN4-4000 and/or are only asking for newer transactions when reconnecting. > In addition to that, if I commit LSN4-4000, even tho that LSN has a "bigger > numeric value" than the ones representing INSERT and UPDATE events on T-2, I > will be receiving the entire T-2 transaction again, as the LSN4-5500 is > still uncommitted. I don't quite know what you mean with "commit LSN4-4000" here. > This makes sense to me, but just to be extra clear, I will never receive a > transaction commit before receiving all other events for that transaction. Correct. Greetings, Andres Freund
Re: pltcl tests fail with FreeBSD 13.2
Hi, On 2023-07-31 18:33:37 -0400, Tom Lane wrote: > Andres Freund writes: > > I saw that CI image builds for freebsd were failing, because 13.1, used > > until > > now, is EOL. Update to 13.2, no problem, I thought. Ran a CI run against > > 13.2 > > - unfortunately that failed. In pltcl of all places. > > I tried to replicate this in a freshly-installed 13.2 VM, and could > not, which is unsurprising because the FreeBSD installation process > does not let you skip selecting a timezone --- which creates > /etc/localtime AFAICT. So I'm unconvinced that we ought to worry > about the case where that's not there. How is it that the CI image > lacks that file? I don't know why it lacks the file - the CI image is based on the google cloud image freebsd maintains / publishes ([1][2]). Which doesn't have /etc/localtime. I've now added a "tzsetup UTC" to the image generation, which fixes the test failure. > (And could it be that we had one in the predecessor 13.1 image?) No, I checked, and it's not in there either... It looks like the difference is that 13.1 reads the UTC zoneinfo in that case, whereas 13.2 doesn't. Upstream 13.1 image: truss date 2>&1|grep -E 'open|stat' ... open("/etc/localtime",O_RDONLY,0101401200) ERR#2 'No such file or directory' open("/usr/share/zoneinfo/UTC",O_RDONLY,00) = 3 (0x3) fstat(3,{ mode=-r--r--r-- ,inode=351417,size=118,blksize=32768 }) = 0 (0x0) open("/usr/share/zoneinfo/posixrules",O_RDONLY,014330460400) = 3 (0x3) fstat(3,{ mode=-r--r--r-- ,inode=356621,size=3535,blksize=32768 }) = 0 (0x0) fstat(1,{ mode=p- ,inode=696,size=0,blksize=4096 }) = 0 (0x0) Upstream 13.2 image: open("/etc/localtime",O_RDONLY,01745)ERR#2 'No such file or directory' fstat(1,{ mode=p- ,inode=658,size=0,blksize=4096 }) = 0 (0x0) Why not reading the UTC zone leads to timestamps being out of range, I do not know... Greetings, Andres Freund [1] https://wiki.freebsd.org/GoogleCloudPlatform [2] https://cloud.google.com/compute/docs/images#freebsd
Re: CDC/ETL system on top of logical replication with pgoutput, custom client
Hi, On 2023-07-31 14:16:22 +, José Neves wrote: > Hi Amit, thanks for the reply. > > In our worker (custom pg replication client), we care only about INSERT, > UPDATE, and DELETE operations, which - sure - may be part of the issue. That seems likely. Postgres streams out changes in commit order, not in order of the changes having been made (that'd not work due to rollbacks etc). If you just disregard transactions entirely, you'll get something bogus after retries. You don't need to store the details for each commit in the target system, just up to which LSN you have processed *commit records*. E.g. if you have received and safely stored up to commit 0/1000, you need to remember that. Are you using the 'streaming' mode / option to pgoutput? > 1. We have no way to match LSN operations with the respective commit, as > they have unordered offsets. Not sure what you mean with "unordered offsets"? > Assuming that all of them were received in order, we would commit all data > with the commit message LSN4-4000 as other events would match the transaction > start and end LSN interval of it. Logical decoding sends out changes in a deterministic order and you won't see out of order data when using TCP (the entire connection can obviously fail though). Andres
Re: pltcl tests fail with FreeBSD 13.2
Hi, On 2023-07-31 12:15:10 -0700, Andres Freund wrote: > It sure looks like freebsd 13.2 tcl is just busted. Notably it can't even > parse what it generates: > > echo 'puts [clock scan [clock format [clock seconds] -format "%Y/%m/%d"] > -format "%Y/%m/%d"]'|tclsh8.6 > > Which works on 13.1 (and other operating systems), without a problem. > > I used truss as a very basic way to see differences between 13.1 and 13.2 - > the big difference was .2 failing just after > access("/etc/localtime",F_OK) ERR#2 'No such file or > directory' > open("/etc/localtime",O_RDONLY,077)ERR#2 'No such file or > directory' > > whereas 13.1 also saw that, but then continued to > > issetugid()= 0 (0x0) > open("/usr/share/zoneinfo/UTC",O_RDONLY,00)= 3 (0x3) > fstat(3,{ mode=-r--r--r-- ,inode=351417,size=118,blksize=32768 }) = 0 (0x0) > ... > > which made me test specifying the timezone explicitly: > echo 'puts [clock scan [clock format [clock seconds] -format "%Y/%m/%d" > -timezone "UTC"] -format "%Y/%m/%d" -timezone "UTC"]'|tclsh8.6 > > Which, surprise, works. > > So does specifying the timezone via the TZ='UTC' environment variable. > > > I guess there could be a libc behaviour change or such around timezones? I do > see > https://www.freebsd.org/releases/13.2R/relnotes/ > "tzcode has been upgraded to version 2022g with improved timezone change > detection and reliability fixes." One additional datapoint: If I configure the timezone with "tzsetup" (which creates /etc/localtime), the problem vanishes as well. Greetings, Andres Freund
pltcl tests fail with FreeBSD 13.2
Hi, I saw that CI image builds for freebsd were failing, because 13.1, used until now, is EOL. Update to 13.2, no problem, I thought. Ran a CI run against 13.2 - unfortunately that failed. In pltcl of all places. https://api.cirrus-ci.com/v1/artifact/task/5275616266682368/testrun/build/testrun/pltcl/regress/regression.diffs Notably both 13.1 and 13.2 are using tcl 8.6.13. The code for the relevant function is this: create function tcl_date_week(int4,int4,int4) returns text as $$ return [clock format [clock scan "$2/$3/$1"] -format "%U"] $$ language pltcl immutable; select tcl_date_week(2010,1,26); It doesn't surprise me that there are problems with above clock scan, it uses the MM/DD/ format without making that explicit. But why that doesn't work on freebsd 13.2, I can't explain. It looks like tcl specifies the MM/DD/ bit for "free format scans": https://www.tcl.tk/man/tcl8.6/TclCmd/clock.html#M80 It sure looks like freebsd 13.2 tcl is just busted. Notably it can't even parse what it generates: echo 'puts [clock scan [clock format [clock seconds] -format "%Y/%m/%d"] -format "%Y/%m/%d"]'|tclsh8.6 Which works on 13.1 (and other operating systems), without a problem. I used truss as a very basic way to see differences between 13.1 and 13.2 - the big difference was .2 failing just after access("/etc/localtime",F_OK)ERR#2 'No such file or directory' open("/etc/localtime",O_RDONLY,077) ERR#2 'No such file or directory' whereas 13.1 also saw that, but then continued to issetugid() = 0 (0x0) open("/usr/share/zoneinfo/UTC",O_RDONLY,00) = 3 (0x3) fstat(3,{ mode=-r--r--r-- ,inode=351417,size=118,blksize=32768 }) = 0 (0x0) ... which made me test specifying the timezone explicitly: echo 'puts [clock scan [clock format [clock seconds] -format "%Y/%m/%d" -timezone "UTC"] -format "%Y/%m/%d" -timezone "UTC"]'|tclsh8.6 Which, surprise, works. So does specifying the timezone via the TZ='UTC' environment variable. I guess there could be a libc behaviour change or such around timezones? I do see https://www.freebsd.org/releases/13.2R/relnotes/ "tzcode has been upgraded to version 2022g with improved timezone change detection and reliability fixes." Greetings, Andres Freund
Postmaster doesn't correctly handle crashes in PM_STARTUP state
Hi, While testing something I made the checkpointer process intentionally crash as soon as it started up. The odd thing I observed on macOS is that we start a *new* checkpointer before shutting down: 2023-07-29 14:32:39.241 PDT [65031] LOG: listening on Unix socket "/tmp/.s.PGSQL.5432" 2023-07-29 14:32:39.244 PDT [65031] DEBUG: reaping dead processes 2023-07-29 14:32:39.244 PDT [65031] LOG: checkpointer process (PID 65032) was terminated by signal 11: Segmentation fault: 11 2023-07-29 14:32:39.244 PDT [65031] LOG: terminating any other active server processes 2023-07-29 14:32:39.244 PDT [65031] DEBUG: sending SIGQUIT to process 65034 2023-07-29 14:32:39.245 PDT [65031] DEBUG: sending SIGQUIT to process 65033 2023-07-29 14:32:39.245 PDT [65031] DEBUG: reaping dead processes 2023-07-29 14:32:39.245 PDT [65035] LOG: process 65035 taking over ProcSignal slot 126, but it's not empty 2023-07-29 14:32:39.245 PDT [65031] DEBUG: reaping dead processes 2023-07-29 14:32:39.245 PDT [65031] LOG: shutting down because restart_after_crash is off Note that a new process (65035) is started after the crash has been observed. I added logging to StartChildProcess(), and the process that's started is another checkpointer. I could not initially reproduce this on linux. After a fair bit of confusion, I figured out the reason: On macOS it takes a bit longer for the startup process to finish, which means we're still in PM_STARTUP state when we see that crash, instead of PM_RECOVERY or PM_RUN or ... The problem is that unfortunately HandleChildCrash() doesn't change pmState when in PM_STARTUP: /* We now transit into a state of waiting for children to die */ if (pmState == PM_RECOVERY || pmState == PM_HOT_STANDBY || pmState == PM_RUN || pmState == PM_STOP_BACKENDS || pmState == PM_SHUTDOWN) pmState = PM_WAIT_BACKENDS; Once I figured that out, I put a sleep(1) in StartupProcessMain(), and the problem reproduces on linux as well. I haven't fully dug through the history, this looks to be a quite old problem. Arguably we might also be missing PM_SHUTDOWN_2, but I can't really see a bad consequence of that. Greetings, Andres Freund
Re: Support worker_spi to execute the function dynamically.
Hi, On 2023-07-28 13:45:29 +0900, Michael Paquier wrote: > Having each bgworker on its own schema would be enough to prevent > conflicts, but I'd like to add a second thing: a check on > pg_stat_activity.wait_event after starting the workers. I have added > something like that in the patch I have posted today for the custom > wait events at [1] and it enforces the startup sequences of the > workers in a stricter way. Is that very meaningful? ISTM the interesting thing to check for would be that the state is idle? Greetings, Andres Freund
Re: Eager page freeze criteria clarification
Hi, On 2023-07-28 16:19:47 -0400, Robert Haas wrote: > On Fri, Jul 28, 2023 at 3:00 PM Peter Geoghegan wrote: > > > Or are we trying to determine how likely > > > the freeze record is to emit an FPI and avoid eager freezing when it > > > isn't worth the cost? > > > > No, that's not something that we're doing right now (we just talked > > about doing something like that). In 16 VACUUM just "makes the best > > out of a bad situation" when an FPI was already required during > > pruning. We have already "paid for the privilege" of writing some WAL > > for the page at that point, so it's reasonable to not squander a > > window of opportunity to avoid future FPIs in future VACUUM > > operations, by freezing early. > > This doesn't make sense to me. It's true that if the pruning record > emitted an FPI, then the freezing record probably won't need to do so > either, unless by considerable ill-fortune the redo pointer was moved > in the tiny window between pruning and freezing. But isn't that also > true that if the pruning record *didn't* emit an FPI? In that case, > the pruning record wasn't the first WAL-logged modification to the > page during the then-current checkpoint cycle, and some earlier > modification already did it. But in that case also the freezing won't > need to emit a new FPI, except for the same identical case where the > redo pointer moves at just the wrong time. > > Put differently, I can't see any reason to care whether pruning > emitted an FPI or not. Either way, it's very unlikely that freezing > needs to do so. +many Using FPIs having been generated during pruning as part of the logic to decide whether to freeze makes no sense to me. We likely need some heuristic here, but I suspect it has to look quite different than the FPI condition. Greetings, Andres
Re: Support worker_spi to execute the function dynamically.
Hi, The new test fails with my AIO branch occasionally. But I'm fairly certain that's just due to timing differences. Excerpt from the log: 2023-07-27 21:43:00.385 UTC [42339] LOG: worker_spi worker 3 initialized with schema3.counted 2023-07-27 21:43:00.399 UTC [42344] 001_worker_spi.pl LOG: statement: SELECT datname, count(datname) FROM pg_stat_activity WHERE backend_type = 'worker_spi' GROUP BY datname; 2023-07-27 21:43:00.403 UTC [42340] LOG: worker_spi worker 2 initialized with schema2.counted 2023-07-27 21:43:00.407 UTC [42341] LOG: worker_spi worker 1 initialized with schema1.counted 2023-07-27 21:43:00.420 UTC [42346] 001_worker_spi.pl LOG: statement: SELECT worker_spi_launch(1); 2023-07-27 21:43:00.423 UTC [42347] LOG: worker_spi dynamic worker 1 initialized with schema1.counted 2023-07-27 21:43:00.432 UTC [42349] 001_worker_spi.pl LOG: statement: SELECT worker_spi_launch(2); 2023-07-27 21:43:00.437 UTC [42350] LOG: worker_spi dynamic worker 2 initialized with schema2.counted 2023-07-27 21:43:00.443 UTC [42347] ERROR: duplicate key value violates unique constraint "pg_namespace_nspname_index" 2023-07-27 21:43:00.443 UTC [42347] DETAIL: Key (nspname)=(schema1) already exists. 2023-07-27 21:43:00.443 UTC [42347] CONTEXT: SQL statement "CREATE SCHEMA "schema1" CREATE TABLE "counted" ( type text CHECK (type IN ('total', 'delta')), value integer)CREATE UNIQUE INDEX "counted_unique_total" ON "counted" (type) WHERE type = 'total'" As written, dynamic and static workers race each other. It doesn't make a lot of sense to me to use the same ids for either? The attached patch reproduces the problem on master. Note that without the sleep(3) in the test the workers don't actually finish starting, the test shuts down the cluster before that happens... Greetings, Andres Freund diff --git i/src/test/modules/worker_spi/t/001_worker_spi.pl w/src/test/modules/worker_spi/t/001_worker_spi.pl index c2938713134..093a038b005 100644 --- i/src/test/modules/worker_spi/t/001_worker_spi.pl +++ w/src/test/modules/worker_spi/t/001_worker_spi.pl @@ -52,6 +52,7 @@ $node->append_conf( shared_preload_libraries = 'worker_spi' worker_spi.database = 'mydb' worker_spi.total_workers = 3 +log_statement=all }); $node->restart; @@ -68,6 +69,9 @@ ok( $node->poll_query_until( # check their existence. my $worker1_pid = $node->safe_psql('mydb', 'SELECT worker_spi_launch(1);'); my $worker2_pid = $node->safe_psql('mydb', 'SELECT worker_spi_launch(2);'); + +sleep(3); + ok( $node->poll_query_until( 'mydb', qq[SELECT datname, count(datname) FROM pg_stat_activity diff --git i/src/test/modules/worker_spi/worker_spi.c w/src/test/modules/worker_spi/worker_spi.c index 903dcddef97..ebd2120ec11 100644 --- i/src/test/modules/worker_spi/worker_spi.c +++ w/src/test/modules/worker_spi/worker_spi.c @@ -120,6 +120,8 @@ initialize_worker_spi(worktable *table) elog(FATAL, "failed to create my schema"); debug_query_string = NULL; /* rest is not statement-specific */ + + pg_usleep(USECS_PER_SEC*1); } SPI_finish(); @@ -127,6 +129,8 @@ initialize_worker_spi(worktable *table) CommitTransactionCommand(); debug_query_string = NULL; pgstat_report_activity(STATE_IDLE, NULL); + + elog(LOG, "done"); } void
Re: Buildfarm animal grassquit is failing
Hi, On 2023-07-25 20:10:06 -0400, Tom Lane wrote: > grassquit has been failing to run the regression tests for the last > few days, since [1]: > > # +++ regress check in src/test/regress +++ > # using temp instance on port 6880 with PID 766305 > ERROR: stack depth limit exceeded > HINT: Increase the configuration parameter "max_stack_depth" (currently > 2048kB), after ensuring the platform's stack depth limit is adequate. > ERROR: stack depth limit exceeded > HINT: Increase the configuration parameter "max_stack_depth" (currently > 2048kB), after ensuring the platform's stack depth limit is adequate. > # command failed: "psql" -X -q -c "CREATE DATABASE \\"regression\\" > TEMPLATE=template0 LOCALE='C'" -c "ALTER DATABASE \\"regression\\" SET > lc_messages TO 'C';ALTER DATABASE \\"regression\\" SET lc_monetary TO > 'C';ALTER DATABASE \\"regression\\" SET lc_numeric TO 'C';ALTER DATABASE > \\"regression\\" SET lc_time TO 'C';ALTER DATABASE \\"regression\\" SET > bytea_output TO 'hex';ALTER DATABASE \\"regression\\" SET > timezone_abbreviations TO 'Default';" "postgres" > > This seems to be happening in all branches, so I doubt it > has anything to do with recent PG commits. Instead, I notice > that grassquit seems to have been updated to a newer Debian > release: 6.1.0-6-amd64 works, 6.3.0-2-amd64 doesn't. Indeed, the automated updates had been stuck, and I fixed that a few days ago. I missed grassquit's failures unfortunately. I think I know what the issue is - I have seen them locally. Newer versions of gcc and clang (or libasan?) default to enabling "stack use after return" checks - for some reason that implies using a shadow stack *sometimes*. Which can confuse our stack depth checking terribly, not so surprisingly. I've been meaning to look into what we could do to fix that, but I haven't found the cycles... For now I added :detect_stack_use_after_return=0 to ASAN_OPTIONS, which I think should fix the issue. Greetings, Andres Freund
Re: WAL Insertion Lock Improvements
Hi, On 2023-07-26 07:40:31 +0900, Michael Paquier wrote: > On Tue, Jul 25, 2023 at 09:49:01AM -0700, Andres Freund wrote: > > FWIW, I'm working on a patch that replaces WAL insert locks as a whole, > > because they don't scale all that well. > > What were you looking at here? Just wondering. Here's what I had written offlist a few days ago: > The basic idea is to have a ringbuffer of in-progress insertions. The > acquisition of a position in the ringbuffer is done at the same time as > advancing the reserved LSN, using a 64bit xadd. The trick that makes that > possible is to use the high bits of the atomic for the position in the > ringbuffer. The point of using the high bits is that they wrap around, without > affecting the rest of the value. > > Of course, when using xadd, we can't keep the "prev" pointer in the > atomic. That's where the ringbuffer comes into play. Whenever one inserter has > determined the byte pos of its insertion, it updates the "prev byte pos" in > the *next* ringbuffer entry. > > Of course that means that insertion N+1 needs to wait for N to set the prev > position - but that happens very quickly. In my very hacky prototype the > relevant path (which for now just spins) is reached very rarely, even when > massively oversubscribed. While I've not implemented that, N+1 could actually > do the first "iteration" in CopyXLogRecordToWAL() before it needs the prev > position, the COMP_CRC32C() could happen "inside" the buffer. > > > There's a fair bit of trickyness in turning that into something working, of > course :). Ensuring that the ring buffer of insertions doesn't wrap around is > non-trivial. Nor is trivial to ensure that the "reduced" space LSN in the > atomic can't overflow. > > I do wish MAX_BACKENDS were smaller... > > > Until last night I thought all my schemes would continue to need something > like the existing WAL insertion locks, to implement > WALInsertLockAcquireExclusive(). > > But I think I came up with an idea to do away with that (not even prototyped > yet): Use one bit in the atomic that indicates that no new insertions are > allowed. Whenever the xadd finds that old value actually was locked, it > "aborts" the insertion, and instead waits for a condition variable (or > something similar). Of course that's after modifying the atomic - to deal with > that the "lock holder" reverts all modifications that have been made to the > atomic when releasing the "lock", they weren't actually successful and all > those backends will retry. > > Except that this doesn't quite suffice - XLogInsertRecord() needs to be able > to "roll back", when it finds that we now need to log FPIs. I can't quite see > how to make that work with what I describe above. The only idea I have so far > is to just waste the space with a NOOP record - it should be pretty rare. At > least if we updated RedoRecPtr eagerly (or just stopped this stupid business > of having an outdated backend-local copy). > > > > My prototype shows this idea to be promising. It's a tad slower at low > concurrency, but much better at high concurrency. I think most if not all of > the low-end overhead isn't inherent, but comes from having both old and new > infrastructure in place (as well as a lot of debugging cruft). Greetings, Andres Freund
Re: WAL Insertion Lock Improvements
Hi, On 2023-07-25 16:43:16 +0900, Michael Paquier wrote: > On Sat, Jul 22, 2023 at 01:08:49PM +0530, Bharath Rupireddy wrote: > > Yes, it looks safe to me too. > > 0001 has been now applied. I have done more tests while looking at > this patch since yesterday and was surprised to see higher TPS numbers > on HEAD with the same tests as previously, and the patch was still > shining with more than 256 clients. Just a small heads up: I just rebased my aio tree over the commit and promptly, on the first run, saw a hang. I did some debugging on that. Unfortunately repeated runs haven't repeated that hang, despite quite a bit of trying. The symptom I was seeing is that all running backends were stuck in LWLockWaitForVar(), even though the value they're waiting for had changed. Which obviously "shouldn't be possible". It's of course possible that this is AIO specific, but I didn't see anything in stacks to suggest that. I do wonder if this possibly exposed an undocumented prior dependency on the value update always happening under the list lock. Greetings, Andres Freund
Re: Logical walsenders don't process XLOG_CHECKPOINT_SHUTDOWN
Hi, On 2023-07-25 14:31:00 +0530, Amit Kapila wrote: > To ensure that all the data has been sent during the upgrade, we can > ensure that each logical slot's confirmed_flush_lsn (position in the > WAL till which subscriber has confirmed that it has applied the WAL) > is the same as current_wal_insert_lsn. Now, because we don't send > XLOG_CHECKPOINT_SHUTDOWN even on clean shutdown, confirmed_flush_lsn > will never be the same as current_wal_insert_lsn. The one idea being > discussed in patch [1] (see 0003) is to ensure that each slot's LSN is > exactly XLOG_CHECKPOINT_SHUTDOWN ago which probably has some drawbacks > like what if we tomorrow add some other WAL in the shutdown checkpoint > path or the size of record changes then we would need to modify the > corresponding code in upgrade. Yea, that doesn't seem like a good path. But there is a variant that seems better: We could just scan the end of the WAL for records that should have been streamed out? Greetings, Andres Freund
Re: WAL Insertion Lock Improvements
Hi, On 2023-07-25 16:43:16 +0900, Michael Paquier wrote: > On Sat, Jul 22, 2023 at 01:08:49PM +0530, Bharath Rupireddy wrote: > > Yes, it looks safe to me too. > > 0001 has been now applied. I have done more tests while looking at > this patch since yesterday and was surprised to see higher TPS numbers > on HEAD with the same tests as previously, and the patch was still > shining with more than 256 clients. > > > FWIW, 0001 essentially implements what > > an existing TODO comment introduced by commit 008608b9d5106 says: > > We really need to do something in terms of documentation with > something like 0002, so I'll try to look at that next. Regarding > 0003, I don't know. I think that we'd better look more into cases > where it shows actual benefits for specific workloads (like workloads > with a fixed rate of read and/or write operations?). FWIW, I'm working on a patch that replaces WAL insert locks as a whole, because they don't scale all that well. If there's no very clear improvements, I'm not sure it's worth putting too much effort into polishing them all that much. Greetings, Andres Freund
Re: Performance degradation on concurrent COPY into a single relation in PG16.
Hi, On 2023-07-25 08:50:19 -0700, Andres Freund wrote: > One idea I had was to add a fastpath that won't parse all strings, but will > parse the strings that we would generate, and fall back to the more general > variant if it fails. See the attached, rough, prototype: > > fix_COPY_DEFAULT.patch + fastpath.patch: > 746.971 > > fix_COPY_DEFAULT.patch + fastpath.patch + isdigit.patch: > 715.570 > > Now, the precise contents of this fastpath are not yet clear (wrt imul or > not), but I think the idea has promise. Btw, I strongly suspect that fastpath wants to be branchless SSE when it grows up. Greetings, Andres Freund
Re: Performance degradation on concurrent COPY into a single relation in PG16.
Hi, On 2023-07-25 23:37:08 +1200, David Rowley wrote: > On Tue, 25 Jul 2023 at 17:34, Andres Freund wrote: > I've not really studied the fix_COPY_DEFAULT.patch patch. Is there a > reason to delay committing that? It would be good to eliminate that > as a variable for the current performance regression. Yea, I don't think there's a reason to hold off on that. Sawada-san, do you want to commit it? Or Andrew? > > prep: > > COPY (SELECT generate_series(1, 200) a, (random() * 10 - > > 5)::int b, 3243423 c) TO '/tmp/lotsaints.copy'; > > DROP TABLE lotsaints; CREATE UNLOGGED TABLE lotsaints(a int, b int, c int); > > > > benchmark: > > psql -qX -c 'truncate lotsaints' && pgbench -n -P1 -f <( echo "COPY > > lotsaints FROM '/tmp/lotsaints.copy';") -t 15 > > > > I disabled turbo mode, pinned the server to a single core of my Xeon Gold > > 5215: > > > > HEAD: 812.690 > > > > your patch: 821.354 > > > > strtoint from 8692f6644e7: 824.543 > > > > strtoint from 6b423ec677d^: 806.678 > > I'm surprised to see the imul version is faster. It's certainly not > what we found when working on 6b423ec67. What CPUs did you test it on? I'd not be surprised if this were heavily dependent on the microarchitecture. > I've been fooling around a bit to try to learn what might be going on. > I wrote 2 patches: > > 1) pg_strtoint_imul.patch: Effectively reverts 6b423ec67 and puts the > code how it likely would have looked had I not done that work at all. > (I've assumed that the hex, octal, binary parsing would have been > added using the overflow multiplication functions just as the base 10 > parsing). > 2) pg_strtoint_optimize.patch: I've made another pass over the > functions with the current overflow checks and optimized the digit > checking code so that it can be done in a single check like: if (digit > < 10). This can be done by assigning the result of *ptr - '0' to an > unsigned char. Anything less than '0' will wrap around and anything > above '9' will remain so. I've also removed a few inefficiencies with > the isspace checking. We didn't need to do "while (*ptr && > isspace(*ptr)). It's fine just to do while (isspace(*ptr)) since '\0' > isn't a space char. I also got rid of the isxdigit call. The > hexlookup array contains -1 for non-hex chars. We may as well check > the digit value is >= 0. > > Here are the results I get using your test as quoted above: > > master @ 62e9af4c63f + fix_COPY_DEFAULT.patch > > latency average = 568.102 ms > > master @ 62e9af4c63f + fix_COPY_DEFAULT.patch + pg_strtoint_optimize.patch > > latency average = 531.238 ms > > master @ 62e9af4c63f + fix_COPY_DEFAULT.patch + pg_strtoint_imul.patch > > latency average = 559.333 ms (surprisingly also faster on my machine) > > The optimized version of the pg_strtoint functions wins over the imul > patch. Could you test to see if this is also the case in your Xeon > machine? (these are the numbers with turbo disabled, if I enable it they're all in the 6xx ms range, but the variance is higher) fix_COPY_DEFAULT.patch 774.344 fix_COPY_DEFAULT.patch + pg_strtoint32_base_10_first.v2.patch 777.673 fix_COPY_DEFAULT.patch + pg_strtoint_optimize.patch 777.545 fix_COPY_DEFAULT.patch + pg_strtoint_imul.patch 795.298 fix_COPY_DEFAULT.patch + pg_strtoint_imul.patch + likely(isdigit()) 773.477 fix_COPY_DEFAULT.patch + pg_strtoint32_base_10_first.v2.patch + pg_strtoint_imul.patch 774.443 fix_COPY_DEFAULT.patch + pg_strtoint32_base_10_first.v2.patch + pg_strtoint_imul.patch + likely(isdigit()) 774.513 fix_COPY_DEFAULT.patch + pg_strtoint32_base_10_first.v2.patch + pg_strtoint_imul.patch + likely(isdigit()) + unlikely(*ptr == '_') 763.879 One idea I had was to add a fastpath that won't parse all strings, but will parse the strings that we would generate, and fall back to the more general variant if it fails. See the attached, rough, prototype: fix_COPY_DEFAULT.patch + fastpath.patch: 746.971 fix_COPY_DEFAULT.patch + fastpath.patch + isdigit.patch: 715.570 Now, the precise contents of this fastpath are not yet clear (wrt imul or not), but I think the idea has promise. > > (when I say strtoint from, I did not replace the goto labels, so that part > > is > > unchanged and unrelated) > > I didn't quite follow this. I meant that I didn't undo ereport()->ereturn(). Greetings, Andres Freund diff --git c/src/backend/utils/adt/numutils.c i/src/backend/utils/adt/numutils.c ind
Re: Performance degradation on concurrent COPY into a single relation in PG16.
Hi, Hm, in some cases your patch is better, but in others both the old code (8692f6644e7) and HEAD beat yours on my machine. TBH, not entirely sure why. prep: COPY (SELECT generate_series(1, 200) a, (random() * 10 - 5)::int b, 3243423 c) TO '/tmp/lotsaints.copy'; DROP TABLE lotsaints; CREATE UNLOGGED TABLE lotsaints(a int, b int, c int); benchmark: psql -qX -c 'truncate lotsaints' && pgbench -n -P1 -f <( echo "COPY lotsaints FROM '/tmp/lotsaints.copy';") -t 15 I disabled turbo mode, pinned the server to a single core of my Xeon Gold 5215: HEAD: 812.690 your patch: 821.354 strtoint from 8692f6644e7: 824.543 strtoint from 6b423ec677d^: 806.678 (when I say strtoint from, I did not replace the goto labels, so that part is unchanged and unrelated) IOW, for me the code from 15 is the fastest by a good bit... There's an imul, sure, but the fact that it sets a flag makes it faster than having to add more tests and branches. Looking at a profile reminded me of how silly it is that we are wasting a good chunk of the time in these isdigit() checks, even though we already rely on on the ascii values via (via *ptr++ - '0'). I think that's done in the headers for some platforms, but not others (glibc). And we've even done already for octal and binary! Open coding isdigit() gives us: HEAD: 797.434 your patch: 803.570 strtoint from 8692f6644e7: 778.943 strtoint from 6b423ec677d^: 777.741 It's somewhat odd that HEAD and your patch switch position here... > - else if (ptr[0] == '0' && (ptr[1] == 'o' || ptr[1] == 'O')) > + /* process hex digits */ > + else if (ptr[1] == 'x' || ptr[1] == 'X') > { > > firstdigit = ptr += 2; I find this unnecessarily hard to read. I realize it's been added in 6fcda9aba83, but I don't see a reason to use such a construct here. I find it somewhat grating how much duplication there now is in this code due to being repeated for all the bases... Greetings, Andres Freund
Re: [BUG] Crash on pgbench initialization.
Hi, On 2023-07-24 09:42:44 -0700, Andres Freund wrote: > > I don't know this code at all, but I hope that this can be solved with > > just Anton's proposed patch. > > Yes, it's just that off-by-one. I need to check if there's a similar bug for > local / temp table buffers though. Doesn't appear that way. We *do* fail if there's only 1 remaining buffer, but we already did before my change (because we also need to pin the fsm). I don't think that's an issue worth worrying about, if all-1 of local buffers are pinned, you're going to have problems. Thanks Anton / Victoria for the report and fix. Pushed. Greetings, Andres Freund
Re: [BUG] Crash on pgbench initialization.
Hi, On 2023-07-24 15:54:33 +0200, Alvaro Herrera wrote: > On 2023-Jul-24, Michael Paquier wrote: > > > On Sun, Jul 23, 2023 at 11:21:47PM +0300, Anton A. Melnikov wrote: > > > After some research, found this happens when the LimitAdditionalPins() > > > returns exactly zero. > > > In the current master, this will happen e.g. if shared_buffers = 10MB and > > > max_worker_processes = 40. > > > Then the command "pgbench --initialize postgres" will lead to crash. > > > See the backtrace attached. > > > > > > There is a fix in the patch applied. Please take a look on it. > > > > Nice catch, issue reproduced here so I am adding an open item for now. > > (I have not looked at the patch, yet.) > > Hmm, I see that all this code was added by 31966b151e6a, which makes > this Andres' item. I see Michael marked it as such in the open items > page, but did not CC Andres, so I'm doing that here now. Thanks - I had indeed not seen this. I can't really keep up with the list at all times... > I don't know this code at all, but I hope that this can be solved with > just Anton's proposed patch. Yes, it's just that off-by-one. I need to check if there's a similar bug for local / temp table buffers though. Greetings, Andres Freund
Re: Avoid stack frame setup in performance critical routines using tail calls
Hi, David and I were chatting about this patch, in the context of his bump allocator patch. Attached is a rebased version that is also split up into two steps, and a bit more polished. I wasn't sure what a good test was. I ended up measuring COPY pgbench_accounts TO '/dev/null' WITH (FORMAT 'binary'); of a scale 1 database with pgbench: c=1;pgbench -q -i -s1 && pgbench -n -c$c -j$c -t100 -f <(echo "COPY pgbench_accounts TO '/dev/null' WITH (FORMAT 'binary');") average latency HEAD: 33.865 ms 01: 32.820 ms 02: 29.934 ms The server was pinned to the one core, turbo mode disabled. That's a pretty nice win, I'd say. And I don't think this is actually the most allocator bound workload, I just tried something fairly random... Greetings, Andres Freund >From 4b97070e32ed323ae0f083bf865717c6d271ce53 Mon Sep 17 00:00:00 2001 From: Andres Freund Date: Tue, 18 Jul 2023 18:55:58 -0700 Subject: [PATCH v2 1/4] Optimize palloc() etc to allow sibling calls The reason palloc() etc previously couldn't use sibling calls is that they did check the returned value (to e.g. raise an error when the allocation fails). Push the error handling down into the memory context implementation - they can avoid performing the check at all in the hot code paths. --- src/include/nodes/memnodes.h | 4 +- src/include/utils/memutils_internal.h | 29 +++- src/backend/utils/mmgr/alignedalloc.c | 11 +- src/backend/utils/mmgr/aset.c | 23 ++-- src/backend/utils/mmgr/generation.c | 14 +- src/backend/utils/mmgr/mcxt.c | 186 ++ src/backend/utils/mmgr/slab.c | 12 +- 7 files changed, 102 insertions(+), 177 deletions(-) diff --git a/src/include/nodes/memnodes.h b/src/include/nodes/memnodes.h index ff6453bb7ac..47d2932e6ed 100644 --- a/src/include/nodes/memnodes.h +++ b/src/include/nodes/memnodes.h @@ -57,10 +57,10 @@ typedef void (*MemoryStatsPrintFunc) (MemoryContext context, void *passthru, typedef struct MemoryContextMethods { - void *(*alloc) (MemoryContext context, Size size); + void *(*alloc) (MemoryContext context, Size size, int flags); /* call this free_p in case someone #define's free() */ void (*free_p) (void *pointer); - void *(*realloc) (void *pointer, Size size); + void *(*realloc) (void *pointer, Size size, int flags); void (*reset) (MemoryContext context); void (*delete_context) (MemoryContext context); MemoryContext (*get_chunk_context) (void *pointer); diff --git a/src/include/utils/memutils_internal.h b/src/include/utils/memutils_internal.h index 2d107bbf9d4..3a050819263 100644 --- a/src/include/utils/memutils_internal.h +++ b/src/include/utils/memutils_internal.h @@ -19,9 +19,9 @@ #include "utils/memutils.h" /* These functions implement the MemoryContext API for AllocSet context. */ -extern void *AllocSetAlloc(MemoryContext context, Size size); +extern void *AllocSetAlloc(MemoryContext context, Size size, int flags); extern void AllocSetFree(void *pointer); -extern void *AllocSetRealloc(void *pointer, Size size); +extern void *AllocSetRealloc(void *pointer, Size size, int flags); extern void AllocSetReset(MemoryContext context); extern void AllocSetDelete(MemoryContext context); extern MemoryContext AllocSetGetChunkContext(void *pointer); @@ -36,9 +36,9 @@ extern void AllocSetCheck(MemoryContext context); #endif /* These functions implement the MemoryContext API for Generation context. */ -extern void *GenerationAlloc(MemoryContext context, Size size); +extern void *GenerationAlloc(MemoryContext context, Size size, int flags); extern void GenerationFree(void *pointer); -extern void *GenerationRealloc(void *pointer, Size size); +extern void *GenerationRealloc(void *pointer, Size size, int flags); extern void GenerationReset(MemoryContext context); extern void GenerationDelete(MemoryContext context); extern MemoryContext GenerationGetChunkContext(void *pointer); @@ -54,9 +54,9 @@ extern void GenerationCheck(MemoryContext context); /* These functions implement the MemoryContext API for Slab context. */ -extern void *SlabAlloc(MemoryContext context, Size size); +extern void *SlabAlloc(MemoryContext context, Size size, int flags); extern void SlabFree(void *pointer); -extern void *SlabRealloc(void *pointer, Size size); +extern void *SlabRealloc(void *pointer, Size size, int flags); extern void SlabReset(MemoryContext context); extern void SlabDelete(MemoryContext context); extern MemoryContext SlabGetChunkContext(void *pointer); @@ -75,7 +75,7 @@ extern void SlabCheck(MemoryContext context); * part of a fully-fledged MemoryContext type. */ extern void AlignedAllocFree(void *pointer); -extern void *AlignedAllocRealloc(void *pointer, Size size); +extern void *AlignedAllocRealloc(void *pointer, Size size, int flags); extern MemoryContext AlignedAllocGetChunkContext(v
Re: Extension Enhancement: Buffer Invalidation in pg_buffercache
Hi, I wanted this feature a couple times before... On 2023-07-03 13:56:29 +0530, Palak Chaturvedi wrote: > +PG_FUNCTION_INFO_V1(pg_buffercache_invalidate); > +Datum > +pg_buffercache_invalidate(PG_FUNCTION_ARGS) I don't think "invalidating" is the right terminology. Note that we already have InvalidateBuffer() - but it's something we can't allow users to do, as it throws away dirty buffer contents (it's used for things like dropping a table). How about using "discarding" for this functionality? Using the buffer ID as the identifier doesn't seem great, because what that buffer is used for, could have changed since the buffer ID has been acquired (via the pg_buffercache view presumably)? My suspicion is that the usual usecase for this would be to drop all buffers that can be dropped? > + if (bufnum < 0 || bufnum > NBuffers) > + { > + ereport(ERROR, > + (errcode(ERRCODE_INVALID_PARAMETER_VALUE), > + errmsg("buffernum is not valid"))); > + > + } > + > + result = TryInvalidateBuffer(bufnum); > + PG_RETURN_BOOL(result); > +} I think this should be restricted to superuser by default (by revoking permissions from PUBLIC). We allow normal users to use pg_prewarm(...), true - but we perform an ACL check on the relation, so it can only be used for relations you have access too. This function could be used to affect performance of other users quite substantially. > +/* > +Try Invalidating a buffer using bufnum. > +If the buffer is invalid, the function returns false. > +The function checks for dirty buffer and flushes the dirty buffer before > invalidating. > +If the buffer is still dirty it returns false. > +*/ > +bool > +TryInvalidateBuffer(Buffer bufnum) > +{ > + BufferDesc *bufHdr = GetBufferDescriptor(bufnum - 1); > + uint32 buf_state; > + > + ReservePrivateRefCountEntry(); > + > + buf_state = LockBufHdr(bufHdr); > + if ((buf_state & BM_VALID) == BM_VALID) > + { > + /* > + * The buffer is pinned therefore cannot invalidate. > + */ > + if (BUF_STATE_GET_REFCOUNT(buf_state) > 0) > + { > + UnlockBufHdr(bufHdr, buf_state); > + return false; > + } > + if ((buf_state & BM_DIRTY) == BM_DIRTY) > + { > + /* > + * Try once to flush the dirty buffer. > + */ > + PinBuffer_Locked(bufHdr); > + LWLockAcquire(BufferDescriptorGetContentLock(bufHdr), > LW_SHARED); > + FlushBuffer(bufHdr, NULL, IOOBJECT_RELATION, > IOCONTEXT_NORMAL); > + LWLockRelease(BufferDescriptorGetContentLock(bufHdr)); > + UnpinBuffer(bufHdr); > + buf_state = LockBufHdr(bufHdr); > + if (BUF_STATE_GET_REFCOUNT(buf_state) > 0) > + { > + UnlockBufHdr(bufHdr, buf_state); > + return false; > + } > + > + /* > + * If its dirty again or not valid anymore give up. > + */ > + > + if ((buf_state & (BM_DIRTY | BM_VALID)) != (BM_VALID)) > + { > + UnlockBufHdr(bufHdr, buf_state); > + return false; > + } > + > + } > + > + InvalidateBuffer(bufHdr); I'm wary of using InvalidateBuffer() here, it's typically used for different purposes, including throwing valid contents away. That seems a bit scary. I think you should be able to just use InvalidateVictimBuffer()? Greetings, Andres Freund
Re: Atomic ops for unlogged LSN
Hi, On 2023-07-17 16:15:52 -0700, Nathan Bossart wrote: > On Mon, Jul 17, 2023 at 07:08:03PM -0400, Stephen Frost wrote: > > Awesome. Was there any other feedback on this change which basically is > > just getting rid of a spinlock and replacing it with using atomics? > > It's still in needs-review status but there's been a number of > > comments/reviews (drive-by, at least) but without any real ask for any > > changes to be made. > > LGTM Why don't we just use a barrier when around reading the value? It's not like CreateCheckPoint() is frequent? Greetings, Andres Freund
Report distinct wait events when waiting for WAL "operation"
Hi, In a number of workloads one can see two wait events prominently: LWLock:WALWrite and LWLock:WALInsert. Unfortunately for both that is not very informative: LWLock:WALWrite can be reported because there is genuine contention on the LWLock, or, more commonly, because several backends are waiting for another to finish IO. In the latter case we are not actually waiting to acquire the lock, we are waiting for the lock to be released *without* then acquiring it. LWLock:WALInsert can be reported because there are not enough WALInsert locks (c.f. NUM_XLOGINSERT_LOCKS) or because we are waiting for another backend to finish copying a WAL record into wal_buffers. In the latter case we are therefore not waiting to acquire an LWLock. I think both of these cases are relevant to distinguish from an operational perspective. Secondarily, I've received many questions about making those locks more scalable / granular, when in most of the cases the issue was not actual lock contention. Today it's surprisingly hard to figure out whether the issue is lock contention or the speed of copying buffers for WAL insert locks / computing the last prc of the CRC checksum. Therefore I'm proposing that LWLockAcquireOrWait() and LWLockWaitForVar() not use the "generic" LWLockReportWaitStart(), but use caller provided wait events. The attached patch adds two new wait events for the existing callers. I waffled a bit about which wait event section to add these to. Ended up with "IPC", but would be happy to change that. WAIT_EVENT_WAL_WAIT_INSERT WALWaitInsert "Waiting for WAL record to be copied into buffers." WAIT_EVENT_WAL_WAIT_WRITE WALWaitWrite"Waiting for WAL buffers to be written or flushed to disk." Previously it was e.g. not really possible to distinguish that something like this: ┌┬─┬┬───┐ │ backend_type │ wait_event_type │ wait_event │ count │ ├┼─┼┼───┤ │ client backend │ LWLock │ WALInsert │32 │ │ client backend │ (null) │ (null) │ 9 │ │ walwriter │ IO │ WALWrite │ 1 │ │ client backend │ Client │ ClientRead │ 1 │ │ client backend │ LWLock │ WALWrite │ 1 │ └┴─┴┴───┘ is a workload with a very different bottleneck than this: ┌┬─┬───┬───┐ │ backend_type │ wait_event_type │ wait_event │ count │ ├┼─┼───┼───┤ │ client backend │ IPC │ WALWaitInsert │22 │ │ client backend │ LWLock │ WALInsert │13 │ │ client backend │ LWLock │ WALBufMapping │ 5 │ │ walwriter │ (null) │ (null)│ 1 │ │ client backend │ Client │ ClientRead│ 1 │ │ client backend │ (null) │ (null)│ 1 │ └┴─┴───┴───┘ even though they are very different FWIW, the former is bottlenecked by the number of WAL insertion locks, the second is bottlenecked by copying WAL into buffers due to needing to flush them. Greetings, Andres Freund >From 9dcf4a45b6ed7d8fca1a1cd6782f11dff3a84406 Mon Sep 17 00:00:00 2001 From: Andres Freund Date: Mon, 17 Jul 2023 09:20:05 -0700 Subject: [PATCH v1] Caller specified wait events for LWLockWaitForVar(), LWLockAcquireOrWait() Reporting waits within those functions as LWLock wait events is misleading, as we do not actually wait for the lock themselves. Author: Reviewed-by: Discussion: https://postgr.es/m/ Backpatch: --- src/include/storage/lwlock.h | 6 -- src/backend/access/transam/xlog.c | 6 -- src/backend/storage/lmgr/lwlock.c | 19 +-- .../utils/activity/wait_event_names.txt | 2 ++ 4 files changed, 23 insertions(+), 10 deletions(-) diff --git a/src/include/storage/lwlock.h b/src/include/storage/lwlock.h index 34169e5889e..0f345076f41 100644 --- a/src/include/storage/lwlock.h +++ b/src/include/storage/lwlock.h @@ -127,7 +127,8 @@ extern PGDLLIMPORT bool Trace_lwlocks; extern bool LWLockAcquire(LWLock *lock, LWLockMode mode); extern bool LWLockConditionalAcquire(LWLock *lock, LWLockMode mode); -extern bool LWLockAcquireOrWait(LWLock *lock, LWLockMode mode); +extern bool LWLockAcquireOrWait(LWLock *lock, LWLockMode mode, +uint32 wait_event_info); extern void LWLockRelease(LWLock *lock); extern void LWLockReleaseClearVar(LWLock *lock, uint64 *valptr, uint64 val); extern void LWLockReleaseAll(void); @@ -135,7 +136,8 @@ extern bool LWLockHeldByMe(LWLock *lock); extern bool LWLockAnyHeldByMe(LWLock *lock, int nlocks, size_t stride); extern bool LWLockHeldByMeInMode(LWLock *lock, LWLockMode mode); -extern bool LWLockWaitForVar(LWLock *lock, uint64 *valptr, uint64 oldval, uint64 *newval);
Re: Improve heapgetpage() performance, overhead from serializable
Hi, On 2023-07-17 09:55:07 +0800, Zhang Mingli wrote: > LGTM and I have a fool question: > > if (likely(all_visible)) > { > if (likely(!check_serializable)) > scan->rs_ntuples = heapgetpage_collect(scan, snapshot, > page, buffer, > > block, lines, 1, 0); > else > scan->rs_ntuples = heapgetpage_collect(scan, snapshot, > page, buffer, > > block, lines, 1, 1); > } > else > { > if (likely(!check_serializable)) > scan->rs_ntuples = heapgetpage_collect(scan, snapshot, > page, buffer, > > block, lines, 0, 0); > else > scan->rs_ntuples = heapgetpage_collect(scan, snapshot, > page, buffer, > > block, lines, 0, 1); > > > Does it make sense to combine if else condition and put it to the incline > function’s param? > > Like: > scan->rs_ntuples = heapgetpage_collect(scan, snapshot, page, buffer, > > block, lines, all_visible, check_serializable); I think that makes it less likely that the compiler actually generates a constant-folded version for each of the branches. Perhaps worth some experimentation. Greetings, Andres Freund
Re: Autogenerate some wait events code and documentation
Hi, On 2023-07-14 13:49:22 +0900, Michael Paquier wrote: > I have looked again at 0001 and 0002 and applied them to get them out > of the way. 0003 and 0004 are rebased and attached. I'll add them to > the CF for later consideration. More opinions are welcome. > From b6390183bdcc054df82279bb0b2991730f85a0a3 Mon Sep 17 00:00:00 2001 > From: Michael Paquier > Date: Thu, 13 Jul 2023 10:14:47 +0900 > Subject: [PATCH v3 4/4] Remove column for enum elements in > wait_event_names.txt > > This file is now made of two columns, removing the column listing the > enum elements for each wait event class: > - Camelcase event name used in pg_stat_activity. There are now > unquoted. > - Description of the documentation. > The enum elements are generated from what is now the first column. I think the search issue is valid, so I do think going the other way is preferrable. I.e. use just the enum value in the .txt and generate the camel case name from that. That allows you to search the define used in code and find a hit in the file. I personally would still leave off the WAIT_EVENT prefix in the .txt, I think most of us can remember to chop that off. I don't think we need to be particularly consistent with wait events across major versions. They're necessarily tied to how the code works, and we've yanked that around plenty. Greetings, Andres Freund
Improve heapgetpage() performance, overhead from serializable
Hi, Several loops which are important for query performance, like heapgetpage()'s loop over all tuples, have to call functions like HeapCheckForSerializableConflictOut() and PredicateLockTID() in every iteration. When serializable is not in use, all those functions do is to to return. But being situated in a different translation unit, the compiler can't inline (without LTO at least) the check whether serializability is needed. It's not just the function call overhead that's noticable, it's also that registers have to be spilled to the stack / reloaded from memory etc. On a freshly loaded pgbench scale 100, with turbo mode disabled, postgres pinned to one core. Parallel workers disabled to reduce noise. All times are the average of 15 executions with pgbench, in a newly started, but prewarmed postgres. SELECT * FROM pgbench_accounts OFFSET 1000; HEAD: 397.977 removing the HeapCheckForSerializableConflictOut() from heapgetpage() (incorrect!), to establish the baseline of what serializable costs: 336.695 pulling out CheckForSerializableConflictOutNeeded() from HeapCheckForSerializableConflictOut() in heapgetpage(), and avoiding calling HeapCheckForSerializableConflictOut() in the loop: 339.742 moving the loop into a static inline function, marked as pg_always_inline, called with static arguments for always_visible, check_serializable: 326.546 marking the always_visible, !check_serializable case likely(): 322.249 removing TestForOldSnapshot() calls, which we pretty much already decided on: 312.987 FWIW, there's more we can do, with some hacky changes I got the time down to 273.261, but the tradeoffs start to be a bit more complicated. And 397->320ms for something as core as this, is imo worth considering on its own. Now, this just affects the sequential scan case. heap_hot_search_buffer() shares many of the same pathologies. I find it a bit harder to improve, because the compiler's code generation seems to switch between good / bad with changes that seems unrelated... I wonder why we haven't used PageIsAllVisible() in heap_hot_search_buffer() so far? Greetings, Andres Freund >From 92ca2c15ccf2adb9b7f2da9cf86b571534287136 Mon Sep 17 00:00:00 2001 From: Andres Freund Date: Sat, 15 Jul 2023 15:13:30 -0700 Subject: [PATCH v1] optimize heapgetpage() --- src/backend/access/heap/heapam.c | 100 ++- 1 file changed, 72 insertions(+), 28 deletions(-) diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c index 7ed72abe597..d662d2898a2 100644 --- a/src/backend/access/heap/heapam.c +++ b/src/backend/access/heap/heapam.c @@ -366,6 +366,56 @@ heap_setscanlimits(TableScanDesc sscan, BlockNumber startBlk, BlockNumber numBlk scan->rs_numblocks = numBlks; } +/* + * The loop for heapgetpage() in pagemode. Pulled out so it can be called + * multiple times, with constant arguments for all_visible, + * check_serializable. + */ +pg_attribute_always_inline +static int +heapgetpage_collect(HeapScanDesc scan, Snapshot snapshot, + Page page, Buffer buffer, + BlockNumber block, int lines, + bool all_visible, bool check_serializable) +{ + int ntup = 0; + OffsetNumber lineoff; + + for (lineoff = FirstOffsetNumber; lineoff <= lines; lineoff++) + { + ItemId lpp = PageGetItemId(page, lineoff); + bool valid; + HeapTupleData loctup; + + if (!ItemIdIsNormal(lpp)) + continue; + + loctup.t_data = (HeapTupleHeader) PageGetItem(page, lpp); + loctup.t_len = ItemIdGetLength(lpp); + loctup.t_tableOid = RelationGetRelid(scan->rs_base.rs_rd); + ItemPointerSet(&(loctup.t_self), block, lineoff); + + if (all_visible) + valid = true; + else + valid = HeapTupleSatisfiesVisibility(&loctup, snapshot, buffer); + + if (check_serializable) + HeapCheckForSerializableConflictOut(valid, scan->rs_base.rs_rd, +&loctup, buffer, snapshot); + + if (valid) + { + scan->rs_vistuples[ntup] = lineoff; + ntup++; + } + } + + Assert(ntup <= MaxHeapTuplesPerPage); + + return ntup; +} + /* * heapgetpage - subroutine for heapgettup() * @@ -381,9 +431,8 @@ heapgetpage(TableScanDesc sscan, BlockNumber block) Snapshot snapshot; Page page; int lines; - int ntup; - OffsetNumber lineoff; bool all_visible; + bool check_serializable; Assert(block < scan->rs_nblocks); @@ -427,7 +476,6 @@ heapgetpage(TableScanDesc sscan, BlockNumber block) page = BufferGetPage(buffer); TestForOldSnapshot(snapshot, scan->rs_base.rs_rd, page); lines = PageGetMaxOffsetNumber(page); - ntup = 0; /* * If the all-visible flag indicates that all tuples on the page are @@ -450,37 +498,33 @@ heapgetpage(TableScanDesc sscan, BlockNumber block) * tuple for visibility the hard way. */ all_visible = PageIsAllVisible(page) && !snapshot->takenDuringRecovery; + check_serializable = + CheckForSerializableConflictOutNeeded(scan->rs_base.rs_
Re: Inefficiency in parallel pg_restore with many tables
Hi, On 2023-07-15 13:47:12 -0400, Tom Lane wrote: > I wonder if we could replace the sorted ready-list with a priority heap, > although that might be complicated by the fact that pop_next_work_item > has to be capable of popping something that's not necessarily the > largest remaining job. Another idea could be to be a little less eager > to sort the list every time; I think in practice scheduling wouldn't > get much worse if we only re-sorted every so often. Perhaps we could keep track of where the newly inserted items are, and use insertion sort or such when the number of new elements is much smaller than the size of the already sorted elements? As you say, a straight priority heap might not be easy. But we could just open code using two sorted arrays, one large, one for recent additions that needs to be newly sorted. And occasionally merge the small array into the big array, once it has gotten large enough that sorting becomes expensive. We could go for a heap of N>2 such arrays, but I doubt it would be worth much. Greetings, Andres Freund
XLogSaveBufferForHint() correctness and more
Hi, While looking at [1] I started to wonder why it is safe that CreateCheckPoint() updates XLogCtl->RedoRecPtr after releasing the WAL insertion lock: /* * Now we can release the WAL insertion locks, allowing other xacts to * proceed while we are flushing disk buffers. */ WALInsertLockRelease(); /* Update the info_lck-protected copy of RedoRecPtr as well */ SpinLockAcquire(&XLogCtl->info_lck); XLogCtl->RedoRecPtr = checkPoint.redo; SpinLockRelease(&XLogCtl->info_lck); The most important user of that is GetRedoRecPtr(). Right now I'm a bit confused why it's ok that /* Update the info_lck-protected copy of RedoRecPtr as well */ SpinLockAcquire(&XLogCtl->info_lck); XLogCtl->RedoRecPtr = checkPoint.redo; SpinLockRelease(&XLogCtl->info_lck); happens after WALInsertLockRelease(). But then started to wonder, even if that weren't the case, how come XLogSaveBufferForHint() and other uses of GetRedoRecPtr(), aren't racy as hell? The reason XLogInsertRecord() can safely check if an FPW is needed is that it holds a WAL insertion lock, the redo pointer cannot change until the insertion lock is released. But there's *zero* interlock in XLogSaveBufferForHint() from what I can tell? A checkpoint could easily start between between the GetRedoRecPtr() and the check whether this buffer needs to be WAL logged? While XLogSaveBufferForHint() makes no note of this, it's sole caller, MarkBufferDirtyHint(), tries to deal with some related concerns to some degree: /* * If the block is already dirty because we either made a change * or set a hint already, then we don't need to write a full page * image. Note that aggressive cleaning of blocks dirtied by hint * bit setting would increase the call rate. Bulk setting of hint * bits would reduce the call rate... * * We must issue the WAL record before we mark the buffer dirty. * Otherwise we might write the page before we write the WAL. That * causes a race condition, since a checkpoint might occur between * writing the WAL record and marking the buffer dirty. We solve * that with a kluge, but one that is already in use during * transaction commit to prevent race conditions. Basically, we * simply prevent the checkpoint WAL record from being written * until we have marked the buffer dirty. We don't start the * checkpoint flush until we have marked dirty, so our checkpoint * must flush the change to disk successfully or the checkpoint * never gets written, so crash recovery will fix. * * It's possible we may enter here without an xid, so it is * essential that CreateCheckPoint waits for virtual transactions * rather than full transactionids. */ Assert((MyProc->delayChkptFlags & DELAY_CHKPT_START) == 0); MyProc->delayChkptFlags |= DELAY_CHKPT_START; delayChkptFlags = true; lsn = XLogSaveBufferForHint(buffer, buffer_std); but I don't think that really does all that much, because the DELAY_CHKPT_START handling in CreateCheckPoint() happens after we determine the redo pointer. This code isn't even reached if we wrongly skipped due to the if (lsn <= RedoRecPtr). I seriously doubt this can correctly be implemented outside of xlog*.c / without the use of a WALInsertLock? I feel like I must be missing here, this isnt' a particularly narrow race? It looks to me like the sue of GetRedoRecPtr() in nextval_internal() is also wrong. I think the uses in slot.c, snapbuild.c, rewriteheap.c are fine. Greetings, Andres Freund [1] https://www.postgresql.org/message-id/20230714151626.rhgae7taigk2xrq7%40awork3.anarazel.de
Re: New WAL record to detect the checkpoint redo location
Hi, As I think I mentioned before, I like this idea. However, I don't like the implementation too much. On 2023-06-15 13:11:57 +0530, Dilip Kumar wrote: > diff --git a/src/backend/access/transam/xlog.c > b/src/backend/access/transam/xlog.c > index b2430f617c..a025fb91e2 100644 > --- a/src/backend/access/transam/xlog.c > +++ b/src/backend/access/transam/xlog.c > @@ -744,6 +744,7 @@ XLogInsertRecord(XLogRecData *rdata, > XLogRecPtr StartPos; > XLogRecPtr EndPos; > boolprevDoPageWrites = doPageWrites; > + boolcallerHoldingExlock = holdingAllLocks; > TimeLineID insertTLI; > > /* we assume that all of the record header is in the first chunk */ > @@ -792,10 +793,18 @@ XLogInsertRecord(XLogRecData *rdata, >*-- >*/ > START_CRIT_SECTION(); > - if (isLogSwitch) > - WALInsertLockAcquireExclusive(); > - else > - WALInsertLockAcquire(); > + > + /* > + * Acquire wal insertion lock, nothing to do if the caller is already > + * holding the exclusive lock. > + */ > + if (!callerHoldingExlock) > + { > + if (isLogSwitch) > + WALInsertLockAcquireExclusive(); > + else > + WALInsertLockAcquire(); > + } > > /* >* Check to see if my copy of RedoRecPtr is out of date. If so, may have This might work right now, but doesn't really seem maintainable. Nor do I like adding branches into this code a whole lot. > @@ -6597,6 +6612,32 @@ CreateCheckPoint(int flags) I think the commentary above this function would need a fair bit of revising... >*/ > RedoRecPtr = XLogCtl->Insert.RedoRecPtr = checkPoint.redo; > > + /* > + * Insert a special purpose CHECKPOINT_REDO record as the first record > at > + * checkpoint redo lsn. Although we have the checkpoint record that > + * contains the exact redo lsn, there might have been some other records > + * those got inserted between the redo lsn and the actual checkpoint > + * record. So when processing the wal, we cannot rely on the checkpoint > + * record if we want to stop at the checkpoint-redo LSN. > + * > + * This special record, however, is not required when we doing a > shutdown > + * checkpoint, as there will be no concurrent wal insertions during that > + * time. So, the shutdown checkpoint LSN will be the same as > + * checkpoint-redo LSN. > + * > + * This record is guaranteed to be the first record at checkpoint redo > lsn > + * because we are inserting this while holding the exclusive wal > insertion > + * lock. > + */ > + if (!shutdown) > + { > + int dummy = 0; > + > + XLogBeginInsert(); > + XLogRegisterData((char *) &dummy, sizeof(dummy)); > + recptr = XLogInsert(RM_XLOG_ID, XLOG_CHECKPOINT_REDO); > + } It seems to me that we should be able to do better than this. I suspect we might be able to get rid of the need for exclusive inserts here. If we rid of that, we could determine the redoa location based on the LSN determined by the XLogInsert(). Alternatively, I think we should split XLogInsertRecord() so that the part with the insertion locks held is a separate function, that we could use here. Greetings, Andres Freund
Re: WAL Insertion Lock Improvements
Hi, On 2023-07-13 14:04:31 -0700, Andres Freund wrote: > From b74b6e953cb5a7e7ea1a89719893f6ce9e231bba Mon Sep 17 00:00:00 2001 > From: Bharath Rupireddy > Date: Fri, 19 May 2023 15:00:21 + > Subject: [PATCH v8] Optimize WAL insertion lock acquisition and release > > This commit optimizes WAL insertion lock acquisition and release > in the following way: I think this commit does too many things at once. > 1. WAL insertion lock's variable insertingAt is currently read and > written with the help of lwlock's wait list lock to avoid > torn-free reads/writes. This wait list lock can become a point of > contention on a highly concurrent write workloads. Therefore, make > insertingAt a 64-bit atomic which inherently provides torn-free > reads/writes. "inherently" is a bit strong, given that we emulate 64bit atomics where not available... > 2. LWLockUpdateVar currently acquires lwlock's wait list lock even when > there are no waiters at all. Add a fastpath exit to LWLockUpdateVar when > there are no waiters to avoid unnecessary locking. I don't think there's enough of an explanation for why this isn't racy. The reason it's, I think, safe, is that anyone using LWLockConflictsWithVar() will do so twice in a row, with a barrier inbetween. But that really relies on what I explain in the paragraph below: > It also adds notes on why LWLockConflictsWithVar doesn't need a > memory barrier as far as its current usage is concerned. I don't think: * NB: LWLockConflictsWithVar (which is called from * LWLockWaitForVar) relies on the spinlock used above in this * function and doesn't use a memory barrier. helps to understand why any of this is safe to a meaningful degree. The existing comments aren't obviously aren't sufficient to explain this, but the reason it's, I think, safe today, is that we are only waiting for insertions that started before WaitXLogInsertionsToFinish() was called. The lack of memory barriers in the loop means that we might see locks as "unused" that have *since* become used, which is fine, because they only can be for later insertions that we wouldn't want to wait on anyway. Not taking a lock to acquire the current insertingAt value means that we might see older insertingAt value. Which should also be fine, because if we read a too old value, we'll add ourselves to the queue, which contains atomic operations. > diff --git a/src/backend/storage/lmgr/lwlock.c > b/src/backend/storage/lmgr/lwlock.c > index 59347ab951..82266e6897 100644 > --- a/src/backend/storage/lmgr/lwlock.c > +++ b/src/backend/storage/lmgr/lwlock.c > @@ -1547,9 +1547,8 @@ LWLockAcquireOrWait(LWLock *lock, LWLockMode mode) > * *result is set to true if the lock was free, and false otherwise. > */ > static bool > -LWLockConflictsWithVar(LWLock *lock, > -uint64 *valptr, uint64 oldval, > uint64 *newval, > -bool *result) > +LWLockConflictsWithVar(LWLock *lock, pg_atomic_uint64 *valptr, uint64 oldval, > +uint64 *newval, bool *result) > { > boolmustwait; > uint64 value; > @@ -1572,13 +1571,11 @@ LWLockConflictsWithVar(LWLock *lock, > *result = false; > > /* > - * Read value using the lwlock's wait list lock, as we can't generally > - * rely on atomic 64 bit reads/stores. TODO: On platforms with a way to > - * do atomic 64 bit reads/writes the spinlock should be optimized away. > + * Reading the value atomically ensures that we don't need any explicit > + * locking. Note that in general, 64 bit atomic APIs in postgres > inherently > + * provide explicit locking for the platforms without atomics support. >*/ This comment seems off to me. Using atomics doesn't guarantee not needing locking. It just guarantees that we are reading a non-torn value. > @@ -1605,9 +1602,14 @@ LWLockConflictsWithVar(LWLock *lock, > * > * Note: this function ignores shared lock holders; if the lock is held > * in shared mode, returns 'true'. > + * > + * Be careful that LWLockConflictsWithVar() does not include a memory > barrier, > + * hence the caller of this function may want to rely on an explicit barrier > or > + * a spinlock to avoid memory ordering issues. > */ s/careful/aware/? s/spinlock/implied barrier via spinlock or lwlock/? Greetings, Andres
Re: WAL Insertion Lock Improvements
On 2023-07-11 09:20:45 +0900, Michael Paquier wrote: > On Mon, Jun 05, 2023 at 08:00:00AM +0530, Bharath Rupireddy wrote: > > On Wed, May 31, 2023 at 5:05 PM Michael Paquier wrote: > >> @Andres: Were there any extra tests you wanted to be run for more > >> input? > > > > @Andres Freund please let us know your thoughts. > > Err, ping. It seems like this thread is waiting on input from you, > Andres? Looking. Sorry for not getting to this earlier. - Andres
Re: DROP DATABASE is interruptible
Hi, On 2023-07-12 11:54:18 +0200, Daniel Gustafsson wrote: > > On 12 Jul 2023, at 03:59, Andres Freund wrote: > > On 2023-07-07 14:09:08 +0200, Daniel Gustafsson wrote: > >>> On 25 Jun 2023, at 19:03, Andres Freund wrote: > >>> On 2023-06-21 12:02:04 -0700, Andres Freund wrote: > > >>> There don't need to be explict checks, because pg_upgrade will fail, > >>> because > >>> it connects to every database. Obviously the error could be nicer, but it > >>> seems ok for something hopefully very rare. I did add a test ensuring > >>> that the > >>> behaviour is caught. > >> > >> I don't see any pg_upgrade test in the patch? > > > > Oops, I stashed them alongside some unrelated changes... Included this time. > > Looking more at this I wonder if we in HEAD should make this a bit nicer by > extending the --check phase to catch this? I did a quick hack along these > lines in the 0003 commit attached here (0001 and 0002 are your unchanged > patches, just added for consistency and to be CFBot compatible). If done it > could be a separate commit to make the 0002 patch backport cleaner of course. I don't really have an opinion on that, tbh... > >> + errhint("Use DROP DATABASE to drop invalid databases")); > >> Should end with a period as a complete sentence? > > > > I get confused about this every time. It's not helped by this example in > > sources.sgml: > > > > > > Primary:could not create shared memory segment: %m > > Detail: Failed syscall was shmget(key=%d, size=%u, 0%o). > > Hint: the addendum > > > > > > Which notably does not use punctuation for the hint. But indeed, later we > > say: > > > >Detail and hint messages: Use complete sentences, and end each with > >a period. Capitalize the first word of sentences. Put two spaces after > >the period if another sentence follows (for English text; might be > >inappropriate in other languages). > > > > That's not a very helpful example, and one which may give the wrong impression > unless the entire page is read. I've raised this with a small diff to improve > it on -docs. Thanks for doing that! > > Updated patches attached. > > This version of the patchset LGTM. Backpatching indeed was no fun. Not having BackgroundPsql.pm was the worst part. But also a lot of other conflicts in tests... Took me 5-6 hours or so. But I now finally pushed the fixes. Hope the buildfarm agrees with it... Thanks for the review! Greetings, Andres Freund
Re: vac_truncate_clog()'s bogus check leads to bogusness
On 2023-06-22 22:29:12 -0700, Noah Misch wrote: > On Thu, Jun 22, 2023 at 09:45:18AM -0700, Andres Freund wrote: > > On 2023-06-21 21:50:39 -0700, Noah Misch wrote: > > > On Wed, Jun 21, 2023 at 03:12:08PM -0700, Andres Freund wrote: > > > > When vac_truncate_clog() returns early > > > ... > > > > we haven't released the lwlock that we acquired earlier > > > > > > > Until there's some cause for the session to call LWLockReleaseAll(), > > > > the lock > > > > is held. Until then neither the process holding the lock, nor any other > > > > process, can finish vacuuming. We don't even have an assert against a > > > > self-deadlock with an already held lock, oddly enough. > > > > > > I agree with this finding. Would you like to add the lwlock releases, or > > > would you like me to? > > > > Happy with either. I do have code and testcase, so I guess it would make > > sense for me to do it? > > Sounds good. Thanks. Done.
Re: Meson build updates
Hi, On 2023-06-29 13:34:42 -0500, Tristan Partin wrote: > On Thu Jun 29, 2023 at 12:35 PM CDT, Andres Freund wrote: > > Hm. One minor disadvantage of this is that if no c++ compiler was found, you > > can't really see anything about llvm in the the output, nor in > > meson-log.txt, > > making it somewhat hard to figure out why llvm was disabled. > > > > I think something like > > > > elif llvmopt.auto() > > message('llvm requires a C++ compiler') > > endif > > > > Should do the trick? > > Your patch looks great to me. > > > > v5-0011-Pass-feature-option-through-to-required-kwarg.patch > > > > I'm a bit confused how it ended how it's looking like it is right now, but > > ... :) > > > > I'm thinking of merging 0011 and relevant parts of 0012 and 0015 into one. > > Your patch looks great to me. Pushed these. Thanks! Greetings, Andres Freund
Re: Remove distprep
Hi, On 2023-06-09 11:10:14 +0200, Peter Eisentraut wrote: > Per discussion at the unconference[0], I started to write a patch that > removes the make distprep target. A more detailed explanation of the > rationale is also in the patch. Thanks for tackling this! > It needs some polishing around the edges, but I wanted to put it out here to > get things moving and avoid duplicate work. It'd be nice if we could get this in soon, so we could move ahead with the src/tools/msvc removal. > One thing in particular that isn't clear right now is how "make world" > should behave if the documentation tools are not found. Maybe we should > make a build option, like "--with-docs", to mirror the meson behavior. Isn't that somewhat unrelated to distprep? I see that you removed missing, but I don't really understand why as part of this commit? > -# If there are any files in the source directory that we also generate in the > -# build directory, they might get preferred over the newly generated files, > -# e.g. because of a #include "file", which always will search in the current > -# directory first. > -message('checking for file conflicts between source and build directory') You're thinking this can be removed because distclean is now reliable? There were some pretty annoying to debug issues early on, where people switched from an in-tree autoconf build to meson, with some files left over from the source build, causing problems at a *later* time (when the files should have changed, but the wrong ones were picked up). That's not really related distprep etc, so I'd change this separately, if we want to change it. > diff --git a/src/tools/pginclude/cpluspluscheck > b/src/tools/pginclude/cpluspluscheck > index 4e09c4686b..287395887c 100755 > --- a/src/tools/pginclude/cpluspluscheck > +++ b/src/tools/pginclude/cpluspluscheck > @@ -134,6 +134,9 @@ do > test "$f" = src/interfaces/ecpg/preproc/preproc.h && continue > test "$f" = src/test/isolation/specparse.h && continue > > + # FIXME > + test "$f" = src/backend/utils/adt/jsonpath_internal.h && continue > + > # ppport.h is not under our control, so we can't make it standalone. > test "$f" = src/pl/plperl/ppport.h && continue Hm, what's that about? We really ought to replace these scripts with something better, which understands concurrency... Greetings, Andres Freund
Re: Forgive trailing semicolons inside of config files
Hi, On 2023-07-11 12:21:46 -0400, Greg Sabino Mullane wrote: > On Tue, Jul 11, 2023 at 11:04 AM Isaac Morland > > Or maybe there could be a "check configuration" subcommand which checks > > the configuration. > > > > There are things inside of Postgres once it has started, but yeah, > something akin to visudo would be nice for editing config files. You can also do it kind-of-reasonably with the server binary, like: PGDATA=/srv/dev/pgdev-dev /path/to/postgres -C server_version; echo $? > > I'd be more interested in improvements in visibility of errors. For > > example, maybe if I try to start the server and there is a config file > > problem, I could somehow get a straightforward error message right in the > > terminal window complaining about the line of the configuration which is > > wrong. > > > > That ship has long since sailed. We already send a detailed error message > with the line number, but in today's world of "service start", "systemctl > start", and higher level of control such as Patroni and Kubernetes, getting > things to show in a terminal window isn't happening. We can't work around > 2>&1. At least with debian's infrastructure, both systemctl start and reload show errors reasonably well: start with broken config: Jul 11 19:13:40 awork3 systemd[1]: Starting postgresql@15-test.service - PostgreSQL Cluster 15-test... Jul 11 19:13:40 awork3 postgresql@15-test[3217452]: Error: invalid line 3 in /var/lib/postgresql/15/test/postgresql.auto.conf: dd Jul 11 19:13:40 awork3 systemd[1]: postgresql@15-test.service: Can't open PID file /run/postgresql/15-test.pid (yet?) after start: No such file or directory reload with broken config: Jul 11 19:10:38 awork3 systemd[1]: Reloading postgresql@15-test.service - PostgreSQL Cluster 15-test... Jul 11 19:10:38 awork3 postgresql@15-test[3217175]: Error: invalid line 3 in /var/lib/postgresql/15/test/postgresql.auto.conf: dd Jul 11 19:10:38 awork3 systemd[1]: postgresql@15-test.service: Control process exited, code=exited, status=1/FAILURE Jul 11 19:10:38 awork3 systemd[1]: Reload failed for postgresql@15-test.service - PostgreSQL Cluster 15-test. However: It looks like that's all implemented in debian specific tooling, rather than PG itself. Oops. Looks like we could make this easier in core postgres by adding one more sd_notify() call, with something like STATUS=reload failed due to syntax error in file "/srv/dev/pgdev-dev/postgresql.conf" line 821, near end of line Greetings, Andres Freund
Re: DROP DATABASE is interruptible
Hi, On 2023-07-07 14:09:08 +0200, Daniel Gustafsson wrote: > > On 25 Jun 2023, at 19:03, Andres Freund wrote: > > On 2023-06-21 12:02:04 -0700, Andres Freund wrote: > >> I'm hacking on this bugfix again, > > This patch LGTM from reading through and testing (manually and with your > supplied tests in the patch), I think this is a sound approach to deal with > this. Thanks! > >> I haven't yet added checks to pg_upgrade, even though that's clearly > >> needed. I'm waffling a bit between erroring out and just ignoring the > >> database? pg_upgrade already fails when datallowconn is set "wrongly", see > >> check_proper_datallowconn(). Any opinions? > > > > There don't need to be explict checks, because pg_upgrade will fail, because > > it connects to every database. Obviously the error could be nicer, but it > > seems ok for something hopefully very rare. I did add a test ensuring that > > the > > behaviour is caught. > > I don't see any pg_upgrade test in the patch? Oops, I stashed them alongside some unrelated changes... Included this time. > > It's somewhat odd that pg_upgrade prints errors on stdout... > > There are many odd things about pg_upgrade logging, updating it to use the > common logging framework of other utils would be nice. Indeed. > >> I'm not sure what should be done for psql. It's probably not a good idea to > >> change tab completion, that'd just make it appear the database is gone. > >> But \l > >> could probably show dropped databases more prominently? > > > > I have not done that. I wonder if this is something that should be done in > > the > > back branches? > > Possibly, I'm not sure where we usually stand on changing the output format of > \ commands in psql in minor revisions. I'd normally be quite careful, people do script psql. While breaking things when encountering an invalid database doesn't actually sound like a bad thing, I don't think it fits into any of the existing column output by psql for \l. > A few small comments on the patch: > > + * Max connections allowed (-1=no limit, -2=invalid database). A database > + * is set to invalid partway through eing dropped. Using datconnlimit=-2 > + * for this purpose isn't particularly clean, but is backpatchable. > Typo: s/eing/being/. Fixed. > A limit of -1 makes sense, but the meaning of -2 is less intuitive IMO. > Would it make sense to add a #define with a more descriptive name to save > folks reading this having to grep around and figure out what -2 means? I went back and forth about this one. We don't use defines for such things in all the frontend code today, so the majority of places won't be improved by adding this. I added them now, which required touching a few otherwise untouched places, but not too bad. > + errhint("Use DROP DATABASE to drop invalid databases")); > Should end with a period as a complete sentence? I get confused about this every time. It's not helped by this example in sources.sgml: Primary:could not create shared memory segment: %m Detail: Failed syscall was shmget(key=%d, size=%u, 0%o). Hint: the addendum Which notably does not use punctuation for the hint. But indeed, later we say: Detail and hint messages: Use complete sentences, and end each with a period. Capitalize the first word of sentences. Put two spaces after the period if another sentence follows (for English text; might be inappropriate in other languages). > + errmsg("cannot alter invalid database \"%s\"", stmt->dbname), > + errdetail("Use DROP DATABASE to drop invalid databases")); > Shouldn't this be an errhint() instead? Also ending with a period. Yep. > + if (database_is_invalid_form((Form_pg_database) dbform)) > + continue; > Would it make sense to stick a DEBUG2 log entry in there to signal that such a > database exist? (The same would apply for the similar hunk in autovacuum.c.) I don't really have an opinion on it. Added. elog(DEBUG2, "skipping invalid database \"%s\" while computing relfrozenxid", NameStr(dbform->datname)); and elog(DEBUG2, "autovacuum: skipping invalid database \"%s\"", NameStr(pgdatabase->datname)); Updated patches attached. Not looking forward to fixing all the conflicts. Does anybody have an opinion about whether we should add a dedicated field to pg_database for representing invali
Re: Support to define custom wait events for extensions
ents/t/001_basic.pl > @@ -0,0 +1,34 @@ > +# Copyright (c) 2023, PostgreSQL Global Development Group > + > +use strict; > +use warnings; > + > +use PostgreSQL::Test::Cluster; > +use PostgreSQL::Test::Utils; > +use Test::More; > + > +my $node = PostgreSQL::Test::Cluster->new('main'); > + > +$node->init; > +$node->append_conf( > + 'postgresql.conf', > + "shared_preload_libraries = 'test_custom_wait_events'" > +); > +$node->start; I think this should also test registering a wait event later. > @@ -0,0 +1,188 @@ > +/*-- > + * > + * test_custom_wait_events.c > + * Code for testing custom wait events > + * > + * This code initializes a custom wait event in shmem_request_hook() and > + * provide a function to launch a background worker waiting forever > + * with the custom wait event. Isn't this vast overkill? Why not just have a simple C function that waits with a custom wait event, until cancelled? That'd maybe 1/10th of the LOC. Greetings, Andres Freund
Re: Cleaning up threading code
On 2023-07-12 08:58:29 +1200, Thomas Munro wrote: > On Mon, Jul 10, 2023 at 10:45 AM Thomas Munro wrote: > > * defined ENABLE_THREAD_SAFETY 1 in c.h, for the benefit of extensions > > I may lack imagination but I couldn't think of any use for that > vestigial macro in backend/extension code, and client code doesn't see > c.h and might not get the right answer anyway if it's dynamically > linked which is the usual case. I took it out for now. Open to > discussing further if someone can show what kinds of realistic > external code would be affected. WFM.
Re: Performance degradation on concurrent COPY into a single relation in PG16.
Hi, On 2023-07-03 11:55:13 +0900, Masahiko Sawada wrote: > While testing PG16, I observed that in PG16 there is a big performance > degradation in concurrent COPY into a single relation with 2 - 16 > clients in my environment. I've attached a test script that measures > the execution time of COPYing 5GB data in total to the single relation > while changing the number of concurrent insertions, in PG16 and PG15. > Here are the results on my environment (EC2 instance, RHEL 8.6, 128 > vCPUs, 512GB RAM): > > * PG15 (4b15868b69) > PG15: nclients = 1, execution time = 14.181 > > * PG16 (c24e9ef330) > PG16: nclients = 1, execution time = 17.112 > The relevant commit is 00d1e02be2 "hio: Use ExtendBufferedRelBy() to > extend tables more efficiently". With commit 1cbbee0338 (the previous > commit of 00d1e02be2), I got a better numbers, it didn't have a better > scalability, though: > > PG16: nclients = 1, execution time = 17.444 I think the single client case is indicative of an independent regression, or rather regressions - it can't have anything to do with the fallocate() issue and reproduces before that too in your numbers. 1) COPY got slower, due to: 9f8377f7a27 Add a DEFAULT option to COPY FROM This added a new palloc()/free() to every call to NextCopyFrom(). It's not at all clear to me why this needs to happen in NextCopyFrom(), particularly because it's already stored in CopyFromState? 2) pg_strtoint32_safe() got substantially slower, mainly due to faff8f8e47f Allow underscores in integer and numeric constants. 6fcda9aba83 Non-decimal integer literals pinned to one cpu, turbo mode disabled, I get the following best-of-three times for copy test from '/tmp/tmp_4.data' (too impatient to use the larger file every time) 15: 6281.107 ms HEAD: 7000.469 ms backing out 9f8377f7a27: 6433.516 ms also backing out faff8f8e47f, 6fcda9aba83: 6235.453 ms I suspect 1) can relatively easily be fixed properly. But 2) seems much harder. The changes increased the number of branches substantially, that's gonna cost in something as (previously) tight as pg_strtoint32(). For higher concurrency numbers, I now was able to reproduce the regression, to a smaller degree. Much smaller after fixing the above. The reason we run into the issue here is basically that the rows in the test are very narrow and reach #define MAX_BUFFERED_TUPLES 1000 at a small number of pages, so we go back and forth between extending with fallocate() and not. I'm *not* saying that that is the solution, but after changing that to 5000, the numbers look a lot better (with the other regressions "worked around"): (this is again with turboboost disabled, to get more reproducible numbers) clients 1 2 4 8 16 32 15,buffered=100025725 13211 9232563948624700 15,buffered=500026107 14550 8644605049434766 HEAD+fixes,buffered=100025875 14505 8200490035653433 HEAD+fixes,buffered=500025830 12975 6527359427392642 Greetings, Andres Freund [1] https://postgr.es/m/CAD21AoAEwHTLYhuQ6PaBRPXKWN-CgW9iw%2B4hm%3D2EOFXbJQ3tOg%40mail.gmail.com
Re: Performance degradation on concurrent COPY into a single relation in PG16.
Hi, On 2023-07-11 09:09:43 +0200, Jakub Wartak wrote: > On Mon, Jul 10, 2023 at 6:24 PM Andres Freund wrote: > > > > Hi, > > > > On 2023-07-03 11:53:56 +0200, Jakub Wartak wrote: > > > Out of curiosity I've tried and it is reproducible as you have stated : > > > XFS > > > @ 4.18.0-425.10.1.el8_7.x86_64: > > >... > > > According to iostat and blktrace -d /dev/sda -o - | blkparse -i - output , > > > the XFS issues sync writes while ext4 does not, xfs looks like constant > > > loop of sync writes (D) by kworker/2:1H-kblockd: > > > > That clearly won't go well. It's not reproducible on newer systems, > > unfortunately :(. Or well, fortunately maybe. > > > > > > I wonder if a trick to avoid this could be to memorialize the fact that we > > bulk extended before and extend by that much going forward? That'd avoid the > > swapping back and forth. > > I haven't seen this thread [1] "Question on slow fallocate", from XFS > mailing list being mentioned here (it was started by Masahiko), but I > do feel it contains very important hints even challenging the whole > idea of zeroing out files (or posix_fallocate()). Please especially > see Dave's reply. I think that's just due to the reproducer being a bit too minimal and the actual problem being addressed not being explained. > He also argues that posix_fallocate() != fallocate(). What's interesting is > that it's by design and newer kernel versions should not prevent such > behaviour, see my testing result below. I think the problem there was that I was not targetting a different file between the different runs, somehow assuming the test program would be taking care of that. I don't think the test program actually tests things in a particularly useful way - it does fallocate()s in 8k chunks - which postgres never does. > All I can add is that this those kernel versions (4.18.0) seem to very > popular across customers (RHEL, Rocky) right now and that I've tested > on most recent available one (4.18.0-477.15.1.el8_8.x86_64) using > Masahiko test.c and still got 6-7x slower time when using XFS on that > kernel. After installing kernel-ml (6.4.2) the test.c result seems to > be the same (note it it occurs only when 1st allocating space, but of > course it doesnt if the same file is rewritten/"reallocated"): test.c really doesn't reproduce postgres behaviour in any meaningful way, using it as a benchmark is a bad idea. Greetings, Andres Freund
Re: Refactoring backend fork+exec code
nds = { [PMC_AV_LAUNCHER] = {"autovacuum launcher", AutoVacLauncherMain, true}, ... } or such should work. I'd also use designated initializers for the fields, it's otherwise hard to know what true means etc. I think it might be good to put more into array. If we e.g. knew whether a particular child type is a backend-like, and aux process or syslogger, we could avoid the duplicated InitAuxiliaryProcess(), MemoryContextDelete(PostmasterContext) etc calls everywhere. > +/* > + * SubPostmasterMain -- Get the fork/exec'd process into a state equivalent > + * to what it would be if we'd simply forked on Unix, and > then > + * dispatch to the appropriate place. > + * > + * The first two command line arguments are expected to be "--forkFOO" > + * (where FOO indicates which postmaster child we are to become), and > + * the name of a variables file that we can read to load data that would > + * have been inherited by fork() on Unix. Remaining arguments go to the > + * subprocess FooMain() routine. XXX > + */ > +void > +SubPostmasterMain(int argc, char *argv[]) > +{ > + PostmasterChildType child_type; > + char *startup_data; > + size_t startup_data_len; > + > + /* In EXEC_BACKEND case we will not have inherited these settings */ > + IsPostmasterEnvironment = true; > + whereToSendOutput = DestNone; > + > + /* Setup essential subsystems (to ensure elog() behaves sanely) */ > + InitializeGUCOptions(); > + > + /* Check we got appropriate args */ > + if (argc < 3) > + elog(FATAL, "invalid subpostmaster invocation"); > + > + if (strncmp(argv[1], "--forkchild=", 12) == 0) > + { > + char *entry_name = argv[1] + 12; > + boolfound = false; > + > + for (int idx = 0; idx < lengthof(entry_kinds); idx++) > + { > + if (strcmp(entry_kinds[idx].name, entry_name) == 0) > + { > + child_type = idx; > + found = true; > + break; > + } > + } > + if (!found) > + elog(ERROR, "unknown child kind %s", entry_name); > + } Hm, shouldn't we error out when called without --forkchild? > +/* Save critical backend variables into the BackendParameters struct */ > +#ifndef WIN32 > +static bool > +save_backend_variables(BackendParameters *param, ClientSocket *client_sock) > +#else There's so much of this kind of thing. Could we hide it in a struct or such instead of needing ifdefs everywhere? > --- a/src/backend/storage/ipc/shmem.c > +++ b/src/backend/storage/ipc/shmem.c > @@ -144,6 +144,8 @@ InitShmemAllocation(void) > /* >* Initialize ShmemVariableCache for transaction manager. (This doesn't >* really belong here, but not worth moving.) > + * > + * XXX: we really should move this >*/ > ShmemVariableCache = (VariableCache) > ShmemAlloc(sizeof(*ShmemVariableCache)); Heh. Indeed. And probably just rename it to something less insane. Greetings, Andres Freund
Re: ResourceOwner refactoring
Hi, On 2023-07-10 12:14:46 -0700, Andres Freund wrote: > > /* > > - * Initially allocated size of a ResourceArray. Must be power of two since > > - * we'll use (arraysize - 1) as mask for hashing. > > + * Size of the fixed-size array to hold most-recently remembered resources. > > */ > > -#define RESARRAY_INIT_SIZE 16 > > +#define RESOWNER_ARRAY_SIZE 32 > > That's 512 bytes, pretty large to be searched sequentially. It's somewhat sad > that the array needs to include 8 byte of ResourceOwnerFuncs... On that note: It's perhaps worth noting that this change actually *increases* the size of ResourceOwnerData. Previously: /* size: 576, cachelines: 9, members: 19 */ /* sum members: 572, holes: 1, sum holes: 4 */ now: /* size: 696, cachelines: 11, members: 13 */ /* sum members: 693, holes: 1, sum holes: 3 */ It won't make a difference memory-usage wise, given aset.c's rounding behaviour. But it does mean that more memory needs to be zeroed, as ResourceOwnerCreate() uses MemoryContextAllocZero. As there's really no need to initialize the long ->arr, ->locks, it seems worth to just initialize explicitly instead of using MemoryContextAllocZero(). With the new representation, is there any point in having ->locks? We still need ->nlocks to be able to switch to the lossy behaviour, but there doesn't seem to be much point in having the separate array. Greetings, Andres Freund
Re: ResourceOwner refactoring
owner->releasing = true; Is it a problem than a possible error would lead to ->releasing = true to be "leaked"? I think it might be, because we haven't called ResourceOwnerSort(), which means we'd not process resources in the right order during abort processing. > > +/* > + * Hash function for value+kind combination. > + */ > static inline uint32 > hash_resource_elem(Datum value, ResourceOwnerFuncs *kind) > { > - Datum data[2]; > - > - data[0] = value; > - data[1] = PointerGetDatum(kind); > - > - return hash_bytes((unsigned char *) &data, 2 * SIZEOF_DATUM); > + /* > + * Most resource kinds store a pointer in 'value', and pointers are > unique > + * all on their own. But some resources store plain integers (Files and > + * Buffers as of this writing), so we want to incorporate the 'kind' in > + * the hash too, otherwise those resources will collide a lot. But > + * because there are only a few resource kinds like that - and only a > few > + * resource kinds to begin with - we don't need to work too hard to mix > + * 'kind' into the hash. Just add it with hash_combine(), it perturbs > the > + * result enough for our purposes. > + */ > +#if SIZEOF_DATUM == 8 > + return hash_combine64(murmurhash64((uint64) value), (uint64) kind); > +#else > + return hash_combine(murmurhash32((uint32) value), (uint32) kind); > +#endif > } Why are we using a 64bit hash to then just throw out the high bits? Greetings, Andres Freund
Re: Performance degradation on concurrent COPY into a single relation in PG16.
Hi, On 2023-07-03 11:53:56 +0200, Jakub Wartak wrote: > Out of curiosity I've tried and it is reproducible as you have stated : XFS > @ 4.18.0-425.10.1.el8_7.x86_64: >... > According to iostat and blktrace -d /dev/sda -o - | blkparse -i - output , > the XFS issues sync writes while ext4 does not, xfs looks like constant > loop of sync writes (D) by kworker/2:1H-kblockd: That clearly won't go well. It's not reproducible on newer systems, unfortunately :(. Or well, fortunately maybe. I wonder if a trick to avoid this could be to memorialize the fact that we bulk extended before and extend by that much going forward? That'd avoid the swapping back and forth. One thing that confuses me is that Sawada-san observed a regression at a single client - yet from what I can tell, there's actually not a single fallocate() being generated in that case, because the table is so narrow that we never end up extending by a sufficient number of blocks in RelationAddBlocks() to reach that path. Yet there's a regression at 1 client. I don't yet have a RHEL8 system at hand, could either of you send the result of a strace -c -p $pid_of_backend_doing_copy Greetings, Andres Freund
Re: Performance degradation on concurrent COPY into a single relation in PG16.
Hi, On 2023-07-03 11:59:38 +0900, Masahiko Sawada wrote: > On Mon, Jul 3, 2023 at 11:55 AM Masahiko Sawada wrote: > > > > After further investigation, the performance degradation comes from > > calling posix_fallocate() (called via FileFallocate()) and pwritev() > > (called via FileZero) alternatively depending on how many blocks we > > extend by. And it happens only on the xfs filesystem. > > FYI, the attached simple C program proves the fact that calling > alternatively posix_fallocate() and pwrite() causes slow performance > on posix_fallocate(): > > $ gcc -o test test.c > $ time ./test test.1 1 > total 20 > fallocate 20 > filewrite 0 > > real0m1.305s > user0m0.050s > sys 0m1.255s > > $ time ./test test.2 2 > total 20 > fallocate 10 > filewrite 10 > > real1m29.222s > user0m0.139s > sys 0m3.139s On an xfs filesystem, with a very recent kernel: time /tmp/msw_test /srv/dev/fio/msw 0 total 20 fallocate 0 filewrite 20 real0m0.456s user0m0.017s sys 0m0.439s time /tmp/msw_test /srv/dev/fio/msw 1 total 20 fallocate 20 filewrite 0 real0m0.141s user0m0.010s sys 0m0.131s time /tmp/msw_test /srv/dev/fio/msw 2 total 20 fallocate 10 filewrite 10 real0m0.297s user0m0.017s sys 0m0.280s So I don't think I can reproduce your problem on that system... I also tried adding a fdatasync() into the loop, but that just made things uniformly slow. I guess I'll try to dig up whether this is a problem in older upstream kernels, or whether it's been introduced in RHEL. Greetings, Andres Freund
Re: Performance degradation on concurrent COPY into a single relation in PG16.
Hi, On 2023-07-03 11:55:13 +0900, Masahiko Sawada wrote: > While testing PG16, I observed that in PG16 there is a big performance > degradation in concurrent COPY into a single relation with 2 - 16 > clients in my environment. I've attached a test script that measures > the execution time of COPYing 5GB data in total to the single relation > while changing the number of concurrent insertions, in PG16 and PG15. > Here are the results on my environment (EC2 instance, RHEL 8.6, 128 > vCPUs, 512GB RAM): Gah, RHEL with its frankenkernels, the bane of my existance. FWIW, I had extensively tested this with XFS, just with a newer kernel. Have you tested this on RHEL9 as well by any chance? Greetings, Andres Freund
Re: Performance degradation on concurrent COPY into a single relation in PG16.
Hi, On 2023-07-10 15:25:41 +0200, Alvaro Herrera wrote: > On 2023-Jul-03, Masahiko Sawada wrote: > > > While testing PG16, I observed that in PG16 there is a big performance > > degradation in concurrent COPY into a single relation with 2 - 16 > > clients in my environment. I've attached a test script that measures > > the execution time of COPYing 5GB data in total to the single relation > > while changing the number of concurrent insertions, in PG16 and PG15. > > This item came up in the RMT meeting. Andres, I think this item belongs > to you, because of commit 00d1e02be2. I'll take a look - I wasn't even aware of this thread until now. Greetings, Andres Freund
Re: Autogenerate some wait events code and documentation
Hi, On 2023-07-06 09:36:12 +0900, Michael Paquier wrote: > On Wed, Jul 05, 2023 at 02:59:39PM -0700, Andres Freund wrote: > > Rebasing a patch over this I was a bit confused because I got a bunch of > > ""unable to parse wait_event_names.txt" errors. Took me a while to figure > > out > > that that was just because I didn't include a trailing . in the description. > > Perhaps that could be turned into a more meaningful error? > > > > die "unable to parse wait_event_names.txt" > > unless $line =~ /^(\w+)\t+(\w+)\t+("\w+")\t+("\w.*\.")$/; > > Agreed that we could at least add the $line in the error message > generated, at least, to help with debugging. > > > I also do wonder if we should invest in generating the lwlock names as > > well. Except for a few abbreviations, the second column is always the > > camel-cased version of what follows WAIT_EVENT_. Feels pretty tedious to > > write > > that out. > > And you mean getting rid of lwlocknames.txt? No, I meant the second column in wait_event_names.txt. If you look at stuff like: WAIT_EVENT_ARCHIVER_MAIN"ArchiverMain" "Waiting in main loop of archiver process." It'd be pretty trivial to generate ArchiverMain from ARCHIVER_MAIN. > The impact on dtrace or other similar tools is uncertain to me because we > have free number on this list, and stuff like GetLWLockIdentifier() rely on > the input ID from lwlocknames.txt. I don't really care, tbh. If we wanted to keep the names the same in case of abbreviations, we could just make the name optional, and auto-generated if not explicitly specified. > > Perhaps we should just change the case of the upper-cased names (DSM, SSL, > > WAL, ...) to follow the other names? > > So you mean renaming the existing events like WalSenderWaitForWAL to > WalSenderWaitForWal? Yes. > The impact on existing monitoring queries is not zero because any changes > would be silent, and that's the part that worried me the most even if it can > remove one column in the txt file. Then let's just use - or so to indicate the inferred name, with a "string" overriding it? Greetings, Andres Freund