Re: slab allocator performance issues
On Fri, 11 Nov 2022 at 22:20, John Naylor wrote: > > > On Wed, Oct 12, 2022 at 4:37 PM David Rowley wrote: > > [v3] > > + /* > + * Compute a shift that guarantees that shifting chunksPerBlock with it > + * yields is smaller than SLAB_FREELIST_COUNT - 1 (one freelist is used for > full blocks). > + */ > + slab->freelist_shift = 0; > + while ((slab->chunksPerBlock >> slab->freelist_shift) >= > (SLAB_FREELIST_COUNT - 1)) > + slab->freelist_shift++; > > + /* > + * Ensure, without a branch, that index 0 is only used for blocks entirely > + * without free chunks. > + * XXX: There probably is a cheaper way to do this. Needing to shift twice > + * by slab->freelist_shift isn't great. > + */ > + index = (freecount + (1 << slab->freelist_shift) - 1) >> > slab->freelist_shift; > > How about something like > > #define SLAB_FREELIST_COUNT ((1<<3) + 1) > index = (freecount & (SLAB_FREELIST_COUNT - 2)) + (freecount != 0); Doesn't this create a sort of round-robin use of the free list? What we want is a sort of "histogram" bucket set of free lists so we can group together blocks that have a close-enough free number of chunks. Unless I'm mistaken, I think what you have doesn't do that. I wondered if simply: index = -(-freecount >> slab->freelist_shift); would be faster than Andres' version. I tried it out and on my AMD machine, it's about the same speed. Same on a Raspberry Pi 4. Going by [2], the instructions are very different with each method, so other machines with different latencies on those instructions might show something different. I attached what I used to test if anyone else wants a go. AMD Zen2 $ ./freecount 20 Test 'a' in 0.922766 seconds Test 'd' in 0.922762 seconds (0.000433% faster) RPI4 $ ./freecount 20 Test 'a' in 3.341350 seconds Test 'd' in 3.338690 seconds (0.079672% faster) That was gcc. Trying it with clang, it went in a little heavy-handed and optimized out my loop, so some more trickery might be needed for a useful test on that compiler. David [2] https://godbolt.org/z/dh95TohEG #include #include #include #define SLAB_FREELIST_COUNT 9 int main(int argc, char **argv) { clock_t start, end; double v1_time, v2_time; int freecount; int freelist_shift = 0; int chunksPerBlock; int index; int zerocount = 0; if (argc < 2) { printf("Syntax: %s \n", argv[0]); return -1; } chunksPerBlock = atoi(argv[1]); printf("chunksPerBlock = %d\n", chunksPerBlock); while ((chunksPerBlock >> freelist_shift) >= (SLAB_FREELIST_COUNT - 1)) freelist_shift++; printf("freelist_shift = %d\n", freelist_shift); start = clock(); for (freecount = 0; freecount <= chunksPerBlock; freecount++) { index = (freecount + (1 << freelist_shift) - 1) >> freelist_shift; /* try to prevent optimizing the above out */ if (index == 0) zerocount++; } end = clock(); v1_time = (double) (end - start) / CLOCKS_PER_SEC; printf("Test 'a' in %f seconds\n", v1_time); printf("zerocount = %d\n", zerocount); zerocount = 0; start = clock(); for (freecount = 0; freecount <= chunksPerBlock; freecount++) { index = -(-freecount >> freelist_shift); /* try to prevent optimizing the above out */ if (index == 0) zerocount++; } end = clock(); v2_time = (double) (end - start) / CLOCKS_PER_SEC; printf("Test 'd' in %f seconds (%f%% faster)\n", v2_time, v1_time / v2_time * 100.0 - 100.0); printf("zerocount = %d\n", zerocount); return 0; }
Re: pg_dump: Remove "blob" terminology
On 02.12.22 09:34, Daniel Gustafsson wrote: On 2 Dec 2022, at 08:09, Peter Eisentraut wrote: fixed +1 on this version of the patch, LGTM. committed I also put back the old long options forms in the documentation and help but marked them deprecated. + --blobs (deprecated) While not in scope for this patch, I wonder if we should add a similar (deprecated) marker on other commandline options which are documented to be deprecated. -i on bin/postgres comes to mind as one candidate. makes sense
Re: Generate pg_stat_get_* functions with Macros
Hi, On 12/5/22 8:44 AM, Michael Paquier wrote: On Mon, Dec 05, 2022 at 08:27:15AM +0100, Drouvot, Bertrand wrote: On 12/4/22 6:32 PM, Nathan Bossart wrote: Alright. I marked this as ready-for-committer. Thanks! Well, that's kind of nice: 5 files changed, 139 insertions(+), 396 deletions(-) And I like removing code, so.. Thanks for looking at it! In the same area, I am counting a total of 21 (?) pgstat routines for databases that rely on pgstat_fetch_stat_dbentry() while returning an int64. This would lead to more cleanup. -- Yeah, good point, thanks! I'll look at the "databases" ones but I think in a separate patch. The reason is that the current one is preparatory work for [1]. Means, once the current patch is committed, working on [1] and "cleaning" the databases one could be done in parallel. Sounds good to you? [1]: https://commitfest.postgresql.org/41/3984/ Regards, -- Bertrand Drouvot PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com
Re: Operation log for major operations
Hi! >I think storing this in pg_control is a bad idea. That file is >extremely critical and if you break it, you're pretty much SOL on >recovering your data. I suggest that this should use a separate file. Thanks. Operation log location changed to 'global/pg_control_log' (if the name is not pretty, it can be changed). I attached the patch (v2-0001-Operation-log.patch) and description of operation log (Operation-log.txt). With best regards, Dmitry Koval Postgres Professional: http://postgrespro.com- Introduction. - Operation log is designed to store information about critical system events (like pg_upgrade, pg_resetwal, pg_resetwal, etc.). This information is not interesting to the ordinary user, but it is very important for the vendor's technical support. An example: client complains about DB damage to technical support (damage was caused by pg_resetwal which was "silent" executed by one of administrators). Concepts. - * operation log is placed in the separate file 'global/pg_control_log' (log size is 8kB); * user can not modify the operation log; log can be changed by function call only (from postgres or from postgres utilities); * operation log is a ring buffer (with CRC-32 protection), deleting entries from the operation log is possible only when the buffer is overflowed; * SQL-function is used to read data of operation log. Example of operation log data. -- >select * from pg_operation_log(); event|edition|version| lsn | last |count +---+---+-+--+-- startup|vanilla|10.22.0|0/828|2022-11-18 23:06:27+03|1 pg_upgrade |vanilla|15.0.0 |0/828|2022-11-18 23:06:27+03|1 startup|vanilla|15.0.0 |0/80001F8|2022-11-18 23:11:53+03|3 pg_resetwal|vanilla|15.0.0 |0/80001F8|2022-11-18 23:09:53+03|2 (4 rows) Some details about inserting data to operation log. --- There are two modes of inserting information about events in the operation log. * MERGE mode (events "startup", "pg_resetwal", "pg_rewind"). We searches in ring buffer of operation log an event with the same type ("startup" for example) with the same version number. If event was found, we will increment event counter by 1 and update the date/time of event ("last" field) with the current value. If event was not found, we will add this event to the ring buffer (see INSERT mode). * INSERT mode (events "bootstrap", "pg_upgrade", "promoted"). We will add an event to the ring buffer (without searching). - Operation log structures. - The operation log is placed in the pg_control_log file (file size PG_OPERATION_LOG_FULL_SIZE=8192). Operation log is a ring buffer of 341 elements (24 bytes each) with header (8 bytes). а) Operation log element: typedef struct OperationLogData { uint8 ol_type;/* operation type */ uint8 ol_edition; /* postgres edition */ uint16 ol_count; /* number of operations */ uint32 ol_version; /* postgres version */ pg_time_t ol_timestamp; /* = int64, operation date/time */ XLogRecPtr ol_lsn; /* = uint64, last check point record ptr */ } OperationLogData; Description of fields: * ol_type - event type. The types are contained in an enum: typedef enum { OLT_BOOTSTRAP = 1, /* bootstrap */ OLT_STARTUP,/* server startup */ OLT_RESETWAL, /* pg_resetwal */ OLT_REWIND, /* pg_rewind */ OLT_UPGRADE,/* pg_upgrade */ OLT_PROMOTED, /* promoted */ OLT_NumberOfTypes /* should be last */ } ol_type_enum; * ol_edition - PostgreSQL edition (this field is important for custom PostgreSQL editions). The editions are contained in an enum: typedef enum { ED_PG_ORIGINAL = 0 /* Here can be custom editions */ } PgNumEdition; * ol_count - the number of fired events in case events of this type can be merged (otherwise 1). Maximum is 65536; * ol_version - version formed with using PG_VERSION_NUM (PG_VERSION_NUM*100; for example: 13000800 for v13.8). Two last digits can be used for patch version; * ol_timestamp - date/time of the last fired event in case events of this type can be merged (otherwise - the date/time of the event); * ol_lsn - "Latest checkpoint location" value (ControlFile->checkPoint) which contains in pg_control file at the time of the event. б) Operation log header: typedef struct OperationLogHeader { pg_crc32c ol_crc; /* CRC of operation l
Re: Generate pg_stat_get_* functions with Macros
On Mon, Dec 05, 2022 at 09:11:43AM +0100, Drouvot, Bertrand wrote: > Means, once the current patch is committed, working on [1] and > "cleaning" the databases one could be done in parallel. Sounds good > to you? Doing that in a separate patch is fine by me. -- Michael signature.asc Description: PGP signature
Order getopt arguments
I had noticed that most getopt() or getopt_long() calls had their letter lists in pretty crazy orders. There might have been occasional attempts at grouping, but those then haven't been maintained as new options were added. To restore some sanity to this, I went through and ordered them alphabetically.From 022b6f7665e7f38fd773f7d44c892344b714f795 Mon Sep 17 00:00:00 2001 From: Peter Eisentraut Date: Mon, 5 Dec 2022 09:15:53 +0100 Subject: [PATCH] Order getopt arguments Order the letters in the arguments of getopt() and getopt_long(), as well as in the subsequent switch statements. In most cases, I used alphabetical with lower case first. In a few cases, existing different orders (e.g., upper case first) was kept to reduce the diff size. --- src/backend/bootstrap/bootstrap.c | 52 ++--- src/backend/postmaster/postmaster.c | 56 ++--- src/backend/tcop/postgres.c | 54 ++--- src/bin/pg_amcheck/pg_amcheck.c | 10 +- src/bin/pg_archivecleanup/pg_archivecleanup.c | 2 +- src/bin/pg_basebackup/pg_basebackup.c | 108 - src/bin/pg_basebackup/pg_receivewal.c | 41 ++-- src/bin/pg_basebackup/pg_recvlogical.c| 8 +- src/bin/pg_checksums/pg_checksums.c | 14 +- src/bin/pg_upgrade/option.c | 2 +- src/bin/pgbench/pgbench.c | 206 +- src/bin/scripts/clusterdb.c | 38 ++-- src/bin/scripts/createdb.c| 44 ++-- src/bin/scripts/dropdb.c | 20 +- src/bin/scripts/dropuser.c| 14 +- src/bin/scripts/reindexdb.c | 56 ++--- src/bin/scripts/vacuumdb.c| 98 - src/interfaces/ecpg/preproc/ecpg.c| 91 .../modules/libpq_pipeline/libpq_pipeline.c | 8 +- 19 files changed, 458 insertions(+), 464 deletions(-) diff --git a/src/backend/bootstrap/bootstrap.c b/src/backend/bootstrap/bootstrap.c index 661ebacb0c..d623ad9d50 100644 --- a/src/backend/bootstrap/bootstrap.c +++ b/src/backend/bootstrap/bootstrap.c @@ -228,6 +228,32 @@ BootstrapModeMain(int argc, char *argv[], bool check_only) case 'B': SetConfigOption("shared_buffers", optarg, PGC_POSTMASTER, PGC_S_ARGV); break; + case 'c': + case '-': + { + char *name, + *value; + + ParseLongOption(optarg, &name, &value); + if (!value) + { + if (flag == '-') + ereport(ERROR, + (errcode(ERRCODE_SYNTAX_ERROR), + errmsg("--%s requires a value", + optarg))); + else + ereport(ERROR, + (errcode(ERRCODE_SYNTAX_ERROR), + errmsg("-c %s requires a value", + optarg))); + } + + SetConfigOption(name, value, PGC_POSTMASTER, PGC_S_ARGV); + pfree(name); + pfree(value); + break; + } case 'D': userDoption = pstrdup(optarg); break; @@ -265,32 +291,6 @@ BootstrapModeMain(int argc, char *argv[], bool check_only) PGC_S_DYNAMIC_DEFAULT); } break; - case 'c': - case '-': - { - char *name, - *value; - - ParseLongOption(optarg, &name, &value); - if (!value) - { - if (flag == '-') - ereport(ERROR, -
Re: Order getopt arguments
Hello Peter, I had noticed that most getopt() or getopt_long() calls had their letter lists in pretty crazy orders. There might have been occasional attempts at grouping, but those then haven't been maintained as new options were added. To restore some sanity to this, I went through and ordered them alphabetically. I agree that a more or less random historical order does not make much sense. For pgbench, ISTM that sorting per functionality then alphabetical would be better than pure alphabetical because it has 2 modes. Such sections might be (1) general (2) connection (3) common/shared (4) initialization and (5) benchmarking, we some comments on each. What do you think? If okay, I'll send you a patch for that. -- Fabien.
Re: Optimize common expressions in projection evaluation
> Which is properly written as the following, using lateral, which also avoids > the problem you describe: > > INSERT INTO tbl > SELECT func_call.* > FROM ft > JOIN LATERAL convert_func(ft.rawdata) AS func_call ON true; I didn't fully realize this syntax until you point out. Just try it out in our case and this works well. I think My problem is mostly resolved without the need of this patch. Thanks! It's still good to do something about the normal (func(v)).* syntax if it's still considered legal. I will give a try to expanding it more cleverly and see if we can avoid the duplicate evaluation issue. Peifeng Qiu
Re: [PATCH] Add native windows on arm64 support
On 05/12/2022 05:12, Michael Paquier wrote: - Last comes OpenSSL, that supports amd64_arm64 as build target (see NOTES-WINDOWS.md), and the library names are libssl.lib and libcrypto.lib. Looking at https://slproweb.com/products/Win32OpenSSL.html, there are experimental builds for arm64 with OpenSSL 3.0. Niyas or somebody else, could you look at the contents of lib/VC/ and see what we could rely on for the debug builds after installing this MSI? We should rely on something like lib/VC/sslcrypto64MD.lib or lib/VC/sslcrypto32MD.lib, but for arm64. I tried that installer. And I can see following libraries installed in lib/VC location. libcryptoarm64MD.lib libcryptoarm64MDd.lib libcryptoarm64MT.lib libcryptoarm64MTd.lib libsslarm64MD.lib libsslarm64MDd.lib libsslarm64MT.lib libsslarm64MTd.lib - USE_ARMV8_CRC32C_WITH_RUNTIME_CHECK => undef, + USE_ARMV8_CRC32C_WITH_RUNTIME_CHECK => $self->{platform} eq "ARM64" ? : 1 : undef, Did you actually test this patch? This won't work at all with perl, per se the double colon after the question mark. Yes I did a full build. I am not sure why I didn't see any error with that. My perl skills are very limited and I started with something bit more naive like "$self->{platform} == "ARM64" ? : 1 : undef" But that didn't work and I changed to using eq and the compilation was fine. Thanks for fixing the patch. > For now, please find attached an updated patch with all the fixes I > could come up with. Thanks. I did a quick sanity build with your updated patch and looks fine. -- Niyas
Re: Perform streaming logical transactions by background workers and parallel apply
On Sun, Dec 4, 2022 at 4:48 PM houzj.f...@fujitsu.com wrote: > > On Friday, December 2, 2022 4:59 PM Peter Smith wrote: > > > > > ~~~ > > > > 12. pa_clean_subtrans > > > > +/* Reset the list that maintains subtransactions. */ void > > +pa_clean_subtrans(void) > > +{ > > + subxactlist = NIL; > > +} > > > > Maybe a more informative function name would be pa_reset_subxactlist()? > > I thought the current name is more consistent with pa_start_subtrans. > Then how about changing the name to pg_reset_subtrans()? > > > > > 21. apply_worker_clean_exit > > > > I wasn't sure if calling this a 'clean' exit meant anything much. > > > > How about: > > - apply_worker_proc_exit, or > > - apply_worker_exit > > I thought the clean means the exit number is 0(proc_exit(0)) and is > not due to any ERROR, I am not sure If proc_exit or exit is better. > > I have addressed other comments in the new version patch. > +1 for apply_worker_exit. One minor suggestion for a recent change in v56-0001*: /* - * A hash table used to cache streaming transactions being applied and the - * parallel application workers required to apply transactions. + * A hash table used to cache the state of streaming transactions being + * applied by the parallel apply workers. */ static HTAB *ParallelApplyTxnHash = NULL; -- With Regards, Amit Kapila.
Marking options deprecated in help output
In the pg_dump blob terminology thread it was briefly discussed [0] to mark parameters as deprecated in the --help output. The attached is a quick diff to show that that would look like. Personally I think it makes sense, not everyone will read the docs. -- Daniel Gustafsson https://vmware.com/ [0] 95467596-834d-baa4-c67c-5db1096ed...@enterprisedb.com deprecated_options.diff Description: Binary data
Re: Using WaitEventSet in the postmaster
On Sat, Dec 3, 2022 at 10:41 AM Thomas Munro wrote: > Here's an iteration like that. Still WIP grade. It passes, but there > must be something I don't understand about this computer program yet, > because if I move the "if (pending_..." section up into the block > where WL_LATCH_SET has arrived (instead of testing those variables > every time through the loop), a couple of tests leave zombie > (unreaped) processes behind, indicating that something funky happened > to the state machine that I haven't yet grokked. Will look more next > week. Duh. The reason for that was the pre-existing special case for PM_WAIT_DEAD_END, which used a sleep(100ms) loop to wait for children to exit, which I needed to change to a latch wait. Fixed in the next iteration, attached. The reason for the existing sleep-based approach was that we didn't want to accept any more connections (or spin furiously because the listen queue was non-empty). So in this version I invented a way to suppress socket events temporarily with WL_SOCKET_IGNORE, and then reactivate them after crash reinit. Still WIP, but I hope travelling in the right direction. From 65b5fa1f7024cb78cee9ba57d36a78dc17ffe492 Mon Sep 17 00:00:00 2001 From: Thomas Munro Date: Wed, 9 Nov 2022 22:59:58 +1300 Subject: [PATCH v3] Give the postmaster a WaitEventSet and a latch. Traditionally, the postmaster's architecture was quite unusual. It did a lot of work inside signal handlers, which were only unblocked while waiting in select() to make that safe. Switch to a more typical architecture, where signal handlers just set flags and use a latch to close races. Now the postmaster looks like all other PostgreSQL processes, multiplexing its event processing in epoll_wait()/kevent()/poll()/WaitForMultipleObjects() depending on the OS. Changes: * WL_SOCKET_ACCEPT is a new event for an incoming connection (on Unix, this is just another name for WL_SOCKET_READABLE, but Window has a different underlying event; this mirrors WL_SOCKET_CONNECTED on the other end of a connection) * WL_SOCKET_IGNORE is a new way to stop waking up for new incoming connections while shutting down. * Small adjustments to WaitEventSet to allow running in the postmaster. * Allow the postmaster to set up its own local latch. For now we don't want other backends setting the postmaster's latch directly (perhaps later we'll figure out how to use a shared latch "robustly", so that memory corruption can't interfere with the postmaster's cleanup-and-restart responsibilities, but for now there is a two-step signal protocol SIGUSR1 -> SIGURG). * The existing signal handlers are cut in two: a handle_XXX part that sets a pending_XXX variable and sets the local latch, and a process_XXX part. * ServerLoop(), the process_XXX() functions and PostmasterStateMachine() now all take a pointer to a Postmaster object that lives on the stack as a parameter that initially holds the WaitEventSet they need to do their job. Many other global variables could be moved into it, but that's not done here. * Signal handlers are now installed with the regular pqsignal() function rather then the special pqsignal_pm() function; the concerns about the portability of SA_RESTART vs select() are no longer relevant: SUSv2 left it implementation-defined whether select() restarts, but didn't add that qualification for poll(), and it doesn't matter anyway because we call SetLatch() creating a new reason to wake up. Reviewed-by: Andres Freund Discussion: https://postgr.es/m/CA%2BhUKG%2BZ-HpOj1JsO9eWUP%2Bar7npSVinsC_npxSy%2BjdOMsx%3DGg%40mail.gmail.com --- src/backend/libpq/pqsignal.c| 40 --- src/backend/postmaster/postmaster.c | 413 +++- src/backend/storage/ipc/latch.c | 22 ++ src/backend/tcop/postgres.c | 1 - src/backend/utils/init/miscinit.c | 13 +- src/include/libpq/pqsignal.h| 3 - src/include/miscadmin.h | 1 + src/include/storage/latch.h | 9 +- 8 files changed, 266 insertions(+), 236 deletions(-) diff --git a/src/backend/libpq/pqsignal.c b/src/backend/libpq/pqsignal.c index 1ab34c5214..718043a39d 100644 --- a/src/backend/libpq/pqsignal.c +++ b/src/backend/libpq/pqsignal.c @@ -97,43 +97,3 @@ pqinitmask(void) sigdelset(&StartupBlockSig, SIGALRM); #endif } - -/* - * Set up a postmaster signal handler for signal "signo" - * - * Returns the previous handler. - * - * This is used only in the postmaster, which has its own odd approach to - * signal handling. For signals with handlers, we block all signals for the - * duration of signal handler execution. We also do not set the SA_RESTART - * flag; this should be safe given the tiny range of code in which the - * postmaster ever unblocks signals. - * - * pqinitmask() must have been invoked previously. - */ -pqsigfunc -pqsignal_pm(int signo, pqsigfunc func) -{ - struct sigaction act, -oact; - - act.sa_handler = f
Re: doc: add missing "id" attributes to extension packaging page
On 2022-Dec-05, Ian Lawrence Barwick wrote: > On this page: > > https://www.postgresql.org/docs/current/extend-extensions.html > > three of the sections are missing an "id" attribute; patch adds > these. Noticed when trying to create a stable link to one of the affected > sections. Hm, I was reminded of this patch here that adds IDs in a lot of places https://postgr.es/m/3bac458c-b121-1b20-8dea-0665986fa...@gmx.de and this other one https://postgr.es/m/76287ac6-f415-8562-fdaa-5876380c0...@gmx.de which adds XSL stuff for adding selectable anchors next to each id-carrying item. -- Álvaro HerreraBreisgau, Deutschland — https://www.EnterpriseDB.com/
Re: Avoid streaming the transaction which are skipped (in corner cases)
On Mon, Dec 5, 2022 at 9:21 AM Dilip Kumar wrote: > > On Mon, Dec 5, 2022 at 8:59 AM Amit Kapila wrote: > > > > On Sun, Dec 4, 2022 at 5:14 PM houzj.f...@fujitsu.com > > wrote: > > > > > > On Saturday, December 3, 2022 7:37 PM Amit Kapila > > > wrote: > > > > Apart from the above, I have slightly adjusted the comments in the > > > > attached. Do > > > > let me know what you think of the attached. > > > > > > Thanks for updating the patch. It looks good to me. > > > > > > > I feel the function name ReorderBufferLargestTopTXN() is slightly > > misleading because it also checks some of the streaming properties > > (like whether the TXN has partial changes and whether it contains any > > streamable change). Shall we rename it to > > ReorderBufferLargestStreamableTopTXN() or something like that? > > Yes that makes sense I have done this change in the attached patch. > > The other point to consider is whether we need to have a test case for > > this patch. I think before this patch if the size of DDL changes in a > > transaction exceeds logical_decoding_work_mem, the empty streams will > > be output in the plugin but after this patch, there won't be any such > > stream. I tried this test, but I think generating 64k data with just CID messages will make the test case really big. I tried using multiple sessions such that one session makes the reorder buffer full but contains partial changes so that we try to stream another transaction but that is not possible in an automated test to consistently generate the partial change. I think we need something like this[1] so that we can better control the streaming. [1]https://www.postgresql.org/message-id/OSZPR01MB631042582805A8E8615BC413FD329%40OSZPR01MB6310.jpnprd01.prod.outlook.com -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com From d15b33d74952a78c4050ed7f1235bbb675643478 Mon Sep 17 00:00:00 2001 From: Amit Kapila Date: Sat, 3 Dec 2022 12:11:29 +0530 Subject: [PATCH v5] Avoid unnecessary streaming of transactions during logical replication. After restart, we don't perform streaming of an in-progress transaction if it was previously decoded and confirmed by client. To achieve that we were comparing the END location of the WAL record being decoded with the WAL location we have already decoded and confirmed by the client. While decoding the commit record, to decide whether to process and send the complete transaction, we compare its START location with the WAL location we have already decoded and confirmed by the client. Now, if we need to queue some change in the transaction while decoding the commit record (e.g. snapshot), it is possible that we decide to stream the transaction but later commit processing decides to skip it. In such a case, we would needlessly send the changes and later when we decide to skip it, we will send stream abort. We also sometimes decide to stream the changes when we actually just need to process them locally like a change for invalidations. This will lead us to send empty streams. To avoid this, while queuing each change for decoding, we remember whether the transaction has any change that actually needs to be sent downstream and use that information later to decide whether to stream the transaction or not. Author: Dilip Kumar Reviewed-by: Hou Zhijie, Ashutosh Bapat, Shi yu, Amit Kapila Discussion: https://postgr.es/m/CAFiTN-tHK=7lzfrps8fbt2ksrojgqbzywcgxst2bm9-rjja...@mail.gmail.com --- src/backend/replication/logical/reorderbuffer.c | 51 ++--- src/include/replication/reorderbuffer.h | 23 +++ 2 files changed, 53 insertions(+), 21 deletions(-) diff --git a/src/backend/replication/logical/reorderbuffer.c b/src/backend/replication/logical/reorderbuffer.c index 31f7381..6e11056 100644 --- a/src/backend/replication/logical/reorderbuffer.c +++ b/src/backend/replication/logical/reorderbuffer.c @@ -695,9 +695,9 @@ ReorderBufferTXNByXid(ReorderBuffer *rb, TransactionId xid, bool create, * Record the partial change for the streaming of in-progress transactions. We * can stream only complete changes so if we have a partial change like toast * table insert or speculative insert then we mark such a 'txn' so that it - * can't be streamed. We also ensure that if the changes in such a 'txn' are - * above logical_decoding_work_mem threshold then we stream them as soon as we - * have a complete change. + * can't be streamed. We also ensure that if the changes in such a 'txn' can + * be streamed and are above logical_decoding_work_mem threshold then we stream + * them as soon as we have a complete change. */ static void ReorderBufferProcessPartialChange(ReorderBuffer *rb, ReorderBufferTXN *txn, @@ -762,7 +762,8 @@ ReorderBufferProcessPartialChange(ReorderBuffer *rb, ReorderBufferTXN *txn, */ if (ReorderBufferCanStartStreaming(rb) && !(rbtxn_has_partial_change(toptxn)) && - rbtxn_is_serialized(txn)) + rbtxn_is_serialized(txn) && + rbtxn_has_streamable_change(to
Re: slab allocator performance issues
On Mon, Dec 5, 2022 at 3:02 PM David Rowley wrote: > > On Fri, 11 Nov 2022 at 22:20, John Naylor wrote: > > #define SLAB_FREELIST_COUNT ((1<<3) + 1) > > index = (freecount & (SLAB_FREELIST_COUNT - 2)) + (freecount != 0); > > Doesn't this create a sort of round-robin use of the free list? What > we want is a sort of "histogram" bucket set of free lists so we can > group together blocks that have a close-enough free number of chunks. > Unless I'm mistaken, I think what you have doesn't do that. The intent must have slipped my mind along the way. > I wondered if simply: > > index = -(-freecount >> slab->freelist_shift); > > would be faster than Andres' version. I tried it out and on my AMD > machine, it's about the same speed. Same on a Raspberry Pi 4. > > Going by [2], the instructions are very different with each method, so > other machines with different latencies on those instructions might > show something different. I attached what I used to test if anyone > else wants a go. I get about 0.1% difference on my machine. Both ways boil down to (on gcc) 3 instructions with low latency. The later ones need the prior results to execute, which I think is what the XXX comment "isn't great" was referring to. The new coding is more mysterious (does it do the right thing on all platforms?), so I guess the original is still the way to go unless we get a better idea. -- John Naylor EDB: http://www.enterprisedb.com
MemoizePath fails to work for partitionwise join
I happened to notice that currently MemoizePath cannot be generated for partitionwise join. I traced that and have found the problem. One condition we need to meet to generate MemoizePath is that inner path's ppi_clauses should have the form "outer op inner" or "inner op outer", as checked by clause_sides_match_join in paraminfo_get_equal_hashops. Note that when are at this check, the inner path has not got the chance to be re-parameterized (that is done later in try_nestloop_path), so if it is parameterized, it is parameterized by the topmost parent of the outer rel, not the outer rel itself. Thus this check performed by clause_sides_match_join could not succeed if we are joining two child rels. The fix is straightforward, just to use outerrel->top_parent if it is not null and leave the reparameterization work to try_nestloop_path. In addition, I believe when reparameterizing MemoizePath we need to adjust its param_exprs too. Attach the patch for fix. Thanks Richard v1-0001-Fix-MemoizePath-for-partitionwise-join.patch Description: Binary data
Re: Avoid streaming the transaction which are skipped (in corner cases)
On Mon, Dec 5, 2022 at 3:41 PM Dilip Kumar wrote: > > On Mon, Dec 5, 2022 at 9:21 AM Dilip Kumar wrote: > > > > On Mon, Dec 5, 2022 at 8:59 AM Amit Kapila wrote: > > > > > > On Sun, Dec 4, 2022 at 5:14 PM houzj.f...@fujitsu.com > > > wrote: > > > > > > > > On Saturday, December 3, 2022 7:37 PM Amit Kapila > > > > wrote: > > > > > Apart from the above, I have slightly adjusted the comments in the > > > > > attached. Do > > > > > let me know what you think of the attached. > > > > > > > > Thanks for updating the patch. It looks good to me. > > > > > > > > > > I feel the function name ReorderBufferLargestTopTXN() is slightly > > > misleading because it also checks some of the streaming properties > > > (like whether the TXN has partial changes and whether it contains any > > > streamable change). Shall we rename it to > > > ReorderBufferLargestStreamableTopTXN() or something like that? > > > > Yes that makes sense > > I have done this change in the attached patch. > > > > The other point to consider is whether we need to have a test case for > > > this patch. I think before this patch if the size of DDL changes in a > > > transaction exceeds logical_decoding_work_mem, the empty streams will > > > be output in the plugin but after this patch, there won't be any such > > > stream. > > I tried this test, but I think generating 64k data with just CID > messages will make the test case really big. I tried using multiple > sessions such that one session makes the reorder buffer full but > contains partial changes so that we try to stream another transaction > but that is not possible in an automated test to consistently generate > the partial change. > I also don't see a way to achieve it in an automated way because both toast and speculative inserts are part of one statement, so we need a real concurrent test to make it happen. Can anyone else think of a way to achieve it? > I think we need something like this[1] so that we can better control > the streaming. > +1. The additional advantage would be that we can generate parallel apply and new streaming tests with much lesser data. Shi-San, can you please start a new thread for the GUC patch proposed by you as indicated by Dilip? -- With Regards, Amit Kapila.
Re: Missing MaterialPath support in reparameterize_path_by_child
On Fri, Dec 2, 2022 at 9:13 PM Tom Lane wrote: > > Ashutosh Bapat writes: > > On Fri, Dec 2, 2022 at 8:25 AM Tom Lane wrote: > >> Unfortunately, I don't have an example that produces such a > >> failure against HEAD. It seems certain to me that such cases > >> exist, though, so I'd like to apply and back-patch the attached. > > > From this comment, that I wrote back when I implemented that function, > > I wonder if we thought MaterialPath wouldn't appear on the inner side > > of nestloop join. But that can't be the case. Or probably we didn't > > find MaterialPath being there from our tests. > > * This function is currently only applied to the inner side of a > > nestloop > > * join that is being partitioned by the partitionwise-join code. > > Hence, > > * we need only support path types that plausibly arise in that context. > > But I think it's good to have MaterialPath there. > > So thinking about this a bit: the reason it is okay if reparameterize_path > fails is that it's not fatal. We just go on our way without making > a parameterized path for that appendrel. However, if > reparameterize_path_by_child fails for every available child path, > we end up with "could not devise a query plan", because the > partitionwise-join code is brittle and won't tolerate failure > to build a parent-join path. Seems like we should be willing to > fall back to a non-partitionwise join in that case. > > regards, tom lane partition-wise join should be willing to fallback to non-partitionwise join in such a case. After spending a few minutes with the code, I think generate_partitionwise_join_paths() should not call set_cheapest() is the pathlist of the child is NULL and should just wind up and avoid adding any path. -- Best Wishes, Ashutosh Bapat
Re: Allow placeholders in ALTER ROLE w/o superuser
> On Thu, Dec 1, 2022 at 6:14 AM Alexander Korotkov > wrote: > > On Wed, Nov 23, 2022 at 1:53 AM Steve Chavez wrote: > > > So from my side this all looks good! > > > > Thank you for your feedback. > > > > The next revision of the patch is attached. It contains code > > improvements, comments and documentation. I'm going to also write > > sode tests. pg_db_role_setting doesn't seem to be well-covered with > > tests. I will probably need to write a new module into > > src/tests/modules to check now placeholders interacts with dynamically > > defined GUCs. > > Another revision of patch is attached. It's fixed now that USER SET > values can't be used for PGC_SUSET parameters. Tests are added. That > require new module test_pg_db_role_setting to check dynamically > defined GUCs. I've looked through the last version of a patch. The tests in v3 failed due to naming mismatches. I fixed this in v4 (PFA). The other thing that may seem unexpected: is whether the value should apply to the ordinary user only, encoded in the parameter name. The pro of this is that it doesn't break catalog compatibility by a separate field for GUC permissions a concept that doesn't exist today (and maybe not needed at all). Also, it makes the patch more minimalistic in the code. This is also fully compatible with the previous parameters naming due to parentheses being an unsupported symbol for the parameter name. I've also tried to revise the comments and docs a little bit to reflect the changes. The CI-enabled build of patch v4 for reference is at https://github.com/pashkinelfe/postgres/tree/placeholders-in-alter-role-v4 Overall the patch looks useful and good enough to be committed. Kind regards, Pavel Borisov, Supabase v4-0001-USER-SET-parameters-for-pg_db_role_setting.patch Description: Binary data
Re: Allow placeholders in ALTER ROLE w/o superuser
After posting the patch I've found my own typo in docs. So corrected it in v5 (PFA). Regards, Pavel. v5-0001-USER-SET-parameters-for-pg_db_role_setting.patch Description: Binary data
move some bitmapset.c macros to bitmapset.h
Over in [1], Masahiko and I found that using some bitmapset logic yields a useful speedup in one part of the proposed radix tree patch. In addition to what's in bitmapset.h now, we'd need WORDNUM, BITNUM, RIGHTMOST_ONE and bmw_rightmost_one_pos() from bitmapset.c. The file tidbitmap.c has its own copies of the first two, and we could adapt that strategy again. But instead of three files defining these, it seems like it's time to consider moving them somewhere more central. Attached is a simple form of this concept, including moving HAS_MULTIPLE_ONES just for consistency. One possible objection is the possibility of namespace clash. Thoughts? I also considered putting the macros and typedefs in pg_bitutils.h. One organizational advantage is that pg_bitutils.h already offers convenience function symbols where the parameter depends on SIZEOF_SIZE_T, so putting the BITS_PER_BITMAPWORD versions there makes sense. But that way is not a clear win, so I didn't go that far. [1] https://www.postgresql.org/message-id/CAFBsxsHgP5LP9q%3DRq_3Q2S6Oyut67z%2BVpemux%2BKseSPXhJF7sg%40mail.gmail.com -- John Naylor EDB: http://www.enterprisedb.com src/backend/nodes/bitmapset.c | 24 src/backend/nodes/tidbitmap.c | 5 - src/include/nodes/bitmapset.h | 24 3 files changed, 24 insertions(+), 29 deletions(-) diff --git a/src/backend/nodes/bitmapset.c b/src/backend/nodes/bitmapset.c index b7b274aeff..3204b49738 100644 --- a/src/backend/nodes/bitmapset.c +++ b/src/backend/nodes/bitmapset.c @@ -26,33 +26,9 @@ #include "port/pg_bitutils.h" -#define WORDNUM(x) ((x) / BITS_PER_BITMAPWORD) -#define BITNUM(x) ((x) % BITS_PER_BITMAPWORD) - #define BITMAPSET_SIZE(nwords) \ (offsetof(Bitmapset, words) + (nwords) * sizeof(bitmapword)) -/*-- - * This is a well-known cute trick for isolating the rightmost one-bit - * in a word. It assumes two's complement arithmetic. Consider any - * nonzero value, and focus attention on the rightmost one. The value is - * then something like - *xx1 - * where x's are unspecified bits. The two's complement negative is formed - * by inverting all the bits and adding one. Inversion gives - *yy0 - * where each y is the inverse of the corresponding x. Incrementing gives - *yy1 - * and then ANDing with the original value gives - *001 - * This works for all cases except original value = zero, where of course - * we get zero. - *-- - */ -#define RIGHTMOST_ONE(x) ((signedbitmapword) (x) & -((signedbitmapword) (x))) - -#define HAS_MULTIPLE_ONES(x) ((bitmapword) RIGHTMOST_ONE(x) != (x)) - /* Select appropriate bit-twiddling functions for bitmap word size */ #if BITS_PER_BITMAPWORD == 32 #define bmw_leftmost_one_pos(w) pg_leftmost_one_pos32(w) diff --git a/src/backend/nodes/tidbitmap.c b/src/backend/nodes/tidbitmap.c index a7a6b26668..8a1fd1d205 100644 --- a/src/backend/nodes/tidbitmap.c +++ b/src/backend/nodes/tidbitmap.c @@ -72,11 +72,6 @@ */ #define PAGES_PER_CHUNK (BLCKSZ / 32) -/* We use BITS_PER_BITMAPWORD and typedef bitmapword from nodes/bitmapset.h */ - -#define WORDNUM(x) ((x) / BITS_PER_BITMAPWORD) -#define BITNUM(x) ((x) % BITS_PER_BITMAPWORD) - /* number of active words for an exact page: */ #define WORDS_PER_PAGE ((MAX_TUPLES_PER_PAGE - 1) / BITS_PER_BITMAPWORD + 1) /* number of active words for a lossy chunk: */ diff --git a/src/include/nodes/bitmapset.h b/src/include/nodes/bitmapset.h index 2792281658..8462282b9e 100644 --- a/src/include/nodes/bitmapset.h +++ b/src/include/nodes/bitmapset.h @@ -48,6 +48,30 @@ typedef int32 signedbitmapword; /* must be the matching signed type */ #endif +#define WORDNUM(x) ((x) / BITS_PER_BITMAPWORD) +#define BITNUM(x) ((x) % BITS_PER_BITMAPWORD) + +/*-- + * This is a well-known cute trick for isolating the rightmost one-bit + * in a word. It assumes two's complement arithmetic. Consider any + * nonzero value, and focus attention on the rightmost one. The value is + * then something like + *xx1 + * where x's are unspecified bits. The two's complement negative is formed + * by inverting all the bits and adding one. Inversion gives + *yy0 + * where each y is the inverse of the corresponding x. Incrementing gives + *yy1 + * and then ANDing with the original value gives + *001 + * This works for all cases except original value = zero, where of course + * we get zero. + *-- + */ +#define RIGHTMOST_ONE(x) ((signedbitmapword) (x) & -((signedbitmapword) (x))) + +#define HAS_MULTIPLE_ONES(x) ((bitmapword) RIGHTMOST_ONE(x) != (x)) + typedef struct Bitmapset { pg_node_attr(custom_copy_equal, special_read_write)
Re: move some bitmapset.c macros to bitmapset.h
On 2022-Dec-05, John Naylor wrote: > diff --git a/src/backend/nodes/bitmapset.c b/src/backend/nodes/bitmapset.c > index b7b274aeff..3204b49738 100644 > --- a/src/backend/nodes/bitmapset.c > +++ b/src/backend/nodes/bitmapset.c > @@ -26,33 +26,9 @@ > #include "port/pg_bitutils.h" > > > -#define WORDNUM(x) ((x) / BITS_PER_BITMAPWORD) > -#define BITNUM(x)((x) % BITS_PER_BITMAPWORD) In this location, nobody can complain about the naming of these macros, since they're just used to implement other bitmapset.c code. However, if you move them to the .h file, ISTM you should give them more meaningful names. -- Álvaro Herrera 48°01'N 7°57'E — https://www.EnterpriseDB.com/ "Oh, great altar of passive entertainment, bestow upon me thy discordant images at such speed as to render linear thought impossible" (Calvin a la TV)
Re: [PATCH] Reuse Workers and Replication Slots during Logical Replication
Hi hackers, I've been working on/struggling with this patch for a while. But I haven't updated this thread regularly. So sharing what I did with this patch so far. > Amit Kapila , 6 Ağu 2022 Cmt, 16:01 tarihinde > şunu yazdı: > >> > >> I think there is some basic flaw in slot reuse design. Currently, we > >> copy the table by starting a repeatable read transaction (BEGIN READ > >> ONLY ISOLATION LEVEL REPEATABLE READ) and create a slot that > >> establishes a snapshot which is first used for copy and then LSN > >> returned by it is used in the catchup phase after the copy is done. > >> The patch won't establish such a snapshot before a table copy as it > >> won't create a slot each time. If this understanding is correct, I > >> think we need to use ExportSnapshot/ImportSnapshot functionality to > >> achieve it or do something else to avoid the problem mentioned. > This issue that Amit mentioned causes some problems such as duplicated rows in the subscriber. Basically, with this patch, tablesync worker creates a replication slot only in its first run. To ensure table copy and sync are consistent with each other, the worker needs the correct snapshot and LSN which both are returned by slot create operation. Since this patch does not create a rep. slot in each table copy and instead reuses the one created in the beginning, we do not get a new snapshot and LSN for each table anymore. Snapshot gets lost right after the transaction is committed, but the patch continues to use the same LSN for next tables without the proper snapshot. In the end, for example, the worker might first copy some rows, then apply changes from rep. slot and inserts those rows again for the tables in later iterations. I discussed some possible ways to resolve this with Amit offline: 1- Copy all tables in one transaction so that we wouldn't need to deal with snapshots. Not easy to keep track of the progress. If the transaction fails, we would need to start all over again. 2- Don't lose the first snapshot (by keeping a transaction open with the snapshot imported or some other way) and use the same snapshot and LSN for all tables. I'm not sure about the side effects of keeping a transaction open that long or using a snapshot that might be too old after some time. Still seems like it might work. 3- For each table, get a new snapshot and LSN by using an existing replication slot. Even though this approach wouldn't create a new replication slot, preparing the slot for snapshot and then taking the snapshot may be costly. Need some numbers here to see how much this approach would improve the performance. I decided to go with approach 3 (creating a new snapshot with an existing replication slot) for now since it would require less change in the tablesync worker logic than the other options would. To achieve this, this patch introduces a new command for Streaming Replication Protocol. The new REPLICATION_SLOT_SNAPSHOT command basically mimics how CREATE_REPLICATION_SLOT creates a snapshot, but without actually creating a new replication slot. Later the tablesync worker calls this command if it decides not to create a new rep. slot, uses the snapshot created and LSN returned by the command. Also; After the changes discussed here [1], concurrent replication origin drops by apply worker and tablesync workers may hold each other on wait due to locks taken by replorigin_drop_by_name. I see that this harms the performance of logical replication quite a bit in terms of speed. Even though reusing replication origins was discussed in this thread before, the patch didn't include any change to do so. The updated version of the patch now also reuses replication origins too. Seems like even only changes to reuse origins by itself improves the performance significantly. Attached two patches: 0001: adds REPLICATION_SLOT_SNAPSHOT command for replication protocol. 0002: Reuses workers/replication slots and origins for tablesync I would appreciate any feedback/review/thought on the approach and both patches. I will also share some numbers to compare performances of the patch and master branch soon. [1] https://www.postgresql.org/message-id/flat/20220714115155.GA5439%40depesz.com Best, -- Melih Mutlu Microsoft 0001-Add-replication-protocol-cmd-to-create-a-snapshot.patch Description: Binary data v4-0002-Reuse-Logical-Replication-Background-worker.patch Description: Binary data
RE: suppressing useless wakeups in logical/worker.c
Dear Nathan, Thank you for making the patch! I tested your patch, and it basically worked well. About following part: ``` ConfigReloadPending = false; ProcessConfigFile(PGC_SIGHUP); + now = GetCurrentTimestamp(); + for (int i = 0; i < NUM_LRW_WAKEUPS; i++) + LogRepWorkerComputeNextWakeup(i, now); + + /* +* If a wakeup time for starting sync workers was set, just set it +* to right now. It will be recalculated as needed. +*/ + if (next_sync_start != PG_INT64_MAX) + next_sync_start = now; } ``` Do we have to recalculate the NextWakeup when subscriber receives SIGHUP signal? I think this may cause the unexpected change like following. Assuming that wal_receiver_timeout is 60s, and wal_sender_timeout on publisher is 0s (or the network between nodes is disconnected). And we send SIGHUP signal per 20s to subscriber's postmaster. Currently the last_recv_time is calcurated when the worker accepts messages, and the value is used for deciding to send a ping. The worker will exit if the walsender does not reply. But in your patch, the apply worker calcurates wakeup[LRW_WAKEUP_PING] and wakeup[LRW_WAKEUP_TERMINATE] again when it gets SIGHUP, so the worker never sends ping with requestReply = true, and never exits due to the timeout. My case seems to be crazy, but there may be another issues if it remains. Best Regards, Hayato Kuroda FUJITSU LIMITED
Re: HOT chain validation in verify_heapam()
On Fri, Dec 2, 2022 at 5:43 AM Andres Freund wrote: > > > + /* Loop over offsets and validate the data in the > predecessor array. */ > > + for (OffsetNumber currentoffnum = FirstOffsetNumber; > currentoffnum <= maxoff; > > + currentoffnum = OffsetNumberNext(currentoffnum)) > > + { > > + HeapTupleHeader pred_htup; > > + HeapTupleHeader curr_htup; > > + TransactionId pred_xmin; > > + TransactionId curr_xmin; > > + ItemId pred_lp; > > + ItemId curr_lp; > > + boolpred_in_progress; > > + XidCommitStatus xid_commit_status; > > + XidBoundsViolation xid_status; > > + > > + ctx.offnum = predecessor[currentoffnum]; > > + ctx.attnum = -1; > > + curr_lp = PageGetItemId(ctx.page, currentoffnum); > > + if (!lp_valid[currentoffnum] || > ItemIdIsRedirected(curr_lp)) > > + continue; > > I don't think we should do PageGetItemId(ctx.page, currentoffnum); if > !lp_valid[currentoffnum]. > > Fixed. > > > + if (ctx.offnum == 0) > > For one, I think it'd be better to use InvalidOffsetNumber here. But more > generally, storing the predecessor in ctx.offnum seems quite confusing. > > changed all relevant places to use InvalidOffsetNumber. > > > + { > > + /* > > + * No harm in overriding value of > ctx.offnum as we will always > > + * continue if we are here. > > + */ > > + ctx.offnum = currentoffnum; > > + if (TransactionIdIsInProgress(curr_xmin) > || TransactionIdDidCommit(curr_xmin)) > > Is it actually ok to call TransactionIdDidCommit() here? There's a reason > get_xid_status() exists after all. And we do have the xid status for xmin > already, so this could just check xid_commit_status, no? > > > I think it will be good to pass NULL to get_xid_status like "get_xid_status(curr_xmin, &ctx, NULL);" so that we can only check the xid status at the time when it is actually required. This way we can avoid checking xid status in cases when we simply 'continue' due to some check. -- Regards, Himanshu Upadhyaya EnterpriseDB: http://www.enterprisedb.com From 02f0d19bd82f390df8e3d732d60cf3eb0ae1dc97 Mon Sep 17 00:00:00 2001 From: Himanshu Upadhyaya Date: Mon, 5 Dec 2022 18:28:33 +0530 Subject: [PATCH v8] Implement HOT chain validation in verify_heapam() Himanshu Upadhyaya, reviewed by Robert Haas, Aleksander Alekseev, Andres Freund --- contrib/amcheck/verify_heapam.c | 278 ++ src/bin/pg_amcheck/t/004_verify_heapam.pl | 193 ++- 2 files changed, 460 insertions(+), 11 deletions(-) diff --git a/contrib/amcheck/verify_heapam.c b/contrib/amcheck/verify_heapam.c index b72a5c96d1..ebad73d64b 100644 --- a/contrib/amcheck/verify_heapam.c +++ b/contrib/amcheck/verify_heapam.c @@ -399,6 +399,9 @@ verify_heapam(PG_FUNCTION_ARGS) for (ctx.blkno = first_block; ctx.blkno <= last_block; ctx.blkno++) { OffsetNumber maxoff; + OffsetNumber predecessor[MaxOffsetNumber] = {InvalidOffsetNumber}; + OffsetNumber successor[MaxOffsetNumber] = {InvalidOffsetNumber}; + bool lp_valid[MaxOffsetNumber] = {false}; CHECK_FOR_INTERRUPTS(); @@ -433,6 +436,8 @@ verify_heapam(PG_FUNCTION_ARGS) for (ctx.offnum = FirstOffsetNumber; ctx.offnum <= maxoff; ctx.offnum = OffsetNumberNext(ctx.offnum)) { + OffsetNumber nextoffnum; + ctx.itemid = PageGetItemId(ctx.page, ctx.offnum); /* Skip over unused/dead line pointers */ @@ -469,6 +474,13 @@ verify_heapam(PG_FUNCTION_ARGS) report_corruption(&ctx, psprintf("line pointer redirection to unused item at offset %u", (unsigned) rdoffnum)); + +/* + * make entry in successor array, redirected tuple will be + * validated at the time when we loop over successor array + */ +successor[ctx.offnum] = rdoffnum; +lp_valid[ctx.offnum] = true; continue; } @@ -504,9 +516,268 @@ verify_heapam(PG_FUNCTION_ARGS) /* It should be safe to examine the tuple's header, at least */ ctx.tuphdr = (HeapTupleHeader) PageGetItem(ctx.page, ctx.itemid); ctx.natts = HeapTupleHeaderGetNatts(ctx.tuphdr); + lp_valid[ctx.offnum] = true; /* Ok, ready to check this next tuple */ check_tuple(&ctx); + + /* + * Add the data to the successor array if next updated tuple is in + * the same page. It will be used later to generate the + * predecessor array. + * + * We need to access the tuple's header to populate the + * predecessor array. However the tuple is not necessarily sanity + * checked yet so
Re: daitch_mokotoff module
Hi Ian, Ian Lawrence Barwick writes: > Hi Dag > > 2022年2月3日(木) 23:27 Dag Lem : >> >> Hi, >> >> Just some minor adjustments to the patch: >> >> * Removed call to locale-dependent toupper() >> * Cleaned up input normalization > > This patch was marked as "Waiting on Author" in the CommitFest entry [1] > but I see you provided an updated version which hasn't received any feedback, > so I've move this to the next CommitFest [2] and set it to "Needs Review". > > [1] https://commitfest.postgresql.org/40/3451/ > [2] https://commitfest.postgresql.org/41/3451/ > >> I have been asked to sign up to review a commitfest patch or patches - >> unfortunately I've been ill with COVID-19 and it's not until now that >> I feel well enough to have a look. >> >> Julien: I'll have a look at https://commitfest.postgresql.org/36/3468/ >> as you suggested (https://commitfest.postgresql.org/36/3379/ seems to >> have been reviewed now). >> >> If there are other suggestions for a patch or patches to review for >> someone new to PostgreSQL internals, I'd be grateful for that. > > I see you provided some feedback on > https://commitfest.postgresql.org/36/3468/, > though the patch seems to have not been accepted (but not conclusively > rejected > either). If you still have the chance to review another patch (or more) it > would > be much appreciated, as there's quite a few piling up. Things like > documentation > or small improvements to client applications are always a good place to start. > Reviews can be provided at any time, there's no need to wait for the next > CommitFest. > OK, I'll try to find another patch to review. Regards Dag Lem
Re: [PATCH] Reuse Workers and Replication Slots during Logical Replication
On Mon, Dec 5, 2022 at 6:30 PM Melih Mutlu wrote: > > Attached two patches: > 0001: adds REPLICATION_SLOT_SNAPSHOT command for replication protocol. > 0002: Reuses workers/replication slots and origins for tablesync > > I would appreciate any feedback/review/thought on the approach and both > patches. > I will also share some numbers to compare performances of the patch and > master branch soon. > It would be interesting to see the numbers differently for resue of replication slots and origins. This will let us know how much each of those optimizations helps with the reuse of workers. -- With Regards, Amit Kapila.
Re: Allow placeholders in ALTER ROLE w/o superuser
On Mon, Dec 5, 2022 at 2:27 PM Pavel Borisov wrote: > After posting the patch I've found my own typo in docs. So corrected > it in v5 (PFA). The new revision of the patch is attached. I've removed the mention of "(s)" suffix from the "Server Configuration" docs section. I think it might be confusing since this suffix isn't a part of the variable name. It is only used for storage. Instead, I've added the description of this suffix to the catalog structure description and psql documentation. Also, I've added psql tab completion for the USER SET flag, and made some enhancements to comments, tests, and commit message. -- Regards, Alexander Korotkov 0001-Add-USER-SET-parameter-values-for-pg_db_role_sett-v6.patch Description: Binary data
Re: move some bitmapset.c macros to bitmapset.h
Alvaro Herrera writes: > On 2022-Dec-05, John Naylor wrote: >> -#define WORDNUM(x) ((x) / BITS_PER_BITMAPWORD) >> -#define BITNUM(x) ((x) % BITS_PER_BITMAPWORD) > In this location, nobody can complain about the naming of these macros, > since they're just used to implement other bitmapset.c code. However, > if you move them to the .h file, ISTM you should give them more > meaningful names. IMV these are absolutely private to bitmapset.c. I reject the idea that they should be exposed publicly, under these names or any others. Maybe we need some more bitmapset primitive functions? What is it you actually want to accomplish in the end? regards, tom lane
Re: Missing MaterialPath support in reparameterize_path_by_child
Ashutosh Bapat writes: > partition-wise join should be willing to fallback to non-partitionwise > join in such a case. After spending a few minutes with the code, I > think generate_partitionwise_join_paths() should not call > set_cheapest() is the pathlist of the child is NULL and should just > wind up and avoid adding any path. We clearly need to not call set_cheapest(), but that's not sufficient; we still fail at higher levels, as you'll see if you try the example Richard found. I ended up making fe12f2f8f to fix this. I don't especially like "rel->nparts = 0" as a way of disabling partitionwise join; ISTM it'd be clearer and more flexible to reset consider_partitionwise_join instead of destroying the data structure. But that's the way it's being done elsewhere, and I didn't want to tamper with it in a bug fix. I see various assertions about parent and child consider_partitionwise_join flags being equal, which we might have to revisit if we try to make it work that way. regards, tom lane
Re: Allow placeholders in ALTER ROLE w/o superuser
Hi, Alexander! On Mon, 5 Dec 2022 at 17:51, Alexander Korotkov wrote: > > On Mon, Dec 5, 2022 at 2:27 PM Pavel Borisov wrote: > > After posting the patch I've found my own typo in docs. So corrected > > it in v5 (PFA). > > The new revision of the patch is attached. > > I've removed the mention of "(s)" suffix from the "Server > Configuration" docs section. I think it might be confusing since this > suffix isn't a part of the variable name. It is only used for storage. > Instead, I've added the description of this suffix to the catalog > structure description and psql documentation. > > Also, I've added psql tab completion for the USER SET flag, and made > some enhancements to comments, tests, and commit message. The changes in expected test results are somehow lost in v6, I've corrected them in v7. Otherwise, I've looked through the updated patch and it is good. Regards, Pavel. v7-0001-Add-USER-SET-parameter-values-for-pg_db_role_sett.patch Description: Binary data
ANY_VALUE aggregate
The SQL:2023 Standard defines a new aggregate named ANY_VALUE. It returns an implementation-dependent (i.e. non-deterministic) value from the rows in its group. PFA an implementation of this aggregate. Ideally, the transition function would stop being called after the first non-null was found, and then the entire aggregation would stop when all functions say they are finished[*], but this patch does not go anywhere near that far. This patch is based off of commit fb958b5da8. [*] I can imagine something like array_agg(c ORDER BY x LIMIT 5) to get the top five of something without going through a LATERAL subquery. -- Vik FearingFrom 7465fac12fc636ff26088ae31de2937f7c3a459f Mon Sep 17 00:00:00 2001 From: Peter Eisentraut Date: Sat, 9 Apr 2022 00:07:38 +0200 Subject: [PATCH] Implement ANY_VALUE aggregate SQL:2023 defines an ANY_VALUE aggregate whose purpose is to emit an implementation-dependent (i.e. non-deterministic) value from the aggregated rows. --- doc/src/sgml/func.sgml | 14 ++ src/backend/utils/adt/misc.c | 12 src/include/catalog/pg_aggregate.dat | 4 src/include/catalog/pg_proc.dat | 8 src/test/regress/expected/aggregates.out | 18 ++ src/test/regress/sql/aggregates.sql | 5 + 6 files changed, 61 insertions(+) diff --git a/doc/src/sgml/func.sgml b/doc/src/sgml/func.sgml index 2052d3c844..1823ee71d7 100644 --- a/doc/src/sgml/func.sgml +++ b/doc/src/sgml/func.sgml @@ -19706,16 +19706,30 @@ SELECT NULLIF(value, '(none)') ... Description Partial Mode + + + + any_value + +any_value ( "any" ) +same as input type + + +Chooses a non-deterministic value from the non-null input values. + + Yes + + array_agg array_agg ( anynonarray ) anyarray diff --git a/src/backend/utils/adt/misc.c b/src/backend/utils/adt/misc.c index 9c13251231..94c92de06d 100644 --- a/src/backend/utils/adt/misc.c +++ b/src/backend/utils/adt/misc.c @@ -928,8 +928,20 @@ pg_get_replica_identity_index(PG_FUNCTION_ARGS) idxoid = RelationGetReplicaIndex(rel); table_close(rel, AccessShareLock); if (OidIsValid(idxoid)) PG_RETURN_OID(idxoid); else PG_RETURN_NULL(); } + +Datum +any_value_trans(PG_FUNCTION_ARGS) +{ + /* Return the first non-null argument */ + if (!PG_ARGISNULL(0)) + PG_RETURN_DATUM(PG_GETARG_DATUM(0)); + if (!PG_ARGISNULL(1)) + PG_RETURN_DATUM(PG_GETARG_DATUM(1)); + PG_RETURN_NULL(); +} + diff --git a/src/include/catalog/pg_aggregate.dat b/src/include/catalog/pg_aggregate.dat index b9110a5298..37626d6f0c 100644 --- a/src/include/catalog/pg_aggregate.dat +++ b/src/include/catalog/pg_aggregate.dat @@ -620,9 +620,13 @@ aggtransfn => 'ordered_set_transition_multi', aggfinalfn => 'cume_dist_final', aggfinalextra => 't', aggfinalmodify => 'w', aggmfinalmodify => 'w', aggtranstype => 'internal' }, { aggfnoid => 'dense_rank(any)', aggkind => 'h', aggnumdirectargs => '1', aggtransfn => 'ordered_set_transition_multi', aggfinalfn => 'dense_rank_final', aggfinalextra => 't', aggfinalmodify => 'w', aggmfinalmodify => 'w', aggtranstype => 'internal' }, +# any_value +{ aggfnoid => 'any_value(anyelement)', aggtransfn => 'any_value_trans', + aggcombinefn => 'any_value_trans', aggtranstype => 'anyelement' }, + ] diff --git a/src/include/catalog/pg_proc.dat b/src/include/catalog/pg_proc.dat index f9301b2627..2ee4797559 100644 --- a/src/include/catalog/pg_proc.dat +++ b/src/include/catalog/pg_proc.dat @@ -11849,9 +11849,17 @@ proname => 'brin_minmax_multi_summary_recv', provolatile => 's', prorettype => 'pg_brin_minmax_multi_summary', proargtypes => 'internal', prosrc => 'brin_minmax_multi_summary_recv' }, { oid => '4641', descr => 'I/O', proname => 'brin_minmax_multi_summary_send', provolatile => 's', prorettype => 'bytea', proargtypes => 'pg_brin_minmax_multi_summary', prosrc => 'brin_minmax_multi_summary_send' }, +{ oid => '8981', descr => 'arbitrary value from among input values', + proname => 'any_value', prokind => 'a', proisstrict => 'f', + prorettype => 'anyelement', proargtypes => 'anyelement', + prosrc => 'aggregate_dummy' }, +{ oid => '8982', descr => 'any_value transition function', + proname => 'any_value_trans', prorettype => 'anyelement', proargtypes => 'anyelement anyelement', + prosrc => 'any_value_trans' }, + ] diff --git a/src/test/regress/expected/aggregates.out b/src/test/regress/expected/aggregates.out index fc2bd40be2..fb87b9abf1 100644 --- a/src/test/regress/expected/aggregates.out +++ b/src/test/regress/expected/aggregates.out @@ -20,16 +20,28 @@ SELECT avg(four) AS avg_1 FROM onek; (1 row) SELECT avg(a) AS avg_32 FROM aggtest WHERE a < 100; avg_32 - 32.666
Re: pg_basebackup: add test about zstd compress option
On Fri, Dec 2, 2022 at 11:29 PM Ian Lawrence Barwick wrote: > Though on reflection maybe it's overkill and the existing tests > suffice. Anyway leaving the patch here in the interests of pushing > this forward in some direction. Do you think that there is a scenario where 008_untar.pl and 009_extract.pl pass but this test fails, alerting us to a problem that would otherwise have gone undetected? If so, what is that scenario? The only thing that I can think of would be if $decompress_program --test were failing, but actually trying to decompress succeeded. I would be inclined to dismiss that particular scenario as not important enough to be worth the additional CPU cycles. -- Robert Haas EDB: http://www.enterprisedb.com
Re: pg_dump: Remove "blob" terminology
On Sat, Dec 3, 2022 at 11:07 AM Andrew Dunstan wrote: > > Well, what this would lose is the ability to selectively restore > > individual large objects using "pg_restore -L". I'm not sure who > > out there might be depending on that, but if we assume that's the > > empty set I fear we'll find out it's not. So a workaround switch > > seemed possibly worth the trouble. I don't have a position yet > > on which way ought to be the default. > > OK, fair point. I suspect making the batched mode the default would gain > more friends than enemies. A lot of people probably don't know that selective restore even exists but it is an AWESOME feature and I'd hate to see us break it, or even just degrade it. I wonder if we can't find a better solution than bunching TOC entries together. Perhaps we don't need every TOC entry in memory simultaneously, for instance, especially things like LOBs that don't have dependencies. -- Robert Haas EDB: http://www.enterprisedb.com
Re: slab allocator performance issues
On Fri, Sep 10, 2021 at 5:07 PM Tomas Vondra wrote: > Turns out it's pretty difficult to benchmark this, because the results > strongly depend on what the backend did before. What you report here seems to be mostly cold-cache effects, with which I don't think we need to be overly concerned. We don't want big regressions in the cold-cache case, but there is always going to be some overhead when a new backend starts up, because you've got to fault some pages into the heap/malloc arena/whatever before they can be efficiently accessed. What would be more concerning is if we found out that the performance depended heavily on the internal state of the allocator. For example, suppose you have two warmup routines W1 and W2, each of which touches the same amount of total memory, but with different details. Then you have a benchmark B. If you do W1-B and W2-B and the time for B varies dramatically between them, then you've maybe got an issue. For instance, it could indicate that the allocator has issue when the old and new allocations are very different sizes, or something like that. -- Robert Haas EDB: http://www.enterprisedb.com
Re: [PoC] Reducing planning time when tables have many partitions
On Sun, 4 Dec 2022 at 00:35, David Rowley wrote: > > On Tue, 29 Nov 2022 at 21:59, Yuya Watari wrote: > > Thank you for testing the patch with an actual query. This speedup is > > very impressive. When I used an original query with 1024 partitions, > > its planning time was about 200ms. Given that each partition is also > > partitioned in your workload, I think the result of 1415ms is > > reasonable. > > I was looking again at the v9-0001 patch and I think we can do a > little better when building the Bitmapset of matching EMs. For > example, in the v9 patch, the code for get_ecmember_indexes_strict() > is doing: > > + if (!with_children) > + matching_ems = bms_copy(ec->ec_nonchild_indexes); > + else > + matching_ems = bms_copy(ec->ec_member_indexes); > + > + i = -1; > + while ((i = bms_next_member(relids, i)) >= 0) > + { > + RelOptInfo *rel = root->simple_rel_array[i]; > + > + matching_ems = bms_int_members(matching_ems, > rel->eclass_member_indexes); > + } > > It seems reasonable that if there are a large number of partitions > then ec_member_indexes will have a large number of Bitmapwords. When > we do bms_int_members() on that, we're going to probably end up with a > bunch of trailing zero words in the set. In the v10 patch, I've > changed this to become: > > +inti = bms_next_member(relids, -1); > + > +if (i >= 0) > +{ > +RelOptInfo *rel = root->simple_rel_array[i]; > + > +/* > + * bms_intersect to the first relation to try to keep the resulting > + * Bitmapset as small as possible. This saves having to make a > + * complete bms_copy() of one of them. One may contain significantly > + * more words than the other. > + */ > +if (!with_children) > +matching_ems = bms_intersect(rel->eclass_member_indexes, > + ec->ec_nonchild_indexes); > +else > +matching_ems = bms_intersect(rel->eclass_member_indexes, > + ec->ec_member_indexes); > + > +while ((i = bms_next_member(relids, i)) >= 0) > +{ > +rel = root->simple_rel_array[i]; > +matching_ems = bms_int_members(matching_ems, > + rel->eclass_member_indexes); > +} > +} > > so, effectively we first bms_intersect to the first member of relids > before masking out the bits for the remaining ones. This should mean > we'll have a Bitmapset with fewer words in many complex planning > problems. There's no longer the dilemma of having to decide if we > should start with RelOptInfo's eclass_member_indexes or the > EquivalenceClass's member indexes. When using bms_int_member, we > really want to start with the smallest of those so we get the smallest > resulting set. With bms_intersect(), it will always make a copy of > the smallest set. v10 does that instead of bms_copy()ing the > EquivalenceClass's member's Bitmapset. > > I also wondered how much we're losing to the fact that > bms_int_members() zeros the trailing words and does not trim the > Bitmapset down. > > The problem there is 2-fold; > 1) we have to zero the trailing words on the left input. That'll > pollute the CPU cache a bit as it may have to fetch a bunch of extra > cache lines, and; > 2) subsequent bms_int_members() done afterwards may have to mask out > additional words. If we can make the shortest input really short, then > subsequent bms_int_members() are going to be very fast. > > You might argue there that setting nwords to the shortest length may > cause us to have to repalloc the Bitmapset if we need to later add > more members again, but if you look at the repalloc() code, it's > effectively a no-op when the allocated size >= the requested size, so > repalloc() should be very fast in this case. So, worst case, there's > an additional "no-op" repalloc() (which should be very fast) followed > by maybe a bms_add_members() which has to zero the words instead of > bms_int_members(). I changed this in the v10-0002 patch. I'm not sure > if we should do this or not. > > I also changed v10-0001 so that we still store the EquivalenceClass's > members list. There were a few places where the code just wanted to > get the first member and having to look at the Bitmapset index and > fetch the first match from PlannerInfo seemed convoluted. If the > query is simple, it seems like it's not going to be very expensive to > add a few EquivalenceMembers to this list. When planning more complex > problems, there's probably enough other extra overhead that we're > unlikely to notice the extra lappend()s. This also allows v10-0003 to > work, see below. > > In v10-0003, I experimented with the iterator concept that I mentioned > earlier. Since v10-0001 is now storing the EquivalenceMember list in > EquivalenceClass again, it's now quite simple to have the iterator > decide if it should be scanning the inde
Re: Collation version tracking for macOS
On Sun, Dec 4, 2022 at 10:12 PM Thomas Munro wrote: > My tentative votes are: > > 1. I think we should seriously consider provider = ICU63. I still > think search-by-collversion is a little too magical, even though it > clearly can be made to work. Of the non-magical systems, I think > encoding the choice of library into the provider name would avoid the > need to add a second confusing "X_version" concept alongside our > existing "X_version" columns in catalogues and DDL syntax, while still > making it super clear what is going on. This would include adding DDL > commands so you can do ALTER DATABASE/COLLATION ... PROVIDER = ICU63 > to make warnings go way. +1. I wouldn't lose any sleep if we picked a different non-magical option, but I think this is probably my favorite of the explicit-library-version options (though it is close) and I like it better than search-by-collversion. (It's possible that I'm wrong to like it better, but I do.) > 2. I think we should ignore minor versions for now (other than > reporting them in the relevant introspection functions), but not make > any choices that would prevent us from changing our mind about that in > a later release. For example, having two levels of specificity ICU > and ICU68 in the libver-in-provider-name design wouldn't preclude us > from adding support for ICU68_2 later +1. -- Robert Haas EDB: http://www.enterprisedb.com
Re: Error-safe user functions
On Sat, Dec 3, 2022 at 10:57 PM Corey Huinker wrote: > 2. ereturn_* => errfeedback / error_feedback / feedback Oh, I like that, especially errfeedback. -- Robert Haas EDB: http://www.enterprisedb.com
Re: pg_dump: Remove "blob" terminology
Robert Haas writes: > I wonder if we can't find a better solution than bunching TOC entries > together. Perhaps we don't need every TOC entry in memory > simultaneously, for instance, especially things like LOBs that don't > have dependencies. Interesting idea. We'd have to either read the TOC multiple times, or shove the LOB TOC entries into a temporary file, either of which has downsides. regards, tom lane
Re: Order getopt arguments
On Mon, Dec 5, 2022 at 3:42 AM Fabien COELHO wrote: > > I had noticed that most getopt() or getopt_long() calls had their letter > > lists in pretty crazy orders. There might have been occasional attempts > > at grouping, but those then haven't been maintained as new options were > > added. To restore some sanity to this, I went through and ordered them > > alphabetically. > > I agree that a more or less random historical order does not make much > sense. > > For pgbench, ISTM that sorting per functionality then alphabetical would > be better than pure alphabetical because it has 2 modes. Such sections > might be (1) general (2) connection (3) common/shared (4) initialization > and (5) benchmarking, we some comments on each. I don't see the value in this. Grouping related options often makes sense, but it seems more confusing than helpful in the case of a getopt string. +1 for Peter's proposal to just alphabetize. That's easy to maintain, at least in theory. -- Robert Haas EDB: http://www.enterprisedb.com
Re: Error-safe user functions
Robert Haas writes: > On Sat, Dec 3, 2022 at 10:57 PM Corey Huinker wrote: >> 2. ereturn_* => errfeedback / error_feedback / feedback > Oh, I like that, especially errfeedback. efeedback? But TBH I do not think any of these are better than ereturn. Whether or not you agree with my position that it'd be best if the new macro name is the same length as "ereport", I hope we can all agree that it had better be short. ereport call nests already tend to contain quite long lines. We don't need to add another couple tab-stops worth of indentation there. As for it being the same length: if you take a close look at my 0002 patch, you will realize that replacing ereport with a different-length name will double or triple the number of lines that need to be touched in many input functions. Yeah, we could sweep that under the rug to some extent by submitting non-pgindent'd patches and running a separate pgindent commit later, but that is not without pain. I don't want to go there for the sake of a name that isn't really compellingly The Right Word, and "feedback" just isn't very compelling IMO. regards, tom lane
Re: Order getopt arguments
Robert Haas writes: > +1 for Peter's proposal to just alphabetize. That's easy to maintain, > at least in theory. Agreed for single-letter options. Long options complicate matters: are we going to order their code stanzas by the actual long name, or by the character/number returned by getopt? Or are we going to be willing to repeatedly renumber the assigned codes to keep those the same? I don't think I want to go that far. regards, tom lane
Re: pg_dump: Remove "blob" terminology
On Mon, Dec 5, 2022 at 10:56 AM Tom Lane wrote: > Interesting idea. We'd have to either read the TOC multiple times, > or shove the LOB TOC entries into a temporary file, either of which > has downsides. I wonder if we'd need to read the TOC entries repeatedly, or just incrementally. I haven't studied the pg_dump format much, but I wonder if we could arrange things so that the "boring" entries without dependencies, or maybe the LOB entries specifically, are grouped together in some way where we know the byte-length of that section and can just skip over it. Then when we need them we go back and read 'em one by one. (Of course this doesn't work if reading for standard input or if multiple phases of processing need them or whatever. Not trying to say I've solved the problem. But in general I think we need to give more thought to the possibility that "just read all the data into memory" is not an adequate solution to $PROBLEM.) -- Robert Haas EDB: http://www.enterprisedb.com
Re: Error-safe user functions
On Mon, Dec 5, 2022 at 11:09 AM Tom Lane wrote: > Robert Haas writes: > > On Sat, Dec 3, 2022 at 10:57 PM Corey Huinker > > wrote: > >> 2. ereturn_* => errfeedback / error_feedback / feedback > > > Oh, I like that, especially errfeedback. > > efeedback? But TBH I do not think any of these are better than ereturn. I do. Having a macro name that is "return" plus one character is going to make people think that it returns. I predict that if you insist on using that name people are still going to be making mistakes based on that confusion 10 years from now. -- Robert Haas EDB: http://www.enterprisedb.com
Re: Order getopt arguments
On Mon, Dec 5, 2022 at 11:14 AM Tom Lane wrote: > Robert Haas writes: > > +1 for Peter's proposal to just alphabetize. That's easy to maintain, > > at least in theory. > > Agreed for single-letter options. Long options complicate matters: > are we going to order their code stanzas by the actual long name, or > by the character/number returned by getopt? Or are we going to be > willing to repeatedly renumber the assigned codes to keep those the > same? I don't think I want to go that far. I was only talking about the actual argument to getopt(), not the order of the code stanzas. I'm not sure what we ought to do about the latter. -- Robert Haas EDB: http://www.enterprisedb.com
Re: Error-safe user functions
On 2022-12-05 Mo 11:20, Robert Haas wrote: > On Mon, Dec 5, 2022 at 11:09 AM Tom Lane wrote: >> Robert Haas writes: >>> On Sat, Dec 3, 2022 at 10:57 PM Corey Huinker >>> wrote: 2. ereturn_* => errfeedback / error_feedback / feedback >>> Oh, I like that, especially errfeedback. >> efeedback? But TBH I do not think any of these are better than ereturn. > I do. Having a macro name that is "return" plus one character is going > to make people think that it returns. I predict that if you insist on > using that name people are still going to be making mistakes based on > that confusion 10 years from now. > OK, I take both this point and Tom's about trying to keep it the same length. So we need something that's 7 letters, doesn't say 'return' and preferably begins with 'e'. I modestly suggest 'eseterr', or if we like the 'feedback' idea 'efeedbk'. cheers andrew -- Andrew Dunstan EDB: https://www.enterprisedb.com
Re: Order getopt arguments
Robert Haas writes: > I was only talking about the actual argument to getopt(), not the > order of the code stanzas. I'm not sure what we ought to do about the > latter. 100% agreed that the getopt argument should just be alphabetical. But the bulk of Peter's patch is rearranging switch cases to agree with that, and if you want to do that then you have to also think about long options, which are not in the getopt argument. I'm not entirely convinced that reordering the switch cases is worth troubling over. regards, tom lane
Re: Error-safe user functions
On 12/5/22 11:36, Andrew Dunstan wrote: On 2022-12-05 Mo 11:20, Robert Haas wrote: On Mon, Dec 5, 2022 at 11:09 AM Tom Lane wrote: Robert Haas writes: On Sat, Dec 3, 2022 at 10:57 PM Corey Huinker wrote: 2. ereturn_* => errfeedback / error_feedback / feedback Oh, I like that, especially errfeedback. efeedback? But TBH I do not think any of these are better than ereturn. I do. Having a macro name that is "return" plus one character is going to make people think that it returns. I predict that if you insist on using that name people are still going to be making mistakes based on that confusion 10 years from now. OK, I take both this point and Tom's about trying to keep it the same length. So we need something that's 7 letters, doesn't say 'return' and preferably begins with 'e'. I modestly suggest 'eseterr', or if we like the 'feedback' idea 'efeedbk'. Maybe eretort? -- Joe Conway PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com
Re: Order getopt arguments
On Mon, Dec 5, 2022 at 11:51 AM Tom Lane wrote: > Robert Haas writes: > > I was only talking about the actual argument to getopt(), not the > > order of the code stanzas. I'm not sure what we ought to do about the > > latter. > > 100% agreed that the getopt argument should just be alphabetical. > But the bulk of Peter's patch is rearranging switch cases to agree > with that, and if you want to do that then you have to also think > about long options, which are not in the getopt argument. I'm > not entirely convinced that reordering the switch cases is worth > troubling over. I'm not particularly sold on that either, but neither am I particularly opposed to it. -- Robert Haas EDB: http://www.enterprisedb.com
Re: Error-safe user functions
Robert Haas writes: > On Mon, Dec 5, 2022 at 11:09 AM Tom Lane wrote: >> efeedback? But TBH I do not think any of these are better than ereturn. > I do. Having a macro name that is "return" plus one character is going > to make people think that it returns. But it does return, or at least you need to code on the assumption that it will. (The cases where it doesn't aren't much different from any situation where a called subroutine unexpectedly throws an error. Callers typically don't have to consider that.) regards, tom lane
Re: Allow placeholders in ALTER ROLE w/o superuser
I couldn't find any discussion of the idea of adding "(s)" to the variable name in order to mark the variable userset in the catalog, and I have to admit I find it a bit strange. Are we really agreed that that's the way to proceed? -- Álvaro Herrera PostgreSQL Developer — https://www.EnterpriseDB.com/ "Having your biases confirmed independently is how scientific progress is made, and hence made our great society what it is today" (Mary Gardiner)
Re: Error-safe user functions
Joe Conway writes: > On 12/5/22 11:36, Andrew Dunstan wrote: >> OK, I take both this point and Tom's about trying to keep it the same >> length. So we need something that's 7 letters, doesn't say 'return' and >> preferably begins with 'e'. I modestly suggest 'eseterr', or if we like >> the 'feedback' idea 'efeedbk'. > Maybe eretort? Nah, it's so close to ereport that it looks like a typo. eseterr isn't awful, perhaps. Or maybe err, but I've not thought of suitable . regards, tom lane
Re: Allow placeholders in ALTER ROLE w/o superuser
Alvaro Herrera writes: > I couldn't find any discussion of the idea of adding "(s)" to the > variable name in order to mark the variable userset in the catalog, and > I have to admit I find it a bit strange. Are we really agreed that > that's the way to proceed? I hadn't been paying close attention to this thread, sorry. I agree that that seems like a very regrettable choice, especially if you anticipate having to bump catversion anyway. Better to add a bool column to the catalog. regards, tom lane
Re: Error-safe user functions
On Mon, Dec 5, 2022 at 12:09 PM Tom Lane wrote: > But it does return, or at least you need to code on the assumption > that it will. (The cases where it doesn't aren't much different > from any situation where a called subroutine unexpectedly throws > an error. Callers typically don't have to consider that.) Are you just trolling me here? AIUI, the macro never returns in the sense of using the return statement, unlike PG_RETURN_WHATEVER(), which do. It possibly transfers control by throwing an error. But that is also true of just about everything you do in PostgreSQL code, because errors can get thrown from almost anywhere. So clearly the possibility of a non-local transfer of control is not the issue here. The issue is the possibility that there will be NO transfer of control. That is, you are compelled to write ereturn() and then afterwards you still need a return statement. I do not understand how it is possible to sensibly argue that someone won't see a macro called ereturn() and perhaps come to the false conclusion that it will always return. -- Robert Haas EDB: http://www.enterprisedb.com
Re: Error-safe user functions
I wrote: > Nah, it's so close to ereport that it looks like a typo. eseterr isn't > awful, perhaps. Or maybe err, but I've not thought of suitable . ... "errsave", maybe? regards, tom lane
Re: suppressing useless wakeups in logical/worker.c
On Mon, Dec 05, 2022 at 01:00:19PM +, Hayato Kuroda (Fujitsu) wrote: > But in your patch, the apply worker calcurates wakeup[LRW_WAKEUP_PING] and > wakeup[LRW_WAKEUP_TERMINATE] again when it gets SIGHUP, so the worker never > sends > ping with requestReply = true, and never exits due to the timeout. This is the case for the walreceiver patch, too. If a SIGHUP arrives just before we are due to ping the server, the ping wakeup time will be pushed back. To me, this seems unlikely to cause any issues in practice unless the server is being constantly SIGHUP'd. If we wanted to fix this, we'd probably need to recompute the wakeup times using the values currently set. I haven't looked into this too closely, but it doesn't sound tremendously difficult. Thoughts? -- Nathan Bossart Amazon Web Services: https://aws.amazon.com
Re: Error-safe user functions
On Mon, Dec 5, 2022 at 12:27 PM Tom Lane wrote: > I wrote: > > Nah, it's so close to ereport that it looks like a typo. eseterr isn't > > awful, perhaps. Or maybe err, but I've not thought of suitable . > > ... "errsave", maybe? eseterr or errsave seem totally fine to me, FWIW. I would probably choose a more verbose name if I were doing it, but I do get the point that keeping line lengths reasonable is important, and if someone were to accuse me of excessive prolixity, I would be unable to mount much of a defense. -- Robert Haas EDB: http://www.enterprisedb.com
Re: Error-safe user functions
On 12/5/22 12:35, Robert Haas wrote: On Mon, Dec 5, 2022 at 12:27 PM Tom Lane wrote: I wrote: > Nah, it's so close to ereport that it looks like a typo. eseterr isn't > awful, perhaps. Or maybe err, but I've not thought of suitable . ... "errsave", maybe? eseterr or errsave seem totally fine to me, FWIW. +1 I would probably choose a more verbose name if I were doing it, but I do get the point that keeping line lengths reasonable is important, and if someone were to accuse me of excessive prolixity, I would be unable to mount much of a defense. prolixity -- nice word! I won't comment on its applicability to you in particular ;-P -- Joe Conway PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com
Request to modify view_table_usage to include materialized views
Hello, I think this is the correct mail list for feature/modification requests. If not please let me know which mail list I should use. Would it be possible to modify the information_schema.view_table_usage (VTU) to include materialized views? ( https://www.postgresql.org/docs/current/infoschema-view-table-usage.html) Currently when querying VTU, if the view you're interested in queries a materialized view, then it doesn't show up in VTU. For example, I was trying to determine which tables/views made up a particular view: --View is present in pg_views drps=> select schemaname, viewname, viewowner drps-> from pg_views drps-> where viewname = 'platform_version_v'; schemaname | viewname | viewowner ++--- event | platform_version_v | drps -- Check view_table_usage for objects that are queried by the platform_version_v view, but it doesn't find any: drps=> select * drps=> from information_schema.view_table_usage drps=> where view_name = 'platform_version_v'; view_catalog | view_schema | view_name | table_catalog | table_schema | table_name --+-+---+---+--+ (0 rows) I looked at the pg_views.definition column for platform_version_v, and it is querying a materialized view. The source code for information_schema.view_table_usage view is at https://git.postgresql.org/gitweb/?p=postgresql.git;a=blob;f=src/backend/catalog/information_schema.sql;h=18725a02d1fb6ffda3d218033b972a0ff23aac3b;hb=HEAD#l2605 If I change lines 2605 and 2616 to: 2605: AND v.relkind in ('v','m') 2616: AND t.relkind IN ('r', 'v', 'f', 'p','m') and compile the modified version of VTU in my test schema, then I see the MV that is used in the query of platform_version_v view: drps=> select * drps=> from test.view_table_usage drps=> where view_name = 'platform_version_v'; view_catalog | view_schema | view_name | table_catalog | table_schema | table_name --+-++---+--+- drps | event | platform_version_v | drps | event | platform_version_mv My method of changing those 2 lines of code may not be the best or correct solution, it's just to illustrate what I'm looking for. Thanks! Jon
Re: Failed Assert while pgstat_unlink_relation
Hi, On 2022-12-05 15:20:55 +0900, Kyotaro Horiguchi wrote: > The in-xact created relation t1 happened to be scanned during the > CREATE RULE and a stats entry is attached. So the stats entry loses t1 > at roll-back, then crashes. Thus, if I understand it correctly, it > seems to me that just unlinking the stats from t1 (when relkind is > changed) works. > > But the fix doesn't change the behavior in relkind-not-changing > cases. If an in-xact-created table gets a stats entry then the > relcache entry for t1 is refreshed to a table relation again then the > transaction rolls back, crash will happen for the same reason. I'm not > sure if there is such a case actually. We unlink the stats in that case already. see RelationDestroyRelation(). > When I tried to check that behavior further, I found that that > CREATE ROLE is no longer allowed.. I assume you mean RULE, not ROLE? It should still work in 15. Greetings, Andres Freund
Re: Collation version tracking for macOS
On Mon, 2022-12-05 at 16:12 +1300, Thomas Munro wrote: > 1. I think we should seriously consider provider = ICU63. I still > think search-by-collversion is a little too magical, even though it > clearly can be made to work. Of the non-magical systems, I think > encoding the choice of library into the provider name would avoid the > need to add a second confusing "X_version" concept alongside our > existing "X_version" columns in catalogues and DDL syntax, while > still > making it super clear what is going on. As I understand it, this is #2 in your previous list? Can we put the naming of the provider into the hands of the user, e.g.: CREATE COLLATION PROVIDER icu63 TYPE icu AS '/path/to/libicui18n.so.63', '/path/to/libicuuc.so.63'; In this model, icu would be a "provider kind" and icu63 would be the specific provider, which is named by the user. That seems like the least magical approach, to me. We need an ICU library; the administrator gives us one that looks like ICU; and we're happy. It avoids a lot of the annoyances we're discussing, and puts the power in the hands of the admin. If they want to allow minor version updates, they specify the library with .so.63, and let the symlinking handle it. Of course, we can still do some sanity checks (WARNINGs or ERRORs) when we think something is going wrong; like the version of ICU is too new, or the reported version (ucol_getVersion()) doesn't match what's in collversion. But we basically get out of the business of understanding ICU versioning and leave that up to the administrator. It's easier to document, and would require fewer GUCs (if any). And it avoids mixing version information from another project into our data model. -- Jeff Davis PostgreSQL Contributor Team - AWS
Re: Error-safe user functions
On 2022-Dec-05, Tom Lane wrote: > I wrote: > > Nah, it's so close to ereport that it looks like a typo. eseterr isn't > > awful, perhaps. Or maybe err, but I've not thought of suitable . > > ... "errsave", maybe? IMO eseterr is quite awful while errsave is not, so here goes my vote for the latter. -- Álvaro HerreraBreisgau, Deutschland — https://www.EnterpriseDB.com/ Tom: There seems to be something broken here. Teodor: I'm in sackcloth and ashes... Fixed. http://archives.postgresql.org/message-id/482d1632.8010...@sigaev.ru
Re: MemoizePath fails to work for partitionwise join
Richard Guo writes: > The fix is straightforward, just to use outerrel->top_parent if it is > not null and leave the reparameterization work to try_nestloop_path. In > addition, I believe when reparameterizing MemoizePath we need to adjust > its param_exprs too. Right you are. I'd noticed the apparent omission in reparameterize_path_by_child, but figured that we'd need a test case to find any other holes before it'd be worth fixing. Thanks for finding a usable test case. One small problem is that top_parent doesn't exist in the back branches, so I had to substitute a much uglier lookup in order to make this work there. I'm surprised that we got away without top_parent for this long TBH, but anyway this fix validates the wisdom of 2f17b5701. So, pushed with some cosmetic adjustments and the modified back-branch code. regards, tom lane
Re: Error-safe user functions
Robert Haas writes: > AIUI, the macro never returns in the sense of using the return > statement, unlike PG_RETURN_WHATEVER(), which do. Oh! Now I see what you don't like about it. I thought you meant "return to the call site", not "return to the call site's caller". Agreed that that could be confusing. regards, tom lane
Re: Collation version tracking for macOS
On 12/5/22 12:41, Jeff Davis wrote: On Mon, 2022-12-05 at 16:12 +1300, Thomas Munro wrote: 1. I think we should seriously consider provider = ICU63. I still think search-by-collversion is a little too magical, even though it clearly can be made to work. Of the non-magical systems, I think encoding the choice of library into the provider name would avoid the need to add a second confusing "X_version" concept alongside our existing "X_version" columns in catalogues and DDL syntax, while still making it super clear what is going on. As I understand it, this is #2 in your previous list? Can we put the naming of the provider into the hands of the user, e.g.: CREATE COLLATION PROVIDER icu63 TYPE icu AS '/path/to/libicui18n.so.63', '/path/to/libicuuc.so.63'; In this model, icu would be a "provider kind" and icu63 would be the specific provider, which is named by the user. That seems like the least magical approach, to me. We need an ICU library; the administrator gives us one that looks like ICU; and we're happy. +1 I like this. The provider kind defines which path we take in our code, and the specific library unambiguously defines a specific collation behavior (I think, ignoring bugs?) -- Joe Conway PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com
Re: Error-safe user functions
On Mon, Dec 5, 2022 at 12:44 PM Tom Lane wrote: > Robert Haas writes: > > AIUI, the macro never returns in the sense of using the return > > statement, unlike PG_RETURN_WHATEVER(), which do. > > Oh! Now I see what you don't like about it. I thought you > meant "return to the call site", not "return to the call site's > caller". Agreed that that could be confusing. OK, good. I couldn't figure out what in the world we were arguing about... apparently I wasn't being as clear as I thought I was. -- Robert Haas EDB: http://www.enterprisedb.com
Re: Error-safe user functions
On 2022-12-05 Mo 12:42, Alvaro Herrera wrote: > On 2022-Dec-05, Tom Lane wrote: > >> I wrote: >>> Nah, it's so close to ereport that it looks like a typo. eseterr isn't >>> awful, perhaps. Or maybe err, but I've not thought of suitable . >> ... "errsave", maybe? > IMO eseterr is quite awful while errsave is not, so here goes my vote > for the latter. Wait a minute! Oh, no, sorry, as you were, 'errsave' is fine. cheers andrew -- Andrew Dunstan EDB: https://www.enterprisedb.com
Re: Request to modify view_table_usage to include materialized views
Jonathan Lemig writes: > Would it be possible to modify the information_schema.view_table_usage > (VTU) to include materialized views? Is it physically possible? Sure, it'd just take adjustment of some relkind checks. However, it's against project policy. We consider that because the information_schema views are defined by the SQL standard, they should only show standardized properties of standardized objects. If the standard ever gains materialized views, we'd adjust those views to show them. In the meantime, they aren't there. It would make little sense in any case to adjust only this one view. But if we were to revisit that policy, there are a lot of corner cases that would have to be thought through --- things that almost fit into the views, or that might appear in a very misleading way, etc. regards, tom lane
Re: ANY_VALUE aggregate
On Mon, Dec 5, 2022 at 7:57 AM Vik Fearing wrote: > The SQL:2023 Standard defines a new aggregate named ANY_VALUE. It > returns an implementation-dependent (i.e. non-deterministic) value from > the rows in its group. > > PFA an implementation of this aggregate. > > Can we please add "first_value" and "last_value" if we are going to add "some_random_value" to our library of aggregates? Also, maybe we should have any_value do something like compute a 50/50 chance that any new value seen replaces the existing chosen value, instead of simply returning the first value all the time. Maybe even prohibit the first value from being chosen so long as a second value appears. David J.
Re: Error-safe user functions
Andrew Dunstan writes: > Wait a minute! Oh, no, sorry, as you were, 'errsave' is fine. Seems like everybody's okay with errsave. I'll make a v2 in a little bit. I'd like to try updating array_in and/or record_in just to verify that indirection cases work okay, before we consider the design to be set. regards, tom lane
Re: ANY_VALUE aggregate
"David G. Johnston" writes: > Can we please add "first_value" and "last_value" if we are going to add > "some_random_value" to our library of aggregates? First and last according to what ordering? We have those in the window-aggregate case, and I don't think we want to encourage people to believe that "first" and "last" are meaningful otherwise. ANY_VALUE at least makes it clear that you're getting an unspecified one of the inputs. regards, tom lane
Re: ANY_VALUE aggregate
On Mon, Dec 5, 2022 at 1:04 PM Tom Lane wrote: > "David G. Johnston" writes: > > Can we please add "first_value" and "last_value" if we are going to add > > "some_random_value" to our library of aggregates? > > First and last according to what ordering? We have those in the > window-aggregate case, and I don't think we want to encourage people > to believe that "first" and "last" are meaningful otherwise. > > ANY_VALUE at least makes it clear that you're getting an unspecified > one of the inputs. I have personally implemented first_value() and last_value() in the past in cases where I had guaranteed the ordering myself, or didn't care what ordering was used. I think they're perfectly sensible. But if we don't add them to core, at least they're easy to add in user-space. -- Robert Haas EDB: http://www.enterprisedb.com
Re: Using WaitEventSet in the postmaster
Hi, On 2022-12-05 22:45:57 +1300, Thomas Munro wrote: > On Sat, Dec 3, 2022 at 10:41 AM Thomas Munro wrote: > > Here's an iteration like that. Still WIP grade. It passes, but there > > must be something I don't understand about this computer program yet, > > because if I move the "if (pending_..." section up into the block > > where WL_LATCH_SET has arrived (instead of testing those variables > > every time through the loop), a couple of tests leave zombie > > (unreaped) processes behind, indicating that something funky happened > > to the state machine that I haven't yet grokked. Will look more next > > week. > > Duh. The reason for that was the pre-existing special case for > PM_WAIT_DEAD_END, which used a sleep(100ms) loop to wait for children > to exit, which I needed to change to a latch wait. Fixed in the next > iteration, attached. > > The reason for the existing sleep-based approach was that we didn't > want to accept any more connections (or spin furiously because the > listen queue was non-empty). So in this version I invented a way to > suppress socket events temporarily with WL_SOCKET_IGNORE, and then > reactivate them after crash reinit. That seems like an odd special flag. Why do we need it? Is it just because we want to have assertions ensuring that something is being queried? > * WL_SOCKET_ACCEPT is a new event for an incoming connection (on Unix, >this is just another name for WL_SOCKET_READABLE, but Window has a >different underlying event; this mirrors WL_SOCKET_CONNECTED on the >other end of a connection) Perhaps worth committing separately and soon? Seems pretty uncontroversial from here. > +/* > + * Object representing the state of a postmaster. > + * > + * XXX Lots of global variables could move in here. > + */ > +typedef struct > +{ > + WaitEventSet*wes; > +} Postmaster; > + Seems weird to introduce this but then basically have it be unused. I'd say either have a preceding patch move at least a few members into it, or just omit it for now. > + /* This may configure SIGURG, depending on platform. */ > + InitializeLatchSupport(); > + InitLocalLatch(); I'm mildly preferring InitProcessLocalLatch(), but not sure why - there's not really a conflicting meaning of "local" here. > +/* > + * Initialize the WaitEventSet we'll use in our main event loop. > + */ > +static void > +InitializeWaitSet(Postmaster *postmaster) > +{ > + /* Set up a WaitEventSet for our latch and listening sockets. */ > + postmaster->wes = CreateWaitEventSet(CurrentMemoryContext, 1 + > MAXLISTEN); > + AddWaitEventToSet(postmaster->wes, WL_LATCH_SET, PGINVALID_SOCKET, > MyLatch, NULL); > + for (int i = 0; i < MAXLISTEN; i++) > + { > + int fd = ListenSocket[i]; > + > + if (fd == PGINVALID_SOCKET) > + break; > + AddWaitEventToSet(postmaster->wes, WL_SOCKET_ACCEPT, fd, NULL, > NULL); > + } > +} The naming seems like it could be confused with latch.h infrastructure. InitPostmasterWaitSet()? > +/* > + * Activate or deactivate the server socket events. > + */ > +static void > +AdjustServerSocketEvents(Postmaster *postmaster, bool active) > +{ > + for (int pos = 1; pos < GetNumRegisteredWaitEvents(postmaster->wes); > ++pos) > + ModifyWaitEvent(postmaster->wes, > + pos, active ? WL_SOCKET_ACCEPT > : WL_SOCKET_IGNORE, > + NULL); > +} This seems to hardcode the specific wait events we're waiting for based on latch.c infrastructure. Not really convinced that's a good idea. > + /* > + * Latch set by signal handler, or new connection pending on > any of our > + * sockets? If the latter, fork a child process to deal with it. > + */ > + for (int i = 0; i < nevents; i++) > { > - if (errno != EINTR && errno != EWOULDBLOCK) > + if (events[i].events & WL_LATCH_SET) > { > - ereport(LOG, > - (errcode_for_socket_access(), > - errmsg("select() failed in > postmaster: %m"))); > - return STATUS_ERROR; > + ResetLatch(MyLatch); > + > + /* Process work scheduled by signal handlers. */ > + if (pending_action_request) > + process_action_request(postmaster); > + if (pending_child_exit) > + process_child_exit(postmaster); > + if (pending_reload_request) > + process_reload_request(); > + if (pending_shutdown_request) > +
Re: [PATCH] Add native windows on arm64 support
Hi, On 2022-12-05 14:12:41 +0900, Michael Paquier wrote: > With meson gaining in maturity, perhaps that's not the most urgent > thing as we will likely remove src/tools/msvc/ soon but I'd rather do > that right anyway as much as I can to avoid an incorrect state in the > tree at any time in its history. I'd actually argue that we should just not add win32 support to src/tools/msvc/. > --- a/src/include/storage/s_lock.h > +++ b/src/include/storage/s_lock.h > @@ -708,13 +708,21 @@ typedef LONG slock_t; > #define SPIN_DELAY() spin_delay() > > /* If using Visual C++ on Win64, inline assembly is unavailable. > - * Use a _mm_pause intrinsic instead of rep nop. > + * Use _mm_pause (x64) or __isb (arm64) intrinsic instead of rep nop. > */ > #if defined(_WIN64) > static __forceinline void > spin_delay(void) > { > +#ifdef _M_ARM64 > + /* > + * See spin_delay aarch64 inline assembly definition above for details > + * ref: > https://learn.microsoft.com/en-us/cpp/intrinsics/arm64-intrinsics#BarrierRestrictions > + */ > + __isb(_ARM64_BARRIER_SY); > +#else > _mm_pause(); > +#endif > } > #else > static __forceinline void This looks somewhat wrong to me. We end up with some ifdefs on the function defintion level, and some others inside the function body. I think it should be either or. > diff --git a/meson.build b/meson.build > index 725e10d815..e354ad7650 100644 > --- a/meson.build > +++ b/meson.build > @@ -1944,7 +1944,13 @@ int main(void) > > elif host_cpu == 'arm' or host_cpu == 'aarch64' > > - prog = ''' > + if cc.get_id() == 'msvc' > +cdata.set('USE_ARMV8_CRC32C', false) > +cdata.set('USE_ARMV8_CRC32C_WITH_RUNTIME_CHECK', 1) > +have_optimized_crc = true > + else > + > +prog = ''' > #include > > int main(void) Why does this need to be hardcoded? The compiler probe should just work for msvc. > @@ -1960,18 +1966,19 @@ int main(void) > } > ''' > > - if cc.links(prog, name: '__crc32cb, __crc32ch, __crc32cw, and __crc32cd > without -march=armv8-a+crc', > - args: test_c_args) > -# Use ARM CRC Extension unconditionally > -cdata.set('USE_ARMV8_CRC32C', 1) > -have_optimized_crc = true > - elif cc.links(prog, name: '__crc32cb, __crc32ch, __crc32cw, and __crc32cd > with -march=armv8-a+crc', > - args: test_c_args + ['-march=armv8-a+crc']) > -# Use ARM CRC Extension, with runtime check > -cflags_crc += '-march=armv8-a+crc' > -cdata.set('USE_ARMV8_CRC32C', false) > -cdata.set('USE_ARMV8_CRC32C_WITH_RUNTIME_CHECK', 1) > -have_optimized_crc = true > +if cc.links(prog, name: '__crc32cb, __crc32ch, __crc32cw, and __crc32cd > without -march=armv8-a+crc', > +args: test_c_args) > + # Use ARM CRC Extension unconditionally > + cdata.set('USE_ARMV8_CRC32C', 1) > + have_optimized_crc = true > +elif cc.links(prog, name: '__crc32cb, __crc32ch, __crc32cw, and > __crc32cd with -march=armv8-a+crc', > +args: test_c_args + ['-march=armv8-a+crc']) > + # Use ARM CRC Extension, with runtime check > + cflags_crc += '-march=armv8-a+crc' > + cdata.set('USE_ARMV8_CRC32C', false) > + cdata.set('USE_ARMV8_CRC32C_WITH_RUNTIME_CHECK', 1) > + have_optimized_crc = true > +endif >endif > endif And then this reindent wouldn't be needed. Greetings, Andres Freund
Re: Request to modify view_table_usage to include materialized views
Hey Tom, Thanks for the info. I'll submit a document change request instead. Thanks! Jon On Mon, Dec 5, 2022 at 11:53 AM Tom Lane wrote: > Jonathan Lemig writes: > > Would it be possible to modify the information_schema.view_table_usage > > (VTU) to include materialized views? > > Is it physically possible? Sure, it'd just take adjustment of some > relkind checks. > > However, it's against project policy. We consider that because the > information_schema views are defined by the SQL standard, they should > only show standardized properties of standardized objects. If the > standard ever gains materialized views, we'd adjust those views to > show them. In the meantime, they aren't there. > > It would make little sense in any case to adjust only this one view. > But if we were to revisit that policy, there are a lot of corner > cases that would have to be thought through --- things that almost > fit into the views, or that might appear in a very misleading way, > etc. > > regards, tom lane >
Re: WAL Insertion Lock Improvements (was: Re: Avoid LWLockWaitForVar() for currently held WAL insertion lock in WaitXLogInsertionsToFinish())
Hi, FWIW, I don't see an advantage in 0003. If it allows us to make something else simpler / faster, cool, but on its own it doesn't seem worthwhile. On 2022-12-02 16:31:58 -0800, Nathan Bossart wrote: > On Fri, Dec 02, 2022 at 04:32:38PM +0530, Bharath Rupireddy wrote: > > On Fri, Dec 2, 2022 at 6:10 AM Andres Freund wrote: > >> I'm not sure this is quite right - don't we need a memory barrier. But I > >> don't > >> see a reason to not just leave this code as-is. I think this should be > >> optimized entirely in lwlock.c > > > > Actually, we don't need that at all as LWLockWaitForVar() will return > > immediately if the lock is free. So, I removed it. > > I briefly looked at the latest patch set, and I'm curious how this change > avoids introducing memory ordering bugs. Perhaps I am missing something > obvious. I'm a bit confused too - the comment above talks about LWLockWaitForVar(), but the patches seem to optimize LWLockUpdateVar(). I think it'd be safe to optimize LWLockConflictsWithVar(), due to some pre-existing, quite crufty, code. LWLockConflictsWithVar() says: * Test first to see if it the slot is free right now. * * XXX: the caller uses a spinlock before this, so we don't need a memory * barrier here as far as the current usage is concerned. But that might * not be safe in general. which happens to be true in the single, indirect, caller: /* Read the current insert position */ SpinLockAcquire(&Insert->insertpos_lck); bytepos = Insert->CurrBytePos; SpinLockRelease(&Insert->insertpos_lck); reservedUpto = XLogBytePosToEndRecPtr(bytepos); I think at the very least we ought to have a comment in WaitXLogInsertionsToFinish() highlighting this. It's not at all clear to me that the proposed fast-path for LWLockUpdateVar() is safe. I think at the very least we could end up missing waiters that we should have woken up. I think it ought to be safe to do something like pg_atomic_exchange_u64().. if (!(pg_atomic_read_u32(&lock->state) & LW_FLAG_HAS_WAITERS)) return; because the pg_atomic_exchange_u64() will provide the necessary memory barrier. Greetings, Andres Freund
Re: [PATCH] Add `verify-system` sslmode to use system CA pool for server cert
On Fri, Dec 2, 2022 at 9:58 AM Jacob Champion wrote: > Thanks for the nudge -- running with OpenSSL 3.0.7 in CI did not fix > the issue. I suspect a problem with our error stack handling... It is a problem with the error queue, but *whose* problem is probably up for debate. The queue looks like this after SSL_connect() returns: error:1669:STORE routines:ossl_store_get0_loader_int:unregistered scheme:crypto/store/store_register.c:237:scheme=file error:8002:system library:file_open:No such file or directory:providers/implementations/storemgmt/file_store.c:269:calling stat(/usr/local/etc/openssl@3/certs) error:1669:STORE routines:ossl_store_get0_loader_int:unregistered scheme:crypto/store/store_register.c:237:scheme=file error:8002:system library:file_open:No such file or directory:providers/implementations/storemgmt/file_store.c:269:calling stat(/usr/local/etc/openssl@3/certs) error:1669:STORE routines:ossl_store_get0_loader_int:unregistered scheme:crypto/store/store_register.c:237:scheme=file error:8002:system library:file_open:No such file or directory:providers/implementations/storemgmt/file_store.c:269:calling stat(/usr/local/etc/openssl@3/certs) error:0A86:SSL routines:tls_post_process_server_certificate:certificate verify failed:ssl/statem/statem_clnt.c:1883: Note that the error we care about is at the end, not the front. We are not the first using Homebrew to run into this, and best I can tell, it is a brew-specific bug. The certificate directory that's been configured isn't actually installed by the formula. (A colleague here was able to verify the same behavior on their local machine, so it's not a Cirrus problem.) The confusing "unrecognized scheme" message has thrown at least a few people off the scent. That refers to an OpenSSL STORE URI, not the URI describing the peer. (Why `file://` is considered "unregistered" is beyond me, considering the documentation says that file URI support is built into libcrypto.) From inspection, that error is put onto the queue before checking to see if the certificate directory exists, and then it's popped back off the queue if the directory is found(?!). Unfortunately, the directory isn't there for Homebrew, which means we get both errors, the first of which is not actually helpful. And then it pushes the pair of errors two more times, for reasons I haven't bothered looking into yet. Maybe this is considered an internal error caused by a packaging bug, in which case I expect the formula maintainers to ask why it worked for 1.1. Maybe it's a client error because we're not looking for the best error on the queue, in which case I ask how we're supposed to know which error is the most interesting. (I actually kind of know the answer to this -- OpenSSL's builtin clients appear to check the front of the queue first, to see if it's an SSL-related error, and then if it's not they grab the error at the end of the queue instead. To which I ask: *what?*) Maybe clients are expected to present the entirety of the queue. But then, why are three separate copies of the same errors spamming the queue? We can't present that. I'm considering filing an issue with OpenSSL, to see what they suggest a responsible client should do in this situation... > Separately from this, our brew cache in Cirrus is extremely out of > date. Is there something that's supposed to be running `brew update` > (or autoupdate) that is stuck or broken? (If this is eventually considered a bug in the formula, we'll need to update to get the fix regardless.) --Jacob
Re: PGDOCS - Logical replication GUCs - added some xrefs
Hi, On Mon, Oct 24, 2022 at 12:45 AM Peter Smith wrote: > Hi hackers. > > There is a docs Logical Replication section "31.10 Configuration > Settings" [1] which describes some logical replication GUCs, and > details on how they interact with each other and how to take that into > account when setting their values. > > There is another docs Server Configuration section "20.6 Replication" > [2] which lists the replication-related GUC parameters, and what they > are for. > > Currently AFAIK those two pages are unconnected, but I felt it might > be helpful if some of the parameters in the list [2] had xref links to > the additional logical replication configuration information [1]. PSA > a patch to do that. > +1 on the patch. Some feedback on v5 below. > + > + For logical replication configuration settings refer > + also to . > + > + I feel the top paragraph needs to explain terminology for logical replication like it does for physical replication in addition to linking to the logical replication config page. I'm recommending this as we use terms like subscriber etc. in description of parameters without introducing them first. As an example, something like below might work. These settings control the behavior of the built-in streaming replication feature (see Section 27.2.5) and logical replication (link). For physical replication, servers will be either a primary or a standby server. Primaries can send data, while standbys are always receivers of replicated data. When cascading replication (see Section 27.2.7) is used, standby servers can also be senders, as well as receivers. Parameters are mainly for sending and standby servers, though some parameters have meaning only on the primary server. Settings may vary across the cluster without problems if that is required. For logical replication, servers will either be publishers (also called senders in the sections below) or subscribers. Publishers are , Subscribers are... > + > + See for more details > + about setting max_replication_slots for logical > + replication. > + The link doesn't add any new information regarding max_replication_slots other than "to reserve some for table sync" and has a good amount of unrelated info. I think it might be useful to just put a line here asking to reserve some for table sync instead of linking to the entire logical replication config section. > - Logical replication requires several configuration options to be set. > + Logical replication requires several configuration parameters to be set. May not be needed? The docs have references to both options and parameters but I don't feel strongly about it. Feel free to use what you prefer. I think we should add an additional line to the intro here saying that parameters are mostly relevant only one of the subscriber or publisher. Maybe a better written version of "While max_replication_slots means different things on the publisher and subscriber, all other parameters are relevant only on either the publisher or the subscriber." > + > + Notes I don't think we need this sub-section. If I understand correctly, these parameters are effective only on the subscriber side. So, any reason to not include them in that section? > + > + > +Logical replication workers are also affected by > +wal_receiver_timeout, > +wal_receiver_status_interval and > +wal_receiver_retry_interval. > + > + I like moving this; it makes more sense here. Should we remove it from config.sgml? It seems a bit out of place there as we generally talk only about individual parameters there and this line is general logical replication subscriber advise which is more suited to logical-replication.sgml > + > +Configuration parameter > +max_worker_processes > +may need to be adjusted to accommodate for replication workers, at least ( > +max_logical_replication_workers > ++ 1). Some extensions and parallel queries also take > +worker slots from max_worker_processes. > + > + > + I think we should move this to the subscriber section as said above. It's useful to know this and people might skip over the notes. > ~~ > > Meanwhile, I also suspect that the main blurb top of [1] is not > entirely correct... it says "These settings control the behaviour of > the built-in streaming replication feature", although some of the GUCs > mentioned later in this section are clearly for "logical replication". > > Thoughts? > I shared an idea above. Regards, Samay > > -- > [1] 31.10 Configuration Settings - > https://www.postgresql.org/docs/current/logical-replication-config.html > [2] 20.6 Replication - > https://www.postgresql.org/docs/current/runtime-config-replication.html > > Kind Regards, > Peter Smith. > Fujitsu Australia >
Re: Error-safe user functions
On Mon, Dec 5, 2022 at 11:36 AM Andrew Dunstan wrote: > > On 2022-12-05 Mo 11:20, Robert Haas wrote: > > On Mon, Dec 5, 2022 at 11:09 AM Tom Lane wrote: > >> Robert Haas writes: > >>> On Sat, Dec 3, 2022 at 10:57 PM Corey Huinker > wrote: > 2. ereturn_* => errfeedback / error_feedback / feedback > >>> Oh, I like that, especially errfeedback. > >> efeedback? But TBH I do not think any of these are better than ereturn. > > I do. Having a macro name that is "return" plus one character is going > > to make people think that it returns. I predict that if you insist on > > using that name people are still going to be making mistakes based on > > that confusion 10 years from now. > > > > OK, I take both this point and Tom's about trying to keep it the same > length. So we need something that's 7 letters, doesn't say 'return' and > preferably begins with 'e'. I modestly suggest 'eseterr', or if we like > the 'feedback' idea 'efeedbk'. > > > Consulting https://www.thesaurus.com/browse/feedback again: ereply clocks in at 7 characters.
Re: Error-safe user functions
On Mon, Dec 5, 2022 at 1:00 PM Tom Lane wrote: > Andrew Dunstan writes: > > Wait a minute! Oh, no, sorry, as you were, 'errsave' is fine. > > Seems like everybody's okay with errsave. I'll make a v2 in a > little bit. I'd like to try updating array_in and/or record_in > just to verify that indirection cases work okay, before we consider > the design to be set. > +1 to errsave.
Re: ANY_VALUE aggregate
On Mon, Dec 5, 2022 at 12:57 PM David G. Johnston < david.g.johns...@gmail.com> wrote: > On Mon, Dec 5, 2022 at 7:57 AM Vik Fearing > wrote: > >> The SQL:2023 Standard defines a new aggregate named ANY_VALUE. It >> returns an implementation-dependent (i.e. non-deterministic) value from >> the rows in its group. >> >> PFA an implementation of this aggregate. >> >> > Can we please add "first_value" and "last_value" if we are going to add > "some_random_value" to our library of aggregates? > > Also, maybe we should have any_value do something like compute a 50/50 > chance that any new value seen replaces the existing chosen value, instead > of simply returning the first value all the time. Maybe even prohibit the > first value from being chosen so long as a second value appears. > > David J. > Adding to the pile of wanted aggregates: in the past I've lobbied for only_value() which is like first_value() but it raises an error on encountering a second value.
Re: pg_stat_bgwriter.buffers_backend is pretty meaningless (and more?)
Hi, - I think it might be worth to rename IOCONTEXT_BUFFER_POOL to IOCONTEXT_{NORMAL, PLAIN, DEFAULT}. I'd like at some point to track WAL IO , temporary file IO etc, and it doesn't seem useful to define a version of BUFFER_POOL for each of them. And it'd make it less confusing, because all the other existing contexts are also in the buffer pool (for now, can't wait for "bypass" or whatever to be tracked as well). - given that IOContextForStrategy() is defined in freelist.c, and that declaring it in pgstat.h requires including buf.h, I think it's probably better to move IOContextForStrategy()'s declaration to freelist.h (doesn't exist, but whatever the right one is) - pgstat_backend_io_stats_assert_well_formed() doesn't seem to belong in pgstat.c. Why not pgstat_io_ops.c? - Do pgstat_io_context_ops_assert_zero(), pgstat_io_op_assert_zero() have to be in pgstat.h? I think the only non-trival thing is the first point, the rest is stuff than I also evolve during commit. Greetings, Andres Freund
Re: Allow placeholders in ALTER ROLE w/o superuser
On Mon, Dec 5, 2022 at 8:18 PM Tom Lane wrote: > Alvaro Herrera writes: > > I couldn't find any discussion of the idea of adding "(s)" to the > > variable name in order to mark the variable userset in the catalog, and > > I have to admit I find it a bit strange. Are we really agreed that > > that's the way to proceed? > > I hadn't been paying close attention to this thread, sorry. > > I agree that that seems like a very regrettable choice, > especially if you anticipate having to bump catversion anyway. I totally understand that this change requires a catversion bump. I've reflected this in the commit message. > Better to add a bool column to the catalog. What about adding a boolean array to the pg_db_role_setting? So, pg_db_role_setting would have the following columns. * setdatabase oid * setrole oid * setconfig text[] * setuser bool[] -- Regards, Alexander Korotkov
Re: ANY_VALUE aggregate
On Mon, Dec 5, 2022 at 2:31 PM Corey Huinker wrote: > Adding to the pile of wanted aggregates: in the past I've lobbied for > only_value() which is like first_value() but it raises an error on > encountering a second value. Yeah, that's another that I have hand-rolled in the past. -- Robert Haas EDB: http://www.enterprisedb.com
Re: Error-safe user functions
On 2022-12-05 Mo 14:22, Corey Huinker wrote: > > On Mon, Dec 5, 2022 at 11:36 AM Andrew Dunstan > wrote: > > > On 2022-12-05 Mo 11:20, Robert Haas wrote: > > On Mon, Dec 5, 2022 at 11:09 AM Tom Lane wrote: > >> Robert Haas writes: > >>> On Sat, Dec 3, 2022 at 10:57 PM Corey Huinker > wrote: > 2. ereturn_* => errfeedback / error_feedback / feedback > >>> Oh, I like that, especially errfeedback. > >> efeedback? But TBH I do not think any of these are better than > ereturn. > > I do. Having a macro name that is "return" plus one character is > going > > to make people think that it returns. I predict that if you > insist on > > using that name people are still going to be making mistakes > based on > > that confusion 10 years from now. > > > > OK, I take both this point and Tom's about trying to keep it the same > length. So we need something that's 7 letters, doesn't say > 'return' and > preferably begins with 'e'. I modestly suggest 'eseterr', or if we > like > the 'feedback' idea 'efeedbk'. > > > > Consulting https://www.thesaurus.com/browse/feedback again: > ereply clocks in at 7 characters. It does? cheers andrew -- Andrew Dunstan EDB: https://www.enterprisedb.com
Re: ANY_VALUE aggregate
On 12/5/22 15:57, Vik Fearing wrote: The SQL:2023 Standard defines a new aggregate named ANY_VALUE. It returns an implementation-dependent (i.e. non-deterministic) value from the rows in its group. PFA an implementation of this aggregate. Here is v2 of this patch. I had forgotten to update sql_features.txt. -- Vik Fearing From a9bb61aab9788ae25fdcd28f7dcfb54a263665cc Mon Sep 17 00:00:00 2001 From: Peter Eisentraut Date: Sat, 9 Apr 2022 00:07:38 +0200 Subject: [PATCH] Implement ANY_VALUE aggregate SQL:2023 defines an ANY_VALUE aggregate whose purpose is to emit an implementation-dependent (i.e. non-deterministic) value from the aggregated rows. --- doc/src/sgml/func.sgml | 14 ++ src/backend/catalog/sql_features.txt | 1 + src/backend/utils/adt/misc.c | 12 src/include/catalog/pg_aggregate.dat | 4 src/include/catalog/pg_proc.dat | 8 src/test/regress/expected/aggregates.out | 18 ++ src/test/regress/sql/aggregates.sql | 5 + 7 files changed, 62 insertions(+) diff --git a/doc/src/sgml/func.sgml b/doc/src/sgml/func.sgml index 2052d3c844..1823ee71d7 100644 --- a/doc/src/sgml/func.sgml +++ b/doc/src/sgml/func.sgml @@ -19706,16 +19706,30 @@ SELECT NULLIF(value, '(none)') ... Description Partial Mode + + + + any_value + +any_value ( "any" ) +same as input type + + +Chooses a non-deterministic value from the non-null input values. + + Yes + + array_agg array_agg ( anynonarray ) anyarray diff --git a/src/backend/catalog/sql_features.txt b/src/backend/catalog/sql_features.txt index 8704a42b60..b7b6ad6334 100644 --- a/src/backend/catalog/sql_features.txt +++ b/src/backend/catalog/sql_features.txt @@ -515,16 +515,17 @@ T617 FIRST_VALUE and LAST_VALUE functions YES T618 NTH_VALUE function NO function exists, but some options missing T619 Nested window functions NO T620 WINDOW clause: GROUPS option YES T621 Enhanced numeric functions YES T622 Trigonometric functions YES T623 General logarithm functions YES T624 Common logarithm functions YES T625 LISTAGG NO +T626 ANY_VALUE YES T631 IN predicate with one list element YES T641 Multiple column assignment NO only some syntax variants supported T651 SQL-schema statements in SQL routines YES T652 SQL-dynamic statements in SQL routines NO T653 SQL-schema statements in external routines YES T654 SQL-dynamic statements in external routines NO T655 Cyclically dependent routines YES T811 Basic SQL/JSON constructor functions NO diff --git a/src/backend/utils/adt/misc.c b/src/backend/utils/adt/misc.c index 9c13251231..94c92de06d 100644 --- a/src/backend/utils/adt/misc.c +++ b/src/backend/utils/adt/misc.c @@ -928,8 +928,20 @@ pg_get_replica_identity_index(PG_FUNCTION_ARGS) idxoid = RelationGetReplicaIndex(rel); table_close(rel, AccessShareLock); if (OidIsValid(idxoid)) PG_RETURN_OID(idxoid); else PG_RETURN_NULL(); } + +Datum +any_value_trans(PG_FUNCTION_ARGS) +{ + /* Return the first non-null argument */ + if (!PG_ARGISNULL(0)) + PG_RETURN_DATUM(PG_GETARG_DATUM(0)); + if (!PG_ARGISNULL(1)) + PG_RETURN_DATUM(PG_GETARG_DATUM(1)); + PG_RETURN_NULL(); +} + diff --git a/src/include/catalog/pg_aggregate.dat b/src/include/catalog/pg_aggregate.dat index b9110a5298..37626d6f0c 100644 --- a/src/include/catalog/pg_aggregate.dat +++ b/src/include/catalog/pg_aggregate.dat @@ -620,9 +620,13 @@ aggtransfn => 'ordered_set_transition_multi', aggfinalfn => 'cume_dist_final', aggfinalextra => 't', aggfinalmodify => 'w', aggmfinalmodify => 'w', aggtranstype => 'internal' }, { aggfnoid => 'dense_rank(any)', aggkind => 'h', aggnumdirectargs => '1', aggtransfn => 'ordered_set_transition_multi', aggfinalfn => 'dense_rank_final', aggfinalextra => 't', aggfinalmodify => 'w', aggmfinalmodify => 'w', aggtranstype => 'internal' }, +# any_value +{ aggfnoid => 'any_value(anyelement)', aggtransfn => 'any_value_trans', + aggcombinefn => 'any_value_trans', aggtranstype => 'anyelement' }, + ] diff --git a/src/include/catalog/pg_proc.dat b/src/include/catalog/pg_proc.dat index f9301b2627..2ee4797559 100644 --- a/src/include/catalog/pg_proc.dat +++ b/src/include/catalog/pg_proc.dat @@ -11849,9 +11849,17 @@ proname => 'brin_minmax_multi_summary_recv', provolatile => 's', prorettype => 'pg_brin_minmax_multi_summary', proargtypes => 'internal', prosrc => 'brin_minmax_multi_summary_recv' }, { oid => '4641', descr => 'I/O', proname => 'brin_minmax_multi_summary_send', provolatile => 's', prorettype => 'bytea', proargtypes => 'pg_brin_minmax_multi_summary', prosrc => 'brin_minmax_multi_summary_send' }, +{ oid => '8981', descr => 'arbitrary value fro
Re: predefined role(s) for VACUUM and ANALYZE
Hello, While looking into the new feature, I found the following situation with the \dp command displaying privileges on the system tables: GRANT VACUUM, ANALYZE ON TABLE pg_type TO alice; SELECT relacl FROM pg_class WHERE oid = 'pg_type'::regclass; relacl - {=r/postgres,postgres=arwdDxtvz/postgres,alice=vz/postgres} (1 row) But the \dp command does not show the granted privileges: \dp pg_type Access privileges Schema | Name | Type | Access privileges | Column privileges | Policies +--+--+---+---+-- (0 rows) The comment in src/bin/psql/describe.c explains the situation: /* * Unless a schema pattern is specified, we suppress system and temp * tables, since they normally aren't very interesting from a permissions * point of view. You can see 'em by explicit request though, eg with \z * pg_catalog.* */ So to see the privileges you have to explicitly specify the schema name: \dp pg_catalog.pg_type Access privileges Schema | Name | Type | Access privileges | Column privileges | Policies +-+---+-+---+-- pg_catalog | pg_type | table | =r/postgres +| | | | | postgres=arwdDxtvz/postgres+| | | | | alice=vz/postgres | | (1 row) But perhaps this behavior should be reviewed or at least documented? - Pavel Luzanov
Re: [PoC] Reducing planning time when tables have many partitions
On Tue, 6 Dec 2022 at 04:45, Thom Brown wrote: > Testing your patches with the same 1024 partitions, each with 64 > sub-partitions, I get a planning time of 205.020 ms, which is now a > 1,377x speedup. This has essentially reduced the planning time from a > catastrophe to a complete non-issue. Huge win! Thanks for testing the v10 patches. I wouldn't have expected such additional gains from v10. I was mostly focused on trying to minimise any performance regression for simple queries that wouldn't benefit from indexing the EquivalenceMembers. Your query sounds like it does not fit into that category. Perhaps it is down to the fact that v9-0002 or v9-0003 reverts a couple of the optimisations that is causing v9 to be slower than v10 for your query. It's hard to tell without more details of what you're running. Is this a schema and query you're able to share? Or perhaps mock up a script of something similar enough to allow us to see why v9 and v10 are so different? Additionally, it would be interesting to see if patching with v10-0002 alone helps the performance of your query at all. I didn't imagine that change would give us anything easily measurable, but partition pruning makes extensive use of Bitmapsets, so perhaps you've found something. If you have then it might be worth considering v10-0002 independently of the EquivalenceMember indexing work. David
Re: Collation version tracking for macOS
On Tue, Dec 6, 2022 at 6:45 AM Joe Conway wrote: > On 12/5/22 12:41, Jeff Davis wrote: > > On Mon, 2022-12-05 at 16:12 +1300, Thomas Munro wrote: > >> 1. I think we should seriously consider provider = ICU63. I still > >> think search-by-collversion is a little too magical, even though it > >> clearly can be made to work. Of the non-magical systems, I think > >> encoding the choice of library into the provider name would avoid the > >> need to add a second confusing "X_version" concept alongside our > >> existing "X_version" columns in catalogues and DDL syntax, while > >> still > >> making it super clear what is going on. > > > > As I understand it, this is #2 in your previous list? > > > > Can we put the naming of the provider into the hands of the user, e.g.: > > > >CREATE COLLATION PROVIDER icu63 TYPE icu > > AS '/path/to/libicui18n.so.63', '/path/to/libicuuc.so.63'; > > > > In this model, icu would be a "provider kind" and icu63 would be the > > specific provider, which is named by the user. > > > > That seems like the least magical approach, to me. We need an ICU > > library; the administrator gives us one that looks like ICU; and we're > > happy. > > +1 > > I like this. The provider kind defines which path we take in our code, > and the specific library unambiguously defines a specific collation > behavior (I think, ignoring bugs?) OK, I'm going to see what happens if I try to wrangle that stuff into a new catalogue table.
Re: Error-safe user functions
I wrote: > Seems like everybody's okay with errsave. I'll make a v2 in a > little bit. I'd like to try updating array_in and/or record_in > just to verify that indirection cases work okay, before we consider > the design to be set. v2 as promised, incorporating the discussed renamings as well as some follow-on ones (ErrorReturnContext -> ErrorSaveContext, notably). I also tried moving the struct into a new header file, miscnodes.h after Andrew's suggestion upthread. That seems at least marginally cleaner than putting it in nodes.h, although I'm not wedded to this choice. I was really glad that I took the trouble to update some less-trivial input functions, because I learned two things: * It's better if InputFunctionCallSafe will tolerate the case of not being passed an ErrorSaveContext. In the COPY hack it felt worthwhile to have completely separate code paths calling InputFunctionCallSafe or InputFunctionCall, but that's less appetizing elsewhere. * There's a crying need for a macro that wraps up errsave() with an immediate return. Hence, ereturn() is reborn from the ashes. I hope Robert won't object to that name if it *does* do a return. I feel pretty good about this version; it seems committable if there are not objections. Not sure if we should commit 0003 like this, though. regards, tom lane diff --git a/doc/src/sgml/ref/create_type.sgml b/doc/src/sgml/ref/create_type.sgml index 693423e524..53b8d15f97 100644 --- a/doc/src/sgml/ref/create_type.sgml +++ b/doc/src/sgml/ref/create_type.sgml @@ -900,6 +900,15 @@ CREATE TYPE name function is written in C. + + In PostgreSQL version 16 and later, it is + desirable for base types' input functions to return safe + errors using the new errsave() mechanism, rather + than throwing ereport() exceptions as in previous + versions. See src/backend/utils/fmgr/README for + more information. + + diff --git a/src/backend/nodes/Makefile b/src/backend/nodes/Makefile index 4368c30fdb..7c594be583 100644 --- a/src/backend/nodes/Makefile +++ b/src/backend/nodes/Makefile @@ -56,6 +56,7 @@ node_headers = \ nodes/bitmapset.h \ nodes/extensible.h \ nodes/lockoptions.h \ + nodes/miscnodes.h \ nodes/replnodes.h \ nodes/supportnodes.h \ nodes/value.h \ diff --git a/src/backend/nodes/gen_node_support.pl b/src/backend/nodes/gen_node_support.pl index 7212bc486f..08992dfd47 100644 --- a/src/backend/nodes/gen_node_support.pl +++ b/src/backend/nodes/gen_node_support.pl @@ -68,6 +68,7 @@ my @all_input_files = qw( nodes/bitmapset.h nodes/extensible.h nodes/lockoptions.h + nodes/miscnodes.h nodes/replnodes.h nodes/supportnodes.h nodes/value.h @@ -89,6 +90,7 @@ my @nodetag_only_files = qw( executor/tuptable.h foreign/fdwapi.h nodes/lockoptions.h + nodes/miscnodes.h nodes/replnodes.h nodes/supportnodes.h ); diff --git a/src/backend/utils/error/elog.c b/src/backend/utils/error/elog.c index 2585e24845..81727ecb28 100644 --- a/src/backend/utils/error/elog.c +++ b/src/backend/utils/error/elog.c @@ -71,6 +71,7 @@ #include "libpq/libpq.h" #include "libpq/pqformat.h" #include "mb/pg_wchar.h" +#include "nodes/miscnodes.h" #include "miscadmin.h" #include "pgstat.h" #include "postmaster/bgworker.h" @@ -686,6 +687,154 @@ errfinish(const char *filename, int lineno, const char *funcname) } +/* + * errsave_start --- begin a "safe" error-reporting cycle + * + * If "context" isn't an ErrorSaveContext node, this behaves as + * errstart(ERROR, domain), and the errsave() macro ends up acting + * exactly like ereport(ERROR, ...). + * + * If "context" is an ErrorSaveContext node, but the node creator only wants + * notification of the fact of a safe error without any details, just set + * the error_occurred flag in the ErrorSaveContext node and return false, + * which will cause us to skip the remaining error processing steps. + * + * Otherwise, create and initialize error stack entry and return true. + * Subsequently, errmsg() and perhaps other routines will be called to further + * populate the stack entry. Finally, errsave_finish() will be called to + * tidy up. + */ +bool +errsave_start(void *context, const char *domain) +{ + ErrorSaveContext *escontext; + ErrorData *edata; + + /* + * Do we have a context for safe error reporting? If not, just punt to + * errstart(). + */ + if (context == NULL || !IsA(context, ErrorSaveContext)) + return errstart(ERROR, domain); + + /* Report that an error was detected */ + escontext = (ErrorSaveContext *) context; + escontext->error_occurred = true; + + /* Nothing else to do if caller wants no further details */ + if (!escontext->details_wanted) + return false; + + /* + * Okay, crank up a stack entry to store the info in. + */ + + recursion_depth++; + if (++errordata_stack_depth >= ERRORDATA_STACK_SIZE) + { + /* + * Wups, stack not big enough. We treat this as a PANIC condition + * because it suggests an infinite loop of errors dur
Re: Transaction timeout
On Sat, Dec 3, 2022 at 9:41 AM Andrey Borodin wrote: > Fixed. Added test for this. > Thanks! Tested (gitpod: https://gitpod.io/#https://gitlab.com/NikolayS/postgres/tree/transaction_timeout-v2 ), works as expected.
Re: Questions regarding distinct operation implementation
On Mon, 5 Dec 2022 at 02:34, Ankit Kumar Pandey wrote: > Interesting problem, Hashtables created in normal aggregates (AGG_HASHED > mode) may provide some reference as they have hashagg_spill_tuple but I > am not sure of any prior implementation of hashtable with counter and > spill. I'm unsure if there's such guidance that can be gleaned from studying nodeAgg.c. IIRC, that works by deciding up-front how many partitions we're going to split the hash key space into and then writing out tuples to files based on "hashkey MOD number-of-partitions". At the end of that, you can just aggregate tuples one partition at a time. All groups are in the same file/partition. The reason this does not seem useful to your case is that you need to be able to quickly look up a given Datum or set of Datums to check if they are unique or not. For that, you'd need to reload the hash table every time your lookup lands on a different partition of the hashkey space. I fail to see how that could ever be fast unless there happened to only be 1 partition. To make that worse, when a tuple goes out of the frame and the counter that's tracking how many times the Datum(s) appeared reaches 0, you need to write the entire file out again minus that tuple. Let's say you're window function is on a column which is distinct or *very* close to it and the given window is moving the window frame forward 1 tuple per input tuple. If each subsequent Datum hashes to a different partition, then you're going to need to load the file for that hash key space to check if that Datum has already been seen, then you're going to have to evict that tuple from the file as it moves out of frame, so that means reading and writing that entire file per input tuple consumed. That'll perform very poorly! It's possible that you could maybe speed it up a bit with some lossy hash table that sits atop of this can only tell you if the given key definitely does *not* exists. You'd then be able to just write that tuple out to the partition and you'd not have to read or write out the file again. It's going to slow down to a crawl when the lossy table contains too many false positives though. > Major concern is, if we go through tuplesort route (without order > by case), would we get handicapped in future if we want order by or more > features? Yeah, deciding that before you go down one of the paths is going to be important. I imagine the reason that you've not found another database that supports DISTINCT window functions in a window with an ORDER BY clause is that it's very hard to make it in a way where it performs well in all cases. Maybe another way to go about it that will give you less lock-in if we decide to make ORDER BY work later would be to design some new tuple-store-like data structure that can be defined with a lookup key so you could ask it if a given key is stored and it would return the answer quickly without having to trawl through all stored tuples. It would also need to support the same positional lookups as tuplestore does today so that all evicting-tuples-from-the-window-frame stuff works as it does today. If you made something like that, then the changes required in nodeWindowAgg.c would be significantly reduced. You'd also just have 1 work_mem limit to abide by instead of having to consider sharing that between a tuplestore and a hashtable/tuplesort. Maybe as step 1, you could invent keyedtuplestore.c and consume tuplestore's functions but layer on the lossy hashtable idea that I mentioned above. That'd have to be more than just a bloom filter as you need a payload of the count of tuples matching the given hashkey MOD nbuckets. If you then had a function like keyedtuplestore_key_definately_does_not_exist() (can't think of a better name now) then you can just lookup the lossy table and if there are 0 tuples at that lossy bucket, then you can keyedtuplestore_puttupleslot() from nodeWindowAgg.c. keyedtuplestore_key_definately_does_not_exist() would have to work much harder if there were>0 tuples with the same lossy hashkey. You'd need to trawl through the tuples and check each one. Perhaps that could be tuned a bit so if you get too many collisions then the lossy table could be rehashed to a larger size. It's going to fall flat on its face, performance-wise, when the hash table can't be made larger due to work_mem constraints. Anyway, that's a lot of only partially thought-through ideas above. If you're working on a patch like this, you should expect to have to rewrite it a dozen or 2 times as new ideas arrive. If you're good at using the fact that the new patch is better than the old one as motivation to continue, then you're onto an attitude that is PostgreSQL-community-proof :-) (thankfully) we're often not good at "let's just commit it now and make it better later". David
Re: Transaction timeout
The following review has been posted through the commitfest application: make installcheck-world: tested, passed Implements feature: tested, passed Spec compliant: not tested Documentation:not tested Tested, works as expected; documentation is not yet added
Re: initdb: Refactor PG_CMD_PUTS loops
On 02.12.22 15:07, Andrew Dunstan wrote: On 2022-12-01 Th 05:02, Peter Eisentraut wrote: Keeping the SQL commands that initdb runs in string arrays before feeding them to PG_CMD_PUTS() seems unnecessarily verbose and inflexible. In some cases, the array only has one member. In other cases, one might want to use PG_CMD_PRINTF() instead, to parametrize a command, but that would require breaking up the loop or using workarounds like replace_token(). This patch unwinds all that; it's much simpler that way. Looks reasonable. (Most of this dates back to 2003/2004, the very early days of initdb.c.) committed
Re: Transaction timeout
Hi, On 2022-12-03 09:41:04 -0800, Andrey Borodin wrote: > @@ -2720,6 +2723,7 @@ finish_xact_command(void) > > if (xact_started) > { > + > CommitTransactionCommand(); > > #ifdef MEMORY_CONTEXT_CHECKING Spurious newline added. > @@ -4460,6 +4473,10 @@ PostgresMain(const char *dbname, const char *username) > > enable_timeout_after(IDLE_SESSION_TIMEOUT, > > IdleSessionTimeout); > } > + > + > + if (get_timeout_active(TRANSACTION_TIMEOUT)) > + disable_timeout(TRANSACTION_TIMEOUT, > false); > } > > /* Report any recently-changed GUC options */ Too many newlines added. I'm a bit worried about adding evermore branches and function calls for the processing of single statements. We already spend a noticable percentage of the cycles for a single statement in PostgresMain(), this adds additional overhead. I'm somewhat inclined to think that we need some redesign here before we add more overhead. > @@ -1360,6 +1363,16 @@ IdleInTransactionSessionTimeoutHandler(void) > SetLatch(MyLatch); > } > > +static void > +TransactionTimeoutHandler(void) > +{ > +#ifdef HAVE_SETSID > + /* try to signal whole process group */ > + kill(-MyProcPid, SIGINT); > +#endif > + kill(MyProcPid, SIGINT); > +} > + Why does this use signals instead of just setting the latch like IdleInTransactionSessionTimeoutHandler() etc? > diff --git a/src/bin/pg_dump/pg_backup_archiver.c > b/src/bin/pg_dump/pg_backup_archiver.c > index 0081873a72..5229fe3555 100644 > --- a/src/bin/pg_dump/pg_backup_archiver.c > +++ b/src/bin/pg_dump/pg_backup_archiver.c > @@ -3089,6 +3089,7 @@ _doSetFixedOutputState(ArchiveHandle *AH) > ahprintf(AH, "SET statement_timeout = 0;\n"); > ahprintf(AH, "SET lock_timeout = 0;\n"); > ahprintf(AH, "SET idle_in_transaction_session_timeout = 0;\n"); > + ahprintf(AH, "SET transaction_timeout = 0;\n"); Hm - why is that the right thing to do? > diff --git a/src/test/isolation/specs/timeouts.spec > b/src/test/isolation/specs/timeouts.spec > index c747b4ae28..a7f27811c7 100644 > --- a/src/test/isolation/specs/timeouts.spec > +++ b/src/test/isolation/specs/timeouts.spec > @@ -23,6 +23,9 @@ step sto{ SET statement_timeout = '10ms'; } > step lto { SET lock_timeout = '10ms'; } > step lsto{ SET lock_timeout = '10ms'; SET statement_timeout = '10s'; } > step slto{ SET lock_timeout = '10s'; SET statement_timeout = '10ms'; } > +step tto { SET transaction_timeout = '10ms'; } > +step sleep0 { SELECT pg_sleep(0.0001) } > +step sleep10 { SELECT pg_sleep(0.01) } > step locktbl { LOCK TABLE accounts; } > step update { DELETE FROM accounts WHERE accountid = 'checking'; } > teardown { ABORT; } > @@ -47,3 +50,5 @@ permutation wrtbl lto update(*) > permutation wrtbl lsto update(*) > # statement timeout expires first, row-level lock > permutation wrtbl slto update(*) > +# transaction timeout > +permutation tto sleep0 sleep0 sleep10(*) > \ No newline at end of file I don't think this is quite sufficient. I think the test should verify that transaction timeout interacts correctly with statement timeout / idle in tx timeout. Greetings, Andres Freund
Re: [PATCH] Add native windows on arm64 support
On Mon, Dec 05, 2022 at 10:14:49AM -0800, Andres Freund wrote: > On 2022-12-05 14:12:41 +0900, Michael Paquier wrote: >> With meson gaining in maturity, perhaps that's not the most urgent >> thing as we will likely remove src/tools/msvc/ soon but I'd rather do >> that right anyway as much as I can to avoid an incorrect state in the >> tree at any time in its history. > > I'd actually argue that we should just not add win32 support to > src/tools/msvc/. Are you arguing about ripping out the existing win32 or do nothing for arm64? I guess the later, but these words also sound like the former to me. -- Michael signature.asc Description: PGP signature
Re: Transaction timeout
Thanks for looking into this Andres! On Mon, Dec 5, 2022 at 3:07 PM Andres Freund wrote: > > I'm a bit worried about adding evermore branches and function calls for > the processing of single statements. We already spend a noticable > percentage of the cycles for a single statement in PostgresMain(), this > adds additional overhead. > > I'm somewhat inclined to think that we need some redesign here before we > add more overhead. > We can cap statement_timeout\idle_session_timeout by the budget of transaction_timeout left. Either way we can do batch function enable_timeouts() instead enable_timeout_after(). Does anything of it make sense? > > > @@ -1360,6 +1363,16 @@ IdleInTransactionSessionTimeoutHandler(void) > > SetLatch(MyLatch); > > } > > > > +static void > > +TransactionTimeoutHandler(void) > > +{ > > +#ifdef HAVE_SETSID > > + /* try to signal whole process group */ > > + kill(-MyProcPid, SIGINT); > > +#endif > > + kill(MyProcPid, SIGINT); > > +} > > + > > Why does this use signals instead of just setting the latch like > IdleInTransactionSessionTimeoutHandler() etc? I just copied statement_timeout behaviour. As I understand this implementation is prefered if the timeout can catch the backend running at full steam. > > diff --git a/src/bin/pg_dump/pg_backup_archiver.c > > b/src/bin/pg_dump/pg_backup_archiver.c > > index 0081873a72..5229fe3555 100644 > > --- a/src/bin/pg_dump/pg_backup_archiver.c > > +++ b/src/bin/pg_dump/pg_backup_archiver.c > > @@ -3089,6 +3089,7 @@ _doSetFixedOutputState(ArchiveHandle *AH) > > ahprintf(AH, "SET statement_timeout = 0;\n"); > > ahprintf(AH, "SET lock_timeout = 0;\n"); > > ahprintf(AH, "SET idle_in_transaction_session_timeout = 0;\n"); > > + ahprintf(AH, "SET transaction_timeout = 0;\n"); > > Hm - why is that the right thing to do? Because transaction_timeout has effects of statement_timeout. Thank you! Best regards, Andrey Borodin.
Re: Error-safe user functions
Hi, On 2022-12-05 16:40:06 -0500, Tom Lane wrote: > +/* > + * errsave_start --- begin a "safe" error-reporting cycle > + * > + * If "context" isn't an ErrorSaveContext node, this behaves as > + * errstart(ERROR, domain), and the errsave() macro ends up acting > + * exactly like ereport(ERROR, ...). > + * > + * If "context" is an ErrorSaveContext node, but the node creator only wants > + * notification of the fact of a safe error without any details, just set > + * the error_occurred flag in the ErrorSaveContext node and return false, > + * which will cause us to skip the remaining error processing steps. > + * > + * Otherwise, create and initialize error stack entry and return true. > + * Subsequently, errmsg() and perhaps other routines will be called to > further > + * populate the stack entry. Finally, errsave_finish() will be called to > + * tidy up. > + */ > +bool > +errsave_start(void *context, const char *domain) Why is context a void *? > +{ > + ErrorSaveContext *escontext; > + ErrorData *edata; > + > + /* > + * Do we have a context for safe error reporting? If not, just punt to > + * errstart(). > + */ > + if (context == NULL || !IsA(context, ErrorSaveContext)) > + return errstart(ERROR, domain); I don't think we should "accept" !IsA(context, ErrorSaveContext) - that seems likely to hide things like use-after-free. > + if (++errordata_stack_depth >= ERRORDATA_STACK_SIZE) > + { > + /* > + * Wups, stack not big enough. We treat this as a PANIC > condition > + * because it suggests an infinite loop of errors during error > + * recovery. > + */ > + errordata_stack_depth = -1; /* make room on stack */ > + ereport(PANIC, (errmsg_internal("ERRORDATA_STACK_SIZE > exceeded"))); > + } This is the fourth copy of this code... > +/* > + * errsave_finish --- end a "safe" error-reporting cycle > + * > + * If errsave_start() decided this was a regular error, behave as > + * errfinish(). Otherwise, package up the error details and save > + * them in the ErrorSaveContext node. > + */ > +void > +errsave_finish(void *context, const char *filename, int lineno, > +const char *funcname) > +{ > + ErrorSaveContext *escontext = (ErrorSaveContext *) context; > + ErrorData *edata = &errordata[errordata_stack_depth]; > + > + /* verify stack depth before accessing *edata */ > + CHECK_STACK_DEPTH(); > + > + /* > + * If errsave_start punted to errstart, then elevel will be ERROR or > + * perhaps even PANIC. Punt likewise to errfinish. > + */ > + if (edata->elevel >= ERROR) > + errfinish(filename, lineno, funcname); I'd put a pg_unreachable() or such after the errfinish() call. > + /* > + * Else, we should package up the stack entry contents and deliver them > to > + * the caller. > + */ > + recursion_depth++; > + > + /* Save the last few bits of error state into the stack entry */ > + if (filename) > + { > + const char *slash; > + > + /* keep only base name, useful especially for vpath builds */ > + slash = strrchr(filename, '/'); > + if (slash) > + filename = slash + 1; > + /* Some Windows compilers use backslashes in __FILE__ strings */ > + slash = strrchr(filename, '\\'); > + if (slash) > + filename = slash + 1; > + } > + > + edata->filename = filename; > + edata->lineno = lineno; > + edata->funcname = funcname; > + edata->elevel = ERROR; /* hide the LOG value used above */ > + > + /* > + * We skip calling backtrace and context functions, which are more > likely > + * to cause trouble than provide useful context; they might act on the > + * assumption that a transaction abort is about to occur. > + */ This seems like a fair bit of duplicated code. > + * This is the same as InputFunctionCall, but the caller may also pass a > + * previously-initialized ErrorSaveContext node. (We declare that as > + * "void *" to avoid including miscnodes.h in fmgr.h.) It seems way cleaner to forward declare ErrorSaveContext instead of using void *. > If escontext points > + * to an ErrorSaveContext, any "safe" errors detected by the input function > + * will be reported by filling the escontext struct. The caller must > + * check escontext->error_occurred before assuming that the function result > + * is meaningful. I wonder if we shouldn't instead make InputFunctionCallSafe() return a boolean and return the Datum via a pointer. As callers are otherwise going to need to do SAFE_ERROR_OCCURRED(escontext) themselves, I think it should also lead to more concise (and slightly more efficient) code. > +Datum > +InputFunctionCallSafe(FmgrInfo *flinfo, char *str, > +
Re: [PATCH] Add native windows on arm64 support
Hi, On 2022-12-06 08:31:16 +0900, Michael Paquier wrote: > On Mon, Dec 05, 2022 at 10:14:49AM -0800, Andres Freund wrote: > > On 2022-12-05 14:12:41 +0900, Michael Paquier wrote: > >> With meson gaining in maturity, perhaps that's not the most urgent > >> thing as we will likely remove src/tools/msvc/ soon but I'd rather do > >> that right anyway as much as I can to avoid an incorrect state in the > >> tree at any time in its history. > > > > I'd actually argue that we should just not add win32 support to > > src/tools/msvc/. > > Are you arguing about ripping out the existing win32 or do nothing for > arm64? I guess the later, but these words also sound like the former > to me. Yes, that's a typo. I meant that we shouldn't add arm64 support to src/tools/msvc/. I do think we should rip out src/tools/msvc/ soon-ish, but we need buildfarm support first. Greetings, Andres Freund
Re: Perform streaming logical transactions by background workers and parallel apply
Here are my review comments for patch v55-0002 == .../replication/logical/applyparallelworker.c 1. pa_can_start @@ -276,9 +278,9 @@ pa_can_start(TransactionId xid) /* * Don't start a new parallel worker if user has set skiplsn as it's * possible that user want to skip the streaming transaction. For - * streaming transaction, we need to spill the transaction to disk so that - * we can get the last LSN of the transaction to judge whether to skip - * before starting to apply the change. + * streaming transaction, we need to serialize the transaction to a file + * so that we can get the last LSN of the transaction to judge whether to + * skip before starting to apply the change. */ if (!XLogRecPtrIsInvalid(MySubscription->skiplsn)) return false; I think the wording change may belong in patch 0001 because it has nothing to do with partial serializing. ~~~ 2. pa_free_worker + /* + * Stop the worker if there are enough workers in the pool. + * + * XXX The worker is also stopped if the leader apply worker needed to + * serialize part of the transaction data due to a send timeout. This is + * because the message could be partially written to the queue due to send + * timeout and there is no way to clean the queue other than resending the + * message until it succeeds. To avoid complexity, we directly stop the + * worker in this case. + */ + if (winfo->serialize_changes || + napplyworkers > (max_parallel_apply_workers_per_subscription / 2)) Don't need to say "due to send timeout" 2 times in 2 sentences. SUGGESTION XXX The worker is also stopped if the leader apply worker needed to serialize part of the transaction data due to a send timeout. This is because the message could be partially written to the queue but there is no way to clean the queue other than resending the message until it succeeds. Directly stopping the worker avoids needing this complexity. ~~~ 3. pa_spooled_messages Previously I suggested this function name should be changed but that was rejected (see [1] #6a) > 6a. > IMO a better name for this function would be > pa_apply_spooled_messages(); Not sure about this. ~ FYI the reason for the previous suggestion is because there is no verb in the current function name, so the reader is left thinking pa_spooled_messages "what"? It means the caller has to have extra comments like: /* Check if changes have been serialized to a file. */ pa_spooled_messages(); OTOH, if the function was called something better -- e.g. pa_check_for_spooled_messages() or similar -- then it would be self-explanatory. ~ 4. /* + * Replay the spooled messages in the parallel apply worker if the leader apply + * worker has finished serializing changes to the file. + */ +static void +pa_spooled_messages(void) I'm not 100% sure of the logic, so IMO maybe the comment should say a bit more about how this works: Specifically, let's say there was some timeout and the LA needed to write the spool file, then let's say the PA timed out and found itself inside this function. Now, let's say the LA is still busy writing the file -- so what happens next? Does this function simply return, then the main PA loop waits again, then the times out again, then PA finds itself back inside this function again... and that keeps happening over and over until eventually the spool file is found FS_READY? Some explanatory comments might help. ~ 5. + /* + * Check if changes have been serialized to a file. if so, read and apply + * them. + */ + SpinLockAcquire(&MyParallelShared->mutex); + fileset_state = MyParallelShared->fileset_state; + SpinLockRelease(&MyParallelShared->mutex); "if so" -> "If so" ~~~ 6. pa_send_data + * + * If the attempt to send data via shared memory times out, then we will switch + * to "PARTIAL_SERIALIZE mode" for the current transaction to prevent possible + * deadlocks with another parallel apply worker (refer to the comments atop + * applyparallelworker.c for details). This means that the current data and any + * subsequent data for this transaction will be serialized to a file. */ void pa_send_data(ParallelApplyWorkerInfo *winfo, Size nbytes, const void *data) SUGGESTION (minor comment rearranging) If the attempt to send data via shared memory times out, then we will switch to "PARTIAL_SERIALIZE mode" for the current transaction -- this means that the current data and any subsequent data for this transaction will be serialized to a file. This is done to prevent possible deadlocks with another parallel apply worker (refer to the comments atop applyparallelworker.c for details). ~ 7. + /* + * Take the stream lock to make sure that the parallel apply worker + * will wait for the leader to release the stream lock until the + * end of the transaction. + */ + pa_lock_stream(winfo->shared->xid, AccessExclusiveLock); The comment doesn't sound right. "until the end" -> "at the end" (??) ~~~ 8. pa_stream_abort @@ -1374,6 +1470,7 @@ pa_stream_abort(LogicalRepStreamAbortData *abort_da