Re: [PoC] Improve dead tuple storage for lazy vacuum
On Tue, Dec 19, 2023 at 12:37 PM Masahiko Sawada wrote: > > On Mon, Dec 18, 2023 at 3:41 PM John Naylor wrote: > > Let's do it in just one place. In TidStoreCreate(), do > > > > /* clamp max_bytes to at least the size of the empty tree with > > allocated blocks, so it doesn't immediately appear full */ > > ts->control->max_bytes = Max(max_bytes, {rt, shared_rt}_memory_usage); > > > > Then we can get rid of all the worry about 1MB/2MB, 64kB, 70kB -- all that. > > But doesn't it mean that even if we create a shared tidstore with > small memory, say 64kB, it actually uses 1MB? This sounds like an argument for controlling the minimum DSA segment size. (I'm not really in favor of that, but open to others' opinion) I wasn't talking about that above -- I was saying we should have only one place where we clamp max_bytes so that the tree doesn't immediately appear full.
Re: Change GUC hashtable to use simplehash?
On Mon, 2023-12-18 at 13:39 +0700, John Naylor wrote: > For now just two: > v10-0002 is Jeff's change to the search path cache, but with the > chunked interface that I found to be faster. Did you consider specializing for the case of an aligned pointer? If it's a string (c string or byte string) it's almost always going to be aligned, right? I hacked up a patch (attached). I lost track of which benchmark we're using to test the performance, but when I test in a loop it seems substantially faster. It reads past the NUL byte, but only to the next alignment boundary, which I think is OK (though I think I'd need to fix the patch for when maxalign < 8). Regards, Jeff Davis From 055d5cc24404584fd98109fabdcf83348e5c49b4 Mon Sep 17 00:00:00 2001 From: Jeff Davis Date: Mon, 18 Dec 2023 16:44:27 -0800 Subject: [PATCH v10jd] Optimize hash function further. --- src/backend/catalog/namespace.c | 46 +--- src/include/common/hashfn_unstable.h | 9 ++ 2 files changed, 51 insertions(+), 4 deletions(-) diff --git a/src/backend/catalog/namespace.c b/src/backend/catalog/namespace.c index c23f46aca3..368a7fabec 100644 --- a/src/backend/catalog/namespace.c +++ b/src/backend/catalog/namespace.c @@ -245,15 +245,44 @@ static bool MatchNamedCall(HeapTuple proctup, int nargs, List *argnames, * offers a more convenient API. */ +/* From: https://graphics.stanford.edu/~seander/bithacks.html#ZeroInWord */ +#define haszero64(v) \ + (((v) - 0x0101010101010101UL) & ~(v) & 0x8080808080808080UL) + static inline uint32 -spcachekey_hash(SearchPathCacheKey key) +cstring_hash_aligned(const char *str, uint64 seed) +{ + const char *const start = str; + const char *buf = start; + int chunk_len = 0; + fasthash_state hs; + + fasthash_init(&hs, FH_UNKNOWN_LENGTH, seed); + + Assert(PointerIsAligned(start, uint64)); + while (!haszero64(*(uint64 *)buf)) + { + fasthash_accum64(&hs, buf); + buf += sizeof(uint64); + } + + while (buf[chunk_len] != '\0') + chunk_len++; + fasthash_accum(&hs, buf, chunk_len); + buf += chunk_len; + + return fasthash_final32(&hs, buf - start); +} + +static inline uint32 +cstring_hash_unaligned(const char *str, uint64 seed) { - const char *const start = key.searchPath; - const char *buf = key.searchPath; + const char *const start = str; + const char *buf = str; fasthash_state hs; /* WIP: maybe roleid should be mixed in normally */ - fasthash_init(&hs, FH_UNKNOWN_LENGTH, key.roleid); + fasthash_init(&hs, FH_UNKNOWN_LENGTH, seed); while (*buf) { int chunk_len = 0; @@ -269,6 +298,15 @@ spcachekey_hash(SearchPathCacheKey key) return fasthash_final32(&hs, buf - start); } +static inline uint32 +spcachekey_hash(SearchPathCacheKey key) +{ + if (PointerIsAligned(key.searchPath, uint64)) + return cstring_hash_aligned(key.searchPath, key.roleid); + else + return cstring_hash_unaligned(key.searchPath, key.roleid); +} + static inline bool spcachekey_equal(SearchPathCacheKey a, SearchPathCacheKey b) { diff --git a/src/include/common/hashfn_unstable.h b/src/include/common/hashfn_unstable.h index bf1dbee28d..553fab0415 100644 --- a/src/include/common/hashfn_unstable.h +++ b/src/include/common/hashfn_unstable.h @@ -105,6 +105,15 @@ fasthash_combine(fasthash_state *hs) hs->accum = 0; } +/* Accumulate 8 bytes from aligned pointer and combine it into the hash */ +static inline void +fasthash_accum64(fasthash_state *hs, const char *ptr) +{ + Assert(PointerIsAligned(ptr, uint64)); + hs->accum = *(uint64 *)ptr; + fasthash_combine(hs); +} + /* Accumulate up to 8 bytes of input and combine it into the hash */ static inline void fasthash_accum(fasthash_state *hs, const char *k, int len) -- 2.34.1
Re: Proposal to add page headers to SLRU pages
> This work is being done in file.c – it seems to me the proper way to > proceed would be to continue writing on-disk upgrade logic here. > Besides that this looks good to me, would like to hear what others have to > say. Thank you, Rishu for taking time to review the code. I've updated the patch and moved the on-disk upgrade logic to pg_upgrade/file.c. I have also added this thread to the current Commitfest and hope this patch will be part of the 17 release. The commitfest link: https://commitfest.postgresql.org/46/4709/ Regards, Yong, slur_page_header_v2.patch Description: slur_page_header_v2.patch
Re: "pgoutput" options missing on documentation
On Tue, Dec 19, 2023 at 12:07 PM Peter Smith wrote: > > On Tue, Dec 19, 2023 at 1:35 AM Emre Hasegeli wrote: > > > > > Fair enough. I think we should push your first patch only in HEAD as > > > this is a minor improvement over the current behaviour. What do you > > > think? > > > > I agree. > > Patch 0001 > > AFAICT parse_output_parameters possible errors are never tested. For > example, there is no code coverage [1] touching any of these ereports. > > IMO there should be some simple test cases -- I am happy to create > some tests if you agree they should exist. > I don't think having tests for all sorts of error checking will add much value as compared to the overhead they bring. > ~~~ > > While looking at the function parse_output_parameters() I noticed that > if an unrecognised option is passed the function emits an elog instead > of an ereport > We don't expect unrecognized option here and for such a thing, we use elog in the code. See the similar usage in parseCreateReplSlotOptions(). I think we should move to 0002 patch now. In that, I suggest preparing separate back branch patches. -- With Regards, Amit Kapila.
Re: [DOC] Introducing Quick Start Guide to PL/pgSQL and PL/Python Documentation
On Sat, 16 Dec 2023 at 18:49, Ishaan Adarsh wrote: > Hi > > I have made some documentation enhancements for PL/pgSQL and PL/Python > sections. The changes include the addition of a Quick Start Guide to > facilitate a smoother onboarding experience for users. > > Patch File Name: > 0001-plpyhton-plpgsql-docu-changes.patch > > Patch Description: > This patch introduces a Quick Start Guide to the documentation for PL/pgSQL > and PL/Python. The Quick Start Guide provides users with a step-by-step > walkthrough of setting up and using these procedural languages. The goal is > to improve user accessibility and streamline the learning process. > > Changes Made: > 1. Added a new section titled "Quick Start Guide" to both PL/pgSQL and > PL/Python documentation. > 2. Included step-by-step instructions for users to get started with these > procedural languages. > 3. Provided explanations, code snippets, and examples to illustrate key > concepts. > > Discussion Points: > I am seeking your feedback on the following aspects: > - Clarity and completeness of the Quick Start Guide > - Any additional information or examples that could enhance the guide > - Suggestions for improving the overall documentation structure > > Your insights and suggestions are highly valuable, and I appreciate your > time in reviewing this documentation enhancement. 1. It seems you miss tag in plpgsql "Create the Makefile": + +Create the Makefile + + + Create a Makefile in the pg_plpgsql_ext directory with the following content: + 2. We expected use CREATE EXTENSION to load the extension, should we add the following in extension--version.sql? -- complain if script is sourced in psql, rather than via CREATE EXTENSION \echo Use "CREATE EXTENSION pair" to load this file. \quit -- Regrads, Japin Li ChengDu WenWu Information Technology Co., Ltd.
Re: "pgoutput" options missing on documentation
On Tue, Dec 19, 2023 at 1:35 AM Emre Hasegeli wrote: > > > Fair enough. I think we should push your first patch only in HEAD as > > this is a minor improvement over the current behaviour. What do you > > think? > > I agree. Patch 0001 AFAICT parse_output_parameters possible errors are never tested. For example, there is no code coverage [1] touching any of these ereports. IMO there should be some simple test cases -- I am happy to create some tests if you agree they should exist. ~~~ While looking at the function parse_output_parameters() I noticed that if an unrecognised option is passed the function emits an elog instead of an ereport -- test_pub=# SELECT * FROM pg_logical_slot_get_changes('test_slot_v1', NULL, NULL, 'banana', '1'); 2023-12-19 17:08:21.627 AEDT [8921] ERROR: unrecognized pgoutput option: banana 2023-12-19 17:08:21.627 AEDT [8921] CONTEXT: slot "test_slot_v1", output plugin "pgoutput", in the startup callback 2023-12-19 17:08:21.627 AEDT [8921] STATEMENT: SELECT * FROM pg_logical_slot_get_changes('test_slot_v1', NULL, NULL, 'banana', '1'); ERROR: unrecognized pgoutput option: banana CONTEXT: slot "test_slot_v1", output plugin "pgoutput", in the startup callback -- But that doesn't seem right. AFAIK elog messages use errmsg_internal so this message would not get translated. PSA a patch to fix that. == [1] code coverage -- https://coverage.postgresql.org/src/backend/replication/pgoutput/pgoutput.c.gcov.html Kind Regards, Peter Smith. Fujitsu Australia parse_output_parameters.diff Description: Binary data
Update the comment in nodes.h to cover Cardinality
By chance I discovered that the comment for the typedefs of "double"s does not cover Cardinality. Should we update that comment accordingly, maybe something like below? - * Typedefs for identifying qualifier selectivities and plan costs as such. - * These are just plain "double"s, but declaring a variable as Selectivity - * or Cost makes the intent more obvious. + * Typedefs for identifying qualifier selectivities, plan costs and + * estimated rows or other count as such. These are just plain "double"s, + * but declaring a variable as Selectivity, Cost or Cardinality makes the + * intent more obvious. Thanks Richard v1-0001-Add-comment-for-Cardinality-typedef.patch Description: Binary data
Re: Synchronizing slots from primary to standby
On Tue, Dec 19, 2023 at 4:51 AM Peter Smith wrote: > > > == > doc/src/sgml/system-views.sgml > > 3. > + > + The hot standby can have any of these sync_state values for the slots > but > + on a hot standby, the slots with state 'r' and 'i' can neither be used > + for logical decoding nor dropped by the user. > + The sync_state has no meaning on the primary server; the primary > + sync_state value is default 'n' for all slots but may (if leftover > + from a promoted standby) also be 'r'. > + > > I still feel we are exposing too much useless information about the > primary server values. > > Isn't it sufficient to just say "The sync_state values have no meaning > on a primary server.", and not bother to mention what those > meaningless values might be -- e.g. if they are meaningless then who > cares what they are or how they got there? > I feel it would be good to mention somewhere that primary can have slots in 'r' state, if not here, some other place. > > 7. > +/* > + * Exit the slot sync worker with given exit-code. > + */ > +static void > +slotsync_worker_exit(const char *msg, int code) > +{ > + ereport(LOG, errmsg("%s", msg)); > + proc_exit(code); > +} > > This could be written differently (don't pass the exit code, instead > pass a bool) like: > > static void > slotsync_worker_exit(const char *msg, bool restart_worker) > > By doing it this way, you can keep the special exit code values (0,1) > within this function where you can comment all about them instead of > having scattered comments about exit codes in the callers. > > SUGGESTION > ereport(LOG, errmsg("%s", msg)); > /* restart or not> */ > proc_exit(restart_worker ? 1 : 0); > Hmm, I don't see the need for this function in the first place. We can use proc_exit in the two callers directly. -- With Regards, Amit Kapila.
Re: Making the initial and maximum DSA segment sizes configurable
Hi, On Wed, Mar 22, 2023 at 12:15 AM Masahiko Sawada wrote: > > Hi all, > > I started this new thread from another thread[1] where we're > discussing a new storage for TIDs, TidStore, since we found a > difficulty about the memory usage limit for TidStores on DSA. > > TidStore is a new data structure to efficiently store TIDs, backed by > a radix tree. In the patch series proposed on that thread, in addition > to radix tree and TidStore, there is another patch for lazy (parallel) > vacuum to replace the array of dead tuple TIDs with a TidStore. To > support parallel vacuum, radix tree (and TidStore) can be created on a > local memory as well as on DSA. Also, it has memory usage limit > functionality; we can specify the memory limit (e.g., > maintenance_work_mem) to TidStoreCreate() function. Once the total DSA > segment size (area->control->total_segment_size) exceeds the limit, > TidStoreIsFull() returns true. The lazy vacuum can continue scanning > heap blocks to collect dead tuple TIDs until TidStoreIsFull() returns > true. Currently lazy vacuum is the sole user of TidStore but maybe it > can be used by other codes such as tidbitmap.c where will be limited > by work_mem. > > During the development, we found out that DSA memory growth is > unpredictable, leading to inefficient memory limitation. > > DSA is built on top of DSM segments and it manages a set of DSM > segments, adding new segments as required and detaching them when they > are no longer needed. The DSA segment starts with 1MB in size and a > new segment size is at least big enough to follow a geometric series > that approximately doubles the total storage each time we create a new > segment. Because of this fact, it's not efficient to simply compare > the memory limit to the total segment size. For example, if > maintenance_work_mem is 512MB, the total segment size will be like: > > 2 * (1 + 2 + 4 + 8 + 16 + 32 + 64 + 128) = 510MB -> less than the > limit, continue heap scan. > > 2 * (1 + 2 + 4 + 8 + 16 + 32 + 64 + 128) + 256 = 766MB -> stop (exceed > 254MB). > > One might think we can use dsa_set_size_limit() but it cannot; lazy > vacuum ends up with an error. If we set DSA_ALLOC_NO_OOM, we might end > up stopping the insertion halfway. > > Besides excessively allocating memory, since the initial DSM segment > size is fixed 1MB, memory usage of a shared TidStore will start from > 1MB+. This is higher than the minimum values of both work_mem and > maintenance_work_mem, 64kB and 1MB respectively. Increasing the > minimum m_w_m to 2MB might be acceptable but not for work_mem. > > Researching possible solutions, we found that aset.c also has a > similar characteristic; allocates an 8K block (by default) upon the > first allocation in a context, and doubles that size for each > successive block request. But we can specify the initial block size > and max blocksize. This made me think of an idea to specify both to > DSA and both values are calculated based on m_w_m. I've attached the > patch for this idea. The changes to dsa.c are straightforward since > dsa.c already uses macros DSA_INITIAL_SEGMENT_SIZE and > DSA_MAX_SEGMENT_SIZE. I just made these values configurable. > > FYI with this patch, we can create a DSA in parallel_vacuum_init() > with initial and maximum block sizes as follows: > > initial block size = min(m_w_m / 4, 1MB) > max block size = max(m_w_m / 8, 8MB) > > In most cases, we can start with a 1MB initial segment, the same as > before. For larger memory, the heap scan stops after DSA allocates > 1.25 times more memory than m_w_m. For example, if m_w_m = 512MB, the > both initial and maximum segment sizes are 1MB and 64MB respectively, > and then DSA allocates the segments as follows until heap scanning > stops: > > 2 * (1 + 2 + 4 + 8 + 16 + 32 + 64) + (64 * 4) = 510MB -> less than the > limit, continue heap scan. > > 2 * (1 + 2 + 4 + 8 + 16 + 32 + 64) + (64 * 5) = 574MB -> stop > (allocated additional 62MB). > > It also works with smaller memory; If the limit is 1MB, we start with > a 256KB initial segment and heap scanning stops after DSA allocated > 1.5MB (= 256kB + 256kB + 512kB + 512kB). > > There is room for considering better formulas for initial and maximum > block sizes but making both values configurable is a promising idea. > And the analogous behavior to aset could be a good thing for > readability and maintainability. There is another test result where I > used this idea on top of a radix tree[2]. > > We need to consider the total number of allocated DSA segments as the > total number of DSM segments available on the system is fixed[3]. But > it seems not problematic even with this patch since we allocate only a > few additional segments (in above examples 17 segs vs. 19 segs). There > was no big difference also in performance[2]. > The last time I posted this email seemed not good timing since it was close to the feature freeze, and the email was very long. The tidstore and radix tree developments are still in-p
Re: Add a perl function in Cluster.pm to generate WAL
On Tue, Dec 19, 2023 at 9:51 AM Michael Paquier wrote: > > On Mon, Dec 18, 2023 at 08:48:09AM -0300, Euler Taveira wrote: > > It is cheaper. > > Agreed that this could just use a set of pg_logical_emit_message() > when jumping across N segments. Thanks. I missed the point of using pg_logical_emit_message() over CREATE .. DROP TABLE to generate WAL. And, I agree that it's better and relatively cheaper in terms of amount of WAL generated. > Another thing that seems quite > important to me is to force a flush of WAL with the last segment > switch, and the new "flush" option of pg_logical_emit_message() can > be very handy for this purpose. I used pg_logical_emit_message() in non-transactional mode without needing an explicit WAL flush as the pg_switch_wal() does a WAL flush at the end [1]. Attached v4 patch. [1] /* * If this was an XLOG_SWITCH record, flush the record and the empty * padding space that fills the rest of the segment, and perform * end-of-segment actions (eg, notifying archiver). */ if (class == WALINSERT_SPECIAL_SWITCH) { TRACE_POSTGRESQL_WAL_SWITCH(); XLogFlush(EndPos); -- Bharath Rupireddy PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com From b7fa7545eb983aaf92e3d7e99bdf76ef42b8e40e Mon Sep 17 00:00:00 2001 From: Bharath Rupireddy Date: Tue, 19 Dec 2023 05:49:20 + Subject: [PATCH v4] Add a TAP test function to generate WAL This commit adds a perl function in Cluster.pm to generate WAL. Some TAP tests are now using their own way to generate WAL. Generalizing this functionality enables multiple TAP tests to reuse the functionality. --- src/test/perl/PostgreSQL/Test/Cluster.pm | 22 + src/test/recovery/t/001_stream_rep.pl | 6 +-- src/test/recovery/t/019_replslot_limit.pl | 48 +-- .../t/035_standby_logical_decoding.pl | 7 +-- 4 files changed, 38 insertions(+), 45 deletions(-) diff --git a/src/test/perl/PostgreSQL/Test/Cluster.pm b/src/test/perl/PostgreSQL/Test/Cluster.pm index a020377761..ad575ed6d6 100644 --- a/src/test/perl/PostgreSQL/Test/Cluster.pm +++ b/src/test/perl/PostgreSQL/Test/Cluster.pm @@ -3178,6 +3178,28 @@ sub create_logical_slot_on_standby =pod +=item $node->advance_wal($n) + +Advance WAL of given node by $n segments + +=cut + +sub advance_wal +{ + my ($self, $n) = @_; + + # Advance by $n segments (= (wal_segment_size * $n) bytes). + for (my $i = 0; $i < $n; $i++) + { + $self->safe_psql('postgres', qq{ + SELECT pg_logical_emit_message(false, '', 'foo'); + SELECT pg_switch_wal(); + }); + } +} + +=pod + =back =cut diff --git a/src/test/recovery/t/001_stream_rep.pl b/src/test/recovery/t/001_stream_rep.pl index 95f9b0d772..f0de921b4b 100644 --- a/src/test/recovery/t/001_stream_rep.pl +++ b/src/test/recovery/t/001_stream_rep.pl @@ -522,11 +522,7 @@ $node_primary->safe_psql('postgres', my $segment_removed = $node_primary->safe_psql('postgres', 'SELECT pg_walfile_name(pg_current_wal_lsn())'); chomp($segment_removed); -$node_primary->psql( - 'postgres', " - CREATE TABLE tab_phys_slot (a int); - INSERT INTO tab_phys_slot VALUES (generate_series(1,10)); - SELECT pg_switch_wal();"); +$node_primary->advance_wal(1); my $current_lsn = $node_primary->safe_psql('postgres', "SELECT pg_current_wal_lsn();"); chomp($current_lsn); diff --git a/src/test/recovery/t/019_replslot_limit.pl b/src/test/recovery/t/019_replslot_limit.pl index 7d94f15778..e4b75c6545 100644 --- a/src/test/recovery/t/019_replslot_limit.pl +++ b/src/test/recovery/t/019_replslot_limit.pl @@ -59,7 +59,7 @@ $result = $node_primary->safe_psql('postgres', is($result, "reserved|t", 'check the catching-up state'); # Advance WAL by five segments (= 5MB) on primary -advance_wal($node_primary, 1); +$node_primary->advance_wal(1); $node_primary->safe_psql('postgres', "CHECKPOINT;"); # The slot is always "safe" when fitting max_wal_size @@ -69,7 +69,7 @@ $result = $node_primary->safe_psql('postgres', is($result, "reserved|t", 'check that it is safe if WAL fits in max_wal_size'); -advance_wal($node_primary, 4); +$node_primary->advance_wal(4); $node_primary->safe_psql('postgres', "CHECKPOINT;"); # The slot is always "safe" when max_slot_wal_keep_size is not set @@ -100,7 +100,7 @@ $result = $node_primary->safe_psql('postgres', is($result, "reserved", 'check that max_slot_wal_keep_size is working'); # Advance WAL again then checkpoint, reducing remain by 2 MB. -advance_wal($node_primary, 2); +$node_primary->advance_wal(2); $node_primary->safe_psql('postgres', "CHECKPOINT;"); # The slot is still working @@ -118,7 +118,7 @@ $node_standby->stop; $result = $node_primary->safe_psql('postgres', "ALTER SYSTEM SET wal_keep_size to '8MB'; SELECT pg_reload_conf();"); # Advance WAL again then checkpoint, reducing remain by 6 MB. -advance_wal($node_primary, 6); +$node_primary->advance_wal(6); $result = $node_primary->safe_psql('postgres',
Re: [PoC] Improve dead tuple storage for lazy vacuum
On Mon, Dec 18, 2023 at 3:41 PM John Naylor wrote: > > On Fri, Dec 15, 2023 at 3:15 PM Masahiko Sawada wrote: > > > > On Fri, Dec 15, 2023 at 10:30 AM John Naylor > > wrote: > > > > parallel_vacuum_init(Relation rel, Relation *indrels, int nindexes, > > > - int nrequested_workers, int max_items, > > > - int elevel, BufferAccessStrategy bstrategy) > > > + int nrequested_workers, int vac_work_mem, > > > + int max_offset, int elevel, > > > + BufferAccessStrategy bstrategy) > > > > > > It seems very strange to me that this function has to pass the > > > max_offset. In general, it's been simpler to assume we have a constant > > > max_offset, but in this case that fact is not helping. Something to > > > think about for later. > > > > max_offset was previously used in old TID encoding in tidstore. Since > > tidstore has entries for each block, I think we no longer need it. > > It's needed now to properly size the allocation of TidStoreIter which > contains... > > +/* Result struct for TidStoreIterateNext */ > +typedef struct TidStoreIterResult > +{ > + BlockNumber blkno; > + int num_offsets; > + OffsetNumber offsets[FLEXIBLE_ARRAY_MEMBER]; > +} TidStoreIterResult; > > Maybe we can palloc the offset array to "almost always" big enough, > with logic to resize if needed? If not too hard, seems worth it to > avoid churn in the parameter list. Yes, I was thinking of that. > > > > v45-0010: > > > > > > Thinking about this some more, I'm not sure we need to do anything > > > different for the *starting* segment size. (Controlling *max* size > > > does seem important, however.) For the corner case of m_w_m = 1MB, > > > it's fine if vacuum quits pruning immediately after (in effect) it > > > finds the DSA has gone to 2MB. It's not worth bothering with, IMO. If > > > the memory accounting starts >1MB because we're adding the trivial > > > size of some struct, let's just stop doing that. The segment > > > allocations are what we care about. > > > > IIUC it's for work_mem, whose the minimum value is 64kB. > > > > > > > > v45-0011: > > > > > > + /* > > > + * max_bytes is forced to be at least 64kB, the current minimum valid > > > + * value for the work_mem GUC. > > > + */ > > > + max_bytes = Max(64 * 1024L, max_bytes); > > > > > > Why? > > > > This is to avoid creating a radix tree within very small memory. The > > minimum work_mem value is a reasonable lower bound that PostgreSQL > > uses internally. It's actually copied from tuplesort.c. > > There is no explanation for why it should be done like tuplesort.c. Also... > > - tree->leaf_ctx = SlabContextCreate(ctx, > -"radix tree leaves", > -RT_SLAB_BLOCK_SIZE(sizeof(RT_VALUE_TYPE)), > -sizeof(RT_VALUE_TYPE)); > + tree->leaf_ctx = SlabContextCreate(ctx, > +"radix tree leaves", > +Min(RT_SLAB_BLOCK_SIZE(sizeof(RT_VALUE_TYPE)), > +work_mem), > +sizeof(RT_VALUE_TYPE)); > > At first, my eyes skipped over this apparent re-indent, but hidden > inside here is another (undocumented) attempt to clamp the size of > something. There are too many of these sprinkled in various places, > and they're already a maintenance hazard -- a different one was left > behind in v45-0011: > > @@ -201,6 +183,7 @@ TidStoreCreate(size_t max_bytes, int max_off, > dsa_area *area) > ts->control->max_bytes = max_bytes - (70 * 1024); > } > > Let's do it in just one place. In TidStoreCreate(), do > > /* clamp max_bytes to at least the size of the empty tree with > allocated blocks, so it doesn't immediately appear full */ > ts->control->max_bytes = Max(max_bytes, {rt, shared_rt}_memory_usage); > > Then we can get rid of all the worry about 1MB/2MB, 64kB, 70kB -- all that. But doesn't it mean that even if we create a shared tidstore with small memory, say 64kB, it actually uses 1MB? Regards, -- Masahiko Sawada Amazon Web Services: https://aws.amazon.com
Re: SLRU optimization - configurable buffer pool and partitioning the SLRU lock
On Mon, Dec 18, 2023 at 11:00 PM Robert Haas wrote: > > On Mon, Dec 18, 2023 at 12:04 PM Robert Haas wrote: > > certain sense they are competing for the same job. However, they do > > aim to alleviate different TYPES of contention: the group XID update > > stuff should be most valuable when lots of processes are trying to > > update the same page, and the banks should be most valuable when there > > is simultaneous access to a bunch of different pages. So I'm not > > convinced that this patch is a reason to remove the group XID update > > mechanism, but someone might argue otherwise. > > Hmm, but, on the other hand: > > Currently all readers and writers are competing for the same LWLock. > But with this change, the readers will (mostly) no longer be competing > with the writers. So, in theory, that might reduce lock contention > enough to make the group update mechanism pointless. Thanks for your feedback, I agree that with a bank-wise lock, we might not need group updates for some of the use cases as you said where readers and writers are contenting on the centralized lock because, in most of the cases, readers will be distributed across different banks. OTOH there are use cases where the writer commit is the bottleneck (on SLRU lock) like pgbench simple-update or TPC-B then we will still benefit by group update. During group update testing we have seen benefits with such a scenario[1] with high client counts. So as per my understanding by distributing the SLRU locks there are scenarios where we will not need group update anymore but OTOH there are also scenarios where we will still benefit from the group update. [1] https://www.postgresql.org/message-id/CAFiTN-u-XEzhd%3DhNGW586fmQwdTy6Qy6_SXe09tNB%3DgBcVzZ_A%40mail.gmail.com -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com
Re: Synchronizing slots from primary to standby
On Tue, Dec 19, 2023 at 6:58 AM Peter Smith wrote: > > Here are some comments for the patch v49-0002. > > (This is in addition to my review comments for v48-0002 [1]) > > == > src/backend/access/transam/xlogrecovery.c > > > 1. FinishWalRecovery > > + * > + * We do not update the sync_state from READY to NONE here, as any failed > + * update could leave some slots in the 'NONE' state, causing issues during > + * slot sync after restarting the server as a standby. While updating after > + * switching to the new timeline is an option, it does not simplify the > + * handling for both READY and NONE state slots. Therefore, we retain the > + * READY state slots after promotion as they can provide useful information > + * about their origin. > + */ > > Do you know if that wording is correct? e.g., If you were updating > from READY to NONE and there was a failed update, that would leave > some slots still in a READY state, right? So why does the comment say > "could leave some slots in the 'NONE' state"? > The comment is correct because after restart the server will start as 'standby', so 'READY' marked slots are okay but the slots that we changed to 'NONE' would now appear as user-created slots which would be wrong. -- With Regards, Amit Kapila.
Re: Clang optimiser vs preproc.c
On Tue, Dec 19, 2023 at 11:42 AM Thomas Munro wrote: > Hrmph. Well something weird is going on, but it might indeed involve > me being confused about debug options of the compiler itself. How can > one find out which build options were used for clang/llvm compiler + > libraries? My earlier reports were from a little machine at home, so > let's try again on an i9-9900 CPU @ 3.10GHz (a bit snappier) running > Debian 12, again using packages from apt.llvm.org: > > 17 ~198s > 16 ~14s > 15 ~11s And on another Debian machine (this time a VM) also using apt.llvm.org packages, the huge ~3 minute time occurs with clang-16... hrrrnnnff... seems like there must be some other variable here that I haven't spotted yet...
Re: Add a perl function in Cluster.pm to generate WAL
On Mon, Dec 18, 2023 at 08:48:09AM -0300, Euler Taveira wrote: > It is cheaper. Agreed that this could just use a set of pg_logical_emit_message() when jumping across N segments. Another thing that seems quite important to me is to force a flush of WAL with the last segment switch, and the new "flush" option of pg_logical_emit_message() can be very handy for this purpose. -- Michael signature.asc Description: PGP signature
Re: WAL Insertion Lock Improvements
On Tue, Jul 25, 2023 at 04:43:16PM +0900, Michael Paquier wrote: > 0001 has been now applied. I have done more tests while looking at > this patch since yesterday and was surprised to see higher TPS numbers > on HEAD with the same tests as previously, and the patch was still > shining with more than 256 clients. I found this code when searching for callers that use atomic exchanges as atomic writes with barriers (for a separate thread [0]). Can't we use pg_atomic_write_u64() here since the locking functions that follow should serve as barriers? I've attached a patch to demonstrate what I'm thinking. This might be more performant, although maybe less so after commit 64b1fb5. Am I missing something obvious here? If not, I might rerun the benchmarks to see whether it makes any difference. [0] https://www.postgresql.org/message-id/flat/20231110205128.GB1315705%40nathanxps13 -- Nathan Bossart Amazon Web Services: https://aws.amazon.com diff --git a/src/backend/storage/lmgr/lwlock.c b/src/backend/storage/lmgr/lwlock.c index 315a78cda9..b43972bd2e 100644 --- a/src/backend/storage/lmgr/lwlock.c +++ b/src/backend/storage/lmgr/lwlock.c @@ -1752,10 +1752,10 @@ LWLockUpdateVar(LWLock *lock, pg_atomic_uint64 *valptr, uint64 val) PRINT_LWDEBUG("LWLockUpdateVar", lock, LW_EXCLUSIVE); /* - * Note that pg_atomic_exchange_u64 is a full barrier, so we're guaranteed - * that the variable is updated before waking up waiters. + * We rely on the barrier provided by LWLockWaitListLock() to ensure the + * variable is updated before waking up waiters. */ - pg_atomic_exchange_u64(valptr, val); + pg_atomic_write_u64(valptr, val); proclist_init(&wakeup); @@ -1881,10 +1881,10 @@ void LWLockReleaseClearVar(LWLock *lock, pg_atomic_uint64 *valptr, uint64 val) { /* - * Note that pg_atomic_exchange_u64 is a full barrier, so we're guaranteed - * that the variable is updated before releasing the lock. + * We rely on the barrier provided by LWLockRelease() to ensure the + * variable is updated before releasing the lock. */ - pg_atomic_exchange_u64(valptr, val); + pg_atomic_write_u64(valptr, val); LWLockRelease(lock); }
Re: Improve eviction algorithm in ReorderBuffer
On Sun, Dec 17, 2023 at 11:40 AM Amit Kapila wrote: > > On Fri, Dec 15, 2023 at 11:29 AM Masahiko Sawada > wrote: > > > > On Fri, Dec 15, 2023 at 12:37 PM Amit Kapila > > wrote: > > > > > > On Wed, Dec 13, 2023 at 6:01 AM Masahiko Sawada > > > wrote: > > > > > > > > > > IIUC, you are giving preference to multiple list ideas as compared to > > > (a) because you don't need to adjust the list each time the > > > transaction size changes, is that right? > > > > Right. > > > > > If so, I think there is a > > > cost to keep that data structure up-to-date but it can help in > > > reducing the number of times we need to serialize. > > > > Yes, there is a trade-off. > > > > What I don't want to do is to keep all transactions ordered since it's > > too costly. The proposed idea uses multiple lists to keep all > > transactions roughly ordered. The maintenance cost would be cheap > > since each list is unordered. > > > > It might be a good idea to have a threshold to switch how to pick the > > largest transaction based on the number of transactions in the > > reorderbuffer. If there are many transactions, we can use the proposed > > algorithm to find a possibly-largest transaction, otherwise use the > > current way. > > > > Yeah, that makes sense. > > > > > > > > > > > > > 1) A scenario where suppose you have one very large transaction that > > > > > is consuming ~40% of the memory and 5-6 comparatively smaller > > > > > transactions that are just above 10% of the memory limit. And now for > > > > > coming under the memory limit instead of getting 1 large transaction > > > > > evicted out, we are evicting out multiple times. > > > > > > > > Given the large transaction list will have up to 10 transactions, I > > > > think it's cheap to pick the largest transaction among them. It's O(N) > > > > but N won't be large. > > > > > > > > > 2) Another scenario where all the transactions are under 10% of the > > > > > memory limit but let's say there are some transactions are consuming > > > > > around 8-9% of the memory limit each but those are not very old > > > > > transactions whereas there are certain old transactions which are > > > > > fairly small and consuming under 1% of memory limit and there are many > > > > > such transactions. So how it would affect if we frequently select > > > > > many of these transactions to come under memory limit instead of > > > > > selecting a couple of large transactions which are consuming 8-9%? > > > > > > > > Yeah, probably we can do something for small transactions (i.e. small > > > > and on-memory transactions). One idea is to pick the largest > > > > transaction among them by iterating over all of them. Given that the > > > > more transactions are evicted, the less transactions the on-memory > > > > transaction list has, unlike the current algorithm, we would still > > > > win. Or we could even split it into several sub-lists in order to > > > > reduce the number of transactions to check. For example, splitting it > > > > into two lists: transactions consuming 5% < and 5% >= of the memory > > > > limit, and checking the 5% >= list preferably. > > > > > > > > > > Which memory limit are you referring to here? Is it > > > logical_decoding_work_mem? > > > > logical_decoding_work_mem. > > > > > > > > > The cost for > > > > maintaining these lists could increase, though. > > > > > > > > > > Yeah, can't we maintain a single list of all xacts that are consuming > > > equal to or greater than the memory limit? Considering that the memory > > > limit is logical_decoding_work_mem, then I think just picking one > > > transaction to serialize would be sufficient. > > > > IIUC we serialize a transaction when the sum of all transactions' > > memory usage in the reorderbuffer exceeds logical_decoding_work_mem. > > In what cases are multiple transactions consuming equal to or greater > > than the logical_decoding_work_mem? > > > > The individual transactions shouldn't cross > 'logical_decoding_work_mem'. I got a bit confused by your proposal to > maintain the lists: "...splitting it into two lists: transactions > consuming 5% < and 5% >= of the memory limit, and checking the 5% >= > list preferably.". In the previous sentence, what did you mean by > transactions consuming 5% >= of the memory limit? I got the impression > that you are saying to maintain them in a separate transaction list > which doesn't seems to be the case. I wanted to mean that there are three lists in total: the first one maintain the transactions consuming more than 10% of logical_decoding_work_mem, the second one maintains other transactions consuming more than or equal to 5% of logical_decoding_work_mem, and the third one maintains other transactions consuming more than 0 and less than 5% of logical_decoding_work_mem. Regards, -- Masahiko Sawada Amazon Web Services: https://aws.amazon.com
Re: add non-option reordering to in-tree getopt_long
Nathan Bossart writes: > On Mon, Dec 18, 2023 at 02:41:22PM -0500, Tom Lane wrote: >> We just had a user complaint that seems to trace to exactly this >> bogus reporting in pg_ctl [1]. Although I was originally not >> very pleased with changing our getopt_long to do switch reordering, >> I'm now wondering if we should back-patch these changes as bug >> fixes. It's probably not worth the risk, but ... > I'm not too concerned about the risks of back-patching these commits, but > if this 19-year-old bug was really first reported today, I'd agree that > fixing it in the stable branches is probably not worth it. Agreed, if it actually is 19 years old. I'm wondering a little bit if there could be some moderately-recent glibc behavior change involved. I'm not excited enough about it to go trawl their change log, but we should keep our ears cocked for similar reports. regards, tom lane
Re: Proposal to add page headers to SLRU pages
On Thu, Dec 8, 2023 at 1:36 AM Li, Yong wrote: >Given so many different approaches were discussed, I have started a >wiki to record and collaborate all efforts towards SLRU >improvements. The wiki provides a concise overview of all the ideas >discussed and can serve as a portal for all historical >discussions. Currently, the wiki summarizes four recent threads >ranging from identifier format change to page header change, to moving >SLRU into the main buffer pool, to reduce lock contention on SLRU >latches. We can keep the patch related discussions in this thread and >use the wiki as a live document for larger scale collaborations. >The wiki page is >here: https://wiki.postgresql.org/wiki/SLRU_improvements >Regarding the benefits of this patch, here is a detailed explanation: 1. Checksum is added to each page, allowing us to verify if a page has been corrupted when read from the disk. 2. The ad-hoc LSN group structure is removed from the SLRU cache control data and is replaced by the page LSN in the page header. This allows us to use the same WAL protocol as used by pages in the main buffer pool: flush all redo logs up to the page LSN before flushing the page itself. If we move SLRU caches into the main buffer pool, this change fits naturally. 3. It leaves further optimizations open. We can continue to pursue the goal of moving SLRU into the main buffer pool, or we can follow the lock partition idea. This change by itself does not conflict with either proposal. >Also, the patch is now complete and is ready for review. All check- >world tests including tap tests passed with this patch. Hi Yong, I agree we should break the effort for the SLRU optimization into smaller chunks after having worked on some of the bigger patches and facing difficulty in making progress that way. The patch looks mostly good to me; though one thing that I thought about differently with the upgrade portion is where we should keep the logic of re-writing the CLOG files. There is a precedent introduced back in Postgres v9.6 in making on disk page format changes across different in visibility map: [1] code comment: * In versions of PostgreSQL prior to catversion 201603011, PostgreSQL's * visibility map included one bit per heap page; it now includes two. * When upgrading a cluster from before that time to a current PostgreSQL * version, we could refuse to copy visibility maps from the old cluster * to the new cluster; the next VACUUM would recreate them, but at the * price of scanning the entire table. So, instead, we rewrite the old * visibility maps in the new format. This work is being done in file.c – it seems to me the proper way to proceed would be to continue writing on-disk upgrade logic here. Besides that this looks good to me, would like to hear what others have to say. Thanks, Rishu Bagga Amazon Web Services (AWS) [1] https://github.com/postgres/postgres/commit/7087166a88fe0c04fc6636d0d6d6bea1737fc1fb
Re: Synchronizing slots from primary to standby
Here are some comments for the patch v49-0002. (This is in addition to my review comments for v48-0002 [1]) == src/backend/access/transam/xlogrecovery.c 1. FinishWalRecovery + * + * We do not update the sync_state from READY to NONE here, as any failed + * update could leave some slots in the 'NONE' state, causing issues during + * slot sync after restarting the server as a standby. While updating after + * switching to the new timeline is an option, it does not simplify the + * handling for both READY and NONE state slots. Therefore, we retain the + * READY state slots after promotion as they can provide useful information + * about their origin. + */ Do you know if that wording is correct? e.g., If you were updating from READY to NONE and there was a failed update, that would leave some slots still in a READY state, right? So why does the comment say "could leave some slots in the 'NONE' state"? == src/backend/replication/slot.c 2. ReplicationSlotAlter + /* + * Do not allow users to drop the slots which are currently being synced + * from the primary to the standby. + */ + if (RecoveryInProgress() && + MyReplicationSlot->data.sync_state != SYNCSLOT_STATE_NONE) + ereport(ERROR, + (errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE), + errmsg("cannot alter replication slot \"%s\"", name), + errdetail("This slot is being synced from the primary server."))); + The comment looks wrong -- should say "Do not allow users to alter..." == 3. +## +# Test that synchronized slot can neither be decoded nor dropped by the user +## + 3a, /Test that synchronized slot/Test that a synchronized slot/ 3b. Isn't there a missing test? Should this part also check that it cannot ALTER the replication slot being synced? e.g. test for the new v49 error message that was added in ReplicationSlotAlter() ~~~ 4. +# Disable hot_standby_feedback +$standby1->safe_psql('postgres', 'ALTER SYSTEM SET hot_standby_feedback = off;'); +$standby1->restart; + Can there be a comment added to explain why you are doing the 'hot_standby_feedback' toggle? ~~~ 5. +## +# Promote the standby1 to primary. Confirm that: +# a) the sync-ready('r') slot 'lsub1_slot' is retained on the new primary +# b) the initiated('i') slot 'logical_slot' is dropped on promotion +# c) logical replication for regress_mysub1 is resumed succesfully after failover +## /succesfully/successfully/ ~~~ 6. + +# Confirm that data in tab_mypub3 replicated on subscriber +is( $subscriber1->safe_psql('postgres', q{SELECT count(*) FROM tab_int;}), + "$primary_row_count", + 'data replicated from the new primary'); The comment is wrong -- it names a different table ('tab_mypub3' ?) to what the SQL says. == [1] My v48-0002 review comments. https://www.postgresql.org/message-id/CAHut%2BPsyZQZ1A4XcKw-D%3DvcTg16pN9Dw0PzE8W_X7Yz_bv00rQ%40mail.gmail.com Kind Regards, Peter Smith. Fujitsu Australia
Re: Transaction timeout
On Mon, 18 Dec 2023 at 17:40, Andrey M. Borodin wrote: >> On 18 Dec 2023, at 14:32, Japin Li wrote: >> >> >> Thanks for updating the patch > > Sorry for the noise, but commitfest bot found one more bug in handling > statement timeout. PFA v11. > On Windows, there still have an error: diff -w -U3 C:/cirrus/src/test/isolation/expected/timeouts.out C:/cirrus/build/testrun/isolation/isolation/results/timeouts.out --- C:/cirrus/src/test/isolation/expected/timeouts.out 2023-12-18 10:22:21.772537200 + +++ C:/cirrus/build/testrun/isolation/isolation/results/timeouts.out 2023-12-18 10:26:08.039831800 + @@ -103,24 +103,7 @@ 0 (1 row) -step stt2_check: SELECT 1; -FATAL: terminating connection due to transaction timeout -server closed the connection unexpectedly +PQconsumeInput failed: server closed the connection unexpectedly This probably means the server terminated abnormally before or while processing the request. -step itt4_set: SET idle_in_transaction_session_timeout = '1ms'; SET statement_timeout = '10s'; SET lock_timeout = '10s'; SET transaction_timeout = '10s'; -step itt4_begin: BEGIN ISOLATION LEVEL READ COMMITTED; -step sleep_there: SELECT pg_sleep(0.01); -pg_sleep - - -(1 row) - -step stt3_check_itt4: SELECT count(*) FROM pg_stat_activity WHERE application_name = 'isolation/timeouts/itt4' -step stt3_check_itt4: <... completed> -count -- -0 -(1 row) - -- Regrads, Japin Li ChengDu WenWu Information Technology Co., Ltd.
Re: POC PATCH: copy from ... exceptions to: (was Re: VLDB Features)
On Mon, Dec 18, 2023 at 4:41 PM jian he wrote: > > On Mon, Dec 18, 2023 at 1:09 PM torikoshia wrote: > > > > Hi, > > > > > save the error metadata to system catalogs would be more expensive, > > > please see below explanation. > > > I have no knowledge of publications. > > > but i feel there is a feature request: publication FOR ALL TABLES > > > exclude regex_pattern. > > > Anyway, that would be another topic. > > > > I think saving error metadata to system catalog is not a good idea, too. > > And I believe Sawada-san just pointed out missing features and did not > > suggested that we use system catalog. > > > > > I don't think "specify the maximum number of errors to tolerate > > > before raising an ERROR." is very useful > > > > That may be so. > > I imagine it's useful in some use case since some loading tools have > > such options. > > Anyway I agree it's not necessary for initial patch as mentioned in [1]. > > > > > I suppose we can specify an ERRORFILE directory. similar > > > implementation [2], demo in [3] > > > it will generate 2 files, one file shows the malform line content as > > > is, another file shows the error info. > > > > That may be a good option when considering "(2) logging errors to > > somewhere". > > > > What do you think about the proposal to develop these features in > > incrementally? > > > > I am more with tom's idea [1], that is when errors happen (data type > conversion only), do not fail, AND we save the error to a table. I > guess we can implement this logic together, only with a new COPY > option. If we want only such a feature we need to implement it together (the patch could be split, though). But if some parts of the feature are useful for users as well, I'd recommend implementing it incrementally. That way, the patches can get small and it would be easy for reviewers and committers to review/commit them. > > imagine a case (it's not that contrived, imho), while conversion from > text to table's int, postgres isspace is different from the source > text file's isspace logic. > then all the lines are malformed. If we just say on error continue and > not save error meta info, the user is still confused which field has > the wrong data, then the user will probably try to incrementally test > which field contains malformed data. > > Since we need to save the error somewhere. > Everyone has the privilege to INSERT can do COPY. > I think we also need to handle the access privilege also. > So like I mentioned above, one copy_error error table hub, then > everyone can view/select their own copy failure record. The error table hub idea is still unclear to me. I assume that there are error tables at least on each database. And an error table can have error data that happened during COPY FROM, including malformed lines. Do the error tables grow without bounds and the users have to delete rows at some point? If so, who can do that? How can we achieve that the users can see only errored rows they generated? And the issue with logical replication also needs to be resolved. Anyway, if we go this direction, we need to discuss the overall design. Regards, -- Masahiko Sawada Amazon Web Services: https://aws.amazon.com
Re: POC PATCH: copy from ... exceptions to: (was Re: VLDB Features)
On Mon, Dec 18, 2023 at 9:16 AM jian he wrote: > > On Fri, Dec 15, 2023 at 4:49 AM Masahiko Sawada wrote: > > > > Hi, > > > > I've read this thread and the latest patch. IIUC with SAVE_ERROR > > option, COPY FROM creates an error table for the target table and > > writes error information there. > > > > While I agree that the final shape of this feature would be something > > like that design, I'm concerned some features are missing in order to > > make this feature useful in practice. For instance, error logs are > > inserted to error tables without bounds, meaning that users who want > > to tolerate errors during COPY FROM will have to truncate or drop the > > error tables periodically, or the database will grow with error logs > > without limit. Ideally such maintenance work should be done by the > > database. There might be some users who want to log such conversion > > errors in server logs to avoid such maintenance work. I think we > > should provide an option for where to write, at least. Also, since the > > error tables are normal user tables internally, error logs are also > > replicated to subscribers if there is a publication FOR ALL TABLES, > > unlike system catalogs. I think some users would not like such > > behavior. > > save the error metadata to system catalogs would be more expensive, > please see below explanation. > I have no knowledge of publications. > but i feel there is a feature request: publication FOR ALL TABLES > exclude regex_pattern. > Anyway, that would be another topic. I don't think the new regex idea would be a good solution for the existing users who are using FOR ALL TABLES publication. It's not desirable that they have to change the publication because of this feature. With the current patch, a logical replication using FOR ALL TABLES publication will stop immediately after an error information is inserted into a new error table unless the same error table is created on subscribers. > > > Looking at SAVE_ERROR feature closely, I think it consists of two > > separate features. That is, it enables COPY FROM to load data while > > (1) tolerating errors and (2) logging errors to somewhere (i.e., an > > error table). If we implement only (1), it would be like COPY FROM > > tolerate errors infinitely and log errors to /dev/null. The user > > cannot see the error details but I guess it could still help some > > cases as Andres mentioned[1] (it might be a good idea to send the > > number of rows successfully loaded in a NOTICE message if some rows > > could not be loaded). Then with (2), COPY FROM can log error > > information to somewhere such as tables and server logs and the user > > can select it. So I'm thinking we may be able to implement this > > feature incrementally. The first step would be something like an > > option to ignore all errors or an option to specify the maximum number > > of errors to tolerate before raising an ERROR. The second step would > > be to support logging destinations such as server logs and tables. > > > > Regards, > > > > [1] > > https://www.postgresql.org/message-id/20231109002600.fuihn34bjqqgmbjm%40awork3.anarazel.de > > > > -- > > Masahiko Sawada > > Amazon Web Services: https://aws.amazon.com > > > feature incrementally. The first step would be something like an > > option to ignore all errors or an option to specify the maximum number > > of errors to tolerate before raising an ERROR. The second step would > > I don't think "specify the maximum number of errors to tolerate > before raising an ERROR." is very useful > > QUOTE from [1] > MAXERROR [AS] error_count > If the load returns the error_count number of errors or greater, the > load fails. If the load returns fewer errors, it continues and returns > an INFO message that states the number of rows that could not be > loaded. Use this parameter to allow loads to continue when certain > rows fail to load into the table because of formatting errors or other > inconsistencies in the data. > Set this value to 0 or 1 if you want the load to fail as soon as the > first error occurs. The AS keyword is optional. The MAXERROR default > value is 0 and the limit is 10. > The actual number of errors reported might be greater than the > specified MAXERROR because of the parallel nature of Amazon Redshift. > If any node in the Amazon Redshift cluster detects that MAXERROR has > been exceeded, each node reports all of the errors it has encountered. > END OF QUOTE > > option MAXERROR error_count. iiuc, it fails while validating line > error_count + 1, else it raises a notice, tells you how many rows have > errors. > > * case when error_count is small, and the copy fails, it only tells > you that at least the error_count line has malformed data. but what if > the actual malformed rows are very big. In this case, this failure > error message is not that helpful. > * case when error_count is very big, and the copy does not fail. then > the actual malformed data rows are very big (still les
Re: Synchronizing slots from primary to standby
Here are some review comments for v48-0002 == doc/src/sgml/config.sgml 1. + If slot synchronization is enabled then it is also necessary to + specify dbname in the + primary_conninfo string. This will only be used for + slot synchronization. It is ignored for streaming. I felt the "If slot synchronization is enabled" part should also include an xref to the enable_slotsync GUC, otherwise there is no information here about how to enable it. SUGGESTION If slot synchronization is enabled (see XXX) == doc/src/sgml/logicaldecoding.sgml 2. + + The ability to resume logical replication after failover depends upon the + pg_replication_slots.sync_state + value for the synchronized slots on the standby at the time of failover. + Only slots that have attained "ready" sync_state ('r') on the standby + before failover can be used for logical replication after failover. Slots + that have not yet reached 'r' state (they are still 'i') will be dropped, + therefore logical replication for those slots cannot be resumed. For + example, if the synchronized slot could not become sync-ready on standby + due to a disabled subscription, then the subscription cannot be resumed + after failover even when it is enabled. + If the primary is idle, then the synchronized slots on the standby may + take a noticeable time to reach the ready ('r') sync_state. This can + be sped up by calling the + pg_log_standby_snapshot function on the primary. + 2a. /sync-ready on standby/sync-ready on the standby/ ~ 2b. Should "If the primary is idle" be in a new paragraph? == doc/src/sgml/system-views.sgml 3. + + The hot standby can have any of these sync_state values for the slots but + on a hot standby, the slots with state 'r' and 'i' can neither be used + for logical decoding nor dropped by the user. + The sync_state has no meaning on the primary server; the primary + sync_state value is default 'n' for all slots but may (if leftover + from a promoted standby) also be 'r'. + I still feel we are exposing too much useless information about the primary server values. Isn't it sufficient to just say "The sync_state values have no meaning on a primary server.", and not bother to mention what those meaningless values might be -- e.g. if they are meaningless then who cares what they are or how they got there? == src/backend/replication/logical/slotsync.c 4. synchronize_one_slot + /* Slot ready for sync, so sync it. */ + if (sync_state == SYNCSLOT_STATE_READY) + { + /* + * Sanity check: With hot_standby_feedback enabled and + * invalidations handled appropriately as above, this should never + * happen. + */ + if (remote_slot->restart_lsn < slot->data.restart_lsn) + elog(ERROR, + "not synchronizing local slot \"%s\" LSN(%X/%X)" + " to remote slot's LSN(%X/%X) as synchronization " + " would move it backwards", remote_slot->name, + LSN_FORMAT_ARGS(slot->data.restart_lsn), + LSN_FORMAT_ARGS(remote_slot->restart_lsn)); + + if (remote_slot->confirmed_lsn != slot->data.confirmed_flush || + remote_slot->restart_lsn != slot->data.restart_lsn || + remote_slot->catalog_xmin != slot->data.catalog_xmin) + { + /* Update LSN of slot to remote slot's current position */ + local_slot_update(remote_slot); + ReplicationSlotSave(); + slot_updated = true; + } + } + /* Slot not ready yet, let's attempt to make it sync-ready now. */ + else if (sync_state == SYNCSLOT_STATE_INITIATED) + { + /* + * Wait for the primary server to catch-up. Refer to the comment + * atop the file for details on this wait. + */ + if (remote_slot->restart_lsn < slot->data.restart_lsn || + TransactionIdPrecedes(remote_slot->catalog_xmin, + slot->data.catalog_xmin)) + { + if (!wait_for_primary_slot_catchup(wrconn, remote_slot, NULL)) + { + ReplicationSlotRelease(); + return false; + } + } + + /* + * Wait for primary is over, update the lsns and mark the slot as + * READY for further syncs. + */ + local_slot_update(remote_slot); + SpinLockAcquire(&slot->mutex); + slot->data.sync_state = SYNCSLOT_STATE_READY; + SpinLockRelease(&slot->mutex); + + /* Save the changes */ + ReplicationSlotMarkDirty(); + ReplicationSlotSave(); + slot_updated = true; + + ereport(LOG, + errmsg("newly locally created slot \"%s\" is sync-ready now", +remote_slot->name)); + } 4a. It would be more natural in the code if you do the SYNCSLOT_STATE_INITIATED logic before the SYNCSLOT_STATE_READY because that is the order those states come in. ~ 4b. I'm not sure if it is worth it, but I was thinking that some duplicate code can be avoided by doing if/if instead of if/else if (sync_state == SYNCSLOT_STATE_INITIATED) { .. } if (sync_state == SYNCSLOT_STATE_READY) { } By arranging it this way maybe the SYNCSLOT_STATE_INITIATED code block doesn't need to do anything except update the sync_state = SYNCSLOT_STATE_READY; Then it can just fall through to the S
Re: Clang optimiser vs preproc.c
On Sun, Dec 17, 2023 at 1:29 AM Andres Freund wrote: > On 2023-12-15 22:19:56 -0500, Tom Lane wrote: > > Thomas Munro writes: > > > On Sat, Dec 16, 2023 at 3:44 PM Tom Lane wrote: > > >> Thomas Munro writes: > > >>> FYI, it looks like there is a big jump in CPU time to compile preproc.c > > >>> at -O2: > > >>> clang15: ~16s > > >>> clang16: ~211s > > >>> clang17: ~233s > > Is this with non-assert clangs? Because I see times that seem smaller by more > than what can be explained by hardware differences: > > preproc.c: > 17 10.270s > 169.685s > 158.300s > > gram.c: > 171.936s > 162.131s > 152.161s > > That's still bad, but a far cry away from 233s. Hrmph. Well something weird is going on, but it might indeed involve me being confused about debug options of the compiler itself. How can one find out which build options were used for clang/llvm compiler + libraries? My earlier reports were from a little machine at home, so let's try again on an i9-9900 CPU @ 3.10GHz (a bit snappier) running Debian 12, again using packages from apt.llvm.org: 17 ~198s 16 ~14s 15 ~11s OK so even if we ignore the wild outlier it is getting significantly slower. But... huh, there goes the big jump, but at a different version than I saw with FBSD's packages. Here's what perf says it's doing: + 99.42%20.12% clang-17 libLLVM-17.so.1 [.] llvm::slpvectorizer::BoUpSLP::getTreeCost ◆ + 96.91% 0.00% clang-17 libLLVM-17.so.1 [.] llvm::SLPVectorizerPass::runImpl ▒ + 96.91% 0.00% clang-17 libLLVM-17.so.1 [.] llvm::SLPVectorizerPass::vectorizeChainsInBlock ▒ + 96.91% 0.00% clang-17 libLLVM-17.so.1 [.] llvm::SLPVectorizerPass::vectorizeSimpleInstructions ▒ + 96.91% 0.00% clang-17 libLLVM-17.so.1 [.] llvm::SLPVectorizerPass::vectorizeInsertElementInst ▒ + 96.91% 0.00% clang-17 libLLVM-17.so.1 [.] llvm::SLPVectorizerPass::tryToVectorizeList ▒ + 73.79% 0.00% clang-17 libLLVM-17.so.1 [.] 0x7fbead445cb0 ▒ + 36.88%36.88% clang-17 libLLVM-17.so.1 [.] 0x01e45cda ▒ +3.95% 3.95% clang-17 libLLVM-17.so.1 [.] 0x01e45d11
Re: add non-option reordering to in-tree getopt_long
On Mon, Dec 18, 2023 at 02:41:22PM -0500, Tom Lane wrote: > We just had a user complaint that seems to trace to exactly this > bogus reporting in pg_ctl [1]. Although I was originally not > very pleased with changing our getopt_long to do switch reordering, > I'm now wondering if we should back-patch these changes as bug > fixes. It's probably not worth the risk, but ... I'm not too concerned about the risks of back-patching these commits, but if this 19-year-old bug was really first reported today, I'd agree that fixing it in the stable branches is probably not worth it. -- Nathan Bossart Amazon Web Services: https://aws.amazon.com
Re: Add --check option to pgindent
On Mon Dec 18, 2023 at 11:21 AM CST, Jelte Fennema-Nio wrote: On Mon, 18 Dec 2023 at 17:50, Tristan Partin wrote: > I could propose something. It would help if I had an example of such > a tool that already exists. Basically the same behaviour as what you're trying to add now for --check, only instead of printing the diff it would actually change the files just like running pgindent without a --check flag does. i.e. allow pgindent --check to not run in "dry-run" mode My pre-commit hook looks like this currently (removed boring cruft around it): if ! src/tools/pgindent/pgindent --check $files > /dev/null; then exit 0 fi echo "Running pgindent on changed files" src/tools/pgindent/pgindent $files echo "Commit abandoned. Rerun git commit to adopt pgindent changes" exit 1 But I would like it to look like: if src/tools/pgindent/pgindent --check --write $files > /dev/null; then exit 0 fi echo "Commit abandoned. Rerun git commit to adopt pgindent changes" exit 1 To me, the two options seem at odds, like you either check or write. How would you feel about just capturing the diffs that are printed and patch(1)-ing them? -- Tristan Partin Neon (https://neon.tech)
Re: Add --check option to pgindent
On Mon Dec 18, 2023 at 10:50 AM CST, Tristan Partin wrote: On Mon Dec 18, 2023 at 10:14 AM CST, Jelte Fennema-Nio wrote: > On Mon, 18 Dec 2023 at 13:42, Daniel Gustafsson wrote: > > I think this is pretty much ready to go, the attached v4 squashes the changes > > and fixes the man-page which also needed an update. The referenced Wiki page > > will need an edit or two after this goes in, but that's easy enough. > > One thing I'm wondering: When both --check and --diff are passed, > should pgindent still early exit with 2 on the first incorrectly > formatted file? Or should it show diffs for all failing files? I'm > leaning towards the latter making more sense. Makes sense. Let me iterate on the squashed version of the patch. Here is an additional patch which implements the behavior you described. The first patch is just Daniel's squashed version of my patches. -- Tristan Partin Neon (https://neon.tech) From 5cd4c5e9af0fc6e2e385b79111a6cca66df6cad9 Mon Sep 17 00:00:00 2001 From: Daniel Gustafsson Date: Mon, 18 Dec 2023 13:22:12 +0100 Subject: [PATCH v5 1/2] Rename non-destructive modes in pgindent This renames --silent-diff to --check and --show-diff to --diff, in order to make the options a little bit more self-explanatory for developers used to similar formatters/linters. --check and --diff are also allowed to be combined. Author: Tristan Partin Discussion: https://postgr.es/m/cxlx2xyth9s6.140sc6y61v...@neon.tech --- src/tools/pgindent/pgindent | 35 + src/tools/pgindent/pgindent.man | 12 +-- 2 files changed, 24 insertions(+), 23 deletions(-) diff --git a/src/tools/pgindent/pgindent b/src/tools/pgindent/pgindent index bce63d95da..eb2f52f4b9 100755 --- a/src/tools/pgindent/pgindent +++ b/src/tools/pgindent/pgindent @@ -22,8 +22,8 @@ my $indent_opts = my $devnull = File::Spec->devnull; my ($typedefs_file, $typedef_str, @excludes, - $indent, $build, $show_diff, - $silent_diff, $help, @commits,); + $indent, $build, $diff, + $check, $help, @commits,); $help = 0; @@ -34,15 +34,12 @@ my %options = ( "list-of-typedefs=s" => \$typedef_str, "excludes=s" => \@excludes, "indent=s" => \$indent, - "show-diff" => \$show_diff, - "silent-diff" => \$silent_diff,); + "diff" => \$diff, + "check" => \$check,); GetOptions(%options) || usage("bad command line argument"); usage() if $help; -usage("Cannot have both --silent-diff and --show-diff") - if $silent_diff && $show_diff; - usage("Cannot use --commit with command line file list") if (@commits && @ARGV); @@ -294,7 +291,7 @@ sub run_indent return $source; } -sub show_diff +sub diff { my $indented = shift; my $source_filename = shift; @@ -323,8 +320,8 @@ Options: --list-of-typedefs=STR string containing typedefs, space separated --excludes=PATH file containing list of filename patterns to ignore --indent=PATH path to pg_bsd_indent program - --show-diff show the changes that would be made - --silent-diff exit with status 2 if any changes would be made + --diff show the changes that would be made + --check exit with status 2 if any changes would be made The --excludes and --commit options can be given more than once. EOF if ($help) @@ -417,17 +414,21 @@ foreach my $source_filename (@files) if ($source ne $orig_source) { - if ($silent_diff) - { - exit 2; - } - elsif ($show_diff) + if (!$diff && !$check) { - print show_diff($source, $source_filename); + write_source($source, $source_filename); } else { - write_source($source, $source_filename); + if ($diff) + { +print diff($source, $source_filename); + } + + if ($check) + { +exit 2; + } } } } diff --git a/src/tools/pgindent/pgindent.man b/src/tools/pgindent/pgindent.man index fe411ee699..caab5cde91 100644 --- a/src/tools/pgindent/pgindent.man +++ b/src/tools/pgindent/pgindent.man @@ -31,13 +31,13 @@ find the file src/tools/pgindent/exclude_file_patterns. The --excludes option can be used more than once to specify multiple files containing exclusion patterns. -There are also two non-destructive modes of pgindent. If given the --show-diff +There are also two non-destructive modes of pgindent. If given the --diff option pgindent will show the changes it would make, but doesn't actually make -them. If given instead the --silent-diff option, pgindent will exit with a -status of 2 if it finds any indent changes are required, but will not -make the changes or give any other information. This mode is intended for -possible use in a git pre-commit hook. An example of its use in a git hook -can be seen at https://wiki.postgresql.org/wiki/Working_with_Git#Using_git_hooks +them. If given instead the --check option, pgindent will exit with a status of +2 if it finds any indent changes are required, but will not make the changes. +This mode is intended for possible use in a git pre-commit hook. The --
Re: index prefetching
On Sat, Dec 9, 2023 at 1:08 PM Tomas Vondra wrote: > But there's a layering problem that I don't know how to solve - I don't > see how we could make indexam.c entirely oblivious to the prefetching, > and move it entirely to the executor. Because how else would you know > what to prefetch? Yeah, that seems impossible. Some thoughts: * I think perhaps the subject line of this thread is misleading. It doesn't seem like there is any index prefetching going on here at all, and there couldn't be, unless you extended the index AM API with new methods. What you're actually doing is prefetching heap pages that will be needed by a scan of the index. I think this confusing naming has propagated itself into some parts of the patch, e.g. index_prefetch() reads *from the heap* which is not at all clear from the comment saying "Prefetch the TID, unless it's sequential or recently prefetched." You're not prefetching the TID: you're prefetching the heap tuple to which the TID points. That's not an academic distinction IMHO -- the TID would be stored in the index, so if we were prefetching the TID, we'd have to be reading index pages, not heap pages. * Regarding layering, my first thought was that the changes to index_getnext_tid() and index_getnext_slot() are sensible: read ahead by some number of TIDs, keep the TIDs you've fetched in an array someplace, use that to drive prefetching of blocks on disk, and return the previously-read TIDs from the queue without letting the caller know that the queue exists. I think that's the obvious design for a feature of this type, to the point where I don't really see that there's a viable alternative design. Driving something down into the individual index AMs would make sense if you wanted to prefetch *from the indexes*, but it's unnecessary otherwise, and best avoided. * But that said, the skip_all_visible flag passed down to index_prefetch() looks like a VERY strong sign that the layering here is not what it should be. Right now, when some code calls index_getnext_tid(), that function does not need to know or care whether the caller is going to fetch the heap tuple or not. But with this patch, the code does need to care. So knowledge of the executor concept of an index-only scan trickles down into indexam.c, which now has to be able to make decisions that are consistent with the ones that the executor will make. That doesn't seem good at all. * I think it might make sense to have two different prefetching schemes. Ideally they could share some structure. If a caller is using index_getnext_slot(), then it's easy for prefetching to be fully transparent. The caller can just ask for TIDs and the prefetching distance and TID queue can be fully under the control of something that is hidden from the caller. But when using index_getnext_tid(), the caller needs to have an opportunity to evaluate each TID and decide whether we even want the heap tuple. If yes, then we feed that TID to the prefetcher; if no, we don't. That way, we're not replicating executor logic in lower-level code. However, that also means that the IOS logic needs to be aware that this TID queue exists and interact with whatever controls the prefetch distance. Perhaps after calling index_getnext_tid() you call index_prefetcher_put_tid(prefetcher, tid, bool fetch_heap_tuple) and then you call index_prefetcher_get_tid() to drain the queue. Perhaps also the prefetcher has a "fill" callback that gets invoked when the TID queue isn't as full as the prefetcher wants it to be. Then index_getnext_slot() can just install a trivial fill callback that says index_prefetecher_put_tid(prefetcher, index_getnext_tid(...), true), but IOS can use a more sophisticated callback that checks the VM to determine what to pass for the third argument. * I realize that I'm being a little inconsistent in what I just said, because in the first bullet point I said that this wasn't really index prefetching, and now I'm proposing function names that still start with index_prefetch. It's not entirely clear to me what the best thing to do about the terminology is here -- could it be a heap prefetcher, or a TID prefetcher, or an index scan prefetcher? I don't really know, but whatever we can do to make the naming more clear seems like a really good idea. Maybe there should be a clearer separation between the queue of TIDs that we're going to return from the index and the queue of blocks that we want to prefetch to get the corresponding heap tuples -- making that separation crisper might ease some of the naming issues. * Not that I want to be critical because I think this is a great start on an important project, but it does look like there's an awful lot of stuff here that still needs to be sorted out before it would be reasonable to think of committing this, both in terms of design decisions and just general polish. There's a lot of stuff marked with XXX and I think that's great because most of those seem to be good questions but that does leave th
Re: encoding affects ICU regex character classification
On Fri, 2023-12-15 at 16:48 -0800, Jeremy Schneider wrote: > This goes back to my other thread (which sadly got very little > discussion): PosgreSQL really needs to be safe by /default/ Doesn't a built-in provider help create a safer option? The built-in provider's version of Unicode will be consistent with unicode_assigned(), which is a first step toward rejecting code points that the provider doesn't understand. And by rejecting unassigned code points, we get all kinds of Unicode compatibility guarantees that avoid the kinds of change risks that you are worried about. Regards, Jeff Davis
Fixing backslash dot for COPY FROM...CSV
Hi, PFA a patch that attempts to fix the bug that \. on a line by itself is handled incorrectly by COPY FROM ... CSV. This issue has been discussed several times previously, for instance in [1] and [2], and mentioned in the doc for \copy in commit 42d3125. There's one case that works today: when the line is part of a multi-line quoted section, and the data is read from a file, not from the client. In other situations, an error is raised or the data is cut at the point of \. without an error. The patch addresses that issue in the server and in psql, except for the case of inlined data, where \. cannot be both valid data and an EOF marker at the same time, so it keeps treating it as an EOF marker. [1] https://www.postgresql.org/message-id/10e3eff6-eb04-4b3f-aeb4-b920192b9...@manitou-mail.org [2] https://www.postgresql.org/message-id/8aeab305-5e94-4fa5-82bf-6da6baee6e05%40app.fastmail.com Best regards, -- Daniel Vérité https://postgresql.verite.pro/ Twitter: @DanielVerite From 868b1e065cf714f315bab65129fd05a1d60fc81b Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?Daniel=20V=C3=A9rit=C3=A9?= Date: Mon, 18 Dec 2023 20:47:02 +0100 Subject: [PATCH v1] Support backslash-dot on a line by itself as valid data in COPY FROM...CSV --- doc/src/sgml/ref/psql-ref.sgml | 4 src/backend/commands/copyfromparse.c | 13 ++- src/bin/psql/copy.c | 32 src/test/regress/expected/copy.out | 15 + src/test/regress/sql/copy.sql| 11 ++ 5 files changed, 51 insertions(+), 24 deletions(-) diff --git a/doc/src/sgml/ref/psql-ref.sgml b/doc/src/sgml/ref/psql-ref.sgml index cc7d797159..d94e3cacfc 100644 --- a/doc/src/sgml/ref/psql-ref.sgml +++ b/doc/src/sgml/ref/psql-ref.sgml @@ -1119,10 +1119,6 @@ INSERT INTO tbl1 VALUES ($1, $2) \bind 'first value' 'second value' \g destination, because all data must pass through the client/server connection. For large amounts of data the SQL command might be preferable. -Also, because of this pass-through method, \copy -... from in CSV mode will erroneously -treat a \. data value alone on a line as an -end-of-input marker. diff --git a/src/backend/commands/copyfromparse.c b/src/backend/commands/copyfromparse.c index f553734582..b4c291b25b 100644 --- a/src/backend/commands/copyfromparse.c +++ b/src/backend/commands/copyfromparse.c @@ -1137,7 +1137,6 @@ CopyReadLineText(CopyFromState cstate) boolresult = false; /* CSV variables */ - boolfirst_char_in_line = true; boolin_quote = false, last_was_esc = false; charquotec = '\0'; @@ -1335,10 +1334,9 @@ CopyReadLineText(CopyFromState cstate) } /* -* In CSV mode, we only recognize \. alone on a line. This is because -* \. is a valid CSV data value. +* In CSV mode, \. is a valid CSV data value anywhere in the line. */ - if (c == '\\' && (!cstate->opts.csv_mode || first_char_in_line)) + if (c == '\\' && !cstate->opts.csv_mode) { charc2; @@ -1442,14 +1440,7 @@ CopyReadLineText(CopyFromState cstate) } } - /* -* This label is for CSV cases where \. appears at the start of a -* line, but there is more text after it, meaning it was a data value. -* We are more strict for \. in CSV mode because \. could be a data -* value, while in non-CSV mode, \. cannot be a data value. -*/ not_end_of_copy: - first_char_in_line = false; } /* end of outer loop */ /* diff --git a/src/bin/psql/copy.c b/src/bin/psql/copy.c index dbbbdb8898..f31a7acbb6 100644 --- a/src/bin/psql/copy.c +++ b/src/bin/psql/copy.c @@ -620,20 +620,34 @@ handleCopyIn(PGconn *conn, FILE *copystream, bool isbinary, PGresult **res) /* current line is done? */ if (buf[buflen - 1] == '\n') { - /* check for EOF marker, but not on a partial line */ - if (at_line_begin) + /* +* When at the beginning of the line, check for EOF marker. +* If the marker is found and the data is inlined, +* we must stop at this point. If not, the \. line can be +* sent to the server, and we let it decide whether +* it's an E
Re: Built-in CTYPE provider
On Fri, 2023-12-15 at 16:30 -0800, Jeremy Schneider wrote: > Looking closer, patches 3 and 4 look like an incremental extension of > this earlier idea; Yes, it's essentially the same thing extended to a few more files. I don't know if "incremental" is the right word though; this is a substantial extension of the idea. > the perl scripts download data from unicode.org and > we've specifically defined Unicode version 15.1 and the scripts turn > the > data tables inside-out into C data structures optimized for lookup. > That > C code is then checked in to the PostgreSQL source code files > unicode_category.h and unicode_case_table.h - right? Yes. The standard build process shouldn't be downloading files, so the static tables are checked in. Also, seeing the diffs of the static tables improves the visibility of changes in case there's some mistake or big surprise. > Am I reading correctly that these two patches add C functions > pg_u_prop_* and pg_u_is* (patch 3) and unicode_*case (patch 4) but we > don't yet reference these functions anywhere? So this is just getting > some plumbing in place? Correct. Perhaps I should combine these into the builtin provider thread, but these are independently testable and reviewable. > > > My prediction is that updating this built-in provider eventually > won't > be any different from ICU or glibc. The built-in provider will have several advantages because it's tied to a PG major version: * A physical replica can't have different semantics than the primary. * Easier to document and test. * Changes are more transparent and can be documented in the release notes, so that administrators can understand the risks and blast radius at pg_upgrade time. > Later on down the road, from a user perspective, I think we should be > careful about confusion where providers are used inconsistently. It's > not great if one function follow built-in Unicode 15.1 rules but > another > function uses Unicode 13 rules because it happened to call an ICU > function or a glibc function. We could easily end up with multiple > providers processing different parts of a single SQL statement, which > could lead to strange results in some cases. The whole concept of "providers" is that they aren't consistent with each other. ICU, libc, and the builtin provider will all be based on different versions of Unicode. That's by design. The built-in provider will be a bit better in the sense that it's consistent with the normalization functions, and the other providers aren't. Regards, Jeff Davis
Re: add non-option reordering to in-tree getopt_long
Michael Paquier writes: > On Thu, Jul 13, 2023 at 09:38:42PM -0700, Nathan Bossart wrote: >> Take the following examples of client programs that accept one non-option: >> >> ~$ pg_resetwal a b c >> pg_resetwal: error: too many command-line arguments (first is "b") >> pg_resetwal: hint: Try "pg_resetwal --help" for more information. >> >> Yet pg_ctl gives: >> >> ~$ pg_ctl start a b c >> pg_ctl: too many command-line arguments (first is "start") >> Try "pg_ctl --help" for more information. >> >> In this example, isn't "a" the first extra non-option that should be >> reported? > Good point. This is interpreting "first" as being the first option > that's invalid. Here my first impression was that pg_ctl got that > right, where "first" refers to the first subcommand that would be > valid. Objection withdrawn. We just had a user complaint that seems to trace to exactly this bogus reporting in pg_ctl [1]. Although I was originally not very pleased with changing our getopt_long to do switch reordering, I'm now wondering if we should back-patch these changes as bug fixes. It's probably not worth the risk, but ... regards, tom lane [1] https://www.postgresql.org/message-id/flat/CANzqJaDCagH5wNkPQ42%3DFx3mJPR-YnB3PWFdCAYAVdb9%3DQ%2Bt-A%40mail.gmail.com
Re: trying again to get incremental backup
On Fri, Dec 15, 2023 at 6:58 AM Peter Eisentraut wrote: > A separate bikeshedding topic: The GUC "summarize_wal", could that be > "wal_something" instead? (wal_summarize? wal_summarizer?) It would be > nice if these settings names group together a bit, both with existing > wal_* ones and also with the new ones you are adding > (wal_summary_keep_time). Yeah, this is highly debatable, so bikeshed away. IMHO, the question here is whether we care more about (1) having the name of the GUC sound nice grammatically or (2) having the GUC begin with the same string as other, related GUCs. I think that Tom Lane tends to prefer the former, and probably some other people do too, while some other people tend to prefer the latter. Ideally it would be possible to satisfy both goals at once here, but everything I thought about that started with "wal" sounded too awkward for me to like it; hence the current choice of name. But if there's consensus on something else, so be it. -- Robert Haas EDB: http://www.enterprisedb.com
Re: trying again to get incremental backup
On Fri, Dec 15, 2023 at 6:53 AM Peter Eisentraut wrote: > The first fixes up some things in nls.mk related to a file move. The > second is some cleanup because some function you are using has been > removed in the meantime; you probably found that yourself while rebasing. Incorporated these. As you guessed, MemoryContextResetAndDeleteChildren -> MemoryContextReset had already been done locally. > The pg_walsummary patch doesn't have a nls.mk, but you also comment that > it doesn't have tests yet, so I assume it's not considered complete yet > anyway. I think this was more of a case of me just not realizing that I should add that. I'll add something simple to the next version, but I'm not very good at this NLS stuff. -- Robert Haas EDB: http://www.enterprisedb.com
Re: SLRU optimization - configurable buffer pool and partitioning the SLRU lock
On Mon, Dec 18, 2023 at 12:53 PM Andrey M. Borodin wrote: > One page still accommodates 32K transaction statuses under one lock. It feels > like a lot. About 1 second of transactions on a typical installation. > > When the group commit was committed did we have a benchmark to estimate > efficiency of this technology? Can we repeat that test again? I think we did, but it might take some research to find it in the archives. If we can, I agree that repeating it feels like a good idea. -- Robert Haas EDB: http://www.enterprisedb.com
Re: SLRU optimization - configurable buffer pool and partitioning the SLRU lock
> On 18 Dec 2023, at 22:30, Robert Haas wrote: > > On Mon, Dec 18, 2023 at 12:04 PM Robert Haas wrote: >> certain sense they are competing for the same job. However, they do >> aim to alleviate different TYPES of contention: the group XID update >> stuff should be most valuable when lots of processes are trying to >> update the same page, and the banks should be most valuable when there >> is simultaneous access to a bunch of different pages. So I'm not >> convinced that this patch is a reason to remove the group XID update >> mechanism, but someone might argue otherwise. > > Hmm, but, on the other hand: > > Currently all readers and writers are competing for the same LWLock. > But with this change, the readers will (mostly) no longer be competing > with the writers. So, in theory, that might reduce lock contention > enough to make the group update mechanism pointless. One page still accommodates 32K transaction statuses under one lock. It feels like a lot. About 1 second of transactions on a typical installation. When the group commit was committed did we have a benchmark to estimate efficiency of this technology? Can we repeat that test again? Best regards, Andrey Borodin.
Re: common signal handler protection
rebased for cfbot -- Nathan Bossart Amazon Web Services: https://aws.amazon.com >From c5fc2186960c483d53789f27fcf84771e98c5ca3 Mon Sep 17 00:00:00 2001 From: Nathan Bossart Date: Tue, 28 Nov 2023 14:58:20 -0600 Subject: [PATCH v5 1/3] Check that MyProcPid == getpid() in all signal handlers. In commit 97550c0711, we added a similar check to the SIGTERM handler for the startup process. This commit adds this check to all signal handlers installed with pqsignal(). This is done by using a wrapper function that performs the check before calling the actual handler. The hope is that this will offer more general protection against child processes of Postgres backends inadvertently modifying shared memory due to inherited signal handlers. Another potential follow-up improvement is to use this wrapper handler function to restore errno instead of relying on each individual handler function to do so. This commit makes the changes in commit 97550c0711 obsolete but leaves reverting it for a follow-up commit. Reviewed-by: Andres Freund Discussion: https://postgr.es/m/20231121212008.GA3742740%40nathanxps13 --- src/port/pqsignal.c | 86 +++-- 1 file changed, 84 insertions(+), 2 deletions(-) diff --git a/src/port/pqsignal.c b/src/port/pqsignal.c index 23e7f685c1..f04a7153de 100644 --- a/src/port/pqsignal.c +++ b/src/port/pqsignal.c @@ -46,23 +46,99 @@ #include "c.h" #include +#ifndef FRONTEND +#include +#endif #ifndef FRONTEND #include "libpq/pqsignal.h" +#include "miscadmin.h" +#endif + +#ifdef PG_SIGNAL_COUNT /* Windows */ +#define PG_NSIG (PG_SIGNAL_COUNT) +#elif defined(NSIG) +#define PG_NSIG (NSIG) +#else +#define PG_NSIG (64) /* XXX: wild guess */ +#endif + +/* Check a couple of common signals to make sure PG_NSIG is accurate. */ +StaticAssertDecl(SIGUSR2 < PG_NSIG, "SIGUSR2 >= PG_NSIG"); +StaticAssertDecl(SIGHUP < PG_NSIG, "SIGHUP >= PG_NSIG"); +StaticAssertDecl(SIGTERM < PG_NSIG, "SIGTERM >= PG_NSIG"); +StaticAssertDecl(SIGALRM < PG_NSIG, "SIGALRM >= PG_NSIG"); + +static volatile pqsigfunc pqsignal_handlers[PG_NSIG]; + +/* + * Except when called with SIG_IGN or SIG_DFL, pqsignal() sets up this function + * as the handler for all signals. This wrapper handler function checks that + * it is called within a process that the server knows about (i.e., any process + * that has called InitProcessGlobals(), such as a client backend), and not a + * child process forked by system(3), etc. This check ensures that such child + * processes do not modify shared memory, which is often detrimental. If the + * check succeeds, the function originally provided to pqsignal() is called. + * Otherwise, the default signal handler is installed and then called. + */ +static void +wrapper_handler(SIGNAL_ARGS) +{ +#ifndef FRONTEND + + /* + * We expect processes to set MyProcPid before calling pqsignal() or + * before accepting signals. + */ + Assert(MyProcPid); + Assert(MyProcPid != PostmasterPid || !IsUnderPostmaster); + + if (unlikely(MyProcPid != (int) getpid())) + { + pqsignal(postgres_signal_arg, SIG_DFL); + raise(postgres_signal_arg); + return; + } #endif + (*pqsignal_handlers[postgres_signal_arg]) (postgres_signal_arg); +} + /* * Set up a signal handler, with SA_RESTART, for signal "signo" * * Returns the previous handler. + * + * NB: If called within a signal handler, race conditions may lead to bogus + * return values. You should either avoid calling this within signal handlers + * or ignore the return value. + * + * XXX: Since no in-tree callers use the return value, and there is little + * reason to do so, it would be nice if we could convert this to a void + * function instead of providing potentially-bogus return values. + * Unfortunately, that requires modifying the pqsignal() in legacy-pqsignal.c, + * which in turn requires an SONAME bump, which is probably not worth it. */ pqsigfunc pqsignal(int signo, pqsigfunc func) { + pqsigfunc orig_func = pqsignal_handlers[signo]; /* assumed atomic */ #if !(defined(WIN32) && defined(FRONTEND)) struct sigaction act, oact; +#else + pqsigfunc ret; +#endif + Assert(signo < PG_NSIG); + + if (func != SIG_IGN && func != SIG_DFL) + { + pqsignal_handlers[signo] = func; /* assumed atomic */ + func = wrapper_handler; + } + +#if !(defined(WIN32) && defined(FRONTEND)) act.sa_handler = func; sigemptyset(&act.sa_mask); act.sa_flags = SA_RESTART; @@ -72,9 +148,15 @@ pqsignal(int signo, pqsigfunc func) #endif if (sigaction(signo, &act, &oact) < 0) return SIG_ERR; - return oact.sa_handler; + else if (oact.sa_handler == wrapper_handler) + return orig_func; + else + return oact.sa_handler; #else /* Forward to Windows native signal system. */ - return signal(signo, func); + if ((ret = signal(signo, func)) == wrapper_handler) + return orig_func; + else + return ret; #endif } -- 2.25.1 >From 44bae78090d44f2ecdc9fbaea43d9adc1827c445 Mon Sep 17 00:00:00 2001 From: Nathan Bossart Date: Tue
Re: SLRU optimization - configurable buffer pool and partitioning the SLRU lock
On Mon, Dec 18, 2023 at 12:04 PM Robert Haas wrote: > certain sense they are competing for the same job. However, they do > aim to alleviate different TYPES of contention: the group XID update > stuff should be most valuable when lots of processes are trying to > update the same page, and the banks should be most valuable when there > is simultaneous access to a bunch of different pages. So I'm not > convinced that this patch is a reason to remove the group XID update > mechanism, but someone might argue otherwise. Hmm, but, on the other hand: Currently all readers and writers are competing for the same LWLock. But with this change, the readers will (mostly) no longer be competing with the writers. So, in theory, that might reduce lock contention enough to make the group update mechanism pointless. -- Robert Haas EDB: http://www.enterprisedb.com
Re: Add --check option to pgindent
On Mon, 18 Dec 2023 at 17:50, Tristan Partin wrote: > I could propose something. It would help if I had an example of such > a tool that already exists. Basically the same behaviour as what you're trying to add now for --check, only instead of printing the diff it would actually change the files just like running pgindent without a --check flag does. i.e. allow pgindent --check to not run in "dry-run" mode My pre-commit hook looks like this currently (removed boring cruft around it): if ! src/tools/pgindent/pgindent --check $files > /dev/null; then exit 0 fi echo "Running pgindent on changed files" src/tools/pgindent/pgindent $files echo "Commit abandoned. Rerun git commit to adopt pgindent changes" exit 1 But I would like it to look like: if src/tools/pgindent/pgindent --check --write $files > /dev/null; then exit 0 fi echo "Commit abandoned. Rerun git commit to adopt pgindent changes" exit 1
Re: SLRU optimization - configurable buffer pool and partitioning the SLRU lock
On Tue, Dec 12, 2023 at 8:29 AM Alvaro Herrera wrote: > The problem I see is that the group update mechanism is designed around > contention of the global xact-SLRU control lock; it uses atomics to > coordinate a single queue when the lock is contended. So if we split up > the global SLRU control lock using banks, then multiple processes using > different bank locks might not contend. OK, this is fine, but what > happens if two separate groups of processes encounter contention on two > different bank locks? I think they will both try to update the same > queue, and coordinate access to that *using different bank locks*. I > don't see how can this work correctly. > > I suspect the first part of that algorithm, where atomics are used to > create the list without a lock, might work fine. But will each "leader" > process, each of which is potentially using a different bank lock, > coordinate correctly? Maybe this works correctly because only one > process will find the queue head not empty? If this is what happens, > then there needs to be comments about it. Without any explanation, > this seems broken and potentially dangerous, as some transaction commit > bits might become lost given high enough concurrency and bad luck. I don't want to be dismissive of this concern, but I took a look at this part of the patch set and I don't see a correctness problem. I think the idea of the mechanism is that we have a single linked list in shared memory that can accumulate those waiters. At some point a process pops the entire list of waiters, which allows a new group of waiters to begin accumulating. The process that pops the list must perform the updates for every process in the just-popped list without failing, else updates would be lost. In theory, there can be any number of lists that some process has popped and is currently working its way through at the same time, although in practice I bet it's quite rare for there to be more than one. The only correctness problem is if it's possible for a process that popped the list to error out before it finishes doing the updates that it "promised" to do by popping the list. Having individual bank locks doesn't really change anything here. That's just a matter of what lock has to be held in order to perform the update that was promised, and the algorithm described in the previous paragraph doesn't really care about that. Nor is there a deadlock hazard here as long as processes only take one lock at a time, which I believe is the case. The only potential issue that I see here is with performance. I've heard some questions about whether this machinery performs well even as it stands, but certainly if we divide up the lock into a bunch of bankwise locks then that should tend in the direction of making a mechanism like this less valuable, because both mechanisms are trying to alleviate contention, and so in a certain sense they are competing for the same job. However, they do aim to alleviate different TYPES of contention: the group XID update stuff should be most valuable when lots of processes are trying to update the same page, and the banks should be most valuable when there is simultaneous access to a bunch of different pages. So I'm not convinced that this patch is a reason to remove the group XID update mechanism, but someone might argue otherwise. A related concern is that, if by chance we do end up with multiple updaters from different pages in the same group, it will now be more expensive to sort that out because we'll have to potentially keep switching locks. So that could make the group XID update mechanism less performant than it is currently. I think that's probably not an issue because I think it should be a rare occurrence, as the comments say. A bit more cost in a code path that is almost never taken won't matter. But if that path is more commonly taken than I think, then maybe making it more expensive could hurt. It might be worth adding some debugging to see how often we actually go down that path in a highly stressed system. BTW: +* as well. The main reason for now allowing requesters of different pages now -> not -- Robert Haas EDB: http://www.enterprisedb.com
Re: optimize atomic exchanges
On Fri, Dec 15, 2023 at 04:56:27AM -0800, Andres Freund wrote: > I don't think we need the inline asm. Otherwise looks good. Committed with that change. Thanks for reviewing! I am going to watch the buildfarm especially closely for this one. -- Nathan Bossart Amazon Web Services: https://aws.amazon.com
Re: Add --check option to pgindent
On Mon Dec 18, 2023 at 10:14 AM CST, Jelte Fennema-Nio wrote: On Mon, 18 Dec 2023 at 13:42, Daniel Gustafsson wrote: > I think this is pretty much ready to go, the attached v4 squashes the changes > and fixes the man-page which also needed an update. The referenced Wiki page > will need an edit or two after this goes in, but that's easy enough. One thing I'm wondering: When both --check and --diff are passed, should pgindent still early exit with 2 on the first incorrectly formatted file? Or should it show diffs for all failing files? I'm leaning towards the latter making more sense. Makes sense. Let me iterate on the squashed version of the patch. Related (but not required for this patch): For my pre-commit hook I would very much like it if there was an option to have pgindent write the changes to disk, but still exit non-zero, e.g. a --write flag that could be combined with --check just like --diff and --check can be combined with this patch. Currently my pre-commit hook need two separate calls to pgindent to both abort the commit and write the required changes to disk. I could propose something. It would help if I had an example of such a tool that already exists. -- Tristan Partin Neon (https://neon.tech)
Re: micro-optimizing json.c
On Fri, Dec 08, 2023 at 05:56:20PM -0500, Tom Lane wrote: > Nathan Bossart writes: >> Here are a couple more easy micro-optimizations in nearby code. I've split >> them into individual patches for review, but I'll probably just combine >> them into one patch before committing. > > LGTM Committed. Thanks for reviewing! For the record, I did think about changing appendStringInfoString() into a macro or an inline function so that any calls with a string literal would benefit from this sort of optimization, but I was on-the-fence about it because it requires some special knowledge, i.e., you have to know to provide string literals to remove the runtime calls to strlen(). Perhaps this is worth further exploration... -- Nathan Bossart Amazon Web Services: https://aws.amazon.com
Re: psql JSON output format
On Mon, 18 Dec 2023 at 16:38, Christoph Berg wrote: > We'd want both patches even if they do the same thing on two different > levels, I'd say. Makes sense. One thing I was still wondering is if it wouldn't be easier to wrap all queries in "copy (select whatever) to stdout (format json)" automatically when the -J flag is passed psql. Because it would be nice not to have to implement this very similar logic in two places. But I guess that approach does not work for commands that don't work inside COPY, i.e. DML and DDL. I'm assuming your current patch works fine with DML/DDL. If that's indeed the case then I agree it makes sense to have this patch. And another big benefit is that it wouldn't require a new Postgres server function for the json functionality of psql.
Re: Add --check option to pgindent
On Mon, 18 Dec 2023 at 17:14, Jelte Fennema-Nio wrote: > One thing I'm wondering: When both --check and --diff are passed, > should pgindent still early exit with 2 on the first incorrectly > formatted file? Or should it show diffs for all failing files? I'm > leaning towards the latter making more sense. To be clear, in the latter case it would still exit with 2 at the end of the script, but only after showing diffs for all files.
Re: Add --check option to pgindent
On Mon, 18 Dec 2023 at 13:42, Daniel Gustafsson wrote: > I think this is pretty much ready to go, the attached v4 squashes the changes > and fixes the man-page which also needed an update. The referenced Wiki page > will need an edit or two after this goes in, but that's easy enough. One thing I'm wondering: When both --check and --diff are passed, should pgindent still early exit with 2 on the first incorrectly formatted file? Or should it show diffs for all failing files? I'm leaning towards the latter making more sense. Related (but not required for this patch): For my pre-commit hook I would very much like it if there was an option to have pgindent write the changes to disk, but still exit non-zero, e.g. a --write flag that could be combined with --check just like --diff and --check can be combined with this patch. Currently my pre-commit hook need two separate calls to pgindent to both abort the commit and write the required changes to disk.
Re: Add --check option to pgindent
On Mon, 18 Dec 2023 at 16:45, Tristan Partin wrote: > > On Mon Dec 18, 2023 at 6:41 AM CST, Daniel Gustafsson wrote: > > > On 15 Dec 2023, at 16:43, Tristan Partin wrote: > > > > > Here is a v3. > > > > I think this is pretty much ready to go, the attached v4 squashes the > > changes > > and fixes the man-page which also needed an update. The referenced Wiki > > page > > will need an edit or two after this goes in, but that's easy enough. > > I have never edited the Wiki before. How can I do that? More than happy > to do it. As mentioned at the bottom of the main page of the wiki: NOTE: due to recent spam activity "editor" privileges are granted manually for the time being. If you just created a new community account or if your current account used to have "editor" privileges ask on either the PostgreSQL -www Mailinglist or the PostgreSQL IRC Channel (direct your request to 'pginfra', multiple individuals in the channel highlight on that string) for help. Please include your community account name in those requests. After that, it's just a case of loggin in on the wiki (link top right corner, which uses the community account) and then just go on to your prefered page, click edit, and do your modifications. Don't forget to save the modifications - I'm not sure whether the wiki editor's state is locally persisted. Kind regards, Matthias van de Meent Neon (https://neon.tech)
Re: Add --check option to pgindent
On Mon Dec 18, 2023 at 7:56 AM CST, Euler Taveira wrote: On Mon, Dec 18, 2023, at 9:41 AM, Daniel Gustafsson wrote: > > On 15 Dec 2023, at 16:43, Tristan Partin wrote: > > > Here is a v3. > > I think this is pretty much ready to go, the attached v4 squashes the changes > and fixes the man-page which also needed an update. The referenced Wiki page > will need an edit or two after this goes in, but that's easy enough. ... and pgbuildfarm client [1] needs to be changed too. [1] https://github.com/PGBuildFarm/client-code/blob/main/PGBuild/Modules/CheckIndent.pm#L55-L90 Thanks Euler. I am on it. -- Tristan Partin Neon (https://neon.tech)
Re: Add --check option to pgindent
On Mon Dec 18, 2023 at 6:41 AM CST, Daniel Gustafsson wrote: > On 15 Dec 2023, at 16:43, Tristan Partin wrote: > Here is a v3. I think this is pretty much ready to go, the attached v4 squashes the changes and fixes the man-page which also needed an update. The referenced Wiki page will need an edit or two after this goes in, but that's easy enough. I have never edited the Wiki before. How can I do that? More than happy to do it. -- Tristan Partin Neon (https://neon.tech)
Re: psql JSON output format
Re: Jelte Fennema-Nio > This seems useful to me too, but my usecases would also be solved (and > possibly better solved) by adding JSON support to COPY as proposed > here: > https://www.postgresql.org/message-id/flat/CALvfUkBxTYy5uWPFVwpk_7ii2zgT07t3d-yR_cy4sfrrLU%3Dkcg%40mail.gmail.com Thanks for the pointer, I had not scrolled back enough to see that thread. I'm happy to see that this patch is also settling on "array of objects". > I'm wondering if your follow-up patch would be better served by that too or > not. I'd need it to work on query results. Which could of course be wrapped into "copy (select whatever) to stdout (format json)", but doing it in psql without mangling the query is cleaner. And (see the other mail), the psql format selection works nicely with existing queries like `psql -l`. And "copy format json" wouldn't support \x expanded mode. We'd want both patches even if they do the same thing on two different levels, I'd say. Christoph
Re: A wrong comment about search_indexed_tlist_for_var
The following review has been posted through the commitfest application: make installcheck-world: not tested Implements feature: tested, passed Spec compliant: tested, passed Documentation:tested, passed Comment is updated correctly.
Re: psql JSON output format
On Mon, 18 Dec 2023 at 15:56, Christoph Berg wrote: > I noticed psql was lacking JSON formatting of query results which I > need for a follow-up patch. This seems useful to me too, but my usecases would also be solved (and possibly better solved) by adding JSON support to COPY as proposed here: https://www.postgresql.org/message-id/flat/CALvfUkBxTYy5uWPFVwpk_7ii2zgT07t3d-yR_cy4sfrrLU%3Dkcg%40mail.gmail.com I'm wondering if your follow-up patch would be better served by that too or not.
Re: psql JSON output format
Re: To PostgreSQL Hackers > On the command line, the format is selected by `psql --json` and `psql -J`. Among other uses, it enables easy post-processing of psql output using `jq`: $ psql -lJ | jq [ { "Name": "myon", "Owner": "myon", "Encoding": "UTF8", "Locale Provider": "libc", "Collate": "de_DE.utf8", "Ctype": "de_DE.utf8", "ICU Locale": null, "ICU Rules": null, "Access privileges": null }, ... ] Christoph
psql JSON output format
I noticed psql was lacking JSON formatting of query results which I need for a follow-up patch. It also seems useful generally, so here's a patch: postgres=# \pset format json Output format is json. postgres=# select * from (values ('one', 2, 'three'), ('four', 5, 'six')) as sub(a, b, c); [ { "a": "one", "b": "2", "c": "three" }, { "a": "four", "b": "5", "c": "six" } ] postgres=# \x Expanded display is on. postgres=# select * from (values ('one', 2, 'three'), ('four', 5, 'six')) as sub(a, b, c); [{ "a": "one", "b": "2", "c": "three" },{ "a": "four", "b": "5", "c": "six" }] postgres=# Both normal and expanded output format are optimized for readability while still saving screen space. Both formats output the same JSON structure, an array of objects. Other variants like array-of-arrays or line-separated objects ("jsonline") might be possible, but I didn't want to overengineer it. On the command line, the format is selected by `psql --json` and `psql -J`. (I'm not attached to the short option, but -J was free and it's in line with `psql -H` to select HTML.) Christoph >From 5de0629e3664b981b2a246c480a90636ddfc8dd7 Mon Sep 17 00:00:00 2001 From: Christoph Berg Date: Fri, 8 Sep 2023 15:59:29 +0200 Subject: [PATCH] Add JSON output format to psql Query results are formatted as an array of JSON objects. In non-expanded mode, one object per line is printed, and in expanded mode, one key-value pair. Use `\pset format json` to enable, or run `psql --json` or `psql -J`. NULL values are printed as `null`, independently from the psql null setting, all other values are printed as quoted strings. --- doc/src/sgml/ref/psql-ref.sgml | 26 +- src/bin/psql/command.c | 6 +- src/bin/psql/help.c| 1 + src/bin/psql/startup.c | 6 +- src/bin/psql/tab-complete.c| 2 +- src/fe_utils/print.c | 128 +++-- src/include/fe_utils/print.h | 1 + src/test/regress/expected/psql.out | 67 +++ src/test/regress/sql/psql.sql | 25 ++ 9 files changed, 253 insertions(+), 9 deletions(-) diff --git a/doc/src/sgml/ref/psql-ref.sgml b/doc/src/sgml/ref/psql-ref.sgml index cc7d797159..f1f7eda082 100644 --- a/doc/src/sgml/ref/psql-ref.sgml +++ b/doc/src/sgml/ref/psql-ref.sgml @@ -274,6 +274,17 @@ EOF + + -J + --json + + + Switches to JSON output mode. This is + equivalent to \pset format json. + + + + -l --list @@ -2956,6 +2967,7 @@ lo_import 152801 asciidoc, csv, html, + json, latex, latex-longtable, troff-ms, unaligned, or wrapped. @@ -2972,7 +2984,7 @@ lo_import 152801 in by other programs, for example, tab-separated or comma-separated format. However, the field separator character is not treated specially if it appears in a column's value; - so CSV format may be better suited for such + the CSV or JSON formats may be better suited for such purposes. @@ -2997,6 +3009,18 @@ lo_import 152801 \pset csv_fieldsep. + json format + + JSON format + in psql + + writes output in JSON format, described in + https://www.ietf.org/rfc/rfc4627.txt";>RFC 4627. + The result is an array of JSON objects. In non-expanded mode, one + object per line is printed, while in expanded mode, each key-value + pair is a separate line. + + wrapped format is like aligned but wraps wide data values across lines to make the output fit in the target column width. The target width is determined as described under diff --git a/src/bin/psql/command.c b/src/bin/psql/command.c index 82cc091568..d8d11e9519 100644 --- a/src/bin/psql/command.c +++ b/src/bin/psql/command.c @@ -4318,6 +4318,9 @@ _align2string(enum printFormat in) case PRINT_HTML: return "html"; break; + case PRINT_JSON: + return "json"; + break; case PRINT_LATEX: return "latex"; break; @@ -4408,6 +4411,7 @@ do_pset(const char *param, const char *value, printQueryOpt *popt, bool quiet) {"asciidoc", PRINT_ASCIIDOC}, {"csv", PRINT_CSV}, {"html", PRINT_HTML}, + {"json", PRINT_JSON}, {"latex", PRINT_LATEX}, {"troff-ms", PRINT_TROFF_MS}, {"unaligned", PRINT_UNALIGNED}, @@ -4448,7 +4452,7 @@ do_pset(const char *param, const char *value, printQueryOpt *popt, bool quiet) } else { -pg_log_error("\\pset: allowed formats are aligned, asciidoc, csv, html, latex, latex-longtable, troff-ms, unaligned, wrapped"); +pg_log_error("\\pset: allowed formats are aligned, asciidoc, csv, html, json, latex, latex-longtable, troff-ms, unaligned, wrapped"); return false; } } diff --git a/src/bin/psql/help.c b/src/bin/psql/
Re: Detecting some cases of missing backup_label
Greetings, * Stephen Frost (sfr...@snowman.net) wrote: > * Andres Freund (and...@anarazel.de) wrote: > > I recently mentioned to Robert (and also Heikki earlier), that I think I > > see a > > way to detect an omitted backup_label in a relevant subset of the cases > > (it'd > > apply to the pg_control as well, if we moved to that). Robert encouraged me > > to share the idea, even though it does not provide complete protection. > > That would certainly be nice. > > > The subset I think we can address is the following: > > > > a) An omitted backup_label would lead to corruption, i.e. without the > >backup_label we won't start recovery at the right position. Obviously > > it'd > >be better to also catch a wrong procedure when it'd not cause corruption > > - > >perhaps my idea can be extended to handle that, with a small bit of > >overhead. > > > > b) The backup has been taken from a primary. Unfortunately that probably > > can't > >be addressed - but the vast majority of backups are taken from a primary, > >so I think it's still a worthwhile protection. > > Agreed that this is a worthwhile set to try and address, even if we > can't address other cases. > > > Here's my approach > > > > 1) We add a XLOG_BACKUP_START WAL record when starting a base backup on a > >primary, emitted just *after* the checkpoint completed > > > > 2) When replaying a base backup start record, we create a state file that > >includes the corresponding LSN in the filename > > > > 3) On the primary, the state file for XLOG_BACKUP_START is *not* created at > >that time. Instead the state file is created during pg_backup_stop(). > > > > 4) When replaying a XLOG_BACKUP_END record, we verif that the state file > >created by XLOG_BACKUP_START is present, and error out if not. Backups > >that started before the redo LSN from backup_label are ignored > >(necessitates remembering that LSN, but we've been discussing that > > anyway). > > > > > > Because the backup state file on the primary is only created during > > pg_backup_stop(), a copy of the data directory taken between > > pg_backup_start() > > and pg_backup_stop() does *not* contain the corresponding "backup state > > file". Because of this, an omitted backup_label is detected if recovery does > > not start early enough - recovery won't encounter the XLOG_BACKUP_START > > record > > and thus would not create the state file, leading to an error in 4). > > While I see the idea here, I think, doesn't it end up being an issue if > things happen like this: > > pg_backup_start -> XLOG_BACKUP_START WAL written -> new checkpoint > happens -> pg_backup_stop -> XLOG_BACKUP_STOP WAL written -> crash > > In that scenario, we'd go back to the new checkpoint (the one *after* > the checkpoint that happened before we wrote XLOG_BACKUP_START), start > replaying, and then hit the XLOG_BACKUP_STOP and then error out, right? > Even though we're actually doing crash recovery and everything should be > fine as long as we replay all of the WAL. Andres and I discussed this in person at PGConf.eu and the idea is that if we find a XLOG_BACKUP_STOP record then we can check if the state file was written out and if so then we can conclude that we are doing crash recovery and not restoring from a backup and therefore we don't error out. This also implies that we don't consider PG to be recovered at the XLOG_BACKUP_STOP point, if the state file exists, but instead we have to be sure to replay all WAL that's been written. Perhaps we even explicitly refuse to use restore_command in this case? We do error out if we hit a XLOG_BACKUP_STOP and the state file doesn't exist, as that implies that we started replaying from a point after a XLOG_BACKUP_START record was written but are working from a copy of the data directory which didn't include the state file. Of course, we need to actually implement and test these different cases to make sure it all works but I'm at least feeling better about the idea and wanted to share that here. Thanks, Stephen signature.asc Description: PGP signature
Re: "pgoutput" options missing on documentation
> Fair enough. I think we should push your first patch only in HEAD as > this is a minor improvement over the current behaviour. What do you > think? I agree.
Re: Simplify newNode()
On 15/12/2023 00:44, Tom Lane wrote: Good point. Looking closer, modern compilers will actually turn the MemSetLoop() in MemoryContextAllocZeroAligned() into a call to memset() anyway! Funny. That is true for recent versions of gcc, clang, and MSVC. Not here ... Hmm, according to godbolt, the change happened in GCC version 10.1. Starting with gcc 10.1, it is turned into a memset(). On clang, the same change happened in version 3.4.1. I think we have consensus on patch v2. It's simpler and not less performant than what we have now, at least on modern compilers. Barring objections, I'll commit that. I'm not planning to spend more time on this, but there might be some room for further optimization if someone is interested to do the micro-benchmarking. The obvious thing would be to persuade modern compilers to not switch to memset() in MemoryContextAllocZeroAligned (*), making the old macro logic work the same way it used to on old compilers. Also, instead of palloc0, it might be better for newNode() to call palloc followed by memset. That would allow the compiler to partially optimize away the memset. Most callers fill at least some of the fields after calling makeNode(), so the compiler could generate code that clears only the uninitialized fields and padding bytes. (*) or rather, a new function like MemoryContextAllocZeroAligned but without the 'context' argument. We want to keep the savings in the callers from eliminating the extra argument. -- Heikki Linnakangas Neon (https://neon.tech)
Re: Add --check option to pgindent
On Mon, Dec 18, 2023, at 9:41 AM, Daniel Gustafsson wrote: > > On 15 Dec 2023, at 16:43, Tristan Partin wrote: > > > Here is a v3. > > I think this is pretty much ready to go, the attached v4 squashes the changes > and fixes the man-page which also needed an update. The referenced Wiki page > will need an edit or two after this goes in, but that's easy enough. ... and pgbuildfarm client [1] needs to be changed too. [1] https://github.com/PGBuildFarm/client-code/blob/main/PGBuild/Modules/CheckIndent.pm#L55-L90 -- Euler Taveira EDB https://www.enterprisedb.com/
Re: Remove MSVC scripts from the tree
On 18.12.23 11:49, vignesh C wrote: Few comments: 1) Now that the MSVC build scripts are removed, should we have the reference to "MSVC build scripts" here? ltree.h: I think this note is correct and can be kept, as it explains the historical context. 2) I had seen that if sed/gzip is not available meson build will fail: 2.a) Program gsed sed found: NO meson.build:334:6: ERROR: Program 'gsed sed' not found or not executable Yes, this would need to be improved. Currently, sed is only required if either selinux or dtrace is enabled, which isn't supported on Windows. But we should adjust the build scripts to not fail the top-level setup run unless those options are enabled. 2.b) Program gzip found: NO meson.build:337:7: ERROR: Program 'gzip' not found or not executable gzip is only required for certain test suites, so again we should adjust the build scripts to not fail the build but instead skip the tests as appropriate.
Re: Add new for_each macros for iterating over a List that do not require ListCell pointer
The more I think about it and look at the code, the more I like the usage of the loop style proposed in the previous 0003 patch (which automatically declares a loop variable for the scope of the loop using a second for loop). I did some testing on godbolt.org and both versions of the macros result in the same assembly when compiling with -O2 (and even -O1) when compiling with ancient versions of gcc (5.1) and clang (3.0): https://godbolt.org/z/WqfTbhe4e So attached is now an updated patchset that only includes these even easier to use foreach macros. I also updated some of the comments and moved modifying foreach_delete_current and foreach_current_index to their own commit. On Thu, 14 Dec 2023 at 16:54, Jelte Fennema-Nio wrote: > > On Fri, 1 Dec 2023 at 05:20, Nathan Bossart wrote: > > Could we simplify it with something like the following? > > Great suggestion! Updated the patchset accordingly. > > This made it also easy to change the final patch to include the > automatic scoped declaration logic for all of the new macros. I quite > like how the calling code changes to not have to declare the variable. > But it's definitely a larger divergence from the status quo than > without patch 0003. So I'm not sure if it's desired. > > Finally, I also renamed the functions to use foreach instead of > for_each, since based on this thread that seems to be the generally > preferred naming. v7-0001-Add-macros-for-looping-through-a-list-without-nee.patch Description: Binary data v7-0003-Use-new-foreach_xyz-macros-in-a-few-places.patch Description: Binary data v7-0002-Make-some-macros-that-used-cell-work-with-new-for.patch Description: Binary data
Re: Postgres picks suboptimal index after building of an extended statistics
Hi! I'd like to get this subject off the ground. The problem originally described in [1] obviously comes from wrong selectivity estimation. "Dependencies" extended statistics lead to significant selectivity miss 24/1000 instead of 1/1000. When the estimation is correct, the PostgreSQL optimizer is capable of choosing the appropriate unique index for the query. Tom pointed out in [2] that this might be a problem of "Dependencies" extended statistics. I've rechecked this. The dataset consists of two parts. The first part defined in the CREATE TABLE statement contains independent values. The second part defined in the INSERT statement contains dependent values. "Dependencies" extended statistics estimate dependency degree as a fraction of rows containing dependent values. According to this definition, it correctly estimates the dependency degree as about 0.5 for all the combinations. So, the error in the estimate comes from the synergy of two factors: MCV estimation of the frequency of individual values, and spreading of average dependency degree for those values, which is not relevant for them. I don't think there is a way to fix "dependencies" extended statistics because it works exactly as designed. I have to note that instead of fixing "dependencies" extended statistics you can just add multi-column MCV statistics, which would fix the problem. CREATE STATISTICS aestat(dependencies,ndistinct,mcv) ON x,y,z FROM a; Independently on how well our statistics work, it looks pitiful that we couldn't fix that using the knowledge of unique constraints on the table. Despite statistics, which give us just more or less accurate estimates, the constraint is something we really enforce and thus can rely on. The patches provided by Andrei in [1], [3], and [4] fix this issue at the index scan path level. As Thomas pointed out in [5], that could lead to inconsistency between the number of rows used for unique index scan estimation and the value displayed in EXPLAIN (and used for other paths). Even though this was debated in [6], I can confirm this inconsistency. Thus, with the patch published in [4], I can see the 28-row estimation with a unique index scan. ` QUERY PLAN --- Index Only Scan using a_pkey on a (cost=0.28..8.30 rows=28 width=12) Index Cond: ((x = 1) AND (y = 1) AND (z = 1)) (2 rows) Also, there is a set of patches [7], [8], and [9], which makes the optimizer consider path selectivity as long as path costs during the path selection. I've rechecked that none of these patches could resolve the original problem described in [1]. Also, I think they are quite tricky. The model of our optimizer assumes that paths in the list should be the different ways of getting the same result. If we choose the paths by their selectivity, that breaks this model. I don't say there is no way for this. But if we do this, that would require significant rethinking of our optimizer model and possible revision of a significant part of it. Anyway, I think if there is still interest in this, that should be moved into a separate thread to keep this thread focused on the problem described in [1]. Finally, I'd like to note that the issue described in [1] is mostly the selectivity estimation problem. It could be solved by adding the multi-column MCV statistics. The patches published so far look more like hacks for particular use cases rather than appropriate solutions. It still looks promising to me to use the knowledge of unique constraints during selectivity estimation [10]. Even though it's hard to implement and possibly implies some overhead, it fits the current model. I also think unique contracts could probably be used in some way to improve estimates even when there is no full match. Links. 1. https://www.postgresql.org/message-id/0ca4553c-1f34-12ba-9122-44199d1ced41%40postgrespro.ru 2. https://www.postgresql.org/message-id/3119052.1657231656%40sss.pgh.pa.us 3. https://www.postgresql.org/message-id/90a1d6ef-c777-b95d-9f77-0065ad4522df%40postgrespro.ru 4. https://www.postgresql.org/message-id/a5a18d86-c0e5-0ceb-9a18-be1beb2d2944%40postgrespro.ru 5. https://www.postgresql.org/message-id/f8044836-5d61-a4e0-af82-5821a2a1f0a7%40enterprisedb.com 6. https://www.postgresql.org/message-id/90a1d6ef-c777-b95d-9f77-0065ad4522df%40postgrespro.ru 7. https://www.postgresql.org/message-id/2df148b5-0bb8-f80b-ac03-251682fab585%40postgrespro.ru 8. https://www.postgresql.org/message-id/6fb43191-2df3-4791-b307-be754e648276%40postgrespro.ru 9. https://www.postgresql.org/message-id/154f786a-06a0-4fb1-b8a4-16c66149731b%40postgrespro.ru 10. https://www.postgresql.org/message-id/f8044836-5d61-a4e0-af82-5821a2a1f0a7%40enterprisedb.com -- Regards, Alexander Korotkov
Re: "pgoutput" options missing on documentation
On Mon, Dec 18, 2023 at 1:08 PM Emre Hasegeli wrote: > > > I found the existing error code appropriate because for syntax > > specification, either we need to mandate this at the grammar level or > > at the API level. Also, I think we should give a message similar to an > > existing message: "publication_names parameter missing". For example, > > we can say, "proto_version parameter missing". BTW, I also don't like > > the other changes parse_output_parameters() done in 0001, if we want > > to improve all the similar messages there are other places in the code > > as well, so we can separately make the case for the same. > > Okay, I am changing these back. I think we should keep the word > "option". It is used on other error messages. > Fair enough. I think we should push your first patch only in HEAD as this is a minor improvement over the current behaviour. What do you think? -- With Regards, Amit Kapila.
Re: planner chooses incremental but not the best one
On 12/18/23 11:40, Richard Guo wrote: > > On Mon, Dec 18, 2023 at 7:31 AM Tomas Vondra > mailto:tomas.von...@enterprisedb.com>> > wrote: > > Oh! Now I see what you meant by using the new formula in 84f9a35e3 > depending on how we sum tuples. I agree that seems like the right thing. > > I'm not sure it'll actually help with the issue, though - if I apply the > patch, the plan does not actually change (and the cost changes just a > little bit). > > I looked at this only very briefly, but I believe it's due to the > assumption of independence I mentioned earlier - we end up using the new > formula introduced in 84f9a35e3, but it assumes it assumes the > selectivity and number of groups are independent. But that'd not the > case here, because the groups are very clearly correlated (with the > condition on "b"). > > > You're right. The patch allows us to adjust the estimate of distinct > values for appendrels using the new formula introduced in 84f9a35e3. > But if the restrictions are correlated with the grouping expressions, > the new formula does not behave well. That's why the patch does not > help in case [1], where 'b' and 'c' are correlated. > > OTOH, if the restrictions are not correlated with the grouping > expressions, the new formula would perform quite well. And in this case > the patch would help a lot, as shown in [2] where estimate_num_groups() > gives a much more accurate estimate with the help of this patch. > > So this patch could be useful in certain situations. I'm wondering if > we should at least have this patch (if it is right). > I do agree the patch seems to do the right thing, and it's worth pushing on it's own. > > If that's the case, I'm not sure how to fix this :-( > > > The commit message of 84f9a35e3 says > > This could possibly be improved upon in the future by identifying > correlated restrictions and using a hybrid of the old and new > formulae. > > Maybe this is something we can consider trying. But anyhow this is not > an easy task I suppose. Yeah, if it was easy, it'd have been done in 84f9a35e3 already ;-) The challenge is where to get usable information about correlation between columns. I only have a couple very rought ideas of what might try. For example, if we have multi-column ndistinct statistics, we might look at ndistinct(b,c) and ndistinct(b,c,d) and deduce something from ndistinct(b,c,d) / ndistinct(b,c) If we know how many distinct values we have for the predicate column, we could then estimate the number of groups. I mean, we know that for the restriction "WHERE b = 3" we only have 1 distinct value, so we could estimate the number of groups as 1 * ndistinct(b,c) I'm well aware this is only a very trivial example, and for more complex examples it's likely way more complicated. But hopefully it illustrates the general idea. The other idea would be to maybe look at multi-column MCV, and try using it to deduce cross-column correlation too (it could be more accurate for arbitrary predicates). And finally, we might try being a bit more pessimistic and look at what the "worst case" behavior could be. So for example instead of trying to estimate the real number of groups, we'd ask "What's the minimum number of groups we're likely to get?". And use that to cost the incremental sort. But I don't think we do this elsewhere, and I'm not sure we want to start with this risk-based approach here. It might be OK, because we usually expect the incremental sort to be much cheaper, ... If this something would be interested in exploring? I don't have capacity to work on this myself, but I'd be available for reviews, feedback and so on. regards -- Tomas Vondra EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Re: Add --check option to pgindent
> On 15 Dec 2023, at 16:43, Tristan Partin wrote: > Here is a v3. I think this is pretty much ready to go, the attached v4 squashes the changes and fixes the man-page which also needed an update. The referenced Wiki page will need an edit or two after this goes in, but that's easy enough. -- Daniel Gustafsson v4-0001-Rename-non-destructive-modes-in-pgindent.patch Description: Binary data
RE: Synchronizing slots from primary to standby
On Monday, December 11, 2023 5:31 PM shveta malik wrote: > > On Thu, Dec 7, 2023 at 1:33 PM Peter Smith > wrote: > > > > Hi. > > > > Here are my review comments for patch v43-0002. > > > > > == > > src/backend/access/transam/xlogrecovery.c > > > > 13. > > + /* > > + * Shutdown the slot sync workers to prevent potential conflicts between > > + * user processes and slotsync workers after a promotion. Additionally, > > + * drop any slots that have initiated but not yet completed the sync > > + * process. > > + */ > > + ShutDownSlotSync(); > > + slotsync_drop_initiated_slots(); > > + > > > > Is this where maybe the 'sync_state' should also be updated for > > everything so you are not left with confusion about different states > > on a node that is no longer a standby node? > > > > yes, this is the place. But this needs more thought as it may cause > too much disk activity during promotion. so let me analyze and come > back. Per off-list discussion with Amit. I think it's fine to keep both READY and NONE on a primary. Because even if we update the sync_state from READY to NONE on promotion, it doesn't reduce the complexity for the handling of READY and NONE state. And it's not straightforward to choose the right place to update sync_state, here is the analysis: (related steps on promotion) 1 (patch) shutdown slotsync worker 2 (patch) drop 'i' state slots. 3 remove standby.signal and recovery.signal 4 switch to a new timeline and write the timeline history file 5 set SharedRecoveryState = RECOVERY_STATE_DONE which means RecoveryInProgress() will return false. We could not update the sync_state before step 3 because if the update fails after updating some of slots' state, then the server will be shutdown leaving some 'NONE' state slots. After restarting, the server is still a standby so the slot sync worker will fail to sync these 'NONE' state slots. We also could not update it after step 3 and before step 4. Because if any ERROR when updating, then after restarting the server, although the server will become a master(as standby.signal is removed), but it can still be made as a active standby by creating a standby.signal file because the timeline has not been switched. And in this case, the slot sync worker will also fail to sync these 'NONE' state slots. Updating the sync_state after step 4 and before step 5 is OK, but still It doesn't simplify the handling for both READY and NONE state slots. Therefore, I think we can retain the READY state slots after promotion as they can provide information about the slot's origin. I added some comments around slotsync cleanup codes (in FinishWalRecovery) to mentioned the reason. Best Regards, Hou zj
Re: [DOC] Introducing Quick Start Guide to PL/pgSQL and PL/Python Documentation
On Sat, Dec 16, 2023, at 7:49 AM, Ishaan Adarsh wrote: > I have made some documentation enhancements for PL/pgSQL and PL/Python > sections. The changes include the addition of a Quick Start Guide to > facilitate a smoother onboarding experience for users. Great! Add your patch to the next CF [1] so we don't miss it. [1] https://commitfest.postgresql.org/46/ -- Euler Taveira EDB https://www.enterprisedb.com/
Re: Add a perl function in Cluster.pm to generate WAL
On Mon, Dec 18, 2023, at 2:39 AM, Bharath Rupireddy wrote: > Rebase needed, attached v3 patch. I think you don't understand the suggestion proposed by Michael and Kyotaro. If you do a comparison with the following SQL commands: euler=# select pg_walfile_name(pg_current_wal_lsn()); pg_walfile_name -- 00010040 (1 row) euler=# select pg_logical_emit_message(true, 'prefix', 'message4'); pg_logical_emit_message - 0/40A8 (1 row) euler=# select pg_switch_wal(); pg_switch_wal --- 0/40F0 (1 row) euler=# create table cc (b int); CREATE TABLE euler=# drop table cc; DROP TABLE euler=# select pg_switch_wal(); pg_switch_wal --- 0/41017C88 (1 row) euler=# select pg_walfile_name(pg_current_wal_lsn()); pg_walfile_name -- 00010041 (1 row) You get $ pg_waldump 00010040 rmgr: Standby len (rec/tot): 50/50, tx: 0, lsn: 0/4028, prev 0/3F0001C0, desc: RUNNING_XACTS nextXid 295180 latestCompletedXid 295179 oldestRunningXid 295180 rmgr: LogicalMessage len (rec/tot): 65/65, tx: 295180, lsn: 0/4060, prev 0/4028, desc: MESSAGE transactional, prefix "prefix"; payload (8 bytes): 6D 65 73 73 61 67 65 34 rmgr: Transaction len (rec/tot): 46/46, tx: 295180, lsn: 0/40A8, prev 0/4060, desc: COMMIT 2023-12-18 08:35:06.821322 -03 rmgr: XLOGlen (rec/tot): 24/24, tx: 0, lsn: 0/40D8, prev 0/40A8, desc: SWITCH $ pg_waldump 00010041 rmgr: Standby len (rec/tot): 50/50, tx: 0, lsn: 0/4128, prev 0/40D8, desc: RUNNING_XACTS nextXid 295181 latestCompletedXid 295180 oldestRunningXid 295181 rmgr: Storage len (rec/tot): 42/42, tx: 0, lsn: 0/4160, prev 0/4128, desc: CREATE base/33287/88102 rmgr: Heap2 len (rec/tot): 60/60, tx: 295181, lsn: 0/4190, prev 0/4160, desc: NEW_CID rel: 1663/33287/1247, tid: 14/16, cmin: 0, cmax: 4294967295, combo: 4294967295 rmgr: Heaplen (rec/tot): 54/ 3086, tx: 295181, lsn: 0/41D0, prev 0/4190, desc: INSERT off: 16, flags: 0x00, blkref #0: rel 1663/33287/1247 blk 14 FPW rmgr: Btree len (rec/tot): 53/ 5133, tx: 295181, lsn: 0/41000CE0, prev 0/41D0, desc: INSERT_LEAF off: 252, blkref #0: rel 1663/33287/2703 blk 2 FPW . . . rmgr: Btree len (rec/tot): 72/72, tx: 295181, lsn: 0/41016E48, prev 0/41014F00, desc: INSERT_LEAF off: 111, blkref #0: rel 1663/33287/2674 blk 7 rmgr: Heap2 len (rec/tot): 69/69, tx: 295181, lsn: 0/41016E90, prev 0/41016E48, desc: PRUNE snapshotConflictHorizon: 295177, nredirected: 0, ndead: 7, nunused: 0, redirected: [], dead: [20, 21, 22, 23, 24, 26, 27], unused: [], blkref #0: rel 1663/33287/1249 blk 17 rmgr: Transaction len (rec/tot):385/ 385, tx: 295181, lsn: 0/41016ED8, prev 0/41016E90, desc: INVALIDATION ; inval msgs: catcache 80 catcache 79 catcache 80 catcache 79 catcache 55 catcache 54 catcache 7 catcache 6 catcache 7 catcache 6 catcache 7 catcache 6 catcache 7 catcache 6 catcache 7 catcache 6 catcache 7 catcache 6 catcache 7 catcache 6 snapshot 2608 relcache 88102 rmgr: Standby len (rec/tot): 42/42, tx: 295181, lsn: 0/41017060, prev 0/41016ED8, desc: LOCK xid 295181 db 33287 rel 88102 rmgr: Transaction len (rec/tot):405/ 405, tx: 295181, lsn: 0/41017090, prev 0/41017060, desc: COMMIT 2023-12-18 08:35:22.342122 -03; inval msgs: catcache 80 catcache 79 catcache 80 catcache 79 catcache 55 catcache 54 catcache 7 catcache 6 catcache 7 catcache 6 catcache 7 catcache 6 catcache 7 catcache 6 catcache 7 catcache 6 catcache 7 catcache 6 catcache 7 catcache 6 snapshot 2608 relcache 88102 rmgr: Standby len (rec/tot): 42/42, tx: 295182, lsn: 0/41017228, prev 0/41017090, desc: LOCK xid 295182 db 33287 rel 88102 rmgr: Heap2 len (rec/tot): 61/61, tx: 295182, lsn: 0/41017258, prev 0/41017228, desc: PRUNE snapshotConflictHorizon: 295177, nredirected: 0, ndead: 3, nunused: 0, redirected: [], dead: [9, 12, 15], unused: [], blkref #0: rel 1663/33287/2608 blk 3 rmgr: Heap2 len (rec/tot): 60/60, tx: 295182, lsn: 0/41017298, prev 0/41017258, desc: NEW_CID rel: 1663/33287/1247, tid: 14/17, cmin: 4294967295, cmax: 0, combo: 4294967295 rmgr: Heaplen (rec/tot): 54/54, tx: 295182, lsn: 0/410172D8, prev 0/41017298, desc: DELETE xmax: 295182, off: 17, infobits: [KEYS_UPDATED], flags: 0x00, blkref #0: rel 1663/33287/1247 blk 14 . . . rmgr: Heap2 len (rec/tot): 60/60, tx: 295182, lsn: 0/410178D8, prev 0/410178A0, desc: NEW_CID rel: 1663/33287/2608, tid: 3/24, cmin: 4294967295, cmax: 2, combo: 4294967295 rmgr: Heaplen (rec/tot): 54/54, tx: 295182, lsn: 0/41017918, prev 0/410178D8, desc: DELETE
Re: Synchronizing slots from primary to standby
On Fri, Dec 15, 2023 at 11:03 AM shveta malik wrote: > > Sorry, I missed attaching the patch. PFA v48. > Few comments on v48_0002 1. +static void +slotsync_reread_config(WalReceiverConn *wrconn) { ... + pfree(old_primary_conninfo); + pfree(old_primary_slotname); + + if (restart) + { + char*msg = "slot sync worker will restart because of a parameter change"; + + /* + * The exit code 1 will make postmaster restart the slot sync worker. + */ + slotsync_worker_exit(msg, 1 /* proc_exit code */ ); + } ... I don't see the need to explicitly pfree in case we are already exiting the process because anyway the memory will be released. We can avoid using the 'restart' variable for this. Also, probably, directly exiting here makes sense and at another place where this function is used. I see that in maybe_reread_subscription(), we exit with a 0 code and still apply worker restarts, so why use a different exit code here? 2. +static void +check_primary_info(WalReceiverConn *wrconn, bool *am_cascading_standby) { ... + remote_in_recovery = DatumGetBool(slot_getattr(tupslot, 1, &isnull)); + Assert(!isnull); + + /* No need to check further, return that we are cascading standby */ + if (remote_in_recovery) + { + *am_cascading_standby = true; + CommitTransactionCommand(); + return; ... } Don't we need to clear the result and tuple in case of early return? 3. It would be a good idea to mention about requirements like a physical slot on primary, hot_standby_feedback, etc. in the commit message. 4. +static bool +wait_for_primary_slot_catchup(WalReceiverConn *wrconn, RemoteSlot *remote_slot, + bool *wait_attempts_exceeded) { ... + tupslot = MakeSingleTupleTableSlot(res->tupledesc, &TTSOpsMinimalTuple); + if (!tuplestore_gettupleslot(res->tuplestore, true, false, tupslot)) + { + ereport(WARNING, + (errmsg("slot \"%s\" creation aborted", remote_slot->name), + errdetail("This slot was not found on the primary server"))); ... + /* + * It is possible to get null values for LSN and Xmin if slot is + * invalidated on the primary server, so handle accordingly. + */ + new_invalidated = DatumGetBool(slot_getattr(tupslot, 1, &isnull)); + Assert(!isnull); + + new_restart_lsn = DatumGetLSN(slot_getattr(tupslot, 2, &isnull)); + if (new_invalidated || isnull) + { + ereport(WARNING, + (errmsg("slot \"%s\" creation aborted", remote_slot->name), + errdetail("This slot was invalidated on the primary server"))); ... } a. The errdetail message should end with a full stop. Please check all other errdetail messages in the patch to follow the same guideline. b. I think saying slot creation aborted is not completely correct because we would have created the slot especially when it is in 'i' state. Can we change it to something like: "aborting initial sync for slot \"%s\""? c. Also, if the remote_slot is invalidated, ideally, we can even drop the local slot but it seems that the patch will drop the same before the next-sync cycle with any other slot that needs to be dropped. If so, can we add the comment to indicate the same? 5. +static void +local_slot_update(RemoteSlot *remote_slot) +{ + Assert(MyReplicationSlot->data.invalidated == RS_INVAL_NONE); + + LogicalConfirmReceivedLocation(remote_slot->confirmed_lsn); + LogicalIncreaseXminForSlot(remote_slot->confirmed_lsn, +remote_slot->catalog_xmin); + LogicalIncreaseRestartDecodingForSlot(remote_slot->confirmed_lsn, + remote_slot->restart_lsn); + + SpinLockAcquire(&MyReplicationSlot->mutex); + MyReplicationSlot->data.invalidated = remote_slot->invalidated; + SpinLockRelease(&MyReplicationSlot->mutex); ... ... If required, the invalidated flag is updated in the caller as well, so why do we need to update it here as well? -- With Regards, Amit Kapila.
Re: Remove MSVC scripts from the tree
On Wed, 6 Dec 2023 at 12:59, Michael Paquier wrote: > > On Wed, Dec 06, 2023 at 12:15:50PM +0530, Shubham Khanna wrote: > > Patch is not applying. Please share the Rebased Version. Please find the > > error: > > Thanks. Here you go with a v6. Few comments: 1) Now that the MSVC build scripts are removed, should we have the reference to "MSVC build scripts" here? ltree.h: . /* * LOWER_NODE used to be defined in the Makefile via the compile flags. * However the MSVC build scripts neglected to do the same which resulted in * MSVC builds not using LOWER_NODE. Since then, the MSVC scripts have been * modified to look for -D compile flags in Makefiles, so here, in order to * get the historic behavior of LOWER_NODE not being defined on MSVC, we only * define it when not building in that environment. This is important as we * want to maintain the same LOWER_NODE behavior after a pg_upgrade. */ #ifndef _MSC_VER #define LOWER_NODE #endif . 2) I had seen that if sed/gzip is not available meson build will fail: 2.a) Program gsed sed found: NO meson.build:334:6: ERROR: Program 'gsed sed' not found or not executable 2.b) Program gzip found: NO meson.build:337:7: ERROR: Program 'gzip' not found or not executable Should we mention sed and gzip here? + + Bison and +Flex + + +Bison and Flex are +required. Only Bison versions 2.3 and later +will work. Flex must be version 2.5.35 or later. + Regards, Vignesh
Re: planner chooses incremental but not the best one
On Mon, Dec 18, 2023 at 7:31 AM Tomas Vondra wrote: > Oh! Now I see what you meant by using the new formula in 84f9a35e3 > depending on how we sum tuples. I agree that seems like the right thing. > > I'm not sure it'll actually help with the issue, though - if I apply the > patch, the plan does not actually change (and the cost changes just a > little bit). > > I looked at this only very briefly, but I believe it's due to the > assumption of independence I mentioned earlier - we end up using the new > formula introduced in 84f9a35e3, but it assumes it assumes the > selectivity and number of groups are independent. But that'd not the > case here, because the groups are very clearly correlated (with the > condition on "b"). You're right. The patch allows us to adjust the estimate of distinct values for appendrels using the new formula introduced in 84f9a35e3. But if the restrictions are correlated with the grouping expressions, the new formula does not behave well. That's why the patch does not help in case [1], where 'b' and 'c' are correlated. OTOH, if the restrictions are not correlated with the grouping expressions, the new formula would perform quite well. And in this case the patch would help a lot, as shown in [2] where estimate_num_groups() gives a much more accurate estimate with the help of this patch. So this patch could be useful in certain situations. I'm wondering if we should at least have this patch (if it is right). > If that's the case, I'm not sure how to fix this :-( The commit message of 84f9a35e3 says This could possibly be improved upon in the future by identifying correlated restrictions and using a hybrid of the old and new formulae. Maybe this is something we can consider trying. But anyhow this is not an easy task I suppose. [1] https://www.postgresql.org/message-id/CAMbWs4-FtsJ0dQUv6g%3D%3DXR_gsq%3DFj9oiydW6gbqwQ_wrbU0osw%40mail.gmail.com [2] https://www.postgresql.org/message-id/CAMbWs4-ocromEKMtVDH3RBMuAJQaQDK0qi4k6zOuvpOnMWZauw%40mail.gmail.com Thanks Richard
Re: Move walreceiver state assignment (to WALRCV_STREAMING) in WalReceiverMain()
Hi, On 12/13/23 3:33 PM, Michael Paquier wrote: On Tue, Dec 12, 2023 at 04:54:32PM -0300, Euler Taveira wrote: Couldn't it give up before starting if you apply your patch? My main concern is due to a slow system, the walrcv_connect() took to long in WalReceiverMain() and the code above kills the walreceiver while in the process to start it. Since you cannot control the hardcoded WALRCV_STARTUP_TIMEOUT value, you might have issues during overload periods. Sounds like a fair point to me, Thanks for looking at it! I'm not sure about it, see my comment in [1]. Another thing that I'm a bit surprised with is why it would be OK to switch the status to STREAMING only we first_stream is set, discarding the restart case. Yeah, that looks like a miss on my side. Thanks for pointing out! Please find attached v2 addressing this remark. [1]: https://www.postgresql.org/message-id/c76c0a65-f754-4614-b616-1d48f9195745%40gmail.com Regards, -- Bertrand Drouvot PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.comFrom 12da1b3c6295e2369b3a65c8f4ee40882def310f Mon Sep 17 00:00:00 2001 From: Bertrand Drouvot Date: Mon, 11 Dec 2023 20:05:25 + Subject: [PATCH v2] move walrcv->walRcvState assignment to WALRCV_STREAMING walrcv->walRcvState = WALRCV_STREAMING was set to early. Move the assignement to a more appropriate place. --- src/backend/replication/walreceiver.c | 4 +++- 1 file changed, 3 insertions(+), 1 deletion(-) 100.0% src/backend/replication/ diff --git a/src/backend/replication/walreceiver.c b/src/backend/replication/walreceiver.c index 26ded928a7..5f612d354e 100644 --- a/src/backend/replication/walreceiver.c +++ b/src/backend/replication/walreceiver.c @@ -242,7 +242,6 @@ WalReceiverMain(void) } /* Advertise our PID so that the startup process can kill us */ walrcv->pid = MyProcPid; - walrcv->walRcvState = WALRCV_STREAMING; /* Fetch information required to start streaming */ walrcv->ready_to_display = false; @@ -412,6 +411,9 @@ WalReceiverMain(void) options.proto.physical.startpointTLI = startpointTLI; if (walrcv_startstreaming(wrconn, &options)) { + SpinLockAcquire(&walrcv->mutex); + walrcv->walRcvState = WALRCV_STREAMING; + SpinLockRelease(&walrcv->mutex); if (first_stream) ereport(LOG, (errmsg("started streaming WAL from primary at %X/%X on timeline %u", -- 2.34.1
Re: Move walreceiver state assignment (to WALRCV_STREAMING) in WalReceiverMain()
Hi, On 12/12/23 8:54 PM, Euler Taveira wrote: On Tue, Dec 12, 2023, at 12:58 PM, Drouvot, Bertrand wrote: Currently walrcv->walRcvState is set to WALRCV_STREAMING at the beginning of WalReceiverMain(). But it seems that after this assignment things could be wrong before the walreicever actually starts streaming (like not being able to connect to the primary). It looks to me that WALRCV_STREAMING should be set once walrcv_startstreaming() returns true: this is the proposal of this patch. Per the state name (streaming), it seems it should be set later as you proposed, Thanks for looking at it! however, I'm concerned about the previous state (WALRCV_STARTING). If I'm reading the code correctly, WALRCV_STARTING is assigned at RequestXLogStreaming(): SetInstallXLogFileSegmentActive(); RequestXLogStreaming(tli, ptr, PrimaryConnInfo, PrimarySlotName, wal_receiver_create_temp_slot); flushedUpto = 0; } /* * Check if WAL receiver is active or wait to start up. */ if (!WalRcvStreaming()) { lastSourceFailed = true; break; } After a few lines the function WalRcvStreaming() has: SpinLockRelease(&walrcv->mutex); /* * If it has taken too long for walreceiver to start up, give up. Setting * the state to STOPPED ensures that if walreceiver later does start up * after all, it will see that it's not supposed to be running and die * without doing anything. */ if (state == WALRCV_STARTING) { pg_time_t now = (pg_time_t) time(NULL); if ((now - startTime) > WALRCV_STARTUP_TIMEOUT) { bool stopped = false; SpinLockAcquire(&walrcv->mutex); if (walrcv->walRcvState == WALRCV_STARTING) { state = walrcv->walRcvState = WALRCV_STOPPED; stopped = true; } SpinLockRelease(&walrcv->mutex); if (stopped) ConditionVariableBroadcast(&walrcv->walRcvStoppedCV); } } Couldn't it give up before starting if you apply your patch? My main concern is due to a slow system, the walrcv_connect() took to long in WalReceiverMain() and the code above kills the walreceiver while in the process to start it. Yeah, so it looks to me that the sequence of events is: 1) The startup process sets walrcv->walRcvState = WALRCV_STARTING (in RequestXLogStreaming()) 2) The startup process sets the walrcv->startTime (in RequestXLogStreaming()) 3) The startup process asks then the postmaster to starts the walreceiver 4) Then The startup process checks if WalRcvStreaming() is true Note that 3) is not waiting for the walreceiver to actually start: it "just" sets a flag and kill (SIGUSR1) the postmaster (in SendPostmasterSignal()). It means that if the time between 1 and 4 is <= WALRCV_STARTUP_TIMEOUT (10 seconds) then WalRcvStreaming() returns true (even if the walreceiver is not streaming yet). So it looks to me that even if the walreceiver does take time to start streaming, as long as the time between 1 and 4 is <= 10 seconds we are fine. And I think it's fine because WalRcvStreaming() does not actually "only" check that the walreceiver is streaming but as its comment states: " /* * Is walreceiver running and streaming (or at least attempting to connect, * or starting up)? */ " Regards, -- Bertrand Drouvot PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com
Re: Simplify newNode()
On Fri, Dec 15, 2023 at 5:44 AM Tom Lane wrote: > > I did check that the v1 patch successfully inlines newNode() and > reduces it to just a MemoryContextAllocZeroAligned call, so it's > correct that modern compilers do that better than whatever I tested > in 2008. But I wonder what is happening in v2 to reduce the code > size so much. MemoryContextAllocZeroAligned is not 20kB by itself. I poked at this a bit and it seems to come from what Heikki said upthread about fewer instructions before the calls: Running objdump on v1 and v2 copyfuncs.o and diff'ing shows there are fewer MOV instructions (some extraneous stuff removed): e9 da 5f 00 00 jmp<_copyReindexStmt> - 48 8b 05 00 00 00 00movrax,QWORD PTR [rip+0x0] - be 18 00 00 00 movesi,0x18 - 48 8b 38movrdi,QWORD PTR [rax] - e8 00 00 00 00 call MemoryContextAllocZeroAligned-0x4 + bf 18 00 00 00 movedi,0x18 + e8 00 00 00 00 call palloc0-0x4 That's 10 bytes savings. - 48 8b 05 00 00 00 00movrax,QWORD PTR [rip+0x0] - 48 8b 38movrdi,QWORD PTR [rax] - e8 00 00 00 00 call MemoryContextAllocZeroAligned-0x4 + e8 00 00 00 00 call palloc0-0x4 ...another 10 bytes. Over and over again. Because of the size differences, the compiler is inlining more: e.g. in v1 _copyFieldStore has 4 call sites, but in v2 it got inlined. About the patch, I'm wondering if this whitespace is intentional, but it's otherwise straightforward: --- a/src/include/nodes/nodes.h +++ b/src/include/nodes/nodes.h @@ -132,6 +132,7 @@ typedef struct Node #define nodeTag(nodeptr) (((const Node*)(nodeptr))->type) + /*
Re: remaining sql/json patches
Hi! another minor issue I found: +SELECT pg_get_expr(adbin, adrelid) +FROM pg_attrdef +WHERE adrelid = 'test_jsonb_constraints'::regclass +ORDER BY 1; + +SELECT pg_get_expr(adbin, adrelid) FROM pg_attrdef WHERE adrelid = 'test_jsonb_constraints'::regclass; I think these two queries are the same? Why do we test it twice.
Re: Transaction timeout
> On 18 Dec 2023, at 14:32, Japin Li wrote: > > > Thanks for updating the patch Sorry for the noise, but commitfest bot found one more bug in handling statement timeout. PFA v11. Best regards, Andrey Borodin. v11-0001-Introduce-transaction_timeout.patch Description: Binary data
Re: Transaction timeout
On Mon, 18 Dec 2023 at 13:49, Andrey M. Borodin wrote: >> On 16 Dec 2023, at 05:58, Japin Li wrote: >> >> >> On Fri, 15 Dec 2023 at 17:51, Andrey M. Borodin wrote: On 8 Dec 2023, at 15:29, Japin Li wrote: Thanks for updating the patch. LGTM. >>> >>> PFA v9. Changes: >>> 1. Added tests for idle_in_transaction_timeout >>> 2. Suppress statement_timeout if it’s shorter than transaction_timeout >>> >> + if (StatementTimeout > 0 >> + && IdleInTransactionSessionTimeout < TransactionTimeout) >> ^ >> >> Should be StatementTimeout? > Yes, that’s an oversight. I’ve adjusted tests so they catch this problem. > >> Maybe we should add documentation to describe this behavior. > > I've added a paragraph about it to config.sgml, but I'm not sure about the > comprehensiveness of the wording. > Thanks for updating the patch, no objections. -- Regrads, Japin Li ChengDu WenWu Information Technology Co., Ltd.
Re: ALTER COLUMN ... SET EXPRESSION to alter stored generated column's expression
On 11.12.23 13:22, Amul Sul wrote: create table t1 (a int, b int generated always as (a + 1) stored); alter table t1 add column c int, alter column b set expression as (a + c); ERROR: 42703: column "c" does not exist I think intuitively, this ought to work. Maybe just moving the new pass after AT_PASS_ADD_COL would do it. I think we can't support that (like alter type) since we need to place this new pass before AT_PASS_OLD_INDEX & AT_PASS_OLD_CONSTR to re-add indexes and constraints for the validation. Could we have AT_PASS_ADD_COL before AT_PASS_OLD_*? So overall it would be ... AT_PASS_ALTER_TYPE, AT_PASS_ADD_COL, // moved AT_PASS_SET_EXPRESSION, // new AT_PASS_OLD_INDEX, AT_PASS_OLD_CONSTR, AT_PASS_ADD_CONSTR, ... This appears to not break any existing tests.
Re: trying again to get incremental backup
Another set of comments, about the patch that adds pg_combinebackup: Make sure all the options are listed in a consistent order. We have lately changed everything to be alphabetical. This includes: - reference page pg_combinebackup.sgml - long_options listing - getopt_long() argument - subsequent switch - (--help output, but it looks ok as is) Also, in pg_combinebackup.sgml, the option --sync-method is listed as if it does not take an argument, but it does.
Re: introduce dynamic shared memory registry
Hi! This patch looks like a good solution for a pain in the ass, I'm too for this patch to be committed. Have looked through the code and agree with Andrei, the code looks good. Just a suggestion - maybe it is worth adding a function for detaching the segment, for cases when we unload and/or re-load the extension? -- Regards, Nikita Malakhov Postgres Professional The Russian Postgres Company https://postgrespro.ru/
Re: introduce dynamic shared memory registry
On 18/12/2023 13:39, Andrei Lepikhov wrote: On 5/12/2023 10:46, Nathan Bossart wrote: I don't presently have any concrete plans to use this for anything, but I thought it might be useful for extensions for caching, etc. and wanted to see whether there was any interest in the feature. I am delighted that you commenced this thread. Designing extensions, every time I feel pain introducing one shared value or some global stat, the extension must be required to be loadable on startup only. It reduces the flexibility of even very lightweight extensions, which look harmful to use in a cloud. After looking into the code, I have some comments: 1. The code looks good; I didn't find possible mishaps. Some proposed changes are in the attachment. 2. I think a separate file for this feature looks too expensive. According to the gist of that code, it is a part of the DSA module. 3. The dsm_registry_init_or_attach routine allocates a DSM segment. Why not create dsa_area for a requestor and return it? -- regards, Andrei Lepikhov Postgres Professional diff --git a/src/backend/storage/ipc/dsm_registry.c b/src/backend/storage/ipc/dsm_registry.c index ea80f45716..0343ce987f 100644 --- a/src/backend/storage/ipc/dsm_registry.c +++ b/src/backend/storage/ipc/dsm_registry.c @@ -45,8 +45,8 @@ static const dshash_parameters dsh_params = { LWTRANCHE_DSM_REGISTRY_HASH }; -static dsa_area *dsm_registry_dsa; -static dshash_table *dsm_registry_table; +static dsa_area *dsm_registry_dsa = NULL; +static dshash_table *dsm_registry_table = NULL; static void init_dsm_registry(void); @@ -83,13 +83,20 @@ init_dsm_registry(void) { /* Quick exit if we already did this. */ if (dsm_registry_table) + { + Assert(dsm_registry_dsa != NULL); return; + } + + Assert(dsm_registry_dsa == NULL); /* Otherwise, use a lock to ensure only one process creates the table. */ LWLockAcquire(DSMRegistryLock, LW_EXCLUSIVE); if (DSMRegistryCtx->dshh == DSHASH_HANDLE_INVALID) { + Assert(DSMRegistryCtx->dsah == DSHASH_HANDLE_INVALID); + /* Initialize dynamic shared hash table for registry. */ dsm_registry_dsa = dsa_create(LWTRANCHE_DSM_REGISTRY_DSA); dsa_pin(dsm_registry_dsa); @@ -102,6 +109,8 @@ init_dsm_registry(void) } else { + Assert(DSMRegistryCtx->dsah != DSHASH_HANDLE_INVALID); + /* Attach to existing dynamic shared hash table. */ dsm_registry_dsa = dsa_attach(DSMRegistryCtx->dsah); dsa_pin_mapping(dsm_registry_dsa); diff --git a/src/include/storage/lwlock.h b/src/include/storage/lwlock.h index 665d471418..e0e7b3b765 100644 --- a/src/include/storage/lwlock.h +++ b/src/include/storage/lwlock.h @@ -209,7 +209,7 @@ typedef enum BuiltinTrancheIds LWTRANCHE_LAUNCHER_HASH, LWTRANCHE_DSM_REGISTRY_DSA, LWTRANCHE_DSM_REGISTRY_HASH, - LWTRANCHE_FIRST_USER_DEFINED + LWTRANCHE_FIRST_USER_DEFINED, } BuiltinTrancheIds; /*