Re: add AVX2 support to simd.h
On Sat, Mar 16, 2024 at 2:40 AM Nathan Bossart wrote: > > On Fri, Mar 15, 2024 at 12:41:49PM -0500, Nathan Bossart wrote: > > I've also attached the results of running this benchmark on my machine at > > HEAD, after applying 0001, and after applying both 0001 and 0002. 0001 > > appears to work pretty well. When there is a small "tail," it regresses a > > small amount, but overall, it seems to improve more cases than it harms. > > 0002 does regress searches on smaller arrays quite a bit, since it > > postpones the SIMD optimizations until the arrays are longer. It might be > > possible to mitigate by using 2 registers when the "tail" is long enough, > > but I have yet to try that. > > The attached 0003 is a sketch of what such mitigation might look like. It > appears to help with the regressions nicely. I omitted the benchmarking > patch in v3 to appease cfbot. I haven't looked at the patches, but the graphs look good.
Re: [PoC] Improve dead tuple storage for lazy vacuum
On Fri, Mar 15, 2024 at 9:17 PM Masahiko Sawada wrote: > > On Fri, Mar 15, 2024 at 4:36 PM John Naylor wrote: > > > > On Thu, Mar 14, 2024 at 7:04 PM Masahiko Sawada > > wrote: > > > Given TidStoreSetBlockOffsets() is designed to always set (i.e. > > > overwrite) the value, I think we should not expect that found is > > > always false. > > > > I find that a puzzling statement, since 1) it was designed for > > insert-only workloads, not actual overwrite IIRC and 2) the tests will > > now fail if the same block is set twice, since we just switched the > > tests to use a remnant of vacuum's old array. Having said that, I > > don't object to removing artificial barriers to using it for purposes > > not yet imagined, as long as test_tidstore.sql warns against that. > > I think that if it supports only insert-only workload and expects the > same block is set only once, it should raise an error rather than an > assertion. It's odd to me that the function fails only with an > assertion build assertions even though it actually works fine even in > that case. After thinking some more, I think you're right -- it's too heavy-handed to throw an error/assert and a public function shouldn't make assumptions about the caller. It's probably just a matter of documenting the function (and it's lack of generality), and the tests (which are based on the thing we're replacing). > As for test_tidstore you're right that the test code doesn't handle > the case where setting the same block twice. I think that there is no > problem in the fixed-TIDs tests, but we would need something for > random-TIDs tests so that we don't set the same block twice. I guess > it could be trivial since we can use SQL queries to generate TIDs. I'm > not sure how the random-TIDs tests would be like, but I think we can > use SELECT DISTINCT to eliminate the duplicates of block numbers to > use. Also, I don't think we need random blocks, since the radix tree tests excercise that heavily already. Random offsets is what I was thinking of (if made distinct and ordered), but even there the code is fairy trivial, so I don't have a strong feeling about it. > > Given the above two things, I think this function's comment needs > > stronger language about its limitations. Perhaps even mention that > > it's intended for, and optimized for, vacuum. You and I have long > > known that tidstore would need a separate, more complex, function to > > add or remove individual tids from existing entries, but it might be > > good to have that documented. > > Agreed. How about this: /* - * Set the given TIDs on the blkno to TidStore. + * Create or replace an entry for the given block and array of offsets * - * NB: the offset numbers in offsets must be sorted in ascending order. + * NB: This function is designed and optimized for vacuum's heap scanning + * phase, so has some limitations: + * - The offset numbers in "offsets" must be sorted in ascending order. + * - If the block number already exists, the entry will be replaced -- + * there is no way to add or remove offsets from an entry. */ void TidStoreSetBlockOffsets(TidStore *ts, BlockNumber blkno, OffsetNumber *offsets, I think we can stop including the debug-tid-store patch for CI now. That would allow getting rid of some unnecessary variables. More comments: + * Prepare to iterate through a TidStore. Since the radix tree is locked during + * the iteration, TidStoreEndIterate() needs to be called when finished. + * Concurrent updates during the iteration will be blocked when inserting a + * key-value to the radix tree. This is outdated. Locking is optional. The remaining real reason now is that TidStoreEndIterate needs to free memory. We probably need to say something about locking, too, but not this. + * Scan the TidStore and return a pointer to TidStoreIterResult that has TIDs + * in one block. We return the block numbers in ascending order and the offset + * numbers in each result is also sorted in ascending order. + */ +TidStoreIterResult * +TidStoreIterateNext(TidStoreIter *iter) The wording is a bit awkward. +/* + * Finish an iteration over TidStore. This needs to be called after finishing + * or when existing an iteration. + */ s/existing/exiting/ ? It seems to say we need to finish after finishing. Maybe more precise wording. +/* Extract TIDs from the given key-value pair */ +static void +tidstore_iter_extract_tids(TidStoreIter *iter, uint64 key, BlocktableEntry *page) This is a leftover from the old encoding scheme. This should really take a "BlockNumber blockno" not a "key", and the only call site should probably cast the uint64 to BlockNumber. + * tidstore.h + * Tid storage. + * + * + * Portions Copyright (c) 1996-2023, PostgreSQL Global Development Group Update year. +typedef struct BlocktableEntry +{ + uint16 nwords; + bitmapword words[FLEXIBLE_ARRAY_MEMBER]; +} BlocktableEntry; In my WIP for runtime-embeddable offsets, nwords needs to be one byte. That doesn't have any
Re: REVOKE FROM warning on grantor
=?ISO-8859-1?Q?=C9tienne?= BERSAC writes: > I'll try to patch the behaviour to ensure an error if the REVOKE is > ineffective. I think we're unlikely to accept such a patch. By my reading, the way we do it now is required by the SQL standard. The standard doesn't seem to say that in so many words; what it does say (from SQL99) is b) If the is a , then for every specified, a set of role authorization descriptors is identified. A role authorization descriptor is said to be identified if it defines the grant of any of the specified s to with grantor A. It does not say that that set must be nonempty. Admittedly it's not very clear from this one point. However, if you look around in the standard it seems clear that they expect no-op revokes to be no-ops not errors. As an example, every type of DROP command includes an explicit step to drop privileges attached to the object, with wording like (this is for ALTER TABLE DROP COLUMN): 3) Let A be the that owns T. The following is effectively executed with a current authorization identifier of "_SYSTEM" and without further Access Rule checking: REVOKE INSERT(CN), UPDATE(CN), SELECT(CN), REFERENCES(CN) ON TABLE TN FROM A CASCADE There is no special rule for the case that all (or any...) of those privileges were previously revoked; but if that case is supposed to be an error, there would have to be an exception here, or you couldn't drop such columns. Even taking the position that this is an unspecified point that we could implement how we like, I don't think there's a sufficient argument for changing behavior that's stood for a couple of decades. (The spelling of the message has changed over the years, but giving a warning not an error appears to go all the way back to 99b8f8451 where we implemented user groups.) It is certain that there are applications out there that rely on this behavior and would break. regards, tom lane
Re: Support json_errdetail in FRONTEND builds
> On 16 Mar 2024, at 00:59, Andrew Dunstan wrote: >> On Mar 16, 2024, at 8:53 AM, Daniel Gustafsson wrote: >>> On 15 Mar 2024, at 21:56, Andrew Dunstan wrote: >>> I'd like to get this done ASAP so I can rebase my incremental parse >>> patchset. Daniel, do you want to commit it? If not, I can. >> >> Sure, I can commit these patches. It won't be until tomorrow though since I >> prefer to have ample time to monitor the buildfarm, if you are in a bigger >> rush >> than that then feel free to go ahead. > > tomorrow will be fine, thanks Sorry, I ran into some unforeseen scheduling issues and had less time available than planned. I have pushed the 0001 StringInfo patch to reduce the work for tomorrow when I will work on 0002 unless beaten to it. -- Daniel Gustafsson
Re: pg_upgrade failing for 200+ million Large Objects
Laurenz Albe writes: > On Fri, 2024-03-15 at 19:18 -0400, Tom Lane wrote: >> This patch seems to have stalled out again. In hopes of getting it >> over the finish line, I've done a bit more work to address the two >> loose ends I felt were probably essential to deal with: > Applies and builds fine. > I didn't scrutinize the code, but I gave it a spin on a database with > 15 million (small) large objects. I tried pg_upgrade --link with and > without the patch on a debug build with the default configuration. Thanks for looking at it! > Without the patch: > Runtime: 74.5 minutes > With the patch: > Runtime: 70 minutes Hm, I'd have hoped for a bit more runtime improvement. But perhaps not --- most of the win we saw upthread was from parallelism, and I don't think you'd get any parallelism in a pg_upgrade with all the data in one database. (Perhaps there is more to do there later, but I'm still not clear on how this should interact with the existing cross-DB parallelism; so I'm content to leave that question for another patch.) regards, tom lane
Re: pg_upgrade failing for 200+ million Large Objects
On Fri, 2024-03-15 at 19:18 -0400, Tom Lane wrote: > This patch seems to have stalled out again. In hopes of getting it > over the finish line, I've done a bit more work to address the two > loose ends I felt were probably essential to deal with: Applies and builds fine. I didn't scrutinize the code, but I gave it a spin on a database with 15 million (small) large objects. I tried pg_upgrade --link with and without the patch on a debug build with the default configuration. Without the patch: Runtime: 74.5 minutes Memory usage: ~7GB Disk usage: an extra 5GB dump file + log file during the dump With the patch: Runtime: 70 minutes Memory usage: ~1GB Disk usage: an extra 0.5GB during the dump Memory usage stayed stable once it reached its peak, so no noticeable memory leaks. The reduced memory usage is great. I was surprised by the difference in disk usage: the lion's share is the dump file, and that got substantially smaller. But also the log file shrank considerably, because not every individual large object gets logged. I had a look at "perf top", and the profile looked pretty similar in both cases. The patch is a clear improvement. Yours, Laurenz Albe
Re: Q: Escapes in jsonpath Idents
On Mar 16, 2024, at 14:39, David E. Wheeler wrote: > I went looking for the JavaScript rules for an identifier and found this in > the MDN docs[2]: > >> In JavaScript, identifiers can contain Unicode letters, $, _, and digits >> (0-9), but may not start with a digit. An identifier differs from a string >> in that a string is data, while an identifier is part of the code. In >> JavaScript, there is no way to convert identifiers to strings, but sometimes >> it is possible to parse strings into identifiers. Coda: Dollar signs don’t work at all outside double-quoted string identifiers: david=# select '$.$foo'::jsonpath; ERROR: syntax error at or near "$foo" of jsonpath input LINE 1: select '$.$foo'::jsonpath; ^ david=# select '$.f$oo'::jsonpath; ERROR: syntax error at or near "$oo" of jsonpath input LINE 1: select '$.f$oo'::jsonpath; ^ david=# select '$."$foo"'::jsonpath; jsonpath -- $."$foo" This, too, contradicts the MDM definition an identifier (and some quick browser tests). Best, David
Re: REVOKE FROM warning on grantor
Hi David, > That should have read: the granted permission does not exist Thanks, its clear. However, I'm hitting the warning when removing a role from a group. But the membership remains after the warning. In this case, I expect an error. I'll try to patch the behaviour to ensure an error if the REVOKE is ineffective. Regards, Étienne
Re: BitmapHeapScan streaming read user and prelim refactoring
On 3/16/24 20:12, Andres Freund wrote: > Hi, > > On 2024-03-15 18:42:29 -0400, Melanie Plageman wrote: >> On Fri, Mar 15, 2024 at 5:14 PM Andres Freund wrote: >>> On 2024-03-14 17:39:30 -0400, Melanie Plageman wrote: >>> I spent a good amount of time looking into this with Melanie. After a bunch >>> of >>> wrong paths I think I found the issue: We end up prefetching blocks we have >>> already read. Notably this happens even as-is on master - just not as >>> frequently as after moving BitmapAdjustPrefetchIterator(). >>> >>> From what I can tell the prefetching in parallel bitmap heap scans is >>> thoroughly broken. I added some tracking of the last block read, the last >>> block prefetched to ParallelBitmapHeapState and found that with a small >>> effective_io_concurrency we end up with ~18% of prefetches being of blocks >>> we >>> already read! After moving the BitmapAdjustPrefetchIterator() to rises to >>> 86%, >>> no wonder it's slower... >>> >>> The race here seems fairly substantial - we're moving the two iterators >>> independently from each other, in multiple processes, without useful >>> locking. >>> >>> I'm inclined to think this is a bug we ought to fix in the backbranches. >> >> Thinking about how to fix this, perhaps we could keep the current max >> block number in the ParallelBitmapHeapState and then when prefetching, >> workers could loop calling tbm_shared_iterate() until they've found a >> block at least prefetch_pages ahead of the current block. They >> wouldn't need to read the current max value from the parallel state on >> each iteration. Even checking it once and storing that value in a >> local variable prevented prefetching blocks after reading them in my >> example repro of the issue. > > That would address some of the worst behaviour, but it doesn't really seem to > address the underlying problem of the two iterators being modified > independently. ISTM the proper fix would be to protect the state of the > iterators with a single lock, rather than pushing down the locking into the > bitmap code. OTOH, we'll only need one lock going forward, so being economic > in the effort of fixing this is also important. > Can you share some details about how you identified the problem, counted the prefetches that happen too late, etc? I'd like to try to reproduce this to understand the issue better. If I understand correctly, what may happen is that a worker reads blocks from the "prefetch" iterator, but before it manages to issue the posix_fadvise, some other worker already did pread. Or can the iterators get "out of sync" in a more fundamental way? If my understanding is correct, why would a single lock solve that? Yes, we'd advance the iterators at the same time, but surely we'd not issue the fadvise calls while holding the lock, and the prefetch/fadvise for a particular block could still happen in different workers. I suppose a dirty PoC fix should not be too difficult, and it'd allow us to check if it works. regards -- Tomas Vondra EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Re: REVOKE FROM warning on grantor
On Sat, Mar 16, 2024 at 1:00 PM Étienne BERSAC wrote: > > > The choice of warning is made because after the command ends the > > grantmin question does not exist. The revoke was a no-op and the > > final state is as the user intended. > > > Sorry, can you explain me what's the grantmin question is ? > > That should have read: the granted permission does not exist David J.
Re: REVOKE FROM warning on grantor
Hi David, Thanks for your answer. > The choice of warning is made because after the command ends the > grantmin question does not exist. The revoke was a no-op and the > final state is as the user intended. Sorry, can you explain me what's the grantmin question is ? Regards, Étienne
Re: BitmapHeapScan streaming read user and prelim refactoring
Hi, On 2024-03-15 18:42:29 -0400, Melanie Plageman wrote: > On Fri, Mar 15, 2024 at 5:14 PM Andres Freund wrote: > > On 2024-03-14 17:39:30 -0400, Melanie Plageman wrote: > > I spent a good amount of time looking into this with Melanie. After a bunch > > of > > wrong paths I think I found the issue: We end up prefetching blocks we have > > already read. Notably this happens even as-is on master - just not as > > frequently as after moving BitmapAdjustPrefetchIterator(). > > > > From what I can tell the prefetching in parallel bitmap heap scans is > > thoroughly broken. I added some tracking of the last block read, the last > > block prefetched to ParallelBitmapHeapState and found that with a small > > effective_io_concurrency we end up with ~18% of prefetches being of blocks > > we > > already read! After moving the BitmapAdjustPrefetchIterator() to rises to > > 86%, > > no wonder it's slower... > > > > The race here seems fairly substantial - we're moving the two iterators > > independently from each other, in multiple processes, without useful > > locking. > > > > I'm inclined to think this is a bug we ought to fix in the backbranches. > > Thinking about how to fix this, perhaps we could keep the current max > block number in the ParallelBitmapHeapState and then when prefetching, > workers could loop calling tbm_shared_iterate() until they've found a > block at least prefetch_pages ahead of the current block. They > wouldn't need to read the current max value from the parallel state on > each iteration. Even checking it once and storing that value in a > local variable prevented prefetching blocks after reading them in my > example repro of the issue. That would address some of the worst behaviour, but it doesn't really seem to address the underlying problem of the two iterators being modified independently. ISTM the proper fix would be to protect the state of the iterators with a single lock, rather than pushing down the locking into the bitmap code. OTOH, we'll only need one lock going forward, so being economic in the effort of fixing this is also important. Greetings, Andres Freund
Re: Vectored I/O in bulk_write.c
Hi, On 2024-03-16 12:27:15 +1300, Thomas Munro wrote: > I canvassed Andres off-list since smgrzeroextend() is his invention, > and he wondered if it was a good idea to blur the distinction between > the different zero-extension strategies like that. Good question. My > take is that it's fine: > > mdzeroextend() already uses fallocate() only for nblocks > 8, but > otherwise writes out zeroes, because that was seen to interact better > with file system heuristics on common systems. That preserves current > behaviour for callers of plain-old sgmrextend(), which becomes a > wrapper for smgrwrite(..., 1, _ZERO | _EXTEND). If some hypothetical > future caller wants to be able to call smgrwritev(..., NULL, 9 blocks, > _ZERO | _EXTEND) directly without triggering the fallocate() strategy > for some well researched reason, we could add a new flag to say so, ie > adjust that gating. > > In other words, we have already blurred the semantics. To me, it > seems nicer to have a single high level interface for the same logical > operation, and flags to select strategies more explicitly if that is > eventually necessary. I don't think zeroextend on the one hand and and on the other hand a normal write or extend are really the same operation. In the former case the content is hard-coded in the latter it's caller provided. Sure, we can deal with that by special-casing NULL content - but at that point, what's the benefit of combinding the two operations? I'm not dead-set against this, just not really convinced that it's a good idea to combine the operations. Greetings, Andres Freund
Q: Escapes in jsonpath Idents
Hackers, The jsonpath doc[1] has an excellent description of the format of strings, but for unquoted path keys, it simply says: > Member accessor that returns an object member with the specified key. If the > key name matches some named variable starting with $ or does not meet the > JavaScript rules for an identifier, it must be enclosed in double quotes to > make it a string literal. I went looking for the JavaScript rules for an identifier and found this in the MDN docs[2]: > In JavaScript, identifiers can contain Unicode letters, $, _, and digits > (0-9), but may not start with a digit. An identifier differs from a string in > that a string is data, while an identifier is part of the code. In > JavaScript, there is no way to convert identifiers to strings, but sometimes > it is possible to parse strings into identifiers. However, the Postgres parsing of jsonpath keys appears to follow the same rules as strings, allowing backslash escapes: david=# select '$.fo\u00f8 == $x'::jsonpath; jsonpath --- ($."foø" == $"x") This would seem to contradict the documentation. Is this behavior required by the SQL standard? Do the docs need updating? Or should the code actually follow the JSON identifier behavior? Thanks, David PS: Those excellent docs on strings mentions support for \v, but the grammar in the right nav of https://www.json.org/json-en.html does not. Another bonus feature? [1]: https://www.postgresql.org/docs/16/datatype-json.html#DATATYPE-JSONPATH [2]: https://developer.mozilla.org/en-US/docs/Glossary/Identifier
Re: UUID v7
> On 15 Mar 2024, at 14:47, Aleksander Alekseev > wrote: > > +1 to the idea. I doubt that anyone will miss it. PFA v22. Changes: 1. Squashed all editorialisation by Jelte 2. Fixed my erroneous comments on using Method 2 (we are using method 1 instead) 3. Remove all traces of uuid_extract_variant() Thanks! Best regards, Andrey Borodin. v22-0001-Implement-UUID-v7.patch Description: Binary data
Re: Proposal to include --exclude-extension Flag in pg_dump
Hi, > In my opinion the order of options in pg_dump.sgml and the --help > output is fine. Keeping this new option together with -e/--extension > makes it easier to see, while otherwise it would get lost much further > down. I agree with your suggestion, so I'll maintain the original order as proposed. > The --filter option should be extended to support "exclude > extension pattern" lines in the filter file. That syntax is already > accepted, but it throws a not-supported error, but it's hopefully not > too hard to make that work now. While proposing the --exclude-extension flag in pg_dump, the --filter option wasn't there, so it got overlooked. But now that I've familiarized myself with it and have tried to enhance its functionality with exclude-extension in the provided patch. Thank you for bringing this to my attention. > It ought to have some tests in the test script. Correct, I must have added it in the first patch itself. As a newcomer to the database, I explored how testing works and tried to include test cases for the newly added flag. I've confirmed that all test cases, including the ones I added, are passing. Attached is the complete patch with all the required code changes. Looking forward to your review and feedback. On Wed, 6 Mar 2024 at 16:04, Dean Rasheed wrote: > On Mon, 1 Jan 2024 at 13:28, Ayush Vatsa wrote: > > > > According to the documentation of pg_dump when the --extension option is > not specified, all non-system extensions in the target database will get > dumped. > > > Why do we need to explicitly exclude extensions? > > Hence to include only a few we use --extension, but to exclude a few I > am proposing --exclude-extension. > > > > Thanks for working on this. It seems like a useful feature to have. > The code changes look good, and it appears to work as expected. > > In my opinion the order of options in pg_dump.sgml and the --help > output is fine. Keeping this new option together with -e/--extension > makes it easier to see, while otherwise it would get lost much further > down. Other opinions on that might differ though. > > There are a couple of things missing from the patch, that should be added: > > 1). The --filter option should be extended to support "exclude > extension pattern" lines in the filter file. That syntax is already > accepted, but it throws a not-supported error, but it's hopefully not > too hard to make that work now. > > 2). It ought to have some tests in the test script. > > Regards, > Dean > v2-0001-Add-support-for-exclude-extension-in-pg_dump.patch Description: Binary data
Re: Improving EXPLAIN's display of SubPlan nodes
I wrote: > Dean Rasheed writes: >> One thing that concerns me about making even greater use of "$n" is >> the potential for confusion with generic plan parameters. > True. After looking at your draft some more, it occurred to me that we're not that far from getting rid of showing "$n" entirely in this context. Instead of (subplan_name).$n, we could write something like (subplan_name).colN or (subplan_name).columnN or (subplan_name).fN, depending on your taste for verboseness. "fN" would correspond to the names we assign to columns of anonymous record types, but it hasn't got much else to recommend it. In the attached I used "colN"; "columnN" would be my next choice. You could also imagine trying to use the sub-SELECT's actual output column names, but I fear that would be ambiguous --- too often it'd be "?column?" or some other non-unique name. > Hmm. I guess what bothers me about that is that it could be read to > suggest that the initplan or subplan is evaluated again for each > output parameter. This objection seems like it could be solved through documentation, so I wrote some. The attached proof-of-concept is incomplete: it fails to replace some $n occurrences with subplan references, as is visible in some of the test cases. I believe your quick hack in get_parameter() is not correct in detail, but for the moment I didn't bother to debug it. I'm just presenting this as a POC to see if this is the direction people would like to go in. If there's not objections, I'll try to make a bulletproof implementation. regards, tom lane diff --git a/contrib/postgres_fdw/expected/postgres_fdw.out b/contrib/postgres_fdw/expected/postgres_fdw.out index 58a603ac56..585ba60ae3 100644 --- a/contrib/postgres_fdw/expected/postgres_fdw.out +++ b/contrib/postgres_fdw/expected/postgres_fdw.out @@ -3062,10 +3062,10 @@ select exists(select 1 from pg_enum), sum(c1) from ft1; QUERY PLAN -- Foreign Scan - Output: $0, (sum(ft1.c1)) + Output: (InitPlan 1).col1, (sum(ft1.c1)) Relations: Aggregate on (public.ft1) Remote SQL: SELECT sum("C 1") FROM "S 1"."T 1" - InitPlan 1 (returns $0) + InitPlan 1 -> Seq Scan on pg_catalog.pg_enum (6 rows) @@ -3080,8 +3080,8 @@ select exists(select 1 from pg_enum), sum(c1) from ft1 group by 1; QUERY PLAN --- GroupAggregate - Output: $0, sum(ft1.c1) - InitPlan 1 (returns $0) + Output: (InitPlan 1).col1, sum(ft1.c1) + InitPlan 1 -> Seq Scan on pg_catalog.pg_enum -> Foreign Scan on public.ft1 Output: ft1.c1 @@ -3305,10 +3305,10 @@ select sum(c1) filter (where (c1 / c1) * random() <= 1) from ft1 group by c2 ord explain (verbose, costs off) select sum(c2) filter (where c2 in (select c2 from ft1 where c2 < 5)) from ft1; -QUERY PLAN + QUERY PLAN +--- Aggregate - Output: sum(ft1.c2) FILTER (WHERE (hashed SubPlan 1)) + Output: sum(ft1.c2) FILTER (WHERE (ANY (ft1.c2 = (hashed SubPlan 1).col1))) -> Foreign Scan on public.ft1 Output: ft1.c2 Remote SQL: SELECT c2 FROM "S 1"."T 1" @@ -6171,9 +6171,9 @@ UPDATE ft2 AS target SET (c2, c7) = ( Update on public.ft2 target Remote SQL: UPDATE "S 1"."T 1" SET c2 = $2, c7 = $3 WHERE ctid = $1 -> Foreign Scan on public.ft2 target - Output: $1, $2, (SubPlan 1 (returns $1,$2)), target.ctid, target.* + Output: $1, $2, (SubPlan 1), target.ctid, target.* Remote SQL: SELECT "C 1", c2, c3, c4, c5, c6, c7, c8, ctid FROM "S 1"."T 1" WHERE (("C 1" > 1100)) FOR UPDATE - SubPlan 1 (returns $1,$2) + SubPlan 1 -> Foreign Scan on public.ft2 src Output: (src.c2 * 10), src.c7 Remote SQL: SELECT c2, c7 FROM "S 1"."T 1" WHERE (($1::integer = "C 1")) @@ -11687,7 +11687,7 @@ SELECT * FROM local_tbl t1 LEFT JOIN (SELECT *, (SELECT count(*) FROM async_pt W Nested Loop Left Join Output: t1.a, t1.b, t1.c, async_pt.a, async_pt.b, async_pt.c, ($0) Join Filter: (t1.a = async_pt.a) - InitPlan 1 (returns $0) + InitPlan 1 -> Aggregate Output: count(*) -> Append @@ -11713,7 +11713,7 @@ SELECT * FROM local_tbl t1 LEFT JOIN (SELECT *, (SELECT count(*) FROM async_pt W Nested Loop Left Join (actual rows=1 loops=1) Join Filter: (t1.a = async_pt.a) Rows Removed by Join Filter: 399 - InitPlan 1 (returns $0) + InitPlan 1 -> Aggregate (actual rows=1 loops=1) -> Append (actual rows=400 loops=1) -> Async Foreign Scan on async_p1 async_pt_4
Re: speed up a logical replica setup
On Sat, 16 Mar 2024 at 21:16, Euler Taveira wrote: > > On Sat, Mar 16, 2024, at 10:31 AM, vignesh C wrote: > > I was able to reproduce this random failure and found the following reason: > The Minimum recovery ending location 0/500 was more than the > recovery_target_lsn specified is "0/4001198". In few random cases the > standby applies a few more WAL records after the replication slot is > created; this leads to minimum recovery ending location being greater > than the recovery_target_lsn because of which the server will fail > with: > FATAL: requested recovery stop point is before consistent recovery point > > > Thanks for checking. I proposed an alternative patch for it [1]. Can you check > it? This approach looks good to me. Regards, Vignesh
Re: speed up a logical replica setup
On Sat, Mar 16, 2024, at 10:31 AM, vignesh C wrote: > I was able to reproduce this random failure and found the following reason: > The Minimum recovery ending location 0/500 was more than the > recovery_target_lsn specified is "0/4001198". In few random cases the > standby applies a few more WAL records after the replication slot is > created; this leads to minimum recovery ending location being greater > than the recovery_target_lsn because of which the server will fail > with: > FATAL: requested recovery stop point is before consistent recovery point Thanks for checking. I proposed an alternative patch for it [1]. Can you check it? [1] https://www.postgresql.org/message-id/34637e7f-0330-420d-8f45-1d022962d2fe%40app.fastmail.com -- Euler Taveira EDB https://www.enterprisedb.com/
Re: speed up a logical replica setup
On Fri, Mar 15, 2024, at 3:34 AM, Amit Kapila wrote: > Did you consider adding options for publication/subscription/slot > names as mentioned in my previous email? As discussed in a few emails > above, it would be quite confusing for users to identify the logical > replication objects once the standby is converted to subscriber. Yes. I was wondering to implement after v1 is pushed. I started to write a code for it but I wasn't sure about the UI. The best approach I came up with was multiple options in the same order. (I don't provide short options to avoid possible portability issues with the order.) It means if I have 3 databases and the following command-line: pg_createsubscriber ... --database pg1 --database pg2 --database3 --publication pubx --publication puby --publication pubz pubx, puby and pubz are created in the database pg1, pg2, and pg3 respectively. > It seems we care only for publications created on the primary. Isn't > it possible that some of the publications have been replicated to > standby by that time, for example, in case failure happens after > creating a few publications? IIUC, we don't care for standby cleanup > after failure because it can't be used for streaming replication > anymore. So, the only choice the user has is to recreate the standby > and start the pg_createsubscriber again. This sounds questionable to > me as to whether users would like this behavior. Does anyone else have > an opinion on this point? If it happens after creating a publication and before promotion, the cleanup routine will drop the publications on primary and it will eventually be applied to the standby via replication later. > I see the below note in the patch: > +If pg_createsubscriber fails while processing, > +then the data directory is likely not in a state that can be recovered. > It > +is true if the target server was promoted. In such a case, creating a new > +standby server is recommended. > > By reading this it is not completely clear whether the standby is not > recoverable in case of any error or only an error after the target > server is promoted. If others agree with this behavior then we should > write the detailed reason for this somewhere in the comments as well > unless it is already explained. I rewrote the sentence to make it clear that only if the server is promoted, the target server will be in a state that cannot be reused. It provides a message saying it too. pg_createsubscriber: target server reached the consistent state pg_createsubscriber: hint: If pg_createsubscriber fails after this point, you must recreate the physical replica before continuing. I'm attaching a new version (v30) that adds: * 3 new options (--publication, --subscription, --replication-slot) to assign names to the objects. The --database option used to ignore duplicate names, however, since these new options rely on the number of database options to match the number of object name options, it is forbidden from now on. The duplication is also forbidden for the object names to avoid errors earlier. * rewrite the paragraph related to unusuable target server after pg_createsubscriber fails. * Vignesh reported an issue [1] related to reaching the recovery stop point before the consistent state is reached. I proposed a simple patch that fixes the issue. [1] https://www.postgresql.org/message-id/CALDaNm3VMOi0GugGvhk3motghaFRKSWMCSE2t3YX1e%2BMttToxg%40mail.gmail.com -- Euler Taveira EDB https://www.enterprisedb.com/ v30-0001-pg_createsubscriber-creates-a-new-logical-replic.patch.gz Description: application/gzip v30-0002-Stop-the-target-server-earlier.patch.gz Description: application/gzip
Re: [HACKERS] make async slave to wait for lsn to be replayed
On Sat, Mar 16, 2024 at 4:26 PM Amit Kapila wrote: > > On Fri, Mar 15, 2024 at 7:50 PM Alexander Korotkov > wrote: > > > > I went through this patch another time, and made some minor > > adjustments. Now it looks good, I'm going to push it if no > > objections. > > > > I have a question related to usability, if the regular reads (say a > Select statement or reads via function/procedure) need a similar > guarantee to see the changes on standby then do they also always need > to first do something like "BEGIN AFTER '0/3F0FF791' WITHIN 1000;"? Or > in other words, shouldn't we think of something for implicit > transactions? +1 to have support for implicit txns. A strawman solution I can think of is to let primary send its current insert LSN to the standby every time it sends a bunch of WAL, and the standby waits for that LSN to be replayed on it at the start of every implicit txn automatically. The new BEGIN syntax requires application code changes. This led me to think how one can achieve read-after-write consistency today in a primary - standby set up. All the logic of this patch, that is, waiting for the standby to pass a given primary LSN needs to be done in the application code (or in proxy or in load balancer?). I believe there might be someone doing this already, it's good to hear from them. > In general, it seems this patch has been stuck for a long time on the > decision to choose an appropriate UI (syntax), and we thought of > moving it further so that the other parts of the patch can be > reviewed/discussed. So, I feel before pushing this we should see > comments from a few (at least two) other senior members who earlier > shared their opinion on the syntax. I know we don't have much time > left but OTOH pushing such a change (where we didn't have a consensus > on syntax) without much discussion at this point of time could lead to > discussions after commit. +1 to gain consensus first on the syntax changes. With this, we might be violating the SQL standard for explicit txn commands (I stand for correction about the SQL standard though). -- Bharath Rupireddy PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com
Re: speed up a logical replica setup
On Mon, 11 Mar 2024 at 10:33, Hayato Kuroda (Fujitsu) wrote: > > Dear Vignesh, > > Thanks for updating the patch, but cfbot still got angry [1]. > Note that two containers (autoconf and meson) failed at different place, > so I think it is intentional one. It seems that there may be a bug related > with 32-bit build. > > We should see and fix as soon as possible. I was able to reproduce this random failure and found the following reason: The Minimum recovery ending location 0/500 was more than the recovery_target_lsn specified is "0/4001198". In few random cases the standby applies a few more WAL records after the replication slot is created; this leads to minimum recovery ending location being greater than the recovery_target_lsn because of which the server will fail with: FATAL: requested recovery stop point is before consistent recovery point I have fixed it by pausing the replay in the standby server before the replication slots get created. The attached v29-0002-Keep-standby-server-s-minimum-recovery-point-les.patch patch has the changes for the same. Thoughts? Regards, Vignesh From e98e1ba0661e3d658de8d46ff9082f6cd4040b41 Mon Sep 17 00:00:00 2001 From: Vignesh C Date: Sat, 16 Mar 2024 18:47:35 +0530 Subject: [PATCH v29 2/2] Keep standby server's minimum recovery point less than the consistent_lsn. Standby server should not get ahead of the replication slot's lsn. If this happens we will not be able to promote the standby server as it will not be able to reach the consistent point because the standby server's Minimum recovery ending location will be greater than the consistent_lsn. Fixed this by pausing the replay before the replication slots are created. --- src/bin/pg_basebackup/pg_createsubscriber.c | 93 - 1 file changed, 92 insertions(+), 1 deletion(-) diff --git a/src/bin/pg_basebackup/pg_createsubscriber.c b/src/bin/pg_basebackup/pg_createsubscriber.c index c565f8524a..18e8757403 100644 --- a/src/bin/pg_basebackup/pg_createsubscriber.c +++ b/src/bin/pg_basebackup/pg_createsubscriber.c @@ -95,6 +95,8 @@ static void drop_publication(PGconn *conn, struct LogicalRepInfo *dbinfo); static void create_subscription(PGconn *conn, struct LogicalRepInfo *dbinfo); static void set_replication_progress(PGconn *conn, struct LogicalRepInfo *dbinfo, const char *lsn); +static void pause_replay_stantby_server(struct LogicalRepInfo *dbinfo, + struct CreateSubscriberOptions *opt); static void enable_subscription(PGconn *conn, struct LogicalRepInfo *dbinfo); #define USEC_PER_SEC 100 @@ -917,7 +919,9 @@ check_subscriber(struct LogicalRepInfo *dbinfo) appendPQExpBuffer(str, "SELECT pg_catalog.pg_has_role(current_user, %u, 'MEMBER'), " "pg_catalog.has_database_privilege(current_user, '%s', 'CREATE'), " - "pg_catalog.has_function_privilege(current_user, 'pg_catalog.pg_replication_origin_advance(text, pg_lsn)', 'EXECUTE')", + "pg_catalog.has_function_privilege(current_user, 'pg_catalog.pg_replication_origin_advance(text, pg_lsn)', 'EXECUTE'), " + "pg_catalog.has_function_privilege(current_user, 'pg_catalog.pg_wal_replay_pause()', 'EXECUTE'), " + "pg_catalog.has_function_privilege(current_user, 'pg_catalog.pg_get_wal_replay_pause_state()', 'EXECUTE')", ROLE_PG_CREATE_SUBSCRIPTION, dbinfo[0].dbname); pg_log_debug("command is: %s", str->data); @@ -949,6 +953,18 @@ check_subscriber(struct LogicalRepInfo *dbinfo) "pg_catalog.pg_replication_origin_advance(text, pg_lsn)"); failed = true; } + if (strcmp(PQgetvalue(res, 0, 3), "t") != 0) + { + pg_log_error("permission denied for function \"%s\"", + "pg_catalog.pg_wal_replay_pause()"); + failed = true; + } + if (strcmp(PQgetvalue(res, 0, 4), "t") != 0) + { + pg_log_error("permission denied for function \"%s\"", + "pg_catalog.pg_get_wal_replay_pause_state()"); + failed = true; + } destroyPQExpBuffer(str); PQclear(res); @@ -1026,6 +1042,72 @@ check_subscriber(struct LogicalRepInfo *dbinfo) exit(1); } +/* + * Pause replaying at the standby server. + */ +static void +pause_replay_stantby_server(struct LogicalRepInfo *dbinfo, + struct CreateSubscriberOptions *opt) +{ + PGconn *conn; + PGresult *res; + + /* Connect to subscriber. */ + conn = connect_database(dbinfo[0].subconninfo, true); + + pg_log_info("Pausing the replay in standby server"); + pg_log_debug("command is: SELECT pg_catalog.pg_wal_replay_pause()"); + + if (!dry_run) + { + int timer = 0; + + res = PQexec(conn, "SELECT pg_catalog.pg_wal_replay_pause()"); + if (PQresultStatus(res) != PGRES_TUPLES_OK) + { + pg_log_error("could not pause replay in standby server: %s", + PQresultErrorMessage(res)); + disconnect_database(conn, true); + } + PQclear(res); + + /* Wait till replay has paused */ + for(;;) + { + pg_log_debug("command is: SELECT pg_catalog.pg_get_wal_replay_pause_state()"); + res = PQexec(conn, "SELECT
Building with meson on NixOS/nixpkgs
To build on NixOS/nixpkgs I came up with a few small patches to meson.build. All of this works fine with Autoconf/Make already.From 24ae72b9b0adc578c6729eff59c9038e6b4ac517 Mon Sep 17 00:00:00 2001 From: Wolfgang Walther Date: Sat, 2 Mar 2024 17:18:38 +0100 Subject: [PATCH 1/3] Fallback to uuid for ossp-uuid with meson The upstream name for the ossp-uuid package / pkg-config file is "uuid". Many distributions change this to be "ossp-uuid" to not conflict with e2fsprogs. This lookup fails on distributions which don't change this name, for example NixOS / nixpkgs. Both "ossp-uuid" and "uuid" are also checked in configure.ac. --- meson.build | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/meson.build b/meson.build index c8fdfeb0ec3..d9ad1ce902f 100644 --- a/meson.build +++ b/meson.build @@ -1346,7 +1346,7 @@ if uuidopt != 'none' uuidfunc = 'uuid_to_string' uuidheader = 'uuid.h' elif uuidopt == 'ossp' -uuid = dependency('ossp-uuid', required: true) +uuid = dependency('ossp-uuid', 'uuid', required: true) uuidfunc = 'uuid_export' uuidheader = 'uuid.h' else -- 2.44.0 From 56e0abbcc3b950b6e93eddc6ede453ce529423ea Mon Sep 17 00:00:00 2001 From: Wolfgang Walther Date: Sat, 2 Mar 2024 22:06:25 +0100 Subject: [PATCH 2/3] Fallback to clang in PATH with meson Some distributions put clang into a different path than the llvm binary path. For example, this is the case on NixOS / nixpkgs, which failed to find clang with meson before this patch. --- meson.build | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/meson.build b/meson.build index d9ad1ce902f..7c64fb04ea5 100644 --- a/meson.build +++ b/meson.build @@ -759,7 +759,7 @@ if add_languages('cpp', required: llvmopt, native: false) llvm_binpath = llvm.get_variable(configtool: 'bindir') ccache = find_program('ccache', native: true, required: false) -clang = find_program(llvm_binpath / 'clang', required: true) +clang = find_program(llvm_binpath / 'clang', 'clang', required: true) endif elif llvmopt.auto() message('llvm requires a C++ compiler') -- 2.44.0 From 62f10689d843227fca6d54e86462d0be5c4f434f Mon Sep 17 00:00:00 2001 From: Wolfgang Walther Date: Mon, 11 Mar 2024 19:54:41 +0100 Subject: [PATCH 3/3] Support absolute bindir/libdir in regression tests with meson Passing an absolute bindir/libdir will install the binaries and libraries to /tmp_install/ and /tmp_install/ respectively. This is path is correctly passed to the regression test suite via configure/make, but not via meson, yet. This is because the "/" operator in the following expression throws away the whole left side when the right side is an absolute path: test_install_location / get_option('libdir') This was already correctly handled for dir_prefix, which is likely absolute as well. This patch handles both bindir and libdir in the same way - prefixing absolute paths with the tmp_install path correctly. --- meson.build | 11 ++- 1 file changed, 6 insertions(+), 5 deletions(-) diff --git a/meson.build b/meson.build index 7c64fb04ea5..e8438209058 100644 --- a/meson.build +++ b/meson.build @@ -3044,15 +3044,17 @@ test_install_destdir = meson.build_root() / 'tmp_install/' if build_system != 'windows' # On unixoid systems this is trivial, we just prepend the destdir assert(dir_prefix.startswith('/')) # enforced by meson - test_install_location = '@0@@1@'.format(test_install_destdir, dir_prefix) + temp_install_bindir = '@0@@1@'.format(test_install_destdir, dir_prefix / dir_bin) + temp_install_libdir = '@0@@1@'.format(test_install_destdir, dir_prefix / dir_lib) else # drives, drive-relative paths, etc make this complicated on windows, call # into a copy of meson's logic for it command = [ python, '-c', 'import sys; from pathlib import PurePath; d1=sys.argv[1]; d2=sys.argv[2]; print(str(PurePath(d1, *PurePath(d2).parts[1:])))', -test_install_destdir, dir_prefix] - test_install_location = run_command(command, check: true).stdout().strip() +test_install_destdir] + temp_install_bindir = run_command(command, dir_prefix / dir_bin, check: true).stdout().strip() + temp_install_libdir = run_command(command, dir_prefix / dir_lib, check: true).stdout().strip() endif meson_install_args = meson_args + ['install'] + { @@ -3089,7 +3091,6 @@ testport = 4 test_env = environment() -temp_install_bindir = test_install_location / get_option('bindir') test_initdb_template = meson.build_root() / 'tmp_install' / 'initdb-template' test_env.set('PG_REGRESS', pg_regress.full_path()) test_env.set('REGRESS_SHLIB', regress_module.full_path()) @@ -3104,7 +3105,7 @@ test_env.set('PG_TEST_EXTRA', get_option('PG_TEST_EXTRA')) # that works (everything but windows, basically). On windows everything # library-like gets installed into bindir, solving that issue. if library_path_var != '' - test_env.prepend(library_path_var, test_install_location /
Re: [HACKERS] make async slave to wait for lsn to be replayed
On Fri, Mar 15, 2024 at 7:50 PM Alexander Korotkov wrote: > > I went through this patch another time, and made some minor > adjustments. Now it looks good, I'm going to push it if no > objections. > I have a question related to usability, if the regular reads (say a Select statement or reads via function/procedure) need a similar guarantee to see the changes on standby then do they also always need to first do something like "BEGIN AFTER '0/3F0FF791' WITHIN 1000;"? Or in other words, shouldn't we think of something for implicit transactions? In general, it seems this patch has been stuck for a long time on the decision to choose an appropriate UI (syntax), and we thought of moving it further so that the other parts of the patch can be reviewed/discussed. So, I feel before pushing this we should see comments from a few (at least two) other senior members who earlier shared their opinion on the syntax. I know we don't have much time left but OTOH pushing such a change (where we didn't have a consensus on syntax) without much discussion at this point of time could lead to discussions after commit. -- With Regards, Amit Kapila.
Re: Introduce XID age and inactive timeout based replication slot invalidation
On Fri, Mar 15, 2024 at 10:45 AM Bharath Rupireddy wrote: > > On Wed, Mar 13, 2024 at 9:38 AM Amit Kapila wrote: > > > > BTW, is XID the based parameter 'max_slot_xid_age' not have similarity > > with 'max_slot_wal_keep_size'? I think it will impact the rows we > > removed based on xid horizons. Don't we need to consider it while > > vacuum computing the xid horizons in ComputeXidHorizons() similar to > > what we do for WAL w.r.t 'max_slot_wal_keep_size'? > > I'm having a hard time understanding why we'd need something up there > in ComputeXidHorizons(). Can you elaborate it a bit please? > > What's proposed with max_slot_xid_age is that during checkpoint we > look at slot's xmin and catalog_xmin, and the current system txn id. > Then, if the XID age of (xmin, catalog_xmin) and current_xid crosses > max_slot_xid_age, we invalidate the slot. > I can see that in your patch (in function InvalidatePossiblyObsoleteSlot()). As per my understanding, we need something similar for slot xids in ComputeXidHorizons() as we are doing WAL in KeepLogSeg(). In KeepLogSeg(), we compute the minimum LSN location required by slots and then adjust it for 'max_slot_wal_keep_size'. On similar lines, currently in ComputeXidHorizons(), we compute the minimum xid required by slots (procArray->replication_slot_xmin and procArray->replication_slot_catalog_xmin) but then don't adjust it for 'max_slot_xid_age'. I could be missing something in this but it is better to keep discussing this and try to move with another parameter 'inactive_replication_slot_timeout' which according to me can be kept at slot level instead of a GUC but OTOH we need to see the arguments on both side and then decide which makes more sense. -- With Regards, Amit Kapila.
Re: [HACKERS] make async slave to wait for lsn to be replayed
On Sat, Mar 16, 2024 at 2:04 AM Alexander Korotkov wrote: > > > Rebase and update patch. Thanks for working on this. I took a quick look at v11 patch. Here are some comments: 1. +#include "utils/timestamp.h" +#include "executor/spi.h" +#include "utils/fmgrprotos.h" Please place executor/spi.h in the alphabetical order. Also, look at all other header files and place them in the order. 2. It seems like pgindent is not happy with src/backend/access/transam/xlogrecovery.c and src/backend/commands/waitlsn.c. Please run it to keep BF member koel happy post commit. 3. This patch changes, SQL explicit transaction statement syntax, is it (this deviation) okay from SQL standard perspective? 4. I think some more checks are needed for better input validations. 4.1 With invalid LSN succeeds, shouldn't it error out? Or at least, add a fast path/quick exit to WaitForLSN()? BEGIN AFTER '0/0'; 4.2 With an unreasonably high future LSN, BEGIN command waits unboundedly, shouldn't we check if the specified LSN is more than pg_last_wal_receive_lsn() error out? BEGIN AFTER '0/'; SELECT pg_last_wal_receive_lsn() + 1 AS future_receive_lsn \gset BEGIN AFTER :'future_receive_lsn'; 4.3 With an unreasonably high wait time, BEGIN command waits unboundedly, shouldn't we restrict the wait time to some max value, say a day or so? SELECT pg_last_wal_receive_lsn() + 1 AS future_receive_lsn \gset BEGIN AFTER :'future_receive_lsn' WITHIN 10; 5. +#include +#include +#include "postgres.h" +#include "pgstat.h" postgres.h must be included at the first, and then the system header files, and then all postgres header files, just like below. See a very recent commit https://git.postgresql.org/gitweb/?p=postgresql.git;a=commitdiff;h=97d85be365443eb4bf84373a7468624762382059. +#include "postgres.h" +#include +#include +#include "access/transam.h" +#include "access/xact.h" +#include "access/xlog.h" 6. +/* Set all latches in shared memory to signal that new LSN has been replayed */ +void +WaitLSNSetLatches(XLogRecPtr curLSN) +{ I see this patch is waking up all the waiters in the recovery path after applying every WAL record, which IMO is a hot path. Is the impact of this change on recovery measured, perhaps using https://github.com/macdice/redo-bench or similar tools? 7. In continuation to comment #6, why not use Conditional Variables instead of proc latches to sleep and wait for all the waiters in WaitLSNSetLatches? -- Bharath Rupireddy PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com