Re: add AVX2 support to simd.h

2024-03-16 Thread John Naylor
On Sat, Mar 16, 2024 at 2:40 AM Nathan Bossart  wrote:
>
> On Fri, Mar 15, 2024 at 12:41:49PM -0500, Nathan Bossart wrote:
> > I've also attached the results of running this benchmark on my machine at
> > HEAD, after applying 0001, and after applying both 0001 and 0002.  0001
> > appears to work pretty well.  When there is a small "tail," it regresses a
> > small amount, but overall, it seems to improve more cases than it harms.
> > 0002 does regress searches on smaller arrays quite a bit, since it
> > postpones the SIMD optimizations until the arrays are longer.  It might be
> > possible to mitigate by using 2 registers when the "tail" is long enough,
> > but I have yet to try that.
>
> The attached 0003 is a sketch of what such mitigation might look like.  It
> appears to help with the regressions nicely.  I omitted the benchmarking
> patch in v3 to appease cfbot.

I haven't looked at the patches, but the graphs look good.




Re: [PoC] Improve dead tuple storage for lazy vacuum

2024-03-16 Thread John Naylor
On Fri, Mar 15, 2024 at 9:17 PM Masahiko Sawada  wrote:
>
> On Fri, Mar 15, 2024 at 4:36 PM John Naylor  wrote:
> >
> > On Thu, Mar 14, 2024 at 7:04 PM Masahiko Sawada  
> > wrote:

> > > Given TidStoreSetBlockOffsets() is designed to always set (i.e.
> > > overwrite) the value, I think we should not expect that found is
> > > always false.
> >
> > I find that a puzzling statement, since 1) it was designed for
> > insert-only workloads, not actual overwrite IIRC and 2) the tests will
> > now fail if the same block is set twice, since we just switched the
> > tests to use a remnant of vacuum's old array. Having said that, I
> > don't object to removing artificial barriers to using it for purposes
> > not yet imagined, as long as test_tidstore.sql warns against that.
>
> I think that if it supports only insert-only workload and expects the
> same block is set only once, it should raise an error rather than an
> assertion. It's odd to me that the function fails only with an
> assertion build assertions even though it actually works fine even in
> that case.

After thinking some more, I think you're right -- it's too
heavy-handed to throw an error/assert and a public function shouldn't
make assumptions about the caller. It's probably just a matter of
documenting the function (and it's lack of generality), and the tests
(which are based on the thing we're replacing).

> As for test_tidstore you're right that the test code doesn't handle
> the case where setting the same block twice. I think that there is no
> problem in the fixed-TIDs tests, but we would need something for
> random-TIDs tests so that we don't set the same block twice. I guess
> it could be trivial since we can use SQL queries to generate TIDs. I'm
> not sure how the random-TIDs tests would be like, but I think we can
> use SELECT DISTINCT to eliminate the duplicates of block numbers to
> use.

Also, I don't think we need random blocks, since the radix tree tests
excercise that heavily already.

Random offsets is what I was thinking of (if made distinct and
ordered), but even there the code is fairy trivial, so I don't have a
strong feeling about it.

> > Given the above two things, I think this function's comment needs
> > stronger language about its limitations. Perhaps even mention that
> > it's intended for, and optimized for, vacuum. You and I have long
> > known that tidstore would need a separate, more complex, function to
> > add or remove individual tids from existing entries, but it might be
> > good to have that documented.
>
> Agreed.

How about this:

 /*
- * Set the given TIDs on the blkno to TidStore.
+ * Create or replace an entry for the given block and array of offsets
  *
- * NB: the offset numbers in offsets must be sorted in ascending order.
+ * NB: This function is designed and optimized for vacuum's heap scanning
+ * phase, so has some limitations:
+ * - The offset numbers in "offsets" must be sorted in ascending order.
+ * - If the block number already exists, the entry will be replaced --
+ *   there is no way to add or remove offsets from an entry.
  */
 void
 TidStoreSetBlockOffsets(TidStore *ts, BlockNumber blkno, OffsetNumber *offsets,

I think we can stop including the debug-tid-store patch for CI now.
That would allow getting rid of some unnecessary variables. More
comments:

+ * Prepare to iterate through a TidStore. Since the radix tree is locked during
+ * the iteration, TidStoreEndIterate() needs to be called when finished.

+ * Concurrent updates during the iteration will be blocked when inserting a
+ * key-value to the radix tree.

This is outdated. Locking is optional. The remaining real reason now
is that TidStoreEndIterate needs to free memory. We probably need to
say something about locking, too, but not this.

+ * Scan the TidStore and return a pointer to TidStoreIterResult that has TIDs
+ * in one block. We return the block numbers in ascending order and the offset
+ * numbers in each result is also sorted in ascending order.
+ */
+TidStoreIterResult *
+TidStoreIterateNext(TidStoreIter *iter)

The wording is a bit awkward.

+/*
+ * Finish an iteration over TidStore. This needs to be called after finishing
+ * or when existing an iteration.
+ */

s/existing/exiting/ ?

It seems to say we need to finish after finishing. Maybe more precise wording.

+/* Extract TIDs from the given key-value pair */
+static void
+tidstore_iter_extract_tids(TidStoreIter *iter, uint64 key,
BlocktableEntry *page)

This is a leftover from the old encoding scheme. This should really
take a "BlockNumber blockno" not a "key", and the only call site
should probably cast the uint64 to BlockNumber.

+ * tidstore.h
+ *   Tid storage.
+ *
+ *
+ * Portions Copyright (c) 1996-2023, PostgreSQL Global Development Group

Update year.

+typedef struct BlocktableEntry
+{
+ uint16 nwords;
+ bitmapword words[FLEXIBLE_ARRAY_MEMBER];
+} BlocktableEntry;

In my WIP for runtime-embeddable offsets, nwords needs to be one byte.
That doesn't have any 

Re: REVOKE FROM warning on grantor

2024-03-16 Thread Tom Lane
=?ISO-8859-1?Q?=C9tienne?= BERSAC  writes:
> I'll try to patch the behaviour to ensure an error if the REVOKE is
> ineffective.

I think we're unlikely to accept such a patch.  By my reading, the way
we do it now is required by the SQL standard.  The standard doesn't
seem to say that in so many words; what it does say (from SQL99) is

b) If the  is a , then
  for every  specified, a set of role authorization
  descriptors is identified. A role authorization descriptor is
  said to be identified if it defines the grant of any of the
  specified s to  with grantor A.

It does not say that that set must be nonempty.  Admittedly it's not
very clear from this one point.  However, if you look around in the
standard it seems clear that they expect no-op revokes to be no-ops
not errors.  As an example, every type of DROP command includes an
explicit step to drop privileges attached to the object, with wording
like (this is for ALTER TABLE DROP COLUMN):

 3) Let A be the  that owns T. The
following  is effectively executed with
a current authorization identifier of "_SYSTEM" and without
further Access Rule checking:

  REVOKE INSERT(CN), UPDATE(CN), SELECT(CN), REFERENCES(CN) ON
  TABLE TN FROM A CASCADE

There is no special rule for the case that all (or any...) of those
privileges were previously revoked; but if that case is supposed to be
an error, there would have to be an exception here, or you couldn't
drop such columns.

Even taking the position that this is an unspecified point that we
could implement how we like, I don't think there's a sufficient
argument for changing behavior that's stood for a couple of decades.
(The spelling of the message has changed over the years, but giving a
warning not an error appears to go all the way back to 99b8f8451
where we implemented user groups.)  It is certain that there are
applications out there that rely on this behavior and would break.

regards, tom lane




Re: Support json_errdetail in FRONTEND builds

2024-03-16 Thread Daniel Gustafsson
> On 16 Mar 2024, at 00:59, Andrew Dunstan  wrote:
>> On Mar 16, 2024, at 8:53 AM, Daniel Gustafsson  wrote:
>>> On 15 Mar 2024, at 21:56, Andrew Dunstan  wrote:

>>> I'd like to get this done ASAP so I can rebase my incremental parse 
>>> patchset. Daniel, do you want to commit it? If not, I can.
>> 
>> Sure, I can commit these patches.  It won't be until tomorrow though since I
>> prefer to have ample time to monitor the buildfarm, if you are in a bigger 
>> rush
>> than that then feel free to go ahead.
> 
> tomorrow will be fine, thanks 

Sorry, I ran into some unforeseen scheduling issues and had less time available
than planned.  I have pushed the 0001 StringInfo patch to reduce the work for
tomorrow when I will work on 0002 unless beaten to it.

--
Daniel Gustafsson





Re: pg_upgrade failing for 200+ million Large Objects

2024-03-16 Thread Tom Lane
Laurenz Albe  writes:
> On Fri, 2024-03-15 at 19:18 -0400, Tom Lane wrote:
>> This patch seems to have stalled out again.  In hopes of getting it
>> over the finish line, I've done a bit more work to address the two
>> loose ends I felt were probably essential to deal with:

> Applies and builds fine.
> I didn't scrutinize the code, but I gave it a spin on a database with
> 15 million (small) large objects.  I tried pg_upgrade --link with and
> without the patch on a debug build with the default configuration.

Thanks for looking at it!

> Without the patch:
> Runtime: 74.5 minutes

> With the patch:
> Runtime: 70 minutes

Hm, I'd have hoped for a bit more runtime improvement.  But perhaps
not --- most of the win we saw upthread was from parallelism, and
I don't think you'd get any parallelism in a pg_upgrade with all
the data in one database.  (Perhaps there is more to do there later,
but I'm still not clear on how this should interact with the existing
cross-DB parallelism; so I'm content to leave that question for
another patch.)

regards, tom lane




Re: pg_upgrade failing for 200+ million Large Objects

2024-03-16 Thread Laurenz Albe
On Fri, 2024-03-15 at 19:18 -0400, Tom Lane wrote:
> This patch seems to have stalled out again.  In hopes of getting it
> over the finish line, I've done a bit more work to address the two
> loose ends I felt were probably essential to deal with:

Applies and builds fine.

I didn't scrutinize the code, but I gave it a spin on a database with
15 million (small) large objects.  I tried pg_upgrade --link with and
without the patch on a debug build with the default configuration.

Without the patch:

Runtime: 74.5 minutes
Memory usage: ~7GB
Disk usage: an extra 5GB dump file + log file during the dump

With the patch:

Runtime: 70 minutes
Memory usage: ~1GB
Disk usage: an extra 0.5GB during the dump

Memory usage stayed stable once it reached its peak, so no noticeable
memory leaks.

The reduced memory usage is great.  I was surprised by the difference
in disk usage: the lion's share is the dump file, and that got substantially
smaller.  But also the log file shrank considerably, because not every
individual large object gets logged.

I had a look at "perf top", and the profile looked pretty similar in
both cases.

The patch is a clear improvement.

Yours,
Laurenz Albe




Re: Q: Escapes in jsonpath Idents

2024-03-16 Thread David E. Wheeler
On Mar 16, 2024, at 14:39, David E. Wheeler  wrote:

> I went looking for the JavaScript rules for an identifier and found this in 
> the MDN docs[2]:
> 
>> In JavaScript, identifiers can contain Unicode letters, $, _, and digits 
>> (0-9), but may not start with a digit. An identifier differs from a string 
>> in that a string is data, while an identifier is part of the code. In 
>> JavaScript, there is no way to convert identifiers to strings, but sometimes 
>> it is possible to parse strings into identifiers.

Coda: Dollar signs don’t work at all outside double-quoted string identifiers:

david=# select '$.$foo'::jsonpath;
ERROR:  syntax error at or near "$foo" of jsonpath input
LINE 1: select '$.$foo'::jsonpath;
   ^

david=# select '$.f$oo'::jsonpath;
ERROR:  syntax error at or near "$oo" of jsonpath input
LINE 1: select '$.f$oo'::jsonpath;
   ^

david=# select '$."$foo"'::jsonpath;
 jsonpath 
--
 $."$foo"

This, too, contradicts the MDM definition an identifier (and some quick browser 
tests).

Best,

David





Re: REVOKE FROM warning on grantor

2024-03-16 Thread Étienne BERSAC
Hi David,

> That should have read:  the granted permission does not exist

Thanks, its clear.

However, I'm hitting the warning when removing a role from a group. But
the membership remains after the warning. In this case, I expect an
error.

I'll try to patch the behaviour to ensure an error if the REVOKE is
ineffective.

Regards,
Étienne




Re: BitmapHeapScan streaming read user and prelim refactoring

2024-03-16 Thread Tomas Vondra



On 3/16/24 20:12, Andres Freund wrote:
> Hi,
> 
> On 2024-03-15 18:42:29 -0400, Melanie Plageman wrote:
>> On Fri, Mar 15, 2024 at 5:14 PM Andres Freund  wrote:
>>> On 2024-03-14 17:39:30 -0400, Melanie Plageman wrote:
>>> I spent a good amount of time looking into this with Melanie. After a bunch 
>>> of
>>> wrong paths I think I found the issue: We end up prefetching blocks we have
>>> already read. Notably this happens even as-is on master - just not as
>>> frequently as after moving BitmapAdjustPrefetchIterator().
>>>
>>> From what I can tell the prefetching in parallel bitmap heap scans is
>>> thoroughly broken.  I added some tracking of the last block read, the last
>>> block prefetched to ParallelBitmapHeapState and found that with a small
>>> effective_io_concurrency we end up with ~18% of prefetches being of blocks 
>>> we
>>> already read! After moving the BitmapAdjustPrefetchIterator() to rises to 
>>> 86%,
>>> no wonder it's slower...
>>>
>>> The race here seems fairly substantial - we're moving the two iterators
>>> independently from each other, in multiple processes, without useful 
>>> locking.
>>>
>>> I'm inclined to think this is a bug we ought to fix in the backbranches.
>>
>> Thinking about how to fix this, perhaps we could keep the current max
>> block number in the ParallelBitmapHeapState and then when prefetching,
>> workers could loop calling tbm_shared_iterate() until they've found a
>> block at least prefetch_pages ahead of the current block. They
>> wouldn't need to read the current max value from the parallel state on
>> each iteration. Even checking it once and storing that value in a
>> local variable prevented prefetching blocks after reading them in my
>> example repro of the issue.
> 
> That would address some of the worst behaviour, but it doesn't really seem to
> address the underlying problem of the two iterators being modified
> independently. ISTM the proper fix would be to protect the state of the
> iterators with a single lock, rather than pushing down the locking into the
> bitmap code. OTOH, we'll only need one lock going forward, so being economic
> in the effort of fixing this is also important.
> 

Can you share some details about how you identified the problem, counted
the prefetches that happen too late, etc? I'd like to try to reproduce
this to understand the issue better.

If I understand correctly, what may happen is that a worker reads blocks
from the "prefetch" iterator, but before it manages to issue the
posix_fadvise, some other worker already did pread. Or can the iterators
get "out of sync" in a more fundamental way?

If my understanding is correct, why would a single lock solve that? Yes,
we'd advance the iterators at the same time, but surely we'd not issue
the fadvise calls while holding the lock, and the prefetch/fadvise for a
particular block could still happen in different workers.

I suppose a dirty PoC fix should not be too difficult, and it'd allow us
to check if it works.


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company




Re: REVOKE FROM warning on grantor

2024-03-16 Thread David G. Johnston
On Sat, Mar 16, 2024 at 1:00 PM Étienne BERSAC 
wrote:

>
> > The choice of warning is made because after the command ends the
> > grantmin question does not exist.  The revoke was a no-op and the
> > final state is as the user intended.
>
>
> Sorry, can you explain me what's the grantmin question is ?
>
>
That should have read:  the granted permission does not exist

David J.


Re: REVOKE FROM warning on grantor

2024-03-16 Thread Étienne BERSAC
Hi David,
Thanks for your answer.



> The choice of warning is made because after the command ends the
> grantmin question does not exist.  The revoke was a no-op and the
> final state is as the user intended. 


Sorry, can you explain me what's the grantmin question is ?

Regards,
Étienne




Re: BitmapHeapScan streaming read user and prelim refactoring

2024-03-16 Thread Andres Freund
Hi,

On 2024-03-15 18:42:29 -0400, Melanie Plageman wrote:
> On Fri, Mar 15, 2024 at 5:14 PM Andres Freund  wrote:
> > On 2024-03-14 17:39:30 -0400, Melanie Plageman wrote:
> > I spent a good amount of time looking into this with Melanie. After a bunch 
> > of
> > wrong paths I think I found the issue: We end up prefetching blocks we have
> > already read. Notably this happens even as-is on master - just not as
> > frequently as after moving BitmapAdjustPrefetchIterator().
> >
> > From what I can tell the prefetching in parallel bitmap heap scans is
> > thoroughly broken.  I added some tracking of the last block read, the last
> > block prefetched to ParallelBitmapHeapState and found that with a small
> > effective_io_concurrency we end up with ~18% of prefetches being of blocks 
> > we
> > already read! After moving the BitmapAdjustPrefetchIterator() to rises to 
> > 86%,
> > no wonder it's slower...
> >
> > The race here seems fairly substantial - we're moving the two iterators
> > independently from each other, in multiple processes, without useful 
> > locking.
> >
> > I'm inclined to think this is a bug we ought to fix in the backbranches.
> 
> Thinking about how to fix this, perhaps we could keep the current max
> block number in the ParallelBitmapHeapState and then when prefetching,
> workers could loop calling tbm_shared_iterate() until they've found a
> block at least prefetch_pages ahead of the current block. They
> wouldn't need to read the current max value from the parallel state on
> each iteration. Even checking it once and storing that value in a
> local variable prevented prefetching blocks after reading them in my
> example repro of the issue.

That would address some of the worst behaviour, but it doesn't really seem to
address the underlying problem of the two iterators being modified
independently. ISTM the proper fix would be to protect the state of the
iterators with a single lock, rather than pushing down the locking into the
bitmap code. OTOH, we'll only need one lock going forward, so being economic
in the effort of fixing this is also important.

Greetings,

Andres Freund




Re: Vectored I/O in bulk_write.c

2024-03-16 Thread Andres Freund
Hi,

On 2024-03-16 12:27:15 +1300, Thomas Munro wrote:
> I canvassed Andres off-list since smgrzeroextend() is his invention,
> and he wondered if it was a good idea to blur the distinction between
> the different zero-extension strategies like that.  Good question.  My
> take is that it's fine:
> 
> mdzeroextend() already uses fallocate() only for nblocks > 8, but
> otherwise writes out zeroes, because that was seen to interact better
> with file system heuristics on common systems.  That preserves current
> behaviour for callers of plain-old sgmrextend(), which becomes a
> wrapper for smgrwrite(..., 1, _ZERO | _EXTEND).  If some hypothetical
> future caller wants to be able to call smgrwritev(..., NULL, 9 blocks,
> _ZERO | _EXTEND) directly without triggering the fallocate() strategy
> for some well researched reason, we could add a new flag to say so, ie
> adjust that gating.
> 
> In other words, we have already blurred the semantics.  To me, it
> seems nicer to have a single high level interface for the same logical
> operation, and flags to select strategies more explicitly if that is
> eventually necessary.

I don't think zeroextend on the one hand and and on the other hand a normal
write or extend are really the same operation. In the former case the content
is hard-coded in the latter it's caller provided. Sure, we can deal with that
by special-casing NULL content - but at that point, what's the benefit of
combinding the two operations?  I'm not dead-set against this, just not really
convinced that it's a good idea to combine the operations.

Greetings,

Andres Freund




Q: Escapes in jsonpath Idents

2024-03-16 Thread David E. Wheeler
Hackers,

The jsonpath doc[1] has an excellent description of the format of strings, but 
for unquoted path keys, it simply says:

> Member accessor that returns an object member with the specified key. If the 
> key name matches some named variable starting with $ or does not meet the 
> JavaScript rules for an identifier, it must be enclosed in double quotes to 
> make it a string literal.

I went looking for the JavaScript rules for an identifier and found this in the 
MDN docs[2]:

> In JavaScript, identifiers can contain Unicode letters, $, _, and digits 
> (0-9), but may not start with a digit. An identifier differs from a string in 
> that a string is data, while an identifier is part of the code. In 
> JavaScript, there is no way to convert identifiers to strings, but sometimes 
> it is possible to parse strings into identifiers.


However, the Postgres parsing of jsonpath keys appears to follow the same rules 
as strings, allowing backslash escapes:

david=# select '$.fo\u00f8 == $x'::jsonpath;
 jsonpath   ---
 ($."foø" == $"x")

This would seem to contradict the documentation. Is this behavior required by 
the SQL standard? Do the docs need updating? Or should the code actually follow 
the JSON identifier behavior?

Thanks,

David

PS: Those excellent docs on strings mentions support for \v, but the grammar in 
the right nav of https://www.json.org/json-en.html does not. Another bonus 
feature?

[1]: https://www.postgresql.org/docs/16/datatype-json.html#DATATYPE-JSONPATH
[2]: https://developer.mozilla.org/en-US/docs/Glossary/Identifier





Re: UUID v7

2024-03-16 Thread Andrey M. Borodin


> On 15 Mar 2024, at 14:47, Aleksander Alekseev  
> wrote:
> 
> +1 to the idea. I doubt that anyone will miss it.

PFA v22.

Changes:
1. Squashed all editorialisation by Jelte
2. Fixed my erroneous comments on using Method 2 (we are using method 1 instead)
3. Remove all traces of uuid_extract_variant()

Thanks!


Best regards, Andrey Borodin.


v22-0001-Implement-UUID-v7.patch
Description: Binary data


Re: Proposal to include --exclude-extension Flag in pg_dump

2024-03-16 Thread Ayush Vatsa
Hi,

> In my opinion the order of options in pg_dump.sgml and the --help

> output is fine. Keeping this new option together with -e/--extension

> makes it easier to see, while otherwise it would get lost much further

> down.

I agree with your suggestion, so I'll maintain the original order as
proposed.



> The --filter option should be extended to support "exclude

> extension pattern" lines in the filter file. That syntax is already

> accepted, but it throws a not-supported error, but it's hopefully not

> too hard to make that work now.

While proposing the --exclude-extension flag in pg_dump, the --filter
option

wasn't there, so it got overlooked. But now that I've familiarized myself
with it

and have tried to enhance its functionality with exclude-extension in the

provided patch. Thank you for bringing this to my attention.



> It ought to have some tests in the test script.

Correct, I must have added it in the first patch itself.

As a newcomer to the database, I explored how testing works and tried

to include test cases for the newly added flag. I've confirmed that all
test

cases, including the ones I added, are passing.



Attached is the complete patch with all the required code changes.

Looking forward to your review and feedback.


On Wed, 6 Mar 2024 at 16:04, Dean Rasheed  wrote:

> On Mon, 1 Jan 2024 at 13:28, Ayush Vatsa  wrote:
> >
> > According to the documentation of pg_dump when the --extension option is
> not specified, all non-system extensions in the target database will get
> dumped.
> > > Why do we need to explicitly exclude extensions?
> > Hence to include only a few we use --extension, but to exclude a few I
> am proposing --exclude-extension.
> >
>
> Thanks for working on this. It seems like a useful feature to have.
> The code changes look good, and it appears to work as expected.
>
> In my opinion the order of options in pg_dump.sgml and the --help
> output is fine. Keeping this new option together with -e/--extension
> makes it easier to see, while otherwise it would get lost much further
> down. Other opinions on that might differ though.
>
> There are a couple of things missing from the patch, that should be added:
>
> 1). The --filter option should be extended to support "exclude
> extension pattern" lines in the filter file. That syntax is already
> accepted, but it throws a not-supported error, but it's hopefully not
> too hard to make that work now.
>
> 2). It ought to have some tests in the test script.
>
> Regards,
> Dean
>


v2-0001-Add-support-for-exclude-extension-in-pg_dump.patch
Description: Binary data


Re: Improving EXPLAIN's display of SubPlan nodes

2024-03-16 Thread Tom Lane
I wrote:
> Dean Rasheed  writes:
>> One thing that concerns me about making even greater use of "$n" is
>> the potential for confusion with generic plan parameters.

> True.

After looking at your draft some more, it occurred to me that we're
not that far from getting rid of showing "$n" entirely in this
context.  Instead of (subplan_name).$n, we could write something like
(subplan_name).colN or (subplan_name).columnN or (subplan_name).fN,
depending on your taste for verboseness.  "fN" would correspond to the
names we assign to columns of anonymous record types, but it hasn't
got much else to recommend it.  In the attached I used "colN";
"columnN" would be my next choice.

You could also imagine trying to use the sub-SELECT's actual output
column names, but I fear that would be ambiguous --- too often it'd
be "?column?" or some other non-unique name.

> Hmm.  I guess what bothers me about that is that it could be read to
> suggest that the initplan or subplan is evaluated again for each
> output parameter.

This objection seems like it could be solved through documentation,
so I wrote some.

The attached proof-of-concept is incomplete: it fails to replace some
$n occurrences with subplan references, as is visible in some of the
test cases.  I believe your quick hack in get_parameter() is not
correct in detail, but for the moment I didn't bother to debug it.
I'm just presenting this as a POC to see if this is the direction
people would like to go in.  If there's not objections, I'll try to
make a bulletproof implementation.

regards, tom lane

diff --git a/contrib/postgres_fdw/expected/postgres_fdw.out b/contrib/postgres_fdw/expected/postgres_fdw.out
index 58a603ac56..585ba60ae3 100644
--- a/contrib/postgres_fdw/expected/postgres_fdw.out
+++ b/contrib/postgres_fdw/expected/postgres_fdw.out
@@ -3062,10 +3062,10 @@ select exists(select 1 from pg_enum), sum(c1) from ft1;
 QUERY PLAN
 --
  Foreign Scan
-   Output: $0, (sum(ft1.c1))
+   Output: (InitPlan 1).col1, (sum(ft1.c1))
Relations: Aggregate on (public.ft1)
Remote SQL: SELECT sum("C 1") FROM "S 1"."T 1"
-   InitPlan 1 (returns $0)
+   InitPlan 1
  ->  Seq Scan on pg_catalog.pg_enum
 (6 rows)
 
@@ -3080,8 +3080,8 @@ select exists(select 1 from pg_enum), sum(c1) from ft1 group by 1;
 QUERY PLAN 
 ---
  GroupAggregate
-   Output: $0, sum(ft1.c1)
-   InitPlan 1 (returns $0)
+   Output: (InitPlan 1).col1, sum(ft1.c1)
+   InitPlan 1
  ->  Seq Scan on pg_catalog.pg_enum
->  Foreign Scan on public.ft1
  Output: ft1.c1
@@ -3305,10 +3305,10 @@ select sum(c1) filter (where (c1 / c1) * random() <= 1) from ft1 group by c2 ord
 
 explain (verbose, costs off)
 select sum(c2) filter (where c2 in (select c2 from ft1 where c2 < 5)) from ft1;
-QUERY PLAN 

+  QUERY PLAN   
+---
  Aggregate
-   Output: sum(ft1.c2) FILTER (WHERE (hashed SubPlan 1))
+   Output: sum(ft1.c2) FILTER (WHERE (ANY (ft1.c2 = (hashed SubPlan 1).col1)))
->  Foreign Scan on public.ft1
  Output: ft1.c2
  Remote SQL: SELECT c2 FROM "S 1"."T 1"
@@ -6171,9 +6171,9 @@ UPDATE ft2 AS target SET (c2, c7) = (
  Update on public.ft2 target
Remote SQL: UPDATE "S 1"."T 1" SET c2 = $2, c7 = $3 WHERE ctid = $1
->  Foreign Scan on public.ft2 target
- Output: $1, $2, (SubPlan 1 (returns $1,$2)), target.ctid, target.*
+ Output: $1, $2, (SubPlan 1), target.ctid, target.*
  Remote SQL: SELECT "C 1", c2, c3, c4, c5, c6, c7, c8, ctid FROM "S 1"."T 1" WHERE (("C 1" > 1100)) FOR UPDATE
- SubPlan 1 (returns $1,$2)
+ SubPlan 1
->  Foreign Scan on public.ft2 src
  Output: (src.c2 * 10), src.c7
  Remote SQL: SELECT c2, c7 FROM "S 1"."T 1" WHERE (($1::integer = "C 1"))
@@ -11687,7 +11687,7 @@ SELECT * FROM local_tbl t1 LEFT JOIN (SELECT *, (SELECT count(*) FROM async_pt W
  Nested Loop Left Join
Output: t1.a, t1.b, t1.c, async_pt.a, async_pt.b, async_pt.c, ($0)
Join Filter: (t1.a = async_pt.a)
-   InitPlan 1 (returns $0)
+   InitPlan 1
  ->  Aggregate
Output: count(*)
->  Append
@@ -11713,7 +11713,7 @@ SELECT * FROM local_tbl t1 LEFT JOIN (SELECT *, (SELECT count(*) FROM async_pt W
  Nested Loop Left Join (actual rows=1 loops=1)
Join Filter: (t1.a = async_pt.a)
Rows Removed by Join Filter: 399
-   InitPlan 1 (returns $0)
+   InitPlan 1
  ->  Aggregate (actual rows=1 loops=1)
->  Append (actual rows=400 loops=1)
  ->  Async Foreign Scan on async_p1 async_pt_4 

Re: speed up a logical replica setup

2024-03-16 Thread vignesh C
On Sat, 16 Mar 2024 at 21:16, Euler Taveira  wrote:
>
> On Sat, Mar 16, 2024, at 10:31 AM, vignesh C wrote:
>
> I was able to reproduce this random failure and found the following reason:
> The Minimum recovery ending location 0/500 was more than the
> recovery_target_lsn specified is "0/4001198". In few random cases the
> standby applies a few more WAL records after the replication slot is
> created; this leads to minimum recovery ending location being greater
> than the recovery_target_lsn because of which the server will fail
> with:
> FATAL:  requested recovery stop point is before consistent recovery point
>
>
> Thanks for checking. I proposed an alternative patch for it [1]. Can you check
> it?

This approach looks good to me.

Regards,
Vignesh




Re: speed up a logical replica setup

2024-03-16 Thread Euler Taveira
On Sat, Mar 16, 2024, at 10:31 AM, vignesh C wrote:
> I was able to reproduce this random failure and found the following reason:
> The Minimum recovery ending location 0/500 was more than the
> recovery_target_lsn specified is "0/4001198". In few random cases the
> standby applies a few more WAL records after the replication slot is
> created; this leads to minimum recovery ending location being greater
> than the recovery_target_lsn because of which the server will fail
> with:
> FATAL:  requested recovery stop point is before consistent recovery point

Thanks for checking. I proposed an alternative patch for it [1]. Can you check
it?

[1] 
https://www.postgresql.org/message-id/34637e7f-0330-420d-8f45-1d022962d2fe%40app.fastmail.com


--
Euler Taveira
EDB   https://www.enterprisedb.com/


Re: speed up a logical replica setup

2024-03-16 Thread Euler Taveira
On Fri, Mar 15, 2024, at 3:34 AM, Amit Kapila wrote:
> Did you consider adding options for publication/subscription/slot
> names as mentioned in my previous email? As discussed in a few emails
> above, it would be quite confusing for users to identify the logical
> replication objects once the standby is converted to subscriber.

Yes. I was wondering to implement after v1 is pushed. I started to write a code
for it but I wasn't sure about the UI. The best approach I came up with was
multiple options in the same order. (I don't provide short options to avoid
possible portability issues with the order.) It means if I have 3 databases and
the following command-line:

pg_createsubscriber ... --database pg1 --database pg2 --database3 --publication
pubx --publication puby --publication pubz

pubx, puby and pubz are created in the database pg1, pg2, and pg3 respectively.

> It seems we care only for publications created on the primary. Isn't
> it possible that some of the publications have been replicated to
> standby by that time, for example, in case failure happens after
> creating a few publications? IIUC, we don't care for standby cleanup
> after failure because it can't be used for streaming replication
> anymore. So, the only choice the user has is to recreate the standby
> and start the pg_createsubscriber again. This sounds questionable to
> me as to whether users would like this behavior. Does anyone else have
> an opinion on this point?

If it happens after creating a publication and before promotion, the cleanup
routine will drop the publications on primary and it will eventually be applied
to the standby via replication later.

> I see the below note in the patch:
> +If pg_createsubscriber fails while processing,
> +then the data directory is likely not in a state that can be recovered. 
> It
> +is true if the target server was promoted. In such a case, creating a new
> +standby server is recommended.
> 
> By reading this it is not completely clear whether the standby is not
> recoverable in case of any error or only an error after the target
> server is promoted. If others agree with this behavior then we should
> write the detailed reason for this somewhere in the comments as well
> unless it is already explained.

I rewrote the sentence to make it clear that only if the server is promoted,
the target server will be in a state that cannot be reused. It provides a
message saying it too.

pg_createsubscriber: target server reached the consistent state
pg_createsubscriber: hint: If pg_createsubscriber fails after this point, you
must recreate the physical replica before continuing.

I'm attaching a new version (v30) that adds:

* 3 new options (--publication, --subscription, --replication-slot) to assign
  names to the objects. The --database option used to ignore duplicate names,
  however, since these new options rely on the number of database options to
  match the number of object name options, it is forbidden from now on. The
  duplication is also forbidden for the object names to avoid errors earlier.
* rewrite the paragraph related to unusuable target server after
  pg_createsubscriber fails.
* Vignesh reported an issue [1] related to reaching the recovery stop point
  before the consistent state is reached. I proposed a simple patch that fixes
  the issue.

[1] 
https://www.postgresql.org/message-id/CALDaNm3VMOi0GugGvhk3motghaFRKSWMCSE2t3YX1e%2BMttToxg%40mail.gmail.com


--
Euler Taveira
EDB   https://www.enterprisedb.com/


v30-0001-pg_createsubscriber-creates-a-new-logical-replic.patch.gz
Description: application/gzip


v30-0002-Stop-the-target-server-earlier.patch.gz
Description: application/gzip


Re: [HACKERS] make async slave to wait for lsn to be replayed

2024-03-16 Thread Bharath Rupireddy
On Sat, Mar 16, 2024 at 4:26 PM Amit Kapila  wrote:
>
> On Fri, Mar 15, 2024 at 7:50 PM Alexander Korotkov  
> wrote:
> >
> > I went through this patch another time, and made some minor
> > adjustments.  Now it looks good, I'm going to push it if no
> > objections.
> >
>
> I have a question related to usability, if the regular reads (say a
> Select statement or reads via function/procedure) need a similar
> guarantee to see the changes on standby then do they also always need
> to first do something like "BEGIN AFTER '0/3F0FF791' WITHIN 1000;"? Or
> in other words, shouldn't we think of something for implicit
> transactions?

+1 to have support for implicit txns. A strawman solution I can think
of is to let primary send its current insert LSN to the standby every
time it sends a bunch of WAL, and the standby waits for that LSN to be
replayed on it at the start of every implicit txn automatically.

The new BEGIN syntax requires application code changes. This led me to
think how one can achieve read-after-write consistency today in a
primary - standby set up. All the logic of this patch, that is,
waiting for the standby to pass a given primary LSN needs to be done
in the application code (or in proxy or in load balancer?). I believe
there might be someone doing this already, it's good to hear from
them.

> In general, it seems this patch has been stuck for a long time on the
> decision to choose an appropriate UI (syntax), and we thought of
> moving it further so that the other parts of the patch can be
> reviewed/discussed. So, I feel before pushing this we should see
> comments from a few (at least two) other senior members who earlier
> shared their opinion on the syntax. I know we don't have much time
> left but OTOH pushing such a change (where we didn't have a consensus
> on syntax) without much discussion at this point of time could lead to
> discussions after commit.

+1 to gain consensus first on the syntax changes. With this, we might
be violating the SQL standard for explicit txn commands (I stand for
correction about the SQL standard though).

-- 
Bharath Rupireddy
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com




Re: speed up a logical replica setup

2024-03-16 Thread vignesh C
On Mon, 11 Mar 2024 at 10:33, Hayato Kuroda (Fujitsu)
 wrote:
>
> Dear Vignesh,
>
> Thanks for updating the patch, but cfbot still got angry [1].
> Note that two containers (autoconf and meson) failed at different place,
> so I think it is intentional one. It seems that there may be a bug related 
> with 32-bit build.
>
> We should see and fix as soon as possible.

I was able to reproduce this random failure and found the following reason:
The Minimum recovery ending location 0/500 was more than the
recovery_target_lsn specified is "0/4001198". In few random cases the
standby applies a few more WAL records after the replication slot is
created; this leads to minimum recovery ending location being greater
than the recovery_target_lsn because of which the server will fail
with:
FATAL:  requested recovery stop point is before consistent recovery point

I have fixed it by pausing the replay in the standby server before the
replication slots get created.
The attached v29-0002-Keep-standby-server-s-minimum-recovery-point-les.patch
patch has the changes for the same.
Thoughts?

Regards,
Vignesh
From e98e1ba0661e3d658de8d46ff9082f6cd4040b41 Mon Sep 17 00:00:00 2001
From: Vignesh C 
Date: Sat, 16 Mar 2024 18:47:35 +0530
Subject: [PATCH v29 2/2] Keep standby server's minimum recovery point less
 than the consistent_lsn.

Standby server should not get ahead of the replication slot's lsn. If
this happens we will not be able to promote the standby server as it
will not be able to reach the consistent point because the standby
server's Minimum recovery ending location will be greater than the
consistent_lsn. Fixed this by pausing the replay before the replication
slots are created.
---
 src/bin/pg_basebackup/pg_createsubscriber.c | 93 -
 1 file changed, 92 insertions(+), 1 deletion(-)

diff --git a/src/bin/pg_basebackup/pg_createsubscriber.c b/src/bin/pg_basebackup/pg_createsubscriber.c
index c565f8524a..18e8757403 100644
--- a/src/bin/pg_basebackup/pg_createsubscriber.c
+++ b/src/bin/pg_basebackup/pg_createsubscriber.c
@@ -95,6 +95,8 @@ static void drop_publication(PGconn *conn, struct LogicalRepInfo *dbinfo);
 static void create_subscription(PGconn *conn, struct LogicalRepInfo *dbinfo);
 static void set_replication_progress(PGconn *conn, struct LogicalRepInfo *dbinfo,
 	 const char *lsn);
+static void pause_replay_stantby_server(struct LogicalRepInfo *dbinfo,
+		struct CreateSubscriberOptions *opt);
 static void enable_subscription(PGconn *conn, struct LogicalRepInfo *dbinfo);
 
 #define	USEC_PER_SEC	100
@@ -917,7 +919,9 @@ check_subscriber(struct LogicalRepInfo *dbinfo)
 	appendPQExpBuffer(str,
 	  "SELECT pg_catalog.pg_has_role(current_user, %u, 'MEMBER'), "
 	  "pg_catalog.has_database_privilege(current_user, '%s', 'CREATE'), "
-	  "pg_catalog.has_function_privilege(current_user, 'pg_catalog.pg_replication_origin_advance(text, pg_lsn)', 'EXECUTE')",
+	  "pg_catalog.has_function_privilege(current_user, 'pg_catalog.pg_replication_origin_advance(text, pg_lsn)', 'EXECUTE'), "
+	  "pg_catalog.has_function_privilege(current_user, 'pg_catalog.pg_wal_replay_pause()', 'EXECUTE'), "
+	  "pg_catalog.has_function_privilege(current_user, 'pg_catalog.pg_get_wal_replay_pause_state()', 'EXECUTE')",
 	  ROLE_PG_CREATE_SUBSCRIPTION, dbinfo[0].dbname);
 
 	pg_log_debug("command is: %s", str->data);
@@ -949,6 +953,18 @@ check_subscriber(struct LogicalRepInfo *dbinfo)
 	 "pg_catalog.pg_replication_origin_advance(text, pg_lsn)");
 		failed = true;
 	}
+	if (strcmp(PQgetvalue(res, 0, 3), "t") != 0)
+	{
+		pg_log_error("permission denied for function \"%s\"",
+	 "pg_catalog.pg_wal_replay_pause()");
+		failed = true;
+	}
+	if (strcmp(PQgetvalue(res, 0, 4), "t") != 0)
+	{
+		pg_log_error("permission denied for function \"%s\"",
+	 "pg_catalog.pg_get_wal_replay_pause_state()");
+		failed = true;
+	}
 
 	destroyPQExpBuffer(str);
 	PQclear(res);
@@ -1026,6 +1042,72 @@ check_subscriber(struct LogicalRepInfo *dbinfo)
 		exit(1);
 }
 
+/*
+ * Pause replaying at the standby server.
+ */
+static void
+pause_replay_stantby_server(struct LogicalRepInfo *dbinfo,
+			struct CreateSubscriberOptions *opt)
+{
+	PGconn	   *conn;
+	PGresult   *res;
+
+	/* Connect to subscriber. */
+	conn = connect_database(dbinfo[0].subconninfo, true);
+
+	pg_log_info("Pausing the replay in standby server");
+	pg_log_debug("command is: SELECT pg_catalog.pg_wal_replay_pause()");
+
+	if (!dry_run)
+	{
+		int			timer = 0;
+
+		res = PQexec(conn, "SELECT pg_catalog.pg_wal_replay_pause()");
+		if (PQresultStatus(res) != PGRES_TUPLES_OK)
+		{
+			pg_log_error("could not pause replay in standby server: %s",
+		 PQresultErrorMessage(res));
+			disconnect_database(conn, true);
+		}
+		PQclear(res);
+
+		/* Wait till replay has paused */
+		for(;;)
+		{
+			pg_log_debug("command is: SELECT pg_catalog.pg_get_wal_replay_pause_state()");
+			res = PQexec(conn, "SELECT 

Building with meson on NixOS/nixpkgs

2024-03-16 Thread Wolfgang Walther
To build on NixOS/nixpkgs I came up with a few small patches to 
meson.build. All of this works fine with Autoconf/Make already.From 24ae72b9b0adc578c6729eff59c9038e6b4ac517 Mon Sep 17 00:00:00 2001
From: Wolfgang Walther 
Date: Sat, 2 Mar 2024 17:18:38 +0100
Subject: [PATCH 1/3] Fallback to uuid for ossp-uuid with meson

The upstream name for the ossp-uuid package / pkg-config file is "uuid". Many
distributions change this to be "ossp-uuid" to not conflict with e2fsprogs.

This lookup fails on distributions which don't change this name, for example
NixOS / nixpkgs. Both "ossp-uuid" and "uuid" are also checked in configure.ac.
---
 meson.build | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/meson.build b/meson.build
index c8fdfeb0ec3..d9ad1ce902f 100644
--- a/meson.build
+++ b/meson.build
@@ -1346,7 +1346,7 @@ if uuidopt != 'none'
 uuidfunc = 'uuid_to_string'
 uuidheader = 'uuid.h'
   elif uuidopt == 'ossp'
-uuid = dependency('ossp-uuid', required: true)
+uuid = dependency('ossp-uuid', 'uuid', required: true)
 uuidfunc = 'uuid_export'
 uuidheader = 'uuid.h'
   else
-- 
2.44.0

From 56e0abbcc3b950b6e93eddc6ede453ce529423ea Mon Sep 17 00:00:00 2001
From: Wolfgang Walther 
Date: Sat, 2 Mar 2024 22:06:25 +0100
Subject: [PATCH 2/3] Fallback to clang in PATH with meson

Some distributions put clang into a different path than the llvm binary path.

For example, this is the case on NixOS / nixpkgs, which failed to find clang
with meson before this patch.
---
 meson.build | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/meson.build b/meson.build
index d9ad1ce902f..7c64fb04ea5 100644
--- a/meson.build
+++ b/meson.build
@@ -759,7 +759,7 @@ if add_languages('cpp', required: llvmopt, native: false)
 llvm_binpath = llvm.get_variable(configtool: 'bindir')
 
 ccache = find_program('ccache', native: true, required: false)
-clang = find_program(llvm_binpath / 'clang', required: true)
+clang = find_program(llvm_binpath / 'clang', 'clang', required: true)
   endif
 elif llvmopt.auto()
   message('llvm requires a C++ compiler')
-- 
2.44.0

From 62f10689d843227fca6d54e86462d0be5c4f434f Mon Sep 17 00:00:00 2001
From: Wolfgang Walther 
Date: Mon, 11 Mar 2024 19:54:41 +0100
Subject: [PATCH 3/3] Support absolute bindir/libdir in regression tests with
 meson

Passing an absolute bindir/libdir will install the binaries and libraries to
/tmp_install/ and /tmp_install/ respectively.

This is path is correctly passed to the regression test suite via configure/make,
but not via meson, yet. This is because the "/" operator in the following expression
throws away the whole left side when the right side is an absolute path:

  test_install_location / get_option('libdir')

This was already correctly handled for dir_prefix, which is likely absolute as well.
This patch handles both bindir and libdir in the same way - prefixing absolute paths
with the tmp_install path correctly.
---
 meson.build | 11 ++-
 1 file changed, 6 insertions(+), 5 deletions(-)

diff --git a/meson.build b/meson.build
index 7c64fb04ea5..e8438209058 100644
--- a/meson.build
+++ b/meson.build
@@ -3044,15 +3044,17 @@ test_install_destdir = meson.build_root() / 'tmp_install/'
 if build_system != 'windows'
   # On unixoid systems this is trivial, we just prepend the destdir
   assert(dir_prefix.startswith('/')) # enforced by meson
-  test_install_location = '@0@@1@'.format(test_install_destdir, dir_prefix)
+  temp_install_bindir = '@0@@1@'.format(test_install_destdir, dir_prefix / dir_bin)
+  temp_install_libdir = '@0@@1@'.format(test_install_destdir, dir_prefix / dir_lib)
 else
   # drives, drive-relative paths, etc make this complicated on windows, call
   # into a copy of meson's logic for it
   command = [
 python, '-c',
 'import sys; from pathlib import PurePath; d1=sys.argv[1]; d2=sys.argv[2]; print(str(PurePath(d1, *PurePath(d2).parts[1:])))',
-test_install_destdir, dir_prefix]
-  test_install_location = run_command(command, check: true).stdout().strip()
+test_install_destdir]
+  temp_install_bindir = run_command(command, dir_prefix / dir_bin, check: true).stdout().strip()
+  temp_install_libdir = run_command(command, dir_prefix / dir_lib, check: true).stdout().strip()
 endif
 
 meson_install_args = meson_args + ['install'] + {
@@ -3089,7 +3091,6 @@ testport = 4
 
 test_env = environment()
 
-temp_install_bindir = test_install_location / get_option('bindir')
 test_initdb_template = meson.build_root() / 'tmp_install' / 'initdb-template'
 test_env.set('PG_REGRESS', pg_regress.full_path())
 test_env.set('REGRESS_SHLIB', regress_module.full_path())
@@ -3104,7 +3105,7 @@ test_env.set('PG_TEST_EXTRA', get_option('PG_TEST_EXTRA'))
 # that works (everything but windows, basically). On windows everything
 # library-like gets installed into bindir, solving that issue.
 if library_path_var != ''
-  test_env.prepend(library_path_var, test_install_location / 

Re: [HACKERS] make async slave to wait for lsn to be replayed

2024-03-16 Thread Amit Kapila
On Fri, Mar 15, 2024 at 7:50 PM Alexander Korotkov  wrote:
>
> I went through this patch another time, and made some minor
> adjustments.  Now it looks good, I'm going to push it if no
> objections.
>

I have a question related to usability, if the regular reads (say a
Select statement or reads via function/procedure) need a similar
guarantee to see the changes on standby then do they also always need
to first do something like "BEGIN AFTER '0/3F0FF791' WITHIN 1000;"? Or
in other words, shouldn't we think of something for implicit
transactions?

In general, it seems this patch has been stuck for a long time on the
decision to choose an appropriate UI (syntax), and we thought of
moving it further so that the other parts of the patch can be
reviewed/discussed. So, I feel before pushing this we should see
comments from a few (at least two) other senior members who earlier
shared their opinion on the syntax. I know we don't have much time
left but OTOH pushing such a change (where we didn't have a consensus
on syntax) without much discussion at this point of time could lead to
discussions after commit.

-- 
With Regards,
Amit Kapila.




Re: Introduce XID age and inactive timeout based replication slot invalidation

2024-03-16 Thread Amit Kapila
On Fri, Mar 15, 2024 at 10:45 AM Bharath Rupireddy
 wrote:
>
> On Wed, Mar 13, 2024 at 9:38 AM Amit Kapila  wrote:
> >
> > BTW, is XID the based parameter 'max_slot_xid_age' not have similarity
> > with 'max_slot_wal_keep_size'? I think it will impact the rows we
> > removed based on xid horizons. Don't we need to consider it while
> > vacuum computing the xid horizons in ComputeXidHorizons() similar to
> > what we do for WAL w.r.t 'max_slot_wal_keep_size'?
>
> I'm having a hard time understanding why we'd need something up there
> in ComputeXidHorizons(). Can you elaborate it a bit please?
>
> What's proposed with max_slot_xid_age is that during checkpoint we
> look at slot's xmin and catalog_xmin, and the current system txn id.
> Then, if the XID age of (xmin, catalog_xmin) and current_xid crosses
> max_slot_xid_age, we invalidate the slot.
>

I can see that in your patch (in function
InvalidatePossiblyObsoleteSlot()). As per my understanding, we need
something similar for slot xids in ComputeXidHorizons() as we are
doing WAL in KeepLogSeg(). In KeepLogSeg(), we compute the minimum LSN
location required by slots and then adjust it for
'max_slot_wal_keep_size'. On similar lines, currently in
ComputeXidHorizons(), we compute the minimum xid required by slots
(procArray->replication_slot_xmin and
procArray->replication_slot_catalog_xmin) but then don't adjust it for
'max_slot_xid_age'. I could be missing something in this but it is
better to keep discussing this and try to move with another parameter
'inactive_replication_slot_timeout' which according to me can be kept
at slot level instead of a GUC but OTOH we need to see the arguments
on both side and then decide which makes more sense.

-- 
With Regards,
Amit Kapila.




Re: [HACKERS] make async slave to wait for lsn to be replayed

2024-03-16 Thread Bharath Rupireddy
On Sat, Mar 16, 2024 at 2:04 AM Alexander Korotkov  wrote:
>
> > Rebase and update patch.

Thanks for working on this. I took a quick look at v11 patch. Here are
some comments:

1.
+#include "utils/timestamp.h"
+#include "executor/spi.h"
+#include "utils/fmgrprotos.h"

Please place executor/spi.h in the alphabetical order. Also, look at
all other header files and place them in the order.

2. It seems like pgindent is not happy with
src/backend/access/transam/xlogrecovery.c and
src/backend/commands/waitlsn.c. Please run it to keep BF member koel
happy post commit.

3. This patch changes, SQL explicit transaction statement syntax, is
it (this deviation) okay from SQL standard perspective?

4. I think some more checks are needed for better input validations.

4.1 With invalid LSN succeeds, shouldn't it error out? Or at least,
add a fast path/quick exit to WaitForLSN()?
BEGIN AFTER '0/0';

4.2 With an unreasonably high future LSN, BEGIN command waits
unboundedly, shouldn't we check if the specified LSN is more than
pg_last_wal_receive_lsn() error out?
BEGIN AFTER '0/';
SELECT pg_last_wal_receive_lsn() + 1 AS future_receive_lsn \gset
BEGIN AFTER :'future_receive_lsn';

4.3 With an unreasonably high wait time, BEGIN command waits
unboundedly, shouldn't we restrict the wait time to some max value,
say a day or so?
SELECT pg_last_wal_receive_lsn() + 1 AS future_receive_lsn \gset
BEGIN AFTER :'future_receive_lsn' WITHIN 10;

5.
+#include 
+#include 
+#include "postgres.h"
+#include "pgstat.h"

postgres.h must be included at the first, and then the system header
files, and then all postgres header files, just like below. See a very
recent commit 
https://git.postgresql.org/gitweb/?p=postgresql.git;a=commitdiff;h=97d85be365443eb4bf84373a7468624762382059.

+#include "postgres.h"

+#include 
+#include 

+#include "access/transam.h"
+#include "access/xact.h"
+#include "access/xlog.h"

6.
+/* Set all latches in shared memory to signal that new LSN has been replayed */
+void
+WaitLSNSetLatches(XLogRecPtr curLSN)
+{

I see this patch is waking up all the waiters in the recovery path
after applying every WAL record, which IMO is a hot path. Is the
impact of this change on recovery measured, perhaps using
https://github.com/macdice/redo-bench or similar tools?

7. In continuation to comment #6, why not use Conditional Variables
instead of proc latches to sleep and wait for all the waiters in
WaitLSNSetLatches?

-- 
Bharath Rupireddy
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com