subject:"speed up a logical replica setup"

Re: speed up a logical replica setup

2024-07-30 Thread Euler Taveira

On Tue, Jul 30, 2024, at 7:35 AM, Peter Eisentraut wrote:
> On 30.07.24 02:05, Euler Taveira wrote:
> > I'm attaching a patch that implements option (c). While reading the code
> > I noticed that I left a comment that should be removed by commit
> > b9639138262. 0002 removes it.
> 
> I have committed your 0002 patch.  It looks like Amit has already 
> committed your 0001 patch.

Peter / Amit, thanks for the quick review and commit.

--
Euler Taveira
EDB   https://www.enterprisedb.com/

Re: speed up a logical replica setup

2024-07-30 Thread Peter Eisentraut


On 30.07.24 02:05, Euler Taveira wrote:

I'm attaching a patch that implements option (c). While reading the code
I noticed that I left a comment that should be removed by commit
b9639138262. 0002 removes it.


I have committed your 0002 patch.  It looks like Amit has already 
committed your 0001 patch.

RE: speed up a logical replica setup

2024-07-30 Thread Hayato Kuroda (Fujitsu)

Dear Euler,

I applied your patch and confirmed it could fix the issue [1].

METHOD
===
1. shortened the snapshot interval to 3ms, and
```
-#define LOG_SNAPSHOT_INTERVAL_MS 15000
+#define LOG_SNAPSHOT_INTERVAL_MS 3
```
2. ran test 040_pg_createsubscriber.pl many times.
   Before patching, I could reproduce the failure when I ran less than 1hr.
   After patching, I ran more than 1hr but I could not reproduce.

Since this is a timing issue which I could not surely reproduce, I wanted others
to test it as well.

[1]: 
https://www.postgresql.org/message-id/68de6498-0449-a113-dd03-e198dded0bac%40gmail.com

Best regards,
Hayato Kuroda
FUJITSU LIMITED

Re: 040_pg_createsubscriber.pl is slow and unstable (was Re: speed up a logical replica setup)

2024-07-30 Thread Amit Kapila

On Tue, Jul 30, 2024 at 11:28 AM Ashutosh Bapat
 wrote:
>
> On Tue, Jul 30, 2024 at 9:25 AM Amit Kapila  wrote:
> >
> > On Tue, Jul 30, 2024 at 1:48 AM Tom Lane  wrote:
> > >
> > > Robert Haas  writes:
> > > > On Sun, Jun 30, 2024 at 2:40 PM Tom Lane  wrote:
> > > >> ... However, I added a new open item about how the
> > > >> 040_pg_createsubscriber.pl test is slow and still unstable.
> > >
> > > > But that said, I see no commits in the commit history which purport to
> > > > improve performance, so I guess the performance is probably still not
> > > > what you want, though I am not clear on the details.
> > >
> > > My concern is described at [1]:
> > >
> > > >> I have a different but possibly-related complaint: why is
> > > >> 040_pg_createsubscriber.pl so miserably slow?  On my machine it
> > > >> runs for a bit over 19 seconds, which seems completely out of line
> > > >> (for comparison, 010_pg_basebackup.pl takes 6 seconds, and the
> > > >> other test scripts in this directory take much less).  It looks
> > > >> like most of the blame falls on this step:
> > > >>
> > > >> [12:47:22.292](14.534s) ok 28 - run pg_createsubscriber on node S
> > > >>
> > > >> AFAICS the amount of data being replicated is completely trivial,
> > > >> so that it doesn't make any sense for this to take so long --- and
> > > >> if it does, that suggests that this tool will be impossibly slow
> > > >> for production use.  But I suspect there is a logic flaw causing
> > > >> this.  Speculating wildly, perhaps that is related to the failure
> > > >> Alexander spotted?
> > >
> > > The followup discussion in that thread made it sound like there's
> > > some fairly fundamental deficiency in how wait_for_end_recovery()
> > > detects end-of-recovery.  I'm not too conversant with the details
> > > though, and it's possible that pg_createsubscriber is just falling
> > > foul of a pre-existing infelicity.
> > >
> > > If the problem can be correctly described as "pg_createsubscriber
> > > takes 10 seconds or so to detect end-of-stream",
> > >
> >
> > The problem can be defined as: "pg_createsubscriber waits for an
> > additional (new) WAL record to be generated on primary before it
> > considers the standby is ready for becoming a subscriber". Now, on
> > busy systems, this shouldn't be a problem but for idle systems, the
> > time to detect end-of-stream can't be easily defined.
>
> AFAIU, the server will emit running transactions WAL record at least
> 15 seconds.
>

AFAICU, this is not true because the code suggests that the running
xacts record is inserted by bgwriter only when enough time has passed
and interesting records have been inserted since the last snapshot.
Please take a look at the following comment and code in bgwriter.c
"Only log if enough time has passed and interesting records have been
inserted since the last snapshot...".

-- 
With Regards,
Amit Kapila.

Re: 040_pg_createsubscriber.pl is slow and unstable (was Re: speed up a logical replica setup)

2024-07-29 Thread Ashutosh Bapat

On Tue, Jul 30, 2024 at 9:25 AM Amit Kapila  wrote:
>
> On Tue, Jul 30, 2024 at 1:48 AM Tom Lane  wrote:
> >
> > Robert Haas  writes:
> > > On Sun, Jun 30, 2024 at 2:40 PM Tom Lane  wrote:
> > >> ... However, I added a new open item about how the
> > >> 040_pg_createsubscriber.pl test is slow and still unstable.
> >
> > > But that said, I see no commits in the commit history which purport to
> > > improve performance, so I guess the performance is probably still not
> > > what you want, though I am not clear on the details.
> >
> > My concern is described at [1]:
> >
> > >> I have a different but possibly-related complaint: why is
> > >> 040_pg_createsubscriber.pl so miserably slow?  On my machine it
> > >> runs for a bit over 19 seconds, which seems completely out of line
> > >> (for comparison, 010_pg_basebackup.pl takes 6 seconds, and the
> > >> other test scripts in this directory take much less).  It looks
> > >> like most of the blame falls on this step:
> > >>
> > >> [12:47:22.292](14.534s) ok 28 - run pg_createsubscriber on node S
> > >>
> > >> AFAICS the amount of data being replicated is completely trivial,
> > >> so that it doesn't make any sense for this to take so long --- and
> > >> if it does, that suggests that this tool will be impossibly slow
> > >> for production use.  But I suspect there is a logic flaw causing
> > >> this.  Speculating wildly, perhaps that is related to the failure
> > >> Alexander spotted?
> >
> > The followup discussion in that thread made it sound like there's
> > some fairly fundamental deficiency in how wait_for_end_recovery()
> > detects end-of-recovery.  I'm not too conversant with the details
> > though, and it's possible that pg_createsubscriber is just falling
> > foul of a pre-existing infelicity.
> >
> > If the problem can be correctly described as "pg_createsubscriber
> > takes 10 seconds or so to detect end-of-stream",
> >
>
> The problem can be defined as: "pg_createsubscriber waits for an
> additional (new) WAL record to be generated on primary before it
> considers the standby is ready for becoming a subscriber". Now, on
> busy systems, this shouldn't be a problem but for idle systems, the
> time to detect end-of-stream can't be easily defined.

AFAIU, the server will emit running transactions WAL record at least
15 seconds. So the subscriber should not have to wait longer than 15
seconds. I understand that it would be a problem for tests, but will
it be a problem for end users? Sorry for repetition, if this has been
discussed.

-- 
Best Wishes,
Ashutosh Bapat

Re: speed up a logical replica setup

2024-07-29 Thread Amit Kapila

On Tue, Jul 30, 2024 at 9:26 AM Hayato Kuroda (Fujitsu)
 wrote:
>
> > Hayato already mentioned one of the solution in a previous email [2].
> > AFAICS we can use any solution that creates a WAL record. I tested the
> > following options:
>
> Yes, my point was that we should add an arbitrary record after the 
> recovery_target_lsn.
>
> > (a) temporary replication slot: requires an additional replication slot.
> > small payload. it is extremely slow in comparison with the other
> > options.
> > (b) logical message: can be consumed by logical replication when/if it
> > is supported some day. big payload. fast.
> > (c) snapshot of running txn:  small payload. fast.
> > (d) named restore point: biggest payload. fast.
>
> Another PoV is whether they trigger the flush of the generated WAL record. 
> You've
> tested pg_logical_emit_message() with flush=false, but this meant that the 
> WAL does
> not flush so that the wait_for_end_recovery() waits a time. IIUC, it may wait 
> 15
> seconds, which is the time duration bgwriter works. If flush is set to true, 
> the
> execution time will be longer.
> pg_create_restore_point() also does not flush the generated record so that it 
> may
> be problematic. Moreover, the name of the restore point might be a conflict 
> that
> users will use.
>
> Overall, I could agree to use pg_log_standby_snapshot(). This flushes the WAL
> asynchronously but ensures the timing is not so delayed.
>

The other minor benefit of using pg_log_standby_snapshot() is that
after the standby is converted to subscriber, the publisher will
process this record to see if the slot machinery can be advanced. So,
overall there won't be any harm in using it. I'll check the latest
patch Euler shared.

-- 
With Regards,
Amit Kapila.

Re: 040_pg_createsubscriber.pl is slow and unstable (was Re: speed up a logical replica setup)

2024-07-29 Thread Amit Kapila

On Tue, Jul 30, 2024 at 9:56 AM Tom Lane  wrote:
>
> Amit Kapila  writes:
> > On Tue, Jul 30, 2024 at 1:48 AM Tom Lane  wrote:
> >> If the problem can be correctly described as "pg_createsubscriber
> >> takes 10 seconds or so to detect end-of-stream",
>
> > The problem can be defined as: "pg_createsubscriber waits for an
> > additional (new) WAL record to be generated on primary before it
> > considers the standby is ready for becoming a subscriber". Now, on
> > busy systems, this shouldn't be a problem but for idle systems, the
> > time to detect end-of-stream can't be easily defined.
>
> Got it.  IMO, that absolutely will be a problem for real users,
> not only test cases.
>
> > One of the proposed solutions is that pg_createsubscriber generate a
> > dummy WAL record on the publisher/primary by using something like
> > pg_logical_emit_message(), pg_log_standby_snapshot(), etc. This will
> > fix the problem (BF failures and slow detection for end-of-stream) but
> > sounds more like a hack.
>
> It's undoubtedly a hack, but I like it anyway because it's small,
> self-contained, and easily removable once we have a better solution.
> As you say, it seems a bit late in the v17 cycle to be designing
> anything more invasive.
>

Thanks for your feedback. We will proceed in that direction and try to
close this open item.

-- 
With Regards,
Amit Kapila.

Re: 040_pg_createsubscriber.pl is slow and unstable (was Re: speed up a logical replica setup)

2024-07-29 Thread Tom Lane

Amit Kapila  writes:
> On Tue, Jul 30, 2024 at 1:48 AM Tom Lane  wrote:
>> If the problem can be correctly described as "pg_createsubscriber
>> takes 10 seconds or so to detect end-of-stream",

> The problem can be defined as: "pg_createsubscriber waits for an
> additional (new) WAL record to be generated on primary before it
> considers the standby is ready for becoming a subscriber". Now, on
> busy systems, this shouldn't be a problem but for idle systems, the
> time to detect end-of-stream can't be easily defined.

Got it.  IMO, that absolutely will be a problem for real users,
not only test cases.

> One of the proposed solutions is that pg_createsubscriber generate a
> dummy WAL record on the publisher/primary by using something like
> pg_logical_emit_message(), pg_log_standby_snapshot(), etc. This will
> fix the problem (BF failures and slow detection for end-of-stream) but
> sounds more like a hack.

It's undoubtedly a hack, but I like it anyway because it's small,
self-contained, and easily removable once we have a better solution.
As you say, it seems a bit late in the v17 cycle to be designing
anything more invasive.

regards, tom lane

RE: speed up a logical replica setup

2024-07-29 Thread Hayato Kuroda (Fujitsu)

Dear Euler,

> Hayato already mentioned one of the solution in a previous email [2].
> AFAICS we can use any solution that creates a WAL record. I tested the
> following options:

Yes, my point was that we should add an arbitrary record after the 
recovery_target_lsn.

> (a) temporary replication slot: requires an additional replication slot.
> small payload. it is extremely slow in comparison with the other
> options.
> (b) logical message: can be consumed by logical replication when/if it
> is supported some day. big payload. fast.
> (c) snapshot of running txn:  small payload. fast.
> (d) named restore point: biggest payload. fast.

Another PoV is whether they trigger the flush of the generated WAL record. 
You've
tested pg_logical_emit_message() with flush=false, but this meant that the WAL 
does
not flush so that the wait_for_end_recovery() waits a time. IIUC, it may wait 15
seconds, which is the time duration bgwriter works. If flush is set to true, the
execution time will be longer.
pg_create_restore_point() also does not flush the generated record so that it 
may
be problematic. Moreover, the name of the restore point might be a conflict that
users will use.

Overall, I could agree to use pg_log_standby_snapshot(). This flushes the WAL
asynchronously but ensures the timing is not so delayed.

Best regards,
Hayato Kuroda
FUJITSU LIMITED

Re: 040_pg_createsubscriber.pl is slow and unstable (was Re: speed up a logical replica setup)

2024-07-29 Thread Amit Kapila

On Tue, Jul 30, 2024 at 1:48 AM Tom Lane  wrote:
>
> Robert Haas  writes:
> > On Sun, Jun 30, 2024 at 2:40 PM Tom Lane  wrote:
> >> ... However, I added a new open item about how the
> >> 040_pg_createsubscriber.pl test is slow and still unstable.
>
> > But that said, I see no commits in the commit history which purport to
> > improve performance, so I guess the performance is probably still not
> > what you want, though I am not clear on the details.
>
> My concern is described at [1]:
>
> >> I have a different but possibly-related complaint: why is
> >> 040_pg_createsubscriber.pl so miserably slow?  On my machine it
> >> runs for a bit over 19 seconds, which seems completely out of line
> >> (for comparison, 010_pg_basebackup.pl takes 6 seconds, and the
> >> other test scripts in this directory take much less).  It looks
> >> like most of the blame falls on this step:
> >>
> >> [12:47:22.292](14.534s) ok 28 - run pg_createsubscriber on node S
> >>
> >> AFAICS the amount of data being replicated is completely trivial,
> >> so that it doesn't make any sense for this to take so long --- and
> >> if it does, that suggests that this tool will be impossibly slow
> >> for production use.  But I suspect there is a logic flaw causing
> >> this.  Speculating wildly, perhaps that is related to the failure
> >> Alexander spotted?
>
> The followup discussion in that thread made it sound like there's
> some fairly fundamental deficiency in how wait_for_end_recovery()
> detects end-of-recovery.  I'm not too conversant with the details
> though, and it's possible that pg_createsubscriber is just falling
> foul of a pre-existing infelicity.
>
> If the problem can be correctly described as "pg_createsubscriber
> takes 10 seconds or so to detect end-of-stream",
>

The problem can be defined as: "pg_createsubscriber waits for an
additional (new) WAL record to be generated on primary before it
considers the standby is ready for becoming a subscriber". Now, on
busy systems, this shouldn't be a problem but for idle systems, the
time to detect end-of-stream can't be easily defined.

One of the proposed solutions is that pg_createsubscriber generate a
dummy WAL record on the publisher/primary by using something like
pg_logical_emit_message(), pg_log_standby_snapshot(), etc. This will
fix the problem (BF failures and slow detection for end-of-stream) but
sounds more like a hack. The other ideas that we can consider as
mentioned in [1] require API/design change which is not preferable at
this point. So, the only way seems to be to accept the generation of
dummy WAL records to bring predictability in the tests or otherwise in
the usage of the tool.

[1] - 
https://www.postgresql.org/message-id/CAA4eK1%2Bp%2B7Ag6nqdFRdqowK1EmJ6bG-MtZQ_54dnFBi%3D_OO5RQ%40mail.gmail.com

-- 
With Regards,
Amit Kapila.

Re: speed up a logical replica setup

2024-07-29 Thread Euler Taveira

On Mon, Jul 29, 2024, at 6:11 PM, Euler Taveira wrote:
> The options are:
> 
> (a) temporary replication slot: requires an additional replication slot.
> small payload. it is extremely slow in comparison with the other
> options.
> (b) logical message: can be consumed by logical replication when/if it
> is supported some day. big payload. fast.
> (c) snapshot of running txn:  small payload. fast.
> (d) named restore point: biggest payload. fast.
> 
> I don't have a strong preference but if I need to pick one I would
> choose option (c) or option (d). The option (a) is out of the question.

I'm attaching a patch that implements option (c). While reading the code
I noticed that I left a comment that should be removed by commit
b9639138262. 0002 removes it.


--
Euler Taveira
EDB   https://www.enterprisedb.com/
From f4afe05fc7e73c5c23bcdeba4fc65a538c83b8ba Mon Sep 17 00:00:00 2001
From: Euler Taveira 
Date: Mon, 29 Jul 2024 19:44:16 -0300
Subject: [PATCH 1/2] pg_createsubscriber: fix slow recovery

If the primary server is idle when you are running pg_createsubscriber,
it used to take some time during recovery. The reason is that it was
using the LSN returned by pg_create_logical_replication_slot as
recovery_target_lsn. This LSN points to the next WAL record that might
not be available at WAL, hence, the recovery routine waits until some
activity writes a WAL record to end the recovery. Inject a new WAL
record after the last replication slot to avoid slowness.

Discussion: https://www.postgresql.org/message-id/2377319.1719766794%40sss.pgh.pa.us
---
 src/bin/pg_basebackup/pg_createsubscriber.c | 23 +
 1 file changed, 23 insertions(+)

diff --git a/src/bin/pg_basebackup/pg_createsubscriber.c b/src/bin/pg_basebackup/pg_createsubscriber.c
index b02318782a6..00976c643a1 100644
--- a/src/bin/pg_basebackup/pg_createsubscriber.c
+++ b/src/bin/pg_basebackup/pg_createsubscriber.c
@@ -778,6 +778,29 @@ setup_publisher(struct LogicalRepInfo *dbinfo)
 		else
 			exit(1);
 
+		/*
+		 * An idle server might not write a new WAL record until the recovery
+		 * is about to end. Since pg_createsubscriber is using the LSN
+		 * returned by the last replication slot as recovery_target_lsn, this
+		 * LSN is ahead of the current WAL position and the recovery waits
+		 * until something writes a WAL record to reach the target and ends
+		 * the recovery. To avoid the recovery slowness in this case, injects
+		 * a new WAL record here.
+		 */
+		if (i == num_dbs - 1 && !dry_run)
+		{
+			PGresult   *res;
+
+			res = PQexec(conn, "SELECT pg_log_standby_snapshot()");
+			if (PQresultStatus(res) != PGRES_TUPLES_OK)
+			{
+pg_log_error("could not write an additional WAL record: %s",
+			 PQresultErrorMessage(res));
+disconnect_database(conn, true);
+			}
+			PQclear(res);
+		}
+
 		disconnect_database(conn, false);
 	}
 
-- 
2.30.2

From 75558e8379abae3a642583f31b21e0ca5db80d2b Mon Sep 17 00:00:00 2001
From: Euler Taveira 
Date: Mon, 29 Jul 2024 20:59:32 -0300
Subject: [PATCH 2/2] pg_createsubscriber: remove obsolete comment

This comment should have been removed by commit b9639138262. There is no
replication slot check on primary anymore.
---
 src/bin/pg_basebackup/pg_createsubscriber.c | 5 +
 1 file changed, 1 insertion(+), 4 deletions(-)

diff --git a/src/bin/pg_basebackup/pg_createsubscriber.c b/src/bin/pg_basebackup/pg_createsubscriber.c
index 00976c643a1..87668640f78 100644
--- a/src/bin/pg_basebackup/pg_createsubscriber.c
+++ b/src/bin/pg_basebackup/pg_createsubscriber.c
@@ -2209,10 +2209,7 @@ main(int argc, char **argv)
 	stop_standby_server(subscriber_dir);
 
 	/*
-	 * Create the required objects for each database on publisher. This step
-	 * is here mainly because if we stop the standby we cannot verify if the
-	 * primary slot is in use. We could use an extra connection for it but it
-	 * doesn't seem worth.
+	 * Create the required objects for each database on publisher.
 	 */
 	consistent_lsn = setup_publisher(dbinfo);
 
-- 
2.30.2

Re: speed up a logical replica setup

2024-07-29 Thread Euler Taveira

On Wed, Jul 17, 2024, at 11:37 PM, Amit Kapila wrote:
> I am thinking of transactions between restart_lsn and "consistent
> point lsn" (aka the point after which all commits will be replicated).
> You conclusion seems correct to me that such transactions won't be
> replicated by streaming replication and would be skipped by logical
> replication. Now, if we can avoid that anyway, we can consider that.

Under reflection what I proposed [1] seems more complex and possibly
error prone than other available solutions. The recovery step was slow
if the server is idle (that's the case for the test). An idle server
usually doesn't have another WAL record after creating the replication
slots. Since pg_createsubscriber is using the LSN returned by the last
replication slot as recovery_target_lsn, this LSN is ahead of the
current WAL position and the recovery waits until something writes a
WAL record to reach the target and ends the recovery.

Hayato already mentioned one of the solution in a previous email [2].
AFAICS we can use any solution that creates a WAL record. I tested the
following options: 

\timing
select * from pg_create_logical_replication_slot('pg_createsubscriber', 
'pgoutput', true);
select pg_logical_emit_message(false, 'pg_createsubscriber', 'dummy');
select pg_log_standby_snapshot();
select pg_create_restore_point('pg_createsubscriber');

that results in the following output:

  slot_name  |lsn
-+---
pg_createsubscriber | 0/942DD28
(1 row)

Time: 200.571 ms
pg_logical_emit_message 
-
0/942DD78
(1 row)

Time: 0.938 ms
pg_log_standby_snapshot 
-
0/942DDB0
(1 row)

Time: 0.741 ms
pg_create_restore_point 
-
0/942DE18
(1 row)

Time: 0.870 ms

and generates the following WAL records:

rmgr: Standby len (rec/tot): 50/50, tx:  0, lsn: 
0/0942DCF0, prev 0/0942DCB8, desc: RUNNING_XACTS nextXid 3939 
latestCompletedXid 3938 oldestRunningXid 3939
rmgr: LogicalMessage len (rec/tot): 75/75, tx:  0, lsn: 
0/0942DD28, prev 0/0942DCF0, desc: MESSAGE non-transactional, prefix 
"pg_createsubscriber"; payload (5 bytes): 64 75 6D 6D 79
rmgr: Standby len (rec/tot): 50/50, tx:  0, lsn: 
0/0942DD78, prev 0/0942DD28, desc: RUNNING_XACTS nextXid 3939 
latestCompletedXid 3938 oldestRunningXid 3939
rmgr: XLOGlen (rec/tot): 98/98, tx:  0, lsn: 
0/0942DDB0, prev 0/0942DD78, desc: RESTORE_POINT pg_createsubscriber

The options are:

(a) temporary replication slot: requires an additional replication slot.
small payload. it is extremely slow in comparison with the other
options.
(b) logical message: can be consumed by logical replication when/if it
is supported some day. big payload. fast.
(c) snapshot of running txn:  small payload. fast.
(d) named restore point: biggest payload. fast.

I don't have a strong preference but if I need to pick one I would
choose option (c) or option (d). The option (a) is out of the question.

Opinions?

[1] 
https://www.postgresql.org/message-id/b1f0f8c7-8f01-4950-af77-339df3dc4684%40app.fastmail.com
[2] 
https://www.postgresql.org/message-id/OSBPR01MB25521B15BF950D2523BBE143F5D32%40OSBPR01MB2552.jpnprd01.prod.outlook.com

--
Euler Taveira
EDB   https://www.enterprisedb.com/

Re: 040_pg_createsubscriber.pl is slow and unstable (was Re: speed up a logical replica setup)

2024-07-29 Thread Tom Lane

Robert Haas  writes:
> On Sun, Jun 30, 2024 at 2:40 PM Tom Lane  wrote:
>> ... However, I added a new open item about how the
>> 040_pg_createsubscriber.pl test is slow and still unstable.

> But that said, I see no commits in the commit history which purport to
> improve performance, so I guess the performance is probably still not
> what you want, though I am not clear on the details.

My concern is described at [1]:

>> I have a different but possibly-related complaint: why is
>> 040_pg_createsubscriber.pl so miserably slow?  On my machine it
>> runs for a bit over 19 seconds, which seems completely out of line
>> (for comparison, 010_pg_basebackup.pl takes 6 seconds, and the
>> other test scripts in this directory take much less).  It looks
>> like most of the blame falls on this step:
>> 
>> [12:47:22.292](14.534s) ok 28 - run pg_createsubscriber on node S
>> 
>> AFAICS the amount of data being replicated is completely trivial,
>> so that it doesn't make any sense for this to take so long --- and
>> if it does, that suggests that this tool will be impossibly slow
>> for production use.  But I suspect there is a logic flaw causing
>> this.  Speculating wildly, perhaps that is related to the failure
>> Alexander spotted?

The followup discussion in that thread made it sound like there's
some fairly fundamental deficiency in how wait_for_end_recovery()
detects end-of-recovery.  I'm not too conversant with the details
though, and it's possible that pg_createsubscriber is just falling
foul of a pre-existing infelicity.

If the problem can be correctly described as "pg_createsubscriber
takes 10 seconds or so to detect end-of-stream", then it's probably
only an annoyance for testing and not something that would be fatal
in the real world.  I'm not quite sure if that's accurate, though.

regards, tom lane

[1] 
https://www.postgresql.org/message-id/flat/2377319.1719766794%40sss.pgh.pa.us#bba9f5ee0efc73151cc521a6bd5182ed

040_pg_createsubscriber.pl is slow and unstable (was Re: speed up a logical replica setup)

2024-07-29 Thread Robert Haas

On Sun, Jun 30, 2024 at 2:40 PM Tom Lane  wrote:
> In hopes of moving things along as we approach the v18 branch,
> I went ahead and pushed Kuroda-san's patches (with a bit of
> further editorialization).  AFAICS that allows closing out
> the concerns raised by Noah, so I've marked that open item
> done.  However, I added a new open item about how the
> 040_pg_createsubscriber.pl test is slow and still unstable.

This open item is now the only open item. It seems to have been open
for a month with no response from Peter which is, from my point of
view, far from ideal. However, another thing that is not ideal is that
we've been using the same thread to discuss every issue related to
this patch for 2.5 years. The thread spans hundreds of messages and it
is by no means obvious to what extent the messages posted after this
one addressed the underlying concern. Perhaps it would have been an
idea to start new threads when we started discussing post-commit
issues, instead of just tagging onto the same one.

But that said, I see no commits in the commit history which purport to
improve performance, so I guess the performance is probably still not
what you want, though I am not clear on the details. And as far as
stability is concerned, I peered through the dense haze of
027_stream_regress-related buildfarm failures for long enough to
discover that the stability issues with 040_pg_createsubscriber aren't
fixed either, because we have these recent buildfarm reports:

https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=olingo=2024-07-26%2016%3A02%3A40
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=adder=2024-07-26%2009%3A20%3A15
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=canebrake=2024-07-25%2002%3A39%3A02
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=tamandua=2024-07-22%2002%3A31%3A32

So, Peter, as the committer responsible for pg_createsubscriber, what
are we going to do about this?

--
Robert Haas
EDB: http://www.enterprisedb.com

Re: speed up a logical replica setup

2024-07-17 Thread Amit Kapila

On Wed, Jul 17, 2024 at 5:28 PM Hayato Kuroda (Fujitsu)
 wrote:
>
> > Your analysis sounds correct to me.
>
> Okay, so we could have a same picture...
>
> > > IIUC, the root cause is that pg_create_logical_replication_slot() returns 
> > > a LSN
> > > which is not generated yet. So, I think both mine [1] and Euler's 
> > > approach [2]
> > > can solve the issue. My proposal was to add an extra WAL record after the 
> > > final
> > > slot creation, and Euler's one was to use a restart_lsn as the
> > recovery_target_lsn.
> > >
> >
> > I don't think it is correct to set restart_lsn as consistent_lsn point
> > because the same is used to set replication origin progress. Later
> > when we start the subscriber, the system will use that LSN as a
> > start_decoding_at point which is the point after which all the commits
> > will be replicated. So, we will end up incorrectly using restart_lsn
> > (LSN from where we start reading the WAL) as start_decoding_at point.
> > How could that be correct?
>
> I didn't say we could use restart_lsn as consistent point of logical 
> replication,
> but I could agree the approach has issues.
>
> > Now, even if we use restart_lsn as recovery_target_lsn and the LSN
> > returned by pg_create_logical_replication_slot() as consistent LSN to
> > set replication progress, that also could lead to data loss because
> > the subscriber may never get data between restart_lsn value and
> > consistent LSN value.
>
> You considered the case, e.g., tuples were inserted just after the restart_lsn
> but before the RUNNING_XACT record?

I am thinking of transactions between restart_lsn and "consistent
point lsn" (aka the point after which all commits will be replicated).
You conclusion seems correct to me that such transactions won't be
replicated by streaming replication and would be skipped by logical
replication. Now, if we can avoid that anyway, we can consider that.

-- 
With Regards,
Amit Kapila.

RE: speed up a logical replica setup

2024-07-17 Thread Hayato Kuroda (Fujitsu)

Dear Amit,

> Your analysis sounds correct to me.

Okay, so we could have a same picture...

> > IIUC, the root cause is that pg_create_logical_replication_slot() returns a 
> > LSN
> > which is not generated yet. So, I think both mine [1] and Euler's approach 
> > [2]
> > can solve the issue. My proposal was to add an extra WAL record after the 
> > final
> > slot creation, and Euler's one was to use a restart_lsn as the
> recovery_target_lsn.
> >
> 
> I don't think it is correct to set restart_lsn as consistent_lsn point
> because the same is used to set replication origin progress. Later
> when we start the subscriber, the system will use that LSN as a
> start_decoding_at point which is the point after which all the commits
> will be replicated. So, we will end up incorrectly using restart_lsn
> (LSN from where we start reading the WAL) as start_decoding_at point.
> How could that be correct?

I didn't say we could use restart_lsn as consistent point of logical 
replication,
but I could agree the approach has issues.

> Now, even if we use restart_lsn as recovery_target_lsn and the LSN
> returned by pg_create_logical_replication_slot() as consistent LSN to
> set replication progress, that also could lead to data loss because
> the subscriber may never get data between restart_lsn value and
> consistent LSN value.

You considered the case, e.g., tuples were inserted just after the restart_lsn
but before the RUNNING_XACT record? In this case, yes, the streaming replication
finishes before replicating tuples but logical replication will skip them.
Euler's approach cannot be used as-is.

Best regards,
Hayato Kuroda
FUJITSU LIMITED

Re: speed up a logical replica setup

2024-07-17 Thread Amit Kapila

On Wed, Jul 17, 2024 at 1:23 PM Hayato Kuroda (Fujitsu)
 wrote:
>
> I also analyzed this failure, let me share it. Here, I think events in below
> ordering were occurred.
>
> 1. Backend created a publication on $db2,
> 2. BGWriter generated RUNNING_XACT record, then
> 3. Backend created a replication slot on $db2.
>
> In this case, the recovery_target_lsn is ahead of the RUNNING_XACT record 
> generated
> at step 3. Also, since both bgwriter and slot creation mark the record as
> *UNIMPORTANT* one, the writer won't start again even after the
> LOG_SNAPSHOT_INTERVAL_MS. The rule is written in BackgroundWriterMain():
>
> ```
> /*
>  * Only log if enough time has passed and interesting 
> records have
>  * been inserted since the last snapshot.  Have to 
> compare with <=
>  * instead of < because GetLastImportantRecPtr() 
> points at the
>  * start of a record, whereas last_snapshot_lsn 
> points just past
>  * the end of the record.
>  */
> if (now >= timeout &&
> last_snapshot_lsn <= GetLastImportantRecPtr())
> {
> last_snapshot_lsn = LogStandbySnapshot();
> last_snapshot_ts = now;
> }
> ```
>
> Therefore, pg_createsubscriber waited until a new record was replicated, but 
> no
> activities were recorded, causing a timeout. Since this is a timing issue, 
> Alexander
> could reproduce the failure with shorter time duration and parallel running.
>

Your analysis sounds correct to me.

> IIUC, the root cause is that pg_create_logical_replication_slot() returns a 
> LSN
> which is not generated yet. So, I think both mine [1] and Euler's approach [2]
> can solve the issue. My proposal was to add an extra WAL record after the 
> final
> slot creation, and Euler's one was to use a restart_lsn as the 
> recovery_target_lsn.
>

I don't think it is correct to set restart_lsn as consistent_lsn point
because the same is used to set replication origin progress. Later
when we start the subscriber, the system will use that LSN as a
start_decoding_at point which is the point after which all the commits
will be replicated. So, we will end up incorrectly using restart_lsn
(LSN from where we start reading the WAL) as start_decoding_at point.
How could that be correct?

Now, even if we use restart_lsn as recovery_target_lsn and the LSN
returned by pg_create_logical_replication_slot() as consistent LSN to
set replication progress, that also could lead to data loss because
the subscriber may never get data between restart_lsn value and
consistent LSN value.

-- 
With Regards,
Amit Kapila.

RE: speed up a logical replica setup

2024-07-17 Thread Hayato Kuroda (Fujitsu)

Dear Alexander, Euler, Amit,

I also analyzed this failure, let me share it. Here, I think events in below
ordering were occurred.

1. Backend created a publication on $db2,
2. BGWriter generated RUNNING_XACT record, then
3. Backend created a replication slot on $db2.

In this case, the recovery_target_lsn is ahead of the RUNNING_XACT record 
generated
at step 3. Also, since both bgwriter and slot creation mark the record as
*UNIMPORTANT* one, the writer won't start again even after the
LOG_SNAPSHOT_INTERVAL_MS. The rule is written in BackgroundWriterMain():

```
/*
 * Only log if enough time has passed and interesting 
records have
 * been inserted since the last snapshot.  Have to 
compare with <=
 * instead of < because GetLastImportantRecPtr() points 
at the
 * start of a record, whereas last_snapshot_lsn points 
just past
 * the end of the record.
 */
if (now >= timeout &&
last_snapshot_lsn <= GetLastImportantRecPtr())
{
last_snapshot_lsn = LogStandbySnapshot();
last_snapshot_ts = now;
}
```

Therefore, pg_createsubscriber waited until a new record was replicated, but no
activities were recorded, causing a timeout. Since this is a timing issue, 
Alexander
could reproduce the failure with shorter time duration and parallel running.

IIUC, the root cause is that pg_create_logical_replication_slot() returns a LSN
which is not generated yet. So, I think both mine [1] and Euler's approach [2]
can solve the issue. My proposal was to add an extra WAL record after the final
slot creation, and Euler's one was to use a restart_lsn as the 
recovery_target_lsn.
In case of primary server, restart_lsn is set to the latest WAL insert position 
and
then RUNNING_XACT record is generated later.

How do you think?

[1]: 
https://www.postgresql.org/message-id/osbpr01mb25521b15bf950d2523bbe143f5...@osbpr01mb2552.jpnprd01.prod.outlook.com
[2]: 
https://www.postgresql.org/message-id/b1f0f8c7-8f01-4950-af77-339df3dc4684%40app.fastmail.com

Best regards,
Hayato Kuroda
FUJITSU LIMITED

Re: speed up a logical replica setup

2024-07-15 Thread Amit Kapila

On Fri, Jul 12, 2024 at 4:54 AM Euler Taveira  wrote:
>
> On Thu, Jul 11, 2024, at 2:00 PM, Alexander Lakhin wrote:
>
> May I ask you to look at another failure of the test occurred today [1]?
>
>
> Thanks for the report!
>
> You are observing the same issue that Amit explained in [1]. The
> pg_create_logical_replication_slot returns the EndRecPtr (see
> slot->data.confirmed_flush in DecodingContextFindStartpoint()). EndRecPtr 
> points
> to the next record and it is a future position for an idle server. That's why
> the recovery takes some time to finish because it is waiting for an activity 
> to
> increase the LSN position. Since you modified LOG_SNAPSHOT_INTERVAL_MS to 
> create
> additional WAL records soon, the EndRecPtr position is reached rapidly and the
> recovery ends quickly.
>

If the recovery ends quickly (which is expected due to reduced
LOG_SNAPSHOT_INTERVAL_MS ) then why do we see "error: recovery timed
out"?

> Hayato proposes a patch [2] to create an additional WAL record that has the 
> same
> effect from you little hack: increase the LSN position to allow the recovery
> finishes soon. I don't like the solution although it seems simple to 
> implement.
> As Amit said if we know the ReadRecPtr, we could use it as consistent LSN. The
> problem is that it is used by logical decoding but it is not exposed. [reading
> the code...] When the logical replication slot is created, restart_lsn points 
> to
> the lastReplayedEndRecPtr (see ReplicationSlotReserveWal()) that is the last
> record replayed.
>

The last 'lastReplayedEndRecPtr' should be the value of restart_lsn on
standby (when RecoveryInProgress is true) but here we are creating
slots on the publisher/primary, so shouldn't restart_lsn point to
"latest WAL insert pointer"?

-- 
With Regards,
Amit Kapila.

Re: speed up a logical replica setup

2024-07-11 Thread Amit Kapila

On Thu, Jul 11, 2024 at 7:04 PM Euler Taveira  wrote:
>
> On Thu, Jul 11, 2024, at 8:22 AM, Zhijie Hou (Fujitsu) wrote:
>
> The analysis and suggestion look reasonable to me.
> Here is a small patch which does the same.
>
>
> LGTM. However, I would add a comment above the INSERT you moved around to 
> remind
> you that a transaction cannot be added there because it will break this test.
>

Pushed after adding a comment.

-- 
With Regards,
Amit Kapila.

Re: speed up a logical replica setup

2024-07-11 Thread Euler Taveira

On Thu, Jul 11, 2024, at 2:00 PM, Alexander Lakhin wrote:
> Hello Amit and Hou-San,
> 
> 11.07.2024 13:21, Amit Kapila wrote:
> >
> > We don't wait for the xmin to catch up corresponding to this insert
> > and I don't know if there is a way to do that. So, we should move this
> > Insert to after the call to pg_sync_replication_slots(). It won't
> > impact the general test of pg_createsubscriber.
> >
> > Thanks to Hou-San for helping me in the analysis of this BF failure.
> 
> Thank you for investigating that issue!
> 
> May I ask you to look at another failure of the test occurred today [1]?
> 

Thanks for the report!

You are observing the same issue that Amit explained in [1]. The
pg_create_logical_replication_slot returns the EndRecPtr (see
slot->data.confirmed_flush in DecodingContextFindStartpoint()). EndRecPtr points
to the next record and it is a future position for an idle server. That's why
the recovery takes some time to finish because it is waiting for an activity to
increase the LSN position. Since you modified LOG_SNAPSHOT_INTERVAL_MS to create
additional WAL records soon, the EndRecPtr position is reached rapidly and the
recovery ends quickly.

Hayato proposes a patch [2] to create an additional WAL record that has the same
effect from you little hack: increase the LSN position to allow the recovery
finishes soon. I don't like the solution although it seems simple to implement.
As Amit said if we know the ReadRecPtr, we could use it as consistent LSN. The
problem is that it is used by logical decoding but it is not exposed. [reading
the code...] When the logical replication slot is created, restart_lsn points to
the lastReplayedEndRecPtr (see ReplicationSlotReserveWal()) that is the last
record replayed. Since the replication slots aren't in use, we could use the
restart_lsn from the last replication slot as a consistent LSN.

I'm attaching a patch that implements it.It runs in 6s instead of 26s.

[1] 
https://www.postgresql.org/message-id/CAA4eK1%2Bp%2B7Ag6nqdFRdqowK1EmJ6bG-MtZQ_54dnFBi%3D_OO5RQ%40mail.gmail.com
[2] 
https://www.postgresql.org/message-id/OSBPR01MB25521B15BF950D2523BBE143F5D32%40OSBPR01MB2552.jpnprd01.prod.outlook.com

--
Euler Taveira
EDB   https://www.enterprisedb.com/
From a3be65c9bea08d3c7ffd575342a879d37b28b9a8 Mon Sep 17 00:00:00 2001
From: Euler Taveira 
Date: Thu, 11 Jul 2024 19:59:56 -0300
Subject: [PATCH] pg_createsubscriber: fix slow recovery

If the primary server is idle when you are running pg_createsubscriber,
it used to take some time during recovery. The reason is that it was
using the LSN returned by pg_create_logical_replication_slot as
recovery_target_lsn. This LSN points to "end + 1" record that might not
be available at WAL, hence, the recovery routine waits until some
activity from auxiliary processes write WAL records and once it reaches
the recovery_target_lsn position.

Instead, use restart_lsn from the last replication slot. It points to
the last WAL record that was replayed. Hence, the recovery finishes
soon.

create_logical_replication_slot() does not return LSN so change its
signature.

Discussion: https://www.postgresql.org/message-id/2377319.1719766794%40sss.pgh.pa.us
Discussion: https://www.postgresql.org/message-id/68de6498-0449-a113-dd03-e198dded0bac%40gmail.com
---
 src/bin/pg_basebackup/pg_createsubscriber.c | 56 -
 1 file changed, 43 insertions(+), 13 deletions(-)

diff --git a/src/bin/pg_basebackup/pg_createsubscriber.c b/src/bin/pg_basebackup/pg_createsubscriber.c
index 21dd50f8089..5ef3f751a5b 100644
--- a/src/bin/pg_basebackup/pg_createsubscriber.c
+++ b/src/bin/pg_basebackup/pg_createsubscriber.c
@@ -86,8 +86,8 @@ static void setup_recovery(const struct LogicalRepInfo *dbinfo, const char *data
 static void drop_primary_replication_slot(struct LogicalRepInfo *dbinfo,
 		  const char *slotname);
 static void drop_failover_replication_slots(struct LogicalRepInfo *dbinfo);
-static char *create_logical_replication_slot(PGconn *conn,
-			 struct LogicalRepInfo *dbinfo);
+static bool create_logical_replication_slot(PGconn *conn,
+			struct LogicalRepInfo *dbinfo);
 static void drop_replication_slot(PGconn *conn, struct LogicalRepInfo *dbinfo,
   const char *slot_name);
 static void pg_ctl_status(const char *pg_ctl_cmd, int rc);
@@ -769,15 +769,47 @@ setup_publisher(struct LogicalRepInfo *dbinfo)
 		create_publication(conn, [i]);

 		/* Create replication slot on publisher */
-		if (lsn)
-			pg_free(lsn);
-		lsn = create_logical_replication_slot(conn, [i]);
-		if (lsn != NULL || dry_run)
+		if (create_logical_replication_slot(conn, [i]) || dry_run)
 			pg_log_info("create replication slot \"%s\" on publisher",
 		dbinfo[i].replslotname);
 		else
 			exit(1);

+		/*
+		 * Get restart_lsn from the last replication slot. It points to the
+		 * last record replayed. The LSN returned by
+		 * pg_create_logical_replication_slot() should not be used because it
+		 * points to a future

Re: speed up a logical replica setup

2024-07-11 Thread Alexander Lakhin


Hello Amit and Hou-San,

11.07.2024 13:21, Amit Kapila wrote:


We don't wait for the xmin to catch up corresponding to this insert
and I don't know if there is a way to do that. So, we should move this
Insert to after the call to pg_sync_replication_slots(). It won't
impact the general test of pg_createsubscriber.

Thanks to Hou-San for helping me in the analysis of this BF failure.


Thank you for investigating that issue!

May I ask you to look at another failure of the test occurred today [1]?
The failure log contains:
recovery_target_lsn = '0/30098D0'
pg_createsubscriber: starting the subscriber
...
pg_createsubscriber: server was started
pg_createsubscriber: waiting for the target server to reach the consistent state
...
2024-07-11 07:40:10.837 UTC [2948531][postmaster][:0] LOG:  received fast 
shutdown request

Though what I'm seeing after a successful run is:
recovery_target_lsn = '0/3009860'
pg_createsubscriber: starting the subscriber
...
pg_createsubscriber: server was started
pg_createsubscriber: waiting for the target server to reach the consistent state
...
2024-07-11 15:19:40.733 UTC [207517] 040_pg_createsubscriber.pl LOG:  
statement: SELECT pg_catalog.pg_is_in_recovery()
2024-07-11 15:19:41.635 UTC [207514] LOG:  recovery stopping after WAL location (LSN) 
"0/3009860"
2024-07-11 15:19:41.635 UTC [207514] LOG:  redo done at 0/3009860 system usage: CPU: user: 0.00 s, system: 0.00 s, 
elapsed: 21.00 s


I've managed to reproduce the failure locally. With the bgwriter mod:
-#define LOG_SNAPSHOT_INTERVAL_MS 15000
+#define LOG_SNAPSHOT_INTERVAL_MS 150

and wal_debug=on, when running 5 test instances with parallel, I get the
failure with the following log:
recovery_target_lsn = '0/3009A20'
pg_createsubscriber: starting the subscriber

2024-07-11 14:40:04.551 UTC [205589:72][startup][33/0:0] LOG:  REDO @ 0/30099E8; LSN 0/3009A20: prev 0/30099B0; xid 0; 
len 24 - Standby/RUNNING_XACTS: nextXid 747 latestCompletedXid 746 oldestRunningXid 747

# ^^^ the last REDO record in the log
...
pg_createsubscriber: server was started
pg_createsubscriber: waiting for the target server to reach the consistent state
...
pg_createsubscriber: server was stopped
pg_createsubscriber: error: recovery timed out
...
[14:43:06.011](181.800s) not ok 30 - run pg_createsubscriber on node S
[14:43:06.012](0.000s)
[14:43:06.012](0.000s) #   Failed test 'run pg_createsubscriber on node S'
#   at t/040_pg_createsubscriber.pl line 356.

$ pg_waldump -p src/bin/pg_basebackup_1/tmp_check/t_040_pg_createsubscriber_node_s_data/pgdata/pg_wal/ 
00010003 00010003 | tail -2
rmgr: Standby len (rec/tot): 50/    50, tx:  0, lsn: 0/030099B0, prev 0/03009948, desc: RUNNING_XACTS 
nextXid 747 latestCompletedXid 746 oldestRunningXid 747
rmgr: Standby len (rec/tot): 50/    50, tx:  0, lsn: 0/030099E8, prev 0/030099B0, desc: RUNNING_XACTS 
nextXid 747 latestCompletedXid 746 oldestRunningXid 747


Whilst
$ pg_waldump -p src/bin/pg_basebackup_1/tmp_check/t_040_pg_createsubscriber_node_p_data/pgdata/pg_wal/ 
00010003 00010003 | tail
rmgr: Standby len (rec/tot): 50/    50, tx:  0, lsn: 0/030099B0, prev 0/03009948, desc: RUNNING_XACTS 
nextXid 747 latestCompletedXid 746 oldestRunningXid 747
rmgr: Standby len (rec/tot): 50/    50, tx:  0, lsn: 0/030099E8, prev 0/030099B0, desc: RUNNING_XACTS 
nextXid 747 latestCompletedXid 746 oldestRunningXid 747
rmgr: Heap2   len (rec/tot): 60/    60, tx:    747, lsn: 0/03009A20, prev 0/030099E8, desc: NEW_CID rel: 
1663/16384/6104, tid: 0/1, cmin: 4294967295, cmax: 0, combo: 4294967295
rmgr: Heap    len (rec/tot): 54/    54, tx:    747, lsn: 0/03009A60, prev 0/03009A20, desc: DELETE xmax: 
747, off: 1, infobits: [KEYS_UPDATED], flags: 0x00, blkref #0: rel 1663/16384/6104 blk 0
rmgr: Transaction len (rec/tot): 78/    78, tx:    747, lsn: 0/03009A98, prev 0/03009A60, desc: INVALIDATION ; 
inval msgs: catcache 49 catcache 46 relcache 0
rmgr: Transaction len (rec/tot): 98/    98, tx:    747, lsn: 0/03009AE8, prev 0/03009A98, desc: COMMIT 
2024-07-11 14:43:05.994561 UTC; relcache init file inval dbid 16384 tsid 1663; inval msgs: catcache 49 catcache 46 
relcache 0
rmgr: Heap2   len (rec/tot): 60/    60, tx:    748, lsn: 0/03009B50, prev 0/03009AE8, desc: NEW_CID rel: 
1663/16385/6104, tid: 0/1, cmin: 4294967295, cmax: 0, combo: 4294967295
rmgr: Heap    len (rec/tot): 54/    54, tx:    748, lsn: 0/03009B90, prev 0/03009B50, desc: DELETE xmax: 
748, off: 1, infobits: [KEYS_UPDATED], flags: 0x00, blkref #0: rel 1663/16385/6104 blk 0
rmgr: Transaction len (rec/tot): 78/    78, tx:    748, lsn: 0/03009BC8, prev 0/03009B90, desc: INVALIDATION ; 
inval msgs: catcache 49 catcache 46 relcache 0
rmgr: Transaction len (rec/tot): 98/    98, tx:    748, lsn: 0/03009C18, prev 0/03009BC8, desc: COMMIT 
2024-07-11

Re: speed up a logical replica setup

2024-07-11 Thread Euler Taveira

On Thu, Jul 11, 2024, at 8:22 AM, Zhijie Hou (Fujitsu) wrote:
> The analysis and suggestion look reasonable to me.
> Here is a small patch which does the same.

LGTM. However, I would add a comment above the INSERT you moved around to remind
you that a transaction cannot be added there because it will break this test.

--
Euler Taveira
EDB   https://www.enterprisedb.com/

RE: speed up a logical replica setup

2024-07-11 Thread Zhijie Hou (Fujitsu)

On Thursday, July 11, 2024 6:21 PM Amit Kapila  wrote:
> 
> On Tue, Jul 9, 2024 at 4:30 PM Alexander Lakhin 
> wrote:
> >
> > Please look at another failure of the test [1]:
> > [13:28:05.647](2.460s) not ok 26 - failover slot is synced
> > [13:28:05.648](0.001s) #   Failed test 'failover slot is synced'
> > #   at
> /home/bf/bf-build/skink-master/HEAD/pgsql/src/bin/pg_basebackup/t/04
> 0_pg_createsubscriber.pl line 307.
> > [13:28:05.648](0.000s) #  got: ''
> > # expected: 'failover_slot'
> >
> > with 040_pg_createsubscriber_node_s.log containing:
> > 2024-07-08 13:28:05.369 UTC [3985464][client backend][0/2:0] LOG:
> > statement: SELECT pg_sync_replication_slots()
> > 2024-07-08 13:28:05.557 UTC [3985464][client backend][0/2:0] LOG:
> > could not sync slot "failover_slot" as remote slot precedes local slot
> > 2024-07-08 13:28:05.557 UTC [3985464][client backend][0/2:0] DETAIL:
> > Remote slot has LSN 0/30047B8 and catalog xmin 743, but local slot has LSN
> 0/30047B8 and catalog xmin 744.
> >
> > I could not reproduce it locally, but I've discovered that that
> > subtest somehow depends on pg_createsubscriber executed for the
> > 'primary contains unmet conditions on node P' check. For example with
> > this test modification:
> > @@ -249,7 +249,7 @@ command_fails(
> >   $node_p->connstr($db1), '--socket-directory',
> >   $node_s->host, '--subscriber-port',
> >   $node_s->port, '--database',
> > -$db1, '--database',
> > +'XXX', '--database',
> >   $db2
> >   ],
> >   'primary contains unmet conditions on node P');
> >
> > I see the same failure:
> > 2024-07-09 10:19:43.284 UTC [938890] 040_pg_createsubscriber.pl LOG:
> > statement: SELECT pg_sync_replication_slots()
> > 2024-07-09 10:19:43.292 UTC [938890] 040_pg_createsubscriber.pl LOG:
> > could not sync slot "failover_slot" as remote slot precedes local slot
> > 2024-07-09 10:19:43.292 UTC [938890] 040_pg_createsubscriber.pl
> > DETAIL:  Remote slot has LSN 0/3004780 and catalog xmin 743, but local
> slot has LSN 0/3004780 and catalog xmin 744.
> >
> > Thus maybe even a normal pg_createsubscriber run can affect the
> > primary server (it's catalog xmin) differently?
> >
> 
> Yes, pg_createsubscriber can affect the primary server's catalog xmin because
> it starts the standby server that can send HSFeedback (See
> XLogWalRcvSendHSFeedback()), which can advance the physical slot's xmin
> corresponding the following Insert in the test:
> 
> # Insert another row on node P and wait node S to catch up
> $node_p->safe_psql($db1, "INSERT INTO tbl1 VALUES('second row')");
> $node_p->wait_for_replay_catchup($node_s);
> 
> In the success case, pg_createsubscriber is able to send HSFeedback and in
> the failure case, it won't. We can see the following logs in
> 040_pg_createsubscriber_node_p.log:
> 
> 2024-07-08 13:28:00.872 UTC [3982331][walsender][:0] FATAL:  the database
> system is starting up
> 2024-07-08 13:28:00.875 UTC [3982328][startup][:0] LOG:  database system
> was shut down at 2024-07-08 13:28:00 UTC
> 2024-07-08 13:28:01.105 UTC [3981996][postmaster][:0] LOG:  database
> system is ready to accept connections
> 
> This shows that when the test  'primary contains unmet conditions on node P'
> starts the standby server the corresponding primary node was not ready
> because we just restarted node_p before that test and didn't ensure that the
> node_p is up and ready to accept connections before starting the
> pg_createsubscriber test.
> 
> Even in the successful cases where the standby is able to connect to primary
> for test 'primary contains unmet conditions on node P', there is no guarantee
> that xmin of the physical slot will be updated at least, we don't have 
> anything in
> the test to ensure the same.
> 
> Now as before creating logical replication, we didn't ensure that the physical
> slot's xmin has been caught up to the latest value, the test can lead to 
> failure
> like: "Remote slot has LSN 0/3004780 and catalog xmin 743, but local slot has
> LSN 0/3004780 and catalog xmin 744".
> 
> The xmin on standby could have been advanced due to the following Insert:
> # Insert another row on node P and wait node S to catch up
> $node_p->safe_psql($db1, "INSERT INTO tbl1 VALUES('second row')");
> $node_p->wait_for_replay_catchup($node_s);
> 
> We don't wait for the xmin to catch up corresponding to this insert and I 
> don't
> know if there is a way to do that. So, we should move this Insert to after 
> the call
> to pg_sync_replication_slots(). It won't impact the general test of
> pg_createsubscriber.

The analysis and suggestion look reasonable to me.
Here is a small patch which does the same.


> Thanks to Hou-San for helping me in the analysis of this BF failure.

Best Regards,
Hou zj


0001-fix-unstable-test-in-040_pg_createsubscriber.patch
Description: 0001-fix-unstable-test-in-040_pg_createsubscriber.patch

Re: speed up a logical replica setup

2024-07-11 Thread Amit Kapila

On Wed, Jul 10, 2024 at 4:51 PM Euler Taveira  wrote:
>
> On Tue, Jul 9, 2024, at 8:00 AM, Alexander Lakhin wrote:
>
> Hello Amit and Kuroda-san,
>
> 03.07.2024 14:02, Amit Kapila wrote:
> > Pushed 0002 and 0003. Let's wait for a discussion on 0001.
>
> Please look at another failure of the test [1]:
> [13:28:05.647](2.460s) not ok 26 - failover slot is synced
> [13:28:05.648](0.001s) #   Failed test 'failover slot is synced'
> #   at 
> /home/bf/bf-build/skink-master/HEAD/pgsql/src/bin/pg_basebackup/t/040_pg_createsubscriber.pl
>  line 307.
> [13:28:05.648](0.000s) #  got: ''
> # expected: 'failover_slot'
>
>
> I'm wondering if the attached patch is sufficient to move the restart_lsn
> forward.
>

This won't guarantee that xmin of the logical slot on the primary has
moved ahead or at least we haven't ensured the same. So, we should try
not to perform statements that can increase the xmin on standby before
creating a logical slot. See the analysis sent in the previous email.

-- 
With Regards,
Amit Kapila.

Re: speed up a logical replica setup

2024-07-11 Thread Amit Kapila

On Tue, Jul 9, 2024 at 4:30 PM Alexander Lakhin  wrote:
>
> Please look at another failure of the test [1]:
> [13:28:05.647](2.460s) not ok 26 - failover slot is synced
> [13:28:05.648](0.001s) #   Failed test 'failover slot is synced'
> #   at 
> /home/bf/bf-build/skink-master/HEAD/pgsql/src/bin/pg_basebackup/t/040_pg_createsubscriber.pl
>  line 307.
> [13:28:05.648](0.000s) #  got: ''
> # expected: 'failover_slot'
>
> with 040_pg_createsubscriber_node_s.log containing:
> 2024-07-08 13:28:05.369 UTC [3985464][client backend][0/2:0] LOG: statement: 
> SELECT pg_sync_replication_slots()
> 2024-07-08 13:28:05.557 UTC [3985464][client backend][0/2:0] LOG: could not 
> sync slot "failover_slot" as remote slot
> precedes local slot
> 2024-07-08 13:28:05.557 UTC [3985464][client backend][0/2:0] DETAIL:  Remote 
> slot has LSN 0/30047B8 and catalog xmin
> 743, but local slot has LSN 0/30047B8 and catalog xmin 744.
>
> I could not reproduce it locally, but I've discovered that that subtest
> somehow depends on pg_createsubscriber executed for the
> 'primary contains unmet conditions on node P' check. For example with this
> test modification:
> @@ -249,7 +249,7 @@ command_fails(
>   $node_p->connstr($db1), '--socket-directory',
>   $node_s->host, '--subscriber-port',
>   $node_s->port, '--database',
> -$db1, '--database',
> +'XXX', '--database',
>   $db2
>   ],
>   'primary contains unmet conditions on node P');
>
> I see the same failure:
> 2024-07-09 10:19:43.284 UTC [938890] 040_pg_createsubscriber.pl LOG:  
> statement: SELECT pg_sync_replication_slots()
> 2024-07-09 10:19:43.292 UTC [938890] 040_pg_createsubscriber.pl LOG:  could 
> not sync slot "failover_slot" as remote slot
> precedes local slot
> 2024-07-09 10:19:43.292 UTC [938890] 040_pg_createsubscriber.pl DETAIL:  
> Remote slot has LSN 0/3004780 and catalog xmin
> 743, but local slot has LSN 0/3004780 and catalog xmin 744.
>
> Thus maybe even a normal pg_createsubscriber run can affect the primary
> server (it's catalog xmin) differently?
>

Yes, pg_createsubscriber can affect the primary server's catalog xmin
because it starts the standby server that can send HSFeedback (See
XLogWalRcvSendHSFeedback()), which can advance the physical slot's
xmin corresponding the following Insert in the test:

# Insert another row on node P and wait node S to catch up
$node_p->safe_psql($db1, "INSERT INTO tbl1 VALUES('second row')");
$node_p->wait_for_replay_catchup($node_s);

In the success case, pg_createsubscriber is able to send HSFeedback
and in the failure case, it won't. We can see the following logs in
040_pg_createsubscriber_node_p.log:

2024-07-08 13:28:00.872 UTC [3982331][walsender][:0] FATAL:  the
database system is starting up
2024-07-08 13:28:00.875 UTC [3982328][startup][:0] LOG:  database
system was shut down at 2024-07-08 13:28:00 UTC
2024-07-08 13:28:01.105 UTC [3981996][postmaster][:0] LOG:  database
system is ready to accept connections

This shows that when the test  'primary contains unmet conditions on
node P' starts the standby server the corresponding primary node was
not ready because we just restarted node_p before that test and didn't
ensure that the node_p is up and ready to accept connections before
starting the pg_createsubscriber test.

Even in the successful cases where the standby is able to connect to
primary for test 'primary contains unmet conditions on node P', there
is no guarantee that xmin of the physical slot will be updated at
least, we don't have anything in the test to ensure the same.

Now as before creating logical replication, we didn't ensure that the
physical slot's xmin has been caught up to the latest value, the test
can lead to failure like: "Remote slot has LSN 0/3004780 and catalog
xmin 743, but local slot has LSN 0/3004780 and catalog xmin 744".

The xmin on standby could have been advanced due to the following Insert:
# Insert another row on node P and wait node S to catch up
$node_p->safe_psql($db1, "INSERT INTO tbl1 VALUES('second row')");
$node_p->wait_for_replay_catchup($node_s);

We don't wait for the xmin to catch up corresponding to this insert
and I don't know if there is a way to do that. So, we should move this
Insert to after the call to pg_sync_replication_slots(). It won't
impact the general test of pg_createsubscriber.

Thanks to Hou-San for helping me in the analysis of this BF failure.

-- 
With Regards,
Amit Kapila.

Re: speed up a logical replica setup

2024-07-10 Thread Euler Taveira

On Tue, Jul 9, 2024, at 8:00 AM, Alexander Lakhin wrote:
> Hello Amit and Kuroda-san,
> 
> 03.07.2024 14:02, Amit Kapila wrote:
> > Pushed 0002 and 0003. Let's wait for a discussion on 0001.
> 
> Please look at another failure of the test [1]:
> [13:28:05.647](2.460s) not ok 26 - failover slot is synced
> [13:28:05.648](0.001s) #   Failed test 'failover slot is synced'
> #   at 
> /home/bf/bf-build/skink-master/HEAD/pgsql/src/bin/pg_basebackup/t/040_pg_createsubscriber.pl
>  line 307.
> [13:28:05.648](0.000s) #  got: ''
> # expected: 'failover_slot'

I'm wondering if the attached patch is sufficient to move the restart_lsn
forward. I experimented several lightweight ideas but none works. BTW the steps
to create the failover slot here is similar 040_standby_failover_slots_sync.pl.
I don't have a clue why it is failing for this one.


--
Euler Taveira
EDB   https://www.enterprisedb.com/
diff --git a/src/bin/pg_basebackup/t/040_pg_createsubscriber.pl b/src/bin/pg_basebackup/t/040_pg_createsubscriber.pl
index 74b90d9a913..232dbbbc55e 100644
--- a/src/bin/pg_basebackup/t/040_pg_createsubscriber.pl
+++ b/src/bin/pg_basebackup/t/040_pg_createsubscriber.pl
@@ -297,6 +297,17 @@ $node_p->safe_psql($db1,
 	"SELECT pg_create_logical_replication_slot('$fslotname', 'pgoutput', false, false, true)"
 );
 $node_s->start;
+$node_p->safe_psql($db1, qq(
+	SELECT pg_logical_emit_message(true, 'a', '1');
+	CHECKPOINT;
+));
+$node_p->safe_psql($db1, qq(
+	SELECT *
+	FROM pg_logical_slot_get_binary_changes('$fslotname', NULL, NULL,
+		'proto_version', '1',
+		'publication_names', 'dummy',
+		'messages', 'true');
+));
 # Wait for the standby to catch up so that the standby is not lagging behind
 # the failover slot.
 $node_p->wait_for_replay_catchup($node_s);

Re: speed up a logical replica setup

2024-07-09 Thread Alexander Lakhin


Hello Amit and Kuroda-san,

03.07.2024 14:02, Amit Kapila wrote:

Pushed 0002 and 0003. Let's wait for a discussion on 0001.


Please look at another failure of the test [1]:
[13:28:05.647](2.460s) not ok 26 - failover slot is synced
[13:28:05.648](0.001s) #   Failed test 'failover slot is synced'
#   at 
/home/bf/bf-build/skink-master/HEAD/pgsql/src/bin/pg_basebackup/t/040_pg_createsubscriber.pl
 line 307.
[13:28:05.648](0.000s) #  got: ''
# expected: 'failover_slot'

with 040_pg_createsubscriber_node_s.log containing:
2024-07-08 13:28:05.369 UTC [3985464][client backend][0/2:0] LOG: statement: 
SELECT pg_sync_replication_slots()
2024-07-08 13:28:05.557 UTC [3985464][client backend][0/2:0] LOG: could not sync slot "failover_slot" as remote slot 
precedes local slot
2024-07-08 13:28:05.557 UTC [3985464][client backend][0/2:0] DETAIL:  Remote slot has LSN 0/30047B8 and catalog xmin 
743, but local slot has LSN 0/30047B8 and catalog xmin 744.


I could not reproduce it locally, but I've discovered that that subtest
somehow depends on pg_createsubscriber executed for the
'primary contains unmet conditions on node P' check. For example with this
test modification:
@@ -249,7 +249,7 @@ command_fails(
 $node_p->connstr($db1), '--socket-directory',
 $node_s->host, '--subscriber-port',
 $node_s->port, '--database',
-    $db1, '--database',
+    'XXX', '--database',
 $db2
 ],
 'primary contains unmet conditions on node P');

I see the same failure:
2024-07-09 10:19:43.284 UTC [938890] 040_pg_createsubscriber.pl LOG:  
statement: SELECT pg_sync_replication_slots()
2024-07-09 10:19:43.292 UTC [938890] 040_pg_createsubscriber.pl LOG:  could not sync slot "failover_slot" as remote slot 
precedes local slot
2024-07-09 10:19:43.292 UTC [938890] 040_pg_createsubscriber.pl DETAIL:  Remote slot has LSN 0/3004780 and catalog xmin 
743, but local slot has LSN 0/3004780 and catalog xmin 744.


Thus maybe even a normal pg_createsubscriber run can affect the primary
server (it's catalog xmin) differently?

One difference I found in the logs, is that the skink failure's
regress_log_040_pg_createsubscriber contains:
pg_createsubscriber: error: publisher requires 2 wal sender processes, but only 
1 remain

Though for a successful run I see locally (I can't find logs of
successful test runs on skink):
pg_createsubscriber: error: publisher requires 2 wal sender processes, but only 
0 remain

[1] 
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=skink=2024-07-08%2013%3A16%3A35

Best regards,
Alexander

Re: speed up a logical replica setup

2024-07-03 Thread Amit Kapila

On Wed, Jul 3, 2024 at 12:42 PM Hayato Kuroda (Fujitsu)
 wrote:
>
> >
> > Disabling on the primary node should be sufficient. Let's do the
> > minimum required to stabilize this test.
>
> +1, removed.
>
> PSA new version. 0001 has not been changed yet. A comment was added
> in 0002 to clarify why we must wait. For 0003, a comment was added and
> setting for standby was reverted.
>

Pushed 0002 and 0003. Let's wait for a discussion on 0001.

-- 
With Regards,
Amit Kapila.

RE: speed up a logical replica setup

2024-07-03 Thread Hayato Kuroda (Fujitsu)

Dear Amit,

> Your analysis looks correct to me. The test could fail due to
> autovacuum. See the following comment in
> 040_standby_failover_slots_sync.
> 
> # Disable autovacuum to avoid generating xid during stats update as otherwise
> # the new XID could then be replicated to standby at some random point making
> # slots at primary lag behind standby during slot sync.
> $publisher->append_conf('postgresql.conf', 'autovacuum = off');
>

Oh, I could not find the comment. I felt it should be added even in
040_pg_createsubscriber.pl. Done.

> > # Descriptions for attached files
> >
> > An attached script can be used to reproduce the first failure without
> pg_createsubscriber.
> > It requires to modify the code like [1].
> 
> > 0003 patch disables autovacuum for node_p and node_s. I think node_p is
> enough, but did
> > like that just in case. This fixes a second failure.
> >
> 
> Disabling on the primary node should be sufficient. Let's do the
> minimum required to stabilize this test.

+1, removed.

PSA new version. 0001 has not been changed yet. A comment was added
in 0002 to clarify why we must wait. For 0003, a comment was added and
setting for standby was reverted.

Best Regards,
Hayato Kuroda
FUJITSU LIMITED
https://www.fujitsu.com/ 



v3-0001-emit-dummy-message-while-setting-up-the-publisher.patch
Description:  v3-0001-emit-dummy-message-while-setting-up-the-publisher.patch


v3-0002-wait-until-RUNNING_XACT-is-replicated.patch
Description: v3-0002-wait-until-RUNNING_XACT-is-replicated.patch


v3-0003-disable-autovacuum-while-testing.patch
Description: v3-0003-disable-autovacuum-while-testing.patch

Re: speed up a logical replica setup

2024-07-03 Thread Amit Kapila

On Wed, Jul 3, 2024 at 10:42 AM Hayato Kuroda (Fujitsu)
 wrote:
>
> Based on that, I considered a scenario why the slot could not be synchronized.
> I felt this was not caused by the pg_createsubscriber.
>
> 1. At initial stage, the xmin of the physical slot is 743, and nextXid of the
>primary is also 743.
> 2. Autovacuum worker starts a new transaction. nextXid is incremented to 744.
> 3. Tries to creates a logical replication slot with failover=true *before the
>transaction at step2 is replicated to the standby*.
> 4. While creating the slot, the catalog_xmin must be determined.
>The initial candidate is nextXid (= 744), but the oldest xmin of 
> replication
>slots (=743) is used if it is older than nextXid. So 743 is chosen in this 
> case.
>This operaion is done in 
> CreateInitDecodingContext()->GetOldestSafeDecodingContext().
> 5. After that, the transaction at step2 is reached to the standby node and it
>updates the nextXid.
> 6. Finally runs pg pg_sync_replication_slots() on the standby. It finds a 
> failover
>slot on the primary and tries to create on the standby. However, the
>catalog_xmin on the primary (743) is older than the nextXid of the standby 
> (744)
>so that it skips to create a slot.
>
> To avoid the issue, we can disable the autovacuuming while testing.
>

Your analysis looks correct to me. The test could fail due to
autovacuum. See the following comment in
040_standby_failover_slots_sync.

# Disable autovacuum to avoid generating xid during stats update as otherwise
# the new XID could then be replicated to standby at some random point making
# slots at primary lag behind standby during slot sync.
$publisher->append_conf('postgresql.conf', 'autovacuum = off');

> # Descriptions for attached files
>
> An attached script can be used to reproduce the first failure without 
> pg_createsubscriber.
> It requires to modify the code like [1].

> 0003 patch disables autovacuum for node_p and node_s. I think node_p is 
> enough, but did
> like that just in case. This fixes a second failure.
>

Disabling on the primary node should be sufficient. Let's do the
minimum required to stabilize this test.

-- 
With Regards,
Amit Kapila.

Re: speed up a logical replica setup

2024-07-03 Thread Amit Kapila

On Wed, Jul 3, 2024 at 11:27 AM Amit Kapila  wrote:
>
> > Do you have any other idea?
> >
>
> The other idea could be that we use the minimum restart_lsn of all the
> slots created by this tool as a consistent_lsn. We can probably get
> that value by using pg_get_replication_slots() but this idea needs
> further evaluation as to whether it will lead to a consistent
> subscriber.
>

This may not work because when the confirmed_flush LSN of any slot is
ahead of the consistent_lsn value we derived above, we could miss
receiving some transactions. So, I don't have any better ideas than
what I mentioned yesterday.

-- 
With Regards,
Amit Kapila.

Re: speed up a logical replica setup

2024-07-02 Thread Amit Kapila

On Wed, Jul 3, 2024 at 9:21 AM Hayato Kuroda (Fujitsu)
 wrote:
>
> Dear Amit,
>
> > IIUC, the problem is that the consistent_lsn value returned by
> > setup_publisher() is the "end +1" location of the required LSN whereas
> > the recovery_target_lsn used in wait_for_end_recovery() expects the
> > LSN value to be "start" location of required LSN.
>
> Yeah, right. It is same as my understanding.
>
> > This sounds like an ugly hack to me and don't know if we can use it.
>
> I also think it is hacky, but I could not find better solutions.
>
> > The ideal way to fix this is to get the start_lsn from the
> > create_logical_slot functionality or have some parameter like
> > recover_target_end_lsn but I don't know if this is a good time to
> > extend such a functionality.
>
> I felt that such approach might be used for HEAD, but not suitable for PG17.
> Alternative approach I came up with was to insert a tuple while waiting the
> promotion. It can generate a WAL record so that standby can finish after the
> application. But I'm not sure how do we do and it seems to lead an additional
> timing issue. Also, this does not improve the behavior of the command - normal
> user may have to wait some time by the command.
>

BTW, I think the time required by standby to reach a consistent state
after startup is any way unpredictable. For example, if we consider
that in real-world scenarios between the time we have stopped standby
and restarted it, there could be many transactions on primary that
need to be replicated before we reach recover_target_lsn.

I don't think adding additional (dummy) WAL records is a good solution
but it is better to hear from others.

> Do you have any other idea?
>

The other idea could be that we use the minimum restart_lsn of all the
slots created by this tool as a consistent_lsn. We can probably get
that value by using pg_get_replication_slots() but this idea needs
further evaluation as to whether it will lead to a consistent
subscriber.

-- 
With Regards,
Amit Kapila.

RE: speed up a logical replica setup

2024-07-02 Thread Hayato Kuroda (Fujitsu)

Dear Alexander and other hackers,

> As a recent buildfarm failure [1] shows, that test addition introduced
> new instability:
> ### Starting node "node_s"
> # Running: pg_ctl -w -D
> /home/bf/bf-build/piculet/HEAD/pgsql.build/testrun/pg_basebackup/040_pg_c
> reatesubscriber/data/t_040_pg_createsubscriber_node_s_data/pgdata
> -l
> /home/bf/bf-build/piculet/HEAD/pgsql.build/testrun/pg_basebackup/040_pg_c
> reatesubscriber/log/040_pg_createsubscriber_node_s.log
> -o --cluster-name=node_s start
> waiting for server to start done
> server started
> # Postmaster PID for node "node_s" is 416482
> error running SQL: 'psql::1: ERROR:  skipping slot synchronization as 
> the
> received slot sync LSN 0/30047F0 for
> slot "failover_slot" is ahead of the standby position 0/3004708'
> while running 'psql -XAtq -d port=51506 host=/tmp/pqWohdD5Qj
> dbname='postgres' -f - -v ON_ERROR_STOP=1' with sql 'SELECT
> pg_sync_replication_slots()' at
> /home/bf/bf-build/piculet/HEAD/pgsql/src/test/perl/PostgreSQL/Test/Cluster
> .pm line 2126.
> 
> I could reproduce this failure with:
> --- a/src/backend/replication/walreceiver.c
> +++ b/src/backend/replication/walreceiver.c
> @@ -517,6 +517,7 @@ WalReceiverMain(char *startup_data, size_t
> startup_data_len)
>* let the startup process and primary server know
> about
>* them.
>*/
> +pg_usleep(30);
>   XLogWalRcvFlush(false, startpointTLI);
> 
> make -s check -C src/bin/pg_basebackup/ PROVE_TESTS="t/040*"
> 
> # +++ tap check in src/bin/pg_basebackup +++
> t/040_pg_createsubscriber.pl .. 22/? # Tests were run but no plan was declared
> and done_testing() was not seen.
> # Looks like your test exited with 29 just after 23.
> t/040_pg_createsubscriber.pl .. Dubious, test returned 29 (wstat 7424, 0x1d00)
> All 23 subtests passed
> 
> Test Summary Report
> ---
> t/040_pg_createsubscriber.pl (Wstat: 7424 Tests: 23 Failed: 0)
>Non-zero exit status: 29
>Parse errors: No plan found in TAP output
> Files=1, Tests=23,  4 wallclock secs ( 0.01 usr  0.01 sys +  0.49 cusr  0.44
> csys =  0.95 CPU)

I thought it was not related with the feature of pg_createsubscriber. It was
related with the slotsync function. In the first place, the error message was
output with the below backtrace.

```
synchronize_one_slot
synchronize_slots
pg_sync_replication_slots
```

And I found the error could be reproduced by the attached script with the source
modification by Alexander [1]. According to my analysis, this could be caused 
when
1) a replication slot is created with failover = true and XLOG_RUNNING_XACT 
record is generated, and
2) pg_sync_replication_slots() is called *before* the record is flushed on the
standby. To stabilize the test, we can call wait_for_replay_catchup() after the 
slot
creation on the primary.

> Moreover, this test may suffer from autovacuum:
> echo "
> autovacuum_naptime = 1
> autovacuum_analyze_threshold = 1
> " > /tmp/temp.config
> TEMP_CONFIG=/tmp/temp.config make -s check -C src/bin/pg_basebackup/
> PROVE_TESTS="t/040*"
> 
> # +++ tap check in src/bin/pg_basebackup +++
> t/040_pg_createsubscriber.pl .. 24/?
> #   Failed test 'failover slot is synced'
> #   at t/040_pg_createsubscriber.pl line 273.
> #  got: ''
> # expected: 'failover_slot'
> t/040_pg_createsubscriber.pl .. 28/? # Looks like you failed 1 test of 33.
> t/040_pg_createsubscriber.pl .. Dubious, test returned 1 (wstat 256, 0x100)
> Failed 1/33 subtests

Thanks for reporting. I could reproduce the failure, and some BF animals said 
NG.
According to them, the root cause seemed that slot synchronization was failed.

```
LOG:  statement: SELECT pg_sync_replication_slots()
LOG:  could not sync slot "failover_slot" as remote slot precedes local slot
DETAIL:  Remote slot has LSN 0/3006890 and catalog xmin 743, but local slot has 
LSN 0/3006890 and catalog xmin 744.
```

Based on that, I considered a scenario why the slot could not be synchronized.
I felt this was not caused by the pg_createsubscriber.

1. At initial stage, the xmin of the physical slot is 743, and nextXid of the
   primary is also 743.
2. Autovacuum worker starts a new transaction. nextXid is incremented to 744.
3. Tries to creates a logical replication slot with failover=true *before the
   transaction at step2 is replicated to the standby*.
4. While creating the slot, the catalog_xmin must be determined.
   The initial candidate is nextXid (= 744), but the oldest xmin of replication
   slots (=743) is used if it is older than nextXid. So 743 is chosen in this 
case.
   This operaion is done in 
CreateInitDecodingContext()->GetOldestSafeDecodingContext().
5. After that, the transaction at step2 is reached to the standby node and it
   updates the nextXid.
6. Finally runs pg pg_sync_replication_slots() on the standby. It finds a 
failover
   slot on the primary and tries to create on the standby. However, the

RE: speed up a logical replica setup

2024-07-02 Thread Hayato Kuroda (Fujitsu)

Dear Amit,

> IIUC, the problem is that the consistent_lsn value returned by
> setup_publisher() is the "end +1" location of the required LSN whereas
> the recovery_target_lsn used in wait_for_end_recovery() expects the
> LSN value to be "start" location of required LSN.

Yeah, right. It is same as my understanding.

> This sounds like an ugly hack to me and don't know if we can use it.

I also think it is hacky, but I could not find better solutions.

> The ideal way to fix this is to get the start_lsn from the
> create_logical_slot functionality or have some parameter like
> recover_target_end_lsn but I don't know if this is a good time to
> extend such a functionality.

I felt that such approach might be used for HEAD, but not suitable for PG17.
Alternative approach I came up with was to insert a tuple while waiting the
promotion. It can generate a WAL record so that standby can finish after the
application. But I'm not sure how do we do and it seems to lead an additional
timing issue. Also, this does not improve the behavior of the command - normal
user may have to wait some time by the command.

Do you have any other idea?

Best Regards,
Hayato Kuroda
FUJITSU LIMITED
https://www.fujitsu.com/

Re: speed up a logical replica setup

2024-07-02 Thread Amit Kapila

On Mon, Jul 1, 2024 at 8:22 PM Hayato Kuroda (Fujitsu)
 wrote:
>
> > I have a different but possibly-related complaint: why is
> > 040_pg_createsubscriber.pl so miserably slow?  On my machine it
> > runs for a bit over 19 seconds, which seems completely out of line
> > (for comparison, 010_pg_basebackup.pl takes 6 seconds, and the
> > other test scripts in this directory take much less).  It looks
> > like most of the blame falls on this step:
> >
> > [12:47:22.292](14.534s) ok 28 - run pg_createsubscriber on node S
> >
> > AFAICS the amount of data being replicated is completely trivial,
> > so that it doesn't make any sense for this to take so long --- and
> > if it does, that suggests that this tool will be impossibly slow
> > for production use.  But I suspect there is a logic flaw causing
> > this.
>
> I analyzed the issue. My elog() debugging said that wait_for_end_recovery() 
> was
> wasted some time. This was caused by the recovery target seeming 
> unsatisfactory.
>
> We are setting recovery_target_lsn by the return value of 
> pg_create_logical_replication_slot(),
> which returns the end of the RUNNING_XACT record. If we use the returned 
> value as
> recovery_target_lsn as-is, however, we must wait for additional WAL generation
> because the parameter requires that the replicated WAL overtake a certain 
> point.
> On my env, the function waited until the bgwriter emitted the 
> XLOG_RUNNING_XACTS record.
>

IIUC, the problem is that the consistent_lsn value returned by
setup_publisher() is the "end +1" location of the required LSN whereas
the recovery_target_lsn used in wait_for_end_recovery() expects the
LSN value to be "start" location of required LSN.

> One simple solution is to add an additional WAL record at the end of the 
> publisher
> setup. IIUC, an arbitrary WAL insertion can reduce the waiting time. The 
> attached
> patch inserts a small XLOG_LOGICAL_MESSAGE record, which could reduce much 
> execution
> time on my environment.
>

This sounds like an ugly hack to me and don't know if we can use it.
The ideal way to fix this is to get the start_lsn from the
create_logical_slot functionality or have some parameter like
recover_target_end_lsn but I don't know if this is a good time to
extend such a functionality.

-- 
With Regards,
Amit Kapila.

RE: speed up a logical replica setup

2024-07-01 Thread Hayato Kuroda (Fujitsu)

Dear Tom,

> I have a different but possibly-related complaint: why is
> 040_pg_createsubscriber.pl so miserably slow?  On my machine it
> runs for a bit over 19 seconds, which seems completely out of line
> (for comparison, 010_pg_basebackup.pl takes 6 seconds, and the
> other test scripts in this directory take much less).  It looks
> like most of the blame falls on this step:
> 
> [12:47:22.292](14.534s) ok 28 - run pg_createsubscriber on node S
> 
> AFAICS the amount of data being replicated is completely trivial,
> so that it doesn't make any sense for this to take so long --- and
> if it does, that suggests that this tool will be impossibly slow
> for production use.  But I suspect there is a logic flaw causing
> this.

I analyzed the issue. My elog() debugging said that wait_for_end_recovery() was
wasted some time. This was caused by the recovery target seeming unsatisfactory.

We are setting recovery_target_lsn by the return value of 
pg_create_logical_replication_slot(),
which returns the end of the RUNNING_XACT record. If we use the returned value 
as
recovery_target_lsn as-is, however, we must wait for additional WAL generation
because the parameter requires that the replicated WAL overtake a certain point.
On my env, the function waited until the bgwriter emitted the 
XLOG_RUNNING_XACTS record.

One simple solution is to add an additional WAL record at the end of the 
publisher
setup. IIUC, an arbitrary WAL insertion can reduce the waiting time. The 
attached
patch inserts a small XLOG_LOGICAL_MESSAGE record, which could reduce much 
execution
time on my environment.

```
BEFORE
(13.751s) ok 30 - run pg_createsubscriber on node S
AFTER
(0.749s) ok 30 - run pg_createsubscriber on node S
```

However, even after the modification, the reported failure [1] could not be 
resolved on my env.

How do you think?

[1]: 
https://www.postgresql.org/message-id/0dffca12-bf17-4a7a-334d-225569de5e6e%40gmail.com

Best Regards,
Hayato Kuroda
FUJITSU LIMITED
https://www.fujitsu.com/ 


emit_dummy_message.diff
Description: emit_dummy_message.diff

Re: speed up a logical replica setup

2024-06-30 Thread Tom Lane

Noah Misch  writes:
> On Sun, Jun 30, 2024 at 09:32:57PM -0400, Tom Lane wrote:
>> Hmph ... according to [1], 545082091 was not enough to fix this.
>> I guess that old version of IPC::Run also misbehaves for cases
>> involving backslash, single quote, and/or some other special
>> characters?

> The database name has backslash toward the end, which likely creates a
> problematic backslash-double-quote sequence under win32 command line rules.

OK, I've made it strip backslashes too.  Thanks for confirming.

regards, tom lane

Re: speed up a logical replica setup

2024-06-30 Thread Noah Misch

On Sun, Jun 30, 2024 at 09:32:57PM -0400, Tom Lane wrote:
> Noah Misch  writes:
> > On Sun, Jun 30, 2024 at 05:01:52PM -0400, Tom Lane wrote:
> >> I'm tempted to lobotomize the new test case on Windows until
> >> we have that resolved.
> 
> > Sounds fine.  The pg_upgrade suite adequately tests appendShellString() and
> > appendConnStrVal() with the larger character repertoire, so it's enough for
> > pg_createsubscriber to test just a space or so.
> 
> Hmph ... according to [1], 545082091 was not enough to fix this.
> I guess that old version of IPC::Run also misbehaves for cases
> involving backslash, single quote, and/or some other special
> characters?

The database name has backslash toward the end, which likely creates a
problematic backslash-double-quote sequence under win32 command line rules.
You can see that in the node log, comparing the CREATE DATABASE log line
(shows dquote at end of dbname) to the FATAL log line (no dquote at end).  I'm
not aware of trouble with characters other than backslash or dquote.

> [1] 
> https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=fairywren=2024-07-01%2000%3A03%3A05

Re: speed up a logical replica setup

2024-06-30 Thread Tom Lane

Noah Misch  writes:
> On Sun, Jun 30, 2024 at 05:01:52PM -0400, Tom Lane wrote:
>> I'm tempted to lobotomize the new test case on Windows until
>> we have that resolved.

> Sounds fine.  The pg_upgrade suite adequately tests appendShellString() and
> appendConnStrVal() with the larger character repertoire, so it's enough for
> pg_createsubscriber to test just a space or so.

Hmph ... according to [1], 545082091 was not enough to fix this.
I guess that old version of IPC::Run also misbehaves for cases
involving backslash, single quote, and/or some other special
characters?

regards, tom lane

[1] 
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=fairywren=2024-07-01%2000%3A03%3A05

Re: speed up a logical replica setup

2024-06-30 Thread Noah Misch

On Sun, Jun 30, 2024 at 05:01:52PM -0400, Tom Lane wrote:
> Noah Misch  writes:
> > On Sun, Jun 30, 2024 at 02:58:00PM -0400, Tom Lane wrote:
> >> So b3f5ccebd promptly blew up on fairywren [1]:
> 
> > It does look consistent with IPC::Run predating v20220807.0, hence lacking 
> > the
> > https://github.com/toddr/IPC-Run/issues/142 fix.  I wondered how this animal
> > was passing 002_pg_upgrade.pl, but it seems the animal has run TAP suites 
> > only
> > occasionally.  Passes in the last week were TAP-free.
> 
> Hmm, drongo just failed in the same way.  (It seems to likewise
> run TAP tests only some of the time, which I don't understand
> because the $Script_Config output looks the same between runs
> with TAP tests and runs without.)

On further reflection, 002_pg_upgrade.pl may not fail with old IPC::Run.  It
may just create a database with an unintended name, losing some test coverage.

> I'm tempted to lobotomize the new test case on Windows until
> we have that resolved.

Sounds fine.  The pg_upgrade suite adequately tests appendShellString() and
appendConnStrVal() with the larger character repertoire, so it's enough for
pg_createsubscriber to test just a space or so.

Re: speed up a logical replica setup

2024-06-30 Thread Tom Lane

Noah Misch  writes:
> On Sun, Jun 30, 2024 at 02:58:00PM -0400, Tom Lane wrote:
>> So b3f5ccebd promptly blew up on fairywren [1]:

> It does look consistent with IPC::Run predating v20220807.0, hence lacking the
> https://github.com/toddr/IPC-Run/issues/142 fix.  I wondered how this animal
> was passing 002_pg_upgrade.pl, but it seems the animal has run TAP suites only
> occasionally.  Passes in the last week were TAP-free.

Hmm, drongo just failed in the same way.  (It seems to likewise
run TAP tests only some of the time, which I don't understand
because the $Script_Config output looks the same between runs
with TAP tests and runs without.)

I'm tempted to lobotomize the new test case on Windows until
we have that resolved.

regards, tom lane

Re: speed up a logical replica setup

2024-06-30 Thread Noah Misch

On Sun, Jun 30, 2024 at 02:58:00PM -0400, Tom Lane wrote:
> I wrote:
> > In hopes of moving things along as we approach the v18 branch,
> > I went ahead and pushed Kuroda-san's patches (with a bit of
> > further editorialization).
> 
> So b3f5ccebd promptly blew up on fairywren [1]:
> 
> connection error: 'psql: error: unterminated quoted string in connection info 
> string'
> while running 'psql -XAtq -d port=52984 host=C:/tools/nmsys64/tmp/hHg_pngw4z 
> dbname='regression"    
> !"#$%&\\'()*+,-"' -f - -v ON_ERROR_STOP=1' at 
> C:/tools/nmsys64/home/pgrunner/bf/root/HEAD/pgsql/src/test/perl/PostgreSQL/Test/Cluster.pm
>  line 2124.
> 
> This shows that the command-line arguments to psql are not getting
> adequately quoted on that platform.  AFAICS, we are relying on
> IPC::Run::run to perform such quoting, and it seems to be
> falling down here.  I wonder what version of IPC::Run fairywren
> is using.

It does look consistent with IPC::Run predating v20220807.0, hence lacking the
https://github.com/toddr/IPC-Run/issues/142 fix.  I wondered how this animal
was passing 002_pg_upgrade.pl, but it seems the animal has run TAP suites only
occasionally.  Passes in the last week were TAP-free.

> [1] 
> https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=fairywren=2024-06-30%2018%3A03%3A06

Re: speed up a logical replica setup

2024-06-30 Thread Tom Lane

I wrote:
> In hopes of moving things along as we approach the v18 branch,
> I went ahead and pushed Kuroda-san's patches (with a bit of
> further editorialization).

So b3f5ccebd promptly blew up on fairywren [1]:

connection error: 'psql: error: unterminated quoted string in connection info 
string'
while running 'psql -XAtq -d port=52984 host=C:/tools/nmsys64/tmp/hHg_pngw4z 
dbname='regression"  
!"#$%&\\'()*+,-"' -f - -v ON_ERROR_STOP=1' at 
C:/tools/nmsys64/home/pgrunner/bf/root/HEAD/pgsql/src/test/perl/PostgreSQL/Test/Cluster.pm
 line 2124.

This shows that the command-line arguments to psql are not getting
adequately quoted on that platform.  AFAICS, we are relying on
IPC::Run::run to perform such quoting, and it seems to be
falling down here.  I wonder what version of IPC::Run fairywren
is using.

regards, tom lane

[1] 
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=fairywren=2024-06-30%2018%3A03%3A06

Re: speed up a logical replica setup

2024-06-30 Thread Tom Lane

In hopes of moving things along as we approach the v18 branch,
I went ahead and pushed Kuroda-san's patches (with a bit of
further editorialization).  AFAICS that allows closing out
the concerns raised by Noah, so I've marked that open item
done.  However, I added a new open item about how the
040_pg_createsubscriber.pl test is slow and still unstable.

regards, tom lane

Re: speed up a logical replica setup

2024-06-30 Thread Tom Lane

Alexander Lakhin  writes:
> As a recent buildfarm failure [1] shows, that test addition introduced
> new instability:

I have a different but possibly-related complaint: why is
040_pg_createsubscriber.pl so miserably slow?  On my machine it
runs for a bit over 19 seconds, which seems completely out of line
(for comparison, 010_pg_basebackup.pl takes 6 seconds, and the
other test scripts in this directory take much less).  It looks
like most of the blame falls on this step:

[12:47:22.292](14.534s) ok 28 - run pg_createsubscriber on node S

AFAICS the amount of data being replicated is completely trivial,
so that it doesn't make any sense for this to take so long --- and
if it does, that suggests that this tool will be impossibly slow
for production use.  But I suspect there is a logic flaw causing
this.  Speculating wildly, perhaps that is related to the failure
Alexander spotted?

regards, tom lane

Re: speed up a logical replica setup

2024-06-30 Thread Alexander Lakhin


Hello Peter and Euler,

17.06.2024 14:04, Peter Eisentraut wrote:

On 07.06.24 05:49, Euler Taveira wrote:

Here it is a patch series to fix the issues reported in recent discussions. The
patches 0001 and 0003 aim to fix the buildfarm issues. The patch 0002 removes
synchronized failover slots on subscriber since it has no use. I also included
an optional patch 0004 that improves the usability by checking both servers if
it already failed in any subscriber check.


I have committed 0001, 0002, and 0003.  Let's keep an eye on the buildfarm to see if that stabilizes things.  So far 
it looks good.


For 0004, I suggest inverting the result values from check_publisher() and create_subscriber() so that it returns true 
if the check is ok.


As a recent buildfarm failure [1] shows, that test addition introduced
new instability:
### Starting node "node_s"
# Running: pg_ctl -w -D 
/home/bf/bf-build/piculet/HEAD/pgsql.build/testrun/pg_basebackup/040_pg_createsubscriber/data/t_040_pg_createsubscriber_node_s_data/pgdata 
-l 
/home/bf/bf-build/piculet/HEAD/pgsql.build/testrun/pg_basebackup/040_pg_createsubscriber/log/040_pg_createsubscriber_node_s.log 
-o --cluster-name=node_s start

waiting for server to start done
server started
# Postmaster PID for node "node_s" is 416482
error running SQL: 'psql::1: ERROR:  skipping slot synchronization as the received slot sync LSN 0/30047F0 for 
slot "failover_slot" is ahead of the standby position 0/3004708'
while running 'psql -XAtq -d port=51506 host=/tmp/pqWohdD5Qj dbname='postgres' -f - -v ON_ERROR_STOP=1' with sql 'SELECT 
pg_sync_replication_slots()' at /home/bf/bf-build/piculet/HEAD/pgsql/src/test/perl/PostgreSQL/Test/Cluster.pm line 2126.


I could reproduce this failure with:
--- a/src/backend/replication/walreceiver.c
+++ b/src/backend/replication/walreceiver.c
@@ -517,6 +517,7 @@ WalReceiverMain(char *startup_data, size_t startup_data_len)
  * let the startup process and primary server know about
  * them.
  */
+pg_usleep(30);
 XLogWalRcvFlush(false, startpointTLI);

make -s check -C src/bin/pg_basebackup/ PROVE_TESTS="t/040*"

# +++ tap check in src/bin/pg_basebackup +++
t/040_pg_createsubscriber.pl .. 22/? # Tests were run but no plan was declared 
and done_testing() was not seen.
# Looks like your test exited with 29 just after 23.
t/040_pg_createsubscriber.pl .. Dubious, test returned 29 (wstat 7424, 0x1d00)
All 23 subtests passed

Test Summary Report
---
t/040_pg_createsubscriber.pl (Wstat: 7424 Tests: 23 Failed: 0)
  Non-zero exit status: 29
  Parse errors: No plan found in TAP output
Files=1, Tests=23,  4 wallclock secs ( 0.01 usr  0.01 sys +  0.49 cusr  0.44 
csys =  0.95 CPU)

Moreover, this test may suffer from autovacuum:
echo "
autovacuum_naptime = 1
autovacuum_analyze_threshold = 1
" > /tmp/temp.config
TEMP_CONFIG=/tmp/temp.config make -s check -C src/bin/pg_basebackup/ 
PROVE_TESTS="t/040*"

# +++ tap check in src/bin/pg_basebackup +++
t/040_pg_createsubscriber.pl .. 24/?
#   Failed test 'failover slot is synced'
#   at t/040_pg_createsubscriber.pl line 273.
#  got: ''
# expected: 'failover_slot'
t/040_pg_createsubscriber.pl .. 28/? # Looks like you failed 1 test of 33.
t/040_pg_createsubscriber.pl .. Dubious, test returned 1 (wstat 256, 0x100)
Failed 1/33 subtests

[1] 
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=piculet=2024-06-28%2004%3A42%3A48

Best regards,
Alexander

Re: speed up a logical replica setup

2024-06-27 Thread Amit Kapila

On Wed, Jun 26, 2024 at 11:35 AM Hayato Kuroda (Fujitsu)
 wrote:
>
> Thanks for giving comment! 0001 was modified per suggestions.
>
> > Yeah, it is a good idea to add a new option for two_phase but that
> > should be done in the next version. For now, I suggest updating the
> > docs and probably raising a warning (if max_prepared_transactions !=
> > 0) as suggested by Noah. This WARNING is useful because one could
> > expect that setting max_prepared_transactions != 0 means apply will
> > happen at prepare time after the subscriber is created by this tool.
> > The WARNING will be useful even if we support two_phase option as the
> > user may have set the non-zero value of max_prepared_transactions but
> > didn't use two_phase option.
>
> I also think it should be tunable, in PG18+. 0002 adds a description in doc.
> Also, this executable warns if the max_prepared_transactions != 0 on the
> publisher.
>

I have slightly adjusted the doc update and warning message for the
0002 patch. Please see attached and let me know what do you think.

-- 
With Regards,
Amit Kapila.


v3-0001-pg_createsubscriber-Warn-the-two-phase-is-disable.patch
Description: Binary data

Re: speed up a logical replica setup

2024-06-26 Thread Noah Misch

On Tue, Jun 25, 2024 at 09:50:59PM -0300, Euler Taveira wrote:
> On Tue, Jun 25, 2024, at 3:24 AM, Amit Kapila wrote:
> > On Tue, Jun 25, 2024 at 3:38 AM Noah Misch  wrote:
> > > On Mon, Jun 24, 2024 at 05:20:21PM +0530, Amit Kapila wrote:
> > > > On Sun, Jun 23, 2024 at 11:52 AM Noah Misch  wrote:
> > > > > > +static void
> > > > > > +create_publication(PGconn *conn, struct LogicalRepInfo *dbinfo)
> > > > > > +{
> > > > >
> > > > > > + appendPQExpBuffer(str, "CREATE PUBLICATION %s FOR ALL TABLES",
> > > > > > +   ipubname_esc);
> > > > >
> > > > > This tool's documentation says it "guarantees that no transaction 
> > > > > will be
> > > > > lost."  I tried to determine whether achieving that will require 
> > > > > something
> > > > > like the fix from
> > > > > https://postgr.es/m/flat/de52b282-1166-1180-45a2-8d8917ca7...@enterprisedb.com.
> > > > > (Not exactly the fix from that thread, since that thread has not 
> > > > > discussed the
> > > > > FOR ALL TABLES version of its race condition.)  I don't know.  On the 
> > > > > one
> > > > > hand, pg_createsubscriber benefits from creating a logical slot after 
> > > > > creating
> > > > > the publication.  That snapbuild.c process will wait for running 
> > > > > XIDs.  On the
> > > > > other hand, an INSERT/UPDATE/DELETE acquires its RowExclusiveLock and 
> > > > > builds
> > > > > its relcache entry before assigning an XID, so perhaps the 
> > > > > snapbuild.c process
> > >
> > > Correction: it doesn't matter how the original INSERT/UPDATE/DELETE 
> > > builds its
> > > relcache entry, just how pgoutput of the change builds the relcache entry 
> > > from
> > > the historic snapshot.
> > >
> > > > > isn't enough to prevent that thread's race condition.  What do you 
> > > > > think?
> > > >
> > > > I am not able to imagine how the race condition discussed in the
> > > > thread you quoted can impact this patch. The problem discussed is
> > > > mainly the interaction when we are processing the changes in logical
> > > > decoding w.r.t concurrent DDL (Alter Publication ... Add Table). The
> > > > problem happens because we use the old cache state.
> > >
> > > Right.  Taking the example from
> > > http://postgr.es/m/20231119021830.d6t6aaxtrkpn7...@awork3.anarazel.de, 
> > > LSNs
> > > between what that mail calls 4) and 5) are not safely usable as start 
> > > points.
> > > pg_createsubscriber evades that thread's problem if the consistent_lsn it
> > > passes to pg_replication_origin_advance() can't be in a bad-start-point 
> > > LSN
> > > span.  I cautiously bet the snapbuild.c process achieves that:
> > >
> > > > I am missing your
> > > > point about the race condition mentioned in the thread you quoted with
> > > > snapbuild.c. Can you please elaborate a bit more?
> > >
> > > When pg_createsubscriber calls pg_create_logical_replication_slot(), the 
> > > key
> > > part starts at:
> > >
> > > /*
> > >  * If caller needs us to determine the decoding start point, do 
> > > so now.
> > >  * This might take a while.
> > >  */
> > > if (find_startpoint)
> > > DecodingContextFindStartpoint(ctx);
> > >
> > > Two factors protect pg_createsubscriber.  First, (a) CREATE PUBLICATION
> > > committed before pg_create_logical_replication_slot() started.  Second, 
> > > (b)
> > > DecodingContextFindStartpoint() waits for running XIDs to complete, via 
> > > the
> > > process described at the snapbuild.c "starting up in several stages" 
> > > diagram.
> > > Hence, the consistent_lsn is not in a bad-start-point LSN span.  It's fine
> > > even if the original INSERT populated all caches before CREATE PUBLICATION
> > > started and managed to assign an XID only after consistent_lsn.  From the
> > > pgoutput perspective, that's indistinguishable from the transaction 
> > > starting
> > > at its first WAL record, after consistent_lsn.  The linked "long-standing 
> > > data
> > > loss bug in initial sync of logical replication" thread doesn't have (a),
> > > hence its bug.  How close is that to accurate?
> > 
> > Yeah, this theory sounds right to me. The key point is that no DML
> > (processing of WAL corresponding to DML) before CREATE PUBLICATION ...
> > command would have reached pgoutput level because we would have waited
> > for it during snapbuild.c. Can we conclude that the race condition
> > discussed in the other thread won't impact this patch?
> 
> As Noah said the key point is the CREATE PUBLICATION *before* creating the
> replication slots -- that wait transactions to complete.

Let's consider the transaction loss topic concluded.  Thanks for contemplating
it.  I've added an open item for the "dbname containing a space" topic.

RE: speed up a logical replica setup

2024-06-26 Thread Hayato Kuroda (Fujitsu)

Dear Amit, Euler,

Thanks for giving comment! 0001 was modified per suggestions.

> Yeah, it is a good idea to add a new option for two_phase but that
> should be done in the next version. For now, I suggest updating the
> docs and probably raising a warning (if max_prepared_transactions !=
> 0) as suggested by Noah. This WARNING is useful because one could
> expect that setting max_prepared_transactions != 0 means apply will
> happen at prepare time after the subscriber is created by this tool.
> The WARNING will be useful even if we support two_phase option as the
> user may have set the non-zero value of max_prepared_transactions but
> didn't use two_phase option.

I also think it should be tunable, in PG18+. 0002 adds a description in doc.
Also, this executable warns if the max_prepared_transactions != 0 on the
publisher. Subscriber-side is not checked now because I don't think it is 
helpful,
but it can be easily added. I did not add test for that, because current test 
code
does not check outputs.

Best Regards,
Hayato Kuroda
FUJITSU LIMITED
https://www.fujitsu.com/ 



v2-0001-pg_createsubscriber-Fix-cases-which-connection-pa.patch
Description:  v2-0001-pg_createsubscriber-Fix-cases-which-connection-pa.patch


v2-0002-pg_createsubscriber-Warn-the-two-phase-is-disable.patch
Description:  v2-0002-pg_createsubscriber-Warn-the-two-phase-is-disable.patch

Re: speed up a logical replica setup

2024-06-25 Thread Amit Kapila

On Wed, Jun 26, 2024 at 7:21 AM Euler Taveira  wrote:
>
> On Mon, Jun 24, 2024, at 3:47 AM, Hayato Kuroda (Fujitsu) wrote:
>
> > pg_createsubscriber fails on a dbname containing a space.  Use
> > appendConnStrVal() here and for other params in get_sub_conninfo().  See the
> > CVE-2016-5424 commits for more background.  For one way to test this
> > scenario,
> > see generate_db() in the pg_upgrade test suite.
>
> Thanks for pointing out. I made a fix patch. Test code was also modified 
> accordingly.
>
>
> Patch looks good to me. I have one suggestion
>
> -# Mostly Ported from 002_pg_upgrade.pl, but this returns a generated dbname.
> +# Extracted from 002_pg_upgrade.pl.
>
> and 2 small fixes:
>
> -   'logical replication works on database $db1');
> +   "logical replication works on database $db1");
>
> -is($result, qq(row 1), 'logical replication works on database $db2');
> +is($result, qq(row 1), "logical replication works on database $db2");
>
> > > +static char *
> > > +create_logical_replication_slot(PGconn *conn, struct LogicalRepInfo 
> > > *dbinfo)
> > > +{
> > > + PQExpBuffer str = createPQExpBuffer();
> > > + PGresult   *res = NULL;
> > > + const char *slot_name = dbinfo->replslotname;
> > > + char*slot_name_esc;
> > > + char*lsn = NULL;
> > > +
> > > + Assert(conn != NULL);
> > > +
> > > + pg_log_info("creating the replication slot \"%s\" on database \"%s\"",
> > > + slot_name, dbinfo->dbname);
> > > +
> > > + slot_name_esc = PQescapeLiteral(conn, slot_name, strlen(slot_name));
> > > +
> > > + appendPQExpBuffer(str,
> > > +   "SELECT lsn FROM
> > pg_catalog.pg_create_logical_replication_slot(%s, 'pgoutput', false, false, 
> > false)",
> >
> > This is passing twophase=false, but the patch does not mention prepared
> > transactions.  Is the intent to not support workloads containing prepared
> > transactions?  If so, the documentation should say that, and the tool likely
> > should warn on startup if max_prepared_transactions != 0.
>
> IIUC, We decided because it is a default behavior of logical replication. See 
> [1].
> +1 for improving a documentation, but not sure it is helpful for adding 
> output.
> I want to know opinions from others.
>
>
> Documentation says
>
> When two-phase commit is enabled, prepared transactions are sent to the
> subscriber at the time of PREPARE TRANSACTION, and are processed as two-phase
> transactions on the subscriber too. Otherwise, prepared transactions are sent 
> to
> the subscriber only when committed, and are then processed immediately by the
> subscriber.
>
> Hence, the replication should be working for prepared transactions even if it
> created the slot with twophase = false. IIRC the user won't be able to change 
> it
> later. As Amit said in a previous email, once the command
> ALTER SUBSCRIPTION ... SET (two_phase = on) is supported, users can change it
> after running pg_createsubscriber. The other option is to add a command-line
> option to enable or disable it.
>

Yeah, it is a good idea to add a new option for two_phase but that
should be done in the next version. For now, I suggest updating the
docs and probably raising a warning (if max_prepared_transactions !=
0) as suggested by Noah. This WARNING is useful because one could
expect that setting max_prepared_transactions != 0 means apply will
happen at prepare time after the subscriber is created by this tool.
The WARNING will be useful even if we support two_phase option as the
user may have set the non-zero value of max_prepared_transactions but
didn't use two_phase option.

-- 
With Regards,
Amit Kapila.

Re: speed up a logical replica setup

2024-06-25 Thread Euler Taveira

On Mon, Jun 24, 2024, at 3:47 AM, Hayato Kuroda (Fujitsu) wrote:
> > pg_createsubscriber fails on a dbname containing a space.  Use
> > appendConnStrVal() here and for other params in get_sub_conninfo().  See the
> > CVE-2016-5424 commits for more background.  For one way to test this
> > scenario,
> > see generate_db() in the pg_upgrade test suite.
> 
> Thanks for pointing out. I made a fix patch. Test code was also modified 
> accordingly.

Patch looks good to me. I have one suggestion

-# Mostly Ported from 002_pg_upgrade.pl, but this returns a generated dbname.
+# Extracted from 002_pg_upgrade.pl.

and 2 small fixes:

-   'logical replication works on database $db1');
+   "logical replication works on database $db1");

-is($result, qq(row 1), 'logical replication works on database $db2');
+is($result, qq(row 1), "logical replication works on database $db2");

> > > +static char *
> > > +create_logical_replication_slot(PGconn *conn, struct LogicalRepInfo 
> > > *dbinfo)
> > > +{
> > > + PQExpBuffer str = createPQExpBuffer();
> > > + PGresult   *res = NULL;
> > > + const char *slot_name = dbinfo->replslotname;
> > > + char*slot_name_esc;
> > > + char*lsn = NULL;
> > > +
> > > + Assert(conn != NULL);
> > > +
> > > + pg_log_info("creating the replication slot \"%s\" on database \"%s\"",
> > > + slot_name, dbinfo->dbname);
> > > +
> > > + slot_name_esc = PQescapeLiteral(conn, slot_name, strlen(slot_name));
> > > +
> > > + appendPQExpBuffer(str,
> > > +   "SELECT lsn FROM
> > pg_catalog.pg_create_logical_replication_slot(%s, 'pgoutput', false, false, 
> > false)",
> > 
> > This is passing twophase=false, but the patch does not mention prepared
> > transactions.  Is the intent to not support workloads containing prepared
> > transactions?  If so, the documentation should say that, and the tool likely
> > should warn on startup if max_prepared_transactions != 0.
> 
> IIUC, We decided because it is a default behavior of logical replication. See 
> [1].
> +1 for improving a documentation, but not sure it is helpful for adding 
> output.
> I want to know opinions from others.

Documentation says

When two-phase commit is enabled, prepared transactions are sent to the
subscriber at the time of PREPARE TRANSACTION, and are processed as two-phase
transactions on the subscriber too. Otherwise, prepared transactions are sent to
the subscriber only when committed, and are then processed immediately by the
subscriber.

Hence, the replication should be working for prepared transactions even if it
created the slot with twophase = false. IIRC the user won't be able to change it
later. As Amit said in a previous email, once the command
ALTER SUBSCRIPTION ... SET (two_phase = on) is supported, users can change it
after running pg_createsubscriber. The other option is to add a command-line
option to enable or disable it.


--
Euler Taveira
EDB   https://www.enterprisedb.com/

Re: speed up a logical replica setup

2024-06-25 Thread Euler Taveira

On Tue, Jun 25, 2024, at 3:24 AM, Amit Kapila wrote:
> On Tue, Jun 25, 2024 at 3:38 AM Noah Misch  wrote:
> >
> > On Mon, Jun 24, 2024 at 05:20:21PM +0530, Amit Kapila wrote:
> > > On Sun, Jun 23, 2024 at 11:52 AM Noah Misch  wrote:
> > > >
> > > > > +static void
> > > > > +create_publication(PGconn *conn, struct LogicalRepInfo *dbinfo)
> > > > > +{
> > > >
> > > > > + appendPQExpBuffer(str, "CREATE PUBLICATION %s FOR ALL TABLES",
> > > > > +   ipubname_esc);
> > > >
> > > > This tool's documentation says it "guarantees that no transaction will 
> > > > be
> > > > lost."  I tried to determine whether achieving that will require 
> > > > something
> > > > like the fix from
> > > > https://postgr.es/m/flat/de52b282-1166-1180-45a2-8d8917ca7...@enterprisedb.com.
> > > > (Not exactly the fix from that thread, since that thread has not 
> > > > discussed the
> > > > FOR ALL TABLES version of its race condition.)  I don't know.  On the 
> > > > one
> > > > hand, pg_createsubscriber benefits from creating a logical slot after 
> > > > creating
> > > > the publication.  That snapbuild.c process will wait for running XIDs.  
> > > > On the
> > > > other hand, an INSERT/UPDATE/DELETE acquires its RowExclusiveLock and 
> > > > builds
> > > > its relcache entry before assigning an XID, so perhaps the snapbuild.c 
> > > > process
> >
> > Correction: it doesn't matter how the original INSERT/UPDATE/DELETE builds 
> > its
> > relcache entry, just how pgoutput of the change builds the relcache entry 
> > from
> > the historic snapshot.
> >
> > > > isn't enough to prevent that thread's race condition.  What do you 
> > > > think?
> > >
> > > I am not able to imagine how the race condition discussed in the
> > > thread you quoted can impact this patch. The problem discussed is
> > > mainly the interaction when we are processing the changes in logical
> > > decoding w.r.t concurrent DDL (Alter Publication ... Add Table). The
> > > problem happens because we use the old cache state.
> >
> > Right.  Taking the example from
> > http://postgr.es/m/20231119021830.d6t6aaxtrkpn7...@awork3.anarazel.de, LSNs
> > between what that mail calls 4) and 5) are not safely usable as start 
> > points.
> > pg_createsubscriber evades that thread's problem if the consistent_lsn it
> > passes to pg_replication_origin_advance() can't be in a bad-start-point LSN
> > span.  I cautiously bet the snapbuild.c process achieves that:
> >
> > > I am missing your
> > > point about the race condition mentioned in the thread you quoted with
> > > snapbuild.c. Can you please elaborate a bit more?
> >
> > When pg_createsubscriber calls pg_create_logical_replication_slot(), the key
> > part starts at:
> >
> > /*
> >  * If caller needs us to determine the decoding start point, do so 
> > now.
> >  * This might take a while.
> >  */
> > if (find_startpoint)
> > DecodingContextFindStartpoint(ctx);
> >
> > Two factors protect pg_createsubscriber.  First, (a) CREATE PUBLICATION
> > committed before pg_create_logical_replication_slot() started.  Second, (b)
> > DecodingContextFindStartpoint() waits for running XIDs to complete, via the
> > process described at the snapbuild.c "starting up in several stages" 
> > diagram.
> > Hence, the consistent_lsn is not in a bad-start-point LSN span.  It's fine
> > even if the original INSERT populated all caches before CREATE PUBLICATION
> > started and managed to assign an XID only after consistent_lsn.  From the
> > pgoutput perspective, that's indistinguishable from the transaction starting
> > at its first WAL record, after consistent_lsn.  The linked "long-standing 
> > data
> > loss bug in initial sync of logical replication" thread doesn't have (a),
> > hence its bug.  How close is that to accurate?
> >
> 
> Yeah, this theory sounds right to me. The key point is that no DML
> (processing of WAL corresponding to DML) before CREATE PUBLICATION ...
> command would have reached pgoutput level because we would have waited
> for it during snapbuild.c. Can we conclude that the race condition
> discussed in the other thread won't impact this patch?

As Noah said the key point is the CREATE PUBLICATION *before* creating the
replication slots -- that wait transactions to complete.


--
Euler Taveira
EDB   https://www.enterprisedb.com/

Re: speed up a logical replica setup

2024-06-25 Thread Amit Kapila

On Tue, Jun 25, 2024 at 3:38 AM Noah Misch  wrote:
>
> On Mon, Jun 24, 2024 at 05:20:21PM +0530, Amit Kapila wrote:
> > On Sun, Jun 23, 2024 at 11:52 AM Noah Misch  wrote:
> > >
> > > > +static void
> > > > +create_publication(PGconn *conn, struct LogicalRepInfo *dbinfo)
> > > > +{
> > >
> > > > + appendPQExpBuffer(str, "CREATE PUBLICATION %s FOR ALL TABLES",
> > > > +   ipubname_esc);
> > >
> > > This tool's documentation says it "guarantees that no transaction will be
> > > lost."  I tried to determine whether achieving that will require something
> > > like the fix from
> > > https://postgr.es/m/flat/de52b282-1166-1180-45a2-8d8917ca7...@enterprisedb.com.
> > > (Not exactly the fix from that thread, since that thread has not 
> > > discussed the
> > > FOR ALL TABLES version of its race condition.)  I don't know.  On the one
> > > hand, pg_createsubscriber benefits from creating a logical slot after 
> > > creating
> > > the publication.  That snapbuild.c process will wait for running XIDs.  
> > > On the
> > > other hand, an INSERT/UPDATE/DELETE acquires its RowExclusiveLock and 
> > > builds
> > > its relcache entry before assigning an XID, so perhaps the snapbuild.c 
> > > process
>
> Correction: it doesn't matter how the original INSERT/UPDATE/DELETE builds its
> relcache entry, just how pgoutput of the change builds the relcache entry from
> the historic snapshot.
>
> > > isn't enough to prevent that thread's race condition.  What do you think?
> >
> > I am not able to imagine how the race condition discussed in the
> > thread you quoted can impact this patch. The problem discussed is
> > mainly the interaction when we are processing the changes in logical
> > decoding w.r.t concurrent DDL (Alter Publication ... Add Table). The
> > problem happens because we use the old cache state.
>
> Right.  Taking the example from
> http://postgr.es/m/20231119021830.d6t6aaxtrkpn7...@awork3.anarazel.de, LSNs
> between what that mail calls 4) and 5) are not safely usable as start points.
> pg_createsubscriber evades that thread's problem if the consistent_lsn it
> passes to pg_replication_origin_advance() can't be in a bad-start-point LSN
> span.  I cautiously bet the snapbuild.c process achieves that:
>
> > I am missing your
> > point about the race condition mentioned in the thread you quoted with
> > snapbuild.c. Can you please elaborate a bit more?
>
> When pg_createsubscriber calls pg_create_logical_replication_slot(), the key
> part starts at:
>
> /*
>  * If caller needs us to determine the decoding start point, do so 
> now.
>  * This might take a while.
>  */
> if (find_startpoint)
> DecodingContextFindStartpoint(ctx);
>
> Two factors protect pg_createsubscriber.  First, (a) CREATE PUBLICATION
> committed before pg_create_logical_replication_slot() started.  Second, (b)
> DecodingContextFindStartpoint() waits for running XIDs to complete, via the
> process described at the snapbuild.c "starting up in several stages" diagram.
> Hence, the consistent_lsn is not in a bad-start-point LSN span.  It's fine
> even if the original INSERT populated all caches before CREATE PUBLICATION
> started and managed to assign an XID only after consistent_lsn.  From the
> pgoutput perspective, that's indistinguishable from the transaction starting
> at its first WAL record, after consistent_lsn.  The linked "long-standing data
> loss bug in initial sync of logical replication" thread doesn't have (a),
> hence its bug.  How close is that to accurate?
>

Yeah, this theory sounds right to me. The key point is that no DML
(processing of WAL corresponding to DML) before CREATE PUBLICATION ...
command would have reached pgoutput level because we would have waited
for it during snapbuild.c. Can we conclude that the race condition
discussed in the other thread won't impact this patch?

-- 
With Regards,
Amit Kapila.

Re: speed up a logical replica setup

2024-06-24 Thread Noah Misch

On Mon, Jun 24, 2024 at 05:20:21PM +0530, Amit Kapila wrote:
> On Sun, Jun 23, 2024 at 11:52 AM Noah Misch  wrote:
> >
> > > +static void
> > > +create_publication(PGconn *conn, struct LogicalRepInfo *dbinfo)
> > > +{
> >
> > > + appendPQExpBuffer(str, "CREATE PUBLICATION %s FOR ALL TABLES",
> > > +   ipubname_esc);
> >
> > This tool's documentation says it "guarantees that no transaction will be
> > lost."  I tried to determine whether achieving that will require something
> > like the fix from
> > https://postgr.es/m/flat/de52b282-1166-1180-45a2-8d8917ca7...@enterprisedb.com.
> > (Not exactly the fix from that thread, since that thread has not discussed 
> > the
> > FOR ALL TABLES version of its race condition.)  I don't know.  On the one
> > hand, pg_createsubscriber benefits from creating a logical slot after 
> > creating
> > the publication.  That snapbuild.c process will wait for running XIDs.  On 
> > the
> > other hand, an INSERT/UPDATE/DELETE acquires its RowExclusiveLock and builds
> > its relcache entry before assigning an XID, so perhaps the snapbuild.c 
> > process

Correction: it doesn't matter how the original INSERT/UPDATE/DELETE builds its
relcache entry, just how pgoutput of the change builds the relcache entry from
the historic snapshot.

> > isn't enough to prevent that thread's race condition.  What do you think?
> 
> I am not able to imagine how the race condition discussed in the
> thread you quoted can impact this patch. The problem discussed is
> mainly the interaction when we are processing the changes in logical
> decoding w.r.t concurrent DDL (Alter Publication ... Add Table). The
> problem happens because we use the old cache state.

Right.  Taking the example from
http://postgr.es/m/20231119021830.d6t6aaxtrkpn7...@awork3.anarazel.de, LSNs
between what that mail calls 4) and 5) are not safely usable as start points.
pg_createsubscriber evades that thread's problem if the consistent_lsn it
passes to pg_replication_origin_advance() can't be in a bad-start-point LSN
span.  I cautiously bet the snapbuild.c process achieves that:

> I am missing your
> point about the race condition mentioned in the thread you quoted with
> snapbuild.c. Can you please elaborate a bit more?

When pg_createsubscriber calls pg_create_logical_replication_slot(), the key
part starts at:

/*
 * If caller needs us to determine the decoding start point, do so now.
 * This might take a while.
 */
if (find_startpoint)
DecodingContextFindStartpoint(ctx);

Two factors protect pg_createsubscriber.  First, (a) CREATE PUBLICATION
committed before pg_create_logical_replication_slot() started.  Second, (b)
DecodingContextFindStartpoint() waits for running XIDs to complete, via the
process described at the snapbuild.c "starting up in several stages" diagram.
Hence, the consistent_lsn is not in a bad-start-point LSN span.  It's fine
even if the original INSERT populated all caches before CREATE PUBLICATION
started and managed to assign an XID only after consistent_lsn.  From the
pgoutput perspective, that's indistinguishable from the transaction starting
at its first WAL record, after consistent_lsn.  The linked "long-standing data
loss bug in initial sync of logical replication" thread doesn't have (a),
hence its bug.  How close is that to accurate?

Thanks,
nm

Re: speed up a logical replica setup

2024-06-24 Thread Amit Kapila

On Sun, Jun 23, 2024 at 11:52 AM Noah Misch  wrote:
>
> > +static void
> > +create_publication(PGconn *conn, struct LogicalRepInfo *dbinfo)
> > +{
>
> > + appendPQExpBuffer(str, "CREATE PUBLICATION %s FOR ALL TABLES",
> > +   ipubname_esc);
>
> This tool's documentation says it "guarantees that no transaction will be
> lost."  I tried to determine whether achieving that will require something
> like the fix from
> https://postgr.es/m/flat/de52b282-1166-1180-45a2-8d8917ca7...@enterprisedb.com.
> (Not exactly the fix from that thread, since that thread has not discussed the
> FOR ALL TABLES version of its race condition.)  I don't know.  On the one
> hand, pg_createsubscriber benefits from creating a logical slot after creating
> the publication.  That snapbuild.c process will wait for running XIDs.  On the
> other hand, an INSERT/UPDATE/DELETE acquires its RowExclusiveLock and builds
> its relcache entry before assigning an XID, so perhaps the snapbuild.c process
> isn't enough to prevent that thread's race condition.  What do you think?
>

I am not able to imagine how the race condition discussed in the
thread you quoted can impact this patch. The problem discussed is
mainly the interaction when we are processing the changes in logical
decoding w.r.t concurrent DDL (Alter Publication ... Add Table). The
problem happens because we use the old cache state. I am missing your
point about the race condition mentioned in the thread you quoted with
snapbuild.c. Can you please elaborate a bit more?

-- 
With Regards,
Amit Kapila.

RE: speed up a logical replica setup

2024-06-24 Thread Hayato Kuroda (Fujitsu)

Dear Noah,

> pg_createsubscriber fails on a dbname containing a space.  Use
> appendConnStrVal() here and for other params in get_sub_conninfo().  See the
> CVE-2016-5424 commits for more background.  For one way to test this
> scenario,
> see generate_db() in the pg_upgrade test suite.

Thanks for pointing out. I made a fix patch. Test code was also modified 
accordingly.

> > +static char *
> > +create_logical_replication_slot(PGconn *conn, struct LogicalRepInfo 
> > *dbinfo)
> > +{
> > +   PQExpBuffer str = createPQExpBuffer();
> > +   PGresult   *res = NULL;
> > +   const char *slot_name = dbinfo->replslotname;
> > +   char   *slot_name_esc;
> > +   char   *lsn = NULL;
> > +
> > +   Assert(conn != NULL);
> > +
> > +   pg_log_info("creating the replication slot \"%s\" on database \"%s\"",
> > +   slot_name, dbinfo->dbname);
> > +
> > +   slot_name_esc = PQescapeLiteral(conn, slot_name, strlen(slot_name));
> > +
> > +   appendPQExpBuffer(str,
> > + "SELECT lsn FROM
> pg_catalog.pg_create_logical_replication_slot(%s, 'pgoutput', false, false, 
> false)",
> 
> This is passing twophase=false, but the patch does not mention prepared
> transactions.  Is the intent to not support workloads containing prepared
> transactions?  If so, the documentation should say that, and the tool likely
> should warn on startup if max_prepared_transactions != 0.

IIUC, We decided because it is a default behavior of logical replication. See 
[1].
+1 for improving a documentation, but not sure it is helpful for adding output.
I want to know opinions from others.

> > +static void
> > +create_publication(PGconn *conn, struct LogicalRepInfo *dbinfo)
> > +{
> 
> > +   appendPQExpBuffer(str, "CREATE PUBLICATION %s FOR ALL TABLES",
> > + ipubname_esc);
> 
> This tool's documentation says it "guarantees that no transaction will be
> lost."  I tried to determine whether achieving that will require something
> like the fix from
> https://postgr.es/m/flat/de52b282-1166-1180-45a2-8d8917ca74c6@enterprise
> db.com.
> (Not exactly the fix from that thread, since that thread has not discussed the
> FOR ALL TABLES version of its race condition.)  I don't know.  On the one
> hand, pg_createsubscriber benefits from creating a logical slot after creating
> the publication.  That snapbuild.c process will wait for running XIDs.  On the
> other hand, an INSERT/UPDATE/DELETE acquires its RowExclusiveLock and
> builds
> its relcache entry before assigning an XID, so perhaps the snapbuild.c process
> isn't enough to prevent that thread's race condition.  What do you think?

IIUC, documentation just intended to say that a type of replication will be
switched from stream to logical one, at the certain point. Please give sometime
for analyzing.

[1]: 
https://www.postgresql.org/message-id/270ad9b8-9c46-40c3-b6c5-3d25b91d3a7d%40app.fastmail.com

Best Regards,
Hayato Kuroda
FUJITSU LIMITED
https://www.fujitsu.com/

 


0001-pg_createsubscriber-Fix-cases-which-connection-param.patch
Description:  0001-pg_createsubscriber-Fix-cases-which-connection-param.patch

Re: speed up a logical replica setup

2024-06-24 Thread Amit Kapila

On Sun, Jun 23, 2024 at 11:52 AM Noah Misch  wrote:
>
> > +static char *
> > +create_logical_replication_slot(PGconn *conn, struct LogicalRepInfo 
> > *dbinfo)
> > +{
> > + PQExpBuffer str = createPQExpBuffer();
> > + PGresult   *res = NULL;
> > + const char *slot_name = dbinfo->replslotname;
> > + char   *slot_name_esc;
> > + char   *lsn = NULL;
> > +
> > + Assert(conn != NULL);
> > +
> > + pg_log_info("creating the replication slot \"%s\" on database \"%s\"",
> > + slot_name, dbinfo->dbname);
> > +
> > + slot_name_esc = PQescapeLiteral(conn, slot_name, strlen(slot_name));
> > +
> > + appendPQExpBuffer(str,
> > +   "SELECT lsn FROM 
> > pg_catalog.pg_create_logical_replication_slot(%s, 'pgoutput', false, false, 
> > false)",
>
> This is passing twophase=false, but the patch does not mention prepared
> transactions.  Is the intent to not support workloads containing prepared
> transactions?  If so, the documentation should say that, and the tool likely
> should warn on startup if max_prepared_transactions != 0.
>

The other point to note in this regard is that if we don't support
two_phase in the beginning during subscription/slot setup, users won't
be able to change it as we don't yet support changing it via alter
subscription (though the patch to alter two_pc is proposed for PG18).

-- 
With Regards,
Amit Kapila.

Re: speed up a logical replica setup

2024-06-23 Thread Noah Misch

On Mon, Mar 25, 2024 at 12:55:39PM +0100, Peter Eisentraut wrote:
> I have committed your version v33.

> commit d44032d

> --- /dev/null
> +++ b/src/bin/pg_basebackup/pg_createsubscriber.c

> +static char *
> +concat_conninfo_dbname(const char *conninfo, const char *dbname)
> +{
> + PQExpBuffer buf = createPQExpBuffer();
> + char   *ret;
> +
> + Assert(conninfo != NULL);
> +
> + appendPQExpBufferStr(buf, conninfo);
> + appendPQExpBuffer(buf, " dbname=%s", dbname);

pg_createsubscriber fails on a dbname containing a space.  Use
appendConnStrVal() here and for other params in get_sub_conninfo().  See the
CVE-2016-5424 commits for more background.  For one way to test this scenario,
see generate_db() in the pg_upgrade test suite.

> +static char *
> +create_logical_replication_slot(PGconn *conn, struct LogicalRepInfo *dbinfo)
> +{
> + PQExpBuffer str = createPQExpBuffer();
> + PGresult   *res = NULL;
> + const char *slot_name = dbinfo->replslotname;
> + char   *slot_name_esc;
> + char   *lsn = NULL;
> +
> + Assert(conn != NULL);
> +
> + pg_log_info("creating the replication slot \"%s\" on database \"%s\"",
> + slot_name, dbinfo->dbname);
> +
> + slot_name_esc = PQescapeLiteral(conn, slot_name, strlen(slot_name));
> +
> + appendPQExpBuffer(str,
> +   "SELECT lsn FROM 
> pg_catalog.pg_create_logical_replication_slot(%s, 'pgoutput', false, false, 
> false)",

This is passing twophase=false, but the patch does not mention prepared
transactions.  Is the intent to not support workloads containing prepared
transactions?  If so, the documentation should say that, and the tool likely
should warn on startup if max_prepared_transactions != 0.

> +static void
> +create_publication(PGconn *conn, struct LogicalRepInfo *dbinfo)
> +{

> + appendPQExpBuffer(str, "CREATE PUBLICATION %s FOR ALL TABLES",
> +   ipubname_esc);

This tool's documentation says it "guarantees that no transaction will be
lost."  I tried to determine whether achieving that will require something
like the fix from
https://postgr.es/m/flat/de52b282-1166-1180-45a2-8d8917ca7...@enterprisedb.com.
(Not exactly the fix from that thread, since that thread has not discussed the
FOR ALL TABLES version of its race condition.)  I don't know.  On the one
hand, pg_createsubscriber benefits from creating a logical slot after creating
the publication.  That snapbuild.c process will wait for running XIDs.  On the
other hand, an INSERT/UPDATE/DELETE acquires its RowExclusiveLock and builds
its relcache entry before assigning an XID, so perhaps the snapbuild.c process
isn't enough to prevent that thread's race condition.  What do you think?

Re: speed up a logical replica setup

2024-06-18 Thread Euler Taveira

On Mon, Jun 17, 2024, at 8:04 AM, Peter Eisentraut wrote:
> On 07.06.24 05:49, Euler Taveira wrote:
> > Here it is a patch series to fix the issues reported in recent 
> > discussions. The
> > patches 0001 and 0003 aim to fix the buildfarm issues. The patch 0002 
> > removes
> > synchronized failover slots on subscriber since it has no use. I also 
> > included
> > an optional patch 0004 that improves the usability by checking both 
> > servers if
> > it already failed in any subscriber check.
> 
> I have committed 0001, 0002, and 0003.  Let's keep an eye on the 
> buildfarm to see if that stabilizes things.  So far it looks good.

Thanks.

> For 0004, I suggest inverting the result values from check_publisher() 
> and create_subscriber() so that it returns true if the check is ok.
> 

Yeah, the order doesn't matter after the commit b963913826. I thought about
changing the order but I didn't; I provided a minimal patch. Since you think
it is an improvement, I'm attaching another patch with this change.


--
Euler Taveira
EDB   https://www.enterprisedb.com/
From ec96586fa35be3f855fb8844a197f374ac82a622 Mon Sep 17 00:00:00 2001
From: Euler Taveira 
Date: Fri, 24 May 2024 14:29:06 -0300
Subject: [PATCH] Check both servers before exiting

The current code provides multiple errors for each server. If any
subscriber condition is not met, it terminates without checking the
publisher conditions.
This change allows it to check both servers before terminating if any
condition is not met. It saves at least one dry run execution.

The order it calls the check functions don't matter after commit
b96391382626306c301b67cbd2d28313d96741f3 (because there is no
primary_slot_name check anymore) so check the publisher first.
---
 src/bin/pg_basebackup/pg_createsubscriber.c | 28 -
 1 file changed, 16 insertions(+), 12 deletions(-)

diff --git a/src/bin/pg_basebackup/pg_createsubscriber.c b/src/bin/pg_basebackup/pg_createsubscriber.c
index 5499e6d96a..26b4087ff7 100644
--- a/src/bin/pg_basebackup/pg_createsubscriber.c
+++ b/src/bin/pg_basebackup/pg_createsubscriber.c
@@ -76,9 +76,9 @@ static uint64 get_standby_sysid(const char *datadir);
 static void modify_subscriber_sysid(const struct CreateSubscriberOptions *opt);
 static bool server_is_in_recovery(PGconn *conn);
 static char *generate_object_name(PGconn *conn);
-static void check_publisher(const struct LogicalRepInfo *dbinfo);
+static bool check_publisher(const struct LogicalRepInfo *dbinfo);
 static char *setup_publisher(struct LogicalRepInfo *dbinfo);
-static void check_subscriber(const struct LogicalRepInfo *dbinfo);
+static bool check_subscriber(const struct LogicalRepInfo *dbinfo);
 static void setup_subscriber(struct LogicalRepInfo *dbinfo,
 			 const char *consistent_lsn);
 static void setup_recovery(const struct LogicalRepInfo *dbinfo, const char *datadir,
@@ -803,7 +803,7 @@ server_is_in_recovery(PGconn *conn)
  *
  * XXX Does it not allow a synchronous replica?
  */
-static void
+static bool
 check_publisher(const struct LogicalRepInfo *dbinfo)
 {
 	PGconn	   *conn;
@@ -908,8 +908,7 @@ check_publisher(const struct LogicalRepInfo *dbinfo)
 
 	pg_free(wal_level);
 
-	if (failed)
-		exit(1);
+	return failed;
 }
 
 /*
@@ -923,7 +922,7 @@ check_publisher(const struct LogicalRepInfo *dbinfo)
  * there is not a reliable way to provide a suitable error saying the server C
  * will be broken at the end of this process (due to pg_resetwal).
  */
-static void
+static bool
 check_subscriber(const struct LogicalRepInfo *dbinfo)
 {
 	PGconn	   *conn;
@@ -1014,8 +1013,7 @@ check_subscriber(const struct LogicalRepInfo *dbinfo)
 		failed = true;
 	}
 
-	if (failed)
-		exit(1);
+	return failed;
 }
 
 /*
@@ -1771,6 +1769,9 @@ main(int argc, char **argv)
 
 	char		pidfile[MAXPGPATH];
 
+	bool		pub_failed = false;
+	bool		sub_failed = false;
+
 	pg_logging_init(argv[0]);
 	pg_logging_set_level(PG_LOG_WARNING);
 	progname = get_progname(argv[0]);
@@ -2061,11 +2062,14 @@ main(int argc, char **argv)
 	pg_log_info("starting the standby with command-line options");
 	start_standby_server(, true);
 
-	/* Check if the standby server is ready for logical replication */
-	check_subscriber(dbinfo);
-
 	/* Check if the primary server is ready for logical replication */
-	check_publisher(dbinfo);
+	pub_failed = check_publisher(dbinfo);
+
+	/* Check if the standby server is ready for logical replication */
+	sub_failed = check_subscriber(dbinfo);
+
+	if (pub_failed || sub_failed)
+		exit(1);
 
 	/*
 	 * Stop the target server. The recovery process requires that the server
-- 
2.30.2

Re: speed up a logical replica setup

2024-06-17 Thread Peter Eisentraut


On 07.06.24 11:17, Hayato Kuroda (Fujitsu) wrote:

Other minor comments are included by the attached diff file. It contains changes
to follow conventions and pgindent/pgperltidy.


I have included some of your changes in the patches from Euler that I 
committed today.  The 0004 patch might get rerolled by Euler, so he 
could include your relevant changes there.


I'm not sure about moving the sync_replication_slots setting around.  It 
seems to add a lot of work for no discernible gain?


The pgperltidy changes are probably better done separately, because they 
otherwise make quite a mess of the patch.  Maybe we'll just do that once 
more before we branch.

Re: speed up a logical replica setup

2024-06-17 Thread Peter Eisentraut


On 07.06.24 05:49, Euler Taveira wrote:
Here it is a patch series to fix the issues reported in recent 
discussions. The
patches 0001 and 0003 aim to fix the buildfarm issues. The patch 0002 
removes
synchronized failover slots on subscriber since it has no use. I also 
included
an optional patch 0004 that improves the usability by checking both 
servers if

it already failed in any subscriber check.


I have committed 0001, 0002, and 0003.  Let's keep an eye on the 
buildfarm to see if that stabilizes things.  So far it looks good.


For 0004, I suggest inverting the result values from check_publisher() 
and create_subscriber() so that it returns true if the check is ok.

RE: speed up a logical replica setup

2024-06-07 Thread Hayato Kuroda (Fujitsu)

Dear Euler,

Thanks for making the follow-up patch! I was looking forward to your updates.
I think this patch set is the solution for the found buildfarm error. However,
there are remained claims raised by others. You should reply what you think for
them. At least:

1) There are some misleading messages [1]. I think v3-0005 patch set can solve
the issue.
2) pg_createsubscriber may fail If the primary has subscriptions [2]. IIUC
possible approaches are A)"keep subscriptions disabled at the end",
B)"by default drop the pre-existing subscriptions",
C) "do nothing, just document the risk".

> Before sending this email I realized that I did nothing about physical
> replication slots on the standby. I think we should also remove them too
> unconditionally.

I also considered around here, but it might be difficult to predict the
expectation by users. Can we surely say that it's not intentional one? Regarding
the failover slot, it is OK because that's meaningful only on the standby,
but not sure other slots. I personally think we can keep current spec, but
how do other think?

Below parts are comments for each patches.

0001
Basically LGTM. I was bit confused because the default timeout is not set, but
it seemed to follow the suggestion by Tomas [3].

0002
If you want to improve the commit message, please add that
sync_replication_slots
is disabled during the conversion.

0003
Confirmed it followed the discussion [4].

0004
Basically LGTM.

Other minor comments are included by the attached diff file. It contains changes
to follow conventions and pgindent/pgperltidy.

[1]:
https://www.postgresql.org/message-id/CAA4eK1J2fAvsJ2HihbWJ_GxETd6sdqSMrZdCVJEutRZRpm1MEQ%40mail.gmail.com
[2]:
https://www.postgresql.org/message-id/CANhcyEWvimA1-f6hSrA%3D9qkfR5SonFb56b36M%2B%2BvT%3DLiFj%3D76g%40mail.gmail.com
[3]:
https://www.postgresql.org/message-id/5d5dd4cd-6359-4109-88e8-c8e13035ae16%40enterprisedb.com
[4]:
https://www.postgresql.org/message-id/CAA4eK1LZxYxcbeiOn3Q5hjXVtZKhJWj-fQtndAeTCvZrPev8BA%40mail.gmail.com

Best Regards,
Hayato Kuroda
FUJITSU LIMITED
https://www.fujitsu.com/global/

minor_fix_by_kuroda.diff
Description: minor_fix_by_kuroda.diff

Re: speed up a logical replica setup

2024-06-06 Thread Euler Taveira

On Wed, May 22, 2024, at 12:16 PM, Euler Taveira wrote:
> I'll summarize all issues as soon as I finish the review of sync slot 
> support. I
> think we should avoid new development if we judge that the item can be
> documented as a limitation for this version. Nevertheless, I will share 
> patches
> so you can give your opinion on whether it is an open item or new development.

Here it is a patch series to fix the issues reported in recent discussions. The
patches 0001 and 0003 aim to fix the buildfarm issues. The patch 0002 removes
synchronized failover slots on subscriber since it has no use. I also included
an optional patch 0004 that improves the usability by checking both servers if
it already failed in any subscriber check.

As I said in this previous email I decided to remove the logic that reacts for
an issue on primary. We can reintroduce another code later if/when we have a
better way to check the recovery progress. It will rely on the recovery_timeout
and it adds recovery_timeout equals to PG_TEST_TIMEOUT_DEFAULT to let the
animals control how long it should wait for the recovery.

Since some animals reported some issues in the check_publisher routine that
checks if the primary_slot_name is in use on primary, this logic was removed
too (patch 0003). We could introduce a way to keep trying this test but the
conclusion is that it is not worth it and if the primary_slot_name does not
exist (due to a setup error), pg_createsubscriber will log an error message and
continue.

The 0002 removes any failover slots that remains on subscriber. Talking about
terminology, I noticed that slotsync.c uses "logical failover slots" and
"failover logical slots", I think the latter sounds better but I'm not a native
speaker. I also don't know if we use a short terminology like "failover slots"
"failover replication slots" or "failover logical replication slots". IMO we
can omit "logical" because "failover" infers it is a logical replication slot.
I'm also not sure about omitting "replication". It is not descriptive enough. I
prefer "failover replication slots".

Before sending this email I realized that I did nothing about physical
replication slots on the standby. I think we should also remove them too
unconditionally.

--
Euler Taveira
EDB   https://www.enterprisedb.com/
From 5d8b4781e6e9bcb00564f45c25e575a8abab6ae8 Mon Sep 17 00:00:00 2001
From: Euler Taveira 
Date: Tue, 21 May 2024 23:04:57 -0300
Subject: [PATCH 1/4] Only the recovery_timeout controls the end of recovery
 process

It used to check if the target server is connected to the primary server
(send required WAL) to rapidly react when the process won't succeed.
This code is not enough to guarantee that the recovery process will
complete. There is a window between the walreceiver shutdown and the
pg_is_in_recovery() returns false that can reach NUM_CONN_ATTEMPTS
attempts and fails.
Instead, rely only on recovery_timeout option to give up the process
after the specified number of seconds.
---
 doc/src/sgml/ref/pg_createsubscriber.sgml |  7 -
 src/bin/pg_basebackup/pg_createsubscriber.c   | 29 ++-
 .../t/040_pg_createsubscriber.pl  |  2 ++
 3 files changed, 5 insertions(+), 33 deletions(-)

diff --git a/doc/src/sgml/ref/pg_createsubscriber.sgml b/doc/src/sgml/ref/pg_createsubscriber.sgml
index 142b02..a700697f88 100644
--- a/doc/src/sgml/ref/pg_createsubscriber.sgml
+++ b/doc/src/sgml/ref/pg_createsubscriber.sgml
@@ -325,13 +325,6 @@ PostgreSQL documentation
 connections to the target server should fail.

-   
-During the recovery process, if the target server disconnects from the
-source server, pg_createsubscriber will check a
-few times if the connection has been reestablished to stream the required
-WAL.  After a few attempts, it terminates with an error.
-   
-

 Since DDL commands are not replicated by logical replication, avoid
 executing DDL commands that change the database schema while running
diff --git a/src/bin/pg_basebackup/pg_createsubscriber.c b/src/bin/pg_basebackup/pg_createsubscriber.c
index 90cc580811..f62f34b1a7 100644
--- a/src/bin/pg_basebackup/pg_createsubscriber.c
+++ b/src/bin/pg_basebackup/pg_createsubscriber.c
@@ -1360,6 +1360,9 @@ stop_standby_server(const char *datadir)
  *
  * If recovery_timeout option is set, terminate abnormally without finishing
  * the recovery process. By default, it waits forever.
+ *
+ * XXX Is the recovery process still in progress? When recovery process has a
+ * better progress reporting mechanism, it should be added here.
  */
 static void
 wait_for_end_recovery(const char *conninfo, const struct CreateSubscriberOptions *opt)
@@ -1367,9 +1370,6 @@ wait_for_end_recovery(const char *conninfo, const struct CreateSubscriberOptions
 	PGconn	   *conn;
 	int			status = POSTMASTER_STILL_STARTING;
 	int			timer = 0;
-	int			count = 0;		/* number of consecutive connection attempts */
-
-#define NUM_CONN_ATTEMPTS	10

Re: speed up a logical replica setup

2024-05-24 Thread Shlok Kyal

On Wed, 22 May 2024 at 16:50, Amit Kapila  wrote:
>
> On Mon, May 20, 2024 at 4:30 PM Shlok Kyal  wrote:
> > >
> > > I was trying to test this utility when 'sync_replication_slots' is on
> > > and it gets in an ERROR loop [1] and never finishes. Please find the
> > > postgresql.auto used on the standby attached. I think if the standby
> > > has enabled sync_slots, you need to pass dbname in
> > > GenerateRecoveryConfig(). I couldn't test it further but I wonder if
> > > there are already synced slots on the standby (either due to
> > > 'sync_replication_slots' or users have used
> > > pg_sync_replication_slots() before invoking pg_createsubscriber),
> > > those would be retained as it is on new subscriber and lead to
> > > unnecessary WAL retention and dead rows.
> > >
> > > [1]
> > > 2024-04-30 11:50:43.239 IST [12536] LOG:  slot sync worker started
> > > 2024-04-30 11:50:43.247 IST [12536] ERROR:  slot synchronization
> > > requires dbname to be specified in primary_conninfo
> >
> > Hi,
> >
> > I tested the scenario posted by Amit in [1], in which retaining synced
> > slots lead to unnecessary WAL retention and ERROR. This is raised as
> > the second open point in [2].
> > The steps to reproduce the issue:
> > (1) Setup physical replication with sync slot feature turned on by
> > setting sync_replication_slots = 'true' or using
> > pg_sync_replication_slots() on the standby node.
> > For physical replication setup, run pg_basebackup with -R  and -d option.
> > (2) Create a logical replication slot on primary node with failover
> > option as true. A corresponding slot is created on standby as part of
> > sync slot feature.
> > (3) Run pg_createsubscriber on standby node.
> > (4) On Checking for the replication slot on standby node, I noticed
> > that the logical slots created in step 2 are retained.
> >  I have attached the script to reproduce the issue.
> >
> > I and Kuroda-san worked to resolve open points. Here are patches to
> > solve the second and third point in [2].
> > Patches proposed by Euler are also attached just in case, but they
> > were not modified.
> >
> > v2-0001: not changed
> >
>
> Shouldn't we modify it as per the suggestion given in the email [1]? I
> am wondering if we can entirely get rid of checking the primary
> business and simply rely on recovery_timeout and keep checking
> server_is_in_recovery(). If so, we can modify the test to use
> non-default recovery_timeout (say 180s or something similar if we have
> used it at any other place). As an additional check we can ensure that
> constent_lsn is present on standby.
>

I have made changes as per your suggestion. I have used
pg_last_wal_receive_lsn to get lsn last wal received on standby and
check if consistent_lsn is already present on the standby.
I have only made changes in v3-0001. v3-0002, v3-0003, v3-0004 and
v3-0005 are not modified.

Thanks and Regards,
Shlok Kyal


v3-0005-Avoid-outputing-some-messages-in-dry_run-mode.patch
Description: Binary data


v3-0003-Disable-slot-sync-during-the-convertion.patch
Description: Binary data


v3-0001-Improve-the-code-that-checks-if-the-recovery-is-f.patch
Description: Binary data


v3-0004-Drop-replication-slots-which-had-been-synchronize.patch
Description: Binary data


v3-0002-Improve-the-code-that-checks-if-the-primary-slot-.patch
Description: Binary data

Re: speed up a logical replica setup

2024-05-23 Thread Amit Kapila

On Thu, May 23, 2024 at 8:43 PM Euler Taveira  wrote:
>
> On Thu, May 23, 2024, at 5:54 AM, Amit Kapila wrote:
>
>
> Why in the first place do we need to ensure that primary_slot_name is
> active on the primary? You mentioned something related to WAL
> retention but I don't know how that is related to this tool's
> functionality. If at all, we are bothered about WAL retention on the
> primary that should be the WAL corresponding to consistent_lsn
> computed by setup_publisher() but this check doesn't seem to ensure
> that.
>
> Maybe it is a lot of checks. I'm afraid there isn't a simple way to get and
> make sure the replication slot is used by the physical replication. I mean if
> there is primary_slot_name = 'foo' on standby, there is no guarantee that the
> replication slot 'foo' exists on primary. The idea is to get the exact
> replication slot name used by physical replication to drop it. Once I posted a
> patch it should be clear. (Another idea is to relax this check and rely only 
> on
> primary_slot_name to drop this replication slot on primary. The replication 
> slot
> might not exist and it shouldn't return an error in this case.)
>

I think your other idea is better than what we are doing currently.
Let's ignore the ERROR even if the primary_slot_name doesn't exist on
the primary.

-- 
With Regards,
Amit Kapila.

Re: speed up a logical replica setup

2024-05-23 Thread Euler Taveira

On Thu, May 23, 2024, at 5:54 AM, Amit Kapila wrote:
> On Wed, May 22, 2024 at 8:46 PM Euler Taveira  wrote:
> >
> > Following the same line that simplifies the code, we can: (a) add a loop in
> > check_subscriber() that waits until walreceiver is available on subscriber 
> > or
> > (b) use a timeout. The main advantage of (a) is that the primary slot is 
> > already
> > available but I'm afraid we need a escape mechanism for the loop (timeout?).
> >
> 
> Sorry, it is not clear to me why we need any additional loop in
> check_subscriber(), aren't we speaking about the problem in
> check_publisher() function?

The idea is to use check_subscriber() to check pg_stat_walreceiver. Once this
view returns a row and primary_slot_name is set on standby, the referred
replication slot name should be active on primary. Hence, the query on
check_publisher() make sure that the referred replication slot is in use on
primary. 

> Why in the first place do we need to ensure that primary_slot_name is
> active on the primary? You mentioned something related to WAL
> retention but I don't know how that is related to this tool's
> functionality. If at all, we are bothered about WAL retention on the
> primary that should be the WAL corresponding to consistent_lsn
> computed by setup_publisher() but this check doesn't seem to ensure
> that.

Maybe it is a lot of checks. I'm afraid there isn't a simple way to get and
make sure the replication slot is used by the physical replication. I mean if
there is primary_slot_name = 'foo' on standby, there is no guarantee that the
replication slot 'foo' exists on primary. The idea is to get the exact
replication slot name used by physical replication to drop it. Once I posted a
patch it should be clear. (Another idea is to relax this check and rely only on
primary_slot_name to drop this replication slot on primary. The replication slot
might not exist and it shouldn't return an error in this case.)

--
Euler Taveira
EDB   https://www.enterprisedb.com/

Re: speed up a logical replica setup

2024-05-23 Thread Amit Kapila

On Wed, May 22, 2024 at 8:46 PM Euler Taveira  wrote:
>
> On Wed, May 22, 2024, at 8:19 AM, Amit Kapila wrote:
>
> > v2-0002: not changed
> >
>
> We have added more tries to see if the primary_slot_name becomes
> active but I think it is still fragile because it is possible on slow
> machines that the required slot didn't become active even after more
> retries. I have raised the same comment previously [2] and asked an
> additional question but didn't get any response.
>
>
> Following the same line that simplifies the code, we can: (a) add a loop in
> check_subscriber() that waits until walreceiver is available on subscriber or
> (b) use a timeout. The main advantage of (a) is that the primary slot is 
> already
> available but I'm afraid we need a escape mechanism for the loop (timeout?).
>

Sorry, it is not clear to me why we need any additional loop in
check_subscriber(), aren't we speaking about the problem in
check_publisher() function?

Why in the first place do we need to ensure that primary_slot_name is
active on the primary? You mentioned something related to WAL
retention but I don't know how that is related to this tool's
functionality. If at all, we are bothered about WAL retention on the
primary that should be the WAL corresponding to consistent_lsn
computed by setup_publisher() but this check doesn't seem to ensure
that.

-- 
With Regards,
Amit Kapila.

Re: speed up a logical replica setup

2024-05-22 Thread Euler Taveira

On Wed, May 22, 2024, at 8:19 AM, Amit Kapila wrote:
> >
> > v2-0001: not changed
> >
> 
> Shouldn't we modify it as per the suggestion given in the email [1]? I
> am wondering if we can entirely get rid of checking the primary
> business and simply rely on recovery_timeout and keep checking
> server_is_in_recovery(). If so, we can modify the test to use
> non-default recovery_timeout (say 180s or something similar if we have
> used it at any other place). As an additional check we can ensure that
> constent_lsn is present on standby.

That's exactly what I want to propose as Tomas convinced me offlist that less is
better when we don't have a useful recovery progress reporting mechanism to make
sure it is still working on the recovery and we should wait.

> > v2-0002: not changed
> >
> 
> We have added more tries to see if the primary_slot_name becomes
> active but I think it is still fragile because it is possible on slow
> machines that the required slot didn't become active even after more
> retries. I have raised the same comment previously [2] and asked an
> additional question but didn't get any response.

Following the same line that simplifies the code, we can: (a) add a loop in
check_subscriber() that waits until walreceiver is available on subscriber or
(b) use a timeout. The main advantage of (a) is that the primary slot is already
available but I'm afraid we need a escape mechanism for the loop (timeout?).

I'll summarize all issues as soon as I finish the review of sync slot support. I
think we should avoid new development if we judge that the item can be
documented as a limitation for this version. Nevertheless, I will share patches
so you can give your opinion on whether it is an open item or new development.

--
Euler Taveira
EDB   https://www.enterprisedb.com/

Re: speed up a logical replica setup

2024-05-22 Thread Amit Kapila

On Mon, May 20, 2024 at 4:30 PM Shlok Kyal  wrote:
> >
> > I was trying to test this utility when 'sync_replication_slots' is on
> > and it gets in an ERROR loop [1] and never finishes. Please find the
> > postgresql.auto used on the standby attached. I think if the standby
> > has enabled sync_slots, you need to pass dbname in
> > GenerateRecoveryConfig(). I couldn't test it further but I wonder if
> > there are already synced slots on the standby (either due to
> > 'sync_replication_slots' or users have used
> > pg_sync_replication_slots() before invoking pg_createsubscriber),
> > those would be retained as it is on new subscriber and lead to
> > unnecessary WAL retention and dead rows.
> >
> > [1]
> > 2024-04-30 11:50:43.239 IST [12536] LOG:  slot sync worker started
> > 2024-04-30 11:50:43.247 IST [12536] ERROR:  slot synchronization
> > requires dbname to be specified in primary_conninfo
>
> Hi,
>
> I tested the scenario posted by Amit in [1], in which retaining synced
> slots lead to unnecessary WAL retention and ERROR. This is raised as
> the second open point in [2].
> The steps to reproduce the issue:
> (1) Setup physical replication with sync slot feature turned on by
> setting sync_replication_slots = 'true' or using
> pg_sync_replication_slots() on the standby node.
> For physical replication setup, run pg_basebackup with -R  and -d option.
> (2) Create a logical replication slot on primary node with failover
> option as true. A corresponding slot is created on standby as part of
> sync slot feature.
> (3) Run pg_createsubscriber on standby node.
> (4) On Checking for the replication slot on standby node, I noticed
> that the logical slots created in step 2 are retained.
>  I have attached the script to reproduce the issue.
>
> I and Kuroda-san worked to resolve open points. Here are patches to
> solve the second and third point in [2].
> Patches proposed by Euler are also attached just in case, but they
> were not modified.
>
> v2-0001: not changed
>

Shouldn't we modify it as per the suggestion given in the email [1]? I
am wondering if we can entirely get rid of checking the primary
business and simply rely on recovery_timeout and keep checking
server_is_in_recovery(). If so, we can modify the test to use
non-default recovery_timeout (say 180s or something similar if we have
used it at any other place). As an additional check we can ensure that
constent_lsn is present on standby.

> v2-0002: not changed
>

We have added more tries to see if the primary_slot_name becomes
active but I think it is still fragile because it is possible on slow
machines that the required slot didn't become active even after more
retries. I have raised the same comment previously [2] and asked an
additional question but didn't get any response.

[1] - 
https://www.postgresql.org/message-id/CAA4eK1JJq_ER6Kq_H%3DjKHR75QPRd8y9_D%3DRtYw%3DaPYKMfqLi9A%40mail.gmail.com
[2] - 
https://www.postgresql.org/message-id/CAA4eK1LT3Z13Dg6p4Z%2B4adO_EY-ow5CmWfikEmBfL%3DeVrm8CPw%40mail.gmail.com

-- 
With Regards,
Amit Kapila.

Re: speed up a logical replica setup

2024-05-20 Thread Shlok Kyal

Hi,
>
> I was trying to test this utility when 'sync_replication_slots' is on
> and it gets in an ERROR loop [1] and never finishes. Please find the
> postgresql.auto used on the standby attached. I think if the standby
> has enabled sync_slots, you need to pass dbname in
> GenerateRecoveryConfig(). I couldn't test it further but I wonder if
> there are already synced slots on the standby (either due to
> 'sync_replication_slots' or users have used
> pg_sync_replication_slots() before invoking pg_createsubscriber),
> those would be retained as it is on new subscriber and lead to
> unnecessary WAL retention and dead rows.
>
> [1]
> 2024-04-30 11:50:43.239 IST [12536] LOG:  slot sync worker started
> 2024-04-30 11:50:43.247 IST [12536] ERROR:  slot synchronization
> requires dbname to be specified in primary_conninfo

Hi,

I tested the scenario posted by Amit in [1], in which retaining synced
slots lead to unnecessary WAL retention and ERROR. This is raised as
the second open point in [2].
The steps to reproduce the issue:
(1) Setup physical replication with sync slot feature turned on by
setting sync_replication_slots = 'true' or using
pg_sync_replication_slots() on the standby node.
For physical replication setup, run pg_basebackup with -R  and -d option.
(2) Create a logical replication slot on primary node with failover
option as true. A corresponding slot is created on standby as part of
sync slot feature.
(3) Run pg_createsubscriber on standby node.
(4) On Checking for the replication slot on standby node, I noticed
that the logical slots created in step 2 are retained.
 I have attached the script to reproduce the issue.

I and Kuroda-san worked to resolve open points. Here are patches to
solve the second and third point in [2].
Patches proposed by Euler are also attached just in case, but they
were not modified.

v2-0001: not changed
v2-0002: not changed
v2-0003: ensures the slot sync is disabled during the conversion. This
resolves the second point.
v2-0004: drops sync slots which may be retained after running. This
resolves the second point.
v2-0005: removes misleading output messages in dry-run. This resolves
the third point.

[1]: 
https://www.postgresql.org/message-id/CAA4eK1KdCb%2B5sjYu6qCMXXdCX1y_ihr8kFzMozq0%3DP%3DauYxgog%40mail.gmail.com
[2]: 
https://www.postgresql.org/message-id/CAA4eK1J22UEfrqx222h5j9DQ7nxGrTbAa_BC%2B%3DmQXdXs-RCsew%40mail.gmail.com

Thanks and Regards,
Shlok Kyal
# cleanup
rm -rf ../primary ../standby primary.log standby.log

# setup primary node
./initdb -D ../primary
 
cat << EOF >> ../primary/postgresql.conf
wal_level = logical
EOF

./pg_ctl -D ../primary -l primary.log start
./psql -d postgres -c "CREATE table t1 (c1 int);"
./psql -d postgres -c "Insert into t1 values(1);"
./psql -d postgres -c "Insert into t1 values(2);"
./psql -d postgres -c "INSERT into t1 values(3);"
./psql -d postgres -c "INSERT into t1 values(4);"

# setup standby node with sync slot feature
./pg_basebackup -h localhost -X stream -v -W -R -S 'sb1' -C -D ../standby/ -d 'dbname=postgres' 

cat << EOF >> ../standby/postgresql.conf
sync_replication_slots = 'true'
hot_standby_feedback = 'on'
port = 9000
EOF

./pg_ctl -D ../standby -l standby.log start

# create a replication slot on primary with failover option
./psql -d postgres -c "SELECT 'init' FROM pg_create_logical_replication_slot('failover_true_slot_2', 'test_decoding', false, false, true);"
sleep 2

# check if slots are synced on standby
./psql -d postgres -p 9000 -c "select * from pg_replication_slots;" 
./pg_ctl -D ../standby -l standby.log stop

# run pg_createsubscriber on standby
./pg_createsubscriber -D ../standby/ -p 9000 -P "host=localhost port=5432 dbname=postgres" -d postgres -v

# check if replication slots are retained after running pg_createsubscriber
./pg_ctl -D ../standby -l standby.log start
./psql -d postgres -p 9000 -c "select * from pg_replication_slots;" 

./pg_ctl -D ../standby -l standby.log stop
./pg_ctl -D ../primary -l primary.log stop


v2-0001-Improve-the-code-that-checks-if-the-recovery-is-f.patch
Description: Binary data


v2-0002-Improve-the-code-that-checks-if-the-primary-slot-.patch
Description: Binary data


v2-0003-Disable-slot-sync-during-the-convertion.patch
Description: Binary data


v2-0004-Drop-replication-slots-which-had-been-synchronize.patch
Description: Binary data


v2-0005-Avoid-outputing-some-messages-in-dry_run-mode.patch
Description: Binary data

Re: speed up a logical replica setup

2024-05-20 Thread Amit Kapila

On Sun, May 19, 2024 at 4:25 AM Thomas Munro  wrote:
>
> 040_pg_createsubscriber.pl seems to be failing occasionally on
> culicidae near "--dry-run on node S".  I couldn't immediately see why.
> That animal is using EXEC_BACKEND and I also saw a one-off failure a
> bit like that on my own local Linux + EXEC_BACKEND test run
> (sorry I didn't keep the details around).  Coincidence?
>

AFAICS, this is the same as one of the two BF failures being discussed
in this thread.

-- 
With Regards,
Amit Kapila.

Re: speed up a logical replica setup

2024-05-18 Thread Thomas Munro

040_pg_createsubscriber.pl seems to be failing occasionally on
culicidae near "--dry-run on node S".  I couldn't immediately see why.
That animal is using EXEC_BACKEND and I also saw a one-off failure a
bit like that on my own local Linux + EXEC_BACKEND test run
(sorry I didn't keep the details around).  Coincidence?

Re: speed up a logical replica setup

2024-04-30 Thread Amit Kapila

On Tue, Apr 30, 2024 at 12:04 PM Amit Kapila  wrote:
>
> On Mon, Apr 29, 2024 at 5:28 PM Amit Kapila  wrote:
> >
> > On Mon, Apr 29, 2024 at 5:23 PM Euler Taveira  wrote:
> >
>
> I was trying to test this utility when 'sync_replication_slots' is on
> and it gets in an ERROR loop [1] and never finishes. Please find the
> postgresql.auto used on the standby attached. I think if the standby
> has enabled sync_slots, you need to pass dbname in
> GenerateRecoveryConfig().

The other possibility is that when we start standby from
pg_createsubscriber, we specifically set 'sync_replication_slots' as
false.

>
> I couldn't test it further but I wonder if
> there are already synced slots on the standby (either due to
> 'sync_replication_slots' or users have used
> pg_sync_replication_slots() before invoking pg_createsubscriber),
> those would be retained as it is on new subscriber and lead to
> unnecessary WAL retention and dead rows.
>

This still needs some handling.

BTW, I don't see the use of following messages in --dry-run mode or
rather they could be misleading:
pg_createsubscriber: hint: If pg_createsubscriber fails after this
point, you must recreate the physical replica before continuing.
...
...
pg_createsubscriber: setting the replication progress (node name
"pg_0" ; LSN 0/0) on database "postgres"

Similarly, we should think if below messages are useful in --dry-run mode:
pg_createsubscriber: dropping publication
"pg_createsubscriber_5_887f7991" on database "postgres"
pg_createsubscriber: creating subscription
"pg_createsubscriber_5_887f7991" on database "postgres"
...
pg_createsubscriber: enabling subscription
"pg_createsubscriber_5_887f7991" on database "postgres"

-- 
With Regards,
Amit Kapila.

Re: speed up a logical replica setup

2024-04-30 Thread Amit Kapila

On Mon, Apr 29, 2024 at 5:28 PM Amit Kapila  wrote:
>
> On Mon, Apr 29, 2024 at 5:23 PM Euler Taveira  wrote:
>

I was trying to test this utility when 'sync_replication_slots' is on
and it gets in an ERROR loop [1] and never finishes. Please find the
postgresql.auto used on the standby attached. I think if the standby
has enabled sync_slots, you need to pass dbname in
GenerateRecoveryConfig(). I couldn't test it further but I wonder if
there are already synced slots on the standby (either due to
'sync_replication_slots' or users have used
pg_sync_replication_slots() before invoking pg_createsubscriber),
those would be retained as it is on new subscriber and lead to
unnecessary WAL retention and dead rows.

[1]
2024-04-30 11:50:43.239 IST [12536] LOG:  slot sync worker started
2024-04-30 11:50:43.247 IST [12536] ERROR:  slot synchronization
requires dbname to be specified in primary_conninfo

-- 
With Regards,
Amit Kapila.

postgresql.auto.standby.conf
Description: Binary data

Re: speed up a logical replica setup

2024-04-29 Thread Amit Kapila

On Mon, Apr 29, 2024 at 5:23 PM Euler Taveira  wrote:
>
> On Mon, Apr 29, 2024, at 6:56 AM, Amit Kapila wrote:
>
> On Wed, Mar 27, 2024 at 1:47 AM Euler Taveira  wrote:
> >
> > On Tue, Mar 26, 2024, at 4:12 PM, Tomas Vondra wrote:
> >
> > Perhaps I'm missing something, but why is NUM_CONN_ATTEMPTS even needed?
> > Why isn't recovery_timeout enough to decide if wait_for_end_recovery()
> > waited long enough?
> >
> >
> > It was an attempt to decoupled a connection failure (that keeps streaming 
> > the
> > WAL) from recovery timeout. The NUM_CONN_ATTEMPTS guarantees that if the 
> > primary
> > is gone during the standby recovery process, there is a way to bail out.
> >
>
> I think we don't need to check primary if the WAL corresponding to
> consistent_lsn is already present on the standby. Shouldn't we first
> check that? Once we ensure that the required WAL is copied, just
> checking server_is_in_recovery() should be sufficient. I feel that
> will be a direct way of ensuring what is required rather than
> indirectly verifying the same (by checking pg_stat_wal_receiver) as we
> are doing currently.
>
>
> How would you check it? WAL file? During recovery, you are not allowed to use
> pg_current_wal_lsn.
>

How about pg_last_wal_receive_lsn()?

-- 
With Regards,
Amit Kapila.

Re: speed up a logical replica setup

2024-04-29 Thread Euler Taveira

On Mon, Apr 29, 2024, at 6:56 AM, Amit Kapila wrote:
> On Wed, Mar 27, 2024 at 1:47 AM Euler Taveira  wrote:
> >
> > On Tue, Mar 26, 2024, at 4:12 PM, Tomas Vondra wrote:
> >
> > Perhaps I'm missing something, but why is NUM_CONN_ATTEMPTS even needed?
> > Why isn't recovery_timeout enough to decide if wait_for_end_recovery()
> > waited long enough?
> >
> >
> > It was an attempt to decoupled a connection failure (that keeps streaming 
> > the
> > WAL) from recovery timeout. The NUM_CONN_ATTEMPTS guarantees that if the 
> > primary
> > is gone during the standby recovery process, there is a way to bail out.
> >
> 
> I think we don't need to check primary if the WAL corresponding to
> consistent_lsn is already present on the standby. Shouldn't we first
> check that? Once we ensure that the required WAL is copied, just
> checking server_is_in_recovery() should be sufficient. I feel that
> will be a direct way of ensuring what is required rather than
> indirectly verifying the same (by checking pg_stat_wal_receiver) as we
> are doing currently.

How would you check it? WAL file? During recovery, you are not allowed to use
pg_current_wal_lsn.

Tomas suggested to me off-list that we should adopt a simple solution in
wait_for_end_recovery: wait for recovery_timeout without additional checks
(which means remove the pg_stat_wal_receiver logic).  When we have additional
information that we can reliably use in this function, we can add it. Hence, it
is also easy to adjust the PG_TEST_TIMEOUT_DEFAULT to have stable tests.


--
Euler Taveira
EDB   https://www.enterprisedb.com/

Re: speed up a logical replica setup

2024-04-29 Thread Amit Kapila

On Tue, Mar 26, 2024 at 8:24 AM Euler Taveira  wrote:
>
> On Mon, Mar 25, 2024, at 1:06 PM, Hayato Kuroda (Fujitsu) wrote:
>
> The first patch implements a combination of (1) and (2).
>
> ## Analysis for failure 2
>
> According to [2], the physical replication slot which is specified as 
> primary_slot_name
> was not used by the walsender process. At that time walsender has not existed.
>
> ```
> ...
> pg_createsubscriber: publisher: current wal senders: 0
> pg_createsubscriber: command is: SELECT 1 FROM 
> pg_catalog.pg_replication_slots WHERE active AND slot_name = 'physical_slot'
> pg_createsubscriber: error: could not obtain replication slot information: 
> got 0 rows, expected 1 row
> ...
> ```
>
> Currently standby must be stopped before the command and current code does not
> block the flow to ensure the replication is started. So there is a possibility
> that the checking is run before walsender is launched.
>
> One possible approach is to wait until the replication starts. Alternative 
> one is
> to ease the condition.
>
>
> That's my suggestion too. I reused NUM_CONN_ATTEMPTS (that was renamed to
> NUM_ATTEMPTS in the first patch). See second patch.
>

How can we guarantee that the slot can become active after these many
attempts? It still could be possible that on some slow machines it
didn't get activated even after NUM_ATTEMPTS. BTW, in the first place,
why do we need to ensure that the 'primary_slot_name' is active on the
primary?

-- 
With Regards,
Amit Kapila.

Re: speed up a logical replica setup

2024-04-29 Thread Amit Kapila

On Wed, Mar 27, 2024 at 1:47 AM Euler Taveira  wrote:
>
> On Tue, Mar 26, 2024, at 4:12 PM, Tomas Vondra wrote:
>
> Perhaps I'm missing something, but why is NUM_CONN_ATTEMPTS even needed?
> Why isn't recovery_timeout enough to decide if wait_for_end_recovery()
> waited long enough?
>
>
> It was an attempt to decoupled a connection failure (that keeps streaming the
> WAL) from recovery timeout. The NUM_CONN_ATTEMPTS guarantees that if the 
> primary
> is gone during the standby recovery process, there is a way to bail out.
>

I think we don't need to check primary if the WAL corresponding to
consistent_lsn is already present on the standby. Shouldn't we first
check that? Once we ensure that the required WAL is copied, just
checking server_is_in_recovery() should be sufficient. I feel that
will be a direct way of ensuring what is required rather than
indirectly verifying the same (by checking pg_stat_wal_receiver) as we
are doing currently.

-- 
With Regards,
Amit Kapila.

Re: speed up a logical replica setup

2024-03-26 Thread Tomas Vondra

On 3/26/24 21:17, Euler Taveira wrote:
> On Tue, Mar 26, 2024, at 4:12 PM, Tomas Vondra wrote:
>> Perhaps I'm missing something, but why is NUM_CONN_ATTEMPTS even needed?
>> Why isn't recovery_timeout enough to decide if wait_for_end_recovery()
>> waited long enough?
> 
> It was an attempt to decoupled a connection failure (that keeps streaming the
> WAL) from recovery timeout. The NUM_CONN_ATTEMPTS guarantees that if the 
> primary
> is gone during the standby recovery process, there is a way to bail out. The
> recovery-timeout is 0 (infinite) by default so you have an infinite wait 
> without
> this check. The idea behind this implementation is to avoid exiting in this
> critical code path. If it times out here you might have to rebuild the standby
> and start again.

- This seems like something that should definitely be documented in the
comment before wait_for_end_recovery(). At the moment it only talks
about timeout, and nothing about NUM_CONN_ATTEMPTS.

- The NUM_CONN_ATTEMPTS name seems rather misleading, considering it
does not really count connection attempts, but number of times we have
not seen 1 in pg_catalog.pg_stat_wal_receiver.

- Not sure I follow the logic - it tries to avoid exiting by setting
infinite timeout, but it still exists based on NUM_CONN_ATTEMPTS. Isn't
that somewhat contradictory?

- Isn't the NUM_CONN_ATTEMPTS actually making it more fragile, i.e. more
likely to exit? For example, what if there's a short networking hiccup,
so that the standby can't connect to the primary.

- It seems a bit strange that even with the recovery timeout set, having
the limit of 10 "connection attempts" effectively establishes a separate
hard-coded limit of 10 seconds. Seems a bit surprising if I set recovery
limit to 1 minute, and it just dies after 10 seconds.

> Amit suggested [1] that we use a value as recovery-timeout but
> how high is a good value? I've already saw some long recovery process using
> pglogical equivalent that timeout out after hundreds of minutes. Maybe I'm too
> worried about a small percentage of cases and we should use 1h as default, for
> example. It would reduce the complexity since the recovery process lacks some
> progress indicators (LSN is not sufficient in this case and there isn't a
> function to provide the current state -- stop applying WAL, reach target, new
> timeline, etc).
> 
> If we remove the pg_stat_wal_receiver check, we should avoid infinite recovery
> by default otherwise we will have some reports saying the tool is hanging when
> in reality the primary has gone and WAL should be streamed.
> 

I don't think there's a default timeout value that would work for
everyone. Either it's going to be too short for some cases, or it'll
take too long for some other cases.

I think there are two obvious default values for the timeout - infinity,
and 60 seconds, which is the default we use for other CLI tools (like
pg_ctl and so on). Considering the negative impact of exiting, I'd say
it's better to default to infinity. It's always possible to Ctrl-C or
terminate the process in some other way, if needed.

As for people complaining about infinite recovery - perhaps it'd be
sufficient to mention this in the messages printed by the tool, to make
it clearer. Or maybe even print something in the loop, because right now
it's entirely silent so it's easy to believe it's stuck. Perhaps not on
every loop, but at least in verbose mode it should print something.

>> IMHO the test should simply pass PG_TEST_DEFAULT_TIMEOUT when calling
>> pg_createsubscriber, and that should do the trick.
> 
> That's a good idea. Tests are not exercising the recovery-timeout option.
> 
>> Increasing PG_TEST_DEFAULT_TIMEOUT is what buildfarm animals doing
>> things like ubsan/valgrind already use to deal with exactly this kind of
>> timeout problem.
>>
>> Or is there a deeper problem with deciding if the system is in recovery?
> 
> As I said with some recovery progress indicators it would be easier to make 
> some
> decisions like wait a few seconds because the WAL has already been applied and
> it is creating a new timeline. The recovery timeout decision is a shot in the
> dark because we might be aborting pg_createsubscriber when the target server 
> is
> about to set RECOVERY_STATE_DONE.
> 

Isn't it enough to check data in pg_stat_replication on the primary?


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: speed up a logical replica setup

2024-03-26 Thread Euler Taveira

On Tue, Mar 26, 2024, at 4:12 PM, Tomas Vondra wrote:
> Perhaps I'm missing something, but why is NUM_CONN_ATTEMPTS even needed?
> Why isn't recovery_timeout enough to decide if wait_for_end_recovery()
> waited long enough?

It was an attempt to decoupled a connection failure (that keeps streaming the
WAL) from recovery timeout. The NUM_CONN_ATTEMPTS guarantees that if the primary
is gone during the standby recovery process, there is a way to bail out. The
recovery-timeout is 0 (infinite) by default so you have an infinite wait without
this check. The idea behind this implementation is to avoid exiting in this
critical code path. If it times out here you might have to rebuild the standby
and start again. Amit suggested [1] that we use a value as recovery-timeout but
how high is a good value? I've already saw some long recovery process using
pglogical equivalent that timeout out after hundreds of minutes. Maybe I'm too
worried about a small percentage of cases and we should use 1h as default, for
example. It would reduce the complexity since the recovery process lacks some
progress indicators (LSN is not sufficient in this case and there isn't a
function to provide the current state -- stop applying WAL, reach target, new
timeline, etc).

If we remove the pg_stat_wal_receiver check, we should avoid infinite recovery
by default otherwise we will have some reports saying the tool is hanging when
in reality the primary has gone and WAL should be streamed.

> IMHO the test should simply pass PG_TEST_DEFAULT_TIMEOUT when calling
> pg_createsubscriber, and that should do the trick.

That's a good idea. Tests are not exercising the recovery-timeout option.

> Increasing PG_TEST_DEFAULT_TIMEOUT is what buildfarm animals doing
> things like ubsan/valgrind already use to deal with exactly this kind of
> timeout problem.
> 
> Or is there a deeper problem with deciding if the system is in recovery?

As I said with some recovery progress indicators it would be easier to make some
decisions like wait a few seconds because the WAL has already been applied and
it is creating a new timeline. The recovery timeout decision is a shot in the
dark because we might be aborting pg_createsubscriber when the target server is
about to set RECOVERY_STATE_DONE.

[1] 
https://www.postgresql.org/message-id/CAA4eK1JRgnhv_ySzuFjN7UaX9qxz5Hqcwew7Fk%3D%2BAbJbu0Kd9w%40mail.gmail.com

--
Euler Taveira
EDB   https://www.enterprisedb.com/

Re: speed up a logical replica setup

2024-03-26 Thread Tomas Vondra

On 3/26/24 03:53, Euler Taveira wrote:
> On Mon, Mar 25, 2024, at 1:06 PM, Hayato Kuroda (Fujitsu) wrote:
>> ## Analysis for failure 1
>>
>> The failure caused by a time lag between walreceiver finishes and 
>> pg_is_in_recovery()
>> returns true.
>>
>> According to the output [1], it seems that the tool failed at 
>> wait_for_end_recovery()
>> with the message "standby server disconnected from the primary". Also, lines
>> "redo done at..." and "terminating walreceiver process due to administrator 
>> command"
>> meant that walreceiver was requested to shut down by XLogShutdownWalRcv().
>>
>> According to the source, we confirm that walreceiver is shut down in
>> StartupXLOG()->FinishWalRecovery()->XLogShutdownWalRcv(). Also, 
>> SharedRecoveryState
>> is changed to RECOVERY_STATE_DONE (this meant the pg_is_in_recovery() return 
>> true)
>> at the latter part of StartupXLOG().
>>
>> So, if there is a delay between FinishWalRecovery() and change the state, 
>> the check
>> in wait_for_end_recovery() would be failed during the time. Since we allow 
>> to miss
>> the walreceiver 10 times and it is checked once per second, the failure 
>> occurs if
>> the time lag is longer than 10 seconds.
>>
>> I do not have a good way to fix it. One approach is make NUM_CONN_ATTEMPTS 
>> larger,
>> but it's not a fundamental solution.
> 
> I was expecting that slow hosts might have issues in wait_for_end_recovery().
> As you said it took a lot of steps between FinishWalRecovery() (where
> walreceiver is shutdown -- XLogShutdownWalRcv) and SharedRecoveryState is set 
> to
> RECOVERY_STATE_DONE. If this window takes longer than NUM_CONN_ATTEMPTS *
> WAIT_INTERVAL (10 seconds), it aborts the execution. That's a bad decision
> because it already finished the promotion and it is just doing the final
> preparation for the host to become a primary.
> 
> /*   
>  * If it is still in recovery, make sure the target server is
>  * connected to the primary so it can receive the required WAL to
>  * finish the recovery process. If it is disconnected try
>  * NUM_CONN_ATTEMPTS in a row and bail out if not succeed.
>  */
> res = PQexec(conn,
>  "SELECT 1 FROM pg_catalog.pg_stat_wal_receiver");
> if (PQntuples(res) == 0)
> {
> if (++count > NUM_CONN_ATTEMPTS)
> {
> stop_standby_server(subscriber_dir);
> pg_log_error("standby server disconnected from the primary");
> break;
> }
> }
> else
> count = 0;  /* reset counter if it connects again */
> 
> This code was add to defend against the death/crash of the target server. 
> There
> are at least 3 options:
> 
> (1) increase NUM_CONN_ATTEMPTS * WAIT_INTERVAL seconds. We discussed this 
> constant
> and I decided to use 10 seconds because even in some slow hosts, this time
> wasn't reached during my tests. It seems I forgot to test the combination of 
> slow
> host, asserts enabled, and ubsan. I didn't notice that pg_promote() uses 60
> seconds as default wait. Maybe that's a reasonable value. I checked the
> 004_timeline_switch test and the last run took: 39.2s (serinus), 33.1s
> (culicidae), 18.31s (calliphoridae) and 27.52s (olingo).
> > (2) check if the primary is not running when walreceiver is not
available on the
> target server. Increase the connection attempts iif the primary is not 
> running.
> Hence, the described case doesn't cause an increment on the count variable.
> 
> (3) set recovery_timeout default to != 0 and remove pg_stat_wal_receiver check
> protection against the death/crash target server. I explained in a previous
> message that timeout may occur in cases that WAL replay to reach consistent
> state takes more than recovery-timeout seconds.
> 
> Option (1) is the easiest fix, however, we can have the same issue again if a
> slow host decides to be even slower, hence, we have to adjust this value 
> again.
> Option (2) interprets the walreceiver absence as a recovery end and if the
> primary server is running it can indicate that the target server is in the
> imminence of the recovery end. Option (3) is not as resilient as the other
> options.
> 
> The first patch implements a combination of (1) and (2).
> 
>> ## Analysis for failure 2
>>
>> According to [2], the physical replication slot which is specified as 
>> primary_slot_name
>> was not used by the walsender process. At that time walsender has not 
>> existed.
>>
>> ```
>> ...
>> pg_createsubscriber: publisher: current wal senders: 0
>> pg_createsubscriber: command is: SELECT 1 FROM 
>> pg_catalog.pg_replication_slots WHERE active AND slot_name = 'physical_slot'
>> pg_createsubscriber: error: could not obtain replication slot information: 
>> got 0 rows, expected 1 row
>> ...
>> ```
>>
>> Currently standby must be stopped before the command and current code does 
>> not
>> block the flow to ensure the

RE: speed up a logical replica setup

2024-03-25 Thread Hayato Kuroda (Fujitsu)

Dear Amit, Euler,

> 
> This only drops the publications created by this tool, not the
> pre-existing ones that we discussed in the link provided.

Another concern around here is the case which primary subscribes changes from 
others.
After the conversion, new subscriber also tries to connect to another publisher 
as
well - this may lead conflicts. This causes because both launcher/workers start
after recovery finishes. So, based on the Ashutosh's point, should we remove
such replication objects?

Best Regards,
Hayato Kuroda
FUJITSU LIMITED
https://www.fujitsu.com/

Re: speed up a logical replica setup

2024-03-25 Thread Amit Kapila

On Tue, Mar 26, 2024 at 8:27 AM Euler Taveira  wrote:
>
> On Mon, Mar 25, 2024, at 11:33 PM, Amit Kapila wrote:
>
> On Mon, Mar 25, 2024 at 5:25 PM Peter Eisentraut  wrote:
> >
> > I have committed your version v33.  I did another pass over the
> > identifier and literal quoting.  I added quoting for replication slot
> > names, for example, even though they can only contain a restricted set
> > of characters, but it felt better to be defensive there.
> >
> > I'm happy to entertain follow-up patches on some of the details like
> > option naming that were still being discussed.  I just wanted to get the
> > main functionality in in good time.  We can fine-tune the rest over the
> > next few weeks.
> >
>
> I was looking at prior discussions on this topic to see if there are
> any other open design points apart from this and noticed that the
> points raised/discussed in the email [1] are also not addressed. IIRC,
> the key point we discussed was that after promotion, the existing
> replication objects should be removed (either optionally or always),
> otherwise, it can lead to a new subscriber not being able to restart
> or getting some unwarranted data.
>
>
> See setup_subscriber.
>
> /*
>  * Since the publication was created before the consistent LSN, it is
>  * available on the subscriber when the physical replica is promoted.
>  * Remove publications from the subscriber because it has no use.
>  */
> drop_publication(conn, [I]);
>

This only drops the publications created by this tool, not the
pre-existing ones that we discussed in the link provided.

-- 
With Regards,
Amit Kapila.

Re: speed up a logical replica setup

2024-03-25 Thread Euler Taveira

On Mon, Mar 25, 2024, at 11:33 PM, Amit Kapila wrote:
> On Mon, Mar 25, 2024 at 5:25 PM Peter Eisentraut  wrote:
> >
> > I have committed your version v33.  I did another pass over the
> > identifier and literal quoting.  I added quoting for replication slot
> > names, for example, even though they can only contain a restricted set
> > of characters, but it felt better to be defensive there.
> >
> > I'm happy to entertain follow-up patches on some of the details like
> > option naming that were still being discussed.  I just wanted to get the
> > main functionality in in good time.  We can fine-tune the rest over the
> > next few weeks.
> >
> 
> I was looking at prior discussions on this topic to see if there are
> any other open design points apart from this and noticed that the
> points raised/discussed in the email [1] are also not addressed. IIRC,
> the key point we discussed was that after promotion, the existing
> replication objects should be removed (either optionally or always),
> otherwise, it can lead to a new subscriber not being able to restart
> or getting some unwarranted data.

See setup_subscriber.

/*
 * Since the publication was created before the consistent LSN, it is
 * available on the subscriber when the physical replica is promoted.
 * Remove publications from the subscriber because it has no use.
 */
drop_publication(conn, [i]);


--
Euler Taveira
EDB   https://www.enterprisedb.com/

Re: speed up a logical replica setup

2024-03-25 Thread Euler Taveira

On Mon, Mar 25, 2024, at 1:06 PM, Hayato Kuroda (Fujitsu) wrote:
> ## Analysis for failure 1
> 
> The failure caused by a time lag between walreceiver finishes and 
> pg_is_in_recovery()
> returns true.
> 
> According to the output [1], it seems that the tool failed at 
> wait_for_end_recovery()
> with the message "standby server disconnected from the primary". Also, lines
> "redo done at..." and "terminating walreceiver process due to administrator 
> command"
> meant that walreceiver was requested to shut down by XLogShutdownWalRcv().
> 
> According to the source, we confirm that walreceiver is shut down in
> StartupXLOG()->FinishWalRecovery()->XLogShutdownWalRcv(). Also, 
> SharedRecoveryState
> is changed to RECOVERY_STATE_DONE (this meant the pg_is_in_recovery() return 
> true)
> at the latter part of StartupXLOG().
> 
> So, if there is a delay between FinishWalRecovery() and change the state, the 
> check
> in wait_for_end_recovery() would be failed during the time. Since we allow to 
> miss
> the walreceiver 10 times and it is checked once per second, the failure 
> occurs if
> the time lag is longer than 10 seconds.
> 
> I do not have a good way to fix it. One approach is make NUM_CONN_ATTEMPTS 
> larger,
> but it's not a fundamental solution.

I was expecting that slow hosts might have issues in wait_for_end_recovery().
As you said it took a lot of steps between FinishWalRecovery() (where
walreceiver is shutdown -- XLogShutdownWalRcv) and SharedRecoveryState is set to
RECOVERY_STATE_DONE. If this window takes longer than NUM_CONN_ATTEMPTS *
WAIT_INTERVAL (10 seconds), it aborts the execution. That's a bad decision
because it already finished the promotion and it is just doing the final
preparation for the host to become a primary.

/*   
 * If it is still in recovery, make sure the target server is
 * connected to the primary so it can receive the required WAL to
 * finish the recovery process. If it is disconnected try
 * NUM_CONN_ATTEMPTS in a row and bail out if not succeed.
 */
res = PQexec(conn,
 "SELECT 1 FROM pg_catalog.pg_stat_wal_receiver");
if (PQntuples(res) == 0)
{
if (++count > NUM_CONN_ATTEMPTS)
{
stop_standby_server(subscriber_dir);
pg_log_error("standby server disconnected from the primary");
break;
}
}
else
count = 0;  /* reset counter if it connects again */

This code was add to defend against the death/crash of the target server. There
are at least 3 options:

(1) increase NUM_CONN_ATTEMPTS * WAIT_INTERVAL seconds. We discussed this 
constant
and I decided to use 10 seconds because even in some slow hosts, this time
wasn't reached during my tests. It seems I forgot to test the combination of 
slow
host, asserts enabled, and ubsan. I didn't notice that pg_promote() uses 60
seconds as default wait. Maybe that's a reasonable value. I checked the
004_timeline_switch test and the last run took: 39.2s (serinus), 33.1s
(culicidae), 18.31s (calliphoridae) and 27.52s (olingo).

(2) check if the primary is not running when walreceiver is not available on the
target server. Increase the connection attempts iif the primary is not running.
Hence, the described case doesn't cause an increment on the count variable.

(3) set recovery_timeout default to != 0 and remove pg_stat_wal_receiver check
protection against the death/crash target server. I explained in a previous
message that timeout may occur in cases that WAL replay to reach consistent
state takes more than recovery-timeout seconds.

Option (1) is the easiest fix, however, we can have the same issue again if a
slow host decides to be even slower, hence, we have to adjust this value again.
Option (2) interprets the walreceiver absence as a recovery end and if the
primary server is running it can indicate that the target server is in the
imminence of the recovery end. Option (3) is not as resilient as the other
options.

The first patch implements a combination of (1) and (2).

> ## Analysis for failure 2
> 
> According to [2], the physical replication slot which is specified as 
> primary_slot_name
> was not used by the walsender process. At that time walsender has not existed.
> 
> ```
> ...
> pg_createsubscriber: publisher: current wal senders: 0
> pg_createsubscriber: command is: SELECT 1 FROM 
> pg_catalog.pg_replication_slots WHERE active AND slot_name = 'physical_slot'
> pg_createsubscriber: error: could not obtain replication slot information: 
> got 0 rows, expected 1 row
> ...
> ```
> 
> Currently standby must be stopped before the command and current code does not
> block the flow to ensure the replication is started. So there is a possibility
> that the checking is run before walsender is launched.
> 
> One possible approach is to wait until the replication starts. Alternative 
> one is
> to ease the condition.

Re: speed up a logical replica setup

2024-03-25 Thread Amit Kapila

On Mon, Mar 25, 2024 at 5:25 PM Peter Eisentraut  wrote:
>
> I have committed your version v33.  I did another pass over the
> identifier and literal quoting.  I added quoting for replication slot
> names, for example, even though they can only contain a restricted set
> of characters, but it felt better to be defensive there.
>
> I'm happy to entertain follow-up patches on some of the details like
> option naming that were still being discussed.  I just wanted to get the
> main functionality in in good time.  We can fine-tune the rest over the
> next few weeks.
>

I was looking at prior discussions on this topic to see if there are
any other open design points apart from this and noticed that the
points raised/discussed in the email [1] are also not addressed. IIRC,
the key point we discussed was that after promotion, the existing
replication objects should be removed (either optionally or always),
otherwise, it can lead to a new subscriber not being able to restart
or getting some unwarranted data.

[1] - 
https://www.postgresql.org/message-id/CAExHW5t4ew7ZrgcDdTv7YmuG7LVQT1ZaEny_EvtngHtEBNyjcQ%40mail.gmail.com

-- 
With Regards,
Amit Kapila.

Re: speed up a logical replica setup

2024-03-25 Thread vignesh C

On Mon, 25 Mar 2024 at 21:36, Hayato Kuroda (Fujitsu)
 wrote:
>
> Dear Bharath, Peter,
>
> > Looks like BF animals aren't happy, please check -
> > > https://buildfarm.postgresql.org/cgi-bin/show_failures.pl.
> >
> > Looks like sanitizer failures.  There were a few messages about that
> > recently, but those were all just about freeing memory after use, which
> > we don't necessarily require for client programs.  So maybe something else.
>
> It seems that there are several time of failures, [1] and [2].
>
> ## Analysis for failure 1
>
> The failure caused by a time lag between walreceiver finishes and 
> pg_is_in_recovery()
> returns true.
>
> According to the output [1], it seems that the tool failed at 
> wait_for_end_recovery()
> with the message "standby server disconnected from the primary". Also, lines
> "redo done at..." and "terminating walreceiver process due to administrator 
> command"
> meant that walreceiver was requested to shut down by XLogShutdownWalRcv().
>
> According to the source, we confirm that walreceiver is shut down in
> StartupXLOG()->FinishWalRecovery()->XLogShutdownWalRcv(). Also, 
> SharedRecoveryState
> is changed to RECOVERY_STATE_DONE (this meant the pg_is_in_recovery() return 
> true)
> at the latter part of StartupXLOG().
>
> So, if there is a delay between FinishWalRecovery() and change the state, the 
> check
> in wait_for_end_recovery() would be failed during the time. Since we allow to 
> miss
> the walreceiver 10 times and it is checked once per second, the failure 
> occurs if
> the time lag is longer than 10 seconds.
>
> I do not have a good way to fix it. One approach is make NUM_CONN_ATTEMPTS 
> larger,
> but it's not a fundamental solution.

I agree with your analysis, another way to fix could be to remove the
following check as increasing the count might still have the race
condition issue:
/*
* If it is still in recovery, make sure the target server is
* connected to the primary so it can receive the required WAL to
* finish the recovery process. If it is disconnected try
* NUM_CONN_ATTEMPTS in a row and bail out if not succeed.
*/
res = PQexec(conn,
"SELECT 1 FROM pg_catalog.pg_stat_wal_receiver");

I'm not sure whether we should worry about the condition where
recovery is not done and pg_stat_wal_receiver is exited as we have the
following sanity check in check_subscriber before we wait for recovery
to be finished:
/* The target server must be a standby */
if (!server_is_in_recovery(conn))
{
pg_log_error("target server must be a standby");
disconnect_database(conn, true);
}

Regards,
Vignesh

RE: speed up a logical replica setup

2024-03-25 Thread Hayato Kuroda (Fujitsu)

Dear Bharath, Peter,

> Looks like BF animals aren't happy, please check -
> > https://buildfarm.postgresql.org/cgi-bin/show_failures.pl.
> 
> Looks like sanitizer failures.  There were a few messages about that
> recently, but those were all just about freeing memory after use, which
> we don't necessarily require for client programs.  So maybe something else.

It seems that there are several time of failures, [1] and [2].

## Analysis for failure 1

The failure caused by a time lag between walreceiver finishes and 
pg_is_in_recovery()
returns true.

According to the output [1], it seems that the tool failed at 
wait_for_end_recovery()
with the message "standby server disconnected from the primary". Also, lines
"redo done at..." and "terminating walreceiver process due to administrator 
command"
meant that walreceiver was requested to shut down by XLogShutdownWalRcv().

According to the source, we confirm that walreceiver is shut down in
StartupXLOG()->FinishWalRecovery()->XLogShutdownWalRcv(). Also, 
SharedRecoveryState
is changed to RECOVERY_STATE_DONE (this meant the pg_is_in_recovery() return 
true)
at the latter part of StartupXLOG().

So, if there is a delay between FinishWalRecovery() and change the state, the 
check
in wait_for_end_recovery() would be failed during the time. Since we allow to 
miss
the walreceiver 10 times and it is checked once per second, the failure occurs 
if
the time lag is longer than 10 seconds.

I do not have a good way to fix it. One approach is make NUM_CONN_ATTEMPTS 
larger,
but it's not a fundamental solution.

## Analysis for failure 2

According to [2], the physical replication slot which is specified as 
primary_slot_name
was not used by the walsender process. At that time walsender has not existed.

```
...
pg_createsubscriber: publisher: current wal senders: 0
pg_createsubscriber: command is: SELECT 1 FROM pg_catalog.pg_replication_slots 
WHERE active AND slot_name = 'physical_slot'
pg_createsubscriber: error: could not obtain replication slot information: got 
0 rows, expected 1 row
...
```

Currently standby must be stopped before the command and current code does not
block the flow to ensure the replication is started. So there is a possibility
that the checking is run before walsender is launched.

One possible approach is to wait until the replication starts. Alternative one 
is
to ease the condition.

How do you think?

[1]: 
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=serinus=2024-03-25%2013%3A03%3A07
[2]: 
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=culicidae=2024-03-25%2013%3A53%3A58

Best Regards,
Hayato Kuroda
FUJITSU LIMITED
https://www.fujitsu.com/

Re: speed up a logical replica setup

2024-03-25 Thread Euler Taveira

On Mon, Mar 25, 2024, at 8:55 AM, Peter Eisentraut wrote:
> On 22.03.24 04:31, Euler Taveira wrote:
> > On Thu, Mar 21, 2024, at 6:49 AM, Shlok Kyal wrote:
> >> There is a compilation error while building postgres with the patch
> >> due to a recent commit. I have attached a top-up patch v32-0003 to
> >> resolve this compilation error.
> >> I have not updated the version of the patch as I have not made any
> >> change in v32-0001 and v32-0002 patch.
> > 
> > I'm attaching a new version (v33) to incorporate this fix (v32-0003) 
> > into the
> > main patch (v32-0001). This version also includes 2 new tests:
> > 
> > - refuse to run if the standby server is running
> > - refuse to run if the standby was promoted e.g. it is not in recovery
> > 
> > The first one exercises a recent change (standby should be stopped) and the
> > second one covers an important requirement.
> 
> I have committed your version v33.  I did another pass over the 
> identifier and literal quoting.  I added quoting for replication slot 
> names, for example, even though they can only contain a restricted set 
> of characters, but it felt better to be defensive there.
Thanks.

> I'm happy to entertain follow-up patches on some of the details like 
> option naming that were still being discussed.  I just wanted to get the 
> main functionality in in good time.  We can fine-tune the rest over the 
> next few weeks.
Agree. Let's continue the discussion about the details.

> > Based on the discussion [1] about the check functions, Vignesh suggested 
> > that it
> > should check both server before exiting. v33-0003 implements it. I don't 
> > have a
> > strong preference; feel free to apply it.
> 
> I haven't done anything about this.
... including this one.


--
Euler Taveira
EDB   https://www.enterprisedb.com/

Re: speed up a logical replica setup

2024-03-25 Thread Peter Eisentraut


On 25.03.24 13:36, Bharath Rupireddy wrote:

On Mon, Mar 25, 2024 at 5:25 PM Peter Eisentraut  wrote:


I have committed your version v33.


Looks like BF animals aren't happy, please check -
https://buildfarm.postgresql.org/cgi-bin/show_failures.pl.


Looks like sanitizer failures.  There were a few messages about that 
recently, but those were all just about freeing memory after use, which 
we don't necessarily require for client programs.  So maybe something else.

Re: speed up a logical replica setup

2024-03-25 Thread Bharath Rupireddy

On Mon, Mar 25, 2024 at 5:25 PM Peter Eisentraut  wrote:
>
> I have committed your version v33.

Looks like BF animals aren't happy, please check -
https://buildfarm.postgresql.org/cgi-bin/show_failures.pl.

-- 
Bharath Rupireddy
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

Re: speed up a logical replica setup

2024-03-25 Thread Peter Eisentraut


On 22.03.24 04:31, Euler Taveira wrote:

On Thu, Mar 21, 2024, at 6:49 AM, Shlok Kyal wrote:

There is a compilation error while building postgres with the patch
due to a recent commit. I have attached a top-up patch v32-0003 to
resolve this compilation error.
I have not updated the version of the patch as I have not made any
change in v32-0001 and v32-0002 patch.


I'm attaching a new version (v33) to incorporate this fix (v32-0003) 
into the

main patch (v32-0001). This version also includes 2 new tests:

- refuse to run if the standby server is running
- refuse to run if the standby was promoted e.g. it is not in recovery

The first one exercises a recent change (standby should be stopped) and the
second one covers an important requirement.


I have committed your version v33.  I did another pass over the 
identifier and literal quoting.  I added quoting for replication slot 
names, for example, even though they can only contain a restricted set 
of characters, but it felt better to be defensive there.


I'm happy to entertain follow-up patches on some of the details like 
option naming that were still being discussed.  I just wanted to get the 
main functionality in in good time.  We can fine-tune the rest over the 
next few weeks.


Based on the discussion [1] about the check functions, Vignesh suggested 
that it
should check both server before exiting. v33-0003 implements it. I don't 
have a

strong preference; feel free to apply it.


I haven't done anything about this.

RE: speed up a logical replica setup

2024-03-25 Thread Hayato Kuroda (Fujitsu)

> IIUC, added options were inspired by Tomas. And he said the limitation
> (pub/sub/slot
> name cannot be specified) was acceptable for the first version. I agree with 
> him.
> (To be honest, I feel that options should be fewer for the first release)

Just to confirm - I don't think it is not completely needed. If we can agree a 
specification
in sometime, it's OK for me to add them. If you ask me, I still prefer Tomas's 
approach.

Best Regards,
Hayato Kuroda
FUJITSU LIMITED
https://www.fujitsu.com/

RE: speed up a logical replica setup

2024-03-25 Thread Hayato Kuroda (Fujitsu)

Dear Amit, Euler,

I also want to share my opinion, just in case.

> On Thu, Mar 21, 2024 at 8:00 PM Euler Taveira  wrote:
> >
> > On Thu, Mar 21, 2024, at 10:33 AM, vignesh C wrote:
> >
> > If we commit this we might not be able to change the way the option
> > behaves once the customers starts using it. How about removing these
> > options in the first version and adding it in the next version after
> > more discussion.
> >
> >
> > We don't need to redesign this one if we want to add a format string in a 
> > next
> > version. A long time ago, pg_dump started to accept pattern for tables 
> > without
> > breaking or deprecating the -t option. If you have 100 databases and you 
> > don't
> > want to specify the options or use a script to generate it for you, you also
> > have the option to let pg_createsubscriber generate the object names for 
> > you.
> > Per my experience, it will be a rare case.
> >
> 
> But, why go with some UI in the first place which we don't think is a
> good one, or at least don't have a broader agreement that it is a good
> one? The problem with self-generated names for users could be that
> they won't be able to make much out of it. If one has to always use
> those internally then probably that would be acceptable. I would
> prefer what Tomas proposed a few emails ago as compared to this one as
> that has fewer options to be provided by users but still, they can
> later identify objects. But surely, we should discuss if you or others
> have better alternatives.

IIUC, added options were inspired by Tomas. And he said the limitation 
(pub/sub/slot
name cannot be specified) was acceptable for the first version. I agree with 
him.
(To be honest, I feel that options should be fewer for the first release)

> The user choosing to convert a physical standby to a subscriber would
> in some cases be interested in converting it for all the databases
> (say for the case of upgrade [1]) and giving options for each database
> would be cumbersome for her.

+1 for the primary use case.

> > Currently dry-run will do the check and might fail on identifying a
> > few failures like after checking subscriber configurations. Then the
> > user will have to correct the configuration and re-run then fix the
> > next set of failures. Whereas the suggest-config will display all the
> > optimal configuration for both the primary and the standby in a single
> > shot. This is not a must in the first version, it can be done as a
> > subsequent enhancement.
> >
> >
> > Do you meant publisher, right? Per order, check_subscriber is done before
> > check_publisher and it checks all settings on the subscriber before 
> > exiting. In
> > v30, I changed the way it provides the required settings. In a previous 
> > version,
> > it fails when it found a wrong setting; the current version, check all 
> > settings
> > from that server before providing a suitable error.
> >
> > pg_createsubscriber: checking settings on publisher
> > pg_createsubscriber: primary has replication slot "physical_slot"
> > pg_createsubscriber: error: publisher requires wal_level >= logical
> > pg_createsubscriber: error: publisher requires 2 replication slots, but 
> > only 0
> remain
> > pg_createsubscriber: hint: Consider increasing max_replication_slots to at 
> > least
> 3.
> > pg_createsubscriber: error: publisher requires 2 wal sender processes, but 
> > only
> 0 remain
> > pg_createsubscriber: hint: Consider increasing max_wal_senders to at least 
> > 3.
> >
> > If you have such an error, you will fix them all and rerun using dry run 
> > mode
> > again to verify everything is ok. I don't have a strong preference about 
> > it. It
> > can be changed easily (unifying the check functions or providing a return 
> > for
> > each of the check functions).
> >
> 
> We can unify the checks but not sure if it is worth it. I am fine
> either way. It would have been better if we provided a way for a user
> to know the tool's requirement rather than letting him know via errors
> but I think it should be okay to extend it later as well.

Both ways are OK, but I prefer to unify checks a bit. The number of working 
modes
in the same executables should be reduced as much as possible.

Also, I feel the current specification is acceptable. pg_upgrade checks one by
one and exits soon in bad cases. If both old and new clusters have issues, the
first run only reports the old one's issues. After DBA fixes and runs again,
issues on the new cluster are output.

[1]: 
https://www.postgresql.org/message-id/8d52c226-7e34-44f7-a919-759bf8d81541%40enterprisedb.com

Best Regards,
Hayato Kuroda
FUJITSU LIMITED
https://www.fujitsu.com/

Re: speed up a logical replica setup

2024-03-22 Thread Shubham Khanna

On Fri, Mar 22, 2024 at 9:02 AM Euler Taveira  wrote:
>
> On Thu, Mar 21, 2024, at 6:49 AM, Shlok Kyal wrote:
>
> There is a compilation error while building postgres with the patch
> due to a recent commit. I have attached a top-up patch v32-0003 to
> resolve this compilation error.
> I have not updated the version of the patch as I have not made any
> change in v32-0001 and v32-0002 patch.
>
>
> I'm attaching a new version (v33) to incorporate this fix (v32-0003) into the
> main patch (v32-0001). This version also includes 2 new tests:
>
> - refuse to run if the standby server is running
> - refuse to run if the standby was promoted e.g. it is not in recovery
>
> The first one exercises a recent change (standby should be stopped) and the
> second one covers an important requirement.
>
> Based on the discussion [1] about the check functions, Vignesh suggested that 
> it
> should check both server before exiting. v33-0003 implements it. I don't have 
> a
> strong preference; feel free to apply it.
>
>
> [1] 
> https://www.postgresql.org/message-id/CALDaNm1Dg5tDRmaabk%2BZND4WF17NrNq52WZxCE%2B90-PGz5trQQ%40mail.gmail.com

I had run valgrind with pg_createsubscriber to see if there were any
issues. Valgrind reported the following issues:
==651272== LEAK SUMMARY:
==651272==definitely lost: 1,319 bytes in 18 blocks
==651272==indirectly lost: 1,280 bytes in 2 blocks
==651272==  possibly lost: 44 bytes in 3 blocks
==651272==still reachable: 3,066 bytes in 22 blocks
==651272== suppressed: 0 bytes in 0 blocks
==651272==
==651272== For lists of detected and suppressed errors, rerun with: -s
==651272== ERROR SUMMARY: 17 errors from 17 contexts (suppressed: 0 from 0)
The attached report has the details of the same.

Thanks and Regards:
Shubham Khanna.

valgrind.log
Description: Binary data

Re: speed up a logical replica setup

2024-03-22 Thread vignesh C

On Fri, 22 Mar 2024 at 09:01, Euler Taveira  wrote:
>
> On Thu, Mar 21, 2024, at 6:49 AM, Shlok Kyal wrote:
>
> There is a compilation error while building postgres with the patch
> due to a recent commit. I have attached a top-up patch v32-0003 to
> resolve this compilation error.
> I have not updated the version of the patch as I have not made any
> change in v32-0001 and v32-0002 patch.
>
>
> I'm attaching a new version (v33) to incorporate this fix (v32-0003) into the
> main patch (v32-0001). This version also includes 2 new tests:
>
> - refuse to run if the standby server is running
> - refuse to run if the standby was promoted e.g. it is not in recovery
>
> The first one exercises a recent change (standby should be stopped) and the
> second one covers an important requirement.

Few comments:
1) In error case PQclear and PQfinish should be called:
+   /* Secure search_path */
+   res = PQexec(conn, ALWAYS_SECURE_SEARCH_PATH_SQL);
+   if (PQresultStatus(res) != PGRES_TUPLES_OK)
+   {
+   pg_log_error("could not clear search_path: %s",
+PQresultErrorMessage(res));
+   if (exit_on_error)
+   exit(1);
+
+   return NULL;
+   }
+   PQclear(res);

2) Call fflush here before calling system command to get proper
ordered console output:
a) Call fflush:
+   int rc = system(cmd_str);
+
+   if (rc == 0)
+   pg_log_info("subscriber successfully changed
the system identifier");
+   else
+   pg_fatal("subscriber failed to change system
identifier: exit code: %d", rc);
+   }

b) similarly here:
+   pg_log_debug("pg_ctl command is: %s", pg_ctl_cmd->data);
+   rc = system(pg_ctl_cmd->data);
+   pg_ctl_status(pg_ctl_cmd->data, rc);
+   standby_running = true;

c) similarly here:
+   pg_ctl_cmd = psprintf("\"%s\" stop -D \"%s\" -s", pg_ctl_path,
+ datadir);
+   pg_log_debug("pg_ctl command is: %s", pg_ctl_cmd);
+   rc = system(pg_ctl_cmd);
+   pg_ctl_status(pg_ctl_cmd, rc);
+   standby_running = false;

3) Call PQClear in error cases:
a) Call PQClear
+   res = PQexec(conn, "SELECT system_identifier FROM
pg_catalog.pg_control_system()");
+   if (PQresultStatus(res) != PGRES_TUPLES_OK)
+   {
+   pg_log_error("could not get system identifier: %s",
+PQresultErrorMessage(res));
+   disconnect_database(conn, true);
+   }

b) similarly here
+   if (PQntuples(res) != 1)
+   {
+   pg_log_error("could not get system identifier: got %d
rows, expected %d row",
+PQntuples(res), 1);
+   disconnect_database(conn, true);
+   }
+

c) similarly here
+   res = PQexec(conn,
+"SELECT oid FROM pg_catalog.pg_database "
+"WHERE datname =
pg_catalog.current_database()");
+   if (PQresultStatus(res) != PGRES_TUPLES_OK)
+   {
+   pg_log_error("could not obtain database OID: %s",
+PQresultErrorMessage(res));
+   disconnect_database(conn, true);
+   }
+

d) There are few more places like this.

4) Since we are using a global variable, we might be able to remove
initializing many of them.
+   /* Default settings */
+   subscriber_dir = NULL;
+   opt.config_file = NULL;
+   opt.pub_conninfo_str = NULL;
+   opt.socket_dir = NULL;
+   opt.sub_port = DEFAULT_SUB_PORT;
+   opt.sub_username = NULL;
+   opt.database_names = (SimpleStringList){0};
+   opt.recovery_timeout = 0;

Regards,
Vignesh

Re: speed up a logical replica setup

2024-03-21 Thread Amit Kapila

On Fri, Mar 22, 2024 at 9:44 AM Fabrízio de Royes Mello
 wrote:
>
> On Fri, Mar 22, 2024 at 12:54 AM Amit Kapila  wrote:
> >
> >
> > The user choosing to convert a physical standby to a subscriber would
> > in some cases be interested in converting it for all the databases
> > (say for the case of upgrade [1]) and giving options for each database
> > would be cumbersome for her.
> >
> > ...
> >
> > [1] - This tool can be used in an upgrade where the user first
> > converts physical standby to subscriber to get incremental changes and
> > then performs an online upgrade.
> >
>
> The first use case me and Euler discussed some time ago to upstream this tool 
> was exactly what Amit described so IMHO we should make it easier for users to 
> subscribe to all existing user databases.
>

I feel that will be a good use case for this tool especially now
(with this release) when we allow upgrade of subscriptions. In this
regard, I feel if the user doesn't specify any database it should have
subscriptions for all databases and the user should have a way to
provide publication/slot/subscription names.

-- 
With Regards,
Amit Kapila.

Re: speed up a logical replica setup

2024-03-21 Thread Fabrízio de Royes Mello

On Fri, Mar 22, 2024 at 12:54 AM Amit Kapila 
wrote:
>
>
> The user choosing to convert a physical standby to a subscriber would
> in some cases be interested in converting it for all the databases
> (say for the case of upgrade [1]) and giving options for each database
> would be cumbersome for her.
>
> ...
>
> [1] - This tool can be used in an upgrade where the user first
> converts physical standby to subscriber to get incremental changes and
> then performs an online upgrade.
>

The first use case me and Euler discussed some time ago to upstream this
tool was exactly what Amit described so IMHO we should make it easier for
users to subscribe to all existing user databases.

Regards,

--
Fabrízio de Royes Mello

Re: speed up a logical replica setup

2024-03-21 Thread Amit Kapila

On Thu, Mar 21, 2024 at 8:00 PM Euler Taveira  wrote:
>
> On Thu, Mar 21, 2024, at 10:33 AM, vignesh C wrote:
>
> If we commit this we might not be able to change the way the option
> behaves once the customers starts using it. How about removing these
> options in the first version and adding it in the next version after
> more discussion.
>
>
> We don't need to redesign this one if we want to add a format string in a next
> version. A long time ago, pg_dump started to accept pattern for tables without
> breaking or deprecating the -t option. If you have 100 databases and you don't
> want to specify the options or use a script to generate it for you, you also
> have the option to let pg_createsubscriber generate the object names for you.
> Per my experience, it will be a rare case.
>

But, why go with some UI in the first place which we don't think is a
good one, or at least don't have a broader agreement that it is a good
one? The problem with self-generated names for users could be that
they won't be able to make much out of it. If one has to always use
those internally then probably that would be acceptable. I would
prefer what Tomas proposed a few emails ago as compared to this one as
that has fewer options to be provided by users but still, they can
later identify objects. But surely, we should discuss if you or others
have better alternatives.

The user choosing to convert a physical standby to a subscriber would
in some cases be interested in converting it for all the databases
(say for the case of upgrade [1]) and giving options for each database
would be cumbersome for her.

> Currently dry-run will do the check and might fail on identifying a
> few failures like after checking subscriber configurations. Then the
> user will have to correct the configuration and re-run then fix the
> next set of failures. Whereas the suggest-config will display all the
> optimal configuration for both the primary and the standby in a single
> shot. This is not a must in the first version, it can be done as a
> subsequent enhancement.
>
>
> Do you meant publisher, right? Per order, check_subscriber is done before
> check_publisher and it checks all settings on the subscriber before exiting. 
> In
> v30, I changed the way it provides the required settings. In a previous 
> version,
> it fails when it found a wrong setting; the current version, check all 
> settings
> from that server before providing a suitable error.
>
> pg_createsubscriber: checking settings on publisher
> pg_createsubscriber: primary has replication slot "physical_slot"
> pg_createsubscriber: error: publisher requires wal_level >= logical
> pg_createsubscriber: error: publisher requires 2 replication slots, but only 
> 0 remain
> pg_createsubscriber: hint: Consider increasing max_replication_slots to at 
> least 3.
> pg_createsubscriber: error: publisher requires 2 wal sender processes, but 
> only 0 remain
> pg_createsubscriber: hint: Consider increasing max_wal_senders to at least 3.
>
> If you have such an error, you will fix them all and rerun using dry run mode
> again to verify everything is ok. I don't have a strong preference about it. 
> It
> can be changed easily (unifying the check functions or providing a return for
> each of the check functions).
>

We can unify the checks but not sure if it is worth it. I am fine
either way. It would have been better if we provided a way for a user
to know the tool's requirement rather than letting him know via errors
but I think it should be okay to extend it later as well.

[1] - This tool can be used in an upgrade where the user first
converts physical standby to subscriber to get incremental changes and
then performs an online upgrade.

-- 
With Regards,
Amit Kapila.

1 2 3 4 >

1 - 100 of 305 matches

Mail list logo