Re: Slow catchup of 2PC (twophase) transactions on replica in LR

2024-04-15 Thread Давыдов Виталий

Dear All,

On Wednesday, April 10, 2024 17:16 MSK, Давыдов Виталий 
 wrote:
 Hi Amit, Ajin, All

Thank you for the patch and the responses. I apologize for my delayed answer 
due to some curcumstances.
On Wednesday, April 10, 2024 14:18 MSK, Amit Kapila  
wrote:

Vitaly, does the minimal solution provided by the proposed patch (Allow to 
alter two_phase option of a subscriber provided no uncommitted prepared 
transactions are pending on that subscription.) address your use case?In 
general, the idea behind the patch seems to be suitable for my case. 
Furthermore, the case of two_phase switch from false to true with uncommitted 
pending prepared transactions probably never happens in my case. The switch 
from false to true means that the replica completes the catchup from the master 
and switches to the normal mode when it participates in the multi-node 
configuration. There should be no uncommitted pending prepared transactions at 
the moment of the switch to the normal mode.

I'm going to try this patch. Give me please some time to investigate the patch. 
I will come with some feedback a little bit later.
I looked at the patch and realized that I can't try it easily in the near 
future because the solution I'm working on is based on PG16 or earlier. This 
patch is not easily applicable to the older releases. I have to port my 
solution to the master, which is not done yet. I apologize for that - so much 
work should be done before applying the patch. BTW, I tested the idea with 
async 2PC commit on my side and it seems to work fine in my case. Anyway, I 
agree, the idea with altering of subscription seems the best one but much 
harder to implement.

To summarise my case of a synchronous multimaster where twophase is used to 
implement global transactions:
 * Replica may have prepared but not committed transactions when I toggle 
subscription twophase from true to false. In this case, all prepared 
transactions may be aborted on the replica before altering the subscription. * 
Replica will not have prepared transactions when subscription is toggled from 
false to true. In this scenario, the replica completes the catchup (with 
twophase=off) and becomes the part of the multi-nodal cluster and is ready to 
accept new 2PC transactions. All the new pending transactions will wait until 
replica responds. But it may work differently for some other solutions. In 
general, it would be great to allow toggling for all scenarious.Just 
interested, does anyone tried to reproduce the problem with slow catchup of 
twophase transactions (pgbench should be used with big number of clients)? I 
haven't seen any messages from anyone other that me that the problem takes 
place.

Thank you for your help!

With best regards,
Vitaly










 


Re: Slow catchup of 2PC (twophase) transactions on replica in LR

2024-04-10 Thread Давыдов Виталий

Hi Amit, Ajin, All

Thank you for the patch and the responses. I apologize for my delayed answer 
due to some curcumstances.
On Wednesday, April 10, 2024 14:18 MSK, Amit Kapila  
wrote:

Vitaly, does the minimal solution provided by the proposed patch (Allow to 
alter two_phase option of a subscriber provided no uncommitted prepared 
transactions are pending on that subscription.) address your use case?In 
general, the idea behind the patch seems to be suitable for my case. 
Furthermore, the case of two_phase switch from false to true with uncommitted 
pending prepared transactions probably never happens in my case. The switch 
from false to true means that the replica completes the catchup from the master 
and switches to the normal mode when it participates in the multi-node 
configuration. There should be no uncommitted pending prepared transactions at 
the moment of the switch to the normal mode.

I'm going to try this patch. Give me please some time to investigate the patch. 
I will come with some feedback a little bit later.

Thank you for your help!

With best regards,
Vitaly Davydov


 


Re: Slow catchup of 2PC (twophase) transactions on replica in LR

2024-03-05 Thread Давыдов Виталий

Hi Heikki,

Thank you for the reply.

On Tuesday, March 05, 2024 12:05 MSK, Heikki Linnakangas  
wrote:
 In a nutshell, this changes PREPARE TRANSACTION so that if
synchronous_commit is 'off', the PREPARE TRANSACTION is not fsync'd to
disk. So if you crash after the PREPARE TRANSACTION has returned, the
transaction might be lost. I think that's completely unacceptable.​
You are right, the prepared transaction might be lost after crash. The same may 
happen with regular transactions that are not fsync-ed on replica in logical 
replication by default. The subscription parameter synchronous_commit is OFF by 
default. I'm not sure, is there some auto recovery for regular transactions? I 
think, the main difference between these two cases - how to manually recover 
when some PREPARE TRANSACTION or COMMIT PREPARED are lost. For regular 
transactions, some updates or deletes in tables on replica may be enough to fix 
the problem. For twophase transactions, it may be harder to fix it by hands, 
but it is possible, I believe. If you create a custom solution that is based on 
twophase transactions (like multimaster) such auto recovery may happen 
automatically. Another solution is to ignore errors on commit prepared if the 
corresponding prepared tx is missing. I don't know other risks that may happen 
with async commit of twophase transactions.
 If you're ok to lose the prepared state of twophase transactions on
crash, why don't you create the subscription with 'two_phase=off' to
begin with?In usual work, the subscription has two_phase = on. I have to change 
this option at catchup stage only, but this parameter can not be altered. There 
was a patch proposal in past to implement altering of two_phase option, but it 
was rejected. I think, the recreation of the subscription with two_phase = off 
will not work.

I believe, async commit for twophase transactions on catchup will significantly 
improve the catchup performance. It is worth to think about such feature.

P.S. We might introduce a GUC option to allow async commit for twophase 
transactions. By default, sync commit will be applied for twophase 
transactions, as it is now.

With best regards,
Vitaly Davydov


Re: Slow catchup of 2PC (twophase) transactions on replica in LR

2024-02-29 Thread Давыдов Виталий

Dear All,

Consider, please, my patch for async commit for twophase transactions. It can 
be applicable when catchup performance is not enought with publication 
parameter twophase = on.

The key changes are:
 * Use XLogSetAsyncXactLSN instead of XLogFlush as it is for usual 
transactions. * In case of async commit only, save 2PC state in the pg_twophase 
file (but not fsync it) in addition to saving in the WAL. The file is used as 
an alternative to storing 2pc state in the memory. * On recovery, reject 
pg_twophase files with future xids.Probably, 2PC async commit should be enabled 
by a GUC (not implemented in the patch).

With best regards,
Vitaly


 
From cbaaa7270d771f9ccd6def08f0f02ce61dc15ff6 Mon Sep 17 00:00:00 2001
From: Vitaly Davydov 
Date: Thu, 29 Feb 2024 18:58:13 +0300
Subject: [PATCH] Async commit support for twophase transactions

---
 src/backend/access/transam/twophase.c | 171 +-
 1 file changed, 138 insertions(+), 33 deletions(-)

diff --git a/src/backend/access/transam/twophase.c b/src/backend/access/transam/twophase.c
index 234c8d08eb..352266be14 100644
--- a/src/backend/access/transam/twophase.c
+++ b/src/backend/access/transam/twophase.c
@@ -109,6 +109,8 @@
 #include "utils/memutils.h"
 #include "utils/timestamp.h"
 
+#define POSTGRESQL_TWOPHASE_SUPPORT_ASYNC_COMMIT
+
 /*
  * Directory where Two-phase commit files reside within PGDATA
  */
@@ -169,6 +171,7 @@ typedef struct GlobalTransactionData
 	BackendId	locking_backend;	/* backend currently working on the xact */
 	bool		valid;			/* true if PGPROC entry is in proc array */
 	bool		ondisk;			/* true if prepare state file is on disk */
+	bool		infile;			/* true if prepared state saved in file (but not fsync-ed) */
 	bool		inredo;			/* true if entry was added via xlog_redo */
 	char		gid[GIDSIZE];	/* The GID assigned to the prepared xact */
 }			GlobalTransactionData;
@@ -227,12 +230,14 @@ static void RemoveGXact(GlobalTransaction gxact);
 static void XlogReadTwoPhaseData(XLogRecPtr lsn, char **buf, int *len);
 static char *ProcessTwoPhaseBuffer(TransactionId xid,
    XLogRecPtr prepare_start_lsn,
-   bool fromdisk, bool setParent, bool setNextXid);
+   bool fromdisk, bool setParent, bool setNextXid,
+   const char *filename);
 static void MarkAsPreparingGuts(GlobalTransaction gxact, TransactionId xid,
 const char *gid, TimestampTz prepared_at, Oid owner,
 Oid databaseid);
 static void RemoveTwoPhaseFile(TransactionId xid, bool giveWarning);
-static void RecreateTwoPhaseFile(TransactionId xid, void *content, int len);
+static void RemoveTwoPhaseFileByName(const char *filename, bool giveWarning);
+static void RecreateTwoPhaseFile(TransactionId xid, void *content, int len, bool dosync);
 
 /*
  * Initialization of shared memory
@@ -427,6 +432,7 @@ MarkAsPreparing(TransactionId xid, const char *gid,
 
 	MarkAsPreparingGuts(gxact, xid, gid, prepared_at, owner, databaseid);
 
+	gxact->infile = false;
 	gxact->ondisk = false;
 
 	/* And insert it into the active array */
@@ -1204,6 +1210,37 @@ EndPrepare(GlobalTransaction gxact)
 (errcode(ERRCODE_PROGRAM_LIMIT_EXCEEDED),
  errmsg("two-phase state file maximum length exceeded")));
 
+#ifdef POSTGRESQL_TWOPHASE_SUPPORT_ASYNC_COMMIT
+
+	Assert(gxact->infile == false);
+
+	if (synchronous_commit == SYNCHRONOUS_COMMIT_OFF)
+	{
+		char		   *buf;
+		size_t			len = 0;
+		size_t			offset = 0;
+
+		for (record = records.head; record != NULL; record = record->next)
+			len += record->len;
+
+		if (len > 0)
+		{
+			buf = palloc(len);
+
+			for (record = records.head; record != NULL; record = record->next)
+			{
+memcpy(buf + offset, record->data, record->len);
+offset += record->len;
+			}
+
+			RecreateTwoPhaseFile(gxact->xid, buf, len, false);
+			pfree(buf);
+			gxact->infile = true;
+		}
+	}
+
+#endif
+
 	/*
 	 * Now writing 2PC state data to WAL. We let the WAL's CRC protection
 	 * cover us, so no need to calculate a separate CRC.
@@ -1239,8 +1276,24 @@ EndPrepare(GlobalTransaction gxact)
    gxact->prepare_end_lsn);
 	}
 
+#if !defined(POSTGRESQL_TWOPHASE_SUPPORT_ASYNC_COMMIT)
+
 	XLogFlush(gxact->prepare_end_lsn);
 
+#else
+
+	if (synchronous_commit > SYNCHRONOUS_COMMIT_OFF)
+	{
+		/* Flush XLOG to disk */
+		XLogFlush(gxact->prepare_end_lsn);
+	}
+	else
+	{
+		XLogSetAsyncXactLSN(gxact->prepare_end_lsn);
+	}
+
+#endif
+
 	/* If we crash now, we have prepared: WAL replay will fix things */
 
 	/* Store record's start location to read that later on Commit */
@@ -1546,12 +1599,11 @@ FinishPreparedTransaction(const char *gid, bool isCommit)
 	 * in WAL files if the LSN is after the last checkpoint record, or moved
 	 * to disk if for some reason they have lived for a long time.
 	 */
-	if (gxact->ondisk)
+	if (gxact->infile || gxact->ondisk)
 		buf = ReadTwoPhaseFile(xid, false);
 	else
 		XlogReadTwoPhaseData(gxact->prepare_start_lsn, &buf, NULL);
 
-
 	/*
 	 * Disassemble the header area
 	 */
@@

Re: Slow catchup of 2PC (twophase) transactions on replica in LR

2024-02-27 Thread Давыдов Виталий

Hi Amit,

On Tuesday, February 27, 2024 16:00 MSK, Amit Kapila  
wrote:
As we do XLogFlush() at the time of prepare then why it is not available? OR 
are you talking about this state after your idea/patch where you are trying to 
make both Prepare and Commit_prepared records async?Right, I'm talking about my 
patch where async commit is implemented. There is no such problem with reading 
2PC from not flushed WAL in the vanilla because XLogFlush is called 
unconditionally, as you've described. But an attempt to add some async stuff 
leads to the problem of reading not flushed WAL. It is why I store 2pc state in 
the local memory in my patch.
It would be good if you could link those threads.Sure, I will find and add some 
links to the discussions from past.

Thank you!

With best regards,
Vitaly
 On Tue, Feb 27, 2024 at 4:49 PM Давыдов Виталий
 wrote:
>
> Thank you for your interest in the discussion!
>
> On Monday, February 26, 2024 16:24 MSK, Amit Kapila  
> wrote:
>
>
> I think the reason is probably that when the WAL record for prepared is 
> already flushed then what will be the idea of async commit here?
>
> I think, the idea of async commit should be applied for both transactions: 
> PREPARE and COMMIT PREPARED, which are actually two separate local 
> transactions. For both these transactions we may call XLogSetAsyncXactLSN on 
> commit instead of XLogFlush when async commit is enabled. When I use async 
> commit, I mean to apply async commit to local transactions, not to a twophase 
> (prepared) transaction itself.
>
>
> At commit prepared, it seems we read prepare's WAL record, right? If so, it 
> is not clear to me do you see a problem with a flush of commit_prepared or 
> reading WAL for prepared or both of these.
>
> The problem with reading WAL is due to async commit of PREPARE TRANSACTION 
> which saves 2PC in the WAL. At the moment of COMMIT PREPARED the WAL with 
> PREPARE TRANSACTION 2PC state may not be XLogFlush-ed yet.
>

As we do XLogFlush() at the time of prepare then why it is not
available? OR are you talking about this state after your idea/patch
where you are trying to make both Prepare and Commit_prepared records
async?

So, PREPARE TRANSACTION should wait until its 2PC state is flushed.
>
> I did some experiments with saving 2PC state in the local memory of logical 
> replication worker and, I think, it worked and demonstrated much better 
> performance. Logical replication worker utilized up to 100% CPU. I'm just 
> concerned about possible problems with async commit for twophase transactions.
>
> To be more specific, I've attached a patch to support async commit for 
> twophase. It is not the final patch but it is presented only for discussion 
> purposes. There were some attempts to save 2PC in memory in past but it was 
> rejected.
>

It would be good if you could link those threads.

--
With Regards,
Amit Kapila.

 

 


Re: Slow catchup of 2PC (twophase) transactions on replica in LR

2024-02-27 Thread Давыдов Виталий

Hi Amit,

Thank you for your interest in the discussion!

On Monday, February 26, 2024 16:24 MSK, Amit Kapila  
wrote:
 
I think the reason is probably that when the WAL record for prepared is already 
flushed then what will be the idea of async commit here?I think, the idea of 
async commit should be applied for both transactions: PREPARE and COMMIT 
PREPARED, which are actually two separate local transactions. For both these 
transactions we may call XLogSetAsyncXactLSN on commit instead of XLogFlush 
when async commit is enabled. When I use async commit, I mean to apply async 
commit to local transactions, not to a twophase (prepared) transaction itself.
 
At commit prepared, it seems we read prepare's WAL record, right? If so, it is 
not clear to me do you see a problem with a flush of commit_prepared or reading 
WAL for prepared or both of these.The problem with reading WAL is due to async 
commit of PREPARE TRANSACTION which saves 2PC in the WAL. At the moment of 
COMMIT PREPARED the WAL with PREPARE TRANSACTION 2PC state may not be 
XLogFlush-ed yet. So, PREPARE TRANSACTION should wait until its 2PC state is 
flushed.

I did some experiments with saving 2PC state in the local memory of logical 
replication worker and, I think, it worked and demonstrated much better 
performance. Logical replication worker utilized up to 100% CPU. I'm just 
concerned about possible problems with async commit for twophase transactions.

To be more specific, I've attached a patch to support async commit for 
twophase. It is not the final patch but it is presented only for discussion 
purposes. There were some attempts to save 2PC in memory in past but it was 
rejected. Now, there might be the second round to discuss it.

With best regards,
Vitaly

 
From 549f809fa122ca0842ec4bfc775afd08feee0d80 Mon Sep 17 00:00:00 2001
From: Vitaly Davydov 
Date: Tue, 27 Feb 2024 14:02:23 +0300
Subject: [PATCH] Add asynchronous commit support for 2PC

---
 src/backend/access/transam/twophase.c | 111 +-
 1 file changed, 108 insertions(+), 3 deletions(-)

diff --git a/src/backend/access/transam/twophase.c b/src/backend/access/transam/twophase.c
index c6af8cfd7e..52f0853db8 100644
--- a/src/backend/access/transam/twophase.c
+++ b/src/backend/access/transam/twophase.c
@@ -109,6 +109,8 @@
 #include "utils/memutils.h"
 #include "utils/timestamp.h"
 
+#define POSTGRESQL_TWOPHASE_SUPPORT_ASYNC_COMMIT
+
 /*
  * Directory where Two-phase commit files reside within PGDATA
  */
@@ -163,6 +165,9 @@ typedef struct GlobalTransactionData
 	 */
 	XLogRecPtr	prepare_start_lsn;	/* XLOG offset of prepare record start */
 	XLogRecPtr	prepare_end_lsn;	/* XLOG offset of prepare record end */
+	void*   prepare_2pc_mem_data;
+	size_t  prepare_2pc_mem_len;
+	pid_t   prepare_2pc_proc;
 	TransactionId xid;			/* The GXACT id */
 
 	Oid			owner;			/* ID of user that executed the xact */
@@ -427,6 +432,9 @@ MarkAsPreparing(TransactionId xid, const char *gid,
 
 	MarkAsPreparingGuts(gxact, xid, gid, prepared_at, owner, databaseid);
 
+	Assert(gxact->prepare_2pc_mem_data == NULL);
+	Assert(gxact->prepare_2pc_proc == 0);
+
 	gxact->ondisk = false;
 
 	/* And insert it into the active array */
@@ -1129,6 +1137,8 @@ StartPrepare(GlobalTransaction gxact)
 	}
 }
 
+extern bool IsLogicalWorker(void);
+
 /*
  * Finish preparing state data and writing it to WAL.
  */
@@ -1167,6 +1177,37 @@ EndPrepare(GlobalTransaction gxact)
 (errcode(ERRCODE_PROGRAM_LIMIT_EXCEEDED),
  errmsg("two-phase state file maximum length exceeded")));
 
+	Assert(gxact->prepare_2pc_mem_data == NULL);
+	Assert(gxact->prepare_2pc_proc == 0);
+
+	if (IsLogicalWorker())
+	{
+		size_t			len = 0;
+		size_t			offset = 0;
+
+		for (record = records.head; record != NULL; record = record->next)
+			len += record->len;
+
+		if (len > 0)
+		{
+			MemoryContext	oldmemctx;
+
+			oldmemctx = MemoryContextSwitchTo(TopMemoryContext);
+
+			gxact->prepare_2pc_mem_data = palloc(len);
+			gxact->prepare_2pc_mem_len = len;
+			gxact->prepare_2pc_proc = getpid();
+
+			for (record = records.head; record != NULL; record = record->next)
+			{
+memcpy((char *)gxact->prepare_2pc_mem_data + offset, record->data, record->len);
+offset += record->len;
+			}
+
+			MemoryContextSwitchTo(oldmemctx);
+		}
+	}
+
 	/*
 	 * Now writing 2PC state data to WAL. We let the WAL's CRC protection
 	 * cover us, so no need to calculate a separate CRC.
@@ -1202,8 +1243,24 @@ EndPrepare(GlobalTransaction gxact)
    gxact->prepare_end_lsn);
 	}
 
+#if !defined(POSTGRESQL_TWOPHASE_SUPPORT_ASYNC_COMMIT)
+
 	XLogFlush(gxact->prepare_end_lsn);
 
+#else
+
+	if (synchronous_commit > SYNCHRONOUS_COMMIT_OFF)
+	{
+		/* Flush XLOG to disk */
+		XLogFlush(gxact->prepare_end_lsn);
+	}
+	else
+	{
+		XLogSetAsyncXactLSN(gxact->prepare_end_lsn);
+	}
+
+#endif
+
 	/* If we crash now, we have prepared: WAL replay will fix things */
 
 	/* Store record's start location to read that later on Commit */
@@ -1

Re: Slow catchup of 2PC (twophase) transactions on replica in LR

2024-02-23 Thread Давыдов Виталий

Hi Amit,
Amit Kapila  wrote:
I don't see we do anything specific for 2PC transactions to make them behave 
differently than regular transactions with respect to synchronous_commit 
setting. What makes you think so? Can you pin point the code you are referring 
to?Yes, sure. The function RecordTransactionCommitPrepared is called on 
prepared transaction commit (twophase.c). It calls XLogFlush unconditionally. 
The function RecordTransactionCommit (for regular transactions, xact.c) calls 
XLogFlush if synchronous_commit > OFF, otherwise it calls XLogSetAsyncXactLSN.

There is some comment in RecordTransactionCommitPrepared (by Bruce Momjian) 
that shows that async commit is not supported yet:
/*
* We don't currently try to sleep before flush here ... nor is there any
* support for async commit of a prepared xact (the very idea is probably
* a contradiction)
*/
/* Flush XLOG to disk */
XLogFlush(recptr);
Right, I think for this we need to implement parallel apply.Yes, parallel apply 
is a good point. But, I believe, it will not work if asynchronous commit is not 
supported. You have only one receiver process which should dispatch incoming 
messages to parallel workers. I guess, you will never reach such rate of 
parallel execution on replica as on the master with multiple backends.
 
Can you be a bit more specific about what exactly you have in mind to achieve 
the above solutions?My proposal is to implement async commit for 2PC 
transactions as it is for regular transactions. It should significantly speedup 
the catchup process. Then, think how to apply in parallel, which is much 
diffcult to do. The current problem is to get 2PC state from the WAL on commit 
prepared. At this moment, the WAL is not flushed yet, commit function waits 
until WAL with 2PC state is to be flushed. I just tried to do it in my sandbox 
and found such a problem. Inability to get 2PC state from unflushed WAL stops 
me right now. I think about possible solutions.

The idea with enableFsync is not a suitable solution, in general, I think. I 
just pointed it as an alternate idea. You just do enableFsync = false before 
prepare or commit prepared and do enableFsync = true after these functions. In 
this case, 2PC records will not be fsync-ed, but FlushPtr will be increased. 
Thus, 2PC state can be read from WAL on commit prepared without waiting. To 
make it work correctly, I guess, we have to do some additional work to keep 
more wal on the master and filter some duplicate transactions on the replica, 
if replica restarts during catchup.
​​
With best regards,
​​Vitaly Davydov

 


Re: Slow catchup of 2PC (twophase) transactions on replica in LR

2024-02-23 Thread Давыдов Виталий

Hi Ajin,

Thank you for your feedback. Could you please try to increase the number of 
clients (-c pgbench option) up to 20 or more? It seems, I forgot to specify it.

With best regards,
Vitaly Davydov On Fri, Feb 23, 2024 at 12:29 AM Давыдов Виталий 
 wrote:
Dear All,
I'd like to present and talk about a problem when 2PC transactions are applied 
quite slowly on a replica during logical replication. There is a master and a 
replica with established logical replication from the master to the replica 
with twophase = true. With some load level on the master, the replica starts to 
lag behind the master, and the lag will be increasing. We have to significantly 
decrease the load on the master to allow replica to complete the catchup. Such 
problem may create significant difficulties in the production. The problem 
appears at least on REL_16_STABLE branch.
To reproduce the problem:
 * Setup logical replication from master to replica with subscription parameter 
twophase =  true. * Create some intermediate load on the master (use pgbench 
with custom sql with prepare+commit) * Optionally switch off the replica for 
some time (keep load on master). * Switch on the replica and wait until it 
reaches the master.
The replica will never reach the master with even some low load on the master. 
If to remove the load, the replica will reach the master for much greater time, 
than expected. I tried the same for regular transactions, but such problem 
doesn't appear even with a decent load.
  I tried this setup and I do see that the logical subscriber does reach the 
master in a short time. I'm not sure what I'm missing. I stopped the logical 
subscriber in between while pgbench was running and then started it again and 
ran the following:postgres=# SELECT sent_lsn, pg_current_wal_lsn() FROM 
pg_stat_replication;
 sent_lsn  | pg_current_wal_lsn
---+
 0/6793FA0 | 0/6793FA0 <=== caught up
(1 row)
 My pgbench command:pgbench postgres -p 6972 -c 2 -j 3 -f /home/ajin/test.sql 
-T 200 -P 5 my custom sql file:cat test.sql
SELECT md5(random()::text) as mygid \gset
BEGIN;
DELETE FROM test WHERE v = pg_backend_pid();
INSERT INTO test(v) SELECT pg_backend_pid();
PREPARE TRANSACTION $$:mygid$$;
COMMIT PREPARED $$:mygid$$; regards,Ajin CherianFujitsu Australia 

 


Slow catchup of 2PC (twophase) transactions on replica in LR

2024-02-22 Thread Давыдов Виталий

Dear All,
I'd like to present and talk about a problem when 2PC transactions are applied 
quite slowly on a replica during logical replication. There is a master and a 
replica with established logical replication from the master to the replica 
with twophase = true. With some load level on the master, the replica starts to 
lag behind the master, and the lag will be increasing. We have to significantly 
decrease the load on the master to allow replica to complete the catchup. Such 
problem may create significant difficulties in the production. The problem 
appears at least on REL_16_STABLE branch.
To reproduce the problem:
 * Setup logical replication from master to replica with subscription parameter 
twophase =  true. * Create some intermediate load on the master (use pgbench 
with custom sql with prepare+commit) * Optionally switch off the replica for 
some time (keep load on master). * Switch on the replica and wait until it 
reaches the master.
The replica will never reach the master with even some low load on the master. 
If to remove the load, the replica will reach the master for much greater time, 
than expected. I tried the same for regular transactions, but such problem 
doesn't appear even with a decent load.
I think, the main proplem of 2PC catchup bad performance - the lack of 
asynchronous commit support for 2PC. For regular transactions asynchronous 
commit is used on the replica by default (subscrition sycnronous_commit = off). 
It allows the replication worker process on the replica to avoid fsync 
(XLogFLush) and to utilize 100% CPU (the background wal writer or checkpointer 
will do fsync). I agree, 2PC are mostly used in multimaster configurations with 
two or more nodes which are performed synchronously, but when the node in 
catchup (node is not online in a multimaster cluster), asynchronous commit have 
to be used to speedup the catchup.
There is another thing that affects on the disbalance of the master and replica 
performance. When the master executes requestes from multiple clients, there is 
a fsync optimization takes place in XLogFlush. It allows to decrease the number 
of fsync in case when a number of parallel backends write to the WAL 
simultaneously. The replica applies received transactions in one thread 
sequentially, such optimization is not applied.
I see some possible solutions:
 * Implement asyncronous commit for 2PC transactions. * Do some hacking with 
enableFsync when it is possible.
I think, asynchronous commit support for 2PC transactions should significantly 
increase replica performance and help to solve this problem. I tried to 
implement it (like for usual transactions) but I've found another problem: 2PC 
state is stored in WAL on prepare, on commit we have to read 2PC state from WAL 
but the read is delayed until WAL is flushed by the background wal writer (read 
LSN should be less than flush LSN). Storing 2PC state in a shared memory (as it 
proposed earlier) may help.

I used the following query to monitor the catchup progress on the master:SELECT 
sent_lsn, pg_current_wal_lsn() FROM pg_stat_replication;
I used the following script for pgbench to the master:SELECT 
md5(random()::text) as mygid \gset
BEGIN;
DELETE FROM test WHERE v = pg_backend_pid();
INSERT INTO test(v) SELECT pg_backend_pid();
PREPARE TRANSACTION $$:mygid$$;
COMMIT PREPARED $$:mygid$$;
 
What do you think?
 
With best regards,
Vitaly Davydov


Re: How to accurately determine when a relation should use local buffers?

2023-11-27 Thread Давыдов Виталий

Hi Aleksander,
Well even assuming this patch will make it to the upstream some day,
which I seriously doubt, it will take somewhere between 2 and 5 years.
Personally I would recommend reconsidering this design.
I understand what you are saying. I have no plans to create a patch for this 
issue. I would like to believe that my case will be taken into consideration 
for next developments. Thank you very much for your help!

With best regards,
Vitaly


Re: How to accurately determine when a relation should use local buffers?

2023-11-23 Thread Давыдов Виталий

Hi Aleksander,
I sort of suspect that you are working on a very specific extension
and/or feature for PG fork. Any chance you could give us more details
about the case?I'm trying to adapt a multimaster solution to some changes in 
pg16. We replicate temp table DDL due to some reasons. Furthermore, such tables 
should be accessible from other processes than the replication receiver process 
on a replica, and they still should be temporary. I understand that DML 
replication for temporary tables will cause a severe performance degradation. 
But it is not our case.

There are some changes in ReadBuffer logic if to compare with pg15. To define 
which buffers to use, ReadBuffer used SmgrIsTemp function in pg15. The decision 
was based on backend id of the relation. In pg16 the decision is based on 
relpersistence attribute, that caused some problems on my side. My opinion, we 
should choose local buffers based on backend ids of relations, not on its 
persistence. Additional check for relpersistence prior to backend id may 
improve the performance in some cases, I think. The internal design may become 
more flexible as a result.

With best regards,
Vitaly Davydov
 


Re: How to accurately determine when a relation should use local buffers?

2023-11-22 Thread Давыдов Виталий

Hi Aleksander,

Thank you for your answers. It seems, local buffers are used for temporary 
relations unconditionally. In this case, we may check either relpersistence or 
backend id, or both of them.
I didn't do a deep investigation of the code in this particular aspect but that 
could be a fair point. Would you like to propose a refactoring that unifies the 
way we check if the relation is temporary?I would propose not to associate 
temporary relations with local buffers. I would say, that we that we should 
choose local buffers only in a backend context. It is the primary condition. 
​​​Thus, to choose local buffers, two checks should be succeeded:
 * relpersistence (RelationUsesLocalBuffers) * backend id (SmgrIsTemp)I know, 
it may be not as effective as to check relpersistence only, but ​it makes 
the internal architecture more flexible, I believe.

With best regards,
Vitaly Davydov



 


How to accurately determine when a relation should use local buffers?

2023-11-20 Thread Давыдов Виталий

Dear Hackers,

I would like to clarify, what the correct way is to determine that a given 
relation is using local buffers. Local buffers, as far as I know, are used for 
temporary tables in backends. There are two functions/macros (bufmgr.c): 
SmgrIsTemp, RelationUsesLocalBuffers. The first function verifies that the 
current process is a regular session backend, while the other macro verifies 
the relation persistence characteristic. It seems, the use of each function 
independently is not correct. I think, these functions should be applied in 
pair to check for local buffers use, but, it seems, these functions are used 
independently. It works until temporary tables are allowed only in session 
backends.

I'm concerned, how to determine the use of local buffers in some other 
theoretical cases? For example, if we decide to replicate temporary tables? Are 
there the other cases, when local buffers can be used with relations in the 
Vanilla? Do we allow the use of relations with RELPERSISTENCE_TEMP not only in 
session backends?

Thank you in advance for your help!

With best regards,
Vitaly Davydov