On Thu, Dec 22, 2022 at 6:58 PM Ashutosh Bapat <ashutosh.ba...@enterprisedb.com> wrote: > > Hi All, > A customer ran a script dropping a few dozens of users in a transaction. > Before dropping a user they change the ownership of the tables owned by that > user to another user and revoking all the accesses from that user in the same > transaction. There were a few thousand tables whose privileges and ownership > was changed by this transaction. Since all of these changes were in catalog > table, those changes were filtered out in ReorderBufferProcessTXN() > by the following code > if (!RelationIsLogicallyLogged(relation)) > goto change_done; > > I tried to reproduce a similar situation through the attached TAP test. For > 500 users and 1000 tables, we see that the transaction takes significant time > but logical decoding does not take much time. So with the default 1 min WAL > sender and receiver timeout I could not reproduce the timeout. Beyond that > our TAp test itself times out. > > But I think there's a possibility that the logical receiver will time out > this way when decoding a sufficiently large transaction which takes more than > the timeout amount of time to decode. So I think we need to call > OutputPluginUpdateProgress() after a regular interval (in terms of time or > number of changes) to consume any feedback from the subscriber or send a > keep-alive message. >
I don't think it will be a good idea to directly call OutputPluginUpdateProgress() from reorderbuffer.c. There is already a patch to discuss this problem [1]. > Following commit > ``` > commit 87c1dd246af8ace926645900f02886905b889718 > Author: Amit Kapila <akap...@postgresql.org> > Date: Wed May 11 10:12:23 2022 +0530 > > Fix the logical replication timeout during large transactions. > > ``` > fixed a similar problem when the changes were filtered by an output plugin, > but in this case the changes are not being handed over to the output plugin > as well. If we fix it in the core we may not need to handle it in the output > plugin as that commit does. The commit does not have a test case which I > could run to reproduce the timeout. > It is not evident how to write a stable test for this because estimating how many changes are enough for the configured wal_receiver_timeout to pass on all the buildfarm machines is tricky. If you have good ideas then feel free to propose a test patch. [1] - https://www.postgresql.org/message-id/OS3PR01MB62751A8063A9A75A096000D89E3F9%40OS3PR01MB6275.jpnprd01.prod.outlook.com -- With Regards, Amit Kapila.