On Fri, Dec 23, 2022 at 2:45 PM Amit Kapila <amit.kapil...@gmail.com> wrote:
> On Thu, Dec 22, 2022 at 6:58 PM Ashutosh Bapat > <ashutosh.ba...@enterprisedb.com> wrote: > > > > Hi All, > > A customer ran a script dropping a few dozens of users in a transaction. > Before dropping a user they change the ownership of the tables owned by > that user to another user and revoking all the accesses from that user in > the same transaction. There were a few thousand tables whose privileges and > ownership was changed by this transaction. Since all of these changes were > in catalog table, those changes were filtered out in > ReorderBufferProcessTXN() > > by the following code > > if (!RelationIsLogicallyLogged(relation)) > > goto change_done; > > > > I tried to reproduce a similar situation through the attached TAP test. > For 500 users and 1000 tables, we see that the transaction takes > significant time but logical decoding does not take much time. So with the > default 1 min WAL sender and receiver timeout I could not reproduce the > timeout. Beyond that our TAp test itself times out. > > > > But I think there's a possibility that the logical receiver will time > out this way when decoding a sufficiently large transaction which takes > more than the timeout amount of time to decode. So I think we need to call > OutputPluginUpdateProgress() after a regular interval (in terms of time or > number of changes) to consume any feedback from the subscriber or send a > keep-alive message. > > > > I don't think it will be a good idea to directly call > OutputPluginUpdateProgress() from reorderbuffer.c. There is already a > patch to discuss this problem [1]. > Yeah. I don't mean to use OutputPluginUpdateProgress() directly. The patch just showed that it helps calling it there in some way. Thanks for pointing the other thread. I have reviewed the patch on that thread and continue the discussion there. > > > Following commit > > ``` > > commit 87c1dd246af8ace926645900f02886905b889718 > > Author: Amit Kapila <akap...@postgresql.org> > > Date: Wed May 11 10:12:23 2022 +0530 > > > > Fix the logical replication timeout during large transactions. > > > > ``` > > fixed a similar problem when the changes were filtered by an output > plugin, but in this case the changes are not being handed over to the > output plugin as well. If we fix it in the core we may not need to handle > it in the output plugin as that commit does. The commit does not have a > test case which I could run to reproduce the timeout. > > > > It is not evident how to write a stable test for this because > estimating how many changes are enough for the configured > wal_receiver_timeout to > pass on all the buildfarm machines is tricky. If you have good ideas > then feel free to propose a test patch. > Will continue this on the other thread. -- Best Wishes, Ashutosh