Re: Timeout when changes are filtered out by the core during logical replication

2023-01-09 Thread Ashutosh Bapat
On Fri, Dec 23, 2022 at 2:45 PM Amit Kapila  wrote:

> On Thu, Dec 22, 2022 at 6:58 PM Ashutosh Bapat
>  wrote:
> >
> > Hi All,
> > A customer ran a script dropping a few dozens of users in a transaction.
> Before dropping a user they change the ownership of the tables owned by
> that user to another user and revoking all the accesses from that user in
> the same transaction. There were a few thousand tables whose privileges and
> ownership was changed by this transaction. Since all of these changes were
> in catalog table, those changes were filtered out in
> ReorderBufferProcessTXN()
> > by the following code
> >if (!RelationIsLogicallyLogged(relation))
> > goto change_done;
> >
> > I tried to reproduce a similar situation through the attached TAP test.
> For 500 users and 1000 tables, we see that the transaction takes
> significant time but logical decoding does not take much time. So with the
> default 1 min WAL sender and receiver timeout I could not reproduce the
> timeout. Beyond that our TAp test itself times out.
> >
> > But I think there's a possibility that the logical receiver will time
> out this way when decoding a sufficiently large transaction which takes
> more than the timeout amount of time to decode. So I think we need to call
> OutputPluginUpdateProgress() after a regular interval (in terms of time or
> number of changes) to consume any feedback from the subscriber or send a
> keep-alive message.
> >
>
> I don't think it will be a good idea to directly call
> OutputPluginUpdateProgress() from reorderbuffer.c. There is already a
> patch to discuss this problem [1].
>

Yeah. I don't mean to use OutputPluginUpdateProgress() directly. The patch
just showed that it helps calling it there in some way. Thanks for pointing
the other thread. I have reviewed the patch on that thread and continue the
discussion there.


>
> > Following commit
> > ```
> > commit 87c1dd246af8ace926645900f02886905b889718
> > Author: Amit Kapila 
> > Date:   Wed May 11 10:12:23 2022 +0530
> >
> > Fix the logical replication timeout during large transactions.
> >
> >  ```
> > fixed a similar problem when the changes were filtered by an output
> plugin, but in this case the changes are not being handed over to the
> output plugin as well. If we fix it in the core we may not need to handle
> it in the output plugin as that commit does. The commit does not have a
> test case which I could run to reproduce the timeout.
> >
>
> It is not evident how to write a stable test for this because
> estimating how many changes are enough for the configured
> wal_receiver_timeout to
> pass on all the buildfarm machines is tricky. If you have good ideas
> then feel free to propose a test patch.
>

Will continue this on the other thread.

-- 
Best Wishes,
Ashutosh


Re: Timeout when changes are filtered out by the core during logical replication

2022-12-23 Thread Amit Kapila
On Thu, Dec 22, 2022 at 6:58 PM Ashutosh Bapat
 wrote:
>
> Hi All,
> A customer ran a script dropping a few dozens of users in a transaction. 
> Before dropping a user they change the ownership of the tables owned by that 
> user to another user and revoking all the accesses from that user in the same 
> transaction. There were a few thousand tables whose privileges and ownership 
> was changed by this transaction. Since all of these changes were in catalog 
> table, those changes were filtered out in ReorderBufferProcessTXN()
> by the following code
>if (!RelationIsLogicallyLogged(relation))
> goto change_done;
>
> I tried to reproduce a similar situation through the attached TAP test. For 
> 500 users and 1000 tables, we see that the transaction takes significant time 
> but logical decoding does not take much time. So with the default 1 min WAL 
> sender and receiver timeout I could not reproduce the timeout. Beyond that 
> our TAp test itself times out.
>
> But I think there's a possibility that the logical receiver will time out 
> this way when decoding a sufficiently large transaction which takes more than 
> the timeout amount of time to decode. So I think we need to call 
> OutputPluginUpdateProgress() after a regular interval (in terms of time or 
> number of changes) to consume any feedback from the subscriber or send a 
> keep-alive message.
>

I don't think it will be a good idea to directly call
OutputPluginUpdateProgress() from reorderbuffer.c. There is already a
patch to discuss this problem [1].

> Following commit
> ```
> commit 87c1dd246af8ace926645900f02886905b889718
> Author: Amit Kapila 
> Date:   Wed May 11 10:12:23 2022 +0530
>
> Fix the logical replication timeout during large transactions.
>
>  ```
> fixed a similar problem when the changes were filtered by an output plugin, 
> but in this case the changes are not being handed over to the output plugin 
> as well. If we fix it in the core we may not need to handle it in the output 
> plugin as that commit does. The commit does not have a test case which I 
> could run to reproduce the timeout.
>

It is not evident how to write a stable test for this because
estimating how many changes are enough for the configured
wal_receiver_timeout to
pass on all the buildfarm machines is tricky. If you have good ideas
then feel free to propose a test patch.

[1] - 
https://www.postgresql.org/message-id/OS3PR01MB62751A8063A9A75A096000D89E3F9%40OS3PR01MB6275.jpnprd01.prod.outlook.com

-- 
With Regards,
Amit Kapila.