On Thu, Dec 22, 2022 at 6:58 PM Ashutosh Bapat
<ashutosh.ba...@enterprisedb.com> wrote:
>
> Hi All,
> A customer ran a script dropping a few dozens of users in a transaction. 
> Before dropping a user they change the ownership of the tables owned by that 
> user to another user and revoking all the accesses from that user in the same 
> transaction. There were a few thousand tables whose privileges and ownership 
> was changed by this transaction. Since all of these changes were in catalog 
> table, those changes were filtered out in ReorderBufferProcessTXN()
> by the following code
>                    if (!RelationIsLogicallyLogged(relation))
>                         goto change_done;
>
> I tried to reproduce a similar situation through the attached TAP test. For 
> 500 users and 1000 tables, we see that the transaction takes significant time 
> but logical decoding does not take much time. So with the default 1 min WAL 
> sender and receiver timeout I could not reproduce the timeout. Beyond that 
> our TAp test itself times out.
>
> But I think there's a possibility that the logical receiver will time out 
> this way when decoding a sufficiently large transaction which takes more than 
> the timeout amount of time to decode. So I think we need to call 
> OutputPluginUpdateProgress() after a regular interval (in terms of time or 
> number of changes) to consume any feedback from the subscriber or send a 
> keep-alive message.
>

I don't think it will be a good idea to directly call
OutputPluginUpdateProgress() from reorderbuffer.c. There is already a
patch to discuss this problem [1].

> Following commit
> ```
> commit 87c1dd246af8ace926645900f02886905b889718
> Author: Amit Kapila <akap...@postgresql.org>
> Date:   Wed May 11 10:12:23 2022 +0530
>
>     Fix the logical replication timeout during large transactions.
>
>  ```
> fixed a similar problem when the changes were filtered by an output plugin, 
> but in this case the changes are not being handed over to the output plugin 
> as well. If we fix it in the core we may not need to handle it in the output 
> plugin as that commit does. The commit does not have a test case which I 
> could run to reproduce the timeout.
>

It is not evident how to write a stable test for this because
estimating how many changes are enough for the configured
wal_receiver_timeout to
pass on all the buildfarm machines is tricky. If you have good ideas
then feel free to propose a test patch.

[1] - 
https://www.postgresql.org/message-id/OS3PR01MB62751A8063A9A75A096000D89E3F9%40OS3PR01MB6275.jpnprd01.prod.outlook.com

-- 
With Regards,
Amit Kapila.


Reply via email to