RE: Logical replication timeout problem

[email protected] Tue, 08 Mar 2022 18:26:35 -0800

On Tue, Mar 8, 2022 at 3:52 PM Masahiko Sawada <[email protected]> wrote:
> I've looked at the patch and have a question:
Thanks for your review and comments.


> +void
> +SendKeepaliveIfNecessary(LogicalDecodingContext *ctx, bool skipped) {
> +        static int skipped_changes_count = 0;
> +
> +        /*
> +         * skipped_changes_count is reset when processing changes that do not
> +         * need to be skipped.
> +         */
> +        if (!skipped)
> +        {
> +                skipped_changes_count = 0;
> +                return;
> +        }
> +
> +        /*
> +         * After continuously skipping SKIPPED_CHANGES_THRESHOLD
> changes, try to send a
> +         * keepalive message.
> +         */
> +        #define SKIPPED_CHANGES_THRESHOLD 10000
> +
> +        if (++skipped_changes_count >= SKIPPED_CHANGES_THRESHOLD)
> +        {
> +                /* Try to send a keepalive message. */
> +                OutputPluginUpdateProgress(ctx, true);
> +
> +                /* After trying to send a keepalive message, reset the flag. 
> */
> +                skipped_changes_count = 0;
> +        }
> +}
> 
> Since we send a keepalive after continuously skipping 10000 changes, the
> originally reported issue can still occur if skipping 10000 changes took more 
> than
> the timeout and the walsender didn't send any change while that, is that 
> right?
Yes, theoretically so.
But after testing, I think this value should be conservative enough not to 
reproduce
this bug.
After the previous discussion[1], it is currently considered that it is better
to directly set a conservative threshold than to calculate the threshold based
on wal_sender_timeout.

[1] - 
https://www.postgresql.org/message-id/OS3PR01MB6275FEB9F83081F1C87539B99E019%40OS3PR01MB6275.jpnprd01.prod.outlook.com

Regards,
Wang wei

RE: Logical replication timeout problem

Reply via email to