> On Apr 7, 2026, at 13:39, Fujii Masao <[email protected]> wrote:
> 
> On Tue, Apr 7, 2026 at 12:32 AM Andres Freund <[email protected]> wrote:
>> Failed on CI just now:
>> 
>> https://cirrus-ci.com/task/6745359004729344?logs=test_world#L410
>> https://api.cirrus-ci.com/v1/artifact/task/6745359004729344/testrun/build/testrun/subscription/038_walsnd_shutdown_timeout/log/regress_log_038_walsnd_shutdown_timeout
>> 
>> [14:58:26.146](0.066s) ok 3 - have walreceiver pid 13796
>> ### Stopping node "publisher" using mode fast
>> # Running: pg_ctl --pgdata 
>> /home/postgres/postgres/build/testrun/subscription/038_walsnd_shutdown_timeout/data/t_038_walsnd_shutdown_timeout_publisher_data/pgdata
>>  --mode fast stop
>> waiting for server to shut 
>> down...........................................................................................................................
>>  failed
>> pg_ctl: server does not shut down
>> # pg_ctl stop failed: 256
>> # Postmaster PID for node "publisher" is 3679
>> [15:00:38.178](132.032s) Bail out!  pg_ctl stop failed
> 
> Thanks for reporting this!
> 
> From the CI results [1], the failure in 038_walsnd_shutdown_timeout.pl appears
> to occur intermittently on FreeBSD. The failing case tests that, when both
> physical and logical replication are in use with slotsync enabled and both are
> stalled (walreceiver on the standby and the logical apply worker on
> the subscriber are blocked), shutting down the primary completes due to
> wal_sender_shutdown_timeout.
> 
> On FreeBSD, however, it seems that after the shutdown request, the physical
> walsender can occasionally keep running, preventing shutdown from completing.
> As a result, pg_ctl stop times out and the test fails.
> 
> I’ll investigate the cause. If it takes time to identify, I may temporarily
> disable just this test case so it doesn’t block other development and testing,
> then re-enable it once the issue is fixed.
> 
> Regards,
> 
> [1]
> https://cirrus-ci.com/build/5134823678803968
> https://cirrus-ci.com/build/5735329598013440
> https://cirrus-ci.com/build/5917696627310592
> https://cirrus-ci.com/build/5742460250357760
> 
> -- 
> Fujii Masao
> 
> 

I have some CF entries failed on this test case as well, so I tried to look 
into the problem. I have a finding for your reference.

With a8f45dee917, wal_sender_shutdown_timeout is only enforced while the 
walsender keeps returning to WalSndCheckShutdownTimeout() in the main loops, 
but there is a path to enter WalSndDone:
```
                        /*
                         * When SIGUSR2 arrives, we send any outstanding logs 
up to the
                         * shutdown checkpoint record (i.e., the latest 
record), wait for
                         * them to be replicated to the standby, and exit. This 
may be a
                         * normal termination at shutdown, or a promotion, the 
walsender
                         * is not sure which.
                         */
                        if (got_SIGUSR2)
                                WalSndDone(send_data);
```

Once entering WalSndDone(), it might call pg_flush() and get stuck:
```
        if (WalSndCaughtUp && sentPtr == replicatedPtr &&
                !pq_is_send_pending())
        {
                QueryCompletion qc;

                /* Inform the standby that XLOG streaming is done */
                SetQueryCompletion(&qc, CMDTAG_COPY, 0);
                EndCommand(&qc, DestRemote, false);
                pq_flush();

                proc_exit(0);
```

And once stuck, it will never get back to WalSndCheckShutdownTimeout(), so the 
new GUC timeout cannot rescue it.

In WalSndDoneImmediate(), pq_flush_if_writable() is used, and the comment talks 
about the possible stuck:
```
                /*
                 * Note that the output buffer may be full during the forced 
shutdown
                 * of walsender. If pq_flush() is called at that time, the 
walsender
                 * process will be stuck. Therefore, call pq_flush_if_writable()
                 * instead. Successful reception of the done message with the
                 * walsender forced into a shutdown is not guaranteed.
                 */
                pq_flush_if_writable();
```

So, maybe switch to use pq_flush_if_writable() in WalSndDone()?

Best regards,
--
Chao Li (Evan)
HighGo Software Co., Ltd.
https://www.highgo.com/






Reply via email to