On Thu, Jan 29, 2026 at 4:03 PM Hayato Kuroda (Fujitsu) <[email protected]> wrote: > > Dear Fujii-san, > > > While reviewing the patch at [1], I noticed a case where lock waits on > > a logical apply worker in the subscriber can cause the checkpointer on > > the publisher to stall. This seems like unexpected behavior and may > > need to be addressed. > > > > The issue can occur as follows: > > > > 1. A logical apply worker on the subscriber blocks waiting for a lock. > > 2. Because the apply worker cannot receive further messages, the walsender's > > send buffer on the publisher becomes full. > > 3. If the walsender then encounters a max_slot_wal_keep_size error, > > it attempts to send an error message to the subscriber before exiting. > > However, with a full send buffer, the walsender blocks while trying to > > send this message. > > 4. The checkpointer on the publisher calls > > InvalidateObsoleteReplicationSlots() > > and waits for the slot to be released. Since the walsender is stuck and > > the slot is not released, the checkpointer also becomes stuck. > > I confirmed this could happen if the max_slot_wal_keep_size is enabled > (in other words, the value is not -1). > Per my test, wal_sender_timeout cannot work well because the process is stuck > at > the lower layer, but tcp_user_timeout can terminate the process. Can we > mention > the workaround in the doc instead of fixing the code? > > It won't work for a Unix domain socket connection, but it's not realistic for > the > production stage.
This approach doesn't seem helpful on platforms that don't support TCP_USER_TIMEOUT, i.e., tcp_user_timeout is not available. Right? If I remember correctly, Windows is one of those platforms. Regards, -- Fujii Masao
