Some additional informations:

XLogSendLogical is stuck in the following infinite loop:
- It attempt to read the next record with XLogReadAhead + XLogDecodeNextRecord
- The page with the record header is read
- It has the record header, it goes back to XLogDecodeNextRecord
- tot_len > len, the record needs to be reassembled
- The next page containing the rest of the record is read with
ReadPageInternal. It fails since this page was never written.
- It jumps to the err label, XLogReaderInvalReadState(state) is called
and reset the reader state
- It goes back to the start of WalSndLoop's loop

There are some attempts done by the walsender to flush the WAL using
XLogBackgroundFlush:
  /*
   * If we're shutting down, trigger pending WAL to be written out,
   * otherwise we'd possibly end up waiting for WAL that never gets
   * written, because walwriter has shut down already.
   */
  if (got_STOPPING)
    XLogBackgroundFlush();

However, XLogBackgroundFlush only writes completed blocks or the
latest async xact known. With the issue triggered, I have the
following state:

XLogCtl->LogwrtRqst: (Write = 39128056, Flush = 39124992)
LogwrtResult: (Write = 39124992, Flush = 39124992)
XLogCtl->asyncXactLSN: 39119776

There are 3064 bytes (39128056 - 39124992) that contain the next page
with the rest of the cont record that still needs to be written.
However, XLogBackgroundFlush backs off to the previous page boundary:
  /* back off to last completed page boundary */
  WriteRqst.Write -= WriteRqst.Write % XLOG_BLCKSZ;
Meaning WriteRqst.Write will be 39124992, which is already written and
flushed and asyncXactLSN is behind both write and flush.

So, it looks like the root issue is more that the async LSN isn't
updated when a transaction without xid is rollbacked.
When going through CommitTransaction, such a transaction would still
go through XLogSetAsyncXactLSN.

I've updated the patch with this new approach: XLogSetAsyncXactLSN is
now called in RecordTransactionAbort even when a xid wasn't assigned.
With this, the logical walsender is able to force the flush of the
last partial page using XLogBackgroundFlush.

Attachment: v2-0001-Fix-stuck-shutdown-due-to-unflushed-records.patch
Description: Binary data

Reply via email to