On 24 February 2016 at 03:53, Oleksii Kliukin <al...@hintbits.com> wrote:


>
> I found the following issue when shutting down a master with a connected
> replica that uses a physical failover slot:
>
> 2016-02-23 20:33:42.546 CET,,,54998,,56ccb3f3.d6d6,3,,2016-02-23 20:33:07
> CET,,0,DEBUG,00000,"performing replication slot checkpoint",,,,,,,,,""
> 2016-02-23 20:33:42.594 CET,,,55002,,56ccb3f3.d6da,4,,2016-02-23 20:33:07
> CET,,0,DEBUG,00000,"archived transaction log file
> ""000000010000000000000003""",,,,,,,,,""
> 2016-02-23 20:33:42.601 CET,,,54998,,56ccb3f3.d6d6,4,,2016-02-23 20:33:07
> CET,,0,PANIC,XX000,"concurrent transaction log activity while database
> system is shutting down",,,,,,,,,""
> 2016-02-23 20:33:43.537 CET,,,54995,,56ccb3f3.d6d3,5,,2016-02-23 20:33:07
> CET,,0,LOG,00000,"checkpointer process (PID 54998) was terminated by signal
> 6: Abort trap",,,,,,,,,""
> 2016-02-23 20:33:43.537 CET,,,54995,,56ccb3f3.d6d3,6,,2016-02-23 20:33:07
> CET,,0,LOG,00000,"terminating any other active server processes",,,,,,,,,
>
>
Odd that I didn't see that in my testing. Thanks very much for this. I
concur with your explanation.

Basically, the issue is that CreateCheckPoint calls
> CheckpointReplicationSlots, which currently produces WAL, and this violates
> the assumption at line xlog.c:8492
>
> if (shutdown && checkPoint.redo != ProcLastRecPtr)
> ereport(PANIC,
> (errmsg("concurrent transaction log activity while database system is
> shutting down")));
>

Interesting problem.

It might be reasonably harmless to omit writing WAL for failover slots
during a shutdown checkpoint. We're using WAL to move data to the replicas
but we don't really need it for local redo and correctness on the master.
The trouble is that we do of course redo failover slot updates on the
master and we don't really want a slot to go backwards vs its on-disk state
before a crash. That's not too harmful - but might be able to lead to us
losing a slot catalog_xmin increase so the slot thinks catalog is still
readable that could've actually been vacuumed away.

CheckpointReplicationSlots notes that:

 * This needn't actually be part of a checkpoint, but it's a convenient
 * location.

... and I suspect the answer there is simply to move the slot checkpoint to
occur prior to the WAL checkpoint rather than during it. I'll investigate.


I really want to focus on the first patch, timeline following for logical
slots. That part is much less invasive and is useful stand-alone. I'll move
it to a separate CF entry and post it to a separate thread as I think it
needs consideration independently of failover slots.


(BTW, the slot docs promise that slots will replay a change exactly once,
but this is not correct and the client must keep track of replay position.
I'll post a patch to correct it separately).


> There are a couple of incorrect comments
>

Thanks, will amend.


-- 
 Craig Ringer                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

Reply via email to