Hello. My colleague found that pg_stat_replication.sync_state
shows false state for some condition.

This prevents Pacemaker from completing fail-over that could
safely be done.

The point is in walsender.c, pg_stat_get_wal_senders() below, (as
of REL9_2_1)

  1555:    if (walsnd->pid != 0)
  1556:    {
  1557:      /*
  1558:       * Treat a standby such as a pg_basebackup background process
  1559:       * which always returns an invalid flush location, as an
  1560:       * asynchronous standby.
  1561:       */
! 1562:       sync_priority[i] = XLogRecPtrIsInvalid(walsnd->flush) ?
  1563:          0 : walsnd->sync_standby_priority;

Here, XLogRecPtrIsInvalid(walsnd->flush) is defined as
(walsnd->flush.xrecoff == 0) which becomes true as usual at every
WAL 'file's (not segments) boundary. xrecoff == 0 is certainly
invalid for the start point of WAL *RECORD*, but should be
considered valid in replication stream. This check was introduced
at 9.2.0 and the version up between 9.1.4 and 9.1.5.

 | DEBUG:  write 4/0 flush 3/FEFFEC68 apply 3/FEFFEC68
 | DEBUG:  write 4/0 flush 4/0 apply 3/FEFFEC68
 | DEBUG:  HOGE: flush = 3/FEFFEC68 sync_priority[0] = 1
 | DEBUG:  write 4/111C0 flush 4/0 apply 3/FEFFECC0
!| DEBUG:  HOGE: flush = 4/0 sync_priority[0] = 0

This value zero of sync_priority[0] makes sync_status 'async'
errorneously and confuses Pacemaker.

# The log line marked with 'HOGE' above printed by applying the
# patch at the bottom of this message and invoking 'select
# sync_state from pg_stat_replication' periodically. To increase
# the chance to see the symptom, sleep 1 second for 'file'
# boundaries :-)

The Heikki's recent(?) commit
0ab9d1c4b31622e9176472b4276f3e9831e3d6ba which changes the format
of XLogRecPtr from logid:xrecoff struct to 64 bit linear address
would fix the false indication. But I suppose this patch won't be
applied to existing 9.1.x and 9.2.x because of the modification
onto streaming protocol.

As far as I see the patch, it would'nt change the meaning of
XLogRecPtr to change XLogRecPtrIsInvalid from (xrecoff == 0) to
(xrecoff == 0 && xlogid == 0). But this change affects rather
wide portion where handling WAL nevertheless what is needed here
is only to stop the false indication.

On the other hand, pg_basebackup seems return 0/0 as flush and
apply positions so it seems enough only to add xlogid == 0 into
the condition. The patch attached for REL9_2_1 does this and
yields the result following.

 | DEBUG:  write 2/FEFFFD48 flush 2/FEFFFD48 apply 2/FEFF7AB0
 | DEBUG:  write 3/0 flush 2/FEFFFD48 apply 2/FEFF7E88
 | DEBUG:  write 3/0 flush 3/0 apply 2/FEFFFD48
 | DEBUG:  HOGE: flush = 2/FEFFFD48 sync_priority[0] = 1
 | DEBUG:  write 3/E338 flush 3/0 apply 2/FEFFFF80
!| DEBUG:  HOGE: flush = 3/0 sync_priority[0] = 1

I think this patch should be applied for 9.2.2 and 9.1.7.

regards,

-- 
Kyotaro Horiguchi
NTT Open Source Software Center

============================================================
===== The patch for this test.
============================================================
diff --git a/src/backend/replication/walsender.c 
b/src/backend/replication/walsender.c
index 064ddd5..19f79d1 100644
--- a/src/backend/replication/walsender.c
+++ b/src/backend/replication/walsender.c
@@ -618,6 +618,10 @@ ProcessStandbyReplyMessage(void)
                 reply.flush.xlogid, reply.flush.xrecoff,
                 reply.apply.xlogid, reply.apply.xrecoff);
 
+       if (reply.write.xrecoff == 0 ||
+               reply.flush.xrecoff == 0)
+               sleep(1);
+
        /*
         * Update shared state for this WalSender process based on reply data 
from
         * standby.
@@ -1561,7 +1565,10 @@ pg_stat_get_wal_senders(PG_FUNCTION_ARGS)
                         */
                        sync_priority[i] = XLogRecPtrIsInvalid(walsnd->flush) ?
                                0 : walsnd->sync_standby_priority;
-
+                       elog(DEBUG1, "HOGE: flush = %X/%X sync_priority[%d] = 
%d",
+                                walsnd->flush.xlogid, walsnd->flush.xrecoff, 
+                                i, sync_priority[i]);
+                       
                        if (walsnd->state == WALSNDSTATE_STREAMING &&
                                walsnd->sync_standby_priority > 0 &&
                                (priority == 0 ||
diff --git a/src/backend/replication/walsender.c b/src/backend/replication/walsender.c
index 064ddd5..1d4cbc4 100644
--- a/src/backend/replication/walsender.c
+++ b/src/backend/replication/walsender.c
@@ -1555,11 +1555,11 @@ pg_stat_get_wal_senders(PG_FUNCTION_ARGS)
 		if (walsnd->pid != 0)
 		{
 			/*
-			 * Treat a standby such as a pg_basebackup background process
-			 * which always returns an invalid flush location, as an
+			 * Treat a standby such as a pg_basebackup background process which
+			 * always returns 0/0 (InvalidXLogRecPtr) as flush location, as an
 			 * asynchronous standby.
 			 */
-			sync_priority[i] = XLogRecPtrIsInvalid(walsnd->flush) ?
+			sync_priority[i] = XLByteEQ(walsnd->flush, InvalidXLogRecPtr) ?
 				0 : walsnd->sync_standby_priority;
 
 			if (walsnd->state == WALSNDSTATE_STREAMING &&
-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Reply via email to