Re: Running from 9.6 backups sometimes fails with fatal error

2019-02-22 Thread Sergey Burladyan
I started trying to debug backup with minimal WALs for achieve consistency, it 
is from 'server 2'
and it only needs 20 WAL files.

It error was:
ERROR:  xlog flush request 44B/7E5DAB28 is not satisfied --- flushed only to 
44B/7305B560
CONTEXT:  writing block 0 of relation base/16506/16891_vm

=== backup_label ===
START WAL LOCATION: 44B/6002F280 (file 0001044B0060)
CHECKPOINT LOCATION: 44B/7305B4F0
BACKUP METHOD: streamed
BACKUP FROM: standby
START TIME: 2019-02-13 03:19:54 MSK
LABEL: pg_basebackup base backup

=== controldata ===
Database cluster state:   in archive recovery
pg_control last modified: Wed 13 Feb 2019 03:19:54 AM MSK
Latest checkpoint location:   44B/7305B4F0
Latest checkpoint's REDO location:44B/6002F280
Latest checkpoint's REDO WAL file:0001044B0060
Time of latest checkpoint:Wed 13 Feb 2019 02:19:36 AM MSK
Minimum recovery ending location: 44B/7305B560

I scan all *_vm files in DB 16506 for pages with LSN bigger than 'Minimum 
recovery ending
location' and find three files:
base/16506/190024_vm 000 44B/9411F5C0
base/16506/190005_vm 000 44B/94121DE8
base/16506/16891_vm 000 44B/7E5DAB28

Then I parse (pg_xlogdump) all this 20 WALs, needed for consistency and search 
for this three
oid, for 190024 and 190005 I found VISIBLE WAL record:
=== $ grep 190024 wals-text | grep VISIBLE ===
rmgr: Heap2   len (rec/tot): 64/  8256, tx:  0, lsn: 
44B/651EDAE8, prev 44B/651EDA58, desc: VISIBLE cutoff xid 752732359 flags 1, 
blkref #0: rel 1663/16506/190024 fork vm blk 0 FPW, blkref #1: rel 
1663/16506/190024 blk 0
rmgr: Heap2   len (rec/tot): 59/59, tx:  0, lsn: 
44B/6B19A6E8, prev 44B/6B19A658, desc: VISIBLE cutoff xid 752734764 flags 1, 
blkref #0: rel 1663/16506/190024 fork vm blk 0, blkref #1: rel 
1663/16506/190024 blk 0
rmgr: Heap2   len (rec/tot): 59/59, tx:  0, lsn: 
44B/71194D40, prev 44B/71194CB0, desc: VISIBLE cutoff xid 752737874 flags 1, 
blkref #0: rel 1663/16506/190024 fork vm blk 0, blkref #1: rel 
1663/16506/190024 blk 0
=== $ grep 190005 wals-text | grep VISIBLE ===
rmgr: Heap2   len (rec/tot): 64/  8256, tx:  0, lsn: 
44B/651F3728, prev 44B/651F36C8, desc: VISIBLE cutoff xid 752732355 flags 1, 
blkref #0: rel 1663/16506/190005 fork vm blk 0 FPW, blkref #1: rel 
1663/16506/190005 blk 0
rmgr: Heap2   len (rec/tot): 59/59, tx:  0, lsn: 
44B/6B19AE80, prev 44B/6B19AE20, desc: VISIBLE cutoff xid 752732667 flags 1, 
blkref #0: rel 1663/16506/190005 fork vm blk 0, blkref #1: rel 
1663/16506/190005 blk 0
rmgr: Heap2   len (rec/tot): 59/59, tx:  0, lsn: 
44B/6B19CF10, prev 44B/6B19AEC0, desc: VISIBLE cutoff xid 752734752 flags 1, 
blkref #0: rel 1663/16506/190005 fork vm blk 0, blkref #1: rel 
1663/16506/190005 blk 1
rmgr: Heap2   len (rec/tot): 59/59, tx:  0, lsn: 
44B/711954D8, prev 44B/71195478, desc: VISIBLE cutoff xid 752737869 flags 1, 
blkref #0: rel 1663/16506/190005 fork vm blk 0, blkref #1: rel 
1663/16506/190005 blk 0

But I did not find in parsed WALs anything about vm fork for 16891.

Is it OK, or is something wrong here?

PS:
I try to remove all problematic visibility maps, with page LSN bigger than 
'Minimum recovery ending
location') and start backup without it. Without it postgres started 
successfully.

-- 
Sergey Burladyan



Running from 9.6 backups sometimes fails with fatal error

2019-02-21 Thread Sergey Burladyan
Hi!

I need help with investigation what happened here. I have two different master 
servers with standby,
version: PostgreSQL 9.6.10 on x86_64-pc-linux-gnu (Debian 9.6.10-1.pgdg90+1), 
compiled by gcc (Debian 6.3.0-18+deb9u1) 6.3.0 20170516, 64-bit

and I create backups with pg_basebackup from theirs standbys, but sometimes 
backup check fails,
it just cannot start, for example:

server 1:
2019-02-18 14:18:51 MSK [11901-2] LOG:  checkpoint starting: end-of-recovery 
immediate wait
2019-02-18 14:18:51 MSK [11901-3] DEBUG:  performing replication slot checkpoint
2019-02-18 14:18:51 MSK [11901-4] LOG:  request to flush past end of generated 
WAL; request 44B/7E5DAB28, currpos 44B/7305B560
2019-02-18 14:18:51 MSK [11901-5] CONTEXT:  writing block 0 of relation 
base/16506/16891_vm
2019-02-18 14:18:51 MSK [11901-6] ERROR:  xlog flush request 44B/7E5DAB28 is 
not satisfied --- flushed only to 44B/7305B560
2019-02-18 14:18:51 MSK [11901-7] CONTEXT:  writing block 0 of relation 
base/16506/16891_vm
2019-02-18 14:18:51 MSK [11858-110] FATAL:  checkpoint request failed

server 2:
2019-02-07 16:07:40 MSK [7911-37] LOG:  checkpoint starting: end-of-recovery 
immediate wait
2019-02-07 16:07:40 MSK [7911-38] DEBUG:  performing replication slot checkpoint
2019-02-07 16:07:41 MSK [7911-39] LOG:  request to flush past end of generated 
WAL; request 1CB/8E94EE90, currpos 1CB/8A195210
2019-02-07 16:07:41 MSK [7911-40] CONTEXT:  writing block 0 of relation 
base/18413/18529_vm
2019-02-07 16:07:41 MSK [7911-41] ERROR:  xlog flush request 1CB/8E94EE90 is 
not satisfied --- flushed only to 1CB/8A195210
2019-02-07 16:07:41 MSK [7911-42] CONTEXT:  writing block 0 of relation 
base/18413/18529_vm
2019-02-07 16:07:41 MSK [7868-665] FATAL:  checkpoint request failed

backup checked with: PostgreSQL 9.6.11 on x86_64-pc-linux-gnu (Debian 
9.6.11-1.pgdg90+1), compiled by gcc (Debian 6.3.0-18+deb9u1) 6.3.0 20170516, 
64-bit

recovery.conf:
restore_command = '/usr/local/bin/restore_cmd /mnt/wals %f %p'
recovery_target = 'immediate'

postgresql.conf:
wal_log_hints = on

How can I debug this?

-- 
Sergey Burladyan