Re: [HACKERS] Bug in visibility map WAL-logging

2014-01-08 Thread Matheus de Oliveira
On Tue, Jan 7, 2014 at 10:42 PM, Matheus de Oliveira 
matioli.math...@gmail.com wrote:

 How did you set up the standby? Did you initialize it from an offline
 backup of the master's data directory, perhaps? The log shows that the
 startup took the the crash recovery first, then start archive recovery
 path, because there was no backup label file. In that mode, the standby
 assumes that the system is consistent after replaying all the WAL in
 pg_xlog, which is correct if you initialize from an offline backup or
 atomic filesystem snapshot, for example. But WAL contains references to
 invalid pages could also be a symptom of an inconsistent base backup,
 cause by incorrect backup procedure. In particular, I have to ask because
 I've seen it before: you didn't delete backup_label from the backup, did
 you?


 Well, I cannot answer this right now, but makes all sense and is possible.


I've just confirmed. That was indeed the case, the script was removing the
backup_label. I've just removed this line and synced it again, it is
running nice (for past 1 hour at least).

Thank you guys for all your help, and sorry for all the confusion I caused.

Best regards,
-- 
Matheus de Oliveira
Analista de Banco de Dados
Dextra Sistemas - MPS.Br nível F!
www.dextra.com.br/postgres


Re: [HACKERS] Bug in visibility map WAL-logging

2014-01-08 Thread Heikki Linnakangas

On 01/08/2014 02:32 PM, Matheus de Oliveira wrote:

On Tue, Jan 7, 2014 at 10:42 PM, Matheus de Oliveira 
matioli.math...@gmail.com wrote:


How did you set up the standby? Did you initialize it from an offline

backup of the master's data directory, perhaps? The log shows that the
startup took the the crash recovery first, then start archive recovery
path, because there was no backup label file. In that mode, the standby
assumes that the system is consistent after replaying all the WAL in
pg_xlog, which is correct if you initialize from an offline backup or
atomic filesystem snapshot, for example. But WAL contains references to
invalid pages could also be a symptom of an inconsistent base backup,
cause by incorrect backup procedure. In particular, I have to ask because
I've seen it before: you didn't delete backup_label from the backup, did
you?


Well, I cannot answer this right now, but makes all sense and is possible.


I've just confirmed. That was indeed the case, the script was removing the
backup_label. I've just removed this line and synced it again, it is
running nice (for past 1 hour at least).


A-ha! ;-)


Thank you guys for all your help, and sorry for all the confusion I caused.


That seems to be a very common mistake to make. I wish we could do 
something about it. Do you think it would've helped in your case if 
there was a big fat warning in the beginning of backup_label, along the 
lines of: # DO NOT REMOVE THIS FILE FROM A BACKUP ? Any other ideas 
how we could've made it more obvious to the script author to not remove it?


- Heikki


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Bug in visibility map WAL-logging

2014-01-08 Thread Andres Freund
On 2014-01-08 14:37:34 +0200, Heikki Linnakangas wrote:
 That seems to be a very common mistake to make. I wish we could do something
 about it. Do you think it would've helped in your case if there was a big
 fat warning in the beginning of backup_label, along the lines of: # DO NOT
 REMOVE THIS FILE FROM A BACKUP ? Any other ideas how we could've made it
 more obvious to the script author to not remove it?

I've been wondering about the possibility of setting a boolean in
checkpoint records indicating that a backup label needs to be used when
starting from that checkpoint. That boolean would only get checked when
using a recovery.conf and we've started with pg_control indicating that
it was written by a primary (i.e. state = DB_SHUTDOWNING).

Greetings,

Andres Freund

-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Bug in visibility map WAL-logging

2014-01-08 Thread Andres Freund
On 2014-01-07 22:42:59 -0200, Matheus de Oliveira wrote:
 @andres, if it is really removing backup_label it could also cause that
 other issue we saw on Monday, right? (yes I did run the same script).

It might be in your case since that's an easy to way to generate that
situation, but there have been several other reports of that bug, so
it's good that we've discussed it ;)

Greetings,

Andres Freund

-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Bug in visibility map WAL-logging

2014-01-08 Thread Heikki Linnakangas

On 01/08/2014 07:29 AM, Greg Stark wrote:

On Tue, Jan 7, 2014 at 11:36 AM, Heikki Linnakangas
hlinnakan...@vmware.com wrote:

Hmm. The xlogdump indeed shows that the order of 'clean' and 'visible' is
incorrect, but I don't immediately see how that could cause the PANIC. Why
is the page uninitialized in the standby? If VACUUM is removing some dead
tuples from it, it certainly should exist and be correctly initialized.


Unless the vacuum subsequently truncated the file to be shorter and
the backup was taken after that?


In that case WAL replay should also see the truncation record before 
reaching consistency. We only PANIC on an uninitialized/missing page 
after reaching consistency, before that it's indeed normal if the file 
was later truncated or deleted.


- Heikki


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Bug in visibility map WAL-logging

2014-01-07 Thread Matheus de Oliveira
Hi folks,


On Fri, Dec 13, 2013 at 9:47 AM, Heikki Linnakangas hlinnakan...@vmware.com
 wrote:

 lazy_vacuum_page() does this:

 1. START_CRIT_SECTION()
 2. Remove dead tuples from the page, marking the itemid's unused.
 3. MarkBufferDirty
 4. if all remaining tuples on the page are visible to everyone, set the
 all-visible flag on the page, and call visibilitymap_set() to set the VM
 bit.
 5 visibilitymap_set() writes a WAL record about setting the bit in the
 visibility map.
 6. write the WAL record of removing the dead tuples.
 7. END_CRIT_SECTION().

 See the problem? Setting the VM bit is WAL-logged first, before the
 removal of the tuples. If you stop recovery between the two WAL records,
 the page's VM bit in the VM map will be set, but the dead tuples are still
 on the page.

 This bug was introduced in Feb 2013, by commit
 fdf9e21196a6f58c6021c967dc5776a16190f295, so it's present in 9.3 and master.

 The fix seems quite straightforward, we just have to change the order of
 those two records. I'll go commit that.



With a lot of help from Andres on IRC (thanks a lot man), I think (he
thinks actually, =P ) I may ran into the issue this bug can cause. I'm
sending this e-mail to (1) confirm if my issue is really caused by that bug
and if that is the case, (2) let you guys know the problems it can cause.

Scenario:

Master: 9.3.1 (I know, trying to go to 9.3.2)
Slave: 9.3.2

The slave was running with hot_standby=off (because of other bug Andres is
working on), but it stopped working with the following log lines:

2014-01-07 12:38:04 BRST [22668]: [7-1] user=,db= WARNING:  page 1076 of
relation base/883916/151040222 is uninitialized
2014-01-07 12:38:04 BRST [22668]: [8-1] user=,db= CONTEXT:  xlog redo
visible: rel 1663/883916/151040222; blk 1076
2014-01-07 12:38:04 BRST [22668]: [9-1] user=,db= PANIC:  WAL contains
references to invalid pages
2014-01-07 12:38:04 BRST [22668]: [10-1] user=,db= CONTEXT:  xlog redo
visible: rel 1663/883916/151040222; blk 1076
2014-01-07 12:38:04 BRST [22665]: [3-1] user=,db= LOG:  startup process
(PID 22668) was terminated by signal 6: Aborted

(Complete log at https://pgsql.privatepaste.com/343f3190fe).

I used pg_xlogdump to (try to) find the block the error relates to:

$ pg_xlogdump pg_xlog/000102BC002B 000102BC007F |
grep -n -C5 '883916/151040222; blk 1076'

2088220 ... Heap2 ...  24/   192, ... lsn: 2BC/46AB8B20 ... desc: clean:
rel 1663/883916/151040222; blk 1073 remxid 107409880
2088221 ... Heap2 ...  20/52, ... lsn: 2BC/46AB8BE0 ... desc: visible:
rel 1663/883916/151040222; blk 1074
2088222 ... Heap2 ...  24/   128, ... lsn: 2BC/46AB8C18 ... desc: clean:
rel 1663/883916/151040222; blk 1074 remxid 107409880
2088223 ... Heap2 ...  20/52, ... lsn: 2BC/46AB8C98 ... desc: visible:
rel 1663/883916/151040222; blk 1075
2088224 ... Heap2 ...  32/64, ... lsn: 2BC/46AB8CD0 ... desc: clean:
rel 1663/883916/151040222; blk 1075 remxid 107409880
== two lines that matched bellow ==
2088225 ... Heap2 ...  20/52, ... lsn: 2BC/46AB8D10 ... desc: visible:
rel 1663/883916/151040222; blk 1076
2088226 ... Heap2 ...  24/   164, ... lsn: 2BC/46AB8D48 ... desc: clean:
rel 1663/883916/151040222; blk 1076 remxid 107409880
2088227 ... Heap2 ...  20/52, ... lsn: 2BC/46AB8DF0 ... desc: visible:
rel 1663/883916/151040222; blk 1077
2088228 ... Heap2 ...  24/   128, ... lsn: 2BC/46AB8E28 ... desc: clean:
rel 1663/883916/151040222; blk 1077 remxid 107409880
2088229 ... Heap2 ...  20/52, ... lsn: 2BC/46AB8EA8 ... desc: visible:
rel 1663/883916/151040222; blk 1078
2088230 ... Heap2 ...  24/  5748, ... lsn: 2BC/46AB8EE0 ... desc: clean:
rel 1663/883916/151040222; blk 1078 remxid 107409880
2088231 ... Heap2 ...  20/52, ... lsn: 2BC/46ABA570 ... desc: visible:
rel 1663/883916/151040222; blk 1079


I cropped some columns to make it easy to read (the entire result is
attached), if you guys need more information, I can send the xlogdump of
all the WALs (or any other information). Also attached the controldata, if
needed.


Thanks a lot.

Best regards,
-- 
Matheus de Oliveira
Analista de Banco de Dados
Dextra Sistemas - MPS.Br nível F!
www.dextra.com.br/postgres
2088220-rmgr: Heap2   len (rec/tot): 24/   192, tx:  0, lsn: 2BC/46AB8B20, prev 2BC/46AB8AE8, bkp: 1000, desc: clean: rel 1663/883916/151040222; blk 1073 remxid 107409880
2088221-rmgr: Heap2   len (rec/tot): 20/52, tx:  0, lsn: 2BC/46AB8BE0, prev 2BC/46AB8B20, bkp: , desc: visible: rel 1663/883916/151040222; blk 1074
2088222-rmgr: Heap2   len (rec/tot): 24/   128, tx:  0, lsn: 2BC/46AB8C18, prev 2BC/46AB8BE0, bkp: 1000, desc: clean: rel 1663/883916/151040222; blk 1074 remxid 107409880
2088223-rmgr: Heap2   len (rec/tot): 20/52, tx:  0, lsn: 2BC/46AB8C98, prev 2BC/46AB8C18, bkp: , desc: visible: rel 1663/883916/151040222; blk 1075
2088224-rmgr: Heap2   len (rec/tot): 32/64, tx:  0, lsn: 

Re: [HACKERS] Bug in visibility map WAL-logging

2014-01-07 Thread Heikki Linnakangas

On 01/07/2014 07:15 PM, Matheus de Oliveira wrote:

Hi folks,


On Fri, Dec 13, 2013 at 9:47 AM, Heikki Linnakangas hlinnakan...@vmware.com

wrote:



lazy_vacuum_page() does this:

1. START_CRIT_SECTION()
2. Remove dead tuples from the page, marking the itemid's unused.
3. MarkBufferDirty
4. if all remaining tuples on the page are visible to everyone, set the
all-visible flag on the page, and call visibilitymap_set() to set the VM
bit.
5 visibilitymap_set() writes a WAL record about setting the bit in the
visibility map.
6. write the WAL record of removing the dead tuples.
7. END_CRIT_SECTION().

See the problem? Setting the VM bit is WAL-logged first, before the
removal of the tuples. If you stop recovery between the two WAL records,
the page's VM bit in the VM map will be set, but the dead tuples are still
on the page.

This bug was introduced in Feb 2013, by commit
fdf9e21196a6f58c6021c967dc5776a16190f295, so it's present in 9.3 and master.

The fix seems quite straightforward, we just have to change the order of
those two records. I'll go commit that.




With a lot of help from Andres on IRC (thanks a lot man), I think (he
thinks actually, =P ) I may ran into the issue this bug can cause. I'm
sending this e-mail to (1) confirm if my issue is really caused by that bug
and if that is the case, (2) let you guys know the problems it can cause.

Scenario:

Master: 9.3.1 (I know, trying to go to 9.3.2)
Slave: 9.3.2

The slave was running with hot_standby=off (because of other bug Andres is
working on), but it stopped working with the following log lines:

2014-01-07 12:38:04 BRST [22668]: [7-1] user=,db= WARNING:  page 1076 of
relation base/883916/151040222 is uninitialized
2014-01-07 12:38:04 BRST [22668]: [8-1] user=,db= CONTEXT:  xlog redo
visible: rel 1663/883916/151040222; blk 1076
2014-01-07 12:38:04 BRST [22668]: [9-1] user=,db= PANIC:  WAL contains
references to invalid pages
2014-01-07 12:38:04 BRST [22668]: [10-1] user=,db= CONTEXT:  xlog redo
visible: rel 1663/883916/151040222; blk 1076
2014-01-07 12:38:04 BRST [22665]: [3-1] user=,db= LOG:  startup process
(PID 22668) was terminated by signal 6: Aborted

(Complete log at https://pgsql.privatepaste.com/343f3190fe).

I used pg_xlogdump to (try to) find the block the error relates to:

$ pg_xlogdump pg_xlog/000102BC002B 000102BC007F |
grep -n -C5 '883916/151040222; blk 1076'

2088220 ... Heap2 ...  24/   192, ... lsn: 2BC/46AB8B20 ... desc: clean:
rel 1663/883916/151040222; blk 1073 remxid 107409880
2088221 ... Heap2 ...  20/52, ... lsn: 2BC/46AB8BE0 ... desc: visible:
rel 1663/883916/151040222; blk 1074
2088222 ... Heap2 ...  24/   128, ... lsn: 2BC/46AB8C18 ... desc: clean:
rel 1663/883916/151040222; blk 1074 remxid 107409880

 ...

Hmm. The xlogdump indeed shows that the order of 'clean' and 'visible' 
is incorrect, but I don't immediately see how that could cause the 
PANIC. Why is the page uninitialized in the standby? If VACUUM is 
removing some dead tuples from it, it certainly should exist and be 
correctly initialized.


How did you set up the standby? Did you initialize it from an offline 
backup of the master's data directory, perhaps? The log shows that the 
startup took the the crash recovery first, then start archive recovery 
path, because there was no backup label file. In that mode, the standby 
assumes that the system is consistent after replaying all the WAL in 
pg_xlog, which is correct if you initialize from an offline backup or 
atomic filesystem snapshot, for example. But WAL contains references to 
invalid pages could also be a symptom of an inconsistent base backup, 
cause by incorrect backup procedure. In particular, I have to ask 
because I've seen it before: you didn't delete backup_label from the 
backup, did you?


- Heikki


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Bug in visibility map WAL-logging

2014-01-07 Thread Andres Freund
On 2014-01-07 21:36:31 +0200, Heikki Linnakangas wrote:
 2088220 ... Heap2 ...  24/   192, ... lsn: 2BC/46AB8B20 ... desc: clean:
 rel 1663/883916/151040222; blk 1073 remxid 107409880
 2088221 ... Heap2 ...  20/52, ... lsn: 2BC/46AB8BE0 ... desc: visible:
 rel 1663/883916/151040222; blk 1074
 2088222 ... Heap2 ...  24/   128, ... lsn: 2BC/46AB8C18 ... desc: clean:
 rel 1663/883916/151040222; blk 1074 remxid 107409880
  ...
 
 Hmm. The xlogdump indeed shows that the order of 'clean' and 'visi ble' is
 incorrect, but I don't immediately see how that could cause the PANIC. Why
 is the page uninitialized in the standby? If VACUUM is removing some dead
 tuples from it, it certainly should exist and be correctly initialized.

Yea, that's strange. I first thought it might be the PageIsNew() branch
in lazy_scan_heap(). That initializes the page without WAL logging which
would explain it being uninitialized on the standby. But that wouldn't
explain why we found something to clean on the primary while the page is
still empty on the standby...

 How did you set up the standby? Did you initialize it from an offline backup
 of the master's data directory, perhaps? The log shows that the startup took
 the the crash recovery first, then start archive recovery path, because
 there was no backup label file

Good point.

In particular, I have to ask because I've seen it before: you
 didn't delete backup_label from the backup, did you?

I have seen that repeatedly too.

Greetings,

Andres Freund

-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Bug in visibility map WAL-logging

2014-01-07 Thread Matheus de Oliveira
On Tue, Jan 7, 2014 at 5:36 PM, Heikki Linnakangas
hlinnakan...@vmware.comwrote:

 On 01/07/2014 07:15 PM, Matheus de Oliveira wrote:

 Hi folks,


 On Fri, Dec 13, 2013 at 9:47 AM, Heikki Linnakangas 
 hlinnakan...@vmware.com

 wrote:


  lazy_vacuum_page() does this:

 1. START_CRIT_SECTION()
 2. Remove dead tuples from the page, marking the itemid's unused.
 3. MarkBufferDirty
 4. if all remaining tuples on the page are visible to everyone, set the
 all-visible flag on the page, and call visibilitymap_set() to set the VM
 bit.
 5 visibilitymap_set() writes a WAL record about setting the bit in the
 visibility map.
 6. write the WAL record of removing the dead tuples.
 7. END_CRIT_SECTION().

 See the problem? Setting the VM bit is WAL-logged first, before the
 removal of the tuples. If you stop recovery between the two WAL records,
 the page's VM bit in the VM map will be set, but the dead tuples are
 still
 on the page.

 This bug was introduced in Feb 2013, by commit
 fdf9e21196a6f58c6021c967dc5776a16190f295, so it's present in 9.3 and
 master.

 The fix seems quite straightforward, we just have to change the order of
 those two records. I'll go commit that.



 With a lot of help from Andres on IRC (thanks a lot man), I think (he
 thinks actually, =P ) I may ran into the issue this bug can cause. I'm
 sending this e-mail to (1) confirm if my issue is really caused by that
 bug
 and if that is the case, (2) let you guys know the problems it can cause.

 Scenario:

 Master: 9.3.1 (I know, trying to go to 9.3.2)
 Slave: 9.3.2

 The slave was running with hot_standby=off (because of other bug Andres is
 working on), but it stopped working with the following log lines:

 2014-01-07 12:38:04 BRST [22668]: [7-1] user=,db= WARNING:  page 1076 of
 relation base/883916/151040222 is uninitialized
 2014-01-07 12:38:04 BRST [22668]: [8-1] user=,db= CONTEXT:  xlog redo
 visible: rel 1663/883916/151040222; blk 1076
 2014-01-07 12:38:04 BRST [22668]: [9-1] user=,db= PANIC:  WAL contains
 references to invalid pages
 2014-01-07 12:38:04 BRST [22668]: [10-1] user=,db= CONTEXT:  xlog redo
 visible: rel 1663/883916/151040222; blk 1076
 2014-01-07 12:38:04 BRST [22665]: [3-1] user=,db= LOG:  startup process
 (PID 22668) was terminated by signal 6: Aborted

 (Complete log at https://pgsql.privatepaste.com/343f3190fe).

 I used pg_xlogdump to (try to) find the block the error relates to:

 $ pg_xlogdump pg_xlog/000102BC002B 000102BC007F |
 grep -n -C5 '883916/151040222; blk 1076'

 2088220 ... Heap2 ...  24/   192, ... lsn: 2BC/46AB8B20 ... desc: clean:
 rel 1663/883916/151040222; blk 1073 remxid 107409880
 2088221 ... Heap2 ...  20/52, ... lsn: 2BC/46AB8BE0 ... desc: visible:
 rel 1663/883916/151040222; blk 1074
 2088222 ... Heap2 ...  24/   128, ... lsn: 2BC/46AB8C18 ... desc: clean:
 rel 1663/883916/151040222; blk 1074 remxid 107409880

  ...

 Hmm. The xlogdump indeed shows that the order of 'clean' and 'visible' is
 incorrect, but I don't immediately see how that could cause the PANIC.


Well... I also didn't understand if it could cause the PANIC. If I got
right, it could only cause a visibility map bit with wrong value (e.g.
first change the bit but fails to mark the tuple, in case of a failure in
between - which seems that not happened). Is that right?


 Why is the page uninitialized in the standby? If VACUUM is removing some
 dead tuples from it, it certainly should exist and be correctly initialized.


That I don't know for sure...


 How did you set up the standby? Did you initialize it from an offline
 backup of the master's data directory, perhaps? The log shows that the
 startup took the the crash recovery first, then start archive recovery
 path, because there was no backup label file. In that mode, the standby
 assumes that the system is consistent after replaying all the WAL in
 pg_xlog, which is correct if you initialize from an offline backup or
 atomic filesystem snapshot, for example. But WAL contains references to
 invalid pages could also be a symptom of an inconsistent base backup,
 cause by incorrect backup procedure. In particular, I have to ask because
 I've seen it before: you didn't delete backup_label from the backup, did
 you?


Well, I cannot answer this right now, but makes all sense and is possible.

I used a script created by someone else that does pg_start_backup + rsync +
pg_stop_backup, but in fact I didn't look at this script to see if it is
doing something nasty, as removing backup_label. I'll be able to check it
tomorrow and so I'll come back to give a definitive answer.

@andres, if it is really removing backup_label it could also cause that
other issue we saw on Monday, right? (yes I did run the same script).

BTW, is there any situation that backup_label should be removed? I see no
reason for doing this, and also have not yet saw someone doing it, so I
didn't even thought that could be it.

Thank you guys for your attention.

Best regards,
-- 

Re: [HACKERS] Bug in visibility map WAL-logging

2014-01-07 Thread Greg Stark
On Tue, Jan 7, 2014 at 11:36 AM, Heikki Linnakangas
hlinnakan...@vmware.com wrote:
 Hmm. The xlogdump indeed shows that the order of 'clean' and 'visible' is
 incorrect, but I don't immediately see how that could cause the PANIC. Why
 is the page uninitialized in the standby? If VACUUM is removing some dead
 tuples from it, it certainly should exist and be correctly initialized.


Unless the vacuum subsequently truncated the file to be shorter and
the backup was taken after that?

-- 
greg


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers