Re: [GENERAL] Streaming replication slave crash

2013-11-25 Thread Mahlon E. Smith
On Tue, Sep 10, 2013, Mahlon E. Smith wrote:
 On Mon, Sep 09, 2013, Jeff Davis wrote:
  
  You may have seen only partial information about that bug and the fix.
 
 Yep, I totally glazed over the REINDEX.  Giving it a go -- thank you!

As a followup for anyone else landing on this thread, the REINDEX was
(unsurprisingly) the solution.  Been runnin' like a champ for the last
2.5 months.

--
Mahlon E. Smith  
http://www.martini.nu/contact.html


pgpI5Zd9jTRIA.pgp
Description: PGP signature


Re: [GENERAL] Streaming replication slave crash

2013-09-10 Thread Mahlon E. Smith
On Mon, Sep 09, 2013, Jeff Davis wrote:
 
 You may have seen only partial information about that bug and the fix.

Yep, I totally glazed over the REINDEX.  Giving it a go -- thank you!

--
Mahlon E. Smith  
http://www.martini.nu/contact.html


pgpIXe1Uec5OO.pgp
Description: PGP signature


Re: [GENERAL] Streaming replication slave crash

2013-09-09 Thread Mahlon E. Smith

[piggybackin' on older (seeming very similar) thread...]

On Fri, Mar 29, 2013, Quentin Hartman wrote:
 Yesterday morning, one of my streaming replication slaves running 9.2.3
 crashed with the following in the log file:
 
 2013-03-28 12:49:30 GMT WARNING:  page 1441792 of relation base/63229/63370
 does not exist
 2013-03-28 12:49:30 GMT CONTEXT:  xlog redo delete: index
 1663/63229/109956; iblk 303, heap 1663/63229/63370;
 2013-03-28 12:49:30 GMT PANIC:  WAL contains references to invalid pages
 2013-03-28 12:49:30 GMT CONTEXT:  xlog redo delete: index


I've got a pair of 9.2.4 DBs here in a streaming + log shipping setup.
Everything moving like clockwork until a couple of weeks ago, at which
time my slave died with the 'WAL contains references to invalid pages'
error.

After some wild googlin' research, I saw the index visibility map fix
for 9.2.1.  We did pg_upgrade in-between versions, but just to be sure I
wasn't somehow carrying corrupt data across versions (?), I went ahead
and VACUUMed everythng with the vacuum_freeze_table_age set to 0, and
went on with my life, hoping I had removed whatever demons were running
around in there.

Slave died again this weekend:

[2013-09-08 02:29:14.712 GMT] 55340:  : WARNING:  page 2424 of relation 
base/16411/2529375 is uninitialized
[2013-09-08 02:29:14.712 GMT] 55340:  : CONTEXT:  xlog redo vacuum: rel 
1663/16411/2529375; blk 26937, lastBlockVacuumed 0
[2013-09-08 02:29:14.712 GMT] 55340:  : PANIC:  WAL contains references 
to invalid pages
[2013-09-08 02:29:14.712 GMT] 55340:  : CONTEXT:  xlog redo vacuum: rel 
1663/16411/2529375; blk 26937, lastBlockVacuumed 0
[2013-09-08 02:29:14.761 GMT] 55338:  : LOG:  startup process (PID 55340) 
was terminated by signal 6: Abort trap
[2013-09-08 02:29:14.761 GMT] 55338:  : LOG:  terminating any other 
active server processes


Unlike Quentin's original message, simply restarting the slave didn't
bring it back to life.  I had to pg_start_backup/rsync again from the
master, at which point:

cp: /usr/local/pgsql/wal/0003.history: No such file or directory
[2013-09-09 03:58:29.235 GMT] 96710:  : LOG:  entering standby mode
[2013-09-09 03:58:29.324 GMT] 96710:  : LOG:  restored log file 
00030E1E004B from archive
[...]
[2013-09-09 03:59:05.151 GMT] 96710:  : LOG:  consistent recovery state 
reached at E1E/89BE3DD8
[2013-09-09 03:59:05.152 GMT] 96708:  : LOG:  database system is ready to 
accept read only connections
cp: /usr/local/pgsql/wal/00030E1E008A: No such file or directory
[2013-09-09 03:59:05.154 GMT] 96710:  : LOG:  unexpected pageaddr 
E1D/FB00 in log file 3614, segment 138, offset 0
cp: /usr/local/pgsql/wal/00030E1E008A: No such file or directory
[2013-09-09 03:59:05.209 GMT] 96786:  : LOG:  streaming replication 
successfully connected to primary


Seemingly okay again for now, though I'm anticipating similar future
failures.  The original relation that it had a problem with is a
child partitioned table's index:


[pgsql ~]$ oid2name -d db01 -o 2529375
From database db01:
  FilenodeTable Name

   2529375  syslog_p_2013_32_seq_key

Partitioning on year/week_of_year, it definitely shouldn't have been
receiving any writes at the time of the failure.  Also interesting, both
failures were on a Saturday, around the time a base backup is performed
(though it failed afterwards in both cases, so this may be coincidence.)

Is there any other information I can gather than can help shed light on
what is going on?  Any recommendations going from here?  My master is
fine during all of this, but I'm concerned I've got some hidden
corruption that will bite me when I least expect it.


pgsql=# select version();
   version  
 
-
 PostgreSQL 9.2.4 on amd64-portbld-freebsd9.0, compiled by cc (GCC) 4.2.1 
20070831 patched [FreeBSD], 64-bit


--
Mahlon E. Smith  
http://www.martini.nu/contact.html


-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [GENERAL] Streaming replication slave crash

2013-09-09 Thread Jeff Davis

On Mon, 2013-09-09 at 13:04 -0700, Mahlon E. Smith wrote:
 After some wild googlin' research, I saw the index visibility map fix
 for 9.2.1.  We did pg_upgrade in-between versions, but just to be sure I
 wasn't somehow carrying corrupt data across versions (?), I went ahead
 and VACUUMed everythng with the vacuum_freeze_table_age set to 0, and
 went on with my life, hoping I had removed whatever demons were running
 around in there.

You may have seen only partial information about that bug and the fix.
See the first item in the release notes here:

http://www.postgresql.org/docs/current/static/release-9-2-1.html

And the actual fix here:

http://git.postgresql.org/gitweb/?p=postgresql.git;a=commit;h=beb850e1d873f8920a78b9b9ee27e9f87c95592f

To rid the daemons entirely, after upgrading to 9.1+, you must REINDEX
all btree and GIN indexes that may have been touched by 9.2.0. Since the
actual problem you see is on an index (as you say later in your report),
then I would suggest that you do that.

Not 100% sure this is the root cause of your problem, of course, but the
symptoms seem to line up.

 Unlike Quentin's original message, simply restarting the slave didn't
 bring it back to life.  I had to pg_start_backup/rsync again from the
 master, at which point:

It looks OK to me, so I think you are in the clear.

If you are particularly unlucky, your master server crashed (while still
on 9.2.0) without writing the data and left your master copy of the
index corrupt. If you are worried about that, you can do another re-sync
after you finish the REINDEXing. This is not necessary unless you
experienced at least one crash on 9.2.0.

Regards,
Jeff Davis





-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


[GENERAL] Streaming replication slave crash

2013-03-29 Thread Quentin Hartman
Yesterday morning, one of my streaming replication slaves running 9.2.3
crashed with the following in the log file:

2013-03-28 12:49:30 GMT WARNING:  page 1441792 of relation base/63229/63370
does not exist
2013-03-28 12:49:30 GMT CONTEXT:  xlog redo delete: index
1663/63229/109956; iblk 303, heap 1663/63229/63370;
2013-03-28 12:49:30 GMT PANIC:  WAL contains references to invalid pages
2013-03-28 12:49:30 GMT CONTEXT:  xlog redo delete: index
1663/63229/109956; iblk 303, heap 1663/63229/63370;
2013-03-28 12:49:31 GMT LOG:  startup process (PID 22941) was terminated by
signal 6: Aborted
2013-03-28 12:49:31 GMT LOG:  terminating any other active server processes
2013-03-28 12:49:31 GMT WARNING:  terminating connection because of crash
of another server process
2013-03-28 12:49:31 GMT DETAIL:  The postmaster has commanded this server
process to roll back the current transaction and exit, because another
server process exited abnormally and possibly corrupted shared memory.
2013-03-28 12:49:31 GMT HINT:  In a moment you should be able to reconnect
to the database and repeat your command.
2013-03-28 12:57:44 GMT LOG:  database system was interrupted while in
recovery at log time 2013-03-28 12:37:42 GMT
2013-03-28 12:57:44 GMT HINT:  If this has occurred more than once some
data might be corrupted and you might need to choose an earlier recovery
target.
2013-03-28 12:57:44 GMT LOG:  entering standby mode
2013-03-28 12:57:44 GMT LOG:  redo starts at 19/2367CE30
2013-03-28 12:57:44 GMT LOG:  incomplete startup packet
2013-03-28 12:57:44 GMT LOG:  consistent recovery state reached at
19/241835B0
2013-03-28 12:57:44 GMT LOG:  database system is ready to accept read only
connections
2013-03-28 12:57:44 GMT LOG:  invalid record length at 19/2419EE38
2013-03-28 12:57:44 GMT LOG:  streaming replication successfully connected
to primary

As you can see I was able to restart it and it picked up and synchronized
right away, but this crash still concerns me.

The DB has about 75GB of data in it, and it is almost entirely write
traffic. It's essentially a log aggregator. I believe it was doing a
pg_dump backup at the time of the crash. It has hot_standby_feedback on to
allow that process to complete.

Any insights into this, or advice on figuring out the root of it would be
appreciated. So far all the things I've found like this are bugs that
should be fixed in this version, or the internet equivalent of a shrug.

Thanks!

QH


Re: [GENERAL] Streaming replication slave crash

2013-03-29 Thread Lonni J Friedman
Looks like you've got some form of coruption:
page 1441792 of relation base/63229/63370 does not exist


The question is whether it was corrupted on the master and then
replicated to the slave, or if it was corrupted on the slave.  I'd
guess that the pg_dump tried to read from that page and barfed.  It
would be interesting to try re-running the pg_dump again to see if
this crash can be replicated.  If so, does it also replicate if you
run pg_dump against the master?  If not, then the corruption is
isolated to the slave, and you might have a hardware problem which is
causing the data to get corrupted.

On Fri, Mar 29, 2013 at 9:19 AM, Quentin Hartman
qhart...@direwolfdigital.com wrote:
 Yesterday morning, one of my streaming replication slaves running 9.2.3
 crashed with the following in the log file:

 2013-03-28 12:49:30 GMT WARNING:  page 1441792 of relation base/63229/63370
 does not exist
 2013-03-28 12:49:30 GMT CONTEXT:  xlog redo delete: index 1663/63229/109956;
 iblk 303, heap 1663/63229/63370;
 2013-03-28 12:49:30 GMT PANIC:  WAL contains references to invalid pages
 2013-03-28 12:49:30 GMT CONTEXT:  xlog redo delete: index 1663/63229/109956;
 iblk 303, heap 1663/63229/63370;
 2013-03-28 12:49:31 GMT LOG:  startup process (PID 22941) was terminated by
 signal 6: Aborted
 2013-03-28 12:49:31 GMT LOG:  terminating any other active server processes
 2013-03-28 12:49:31 GMT WARNING:  terminating connection because of crash of
 another server process
 2013-03-28 12:49:31 GMT DETAIL:  The postmaster has commanded this server
 process to roll back the current transaction and exit, because another
 server process exited abnormally and possibly corrupted shared memory.
 2013-03-28 12:49:31 GMT HINT:  In a moment you should be able to reconnect
 to the database and repeat your command.
 2013-03-28 12:57:44 GMT LOG:  database system was interrupted while in
 recovery at log time 2013-03-28 12:37:42 GMT
 2013-03-28 12:57:44 GMT HINT:  If this has occurred more than once some data
 might be corrupted and you might need to choose an earlier recovery target.
 2013-03-28 12:57:44 GMT LOG:  entering standby mode
 2013-03-28 12:57:44 GMT LOG:  redo starts at 19/2367CE30
 2013-03-28 12:57:44 GMT LOG:  incomplete startup packet
 2013-03-28 12:57:44 GMT LOG:  consistent recovery state reached at
 19/241835B0
 2013-03-28 12:57:44 GMT LOG:  database system is ready to accept read only
 connections
 2013-03-28 12:57:44 GMT LOG:  invalid record length at 19/2419EE38
 2013-03-28 12:57:44 GMT LOG:  streaming replication successfully connected
 to primary

 As you can see I was able to restart it and it picked up and synchronized
 right away, but this crash still concerns me.

 The DB has about 75GB of data in it, and it is almost entirely write
 traffic. It's essentially a log aggregator. I believe it was doing a pg_dump
 backup at the time of the crash. It has hot_standby_feedback on to allow
 that process to complete.

 Any insights into this, or advice on figuring out the root of it would be
 appreciated. So far all the things I've found like this are bugs that should
 be fixed in this version, or the internet equivalent of a shrug.

 Thanks!

 QH



-- 
~
L. Friedmannetll...@gmail.com
LlamaLand   https://netllama.linux-sxs.org


-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [GENERAL] Streaming replication slave crash

2013-03-29 Thread Quentin Hartman
On Fri, Mar 29, 2013 at 10:23 AM, Lonni J Friedman netll...@gmail.comwrote:

 Looks like you've got some form of coruption:
 page 1441792 of relation base/63229/63370 does not exist


Thanks for the insight. I thought that might be it, but never having seen
this before I'm glad to have some confirmation.

The question is whether it was corrupted on the master and then
 replicated to the slave, or if it was corrupted on the slave.  I'd
 guess that the pg_dump tried to read from that page and barfed.  It
 would be interesting to try re-running the pg_dump again to see if
 this crash can be replicated.  If so, does it also replicate if you
 run pg_dump against the master?  If not, then the corruption is
 isolated to the slave, and you might have a hardware problem which is
 causing the data to get corrupted.


Yes, we've gotten several clean dumps form the slave since then w/o
crashing. We're running these machines on EC2 so we sadly have no control
over the hardware. With your confirmation, and an apparently clean state
now, I'm inclined to chalk this up to an EC2 hiccup getting caught by
Postgres and get on with life.

Thanks!

QH


Re: [GENERAL] Streaming replication slave crash

2013-03-29 Thread Tom Lane
Quentin Hartman qhart...@direwolfdigital.com writes:
 Yesterday morning, one of my streaming replication slaves running 9.2.3
 crashed with the following in the log file:

What process did you use for setting up the slave?

There's a fix awaiting release in 9.2.4 that might explain data
corruption on a slave, depending on how it was set up:

* Ensure we do crash recovery before entering archive recovery,
if the database was not stopped cleanly and a recovery.conf file
is present (Heikki Linnakangas, Kyotaro Horiguchi, Mitsumasa Kondo)

This is needed to ensure that the database is consistent in certain
scenarios, such as initializing a standby server with a filesystem
snapshot from a running server.

This theory would be more probable if it's a relatively new slave, since
any corruption would have been there in the slave's initial state, just
waiting for the replay to run into it.

regards, tom lane


-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [GENERAL] Streaming replication slave crash

2013-03-29 Thread Quentin Hartman
On Fri, Mar 29, 2013 at 10:37 AM, Tom Lane t...@sss.pgh.pa.us wrote:

 Quentin Hartman qhart...@direwolfdigital.com writes:
  Yesterday morning, one of my streaming replication slaves running 9.2.3
  crashed with the following in the log file:

 What process did you use for setting up the slave?


I used an rsync from the master while both were stopped.


 This theory would be more probable if it's a relatively new slave, since
 any corruption would have been there in the slave's initial state, just
 waiting for the replay to run into it.


It's newish. I upgraded this pair from 9.1.x to 9.2.3 a little over a week
ago, so did a dump/reload and resynced then. It was running happily in
between now and then.

Thanks!

QH


Re: [GENERAL] Streaming replication slave crash

2013-03-29 Thread Tom Lane
Quentin Hartman qhart...@direwolfdigital.com writes:
 On Fri, Mar 29, 2013 at 10:37 AM, Tom Lane t...@sss.pgh.pa.us wrote:
 What process did you use for setting up the slave?

 I used an rsync from the master while both were stopped.

If the master was shut down cleanly (not -m immediate) then the bug fix
I was thinking about wouldn't explain this.  The fact that the panic
didn't recur after restarting seems to void that theory as well.  I'm
not sure what to make of that angle.

Can you determine which table is being complained of in the failure
message, ie, what has relfilenode 63370 in database 63229?  If so it
would be interesting to know what was being done to that table on the
master.

regards, tom lane


-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [GENERAL] Streaming replication slave crash

2013-03-29 Thread Quentin Hartman
On Fri, Mar 29, 2013 at 10:50 AM, Tom Lane t...@sss.pgh.pa.us wrote:

 Quentin Hartman qhart...@direwolfdigital.com writes:
  On Fri, Mar 29, 2013 at 10:37 AM, Tom Lane t...@sss.pgh.pa.us wrote:
  What process did you use for setting up the slave?

  I used an rsync from the master while both were stopped.

 If the master was shut down cleanly (not -m immediate) then the bug fix
 I was thinking about wouldn't explain this.  The fact that the panic
 didn't recur after restarting seems to void that theory as well.  I'm
 not sure what to make of that angle.


Yes, it was shut down cleanly. A good thought, but I don't think it's
relevant in this case.


 Can you determine which table is being complained of in the failure
 message, ie, what has relfilenode 63370 in database 63229?  If so it
 would be interesting to know what was being done to that table on the
 master.


Good point! Looking deeper into that, it's actually one of our smaller
tables, and it doesn't seem to have any corruption, on either server. I was
able to select all the records from it and the content seems sane. The only
thing that would have been happening on that table is an INSERT or UPDATE.

I think I'm going to run with the spurious EC2 hiccup explanation. I'm
comfortable with that given the extra due diligence I've done with your
(and Lonni's) guidance.

Thanks!

QH