[BUGS] BUG #7500: hot-standby replica crash after an initial rsync

2012-08-21 Thread maxim . boguk
The following bug has been logged on the website:

Bug reference:  7500
Logged by:  Maxim Boguk
Email address:  maxim.bo...@gmail.com
PostgreSQL version: 9.0.8
Operating system:   FreeBSD
Description:

Hi,

Reinitialization of the replica after failover is procedure performed by me
a lot times in that installation. And I never had any issues. But today
after following the same pg_start_backup/rsync procedure replica doesn't
start with following error:

2012-08-21 11:17:07.777 MSK 34329 @ from  [vxid: txid:0] []LOG:  entering
standby mode
2012-08-21 11:17:07.780 MSK 34329 @ from  [vxid:1/0 txid:0] []LOG:  redo
starts at 925/2C0004C8
2012-08-21 11:17:07.783 MSK 34329 @ from  [vxid:1/0 txid:0] []FATAL:  could
not access status of transaction 361788644
2012-08-21 11:17:07.783 MSK 34329 @ from  [vxid:1/0 txid:0] []DETAIL:  Could
not read from file "pg_multixact/offsets/1590" at offset 114688: Unknown
error: 0.
2012-08-21 11:17:07.783 MSK 34329 @ from  [vxid:1/0 txid:0] []CONTEXT:  xlog
redo create multixact 361788644 offset 822580965: 1768508061 1768508078
2012-08-21 11:17:07.791 MSK 34327 @ from  [vxid: txid:0] []LOG:  startup
process (PID 34329) exited with exit code 1
2012-08-21 11:17:07.791 MSK 34327 @ from  [vxid: txid:0] []LOG:  terminating
any other active server processes


I have kept all that database files for the future investigation.

What I should look into first?




-- 
Sent via pgsql-bugs mailing list (pgsql-bugs@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-bugs


Re: [BUGS] BUG #7500: hot-standby replica crash after an initial rsync

2012-08-21 Thread Andres Freund
On Tuesday, August 21, 2012 10:08:11 AM maxim.bo...@gmail.com wrote:
> The following bug has been logged on the website:
> 
> Bug reference:  7500
> Logged by:  Maxim Boguk
> Email address:  maxim.bo...@gmail.com
> PostgreSQL version: 9.0.8
> Operating system:   FreeBSD
> Description:
> 
> Hi,
> 
> Reinitialization of the replica after failover is procedure performed by me
> a lot times in that installation. And I never had any issues. But today
> after following the same pg_start_backup/rsync procedure replica doesn't
> start with following error:
> 
> 2012-08-21 11:17:07.777 MSK 34329 @ from  [vxid: txid:0] []LOG:  entering
> standby mode
> 2012-08-21 11:17:07.780 MSK 34329 @ from  [vxid:1/0 txid:0] []LOG:  redo
> starts at 925/2C0004C8
> 2012-08-21 11:17:07.783 MSK 34329 @ from  [vxid:1/0 txid:0] []FATAL:  could
> not access status of transaction 361788644
> 2012-08-21 11:17:07.783 MSK 34329 @ from  [vxid:1/0 txid:0] []DETAIL: 
> Could not read from file "pg_multixact/offsets/1590" at offset 114688:
> Unknown error: 0.
> 2012-08-21 11:17:07.783 MSK 34329 @ from  [vxid:1/0 txid:0] []CONTEXT: 
> xlog redo create multixact 361788644 offset 822580965: 1768508061
> 1768508078 2012-08-21 11:17:07.791 MSK 34327 @ from  [vxid: txid:0] []LOG:
>  startup process (PID 34329) exited with exit code 1
> 2012-08-21 11:17:07.791 MSK 34327 @ from  [vxid: txid:0] []LOG: 
> terminating any other active server processes
> 
> 
> I have kept all that database files for the future investigation.
> 
> What I should look into first?
Could you reproduce the error with log_error_verbosity=verbose? Or even better 
provide a backtrace with gdb?

Greetings,

Andres
-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


-- 
Sent via pgsql-bugs mailing list (pgsql-bugs@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-bugs


Re: [BUGS] BUG #7500: hot-standby replica crash after an initial rsync

2012-08-21 Thread Maxim Boguk
> >
> >
> > I have kept all that database files for the future investigation.
> >
> > What I should look into first?
> Could you reproduce the error with log_error_verbosity=verbose? Or even
> better
> provide a backtrace with gdb?
>
>
There log with log_error_verbosity=verbose:


2012-08-21 14:04:18.973 MSK 33950 @ from  [vxid: txid:0] []LOG:  0:
database system was interrupted while in recovery at log time 2012-08-21
08:21:42 MSK
2012-08-21 14:04:18.973 MSK 33950 @ from  [vxid: txid:0] []HINT:  If this
has occurred more than once some data might be corrupted and you might need
to choose an earlier recovery target.
2012-08-21 14:04:18.973 MSK 33950 @ from  [vxid: txid:0] []LOCATION:
StartupXLOG, xlog.c:5757
2012-08-21 14:04:18.973 MSK 33950 @ from  [vxid: txid:0] []LOG:  0:
creating missing WAL directory "pg_xlog/archive_status"
2012-08-21 14:04:18.973 MSK 33950 @ from  [vxid: txid:0] []LOCATION:
ValidateXLOGDirectoryStructure, xlog.c:3452
2012-08-21 14:04:19.009 MSK 33950 @ from  [vxid: txid:0] []LOG:  0:
entering standby mode
2012-08-21 14:04:19.009 MSK 33950 @ from  [vxid: txid:0] []LOCATION:
StartupXLOG, xlog.c:5827
2012-08-21 14:04:19.041 MSK 33951 @ from  [vxid: txid:0] []LOG:  0:
streaming replication successfully connected to primary
2012-08-21 14:04:19.041 MSK 33951 @ from  [vxid: txid:0] []LOCATION:
libpqrcv_connect, libpqwalreceiver.c:169
2012-08-21 14:04:20.140 MSK 33950 @ from  [vxid:1/0 txid:0] []LOG:  0:
redo starts at 925/2C0004C8
2012-08-21 14:04:20.140 MSK 33950 @ from  [vxid:1/0 txid:0] []LOCATION:
StartupXLOG, xlog.c:6199
2012-08-21 14:04:20.160 MSK 33950 @ from  [vxid:1/0 txid:0] []FATAL:
XX000: could not access status of transaction 361788644
2012-08-21 14:04:20.161 MSK 33950 @ from  [vxid:1/0 txid:0] []DETAIL:
Could not read from file "pg_multixact/offsets/1590" at offset 114688:
Unknown error: 0.
2012-08-21 14:04:20.161 MSK 33950 @ from  [vxid:1/0 txid:0] []CONTEXT:
xlog redo create multixact 361788644 offset 822580965: 1768508061 1768508078
2012-08-21 14:04:20.161 MSK 33950 @ from  [vxid:1/0 txid:0] []LOCATION:
SlruReportIOError, slru.c:840
2012-08-21 14:04:20.169 MSK 33948 @ from  [vxid: txid:0] []LOG:  0:
startup process (PID 33950) exited with exit code 1
2012-08-21 14:04:20.169 MSK 33948 @ from  [vxid: txid:0] []LOCATION:
LogChildExit, postmaster.c:2840
2012-08-21 14:04:20.169 MSK 33948 @ from  [vxid: txid:0] []LOG:  0:
terminating any other active server processes
2012-08-21 14:04:20.169 MSK 33948 @ from  [vxid: txid:0] []LOCATION:
HandleChildCrash, postmaster.c:2674


Kind Regards,
Maksym


Re: [BUGS] BUG #7500: hot-standby replica crash after an initial rsync

2012-08-21 Thread Maxim Boguk
On Tue, Aug 21, 2012 at 8:08 PM, Maxim Boguk  wrote:

>
> >
>> >
>> > I have kept all that database files for the future investigation.
>> >
>> > What I should look into first?
>> Could you reproduce the error with log_error_verbosity=verbose? Or even
>> better
>> provide a backtrace with gdb?
>>
>>
> There log with log_error_verbosity=verbose:
>
>
> 2012-08-21 14:04:18.973 MSK 33950 @ from  [vxid: txid:0] []LOG:  0:
> database system was interrupted while in recovery at log time 2012-08-21
> 08:21:42 MSK
> 2012-08-21 14:04:18.973 MSK 33950 @ from  [vxid: txid:0] []HINT:  If this
> has occurred more than once some data might be corrupted and you might need
> to choose an earlier recovery target.
> 2012-08-21 14:04:18.973 MSK 33950 @ from  [vxid: txid:0] []LOCATION:
> StartupXLOG, xlog.c:5757
> 2012-08-21 14:04:18.973 MSK 33950 @ from  [vxid: txid:0] []LOG:  0:
> creating missing WAL directory "pg_xlog/archive_status"
> 2012-08-21 14:04:18.973 MSK 33950 @ from  [vxid: txid:0] []LOCATION:
> ValidateXLOGDirectoryStructure, xlog.c:3452
> 2012-08-21 14:04:19.009 MSK 33950 @ from  [vxid: txid:0] []LOG:  0:
> entering standby mode
> 2012-08-21 14:04:19.009 MSK 33950 @ from  [vxid: txid:0] []LOCATION:
> StartupXLOG, xlog.c:5827
> 2012-08-21 14:04:19.041 MSK 33951 @ from  [vxid: txid:0] []LOG:  0:
> streaming replication successfully connected to primary
> 2012-08-21 14:04:19.041 MSK 33951 @ from  [vxid: txid:0] []LOCATION:
> libpqrcv_connect, libpqwalreceiver.c:169
> 2012-08-21 14:04:20.140 MSK 33950 @ from  [vxid:1/0 txid:0] []LOG:  0:
> redo starts at 925/2C0004C8
> 2012-08-21 14:04:20.140 MSK 33950 @ from  [vxid:1/0 txid:0] []LOCATION:
> StartupXLOG, xlog.c:6199
> 2012-08-21 14:04:20.160 MSK 33950 @ from  [vxid:1/0 txid:0] []FATAL:
> XX000: could not access status of transaction 361788644
> 2012-08-21 14:04:20.161 MSK 33950 @ from  [vxid:1/0 txid:0] []DETAIL:
> Could not read from file "pg_multixact/offsets/1590" at offset 114688:
> Unknown error: 0.
> 2012-08-21 14:04:20.161 MSK 33950 @ from  [vxid:1/0 txid:0] []CONTEXT:
> xlog redo create multixact 361788644 offset 822580965: 1768508061 1768508078
> 2012-08-21 14:04:20.161 MSK 33950 @ from  [vxid:1/0 txid:0] []LOCATION:
> SlruReportIOError, slru.c:840
> 2012-08-21 14:04:20.169 MSK 33948 @ from  [vxid: txid:0] []LOG:  0:
> startup process (PID 33950) exited with exit code 1
> 2012-08-21 14:04:20.169 MSK 33948 @ from  [vxid: txid:0] []LOCATION:
> LogChildExit, postmaster.c:2840
> 2012-08-21 14:04:20.169 MSK 33948 @ from  [vxid: txid:0] []LOG:  0:
> terminating any other active server processes
> 2012-08-21 14:04:20.169 MSK 33948 @ from  [vxid: txid:0] []LOCATION:
> HandleChildCrash, postmaster.c:2674
>
>
I tried A bit play with more detailed debug info and got mode detailed log:

LOG:  0: database system was interrupted while in recovery at log time
2012-08-21 08:21:42 MSK
HINT:  If this has occurred more than once some data might be corrupted and
you might need to choose an earlier recovery target.
LOCATION:  StartupXLOG, xlog.c:5757
DEBUG:  0: standby_mode = 'on'
LOCATION:  readRecoveryCommandFile, xlog.c:5269
DEBUG:  0: primary_conninfo = 'host=10.12.22.19 port=5433 user=replica
password=acilper'
LOCATION:  readRecoveryCommandFile, xlog.c:5276
DEBUG:  0: trigger_file = '/db/data/failover'
LOCATION:  readRecoveryCommandFile, xlog.c:5283
LOG:  0: entering standby mode
LOCATION:  StartupXLOG, xlog.c:5827
DEBUG:  0: checkpoint record is at 925/2D41CAA0
LOCATION:  StartupXLOG, xlog.c:5894
DEBUG:  0: redo record is at 925/2C0004C8; shutdown FALSE
LOCATION:  StartupXLOG, xlog.c:5929
DEBUG:  0: next transaction ID: 0/1768511856; next OID: 59880938
LOCATION:  StartupXLOG, xlog.c:5933
DEBUG:  0: next MultiXactId: 361788644; next MultiXactOffset: 822580965
LOCATION:  StartupXLOG, xlog.c:5936
DEBUG:  0: oldest unfrozen transaction ID: 1569182525, in database 16424
LOCATION:  StartupXLOG, xlog.c:5939
DEBUG:  0: transaction ID wrap limit is 371172, limited by database
with OID 16424
LOCATION:  SetTransactionIdLimit, varsup.c:334
DEBUG:  0: initializing for hot standby
LOCATION:  StartupXLOG, xlog.c:6070
DEBUG:  0: my backend id is 1
LOCATION:  SharedInvalBackendInit, sinvaladt.c:326
LOG:  0: redo starts at 925/2C0004C8
LOCATION:  StartupXLOG, xlog.c:6199
DEBUG:  0: record known xact 1768508078 latestObservedXid 0
CONTEXT:  xlog redo commit: 2012-08-21 08:21:42.079655+04
LOCATION:  RecordKnownAssignedTransactionIds, procarray.c:2375
DEBUG:  0: record known xact 1768508078 latestObservedXid 0
CONTEXT:  xlog redo commit: 2012-08-21 08:21:42.079655+04
LOCATION:  RecordKnownAssignedTransactionIds, procarray.c:2375
DEBUG:  0: remove KnownAssignedXid 1768508078
CONTEXT:  xlog redo commit: 2012-08-21 08:21:42.079655+04
LOCATION:  KnownAssignedXidsRemove, procarray.c:2869
DEBUG:  0: record known xact 1768508080 latestObservedXid 0
CONTEXT:  xlog redo exclusive_lock: xid 17685080

Re: [BUGS] BUG #7500: hot-standby replica crash after an initial rsync

2012-08-21 Thread Andres Freund
Hi,


On Tuesday, August 21, 2012 02:28:54 PM Maxim Boguk wrote:
> On Tue, Aug 21, 2012 at 8:08 PM, Maxim Boguk  wrote:
> >> > I have kept all that database files for the future investigation.
> >> > 
> >> > What I should look into first?
> >> 
> >> Could you reproduce the error with log_error_verbosity=verbose? Or even
> >> better
> >> provide a backtrace with gdb?
> > 
> > There log with log_error_verbosity=verbose:
> LOG:  0: database system was interrupted while in recovery at log time
> 2012-08-21 08:21:42 MSK
> HINT:  If this has occurred more than once some data might be corrupted and
> you might need to choose an earlier recovery target.
> LOCATION:  StartupXLOG, xlog.c:5757
> DEBUG:  0: standby_mode = 'on'
> LOCATION:  readRecoveryCommandFile, xlog.c:5269
> DEBUG:  0: primary_conninfo = 'host=10.12.22.19 port=5433 user=replica
> password=acilper'
> LOCATION:  readRecoveryCommandFile, xlog.c:5276
> DEBUG:  0: trigger_file = '/db/data/failover'
> LOCATION:  readRecoveryCommandFile, xlog.c:5283
> LOG:  0: entering standby mode
> LOCATION:  StartupXLOG, xlog.c:5827
> DEBUG:  0: checkpoint record is at 925/2D41CAA0
> LOCATION:  StartupXLOG, xlog.c:5894
> DEBUG:  0: redo record is at 925/2C0004C8; shutdown FALSE
> LOCATION:  StartupXLOG, xlog.c:5929
> DEBUG:  0: next transaction ID: 0/1768511856; next OID: 59880938
> LOCATION:  StartupXLOG, xlog.c:5933
> DEBUG:  0: next MultiXactId: 361788644; next MultiXactOffset: 822580965
> LOCATION:  StartupXLOG, xlog.c:5936
> DEBUG:  0: oldest unfrozen transaction ID: 1569182525, in database
> 16424 LOCATION:  StartupXLOG, xlog.c:5939
> DEBUG:  0: transaction ID wrap limit is 371172, limited by database
> with OID 16424
> LOCATION:  SetTransactionIdLimit, varsup.c:334
> DEBUG:  0: initializing for hot standby
> LOCATION:  StartupXLOG, xlog.c:6070
> DEBUG:  0: my backend id is 1
> LOCATION:  SharedInvalBackendInit, sinvaladt.c:326
> LOG:  0: redo starts at 925/2C0004C8
> LOCATION:  StartupXLOG, xlog.c:6199
> DEBUG:  0: record known xact 1768508078 latestObservedXid 0
> CONTEXT:  xlog redo commit: 2012-08-21 08:21:42.079655+04
> LOCATION:  RecordKnownAssignedTransactionIds, procarray.c:2375
> DEBUG:  0: record known xact 1768508078 latestObservedXid 0
> CONTEXT:  xlog redo commit: 2012-08-21 08:21:42.079655+04
> LOCATION:  RecordKnownAssignedTransactionIds, procarray.c:2375
> DEBUG:  0: remove KnownAssignedXid 1768508078
> CONTEXT:  xlog redo commit: 2012-08-21 08:21:42.079655+04
> LOCATION:  KnownAssignedXidsRemove, procarray.c:2869
> DEBUG:  0: record known xact 1768508080 latestObservedXid 0
> CONTEXT:  xlog redo exclusive_lock: xid 1768508080 rel 1663/16424/17282;
> tid 725829/8
> LOCATION:  RecordKnownAssignedTransactionIds, procarray.c:2375
> DEBUG:  0: record known xact 1768508080 latestObservedXid 0
> CONTEXT:  xlog redo update: rel 1663/16424/17282; tid 725829/8; new
> 726017/7 LOCATION:  RecordKnownAssignedTransactionIds, procarray.c:2375
> DEBUG:  0: record known xact 1768508076 latestObservedXid 0
> CONTEXT:  xlog redo exclusive_lock: xid 1768508076 rel 1663/16424/47760635;
> tid 3075/136
> LOCATION:  RecordKnownAssignedTransactionIds, procarray.c:2375
> DEBUG:  0: record known xact 1768508080 latestObservedXid 0
> CONTEXT:  xlog redo insert: rel 1663/16424/47747; tid 43866/197
> LOCATION:  RecordKnownAssignedTransactionIds, procarray.c:2375
> DEBUG:  0: record known xact 1768508080 latestObservedXid 0
> CONTEXT:  xlog redo insert: rel 1663/16424/52546872; tid 44700/276
> LOCATION:  RecordKnownAssignedTransactionIds, procarray.c:2375
> DEBUG:  0: record known xact 1768508061 latestObservedXid 0
> CONTEXT:  xlog redo log: rel 1663/16424/19211
> LOCATION:  RecordKnownAssignedTransactionIds, procarray.c:2375
> DEBUG:  0: record known xact 1768508082 latestObservedXid 0
> CONTEXT:  xlog redo exclusive_lock: xid 1768508082 rel 1663/16424/4759107;
> tid 1098/7
> LOCATION:  RecordKnownAssignedTransactionIds, procarray.c:2375
> DEBUG:  0: record known xact 1768508076 latestObservedXid 0
> CONTEXT:  xlog redo hot_update: rel 1663/16424/47760635; tid 3075/136; new
> 3075/140
> LOCATION:  RecordKnownAssignedTransactionIds, procarray.c:2375
> DEBUG:  0: record known xact 1768508080 latestObservedXid 0
> CONTEXT:  xlog redo exclusive_lock: xid 1768508080 rel 1663/16424/17282;
> tid 725918/2
> LOCATION:  RecordKnownAssignedTransactionIds, procarray.c:2375
> DEBUG:  0: record known xact 1768508082 latestObservedXid 0
> CONTEXT:  xlog redo hot_update: rel 1663/16424/4759107; tid 1098/7; new
> 1098/39
> LOCATION:  RecordKnownAssignedTransactionIds, procarray.c:2375
> DEBUG:  0: record known xact 1768508080 latestObservedXid 0
> CONTEXT:  xlog redo update: rel 1663/16424/17282; tid 725918/2; new
> 725918/6 LOCATION:  RecordKnownAssignedTransactionIds, procarray.c:2375
> DEBUG:  0: record known xact 1768508080 latestObservedXid 0
> CONTEXT:  xlog redo inse

Re: [BUGS] BUG #7500: hot-standby replica crash after an initial rsync

2012-08-21 Thread Maxim Boguk
Hi Andres,


> > I have some problems with debug startup process with gdb...
> > I following next sequence of commands (and got no useful results):
> Youre debugging the postmaster that way. The easiest way would be to just
> attach to the startup process with gdb -p. Not sure if you can manage that
> timingwise without manually adding sleeps.
>

It's imposible without added sleep seems. Too fast for me.
Where I should add sleep call in PostgreSQL source? (process managment part
of PostgreSQL isn't known for me at all).


>
> Otherwise you probably will have to work with "set follow-fork-mode ask" in
> gdb. Getting to the right child isn't easy though.
>

Unfortunatelly gdb under FreeBSD doesn't suport set follow-fork-mode.

Kind Regards,
Maksy


[BUGS] Database crash in Postgres 8.4.3

2012-08-21 Thread Freddie Burgess
Late yesterday afternoon our DB server went down hard.  we tried to re-start
and it went into recovery mode to recover transaction history and failed.
Notable error was:

 

   FATAL: failed to re-find parent key in index "257969064" for split pages
8366/12375

 

If you look this error up, it indicates issues with the transaction logs and
the inability to recover due to 

corrupt or missing transaction logs.  

 

The solution is to:

 

1. Back up the DB files

2. Run pg_resetxlogs   (this might produce corruption due inconsistent data)

3. Dump the DB

4. Reload the DB

 

Unfortunately this solution is not practical in our case for multiple
reasons.

- We do not have the space for steps 1 & 3

- The time required for steps 1, 3, and 4 is approximately 1 week per step
(3 weeks total) since our database size is approximately 5.4TB

 

If the data is in an inconsistent state, are there other alternative
solutions, such as finding the index specified in the FATAL error and
somehow dropping it?  

 

and does anyone know what circumstances/conditions might corrupt or cause a
transaction log to go missing.

 

thanks

 



Re: [BUGS] BUG #4958: Stats collector hung on WaitForMultipleObjectsEx while attempting to recv a datagram

2012-08-21 Thread Robert Haas
On Tue, Aug 7, 2012 at 2:22 PM, Tom Lane  wrote:
> Robert Haas  writes:
>> We just had a customer hit a very similar problem on 9.1.3, running on
>> Windows Server 2008 SP2. ...
>> The customer finds that they can reproduce this on a variety of
>> systems under heavy load.
>
>> Now, it looks to me like for this stack trace to happen,
>> PgstatCollectorMain() has got to call pgwin32_waitforsinglesocket (at
>> line 3002), and that function has to return true, so that got_data
>> gets set to true.  Then PgstatCollectorMain() will call recv(), which
>> on Windows will really be pgwin32_recv, which will call
>> pgwin32_waitforsinglesocket, which must now hang.  The fact that the
>> first pgwin32_waitforsinglesocket call returned true should mean that
>> the stats collector socket is ready for read, while the fact that the
>> second one did not return seems to imply that it's not ready for read,
>> close, or accept.  So it almost looks like Windows can change its mind
>> about whether the socket is readable.
>
>> Or maybe we're telling it to change its mind.  This sounds an awful
>> lot like something that could have been caused by the oversights fixed
>> in commit b85427f2276d02756b558c0024949305ea65aca5.  Was there a
>> reason we didn't back-patch that?
>
> Sure: it was unproven that that fixed anything at all, much less that it
> was bug-free enough to be safe to backpatch.  Neither of those things
> has changed since May.  If you want you can try making up a 9.1 with
> those changes and giving it to this customer to see if it fixes their
> problems --- but without some field testing of the sort, I'm pretty
> hesitant to put it into back branches.

Well, we had the customer try out 9.2beta, and they were unable to
reproduce the issue there.  Woo-hoo.  Does that qualify as sufficient
evidence for back-patching this?

(BTW, I think commit 9b63e9869ffaa4d6d3e8bf45086a765d8f310f1c contains
a thinko in one of the comments: shouldn't "a crock of the first
water" be "a crock of the first order"?)

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-bugs mailing list (pgsql-bugs@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-bugs


Re: [BUGS] BUG #4958: Stats collector hung on WaitForMultipleObjectsEx while attempting to recv a datagram

2012-08-21 Thread Alvaro Herrera
Excerpts from Robert Haas's message of mar ago 21 14:54:28 -0400 2012:

> (BTW, I think commit 9b63e9869ffaa4d6d3e8bf45086a765d8f310f1c contains
> a thinko in one of the comments: shouldn't "a crock of the first
> water" be "a crock of the first order"?)

I was interested by that phrase when I saw the commit, and as usual when
I see a phrase from Tom that looks like it might mean something
interesting, I looked it up.  Apparently "of the first water" means "of
the highest quality" or "of the most extreme kind":
http://idioms.thefreedictionary.com/of+the+first+water
(Curiously enough, the example usage for the first sense in that site
mentions some gentleman by name of Tom.)

I have learned a few neat English language tricks by paying careful
attention to Tom's choice of words.  There doesn't seem to be many
people in these lists that use English in just the same way.

-- 
Álvaro Herrerahttp://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services


-- 
Sent via pgsql-bugs mailing list (pgsql-bugs@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-bugs


Re: [BUGS] BUG #7500: hot-standby replica crash after an initial rsync

2012-08-21 Thread Andres Freund
On Tuesday, August 21, 2012 03:30:44 PM Maxim Boguk wrote:
> Hi Andres,
> 
> > > I have some problems with debug startup process with gdb...
> > 
> > > I following next sequence of commands (and got no useful results):
> > Youre debugging the postmaster that way. The easiest way would be to just
> > attach to the startup process with gdb -p. Not sure if you can manage
> > that timingwise without manually adding sleeps.
> 
> It's imposible without added sleep seems. Too fast for me.
> Where I should add sleep call in PostgreSQL source? (process managment part
> of PostgreSQL isn't known for me at all).
I would suggest adding it to StartupXLOG() in xlog.c just before the /* REDO 
*/ comment or similar.

I would add something akin to

elog(WARNING, "pid of startup is: %d, sleeping for 10s", getpid());
sleep(10);

Greetings,

Andres
-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


-- 
Sent via pgsql-bugs mailing list (pgsql-bugs@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-bugs


Re: [BUGS] BUG #4958: Stats collector hung on WaitForMultipleObjectsEx while attempting to recv a datagram

2012-08-21 Thread Tom Lane
Alvaro Herrera  writes:
> Excerpts from Robert Haas's message of mar ago 21 14:54:28 -0400 2012:
>> (BTW, I think commit 9b63e9869ffaa4d6d3e8bf45086a765d8f310f1c contains
>> a thinko in one of the comments: shouldn't "a crock of the first
>> water" be "a crock of the first order"?)

> I was interested by that phrase when I saw the commit, and as usual when
> I see a phrase from Tom that looks like it might mean something
> interesting, I looked it up.  Apparently "of the first water" means "of
> the highest quality" or "of the most extreme kind":
> http://idioms.thefreedictionary.com/of+the+first+water

Yeah, that was intentional not a typo.  I think modern usage of this
phrase is largely confined to gemstones though, as in "diamond of the
first water".
http://www.phrases.org.uk/meanings/137550.html

Meanwhile, back at the ranch: I'm fine with applying that patch now that
it's had some field testing.

regards, tom lane


-- 
Sent via pgsql-bugs mailing list (pgsql-bugs@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-bugs


Re: [BUGS] Database crash in Postgres 8.4.3

2012-08-21 Thread Tom Lane
"Freddie Burgess"  writes:
> Late yesterday afternoon our DB server went down hard.  we tried to re-start
> and it went into recovery mode to recover transaction history and failed.

Oh dear.  I'll spare you the lecture about running up-to-date minor
releases, because this doesn't seem to exactly match any known-and-fixed
bug in 8.4.x, but you will reconsider that policy no?

> Notable error was:
>FATAL: failed to re-find parent key in index "257969064" for split pages
> 8366/12375
> If you look this error up, it indicates issues with the transaction logs and
> the inability to recover due to corrupt or missing transaction logs.  

Don't know where you read that, but this has nothing to do with
transaction logs.  It appears to indicate pre-existing corruption in a
btree index, specifically that there is no downlink in the next index
level up pointing to an index page that just got split according to the
WAL event stream.  Unfortunately the WAL replay environment is too
limited to give you a very good fix on *which* index, unless you have
some other way of knowing what relfilenode 257969064 corresponds to.

What exactly happened when your DB server "went down hard"?  It seems
somewhat likely that that involved filesystem corruption that WAL replay
was unable to fix.  Since the replay got as far as here, the next index
level up doesn't contain obvious corruption, it just lacks an entry that
should be there.  The most likely inference is that there's a stale page
where the downlink entry should have been, which suggests strongly that
fsync failed to write a page when it was supposed to (leading to
Postgres supposing that the index page was safely down on disk and could
be forgotten about, when it was not).

That's all a rather roundabout way of saying that you'd be well advised
to take a hard look at your filesystem configuration.  These symptoms
are very consistent with the theory that the filesystem is not honoring
fsync requests properly.

> If the data is in an inconsistent state, are there other alternative
> solutions, such as finding the index specified in the FATAL error and
> somehow dropping it?  

You could try modifying the source code to reduce that error to a
WARNING temporarily, but I'm not sure how far it will get you.  The
concern here is that if one page wasn't written when it was supposed to
be, there are probably others too.  And most likely not all of them
are index pages.

Anyway, if it appears that you're not far away from the end of the xlog,
yeah I'd try disabling error tests until it gets to the end of xlog.
Once the DB is up you can determine which indexes were mentioned in the
messages (look at pg_class.relfilenode) and drop or reindex them.

A preferable solution, if you still had the WAL logs from the previous
checkpoint cycle, would be to back up the WAL pointer to the preceding
checkpoint and replay forward from there.  (This is essentially betting
that all the corruption stems from writes that failed to get to disk
just before the last checkpoint.)  However, unless you've been running a
PITR archiving setup, you probably haven't got that option.

regards, tom lane


-- 
Sent via pgsql-bugs mailing list (pgsql-bugs@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-bugs


Re: [BUGS] BUG #4958: Stats collector hung on WaitForMultipleObjectsEx while attempting to recv a datagram

2012-08-21 Thread Robert Haas
On Tue, Aug 21, 2012 at 4:30 PM, Tom Lane  wrote:
> Meanwhile, back at the ranch: I'm fine with applying that patch now that
> it's had some field testing.

Attached is a version that applies OK to 9.1, after resolving some
conflicts.  Review appreciated.  It wasn't exactly clear to me how to
handle the ereport/elog calls.

What to do pre-9.1 is a bit less clear to me, as the latch machinery
doesn't exist at all in those versions.  Some of the fixes can
probably still be pulled out and applied, though.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


socketfix-91.patch
Description: Binary data

-- 
Sent via pgsql-bugs mailing list (pgsql-bugs@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-bugs


Re: [BUGS] BUG #7500: hot-standby replica crash after an initial rsync

2012-08-21 Thread Maxim Boguk
On Wed, Aug 22, 2012 at 6:08 AM, Andres Freund wrote:

> On Tuesday, August 21, 2012 03:30:44 PM Maxim Boguk wrote:
> > Hi Andres,
>
> I would add something akin to
>
> elog(WARNING, "pid of startup is: %d, sleeping for 10s", getpid());
> sleep(10);
>

Hi Andres,

There is it:

gdb backtrace:

#0  0x000800bd1ab0 in exit () from /lib/libc.so.7
#1  0x005d77f6 in proc_exit (code=1) at ipc.c:137
#2  0x00697948 in errfinish (dummy=) at elog.c:501
#3  0x00488956 in SlruReportIOError (ctl=,
pageno=, xid=361788644) at slru.c:857
#4  0x0048956a in SimpleLruReadPage (ctl=0x95bf80, pageno=176654,
write_ok=1 '\001', xid=361788644) at slru.c:427
#5  0x00489f7f in RecordNewMultiXact (multi=361788644,
offset=822580965, nxids=2, xids=0x800e5302c) at multixact.c:764
#6  0x0048a126 in multixact_redo (lsn=..., record=0x800e53000) at
multixact.c:2017
#7  0x00484bea in StartupXLOG () at xlog.c:6277
#8  0x004870a8 in StartupProcessMain () at xlog.c:9335
#9  0x0049cab2 in AuxiliaryProcessMain (argc=2,
argv=0x7fffe450) at bootstrap.c:412
#10 0x005ba926 in StartChildProcess (type=StartupProcess) at
postmaster.c:4442
#11 0x005bccb2 in PostmasterMain (argc=5, argv=0x7fffea38) at
postmaster.c:1093
#12 0x00566f3b in main (argc=5, argv=0x7fffea38) at main.c:188

relevant database log:
2012-08-22 07:47:21.330 MSK 18155 @ from  [vxid:1/0 txid:0] []LOG:  0:
redo starts at 925/2C0004C8
2012-08-22 07:47:21.330 MSK 18155 @ from  [vxid:1/0 txid:0] []LOCATION:
StartupXLOG, xlog.c:6202
2012-08-22 07:47:21.384 MSK 18155 @ from  [vxid:1/0 txid:0] []FATAL:
XX000: could not access status of transaction 361788644
2012-08-22 07:47:21.384 MSK 18155 @ from  [vxid:1/0 txid:0] []DETAIL:
Could not read from file "pg_multixact/offsets/1590" at offset 114688:
Unknown error: 0.
2012-08-22 07:47:21.384 MSK 18155 @ from  [vxid:1/0 txid:0] []CONTEXT:
xlog redo create multixact 361788644 offset 822580965: 1768508061 1768508078
2012-08-22 07:47:21.384 MSK 18155 @ from  [vxid:1/0 txid:0] []LOCATION:
SlruReportIOError, slru.c:840


Kind Regards,
Maksym


Re: [BUGS] BUG #7500: hot-standby replica crash after an initial rsync

2012-08-21 Thread Maxim Boguk
On Wed, Aug 22, 2012 at 1:50 PM, Maxim Boguk  wrote:

>
>
> On Wed, Aug 22, 2012 at 6:08 AM, Andres Freund wrote:
>
>> On Tuesday, August 21, 2012 03:30:44 PM Maxim Boguk wrote:
>> > Hi Andres,
>>
>> I would add something akin to
>>
>> elog(WARNING, "pid of startup is: %d, sleeping for 10s", getpid());
>> sleep(10);
>>
>
> Hi Andres,
>
> There is it:
>
> gdb backtrace:
>
> #0  0x000800bd1ab0 in exit () from /lib/libc.so.7
> #1  0x005d77f6 in proc_exit (code=1) at ipc.c:137
> #2  0x00697948 in errfinish (dummy=) at elog.c:501
> #3  0x00488956 in SlruReportIOError (ctl=,
> pageno=, xid=361788644) at slru.c:857
> #4  0x0048956a in SimpleLruReadPage (ctl=0x95bf80, pageno=176654,
> write_ok=1 '\001', xid=361788644) at slru.c:427
> #5  0x00489f7f in RecordNewMultiXact (multi=361788644,
> offset=822580965, nxids=2, xids=0x800e5302c) at multixact.c:764
> #6  0x0048a126 in multixact_redo (lsn=..., record=0x800e53000) at
> multixact.c:2017
> #7  0x00484bea in StartupXLOG () at xlog.c:6277
> #8  0x004870a8 in StartupProcessMain () at xlog.c:9335
> #9  0x0049cab2 in AuxiliaryProcessMain (argc=2,
> argv=0x7fffe450) at bootstrap.c:412
> #10 0x005ba926 in StartChildProcess (type=StartupProcess) at
> postmaster.c:4442
> #11 0x005bccb2 in PostmasterMain (argc=5, argv=0x7fffea38) at
> postmaster.c:1093
> #12 0x00566f3b in main (argc=5, argv=0x7fffea38) at main.c:188
>
> relevant database log:
> 2012-08-22 07:47:21.330 MSK 18155 @ from  [vxid:1/0 txid:0] []LOG:  0:
> redo starts at 925/2C0004C8
> 2012-08-22 07:47:21.330 MSK 18155 @ from  [vxid:1/0 txid:0] []LOCATION:
> StartupXLOG, xlog.c:6202
> 2012-08-22 07:47:21.384 MSK 18155 @ from  [vxid:1/0 txid:0] []FATAL:
> XX000: could not access status of transaction 361788644
> 2012-08-22 07:47:21.384 MSK 18155 @ from  [vxid:1/0 txid:0] []DETAIL:
> Could not read from file "pg_multixact/offsets/1590" at offset 114688:
> Unknown error: 0.
> 2012-08-22 07:47:21.384 MSK 18155 @ from  [vxid:1/0 txid:0] []CONTEXT:
> xlog redo create multixact 361788644 offset 822580965: 1768508061 1768508078
> 2012-08-22 07:47:21.384 MSK 18155 @ from  [vxid:1/0 txid:0] []LOCATION:
> SlruReportIOError, slru.c:840
>
>
> Kind Regards,
> Maksym
>

And finally gdb backtrace on build without optimization
( ./configure --prefix=/usr/local/pgsql/tmp/postgresql/ --enable-debug
--enable-cassert CFLAGS='-O0' ):

#0  0x000800d42ab0 in exit () from /lib/libc.so.7
#1  0x006aafc6 in proc_exit (code=1) at ipc.c:137
#2  0x007bcbb1 in errfinish (dummy=0) at elog.c:501
#3  0x004b7065 in SlruReportIOError (ctl=0xad9740, pageno=176654,
xid=361788644) at slru.c:836
#4  0x004b6353 in SimpleLruReadPage (ctl=0xad9740, pageno=176654,
write_ok=1 '\001', xid=361788644) at slru.c:427
#5  0x004b85dd in RecordNewMultiXact (multi=361788644,
offset=822580965, nxids=2, xids=0x800f5302c) at multixact.c:764
#6  0x004b9dd1 in multixact_redo (lsn=..., record=0x800f53000) at
multixact.c:2017
#7  0x004aef0c in StartupXLOG () at xlog.c:6277
#8  0x004b446d in StartupProcessMain () at xlog.c:9335
#9  0x004d9e36 in AuxiliaryProcessMain (argc=2,
argv=0x7fffe8b0) at bootstrap.c:412
#10 0x00683398 in StartChildProcess (type=StartupProcess) at
postmaster.c:4442
#11 0x0067ef60 in PostmasterMain (argc=5, argv=0x7fffea38) at
postmaster.c:1093
#12 0x005fbf5a in main (argc=5, argv=0x7fffea38) at main.c:188

It seems failing around function SlruPhysicalReadPage at block

errno = 0;
if (read(fd, shared->page_buffer[slotno], BLCKSZ) != BLCKSZ)
{
slru_errcause = SLRU_READ_FAILED;
slru_errno = errno;
close(fd);
return false;
}

It seems that it read less than BLCKSZ bytes. And it is not an error
condition for read call
(from man 2 read:
Note that reading less than size bytes is not an error.
).

Actual size of pg_multixact/offsets/1590 file is:
[pgsql@db1 ~/tmp/postgresql-9.0.8]$ ls -la
/db/data_back/pg_multixact/offsets/1590
-rw---  1 pgsql  pgsql  81920 Aug 21 07:46
/db/data_back/pg_multixact/offsets/1590

Now it seems that the (lseek(fd, (off_t) offset, SEEK_SET) < 0) call moving
seek position to outside of an actual file size.
And after that, read call actually could read only 0 bytes from it.

Does that mean that file was damaged during rsync?


Kind Regards,
Maksym

What next I should examine?

Kind Regards,
Maksym