On 05/12/2010 05:28 PM, Simon Riggs wrote:
On Wed, 2010-05-12 at 14:18 +0100, Simon Riggs wrote:
On Wed, 2010-05-12 at 08:52 -0400, Robert Haas wrote:
On Wed, May 12, 2010 at 7:26 AM, Simon Riggs<si...@2ndquadrant.com>  wrote:
On Wed, 2010-05-12 at 07:10 -0400, Robert Haas wrote:

I'm not sure what to make of this.  Sometimes not shutting down
doesn't sound like a feature to me.

It acts exactly the same in recovery as in normal running. It is not a
special feature of recovery at all, bug or otherwise.

Simon, that doesn't make any sense.  We are talking about a backend
getting stuck forever on an exclusive lock that is held by the startup
process and which will never be released (for example, because the
master has shut down and no more WAL can be obtained for replay).  The
startup process does not hold locks in normal operation.

When I test it, startup process holding a lock does not prevent shutdown
of a standby.

I'd be happy to see your test case showing a bug exists and that the
behaviour differs from normal running.

Let me put this differently: I accept that Stefan has reported a
problem. Neither Tom nor myself can reproduce the problem. I've re-run
Stefan's test case and restarted the server more than 400 times now
without any issue.

I re-read your post where you gave what you yourself called "uninformed
speculation". There's no real polite way to say it, but yes your
speculation does appear to be uninformed, since it is incorrect. Reasons
would be not least that Stefan's tests don't actually send any locks to
the standby anyway (!), but even if they did your speculation as to the
cause is still all wrong, as explained.

There is no evidence to link this behaviour with HS, as yet, and you
should be considering the possibility the problem lies elsewhere,
especially since it could be code you committed that is at fault.

Well I'm not sure why people seem to have that hard a time reproducing that issue - it seems that I can provoke it really trivially(in this case no loops, no pgbench, no tricks). A few minutes ago I logged into my test standby (which is idle except for the odd connect to template1 caused by nagios - the master is idle as well and has been for days):

postg...@soldata005:~$ psql
psql (9.0beta1)
Type "help" for help.

postgres=# select 1;
 ?column?
----------
        1
(1 row)

postgres=# \q
postg...@soldata005:~$ pg_ctl -D /var/lib/postgresql/9.0b1/main/ restart
waiting for server to shut down.... done
server stopped
server starting
postg...@soldata005:~$ pg_ctl -D /var/lib/postgresql/9.0b1/main/ restart
waiting for server to shut down.... done
server stopped
server starting
postg...@soldata005:~$ pg_ctl -D /var/lib/postgresql/9.0b1/main/ restart
waiting for server to shut down............................................................... failed
pg_ctl: server does not shut down


the server log for that is as follows:

<2010-05-12 20:36:18.166 CEST,,,> LOG:  received smart shutdown request
<2010-05-12 20:36:18.167 CEST,,,> FATAL: terminating walreceiver process due to administrator command
<2010-05-12 20:36:18.174 CEST,,,> LOG:  shutting down
<2010-05-12 20:36:18.251 CEST,,,> LOG:  database system is shut down
<2010-05-12 20:36:19.706 CEST,,,> LOG: database system was interrupted while in recovery at log time 2010-05-06 17:36:05 CEST <2010-05-12 20:36:19.706 CEST,,,> HINT: If this has occurred more than once some data might be corrupted and you might need to choose an earlier recovery target.
<2010-05-12 20:36:19.706 CEST,,,> LOG:  entering standby mode
<2010-05-12 20:36:19.721 CEST,,,> LOG: consistent recovery state reached at 1/12000078
<2010-05-12 20:36:19.721 CEST,,,> LOG:  invalid record length at 1/12000078
<2010-05-12 20:36:19.723 CEST,,,> LOG: database system is ready to accept read only connections <2010-05-12 20:36:19.737 CEST,,,> LOG: streaming replication successfully connected to primary
<2010-05-12 20:36:19.918 CEST,,,> LOG:  received smart shutdown request
<2010-05-12 20:36:19.919 CEST,,,> FATAL: terminating walreceiver process due to administrator command
<2010-05-12 20:36:19.922 CEST,,,> LOG:  shutting down
<2010-05-12 20:36:19.937 CEST,,,> LOG:  database system is shut down
<2010-05-12 20:36:21.433 CEST,,,> LOG: database system was interrupted while in recovery at log time 2010-05-06 17:36:05 CEST <2010-05-12 20:36:21.433 CEST,,,> HINT: If this has occurred more than once some data might be corrupted and you might need to choose an earlier recovery target.
<2010-05-12 20:36:21.433 CEST,,,> LOG:  entering standby mode
<2010-05-12 20:36:21.482 CEST,,,> LOG:  received smart shutdown request
<2010-05-12 20:36:21.504 CEST,,,> LOG: consistent recovery state reached at 1/12000078
<2010-05-12 20:36:21.504 CEST,,,> LOG:  invalid record length at 1/12000078
<2010-05-12 20:36:21.505 CEST,,,> LOG: database system is ready to accept read only connections <2010-05-12 20:36:21.516 CEST,,,> LOG: streaming replication successfully connected to primary

so it restarted two times successfully - however if one looks at the third time one can see that it received the smart shutdown request BEFORE it reached a consistent recovery state - yet it continued to enable HS and reenabled SR as well.

The database is now sitting there doing nothing and it more or less broken because you cannot connect to it in the current state:

~$ psql
psql: FATAL:  the database system is shutting down

the startup process has the following backtrace:

(gdb) bt
#0  0x00007fbe24cb2c83 in select () from /lib/libc.so.6
#1  0x00000000006e811a in pg_usleep ()
#2  0x000000000048c333 in XLogPageRead ()
#3  0x000000000048c967 in ReadRecord ()
#4  0x0000000000493ab6 in StartupXLOG ()
#5  0x0000000000495a88 in StartupProcessMain ()
#6  0x00000000004ab25e in AuxiliaryProcessMain ()
#7  0x00000000005d4a7d in StartChildProcess ()
#8  0x00000000005d70c2 in PostmasterMain ()
#9  0x000000000057d898 in main ()


Stefan

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Reply via email to