[HACKERS] Crash on promotion when recovery.conf is renamed

Magnus Hagander Thu, 15 Dec 2016 00:45:03 -0800

I had a system where the recovery.conf file was renamed "out of the way" at
some point, and then the system was promoted. This is obviously operator
error, but it seems like something we should handle.


What happens now is that the non-existance of recovery.conf is a FATAL
error. I wonder if it should just be a WARNING, at least in the case of
ENOENT?

What happens is this.

Log output:
2016-12-15 09:36:46.265 CET [25437] LOG:  received promote request
2016-12-15 09:36:46.265 CET [25438] FATAL:  terminating walreceiver process
due to administrator command
mha@mha-laptop:~/postgresql/inst/head$ 2016-12-15 09:36:46.265 CET [25437]
LOG:  invalid record length at 0/5015168: wanted 24, got 0
2016-12-15 09:36:46.265 CET [25437] LOG:  redo done at 0/5015130
2016-12-15 09:36:46.265 CET [25437] LOG:  last completed transaction was at
log time 2016-12-15 09:36:19.27125+01
2016-12-15 09:36:46.276 CET [25437] LOG:  selected new timeline ID: 2
2016-12-15 09:36:46.429 CET [25437] FATAL:  could not open file
"recovery.conf": No such file or directory
2016-12-15 09:36:46.429 CET [25436] LOG:  startup process (PID 25437)
exited with exit code 1
2016-12-15 09:36:46.429 CET [25436] LOG:  terminating any other active
server processes
2016-12-15 09:36:46.429 CET [25456] WARNING:  terminating connection
because of crash of another server process
2016-12-15 09:36:46.429 CET [25456] DETAIL:  The postmaster has commanded
this server process to roll back the current transaction and exit, because
another server process exited abnormally and possibly corrupted shared
memory.
2016-12-15 09:36:46.429 CET [25456] HINT:  In a moment you should be able
to reconnect to the database and repeat your command.
2016-12-15 09:36:46.431 CET [25436] LOG:  database system is shut down


So we can see it switches to timeline 2. Looking in pg_wal (or pg_xlog --
customer system was on 9.5, but this is reproducible in HEAD):

-rw------- 1 mha mha 16777216 Dec 15 09:36 000000010000000000000004
-rw------- 1 mha mha 16777216 Dec 15 09:36 000000010000000000000005
-rw------- 1 mha mha 16777216 Dec 15 09:36 000000020000000000000005
-rw------- 1 mha mha       41 Dec 15 09:36 00000002.history

However, according to pg_controldata, we are still on timeline 1:
Latest checkpoint location:           0/4000060
Prior checkpoint location:            0/4000060
Latest checkpoint's REDO location:    0/4000028
Latest checkpoint's REDO WAL file:    000000010000000000000004
Latest checkpoint's TimeLineID:       1
Latest checkpoint's PrevTimeLineID:   1
..
Minimum recovery ending location:     0/5015168
Min recovery ending loc's timeline:   1


But since we have a history file for timeline 2 in the data directory (and
neatly archived), this data directory isn't consistent with that. Meaning
that for example any other standbys that you try to connect to this cluster
will simply fail, because they try to join up on timeline 2 which doesn't
actually exist.


I wonder if there might be more corner cases like this, but in this
particular one it seems easy enough to just say that failing to rename
recovery.conf because it didn't exist is safe.

But in the case of failing to rename recovery.conf for example because of
permissions errors, we don't want to ignore it. But we also really don't
want to end up with this kind of inconsistent data directory IMO. I don't
know that code well enough to suggest how to fix it though -- hoping for
input for someone who knows it closer?

-- 
 Magnus Hagander
 Me: http://www.hagander.net/
 Work: http://www.redpill-linpro.com/

[HACKERS] Crash on promotion when recovery.conf is renamed

Reply via email to