I had a system where the recovery.conf file was renamed "out of the way" at some point, and then the system was promoted. This is obviously operator error, but it seems like something we should handle.
What happens now is that the non-existance of recovery.conf is a FATAL error. I wonder if it should just be a WARNING, at least in the case of ENOENT? What happens is this. Log output: 2016-12-15 09:36:46.265 CET [25437] LOG: received promote request 2016-12-15 09:36:46.265 CET [25438] FATAL: terminating walreceiver process due to administrator command mha@mha-laptop:~/postgresql/inst/head$ 2016-12-15 09:36:46.265 CET [25437] LOG: invalid record length at 0/5015168: wanted 24, got 0 2016-12-15 09:36:46.265 CET [25437] LOG: redo done at 0/5015130 2016-12-15 09:36:46.265 CET [25437] LOG: last completed transaction was at log time 2016-12-15 09:36:19.27125+01 2016-12-15 09:36:46.276 CET [25437] LOG: selected new timeline ID: 2 2016-12-15 09:36:46.429 CET [25437] FATAL: could not open file "recovery.conf": No such file or directory 2016-12-15 09:36:46.429 CET [25436] LOG: startup process (PID 25437) exited with exit code 1 2016-12-15 09:36:46.429 CET [25436] LOG: terminating any other active server processes 2016-12-15 09:36:46.429 CET [25456] WARNING: terminating connection because of crash of another server process 2016-12-15 09:36:46.429 CET [25456] DETAIL: The postmaster has commanded this server process to roll back the current transaction and exit, because another server process exited abnormally and possibly corrupted shared memory. 2016-12-15 09:36:46.429 CET [25456] HINT: In a moment you should be able to reconnect to the database and repeat your command. 2016-12-15 09:36:46.431 CET [25436] LOG: database system is shut down So we can see it switches to timeline 2. Looking in pg_wal (or pg_xlog -- customer system was on 9.5, but this is reproducible in HEAD): -rw------- 1 mha mha 16777216 Dec 15 09:36 000000010000000000000004 -rw------- 1 mha mha 16777216 Dec 15 09:36 000000010000000000000005 -rw------- 1 mha mha 16777216 Dec 15 09:36 000000020000000000000005 -rw------- 1 mha mha 41 Dec 15 09:36 00000002.history However, according to pg_controldata, we are still on timeline 1: Latest checkpoint location: 0/4000060 Prior checkpoint location: 0/4000060 Latest checkpoint's REDO location: 0/4000028 Latest checkpoint's REDO WAL file: 000000010000000000000004 Latest checkpoint's TimeLineID: 1 Latest checkpoint's PrevTimeLineID: 1 .. Minimum recovery ending location: 0/5015168 Min recovery ending loc's timeline: 1 But since we have a history file for timeline 2 in the data directory (and neatly archived), this data directory isn't consistent with that. Meaning that for example any other standbys that you try to connect to this cluster will simply fail, because they try to join up on timeline 2 which doesn't actually exist. I wonder if there might be more corner cases like this, but in this particular one it seems easy enough to just say that failing to rename recovery.conf because it didn't exist is safe. But in the case of failing to rename recovery.conf for example because of permissions errors, we don't want to ignore it. But we also really don't want to end up with this kind of inconsistent data directory IMO. I don't know that code well enough to suggest how to fix it though -- hoping for input for someone who knows it closer? -- Magnus Hagander Me: http://www.hagander.net/ Work: http://www.redpill-linpro.com/