Re: [HACKERS] Crash on promotion when recovery.conf is renamed

Alexander Korotkov Mon, 27 Mar 2017 08:04:21 -0700

On Mon, Mar 27, 2017 at 11:26 AM, Tsunakawa, Takayuki <
[email protected]> wrote:


> From: [email protected]
> > [mailto:[email protected]] On Behalf Of Michael Paquier
> > Moved to CF 2017-03. Both patches still apply.
>
> Sorry to be late for reviewing this, but done now.  The patch applied,
> make check passed, and the code looks almost good.  I could successfully
> perform a simple archive recovery.  Finally, I broke the 2pc state file
> while the server is down, and could confirm that the server failed to start
> as expected, emitting a FATAL message.  Worked nicely.
>

I've tested moving recovery.conf away case which was originally reported by
Magnus.

When I'm trying to promote standby when recovery.conf is moved away, I get
FATAL error.

waiting for server to promote....2017-03-27 17:12:38.225 MSK [30307] LOG:
 received promote request
2017-03-27 17:12:38.225 MSK [30311] FATAL:  terminating walreceiver process
due to administrator command
2017-03-27 17:12:38.225 MSK [30307] LOG:  redo done at 0/3000028
2017-03-27 17:12:38.226 MSK [30307] LOG:  selected new timeline ID: 2
2017-03-27 17:12:38.253 MSK [30307] FATAL:  could not open file
"recovery.conf": No such file or directory
2017-03-27 17:12:38.253 MSK [30306] LOG:  startup process (PID 30307)
exited with exit code 1
2017-03-27 17:12:38.253 MSK [30306] LOG:  terminating any other active
server processes
2017-03-27 17:12:38.256 MSK [30306] LOG:  database system is shut down
........................................................... stopped waiting
server is still promoting

If I try to start it after crash, it becomes a master on timeline 1.  Just
like in case we deleted recovery.conf while server was shut down.

waiting for server to start....2017-03-27 17:16:54.186 MSK [30413] LOG:
 listening on IPv6 address "::1", port 5430
2017-03-27 17:16:54.186 MSK [30413] LOG:  listening on IPv6 address
"fe80::1%lo0", port 5430
2017-03-27 17:16:54.186 MSK [30413] LOG:  listening on IPv4 address
"127.0.0.1", port 5430
2017-03-27 17:16:54.187 MSK [30413] LOG:  listening on Unix socket
"/tmp/.s.PGSQL.5430"
2017-03-27 17:16:54.195 MSK [30414] LOG:  database system was interrupted
while in recovery at log time 2017-03-27 17:10:23 MSK
2017-03-27 17:16:54.195 MSK [30414] HINT:  If this has occurred more than
once some data might be corrupted and you might need to choose an earlier
recovery target.
2017-03-27 17:16:54.218 MSK [30414] LOG:  database system was not properly
shut down; automatic recovery in progress
2017-03-27 17:16:54.219 MSK [30414] LOG:  redo starts at 0/2000060
2017-03-27 17:16:54.219 MSK [30414] LOG:  consistent recovery state reached
at 0/3000060
2017-03-27 17:16:54.219 MSK [30414] LOG:  invalid record length at
0/3000060: wanted 24, got 0
2017-03-27 17:16:54.219 MSK [30414] LOG:  redo done at 0/3000028
2017-03-27 17:16:54.224 MSK [30413] LOG:  database system is ready to
accept connections
 done
server started

# select pg_is_in_recovery();
 pg_is_in_recovery
───────────────────
 f
(1 row)

# select pg_walfile_name(pg_current_wal_location());
     pg_walfile_name
──────────────────────────
 000000010000000000000003
(1 row)

If instead I put recovery back and start server, then streaming replication
continues normally.

waiting for server to start....2017-03-27 17:32:16.887 MSK [30783] LOG:
 listening on IPv6 address "::1", port 5430
2017-03-27 17:32:16.887 MSK [30783] LOG:  listening on IPv6 address
"fe80::1%lo0", port 5430
2017-03-27 17:32:16.887 MSK [30783] LOG:  listening on IPv4 address
"127.0.0.1", port 5430
2017-03-27 17:32:16.888 MSK [30783] LOG:  listening on Unix socket
"/tmp/.s.PGSQL.5430"
2017-03-27 17:32:16.897 MSK [30784] LOG:  database system was interrupted
while in recovery at log time 2017-03-27 17:28:05 MSK
2017-03-27 17:32:16.897 MSK [30784] HINT:  If this has occurred more than
once some data might be corrupted and you might need to choose an earlier
recovery target.
2017-03-27 17:32:16.914 MSK [30784] LOG:  entering standby mode
2017-03-27 17:32:16.916 MSK [30784] LOG:  redo starts at 0/8000028
2017-03-27 17:32:16.916 MSK [30784] LOG:  consistent recovery state reached
at 0/9000060
2017-03-27 17:32:16.916 MSK [30784] LOG:  invalid record length at
0/9000060: wanted 24, got 0
2017-03-27 17:32:16.916 MSK [30783] LOG:  database system is ready to
accept read only connections
2017-03-27 17:32:16.920 MSK [30788] LOG:  started streaming WAL from
primary at 0/9000000 on timeline 1
 done
server started

# select pg_is_in_recovery();
 pg_is_in_recovery
───────────────────
 t
(1 row)

Thus, I think patch is working as expected in this case.

Also, I'd like to notice that it passes check-world without warnings on my
laptop.

------
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

Re: [HACKERS] Crash on promotion when recovery.conf is renamed

Reply via email to