On Sun, May 28, 2017 at 6:54 AM, Michael Paquier <michael.paqu...@gmail.com> wrote:
> On Sat, May 27, 2017 at 2:40 PM, Ludovic Vaugeois-Pepin > <ludovi...@gmail.com> wrote: > > Say, with 9.6.2, a hot_standby fails to connect to a replication slot: > > FATAL: could not start WAL streaming: ERROR: replication slot > "test3" > > does not exist > > or > > FATAL: could not connect to the primary server: FATAL: the database > > system is starting up > > > > Is there a way to reduce the time it takes until the next attempt? I > > assumed, wrongly I think, that this would be wal_retrieve_retry_interval, > > but it seems that it won't make a difference. I tried setting it to 3s, > but > > it seems to take 15s still. Here are two log samples: > > Could you double-check your configuration? If I set > wal_retrieve_retry_interval to 1s on a standby, I am able to see a > connection attempt from a WAL receiver happening with this interval of > time in the case of repetitive failures. > It really is set at 3s on all servers (master and standbies) earlier in the "deployment" process at the same time "listen_addresses", "hot_standby", and others are set. This doesn't seem to happen every time I run tests. I increased logging to DEBUG1. This is what I got when the problem occurred again. Note that the empty line is there in the log file. < 2017-05-28 09:29:36.127 CEST > LOG: database system was shut down in recovery at 2017-05-28 09:29:03 CEST < 2017-05-28 09:29:36.127 CEST > LOG: entering standby mode < 2017-05-28 09:29:36.127 CEST > DEBUG: checkpoint record is at 0/80002B8 < 2017-05-28 09:29:36.128 CEST > DEBUG: redo record is at 0/80002B8; shutdown TRUE < 2017-05-28 09:29:36.128 CEST > DEBUG: next transaction ID: 0:2535; next OID: 18660 < 2017-05-28 09:29:36.128 CEST > DEBUG: next MultiXactId: 1; next MultiXactOffset: 0 < 2017-05-28 09:29:36.128 CEST > DEBUG: oldest unfrozen transaction ID: 1750, in database 1 < 2017-05-28 09:29:36.128 CEST > DEBUG: oldest MultiXactId: 1, in database 1 < 2017-05-28 09:29:36.128 CEST > DEBUG: commit timestamp Xid oldest/newest: 0/0 < 2017-05-28 09:29:36.128 CEST > DEBUG: transaction ID wrap limit is 2147485397 <(214)%20748-5397>, limited by database with OID 1 < 2017-05-28 09:29:36.128 CEST > DEBUG: MultiXactId wrap limit is 2147483648 <(214)%20748-3648>, limited by database with OID 1 < 2017-05-28 09:29:36.128 CEST > DEBUG: starting up replication slots < 2017-05-28 09:29:36.128 CEST > DEBUG: resetting unlogged relations: cleanup 1 init 0 < 2017-05-28 09:29:36.129 CEST > DEBUG: initializing for hot standby < 2017-05-28 09:29:36.129 CEST > DEBUG: recovery snapshots are now enabled < 2017-05-28 09:29:36.129 CEST > LOG: consistent recovery state reached at 0/8000328 < 2017-05-28 09:29:36.129 CEST > LOG: invalid record length at 0/8000328: wanted 24, got 0 < 2017-05-28 09:29:36.129 CEST > LOG: database system is ready to accept read only connections < 2017-05-28 09:29:36.135 CEST > FATAL: could not connect to the primary server: FATAL: the database system is starting up < 2017-05-28 09:29:36.135 CEST > DEBUG: invalid record length at 0/8000328: wanted 24, got 0 < 2017-05-28 09:29:51.153 CEST > DEBUG: invalid record length at 0/8000328: wanted 24, got 0 < 2017-05-28 09:29:51.158 CEST > LOG: fetching timeline history file for timeline 4 from primary server < 2017-05-28 09:29:51.160 CEST > LOG: started streaming WAL from primary at 0/8000000 on timeline 3 < 2017-05-28 09:29:51.160 CEST > LOG: replication terminated by primary server < 2017-05-28 09:29:51.160 CEST > DETAIL: End of WAL reached on timeline 3 at 0/8000328. < 2017-05-28 09:29:51.161 CEST > DEBUG: walreceiver ended streaming and awaits new instructions < 2017-05-28 09:29:51.161 CEST > LOG: new target timeline is 4 < 2017-05-28 09:29:51.161 CEST > DEBUG: invalid record length at 0/8000328: wanted 24, got 0 < 2017-05-28 09:29:51.161 CEST > LOG: restarted WAL streaming at 0/8000000 on timeline 4 < 2017-05-28 09:29:51.191 CEST > LOG: redo starts at 0/8000328 > -- > Michael > -- Ludovic Vaugeois-Pepin -- Ludovic Vaugeois-Pepin