On Fri, May 22, 2015 at 5:52 AM, Piotr Gasidło <qua...@barbara.eu.org> wrote: > Got strange problem. Unable to repeat, but got logs. > > Simple master-slave using streaming replication. > Master is running. Slave is down. > Segment 0000000400004C4D00000090 was successfully archived and send > from master to slave. > > Now I've started slave, and: > > ay 21 21:23:37 d8 postgres[50645]: [3-1] 2015-05-21 21:23:37.033 CEST > @ 50645 LOG: database system was shut down in recovery at > 2015-05-21 21:22:03 CEST > May 21 21:23:37 d8 postgres[50645]: [4-1] 2015-05-21 21:23:37.034 CEST > @ 50645 LOG: entering standby mode > May 21 21:23:37 d8 postgres[50645]: [5-1] 2015-05-21 21:23:37.058 CEST > @ 50645 LOG: restored log file "0000000400004C4D00000088" from > archive > May 21 21:23:37 d8 postgres[50645]: [6-1] 2015-05-21 21:23:37.120 CEST > @ 50645 LOG: redo starts at 4C4D/88493B50 > May 21 21:23:37 d8 postgres[50645]: [7-1] 2015-05-21 21:23:37.226 CEST > @ 50645 LOG: restored log file "0000000400004C4D00000089" from > archive > May 21 21:23:37 d8 postgres[50645]: [8-1] 2015-05-21 21:23:37.426 CEST > @ 50645 LOG: restored log file "0000000400004C4D0000008A" from > archive > May 21 21:23:37 d8 postgres[50645]: [9-1] 2015-05-21 21:23:37.750 CEST > @ 50645 LOG: restored log file "0000000400004C4D0000008B" from > archive > May 21 21:23:38 d8 postgres[50645]: [10-1] 2015-05-21 21:23:38.376 > CEST @ 50645 LOG: restored log file "0000000400004C4D0000008C" from > archive > May 21 21:23:38 d8 postgres[50645]: [11-1] 2015-05-21 21:23:38.690 > CEST @ 50645 LOG: restored log file "0000000400004C4D0000008D" from > archive > May 21 21:23:38 d8 postgres[50645]: [12-1] 2015-05-21 21:23:38.855 > CEST @ 50645 LOG: restored log file "0000000400004C4D0000008E" from > archive > May 21 21:23:39 d8 postgres[50645]: [13-1] 2015-05-21 21:23:39.275 > CEST @ 50645 LOG: restored log file "0000000400004C4D0000008F" from > archive > May 21 21:23:39 d8 postgres[50645]: [14-1] 2015-05-21 21:23:39.654 > CEST @ 50645 LOG: restored log file "0000000400004C4D00000090" from > archive > May 21 21:23:40 d8 postgres[50645]: [15-1] 2015-05-21 21:23:40.222 > CEST @ 50645 LOG: consistent recovery state reached at > 4C4D/90FFF9C8 > May 21 21:23:40 d8 postgres[50644]: [3-1] 2015-05-21 21:23:40.222 CEST > @ 50644 LOG: database system is ready to accept read only > connections > May 21 21:23:40 d8 postgres[50645]: [16-1] 2015-05-21 21:23:40.223 > CEST @ 50645 LOG: unexpected pageaddr 4C46/E000000 in log segment > 0000000400004C4D00000091, offset 0 > May 21 21:23:40 d8 postgres[50699]: [4-1] 2015-05-21 21:23:40.232 CEST > @ 50699 LOG: started streaming WAL from primary at 4C4D/90000000 on > timeline 4 > May 21 21:23:40 d8 postgres[50699]: [5-1] 2015-05-21 21:23:40.232 CEST > @ 50699 FATAL: could not receive data from WAL stream: ERROR: > requested WAL segment 0000000400004C4D00000090 has already been > removed > May 21 21:23:40 d8 postgres[50645]: [17-1] 2015-05-21 21:23:40.255 > CEST @ 50645 LOG: restored log file "0000000400004C4D00000090" from > archive > May 21 21:23:40 d8 postgres[50703]: [4-1] 2015-05-21 21:23:40.268 CEST > @ 50703 LOG: started streaming WAL from primary at 4C4D/90000000 on > timeline 4 > May 21 21:23:40 d8 postgres[50703]: [5-1] 2015-05-21 21:23:40.268 CEST > @ 50703 FATAL: could not receive data from WAL stream: ERROR: > requested WAL segment 0000000400004C4D00000090 has already been > removed > May 21 21:23:40 d8 postgres[50703]: [5-2] > ... > (and so on) > > So, as I understand: > - slave was started and entered restore, > - slave restored 0000000400004C4D00000090 from archive, reached > consistent recovery state > - now, it connected to master and noticed, that, it has > 0000000400004C4D00000091 segment uncomplete > - and then, started yelling about missing segment (master deleted it > already after archiving) on master > > Why?
Thanks for the report! This seems to be a bug. This problem happens when WAL record is stored in separate two WAL files and there is no valid latter WAL file in the standby. In your case, the former file is 0000000400004C4D00000090 and the latter is 0000000400004C4D00000091. In this case, the first half of WAL record can be read from the former WAL file, but the remaining half not because no valid latter file exists in the standby. Then the standby tries to retrieve the latter WAL file via replication. The problem here is that the standby tries to start the replication from the starting point of WAL record, i.e., that's the location of the former WAL file. So the already-read WAL file is requested via replication. To address this problem, maybe we can just use the location where the remaining half of WAL record starts from as the replication starting location. We would need to look at carefully the side effect of that change, though. Regards, -- Fujii Masao -- Sent via pgsql-general mailing list (pgsql-general@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-general