ello,
On 2013-04-24 17:43:39 +0900, KONDO Mitsumasa wrote:
> Hi,
>
> I find problem about failing start-up achive recovery at Standby mode in
> PG9.2.4 streaming replication.
> I test same problem in PG9.2.3. But it is not occerd...
> > cp: cannot stat `../arc/000000030000000000000013': そのようなファイルやディレクトリはありません
> > [Standby] 2013-04-22 01:27:25 EDTLOG: 00000: restored log file
> > "000000020000000000000013" from archive
I can't read the error message here, but this looks suspicious. The
recovery command seems to be returning success although it prints such
an error message?
> > cp: cannot stat `../arc/000000030000000000000014': そのようなファイルやディレクトリはありません
> [* This point is illegal work]> [Standby] 2013-04-22 01:27:25 EDTLOG: 00000:
> unexpected pageaddr 0/6000000 in log file 0, segment 20, offset 0
> > [Standby] 2013-04-22 01:27:25 EDTLOCATION: ValidXLOGHeader, xlog.c:4395
> > cp: cannot stat `../arc/000000030000000000000014': そのようなファイルやディレクトリはありません
> > cp: cannot stat `../arc/00000004.history': そのようなファイルやディレクトリはありません
> > [Master] 2013-04-22 01:27:25 EDTDEBUG: 00000: received replication
> > command: IDENTIFY_SYSTEM
> > [Master] 2013-04-22 01:27:25 EDTLOCATION: HandleReplicationCommand,
> > walsender.c:449
> > [Master] 2013-04-22 01:27:25 EDTDEBUG: 00000: received replication
> > command: START_REPLICATION 0/14000000
> > [Master] 2013-04-22 01:27:25 EDTLOCATION: HandleReplicationCommand,
> > walsender.c:449
> > [Master] 2013-04-22 01:27:25 EDTFATAL: 58P01: requested WAL segment
> > 000000030000000000000014 has already been removed
> > [Master] 2013-04-22 01:27:25 EDTLOCATION: XLogRead, walsender.c:1023
> > [Standby] 2013-04-22 01:27:25 EDTLOG: 00000: streaming replication
> > successfully connected to primary
> > [Standby] 2013-04-22 01:27:25 EDTLOCATION: libpqrcv_connect,
> > libpqwalreceiver.c:171
> > [Standby] 2013-04-22 01:27:25 EDTFATAL: XX000: could not receive data from
> > WAL stream: FATAL: requested WAL segment 000000030000000000000014 has
> > already been removed
> >
> > [Standby] 2013-04-22 01:27:25 EDTLOCATION: libpqrcv_receive,
> > libpqwalreceiver.c:389
> >
> > [mitsu-ko@localhost postgresql-9.2.4]$ bincp: cannot stat
> > `../arc/000000030000000000000014': そのようなファイルやディレクトリはありません
> > [Standby] 2013-04-22 01:27:30 EDTDEBUG: 00000: unexpected pageaddr
> > 0/6000000 in log file 0, segment 20, offset 0
> > [Standby] 2013-04-22 01:27:30 EDTLOCATION: ValidXLOGHeader, xlog.c:4395
> > cp: cannot stat `../arc/000000030000000000000014': そのようなファイルやディレクトリはありません
> > cp: cannot stat `../arc/00000004.history': そのようなファイルやディレクトリはありません
> > [Master] 2013-04-22 01:27:30 EDTDEBUG: 00000: received replication
> > command: IDENTIFY_SYSTEM
> > [Master] 2013-04-22 01:27:30 EDTLOCATION: HandleReplicationCommand,
> > walsender.c:449
> > [Master] 2013-04-22 01:27:30 EDTDEBUG: 00000: received replication
> > command: START_REPLICATION 0/14000000
> > [Master] 2013-04-22 01:27:30 EDTLOCATION: HandleReplicationCommand,
> > walsender.c:449
> > [Master] 2013-04-22 01:27:30 EDTFATAL: 58P01: requested WAL segment
> > 000000030000000000000014 has already been removed
> > [Master] 2013-04-22 01:27:30 EDTLOCATION: XLogRead, walsender.c:1023
> > [Standby] 2013-04-22 01:27:30 EDTLOG: 00000: streaming replication
> > successfully connected to primary
> > [Standby] 2013-04-22 01:27:30 EDTLOCATION: libpqrcv_connect,
> > libpqwalreceiver.c:171
> > [Standby] 2013-04-22 01:27:30 EDTFATAL: XX000: could not receive data from
> > WAL stream: FATAL: requested WAL segment 000000030000000000000014 has
> > already been removed
And this seems to be the second problem. You probably need to configure
a higher wal_keep_segments on the primary or you need to fix your
recovery_command.
> We try to fix this problem with my company's colleaguea,
> but it is too difficult and complicated to WAL contorol and validation
> method...
>
> This problem is occerd in xlog.c:10692.
> > 10692 if (!ValidXLOGHeader((XLogPageHeader) readBuf, emode, false))
> > 10693 goto next_record_is_invalid;
>
> And problem is occerd this checking in ValidXLOGHeader().
> > 4390 if (!XLByteEQ(hdr->xlp_pageaddr, recaddr))
> > 4391 {
> > 4392 ereport(emode_for_corrupt_record(emode, recaddr),
> > 4393 (errmsg("unexpected pageaddr %X/%X in
> > log file %u, segment %u, offset %u",
> > 4394
> > hdr->xlp_pageaddr.xlogid, hdr->xlp_pageaddr.xrecoff,
> > 4395 readId, readSeg,
> > readOff)));
> > 4396 return false;
> > 4397 }
>
> We think hdr->xlp_pageaddr.xrecoff has wrong value. This is very rare cace!
> We cannot analyze why this value is uncorrect!
I think the tests is just noticing that we don't have valid WAL - we are
probably reading old preallocated wal files from before starting the
node as a standby. Normally it will get the correct data from the
primary after that (you can see the START_REPLICATION commands in the
log), but that failed.
Greetings,
Andres Freund
--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services
--
Sent via pgsql-hackers mailing list ([email protected])
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers