ello,

On 2013-04-24 17:43:39 +0900, KONDO Mitsumasa wrote:
> Hi,
> 
> I find problem about failing start-up achive recovery at Standby mode in 
> PG9.2.4 streaming replication.
> I test same problem in PG9.2.3. But it is not occerd...

> > cp: cannot stat `../arc/000000030000000000000013': そのようなファイルやディレクトリはありません
> > [Standby] 2013-04-22 01:27:25 EDTLOG:  00000: restored log file 
> > "000000020000000000000013" from archive

I can't read the error message here, but this looks suspicious. The
recovery command seems to be returning success although it prints such
an error message?

> > cp: cannot stat `../arc/000000030000000000000014': そのようなファイルやディレクトリはありません
> [* This point is illegal work]> [Standby] 2013-04-22 01:27:25 EDTLOG:  00000: 
> unexpected pageaddr 0/6000000 in log file 0, segment 20, offset 0
> > [Standby] 2013-04-22 01:27:25 EDTLOCATION:  ValidXLOGHeader, xlog.c:4395
> > cp: cannot stat `../arc/000000030000000000000014': そのようなファイルやディレクトリはありません
> > cp: cannot stat `../arc/00000004.history': そのようなファイルやディレクトリはありません
> > [Master] 2013-04-22 01:27:25 EDTDEBUG:  00000: received replication 
> > command: IDENTIFY_SYSTEM
> > [Master] 2013-04-22 01:27:25 EDTLOCATION:  HandleReplicationCommand, 
> > walsender.c:449
> > [Master] 2013-04-22 01:27:25 EDTDEBUG:  00000: received replication 
> > command: START_REPLICATION 0/14000000
> > [Master] 2013-04-22 01:27:25 EDTLOCATION:  HandleReplicationCommand, 
> > walsender.c:449
> > [Master] 2013-04-22 01:27:25 EDTFATAL:  58P01: requested WAL segment 
> > 000000030000000000000014 has already been removed
> > [Master] 2013-04-22 01:27:25 EDTLOCATION:  XLogRead, walsender.c:1023
> > [Standby] 2013-04-22 01:27:25 EDTLOG:  00000: streaming replication 
> > successfully connected to primary
> > [Standby] 2013-04-22 01:27:25 EDTLOCATION:  libpqrcv_connect, 
> > libpqwalreceiver.c:171
> > [Standby] 2013-04-22 01:27:25 EDTFATAL:  XX000: could not receive data from 
> > WAL stream: FATAL:  requested WAL segment 000000030000000000000014 has 
> > already been removed
> >     
> > [Standby] 2013-04-22 01:27:25 EDTLOCATION:  libpqrcv_receive, 
> > libpqwalreceiver.c:389
> > 
> > [mitsu-ko@localhost postgresql-9.2.4]$ bincp: cannot stat 
> > `../arc/000000030000000000000014': そのようなファイルやディレクトリはありません
> > [Standby] 2013-04-22 01:27:30 EDTDEBUG:  00000: unexpected pageaddr 
> > 0/6000000 in log file 0, segment 20, offset 0
> > [Standby] 2013-04-22 01:27:30 EDTLOCATION:  ValidXLOGHeader, xlog.c:4395
> > cp: cannot stat `../arc/000000030000000000000014': そのようなファイルやディレクトリはありません
> > cp: cannot stat `../arc/00000004.history': そのようなファイルやディレクトリはありません
> > [Master] 2013-04-22 01:27:30 EDTDEBUG:  00000: received replication 
> > command: IDENTIFY_SYSTEM
> > [Master] 2013-04-22 01:27:30 EDTLOCATION:  HandleReplicationCommand, 
> > walsender.c:449
> > [Master] 2013-04-22 01:27:30 EDTDEBUG:  00000: received replication 
> > command: START_REPLICATION 0/14000000
> > [Master] 2013-04-22 01:27:30 EDTLOCATION:  HandleReplicationCommand, 
> > walsender.c:449
> > [Master] 2013-04-22 01:27:30 EDTFATAL:  58P01: requested WAL segment 
> > 000000030000000000000014 has already been removed
> > [Master] 2013-04-22 01:27:30 EDTLOCATION:  XLogRead, walsender.c:1023
> > [Standby] 2013-04-22 01:27:30 EDTLOG:  00000: streaming replication 
> > successfully connected to primary
> > [Standby] 2013-04-22 01:27:30 EDTLOCATION:  libpqrcv_connect, 
> > libpqwalreceiver.c:171
> > [Standby] 2013-04-22 01:27:30 EDTFATAL:  XX000: could not receive data from 
> > WAL stream: FATAL:  requested WAL segment 000000030000000000000014 has 
> > already been removed

And this seems to be the second problem. You probably need to configure
a higher wal_keep_segments on the primary or you need to fix your
recovery_command.


> We try to fix this problem with my company's colleaguea,
> but it is too difficult and complicated to WAL contorol and validation 
> method...
> 
> This problem is occerd in xlog.c:10692.
> > 10692         if (!ValidXLOGHeader((XLogPageHeader) readBuf, emode, false))
> > 10693                 goto next_record_is_invalid;
> 
> And problem is occerd this checking in ValidXLOGHeader().
> >  4390         if (!XLByteEQ(hdr->xlp_pageaddr, recaddr))
> >  4391         {
> >  4392                 ereport(emode_for_corrupt_record(emode, recaddr),
> >  4393                                 (errmsg("unexpected pageaddr %X/%X in 
> > log file %u, segment %u, offset %u",
> >  4394                                                 
> > hdr->xlp_pageaddr.xlogid, hdr->xlp_pageaddr.xrecoff,
> >  4395                                                 readId, readSeg, 
> > readOff)));
> >  4396                 return false;
> >  4397         }
> 
> We think hdr->xlp_pageaddr.xrecoff has wrong value. This is very rare cace!
> We cannot analyze why this value is uncorrect!

I think the tests is just noticing that we don't have valid WAL - we are
probably reading old preallocated wal files from before starting the
node as a standby. Normally it will get the correct data from the
primary after that (you can see the START_REPLICATION commands in the
log), but that failed.


Greetings,

Andres Freund

-- 
 Andres Freund                     http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Reply via email to