Re: [HACKERS] Failing start-up archive recovery at Standby mode in PG9.2.4

2013-05-07 Thread Heikki Linnakangas

On 26.04.2013 11:51, KONDO Mitsumasa wrote:

Hi,

I discavered the problem cause. I think taht horiguchi's discovery is
another problem...
Problem has CreateRestartPoint. In recovery mode, PG should not WAL record.
Because PG does not know latest WAL file location.
But in this problem case, PG(standby) write WAL file at RestartPoint in
archive recovery.
In recovery mode, I think that RestartPoint can write only
MinRecoveryPoint.

Here is Standby's pg_xlog directory in problem caused.

[mitsu-ko@localhost postgresql-9.2.4-c]$ ls Standby/pg_xlog/
00020003 00020007
0002000B 0003.history
00020004 00020008
0002000C 0003000E
00020005 00020009
0002000D 0003000F
00020006 0002000A
0002000E archive_status


This problem case is here.

[Standby] 2013-04-26 04:26:44 EDT DEBUG: 0: attempting to remove
WAL segments older than log file 00030002
[Standby] 2013-04-26 04:26:44 EDT LOCATION: RemoveOldXlogFiles,
xlog.c:3568
[Standby] 2013-04-26 04:26:44 EDT DEBUG: 0: recycled transaction
log file 00010002
[Standby] 2013-04-26 04:26:44 EDT LOCATION: RemoveOldXlogFiles,
xlog.c:3607
[Standby] 2013-04-26 04:26:44 EDT DEBUG: 0: recycled transaction
log file 00020002
[Standby] 2013-04-26 04:26:44 EDT LOCATION: RemoveOldXlogFiles,
xlog.c:3607
[Standby] 2013-04-26 04:26:44 EDT LOG: 0: restartpoint complete:
wrote 9 buffers (0.2%); 0 transaction log file(s) added, 0 removed, 2
recycled; write=0.601 s, sync=1.178 s, total=1.781 s; sync files=3,
longest=1.176 s, average=0.392 s
[Standby] 2013-04-26 04:26:44 EDT LOCATION: LogCheckpointEnd, xlog.c:7893
[Standby] 2013-04-26 04:26:44 EDT LOG: 0: recovery restart point
at 0/90FE448
[Standby] 2013-04-26 04:26:44 EDT DETAIL: last completed transaction
was at log time 2013-04-26 04:25:53.203725-04
[Standby] 2013-04-26 04:26:44 EDT LOCATION: CreateRestartPoint,
xlog.c:8601
[Standby] 2013-04-26 04:26:44 EDT LOG: 0: restartpoint starting: xlog
[Standby] 2013-04-26 04:26:44 EDT LOCATION: LogCheckpointStart,
xlog.c:7821
cp: cannot stat `../arc/0003000F': そのようなファイル
やディレクトリはありません
[Standby] 2013-04-26 04:26:44 EDT DEBUG: 0: could not restore file
0003000F from archive: return code 256
[Standby] 2013-04-26 04:26:44 EDT LOCATION: RestoreArchivedFile,
xlog.c:3323
[Standby] 2013-04-26 04:26:44 EDT LOG: 0: unexpected pageaddr
0/200 in log file 0, segment 15, offset 0
[Standby] 2013-04-26 04:26:44 EDT LOCATION: ValidXLOGHeader, xlog.c:4395
cp: cannot stat `../arc/0003000F': そのようなファイル
やディレクトリはありません
[Standby] 2013-04-26 04:26:44 EDT DEBUG: 0: could not restore file
0003000F from archive: return code 256


In recovery, pg normary search WAL file at archive recovery.
If propery WAL file is nothing(archive command is failed), next search
pg_xlog directory.
Normary, propety WAL file is nothing in pg_xlog.
But this case has propety name's WAL file(But it's mistaken and illegal)
in pg_xlog.
So recovery is failed and broken Standby.

So I fix CreateRestartPoint at branching point of executing
MinRecoveryPoint.
It seems to fix this problem, but attached patch is plain.


I didn't understand this. I committed a fix for the issue where recycled 
WAL files get in the way of recovery, but I'm not sure if you're 
describing the same issue or something else. But before we dig any 
deeper into this, can you verify if you're still seeing any issues on 
9.3beta1?


- Heikki


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Failing start-up archive recovery at Standby mode in PG9.2.4

2013-05-07 Thread KONDO Mitsumasa

(2013/05/07 22:40), Heikki Linnakangas wrote:

On 26.04.2013 11:51, KONDO Mitsumasa wrote:

So I fix CreateRestartPoint at branching point of executing
MinRecoveryPoint.
It seems to fix this problem, but attached patch is plain.


I didn't understand this. I committed a fix for the issue where recycled WAL
files get in the way of recovery, but I'm not sure if you're describing the same
issue or something else. But before we dig any deeper into this, can you verify
if you're still seeing any issues on 9.3beta1?
Yes, my fix was very plain and mistake fix point. I see your patch, your fix is 
right. I posted another problem that is cannot promote and PITR. It might be 
different problem, uut I could not reproduce it...
Please wait for more! I try to analyze and test in 9.2.4 with your patch and 
9.3beta1.


Best regards,
--
NTT Open Source Software Center
Mitsumasa KONDO


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Failing start-up archive recovery at Standby mode in PG9.2.4

2013-04-26 Thread Heikki Linnakangas

On 26.04.2013 07:02, Kyotaro HORIGUCHI wrote:

I am uncertain a bit weather it is  necessary to move curFileTLI to
anywhere randomly read . On a short glance, the random access occurs also
for reading checkpoint-related records.


I didn't understand that.


Also I don't have clear distinction between lastSegmentTLI and curFileTLI
after the patch applied.


In short, lastSegmentTLI is the TLI in first page of the last opened 
segment, while curFileTLI is the TLI in the filename of the last opened 
segment. Usually they're the same, but in a segment that contains a 
timeline switch, they differ. For example, if you perform PITR to a 
point in the middle of segment 00010062, you end up with 
two segments:


00010062
00020062

The first half of those files are identical, but the latter contains a 
time-line changing checkpoint record in the middle, and from that point 
on the contents are different. When we open file 
00020062, lastSegmentTLI is 1, because the first half of 
that segment contains WAL from timeline 1, but curFileTLI is 2, because 
that's the TLI in the filename.


- Heikki


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Failing start-up archive recovery at Standby mode in PG9.2.4

2013-04-26 Thread Amit Langote
What would happen if a recycled segment gets assigned a newer timeline than
the one we are currently recovering from? In the reported erroneous
behavior, that is what happens causing XLogFileReadAnyTLI() to return such
bogus segment causing the error. Since, expectedTLIs contains a newer
timeline at its head, and if such a wrongly recycled (that is one bearing a
newer timeline than curFileTLI) segment exists in pg_xlog,
XLogFileReadAnyTLI() would return fd of that segment itself causing the
error in question. In next retry, the same thing would happen and prevent
standby startup from proceeding any further.

So, should we also look at the segment recycling code to see how it names
the recycled segments (what timelineID does it use) ? For example, in
CreateRestartPoint() where segments are recycled? 



--
View this message in context: 
http://postgresql.1045698.n5.nabble.com/Failing-start-up-archive-recovery-at-Standby-mode-in-PG9-2-4-tp5753110p5753358.html
Sent from the PostgreSQL - hackers mailing list archive at Nabble.com.


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Failing start-up archive recovery at Standby mode in PG9.2.4

2013-04-26 Thread Heikki Linnakangas

On 26.04.2013 07:47, Amit Langote wrote:

  How would code after applying this patch behave if a recycled segment gets
renamed using the newest timeline (say 3) while we are still recovering from
a lower timeline (say 2)? In that case, since XLogFileReadAnyTLI returns
that recycled segment as the next segment to recover from, we get the error.
And since XLogFileReadAnyTLI iterates over expectedTLIs (whose head seems to
be recoveryTargetTLI at all times, is that right?), it will return that
wrong (recycled segment) in the first iteration itself.


As long as the right segment is present in the archive, that's OK. Even 
if a recycled segment with higher TLI is in pg_xlog, with the patch 
we'll still read the segment with lower TLI from the archive. But there 
is a corner-case where a recycled segment with a higher TLI masks a 
segment with lower TLI in pg_xlog. For example, if you try to recover by 
copying all the required WAL files directly into pg_xlog, without using 
restore_command, you could run into problems.


So yeah, I think you're right and we need to rethink the recycling. The 
first question is, do we have to recycle WAL segments during recovery at 
all? It's pointless when we're restoring from archive with 
restore_command; the recycled files will just get replaced with files 
from the archive. It does help when walreceiver is active, but I wonder 
how significant it is in practice.


I guess the safest, smallest change is to use a lower TLI when 
installing the recycled files. So, instead of using the current recovery 
target timeline, use the ID of the timeline we're currently recovering. 
That way the reycycled segments will never have a higher TLI than other 
files that recovery will try to replay. See attached patch.


- Heikki
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 30d877b..cfd8a34 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -8459,10 +8459,15 @@ CreateRestartPoint(int flags)
 		PrevLogSeg(_logId, _logSeg);
 
 		/*
-		 * Update ThisTimeLineID to the recovery target timeline, so that
-		 * we install any recycled segments on the correct timeline.
+		 * Update ThisTimeLineID to the timeline we're currently recovering,
+		 * so that we install any recycled segments on the correct timeline.
+		 * (This might be higher than the TLI of the restartpoint we just
+		 * made, if a timeline switch was replayed while we were performing
+		 * the restartpoint.)
 		 */
-		ThisTimeLineID = GetRecoveryTargetTLI();
+		SpinLockAcquire(xlogctl-info_lck);
+		ThisTimeLineID = XLogCtl-lastCheckPoint.ThisTimeLineID;
+		SpinLockRelease(xlogctl-info_lck);
 
 		RemoveOldXlogFiles(_logId, _logSeg, endptr);
 

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Failing start-up archive recovery at Standby mode in PG9.2.4

2013-04-26 Thread KONDO Mitsumasa

Hi,

I discavered the problem cause. I think taht horiguchi's discovery is another 
problem...
Problem has CreateRestartPoint. In recovery mode, PG should not WAL record.
Because PG does not know latest WAL file location.
But in this problem case, PG(standby) write WAL file at RestartPoint in archive 
recovery.
In recovery mode, I think that RestartPoint can write only MinRecoveryPoint.

Here is Standby's pg_xlog directory in problem caused.

[mitsu-ko@localhost postgresql-9.2.4-c]$ ls Standby/pg_xlog/
00020003  00020007  0002000B  
0003.history
00020004  00020008  0002000C  
0003000E
00020005  00020009  0002000D  
0003000F
00020006  0002000A  0002000E  
archive_status


This problem case is here.

[Standby] 2013-04-26 04:26:44 EDT DEBUG:  0: attempting to remove WAL 
segments older than log file 00030002
[Standby] 2013-04-26 04:26:44 EDT LOCATION:  RemoveOldXlogFiles, xlog.c:3568
[Standby] 2013-04-26 04:26:44 EDT DEBUG:  0: recycled transaction log file 
00010002
[Standby] 2013-04-26 04:26:44 EDT LOCATION:  RemoveOldXlogFiles, xlog.c:3607
[Standby] 2013-04-26 04:26:44 EDT DEBUG:  0: recycled transaction log file 
00020002
[Standby] 2013-04-26 04:26:44 EDT LOCATION:  RemoveOldXlogFiles, xlog.c:3607
[Standby] 2013-04-26 04:26:44 EDT LOG:  0: restartpoint complete: wrote 9 
buffers (0.2%); 0 transaction log file(s) added, 0 removed, 2 recycled; 
write=0.601 s, sync=1.178 s, total=1.781 s; sync files=3, longest=1.176 s, 
average=0.392 s
[Standby] 2013-04-26 04:26:44 EDT LOCATION:  LogCheckpointEnd, xlog.c:7893
[Standby] 2013-04-26 04:26:44 EDT LOG:  0: recovery restart point at 
0/90FE448
[Standby] 2013-04-26 04:26:44 EDT DETAIL:  last completed transaction was at 
log time 2013-04-26 04:25:53.203725-04
[Standby] 2013-04-26 04:26:44 EDT LOCATION:  CreateRestartPoint, xlog.c:8601
[Standby] 2013-04-26 04:26:44 EDT LOG:  0: restartpoint starting: xlog
[Standby] 2013-04-26 04:26:44 EDT LOCATION:  LogCheckpointStart, xlog.c:7821
cp: cannot stat `../arc/0003000F': そのようなファイルやディレクトリはありません
[Standby] 2013-04-26 04:26:44 EDT DEBUG:  0: could not restore file 
0003000F from archive: return code 256
[Standby] 2013-04-26 04:26:44 EDT LOCATION:  RestoreArchivedFile, xlog.c:3323
[Standby] 2013-04-26 04:26:44 EDT LOG:  0: unexpected pageaddr 0/200 in 
log file 0, segment 15, offset 0
[Standby] 2013-04-26 04:26:44 EDT LOCATION:  ValidXLOGHeader, xlog.c:4395
cp: cannot stat `../arc/0003000F': そのようなファイルやディレクトリはありません
[Standby] 2013-04-26 04:26:44 EDT DEBUG:  0: could not restore file 
0003000F from archive: return code 256


In recovery, pg normary search WAL file at archive recovery.
If propery WAL file is nothing(archive command is failed), next search pg_xlog 
directory.
Normary, propety WAL file is nothing in pg_xlog.
But this case has propety name's WAL file(But it's mistaken and illegal) in 
pg_xlog.
So recovery is failed and broken Standby.

So I fix CreateRestartPoint at branching point of executing MinRecoveryPoint.
It seems to fix this problem, but attached patch is plain.


Best Regard,
--
NTT Open Source Software Center
Mitsumasa KONDO
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 5452ae1..ae4bcd8 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -7371,7 +7371,8 @@ CreateRestartPoint(int flags)
 * restartpoint. It's assumed that flushing the buffers will do that as a
 * side-effect.
 */
-   if (XLogRecPtrIsInvalid(lastCheckPointRecPtr) ||
+   if (( ControlFile-state == DB_IN_ARCHIVE_RECOVERY  RecoveryInProgress()) ||
+   XLogRecPtrIsInvalid(lastCheckPointRecPtr) ||
lastCheckPoint.redo = ControlFile-checkPointCopy.redo)
{
ereport(DEBUG2,

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Failing start-up archive recovery at Standby mode in PG9.2.4

2013-04-26 Thread Mitsumasa KONDO
I explain more detail about this problem.

This problem was occurred by RestartPoint create illegal WAL file in during
archive recovery. But I cannot recognize why illegal WAL file was created
in CreateRestartPoint(). My attached patch is really plain…

In problem case at XLogFileReadAnyTLI(),  first check WAL file does not get
fd. Because it does not exists property WAL File in archive directory.

XLogFileReadAnyTLI()
 if (sources  XLOG_FROM_ARCHIVE)
 {
   fd = XLogFileRead(log, seg, emode, tli, XLOG_FROM_ARCHIVE, true);
if (fd != -1)
{
   elog(DEBUG1, got WAL segment from archive);
   return fd;
}
 }

Next search WAL file in pg_xlog. There are illegal WAL File in pg_xlog. And
return illegal WAL File’s fd.

XLogFileReadAnyTLI()
  if (sources  XLOG_FROM_PG_XLOG)
  {
 fd = XLogFileRead(log, seg, emode, tli, XLOG_FROM_PG_XLOG, true);
 if (fd != -1)
return fd;
  }

Returned fd is be readFile value. Of cource readFile value is over 0. So
out of for-loop.

XLogPageRead
  readFile = XLogFileReadAnyTLI(readId, readSeg, DEBUG2,
  sources);
   switched_segment = true;
   if (readFile = 0)
  break;

Next, problem function point. Illegal WAL file was read, and error.

XLogPageRead
   if (lseek(readFile, (off_t) readOff, SEEK_SET)  0)
  {
  ereport(emode_for_corrupt_record(emode, *RecPtr),
(errcode_for_file_access(),
   errmsg(could not seek in log file %u, segment %u to offset %u: %m,
readId, readSeg, readOff)));
  goto next_record_is_invalid;
   }
   if (read(readFile, readBuf, XLOG_BLCKSZ) != XLOG_BLCKSZ)
   {
  ereport(emode_for_corrupt_record(emode, *RecPtr),
(errcode_for_file_access(),
   errmsg(could not read from log file %u, segment %u, offset %u: %m,
readId, readSeg, readOff)));
  goto next_record_is_invalid;
   }
   if (!ValidXLOGHeader((XLogPageHeader) readBuf, emode, false))
  goto next_record_is_invalid;


I think that horiguchi's discovery point is after this point.
We must fix that CreateRestartPoint() does not create illegal WAL File.

Best regards,

--
Mitsumasa KONDO


Re: [HACKERS] Failing start-up archive recovery at Standby mode in PG9.2.4

2013-04-25 Thread Amit Langote
I also had a similar observation when I could reproduce this.
I tried to find why restartpoint causes the recycled segment to be named
after timeline 3, but have not been able to determine that.

When I looked at the source, I found that, the function XLogFileReadAnyTLI
which returns a segment file for reading a XLog page iterates over a list
expectedTLIs which starts with 3 in such a case (that is, in case where this
error happens).  XLogFileReadAnyTLI checks the segment in both archive and
pg_xlog. So, even if such a segment (that is with timeline 3) may not be in
the archive , it may be in pg_xlog, due to recycling as we have observed.
So, such a recycled segment may be returned by XLogFileReadAnyTLI as though
it were the next segment to recover from, resulting in the unexpected
pageaddr ... error. 

I could not understand (in case this error happens) how expectedTLIs list
had 3 at its head (for example, when XLogFileReadAnyTLI used it as we
observed) whereas at other times (when this error does not happen), it has 2
at its head until all the segments of timeline 2 are recovered from?
Also, how does recycled segment gets timeline 3 whereas 2 is expected in
this case?

Is this the right way to look at the problem and its possible fix?

I have tried going through the source regarding this but have not been able
to determine where this could accidentally happen, partly because I do not
understand recovery process (and its code) very well. Will post if find
anything useful. 

regards,
Amit Langote





--
View this message in context: 
http://postgresql.1045698.n5.nabble.com/Failing-start-up-archive-recovery-at-Standby-mode-in-PG9-2-4-tp5753110p5753221.html
Sent from the PostgreSQL - hackers mailing list archive at Nabble.com.


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Failing start-up archive recovery at Standby mode in PG9.2.4

2013-04-25 Thread Kyotaro HORIGUCHI
Hmm. I think that I caught the tail of the problem..

Script with small change reproduced the situation for me.

The latest standby uses 3 as its TLI after the history file
0..3.history which could get from the archive. So the WAL files
recycled on this standby will have the TLI=3.
Nevertheless the LSN of the segment recycled on standby is on the
TLI=2 in the master, the standby makes the first request for each
segment with that LSN but TLI = 3 to the master because the standby
runs on recoveryTargetTLI=3. The master reasonably doesn't have it and
finally the standby finds that wrong WAL file in its pg_xlog directory
before the second request with TLI=2 would be made.

In conclusion, the standby should name the recycled WAL segment using
the same TLI for the LSN on the master. Or should never recycle WAL
files..

regards,

--
Kyotaro Horiguchi


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Failing start-up archive recovery at Standby mode in PG9.2.4

2013-04-25 Thread Heikki Linnakangas

On 25.04.2013 17:55, Kyotaro HORIGUCHI wrote:

Hmm. I think that I caught the tail of the problem..

Script with small change reproduced the situation for me.


Can you share the modified script, please?


The latest standby uses 3 as its TLI after the history file
0..3.history which could get from the archive. So the WAL files
recycled on this standby will have the TLI=3.
Nevertheless the LSN of the segment recycled on standby is on the
TLI=2 in the master, the standby makes the first request for each
segment with that LSN but TLI = 3 to the master because the standby
runs on recoveryTargetTLI=3. The master reasonably doesn't have it and
finally the standby finds that wrong WAL file in its pg_xlog directory
before the second request with TLI=2 would be made.


I'm not sure I understand what the problem is, though. When the standby 
opens the bogus, recycled, file in pg_xlog, it will notice that the 
header is incorrect, and retry reading the file from the archive.



In conclusion, the standby should name the recycled WAL segment using
the same TLI for the LSN on the master. Or should never recycle WAL
files..


Perhaps, but it should nevertheless not get confused by recycled segments.

- Heikki


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Failing start-up archive recovery at Standby mode in PG9.2.4

2013-04-25 Thread Kyotaro HORIGUCHI
I forgot it.

 In conclusion, the standby should name the recycled WAL segment using
the same TLI for the LSN on the master. Or should never recycle WAL
files.

Or the standby should make the request with correct TLI at first
consulting the timeline history.
Or the standby should make retry with the more small TLIs  after it
gets broken segments.

regards,

--
Kyotaro Horiguchi


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Failing start-up archive recovery at Standby mode in PG9.2.4

2013-04-25 Thread Kyotaro HORIGUCHI
 Can you share the modified script, please?

Please find the attached files:
  test.sh : test script. most significant change is the load.
   I used simple insert instead of pgbench.
   It might need some more adjustment for other environment
   as my usual.
  xlog.c.diff : Additional log output I thought to be useful to diagnose.

 I'm not sure I understand what the problem is, though. When the standby
 opens the bogus, recycled, file in pg_xlog, it will notice that the header
 is incorrect, and retry reading the file from the archive.

It looks the corrupted header only once. It continues to make retry
without looking there after that.

 Perhaps, but it should nevertheless not get confused by recycled segments.

regards, and good night.
--
Kyotaro Horiguchi


promtest_20130426.tar.gz
Description: GNU Zip compressed data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Failing start-up archive recovery at Standby mode in PG9.2.4

2013-04-25 Thread Heikki Linnakangas

On 25.04.2013 18:56, Kyotaro HORIGUCHI wrote:

Can you share the modified script, please?


Please find the attached files:
   test.sh : test script. most significant change is the load.
I used simple insert instead of pgbench.
It might need some more adjustment for other environment
as my usual.
   xlog.c.diff : Additional log output I thought to be useful to diagnose.


Ok, thanks, I see what's going on now. The problem is that once 
XLogFileRead() finds a file with tli X, it immediately sets curFileTLI = 
X. XLogFileReadAnyTLI() never tries to read files with tli  curFileTLI. 
So, if recovery finds a file with the right filename, e.g 
00030008, it never tries to open 
00020008 anymore, even if the contents of 
00030008 later turn out to be bogus.


One idea to fix this is to not set curFileTLI, until the page header on 
the just-opened file has been verified. Another idea is to change the 
check in XLogFileReadAnyTLI() that currently forbids curFileTLI from 
moving backwards. We could allow curFileTLI to move backwards, as long 
as the tli is = ThisTimeLineID (ThisTimeLineID is the current timeline 
we're recovering records from).


Attached is a patch for the 2nd approach. With the patch, the test 
script works for me. Thoughts?


PS. This wasn't caused by the 9.2.4 change to do crash recovery before 
entering archive recovery. The test script fails in the same way with 
9.2.3 as well.


- Heikki
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 690077c..4ca75d7 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -251,8 +251,7 @@ static bool recoveryStopAfter;
  * curFileTLI: the TLI appearing in the name of the current input WAL file.
  * (This is not necessarily the same as ThisTimeLineID, because we could
  * be scanning data that was copied from an ancestor timeline when the current
- * file was created.)  During a sequential scan we do not allow this value
- * to decrease.
+ * file was created.)
  */
 static TimeLineID recoveryTargetTLI;
 static bool recoveryTargetIsLatest = false;
@@ -657,7 +656,7 @@ static bool InstallXLogFileSegment(uint32 *log, uint32 *seg, char *tmppath,
 static int XLogFileRead(uint32 log, uint32 seg, int emode, TimeLineID tli,
 			 int source, bool notexistOk);
 static int XLogFileReadAnyTLI(uint32 log, uint32 seg, int emode,
-   int sources);
+   int sources, bool randAccess);
 static bool XLogPageRead(XLogRecPtr *RecPtr, int emode, bool fetching_ckpt,
 			 bool randAccess);
 static int	emode_for_corrupt_record(int emode, XLogRecPtr RecPtr);
@@ -2899,7 +2898,8 @@ XLogFileRead(uint32 log, uint32 seg, int emode, TimeLineID tli,
  * This version searches for the segment with any TLI listed in expectedTLIs.
  */
 static int
-XLogFileReadAnyTLI(uint32 log, uint32 seg, int emode, int sources)
+XLogFileReadAnyTLI(uint32 log, uint32 seg, int emode, int sources,
+   bool randAccess)
 {
 	char		path[MAXPGPATH];
 	ListCell   *cell;
@@ -2909,17 +2909,16 @@ XLogFileReadAnyTLI(uint32 log, uint32 seg, int emode, int sources)
 	 * Loop looking for a suitable timeline ID: we might need to read any of
 	 * the timelines listed in expectedTLIs.
 	 *
-	 * We expect curFileTLI on entry to be the TLI of the preceding file in
-	 * sequence, or 0 if there was no predecessor.	We do not allow curFileTLI
-	 * to go backwards; this prevents us from picking up the wrong file when a
-	 * parent timeline extends to higher segment numbers than the child we
-	 * want to read.
+	 * During a sequential read, do not check TLIs smaller than the timeline
+	 * we're currently recovering (ThisTimeLineID); this prevents us from
+	 * picking up the wrong file when a parent timeline extends to higher
+	 * segment numbers than the child we want to read.
 	 */
 	foreach(cell, expectedTLIs)
 	{
 		TimeLineID	tli = (TimeLineID) lfirst_int(cell);
 
-		if (tli  curFileTLI)
+		if (!randAccess  tli  ThisTimeLineID)
 			break;/* don't bother looking at too-old TLIs */
 
 		if (sources  XLOG_FROM_ARCHIVE)
@@ -10510,9 +10509,6 @@ retry:
 		close(readFile);
 		readFile = -1;
 	}
-	/* Reset curFileTLI if random fetch. */
-	if (randAccess)
-		curFileTLI = 0;
 
 	/*
 	 * Try to restore the file from archive, or read an
@@ -10573,7 +10569,7 @@ retry:
 	/* Don't try to read from a source that just failed */
 	sources = ~failedSources;
 	readFile = XLogFileReadAnyTLI(readId, readSeg, DEBUG2,
-  sources);
+  sources, randAccess);
 	switched_segment = true;
 	if (readFile = 0)
 		break;
@@ -10607,16 +10603,12 @@ retry:
 			{
 int			sources;
 
-/* Reset curFileTLI if random fetch. */
-if (randAccess)
-	curFileTLI = 0;
-
 sources = XLOG_FROM_PG_XLOG;
 if (InArchiveRecovery)
 	sources |= XLOG_FROM_ARCHIVE;
 
 readFile = 

Re: [HACKERS] Failing start-up archive recovery at Standby mode in PG9.2.4

2013-04-25 Thread Kyotaro HORIGUCHI
Thank you for the patch.

The test script finishes in success with that. And looks reasonable on a
short glance.

On Fri, Apr 26, 2013 at 4:34 AM, Heikki Linnakangas hlinnakan...@vmware.com
wrote:
 One idea to fix this is to not set curFileTLI, until the page header on
the
 just-opened file has been verified. Another idea is to change the check in
 XLogFileReadAnyTLI() that currently forbids curFileTLI from moving
 backwards. We could allow curFileTLI to move backwards, as long as the
 tli is = ThisTimeLineID (ThisTimeLineID is the current timeline we're
 recovering records from).

 Attached is a patch for the 2nd approach. With the patch, the test script
 works for me. Thoughts?

I am uncertain a bit weather it is  necessary to move curFileTLI to
anywhere randomly read . On a short glance, the random access occurs also
for reading checkpoint-related records.
Also I don't have clear distinction between lastSegmentTLI and curFileTLI
after the patch applied. Although , I need look closer around them to
understand.

 PS. This wasn't caused by the 9.2.4 change to do crash recovery before
 entering archive recovery. The test script fails in the same way with
9.2.3
 as well.

--
Kyotaro Horiguchi


Re: [HACKERS] Failing start-up archive recovery at Standby mode in PG9.2.4

2013-04-25 Thread Amit Langote
 How would code after applying this patch behave if a recycled segment gets
renamed using the newest timeline (say 3) while we are still recovering from
a lower timeline (say 2)? In that case, since XLogFileReadAnyTLI returns
that recycled segment as the next segment to recover from, we get the error.
And since XLogFileReadAnyTLI iterates over expectedTLIs (whose head seems to
be recoveryTargetTLI at all times, is that right?), it will return that
wrong (recycled segment) in the first iteration itself. 

The code for renaming the recycling segment remains unaffected by this
patch, right? That is, the code which assigns timelineID to newly created /
recycled segments is not changed. 

(a snippet from CreateRestartPoint() )

/*
 * Update ThisTimeLineID to the recovery target timeline, so that
 * we install any recycled segments on the correct timeline.
 */

 ThisTimeLineID = GetRecoveryTargetTLI();   

RemoveOldXlogFiles(_logId, _logSeg, endptr);

/*
 * Make more log segments if needed.  (Do this after recycling old log
 * segments, since that may supply some of the needed files.)
 */
PreallocXlogFiles(endptr);


So, would we still get a similar behavior (wherein the wrongly named
recycled segments are produced) and cause the error to happen again? Do we
need to look here?

What do you think? 



--
View this message in context: 
http://postgresql.1045698.n5.nabble.com/Failing-start-up-archive-recovery-at-Standby-mode-in-PG9-2-4-tp5753110p5753352.html
Sent from the PostgreSQL - hackers mailing list archive at Nabble.com.


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Failing start-up archive recovery at Standby mode in PG9.2.4

2013-04-25 Thread Amit Langote
 How would code after applying this patch behave if a recycled segment gets
renamed using the newest timeline (say 3) while we are still recovering from
a lower timeline (say 2)? In that case, since XLogFileReadAnyTLI returns
that recycled segment as the next segment to recover from, we get the error.
And since XLogFileReadAnyTLI iterates over expectedTLIs (whose head seems to
be recoveryTargetTLI at all times, is that right?), it will return that
wrong (recycled segment) in the first iteration itself. 

The code for renaming the recycling segment remains unaffected by this
patch, right? That is, the code which assigns timelineID to newly created /
recycled segments is not changed. 

(a snippet from CreateRestartPoint() ) 

/* 
 * Update ThisTimeLineID to the recovery target timeline, so that 
 * we install any recycled segments on the correct timeline. 
 */ 

 ThisTimeLineID = GetRecoveryTargetTLI();   

RemoveOldXlogFiles(_logId, _logSeg, endptr); 

/* 
 * Make more log segments if needed.  (Do this after recycling old log 
 * segments, since that may supply some of the needed files.) 
 */ 
PreallocXlogFiles(endptr); 


So, would we still get a similar behavior (wherein the wrongly named
recycled segments are produced) and cause the error to happen again? Do we
need to look here? 

What do you think?



--
View this message in context: 
http://postgresql.1045698.n5.nabble.com/Failing-start-up-archive-recovery-at-Standby-mode-in-PG9-2-4-tp5753110p5753353.html
Sent from the PostgreSQL - hackers mailing list archive at Nabble.com.


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Failing start-up archive recovery at Standby mode in PG9.2.4

2013-04-24 Thread Andres Freund
ello,

On 2013-04-24 17:43:39 +0900, KONDO Mitsumasa wrote:
 Hi,
 
 I find problem about failing start-up achive recovery at Standby mode in 
 PG9.2.4 streaming replication.
 I test same problem in PG9.2.3. But it is not occerd...

  cp: cannot stat `../arc/00030013': そのようなファイルやディレクトリはありません
  [Standby] 2013-04-22 01:27:25 EDTLOG:  0: restored log file 
  00020013 from archive

I can't read the error message here, but this looks suspicious. The
recovery command seems to be returning success although it prints such
an error message?

  cp: cannot stat `../arc/00030014': そのようなファイルやディレクトリはありません
 [* This point is illegal work] [Standby] 2013-04-22 01:27:25 EDTLOG:  0: 
 unexpected pageaddr 0/600 in log file 0, segment 20, offset 0
  [Standby] 2013-04-22 01:27:25 EDTLOCATION:  ValidXLOGHeader, xlog.c:4395
  cp: cannot stat `../arc/00030014': そのようなファイルやディレクトリはありません
  cp: cannot stat `../arc/0004.history': そのようなファイルやディレクトリはありません
  [Master] 2013-04-22 01:27:25 EDTDEBUG:  0: received replication 
  command: IDENTIFY_SYSTEM
  [Master] 2013-04-22 01:27:25 EDTLOCATION:  HandleReplicationCommand, 
  walsender.c:449
  [Master] 2013-04-22 01:27:25 EDTDEBUG:  0: received replication 
  command: START_REPLICATION 0/1400
  [Master] 2013-04-22 01:27:25 EDTLOCATION:  HandleReplicationCommand, 
  walsender.c:449
  [Master] 2013-04-22 01:27:25 EDTFATAL:  58P01: requested WAL segment 
  00030014 has already been removed
  [Master] 2013-04-22 01:27:25 EDTLOCATION:  XLogRead, walsender.c:1023
  [Standby] 2013-04-22 01:27:25 EDTLOG:  0: streaming replication 
  successfully connected to primary
  [Standby] 2013-04-22 01:27:25 EDTLOCATION:  libpqrcv_connect, 
  libpqwalreceiver.c:171
  [Standby] 2013-04-22 01:27:25 EDTFATAL:  XX000: could not receive data from 
  WAL stream: FATAL:  requested WAL segment 00030014 has 
  already been removed
  
  [Standby] 2013-04-22 01:27:25 EDTLOCATION:  libpqrcv_receive, 
  libpqwalreceiver.c:389
  
  [mitsu-ko@localhost postgresql-9.2.4]$ bincp: cannot stat 
  `../arc/00030014': そのようなファイルやディレクトリはありません
  [Standby] 2013-04-22 01:27:30 EDTDEBUG:  0: unexpected pageaddr 
  0/600 in log file 0, segment 20, offset 0
  [Standby] 2013-04-22 01:27:30 EDTLOCATION:  ValidXLOGHeader, xlog.c:4395
  cp: cannot stat `../arc/00030014': そのようなファイルやディレクトリはありません
  cp: cannot stat `../arc/0004.history': そのようなファイルやディレクトリはありません
  [Master] 2013-04-22 01:27:30 EDTDEBUG:  0: received replication 
  command: IDENTIFY_SYSTEM
  [Master] 2013-04-22 01:27:30 EDTLOCATION:  HandleReplicationCommand, 
  walsender.c:449
  [Master] 2013-04-22 01:27:30 EDTDEBUG:  0: received replication 
  command: START_REPLICATION 0/1400
  [Master] 2013-04-22 01:27:30 EDTLOCATION:  HandleReplicationCommand, 
  walsender.c:449
  [Master] 2013-04-22 01:27:30 EDTFATAL:  58P01: requested WAL segment 
  00030014 has already been removed
  [Master] 2013-04-22 01:27:30 EDTLOCATION:  XLogRead, walsender.c:1023
  [Standby] 2013-04-22 01:27:30 EDTLOG:  0: streaming replication 
  successfully connected to primary
  [Standby] 2013-04-22 01:27:30 EDTLOCATION:  libpqrcv_connect, 
  libpqwalreceiver.c:171
  [Standby] 2013-04-22 01:27:30 EDTFATAL:  XX000: could not receive data from 
  WAL stream: FATAL:  requested WAL segment 00030014 has 
  already been removed

And this seems to be the second problem. You probably need to configure
a higher wal_keep_segments on the primary or you need to fix your
recovery_command.


 We try to fix this problem with my company's colleaguea,
 but it is too difficult and complicated to WAL contorol and validation 
 method...
 
 This problem is occerd in xlog.c:10692.
  10692 if (!ValidXLOGHeader((XLogPageHeader) readBuf, emode, false))
  10693 goto next_record_is_invalid;
 
 And problem is occerd this checking in ValidXLOGHeader().
   4390 if (!XLByteEQ(hdr-xlp_pageaddr, recaddr))
   4391 {
   4392 ereport(emode_for_corrupt_record(emode, recaddr),
   4393 (errmsg(unexpected pageaddr %X/%X in 
  log file %u, segment %u, offset %u,
   4394 
  hdr-xlp_pageaddr.xlogid, hdr-xlp_pageaddr.xrecoff,
   4395 readId, readSeg, 
  readOff)));
   4396 return false;
   4397 }
 
 We think hdr-xlp_pageaddr.xrecoff has wrong value. This is very rare cace!
 We cannot analyze why this value is uncorrect!

I think the tests is just noticing that we don't have valid WAL - we are
probably reading old preallocated wal files from before starting the
node as a standby. Normally it will get the correct data from the
primary after that (you can see the START_REPLICATION commands in the
log), but that failed.


Greetings,

Andres 

Re: [HACKERS] Failing start-up archive recovery at Standby mode in PG9.2.4

2013-04-24 Thread Kyotaro HORIGUCHI
Hello,

  cp: cannot stat `../arc/00030013': そのようなファイルやディレクトリはありません
  [Standby] 2013-04-22 01:27:25 EDTLOG:  0: restored log file 
  00020013 from archive

 I can't read the error message here, but this looks suspicious.

The error message is No such file or directory in Japanese :-).
Perhaps restore_command is 'cp ../arc/%f %p'.

 The recovery command seems to be returning success although it prints such
an error message?

It seems that, on 9.2.4, RestoreArchivedFile returns expected xlog
filename even if the restore command ends up in failure. And the
caller XLogFileOpen successfully ets and returns fd for the filename
if the file exists since before the restoration attempt. It does not
matter when the existing file and the file expected to be restored are
identical in content.

If the segment includes the timeline increment point, something might
happen, I suppose..

regards,

-- 
Kyotaro Horiguchi


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Failing start-up archive recovery at Standby mode in PG9.2.4

2013-04-24 Thread Kyotaro HORIGUCHI
Sorry,

 caller XLogFileOpen successfully ets and returns fd for the filename

The caller is XLogFileRead in this case.

# and 'ets' is gets, of course.

regards,
-- 
Kyotaro Horiguchi


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Failing start-up archive recovery at Standby mode in PG9.2.4

2013-04-24 Thread Andres Freund
On 2013-04-24 19:16:12 +0900, Kyotaro HORIGUCHI wrote:
 Hello,
 
   cp: cannot stat `../arc/00030013': そのようなファイルやディレクトリはありません
   [Standby] 2013-04-22 01:27:25 EDTLOG:  0: restored log file 
   00020013 from archive
 
  I can't read the error message here, but this looks suspicious.
 
 The error message is No such file or directory in Japanese :-).
 Perhaps restore_command is 'cp ../arc/%f %p'.
 
  The recovery command seems to be returning success although it prints such
 an error message?
 
 It seems that, on 9.2.4, RestoreArchivedFile returns expected xlog
 filename even if the restore command ends up in failure. And the
 caller XLogFileOpen successfully ets and returns fd for the filename
 if the file exists since before the restoration attempt. It does not
 matter when the existing file and the file expected to be restored are
 identical in content.

But thats not what happening here, afaics the restore log file ...
message is only printed if the returncode is 0.

And the return value from RestoreArchivedFile() would be different in
the case you describe which would make XLogFileRead() fail, so it should
result in different errors.

Greetings,

Andres Freund

-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Failing start-up archive recovery at Standby mode in PG9.2.4

2013-04-24 Thread Kyotaro HORIGUCHI
Oops,

 But thats not what happening here, afaics the restore log file ...
 message is only printed if the returncode is 0.

You're right. 'cp nonexistent somewhere' exits with the status
code 1 (or 256?).

The quoted log lines simply show that segment for tli=3 did not exist
and that for tli=2 had been gotten finally. It's quite normal and
irrelevant to the trouble mentioned. Sorry for the confusion.

Unfortunately, the script attached didn't reproduce the situation for
me. But in the log attached,

 [Standby] 2013-04-22 01:27:25 EDTFATAL:  XX000: could not receive data from 
 WAL stream: FATAL:  requested WAL segment 00030014 has 
 already been removed
 cp: cannot stat `../arc/00030014': No such file or directory.
 [Standby] 2013-04-22 01:27:30 EDTDEBUG:  0: unexpected pageaddr 0/600 
 in log file 0, segment 20, offset 0

seems showing that the standby received negative ack for the request
for 20th segment, and 5 seconds later, it tried to get that from
archive and somehow thought that it'd gotten but the header is of 6th
segment.

regards,

-- 
Kyotaro Horiguchi


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Failing start-up archive recovery at Standby mode in PG9.2.4

2013-04-24 Thread Kyotaro HORIGUCHI
I had a bit look on it and came up with an hypothesis.. umm or a
scenario.

==

 Just after restartpoint, too old xlog files are recycled but its
page header has old content, specifically, xlogid and xrecoff.

Plus, if the master's LSN is at the head of new segment file, the
file for the segment may have not been created and the WAL send
request for that LSN from the standby might fail as requested
WAL segment %s has already been removed, in spite of the segment
is not yet created...

 So the standby received nack for the request and looks for
archive but the file is properly not existent. But the file of
that segment is certainly found in pg_xlog directory. So
XLogFileRead grabs the old-in-content file and...

 seems showing that the standby received negative ack for the request
 for 20th segment, and 5 seconds later, it tried to get that from
 archive and somehow thought that it'd gotten but the header is of 6th
 segment.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Failing start-up archive recovery at Standby mode in PG9.2.4

2013-04-24 Thread KONDO Mitsumasa

Hi,


I find problem about failing start-up achive recovery at Standby mode in 
PG9.2.4 streaming replication.
I test same problem in PG9.2.3. But it is not occerd...



cp: cannot stat `../arc/00030013': そのようなファイルやディレクトリはありません
[Standby] 2013-04-22 01:27:25 EDTLOG:  0: restored log file 
00020013 from archive


I can't read the error message here, but this looks suspicious. The
recovery command seems to be returning success although it prints such
an error message?

Sorry.. This suspicious massage is written in japanese.
It means missing file or directory. Beacause starting point of timeline 3 WAL 
file is 0003002A.
 

cp: cannot stat `../arc/00030014': そのようなファイルやディレクトリはありません

[* This point is illegal work] [Standby] 2013-04-22 01:27:25 EDTLOG:  0: 
unexpected pageaddr 0/600 in log file 0, segment 20, offset 0

[Standby] 2013-04-22 01:27:25 EDTLOCATION:  ValidXLOGHeader, xlog.c:4395
cp: cannot stat `../arc/00030014': そのようなファイルやディレクトリはありません
cp: cannot stat `../arc/0004.history': そのようなファイルやディレクトリはありません
[Master] 2013-04-22 01:27:25 EDTDEBUG:  0: received replication command: 
IDENTIFY_SYSTEM
[Master] 2013-04-22 01:27:25 EDTLOCATION:  HandleReplicationCommand, 
walsender.c:449
[Master] 2013-04-22 01:27:25 EDTDEBUG:  0: received replication command: 
START_REPLICATION 0/1400
[Master] 2013-04-22 01:27:25 EDTLOCATION:  HandleReplicationCommand, 
walsender.c:449
[Master] 2013-04-22 01:27:25 EDTFATAL:  58P01: requested WAL segment 
00030014 has already been removed
[Master] 2013-04-22 01:27:25 EDTLOCATION:  XLogRead, walsender.c:1023
[Standby] 2013-04-22 01:27:25 EDTLOG:  0: streaming replication 
successfully connected to primary
[Standby] 2013-04-22 01:27:25 EDTLOCATION:  libpqrcv_connect, 
libpqwalreceiver.c:171
[Standby] 2013-04-22 01:27:25 EDTFATAL:  XX000: could not receive data from WAL 
stream: FATAL:  requested WAL segment 00030014 has already been 
removed

[Standby] 2013-04-22 01:27:25 EDTLOCATION:  libpqrcv_receive, 
libpqwalreceiver.c:389

[mitsu-ko@localhost postgresql-9.2.4]$ bincp: cannot stat 
`../arc/00030014': そのようなファイルやディレクトリはありません
[Standby] 2013-04-22 01:27:30 EDTDEBUG:  0: unexpected pageaddr 0/600 
in log file 0, segment 20, offset 0
[Standby] 2013-04-22 01:27:30 EDTLOCATION:  ValidXLOGHeader, xlog.c:4395
cp: cannot stat `../arc/00030014': そのようなファイルやディレクトリはありません
cp: cannot stat `../arc/0004.history': そのようなファイルやディレクトリはありません
[Master] 2013-04-22 01:27:30 EDTDEBUG:  0: received replication command: 
IDENTIFY_SYSTEM
[Master] 2013-04-22 01:27:30 EDTLOCATION:  HandleReplicationCommand, 
walsender.c:449
[Master] 2013-04-22 01:27:30 EDTDEBUG:  0: received replication command: 
START_REPLICATION 0/1400
[Master] 2013-04-22 01:27:30 EDTLOCATION:  HandleReplicationCommand, 
walsender.c:449
[Master] 2013-04-22 01:27:30 EDTFATAL:  58P01: requested WAL segment 
00030014 has already been removed
[Master] 2013-04-22 01:27:30 EDTLOCATION:  XLogRead, walsender.c:1023
[Standby] 2013-04-22 01:27:30 EDTLOG:  0: streaming replication 
successfully connected to primary
[Standby] 2013-04-22 01:27:30 EDTLOCATION:  libpqrcv_connect, 
libpqwalreceiver.c:171
[Standby] 2013-04-22 01:27:30 EDTFATAL:  XX000: could not receive data from WAL 
stream: FATAL:  requested WAL segment 00030014 has already been 
removed


And this seems to be the second problem. You probably need to configure
a higher wal_keep_segments on the primary or you need to fix your
recovery_command.

In this situation, I try to up timelineID by archive recovery, not also recover 
WAL files.
If achive recovery works correctly in Standby, recovery done point is 
00020029 in timelineID 2.
Please see detail in pg9.2.4.log. Situation of archive file is here.
 

We try to fix this problem with my company's colleaguea,
but it is too difficult and complicated to WAL contorol and validation method...

This problem is occerd in xlog.c:10692.

10692 if (!ValidXLOGHeader((XLogPageHeader) readBuf, emode, false))
10693 goto next_record_is_invalid;


And problem is occerd this checking in ValidXLOGHeader().

  4390 if (!XLByteEQ(hdr-xlp_pageaddr, recaddr))
  4391 {
  4392 ereport(emode_for_corrupt_record(emode, recaddr),
  4393 (errmsg(unexpected pageaddr %X/%X in log 
file %u, segment %u, offset %u,
  4394 hdr-xlp_pageaddr.xlogid, 
hdr-xlp_pageaddr.xrecoff,
  4395 readId, readSeg, 
readOff)));
  4396 return false;
  4397 }


We think hdr-xlp_pageaddr.xrecoff has wrong value. This is very rare cace!
We cannot analyze why this value is uncorrect!


I think the tests is just noticing that we don't have valid WAL - we are