So I think I've come up with a scenario that could cause this. I don't think it's exactly what happened here but maybe something analogous happened with our base backup restore.
On the primary you extend a table a bunch, including adding new segments, but crash before committing (or checkpointing). Then some of the blocks but not all may be written to disk. Assume they're all written except for the last block of the first file. So what you have is a .999G file followed by, day, 9 1G files. (Or maybe the hot backup process could just catch the files in this state if a table is rapidly growing and it doesn't take care to avoid picking up new files that appear after it starts?) smgrnblocks() stops at the first < 1GB segment and ignores the rest. This code in xlog uses it to calculate how many blocks to add but it only calls it once and then doesn't recheck where it's at as it extends the relation. As soon as it adds that one missing block the remaining files become visible. P_NEW always recalculates the position based on smgrnblocks each time (which sounds pretty inefficient but anyways....) so it will add the requested blocks to the new end. Now this isn't enough to explain things since surely the extensions records would be in the xlog in physical order. But this could have all happened after an earlier vacuum truncated the relation and we could be replaying records that predate that. So in short, if you have a 10G table and want to overwrite the last block but the first segment is one block short then xlog will add 9G to the end and write the block there. That sounds like what we've seen. I think the easy fix is to change the code in xlogutils to be more defensive and stop as soon as it finds BufferGetBlockNumber(buffer) == blkno (which is what it has in the assert already). -- greg -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers