I wrote: > Does that index contain any sensitive data, and if not could I trouble > you for a copy? I'm still not clear on the mechanism by which the > indexes got corrupted like this.
Oh, never mind ... I've sussed it. nbtxlog.c's forget_matching_split() assumes it can look into the page that was just updated to get the block number associated with a non-leaf insertion. This is OK *only if the page has exactly its state at the time of the WAL record*. However, btree_xlog_insert() is coded to do nothing if the page has an LSN larger than the WAL record's LSN --- that is, if the page reflects a state *later than* this insertion. So if the page is newer than that --- say, there were some subsequent insertions at earlier positions in the page --- forget_matching_split() would pick up the wrong downlink and hence fail to erase the pending split it should have erased. I believe this bug is only latent whenever full_page_writes = on, because in that situation the first touch of any index page after a checkpoint will rewrite the whole page, and so we'll never be looking at an index page state newer than the WAL record. That explains why no one has tripped over it before. The particular case we are looking at in Panel_pkey seems to require some additional assumptions to explain the state of the index, but I've got no doubt this is the core of the problem. Since we're not going to support full_page_writes = off in 8.1.*, there's no need for a back-patched fix, but I'll see about making it safer in HEAD. regards, tom lane ---------------------------(end of broadcast)--------------------------- TIP 2: Don't 'kill -9' the postmaster