The comments for bufmgr.c's BufferSync routine point out that it's a bad thing for some other backend to be modifying a page while it is written out. The following text has gone unchanged since Postgres95: * Also, we need to be sure that no other transaction is * modifying the page as we flush it. This is only a problem for objects * that use a non-two-phase locking protocol, like btree indices. For * those objects, we would like to set a write lock for the duration of * our IO. Another possibility is to code updates to btree pages * carefully, so that writing them out out of order cannot cause * any unrecoverable errors. * * I don't want to think hard about this right now, so I will try * to come back to it later. Unfortunately, the comment is wrong about this being a problem only for indexes. It's possible for an invalid state of a heap page to be written out, as well. PageAddItem() sets the item pointer for an added tuple before it copies the tuple onto the page, so if it is recycling an existing item pointer slot, there is a state where a valid item pointer is pointing at a tuple that's wholly or partly not valid. This doesn't matter as far as active backends are concerned because we should be holding BUFFER_LOCK_EXCLUSIVE on the page while modifying it. But some other backend could be in process of writing out the page (if it had previously dirtied the page), and so it's possible for this invalid state to reach disk. If the database is shut down before the new update of the page can be written out, then we have a problem. Normally, the new page state will be written out at transaction commit, but what happens if the current transaction aborts? In that case, the dirty page just sits in shared memory. It will get written the next time a transaction modifies the page (and commits), or when some backend decides to recycle the buffer to hold another page. But if the postmaster gets shut down before that happens, we lose; the dirty page is never written at all, and when it's re-read after database restart, the corrupted page state becomes visible. The window of vulnerability is considerably wider in 7.0 than in prior releases, because in prior releases *any* transaction commit will write all dirty pages. In 7.0 the dirtied page will not get written out until we commit a transaction that modified that particular page (or decide to recycle the buffer). The odds of seeing a problem are still pretty small, but the risk is definitely there. I believe the correct fix for this problem is for bufmgr.c to grab a read lock (BUFFER_LOCK_SHARED) on any page that it is writing out. A read lock is sufficient since there's no need to prevent other backends from reading the page, we just need to prevent them from changing it during the I/O. Comments anyone? regards, tom lane