Whilst looking around for stuff that could be deleted thanks to removing old-style VACUUM FULL, I came across some code in btree that seems rather seriously buggy. For reasons explained in nbtree/README, we can't physically recycle a "deleted" btree index page until all transactions open at the time of deletion are gone --- otherwise we might re-use a page that an existing scan is about to land on, and confuse that scan. (This condition is overly strong, of course, but it's what's designed in at the moment.) The way this is implemented is to label a freshly-deleted page with the current value of ReadNewTransactionId(). Once that value is older than RecentXmin, the page is presumed recyclable.
I think this was all right when it was designed, but isn't it rather badly broken by our subsequent changes to have transactions not take out an XID until/unless they write something? A read-only transaction could easily be much older than RecentXmin, no? The odds of an actual problem seem not very high, since to be affected a scan would have to be already "in flight" to the problem page when the deletion occurs. By the time RecentXmin advances and we feed the page to the FSM and get it back, the scan's almost surely going to have arrived. And I think the logic is such that this would not happen before the next VACUUM in any case. Still, it seems pretty bogus. Another issue is that it's not clear what happens in a Hot Standby slave --- it doesn't look like Simon put any interlocking in this area to protect slave queries against having the page disappear from under them. The odds of an actual problem are probably a good bit higher in an HS slave. And there's another problem: _bt_pagedel is designed to recurse in certain improbable cases, but I think this is flat out wrong when doing WAL replay --- if the original process did recurse then it will have emitted a WAL record for each deleted page, meaning replay would try to delete twice. That last problem is easy to fix, but I'm not at all sure what to do about the scan interlock problem. Thoughts? regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers