ITAGAKI Takahiro <[EMAIL PROTECTED]> writes: > I think we can resurrect his idea because we will scan btree pages > at-atime now; the missing-restarting-point problem went away.
> Have I missed something? Comments welcome. I was thinking for awhile just now that this would break the interlock that guarantees VACUUM can't delete a heap tuple that an indexscanning process is about to visit. After further thought, it doesn't, but it's non-obvious. I've added the attached commentary to nbtree/README: On-the-fly deletion of index tuples ----------------------------------- If a process visits a heap tuple and finds that it's dead and removable (ie, dead to all open transactions, not only that process), then we can return to the index and mark the corresponding index entry "known dead", allowing subsequent index scans to skip visiting the heap tuple. The "known dead" marking uses the LP_DELETE bit in ItemIds. This is currently only done in plain indexscans, not bitmap scans, because only plain scans visit the heap and index "in sync" and so there's not a convenient way to do it for bitmap scans. Once an index tuple has been marked LP_DELETE it can actually be removed from the index immediately; since index scans only stop "between" pages, no scan can lose its place from such a deletion. We separate the steps because we allow LP_DELETE to be set with only a share lock (it's exactly like a hint bit for a heap tuple), but physically removing tuples requires exclusive lock. In the current code we try to remove LP_DELETE tuples when we are otherwise faced with having to split a page to do an insertion (and hence have exclusive lock on it already). This leaves the index in a state where it has no entry for a dead tuple that still exists in the heap. This is not a problem for the current implementation of VACUUM, but it could be a problem for anything that explicitly tries to find index entries for dead tuples. (However, the same situation is created by REINDEX, since it doesn't enter dead tuples into the index.) It's sufficient to have an exclusive lock on the index page, not a super-exclusive lock, to do deletion of LP_DELETE items. It might seem that this breaks the interlock between VACUUM and indexscans, but that is not so: as long as an indexscanning process has a pin on the page where the index item used to be, VACUUM cannot complete its btbulkdelete scan and so cannot remove the heap tuple. This is another reason why btbulkdelete has to get super-exclusive lock on every leaf page, not only the ones where it actually sees items to delete. regards, tom lane ---------------------------(end of broadcast)--------------------------- TIP 3: Have you checked our extensive FAQ? http://www.postgresql.org/docs/faq