On Thu, Dec 13, 2018 at 5:42 PM Andres Freund <and...@anarazel.de> wrote: > This leads me to think that we possibly should move computation of the > last removed xid from recovery to the primary, during the generation of > the xl_btree_delete WAL record.
For the record, I share your sense that this isn't a great design for microvacuuming. It would be better if it happened on the primary, where we generally have a lot more control/context. Of course, the devil is in the details. > Talking with Peter Geoghegan we were pondering a few ideas to address > that: > - we could stop doing page level deletion after the tuples on the first > heap page, if that frees sufficient space on the index page, to stave > of a page split for longer. That has the downside that it's possible > that we'd end up WAL logging more, because we'd repeatedly emit > xl_btree_delete records. > - We could use the visits to the heap pages to be *more* aggressive > about deleting the heap page. If there's some tuples on a heap page > that are deletable, and we know about that due to the kill_prior_tuple > logic, it's fairly likely that other pointers to the same page > actually are also dead: We could re-check that for all index entries > pointing to pages that we'd visit anyway. Right. As we discussed, B-Tree is concerned solely with delaying and probably even preventing a page split in its _bt_findsplitloc() path. It's far from obvious that an eventual page split is inevitable, since of course the range of values that belong on the leaf page are always constrained. We can probably afford to be lazy about the things we actually kill, since we're operating on the primary at a critical moment: the point that we need some exact amount of free space to delay a page split. We can free only as much space as needed, balancing the need to access heap pages against the immediate need for free space on the leaf page. Now, I admit that I haven't thought through the implications for WAL volume at all. However, it's possible that we might actually manage to *decrease* the WAL volume in many cases by never freeing space that isn't truly needed. It's not impossible that we'd regularly come out ahead there, since a split is not inevitable just because we're running low on free space this instant. And, I suspect that WAL volume is the only major additional overhead that comes from being more discerning about how many tuples are actually killed. Eagerly freeing space to prevent a page split is a high priority for many B-Tree implementations; the additional fragmentation of the leaf page caused by a more incremental approach to deletion isn't likely to be a problem. Once you establish the principle that you can be lazy here, you also establish the principle that you can be more eager, finding additional things to kill that didn't already have their LP_DEAD bits set based on heuristics (e.g. "I'm visiting the same heap page anyway, might as well look for extra stuff"). It isn't necessary to actually have any of that fancy stuff in your initial version, though. -- Peter Geoghegan