On Thu, Dec 13, 2018 at 5:42 PM Andres Freund <and...@anarazel.de> wrote:
> This leads me to think that we possibly should move computation of the
> last removed xid from recovery to the primary, during the generation of
> the xl_btree_delete WAL record.

For the record, I share your sense that this isn't a great design for
microvacuuming. It would be better if it happened on the primary,
where we generally have a lot more control/context. Of course, the
devil is in the details.

> Talking with Peter Geoghegan we were pondering a few ideas to address
> that:
> - we could stop doing page level deletion after the tuples on the first
>   heap page, if that frees sufficient space on the index page, to stave
>   of a page split for longer.  That has the downside that it's possible
>   that we'd end up WAL logging more, because we'd repeatedly emit
>   xl_btree_delete records.
> - We could use the visits to the heap pages to be *more* aggressive
>   about deleting the heap page. If there's some tuples on a heap page
>   that are deletable, and we know about that due to the kill_prior_tuple
>   logic, it's fairly likely that other pointers to the same page
>   actually are also dead: We could re-check that for all index entries
>   pointing to pages that we'd visit anyway.

Right. As we discussed, B-Tree is concerned solely with delaying and
probably even preventing a page split in its _bt_findsplitloc() path.
It's far from obvious that an eventual page split is inevitable, since
of course the range of values that belong on the leaf page are always
constrained. We can probably afford to be lazy about the things we
actually kill, since we're operating on the primary at a critical
moment: the point that we need some exact amount of free space to
delay a page split. We can free only as much space as needed,
balancing the need to access heap pages against the immediate need for
free space on the leaf page.

Now, I admit that I haven't thought through the implications for WAL
volume at all. However, it's possible that we might actually manage to
*decrease* the WAL volume in many cases by never freeing space that
isn't truly needed. It's not impossible that we'd regularly come out
ahead there, since a split is not inevitable just because we're
running low on free space this instant. And, I suspect that WAL volume
is the only major additional overhead that comes from being more
discerning about how many tuples are actually killed. Eagerly freeing
space to prevent a page split is a high priority for many B-Tree
implementations; the additional fragmentation of the leaf page caused
by a more incremental approach to deletion isn't likely to be a
problem.

Once you establish the principle that you can be lazy here, you also
establish the principle that you can be more eager, finding additional
things to kill that didn't already have their LP_DEAD bits set based
on heuristics (e.g. "I'm visiting the same heap page anyway, might as
well look for extra stuff"). It isn't necessary to actually have any
of that fancy stuff in your initial version, though.

-- 
Peter Geoghegan

Reply via email to