On Sat, Feb 19, 2022 at 4:22 PM Peter Geoghegan <p...@bowt.ie> wrote: > This very much looks like a bug in pg_surgery itself now -- attached > is a draft fix.
Wait, that's not it either. I jumped the gun -- this isn't sufficient (though the patch I posted might not be a bad idea anyway). Looks like pg_surgery isn't processing HOT chains as whole units, which it really should (at least in the context of killing items via the heap_force_kill() function). Killing a root item in a HOT chain is just hazardous -- disconnected/orphaned heap-only tuples are liable to cause chaos, and should be avoided everywhere (including during pruning, and within pg_surgery). It's likely that the hardening I already planned on adding to pruning [1] (as follow-up work to recent bugfix commit 18b87b201f) will prevent lazy_scan_prune from getting stuck like this, whatever the cause happens to be. The actual page image I see lazy_scan_prune choke on (i.e. exhibit the same infinite loop unpleasantness we've seen before on) is not in a consistent state at all (its tuples consist of tuples from a single HOT chain, and the HOT chain is totally inconsistent on account of having an LP_DEAD line pointer root item). pg_surgery could in principle do the right thing here by always treating HOT chains as whole units. Leaving behind disconnected/orphaned heap-only tuples is pretty much pointless anyway, since they'll never be accessible by index scans. Even after a REINDEX, since there is no root item from the heap page to go in the index. (A dump and restore might work better, though.) [1] https://postgr.es/m/cah2-wzmnk6v6tqzuuabxoxm8hjrawu6h12toas-bqycliht...@mail.gmail.com -- Peter Geoghegan