On Tue, Jun 10, 2025 at 3:00 PM Peter Geoghegan <p...@bowt.ie> wrote: > I have confirmed that this flavor of the problem has existed for a > long time. We'll need to backpatch the fix to all supported branches.
> Current plan: Commit 0001 on all supported branches in the next couple > of days. Unless I hear objections. Pushed 0001 earlier today, backpatching to all supported releases. Thanks for the report, Alexander! I should acknowledge that, on further reflection, I have significant doubts about whether it's actually possible for the bug to cause data corruption on earlier releases (by incorrectly setting LP_DEAD bits). The reasons for my doubts are complicated (much more complicated than the fix itself). I think that backpatching was still the right call -- I just don't want to be needlessly alarmist here. When I ran Alexander's test case on an earlier version of Postgres (anything before my recent commit e6eed40e), I did indeed observe that there were 2 _bt_killitems calls, the first of which behaved like what we now call a !so->dropPin scan, and the second of which (against the same page/scan position following restore of a mark) behaved like a so->dropPin scan. This was obviously never intended by the 2015 commit where the problem originated (commit 2ed5b87f), and seems wrong on its face. We're not doing what we're supposed to be doing during the second _bt_killitems call, which I imagined opened up a window for data corruption. However, it has since occured to me that the first _bt_killitems is only able to spuriously set so->curPoss.buf and confuse its _bt_steppage *when it succeeded* in setting at least one LP_DEAD bit. And in order to succeed, it will have to have correctly checked so->currPos.lsn against the page's LSN at that time first. The LSN can't have changed at that point -- if it did then there'd be no pin set in so->currPos.buf (just because that's how that aspect was handled), and so no confusion, and so no second confused call to _bt_killitems later on. The pin held between the first and second _bt_killitems calls *is* sufficient as an interlock against VACUUM, I think, since we *also* checked the LSN during the first _bt_killitems. As Tom would say, it accidentally failed to fail. There is no reason to allow it, especially given that it can be fully avoided by making dropped-pin calls to _bt_killitems consistently release both their lock and their pin (which, as I touched upon, is how _bt_killitems always did it in cases where the page LSN changed since _bt_readpage was called). > I will wait until Postgres 18 branches from master before committing > 0002. It's important refactoring work, but on reflection I find it > hard to justify not just waiting. I will start a new, dedicated thread for this. -- Peter Geoghegan