On Tue, Jun 10, 2025 at 3:00 PM Peter Geoghegan <p...@bowt.ie> wrote:
> I have confirmed that this flavor of the problem has existed for a
> long time. We'll need to backpatch the fix to all supported branches.

> Current plan: Commit 0001 on all supported branches in the next couple
> of days. Unless I hear objections.

Pushed 0001 earlier today, backpatching to all supported releases.
Thanks for the report, Alexander!

I should acknowledge that, on further reflection, I have significant
doubts about whether it's actually possible for the bug to cause data
corruption on earlier releases (by incorrectly setting LP_DEAD bits).
The reasons for my doubts are complicated (much more complicated than
the fix itself). I think that backpatching was still the right call --
I just don't want to be needlessly alarmist here.

When I ran Alexander's test case on an earlier version of Postgres
(anything before my recent commit e6eed40e), I did indeed observe that
there were 2 _bt_killitems calls, the first of which behaved like what
we now call a !so->dropPin scan, and the second of which (against the
same page/scan position following restore of a mark) behaved like a
so->dropPin scan. This was obviously never intended by the 2015 commit
where the problem originated (commit 2ed5b87f), and seems wrong on its
face. We're not doing what we're supposed to be doing during the
second _bt_killitems call, which I imagined opened up a window for
data corruption.

However, it has since occured to me that the first _bt_killitems is
only able to spuriously set so->curPoss.buf and confuse its
_bt_steppage *when it succeeded* in setting at least one LP_DEAD bit.
And in order to succeed, it will have to have correctly checked
so->currPos.lsn against the page's LSN at that time first. The LSN
can't have changed at that point -- if it did then there'd be no pin
set in so->currPos.buf (just because that's how that aspect was
handled), and so no confusion, and so no second confused call to
_bt_killitems later on.

The pin held between the first and second _bt_killitems calls *is*
sufficient as an interlock against VACUUM, I think, since we *also*
checked the LSN during the first _bt_killitems.

As Tom would say, it accidentally failed to fail. There is no reason
to allow it, especially given that it can be fully avoided by making
dropped-pin calls to _bt_killitems consistently release both their
lock and their pin (which, as I touched upon, is how _bt_killitems
always did it in cases where the page LSN changed since _bt_readpage
was called).

> I will wait until Postgres 18 branches from master before committing
> 0002. It's important refactoring work, but on reflection I find it
> hard to justify not just waiting.

I will start a new, dedicated thread for this.

-- 
Peter Geoghegan


Reply via email to