On 26/11/2025 17:50, Heikki Linnakangas wrote:
On 26/11/2025 17:23, Maxim Orlov wrote:
On Tue, 25 Nov 2025 at 13:07, Heikki Linnakangas <[email protected]
<mailto:[email protected]>> wrote:
GetOldMultiXactIdSingleMember() currently asserts that the offset is
never zero, but it should try to do something sensible in that case
instead of just failing.
Correct me if I'm wrong, but we added the assertion that offsets are
never 0, based on the idea that case #2 will never take place during an
update. If this isn't the case, this assertion could be removed.
The rest of the function appears to work correctly.
I even think that, as an experiment, we could randomly reset some of the
offsets to zero and nothing would happen, except that some data would
be lost.
+1
The most sensible thing we can do is give the user a warning, right?
Something like, "During the update, we encountered some weird offset
that shouldn't have been there, but there's nothing we can do about it,
just take note."
Yep, makes sense.
I read through the SLRU reading codepath, looking for all the things
that could go wrong (not sure I got them all):
1. An SLRU file does not exist
2. An SLRU file is too short, i.e. a page does not exist
3. The offset in 'offsets' page is 0
4. The offset in 'offsets' page looks invalid, i.e. it's greater than
nextOffset or smaller than oldestOffset.
5. The offset is out of order compared to its neighbors
6. The multixid has no members
7. The multixid has an invalid (0) member
8. A multixid has more than one updating member
Some of those situations are theoretically are possible if there was a
crash. We don't follow the WAL-before-data rule for these SLRUs.
Instead, we piggyback on the WAL-before-data of the heap page that would
reference the multixid. In other words, we rely on the fact that if a
multixid write is missed or torn because of a crash, that multixid will
not be referenced from anywhwere and will never be read.
However, that doesn't hold for pg_upgrade. pg_upgrade will try to read
all the multixids. So we need to make the multixact reading code
tolerant of the situations that could be present after a crash. I think
the right philosophy here is that we try to read all the old multixids,
and do our best to interpret them the same way that the old server
would. For those situations that can legitimately be present if the old
server crashed at some point, be silent. For cases that should not
happen, even if there was a crash, print a warning. For example, I think
an SLRU file should never be missing (1) or truncated (2). But the zero
offset (3), and (6) can happen.
Perhaps we should check that all the files exist and have the correct
sizes in the pre-check stage, and abort the upgrade early if anything is
missing. That would be pretty cheap to check.
- Heikki