I'd like to discuss how we should've implemented the infamous 9.3
multixid/row-locking stuff, and perhaps still should in 9.6. Hindsight
is always 20/20 - I'll readily admit that I didn't understand the
problems until well after the release - so this isn't meant to bash
what's been done. Rather, let's think of the future.
The main problem with the infamous multixid changes was that it made
pg_multixact a permanent, critical, piece of data. Without it, you
cannot decipher whether some rows have been deleted or not. The 9.3
changes uncovered pre-existing issues with vacuuming and wraparound, but
the fact that multixids are now critical turned those the otherwise
relatively harmless bugs into data loss.
We have pg_clog, which is a similar critical data structure. That's a
pain too - you need VACUUM and you can't easily move tables from one
cluster to another for example - but we've learned to live with it. But
we certainly don't need any more such data structures.
So the lesson here is that having a permanent pg_multixact is not nice,
and we should get rid of it. Here's how to do that:
Looking at the tuple header, the CID and CTID fields are only needed,
when either xmin or xmax is running. Almost: in a HOT-updated tuple,
CTID is required even after xmax has committed, but since it's a HOT
update, the new tuple is always on the same page so you only need the
offsetnumber part. That leaves us with 8 bytes that are always available
for storing "ephemeral" information. By ephemeral, I mean that it is
only needed when xmin or xmax is in-progress. After that, e.g. after a
shutdown, it's never looked at.
Let's add a new SLRU, called Tuple Ephemeral Data (TED). It is addressed
by a 64-bit pointer, which means that it never wraps around. That 64-bit
pointer is stored in the tuple header, in those 8 ephemeral bytes
currently used for CID and CTID. Whenever a tuple is deleted/updated and
locked at the same time, a TED entry is created for it, in the new SLRU,
and the pointer to the entry is put on the tuple. In the TED entry, we
can use as many bytes as we need to store the ephemeral data. It would
include the CID (or possibly both CMIN and CMAX separately, now that we
have the space), CTID, and the locking XIDs. The list of locking XIDs
could be stored there directly, replacing multixids completely, or we
could store a multixid there, and use the current pg_multixact system to
decode them. Or we could store the multixact offset in the TED,
replacing the multixact offset SLRU, but keep the multixact member SLRU
as is.
The XMAX stored on the tuple header would always be a real transaction
ID, not a multixid. Hence locked-only tuples don't need to be frozen
afterwards.
The beauty of this would be that the TED entries can be zapped at
restart, just like pg_subtrans, and pg_multixact before 9.3. It doesn't
need to be WAL-logged, and we are free to change its on-disk layout even
in a minor release.
Further optimizations are possible. If the TED entry fits in 8 bytes, it
can be stored directly in the tuple header. Like today, if a tuple is
locked but not deleted/updated, you only need to store the locker XID,
and you can store the locking XID directly on the tuple. Or if it's
deleted and locked, CTID is not needed, only CID and locker XID, so you
can store those direcly on the tuple. Plus some spare bits to indicate
what is stored. And if the XMIN is older than global-xmin, you could
also steal the XMIN field for storing TED data, making it possible to
store 12 bytes directly in the tuple header. Plus some spare bits again
to indicate that you've done that.
Now, given where we are, how do we get there? Upgrade is a pain, because
even if we no longer generate any new multixids, we'll have to be able
to decode them after pg_upgrade. Perhaps condense pg_multixact into a
simpler pg_clog-style bitmap at pg_upgrade, to make it small and simple
to read, but it would nevertheless be a fair amount of code just to deal
with pg_upgraded databases.
I think this is worth doing, even after we've fixed all the acute
multixid bugs, because this would be more robust in the long run. It
would also remove the need to do anti-wraparound multixid vacuums, and
the newly-added tuning knobs related to that.
- Heikki
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers