[HACKERS] Multixid hindsight design

Heikki Linnakangas Mon, 11 May 2015 14:21:52 -0700

I'd like to discuss how we should've implemented the infamous 9.3multixid/row-locking stuff, and perhaps still should in 9.6. Hindsightis always 20/20 - I'll readily admit that I didn't understand theproblems until well after the release - so this isn't meant to bashwhat's been done. Rather, let's think of the future.

The main problem with the infamous multixid changes was that it madepg_multixact a permanent, critical, piece of data. Without it, youcannot decipher whether some rows have been deleted or not. The 9.3changes uncovered pre-existing issues with vacuuming and wraparound, butthe fact that multixids are now critical turned those the otherwiserelatively harmless bugs into data loss.

We have pg_clog, which is a similar critical data structure. That's apain too - you need VACUUM and you can't easily move tables from onecluster to another for example - but we've learned to live with it. Butwe certainly don't need any more such data structures.

So the lesson here is that having a permanent pg_multixact is not nice,and we should get rid of it. Here's how to do that:

Looking at the tuple header, the CID and CTID fields are only needed,when either xmin or xmax is running. Almost: in a HOT-updated tuple,CTID is required even after xmax has committed, but since it's a HOTupdate, the new tuple is always on the same page so you only need theoffsetnumber part. That leaves us with 8 bytes that are always availablefor storing "ephemeral" information. By ephemeral, I mean that it isonly needed when xmin or xmax is in-progress. After that, e.g. after ashutdown, it's never looked at.

Let's add a new SLRU, called Tuple Ephemeral Data (TED). It is addressedby a 64-bit pointer, which means that it never wraps around. That 64-bitpointer is stored in the tuple header, in those 8 ephemeral bytescurrently used for CID and CTID. Whenever a tuple is deleted/updated andlocked at the same time, a TED entry is created for it, in the new SLRU,and the pointer to the entry is put on the tuple. In the TED entry, wecan use as many bytes as we need to store the ephemeral data. It wouldinclude the CID (or possibly both CMIN and CMAX separately, now that wehave the space), CTID, and the locking XIDs. The list of locking XIDscould be stored there directly, replacing multixids completely, or wecould store a multixid there, and use the current pg_multixact system todecode them. Or we could store the multixact offset in the TED,replacing the multixact offset SLRU, but keep the multixact member SLRUas is.

The XMAX stored on the tuple header would always be a real transactionID, not a multixid. Hence locked-only tuples don't need to be frozenafterwards.

The beauty of this would be that the TED entries can be zapped atrestart, just like pg_subtrans, and pg_multixact before 9.3. It doesn'tneed to be WAL-logged, and we are free to change its on-disk layout evenin a minor release.

Further optimizations are possible. If the TED entry fits in 8 bytes, itcan be stored directly in the tuple header. Like today, if a tuple islocked but not deleted/updated, you only need to store the locker XID,and you can store the locking XID directly on the tuple. Or if it'sdeleted and locked, CTID is not needed, only CID and locker XID, so youcan store those direcly on the tuple. Plus some spare bits to indicatewhat is stored. And if the XMIN is older than global-xmin, you couldalso steal the XMIN field for storing TED data, making it possible tostore 12 bytes directly in the tuple header. Plus some spare bits againto indicate that you've done that.

Now, given where we are, how do we get there? Upgrade is a pain, becauseeven if we no longer generate any new multixids, we'll have to be ableto decode them after pg_upgrade. Perhaps condense pg_multixact into asimpler pg_clog-style bitmap at pg_upgrade, to make it small and simpleto read, but it would nevertheless be a fair amount of code just to dealwith pg_upgraded databases.

I think this is worth doing, even after we've fixed all the acutemultixid bugs, because this would be more robust in the long run. Itwould also remove the need to do anti-wraparound multixid vacuums, andthe newly-added tuning knobs related to that.


- Heikki


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

[HACKERS] Multixid hindsight design

Reply via email to