As we get faster, we'll need to wrap the xid counter more quickly. At current full speed, we could wrap the xid counter every 72 hours.
This is a concern for very large tables, since currently we have to rescan the whole table. This has my interest for change in the next release. We could solve that with a freeze map. Avoiding wraparound vacuums will get harder and harder for us and an approach which is a minor improvement on existing freezing won't help for long. People often ask about 8 byte xids directly, to which I reply: been discussed, major downsides. So I've been trying to rethink this from first principles to come up with a better idea. A much better idea is to hold the xmin epoch on the tuple, in addition to the xid, if there was a good place to hold this. CommandId is 32 bits, but very few commands need 2 billion commands in one transaction. So the suggestion is that we optionally store the xid epoch in 16 bits of the cmdid field. We would still allow very large commandIds and combocids, but we optimise the common usage. By doing this, we will be able to reduce xid wraparounds from every 72 hours to once every 538 years, at current transaction rates, in most cases. And in much less busy databases, this will drop to essentially never. In detail: Whenever we write a tuple with a t_cid of less than 65536 we set a new flag, infomask2 HEAP_HAS_EPOCH and write the xid epoch to the remaining 16 bits. (Or at least, the epoch modulo 65536). So this approach is binary compatible with current storage format (even though I was happy to break it if necessary, I don't think that is helpful). We hold the relnextfreezeepoch on pg_class, which starts at the epoch when the table is created -1 (or 65535 at bootstrap). i.e. we next do a freeze scan in 65536 epochs. Anytime we write a t_cid that is >65535 we reset the relnextfreezeepoch for the table to currentepoch+1 using a non-transactional update. Autovacuum looks at the relnextfreezeepoch when deciding whether to kick off a wraparound freeze vacuum. Setting relnextfreezeepoch needs good interlocks to avoid resetting it when a long running transaction is running. Various designs, but suggest using something similar to vacuum cycleid, for when a long running transaction and an wraparound freeze vacuum occur concurrently. That way we only need to take special care when we have a very long transaction running. Very long transactions, that is transactions with >65536 commands will need freezing within 1 cycle just as we currently do, but only for changes made in the later part of the transaction. However, those are typically pg_dump reloads and we should be able to use COPY FREEZE with those so they get loaded frozen and don't then need later freezing at all. Hang on! Which xid is the epoch for? Well, for xmin of course. If xmax is set, its either a lock or a delete. And either way we get to clean things up with regular vacuums, which we would still need to do each cycle. It's only the xmin that is annoying, because there isn't much other need to revisit those data blocks. If we don't like that, we could use a few extra bits from the epoch field to determine which xid it applies to, but I think that's too much. This seemed a bit radical when I first thought of this, but it still seems solid now. -- Simon Riggs http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers