On Thu, May 30, 2013 at 9:33 AM, Heikki Linnakangas <hlinnakan...@vmware.com> wrote: > The reason we have to freeze is that otherwise our 32-bit XIDs wrap around > and become ambiguous. The obvious solution is to extend XIDs to 64 bits, but > that would waste a lot space. The trick is to add a field to the page header > indicating the 'epoch' of the XID, while keeping the XIDs in tuple header > 32-bit wide (*).
Check. > The other reason we freeze is to truncate the clog. But with 64-bit XIDs, we > wouldn't actually need to change old XIDs on disk to FrozenXid. Instead, we > could implicitly treat anything older than relfrozenxid as frozen. Check. > That's the basic idea. Vacuum freeze only needs to remove dead tuples, but > doesn't need to dirty pages that contain no dead tuples. Check. > Since we're not storing 64-bit wide XIDs on every tuple, we'd still need to > replace the XIDs with FrozenXid whenever the difference between the smallest > and largest XID on a page exceeds 2^31. But that would only happen when > you're updating the page, in which case the page is dirtied anyway, so it > wouldn't cause any extra I/O. It would cause some extra WAL activity, but it wouldn't dirty the page an extra time. > This would also be the first step in allowing the clog to grow larger than 2 > billion transactions, eliminating the need for anti-wraparound freezing > altogether. You'd still want to truncate the clog eventually, but it would > be nice to not be pressed against the wall with "run vacuum freeze now, or > the system will shut down". Interesting. That seems like a major advantage. > (*) "Adding an epoch" is inaccurate, but I like to use that as my mental > model. If you just add a 32-bit epoch field, then you cannot have xids from > different epochs on the page, which would be a problem. In reality, you > would store one 64-bit XID value in the page header, and use that as the > "reference point" for all the 32-bit XIDs on the tuples. See existing > convert_txid() function for how that works. Another method is to store the > 32-bit xid values in tuple headers as offsets from the per-page 64-bit > value, but then you'd always need to have the 64-bit value at hand when > interpreting the XIDs, even if they're all recent. As I see it, the main downsides of this approach are: (1) It breaks binary compatibility (unless you do something to provided for it, like put the epoch in the special space). (2) It consumes 8 bytes per page. I think it would be possible to get this down to say 5 bytes per page pretty easily; we'd simply decide that the low-order 3 bytes of the reference XID must always be 0. Possibly you could even do with 4 bytes, or 4 bytes plus some number of extra bits. (3) You still need to periodically scan the entire relation, or else have a freeze map as Simon and Josh suggested. The upsides of this approach as compared with what Andres and I are proposing are: (1) It provides a stepping stone towards allowing indefinite expansion of CLOG, which is quite appealing as an alternative to a hard shut-down. (2) It doesn't place any particular requirements on PD_ALL_VISIBLE. I don't personally find this much of a benefit as I want to keep PD_ALL_VISIBLE, but I know Jeff and perhaps others disagree. Random thought: Could you compute the reference XID based on the page LSN? That would eliminate the storage overhead. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers