On Tue, Jun 28, 2022 at 7:45 AM Simon Riggs <simon.ri...@enterprisedb.com> wrote: > Another approach would be to condense spcOid and dbOid into a single > 4-byte Oid-like number, since in most cases they are associated with > each other, and not often many of them anyway. So this new number > would indicate both the database and the tablespace. I know that we > want to be able to make file changes without doing catalog lookups, > but since the number of combinations is usually 1, but even then, low, > it can be cached easily in a smgr array and included in the checkpoint > record (or nearby) for ease of use. > > typedef struct buftag > { > Oid db_spcOid; > ForkNumber uint32; > RelFileNumber uint64; > } BufferTag;
I've thought about this before too, because it does seem like the DB OID and tablespace OID are a poor use of bit space. You might not even need to keep the db_spcOid value in any persistent place, because it could just be an alias for buffer mapping lookups that might change on every restart. That does have the problem that you now need a secondary hash table - in theory of unbounded size - to store mappings from <dboid,tsoid> to db_spcOid, and that seems complicated and hard to get right. It might be possible, though. Alternatively, you could imagine a durable mapping that also affects the on-disk structure, but I don't quite see how to make that work: for example, pg_basebackup wants to produce a tar file for each tablespace directory, and if the pathnames no longer contain the tablespace OID but only the db_spcOid, then that doesn't work any more. But the primary problem we're trying to solve here is that right now we sometimes reuse the same filename for a whole new file, and that results in bugs that only manifest themselves in obscure circumstances, e.g. see 4eb2176318d0561846c1f9fb3c68bede799d640f. There are residual failure modes even now related to the "tombstone" files that are created when you drop a relation: remove everything but the first file from the main fork but then keep that file (only) around until after the next checkpoint. OID wraparound is another annoyance that has influenced the design of quite a bit of code over the years and where we probably still have bugs. If we don't reuse relfilenodes, we can avoid a lot of that pain. Combining the DB OID and TS OID fields doesn't solve that problem. > That way we could just have a simple 64-bit RelFileNumber, without > restriction, and probably some spare bytes on the ForkNumber, if we > needed them later. In my personal opinion, the ForkNumber system is an ugly wart which has nothing to recommend it except that the VM and FSM forks are awesome. But if we could have those things without needing forks, I think that would be way better. Forks add code complexity in tons of places, and it's barely possible to scale it to the 4 forks we have already, let alone any larger number. Furthermore, there are really negative performance effects from creating 3 files per small relation rather than 1, and we sure can't afford to have that number get any bigger. I'd rather kill the ForkNumber system with fire that expand it further, but even if we do expand it, we're not close to being able to cope with more than 256 forks per relation. -- Robert Haas EDB: http://www.enterprisedb.com