Re: making relfilenodes 56 bits

Robert Haas Tue, 28 Jun 2022 08:26:26 -0700

On Tue, Jun 28, 2022 at 7:45 AM Simon Riggs
<[email protected]> wrote:
> Another approach would be to condense spcOid and dbOid into a single
> 4-byte Oid-like number, since in most cases they are associated with
> each other, and not often many of them anyway. So this new number
> would indicate both the database and the tablespace. I know that we
> want to be able to make file changes without doing catalog lookups,
> but since the number of combinations is usually 1, but even then, low,
> it can be cached easily in a smgr array and included in the checkpoint
> record (or nearby) for ease of use.
>
> typedef struct buftag
> {
>      Oid     db_spcOid;
>      ForkNumber  uint32;
>      RelFileNumber   uint64;
> } BufferTag;


I've thought about this before too, because it does seem like the DB
OID and tablespace OID are a poor use of bit space. You might not even
need to keep the db_spcOid value in any persistent place, because it
could just be an alias for buffer mapping lookups that might change on
every restart. That does have the problem that you now need a
secondary hash table - in theory of unbounded size - to store mappings
from <dboid,tsoid> to db_spcOid, and that seems complicated and hard
to get right. It might be possible, though. Alternatively, you could
imagine a durable mapping that also affects the on-disk structure, but
I don't quite see how to make that work: for example, pg_basebackup
wants to produce a tar file for each tablespace directory, and if the
pathnames no longer contain the tablespace OID but only the db_spcOid,
then that doesn't work any more.

But the primary problem we're trying to solve here is that right now
we sometimes reuse the same filename for a whole new file, and that
results in bugs that only manifest themselves in obscure
circumstances, e.g. see 4eb2176318d0561846c1f9fb3c68bede799d640f.
There are residual failure modes even now related to the "tombstone"
files that are created when you drop a relation: remove everything but
the first file from the main fork but then keep that file (only)
around until after the next checkpoint. OID wraparound is another
annoyance that has influenced the design of quite a bit of code over
the years and where we probably still have bugs. If we don't reuse
relfilenodes, we can avoid a lot of that pain. Combining the DB OID
and TS OID fields doesn't solve that problem.

> That way we could just have a simple 64-bit RelFileNumber, without
> restriction, and probably some spare bytes on the ForkNumber, if we
> needed them later.

In my personal opinion, the ForkNumber system is an ugly wart which
has nothing to recommend it except that the VM and FSM forks are
awesome. But if we could have those things without needing forks, I
think that would be way better. Forks add code complexity in tons of
places, and it's barely possible to scale it to the 4 forks we have
already, let alone any larger number. Furthermore, there are really
negative performance effects from creating 3 files per small relation
rather than 1, and we sure can't afford to have that number get any
bigger. I'd rather kill the ForkNumber system with fire that expand it
further, but even if we do expand it, we're not close to being able to
cope with more than 256 forks per relation.

-- 
Robert Haas
EDB: http://www.enterprisedb.com

Re: making relfilenodes 56 bits

Reply via email to