[zfs-code] Increasing dnode size

Andreas Dilger Sat, 15 Sep 2007 16:19:35 -0600

On Sep 13, 2007  17:48 -0700, Bill Moore wrote:
> I think there are a couple of issues here.  The first one is to allow
> each dataset to have its own dnode size.  While conceptually not all
> that hard, it would take some re-jiggering of the code to make most of
> the #defines turn into per-dataset variables.  But it should be pretty
> straightforward, and probably not a bad idea in general.


Agreed.

> The other issue is a little more sticky.  My understanding is that
> Lustre-on-DMU plans to use the same data structures as the ZPL.  That
> way, you can mount the Lustre metadata or object stores as a regular
> filesystem.  Given this, the question is what changes, if any, should be
> made to the ZPL to accommodate.  Allowing the ZPL to deal with
> non-512-byte dnodes is probably not that bad.  The question is whether
> or not the ZPL should be made to understand the extended attributes (or
> whatever) that is stored in the rest of the bonus buffer.

There are a couple of approaches I can propose, but since I'm only at
the level of ZFS code newbie I can't weigh weigh how easy/hard it would
be to implement them.  This is really just at the brainstorming stage
for many of them, and we may want to split details into separate threads.

typedef struct dnode_phys {
        uint8_t dn_type;
        uint8_t dn_indblkshift;
        uint8_t dn_nlevels = 3
        uint8_t dn_nblkptr = 3
        uint8_t dn_bonustype;
        uint8_t dn_checksum;
        uint8_t dn_compress;
        uint8_t dn_pad[1];
        uint16_t dn_datablkszsec;
        uint16_t dn_bonuslen;
        uint8_t dn_pad2[4];
        uint64_t dn_maxblkid;
        uint64_t dn_secphys;
        uint64_t dn_pad3[4];
        blkptr_t dn_blkptr[dn_nblkptr];
        uint8_t dn_bonus[BONUSLEN]
} dnode_phys_t;

typedef struct znode_phys {
        uint64_t zp_atime[2];
        uint64_t zp_mtime[2];
        uint64_t zp_ctime[2];
        uint64_t zp_crtime[2];
        uint64_t zp_gen;
        uint64_t zp_mode;
        uint64_t zp_size;
        uint64_t zp_parent;
        uint64_t zp_links;
        uint64_t zp_xattr;
        uint64_t zp_rdev;
        uint64_t zp_flags;
        uint64_t zp_uid;
        uint64_t zp_gid;
        uint64_t zp_pad[4];
        zfs_znode_acl_t zp_acl;
} znode_phys_t

There are several issues that I think should be addressed with a single
design, since they are closely related:
0) versioning of the filesystem
1) variable dnode_phys_t size (per dataset, to start with at least)
2) fast small files (per dnode)
3) variable znode_phys_t size (per dnode)
4) fast extended attributes (per dnode)

Lustre doesn't really care about (3) per-se, and not very much about (2)
right now but we may as well address it at the same time as the others.

Versioning of the filesystem
============================
0.a If we are changing the on-disk layout we have to pay attention to
   on-disk compatibility and ensure older ZFS code does not fail badly.
   I don't think it is possible to make all of the changes being
   proposed here in a way that is compatible with existing code so we
   need to version the changes in some manner.

0.b The ext2/3/4 format has a very clever IMHO versioning mechanism that
   is superior to just incrementing a version number and forcing all
   implementations to support every previous version's features.  See
   http://www.mjmwired.net/kernel/Documentation/filesystems/ext2.txt#224
   for a detailed description of how the features work.  The gist is
   that instead of the "version" being an incrementing digit it is
   instead a bitmask of features.
   
0.c It would be possible to modify ZFS to use ext2-like feature flags.
   We would have to special-case the bits 0x00000001 and 0x00000002
   that represent the different features of ZFS_VERSION_3 currently.
   All new features would still increment the "version number" (which
   would become the "INCOMPAT" version field) so old code would still
   refuse to mount it, but instead of being sequential versions we now
   get power-of-two jumps in the version number.  It is no longer required
   that ZFS support a strict superset of all changes that the Lustre ZFS
   code implements immediately, and it is possible to develop and support
   these changes in parallel, and land them in a safe, piecewise manner
   (or never, as sometimes happens with features that die off)

Variable dnode_phys_t size
==========================
1.a) I think everyone agrees that for a per-dataset fixed value this is
   "just" a matter of changing all the code in a mechanical fashion.
   I'll just ignore the issue of being able to increase this in an
   existing dataset for now.

1.b) My understanding is that dn_bonuslen covers ALL of the ZPL-accessible
   data (i.e. it is a layering violation to try and access anything beyond
   db_bonuslen and in fact the buffer may not even contain any valid data
   or concievably even segfault).  That means any data used by ZPL (and
   by extension Lustre, which wants to maintain format compatibility)
   needs to live inside dn_bonuslen.

1.c) With a larger dnode, it is possible to have more elements in dn_blkptr[]
   on a per-dnode basis.  I have no feeling for the relative performance
   gains of storing 5 or 12 blocks in the dnode but it can't hurt I think.
   Avoiding a seek for files < 10*128kB is still good.  It seems this
   dnode_allocate() already takes this into account based on bonuslen at
   the time of dnode creation.

1.d) It currently doesn't seem possible to change dn_bonuslen on an existing
   object (dnode_reallocate() will truncate all the file data in that case?),
   so we'd need some mechanism to push data blocks into an external blkptr
   in this case (hopefully not impossible given that the pointer to the
   bonus buffer might change?).

1.e) For a Lustre metadata server (which never stores file data) it
   may even be useful to allow dn_nblkptr = 0 to reclaim the 128-byte
   blkptr for EAs.  That is a relatively minor improvement and it seems
   the DMU would currently not be very happy with that.

Fast small files
================
2.a This means storing small files within the dnode itself.  Since
   (AFAICS) the ZPL code is correctly layered atop the DMU, it has no
   idea how or where the data for a file is actually stored.  This
   leaves the possibility of storing small file data within the dn_blkptr[]
   array, which at 128 bytes/blkptr is fairly significant (larger than
   the shrinking symlink space), especially if we have a larger dnode which
   may have a bunch of free space in it.  For a 1024-byte dnode+znode
   we would have 760 bytes of contiguous space, and that covers 1/3
   of the files in my /etc, /bin, /lib, /usr/bin, /usr/lib, and /var.

2.b The DMU of course assumes the dn_blkptr contents are valid (after
   verifying the checksums) so we'd need a mechanism (dn_flag, dn_type,
   dn_compress, dn_datablkszsec?) that indicated whether this was
   "packed inline" data or blkptr_t data.  At first glance I like
   "dn_compress" the best, but there would still have to be some special
   casing to avoid handling the "blkptr" in the normal way.

Variable znode_phys_t size
==========================
3.a) I initially thought that we don't have to store any extra
   information to have a variable znode_phys_t size, because dn_bonuslen
   holds this information.  However, for symlinks ZFS checks essentially
   "zp_size + sizeof(znode_phys_t) < dn_bonuslen" to see if it is a
   fast or slow symlink.  That implies if sizeof(znode_phys_t) changes
   old symlinks on disk will be accessed incorrectly if we don't have
   some extra information about the size of znode_phys_t in each dnode.

3.b)  We can call this "zp_extra_znsize".  If we declare the current
   znode_phys_t as znode_phys_v0_t then zp_extra_znsize is the amount of
   extra space beyond sizeof(znode_phys_v0_t), so 0 for current filesystems.

3.c) zp_extra_znsize would need to be stored in znode_phys_t somewhere.
   There is lots of unused space in some of the 64-bit fields, but I
   don't know how you feel about hacks for this.  Possibilities include
   some bits in zp_flags, zp_pad, high bits in zp_*time nanoseconds, etc.
   It probably only needs to be 8 bytes or so (seems unlikely you will
   more than double the number of fixed fields in struct znode_phys_t).

3.d) We might consider some symlink-specific mechanism to incidate
   fast/slow symlinks (e.g. a flag) instead of depending on sizes,
   which I always found fragile in ext3 also, and was the source of
   several bugs.
   
3.e) We may instead consider (2.a) for symlinks a that point, since there
   is no reason to fear writing 60-byte files anymore (same performance,
   different (larger!) location for symlink data).

3.f) When ZFS code is accessing new fields declared in znode_phys_t it has
   to verify whether they are beyond dn_bonuslen and zp_extra_znsize to
   know if those fields are actually valid on disk.

Finally,

Fast extended attributes
========================
4.a) Unfortunately, due to (1.b), I don't think we can just store the
   EA in the dnode after the bonus buffer.

4.b) Putting the EA in the bonus buffer requires (3.a, 3.b) to be addressed.
   At that point (symlinks possibly excepted, depending on whether 3.e
   is used) the EA space would be:
   
   (dn_bonuslen - sizeof(znode_phys_v0_t) - zp_extra_znsize)

   For existing symlinks we'd have to also reduce this by zp_size.

4.c) It would be best to have some kind of ZAP to store the fast EA data.
   Ideally it is a very simple kind of ZAP (single buffer), but the
   microzap format is too restrictive with only a 64-bit value.
   One of the other Lustre desires is to store additional information in
   each directory entry (in addition to the object number) like file type
   and a remote server identifier, and having a single ZAP type that is
   useful for small entries would be good.  Is it possible to go straight
   to a zap_leaf_phys_t without having a corresponding zap_phys_t first?
   If yes, then this would be quite useful, otherwise a fat ZAP is too fat
   to be useful for storing fast EA data and the extended directory info.


Apologies for the long email, but I think all of these issues are related
and best addressed with a single design even if they are implemented in
a piecemeal fashion.  None of these features are blockers for Lustre
implementation atop ZFS/DMU but nobody wants the performance to be bad.

Cheers, Andreas
--
Andreas Dilger
Principal Software Engineer
Cluster File Systems, Inc.

[zfs-code] Increasing dnode size

Reply via email to