> There are several issues that I think should be addressed with a single
> design, since they are closely related:
> 0) versioning of the filesystem
> 1) variable dnode_phys_t size (per dataset, to start with at least)
> 2) fast small files (per dnode)
> 3) variable znode_phys_t size (per dnode)
> 4) fast extended attributes (per dnode)
> 
> Lustre doesn't really care about (3) per-se, and not very much about (2)
> right now but we may as well address it at the same time as the others.
> 
> Versioning of the filesystem
> ============================
> 0.a If we are changing the on-disk layout we have to pay attention to
>    on-disk compatibility and ensure older ZFS code does not fail badly.
>    I don't think it is possible to make all of the changes being
>    proposed here in a way that is compatible with existing code so we
>    need to version the changes in some manner.
> 
> 0.b The ext2/3/4 format has a very clever IMHO versioning mechanism that
>    is superior to just incrementing a version number and forcing all
>    implementations to support every previous version's features.  See
>    http://www.mjmwired.net/kernel/Documentation/filesystems/ext2.txt#224
>    for a detailed description of how the features work.  The gist is
>    that instead of the "version" being an incrementing digit it is
>    instead a bitmask of features.
>    
> 0.c It would be possible to modify ZFS to use ext2-like feature flags.
>    We would have to special-case the bits 0x00000001 and 0x00000002
>    that represent the different features of ZFS_VERSION_3 currently.
>    All new features would still increment the "version number" (which
>    would become the "INCOMPAT" version field) so old code would still
>    refuse to mount it, but instead of being sequential versions we now
>    get power-of-two jumps in the version number.  It is no longer required
>    that ZFS support a strict superset of all changes that the Lustre ZFS
>    code implements immediately, and it is possible to develop and support
>    these changes in parallel, and land them in a safe, piecewise manner
>    (or never, as sometimes happens with features that die off)
> 

While not entirely the same thing we will soon have a VFS feature 
registration mechanism in Nevada.  Basically, a file system registers 
what features it supports.  Initially this will be things such as "case 
insensitivity", "acl on create", "extended vattr_t".


> Variable znode_phys_t size
> ==========================
> 3.a) I initially thought that we don't have to store any extra
>    information to have a variable znode_phys_t size, because dn_bonuslen
>    holds this information.  However, for symlinks ZFS checks essentially
>    "zp_size + sizeof(znode_phys_t) < dn_bonuslen" to see if it is a
>    fast or slow symlink.  That implies if sizeof(znode_phys_t) changes
>    old symlinks on disk will be accessed incorrectly if we don't have
>    some extra information about the size of znode_phys_t in each dnode.
> 

There is an existing bug to create symlinks with their own object type.

> 3.b)  We can call this "zp_extra_znsize".  If we declare the current
>    znode_phys_t as znode_phys_v0_t then zp_extra_znsize is the amount of
>    extra space beyond sizeof(znode_phys_v0_t), so 0 for current filesystems.

This would also require creation a new DMU_OT_ZNODE2 or something 
similarly named.

> 
> 3.c) zp_extra_znsize would need to be stored in znode_phys_t somewhere.
>    There is lots of unused space in some of the 64-bit fields, but I
>    don't know how you feel about hacks for this.  Possibilities include
>    some bits in zp_flags, zp_pad, high bits in zp_*time nanoseconds, etc.
>    It probably only needs to be 8 bytes or so (seems unlikely you will
>    more than double the number of fixed fields in struct znode_phys_t).
> 

The zp_flags field is off limits.  It is going to be used for storing 
additional file attributes such as immutable, nounlink,...

I don't want to see us overload other fields.  We already have several 
pad fields within the znode that could be used.

> 3.d) We might consider some symlink-specific mechanism to incidate
>    fast/slow symlinks (e.g. a flag) instead of depending on sizes,
>    which I always found fragile in ext3 also, and was the source of
>    several bugs.
>    
> 3.e) We may instead consider (2.a) for symlinks a that point, since there
>    is no reason to fear writing 60-byte files anymore (same performance,
>    different (larger!) location for symlink data).
> 
> 3.f) When ZFS code is accessing new fields declared in znode_phys_t it has
>    to verify whether they are beyond dn_bonuslen and zp_extra_znsize to
>    know if those fields are actually valid on disk.
> 
> Finally,
> 
> Fast extended attributes
> ========================
> 4.a) Unfortunately, due to (1.b), I don't think we can just store the
>    EA in the dnode after the bonus buffer.
> 
> 4.b) Putting the EA in the bonus buffer requires (3.a, 3.b) to be addressed.
>    At that point (symlinks possibly excepted, depending on whether 3.e
>    is used) the EA space would be:
>    
>    (dn_bonuslen - sizeof(znode_phys_v0_t) - zp_extra_znsize)
> 
>    For existing symlinks we'd have to also reduce this by zp_size.
> 
> 4.c) It would be best to have some kind of ZAP to store the fast EA data.
>    Ideally it is a very simple kind of ZAP (single buffer), but the
>    microzap format is too restrictive with only a 64-bit value.
>    One of the other Lustre desires is to store additional information in
>    each directory entry (in addition to the object number) like file type
>    and a remote server identifier, and having a single ZAP type that is
>    useful for small entries would be good.  Is it possible to go straight
>    to a zap_leaf_phys_t without having a corresponding zap_phys_t first?
>    If yes, then this would be quite useful, otherwise a fat ZAP is too fat
>    to be useful for storing fast EA data and the extended directory info.
> 

Can you provide a list of what attributes you want to store in the znode 
and what their sizes are?  Do you expect ZFS to do anything special with 
these attributes?  Should these attributes be exposed to applications?

Usually, we only embed attributes in the znode if the file system has 
some sort of semantics associated with them.

One of the original plans, from several years ago was to create a zp_zap 
field in the znode that would be used for storing additional file 
attributes.  We never actually did that and the field was turned into 
one of the pad fields in the znode.

If the attribute will be needed for every file then it should probably 
be in the znode, but if it is an optional attribute  or too big then 
maybe it should be in some sort of overflow object.


   -Mark
> 
> Apologies for the long email, but I think all of these issues are related
> and best addressed with a single design even if they are implemented in
> a piecemeal fashion.  None of these features are blockers for Lustre
> implementation atop ZFS/DMU but nobody wants the performance to be bad.
> 
> Cheers, Andreas
> --
> Andreas Dilger
> Principal Software Engineer
> Cluster File Systems, Inc.
> 
> _______________________________________________
> zfs-code mailing list
> zfs-code at opensolaris.org
> http://mail.opensolaris.org/mailman/listinfo/zfs-code


Reply via email to