On Sep 13, 2007 17:48 -0700, Bill Moore wrote: > I think there are a couple of issues here. The first one is to allow > each dataset to have its own dnode size. While conceptually not all > that hard, it would take some re-jiggering of the code to make most of > the #defines turn into per-dataset variables. But it should be pretty > straightforward, and probably not a bad idea in general.
Agreed. > The other issue is a little more sticky. My understanding is that > Lustre-on-DMU plans to use the same data structures as the ZPL. That > way, you can mount the Lustre metadata or object stores as a regular > filesystem. Given this, the question is what changes, if any, should be > made to the ZPL to accommodate. Allowing the ZPL to deal with > non-512-byte dnodes is probably not that bad. The question is whether > or not the ZPL should be made to understand the extended attributes (or > whatever) that is stored in the rest of the bonus buffer. There are a couple of approaches I can propose, but since I'm only at the level of ZFS code newbie I can't weigh weigh how easy/hard it would be to implement them. This is really just at the brainstorming stage for many of them, and we may want to split details into separate threads. typedef struct dnode_phys { uint8_t dn_type; uint8_t dn_indblkshift; uint8_t dn_nlevels = 3 uint8_t dn_nblkptr = 3 uint8_t dn_bonustype; uint8_t dn_checksum; uint8_t dn_compress; uint8_t dn_pad[1]; uint16_t dn_datablkszsec; uint16_t dn_bonuslen; uint8_t dn_pad2[4]; uint64_t dn_maxblkid; uint64_t dn_secphys; uint64_t dn_pad3[4]; blkptr_t dn_blkptr[dn_nblkptr]; uint8_t dn_bonus[BONUSLEN] } dnode_phys_t; typedef struct znode_phys { uint64_t zp_atime[2]; uint64_t zp_mtime[2]; uint64_t zp_ctime[2]; uint64_t zp_crtime[2]; uint64_t zp_gen; uint64_t zp_mode; uint64_t zp_size; uint64_t zp_parent; uint64_t zp_links; uint64_t zp_xattr; uint64_t zp_rdev; uint64_t zp_flags; uint64_t zp_uid; uint64_t zp_gid; uint64_t zp_pad[4]; zfs_znode_acl_t zp_acl; } znode_phys_t There are several issues that I think should be addressed with a single design, since they are closely related: 0) versioning of the filesystem 1) variable dnode_phys_t size (per dataset, to start with at least) 2) fast small files (per dnode) 3) variable znode_phys_t size (per dnode) 4) fast extended attributes (per dnode) Lustre doesn't really care about (3) per-se, and not very much about (2) right now but we may as well address it at the same time as the others. Versioning of the filesystem ============================ 0.a If we are changing the on-disk layout we have to pay attention to on-disk compatibility and ensure older ZFS code does not fail badly. I don't think it is possible to make all of the changes being proposed here in a way that is compatible with existing code so we need to version the changes in some manner. 0.b The ext2/3/4 format has a very clever IMHO versioning mechanism that is superior to just incrementing a version number and forcing all implementations to support every previous version's features. See http://www.mjmwired.net/kernel/Documentation/filesystems/ext2.txt#224 for a detailed description of how the features work. The gist is that instead of the "version" being an incrementing digit it is instead a bitmask of features. 0.c It would be possible to modify ZFS to use ext2-like feature flags. We would have to special-case the bits 0x00000001 and 0x00000002 that represent the different features of ZFS_VERSION_3 currently. All new features would still increment the "version number" (which would become the "INCOMPAT" version field) so old code would still refuse to mount it, but instead of being sequential versions we now get power-of-two jumps in the version number. It is no longer required that ZFS support a strict superset of all changes that the Lustre ZFS code implements immediately, and it is possible to develop and support these changes in parallel, and land them in a safe, piecewise manner (or never, as sometimes happens with features that die off) Variable dnode_phys_t size ========================== 1.a) I think everyone agrees that for a per-dataset fixed value this is "just" a matter of changing all the code in a mechanical fashion. I'll just ignore the issue of being able to increase this in an existing dataset for now. 1.b) My understanding is that dn_bonuslen covers ALL of the ZPL-accessible data (i.e. it is a layering violation to try and access anything beyond db_bonuslen and in fact the buffer may not even contain any valid data or concievably even segfault). That means any data used by ZPL (and by extension Lustre, which wants to maintain format compatibility) needs to live inside dn_bonuslen. 1.c) With a larger dnode, it is possible to have more elements in dn_blkptr[] on a per-dnode basis. I have no feeling for the relative performance gains of storing 5 or 12 blocks in the dnode but it can't hurt I think. Avoiding a seek for files < 10*128kB is still good. It seems this dnode_allocate() already takes this into account based on bonuslen at the time of dnode creation. 1.d) It currently doesn't seem possible to change dn_bonuslen on an existing object (dnode_reallocate() will truncate all the file data in that case?), so we'd need some mechanism to push data blocks into an external blkptr in this case (hopefully not impossible given that the pointer to the bonus buffer might change?). 1.e) For a Lustre metadata server (which never stores file data) it may even be useful to allow dn_nblkptr = 0 to reclaim the 128-byte blkptr for EAs. That is a relatively minor improvement and it seems the DMU would currently not be very happy with that. Fast small files ================ 2.a This means storing small files within the dnode itself. Since (AFAICS) the ZPL code is correctly layered atop the DMU, it has no idea how or where the data for a file is actually stored. This leaves the possibility of storing small file data within the dn_blkptr[] array, which at 128 bytes/blkptr is fairly significant (larger than the shrinking symlink space), especially if we have a larger dnode which may have a bunch of free space in it. For a 1024-byte dnode+znode we would have 760 bytes of contiguous space, and that covers 1/3 of the files in my /etc, /bin, /lib, /usr/bin, /usr/lib, and /var. 2.b The DMU of course assumes the dn_blkptr contents are valid (after verifying the checksums) so we'd need a mechanism (dn_flag, dn_type, dn_compress, dn_datablkszsec?) that indicated whether this was "packed inline" data or blkptr_t data. At first glance I like "dn_compress" the best, but there would still have to be some special casing to avoid handling the "blkptr" in the normal way. Variable znode_phys_t size ========================== 3.a) I initially thought that we don't have to store any extra information to have a variable znode_phys_t size, because dn_bonuslen holds this information. However, for symlinks ZFS checks essentially "zp_size + sizeof(znode_phys_t) < dn_bonuslen" to see if it is a fast or slow symlink. That implies if sizeof(znode_phys_t) changes old symlinks on disk will be accessed incorrectly if we don't have some extra information about the size of znode_phys_t in each dnode. 3.b) We can call this "zp_extra_znsize". If we declare the current znode_phys_t as znode_phys_v0_t then zp_extra_znsize is the amount of extra space beyond sizeof(znode_phys_v0_t), so 0 for current filesystems. 3.c) zp_extra_znsize would need to be stored in znode_phys_t somewhere. There is lots of unused space in some of the 64-bit fields, but I don't know how you feel about hacks for this. Possibilities include some bits in zp_flags, zp_pad, high bits in zp_*time nanoseconds, etc. It probably only needs to be 8 bytes or so (seems unlikely you will more than double the number of fixed fields in struct znode_phys_t). 3.d) We might consider some symlink-specific mechanism to incidate fast/slow symlinks (e.g. a flag) instead of depending on sizes, which I always found fragile in ext3 also, and was the source of several bugs. 3.e) We may instead consider (2.a) for symlinks a that point, since there is no reason to fear writing 60-byte files anymore (same performance, different (larger!) location for symlink data). 3.f) When ZFS code is accessing new fields declared in znode_phys_t it has to verify whether they are beyond dn_bonuslen and zp_extra_znsize to know if those fields are actually valid on disk. Finally, Fast extended attributes ======================== 4.a) Unfortunately, due to (1.b), I don't think we can just store the EA in the dnode after the bonus buffer. 4.b) Putting the EA in the bonus buffer requires (3.a, 3.b) to be addressed. At that point (symlinks possibly excepted, depending on whether 3.e is used) the EA space would be: (dn_bonuslen - sizeof(znode_phys_v0_t) - zp_extra_znsize) For existing symlinks we'd have to also reduce this by zp_size. 4.c) It would be best to have some kind of ZAP to store the fast EA data. Ideally it is a very simple kind of ZAP (single buffer), but the microzap format is too restrictive with only a 64-bit value. One of the other Lustre desires is to store additional information in each directory entry (in addition to the object number) like file type and a remote server identifier, and having a single ZAP type that is useful for small entries would be good. Is it possible to go straight to a zap_leaf_phys_t without having a corresponding zap_phys_t first? If yes, then this would be quite useful, otherwise a fat ZAP is too fat to be useful for storing fast EA data and the extended directory info. Apologies for the long email, but I think all of these issues are related and best addressed with a single design even if they are implemented in a piecemeal fashion. None of these features are blockers for Lustre implementation atop ZFS/DMU but nobody wants the performance to be bad. Cheers, Andreas -- Andreas Dilger Principal Software Engineer Cluster File Systems, Inc.