Andreas Dilger wrote: > On Sep 13, 2007 15:27 -0600, Mark Maybee wrote: >> We have explored the idea of increasing the dnode size in the past >> and discovered that a larger dnode size has a significant negative >> performance impact on the ZPL (at least with our current caching >> and read-ahead policies). So we don't have any plans to increase >> its size generically anytime soon. > > I'm sure it depends a lot on the workload. I don't know the details > of how the ZFS allocators work, so it seems possible they always > allocate the modified dnode and the corresponding EAs in a contiguous > chunk initially, but I suspect that keeping this true over the life > of the dnode would put an added burden on the allocator (to know this) > or ZPL (to always mark them dirty to force colocation even if not modified). > > I'd also heard that the 48 (or so) bytes that remain in the bonus buffer > for ZFS are potentially going to be used up soon so there would be a > desire to have a generic solution to this issue. > You seem to have a line on a lot of internal development details :-).
> One of the reasons the large inode patch made it into the Linux > kernel quickly was because it made a big difference for Samba > (in addition to Lustre): > > http://lwn.net/Articles/112571/ > >> However, given that the ZPL isn't the only consumer of datasets, >> and that Lustre may benefit from a larger dnode size, it may be >> worth investigating the possibility of supporting multiple dnode >> sizes within a single pool (this is currently not supported). > > Without knowing the details, it would seem at first glance that > having variable dnode size would be fairly complex. Aren't the > dnodes just stored in a single sparse object and accessed by > dnode_size * objid? This does seem desirable from the POV that > if you have an existing fs with the current dnode size you don't > want to need a reformat in order to use the larger size. > I was referring here to supporting multiple dnode sizes within a *pool*, but the size would still remained fixed for a given dataset (see Bill's mail). This is a much simpler concept to implement. >> Also, note that dnodes already have the notion of "fixed" DMU- >> specific data and "variable" application-used data (the bonus >> area). So even in the current code, Lustre has the ability to >> use 320 bytes of bonus space however it wants. > > That is true, and we discussed this internally, but one of the internal > requirements we have for DMU usage is that it create an on-disk layout > that matches ZFS so that it is possible to mount a Lustre filesystem > via ZFS or ZFS-FUSE (and potentially the reverse in the future). > This will allow us to do problem diagnosis and also leverage any ZFS > scanning/verification tools that may be developed. > Ah, interesting, I was not aware of this requirement. It would not be difficult to allow the ZPL to work with a larger dnode size (in fact its pretty much a noop as long as the ZPL is not trying to use any of the extra space in the dnode). >> Andreas Dilger wrote: >>> Lustre is a fairly heavy user of extended attributes on the metadata target >>> (MDT) to record virtual file->object mappings, and we'll also begin using >>> EAs more heavily on the object store (OST) in the near future (reverse >>> object->file mappings for example). >>> >>> One of the performance improvements we developed early on with ext3 is >>> moving the EA into the inode to avoid seeking and full block writes for >>> small amounts of EA data. The same could also be done to improve small >>> file performance (though we didn't implement that). For ext3 this meant >>> increasing the inode size from 128 bytes to a format-time constant size of >>> 256 - 4096 bytes (chosen based on the default Lustre EA size for that fs). >>> >>> My understanding from brief conversations with some of the ZFS developers >>> is that there are already some plans to enlarge the dnode this because >>> the dnode bonus buffer is getting close to being full for ZFS. Are there >>> any details of this plan that I could read, or has it been discussed >>> before? >>> Due to the generality of the terms I wasn't able to find anything by >>> search. >>> I wanted to get the ball rolling on the large dnode discussion (which >>> you may have already had internally, I don't know), and start a fast EA >>> discussion in a separate thread. >>> >>> >>> >>> One of the important design decisions made with the ext3 "large inode" >>> space >>> (beyond the end of the regular inode) was that there was a marker in each >>> inode which records how much of that space was used for "fixed" fields >>> (e.g. nanosecond timestamps, creation time, inode version) at the time the >>> inode was last written. The space beyond "i_extra_isize" is used for >>> extended attribute storage. If an inode is modified and the kernel code >>> wants to store additional "fixed" fields in the inode it will push the EAs >>> out to external blocks to make room if there isn't enough in-inode space. >>> >>> By having i_extra_isize stored in each inode (actually the first 16-bit >>> field in large inodes) we are at liberty to add new fields to the inode >>> itself without having to do a scan/update operation on existing inodes >>> (definitely desirable for ZFS also) and we don't have to waste a lot >>> of "reserved" space for potential future expansion or for fields at the >>> end that are not being used (e.g. inode version is only useful for NFSv4 >>> and Lustre). None of the "extra" fields are critical to correct operation >>> by definition, since the code has existed until now without them... >>> Conversely, we don't force EAs to start at a fixed offset and then use >>> inefficient EA wrapping for small 32- or 64-bit fields. >>> >>> We also _discussed_ storing ext3 small file data in an EA on an >>> opportunistic basis along with more extent data (ala XFS). Are there >>> plans to allow the dn_blkptr[] array to grow on a per-dnode basis to >>> avoid spilling out to an external block for files that are smaller and/or >>> have little/no EA data? Alternately, it would be interesting to store >>> file data in the (enlarged) dn_blkptr[] array for small files to avoid >>> fragmenting the free space within the dnode. >>> >>> >>> Cheers, Andreas >>> -- >>> Andreas Dilger >>> Principal Software Engineer >>> Cluster File Systems, Inc. >>> >>> _______________________________________________ >>> zfs-code mailing list >>> zfs-code at opensolaris.org >>> http://mail.opensolaris.org/mailman/listinfo/zfs-code > > Cheers, Andreas > -- > Andreas Dilger > Principal Software Engineer > Cluster File Systems, Inc. >