Andreas Dilger wrote:
> On Sep 13, 2007  15:27 -0600, Mark Maybee wrote:
>> We have explored the idea of increasing the dnode size in the past
>> and discovered that a larger dnode size has a significant negative
>> performance impact on the ZPL (at least with our current caching
>> and read-ahead policies).  So we don't have any plans to increase
>> its size generically anytime soon.
> 
> I'm sure it depends a lot on the workload.  I don't know the details
> of how the ZFS allocators work, so it seems possible they always
> allocate the modified dnode and the corresponding EAs in a contiguous
> chunk initially, but I suspect that keeping this true over the life
> of the dnode would put an added burden on the allocator (to know this)
> or ZPL (to always mark them dirty to force colocation even if not modified).
> 
> I'd also heard that the 48 (or so) bytes that remain in the bonus buffer
> for ZFS are potentially going to be used up soon so there would be a
> desire to have a generic solution to this issue.
> 
You seem to have a line on a lot of internal development details :-).

> One of the reasons the large inode patch made it into the Linux
> kernel quickly was because it made a big difference for Samba
> (in addition to Lustre):
> 
>       http://lwn.net/Articles/112571/
> 
>> However, given that the ZPL isn't the only consumer of datasets,
>> and that Lustre may benefit from a larger dnode size, it may be
>> worth investigating the possibility of supporting multiple dnode
>> sizes within a single pool (this is currently not supported).
> 
> Without knowing the details, it would seem at first glance that
> having variable dnode size would be fairly complex.  Aren't the
> dnodes just stored in a single sparse object and accessed by
> dnode_size * objid?  This does seem desirable from the POV that
> if you have an existing fs with the current dnode size you don't
> want to need a reformat in order to use the larger size.
> 
I was referring here to supporting multiple dnode sizes within a
*pool*, but the size would still remained fixed for a given dataset
(see Bill's mail).  This is a much simpler concept to implement.

>> Also, note that dnodes already have the notion of "fixed" DMU-
>> specific data and "variable" application-used data (the bonus
>> area).  So even in the current code, Lustre has the ability to
>> use 320 bytes of bonus space however it wants.
> 
> That is true, and we discussed this internally, but one of the internal
> requirements we have for DMU usage is that it create an on-disk layout
> that matches ZFS so that it is possible to mount a Lustre filesystem
> via ZFS or ZFS-FUSE (and potentially the reverse in the future).
> This will allow us to do problem diagnosis and also leverage any ZFS
> scanning/verification tools that may be developed.
> 
Ah, interesting, I was not aware of this requirement.  It would not be
difficult to allow the ZPL to work with a larger dnode size (in fact
its pretty much a noop as long as the ZPL is not trying to use any of
the extra space in the dnode).

>> Andreas Dilger wrote:
>>> Lustre is a fairly heavy user of extended attributes on the metadata target
>>> (MDT) to record virtual file->object mappings, and we'll also begin using
>>> EAs more heavily on the object store (OST) in the near future (reverse
>>> object->file mappings for example).
>>>
>>> One of the performance improvements we developed early on with ext3 is
>>> moving the EA into the inode to avoid seeking and full block writes for
>>> small amounts of EA data.  The same could also be done to improve small
>>> file performance (though we didn't implement that).  For ext3 this meant
>>> increasing the inode size from 128 bytes to a format-time constant size of
>>> 256 - 4096 bytes (chosen based on the default Lustre EA size for that fs).
>>>
>>> My understanding from brief conversations with some of the ZFS developers
>>> is that there are already some plans to enlarge the dnode this because
>>> the dnode bonus buffer is getting close to being full for ZFS.  Are there
>>> any details of this plan that I could read, or has it been discussed 
>>> before?
>>> Due to the generality of the terms I wasn't able to find anything by 
>>> search.
>>> I wanted to get the ball rolling on the large dnode discussion (which
>>> you may have already had internally, I don't know), and start a fast EA
>>> discussion in a separate thread.
>>>
>>>
>>>
>>> One of the important design decisions made with the ext3 "large inode" 
>>> space
>>> (beyond the end of the regular inode) was that there was a marker in each
>>> inode which records how much of that space was used for "fixed" fields
>>> (e.g. nanosecond timestamps, creation time, inode version) at the time the
>>> inode was last written.  The space beyond "i_extra_isize" is used for
>>> extended attribute storage.  If an inode is modified and the kernel code
>>> wants to store additional "fixed" fields in the inode it will push the EAs
>>> out to external blocks to make room if there isn't enough in-inode space.
>>>
>>> By having i_extra_isize stored in each inode (actually the first 16-bit
>>> field in large inodes) we are at liberty to add new fields to the inode
>>> itself without having to do a scan/update operation on existing inodes
>>> (definitely desirable for ZFS also) and we don't have to waste a lot
>>> of "reserved" space for potential future expansion or for fields at the
>>> end that are not being used (e.g. inode version is only useful for NFSv4
>>> and Lustre).  None of the "extra" fields are critical to correct operation
>>> by definition, since the code has existed until now without them...
>>> Conversely, we don't force EAs to start at a fixed offset and then use
>>> inefficient EA wrapping for small 32- or 64-bit fields.
>>>
>>> We also _discussed_ storing ext3 small file data in an EA on an
>>> opportunistic basis along with more extent data (ala XFS).  Are there
>>> plans to allow the dn_blkptr[] array to grow on a per-dnode basis to
>>> avoid spilling out to an external block for files that are smaller and/or
>>> have little/no EA data?  Alternately, it would be interesting to store
>>> file data in the (enlarged) dn_blkptr[] array for small files to avoid
>>> fragmenting the free space within the dnode.
>>>
>>>
>>> Cheers, Andreas
>>> --
>>> Andreas Dilger
>>> Principal Software Engineer
>>> Cluster File Systems, Inc.
>>>
>>> _______________________________________________
>>> zfs-code mailing list
>>> zfs-code at opensolaris.org
>>> http://mail.opensolaris.org/mailman/listinfo/zfs-code
> 
> Cheers, Andreas
> --
> Andreas Dilger
> Principal Software Engineer
> Cluster File Systems, Inc.
> 

Reply via email to