On Sep 13, 2007  15:27 -0600, Mark Maybee wrote:
> We have explored the idea of increasing the dnode size in the past
> and discovered that a larger dnode size has a significant negative
> performance impact on the ZPL (at least with our current caching
> and read-ahead policies).  So we don't have any plans to increase
> its size generically anytime soon.

I'm sure it depends a lot on the workload.  I don't know the details
of how the ZFS allocators work, so it seems possible they always
allocate the modified dnode and the corresponding EAs in a contiguous
chunk initially, but I suspect that keeping this true over the life
of the dnode would put an added burden on the allocator (to know this)
or ZPL (to always mark them dirty to force colocation even if not modified).

I'd also heard that the 48 (or so) bytes that remain in the bonus buffer
for ZFS are potentially going to be used up soon so there would be a
desire to have a generic solution to this issue.

One of the reasons the large inode patch made it into the Linux
kernel quickly was because it made a big difference for Samba
(in addition to Lustre):

        http://lwn.net/Articles/112571/

> However, given that the ZPL isn't the only consumer of datasets,
> and that Lustre may benefit from a larger dnode size, it may be
> worth investigating the possibility of supporting multiple dnode
> sizes within a single pool (this is currently not supported).

Without knowing the details, it would seem at first glance that
having variable dnode size would be fairly complex.  Aren't the
dnodes just stored in a single sparse object and accessed by
dnode_size * objid?  This does seem desirable from the POV that
if you have an existing fs with the current dnode size you don't
want to need a reformat in order to use the larger size.

> Also, note that dnodes already have the notion of "fixed" DMU-
> specific data and "variable" application-used data (the bonus
> area).  So even in the current code, Lustre has the ability to
> use 320 bytes of bonus space however it wants.

That is true, and we discussed this internally, but one of the internal
requirements we have for DMU usage is that it create an on-disk layout
that matches ZFS so that it is possible to mount a Lustre filesystem
via ZFS or ZFS-FUSE (and potentially the reverse in the future).
This will allow us to do problem diagnosis and also leverage any ZFS
scanning/verification tools that may be developed.

> Andreas Dilger wrote:
> >Lustre is a fairly heavy user of extended attributes on the metadata target
> >(MDT) to record virtual file->object mappings, and we'll also begin using
> >EAs more heavily on the object store (OST) in the near future (reverse
> >object->file mappings for example).
> >
> >One of the performance improvements we developed early on with ext3 is
> >moving the EA into the inode to avoid seeking and full block writes for
> >small amounts of EA data.  The same could also be done to improve small
> >file performance (though we didn't implement that).  For ext3 this meant
> >increasing the inode size from 128 bytes to a format-time constant size of
> >256 - 4096 bytes (chosen based on the default Lustre EA size for that fs).
> >
> >My understanding from brief conversations with some of the ZFS developers
> >is that there are already some plans to enlarge the dnode this because
> >the dnode bonus buffer is getting close to being full for ZFS.  Are there
> >any details of this plan that I could read, or has it been discussed 
> >before?
> >Due to the generality of the terms I wasn't able to find anything by 
> >search.
> >I wanted to get the ball rolling on the large dnode discussion (which
> >you may have already had internally, I don't know), and start a fast EA
> >discussion in a separate thread.
> >
> >
> >
> >One of the important design decisions made with the ext3 "large inode" 
> >space
> >(beyond the end of the regular inode) was that there was a marker in each
> >inode which records how much of that space was used for "fixed" fields
> >(e.g. nanosecond timestamps, creation time, inode version) at the time the
> >inode was last written.  The space beyond "i_extra_isize" is used for
> >extended attribute storage.  If an inode is modified and the kernel code
> >wants to store additional "fixed" fields in the inode it will push the EAs
> >out to external blocks to make room if there isn't enough in-inode space.
> >
> >By having i_extra_isize stored in each inode (actually the first 16-bit
> >field in large inodes) we are at liberty to add new fields to the inode
> >itself without having to do a scan/update operation on existing inodes
> >(definitely desirable for ZFS also) and we don't have to waste a lot
> >of "reserved" space for potential future expansion or for fields at the
> >end that are not being used (e.g. inode version is only useful for NFSv4
> >and Lustre).  None of the "extra" fields are critical to correct operation
> >by definition, since the code has existed until now without them...
> >Conversely, we don't force EAs to start at a fixed offset and then use
> >inefficient EA wrapping for small 32- or 64-bit fields.
> >
> >We also _discussed_ storing ext3 small file data in an EA on an
> >opportunistic basis along with more extent data (ala XFS).  Are there
> >plans to allow the dn_blkptr[] array to grow on a per-dnode basis to
> >avoid spilling out to an external block for files that are smaller and/or
> >have little/no EA data?  Alternately, it would be interesting to store
> >file data in the (enlarged) dn_blkptr[] array for small files to avoid
> >fragmenting the free space within the dnode.
> >
> >
> >Cheers, Andreas
> >--
> >Andreas Dilger
> >Principal Software Engineer
> >Cluster File Systems, Inc.
> >
> >_______________________________________________
> >zfs-code mailing list
> >zfs-code at opensolaris.org
> >http://mail.opensolaris.org/mailman/listinfo/zfs-code

Cheers, Andreas
--
Andreas Dilger
Principal Software Engineer
Cluster File Systems, Inc.


Reply via email to