The performance benchmarks that Mark refers to are valid for our current
ZPL implementation.  That is, the bonus buffer only contains the znode
and symlink contents.  If, however, we had an application that always
had an extended attribute, and that extended attribute was frequently
accessed, then I think there would be (as Andreas points out) a
significant performance advantage to having the XATTR in the dnode
somewhere.

I think there are a couple of issues here.  The first one is to allow
each dataset to have its own dnode size.  While conceptually not all
that hard, it would take some re-jiggering of the code to make most of
the #defines turn into per-dataset variables.  But it should be pretty
straightforward, and probably not a bad idea in general.

The other issue is a little more sticky.  My understanding is that
Lustre-on-DMU plans to use the same data structures as the ZPL.  That
way, you can mount the Lustre metadata or object stores as a regular
filesystem.  Given this, the question is what changes, if any, should be
made to the ZPL to accommodate.  Allowing the ZPL to deal with
non-512-byte dnodes is probably not that bad.  The question is whether
or not the ZPL should be made to understand the extended attributes (or
whatever) that is stored in the rest of the bonus buffer.

While the Lustre guys may be the first to venture into this area, it
will come up anyway with pNFS or the CIFS server, so we should probably
spend some brain cycles thinking about the best way to have extra data
(of various sorts) in larger-than-normal dnodes that the ZPL can deal
with.

A simple plan may be that the first extended attribute is stored in the
bonus buffer (if it fits).  I don't know if this would require the same
logic we used to have that placed small file contents in the bonus
buffer.  Unfortunately, that code was *way* complicated and was ripped
out some time ago.

If the bonus buffer containing an extended attribute won't work, the
question becomes how to put the Lustre LOV data into the dnode/znode so
we get the performance benefits, but using an implementation that we can
live with.  Of course one option would be to give up on the Lustre/ZPL
compatibility, but I don't think that's such a good plan.  Like I
mentioned earlier, I think that pNFS and CIFS will wind up running int
similar issues, so we'll have to deal with such a thing sooner or later.

Ideas?


--Bill

On Thu, Sep 13, 2007 at 03:27:24PM -0600, Mark Maybee wrote:
> Andreas,
> 
> We have explored the idea of increasing the dnode size in the past
> and discovered that a larger dnode size has a significant negative
> performance impact on the ZPL (at least with our current caching
> and read-ahead policies).  So we don't have any plans to increase
> its size generically anytime soon.
> 
> However, given that the ZPL isn't the only consumer of datasets,
> and that Lustre may benefit from a larger dnode size, it may be
> worth investigating the possibility of supporting multiple dnode
> sizes within a single pool (this is currently not supported).
> 
> Also, note that dnodes already have the notion of "fixed" DMU-
> specific data and "variable" application-used data (the bonus
> area).  So even in the current code, Lustre has the ability to
> use 320 bytes of bonus space however it wants.
> 
> -Mark
> 
> Andreas Dilger wrote:
> > Hello,
> > as a brief introduction, I'm one of the developers of Lustre
> > (www.lustre.org) at CFS and we are porting over Lustre to use ZFS (well,
> > technically just the DMU) for back-end storage of Lustre.  We currently
> > use a modified ext3/4 filesystem for the back-end storage (both data and
> > metadata) fairly successfully (single filesystems of up to 2PB with up
> > to 500 back-end ext3 file stores and getting 50GB/s aggregate throughput
> > in some installations).
> > 
> > Lustre is a fairly heavy user of extended attributes on the metadata target
> > (MDT) to record virtual file->object mappings, and we'll also begin using
> > EAs more heavily on the object store (OST) in the near future (reverse
> > object->file mappings for example).
> > 
> > One of the performance improvements we developed early on with ext3 is
> > moving the EA into the inode to avoid seeking and full block writes for
> > small amounts of EA data.  The same could also be done to improve small
> > file performance (though we didn't implement that).  For ext3 this meant
> > increasing the inode size from 128 bytes to a format-time constant size of
> > 256 - 4096 bytes (chosen based on the default Lustre EA size for that fs).
> > 
> > My understanding from brief conversations with some of the ZFS developers
> > is that there are already some plans to enlarge the dnode this because
> > the dnode bonus buffer is getting close to being full for ZFS.  Are there
> > any details of this plan that I could read, or has it been discussed before?
> > Due to the generality of the terms I wasn't able to find anything by search.
> > I wanted to get the ball rolling on the large dnode discussion (which
> > you may have already had internally, I don't know), and start a fast EA
> > discussion in a separate thread.
> > 
> > 
> > 
> > One of the important design decisions made with the ext3 "large inode" space
> > (beyond the end of the regular inode) was that there was a marker in each
> > inode which records how much of that space was used for "fixed" fields
> > (e.g. nanosecond timestamps, creation time, inode version) at the time the
> > inode was last written.  The space beyond "i_extra_isize" is used for
> > extended attribute storage.  If an inode is modified and the kernel code
> > wants to store additional "fixed" fields in the inode it will push the EAs
> > out to external blocks to make room if there isn't enough in-inode space.
> > 
> > By having i_extra_isize stored in each inode (actually the first 16-bit
> > field in large inodes) we are at liberty to add new fields to the inode
> > itself without having to do a scan/update operation on existing inodes
> > (definitely desirable for ZFS also) and we don't have to waste a lot
> > of "reserved" space for potential future expansion or for fields at the
> > end that are not being used (e.g. inode version is only useful for NFSv4
> > and Lustre).  None of the "extra" fields are critical to correct operation
> > by definition, since the code has existed until now without them...
> > Conversely, we don't force EAs to start at a fixed offset and then use
> > inefficient EA wrapping for small 32- or 64-bit fields.
> > 
> > We also _discussed_ storing ext3 small file data in an EA on an
> > opportunistic basis along with more extent data (ala XFS).  Are there
> > plans to allow the dn_blkptr[] array to grow on a per-dnode basis to
> > avoid spilling out to an external block for files that are smaller and/or
> > have little/no EA data?  Alternately, it would be interesting to store
> > file data in the (enlarged) dn_blkptr[] array for small files to avoid
> > fragmenting the free space within the dnode.
> > 
> > 
> > Cheers, Andreas
> > --
> > Andreas Dilger
> > Principal Software Engineer
> > Cluster File Systems, Inc.
> > 
> > _______________________________________________
> > zfs-code mailing list
> > zfs-code at opensolaris.org
> > http://mail.opensolaris.org/mailman/listinfo/zfs-code
> _______________________________________________
> zfs-code mailing list
> zfs-code at opensolaris.org
> http://mail.opensolaris.org/mailman/listinfo/zfs-code

Reply via email to