On Sep 13, 2007, at 5:48 PM, Bill Moore wrote:

> The performance benchmarks that Mark refers to are valid for our  
> current
> ZPL implementation.  That is, the bonus buffer only contains the znode
> and symlink contents.  If, however, we had an application that always
> had an extended attribute, and that extended attribute was frequently
> accessed, then I think there would be (as Andreas points out) a
> significant performance advantage to having the XATTR in the dnode
> somewhere.
>
> I think there are a couple of issues here.  The first one is to allow
> each dataset to have its own dnode size.  While conceptually not all
> that hard, it would take some re-jiggering of the code to make most of
> the #defines turn into per-dataset variables.  But it should be pretty
> straightforward, and probably not a bad idea in general.
>
> The other issue is a little more sticky.  My understanding is that
> Lustre-on-DMU plans to use the same data structures as the ZPL.  That
> way, you can mount the Lustre metadata or object stores as a regular
> filesystem.  Given this, the question is what changes, if any,  
> should be
> made to the ZPL to accommodate.  Allowing the ZPL to deal with
> non-512-byte dnodes is probably not that bad.  The question is whether
> or not the ZPL should be made to understand the extended attributes  
> (or
> whatever) that is stored in the rest of the bonus buffer.
>
> While the Lustre guys may be the first to venture into this area, it
> will come up anyway with pNFS or the CIFS server, so we should  
> probably
> spend some brain cycles thinking about the best way to have extra data
> (of various sorts) in larger-than-normal dnodes that the ZPL can deal
> with.

Yeah, the pNFS metadata server is going to use EAs for the layout  
information.  The pNFS data server is bypassing the ZPL and going  
directly to the DMU.

For the pNFS people, do you have any feeling how big of a EA you will  
need for the layout information?  Are you planning on using just one  
EA?  I'm wondering if the bonus buffer of 320 bytes would suffice.

eric

>
> A simple plan may be that the first extended attribute is stored in  
> the
> bonus buffer (if it fits).  I don't know if this would require the  
> same
> logic we used to have that placed small file contents in the bonus
> buffer.  Unfortunately, that code was *way* complicated and was ripped
> out some time ago.
>
> If the bonus buffer containing an extended attribute won't work, the
> question becomes how to put the Lustre LOV data into the dnode/ 
> znode so
> we get the performance benefits, but using an implementation that  
> we can
> live with.  Of course one option would be to give up on the Lustre/ZPL
> compatibility, but I don't think that's such a good plan.  Like I
> mentioned earlier, I think that pNFS and CIFS will wind up running int
> similar issues, so we'll have to deal with such a thing sooner or  
> later.
>
> Ideas?
>
>
> --Bill
>
> On Thu, Sep 13, 2007 at 03:27:24PM -0600, Mark Maybee wrote:
>> Andreas,
>>
>> We have explored the idea of increasing the dnode size in the past
>> and discovered that a larger dnode size has a significant negative
>> performance impact on the ZPL (at least with our current caching
>> and read-ahead policies).  So we don't have any plans to increase
>> its size generically anytime soon.
>>
>> However, given that the ZPL isn't the only consumer of datasets,
>> and that Lustre may benefit from a larger dnode size, it may be
>> worth investigating the possibility of supporting multiple dnode
>> sizes within a single pool (this is currently not supported).
>>
>> Also, note that dnodes already have the notion of "fixed" DMU-
>> specific data and "variable" application-used data (the bonus
>> area).  So even in the current code, Lustre has the ability to
>> use 320 bytes of bonus space however it wants.
>>
>> -Mark
>>
>> Andreas Dilger wrote:
>>> Hello,
>>> as a brief introduction, I'm one of the developers of Lustre
>>> (www.lustre.org) at CFS and we are porting over Lustre to use ZFS  
>>> (well,
>>> technically just the DMU) for back-end storage of Lustre.  We  
>>> currently
>>> use a modified ext3/4 filesystem for the back-end storage (both  
>>> data and
>>> metadata) fairly successfully (single filesystems of up to 2PB  
>>> with up
>>> to 500 back-end ext3 file stores and getting 50GB/s aggregate  
>>> throughput
>>> in some installations).
>>>
>>> Lustre is a fairly heavy user of extended attributes on the  
>>> metadata target
>>> (MDT) to record virtual file->object mappings, and we'll also  
>>> begin using
>>> EAs more heavily on the object store (OST) in the near future  
>>> (reverse
>>> object->file mappings for example).
>>>
>>> One of the performance improvements we developed early on with  
>>> ext3 is
>>> moving the EA into the inode to avoid seeking and full block  
>>> writes for
>>> small amounts of EA data.  The same could also be done to improve  
>>> small
>>> file performance (though we didn't implement that).  For ext3  
>>> this meant
>>> increasing the inode size from 128 bytes to a format-time  
>>> constant size of
>>> 256 - 4096 bytes (chosen based on the default Lustre EA size for  
>>> that fs).
>>>
>>> My understanding from brief conversations with some of the ZFS  
>>> developers
>>> is that there are already some plans to enlarge the dnode this  
>>> because
>>> the dnode bonus buffer is getting close to being full for ZFS.   
>>> Are there
>>> any details of this plan that I could read, or has it been  
>>> discussed before?
>>> Due to the generality of the terms I wasn't able to find anything  
>>> by search.
>>> I wanted to get the ball rolling on the large dnode discussion  
>>> (which
>>> you may have already had internally, I don't know), and start a  
>>> fast EA
>>> discussion in a separate thread.
>>>
>>>
>>>
>>> One of the important design decisions made with the ext3 "large  
>>> inode" space
>>> (beyond the end of the regular inode) was that there was a marker  
>>> in each
>>> inode which records how much of that space was used for "fixed"  
>>> fields
>>> (e.g. nanosecond timestamps, creation time, inode version) at the  
>>> time the
>>> inode was last written.  The space beyond "i_extra_isize" is used  
>>> for
>>> extended attribute storage.  If an inode is modified and the  
>>> kernel code
>>> wants to store additional "fixed" fields in the inode it will  
>>> push the EAs
>>> out to external blocks to make room if there isn't enough in- 
>>> inode space.
>>>
>>> By having i_extra_isize stored in each inode (actually the first  
>>> 16-bit
>>> field in large inodes) we are at liberty to add new fields to the  
>>> inode
>>> itself without having to do a scan/update operation on existing  
>>> inodes
>>> (definitely desirable for ZFS also) and we don't have to waste a lot
>>> of "reserved" space for potential future expansion or for fields  
>>> at the
>>> end that are not being used (e.g. inode version is only useful  
>>> for NFSv4
>>> and Lustre).  None of the "extra" fields are critical to correct  
>>> operation
>>> by definition, since the code has existed until now without them...
>>> Conversely, we don't force EAs to start at a fixed offset and  
>>> then use
>>> inefficient EA wrapping for small 32- or 64-bit fields.
>>>
>>> We also _discussed_ storing ext3 small file data in an EA on an
>>> opportunistic basis along with more extent data (ala XFS).  Are  
>>> there
>>> plans to allow the dn_blkptr[] array to grow on a per-dnode basis to
>>> avoid spilling out to an external block for files that are  
>>> smaller and/or
>>> have little/no EA data?  Alternately, it would be interesting to  
>>> store
>>> file data in the (enlarged) dn_blkptr[] array for small files to  
>>> avoid
>>> fragmenting the free space within the dnode.
>>>
>>>
>>> Cheers, Andreas
>>> --
>>> Andreas Dilger
>>> Principal Software Engineer
>>> Cluster File Systems, Inc.
>>>
>>> _______________________________________________
>>> zfs-code mailing list
>>> zfs-code at opensolaris.org
>>> http://mail.opensolaris.org/mailman/listinfo/zfs-code
>> _______________________________________________
>> zfs-code mailing list
>> zfs-code at opensolaris.org
>> http://mail.opensolaris.org/mailman/listinfo/zfs-code
> _______________________________________________
> zfs-code mailing list
> zfs-code at opensolaris.org
> http://mail.opensolaris.org/mailman/listinfo/zfs-code


Reply via email to