On Feb 11, 2008, at 8:42 PM, Andreas Dilger wrote:

> On Feb 09, 2008  08:10 -0700, Mark Maybee wrote:
>>> Hi Matthew,
>>> There's also the question of the attribute size limit.. from  
>>> conversations
>>> I had with Andreas, I got the feeling that the Lustre layout  
>>> information
>>> could be quite big when a file is striped through all OSTs, and I  
>>> would
>>> imagine this would become an even bigger concern in the future if we
>>> intend to continue scaling horizontally.
>>
>> By their vary nature, these attributes will have a size  
>> limitation.  I
>> don't really see the point in allowing a size larger than the  
>> supported
>> block size.  64K seems reasonable.  I think that very large values  
>> could
>> possibly be supported using some form of indirection: storing a block
>> pointer for the value, or storing an object ID for values that span
>> multiple blocks.
>
> For the most common case in Lustre, the striping attribute will be
> relatively small (in the range of 80 - 128 bytes).  In other cases
> (less common, but still present) the current ext3 size limit of 4096
> bytes is already a limiting factor on the striping of a file -  
> directly
> affecting the total bandwidth that can be allocated to a single file.
> It is reasonable to have Lustre striping attributes up to 16kB - 24kB
> range, at which point we will have a different (more efficient)  
> mechanism
> for storing large stripes, but it needs some upcoming  
> infrastructure first.
> There is very little use in the middle range.

Another data point; I would expect the pNFS striping information to  
generally be in the
128 byte range with some growth to 256 bytes in the common case.

Spencer

>
> It seems possible that we may need to have two separate mechanisms for
> storing the small attributes and storing the large ones.  The small
> Lustre SAs will be stored in the dnode, and the large ones in the  
> existing
> xattr mechanism.  Given that some applications (ZFS/OSX/pNFS) need to
> be able to fall back to looking in an xattr for the data they need for
> compatibility, this isn't any extra overhead.
>
>>> Anyway, your proposal is interesting, but there's also one thing  
>>> I would
>>> like to add:
>>>
>>> Could we have a special integer value that would essentially mean  
>>> "this is
>>> an unknown, name-value type of attribute", which would be used to  
>>> store
>>> additional, perhaps user-specified attributes?
>>> In the space of the attribute value, we could store the name of the
>>> attribute and the value itself (perhaps with 1 or 2 additional  
>>> bytes for
>>> the name length of the attribute).
>
> To be clear - as yet Lustre has a fairly limited set of "system  
> attributes"
> that are needed for high performance operation.  There is the  
> ability to
> store "user attributes" on a file, and while good performance is  
> desirable
> this is not a widely-used feature and falls into the "nice to have"  
> category.
> In ext3 there is no separation of system attributes and user  
> attributes, so
> they all benefit from the [di]node local storage optimization.
>
>>> I also think that instead of having an additional block pointer  
>>> (which are
>>> huge) in the dnode for "spillage", we should have something like a
>>> "uint64_t dn_spillobj" which would be an object id of a  
>>> "spillage" object.
>>> An object id is much more space-efficient and, like Andreas  
>>> mentioned,
>>> allows for an unlimited number of attributes/attribute sizes.
>>
>> I don't quite understand this.  The whole point of these  
>> attributes is
>> to make them fast to access... using an object ID is going to be far
>> more expensive then a block pointer to access.  Matt's model can also
>> support unlimited numbers of attributes if we allow the blocks to be
>> chained.
>
> While I agree with your point, I think part of the issue is that as  
> soon
> as we store a blkptr_t in the dnode this will consume some significant
> chunk of the SA space and push attributes out of the dnode.  The other
> tradeoff is one of complexity.  You know the code better than I,  
> but it
> seems cleaner to have a dnode reference as a container for a bunch  
> of SA
> blocks rather than having another block tree attached to the same  
> dnode.
>
> That said, if you don't think there is a lot of added complexity to  
> have
> chained blocks for the SAs, I'll defer to your experience.
>
> Cheers, Andreas
> --
> Andreas Dilger
> Sr. Staff Engineer, Lustre Group
> Sun Microsystems of Canada, Inc.
>
> _______________________________________________
> zfs-code mailing list
> zfs-code at opensolaris.org
> http://mail.opensolaris.org/mailman/listinfo/zfs-code


Reply via email to