On Feb 11, 2008, at 8:42 PM, Andreas Dilger wrote: > On Feb 09, 2008 08:10 -0700, Mark Maybee wrote: >>> Hi Matthew, >>> There's also the question of the attribute size limit.. from >>> conversations >>> I had with Andreas, I got the feeling that the Lustre layout >>> information >>> could be quite big when a file is striped through all OSTs, and I >>> would >>> imagine this would become an even bigger concern in the future if we >>> intend to continue scaling horizontally. >> >> By their vary nature, these attributes will have a size >> limitation. I >> don't really see the point in allowing a size larger than the >> supported >> block size. 64K seems reasonable. I think that very large values >> could >> possibly be supported using some form of indirection: storing a block >> pointer for the value, or storing an object ID for values that span >> multiple blocks. > > For the most common case in Lustre, the striping attribute will be > relatively small (in the range of 80 - 128 bytes). In other cases > (less common, but still present) the current ext3 size limit of 4096 > bytes is already a limiting factor on the striping of a file - > directly > affecting the total bandwidth that can be allocated to a single file. > It is reasonable to have Lustre striping attributes up to 16kB - 24kB > range, at which point we will have a different (more efficient) > mechanism > for storing large stripes, but it needs some upcoming > infrastructure first. > There is very little use in the middle range.
Another data point; I would expect the pNFS striping information to generally be in the 128 byte range with some growth to 256 bytes in the common case. Spencer > > It seems possible that we may need to have two separate mechanisms for > storing the small attributes and storing the large ones. The small > Lustre SAs will be stored in the dnode, and the large ones in the > existing > xattr mechanism. Given that some applications (ZFS/OSX/pNFS) need to > be able to fall back to looking in an xattr for the data they need for > compatibility, this isn't any extra overhead. > >>> Anyway, your proposal is interesting, but there's also one thing >>> I would >>> like to add: >>> >>> Could we have a special integer value that would essentially mean >>> "this is >>> an unknown, name-value type of attribute", which would be used to >>> store >>> additional, perhaps user-specified attributes? >>> In the space of the attribute value, we could store the name of the >>> attribute and the value itself (perhaps with 1 or 2 additional >>> bytes for >>> the name length of the attribute). > > To be clear - as yet Lustre has a fairly limited set of "system > attributes" > that are needed for high performance operation. There is the > ability to > store "user attributes" on a file, and while good performance is > desirable > this is not a widely-used feature and falls into the "nice to have" > category. > In ext3 there is no separation of system attributes and user > attributes, so > they all benefit from the [di]node local storage optimization. > >>> I also think that instead of having an additional block pointer >>> (which are >>> huge) in the dnode for "spillage", we should have something like a >>> "uint64_t dn_spillobj" which would be an object id of a >>> "spillage" object. >>> An object id is much more space-efficient and, like Andreas >>> mentioned, >>> allows for an unlimited number of attributes/attribute sizes. >> >> I don't quite understand this. The whole point of these >> attributes is >> to make them fast to access... using an object ID is going to be far >> more expensive then a block pointer to access. Matt's model can also >> support unlimited numbers of attributes if we allow the blocks to be >> chained. > > While I agree with your point, I think part of the issue is that as > soon > as we store a blkptr_t in the dnode this will consume some significant > chunk of the SA space and push attributes out of the dnode. The other > tradeoff is one of complexity. You know the code better than I, > but it > seems cleaner to have a dnode reference as a container for a bunch > of SA > blocks rather than having another block tree attached to the same > dnode. > > That said, if you don't think there is a lot of added complexity to > have > chained blocks for the SAs, I'll defer to your experience. > > Cheers, Andreas > -- > Andreas Dilger > Sr. Staff Engineer, Lustre Group > Sun Microsystems of Canada, Inc. > > _______________________________________________ > zfs-code mailing list > zfs-code at opensolaris.org > http://mail.opensolaris.org/mailman/listinfo/zfs-code