Mark Shellenbaum wrote:
> Girish Shilamkar wrote:
>> Hi,
>>      With reference to my previous mail ([zfs-code] Nvpair for storing EAs
>> in dnode) which had expressed our intent to use libnvpair for EAs in
>> dnode. I am now sending a high level design document for the same.
>>
>> Any suggestions/comments are most welcome.
>>
>> Thanks & Regards,
>> Girish
> 
> General comment:
> 
> I'm not sure we should be using nvlists to store these attributes. 
> While it will be quick to implement, I'm concerned about the constant 
> packing/unpacking and searching linked lists in the nvlist code.

Agreed.  Here is one proposal for how this could be done efficiently and 
scalably (both in space and runtime):

First, note that the problem is quite constrained:

There are a fixed number of attribute names (for any given software version) 
since these are interpreted system attributes. Therefore the attribute names 
can be a small number, rather than a string -- ie, we have an enum that 
defines the attributes.  We (ie, OpenSolaris) can control the allocation of 
new enum values in this namespace.

Many attributes will always have a fixed size (eg, scanstamp, finder info)

Removing attributes or changing their sizes will be extremely rare, so we can 
make this code path slow.  For example, it would not be prohibitive to repack 
& rewrite all attrs.

Here is an idea for an on-disk format which takes advantage of these 
constraints to achieve good performance:

(sa == System Attribute)

enum sa_type {
        ST_ATIME,
         ST_ACL,
         ST_FINDERINFO,
         ST_SCANSTAMP,
         ST_LUSTRE_LAYOUT,
         ...
        ST_NUMTYPES
};

const struct sa_info {
        uint8_t sai_intlen;
} sainfo = {
        8, 2, 8, 8, 1, ...
};      

on disk:
header:
struct sa_phys {
         uint16_t sa_numattrs;
         struct {
                 uint16_t sa_type;     /* enum sssa_type */
                 uint16_t sa_length;   /* in sainfo[sa_type] chunks */
         } sssa_attr[];
};

followed by the data in the order specified by the header.  8-byte alignment 
will be enforced for all attribute starting offsets.

Note that the software (not the on-disk layout) defines the set of possible 
attributes, and their byte sizes (ie, is the value a string, vs a series of 
8-byte integers, etc).  Alternatively, if we are confident that we won't need 
anything but 8-byte integers (or arrays thereof), we can simplify the design 
further.

When we first access a SA block, we read in the header, and construct
an array indexed by enum sa_type that tells us the position and length
of each attribute.  That way, access and modification will be blazing
fast[*].  If the size of an attribute is changed, we rewrite & repack
everything, so there is never any unused space.  (Aside from the padding 
necessary to enforce 8-byte alignment of each SA.  It is assumed that
attribute size changing will be "rare enough" that this performance is
acceptable.)

Thus the space overhead would be only 4 bytes per attribute (compared to
a static struct layout), and runtime overhead would be minimal.  So I think 
we could justify using it for much of the ZPL metadata.  In fact, we could
make this SA stuff be a DMU service, and remove support for bonus
buffers entirely.  The set of ZPL metadata that is always needed could
be in one sa, and other pieces elsewhere.  The DMU would use what was
the bonus buffer, plus an additional BP only if necessary, to implement
the SA.  Ie, if the space needed for the SA's was <= 320 bytes (the
current bonus size limit), then it would all be in the dnode.
Otherwise, the first 192 (320-sizeof(blkptr_t)) would be in the dnode,
and the rest in a block pointed to by a BP stored in the dnode.  In addition, 
we could expand the dnode to 1k so that more SA's could be stored there.

--matt

[*] For example:

struct sa_layout {
        struct {
                int sal_offset; /* in bytes */
                int sal_length; /* in sainfo[sa_type] chunks */
        } sal_ary[ST_NUMTYPES];
};

sa_init would construct an sa_layout given the on-disk sa_phys.

void sa_read(struct sa_layout *sal, void *sa_phys, enum sa_type sat,
     void *buf, int buflen)
{
        int len;

        ASSERT(sat < ST_NUMTYPES);
        len = sal->sal_ary[sat].sal_length * sainfo[sat].sai_intlen
        ASSERT(buflen >= len);
        /* Assumes that it was already byteswapped when read off disk */
        bcopy((char*)sa_phys + sal->sal_ary[sat].sa_offset, buf, len);
}

sa_write would be similar, unless the sal_length was different from what we 
are trying to write, in which case we would rewrite the whole thing to insert 
the new (or resized) attribute.  The software would define a priority order 
for the SA's, which defines what order to lay them out on disk (and which 
spill out of the dnode into the accessory block first).

Reply via email to