Mark Shellenbaum wrote: > Girish Shilamkar wrote: >> Hi, >> With reference to my previous mail ([zfs-code] Nvpair for storing EAs >> in dnode) which had expressed our intent to use libnvpair for EAs in >> dnode. I am now sending a high level design document for the same. >> >> Any suggestions/comments are most welcome. >> >> Thanks & Regards, >> Girish > > General comment: > > I'm not sure we should be using nvlists to store these attributes. > While it will be quick to implement, I'm concerned about the constant > packing/unpacking and searching linked lists in the nvlist code.
Agreed. Here is one proposal for how this could be done efficiently and scalably (both in space and runtime): First, note that the problem is quite constrained: There are a fixed number of attribute names (for any given software version) since these are interpreted system attributes. Therefore the attribute names can be a small number, rather than a string -- ie, we have an enum that defines the attributes. We (ie, OpenSolaris) can control the allocation of new enum values in this namespace. Many attributes will always have a fixed size (eg, scanstamp, finder info) Removing attributes or changing their sizes will be extremely rare, so we can make this code path slow. For example, it would not be prohibitive to repack & rewrite all attrs. Here is an idea for an on-disk format which takes advantage of these constraints to achieve good performance: (sa == System Attribute) enum sa_type { ST_ATIME, ST_ACL, ST_FINDERINFO, ST_SCANSTAMP, ST_LUSTRE_LAYOUT, ... ST_NUMTYPES }; const struct sa_info { uint8_t sai_intlen; } sainfo = { 8, 2, 8, 8, 1, ... }; on disk: header: struct sa_phys { uint16_t sa_numattrs; struct { uint16_t sa_type; /* enum sssa_type */ uint16_t sa_length; /* in sainfo[sa_type] chunks */ } sssa_attr[]; }; followed by the data in the order specified by the header. 8-byte alignment will be enforced for all attribute starting offsets. Note that the software (not the on-disk layout) defines the set of possible attributes, and their byte sizes (ie, is the value a string, vs a series of 8-byte integers, etc). Alternatively, if we are confident that we won't need anything but 8-byte integers (or arrays thereof), we can simplify the design further. When we first access a SA block, we read in the header, and construct an array indexed by enum sa_type that tells us the position and length of each attribute. That way, access and modification will be blazing fast[*]. If the size of an attribute is changed, we rewrite & repack everything, so there is never any unused space. (Aside from the padding necessary to enforce 8-byte alignment of each SA. It is assumed that attribute size changing will be "rare enough" that this performance is acceptable.) Thus the space overhead would be only 4 bytes per attribute (compared to a static struct layout), and runtime overhead would be minimal. So I think we could justify using it for much of the ZPL metadata. In fact, we could make this SA stuff be a DMU service, and remove support for bonus buffers entirely. The set of ZPL metadata that is always needed could be in one sa, and other pieces elsewhere. The DMU would use what was the bonus buffer, plus an additional BP only if necessary, to implement the SA. Ie, if the space needed for the SA's was <= 320 bytes (the current bonus size limit), then it would all be in the dnode. Otherwise, the first 192 (320-sizeof(blkptr_t)) would be in the dnode, and the rest in a block pointed to by a BP stored in the dnode. In addition, we could expand the dnode to 1k so that more SA's could be stored there. --matt [*] For example: struct sa_layout { struct { int sal_offset; /* in bytes */ int sal_length; /* in sainfo[sa_type] chunks */ } sal_ary[ST_NUMTYPES]; }; sa_init would construct an sa_layout given the on-disk sa_phys. void sa_read(struct sa_layout *sal, void *sa_phys, enum sa_type sat, void *buf, int buflen) { int len; ASSERT(sat < ST_NUMTYPES); len = sal->sal_ary[sat].sal_length * sainfo[sat].sai_intlen ASSERT(buflen >= len); /* Assumes that it was already byteswapped when read off disk */ bcopy((char*)sa_phys + sal->sal_ary[sat].sa_offset, buf, len); } sa_write would be similar, unless the sal_length was different from what we are trying to write, in which case we would rewrite the whole thing to insert the new (or resized) attribute. The software would define a priority order for the SA's, which defines what order to lay them out on disk (and which spill out of the dnode into the accessory block first).