[zfs-code] Design for EAs in dnode.

Andreas Dilger Sun, 24 Feb 2008 21:27:46 -0700

On Feb 24, 2008  17:20 +0530, Kalpak Shah wrote:
> On Thu, 2008-02-07 at 12:48 -0800, Matthew Ahrens wrote:
> > Mark Shellenbaum wrote:
> > > General comment:
> > > 
> > > I'm not sure we should be using nvlists to store these attributes. 
> > > While it will be quick to implement, I'm concerned about the constant 
> > > packing/unpacking and searching linked lists in the nvlist code.
> > 
> > Agreed.  Here is one proposal for how this could be done efficiently and 
> > scalably (both in space and runtime):
> > 
> > First, note that the problem is quite constrained:
> > 
> > There are a fixed number of attribute names (for any given software 
> > version) 
> > since these are interpreted system attributes. Therefore the attribute 
> > names 
> > can be a small number, rather than a string -- ie, we have an enum that 
> > defines the attributes.  We (ie, OpenSolaris) can control the allocation of 
> > new enum values in this namespace.
> 
> We can have a 32-bit type and a 32-bit subtype so that the ZFS team will
> only need to assign the 32-bit type and the subtype can be used by
> Lustre or OS/X without requiring ZFS team approval. For Lustre the
> subtype will be a 3-4 char EA name but it can be used differently also.


What else I was thinking would be very useful is to allow setting a
rank/priority for each SA.  This would give the DMU code the ability
to prioritize which SAs are stored in the beginning of the list (both
to be quickly accessible without much scanning, and to ensure they
remain in the dnode in case of overflow).  In particular, to avoid
serious performance degradation any znode data that is moved to an SA,
and if we need an SA to point to an external block, these should be kept
in the dnode itself.

I'm undecided whether the "sa_type" should be the ranking (which might
put some assigned types at a disadvantage of having "high priority" SAs
but ensures that critical SAs can never be pushed out of the dnode,
or the "sa_subtype".

> The kernel would not care about the SA names but it can be expanded to
> com.apple.finder or com.lustre.lov, etc. if it needs to be printed.
> 
> > Many attributes will always have a fixed size (eg, scanstamp, finder info)
> 
> I think we can let sai_intlen == 0 indicate that sa_length is the real
> length of the SA value(and not in sainfo[sa_type] chunks) to allow
> arbitary sized values.
> 
> So here is a slightly modified on-disk format from what you proposed:
> 
> struct sa_phys {
>       struct sa_header {
>               uint32_t magic;
>               uint16_t sa_numattrs;
>               uint16_t pad;
>       };
>       struct {
>               uint32_t sa_type;
>               uint32_t sa_subtype;
>               uint32_t sa_value_offset;
>               uint32_t sa_value_size;
>       } sa_entry[];
> };
> 
> The sa_value_offset will point to the address of the value in the bonus
> buffer of dnode or in the external block. And 8-byte alignment will
> require padding of entries and values.
> 
> sa_type + sa_subtype == 0 would terminate the list and as an additional
> check we could store 0x00000000 to indicate end of sa_entry's. So during
> modification or removal of entries we only need to pack the ones at the
> end.
> 
> Thanks,
> Kalpak.
> 
> > 
> > Removing attributes or changing their sizes will be extremely rare, so we 
> > can 
> > make this code path slow.  For example, it would not be prohibitive to 
> > repack 
> > & rewrite all attrs.
> > 
> > Here is an idea for an on-disk format which takes advantage of these 
> > constraints to achieve good performance:
> > 
> > (sa == System Attribute)
> > 
> > enum sa_type {
> >     ST_ATIME,
> >          ST_ACL,
> >          ST_FINDERINFO,
> >          ST_SCANSTAMP,
> >          ST_LUSTRE_LAYOUT,
> >          ...
> >     ST_NUMTYPES
> > };
> > 
> > const struct sa_info {
> >     uint8_t sai_intlen;
> > } sainfo = {
> >     8, 2, 8, 8, 1, ...
> > };  
> > 
> > on disk:
> > header:
> > struct sa_phys {
> >          uint16_t sa_numattrs;
> >          struct {
> >                  uint16_t sa_type;     /* enum sssa_type */
> >                  uint16_t sa_length;   /* in sainfo[sa_type] chunks */
> >          } sssa_attr[];
> > };
> > 
> > followed by the data in the order specified by the header.  8-byte 
> > alignment 
> > will be enforced for all attribute starting offsets.
> > 
> > Note that the software (not the on-disk layout) defines the set of possible 
> > attributes, and their byte sizes (ie, is the value a string, vs a series of 
> > 8-byte integers, etc).  Alternatively, if we are confident that we won't 
> > need 
> > anything but 8-byte integers (or arrays thereof), we can simplify the 
> > design 
> > further.
> > 
> > When we first access a SA block, we read in the header, and construct
> > an array indexed by enum sa_type that tells us the position and length
> > of each attribute.  That way, access and modification will be blazing
> > fast[*].  If the size of an attribute is changed, we rewrite & repack
> > everything, so there is never any unused space.  (Aside from the padding 
> > necessary to enforce 8-byte alignment of each SA.  It is assumed that
> > attribute size changing will be "rare enough" that this performance is
> > acceptable.)
> > 
> > Thus the space overhead would be only 4 bytes per attribute (compared to
> > a static struct layout), and runtime overhead would be minimal.  So I think 
> > we could justify using it for much of the ZPL metadata.  In fact, we could
> > make this SA stuff be a DMU service, and remove support for bonus
> > buffers entirely.  The set of ZPL metadata that is always needed could
> > be in one sa, and other pieces elsewhere.  The DMU would use what was
> > the bonus buffer, plus an additional BP only if necessary, to implement
> > the SA.  Ie, if the space needed for the SA's was <= 320 bytes (the
> > current bonus size limit), then it would all be in the dnode.
> > Otherwise, the first 192 (320-sizeof(blkptr_t)) would be in the dnode,
> > and the rest in a block pointed to by a BP stored in the dnode.  In 
> > addition, 
> > we could expand the dnode to 1k so that more SA's could be stored there.
> > 
> > --matt
> > 
> > [*] For example:
> > 
> > struct sa_layout {
> >     struct {
> >             int sal_offset; /* in bytes */
> >             int sal_length; /* in sainfo[sa_type] chunks */
> >     } sal_ary[ST_NUMTYPES];
> > };
> > 
> > sa_init would construct an sa_layout given the on-disk sa_phys.
> > 
> > void sa_read(struct sa_layout *sal, void *sa_phys, enum sa_type sat,
> >      void *buf, int buflen)
> > {
> >     int len;
> > 
> >     ASSERT(sat < ST_NUMTYPES);
> >     len = sal->sal_ary[sat].sal_length * sainfo[sat].sai_intlen
> >     ASSERT(buflen >= len);
> >     /* Assumes that it was already byteswapped when read off disk */
> >     bcopy((char*)sa_phys + sal->sal_ary[sat].sa_offset, buf, len);
> > }
> > 
> > sa_write would be similar, unless the sal_length was different from what we 
> > are trying to write, in which case we would rewrite the whole thing to 
> > insert 
> > the new (or resized) attribute.  The software would define a priority order 
> > for the SA's, which defines what order to lay them out on disk (and which 
> > spill out of the dnode into the accessory block first).
> > _______________________________________________
> > zfs-code mailing list
> > zfs-code at opensolaris.org
> > http://mail.opensolaris.org/mailman/listinfo/zfs-code
> 
> _______________________________________________
> zfs-code mailing list
> zfs-code at opensolaris.org
> http://mail.opensolaris.org/mailman/listinfo/zfs-code

Cheers, Andreas
--
Andreas Dilger
Sr. Staff Engineer, Lustre Group
Sun Microsystems of Canada, Inc.

[zfs-code] Design for EAs in dnode.

Reply via email to