On Sep 17, 2007  08:31 -0600, Mark Shellenbaum wrote:
> While not entirely the same thing we will soon have a VFS feature 
> registration mechanism in Nevada.  Basically, a file system registers 
> what features it supports.  Initially this will be things such as "case 
> insensitivity", "acl on create", "extended vattr_t".

It's hard for me to comment on this without more information.  I just
suggested the ext3 mechanism because what I see so far (many features
being tied to ZFS_VERSION_3, and checking for version >= ZFS_VERSION_3)
mean that it is really hard to do parallel development of features and
ensure that the code is actually safe to access the filesystem.

For example, if we start developing large dnode + fast EA code we might 
want to ship that out sooner than it can go into a Solaris release.  We
want to make sure that no Solaris code tries to mount such a filesystem
or it will assert (I think), so we would have to version the fs as v4.

However, maybe Solaris needs some other changes that would require a v4
that does not include large dnode + fast EA support (for whatever reason)
so now we have 2 incompatible codebases that support "v4"...

Do you have a pointer to the upcoming versioning mechanism?

> >3.a) I initially thought that we don't have to store any extra
> >   information to have a variable znode_phys_t size, because dn_bonuslen
> >   holds this information.  However, for symlinks ZFS checks essentially
> >   "zp_size + sizeof(znode_phys_t) < dn_bonuslen" to see if it is a
> >   fast or slow symlink.  That implies if sizeof(znode_phys_t) changes
> >   old symlinks on disk will be accessed incorrectly if we don't have
> >   some extra information about the size of znode_phys_t in each dnode.
> >
> 
> There is an existing bug to create symlinks with their own object type.

I don't think that will help unless there is an extra mechanism to detect
whether the symlink is fast or slow, instead of just using the dn_bonuslen.
Is it possible to store XATTR data on symlinks in Solaris?

> >3.b)  We can call this "zp_extra_znsize".  If we declare the current
> >   znode_phys_t as znode_phys_v0_t then zp_extra_znsize is the amount of
> >   extra space beyond sizeof(znode_phys_v0_t), so 0 for current 
> >   filesystems.
> 
> This would also require creation a new DMU_OT_ZNODE2 or something 
> similarly named.

Sure.  Is it possible to change the DMU_OT type on an existing object?

> >3.c) zp_extra_znsize would need to be stored in znode_phys_t somewhere.
> >   There is lots of unused space in some of the 64-bit fields, but I
> >   don't know how you feel about hacks for this.  Possibilities include
> >   some bits in zp_flags, zp_pad, high bits in zp_*time nanoseconds, etc.
> >   It probably only needs to be 8 bytes or so (seems unlikely you will
> >   more than double the number of fixed fields in struct znode_phys_t).
> >
> 
> The zp_flags field is off limits.  It is going to be used for storing 
> additional file attributes such as immutable, nounlink,...

Ah, OK.  I was wondering about that also, but it isn't in the top 10
priorities yet.

> I don't want to see us overload other fields.  We already have several 
> pad fields within the znode that could be used.

OK, I wasn't sure about what is spoken for already.  Is it ZFS policy to
always have 64-bit member fields?  Some of the fields (e.g. nanoseconds)
don't really make sense as 64-bit values, and it would probably be a
waste to have a 64-bit value for zp_extra_znsize.

> >4.c) It would be best to have some kind of ZAP to store the fast EA data.
> >   Ideally it is a very simple kind of ZAP (single buffer), but the
> >   microzap format is too restrictive with only a 64-bit value.
> >   One of the other Lustre desires is to store additional information in
> >   each directory entry (in addition to the object number) like file type
> >   and a remote server identifier, and having a single ZAP type that is
> >   useful for small entries would be good.  Is it possible to go straight
> >   to a zap_leaf_phys_t without having a corresponding zap_phys_t first?
> >   If yes, then this would be quite useful, otherwise a fat ZAP is too fat
> >   to be useful for storing fast EA data and the extended directory info.
> 
> Can you provide a list of what attributes you want to store in the znode 
> and what their sizes are?  Do you expect ZFS to do anything special with 
> these attributes?  Should these attributes be exposed to applications?

The main one is the Lustre logical object volume (LOV) extended attribute
data.  This ranges from (commonly) 64 bytes, to as much as 4096 bytes (or
possibly larger once on ZFS).  This HAS to be accessed to do anything with
the znode, even stat currently, since the size of a file is distributed
over potentially many servers, so avoiding overhead here is critical.

In addition to that, there will be similar smallish attributes stored with
each znode like back-pointers from the storage znodes to the metadata znode.
These are on the order of 64 bytes as well.

> Usually, we only embed attributes in the znode if the file system has 
> some sort of semantics associated with them.

The issue I think is that this data is only useful for Lustre, so reserving
dedicated space for it in a znode is no good.  Also, the LOV XATTR might be
very large, so any dedicated space would be wasted.  Having a generic and
fast XATTR storage in the znode would help a variety of applications.

> One of the original plans, from several years ago was to create a zp_zap 
> field in the znode that would be used for storing additional file 
> attributes.  We never actually did that and the field was turned into 
> one of the pad fields in the znode.

Maybe "file attributes" is the wrong term.  These are really XATTRs in the
ZFS sense, so I'll refer to them as such in the future.

> If the attribute will be needed for every file then it should probably 
> be in the znode, but if it is an optional attribute  or too big then 
> maybe it should be in some sort of overflow object.

This is what I'm proposing.  For small XATTRs they would live in the znode,
and large ones would be stored using the normal ZFS XATTR mechanism (which
is infinitely flexible).  Since the Lustre LOV XATTR data is created when
the znode is first allocated, it will always get first crack at using the
fast XATTR space, which is fine since it is right up with the znode data in
importance.

Cheers, Andreas
--
Andreas Dilger
Principal Software Engineer
Cluster File Systems, Inc.


Reply via email to