Re: [zones-discuss] zones on shared storage proposal

Edward Pilatowicz Mon, 30 Nov 2009 12:06:40 -0800

On Fri, Nov 27, 2009 at 09:12:33AM -0800, Frank Batschulat wrote:
> Hey Ed, I want to comment on the NFS aspects involed here,
>
> On Thu, May 21, 2009 at 3:55 AM, Edward Pilatowicz wrote:
> >
> >well, it all depends on what nfs shares are actually being exported.
>
> I definitively think we do want to abstain from that much programmatic
> attempts inside the Zones framework on making assumptions about what
> an NFS server does export, how the NFS servers exported namespace
> may look like and how the NFS client (who's running the Zone)
> handles those exports upon access as opposed to explicit mounting.
>
> It is merely okay for the NFS v2/v3 (and their helper) protocols world
> but it is not always adequate for the V4 protocol and all the work/features
> in V4 and V4.1 towards a unified, global namespace.
>
> I'll show why in the context of V4 on the examples you
> mentioned below.
>
> >if the nfs server has the following share(s) exported:
> >
> >nfsserver:/vol
> >
> >then you would have the following mount(s):
> >/var/zones/nfsmount/zone1/nfsserver/vol
> >/var/zones/nfsmount/zone2/nfsserver/vol
> >/var/zones/nfsmount/zone3/nfsserver/vol
> >
> >if the nfs server has the following share(s) exported:
> >
> >nfsserver:/vol/zones
> >
> >then you would have the following mount(s):
> >/var/zones/nfsmount/zone1/nfsserver/vol/zones
> >/var/zones/nfsmount/zone2/nfsserver/vol/zones
> >/var/zones/nfsmount/zone3/nfsserver/vol/zones
>
> in those 2 examples, we'd have to consider how the
> V4 server constructs it's pseudo namespace starting
> at the servers root, including what we call pseudo exports
> that build the bridge to the real exported share points
> at the server and how the V4 client may handle this.
>
> for instance, on the V4 server the export:
>
> /vol
>
> may (and probably will) have different ZFS datasets
> that host our zones underneath /vol eg.:
>
> /vol/zone1
> /vol/zone2
> /vol/zone3
>
> since they are separate ZFS datasets, we would
> cross file system boundaries while traversing from
> the exported servers root / over the share point /vol
> down to the (also presumably exported, otherwise it
> wouln't be usefull in our context anyways) share points
> zone1/zone2/zone3.
>
> We'll distingish between the different file systems
> based on the FSID attribute, if it changes, we'd cross
> server file system boundaries.
>
> With V2/V3 that'd stop us and the client can not travel into
> the new file system below the inital mount and a separate mount
> would have to be performed (unless we've explicitely mounted
> the entire path of course)
>
> However, with V4, the client has the (in our implementation)
> so called Mirror Mount feature. That allows the client
> to transparantly mount those new file systems on access below
> the starting share point /vol (provided they are shared as well)
> and make them immediately visible without requiring the user
> for perform any additional mounts.
>
> Those mirror mounts will be done automatically by our V4 client
> in the kernel as it detects it'd cross server side file system
> boundaries (based on the FSID) on any access other then
> VOP_LOOKUP() or VOP_GETATTR().
>
> Ie. if the global zone did already have mounted
>
> server:/vol
>
> an attempt by the zone utilities to access (as opposed to
> explicit mounting) of
>
> server:/vol/zone1
>
> will automatically mount server:/vol/zone1 into
> the clients namespace and you'd get on the client
> (nfsstat -m) 2 mounts:
>
> server:/vol (already existing regular mount)
> server:/vol/zone1 (the mirror mount done by the client)
>
> if we'd really perform a mount though, that'd just induce
> the mount of
>
> server:/vol/zone1
>
> into the clients namespace running the zone.
>
> With the advent of the upcomming NFS v4 Referrals support
> in the V4 server and V4 client, another 'automatism'
> in the client can possibly change our observation of
> the mounted server exports on the client running the zone.
>
> On the V4 server (that is hosting our zone image) the administrator
> might decide to relocate the export to a different server and
> then might establish a so called 'reparse point' (in essence
> a symlink containing special infos) that will redirect a client
> to a different server hosting this export.
>
> NB: other Vendors NFS servers might hand out referrals to
> NB2: the same feature will be supported by our CIFS client
>
> The V4 client can get a specific referral event (NFS4ERR_MOVED)
> on VOP_LOOKUP(), VOP_GETATTR() and during inital mount processing
> by observing the NFS4ERR_MOVED error and it'll fetch the new location
> information from the server via the 'fs_locations' attribute.
> Our client will then go off and automatically mount the file system
> from the different server it had been referred to from the inital
> server. Again like mirror mounts, this is done transparently for the
> user and inside the kernel.
>
> The minor but important quirk involved here as far as our
> observation from the Zone NFS client is concered is that
> we might get for our mount attempt (or access to)
>
> server_A:/vol/zones1
>
> a mount established instead for
>
> server_B:/vol/zones1
>
> It is planned to even provide our V2/V3 clients with
> Referral support when taking to our NFS servers, although
> the implementation will slightely differ and I'm not
> yet sure how that V2/V3 clients referral mount will
> be observed on the NFS client.
>
> While this (Referrals) currently only affects inital access and
> mounting, in the future with Migration and Replication support being
> implemented, litterally every NFS v4 OTW OP may get a 'migration
> event', aka. receive NFS4ERR_MOVED.
>
> This is still in the early design stages, but we have to expect
> that from the Zones NFS clients observability stand point
> the 'nfsserver' portion of the mounted export may silently
> be 're-written' behind the scenes instead of doing a
> separate 2nd mount, ie.:
>
> our inital zone initiated access/mount:
>
> server_OLD:/vol/zone1
>
> Oops, migration even happens to the client, now will
> silently become:
>
> server_NEW:/vol/zone1
>
> this will be reflected in things like nfsstat(1M) output
> as well.
>


first, one thing to keep in mind.  all these mounts are being setup by
the global zone to access encapsulated zones.  (ie, zones stored in
files, vdisks, etc.)  these nfs filesystems won't be visible from within
a zone.  everything we're describing here happens in the global zone.

> >if the nfs server has the following share(s) exported:
> >
> >nfsserver:/vol/zones/zone1
> >nfsserver:/vol/zones/zone2
> >nfsserver:/vol/zones/zone3
> >
> >then you would have the following mount(s):
> >
> >/var/zones/nfsmount/zone1/nfsserver/vol/zones/zone1
> >/var/zones/nfsmount/zone2/nfsserver/vol/zones/zone2
> >/var/zones/nfsmount/zone3/nfsserver/vol/zones/zone3
>
> as I tried to explain above, the 'nfsserver' part
> can be a moving target as far as our observability from
> the Zone NFS client is concerned.
>

as i mentioned, none of these mounts will be visible from within any
non-global zone.

> >afaik, determining the mount point should be pretty strait forward.
> >i was planning to get a list of all the shares exported by the specified
> >nfs server, and then do a strncmp() of all the exported shares against
> >the specified path. the longest matching share name is the mount path.
>
> Well, that in turn is anything but straight forward and almost
> impossible for NFS v4 servers.
>
> For the V2/V3 clients that do use the mount protocol to instantiate
> a mount the servers mountd(1M) from a V2/V3 server can be asked by
> the client using the MOUNTPROC_EXPORT/MOUNTPROC3_EXPORT RPC procedure
> to return a list of exported file systems.
>
> This is used by commands like showmount(1M) or dfshares(1M)
> to list servers exported file systems, however there's no API available
> todo that other then writing your own RPC aware application doing
> essentially rpc_clnt_calls(3NSL) talking to a remote V2/V3 servers
> mountd(1M).
>
> But, the V4 protocol does not use the mount protocol at all anymore
> so there's no real programmatic way to retrieve a list of
> exported file systems from a V4 server. This would not make
> much sense in the context of the V4 protocol anyways because
> of the way the V4 server constructs its pseudo namespace starting
> from servers root / potentially involving pseudo export nodes
> that eventually bridge to the real share points.
>
> You may be lucky and the exported file systems are shared for
> V3 and V4 in which case you can make an educated guess at least.
>
> >for example. if we have:
> >nfs://jurassic/a/b/c/d/file
> >
> >and jurassic is exporting:
> >jurassic:/a
> >jurassic:/a/b
> >jurassic:/a/b/c
> >
> >then our mount path with be:
> >/var/zones/nfsmount/jurassic/a/b/c
> >
> >and our encapsulated zvol will be accessible at:
> >/var/zones/nfsmount/jurassic/a/b/c/d/file
> >
> >afaik, this is acutally the only way that this could be implemented.
>
> for above reasons I'd rather stay away from implementing some
> logic to figure out what to mount based on a potential
> list of exported file systems from the server and rather stick
> with some basics configured via zonecfg in the way:
>
> NFS path = 'nfs://<host>[:port]/<export>'
> Zone image = '<[dir to]filename>'
>

i don't like the idea of having multiple objects that need to be
specified.  it requires the addition of an extra variable that is only
needed for nfs uris.

> that way we avoid the problem of having to parse the entire
> current proposed SO-URI like:
>
> 'nfs://<host>[:port]/<file-absolute>'.
>
> and probe what part of that pathname may be suitable as a mount.
>
> Of course we could always say that anything before the image file
> name itself shall be in essence an exported path suitable for
> performing a mount.
>

so due to redirects and v4, we can't really do a "showmounts" and scan
for what's available.  ok.

we also can't do a top-down probe (ie, first probe /a/b/c/d, then probe
/a/b/c) due to redirects.  ok.

that means if we want to support the current uri format (with arbitrary
export + file path combinations) our only option is a bottom up probe.
ie:

- attempt to mount jurassic:/a
        if fail, error

- attempt to mount jurassic:/a/b
        if fail attempt to access path b/c/d/file
                if fail, return error

it's not exactly elegant, but it only needs to be done once during zone
boot, so assuming it would work, it would probably be ok.

> Also when talking to V4 servers we could also always just mount
> the servers root / and then an any access to the <file-absolute>
> path will trigger a mirror mount, but then, this does not
> work for V2/V3 servers though.
>

i'm less concerned about v2/v3 serviers.  v4 has been the default since
s10.  i expect people to use it.  that means i think it'd be ok to
design an initial solution that works for v4, and if we get actual
customer requests to support v2/v3 then we can do that as a follow on
RFE.

> I think we may want to elaborate a bit more on the use
> of the current proposed NFS SO-URI of:
>
> 'nfs://<host>[:port]/<file-absolute>'
>
> and its use from Zone land to perform mounts and access the
> zone image.
>

first off, i'm not sure of where "zone land" is.  ;)

that said, i'd be ok with doing any of the following:

1) do a bottom up probe (as described above)

2) change the uri format to add a seperator between the export and the
   mount path.  say
        nfs://<host>[:port]/<export>?path=<file-absolute>

3) as you suggested above, require "that anything before the image file
  name itself shall be in essence an exported path suitable for
  performing a mount."

althought the last idea seems a bit restricting.

ed
_______________________________________________
zones-discuss mailing list
zones-discuss@opensolaris.org

Re: [zones-discuss] zones on shared storage proposal

Reply via email to