[Gluster-devel] Lack of named lookups during resolution of inodes after graph switch (was Discuss: http://review.gluster.org/#/c/11368/)

Raghavendra Gowdappa Mon, 20 Jul 2015 00:09:46 -0700

+gluster-devel

----- Original Message -----
> From: "Dan Lambright" <dlamb...@redhat.com>
> To: "Raghavendra Gowdappa" <rgowd...@redhat.com>
> Cc: "Shyam" <srang...@redhat.com>, "Nithya Balachandran" 
> <nbala...@redhat.com>, "Sakshi Bansal" <saban...@redhat.com>
> Sent: Monday, July 20, 2015 8:23:16 AM
> Subject: Re: Discuss: http://review.gluster.org/#/c/11368/
> 
> 
> I am posting another version of the patch to discuss.. Here is a summary in
> simplest form;
> 
> The fix tries to address problems we have with tiered volumes and fix-layout.
> 
> If we try to use both the hot and cold tier before fix-layout has completed,
> we get many "stale file" errors; the new hot tier does not have layouts for
> the inodes.
> 
> To avoid such problems, we only use the cold tier until fix-layout is done.
> (sub volume count = 1)
> 
> When we detect fix layout is done, we will do a graph switch which will
> create new layouts on demand. We would like to switch to using both tiers
> (subvolume_cnt=2) only at the time of the graph switch is done.
> 
> There is a hole with that solution. If we make a directory after fix layout
> has past the parent (of the new directory), fix-layout will not copy the new
> directory to the new tier.
> 
> If we try to access such directories, the code fails (dht_access does not
> have a cached sub volume).
> 
> So, we detect such directories when we do a lookup/revalidate/discover, and
> store their peculiar state in the layout if they are only accessible on the
> cold tier. Eventually a self heal will happen, and this state will age out.
> 
> I have a unit test and system test for this.
> 
> Basically my questions are
> - cleanest way to invoke the graph switch between using just the cold tier to
> using both the cold and hot tier.


This is a long standing problem which needs a fix very badly. I think the 
client/mount cannot rely on rebalance/tier process for directory creation since 
I/O on client is independent and there is no way to synchronize it with 
rebalance directory heal. The culprit here is lack of hierarchical named 
lookups from root till that directory after a graph switch in mount process. If 
named-lookups are sent, dht is quite capable of creating directories on newly 
added subvols. So, I am proposing some solutions below.

Interface layers (fuse-bridge, gfapi, nfs etc) should make sure that there is 
at least once entire directory hierarchy till root is looked up before sending 
fops on an inode after graph-switch. For dht, its sufficient only if inodes 
associated with directory are looked up in this fashion. However, non-directory 
inodes might also benefit from this since VFS essentially would've done a 
hierarchical lookup before doing fops. Its only glusterfs which has introduced 
nameless lookups, but much of the logic is designed around named hierarchical 
lookup. Now, to address the question whether its possible for interface layers 
to figure out ancestry of an inode,

    * With fuse-bridge, entire dentry structure is preserved (at least in the 
first graph which witnessed named-lookups from kernel and we can migrate this 
structure to newer graphs too). We can use dentry structure from older graph to 
send these named lookups and build similar dentry structure in newer graph too. 
This resolution is still on-demand when a fop is sent on an inode (like 
existing code, but the change being instead of one nameless lookup on inode, we 
do named lookup of parents and inode in newer graph). So, named lookups can be 
sent for all inodes irrespective of whether inode corresponds to directory or 
non-directory.

    * I am assuming gfapi is similar to fuse-bridge. Would need verifications 
from people maintaining gfapi whether my assumption is correct.

    * NFS-v3 server allows client to just pass file-handle and can construct 
relevant state to access the files (one of the reasons why nameless lookups 
were introduced in first place). Since it relies heavily on nameless lookups 
the dentry structure need not always be present in NFS server process. However 
we can borrow some ideas from [1]. If it seems that maintaining the list of 
parents of a file in xattrs is overkill (basically we are constructing reverse 
dentry tree), at least for problems faced by dht/tier its good enough we get 
this hierarchy for directory inodes. With gfid based backend, we can always get 
path/hierarchy for a directory using gfid of inode using .glusterfs directory 
(within .glusterfs there is a symbolic link with name of gfid whose contents 
can get us ancestry till root). This solution works for _all_ interface layers.

I am suspecting its not just dht, but also other cluster xlators like EC, afr, 
non-cluster entities like quota, geo-rep which face this issue. I am aware of 
atleast one problem in afr - difficulty in identifying gfid mismatch of an 
entry across subvols after graph switch. Geo-replication too is using some form 
of gfid to path conversion. So, comments from other maintainers/developers are 
highly appreciated.

[1] http://review.gluster.org/5951

> - is the mechanism I am using (hook to dht_get_cached_subvol, state in the
> layout structure) sufficient to prevent access to the hot tier before a self
> heal happens.
> 
> ----- Original Message -----
> > From: "Raghavendra Gowdappa" <rgowd...@redhat.com>
> > To: "Dan Lambright" <dlamb...@redhat.com>
> > Cc: "Shyam" <srang...@redhat.com>, "Nithya Balachandran"
> > <nbala...@redhat.com>, "Sakshi Bansal" <saban...@redhat.com>
> > Sent: Friday, July 17, 2015 9:50:27 PM
> > Subject: Re: Discuss: http://review.gluster.org/#/c/11368/
> > 
> > Sure. I am fine with it. We'll have a google hangout then.
> > 
> > ----- Original Message -----
> > > From: "Dan Lambright" <dlamb...@redhat.com>
> > > To: "Raghavendra Gowdappa" <rgowd...@redhat.com>
> > > Cc: "Shyam" <srang...@redhat.com>, "Nithya Balachandran"
> > > <nbala...@redhat.com>, "Sakshi Bansal" <saban...@redhat.com>
> > > Sent: Friday, July 17, 2015 10:44:47 PM
> > > Subject: Re: Discuss: http://review.gluster.org/#/c/11368/
> > > 
> > > Hi Du,
> > > 
> > > 7:30 IST PM Monday? Just like last time..
> > > 
> > > Dan
> > > 
> > > ----- Original Message -----
> > > > From: "Raghavendra Gowdappa" <rgowd...@redhat.com>
> > > > To: "Dan Lambright" <dlamb...@redhat.com>
> > > > Cc: "Shyam" <srang...@redhat.com>, "Nithya Balachandran"
> > > > <nbala...@redhat.com>, "Sakshi Bansal" <saban...@redhat.com>
> > > > Sent: Friday, July 17, 2015 12:47:21 PM
> > > > Subject: Re: Discuss: http://review.gluster.org/#/c/11368/
> > > > 
> > > > Hi Dan,
> > > > 
> > > > Monday is fine with me. Let us know the time you'll be available.
> > > > 
> > > > regards,
> > > > Raghavendra.
> > > > 
> > > > ----- Original Message -----
> > > > > From: "Dan Lambright" <dlamb...@redhat.com>
> > > > > To: "Raghavendra Gowdappa" <rgowd...@redhat.com>
> > > > > Cc: "Shyam" <srang...@redhat.com>, "Nithya Balachandran"
> > > > > <nbala...@redhat.com>, "Sakshi Bansal" <saban...@redhat.com>
> > > > > Sent: Friday, July 17, 2015 6:48:16 PM
> > > > > Subject: Re: Discuss: http://review.gluster.org/#/c/11368/
> > > > > 
> > > > > Du, Shyam,
> > > > > 
> > > > > Lets follow up with a meeting. Is today or Monday possible?
> > > > > 
> > > > > Dan
> > > > > 
> > > > > ----- Original Message -----
> > > > > > From: "Dan Lambright" <dlamb...@redhat.com>
> > > > > > To: "Raghavendra Gowdappa" <rgowd...@redhat.com>
> > > > > > Cc: "Shyam" <srang...@redhat.com>, "Nithya Balachandran"
> > > > > > <nbala...@redhat.com>, "Susant Palai" <spa...@redhat.com>,
> > > > > > "Sakshi Bansal" <saban...@redhat.com>
> > > > > > Sent: Wednesday, July 15, 2015 9:44:51 PM
> > > > > > Subject: Re: Discuss: http://review.gluster.org/#/c/11368/
> > > > > > 
> > > > > > Du,
> > > > > > 
> > > > > > Per our discussion today- here is a bit more info on the problem.
> > > > > > 
> > > > > > In QE, they untar a large file, and while that happens attach a
> > > > > > tier.
> > > > > > This
> > > > > > causes us to use the hot subvolume (hashed sub volume) before fixed
> > > > > > layout
> > > > > > finished. This leads to stale file handle errors.
> > > > > > 
> > > > > > I can recreate this per steps below.
> > > > > > 
> > > > > > 1. create a dist rep volume.
> > > > > > 2. mount it over FUSE.
> > > > > > 3. mkdir -p /mnt/z/z1/z2/z3
> > > > > > 4. cd /mnt/z/z1
> > > > > > 
> > > > > > # the next steps force it so fix layout is NOT done. we do not
> > > > > > start
> > > > > > the
> > > > > > rebalance daemon.
> > > > > > 
> > > > > > 5. stop volume
> > > > > > 6. attach tier
> > > > > > 7. start volume
> > > > > > 
> > > > > > 8.example1: stat z2/z3
> > > > > > 
> > > > > > 8.example2: mkdir z2/newdir
> > > > > > 
> > > > > > Either example1 or example2 produce the problem. We can end up in
> > > > > > the
> > > > > > underlying hot DHT translator:
> > > > > > dht_log_new_layout_for_dir_selfheal().
> > > > > > But
> > > > > > no
> > > > > > directories have been created on the hot sub volume. It cannot heal
> > > > > > anything, and returns stale.
> > > > > > 
> > > > > > The flow for example 2 is:
> > > > > > 
> > > > > > tier DHT: fresh lookup / hashed subvol is cold, calls lookup to
> > > > > > cold
> > > > > > 
> > > > > > tier DHT: lookup__cbk calls ht_lookup_directory on both hot AND
> > > > > > cold
> > > > > > sub
> > > > > > volumes
> > > > > > 
> > > > > > cold DHT: is revalidate is true, this works
> > > > > > 
> > > > > > hot DHT: fresh lookup / no hashed subvol
> > > > > > 
> > > > > > hot DHT: lookup dir cbk, gets -1 / 116 error for each subvol
> > > > > > 
> > > > > > tier DHT: lookup dir cbk gets -1 / 116 from hot tier
> > > > > > 
> > > > > > tier DHT : lookup dir cbk gets 0 / 117 from cold tier (ok)
> > > > > > 
> > > > > > tier DHT : then goes to self heal - dht_selfheal_directory
> > > > > > 
> > > > > > tier DHT : dht_selfheal_dir_makedir is called ; returns 0;
> > > > > > missing_dirs
> > > > > > =
> > > > > > 0.
> > > > > > 
> > > > > > fuse apparently retries this, and the process repeats a few times
> > > > > > before
> > > > > > failing to the user.
> > > > > > 
> > > > > > 
> > > > > > 
> > > > > > ----- Original Message -----
> > > > > > > From: "Raghavendra Gowdappa" <rgowd...@redhat.com>
> > > > > > > To: "Shyam" <srang...@redhat.com>
> > > > > > > Cc: "Nithya Balachandran" <nbala...@redhat.com>, "Dan Lambright"
> > > > > > > <dlamb...@redhat.com>, "Susant Palai"
> > > > > > > <spa...@redhat.com>, "Sakshi Bansal" <saban...@redhat.com>
> > > > > > > Sent: Wednesday, July 15, 2015 4:39:56 AM
> > > > > > > Subject: Re: Discuss: http://review.gluster.org/#/c/11368/
> > > > > > > 
> > > > > > > If possible can we start at 7:00 PM IST? I've to leave by 8:15
> > > > > > > PM.
> > > > > > > The
> > > > > > > discussion might not be over if we start at 7:30 PM.
> > > > > > > 
> > > > > > > ----- Original Message -----
> > > > > > > > From: "Shyam" <srang...@redhat.com>
> > > > > > > > To: "Raghavendra Gowdappa" <rgowd...@redhat.com>, "Nithya
> > > > > > > > Balachandran"
> > > > > > > > <nbala...@redhat.com>, "Dan Lambright"
> > > > > > > > <dlamb...@redhat.com>, "Susant Palai" <spa...@redhat.com>
> > > > > > > > Sent: Tuesday, July 14, 2015 11:04:01 PM
> > > > > > > > Subject: Discuss: http://review.gluster.org/#/c/11368/
> > > > > > > > 
> > > > > > > > Tier xlator needs some discussion on this change:
> > > > > > > > http://review.gluster.org/#/c/11368/
> > > > > > > > 
> > > > > > > > I think we can leverage tomorrow's Team on demand meeting for
> > > > > > > > the
> > > > > > > > same.
> > > > > > > > 
> > > > > > > > So request that we convene at the 7:30 PM IST for this
> > > > > > > > tomorrow,
> > > > > > > > do
> > > > > > > > let
> > > > > > > > us know if you cannot make it.
> > > > > > > > 
> > > > > > > > If there is better time let us know.
> > > > > > > > 
> > > > > > > > Shyam
> > > > > > > > 
> > > > > > >
> > > > > 
> > > > 
> > > 
> > 
> 
_______________________________________________
Gluster-devel mailing list
Gluster-devel@gluster.org
http://www.gluster.org/mailman/listinfo/gluster-devel

[Gluster-devel] Lack of named lookups during resolution of inodes after graph switch (was Discuss: http://review.gluster.org/#/c/11368/)

Reply via email to