Re: [Gluster-devel] Single layout at root (Was EHT / DHT)

2014-11-25 Thread Anand Avati
On Tue Nov 25 2014 at 1:28:59 PM Shyam srang...@redhat.com wrote:

 On 11/12/2014 01:55 AM, Anand Avati wrote:
 
 
  On Tue, Nov 11, 2014 at 1:56 PM, Jeff Darcy jda...@redhat.com
  mailto:jda...@redhat.com wrote:
 
(Personally I would have
  done this by mixing in the parent GFID to the hash calculation, but
  that alternative was ignored.)
 
 
  Actually when DHT was implemented, the concept of GFID did not (yet)
  exist. Due to backward compatibility it has just remained this way even
  later. Including the GFID into the hash has benefits.

 I am curious here as this is interesting.

 So the layout start subvol assignment for a directory to be based on its
 GFID was provided so that files with the same name distribute better
 than ending up in the same bricks, right?


Right, for e.g we wouldn't want all the README.txt in various directories
of a volume to end up on the same server. The way it is achieved today is,
the per server hash-range assignment is rotated by a certain amount (how
much it is rotated is determined by a separate hash on the directory path)
at the time of mkdir.


 Instead as we _now_ have GFID, we could use that including the name to
 get a similar/better distribution, or GFID+name to determine hashed subvol.


What we could do now is, include the parent directory gfid as an input into
the DHT hash function.

Today, we do approximately:
  int hashval = dm_hash (readme.txt)
  hash_ranges[] = inode_ctx_get (parent_dir)
  subvol = find_subvol (hash_ranges, hashval)

Instead, we could:
  int hashval = new_hash (readme.txt, parent_dir.gfid)
  hash_ranges[] = global_value
  subvol = find_subvol (hash_ranges, hashval)

The idea here would be that on dentry creates we would need to generate
 the GFID and not let the bricks generate the same, so that we can choose
 the subvol to wind the FOP to.


The GFID would be that of the parent (as an entry name is always in the
context of a parent directory/inode). Also, the GFID for a new entry is
already generated by the client, the brick does not generate a GFID.


 This eliminates the need for a layout per sub-directory and all the
 (interesting) problems that it comes with and instead can be replaced by
 a layout at root. Not sure if it handles all use cases and paths that we
 have now (which needs more understanding).

 I do understand there is a backward compatibility issue here, but other
 than this, this sounds better than the current scheme, as there is a
 single layout to read/optimize/stash/etc. across clients.

 Can I understand the rationale of this better, as to what you folks are
 thinking. Am I missing something or over reading on the benefits that
 this can provide?


I think you understand it right. The benefit is one could have a single
hash layout for the entire volume and the directory specific-ness is
implemented by including the directory gfid into the hash function. The way
I see it, the compromise would be something like:

Pro per directory range: By having per-directory hash ranges, we can do
easier incremental rebalance. Partial progress is well tolerated and does
not impact the entire volume. The time a given directory is undergoing
rebalance, for that directory alone we need to enter unhashed lookup
mode, only for that period of time.

Con per directory range: Just the new hash assignment phase (to impact
placement of new files/data, not move old data) itself is an extended
process, crawling the entire volume with complex per-directory operations.
The number of points in the system where things can break (i.e, result in
overlaps and holes in ranges) is high.

Pro single layout with dir GFID in hash: Avoid the numerous parts (per-dir
hash ranges) which can potentially break.

Con single layout with dir GFID in hash: Rebalance phase 1 (assigning new
layout) is atomic for the entire volume - unhashed lookup has to be on
for all dirs for the entire period. To mitigate this, we could explore
versioning the centralized hash ranges, and store the version used by each
directory in its xattrs (and update the version as the rebalance
progresses). But now we have more centralized metadata (may be/ may not be
a worthy compromise - not sure.)

In summary, including GFID into the hash calculation does open up
interesting possibilities and worthy of serious consideration.

HTH,
Avati
___
Gluster-devel mailing list
Gluster-devel@gluster.org
http://supercolony.gluster.org/mailman/listinfo/gluster-devel


Re: [Gluster-devel] Single layout at root (Was EHT / DHT)

2014-11-25 Thread Shyam

On 11/25/2014 05:03 PM, Anand Avati wrote:



On Tue Nov 25 2014 at 1:28:59 PM Shyam srang...@redhat.com
mailto:srang...@redhat.com wrote:

On 11/12/2014 01:55 AM, Anand Avati wrote:
 
 
  On Tue, Nov 11, 2014 at 1:56 PM, Jeff Darcy jda...@redhat.com
mailto:jda...@redhat.com
  mailto:jda...@redhat.com mailto:jda...@redhat.com wrote:
 
(Personally I would have
  done this by mixing in the parent GFID to the hash
calculation, but
  that alternative was ignored.)
 
 
  Actually when DHT was implemented, the concept of GFID did not (yet)
  exist. Due to backward compatibility it has just remained this
way even
  later. Including the GFID into the hash has benefits.

I am curious here as this is interesting.

So the layout start subvol assignment for a directory to be based on its
GFID was provided so that files with the same name distribute better
than ending up in the same bricks, right?


Right, for e.g we wouldn't want all the README.txt in various
directories of a volume to end up on the same server. The way it is
achieved today is, the per server hash-range assignment is rotated by
a certain amount (how much it is rotated is determined by a separate
hash on the directory path) at the time of mkdir.

Instead as we _now_ have GFID, we could use that including the name to
get a similar/better distribution, or GFID+name to determine hashed
subvol.

What we could do now is, include the parent directory gfid as an input
into the DHT hash function.

Today, we do approximately:
   int hashval = dm_hash (readme.txt)
   hash_ranges[] = inode_ctx_get (parent_dir)
   subvol = find_subvol (hash_ranges, hashval)

Instead, we could:
   int hashval = new_hash (readme.txt, parent_dir.gfid)
   hash_ranges[] = global_value
   subvol = find_subvol (hash_ranges, hashval)

The idea here would be that on dentry creates we would need to generate
the GFID and not let the bricks generate the same, so that we can choose
the subvol to wind the FOP to.


The GFID would be that of the parent (as an entry name is always in the
context of a parent directory/inode). Also, the GFID for a new entry is
already generated by the client, the brick does not generate a GFID.

This eliminates the need for a layout per sub-directory and all the
(interesting) problems that it comes with and instead can be replaced by
a layout at root. Not sure if it handles all use cases and paths that we
have now (which needs more understanding).

I do understand there is a backward compatibility issue here, but other
than this, this sounds better than the current scheme, as there is a
single layout to read/optimize/stash/etc. across clients.

Can I understand the rationale of this better, as to what you folks are
thinking. Am I missing something or over reading on the benefits that
this can provide?


I think you understand it right. The benefit is one could have a single
hash layout for the entire volume and the directory specific-ness is
implemented by including the directory gfid into the hash function. The
way I see it, the compromise would be something like:

Pro per directory range: By having per-directory hash ranges, we can do
easier incremental rebalance. Partial progress is well tolerated and
does not impact the entire volume. The time a given directory is
undergoing rebalance, for that directory alone we need to enter
unhashed lookup mode, only for that period of time.

Con per directory range: Just the new hash assignment phase (to impact
placement of new files/data, not move old data) itself is an extended
process, crawling the entire volume with complex per-directory
operations. The number of points in the system where things can break
(i.e, result in overlaps and holes in ranges) is high.

Pro single layout with dir GFID in hash: Avoid the numerous parts
(per-dir hash ranges) which can potentially break.

Con single layout with dir GFID in hash: Rebalance phase 1 (assigning
new layout) is atomic for the entire volume - unhashed lookup has to be
on for all dirs for the entire period. To mitigate this, we could
explore versioning the centralized hash ranges, and store the version
used by each directory in its xattrs (and update the version as the
rebalance progresses). But now we have more centralized metadata (may
be/ may not be a worthy compromise - not sure.)


Agreed, the auto-unhased would have to wait longer before being rearmed.

Just throwing some more thoughts on the same,

Unhashed-auto also can benefit from just linkto creations, rather than 
require a data rebalance (i.e movement of data). So in phase-0 we could 
just create the linkto files and then turn on auto-unhashed. As lookups 
would find the (linkto) file.


Other abilities, like giving directories weighted layout ranges based on 
size of bricks could be affected, i.e forcing a rebalance when a brick 
size is