subject:"\[Gluster\-devel\] Single layout at root \(Was EHT \/ DHT\)"

Re: [Gluster-devel] Single layout at root (Was EHT / DHT)

2014-11-26 Thread Jeff Darcy

> OK, no current DHT workaround…
> Wasn’t there a xlator that would tend to put files on the local brick
> (maybe with NFS mount)?

You're probably thinking about NUFA.  That will try to put newly created
files on a local brick if there is one, but other than that it's just
DHT.  You can use it by setting the "cluster.nufa" option on your
volume.
___
Gluster-devel mailing list
Gluster-devel@gluster.org
http://supercolony.gluster.org/mailman/listinfo/gluster-devel

Re: [Gluster-devel] Single layout at root (Was EHT / DHT)

2014-11-25 Thread Jan H Holtzhausen

OK, no current DHT workaround…
Wasn’t there a xlator that would tend to put files on the local brick 
(maybe with NFS mount)?

BR
Jan



On 2014/11/26, 1:15 AM, "Shyam"  wrote:

>On 11/25/2014 05:03 PM, Anand Avati wrote:
>>
>>
>> On Tue Nov 25 2014 at 1:28:59 PM Shyam > > wrote:
>>
>> On 11/12/2014 01:55 AM, Anand Avati wrote:
>>  >
>>  >
>>  > On Tue, Nov 11, 2014 at 1:56 PM, Jeff Darcy > 
>>  > >> wrote:
>>  >
>>  >   (Personally I would have
>>  > done this by "mixing in" the parent GFID to the hash
>> calculation, but
>>  > that alternative was ignored.)
>>  >
>>  >
>>  > Actually when DHT was implemented, the concept of GFID did not 
>>(yet)
>>  > exist. Due to backward compatibility it has just remained this
>> way even
>>  > later. Including the GFID into the hash has benefits.
>>
>> I am curious here as this is interesting.
>>
>> So the layout start subvol assignment for a directory to be based 
>>on its
>> GFID was provided so that files with the same name distribute better
>> than ending up in the same bricks, right?
>>
>>
>> Right, for e.g we wouldn't want all the README.txt in various
>> directories of a volume to end up on the same server. The way it is
>> achieved today is, the per server hash-range assignment is "rotated" by
>> a certain amount (how much it is rotated is determined by a separate
>> hash on the directory path) at the time of mkdir.
>>
>> Instead as we _now_ have GFID, we could use that including the name 
>>to
>> get a similar/better distribution, or GFID+name to determine hashed
>> subvol.
>>
>> What we could do now is, include the parent directory gfid as an input
>> into the DHT hash function.
>>
>> Today, we do approximately:
>>int hashval = dm_hash ("readme.txt")
>>hash_ranges[] = inode_ctx_get (parent_dir)
>>subvol = find_subvol (hash_ranges, hashval)
>>
>> Instead, we could:
>>int hashval = new_hash ("readme.txt", parent_dir.gfid)
>>hash_ranges[] = global_value
>>subvol = find_subvol (hash_ranges, hashval)
>>
>> The idea here would be that on dentry creates we would need to 
>>generate
>> the GFID and not let the bricks generate the same, so that we can 
>>choose
>> the subvol to wind the FOP to.
>>
>>
>> The GFID would be that of the parent (as an entry name is always in the
>> context of a parent directory/inode). Also, the GFID for a new entry is
>> already generated by the client, the brick does not generate a GFID.
>>
>> This eliminates the need for a layout per sub-directory and all the
>> (interesting) problems that it comes with and instead can be 
>>replaced by
>> a layout at root. Not sure if it handles all use cases and paths 
>>that we
>> have now (which needs more understanding).
>>
>> I do understand there is a backward compatibility issue here, but 
>>other
>> than this, this sounds better than the current scheme, as there is a
>> single layout to read/optimize/stash/etc. across clients.
>>
>> Can I understand the rationale of this better, as to what you folks 
>>are
>> thinking. Am I missing something or over reading on the benefits 
>>that
>> this can provide?
>>
>>
>> I think you understand it right. The benefit is one could have a single
>> hash layout for the entire volume and the directory "specific-ness" is
>> implemented by including the directory gfid into the hash function. The
>> way I see it, the compromise would be something like:
>>
>> Pro per directory range: By having per-directory hash ranges, we can do
>> easier incremental rebalance. Partial progress is well tolerated and
>> does not impact the entire volume. The time a given directory is
>> undergoing rebalance, for that directory alone we need to enter
>> "unhashed lookup" mode, only for that period of time.
>>
>> Con per directory range: Just the new "hash assignment" phase (to impact
>> placement of new files/data, not move old data) itself is an extended
>> process, crawling the entire volume with complex per-directory
>> operations. The number of points in the system where things can "break"
>> (i.e, result in overlaps and holes in ranges) is high.
>>
>> Pro single layout with dir GFID in hash: Avoid the numerous parts
>> (per-dir hash ranges) which can potentially "break".
>>
>> Con single layout with dir GFID in hash: Rebalance phase 1 (assigning
>> new layout) is atomic for the entire volume - unhashed lookup has to be
>> "on" for all dirs for the entire period. To mitigate this, we could
>> explore versioning the centralized hash ranges, and store the version
>> used by each directory in its xattrs (and update the version as the
>> rebalance progresses). But now we have more centralized metadata (may
>> be/ may not be a worthy compromise - not sure.)
>
>Agreed, the auto-unhased would have t

Re: [Gluster-devel] Single layout at root (Was EHT / DHT)

2014-11-25 Thread Shyam

On 11/25/2014 05:03 PM, Anand Avati wrote:

On Tue Nov 25 2014 at 1:28:59 PM Shyam mailto:srang...@redhat.com>> wrote:

On 11/12/2014 01:55 AM, Anand Avati wrote:
 >
 >
 > On Tue, Nov 11, 2014 at 1:56 PM, Jeff Darcy mailto:jda...@redhat.com>
 > >> wrote:
 >
 >   (Personally I would have
 > done this by "mixing in" the parent GFID to the hash
calculation, but
 > that alternative was ignored.)
 >
 >
 > Actually when DHT was implemented, the concept of GFID did not (yet)
 > exist. Due to backward compatibility it has just remained this
way even
 > later. Including the GFID into the hash has benefits.

I am curious here as this is interesting.

So the layout start subvol assignment for a directory to be based on its
GFID was provided so that files with the same name distribute better
than ending up in the same bricks, right?

Right, for e.g we wouldn't want all the README.txt in various
directories of a volume to end up on the same server. The way it is
achieved today is, the per server hash-range assignment is "rotated" by
a certain amount (how much it is rotated is determined by a separate
hash on the directory path) at the time of mkdir.

Instead as we _now_ have GFID, we could use that including the name to
get a similar/better distribution, or GFID+name to determine hashed
subvol.

What we could do now is, include the parent directory gfid as an input
into the DHT hash function.

Today, we do approximately:
   int hashval = dm_hash ("readme.txt")
   hash_ranges[] = inode_ctx_get (parent_dir)
   subvol = find_subvol (hash_ranges, hashval)

Instead, we could:
   int hashval = new_hash ("readme.txt", parent_dir.gfid)
   hash_ranges[] = global_value
   subvol = find_subvol (hash_ranges, hashval)

The idea here would be that on dentry creates we would need to generate
the GFID and not let the bricks generate the same, so that we can choose
the subvol to wind the FOP to.

The GFID would be that of the parent (as an entry name is always in the
context of a parent directory/inode). Also, the GFID for a new entry is
already generated by the client, the brick does not generate a GFID.

This eliminates the need for a layout per sub-directory and all the
(interesting) problems that it comes with and instead can be replaced by
a layout at root. Not sure if it handles all use cases and paths that we
have now (which needs more understanding).

I do understand there is a backward compatibility issue here, but other
than this, this sounds better than the current scheme, as there is a
single layout to read/optimize/stash/etc. across clients.

Can I understand the rationale of this better, as to what you folks are
thinking. Am I missing something or over reading on the benefits that
this can provide?

I think you understand it right. The benefit is one could have a single
hash layout for the entire volume and the directory "specific-ness" is
implemented by including the directory gfid into the hash function. The
way I see it, the compromise would be something like:

Pro per directory range: By having per-directory hash ranges, we can do
easier incremental rebalance. Partial progress is well tolerated and
does not impact the entire volume. The time a given directory is
undergoing rebalance, for that directory alone we need to enter
"unhashed lookup" mode, only for that period of time.

Con per directory range: Just the new "hash assignment" phase (to impact
placement of new files/data, not move old data) itself is an extended
process, crawling the entire volume with complex per-directory
operations. The number of points in the system where things can "break"
(i.e, result in overlaps and holes in ranges) is high.

Pro single layout with dir GFID in hash: Avoid the numerous parts
(per-dir hash ranges) which can potentially "break".

Con single layout with dir GFID in hash: Rebalance phase 1 (assigning
new layout) is atomic for the entire volume - unhashed lookup has to be
"on" for all dirs for the entire period. To mitigate this, we could
explore versioning the centralized hash ranges, and store the version
used by each directory in its xattrs (and update the version as the
rebalance progresses). But now we have more centralized metadata (may
be/ may not be a worthy compromise - not sure.)

Agreed, the auto-unhased would have to wait longer before being rearmed.

Just throwing some more thoughts on the same,

Unhashed-auto also can benefit from just linkto creations, rather than 
require a data rebalance (i.e movement of data). So in phase-0 we could 
just create the linkto files and then turn on auto-unhashed. As lookups 
would find the (linkto) file.

Other abilities, like giving directories weighted layout ranges based on 
size of bricks could be affected, i.e forcing a rebalance when a brick 
size is increased,

Re: [Gluster-devel] Single layout at root (Was EHT / DHT)

2014-11-25 Thread Anand Avati

On Tue Nov 25 2014 at 1:28:59 PM Shyam  wrote:

> On 11/12/2014 01:55 AM, Anand Avati wrote:
> >
> >
> > On Tue, Nov 11, 2014 at 1:56 PM, Jeff Darcy  > > wrote:
> >
> >   (Personally I would have
> > done this by "mixing in" the parent GFID to the hash calculation, but
> > that alternative was ignored.)
> >
> >
> > Actually when DHT was implemented, the concept of GFID did not (yet)
> > exist. Due to backward compatibility it has just remained this way even
> > later. Including the GFID into the hash has benefits.
>
> I am curious here as this is interesting.
>
> So the layout start subvol assignment for a directory to be based on its
> GFID was provided so that files with the same name distribute better
> than ending up in the same bricks, right?
>

Right, for e.g we wouldn't want all the README.txt in various directories
of a volume to end up on the same server. The way it is achieved today is,
the per server hash-range assignment is "rotated" by a certain amount (how
much it is rotated is determined by a separate hash on the directory path)
at the time of mkdir.

> Instead as we _now_ have GFID, we could use that including the name to
> get a similar/better distribution, or GFID+name to determine hashed subvol.
>

What we could do now is, include the parent directory gfid as an input into
the DHT hash function.

Today, we do approximately:
  int hashval = dm_hash ("readme.txt")
  hash_ranges[] = inode_ctx_get (parent_dir)
  subvol = find_subvol (hash_ranges, hashval)

Instead, we could:
  int hashval = new_hash ("readme.txt", parent_dir.gfid)
  hash_ranges[] = global_value
  subvol = find_subvol (hash_ranges, hashval)

The idea here would be that on dentry creates we would need to generate
> the GFID and not let the bricks generate the same, so that we can choose
> the subvol to wind the FOP to.
>

The GFID would be that of the parent (as an entry name is always in the
context of a parent directory/inode). Also, the GFID for a new entry is
already generated by the client, the brick does not generate a GFID.

> This eliminates the need for a layout per sub-directory and all the
> (interesting) problems that it comes with and instead can be replaced by
> a layout at root. Not sure if it handles all use cases and paths that we
> have now (which needs more understanding).
>
> I do understand there is a backward compatibility issue here, but other
> than this, this sounds better than the current scheme, as there is a
> single layout to read/optimize/stash/etc. across clients.
>
> Can I understand the rationale of this better, as to what you folks are
> thinking. Am I missing something or over reading on the benefits that
> this can provide?
>

I think you understand it right. The benefit is one could have a single
hash layout for the entire volume and the directory "specific-ness" is
implemented by including the directory gfid into the hash function. The way
I see it, the compromise would be something like:

Pro per directory range: By having per-directory hash ranges, we can do
easier incremental rebalance. Partial progress is well tolerated and does
not impact the entire volume. The time a given directory is undergoing
rebalance, for that directory alone we need to enter "unhashed lookup"
mode, only for that period of time.

Con per directory range: Just the new "hash assignment" phase (to impact
placement of new files/data, not move old data) itself is an extended
process, crawling the entire volume with complex per-directory operations.
The number of points in the system where things can "break" (i.e, result in
overlaps and holes in ranges) is high.

Pro single layout with dir GFID in hash: Avoid the numerous parts (per-dir
hash ranges) which can potentially "break".

Con single layout with dir GFID in hash: Rebalance phase 1 (assigning new
layout) is atomic for the entire volume - unhashed lookup has to be "on"
for all dirs for the entire period. To mitigate this, we could explore
versioning the centralized hash ranges, and store the version used by each
directory in its xattrs (and update the version as the rebalance
progresses). But now we have more centralized metadata (may be/ may not be
a worthy compromise - not sure.)

In summary, including GFID into the hash calculation does open up
interesting possibilities and worthy of serious consideration.

HTH,
Avati
___
Gluster-devel mailing list
Gluster-devel@gluster.org
http://supercolony.gluster.org/mailman/listinfo/gluster-devel

Re: [Gluster-devel] Single layout at root (Was EHT / DHT)

Re: [Gluster-devel] Single layout at root (Was EHT / DHT)

Re: [Gluster-devel] Single layout at root (Was EHT / DHT)

Re: [Gluster-devel] Single layout at root (Was EHT / DHT)

4 matches

Site Navigation

Mail list logo

Footer information