Re: [Gluster-devel] Feature review: Improved rebalance performance
On Tuesday 01 July 2014 10:59:08 Shyamsundar Ranganathan wrote: As I see it, rebalance on access should be a complement to normal rebalance to keep the volume _more_ balanced (keep accessed files on the right brick to avoid unnecessary delays due to global lookups or link file redirections), but it can not assure that the volume is fully rebalanced. True, except in the case where we ensure link files are created during rebalance _changed_. I think we are talking about slightly different things. It seems that you consider that a file not placed in the right brick but having a link in the right brick is already balanced. I consider that this is not really balanced. A link file is good to avoid unnecessary global lookups, but they still incur in a performance hit and a future rebalance will have more work than expected if those files are not placed where they should be. I think it's important to keep as few of these bad located files as possible at all times. This is the reason I'm saying to use an index to track small changes (i.e. renames) and be able to identify and move them very fast. This is basically useful when the volume is considered stable (i.e. not adding nor removing a brick and the last full rebalance has finished). Xavi ___ Gluster-devel mailing list Gluster-devel@gluster.org http://supercolony.gluster.org/mailman/listinfo/gluster-devel
Re: [Gluster-devel] Feature review: Improved rebalance performance
On 07/01/2014 11:15 AM, Harshavardhana wrote: Besides bandwidth limits, there also needs to be monitors on brick latency. We don't want so many queued iops that operating performance is impacted. AFAIK - rebalance and self-heal threads run in low-priority queue in io-threads by default. No, they don't. We tried doing that but based on experiences from users we disabled that in io-threads. Pranith ___ Gluster-devel mailing list Gluster-devel@gluster.org http://supercolony.gluster.org/mailman/listinfo/gluster-devel
Re: [Gluster-devel] Feature review: Improved rebalance performance
On Monday 30 June 2014 16:18:09 Shyamsundar Ranganathan wrote: Will this rebalance on access feature be enabled always or only during a brick addition/removal to move files that do not go to the affected brick while the main rebalance is populating or removing files from the brick ? The rebalance on access, in my head, stands as follows, (a little more detailed than what is in the feature page) Step 1: Initiation of the process - Admin chooses to rebalance _changed_ bricks - This could mean added/removed/changed size bricks [3]- Rebalance on access is triggered, so as to move files when they are accessed but asynchronously [1]- Background rebalance, acts only to (re)move data (from)to these bricks [2]- This would also change the layout for all directories, to include the new configuration of the cluster, so that newer data is placed in the correct bricks Step 2: Completion of background rebalance - Once background rebalance is complete, the rebalance status is noted as success/failure based on what the backgrould rebalance process did - This will not stop the on access rebalance, as data is still all over the place, and enhancements like lookup-unhashed=auto will have trouble I don't see why stopping rebalance on access when lookup-unhashed=auto is a problem. If I understand http://review.gluster.org/7702/ correctly, when the directory commit hash does not match that of the volume root, a global lookup will be made. If we change layout in [3], it will also change (or it should) the commit of the directory. This means that even if files of that directory are not rebalanced yet, they will be found regardless if on access rebalance is enabled or not. Am I missing something ? Step 3: Admin can initiate a full rebalance - When this is complete then the on access rebalance would be turned off, as the cluster is rebalanced! Step 2.5/4: Choosing to stop the on access rebalance - This can be initiated by the admin, post 3 which is more logical or between 2 and 3, in which case lookup everywhere for files etc. cannot be avoided due to [2] above I like having the possibility for admins to enable/disable this feature seems interesting. However I also think this should be forcibly enabled when rebalancing _changed_ bricks. Issues and possible solutions: [4] One other thought is to create link files, as a part of [1], for files that do not belong to the right bricks but are _not_ going to be rebalanced as their source/destination is not a changed brick. This _should_ be faster than moving data around and rebalancing these files. It should also avoid the problem that, post a rebalance _changed_ command, the cluster may have files in the wrong place based on the layout, as the link files would be present to correct the situation. In this situation the rebalance on access can be left on indefinitely and turning it off does not serve much purpose. I think that creating link files is a cheap task, specially if rebalance will handle files in parallel. However I'm not sure if this will make any measurable difference in performance on future accesses (in theory it should avoid a global lookup once). This would need to be tested to decide. Enabling rebalance on access always is fine, but I am not sure it buys us gluster states that mean the cluster is in a balanced situation, for other actions like the lookup-unhashed mentioned which may not just need the link files in place. Examples could be mismatched or overly space committed bricks with old, not accessed data etc. but do not have a clear example yet. As I see it, rebalance on access should be a complement to normal rebalance to keep the volume _more_ balanced (keep accessed files on the right brick to avoid unnecessary delays due to global lookups or link file redirections), but it can not assure that the volume is fully rebalanced. Just stating, the core intention of rebalance _changed_ is to create space in existing bricks when the cluster grows faster, or be able to remove bricks from the cluster faster. That is a very important feature. I've missed it several times when expanding a volume. In fact we needed to write some scripts to do something similar before launching a full rebalance. Redoing a rebalance _changed_ again due to a gluster configuration change, i.e expanding the cluster again say, needs some thought. It does not impact if rebalance on access is running or not, the only thing it may impact is the choice of files that are already put into the on access queue based on the older layout, due to the older cluster configuration. Just noting this here. This will need to be thought more deeply, but if we only have a queue of files that *may* need migration, and we really check the target volume at the time of migration, I think this won't pose much problem in case of successive rebalances. In short if we do [4] then we can leave rebalance on access turned
Re: [Gluster-devel] Feature review: Improved rebalance performance
- Original Message - From: Shyamsundar Ranganathan srang...@redhat.com To: Xavier Hernandez xhernan...@datalab.es Cc: gluster-devel@gluster.org Sent: Tuesday, July 1, 2014 1:48:09 AM Subject: Re: [Gluster-devel] Feature review: Improved rebalance performance From: Xavier Hernandez xhernan...@datalab.es Hi Shyam, On Thursday 26 June 2014 14:41:13 Shyamsundar Ranganathan wrote: It also touches upon a rebalance on access like mechanism where we could potentially, move data out of existing bricks to a newer brick faster, in the case of brick addition, and vice versa for brick removal, and heal the rest of the data on access. Will this rebalance on access feature be enabled always or only during a brick addition/removal to move files that do not go to the affected brick while the main rebalance is populating or removing files from the brick ? The rebalance on access, in my head, stands as follows, (a little more detailed than what is in the feature page) Step 1: Initiation of the process - Admin chooses to rebalance _changed_ bricks - This could mean added/removed/changed size bricks [3]- Rebalance on access is triggered, so as to move files when they are accessed but asynchronously [1]- Background rebalance, acts only to (re)move data (from)to these bricks [2]- This would also change the layout for all directories, to include the new configuration of the cluster, so that newer data is placed in the correct bricks Step 2: Completion of background rebalance - Once background rebalance is complete, the rebalance status is noted as success/failure based on what the backgrould rebalance process did - This will not stop the on access rebalance, as data is still all over the place, and enhancements like lookup-unhashed=auto will have trouble Step 3: Admin can initiate a full rebalance - When this is complete then the on access rebalance would be turned off, as the cluster is rebalanced! Step 2.5/4: Choosing to stop the on access rebalance - This can be initiated by the admin, post 3 which is more logical or between 2 and 3, in which case lookup everywhere for files etc. cannot be avoided due to [2] above Issues and possible solutions: [4] One other thought is to create link files, as a part of [1], for files that do not belong to the right bricks but are _not_ going to be rebalanced as their source/destination is not a changed brick. This _should_ be faster than moving data around and rebalancing these files. It should also avoid the problem that, post a rebalance _changed_ command, the cluster may have files in the wrong place based on the layout, as the link files would be present to correct the situation. In this situation the rebalance on access can be left on indefinitely and turning it off does not serve much purpose. Enabling rebalance on access always is fine, but I am not sure it buys us gluster states that mean the cluster is in a balanced situation, for other actions like the lookup-unhashed mentioned which may not just need the link files in place. Examples could be mismatched or overly space committed bricks with old, not accessed data etc. but do not have a clear example yet. Just stating, the core intention of rebalance _changed_ is to create space in existing bricks when the cluster grows faster, or be able to remove bricks from the cluster faster. Redoing a rebalance _changed_ again due to a gluster configuration change, i.e expanding the cluster again say, needs some thought. It does not impact if rebalance on access is running or not, the only thing it may impact is the choice of files that are already put into the on access queue based on the older layout, due to the older cluster configuration. Just noting this here. In short if we do [4] then we can leave rebalance on access turned on always, unless we have some other counter examples or use cases that are not thought of. Doing [4] seems logical, so I would state that we should, but from a performance angle of improving rebalance, we need to determine the worth against access paths from IO post not having [4] (again considering the improvement that lookup-unhashed brings, this maybe obvious that [4] should be done). A note on [3], the intention is to start an asynchronous sync task that rebalances the file on access, and not impact the IO path. So if a file is chosen by the IO path as to needing a rebalance, then a sync task with the required xattr to trigger a file move is setup, and setxattr is called, that should take care of the file migration and enabling the IO path to progress as is. Reading through your mail, a better way of doing this by sharing the load, would be to use an index, so that each node in the cluster has a list of files accessed that need a rebalance. The above method for [3] would be client heavy and would incur a network read and write, whereas the index manner of doing
Re: [Gluster-devel] Feature review: Improved rebalance performance
- Original Message - From: Xavier Hernandez xhernan...@datalab.es To: Raghavendra Gowdappa rgowd...@redhat.com Cc: Shyamsundar Ranganathan srang...@redhat.com, gluster-devel@gluster.org Sent: Tuesday, July 1, 2014 3:10:29 PM Subject: Re: [Gluster-devel] Feature review: Improved rebalance performance On Tuesday 01 July 2014 02:37:34 Raghavendra Gowdappa wrote: Another thing to consider for future versions is to modify the current DHT to a consistent hashing and even the hash value (using gfid instead of a hash of the name would solve the rename problem). The consistent hashing would drastically reduce the number of files that need to be moved and already solves some of the current problems. This change needs a lot of thinking though. The problem with using gfid for hashing instead of name is that we run into a chicken and egg problem. Before lookup, we cannot know the gfid of the file and to lookup the file, we need gfid to find out the node in which file resides. Of course, this problem would go away if we lookup (may be just during fresh lookups) on all the nodes, but that slows down the fresh lookups and may not be acceptable. I think it's not so problematic, and the benefits would be considerable. The gfid of the root directory is always known. This means that we could always do a lookup on root by gfid. I haven't tested it but as I understand it, when you want to do a getxattr on a file inside a subdirectory, for example, the kernel will issue lookups on all intermediate directories to check, Yes, but how does dht handle these lookups? Are you suggesting that we wind the lookup call to all subvolumes (since we don't know which subvolume the file is present for lack of gfid)? at least, the access rights before finally reading the xattr of the file. This means that we can get and cache gfid's of all intermediate directories in the process. Even if there's some operation that does not issue a previous lookup, we could do that lookup if it's not cached. Of course if there were many more operations not issuing a previous lookup, this solution won't be good, but I think this is not the case. I'll try to do some tests to see if this is correct. Xavi ___ Gluster-devel mailing list Gluster-devel@gluster.org http://supercolony.gluster.org/mailman/listinfo/gluster-devel
Re: [Gluster-devel] Feature review: Improved rebalance performance
On Tuesday 01 July 2014 05:55:51 Raghavendra Gowdappa wrote: - Original Message - Another thing to consider for future versions is to modify the current DHT to a consistent hashing and even the hash value (using gfid instead of a hash of the name would solve the rename problem). The consistent hashing would drastically reduce the number of files that need to be moved and already solves some of the current problems. This change needs a lot of thinking though. The problem with using gfid for hashing instead of name is that we run into a chicken and egg problem. Before lookup, we cannot know the gfid of the file and to lookup the file, we need gfid to find out the node in which file resides. Of course, this problem would go away if we lookup (may be just during fresh lookups) on all the nodes, but that slows down the fresh lookups and may not be acceptable. I think it's not so problematic, and the benefits would be considerable. The gfid of the root directory is always known. This means that we could always do a lookup on root by gfid. I haven't tested it but as I understand it, when you want to do a getxattr on a file inside a subdirectory, for example, the kernel will issue lookups on all intermediate directories to check, Yes, but how does dht handle these lookups? Are you suggesting that we wind the lookup call to all subvolumes (since we don't know which subvolume the file is present for lack of gfid)? Oops, that's true. It only works combined with another idea we had about storing directories as special files (using the same redundancy as normal files). This way a lookup for an entry would be translated to a special lookup for the parent directory (we know where it is and its gfid) asking for an specific entry that will return its gfid (and probably some other info). Of course this has more implications like that the bricks won't be able to maintain a (partial) view of the file system like now. Right now, using gfid as the hash key is not possible because this would need asking to each subvolume on lookups as you say, and this is not efficient. The solution I commented would need some important architectural changes. It could be an option to consider for 4.0. Xavi ___ Gluster-devel mailing list Gluster-devel@gluster.org http://supercolony.gluster.org/mailman/listinfo/gluster-devel
Re: [Gluster-devel] Feature review: Improved rebalance performance
From: Xavier Hernandez xhernan...@datalab.es On Monday 30 June 2014 16:18:09 Shyamsundar Ranganathan wrote: Will this rebalance on access feature be enabled always or only during a brick addition/removal to move files that do not go to the affected brick while the main rebalance is populating or removing files from the brick ? The rebalance on access, in my head, stands as follows, (a little more detailed than what is in the feature page) Step 1: Initiation of the process - Admin chooses to rebalance _changed_ bricks - This could mean added/removed/changed size bricks [3]- Rebalance on access is triggered, so as to move files when they are accessed but asynchronously [1]- Background rebalance, acts only to (re)move data (from)to these bricks [2]- This would also change the layout for all directories, to include the new configuration of the cluster, so that newer data is placed in the correct bricks Step 2: Completion of background rebalance - Once background rebalance is complete, the rebalance status is noted as success/failure based on what the backgrould rebalance process did - This will not stop the on access rebalance, as data is still all over the place, and enhancements like lookup-unhashed=auto will have trouble I don't see why stopping rebalance on access when lookup-unhashed=auto is a problem. If I understand http://review.gluster.org/7702/ correctly, when the directory commit hash does not match that of the volume root, a global lookup will be made. If we change layout in [3], it will also change (or it should) the commit of the directory. This means that even if files of that directory are not rebalanced yet, they will be found regardless if on access rebalance is enabled or not. Am I missing something ? The comment was more to state that, the speed up gained by lookup-unhashed would be lost for the time that the cluster is not rebalanced completely, or has not noted all redirection as link files. The feature will work, but sub-optimally, and we need to consider/reduce the time for which this sub-optimal behavior is in effect. Step 3: Admin can initiate a full rebalance - When this is complete then the on access rebalance would be turned off, as the cluster is rebalanced! Step 2.5/4: Choosing to stop the on access rebalance - This can be initiated by the admin, post 3 which is more logical or between 2 and 3, in which case lookup everywhere for files etc. cannot be avoided due to [2] above I like having the possibility for admins to enable/disable this feature seems interesting. However I also think this should be forcibly enabled when rebalancing _changed_ bricks. Yes, when rebalance _changed_ is in effect the rebalance on access is also in effect, noted in Step 1 of the elaboration above. Issues and possible solutions: [4] One other thought is to create link files, as a part of [1], for files that do not belong to the right bricks but are _not_ going to be rebalanced as their source/destination is not a changed brick. This _should_ be faster than moving data around and rebalancing these files. It should also avoid the problem that, post a rebalance _changed_ command, the cluster may have files in the wrong place based on the layout, as the link files would be present to correct the situation. In this situation the rebalance on access can be left on indefinitely and turning it off does not serve much purpose. I think that creating link files is a cheap task, specially if rebalance will handle files in parallel. However I'm not sure if this will make any measurable difference in performance on future accesses (in theory it should avoid a global lookup once). This would need to be tested to decide. It would also avoid global lookup on create of new files when lookup-unhashed=auto is in force, so you find the file in the hashed subvol or not during creates to report EEXIST errors (as needed). For a existing file lookup, yes the link file creation is triggered on the first lookup, which would do a global lookup, against the rebalance process ensuring these link files are present. Overall, it is better to have the link files created, so that create and existing lookups do not suffer the time and resource penalties is my thought. Enabling rebalance on access always is fine, but I am not sure it buys us gluster states that mean the cluster is in a balanced situation, for other actions like the lookup-unhashed mentioned which may not just need the link files in place. Examples could be mismatched or overly space committed bricks with old, not accessed data etc. but do not have a clear example yet. As I see it, rebalance on access should be a complement to normal rebalance to keep the volume _more_ balanced (keep accessed files on the right brick to avoid unnecessary delays due to global lookups or link file redirections), but it
Re: [Gluster-devel] Feature review: Improved rebalance performance
Hi Shyam, On Thursday 26 June 2014 14:41:13 Shyamsundar Ranganathan wrote: It also touches upon a rebalance on access like mechanism where we could potentially, move data out of existing bricks to a newer brick faster, in the case of brick addition, and vice versa for brick removal, and heal the rest of the data on access. Will this rebalance on access feature be enabled always or only during a brick addition/removal to move files that do not go to the affected brick while the main rebalance is populating or removing files from the brick ? I like all the proposed ideas. I think they would improve the performance of the rebalance operation considerably. Probably we will need to define some policies to limit the amount of bandwidth that rebalance is allowed to use and at which hours, but this can be determined later. I would also consider using index or changelog xlators to track renames and let rebalance consume it. Currently a file or directory rename makes that files correctly placed in the right brick need to be moved to another brick. A full rebalance crawling all the file system seems too expensive for this kind of local changes (the effects of this are orders of magnitude smaller than adding or removing a brick). Having a way to list pending moves due to rename without scanning all the file system would be great. Another thing to consider for future versions is to modify the current DHT to a consistent hashing and even the hash value (using gfid instead of a hash of the name would solve the rename problem). The consistent hashing would drastically reduce the number of files that need to be moved and already solves some of the current problems. This change needs a lot of thinking though. Xavi ___ Gluster-devel mailing list Gluster-devel@gluster.org http://supercolony.gluster.org/mailman/listinfo/gluster-devel
Re: [Gluster-devel] Feature review: Improved rebalance performance
On 06/30/2014 02:00 AM, Xavier Hernandez wrote: Hi Shyam, On Thursday 26 June 2014 14:41:13 Shyamsundar Ranganathan wrote: It also touches upon a rebalance on access like mechanism where we could potentially, move data out of existing bricks to a newer brick faster, in the case of brick addition, and vice versa for brick removal, and heal the rest of the data on access. Will this rebalance on access feature be enabled always or only during a brick addition/removal to move files that do not go to the affected brick while the main rebalance is populating or removing files from the brick ? I like all the proposed ideas. I think they would improve the performance of the rebalance operation considerably. Probably we will need to define some policies to limit the amount of bandwidth that rebalance is allowed to use and at which hours, but this can be determined later. I would also consider using index or changelog xlators to track renames and let rebalance consume it. Currently a file or directory rename makes that files correctly placed in the right brick need to be moved to another brick. A full rebalance crawling all the file system seems too expensive for this kind of local changes (the effects of this are orders of magnitude smaller than adding or removing a brick). Having a way to list pending moves due to rename without scanning all the file system would be great. Another thing to consider for future versions is to modify the current DHT to a consistent hashing and even the hash value (using gfid instead of a hash of the name would solve the rename problem). The consistent hashing would drastically reduce the number of files that need to be moved and already solves some of the current problems. This change needs a lot of thinking though. Besides bandwidth limits, there also needs to be monitors on brick latency. We don't want so many queued iops that operating performance is impacted. ___ Gluster-devel mailing list Gluster-devel@gluster.org http://supercolony.gluster.org/mailman/listinfo/gluster-devel
Re: [Gluster-devel] Feature review: Improved rebalance performance
Besides bandwidth limits, there also needs to be monitors on brick latency. We don't want so many queued iops that operating performance is impacted. AFAIK - rebalance and self-heal threads run in low-priority queue in io-threads by default. -- Religious confuse piety with mere ritual, the virtuous confuse regulation with outcomes ___ Gluster-devel mailing list Gluster-devel@gluster.org http://supercolony.gluster.org/mailman/listinfo/gluster-devel