Re: [Gluster-devel] [Gluster-users] Quick update on glusterd's volume scalability improvements
On Fri, Mar 29, 2019 at 6:42 AM Atin Mukherjee wrote: > All, > > As many of you already know that the design logic with which GlusterD > (here on to be referred as GD1) was implemented has some fundamental > scalability bottlenecks at design level, especially around it's way of > handshaking configuration meta data and replicating them across all the > peers. While the initial design was adopted with a factor in mind that GD1 > will have to deal with just few tens of nodes/peers and volumes, the > magnitude of the scaling bottleneck this design can bring in was never > realized and estimated. > > Ever since Gluster has been adopted in container storage land as one of > the storage backends, the business needs have changed. From tens of > volumes, the requirements have translated to hundreds and now to thousands. > We introduced brick multiplexing which had given some relief to have a > better control on the memory footprint when having many number of > bricks/volumes hosted in the node, but this wasn't enough. In one of our (I > represent Red Hat) customer's deployment we had seen on a 3 nodes cluster, > whenever the number of volumes go beyond ~1500 and for some reason if one > of the storage pods get rebooted, the overall time it takes to complete the > overall handshaking (not only in a factor of n X n peer handshaking but > also the number of volume iterations, building up the dictionary and > sending it over the write) consumes a huge time as part of the handshaking > process, the hard timeout of an rpc request which is 10 minutes gets > expired and we see cluster going into a state where none of the cli > commands go through and get stuck. > > With such problem being around and more demand of volume scalability, we > started looking into these areas in GD1 to focus on improving (a) volume > scalability (b) node scalability. While (b) is a separate topic for some > other day we're going to focus on more on (a) today. > > While looking into this volume scalability problem with a deep dive, we > realized that most of the bottleneck which was causing the overall delay in > the friend handshaking and exchanging handshake packets between peers in > the cluster was iterating over the in-memory data structures of the > volumes, putting them into the dictionary sequentially. With 2k like > volumes the function glusterd_add_volumes_to_export_dict () was quite > costly and most time consuming. From pstack output when glusterd instance > was restarted in one of the pods, we could always see that control was > iterating in this function. Based on our testing on a 16 vCPU, 32 GB RAM 3 > nodes cluster, this function itself took almost *7.5 minutes . *The > bottleneck is primarily because of sequential iteration of volumes, > sequentially updating the dictionary with lot of (un)necessary keys. > > So what we tried out was making this loop to work on a worker thread model > so that multiple threads can process a range of volume list and not all of > them so that we can get more parallelism within glusterd. But with that we > still didn't see any improvement and the primary reason for that was our > dictionary APIs need locking. So the next idea was to actually make threads > work on multiple dictionaries and then once all the volumes are iterated > the subsequent dictionaries to be merged into a single one. Along with > these changes there are few other improvements done on skipping comparison > of snapshots if there's no snap available, excluding tiering keys if the > volume type is not tier. With this enhancement [1] we see the overall time > it took to complete building up the dictionary from the in-memory structure > is *2 minutes 18 seconds* which is close* ~3x* improvement. We firmly > believe that with this improvement, we should be able to scale up to 2000 > volumes on a 3 node cluster and that'd help our users to get benefited with > supporting more PVCs/volumes. > > Patch [1] is still in testing and might undergo few minor changes. But we > welcome you for review and comment on it. We plan to get this work > completed, tested and release in glusterfs-7. > > Last but not the least, I'd like to give a shout to Mohit Agrawal (In cc) > for all the work done on this for last few days. Thank you Mohit! > > This sounds good! Thank you for the update on this work. Did you ever consider using etcd with GD1 (like as it is used with GD2)? Having etcd as a backing store for configuration could remove expensive handshaking as well as persistence of configuration on every node. I am interested in understanding if you are aware of any drawbacks with that approach. If there haven't been any thoughts in that direction, it might be a fun experiment to try. Thanks, Vijay ___ Gluster-devel mailing list Gluster-devel@gluster.org https://lists.gluster.org/mailman/listinfo/gluster-devel
[Gluster-devel] Issue with posix locks
Hi all, there is one potential problem with posix locks when used in a replicated or dispersed volume. Some background: Posix locks allow any process to lock a region of a file multiple times, but a single unlock on a given region will release all previous locks. Locked regions can be different for each lock request and they can overlap. The resulting lock will cover the union of all locked regions. A single unlock (the region doesn't necessarily need to match any of the ranges used for locking) will create a "hole" in the currently locked region, independently of how many times a lock request covered that region. For this reason, the locks xlator simply combines the locked regions that are requested, but it doesn't track each individual lock range. Under normal circumstances this works fine. But there are some cases where this behavior is not sufficient. For example, suppose we have a replica 3 volume with quorum = 2. Given the special nature of posix locks, AFR sends the lock request sequentially to each one of the bricks, to avoid that conflicting lock requests from other clients could require to unlock an already locked region on the client that has not got enough successful locks (i.e. quorum). An unlock here not only would cancel the current lock request. It would also cancel any previously acquired lock. However, when something goes wrong (a brick dies during a lock request, or there's a network partition or some other weird situation), it could happen that even using sequential locking, only one brick succeeds the lock request. In this case, AFR should cancel the previous lock (and it does), but this also cancels any previously acquired lock on that region, which is not good. A similar thing can happen if we try to recover (heal) posix locks that were active after a brick has been disconnected (for any reason) and then reconnected. To fix all these situations we need to change the way posix locks are managed by locks xlator. One possibility would be to embed the lock request inside an inode transaction using inodelk. Since inodelks do not suffer this problem, the follwing posix lock could be sent safely. However this implies an additional network request, which could cause some performance impact. Eager-locking could minimize the impact in some cases. However this approach won't work for lock recovery after a disconnect. Another possibility is to send a special partial posix lock request which won't be immediately merged with already existing locks once granted. An additional confirmation request of the partial posix lock will be required to fully grant the current lock and merge it with the existing ones. This requires a new network request, which will add latency, and makes everything more complex since there would be more combinations of states in which something could fail. So I think one possible solution would be the following: 1. Keep each posix lock as an independent object in locks xlator. This will make it possible to "invalidate" any already granted lock without affecting already established locks. 2. Additionally, we'll keep a sorted list of non-overlapping segments of locked regions. And we'll count, for each region, how many locks are referencing it. One lock can reference multiple segments, and each segment can be referenced by multiple locks. 3. An additional lock request that overlaps with an existing segment, can cause this segment to be split to satisfy the non-overlapping property. 4. When an unlock request is received, all segments intersecting with the region are eliminated (it may require some segment splits on the edges), and the unlocked region is subtracted from each lock associated to the segment. If a lock gets an empty region, it's removed. 5. We'll create a special "remove lock" request that doesn't unlock a region but removes an already granted lock. This will decrease the number of references to each of the segments this lock was covering. If some segment reaches 0, it's removed. Otherwise it remains there. This special request will only be used internally to cancel already acquired locks that cannot be fully granted due to quorum issues or any other problem. In some weird cases, the list of segments can be huge (many locks overlapping only on a single byte, so each segment represents only one byte). We can try to find some smarter structure that minimizes this problem or limit the number of segments (for example returning ENOLCK when there are too many). What do you think ? Xavi ___ Gluster-devel mailing list Gluster-devel@gluster.org https://lists.gluster.org/mailman/listinfo/gluster-devel
[Gluster-devel] Quick update on glusterd's volume scalability improvements
All, As many of you already know that the design logic with which GlusterD (here on to be referred as GD1) was implemented has some fundamental scalability bottlenecks at design level, especially around it's way of handshaking configuration meta data and replicating them across all the peers. While the initial design was adopted with a factor in mind that GD1 will have to deal with just few tens of nodes/peers and volumes, the magnitude of the scaling bottleneck this design can bring in was never realized and estimated. Ever since Gluster has been adopted in container storage land as one of the storage backends, the business needs have changed. From tens of volumes, the requirements have translated to hundreds and now to thousands. We introduced brick multiplexing which had given some relief to have a better control on the memory footprint when having many number of bricks/volumes hosted in the node, but this wasn't enough. In one of our (I represent Red Hat) customer's deployment we had seen on a 3 nodes cluster, whenever the number of volumes go beyond ~1500 and for some reason if one of the storage pods get rebooted, the overall time it takes to complete the overall handshaking (not only in a factor of n X n peer handshaking but also the number of volume iterations, building up the dictionary and sending it over the write) consumes a huge time as part of the handshaking process, the hard timeout of an rpc request which is 10 minutes gets expired and we see cluster going into a state where none of the cli commands go through and get stuck. With such problem being around and more demand of volume scalability, we started looking into these areas in GD1 to focus on improving (a) volume scalability (b) node scalability. While (b) is a separate topic for some other day we're going to focus on more on (a) today. While looking into this volume scalability problem with a deep dive, we realized that most of the bottleneck which was causing the overall delay in the friend handshaking and exchanging handshake packets between peers in the cluster was iterating over the in-memory data structures of the volumes, putting them into the dictionary sequentially. With 2k like volumes the function glusterd_add_volumes_to_export_dict () was quite costly and most time consuming. From pstack output when glusterd instance was restarted in one of the pods, we could always see that control was iterating in this function. Based on our testing on a 16 vCPU, 32 GB RAM 3 nodes cluster, this function itself took almost *7.5 minutes . *The bottleneck is primarily because of sequential iteration of volumes, sequentially updating the dictionary with lot of (un)necessary keys. So what we tried out was making this loop to work on a worker thread model so that multiple threads can process a range of volume list and not all of them so that we can get more parallelism within glusterd. But with that we still didn't see any improvement and the primary reason for that was our dictionary APIs need locking. So the next idea was to actually make threads work on multiple dictionaries and then once all the volumes are iterated the subsequent dictionaries to be merged into a single one. Along with these changes there are few other improvements done on skipping comparison of snapshots if there's no snap available, excluding tiering keys if the volume type is not tier. With this enhancement [1] we see the overall time it took to complete building up the dictionary from the in-memory structure is *2 minutes 18 seconds* which is close* ~3x* improvement. We firmly believe that with this improvement, we should be able to scale up to 2000 volumes on a 3 node cluster and that'd help our users to get benefited with supporting more PVCs/volumes. Patch [1] is still in testing and might undergo few minor changes. But we welcome you for review and comment on it. We plan to get this work completed, tested and release in glusterfs-7. Last but not the least, I'd like to give a shout to Mohit Agrawal (In cc) for all the work done on this for last few days. Thank you Mohit! [1] https://review.gluster.org/#/c/glusterfs/+/22445/ ___ Gluster-devel mailing list Gluster-devel@gluster.org https://lists.gluster.org/mailman/listinfo/gluster-devel
Re: [Gluster-devel] Upgrade testing to gluster 6
Hi, Have added a few more info that was missed earlier. The disconnect issue being minor we are working on it with a lower priority. But yes, it will be fixed soon. The bug to track this is: https://bugzilla.redhat.com/show_bug.cgi?id=1694010 The workaround to get over this if it happens is to, upgrade the nodes one after other to the latest version. Once the upgrade is done, 1) kill the glusterd process alone in all the nodes using the command "pkill glusterd" 2) then do a "iptables -F" to flush the iptables. 3) start glusterd using "glusterd" Note: users can use systemctl stop/start glusterd.service command as well instead of the above to kill and start glusterd. On Fri, Mar 29, 2019 at 11:42 AM Hari Gowtham wrote: > > Hello Gluster users, > > As you all aware that glusterfs-6 is out, we would like to inform you > that, we have spent a significant amount of time in testing > glusterfs-6 in upgrade scenarios. We have done upgrade testing to > glusterfs-6 from various releases like 3.12, 4.1 and 5.3. > > As glusterfs-6 has got in a lot of changes, we wanted to test those portions. > There were xlators (and respective options to enable/disable them) > added and deprecated in glusterfs-6 from various versions [1]. > > We had to check the following upgrade scenarios for all such options > Identified in [1]: > 1) option never enabled and upgraded > 2) option enabled and then upgraded > 3) option enabled and then disabled and then upgraded > > We weren't manually able to check all the combinations for all the options. > So the options involving enabling and disabling xlators were prioritized. > The below are the result of the ones tested. > > Never enabled and upgraded: > checked from 3.12, 4.1, 5.3 to 6 the upgrade works. > > Enabled and upgraded: > Tested for tier which is deprecated, It is not a recommended upgrade. > As expected the volume won't be consumable and will have a few more > issues as well. > Tested with 3.12, 4.1 and 5.3 to 6 upgrade. > > Enabled, disabled before upgrade. > Tested for tier with 3.12 and the upgrade went fine. > > There is one common issue to note in every upgrade. The node being > upgraded is going into disconnected state. You have to flush the iptables > and the restart glusterd on all nodes to fix this. > > The testing for enabling new options is still pending. The new options > won't cause as much issues as the deprecated ones so this was put at > the end of the priority list. It would be nice to get contributions > for this. > > For the disable testing, tier was used as it covers most of the xlator > that was removed. And all of these tests were done on a replica 3 volume. > > Note: This is only for upgrade testing of the newly added and removed > xlators. Does not involve the normal tests for the xlator. > > If you have any questions, please feel free to reach us. > > [1] > https://docs.google.com/spreadsheets/d/1nh7T5AXaV6kc5KgILOy2pEqjzC3t_R47f1XUXSVFetI/edit?usp=sharing > > Regards, > Hari and Sanju. -- Regards, Hari Gowtham. ___ Gluster-devel mailing list Gluster-devel@gluster.org https://lists.gluster.org/mailman/listinfo/gluster-devel
[Gluster-devel] FUSE client work on Windows
Hi, Is there any client work on Windows happening? I would like to do this port with Crossmeta FUSE github.com/crossmeta/cxfuse I am looking for relevant source directories that is lean and mean for the Client FUSE . Thanks in advance Sam ___ Gluster-devel mailing list Gluster-devel@gluster.org https://lists.gluster.org/mailman/listinfo/gluster-devel
[Gluster-devel] Upgrade testing to gluster 6
Hello Gluster users, As you all aware that glusterfs-6 is out, we would like to inform you that, we have spent a significant amount of time in testing glusterfs-6 in upgrade scenarios. We have done upgrade testing to glusterfs-6 from various releases like 3.12, 4.1 and 5.3. As glusterfs-6 has got in a lot of changes, we wanted to test those portions. There were xlators (and respective options to enable/disable them) added and deprecated in glusterfs-6 from various versions [1]. We had to check the following upgrade scenarios for all such options Identified in [1]: 1) option never enabled and upgraded 2) option enabled and then upgraded 3) option enabled and then disabled and then upgraded We weren't manually able to check all the combinations for all the options. So the options involving enabling and disabling xlators were prioritized. The below are the result of the ones tested. Never enabled and upgraded: checked from 3.12, 4.1, 5.3 to 6 the upgrade works. Enabled and upgraded: Tested for tier which is deprecated, It is not a recommended upgrade. As expected the volume won't be consumable and will have a few more issues as well. Tested with 3.12, 4.1 and 5.3 to 6 upgrade. Enabled, disabled before upgrade. Tested for tier with 3.12 and the upgrade went fine. There is one common issue to note in every upgrade. The node being upgraded is going into disconnected state. You have to flush the iptables and the restart glusterd on all nodes to fix this. The testing for enabling new options is still pending. The new options won't cause as much issues as the deprecated ones so this was put at the end of the priority list. It would be nice to get contributions for this. For the disable testing, tier was used as it covers most of the xlator that was removed. And all of these tests were done on a replica 3 volume. Note: This is only for upgrade testing of the newly added and removed xlators. Does not involve the normal tests for the xlator. If you have any questions, please feel free to reach us. [1] https://docs.google.com/spreadsheets/d/1nh7T5AXaV6kc5KgILOy2pEqjzC3t_R47f1XUXSVFetI/edit?usp=sharing Regards, Hari and Sanju. ___ Gluster-devel mailing list Gluster-devel@gluster.org https://lists.gluster.org/mailman/listinfo/gluster-devel