On Fri, Mar 29, 2019 at 6:42 AM Atin Mukherjee <amukh...@redhat.com> wrote:
> All, > > As many of you already know that the design logic with which GlusterD > (here on to be referred as GD1) was implemented has some fundamental > scalability bottlenecks at design level, especially around it's way of > handshaking configuration meta data and replicating them across all the > peers. While the initial design was adopted with a factor in mind that GD1 > will have to deal with just few tens of nodes/peers and volumes, the > magnitude of the scaling bottleneck this design can bring in was never > realized and estimated. > > Ever since Gluster has been adopted in container storage land as one of > the storage backends, the business needs have changed. From tens of > volumes, the requirements have translated to hundreds and now to thousands. > We introduced brick multiplexing which had given some relief to have a > better control on the memory footprint when having many number of > bricks/volumes hosted in the node, but this wasn't enough. In one of our (I > represent Red Hat) customer's deployment we had seen on a 3 nodes cluster, > whenever the number of volumes go beyond ~1500 and for some reason if one > of the storage pods get rebooted, the overall time it takes to complete the > overall handshaking (not only in a factor of n X n peer handshaking but > also the number of volume iterations, building up the dictionary and > sending it over the write) consumes a huge time as part of the handshaking > process, the hard timeout of an rpc request which is 10 minutes gets > expired and we see cluster going into a state where none of the cli > commands go through and get stuck. > > With such problem being around and more demand of volume scalability, we > started looking into these areas in GD1 to focus on improving (a) volume > scalability (b) node scalability. While (b) is a separate topic for some > other day we're going to focus on more on (a) today. > > While looking into this volume scalability problem with a deep dive, we > realized that most of the bottleneck which was causing the overall delay in > the friend handshaking and exchanging handshake packets between peers in > the cluster was iterating over the in-memory data structures of the > volumes, putting them into the dictionary sequentially. With 2k like > volumes the function glusterd_add_volumes_to_export_dict () was quite > costly and most time consuming. From pstack output when glusterd instance > was restarted in one of the pods, we could always see that control was > iterating in this function. Based on our testing on a 16 vCPU, 32 GB RAM 3 > nodes cluster, this function itself took almost *7.5 minutes . *The > bottleneck is primarily because of sequential iteration of volumes, > sequentially updating the dictionary with lot of (un)necessary keys. > > So what we tried out was making this loop to work on a worker thread model > so that multiple threads can process a range of volume list and not all of > them so that we can get more parallelism within glusterd. But with that we > still didn't see any improvement and the primary reason for that was our > dictionary APIs need locking. So the next idea was to actually make threads > work on multiple dictionaries and then once all the volumes are iterated > the subsequent dictionaries to be merged into a single one. Along with > these changes there are few other improvements done on skipping comparison > of snapshots if there's no snap available, excluding tiering keys if the > volume type is not tier. With this enhancement [1] we see the overall time > it took to complete building up the dictionary from the in-memory structure > is *2 minutes 18 seconds* which is close* ~3x* improvement. We firmly > believe that with this improvement, we should be able to scale up to 2000 > volumes on a 3 node cluster and that'd help our users to get benefited with > supporting more PVCs/volumes. > > Patch [1] is still in testing and might undergo few minor changes. But we > welcome you for review and comment on it. We plan to get this work > completed, tested and release in glusterfs-7. > > Last but not the least, I'd like to give a shout to Mohit Agrawal (In cc) > for all the work done on this for last few days. Thank you Mohit! > > This sounds good! Thank you for the update on this work. Did you ever consider using etcd with GD1 (like as it is used with GD2)? Having etcd as a backing store for configuration could remove expensive handshaking as well as persistence of configuration on every node. I am interested in understanding if you are aware of any drawbacks with that approach. If there haven't been any thoughts in that direction, it might be a fun experiment to try. Thanks, Vijay
_______________________________________________ Gluster-devel mailing list Gluster-devel@gluster.org https://lists.gluster.org/mailman/listinfo/gluster-devel