Re: [Gluster-devel] [Gluster-users] Quick update on glusterd's volume scalability improvements

2019-03-29 Thread Vijay Bellur
On Fri, Mar 29, 2019 at 6:42 AM Atin Mukherjee  wrote:

> All,
>
> As many of you already know that the design logic with which GlusterD
> (here on to be referred as GD1) was implemented has some fundamental
> scalability bottlenecks at design level, especially around it's way of
> handshaking configuration meta data and replicating them across all the
> peers. While the initial design was adopted with a factor in mind that GD1
> will have to deal with just few tens of nodes/peers and volumes, the
> magnitude of the scaling bottleneck this design can bring in was never
> realized and estimated.
>
> Ever since Gluster has been adopted in container storage land as one of
> the storage backends, the business needs have changed. From tens of
> volumes, the requirements have translated to hundreds and now to thousands.
> We introduced brick multiplexing which had given some relief to have a
> better control on the memory footprint when having many number of
> bricks/volumes hosted in the node, but this wasn't enough. In one of our (I
> represent Red Hat) customer's deployment  we had seen on a 3 nodes cluster,
> whenever the number of volumes go beyond ~1500 and for some reason if one
> of the storage pods get rebooted, the overall time it takes to complete the
> overall handshaking (not only in a factor of n X n peer handshaking but
> also the number of volume iterations, building up the dictionary and
> sending it over the write) consumes a huge time as part of the handshaking
> process, the hard timeout of an rpc request which is 10 minutes gets
> expired and we see cluster going into a state where none of the cli
> commands go through and get stuck.
>
> With such problem being around and more demand of volume scalability, we
> started looking into these areas in GD1 to focus on improving (a) volume
> scalability (b) node scalability. While (b) is a separate topic for some
> other day we're going to focus on more on (a) today.
>
> While looking into this volume scalability problem with a deep dive, we
> realized that most of the bottleneck which was causing the overall delay in
> the friend handshaking and exchanging handshake packets between peers in
> the cluster was iterating over the in-memory data structures of the
> volumes, putting them into the dictionary sequentially. With 2k like
> volumes the function glusterd_add_volumes_to_export_dict () was quite
> costly and most time consuming. From pstack output when glusterd instance
> was restarted in one of the pods, we could always see that control was
> iterating in this function. Based on our testing on a 16 vCPU, 32 GB RAM 3
> nodes cluster, this function itself took almost *7.5 minutes . *The
> bottleneck is primarily because of sequential iteration of volumes,
> sequentially updating the dictionary with lot of (un)necessary keys.
>
> So what we tried out was making this loop to work on a worker thread model
> so that multiple threads can process a range of volume list and not all of
> them so that we can get more parallelism within glusterd. But with that we
> still didn't see any improvement and the primary reason for that was our
> dictionary APIs need locking. So the next idea was to actually make threads
> work on multiple dictionaries and then once all the volumes are iterated
> the subsequent dictionaries to be merged into a single one. Along with
> these changes there are few other improvements done on skipping comparison
> of snapshots if there's no snap available, excluding tiering keys if the
> volume type is not tier. With this enhancement [1] we see the overall time
> it took to complete building up the dictionary from the in-memory structure
> is *2 minutes 18 seconds* which is close*  ~3x* improvement. We firmly
> believe that with this improvement, we should be able to scale up to 2000
> volumes on a 3 node cluster and that'd help our users to get benefited with
> supporting more PVCs/volumes.
>
> Patch [1] is still in testing and might undergo few minor changes. But we
> welcome you for review and comment on it. We plan to get this work
> completed, tested and release in glusterfs-7.
>
> Last but not the least, I'd like to give a shout to Mohit Agrawal (In cc)
> for all the work done on this for last few days. Thank you Mohit!
>
>

This sounds good! Thank you for the update on this work.

Did you ever consider using etcd with GD1 (like as it is used with GD2)?
Having etcd as a backing store for configuration could remove expensive
handshaking as well as persistence of configuration on every node. I am
interested in understanding if you are aware of any drawbacks with that
approach. If there haven't been any thoughts in that direction, it might be
a fun experiment to try.

Thanks,
Vijay
___
Gluster-devel mailing list
Gluster-devel@gluster.org
https://lists.gluster.org/mailman/listinfo/gluster-devel

[Gluster-devel] Issue with posix locks

2019-03-29 Thread Xavi Hernandez
Hi all,

there is one potential problem with posix locks when used in a replicated
or dispersed volume.

Some background:

Posix locks allow any process to lock a region of a file multiple times,
but a single unlock on a given region will release all previous locks.
Locked regions can be different for each lock request and they can overlap.
The resulting lock will cover the union of all locked regions. A single
unlock (the region doesn't necessarily need to match any of the ranges used
for locking) will create a "hole" in the currently locked region,
independently of how many times a lock request covered that region.

For this reason, the locks xlator simply combines the locked regions that
are requested, but it doesn't track each individual lock range.

Under normal circumstances this works fine. But there are some cases where
this behavior is not sufficient. For example, suppose we have a replica 3
volume with quorum = 2. Given the special nature of posix locks, AFR sends
the lock request sequentially to each one of the bricks, to avoid that
conflicting lock requests from other clients could require to unlock an
already locked region on the client that has not got enough successful
locks (i.e. quorum). An unlock here not only would cancel the current lock
request. It would also cancel any previously acquired lock.

However, when something goes wrong (a brick dies during a lock request, or
there's a network partition or some other weird situation), it could happen
that even using sequential locking, only one brick succeeds the lock
request. In this case, AFR should cancel the previous lock (and it does),
but this also cancels any previously acquired lock on that region, which is
not good.

A similar thing can happen if we try to recover (heal) posix locks that
were active after a brick has been disconnected (for any reason) and then
reconnected.

To fix all these situations we need to change the way posix locks are
managed by locks xlator. One possibility would be to embed the lock request
inside an inode transaction using inodelk. Since inodelks do not suffer
this problem, the follwing posix lock could be sent safely. However this
implies an additional network request, which could cause some performance
impact. Eager-locking could minimize the impact in some cases. However this
approach won't work for lock recovery after a disconnect.

Another possibility is to send a special partial posix lock request which
won't be immediately merged with already existing locks once granted. An
additional confirmation request of the partial posix lock will be required
to fully grant the current lock and merge it with the existing ones. This
requires a new network request, which will add latency, and makes
everything more complex since there would be more combinations of states in
which something could fail.

So I think one possible solution would be the following:

1. Keep each posix lock as an independent object in locks xlator. This will
make it possible to "invalidate" any already granted lock without affecting
already established locks.

2. Additionally, we'll keep a sorted list of non-overlapping segments of
locked regions. And we'll count, for each region, how many locks are
referencing it. One lock can reference multiple segments, and each segment
can be referenced by multiple locks.

3. An additional lock request that overlaps with an existing segment, can
cause this segment to be split to satisfy the non-overlapping property.

4. When an unlock request is received, all segments intersecting with the
region are eliminated (it may require some segment splits on the edges),
and the unlocked region is subtracted from each lock associated to the
segment. If a lock gets an empty region, it's removed.

5. We'll create a special "remove lock" request that doesn't unlock a
region but removes an already granted lock. This will decrease the number
of references to each of the segments this lock was covering. If some
segment reaches 0, it's removed. Otherwise it remains there. This special
request will only be used internally to cancel already acquired locks that
cannot be fully granted due to quorum issues or any other problem.

In some weird cases, the list of segments can be huge (many locks
overlapping only on a single byte, so each segment represents only one
byte). We can try to find some smarter structure that minimizes this
problem or limit the number of segments (for example returning ENOLCK when
there are too many).

What do you think ?

Xavi
___
Gluster-devel mailing list
Gluster-devel@gluster.org
https://lists.gluster.org/mailman/listinfo/gluster-devel

[Gluster-devel] Quick update on glusterd's volume scalability improvements

2019-03-29 Thread Atin Mukherjee
All,

As many of you already know that the design logic with which GlusterD (here
on to be referred as GD1) was implemented has some fundamental scalability
bottlenecks at design level, especially around it's way of handshaking
configuration meta data and replicating them across all the peers. While
the initial design was adopted with a factor in mind that GD1 will have to
deal with just few tens of nodes/peers and volumes, the magnitude of the
scaling bottleneck this design can bring in was never realized and
estimated.

Ever since Gluster has been adopted in container storage land as one of the
storage backends, the business needs have changed. From tens of volumes,
the requirements have translated to hundreds and now to thousands. We
introduced brick multiplexing which had given some relief to have a better
control on the memory footprint when having many number of bricks/volumes
hosted in the node, but this wasn't enough. In one of our (I represent Red
Hat) customer's deployment  we had seen on a 3 nodes cluster, whenever the
number of volumes go beyond ~1500 and for some reason if one of the storage
pods get rebooted, the overall time it takes to complete the overall
handshaking (not only in a factor of n X n peer handshaking but also the
number of volume iterations, building up the dictionary and sending it over
the write) consumes a huge time as part of the handshaking process, the
hard timeout of an rpc request which is 10 minutes gets expired and we see
cluster going into a state where none of the cli commands go through and
get stuck.

With such problem being around and more demand of volume scalability, we
started looking into these areas in GD1 to focus on improving (a) volume
scalability (b) node scalability. While (b) is a separate topic for some
other day we're going to focus on more on (a) today.

While looking into this volume scalability problem with a deep dive, we
realized that most of the bottleneck which was causing the overall delay in
the friend handshaking and exchanging handshake packets between peers in
the cluster was iterating over the in-memory data structures of the
volumes, putting them into the dictionary sequentially. With 2k like
volumes the function glusterd_add_volumes_to_export_dict () was quite
costly and most time consuming. From pstack output when glusterd instance
was restarted in one of the pods, we could always see that control was
iterating in this function. Based on our testing on a 16 vCPU, 32 GB RAM 3
nodes cluster, this function itself took almost *7.5 minutes . *The
bottleneck is primarily because of sequential iteration of volumes,
sequentially updating the dictionary with lot of (un)necessary keys.

So what we tried out was making this loop to work on a worker thread model
so that multiple threads can process a range of volume list and not all of
them so that we can get more parallelism within glusterd. But with that we
still didn't see any improvement and the primary reason for that was our
dictionary APIs need locking. So the next idea was to actually make threads
work on multiple dictionaries and then once all the volumes are iterated
the subsequent dictionaries to be merged into a single one. Along with
these changes there are few other improvements done on skipping comparison
of snapshots if there's no snap available, excluding tiering keys if the
volume type is not tier. With this enhancement [1] we see the overall time
it took to complete building up the dictionary from the in-memory structure
is *2 minutes 18 seconds* which is close*  ~3x* improvement. We firmly
believe that with this improvement, we should be able to scale up to 2000
volumes on a 3 node cluster and that'd help our users to get benefited with
supporting more PVCs/volumes.

Patch [1] is still in testing and might undergo few minor changes. But we
welcome you for review and comment on it. We plan to get this work
completed, tested and release in glusterfs-7.

Last but not the least, I'd like to give a shout to Mohit Agrawal (In cc)
for all the work done on this for last few days. Thank you Mohit!

[1] https://review.gluster.org/#/c/glusterfs/+/22445/
___
Gluster-devel mailing list
Gluster-devel@gluster.org
https://lists.gluster.org/mailman/listinfo/gluster-devel

Re: [Gluster-devel] Upgrade testing to gluster 6

2019-03-29 Thread Hari Gowtham
Hi,
Have added a few more info that was missed earlier.

The disconnect issue being minor we are working on it with a lower priority.
But yes, it will be fixed soon.

The bug to track this is: https://bugzilla.redhat.com/show_bug.cgi?id=1694010

The workaround to get over this if it happens is to,
upgrade the nodes one after other to the latest version.
Once the upgrade is done,
1) kill the glusterd process alone in all the nodes
using the command "pkill glusterd"
2) then do a "iptables -F" to flush the iptables.
3) start glusterd using "glusterd"

Note: users can use systemctl stop/start glusterd.service command as well
instead of the above to kill and start glusterd.

On Fri, Mar 29, 2019 at 11:42 AM Hari Gowtham  wrote:
>
> Hello Gluster users,
>
> As you all aware that glusterfs-6 is out, we would like to inform you
> that, we have spent a significant amount of time in testing
> glusterfs-6 in upgrade scenarios. We have done upgrade testing to
> glusterfs-6 from various releases like 3.12, 4.1 and 5.3.
>
> As glusterfs-6 has got in a lot of changes, we wanted to test those portions.
> There were xlators (and respective options to enable/disable them)
> added and deprecated in glusterfs-6 from various versions [1].
>
> We had to check the following upgrade scenarios for all such options
> Identified in [1]:
> 1) option never enabled and upgraded
> 2) option enabled and then upgraded
> 3) option enabled and then disabled and then upgraded
>
> We weren't manually able to check all the combinations for all the options.
> So the options involving enabling and disabling xlators were prioritized.
> The below are the result of the ones tested.
>
> Never enabled and upgraded:
> checked from 3.12, 4.1, 5.3 to 6 the upgrade works.
>
> Enabled and upgraded:
> Tested for tier which is deprecated, It is not a recommended upgrade.
> As expected the volume won't be consumable and will have a few more
> issues as well.
> Tested with 3.12, 4.1 and 5.3 to 6 upgrade.
>
> Enabled, disabled before upgrade.
> Tested for tier with 3.12 and the upgrade went fine.
>
> There is one common issue to note in every upgrade. The node being
> upgraded is going into disconnected state. You have to flush the iptables
> and the restart glusterd on all nodes to fix this.
>
> The testing for enabling new options is still pending. The new options
> won't cause as much issues as the deprecated ones so this was put at
> the end of the priority list. It would be nice to get contributions
> for this.
>
> For the disable testing, tier was used as it covers most of the xlator
> that was removed. And all of these tests were done on a replica 3 volume.
>
> Note: This is only for upgrade testing of the newly added and removed
> xlators. Does not involve the normal tests for the xlator.
>
> If you have any questions, please feel free to reach us.
>
> [1] 
> https://docs.google.com/spreadsheets/d/1nh7T5AXaV6kc5KgILOy2pEqjzC3t_R47f1XUXSVFetI/edit?usp=sharing
>
> Regards,
> Hari and Sanju.



-- 
Regards,
Hari Gowtham.
___
Gluster-devel mailing list
Gluster-devel@gluster.org
https://lists.gluster.org/mailman/listinfo/gluster-devel


[Gluster-devel] FUSE client work on Windows

2019-03-29 Thread Supra Sammandam
Hi,

Is there any client work on Windows happening? I would like to do this
port with Crossmeta FUSE github.com/crossmeta/cxfuse

I am looking for relevant source directories that is  lean and mean for the
Client FUSE .

Thanks in advance
Sam
___
Gluster-devel mailing list
Gluster-devel@gluster.org
https://lists.gluster.org/mailman/listinfo/gluster-devel

[Gluster-devel] Upgrade testing to gluster 6

2019-03-29 Thread Hari Gowtham
Hello Gluster users,

As you all aware that glusterfs-6 is out, we would like to inform you
that, we have spent a significant amount of time in testing
glusterfs-6 in upgrade scenarios. We have done upgrade testing to
glusterfs-6 from various releases like 3.12, 4.1 and 5.3.

As glusterfs-6 has got in a lot of changes, we wanted to test those portions.
There were xlators (and respective options to enable/disable them)
added and deprecated in glusterfs-6 from various versions [1].

We had to check the following upgrade scenarios for all such options
Identified in [1]:
1) option never enabled and upgraded
2) option enabled and then upgraded
3) option enabled and then disabled and then upgraded

We weren't manually able to check all the combinations for all the options.
So the options involving enabling and disabling xlators were prioritized.
The below are the result of the ones tested.

Never enabled and upgraded:
checked from 3.12, 4.1, 5.3 to 6 the upgrade works.

Enabled and upgraded:
Tested for tier which is deprecated, It is not a recommended upgrade.
As expected the volume won't be consumable and will have a few more
issues as well.
Tested with 3.12, 4.1 and 5.3 to 6 upgrade.

Enabled, disabled before upgrade.
Tested for tier with 3.12 and the upgrade went fine.

There is one common issue to note in every upgrade. The node being
upgraded is going into disconnected state. You have to flush the iptables
and the restart glusterd on all nodes to fix this.

The testing for enabling new options is still pending. The new options
won't cause as much issues as the deprecated ones so this was put at
the end of the priority list. It would be nice to get contributions
for this.

For the disable testing, tier was used as it covers most of the xlator
that was removed. And all of these tests were done on a replica 3 volume.

Note: This is only for upgrade testing of the newly added and removed
xlators. Does not involve the normal tests for the xlator.

If you have any questions, please feel free to reach us.

[1] 
https://docs.google.com/spreadsheets/d/1nh7T5AXaV6kc5KgILOy2pEqjzC3t_R47f1XUXSVFetI/edit?usp=sharing

Regards,
Hari and Sanju.
___
Gluster-devel mailing list
Gluster-devel@gluster.org
https://lists.gluster.org/mailman/listinfo/gluster-devel