Re: [Gluster-users] gluster forcing IPV6 on our IPV4 servers, glusterd fails (was gluster update question regarding new DNS resolution requirement)

2021-09-21 Thread Erik Jacobson
On Tue, Sep 21, 2021 at 04:18:10PM +, Strahil Nikolov wrote:
> As far as I know a fix was introduced recently, so even missing to run the
> script won't be so critical - you can run it afterwards.
> I would use Ansible to roll out such updates on a set of nodes - this will
> prevent human errors and will give the opportunity to run such tiny details
> like geo-rep modifying script.
> 
> P.S.: Out of curiosity, are you using distributed-replicated or
> distributed-dispersed volumes ?


Distributed-Replicated, with different volume configurations per use
case and one sharded.

PS: I am HOPING to take another crack at Ganesha tomorrow to try to "get
off our dependence on gnfs" but we'll see how things go with the crisis
of the day always blocking progress. I hope to deprecate the use of
expanded NFS trees (ie compute node root filesystems that are
file-by-file served by the NFS server) in favor of image objects
(squashfs images sitting in sharded volumes). I think what caused us
trouble with ganesha a couple years ago was the huge metadata load which
should be greatly reduced. We will see!




Output from one test system if you're curious:


[root@leader1 ~]# gluster volume info

Volume Name: cm_logs
Type: Distributed-Replicate
Volume ID: 27ffa15b-9fed-4322-b591-225270ca9de5
Status: Started
Snapshot Count: 0
Number of Bricks: 6 x 3 = 18
Transport-type: tcp
Bricks:
Brick1: 172.23.0.3:/data/brick_cm_logs
Brick2: 172.23.0.2:/data/brick_cm_logs
Brick3: 172.23.0.4:/data/brick_cm_logs
Brick4: 172.23.0.5:/data/brick_cm_logs
Brick5: 172.23.0.6:/data/brick_cm_logs
Brick6: 172.23.0.7:/data/brick_cm_logs
Brick7: 172.23.0.8:/data/brick_cm_logs
Brick8: 172.23.0.9:/data/brick_cm_logs
Brick9: 172.23.0.10:/data/brick_cm_logs
Brick10: 172.23.0.11:/data/brick_cm_logs
Brick11: 172.23.0.12:/data/brick_cm_logs
Brick12: 172.23.0.13:/data/brick_cm_logs
Brick13: 172.23.0.14:/data/brick_cm_logs
Brick14: 172.23.0.15:/data/brick_cm_logs
Brick15: 172.23.0.16:/data/brick_cm_logs
Brick16: 172.23.0.17:/data/brick_cm_logs
Brick17: 172.23.0.18:/data/brick_cm_logs
Brick18: 172.23.0.19:/data/brick_cm_logs
Options Reconfigured:
nfs.auth-cache-ttl-sec: 360
nfs.auth-refresh-interval-sec: 360
nfs.mount-rmtab: /-
nfs.exports-auth-enable: on
nfs.export-dirs: on
nfs.export-volumes: on
nfs.nlm: off
transport.address-family: inet
storage.fips-mode-rchecksum: on
nfs.disable: on
performance.client-io-threads: off

Volume Name: cm_obj_sharded
Type: Distributed-Replicate
Volume ID: 311bee36-09af-4d68-9180-b34b45e3c10b
Status: Started
Snapshot Count: 0
Number of Bricks: 6 x 3 = 18
Transport-type: tcp
Bricks:
Brick1: 172.23.0.3:/data/brick_cm_obj_sharded
Brick2: 172.23.0.2:/data/brick_cm_obj_sharded
Brick3: 172.23.0.4:/data/brick_cm_obj_sharded
Brick4: 172.23.0.5:/data/brick_cm_obj_sharded
Brick5: 172.23.0.6:/data/brick_cm_obj_sharded
Brick6: 172.23.0.7:/data/brick_cm_obj_sharded
Brick7: 172.23.0.8:/data/brick_cm_obj_sharded
Brick8: 172.23.0.9:/data/brick_cm_obj_sharded
Brick9: 172.23.0.10:/data/brick_cm_obj_sharded
Brick10: 172.23.0.11:/data/brick_cm_obj_sharded
Brick11: 172.23.0.12:/data/brick_cm_obj_sharded
Brick12: 172.23.0.13:/data/brick_cm_obj_sharded
Brick13: 172.23.0.14:/data/brick_cm_obj_sharded
Brick14: 172.23.0.15:/data/brick_cm_obj_sharded
Brick15: 172.23.0.16:/data/brick_cm_obj_sharded
Brick16: 172.23.0.17:/data/brick_cm_obj_sharded
Brick17: 172.23.0.18:/data/brick_cm_obj_sharded
Brick18: 172.23.0.19:/data/brick_cm_obj_sharded
Options Reconfigured:
features.shard: on
nfs.auth-cache-ttl-sec: 360
nfs.auth-refresh-interval-sec: 360
server.event-threads: 32
performance.io-thread-count: 32
nfs.mount-rmtab: /-
transport.listen-backlog: 16384
nfs.exports-auth-enable: on
nfs.export-dirs: on
nfs.export-volumes: on
nfs.nlm: off
performance.nfs.io-cache: on
performance.cache-refresh-timeout: 60
performance.flush-behind: on
performance.cache-size: 8GB
transport.address-family: inet
storage.fips-mode-rchecksum: on
nfs.disable: off
performance.client-io-threads: on

Volume Name: cm_shared
Type: Distributed-Replicate
Volume ID: 38093b8e-e668-4542-bc5e-34ffc491311a
Status: Started
Snapshot Count: 0
Number of Bricks: 6 x 3 = 18
Transport-type: tcp
Bricks:
Brick1: 172.23.0.3:/data/brick_cm_shared
Brick2: 172.23.0.2:/data/brick_cm_shared
Brick3: 172.23.0.4:/data/brick_cm_shared
Brick4: 172.23.0.5:/data/brick_cm_shared
Brick5: 172.23.0.6:/data/brick_cm_shared
Brick6: 172.23.0.7:/data/brick_cm_shared
Brick7: 172.23.0.8:/data/brick_cm_shared
Brick8: 172.23.0.9:/data/brick_cm_shared
Brick9: 172.23.0.10:/data/brick_cm_shared
Brick10: 172.23.0.11:/data/brick_cm_shared
Brick11: 172.23.0.12:/data/brick_cm_shared
Brick12: 172.23.0.13:/data/brick_cm_shared
Brick13: 172.23.0.14:/data/brick_cm_shared
Brick14: 172.23.0.15:/data/brick_cm_shared
Brick15: 172.23.0.16:/data/brick_cm_shared
Brick16: 172.23.0.17:/data/brick_cm_shared
Brick17: 172.23.0.18:/data/brick_cm_shared
Brick18: 172.23.0.19:/data/brick_cm_shared
Options Reconfigured:
performance.client-io-threads: on

Re: [Gluster-users] gluster update question regarding new DNS resolution requirement

2021-09-21 Thread Erik Jacobson
There is a discussion in -devel as well. I came at this just thinking
"an update should work" and did take a quick look at release notes for
9.0 and 9.3. Come to think of it, I didn't read the Gluster8 relnotes
so maybe that's why I missed this. We were at 7.9 and I read 9.0 and
9.3.

We can't really disable IPV6 100% here. Well we could today but we'd
have to open it again in a couple months. Our main head node already
needs to talk to some IPV6-only stuff while also talking to IPV4 stuff.
These leaders (gluster servers) will need to speak IPV6 very soon at least
minimally. Some controllers are starting to appear, which these 'leader'
nodes need to talk to, that are IPV6-only.

It sounds like what you wrote is true though, that if there is any IPV6
around that function thinks that's what you want is IPV6. A couple
private replies (thank you!!) also mentioned this.

Maybe we'll have to make a more formal version of the patch rather than
just force-setting IPV4 (for our internal use) later on.

Basically, I am in the "once in a year" window where I can update
gluster and get complete testing to be sure we don't have regressions so
we'll keep moving forward with 9.3 with the ipv4 hack in place for now.

This helps me get the context thank you for this note !!

Erik

On Tue, Sep 21, 2021 at 02:44:36PM +, Strahil Nikolov wrote:
> As gf_resolve_ip6 fails, I guess you can disable ipv6 on the host (if not 
> using
> the protocol) and check if it will workaround the problem till it's solved.
> 
> For RH you can check https://access.redhat.com/solutions/8709 (use RH dev
> subscription to read it, or ping me directly and I will try to summarize it 
> for
> your OS version).
> 
> 
> Best Regards,
> Strahil Nikolov
> 
> 
> On Mon, Sep 20, 2021 at 19:35, Erik Jacobson
>  wrote:
> I missed the other important log snip:
> 
> The message "E [MSGID: 101075] [common-utils.c:520:gf_resolve_ip6]
> 0-resolver: error in getaddrinfo [{family=10}, {ret=Address family for
> hostname not supported}]" repeated 620 times between [2021-09-20
> 15:49:23.720633 +] and [2021-09-20 15:50:41.731542 +]
> 
>     So I will dig in to the code some here.
> 
> 
> On Mon, Sep 20, 2021 at 10:59:30AM -0500, Erik Jacobson wrote:
> > Hello all! I hope you are well.
> >
> > We are starting a new software release cycle and I am trying to find a
> > way to upgrade customers from our build of gluster 7.9 to our build of
> > gluster 9.3
> >
> > When we deploy gluster, we foribly remove all references to any host
> > names and use only IP addresses. This is because, if for any reason a
> > DNS server is unreachable, even if the peer files have IPs and DNS, it
> > causes glusterd to be unable to reach peers properly. We can't really
> > rely on /etc/hosts either because customers take artistic licene with
> > their /etc/hosts files and don't realize that problems that can cause.
> >
> > So our deployed peer files look something like this:
> >
> > uuid=46a4b506-029d-4750-acfb-894501a88977
> > state=3
> > hostname1=172.23.0.16
> >
> > That is, with full intention, we avoid host names.
> >
> > When we upgrade to gluster 9.3, we fall over with these errors and
> > gluster is now partitioned and the updated gluster servers can't reach
> > anybody:
> >
> > [2021-09-20 15:50:41.731543 +] E
> [name.c:265:af_inet_client_get_remote_sockaddr] 0-management: DNS
> resolution failed on host 172.23.0.16
> >
> >
> > As you can see, we have defined on purpose everything using IPs but in
> > 9.3 it appears this method fails. Are there any suggestions short of
> > putting real host names in peer files?
> >
> >
> >
> > FYI
> >
> > This supercomputer will be using gluster for part of its system
> > management. It is how we deploy the Image Objects (squashfs images)
> > hosted on NFS today and served by gluster leader nodes and also store
> > system logs, console logs, and other data.
> >
> > https://www.olcf.ornl.gov/frontier/
> >
> >
> > Erik
> > 
> >
> >
> >
> > Community Meeting Calendar:
> >
> > Schedule -
> > Every 2nd and 4th Tuesday at 14:30 IST / 09:00 UTC
> > Bridge: https://meet.google.com/cpu-eiue-hvk
> > Gluster-users mailing list
> > Gluster-users@gluster.org
> > https://lists.gluster.org/mailman/listinfo/gluster-users
>   

Re: [Gluster-users] gluster forcing IPV6 on our IPV4 servers, glusterd fails (was gluster update question regarding new DNS resolution requirement)

2021-09-21 Thread Erik Jacobson
> Don't forget to run the geo-replication fix script , if you missed to do it
> before the upgrade.

We don't use geo-replication YET but thank you for this thoughtful
reminder.

Just a note on things like this -- we really try to do everything in a
package update because that's how we'd have to deploy to customers in an
automated way. So having to run a script as part of the upgrade would be
very hard in a package based work flow for a packged solution.

I'm not complaining I love gluster but this is just food for thought.

I can't even hardly say it with a straight face because we suffer from
similar issues on the cluster management side - updating one CM to the
next is harder than it should be so I'm certainly not judging. Updating
is always painful.

I LOVE that slowly updating our gluster servers is "Just working".

This will allow a supercomputer to slowly update their infrastructure
while taking no compute nodes (using nfs-hosted squashfs images or root)
down. It's really remarkable since it's a big jump too 7.9 to 9.3 I am
impressed by this part. It's a huge relief that I didn't have to do an
intermediate jump to gluster8 in the middle as that would have been
nearly impossible for us to get right.

Thank you all!!

PS: Frontier will have 21 leader nodes running gluster servers.
Distributed/replicate in groups of 3 hosting nfs-exported squashfs image
objects for compute node root filesystems. Many thousands of nodes.

> 
> Best Regards,
> Strahil Nikolov
> 
> 
> On Tue, Sep 21, 2021 at 0:46, Erik Jacobson
>  wrote:
> I pretended I'm a low-level C programmer with network and filesystem
> experience for a few hours.
> 
> I'm not sure what the right solution is but what was happening was the
> code was trying to treat our IPV4 hosts as AF_INET6 and the family was
> incompatible with our IPV4 IP addresses. Yes, we need to move to IPV6
> but we're hoping to do that on our own time (~50 years like everybody
> else :)
> 
> I found a chunk of the code that seemed to be force-setting us to
> AF_INET6.
> 
> While I'm sure it is not 100% the correct patch, the patch attached and
> pasted below is working for me so I'll integrate it with our internal
> build to continue testing.
> 
> Please let me know if there is a configuration item I missed or a
> different way to do this. I added -devel to this email.
> 
> In the previous thread, you would have seen that we're testing a
> hopeful change that will upgrade our deployed customers from gluster
> 7.9 to gluster 9.3.
> 
> Thank you!! Advice on next steps would be appreciated !!
> 
> 
> diff -Narup glusterfs-9.3-ORIG/rpc/rpc-transport/socket/src/name.c
> glusterfs-9.3-NEW/rpc/rpc-transport/socket/src/name.c
> --- glusterfs-9.3-ORIG/rpc/rpc-transport/socket/src/name.c2021-06-29
> 00:27:44.381408294 -0500
> +++ glusterfs-9.3-NEW/rpc/rpc-transport/socket/src/name.c2021-09-20
> 16:34:28.969425361 -0500
> @@ -252,9 +252,16 @@ af_inet_client_get_remote_sockaddr(rpc_t
> /* Need to update transport-address family if address-family is not
> provided
> to command-line arguments
> */
> +/* HPE This is forcing our IPV4 servers in to to an IPV6 address
> +* family that is not compatible with IPV4. For now we will just set 
> it
> +* to AF_INET.
> +*/
> +/*
> if (inet_pton(AF_INET6, remote_host, )) {
> sockaddr->sa_family = AF_INET6;
> }
> +*/
> +sockaddr->sa_family = AF_INET;
> 
> /* TODO: gf_resolve is a blocking call. kick in some
> non blocking dns techniques */
> 
>
> On Mon, Sep 20, 2021 at 11:35:35AM -0500, Erik Jacobson wrote:
> > I missed the other important log snip:
> >
> > The message "E [MSGID: 101075] [common-utils.c:520:gf_resolve_ip6]
> 0-resolver: error in getaddrinfo [{family=10}, {ret=Address family for
> hostname not supported}]" repeated 620 times between [2021-09-20
> 15:49:23.720633 +] and [2021-09-20 15:50:41.731542 +]
> >
> > So I will dig in to the code some here.
> >
> >
> > On Mon, Sep 20, 2021 at 10:59:30AM -0500, Erik Jacobson wrote:
> > > Hello all! I hope you are well.
> > >
> > > We are starting a new software release cycle and I am trying to find a
> > > way to upgrade customers from our build of gluster 7.9 to our build of
> > > gluster 9.3
> > >
> > > When we deploy gluster, we foribly remove all references to any host
> > > names and use only IP addresses. This is

[Gluster-users] gluster forcing IPV6 on our IPV4 servers, glusterd fails (was gluster update question regarding new DNS resolution requirement)

2021-09-20 Thread Erik Jacobson
I pretended I'm a low-level C programmer with network and filesystem
experience for a few hours.

I'm not sure what the right solution is but what was happening was the
code was trying to treat our IPV4 hosts as AF_INET6 and the family was
incompatible with our IPV4 IP addresses. Yes, we need to move to IPV6
but we're hoping to do that on our own time (~50 years like everybody
else :)

I found a chunk of the code that seemed to be force-setting us to
AF_INET6.

While I'm sure it is not 100% the correct patch, the patch attached and
pasted below is working for me so I'll integrate it with our internal
build to continue testing.

Please let me know if there is a configuration item I missed or a
different way to do this. I added -devel to this email.

In the previous thread, you would have seen that we're testing a
hopeful change that will upgrade our deployed customers from gluster
7.9 to gluster 9.3.

Thank you!! Advice on next steps would be appreciated !!


diff -Narup glusterfs-9.3-ORIG/rpc/rpc-transport/socket/src/name.c 
glusterfs-9.3-NEW/rpc/rpc-transport/socket/src/name.c
--- glusterfs-9.3-ORIG/rpc/rpc-transport/socket/src/name.c  2021-06-29 
00:27:44.381408294 -0500
+++ glusterfs-9.3-NEW/rpc/rpc-transport/socket/src/name.c   2021-09-20 
16:34:28.969425361 -0500
@@ -252,9 +252,16 @@ af_inet_client_get_remote_sockaddr(rpc_t
 /* Need to update transport-address family if address-family is not 
provided
to command-line arguments
 */
+/* HPE This is forcing our IPV4 servers in to to an IPV6 address
+ * family that is not compatible with IPV4. For now we will just set it
+ * to AF_INET.
+ */
+/*
 if (inet_pton(AF_INET6, remote_host, )) {
 sockaddr->sa_family = AF_INET6;
 }
+*/
+sockaddr->sa_family = AF_INET;
 
 /* TODO: gf_resolve is a blocking call. kick in some
non blocking dns techniques */


On Mon, Sep 20, 2021 at 11:35:35AM -0500, Erik Jacobson wrote:
> I missed the other important log snip:
> 
> The message "E [MSGID: 101075] [common-utils.c:520:gf_resolve_ip6] 
> 0-resolver: error in getaddrinfo [{family=10}, {ret=Address family for 
> hostname not supported}]" repeated 620 times between [2021-09-20 
> 15:49:23.720633 +] and [2021-09-20 15:50:41.731542 +]
> 
> So I will dig in to the code some here.
> 
> 
> On Mon, Sep 20, 2021 at 10:59:30AM -0500, Erik Jacobson wrote:
> > Hello all! I hope you are well.
> > 
> > We are starting a new software release cycle and I am trying to find a
> > way to upgrade customers from our build of gluster 7.9 to our build of
> > gluster 9.3
> > 
> > When we deploy gluster, we foribly remove all references to any host
> > names and use only IP addresses. This is because, if for any reason a
> > DNS server is unreachable, even if the peer files have IPs and DNS, it
> > causes glusterd to be unable to reach peers properly. We can't really
> > rely on /etc/hosts either because customers take artistic licene with
> > their /etc/hosts files and don't realize that problems that can cause.
> > 
> > So our deployed peer files look something like this:
> > 
> > uuid=46a4b506-029d-4750-acfb-894501a88977
> > state=3
> > hostname1=172.23.0.16
> > 
> > That is, with full intention, we avoid host names.
> > 
> > When we upgrade to gluster 9.3, we fall over with these errors and
> > gluster is now partitioned and the updated gluster servers can't reach
> > anybody:
> > 
> > [2021-09-20 15:50:41.731543 +] E 
> > [name.c:265:af_inet_client_get_remote_sockaddr] 0-management: DNS 
> > resolution failed on host 172.23.0.16
> > 
> > 
> > As you can see, we have defined on purpose everything using IPs but in
> > 9.3 it appears this method fails. Are there any suggestions short of
> > putting real host names in peer files?
> > 
> > 
> > 
> > FYI
> > 
> > This supercomputer will be using gluster for part of its system
> > management. It is how we deploy the Image Objects (squashfs images)
> > hosted on NFS today and served by gluster leader nodes and also store
> > system logs, console logs, and other data.
> > 
> > https://www.olcf.ornl.gov/frontier/  
> > 
> > 
> > Erik
> > 
> > 
> > 
> > 
> > Community Meeting Calendar:
> > 
> > Schedule -
> > Every 2nd and 4th Tuesday at 14:30 IST / 09:00 UTC
> > Bridge: https://meet.google.com/cpu-eiue-hvk  
> > Gluster-users mailing list
> > Gluster-users@gluster.org
> > https://lists.gluster.org/mailman/listinfo/gluster-users  
> 
> 
> 
> 
> Community Meeting Calendar:
> 
> Schedule -
> Every 2nd and 4th Tuesday at 14:30

Re: [Gluster-users] gluster update question regarding new DNS resolution requirement

2021-09-20 Thread Erik Jacobson
I missed the other important log snip:

The message "E [MSGID: 101075] [common-utils.c:520:gf_resolve_ip6] 0-resolver: 
error in getaddrinfo [{family=10}, {ret=Address family for hostname not 
supported}]" repeated 620 times between [2021-09-20 15:49:23.720633 +] and 
[2021-09-20 15:50:41.731542 +]

So I will dig in to the code some here.


On Mon, Sep 20, 2021 at 10:59:30AM -0500, Erik Jacobson wrote:
> Hello all! I hope you are well.
> 
> We are starting a new software release cycle and I am trying to find a
> way to upgrade customers from our build of gluster 7.9 to our build of
> gluster 9.3
> 
> When we deploy gluster, we foribly remove all references to any host
> names and use only IP addresses. This is because, if for any reason a
> DNS server is unreachable, even if the peer files have IPs and DNS, it
> causes glusterd to be unable to reach peers properly. We can't really
> rely on /etc/hosts either because customers take artistic licene with
> their /etc/hosts files and don't realize that problems that can cause.
> 
> So our deployed peer files look something like this:
> 
> uuid=46a4b506-029d-4750-acfb-894501a88977
> state=3
> hostname1=172.23.0.16
> 
> That is, with full intention, we avoid host names.
> 
> When we upgrade to gluster 9.3, we fall over with these errors and
> gluster is now partitioned and the updated gluster servers can't reach
> anybody:
> 
> [2021-09-20 15:50:41.731543 +] E 
> [name.c:265:af_inet_client_get_remote_sockaddr] 0-management: DNS resolution 
> failed on host 172.23.0.16
> 
> 
> As you can see, we have defined on purpose everything using IPs but in
> 9.3 it appears this method fails. Are there any suggestions short of
> putting real host names in peer files?
> 
> 
> 
> FYI
> 
> This supercomputer will be using gluster for part of its system
> management. It is how we deploy the Image Objects (squashfs images)
> hosted on NFS today and served by gluster leader nodes and also store
> system logs, console logs, and other data.
> 
> https://www.olcf.ornl.gov/frontier/ 
> 
> 
> Erik
> 
> 
> 
> 
> Community Meeting Calendar:
> 
> Schedule -
> Every 2nd and 4th Tuesday at 14:30 IST / 09:00 UTC
> Bridge: https://meet.google.com/cpu-eiue-hvk 
> Gluster-users mailing list
> Gluster-users@gluster.org
> https://lists.gluster.org/mailman/listinfo/gluster-users 




Community Meeting Calendar:

Schedule -
Every 2nd and 4th Tuesday at 14:30 IST / 09:00 UTC
Bridge: https://meet.google.com/cpu-eiue-hvk
Gluster-users mailing list
Gluster-users@gluster.org
https://lists.gluster.org/mailman/listinfo/gluster-users


[Gluster-users] gluster update question regarding new DNS resolution requirement

2021-09-20 Thread Erik Jacobson
Hello all! I hope you are well.

We are starting a new software release cycle and I am trying to find a
way to upgrade customers from our build of gluster 7.9 to our build of
gluster 9.3

When we deploy gluster, we foribly remove all references to any host
names and use only IP addresses. This is because, if for any reason a
DNS server is unreachable, even if the peer files have IPs and DNS, it
causes glusterd to be unable to reach peers properly. We can't really
rely on /etc/hosts either because customers take artistic licene with
their /etc/hosts files and don't realize that problems that can cause.

So our deployed peer files look something like this:

uuid=46a4b506-029d-4750-acfb-894501a88977
state=3
hostname1=172.23.0.16

That is, with full intention, we avoid host names.

When we upgrade to gluster 9.3, we fall over with these errors and
gluster is now partitioned and the updated gluster servers can't reach
anybody:

[2021-09-20 15:50:41.731543 +] E 
[name.c:265:af_inet_client_get_remote_sockaddr] 0-management: DNS resolution 
failed on host 172.23.0.16


As you can see, we have defined on purpose everything using IPs but in
9.3 it appears this method fails. Are there any suggestions short of
putting real host names in peer files?



FYI

This supercomputer will be using gluster for part of its system
management. It is how we deploy the Image Objects (squashfs images)
hosted on NFS today and served by gluster leader nodes and also store
system logs, console logs, and other data.

https://www.olcf.ornl.gov/frontier/


Erik




Community Meeting Calendar:

Schedule -
Every 2nd and 4th Tuesday at 14:30 IST / 09:00 UTC
Bridge: https://meet.google.com/cpu-eiue-hvk
Gluster-users mailing list
Gluster-users@gluster.org
https://lists.gluster.org/mailman/listinfo/gluster-users


Re: [Gluster-users] Gluster usage scenarios in HPC cluster management

2021-03-23 Thread Erik Jacobson
> I still have to grasp the "leader node" concept.
> Weren't gluster nodes "peers"? Or by "leader" you mean that it's
> mentioned in the fstab entry like
> /l1,l2,l3:gv0 /mnt/gv0 glusterfs defaults 0 0
> while the peer list includes l1,l2,l3 and a bunch of other nodes?

Right, it's a list of 24 peers. The 24 peers are split in to a 3x24
replicated/distributed setup for the volumes. They also have entries
for themselves as clients in /etc/fstab. I'll dump some volume info
at the end of this.


> > So we would have 24 leader nodes, each leader would have a disk serving
> > 4 bricks (one of which is simply a lock FS for CTDB, one is sharded,
> > one is for logs, and one is heavily optimized for non-object expanded
> > tree NFS). The term "disk" is loose.
> That's a system way bigger than ours (3 nodes, replica3arbiter1, up to
> 36 bricks per node).

I have one dedicated "disk" (could be disk, raid lun, single ssd) and
4 directories for volumes ("bricks"). Of course, the "ctdb" volume is just
for the lock and has a single file.

> 
> > Specs of a leader node at a customer site:
> >  * 256G RAM
> Glip! 256G for 4 bricks... No wonder I have had troubles running 26
> bricks in 64GB RAM... :)

I'm not an expert in memory pools or how they would be impacted by more
peers. I had to do a little research and I think what you're after is
if I can run gluster volume status cm_shared mem on a real cluster
that has a decent node count. I will see if I can do that.


TEST ENV INFO for those who care

Here is some info on my own test environemnt which you can skip.

I have the environment duplicated on my desktop using virtual machines and it
runs fine (slow but fine). It's a 3x1. I take out my giant 8GB cache
from the optimized volumes but other than that it is fine. In my
development environment, the gluster disk is a 40G qcow2 image.

Cache sizes changed from 8G to 100M to fit in the VM.

XML snips for memory, cpus:

  cm-leader1
  99d5a8fc-a32c-b181-2f1a-2929b29c3953
  3268608
  3268608
  2
  
..


I have 1 admin (head) node VM, 3 VM leader nodes like above, and one test
compute node for my development environment.

My desktop where I test this cluster stack is a beefy but not brand new
desktop:

Architecture:x86_64
CPU op-mode(s):  32-bit, 64-bit
Byte Order:  Little Endian
Address sizes:   46 bits physical, 48 bits virtual
CPU(s):  16
On-line CPU(s) list: 0-15
Thread(s) per core:  2
Core(s) per socket:  8
Socket(s):   1
NUMA node(s):1
Vendor ID:   GenuineIntel
CPU family:  6
Model:   79
Model name:  Intel(R) Xeon(R) CPU E5-2620 v4 @ 2.10GHz
Stepping:1
CPU MHz: 2594.333
CPU max MHz: 3000.
CPU min MHz: 1200.
BogoMIPS:4190.22
Virtualization:  VT-x
L1d cache:   32K
L1i cache:   32K
L2 cache:256K
L3 cache:20480K
NUMA node0 CPU(s):   0-15



(Not that it matters but this is a HP Z640 Workstation)

128G memory (good for a desktop I know, but I think 64G would work since
I also run windows10 vm environment for unrelated reasons)

I was able to find a MegaRAID in the lab a few years ago and so I have 4
drives in a MegaRAID and carve off a separate volume for the VM disk
images. It has a cache. So that's also more beefy than a normal desktop.
(on the other hand, I have no SSDs. May experiment with that some day
but things work so well now I'm tempted to leave it until something
croaks :)

I keep all VMs for the test cluster with "Unsafe cache mode" since there
is no true data to worry about and it makes the test cases faster.

So I am able to test a complete cluster management stack including
3-leader-gluster servers, an admin, and compute all on my desktop using
virtual machines and shared networks within libivrt/qemu.

It is so much easier to do development when you don't have to reserve
scarce test clusters and compete with people. I can do 90% of my cluster
development work this way. Things fall over when I need to care about
BMCs/ILOs or need to do performance testing of course. Then I move to
real hardware and play the hunger-games-of-internal-test-resources :) :)

I mention all this just to show that the beefy servers are not needed
nor the memory usage high. I'm not continually swapping or anything like
that.




Configuration Info from Real Machine


Some info on an active 3x3 cluster. 2738 compute nodes.

The most active volume here is "cm_obj_sharded". It is where the image
objects live and this cluster uses image objects for compute node root
filesystems. I by hand changed the IP addresses (in case I made an
error doing that).


Memory status for volume : cm_obj_sharded
--
Brick : 10.1.0.5:/data/brick_cm_obj_sharded
Mallinfo

Arena: 20676608
Ordblks  : 2077
Smblks   : 518
Hblks: 17
Hblkhd   : 

Re: [Gluster-users] Gluster usage scenarios in HPC cluster management

2021-03-22 Thread Erik Jacobson
> > The stuff I work on doesn't use containers much (unlike a different
> > system also at HPE).
> By "pods" I meant "glusterd instance", a server hosting a collection of
> bricks.

Oh ok. The term is overloaded in my world.

> > I don't have a recipe, they've just always been beefy enough for
> > gluster. Sorry I don't have a more scientific answer.
> Seems that 64GB RAM are not enough for a pod with 26 glusterfsd
> instances and no other services (except sshd for management). What do
> you mean by "beefy enough"? 128GB RAM or 1TB?

We are currently using replica-3 but may also support replica-5 in the
future.

So if you had 24 leaders like HLRS, there would be 8 replica-3 at the
bottom layer, and then distributed across. (replicated/distributed
volumes)

So we would have 24 leader nodes, each leader would have a disk serving
4 bricks (one of which is simply a lock FS for CTDB, one is sharded,
one is for logs, and one is heavily optimized for non-object expanded
tree NFS). The term "disk" is loose.

So each SU Leader (or gluster server) serving the 4 volumes, 8x3
configuration, in our world has some differences in CPU type and memory
and storage depending on order and preferences and timing (things always
move forward).

On an SU Leader, we typically do 2 RAID10 volumes with a RAID
controller including cache. However, we have moved to RAID1 in some cases with
better disks. Leaders store a lot of non-gluster stuff on "root" and
then gluster has a dedicated disk/LUN. We have been trying to improve
our helper tools to 100% wheel out a bad leader (say it melted in to the
floor) and replace it. Once we have that solid, and because our
monitoring data on the "root" drive is already redundant, we plan to
move newer servers to two NVME drives without RAID. One for gluster and
one for OS. If a leader melts in to the floor, we have a procedure to
discover a new node for that, install the base OS including
gluster/CTDB/etc, and then run a tool to re-integrate it in to the
cluster as an SU Leader node again and do the healing. Separately,
monitoring data outside of gluster will heal.

PS: I will note that I have a mini-SU-leader cluster on my desktop
(qemu/ libvirt) for development. It is a 1x3 set of SU Leaders, one head node,
and one compute node. I make an adjustment to reduce the gluster cache to fit
in the memory space. Works fine. Not real fast but good enough for development.


Specs of a leader node at a customer site:
 * 256G RAM
 * Storage: 
   - MR9361-8i controller
   - 7681GB root LUN (RAID1)
   - 15.4 TB for gluster bricks (RAID10)
   - 6 SATA SSD MZ7LH7T6HMLA-5
 * AMD EPYC 7702 64-Core Processor
   - CPU(s):  128
   - On-line CPU(s) list: 0-127
   - Thread(s) per core:  2
   - Core(s) per socket:  64
   - Socket(s):   1
   - NUMA node(s):4
 * Management Ethernet
   - Gluster and cluster management co-mingled
   - 2x40G (but 2x10G wouold be fine)




Community Meeting Calendar:

Schedule -
Every 2nd and 4th Tuesday at 14:30 IST / 09:00 UTC
Bridge: https://meet.google.com/cpu-eiue-hvk
Gluster-users mailing list
Gluster-users@gluster.org
https://lists.gluster.org/mailman/listinfo/gluster-users


Re: [Gluster-users] Gluster usage scenarios in HPC cluster management

2021-03-22 Thread Erik Jacobson
The stuff I work on doesn't use containers much (unlike a different
system also at HPE).

Leaders are over-sized but the sizing largely is associated with all the
other stuff leaders do, not just for gluster. That said, my gluster
settings for the expanded nfs tree (as opposed to squashfs image files on
nfs) method use heavy caching; I believe the max was 8G.

I don't have a recipe, they've just always been beefy enough for
gluster. Sorry I don't have a more scientific answer.

On Mon, Mar 22, 2021 at 02:24:17PM +0100, Diego Zuccato wrote:
> Il 19/03/2021 16:03, Erik Jacobson ha scritto:
> 
> > A while back I was asked to make a blog or something similar to discuss
> > the use cases the team I work on (HPCM cluster management) at HPE.
> Tks for the article.
> 
> I just miss a bit of information: how are you sizing CPU/RAM for pods?
> 
> -- 
> Diego Zuccato
> DIFA - Dip. di Fisica e Astronomia
> Servizi Informatici
> Alma Mater Studiorum - Università di Bologna
> V.le Berti-Pichat 6/2 - 40127 Bologna - Italy
> tel.: +39 051 20 95786
> 
> 
> 
> 
> Community Meeting Calendar:
> 
> Schedule -
> Every 2nd and 4th Tuesday at 14:30 IST / 09:00 UTC
> Bridge: https://meet.google.com/cpu-eiue-hvk
> Gluster-users mailing list
> Gluster-users@gluster.org
> https://lists.gluster.org/mailman/listinfo/gluster-users




Community Meeting Calendar:

Schedule -
Every 2nd and 4th Tuesday at 14:30 IST / 09:00 UTC
Bridge: https://meet.google.com/cpu-eiue-hvk
Gluster-users mailing list
Gluster-users@gluster.org
https://lists.gluster.org/mailman/listinfo/gluster-users


Re: [Gluster-users] Gluster usage scenarios in HPC cluster management

2021-03-19 Thread Erik Jacobson
> But I've also tested using tmpfs (allocating half of the RAM per compute node)
> and exporting that as a distributed stripped GlusterFS volume over NFS over
> RDMA to the 100 Gbps IB network so that the "ramdrives" can be used as a high
> speed "scratch disk space" that doesn't have the write endurance limits that
> NAND based flash memory SSDs have.

In my world, we leave the high speed networks to jobs so I don't have
much to offer. In our test SU Leader setup where we may not have disks,
we do carve gluster bricks out of TMPS mounts. However, in that test
case, designed to test the tooling and not the workload, I use iscsi to
emulate disks to test the true solution.

I will just mention that the cluster manager use of squashfs image
objects sitting on NFS mounts is very fast even on top of 20G (2x10G)
mgmt infrastructure. If you combine it with a TMPFS overlay, which is
our default, you will have a writable area in to TMPFS that doesn't
persist. You will have low memory usage.

For a 4-node cluster, you probably don't need to bother with squashfs
even and just mount the directory tree for the image at the right time.

By using tmpfs overlay and some post-boot configuration, you can perhaps
avoid the memory usage of what you are doing. As long as you don't need
to beat the crap out of root, an NFS root is fine and using gluster
backed disks is fine. Note that if you use exported trees with gnfs
instead of image objects, there are lots of volume tweaks you can make
to push efficiency up. For squashfs, I used a sharded volume.

It's easy for me to write this since we have the install environment.
While nothing is "Hard" in there, it's a bunch of code developed over
time. That said, if you wanted to experiment, I can share some pieces of
what we do. I just fear it's too complicated.

I will note that some customers advocate for a tiny root - say 1.5G --
that could fit in TMPFS easily and then attach in workloads (other
filesystems with development environments over the network, or container
environments, etc). That would be another way to keep memory use low for
a diskless cluster.

(we use gnfs because we're not ready to switch to ganesha yet. It's on
our list to move if we can get it working for our load).

> Yes, it isn't as reliable or certainly not high availability (power goes down,
> and the battery backup is exhausted, then the data is lost because it sat in
> RAM), but it's to solve the problems of mechanically rotating hard drives are
> too slow, NAND flash based SSDs has finite write endurance limits, and RAM
> drives, whilst in theory, faster, is also the most expensive in a $/GB basis
> compared to the other storage solutions.
> 
> It's rather unfortunately that you have these different "tiers" of storage, 
> and
> there's really nothing else in between that can help address all of these
> issues simultaneously.
> 
> Thank you for sharing your thoughts.
> 
> Sincerely,
> 
> Ewen Chan
> 
> ━━━
> From: gluster-users-boun...@gluster.org  on
> behalf of Erik Jacobson 
> Sent: March 19, 2021 11:03 AM
> To: gluster-users@gluster.org 
> Subject: [Gluster-users] Gluster usage scenarios in HPC cluster management
>  
> A while back I was asked to make a blog or something similar to discuss
> the use cases the team I work on (HPCM cluster management) at HPE.
> 
> If you are not interested in reading about what I'm up to, just delete
> this and move on.
> 
> I really don't have a public blogging mechanism so I'll just describe
> what we're up to here. Some of this was posted in some form in the past.
> Since this contains the raw materials, I could make a wiki-ized version
> if there were a public place to put it.
> 
> 
> 
> We currently use gluster in two parts of cluster management.
> 
> In fact, gluster in our management node infrastructure is helping us to
> provide scaling and consistency to some of the largest clusters in the
> world, clusters in the TOP100 list. While I can get in to trouble by
> sharing too much, I will just say that trends are continuing and the
> future may have some exciting announcements on where on TOP100 certain
> new giant systems may end up in the coming 1-2 years.
> 
> At HPE, HPCM is the "traditional cluster manager." There is another team
> that develops a more cloud-like solution and I am not discussing that
> solution here.
> 
> 
> Use Case #1: Leader Nodes and Scale Out
> --
> - Why?
>   * Scale out
>   * Redundancy (combined with CTDB, any leader can fail)
>   * Consistency (All servers and compute agree on what the content is)
> 
> - Cluster manager has an admin or head no

Re: [Gluster-users] Gluster usage scenarios in HPC cluster management

2021-03-19 Thread Erik Jacobson
> - Gluster sizing
>   * We typically state compute nodes per leader but this is not for
> gluster per-se. Squashfs image objects are very efficient and
> probably would be fine for 2k nodes per leader. Leader nodes provide
> other services including console logs, system logs, and monitoring
> services.

I tried to avoid typos and mistakes but I missed something above. Argues
for wiki right? :)  I missed "512" :)

  * We typically state 512 compute nodes per leader but this is not for
gluster per-se. Squashfs image objects are very efficient and
probably would be fine for 2k nodes per leader. Leader nodes provide
other services including console logs, system logs, and monitoring
services.





Community Meeting Calendar:

Schedule -
Every 2nd and 4th Tuesday at 14:30 IST / 09:00 UTC
Bridge: https://meet.google.com/cpu-eiue-hvk
Gluster-users mailing list
Gluster-users@gluster.org
https://lists.gluster.org/mailman/listinfo/gluster-users


[Gluster-users] Gluster usage scenarios in HPC cluster management

2021-03-19 Thread Erik Jacobson
A while back I was asked to make a blog or something similar to discuss
the use cases the team I work on (HPCM cluster management) at HPE.

If you are not interested in reading about what I'm up to, just delete
this and move on.

I really don't have a public blogging mechanism so I'll just describe
what we're up to here. Some of this was posted in some form in the past.
Since this contains the raw materials, I could make a wiki-ized version
if there were a public place to put it.



We currently use gluster in two parts of cluster management.

In fact, gluster in our management node infrastructure is helping us to
provide scaling and consistency to some of the largest clusters in the
world, clusters in the TOP100 list. While I can get in to trouble by
sharing too much, I will just say that trends are continuing and the
future may have some exciting announcements on where on TOP100 certain
new giant systems may end up in the coming 1-2 years.

At HPE, HPCM is the "traditional cluster manager." There is another team
that develops a more cloud-like solution and I am not discussing that
solution here.


Use Case #1: Leader Nodes and Scale Out
--
- Why?
  * Scale out
  * Redundancy (combined with CTDB, any leader can fail)
  * Consistency (All servers and compute agree on what the content is)

- Cluster manager has an admin or head node and zero or more leader nodes

- Leader nodes are provisioned in groups of 3 to use distributed
  replica-3 volumes (although at least one customer has interest
  in replica-5)

- We configure a few different volumes for different use cases

- We use Gluster NFS still because, over a year ago, Ganesha was not
  working with our workload and we haven't had time to re-test and
  engage with the community. No blame - we would also owe making sure
  our settings are right.

- We use CTDB for a measure of HA and IP alias management. We use this
  instead of pacemaker to reduce complexity.

- The volume use cases are:
  * Image sharing for diskless compute nodes (sometimes 6,000 nodes)
-> Normally squashFS image files for speed/efficiency exported NFS
-> Expanded ("chrootable") traditional NFS trees for people who
   prefer that, but they don't scale as well and are slower to boot
-> Squashfs images sit on a sharded volume while traditional gluster
   used for expanded tree.
  * TFTP/HTTP for network boot/PXE including miniroot
-> Spread across leaders too due so one node is not saturated with
   PXE/DHCP requests
-> Miniroot is a "fatter initrd" that has our CM toolchain
  * Logs/consoles
-> For traditional logs and consoles (HCPM also uses
   elasticsearch/kafka/friends but we don't put that in gluster)
-> Separate volume to have more non-cached friendly settings
  * 4 total volumes used (one sharded, one heavily optimized for
caching, one for ctdb lock, and one traditional for logging/etc)

- Leader Setup
  * Admin node installs the leaders like any other compute nodes
  * A setup tool operates that configures gluster volumes and CTDB
  * When ready, an admin/head node can be engaged with the leaders
  * At that point, certain paths on the admin become gluster fuse mounts
and bind mounts to gluster fuse mounts.

- How images are deployed (squashfs mode)
  * User creates an image using image creation tools that make a
chrootable tree style image on the admin/head node
  * mksquashfs will generate a squashfs image file on to a shared
storage gluster mount
  * Nodes will mount the filesystem with the squashfs images and then
loop mount the squashfs as part of the boot process.

- How are compute nodes tied to leaders
  * We simply have a variable in our database where human or automated
discovery tools can assign a given node to a given IP alias. This
works better for us than trying to play routing tricks or load
balance tricks
  * When leaders PXE, the DHCP response includes next-server and the
compute node uses the leader IP alias for the tftp/http for
getting the boot loader DHCP config files are on shared storage
to facilitate future scaling of DHCP services.
  * ipxe or grub2 network config files then fetch the kernel, initrd
  * initrd has a small update to load a miniroot (install environment)
 which has more tooling
  * Node is installed (for nodes with root disks) or does a network boot
cycle.

- Gluster sizing
  * We typically state compute nodes per leader but this is not for
gluster per-se. Squashfs image objects are very efficient and
probably would be fine for 2k nodes per leader. Leader nodes provide
other services including console logs, system logs, and monitoring
services.
  * Our biggest deployment at a customer site right now has 24 leader
nodes. Bigger systems are coming.

- Startup scripts - Getting all the gluster mounts and many bind mounts
  used in the solution, as well 

Re: [Gluster-users] qemu raw image file - qemu and grub2 can't find boot content from VM

2021-02-01 Thread Erik Jacobson
We think this fixed it. While there is random chance in there, we can't
repeat it in 7.9. So I'll close this thread out for now.

We'll ask for help again if needed. Thanks for all the kind responses,

Erik

On Fri, Jan 29, 2021 at 02:20:56PM -0600, Erik Jacobson wrote:
> I updated to 7.9, rebooted everything, and it started working.
> 
> I will have QE try to break it again and report back. I couldn't break
> it but they're better at breaking things (which is hard to imagine :)
> 
> 
> On Fri, Jan 29, 2021 at 01:11:50PM -0600, Erik Jacobson wrote:
> > Thank you.
> > 
> > We reproduced the problem after force-killing one of the 3 physical
> > nodes 6 times in a row.
> > 
> > At that point, the grub2 loaded off the qemu virtual hard drive, but
> > could not find partitions. Since there is random luck involved, we don't
> > actually know if it was the force-killing that caused it to stop
> > working.
> > 
> > When I start the VM with the image in this state, there is nothing
> > interesting in the fuse log for the volume in /var/log/glusterfs on the
> > node hosting the image.
> > 
> > No pending heals (all servers report 0 entries to heal).
> > 
> > The same VM behavior happens on all the physical nodes when I try to
> > start with the same VM image.
> > 
> > Something from the gluster fuse mount log from earlier shows:
> > 
> > [2021-01-28 21:24:40.814227] I [MSGID: 114018] 
> > [client.c:2347:client_rpc_notify] 0-adminvm-client-0: disconnected from 
> > adminvm-client-0. Client process will keep trying to connect to glusterd 
> > until brick's port is available
> > [2021-01-28 21:24:43.815120] I [rpc-clnt.c:1963:rpc_clnt_reconfig] 
> > 0-adminvm-client-0: changing port to 49152 (from 0)
> > [2021-01-28 21:24:43.815833] I [MSGID: 114057] 
> > [client-handshake.c:1376:select_server_supported_programs] 
> > 0-adminvm-client-0: Using Program GlusterFS 4.x v1, Num (1298437), Version 
> > (400)
> > [2021-01-28 21:24:43.817682] I [MSGID: 114046] 
> > [client-handshake.c:1106:client_setvolume_cbk] 0-adminvm-client-0: 
> > Connected to adminvm-client-0, attached to remote volume 
> > '/data/brick_adminvm'.
> > [2021-01-28 21:24:43.817709] I [MSGID: 114042] 
> > [client-handshake.c:930:client_post_handshake] 0-adminvm-client-0: 1 fds 
> > open - Delaying child_up until they are re-opened
> > [2021-01-28 21:24:43.895163] I [MSGID: 114041] 
> > [client-handshake.c:318:client_child_up_reopen_done] 0-adminvm-client-0: 
> > last fd open'd/lock-self-heal'd - notifying CHILD-UP
> > The message "W [MSGID: 114061] [client-common.c:2893:client_pre_lk_v2] 
> > 0-adminvm-client-0:  (94695bdb-06b4-4105-9bc8-b8207270c941) remote_fd is 
> > -1. EBADFD [File descriptor in bad state]" repeated 6 times between 
> > [2021-01-28 21:23:54.395811] and [2021-01-28 21:23:54.811640]
> > 
> > 
> > But that was a long time ago.
> > 
> > Brick logs have an entry from when I first started the vm today (the
> > problem was reproduced yesterday) all brick logs have something similar.
> > Nothing appeared on the several other startup attempts of the VM:
> > 
> > [2021-01-28 21:24:45.460147] I [MSGID: 115029] 
> > [server-handshake.c:549:server_setvolume] 0-adminvm-server: accepted client 
> > from 
> > CTX_ID:613f0d91-34e6-4495-859f-bca1c9f7af01-GRAPH_ID:0-PID:6287-HOST:nano-1-PC_NAME:adminvm-client-2-RECON_NO:-0
> >  (version: 7.2) with subvol /data/brick_adminvm
> > [2021-01-29 18:54:45.48] I [addr.c:54:compare_addr_and_update] 
> > 0-/data/brick_adminvm: allowed = "*", received addr = "172.23.255.153"
> > [2021-01-29 18:54:45.455802] I [login.c:110:gf_auth] 0-auth/login: allowed 
> > user names: 3b66cfab-00d5-4b13-a103-93b4cf95e144
> > [2021-01-29 18:54:45.455815] I [MSGID: 115029] 
> > [server-handshake.c:549:server_setvolume] 0-adminvm-server: accepted client 
> > from 
> > CTX_ID:3774af6b-07b9-437b-a34e-9f71f3b57d03-GRAPH_ID:0-PID:45640-HOST:nano-3-PC_NAME:adminvm-client-2-RECON_NO:-0
> >  (version: 7.2) with subvol /data/brick_adminvm
> > [2021-01-29 18:54:45.494950] W [socket.c:774:__socket_rwv] 
> > 0-tcp.adminvm-server: readv on 172.23.255.153:48551 failed (No data 
> > available)
> > [2021-01-29 18:54:45.494994] I [MSGID: 115036] 
> > [server.c:501:server_rpc_notify] 0-adminvm-server: disconnecting connection 
> > from 
> > CTX_ID:3774af6b-07b9-437b-a34e-9f71f3b57d03-GRAPH_ID:0-PID:45640-HOST:nano-3-PC_NAME:adminvm-client-2-RECON_NO:-0
> > [2021-01-29 18:54:45.495091] I [MSGID: 101055] 
> > [client_t.c:436:gf_client_unref] 0-a

[Gluster-users] gnfs exports netmask handling can incorrectly deny access to clients

2021-01-30 Thread Erik Jacobson
Hello team -

First, I wish to state that I know we are supposed to move to Ganesha.
We had a lot of trouble with Ganesha in the past with our workload and
we still owe trying the very latest version and working with the
community. Some of our use cases are complicated and require very large
clusters to test. Therefore, switching has remained elusive.
We still rely on Gluster NFS.

Gluster is now used as part of the solution in some of the largest
supercomputers in the world.

We encountered a problem with Gluster NFS handling of the exports file
in relation to how it computes access rights.

We have patched our build of Gluster with this fix. I'm not sure what
the final fix would be like, but I'm hoping what I paste below will
enable us to get a final fix in to the community.

This analysis and patch was developed by Dick Riegner when I asked for
his help on this problem. There were several others involved.

What follows is his analysis. I will then paste the patch we're using
now. We would be happy to test a new version of the fix if you like (so we
can remove our patch when we upgrade). What follows are Dick's words.


ANALYSIS
==
Here is my Gluster debug output from its nfs.log file.  The working case is
from a compute node client using the IP address 10.31.128.16, and the
failing case is from a client using the IP address 10.31.133.16.


Working case

RJR01: gf_is_ip_in_net() Entered network is 10.31.128.0/18, ip_str is 
10.31.128.16
RJR20: gf_is_ip_in_net() subnet is 18, net_str is 10.31.128.0, net_ip is 
10.31.128.0
RJR40: gf_is_ip_in_net() Host byte order subnet_mask is 0003, ip_buf is 
10801f0a, net_ip_buf is 00801f0a
RJR42: gf_is_ip_in_net() Network byte order subnet_mask is 0300, ip_buf is 
0a1f8010, net_ip_buf is 0a1f8000
RJR44: gf_is_ip_in_net() Network byte order shifted 14 host bits, ip_buf is 
287e, net_ip_buf is 287e
RJR46: gf_is_ip_in_net() My result is 1
RJR99: gf_is_ip_in_net() Exiting result is 1


Failing Case

RJR01: gf_is_ip_in_net() Entered network is 10.31.128.0/18, ip_str is 
10.31.133.16
RJR20: gf_is_ip_in_net() subnet is 18, net_str is 10.31.128.0, net_ip is 
10.31.128.0
RJR40: gf_is_ip_in_net() Host byte order subnet_mask is 0003, ip_buf is 
10851f0a, net_ip_buf is 00801f0a
RJR42: gf_is_ip_in_net() Network byte order subnet_mask is 0300, ip_buf is 
0a1f8510, net_ip_buf is 0a1f8000
RJR44: gf_is_ip_in_net() Network byte order shifted 14 host bits, ip_buf is 
287e, net_ip_buf is 287e
RJR46: gf_is_ip_in_net() My result is 1
RJR99: gf_is_ip_in_net() Exiting result is 0


Gluster function gf_is_ip_in_net() verifies a client's authorization to mount
an export by comparing the subnet address of the client with an allowed
subnet address.  The comparison is made by masking the client IP
address and the allowed subnet address and permitting access when the
resulting subnets are equal.

The mask is an all-ones bit-string the length of the subnet.  In this case,
the subnet is 18 bits and the subnet mask of 0x3 is in Little Endian
ordering used by the Intel x86_64 processor.


1)  Analysis of the working case from client IP address 10.31.128.16

These addresses are in Little Endian order on an Intel x86_64 processor.

Client IP  Subnet
AddressMask Subnet
0x10801f0a  &  0x3  =>  0x01f0a

Allowed
Subnet Subnet
AddressMask Subnet
0x00801f0a  &  0x3  =>  0x01f0a

The resulting subnets are equal so Gluster allows the client to mount its
exports.


2)  Analysis of the failing case from client IP address 10.31.133.16

These addresses are in Little Endian order on an Intel x86_64 processor.

Client IP  Subnet
AddressMask Subnet
0x10851f0a  &  0x3  =>  0x11f0a

Allowed
Subnet Subnet
AddressMask Subnet
0x00801f0a  &  0x3  =>  0x01f0a

The resulting subnets are not equal so Gluster does not allow the client to
mount its exports.

The comparison is incorrectly including the two lower-order bits from part
of the host portion of the client IP address (0x85) as part of the subnet.
The subnet comparison fails and the client is incorrectly denied access to
the Gluster exports.




PROPOSED FIX DESCRIPTION
==
The fix for the incorrect access denied errors is to convert the client and
allowed subnet IP addresses from Host Byte Order (Little Endian)
format to Network Byte Order (Big Endian) format and then isolate their
subnets.  This will ensure that the subnet and host parts of their IP
addresses do not overlap.  Once their subnets are properly isolated,
the subnets can be properly compared.

The conversion from Host Byte Order to Network Byte Order is done by
calling the htonl() function.

A subnet mask is no longer used, but the subnet bit length is used to
isolate the subnet address.  Once the 

Re: [Gluster-users] qemu raw image file - qemu and grub2 can't find boot content from VM

2021-01-29 Thread Erik Jacobson
I updated to 7.9, rebooted everything, and it started working.

I will have QE try to break it again and report back. I couldn't break
it but they're better at breaking things (which is hard to imagine :)


On Fri, Jan 29, 2021 at 01:11:50PM -0600, Erik Jacobson wrote:
> Thank you.
> 
> We reproduced the problem after force-killing one of the 3 physical
> nodes 6 times in a row.
> 
> At that point, the grub2 loaded off the qemu virtual hard drive, but
> could not find partitions. Since there is random luck involved, we don't
> actually know if it was the force-killing that caused it to stop
> working.
> 
> When I start the VM with the image in this state, there is nothing
> interesting in the fuse log for the volume in /var/log/glusterfs on the
> node hosting the image.
> 
> No pending heals (all servers report 0 entries to heal).
> 
> The same VM behavior happens on all the physical nodes when I try to
> start with the same VM image.
> 
> Something from the gluster fuse mount log from earlier shows:
> 
> [2021-01-28 21:24:40.814227] I [MSGID: 114018] 
> [client.c:2347:client_rpc_notify] 0-adminvm-client-0: disconnected from 
> adminvm-client-0. Client process will keep trying to connect to glusterd 
> until brick's port is available
> [2021-01-28 21:24:43.815120] I [rpc-clnt.c:1963:rpc_clnt_reconfig] 
> 0-adminvm-client-0: changing port to 49152 (from 0)
> [2021-01-28 21:24:43.815833] I [MSGID: 114057] 
> [client-handshake.c:1376:select_server_supported_programs] 
> 0-adminvm-client-0: Using Program GlusterFS 4.x v1, Num (1298437), Version 
> (400)
> [2021-01-28 21:24:43.817682] I [MSGID: 114046] 
> [client-handshake.c:1106:client_setvolume_cbk] 0-adminvm-client-0: Connected 
> to adminvm-client-0, attached to remote volume '/data/brick_adminvm'.
> [2021-01-28 21:24:43.817709] I [MSGID: 114042] 
> [client-handshake.c:930:client_post_handshake] 0-adminvm-client-0: 1 fds open 
> - Delaying child_up until they are re-opened
> [2021-01-28 21:24:43.895163] I [MSGID: 114041] 
> [client-handshake.c:318:client_child_up_reopen_done] 0-adminvm-client-0: last 
> fd open'd/lock-self-heal'd - notifying CHILD-UP
> The message "W [MSGID: 114061] [client-common.c:2893:client_pre_lk_v2] 
> 0-adminvm-client-0:  (94695bdb-06b4-4105-9bc8-b8207270c941) remote_fd is -1. 
> EBADFD [File descriptor in bad state]" repeated 6 times between [2021-01-28 
> 21:23:54.395811] and [2021-01-28 21:23:54.811640]
> 
> 
> But that was a long time ago.
> 
> Brick logs have an entry from when I first started the vm today (the
> problem was reproduced yesterday) all brick logs have something similar.
> Nothing appeared on the several other startup attempts of the VM:
> 
> [2021-01-28 21:24:45.460147] I [MSGID: 115029] 
> [server-handshake.c:549:server_setvolume] 0-adminvm-server: accepted client 
> from 
> CTX_ID:613f0d91-34e6-4495-859f-bca1c9f7af01-GRAPH_ID:0-PID:6287-HOST:nano-1-PC_NAME:adminvm-client-2-RECON_NO:-0
>  (version: 7.2) with subvol /data/brick_adminvm
> [2021-01-29 18:54:45.48] I [addr.c:54:compare_addr_and_update] 
> 0-/data/brick_adminvm: allowed = "*", received addr = "172.23.255.153"
> [2021-01-29 18:54:45.455802] I [login.c:110:gf_auth] 0-auth/login: allowed 
> user names: 3b66cfab-00d5-4b13-a103-93b4cf95e144
> [2021-01-29 18:54:45.455815] I [MSGID: 115029] 
> [server-handshake.c:549:server_setvolume] 0-adminvm-server: accepted client 
> from 
> CTX_ID:3774af6b-07b9-437b-a34e-9f71f3b57d03-GRAPH_ID:0-PID:45640-HOST:nano-3-PC_NAME:adminvm-client-2-RECON_NO:-0
>  (version: 7.2) with subvol /data/brick_adminvm
> [2021-01-29 18:54:45.494950] W [socket.c:774:__socket_rwv] 
> 0-tcp.adminvm-server: readv on 172.23.255.153:48551 failed (No data available)
> [2021-01-29 18:54:45.494994] I [MSGID: 115036] 
> [server.c:501:server_rpc_notify] 0-adminvm-server: disconnecting connection 
> from 
> CTX_ID:3774af6b-07b9-437b-a34e-9f71f3b57d03-GRAPH_ID:0-PID:45640-HOST:nano-3-PC_NAME:adminvm-client-2-RECON_NO:-0
> [2021-01-29 18:54:45.495091] I [MSGID: 101055] 
> [client_t.c:436:gf_client_unref] 0-adminvm-server: Shutting down connection 
> CTX_ID:3774af6b-07b9-437b-a34e-9f71f3b57d03-GRAPH_ID:0-PID:45640-HOST:nano-3-PC_NAME:adminvm-client-2-RECON_NO:-0
> 
> 
> 
> Like before, if I halt the VM, kpartx the image, mount the giant root
> within the image, then unmount, unkpartx, and start the VM - it works:
> 
> nano-2:/var/log/glusterfs # kpartx -a /adminvm/images/adminvm.img
> nano-2:/var/log/glusterfs # mount /dev/mapper/loop0p31 /mnt
> nano-2:/var/log/glusterfs # dmesg|tail -3
> [85528.602570] loop: module loaded
> [85535.975623] EXT4-fs (dm-3): recovery complete
> [85535.979663] EXT4-fs (dm-3): mounted filesystem with ordered data mode. 
>

Re: [Gluster-users] qemu raw image file - qemu and grub2 can't find boot content from VM

2021-01-27 Thread Erik Jacobson
> Also, I would like to point that I have VMs with large disks 1TB and 2TB, and
> have no issues. definitely would upgrade Gluster version like let's say at
> least 7.9.

Great! Thank you! We can update but it's very sensitive due to the
workload. I can't officially update our gluster until we have a cluster
with a couple thousand nodes to test with. However, for this problem,
this is on my list on the test machine. I'm hoping I can reproduce it. So far
no luck making it happen again. Once I hit it, I will try to collect more data
and at the end update gluster.

What do you think about the suggestion to increase the shard size? Are
you using the default size on your 1TB and 2TB images?

> Amar also asked a question regarding enabling Sharding in the volume after
> creating the VMs disks, which would certainly mess up the volume if that what
> happened.

Oh I missed this question. I basically scripted it quick since I was
doing it so often.. I have a similar script that takes it away to start
over.

set -x
pdsh -g gluster mkdir /data/brick_adminvm/
gluster volume create adminvm replica 3 transport tcp 
172.23.255.151:/data/brick_adminvm 172.23.255.152:/data/brick_adminvm 
172.23.255.153:/data/brick_adminvm
gluster volume set adminvm group virt
gluster volume set adminvm granular-entry-heal enable
gluster volume set adminvm storage.owner-uid 439
gluster volume set adminvm storage.owner-gid 443
gluster volume start adminvm

pdsh -g gluster mount /adminvm

echo -n "press enter to continue for restore tarball"

pushd /adminvm
tar xvf /root/backup.tar
popd

echo -n "press enter to continue for qemu-img"

pushd /adminvm
qemu-img create -f raw -o preallocation=falloc /adminvm/images/adminvm.img 5T
popd


Thanks again for the kind responses,

Erik

> 
> On Wed, Jan 27, 2021 at 5:28 PM Erik Jacobson  wrote:
> 
> > > Shortly after the sharded volume is made, there are some fuse mount
> > > messages. I'm not 100% sure if this was just before or during the
> > > big qemu-img command to make the 5T image
> > > (qemu-img create -f raw -o preallocation=falloc
> > > /adminvm/images/adminvm.img 5T)
> > Any reason to have a single disk with this size ?
> 
> > Usually in any
> > virtualization I have used , it is always recommended to keep it lower.
> > Have you thought about multiple disks with smaller size ?
> 
> Yes, because the actual virtual machine is an admin node/head node cluster
> manager for a supercomputer that hosts big OS images and drives
> multi-thousand-node-clusters (boot, monitoring, image creation,
> distribution, sometimes NFS roots, etc) . So this VM is a biggie.
> 
> We could make multiple smaller images but it would be very painful since
> it differs from the normal non-VM setup.
> 
> So unlike many solutions where you have lots of small VMs with their
> images small images, this solution is one giant VM with one giant image.
> We're essentially using gluster in this use case (as opposed to others I
> have posted about in the past) for head node failover (combined with
> pacemaker).
> 
> > Also worth
> > noting is that RHII is supported only when the shard size is  512MB, so
> > it's worth trying bigger shard size .
> 
> I have put larger shard size and newer gluster version on the list to
> try. Thank you! Hoping to get it failing again to try these things!
> 
> 
> 
> --
> Respectfully
> Mahdi




Community Meeting Calendar:

Schedule -
Every 2nd and 4th Tuesday at 14:30 IST / 09:00 UTC
Bridge: https://meet.google.com/cpu-eiue-hvk
Gluster-users mailing list
Gluster-users@gluster.org
https://lists.gluster.org/mailman/listinfo/gluster-users


Re: [Gluster-users] qemu raw image file - qemu and grub2 can't find boot content from VM

2021-01-27 Thread Erik Jacobson
> > Shortly after the sharded volume is made, there are some fuse mount
> > messages. I'm not 100% sure if this was just before or during the
> > big qemu-img command to make the 5T image
> > (qemu-img create -f raw -o preallocation=falloc
> > /adminvm/images/adminvm.img 5T)
> Any reason to have a single disk with this size ?

> Usually in any
> virtualization I have used , it is always recommended to keep it lower.
> Have you thought about multiple disks with smaller size ?

Yes, because the actual virtual machine is an admin node/head node cluster
manager for a supercomputer that hosts big OS images and drives
multi-thousand-node-clusters (boot, monitoring, image creation,
distribution, sometimes NFS roots, etc) . So this VM is a biggie.

We could make multiple smaller images but it would be very painful since
it differs from the normal non-VM setup.

So unlike many solutions where you have lots of small VMs with their
images small images, this solution is one giant VM with one giant image.
We're essentially using gluster in this use case (as opposed to others I
have posted about in the past) for head node failover (combined with
pacemaker).

> Also worth
> noting is that RHII is supported only when the shard size is  512MB, so
> it's worth trying bigger shard size .

I have put larger shard size and newer gluster version on the list to
try. Thank you! Hoping to get it failing again to try these things!




Community Meeting Calendar:

Schedule -
Every 2nd and 4th Tuesday at 14:30 IST / 09:00 UTC
Bridge: https://meet.google.com/cpu-eiue-hvk
Gluster-users mailing list
Gluster-users@gluster.org
https://lists.gluster.org/mailman/listinfo/gluster-users


Re: [Gluster-users] qemu raw image file - qemu and grub2 can't find boot content from VM

2021-01-27 Thread Erik Jacobson
> Are you sure that there is no heals pending at the time of the power up

I was watching heals when the problem was persisting and it was all
clear. This was a great suggestion though.

> I checked my oVirt-based gluster and the only difference is:
> cluster.gra
> nular-entry-heal: enable
> The options seem fine.

> > libglusterfs0-7.2-4723.1520.210122T1700.a.sles15sp2hpe.x86_64
> > glusterfs-7.2-4723.1520.210122T1700.a.sles15sp2hpe.x86_64
> > python3-gluster-7.2-4723.1520.210122T1700.a.sles15sp2hpe.noarch
> This one is quite old although it never caused any troubles with my
> oVirt VMs. Either try with latest v7 or even v8.3 .


I can try a newer version. The issue is we have to do massive testing
with thousands of nodes to validate function and that isn't always
available. So we tend to latch on to a good one and stage an upgrade
when we have a system big enough in the factory. In this case though,
the use case is a single VM. If I could find a way to reproduce the
problem I would be able to know if upgrading helped. These hard to
reproduce problems are painful!! We keep hitting it in places but
triggering has been elusive.

THANK YOU for replying back. I will continue to try to reproduce the
problem. If I get it back to consistent fail, I'll try updating gluster
then and take another closer look at the logs and post them.

Erik




Community Meeting Calendar:

Schedule -
Every 2nd and 4th Tuesday at 14:30 IST / 09:00 UTC
Bridge: https://meet.google.com/cpu-eiue-hvk
Gluster-users mailing list
Gluster-users@gluster.org
https://lists.gluster.org/mailman/listinfo/gluster-users


Re: [Gluster-users] qemu raw image file - qemu and grub2 can't find boot content from VM

2021-01-26 Thread Erik Jacobson
 in the field.




On Tue, Jan 26, 2021 at 07:40:19AM -0600, Erik Jacobson wrote:
> Thank you so much for responding! More below.
> 
> 
> >  Anything in the logs of the fuse mount? can you stat the file from the 
> > mount?
> > also, the report of an image is only 64M makes me think about Sharding as 
> > the
> > default value of Shard size is 64M.
> > Do you have any clues on when this issue start to happen? was there any
> > operation done to the Gluster cluster?
> 
> 
> - I had just created the gluster volumes within an hour of the problem
>   to test the vary problem I reported. So it was a "fresh start".
> 
> - It booted one or two times, then stopped booting. Once it couldn't
>   boot, all 3 nodes were the same in that grub2 couldn't boot in the VM
>   image.
> 
> As for the fuse log, I did see a couple of these before it happened the
> first time. I'm not sure if it's a clue or not.
> 
> [2021-01-25 22:48:19.310467] I [fuse-bridge.c:5777:fuse_graph_sync] 0-fuse: 
> switched to graph 0
> [2021-01-25 22:50:09.693958] E [fuse-bridge.c:227:check_and_dump_fuse_W] (--> 
> /usr/lib64/libglusterfs.so.0(_gf_log_callingfn+0x17a)[0x7f914e346faa] (--> 
> /usr/lib64/glusterfs/7.2/xlator/mount/fuse.so(+0x874a)[0x7f914a3d374a] (--> 
> /usr/lib64/glusterfs/7.2/xlator/mount/fuse.so(+0x91cb)[0x7f914a3d41cb] (--> 
> /lib64/libpthread.so.0(+0x84f9)[0x7f914cf184f9] (--> 
> /lib64/libc.so.6(clone+0x3f)[0x7f914c76afbf] ) 0-glusterfs-fuse: writing 
> to fuse device failed: No such file or directory
> [2021-01-25 22:50:09.694462] E [fuse-bridge.c:227:check_and_dump_fuse_W] (--> 
> /usr/lib64/libglusterfs.so.0(_gf_log_callingfn+0x17a)[0x7f914e346faa] (--> 
> /usr/lib64/glusterfs/7.2/xlator/mount/fuse.so(+0x874a)[0x7f914a3d374a] (--> 
> /usr/lib64/glusterfs/7.2/xlator/mount/fuse.so(+0x91cb)[0x7f914a3d41cb] (--> 
> /lib64/libpthread.so.0(+0x84f9)[0x7f914cf184f9] (--> 
> /lib64/libc.so.6(clone+0x3f)[0x7f914c76afbf] ) 0-glusterfs-fuse: writing 
> to fuse device failed: No such file or directory
> 
> 
> 
> I have reserved the test system again. My plans today are:
>  - Start over with the gluster volume on the machine with sles15sp2
>updates
> 
>  - Learn if there are modifications to the image (besides
>mounting/umounting filesystems with the image using kpartx to map
>them to force it to work). What if I add/remove a byte from the end
>of the image file for example.
> 
>  - Revert the setup to sles15sp2 with no updates. My theory is the
>updates are not making a difference and it's just random chance.
>(re-making the gluster volume in the process)
> 
>  - The 64MB shard size made me think too!!
> 
>  - If the team feels it is worth it, I could try a newer gluster. We're
>using the versions we've validated at scale when we have large
>clusters in the factory but if the team thinks I should try something
>else I'm happy to re-build it!!!  We are @ 7.2 plus afr-event-gen-changes
>patch.
> 
> I will keep a better eye on the fuse log to tie an error to the problem
> starting.
> 
> 
> THANKS AGAIN for responding and let me know if you have any more
> clues!
> 
> Erik
> 
> 
> > 
> > On Tue, Jan 26, 2021 at 2:40 AM Erik Jacobson  wrote:
> > 
> > Hello all. Thanks again for gluster. We're having a strange problem
> > getting virtual machines started that are hosted on a gluster volume.
> > 
> > One of the ways we use gluster now is to make a HA-ish cluster head
> > node. A virtual machine runs in the shared storage and is backed up by 3
> > physical servers that contribute to the gluster storage share.
> > 
> > We're using sharding in this volume. The VM image file is around 5T and
> > we use qemu-img with falloc to get all the blocks allocated in advance.
> > 
> > We are not using gfapi largely because it would mean we have to build
> > our own libvirt and qemu and we'd prefer not to do that. So we're using
> > a glusterfs fuse mount to host the image. The virtual machine is using
> > virtio disks but we had similar trouble using scsi emulation.
> > 
> > The issue: - all seems well, the VM head node installs, boots, etc.
> > 
> > However, at some point, it stops being able to boot! grub2 acts like it
> > cannot find /boot. At the grub2 prompt, it can see the partitions, but
> > reports no filesystem found where there are indeed filesystems.
> > 
> > If we switch qemu to use "direct kernel load" (bypass grub2), this often
> > works around the problem but in one case Linux gave us a clue. Linu

Re: [Gluster-users] qemu raw image file - qemu and grub2 can't find boot content from VM

2021-01-26 Thread Erik Jacobson
Thank you so much for responding! More below.


>  Anything in the logs of the fuse mount? can you stat the file from the mount?
> also, the report of an image is only 64M makes me think about Sharding as the
> default value of Shard size is 64M.
> Do you have any clues on when this issue start to happen? was there any
> operation done to the Gluster cluster?


- I had just created the gluster volumes within an hour of the problem
  to test the vary problem I reported. So it was a "fresh start".

- It booted one or two times, then stopped booting. Once it couldn't
  boot, all 3 nodes were the same in that grub2 couldn't boot in the VM
  image.

As for the fuse log, I did see a couple of these before it happened the
first time. I'm not sure if it's a clue or not.

[2021-01-25 22:48:19.310467] I [fuse-bridge.c:5777:fuse_graph_sync] 0-fuse: 
switched to graph 0
[2021-01-25 22:50:09.693958] E [fuse-bridge.c:227:check_and_dump_fuse_W] (--> 
/usr/lib64/libglusterfs.so.0(_gf_log_callingfn+0x17a)[0x7f914e346faa] (--> 
/usr/lib64/glusterfs/7.2/xlator/mount/fuse.so(+0x874a)[0x7f914a3d374a] (--> 
/usr/lib64/glusterfs/7.2/xlator/mount/fuse.so(+0x91cb)[0x7f914a3d41cb] (--> 
/lib64/libpthread.so.0(+0x84f9)[0x7f914cf184f9] (--> 
/lib64/libc.so.6(clone+0x3f)[0x7f914c76afbf] ) 0-glusterfs-fuse: writing to 
fuse device failed: No such file or directory
[2021-01-25 22:50:09.694462] E [fuse-bridge.c:227:check_and_dump_fuse_W] (--> 
/usr/lib64/libglusterfs.so.0(_gf_log_callingfn+0x17a)[0x7f914e346faa] (--> 
/usr/lib64/glusterfs/7.2/xlator/mount/fuse.so(+0x874a)[0x7f914a3d374a] (--> 
/usr/lib64/glusterfs/7.2/xlator/mount/fuse.so(+0x91cb)[0x7f914a3d41cb] (--> 
/lib64/libpthread.so.0(+0x84f9)[0x7f914cf184f9] (--> 
/lib64/libc.so.6(clone+0x3f)[0x7f914c76afbf] ) 0-glusterfs-fuse: writing to 
fuse device failed: No such file or directory



I have reserved the test system again. My plans today are:
 - Start over with the gluster volume on the machine with sles15sp2
   updates

 - Learn if there are modifications to the image (besides
   mounting/umounting filesystems with the image using kpartx to map
   them to force it to work). What if I add/remove a byte from the end
   of the image file for example.

 - Revert the setup to sles15sp2 with no updates. My theory is the
   updates are not making a difference and it's just random chance.
   (re-making the gluster volume in the process)

 - The 64MB shard size made me think too!!

 - If the team feels it is worth it, I could try a newer gluster. We're
   using the versions we've validated at scale when we have large
   clusters in the factory but if the team thinks I should try something
   else I'm happy to re-build it!!!  We are @ 7.2 plus afr-event-gen-changes
   patch.

I will keep a better eye on the fuse log to tie an error to the problem
starting.


THANKS AGAIN for responding and let me know if you have any more
clues!

Erik


> 
> On Tue, Jan 26, 2021 at 2:40 AM Erik Jacobson  wrote:
> 
> Hello all. Thanks again for gluster. We're having a strange problem
> getting virtual machines started that are hosted on a gluster volume.
> 
> One of the ways we use gluster now is to make a HA-ish cluster head
> node. A virtual machine runs in the shared storage and is backed up by 3
> physical servers that contribute to the gluster storage share.
> 
> We're using sharding in this volume. The VM image file is around 5T and
> we use qemu-img with falloc to get all the blocks allocated in advance.
> 
> We are not using gfapi largely because it would mean we have to build
> our own libvirt and qemu and we'd prefer not to do that. So we're using
> a glusterfs fuse mount to host the image. The virtual machine is using
> virtio disks but we had similar trouble using scsi emulation.
> 
> The issue: - all seems well, the VM head node installs, boots, etc.
> 
> However, at some point, it stops being able to boot! grub2 acts like it
> cannot find /boot. At the grub2 prompt, it can see the partitions, but
> reports no filesystem found where there are indeed filesystems.
> 
> If we switch qemu to use "direct kernel load" (bypass grub2), this often
> works around the problem but in one case Linux gave us a clue. Linux
> reported /dev/vda as only being 64 megabytes, which would explain a lot.
> This means the virtual machine Linux though the disk supplied by the
> disk image was tiny! 64M instead of 5T
> 
> We are using sles15sp2 and hit the problem more often with updates
> applied than without. I'm in the process of trying to isolate if there
> is a sles15sp2 update causing this, or if we're within "random chance".
> 
> On one of the physical nodes, if it is in the failure mode, if I use
> 'kpartx' to create the

[Gluster-users] qemu raw image file - qemu and grub2 can't find boot content from VM

2021-01-25 Thread Erik Jacobson
Hello all. Thanks again for gluster. We're having a strange problem
getting virtual machines started that are hosted on a gluster volume.

One of the ways we use gluster now is to make a HA-ish cluster head
node. A virtual machine runs in the shared storage and is backed up by 3
physical servers that contribute to the gluster storage share.

We're using sharding in this volume. The VM image file is around 5T and
we use qemu-img with falloc to get all the blocks allocated in advance.

We are not using gfapi largely because it would mean we have to build
our own libvirt and qemu and we'd prefer not to do that. So we're using
a glusterfs fuse mount to host the image. The virtual machine is using
virtio disks but we had similar trouble using scsi emulation.

The issue: - all seems well, the VM head node installs, boots, etc.

However, at some point, it stops being able to boot! grub2 acts like it
cannot find /boot. At the grub2 prompt, it can see the partitions, but
reports no filesystem found where there are indeed filesystems.

If we switch qemu to use "direct kernel load" (bypass grub2), this often
works around the problem but in one case Linux gave us a clue. Linux
reported /dev/vda as only being 64 megabytes, which would explain a lot.
This means the virtual machine Linux though the disk supplied by the
disk image was tiny! 64M instead of 5T

We are using sles15sp2 and hit the problem more often with updates
applied than without. I'm in the process of trying to isolate if there
is a sles15sp2 update causing this, or if we're within "random chance".

On one of the physical nodes, if it is in the failure mode, if I use
'kpartx' to create the partitions from the image file, then mount the
giant root filesystem (ie mount /dev/mapper/loop0p31 /mnt) and then
umount /mnt, then that physical node starts the VM fine, grub2 loads,
the virtual machine is fully happy!  Until I try to shut it down and
start it up again, at which point it sticks at grub2 again! What about
mounting the image file makes it so qemu sees the whole disk?

The problem doesn't always happen but once it starts, the same VM image has
trouble starting on any of the 3 physical nodes sharing the storage.
But using the trick to force-mount the root within the image with
kpartx, then the machine can come up. My only guess is this changes the
file just a tiny bit in the middle of the image.

Once the problem starts, it keeps happening except temporarily working
when I do the loop mount trick on the physical admin.


Here is some info about what I have in place:


nano-1:/adminvm/images # gluster volume info

Volume Name: adminvm
Type: Replicate
Volume ID: 67de902c-8c00-4dc9-8b69-60b93b5f6104
Status: Started
Snapshot Count: 0
Number of Bricks: 1 x 3 = 3
Transport-type: tcp
Bricks:
Brick1: 172.23.255.151:/data/brick_adminvm
Brick2: 172.23.255.152:/data/brick_adminvm
Brick3: 172.23.255.153:/data/brick_adminvm
Options Reconfigured:
performance.client-io-threads: on
nfs.disable: on
storage.fips-mode-rchecksum: on
transport.address-family: inet
performance.quick-read: off
performance.read-ahead: off
performance.io-cache: off
performance.low-prio-threads: 32
network.remote-dio: enable
cluster.eager-lock: enable
cluster.quorum-type: auto
cluster.server-quorum-type: server
cluster.data-self-heal-algorithm: full
cluster.locking-scheme: granular
cluster.shd-max-threads: 8
cluster.shd-wait-qlength: 1
features.shard: on
user.cifs: off
cluster.choose-local: off
client.event-threads: 4
server.event-threads: 4
cluster.granular-entry-heal: enable
storage.owner-uid: 439
storage.owner-gid: 443




libglusterfs0-7.2-4723.1520.210122T1700.a.sles15sp2hpe.x86_64
glusterfs-7.2-4723.1520.210122T1700.a.sles15sp2hpe.x86_64
python3-gluster-7.2-4723.1520.210122T1700.a.sles15sp2hpe.noarch



nano-1:/adminvm/images # uname -a
Linux nano-1 5.3.18-24.46-default #1 SMP Tue Jan 5 16:11:50 UTC 2021 (4ff469b) 
x86_64 x86_64 x86_64 GNU/Linux
nano-1:/adminvm/images # rpm -qa | grep qemu-4
qemu-4.2.0-9.4.x86_64



Would love any advice


Erik




Community Meeting Calendar:

Schedule -
Every 2nd and 4th Tuesday at 14:30 IST / 09:00 UTC
Bridge: https://meet.google.com/cpu-eiue-hvk
Gluster-users mailing list
Gluster-users@gluster.org
https://lists.gluster.org/mailman/listinfo/gluster-users


Re: [Gluster-users] State of Gluster project

2020-06-22 Thread Erik Jacobson
> For NVMe/SSD  - raid controller is pointless ,  so JBOD makes  most sense.

I am game for an education lesson here. We're still using spinng drives
with big RAID caches but we keep discussing SSD in the context of RAID. I
have read for many real-world workloads, RAID0 makes no sense with
modern SSDs. I get that part. But if your concern is reliability and
reducing the need to mess with Gluster to recover from a drive failure,
a RAID1 or or RADI10 (or some other with redundancy) would seem to at
least make sense from that perspective.

Was your answer a performance answer? Or am I missing something about
RAIDs for redundancy and SSDs being a bad choice?

Thanks again as always,

Erik




Community Meeting Calendar:

Schedule -
Every 2nd and 4th Tuesday at 14:30 IST / 09:00 UTC
Bridge: https://bluejeans.com/441850968

Gluster-users mailing list
Gluster-users@gluster.org
https://lists.gluster.org/mailman/listinfo/gluster-users


Re: [Gluster-users] State of Gluster project

2020-06-21 Thread Erik Jacobson
I agree with this assessment for the most part. I'll just add that,
during development of Gluster based solutions, we had internal use of
Redhat Gluster. This was over a year and a half ago when we started.
For my perhaps non-mainstream use cases, I found the latest versions of
gluster 7 actually fixed several of my issues. Now, I did not try to
work with RedHat when I hit problems as it was only "non-shipable
support" - we could install it but not deliver it. Since it didn't work
well for our strange use cases, we moved on to building our own Gluster
instead instead of working to have customers buy the Red Hat one.
(We also support sles12, sles15, rhel7, rhel8 - so having Red Hat's
version of Gluster sort of wouldn't have worked out for us anyway).

However, I also found that it is quite easy for my use case to hit new bugs.
When we go from gluster72 to one of the newer ones, little things might
happen (and did happen). I don't complain because I get free support
from you and I do my best to fix them if I have time and access to a
failing system.

A tricky thing in my world is we will sell a cluster with 5,000 nodes to
boot and my test cluster may have 3 nodes. I can get time up to 128
nodes on one test system. But I only get short-term access to bigger systems
at the factory. So being able to change from one Gluster version to another is
a real challenge for us because there simply is no way for us to test
very often and, like is normal in HPC, problems only show at scale.
hahaa :) :)

This is also why we are still using Gluster NFS. We know we need to work
with the community on fixing some Ganesha issues, but the amount of time
we get on a large machine that exhibits the problem is short and we must
prioritize. This is why I'm careful to never "blame Ganesha" but rather
point out that we haven't had time to track the issues down with the
Ganesha community. Meanwhile we hope we can keep building Gluster NFS :)

When I next do a version-change of Gluster or try Ganesha again, it will be
when I have sustained access to at least a 1024 node cluster to boot with
3 or 6 Gluster servers to really work out any issues.

I consider this "a cost of doing business in the world I work in" but it
is a real challenge indeed. I assume some challenges parallel Gluster
developers "Works fine on my limited hardware or virtual machines".

Erik

> With  every community project ,  you are in the position  of a Betta  Tester  
> - no matter Fedora,  Gluster  or CEPH. So far  ,  I had  issues with upstream 
>  projects only diring and immediately after patching  - but this is properly 
> mitigated  with a  reasonable patching strategy (patch  test environment and 
> several months later  patch prod with the same repos).
> Enterprise  Linux breaks (and alot) having 10-times more  users and use  
> cases,  so you cannot expect to start to use  Gluster  and assume that a  
> free  peoject won't break at all.
> Our part in this project is to help the devs to create a test case for our 
> workload ,  so  regressions will be reduced to minimum.
> 
> In the past 2  years,  we  got 2  major  issues with VMware VSAN and 1  major 
>  issue  with  a Enterprise Storage cluster (both solutions are quite  
> expensive)  - so  I always recommend proper  testing  of your  software .
> 
> 
> >> That's  true,  but  you  could  also  use  NFS Ganesha,  which  is
> >> more  performant  than FUSE and also as  reliable  as  it.
> >
> >From this very list I read about many users with various problems when 
> >using NFS Ganesha. Is that a wrong impression?
> 
> >From my observations,  almost nobody  is complaining about Ganesha in the 
> >mailing list -> 50% are  having issues  with geo replication,20%  are  
> >having issues with small file performance and the rest have issues with very 
> >old version of gluster  -> v5 or older.
> 
> >> It's  not so hard to  do it  -  just  use  either  'reset-brick' or
> >> 'replace-brick' .
> >
> >Sure - the command itself is simple enough. The point it that each 
> >reconstruction is quite more "riskier" than a simple RAID 
> >reconstruction. Do you run a full Gluster SDS, skipping RAID? How do
> >you 
> >found this setup?
> 
> I  can't say that a  replace-brick  on a 'replica  3' volume is more  riskier 
>  than a rebuild  of a raid,  but I have noticed that nobody is  following Red 
> Hat's  guide  to use  either:
> -  a  Raid6  of 12  Disks (2-3  TB  big)
> -  a Raid10  of  12  Disks (2-3  TB big)
> -  JBOD disks in 'replica  3' mode (i'm not sure about the size  RH 
> recommends,  most probably 2-3 TB)
>  So far,  I didn' have the opportunity to run on JBODs.
> 
> 
> >Thanks.
> 
> 
> 
> 
> Community Meeting Calendar:
> 
> Schedule -
> Every 2nd and 4th Tuesday at 14:30 IST / 09:00 UTC
> Bridge: https://bluejeans.com/441850968 
> 
> Gluster-users mailing list
> Gluster-users@gluster.org
> https://lists.gluster.org/mailman/listinfo/gluster-users 




Community Meeting Calendar:

Schedule -

Re: [Gluster-users] State of Gluster project

2020-06-17 Thread Erik Jacobson
> It is very hard to compare them because they are structurally very different. 
> For example, GlusterFS performance will depend *a lot* on the underlying file 
> system performance. Ceph eliminated that factor by using Bluestore.
> Ceph is very well performing for VM storage, since it's block based and as 
> such optimized for that. I haven't tested CephFS a lot (I used it but only 
> for very small storage) so I cannot speak for its performance, but I am 
> guessing it's not ideal. For large amount of files thus GlusterFS is still a 
> good choice.


Was your experience above based on using a sharded volume or a normal
one? When we worked with virtual machine images, we followed the volume
sharding advice. I don't have a comparison for Ceph handy. I was just
curious. It worked so well for us (but maybe our storage is "too good")
that we found it hard to imagine it could be improved much. This was a
simple case though of a single VM, 3 gluster servers, a sharded volume,
and a raw virtual machine image. Probably a simpler case than yours.

Thank you for writing this and take care,

Erik

> 
> One *MAJOR* advantage of Ceph over GlusterFS is tooling. Ceph's 
> self-analytics, status reporting and problem fixing toolset is just so far 
> beyond GlusterFS that it's really hard for me to recommend GlusterFS for any 
> but the most experienced sysadmins. It does come with the type of 
> implementation Ceph has chosen that they have to have such good tooling 
> (because honestly, poking around in binary data structures really wouldn't be 
> practical for most users), but whenever I had a problem with Ceph the 
> solution was just a couple of command line commands (even if it meant to 
> remove a storage device, wipe it and add it back), where with GlusterFS it 
> means poking around in the .glusterfs directory, looking up inode numbers, 
> extended attributes etc. which is a real pain if you have a 
> multi-million-file filesystem to work on. And that's not even with sharding 
> or distributed volumes.
> 
> Also, Ceph has been a lot more stable that GlusterFS for us. The amount of 
> hand-holding GlusterFS needs is crazy. With Ceph, there is this one bug (I 
> think in certain Linux kernel versions) where it sometimes reads only zeroes 
> from disk and complains about that and then you have to restart that OSD to 
> not run into problems, but that's one "swatch" process on each machine that 
> will do that automatically for us. I have run some Ceph clusters for several 
> years now and only once or twice I had to deal with problems. The several 
> GlusterFS clusters we operate constantly run into troubles. We now shut down 
> all GlusterFS clients before we reboot any GlusterFS node because it was near 
> impossible to reboot a single node without running into unrecoverable 
> troubles (heal entries that will not heal etc.). With Ceph we can achieve 
> 100% uptime, we regularly reboot our hosts one by one and some minutes later 
> the Ceph cluster is clean again.
> 
> If others have more insights I'd be very happy to hear them.
> 
> Stefan
> 
> 
> - Original Message -
> > Date: Tue, 16 Jun 2020 20:30:34 -0700
> > From: Artem Russakovskii 
> > To: Strahil Nikolov 
> > Cc: gluster-users 
> > Subject: Re: [Gluster-users] State of Gluster project
> > Message-ID:
> > 
> > Content-Type: text/plain; charset="utf-8"
> > 
> > Has anyone tried to pit Ceph against gluster? I'm curious what the ups and
> > downs are.
> > 
> > On Tue, Jun 16, 2020, 4:32 PM Strahil Nikolov  wrote:
> > 
> >> Hey Mahdi,
> >>
> >> For me it looks like Red Hat are focusing more  on CEPH  than on Gluster.
> >> I hope the project remains active, cause it's very difficult to find a
> >> Software-defined Storage as easy and as scalable as Gluster.
> >>
> >> Best Regards,
> >> Strahil Nikolov
> >>
> >> ?? 17 ??? 2020 ?. 0:06:33 GMT+03:00, Mahdi Adnan  ??:
> >> >Hello,
> >> >
> >> > I'm wondering what's the current and future plan for Gluster project
> >> >overall, I see that the project is not as busy as it was before "at
> >> >least
> >> >this is what I'm seeing" Like there are fewer blogs about what the
> >> >roadmap
> >> >or future plans of the project, the deprecation of Glusterd2, even Red
> >> >Hat
> >> >Openshift storage switched to Ceph.
> >> >As the community of this project, do you feel the same? Is the
> >> >deprecation
> >> >of Glusterd2 concerning? Do you feel that the project is slowing down
> >> >somehow? Do you think Red Hat is abandoning the project or giving fewer
> >> >resources to Gluster?
> >> 
> >>
> >>
> >>
> >> Community Meeting Calendar:
> >>
> >> Schedule -
> >> Every 2nd and 4th Tuesday at 14:30 IST / 09:00 UTC
> >> Bridge: https://bluejeans.com/441850968 
> >>
> >> Gluster-users mailing list
> >> Gluster-users@gluster.org
> >> https://lists.gluster.org/mailman/listinfo/gluster-users 
> >>
> > -- next part --
> > An HTML attachment was scrubbed...
> > URL:
> > 

Re: [Gluster-users] State of Gluster project

2020-06-17 Thread Erik Jacobson
We never ran tests with Ceph mostly due to time constraints in
engineering. We also liked that, at least when I started as a novice,
gluster seemed easier to set up. We use the solution in automated
setup scripts for maintaining very large clusters. Simplicity in
automated setup is critical here for us including automated installation
of supercomputers in QE and near-automation at customer sites.

We have been happy with our performance using gluster and gluster NFS
for root filesystems when using squashfs object files for the NFS roots
as opposed to expanded files (on a sharded volume). For writable NFS, we
use XFS filesystem images on gluster NFS instead of expanded trees (in
this case, not on sharded volume).

We have systems running as large as 3072 nodes with 16 gluster servers
(subvolumes of 3, distributed/replicate).

We will have 5k nodes in production soon and will need to support 10k
nodes in a year or so. So far we use CTDB for "ha-like" functionality as
pacemaker is scary to us.


We also have designed a second solution around gluster for
high-availability head nodes (aka admin nodes). The old solution used two
admin nodes, pacemaker, external shared storage, to host a VM that would
start on the 2nd server if the first server died. As we know, 2-node ha
is not optimal. We designed a new 3-server HA solution that eliminates
the external shared storage (which was expensive) and instead uses
gluster, sharded volume, and a qemu raw image hosted in the shared
storage to host the virtual admin node.  We use RAIDD10 4-disk per
server for gluster use in this. We have been happy with the performance
of this. It's only a little slower than the external shared filesystem
solution (we tended to use GFS2 or OCFS or whatever it is called in the
past solution). We did need to use pacemaker for this one as virtual
machine availability isn't suitable for CTDB (or less natural anyway).
One highlight of this solution is it allows a customer to put each of
the 3 servers in a separate firewalled vault or room to keep the head 
alive even if there were a fire that destroyed one server.

A key to our use of gluster and not suffering from poor performance in
our root-filesystem-workloads is encapsulating filesystems in image
files instead of using expanded trees of small files.

So far we have relied on gluster NFS for the boot servers as Ganesha
would crash. We haven't re-tried in several months though and owe
debugging on that front. We have not had resources to put in to
debugging Ganesha just yet.

I sure hope Gluster stays healthy and active. It is good to have
multiple solutions with various strengths out there. I like choice.
Plus, choice lets us learn from each other. I hope project sponsors see
that too.

Erik

> 17.06.2020 08:59, Artem Russakovskii пишет:
> > It may be stable, but it still suffers from performance issues, which
> > the team is working on. But nevertheless, I'm curious if maybe Ceph has
> > those problem sorted by now.
> 
> 
> Dunno, we run gluster on small clusters, kvm and gluster on the same hosts.
> 
> There were plans to use ceph on dedicated server next year, but budget cut
> because you don't want to buy our oil for $120 ;-)
> 
> Anyway, in our tests ceph is faster, this is why we wanted to use it, but
> not migrate from gluster.
> 
> 
> 
> 
> 
> 
> Community Meeting Calendar:
> 
> Schedule -
> Every 2nd and 4th Tuesday at 14:30 IST / 09:00 UTC
> Bridge: https://bluejeans.com/441850968
> 
> Gluster-users mailing list
> Gluster-users@gluster.org
> https://lists.gluster.org/mailman/listinfo/gluster-users




Community Meeting Calendar:

Schedule -
Every 2nd and 4th Tuesday at 14:30 IST / 09:00 UTC
Bridge: https://bluejeans.com/441850968

Gluster-users mailing list
Gluster-users@gluster.org
https://lists.gluster.org/mailman/listinfo/gluster-users


Re: [Gluster-users] MTU 9000 question

2020-05-06 Thread Erik Jacobson
Thank you !!!

We are going to try to run some experiments as well in the coming weeks.
Assuming I don't get re-routed, which often happens, I'll share if we
notice anything in our work load.

On Wed, May 06, 2020 at 07:41:56PM +0400, Dmitry Melekhov wrote:
> 
> 06.05.2020 19:15, Erik Jacobson пишет:
> > > It's been working pretty
> > > well at 1500 MTU so far. If the only issue is less throughput, that may
> > > be a price we can pay since we're not bandwidth bound right now.
> > > 
> 
> I think that fragmentation offload on nics makes jumbo frames not very
> useful.
> 
> As I said we see no difference in our workload, we switched nics to mtu 9000
> just because we can, but not from start,
> 
> and we did not see any improvements.
> 
> Gluster never saturates our teamed in two 10Gb connections nor with mtu 9000
> nor with 1500 and there is no visible latency difference.
> 
> 
> 
> 
> 





Community Meeting Calendar:

Schedule -
Every 2nd and 4th Tuesday at 14:30 IST / 09:00 UTC
Bridge: https://bluejeans.com/441850968

Gluster-users mailing list
Gluster-users@gluster.org
https://lists.gluster.org/mailman/listinfo/gluster-users


Re: [Gluster-users] MTU 9000 question

2020-05-06 Thread Erik Jacobson
> On the other side allow jumbo frames and change mtu on even hundreds on
> nodes is extremely simple,
> 
> you can just test it. I don't see "bunch of extra work" here, just use ssh
> and some scripting or something like ansible...

Our issue is we decided to simplify the configuration in our cluster
manager so that cluster management traffic, NFS, and gluster are
co-mingled. Works great. However, we often need to talk to BMCs on that
same network, and many BMCs don't handle MTU 9K correctly. Often a BMC
will seem to work but if you send something big like firmware flash to
it, it never completes the transfer due to the MTU mismatch. So the
"hard part" is due to our own stuff. 

We have a method in the cluster manager to put BMCs in a separate
network but that isn't a common choice.

We are investigating using MTU size-by-path but that gets complicated to
test. Therefore, we are looking to understand the real-world problem with
a 1500 MTU on 2x bonded 10G networks with gluster to decide if we want to
put time and resource to solve the problem. It's been working pretty
well at 1500 MTU so far. If the only issue is less throughput, that may
be a price we can pay since we're not bandwidth bound right now.

Erik




Community Meeting Calendar:

Schedule -
Every 2nd and 4th Tuesday at 14:30 IST / 09:00 UTC
Bridge: https://bluejeans.com/441850968

Gluster-users mailing list
Gluster-users@gluster.org
https://lists.gluster.org/mailman/listinfo/gluster-users


[Gluster-users] MTU 9000 question

2020-05-06 Thread Erik Jacobson
It is inconvenient for us to use MTU 9K for our gluster servers for
various reasons. We typically have bonded 10G interfaces.

We use distribute/replicate and gluster NFS for compute nodes.

My understanding is the negative to using 1500 MTU is just less
efficient use of the network. Are there other concerns? We don't
currently have network saturation problems.

We are trying to make a decision on if we need to do a bunch of extra
work to switch to 9K MTU and if it is worth the benefit.

Does the community have any suggestions?

Erik




Community Meeting Calendar:

Schedule -
Every 2nd and 4th Tuesday at 14:30 IST / 09:00 UTC
Bridge: https://bluejeans.com/441850968

Gluster-users mailing list
Gluster-users@gluster.org
https://lists.gluster.org/mailman/listinfo/gluster-users


Re: [Gluster-users] gnfs split brain when 1 server in 3x1 down (high load) - help request

2020-04-17 Thread Erik Jacobson
Amar, Ravi -

> This thread has been one of the largest effort to stabilize the systems in
> recent times.

Well thanks to you guys too. It would have been easy to stop replying
when things got hard. I understand best effort community support and
appreciate you sticking with us.

The test system I had is disappearing on Monday. However, a larger test
system will be less booked after a release finalizes. So I have a
test platform through early next week, and will again have something in a
couple weeks. I also may have a window at 1k nodes at a customer site
during a maintenance window... And we should have a couple big ones
going through the factory in the comings weeks in the 1k size. At 1k
nodes, we have 3 gluster servers.

THANKS AGAIN. Wow what a relief.

Let me get these changes checked in so I can get it to some customers and
then look at getting a new thread going on the thread hangs.

Erik


> 
> Thanks for patience and number of retries you did, Erik!
> 
> Thanks indeed! Once https://review.gluster.org/#/c/glusterfs/+/24316/ gets
> merged on master, I will back port it to the release branches.
> 
> 
> We surely need to get to the glitch you found with the 7.4 version, as 
> with
> every higher version, we expect more stability!
> 
> True, maybe we should start a separate thread...
> 
> Regards,
> Ravi
> 
> Regards,
> Amar
> 
> On Fri, Apr 17, 2020 at 2:46 AM Erik Jacobson 
> wrote:
> 
> I have some news.
> 
> After many many many trials, reboots of gluster servers, reboots of
> nodes...
> in what should have reproduced the issue several times. I think we're
> stable.
> 
> It appears this glusterfs nfs daemon hang only happens in glusterfs74
> and not 72.
> 
> So
> 1) Your split-brain patch
> 2) performance.parallel-readdir off
> 3) glusterfs72
> 
> I declare it stable. I can't make it fail: split-brain, hang, noor seg
> fault
> with one leader down.
> 
> I'm working on putting this in to a SW update.
> 
> We are going to test if performance.parallel-readdir off impacts
> booting
> at scale but we don't have a system to try it on at this time.
> 
> THAK YOU!
> 
> I may have access to the 57 node test system if there is something
> you'd
> like me to try with regards to why glusterfs74 is unstable in this
> situation. Just let me know.
> 
> Erik
> 
> On Thu, Apr 16, 2020 at 12:03:33PM -0500, Erik Jacobson wrote:
> > So in my test runs since making that change, we have a different odd
> > behavior now. As you recall, this is with your patch -- still not
> > split-brain -- and now with performance.parallel-readdir off
> >
> > The NFS server grinds to a hault after a few test runs. It does not
> core
> > dump.
> >
> > All that shows up in the log is:
> >
> > "pending frames:" with nothing after it and no date stamp.
> >
> > I will start looking for interesting break points I guess.
> >
> >
> > The glusterfs for nfs is still alive:
> >
> > root 30541 1 42 09:57 ?00:51:06 /usr/sbin/glusterfs
> -s localhost --volfile-id gluster/nfs -p /var/run/gluster/nfs/nfs.pid
> -l /var/log/glusterfs/nfs.log -S /var/run/gluster/
> 9ddb5561058ff543.socket
> >
> >
> >
> > [root@leader3 ~]# strace -f  -p 30541
> > strace: Process 30541 attached with 40 threads
> > [pid 30580] futex(0x7f8904035f60, FUTEX_WAIT_PRIVATE, 2, NULL
> 
> > [pid 30579] futex(0x7f8904035f60, FUTEX_WAIT_PRIVATE, 2, NULL
> 
> > [pid 30578] futex(0x7f8904035f60, FUTEX_WAIT_PRIVATE, 2, NULL
> 
> > [pid 30577] futex(0x7f8904035f60, FUTEX_WAIT_PRIVATE, 2, NULL
> 
> > [pid 30576] futex(0x7f8904035f60, FUTEX_WAIT_PRIVATE, 2, NULL
> 
> > [pid 30575] futex(0x7f8904035f60, FUTEX_WAIT_PRIVATE, 2, NULL
> 
> > [pid 30574] futex(0x7f8904035f60, FUTEX_WAIT_PRIVATE, 2, NULL
> 
> > [pid 30573] futex(0x7f8904035f60, FUTEX_WAIT_PRIVATE, 2, NULL
> 
> > [pid 30572] futex(0x7f8904035f60, FUTEX_WAIT_PRIVATE, 2, NULL
> 
> > [pid 30571] futex(0x7f8904035f60, FUTEX_WAIT_PRIVATE, 2, NULL
> 
> > [pid 30570] futex(0x7f8904035f

Re: [Gluster-users] gnfs split brain when 1 server in 3x1 down (high load) - help request

2020-04-16 Thread Erik Jacobson
I have some news.

After many many many trials, reboots of gluster servers, reboots of nodes...
in what should have reproduced the issue several times. I think we're
stable.

It appears this glusterfs nfs daemon hang only happens in glusterfs74
and not 72.

So
1) Your split-brain patch
2) performance.parallel-readdir off
3) glusterfs72

I declare it stable. I can't make it fail: split-brain, hang, noor seg fault
with one leader down.

I'm working on putting this in to a SW update.

We are going to test if performance.parallel-readdir off impacts booting
at scale but we don't have a system to try it on at this time.

THAK YOU!

I may have access to the 57 node test system if there is something you'd
like me to try with regards to why glusterfs74 is unstable in this
situation. Just let me know.

Erik

On Thu, Apr 16, 2020 at 12:03:33PM -0500, Erik Jacobson wrote:
> So in my test runs since making that change, we have a different odd
> behavior now. As you recall, this is with your patch -- still not
> split-brain -- and now with performance.parallel-readdir off
> 
> The NFS server grinds to a hault after a few test runs. It does not core
> dump.
> 
> All that shows up in the log is:
> 
> "pending frames:" with nothing after it and no date stamp.
> 
> I will start looking for interesting break points I guess.
> 
> 
> The glusterfs for nfs is still alive:
> 
> root 30541 1 42 09:57 ?00:51:06 /usr/sbin/glusterfs -s 
> localhost --volfile-id gluster/nfs -p /var/run/gluster/nfs/nfs.pid -l 
> /var/log/glusterfs/nfs.log -S /var/run/gluster/9ddb5561058ff543.socket
> 
> 
> 
> [root@leader3 ~]# strace -f  -p 30541
> strace: Process 30541 attached with 40 threads
> [pid 30580] futex(0x7f8904035f60, FUTEX_WAIT_PRIVATE, 2, NULL 
> [pid 30579] futex(0x7f8904035f60, FUTEX_WAIT_PRIVATE, 2, NULL 
> [pid 30578] futex(0x7f8904035f60, FUTEX_WAIT_PRIVATE, 2, NULL 
> [pid 30577] futex(0x7f8904035f60, FUTEX_WAIT_PRIVATE, 2, NULL 
> [pid 30576] futex(0x7f8904035f60, FUTEX_WAIT_PRIVATE, 2, NULL 
> [pid 30575] futex(0x7f8904035f60, FUTEX_WAIT_PRIVATE, 2, NULL 
> [pid 30574] futex(0x7f8904035f60, FUTEX_WAIT_PRIVATE, 2, NULL 
> [pid 30573] futex(0x7f8904035f60, FUTEX_WAIT_PRIVATE, 2, NULL 
> [pid 30572] futex(0x7f8904035f60, FUTEX_WAIT_PRIVATE, 2, NULL 
> [pid 30571] futex(0x7f8904035f60, FUTEX_WAIT_PRIVATE, 2, NULL 
> [pid 30570] futex(0x7f8904035f60, FUTEX_WAIT_PRIVATE, 2, NULL 
> [pid 30569] futex(0x7f8904035f60, FUTEX_WAIT_PRIVATE, 2, NULL 
> [pid 30568] futex(0x7f8904035f60, FUTEX_WAIT_PRIVATE, 2, NULL 
> [pid 30567] futex(0x7f8904035f60, FUTEX_WAIT_PRIVATE, 2, NULL 
> [pid 30566] futex(0x7f88b820, FUTEX_WAIT_PRIVATE, 2, NULL 
> [pid 30565] futex(0x7f8904035f60, FUTEX_WAIT_PRIVATE, 2, NULL 
> [pid 30564] futex(0x7f8904035f60, FUTEX_WAIT_PRIVATE, 2, NULL 
> [pid 30563] futex(0x7f8904035f60, FUTEX_WAIT_PRIVATE, 2, NULL 
> [pid 30562] futex(0x7f8904035f60, FUTEX_WAIT_PRIVATE, 2, NULL 
> [pid 30561] futex(0x7f8904035f60, FUTEX_WAIT_PRIVATE, 2, NULL 
> [pid 30560] futex(0x7f8904035f60, FUTEX_WAIT_PRIVATE, 2, NULL 
> [pid 30559] futex(0x7f8904035f60, FUTEX_WAIT_PRIVATE, 2, NULL 
> [pid 30558] futex(0x7f8904035f60, FUTEX_WAIT_PRIVATE, 2, NULL 
> [pid 30557] futex(0x7f8904035f60, FUTEX_WAIT_PRIVATE, 2, NULL 
> [pid 30556] futex(0x7f8904035f60, FUTEX_WAIT_PRIVATE, 2, NULL 
> [pid 30555] futex(0x7f8904035f60, FUTEX_WAIT_PRIVATE, 2, NULL 
> [pid 30554] futex(0x7f8904035f60, FUTEX_WAIT_PRIVATE, 2, NULL 
> [pid 30553] futex(0x7f8904035f60, FUTEX_WAIT_PRIVATE, 2, NULL 
> [pid 30552] futex(0x7f8904035f60, FUTEX_WAIT_PRIVATE, 2, NULL 
> [pid 30551] futex(0x7f8904035f60, FUTEX_WAIT_PRIVATE, 2, NULL 
> [pid 30550] restart_syscall(<... resuming interrupted restart_syscall ...> 
> 
> [pid 30549] futex(0x7f8904035f60, FUTEX_WAIT_PRIVATE, 2, NULL 
> [pid 30548] futex(0x7f88b820, FUTEX_WAIT_PRIVATE, 2, NULL 
> [pid 30547] select(0, NULL, NULL, NULL, {tv_sec=0, tv_usec=243775} 
> 
> [pid 30546] restart_syscall(<... resuming interrupted restart_syscall ...> 
> 
> [pid 30545] restart_syscall(<... resuming interrupted restart_syscall ...> 
> 
> [pid 30544] futex(0x7f88b820, FUTEX_WAIT_PRIVATE, 2, NULL 
> [pid 30543] rt_sigtimedwait([HUP INT USR1 USR2 TERM],  
> [pid 30542] futex(0x7f88b820, FUTEX_WAIT_PRIVATE, 2, NULL 
> [pid 30541] futex(0x7f890c3a39d0, FUTEX_WAIT, 30548, NULL 
> [pid 30547] <... select resumed> )  = 0 (Timeout)
> [pid 30547] select(0, NULL, NULL, NULL, {tv_sec=1, tv_usec=0}) = 0 (Timeout)
> [pid 30547] select(0, NULL, NULL, NULL, {tv_sec=1, tv_usec=0}) = 0 (Timeout)
> [pid 30547] select(0, NULL, NULL, NULL, {tv_sec=1, tv_usec=0}) = 0 (Timeout)
> [pid 30547] select(0, NULL, NULL, NULL, {tv_sec=1, tv_usec=0}) = 0 (Timeout)
> [pid 30547] select(

Re: [Gluster-users] gnfs split brain when 1 server in 3x1 down (high load) - help request

2020-04-16 Thread Erik Jacobson
557 detached
strace: Process 30558 detached
strace: Process 30559 detached
strace: Process 30560 detached
strace: Process 30561 detached
strace: Process 30562 detached
strace: Process 30563 detached
strace: Process 30564 detached
strace: Process 30565 detached
strace: Process 30566 detached
strace: Process 30567 detached
strace: Process 30568 detached
strace: Process 30569 detached
strace: Process 30570 detached
strace: Process 30571 detached
strace: Process 30572 detached
strace: Process 30573 detached
strace: Process 30574 detached
strace: Process 30575 detached
strace: Process 30576 detached
strace: Process 30577 detached
strace: Process 30578 detached
strace: Process 30579 detached
strace: Process 30580 detached




> On 16/04/20 8:04 pm, Erik Jacobson wrote:
> > Quick update just on how this got set.
> > 
> > gluster volume set cm_shared performance.parallel-readdir on
> > 
> > Is something we did turn on, thinking it might make our NFS services
> > faster and not knowing about it possibly being negative.
> > 
> > Below is a diff of the nfs volume file ON vs OFF. So I will simply turn
> > this OFF and do a test run.
> Yes,that should do it. I am not sure if performance.parallel-readdir was
> intentionally made to have an effect on gnfs volfiles. Usually, for other
> performance xlators, `gluster volume set` only changes the fuse volfile.




Community Meeting Calendar:

Schedule -
Every 2nd and 4th Tuesday at 14:30 IST / 09:00 UTC
Bridge: https://bluejeans.com/441850968

Gluster-users mailing list
Gluster-users@gluster.org
https://lists.gluster.org/mailman/listinfo/gluster-users


Re: [Gluster-users] gnfs split brain when 1 server in 3x1 down (high load) - help request

2020-04-16 Thread Erik Jacobson
Quick update just on how this got set.

gluster volume set cm_shared performance.parallel-readdir on

Is something we did turn on, thinking it might make our NFS services
faster and not knowing about it possibly being negative.

Below is a diff of the nfs volume file ON vs OFF. So I will simply turn
this OFF and do a test run. Does this look correct? I will start testing
with this turned OFF. Thank you!

[root@leader1 nfs]# diff -u /tmp/nfs-server.vol-ORIG nfs-server.vol
--- /tmp/nfs-server.vol-ORIG2020-04-16 09:28:56.855309870 -0500
+++ nfs-server.vol  2020-04-16 09:29:14.267289600 -0500
@@ -60,21 +60,13 @@
 subvolumes cm_shared-client-0 cm_shared-client-1 cm_shared-client-2
 end-volume

-volume cm_shared-readdir-ahead-0
-type performance/readdir-ahead
-option rda-cache-limit 10MB
-option rda-request-size 131072
-option parallel-readdir on
-subvolumes cm_shared-replicate-0
-end-volume
-
 volume cm_shared-dht
 type cluster/distribute
 option force-migration off
 option lock-migration off
 option lookup-optimize on
 option lookup-unhashed auto
-subvolumes cm_shared-readdir-ahead-0
+subvolumes cm_shared-replicate-0
 end-volume

 volume cm_shared-utime


On Thu, Apr 16, 2020 at 06:58:01PM +0530, Ravishankar N wrote:
> 
> On 16/04/20 6:54 pm, Erik Jacobson wrote:
> > > The patch by itself is only making changes specific to AFR, so it should 
> > > not
> > > affect other translators. But I wonder how readdir-ahead is enabled in 
> > > your
> > > gnfs stack. All performance xlators are turned off in gnfs except
> > > write-behind and AFAIK, there is no way to enable them via the CLI. Did 
> > > you
> > > custom edit your gnfs volfile to add readdir-ahead? If yes, does the crash
> > > go-away if you remove the xlator from the nfs volfile?
> > thank you. A quick reply. I will then go research how to do this,
> > I've never hand edited a volume before. I've never even really looked at
> > the gnfs volfile before.
> > 
> > There are no custom code changes or hand edits.
> > 
> > More soon.
> > 
> Okay, /var/lib/glusterd/nfs/nfs-server.vol is the file you want to look at
> if you are using gnfs.
> 
> -Ravi






Community Meeting Calendar:

Schedule -
Every 2nd and 4th Tuesday at 14:30 IST / 09:00 UTC
Bridge: https://bluejeans.com/441850968

Gluster-users mailing list
Gluster-users@gluster.org
https://lists.gluster.org/mailman/listinfo/gluster-users


Re: [Gluster-users] gnfs split brain when 1 server in 3x1 down (high load) - help request

2020-04-16 Thread Erik Jacobson
> The patch by itself is only making changes specific to AFR, so it should not
> affect other translators. But I wonder how readdir-ahead is enabled in your
> gnfs stack. All performance xlators are turned off in gnfs except
> write-behind and AFAIK, there is no way to enable them via the CLI. Did you
> custom edit your gnfs volfile to add readdir-ahead? If yes, does the crash
> go-away if you remove the xlator from the nfs volfile?

thank you. A quick reply. I will then go research how to do this,
I've never hand edited a volume before. I've never even really looked at
the gnfs volfile before.

There are no custom code changes or hand edits.

More soon.




Community Meeting Calendar:

Schedule -
Every 2nd and 4th Tuesday at 14:30 IST / 09:00 UTC
Bridge: https://bluejeans.com/441850968

Gluster-users mailing list
Gluster-users@gluster.org
https://lists.gluster.org/mailman/listinfo/gluster-users


Re: [Gluster-users] gnfs split brain when 1 server in 3x1 down (high load) - help request

2020-04-15 Thread Erik Jacobson
ock = 0, __count = 0, __owner = 1586972324, __nusers = 0,
     __kind = 210092664, __spins = 0, __elision = 0, __list = 
{__prev = 0x0, __next = 0x0}},
   __size = 
"\000\000\000\000\000\000\000\000\244F\227^\000\000\000\000x\302\205\f", 
'\000' , __align = 0}},
   cookie = 0x0, complete = false, op = GF_FOP_NULL, begin = {tv_sec = 
0, tv_nsec = 0}, end = {tv_sec = 0, tv_nsec = 0}, wind_from = 0x0,
   wind_to = 0x0, unwind_from = 0x0, unwind_to = 0x0}
(gdb) print {call_frame_t}0x7fe5ac096288
$36 = {root = 0x7fe5ac378860, parent = 0x7fe5acf18eb8, frames = {next = 
0x7fe5acf18ec8, prev = 0x7fe5ac6d6cf0}, local = 0x0,
   this = 0x7fe63c014000, ret = 0x7fe63bb5d350 , 
ref_count = 0, lock = {spinlock = 0, mutex = {__data = {__lock = 0,
     __count = 0, __owner = 0, __nusers = 0, __kind = 0, __spins = 
0, __elision = 0, __list = {__prev = 0x0, __next = 0x0}},
   __size = '\000' , __align = 0}}, cookie = 
0x7fe5ac096288, complete = true, op = GF_FOP_READDIRP, begin = {
     tv_sec = 4234, tv_nsec = 637078816}, end = {tv_sec = 4234, tv_nsec 
= 803882755},
   wind_from = 0x7fe63bb5e8c0 <__FUNCTION__.6> "rda_fill_fd", 
wind_to = 0x7fe63bb5e3f0 "(this->children->xlator)->fops->readdirp",
   unwind_from = 0x7fe63bdd8a80 <__FUNCTION__.20442> "afr_readdir_cbk", 
unwind_to = 0x7fe63bb5dfbb "rda_fill_fd_cbk"}



On 4/15/20 8:14 AM, Erik Jacobson wrote:
> Scott - I was going to start with gluster74 since that is what he
> started at but it applies well to glsuter72 so I'll start tthere.
>
> Getting ready to go. The patch detail is interesting. This is probably
> why it took hiim a bit longer. It wasn't a trivial patch.



On Wed, Apr 15, 2020 at 12:45:57PM -0500, Erik Jacobson wrote:
> > The new split-brain issue is much harder to reproduce, but after several
> 
> (correcting to say new seg fault issue, the split brain is gone!!)
> 
> > intense runs, it usually hits once.
> > 
> > We switched to pure gluster74 plus your patch so we're apples to apples
> > now.
> > 
> > I'm going to see if Scott can help debug it.
> > 
> > Here is the back trace info from the core dump:
> > 
> > -rw-r-  1 root root 1.9G Apr 15 12:40 
> > core.glusterfs.0.52467a7e67964553aa9971eb2bb0148c.61084.158697232400
> > -rw-r-  1 root root 221M Apr 15 12:40 
> > core.glusterfs.0.52467a7e67964553aa9971eb2bb0148c.61084.158697232400.lz4
> > drwxrwxrwt  9 root root  20K Apr 15 12:40 .
> > [root@leader3 tmp]#
> > [root@leader3 tmp]#
> > [root@leader3 tmp]# gdb 
> > core.glusterfs.0.52467a7e67964553aa9971eb2bb0148c.61084.158697232400
> > GNU gdb (GDB) Red Hat Enterprise Linux 8.2-5.el8
> > Copyright (C) 2018 Free Software Foundation, Inc.
> > License GPLv3+: GNU GPL version 3 or later 
> > <http://gnu.org/licenses/gpl.html>
> > This is free software: you are free to change and redistribute it.
> > There is NO WARRANTY, to the extent permitted by law.
> > Type "show copying" and "show warranty" for details.
> > This GDB was configured as "x86_64-redhat-linux-gnu".
> > Type "show configuration" for configuration details.
> > For bug reporting instructions, please see:
> > <http://www.gnu.org/software/gdb/bugs/>.
> > Find the GDB manual and other documentation resources online at:
> > <http://www.gnu.org/software/gdb/documentation/>.
> > 
> > For help, type "help".
> > Type "apropos word" to search for commands related to "word"...
> > [New LWP 61102]
> > [New LWP 61085]
> > [New LWP 61087]
> > [New LWP 61117]
> > [New LWP 61086]
> > [New LWP 61108]
> > [New LWP 61089]
> > [New LWP 61090]
> > [New LWP 61121]
> > [New LWP 61088]
> > [New LWP 61091]
> > [New LWP 61093]
> > [New LWP 61095]
> > [New LWP 61092]
> > [New LWP 61094]
> > [New LWP 61098]
> > [New LWP 61096]
> > [New LWP 61097]
> > [New LWP 61084]
> > [New LWP 61100]
> > [New LWP 61103]
> > [New LWP 61104]
> > [New LWP 61099]
> > [New LWP 61105]
> > [New LWP 61101]
> > [New LWP 61106]
> > [New LWP 61109]
> > [New LWP 61107]
> > [New LWP 61112]
> > [New LWP 61119]
> > [New LWP 61110]
> > [New LWP 6]
> > [New LWP 61118]
> > [New LWP 61123]
> > [New LWP 61122]
> > [New LWP 61113]
> > [New LWP 61114]
> > [New LWP 61120]
> > [New LWP 61116]
> > [New LWP 61115]
> > 
> > warning: core file may not match specified executable file.
> > Reading symbols from /usr/sbin/glusterfsd...Reading symbols from 
> > /

Re: [Gluster-users] gnfs split brain when 1 server in 3x1 down (high load) - help request

2020-04-15 Thread Erik Jacobson
7fe617fff700 (LWP 61102))]
> Missing separate debuginfos, use: dnf debuginfo-install 
> glibc-2.28-42.el8.x86_64 keyutils-libs-1.5.10-6.el8.x86_64 
> krb5-libs-1.16.1-22.el8.x86_64 libacl-2.2.53-1.el8.x86_64 
> libattr-2.4.48-3.el8.x86_64 libcom_err-1.44.3-2.el8.x86_64 
> libgcc-8.2.1-3.5.el8.x86_64 libselinux-2.8-6.el8.x86_64 
> libtirpc-1.1.4-3.el8.x86_64 libuuid-2.32.1-8.el8.x86_64 
> openssl-libs-1.1.1-8.el8.x86_64 pcre2-10.32-1.el8.x86_64 
> zlib-1.2.11-10.el8.x86_64
> (gdb) bt
> #0  0x7fe63bb5d7bb in FRAME_DESTROY (frame=0x7fe5ac096288)
> at ../../../../libglusterfs/src/glusterfs/stack.h:193
> #1  STACK_DESTROY (stack=0x7fe5ac6d65f8)
> at ../../../../libglusterfs/src/glusterfs/stack.h:193
> #2  rda_fill_fd_cbk (frame=0x7fe5acf18eb8, cookie=,
> this=0x7fe63c0162b0, op_ret=3, op_errno=0, entries=,
> xdata=0x0) at readdir-ahead.c:623
> #3  0x7fe63bd6c3aa in afr_readdir_cbk (frame=,
> cookie=, this=, op_ret=,
> op_errno=, subvol_entries=, xdata=0x0)
> at afr-dir-read.c:234
> #4  0x7fe6400a1e07 in client4_0_readdirp_cbk (req=,
> iov=, count=, myframe=0x7fe5ace0eda8)
> at client-rpc-fops_v2.c:2338
> #5  0x7fe6479ca115 in rpc_clnt_handle_reply (
> clnt=clnt@entry=0x7fe63c0663f0, pollin=pollin@entry=0x7fe60c1737a0)
> at rpc-clnt.c:764
> #6  0x7fe6479ca4b3 in rpc_clnt_notify (trans=0x7fe63c066780,
> mydata=0x7fe63c066420, event=, data=0x7fe60c1737a0)
> at rpc-clnt.c:931
> #7  0x7fe6479c707b in rpc_transport_notify (
> this=this@entry=0x7fe63c066780,
> event=event@entry=RPC_TRANSPORT_MSG_RECEIVED,
> data=data@entry=0x7fe60c1737a0) at rpc-transport.c:545
> #8  0x7fe640da893c in socket_event_poll_in_async (xl=,
> async=0x7fe60c1738c8) at socket.c:2601
> #9  0x7fe640db03dc in gf_async (
> cbk=0x7fe640da8910 , xl=,
> async=0x7fe60c1738c8) at 
> ../../../../libglusterfs/src/glusterfs/async.h:189
> #10 socket_event_poll_in (notify_handled=true, this=0x7fe63c066780)
> at socket.c:2642
> #11 socket_event_handler (fd=fd@entry=19, idx=idx@entry=10, gen=gen@entry=1,
> data=data@entry=0x7fe63c066780, poll_in=,
> poll_out=, poll_err=0, event_thread_died=0 '\000')
> at socket.c:3040
> #12 0x7fe647c84a5b in event_dispatch_epoll_handler (event=0x7fe617ffe014,
> event_pool=0x563f5a98c750) at event-epoll.c:650
> #13 event_dispatch_epoll_worker (data=0x7fe63c063b60) at event-epoll.c:763
> #14 0x7fe6467a72de in start_thread () from /lib64/libpthread.so.0
> #15 0x7fe645fffa63 in clone () from /lib64/libc.so.6
> 
> 
> 
> On Wed, Apr 15, 2020 at 10:39:34AM -0500, Erik Jacobson wrote:
> > After several successful runs of the test case, we thought we were
> > solved. Indeed, split-brain is gone.
> > 
> > But we're triggering a seg fault now, even in a less loaded case.
> > 
> > We're going to switch to gluster74, which was your intention, and report
> > back.
> > 
> > On Wed, Apr 15, 2020 at 10:33:01AM -0500, Erik Jacobson wrote:
> > > > Attached the wrong patch by mistake in my previous mail. Sending the 
> > > > correct
> > > > one now.
> > > 
> > > Early results loook GREAT !!
> > > 
> > > We'll keep beating on it. We applied it to glsuter72 as that is what we
> > > have to ship with. It applied fine with some line moves.
> > > 
> > > If you would like us to also run a test with gluster74 so that you can
> > > say that's tested, we can run that test. I can do a special build.
> > > 
> > > THANK YOU!!
> > > 
> > > > 
> > > > 
> > > > -Ravi
> > > > 
> > > > 
> > > > On 15/04/20 2:05 pm, Ravishankar N wrote:
> > > > 
> > > > 
> > > > On 10/04/20 2:06 am, Erik Jacobson wrote:
> > > > 
> > > > Once again thanks for sticking with us. Here is a reply from 
> > > > Scott
> > > > Titus. If you have something for us to try, we'd love it. The 
> > > > code had
> > > > your patch applied when gdb was run:
> > > > 
> > > > 
> > > > Here is the addr2line output for those addresses.  Very 
> > > > interesting
> > > > command, of
> > > > which I was not aware.
> > > > 
> > > > [root@leader3 ~]# addr2line -f 
> > > > -e/usr/lib64/glusterfs/7.2/xlator/
> > > > cluster/
> > > > afr.so 0x6f735
> > > > afr_lookup_metadata_heal_check
> > > > afr-c

Re: [Gluster-users] gnfs split brain when 1 server in 3x1 down (high load) - help request

2020-04-15 Thread Erik Jacobson
ck.h:193
#2  rda_fill_fd_cbk (frame=0x7fe5acf18eb8, cookie=,
this=0x7fe63c0162b0, op_ret=3, op_errno=0, entries=,
xdata=0x0) at readdir-ahead.c:623
#3  0x7fe63bd6c3aa in afr_readdir_cbk (frame=,
cookie=, this=, op_ret=,
op_errno=, subvol_entries=, xdata=0x0)
at afr-dir-read.c:234
#4  0x7fe6400a1e07 in client4_0_readdirp_cbk (req=,
iov=, count=, myframe=0x7fe5ace0eda8)
at client-rpc-fops_v2.c:2338
#5  0x7fe6479ca115 in rpc_clnt_handle_reply (
clnt=clnt@entry=0x7fe63c0663f0, pollin=pollin@entry=0x7fe60c1737a0)
at rpc-clnt.c:764
#6  0x7fe6479ca4b3 in rpc_clnt_notify (trans=0x7fe63c066780,
mydata=0x7fe63c066420, event=, data=0x7fe60c1737a0)
at rpc-clnt.c:931
#7  0x7fe6479c707b in rpc_transport_notify (
this=this@entry=0x7fe63c066780,
event=event@entry=RPC_TRANSPORT_MSG_RECEIVED,
data=data@entry=0x7fe60c1737a0) at rpc-transport.c:545
#8  0x7fe640da893c in socket_event_poll_in_async (xl=,
async=0x7fe60c1738c8) at socket.c:2601
#9  0x7fe640db03dc in gf_async (
cbk=0x7fe640da8910 , xl=,
async=0x7fe60c1738c8) at ../../../../libglusterfs/src/glusterfs/async.h:189
#10 socket_event_poll_in (notify_handled=true, this=0x7fe63c066780)
at socket.c:2642
#11 socket_event_handler (fd=fd@entry=19, idx=idx@entry=10, gen=gen@entry=1,
data=data@entry=0x7fe63c066780, poll_in=,
poll_out=, poll_err=0, event_thread_died=0 '\000')
at socket.c:3040
#12 0x7fe647c84a5b in event_dispatch_epoll_handler (event=0x7fe617ffe014,
event_pool=0x563f5a98c750) at event-epoll.c:650
#13 event_dispatch_epoll_worker (data=0x7fe63c063b60) at event-epoll.c:763
#14 0x7fe6467a72de in start_thread () from /lib64/libpthread.so.0
#15 0x7fe645fffa63 in clone () from /lib64/libc.so.6



On Wed, Apr 15, 2020 at 10:39:34AM -0500, Erik Jacobson wrote:
> After several successful runs of the test case, we thought we were
> solved. Indeed, split-brain is gone.
> 
> But we're triggering a seg fault now, even in a less loaded case.
> 
> We're going to switch to gluster74, which was your intention, and report
> back.
> 
> On Wed, Apr 15, 2020 at 10:33:01AM -0500, Erik Jacobson wrote:
> > > Attached the wrong patch by mistake in my previous mail. Sending the 
> > > correct
> > > one now.
> > 
> > Early results loook GREAT !!
> > 
> > We'll keep beating on it. We applied it to glsuter72 as that is what we
> > have to ship with. It applied fine with some line moves.
> > 
> > If you would like us to also run a test with gluster74 so that you can
> > say that's tested, we can run that test. I can do a special build.
> > 
> > THANK YOU!!
> > 
> > > 
> > > 
> > > -Ravi
> > > 
> > > 
> > > On 15/04/20 2:05 pm, Ravishankar N wrote:
> > > 
> > > 
> > > On 10/04/20 2:06 am, Erik Jacobson wrote:
> > > 
> > > Once again thanks for sticking with us. Here is a reply from Scott
> > > Titus. If you have something for us to try, we'd love it. The 
> > > code had
> > > your patch applied when gdb was run:
> > > 
> > > 
> > > Here is the addr2line output for those addresses.  Very 
> > > interesting
> > > command, of
> > > which I was not aware.
> > > 
> > > [root@leader3 ~]# addr2line -f -e/usr/lib64/glusterfs/7.2/xlator/
> > > cluster/
> > > afr.so 0x6f735
> > > afr_lookup_metadata_heal_check
> > > afr-common.c:2803
> > > [root@leader3 ~]# addr2line -f -e/usr/lib64/glusterfs/7.2/xlator/
> > > cluster/
> > > afr.so 0x6f0b9
> > > afr_lookup_done
> > > afr-common.c:2455
> > > [root@leader3 ~]# addr2line -f -e/usr/lib64/glusterfs/7.2/xlator/
> > > cluster/
> > > afr.so 0x5c701
> > > afr_inode_event_gen_reset
> > > afr-common.c:755
> > > 
> > > 
> > > Right, so afr_lookup_done() is resetting the event gen to zero. This 
> > > looks
> > > like a race between lookup and inode refresh code paths. We made some
> > > changes to the event generation logic in AFR. Can you apply the 
> > > attached
> > > patch and see if it fixes the split-brain issue? It should apply 
> > > cleanly on
> > > glusterfs-7.4.
> > > 
> > > Thanks,
> > > Ravi
> > > 
> > >
> > > 
> > > 
> > > 
> > > 
> > > Community Meeting Calendar:
> > > 
> > > 

Re: [Gluster-users] gnfs split brain when 1 server in 3x1 down (high load) - help request

2020-04-15 Thread Erik Jacobson
After several successful runs of the test case, we thought we were
solved. Indeed, split-brain is gone.

But we're triggering a seg fault now, even in a less loaded case.

We're going to switch to gluster74, which was your intention, and report
back.

On Wed, Apr 15, 2020 at 10:33:01AM -0500, Erik Jacobson wrote:
> > Attached the wrong patch by mistake in my previous mail. Sending the correct
> > one now.
> 
> Early results loook GREAT !!
> 
> We'll keep beating on it. We applied it to glsuter72 as that is what we
> have to ship with. It applied fine with some line moves.
> 
> If you would like us to also run a test with gluster74 so that you can
> say that's tested, we can run that test. I can do a special build.
> 
> THANK YOU!!
> 
> > 
> > 
> > -Ravi
> > 
> > 
> > On 15/04/20 2:05 pm, Ravishankar N wrote:
> > 
> > 
> > On 10/04/20 2:06 am, Erik Jacobson wrote:
> > 
> > Once again thanks for sticking with us. Here is a reply from Scott
> > Titus. If you have something for us to try, we'd love it. The code 
> > had
> > your patch applied when gdb was run:
> > 
> > 
> > Here is the addr2line output for those addresses.  Very interesting
> > command, of
> > which I was not aware.
> > 
> > [root@leader3 ~]# addr2line -f -e/usr/lib64/glusterfs/7.2/xlator/
> > cluster/
> > afr.so 0x6f735
> > afr_lookup_metadata_heal_check
> > afr-common.c:2803
> > [root@leader3 ~]# addr2line -f -e/usr/lib64/glusterfs/7.2/xlator/
> > cluster/
> > afr.so 0x6f0b9
> > afr_lookup_done
> > afr-common.c:2455
> > [root@leader3 ~]# addr2line -f -e/usr/lib64/glusterfs/7.2/xlator/
> > cluster/
> > afr.so 0x5c701
> > afr_inode_event_gen_reset
> > afr-common.c:755
> > 
> > 
> > Right, so afr_lookup_done() is resetting the event gen to zero. This 
> > looks
> > like a race between lookup and inode refresh code paths. We made some
> > changes to the event generation logic in AFR. Can you apply the attached
> > patch and see if it fixes the split-brain issue? It should apply 
> > cleanly on
> > glusterfs-7.4.
> > 
> > Thanks,
> > Ravi
> > 
> >
> > 
> > 
> > 
> > 
> > Community Meeting Calendar:
> > 
> > Schedule -
> > Every 2nd and 4th Tuesday at 14:30 IST / 09:00 UTC
> > Bridge: https://bluejeans.com/441850968
> > 
> > Gluster-users mailing list
> > Gluster-users@gluster.org
> > https://lists.gluster.org/mailman/listinfo/gluster-users
> > 
> 
> > >From 11601e709a97ce7c40078866bf5d24b486f39454 Mon Sep 17 00:00:00 2001
> > From: Ravishankar N 
> > Date: Wed, 15 Apr 2020 13:53:26 +0530
> > Subject: [PATCH] afr: event gen changes
> > 
> > The general idea of the changes is to prevent resetting event generation
> > to zero in the inode ctx, since event gen is something that should
> > follow 'causal order'.
> > 
> > Change #1:
> > For a read txn, in inode refresh cbk, if event_generation is
> > found zero, we are failing the read fop. This is not needed
> > because change in event gen is only a marker for the next inode refresh to
> > happen and should not be taken into account by the current read txn.
> > 
> > Change #2:
> > The event gen being zero above can happen if there is a racing lookup,
> > which resets even get (in afr_lookup_done) if there are non zero afr
> > xattrs. The resetting is done only to trigger an inode refresh and a
> > possible client side heal on the next lookup. That can be acheived by
> > setting the need_refresh flag in the inode ctx. So replaced all
> > occurences of resetting even gen to zero with a call to
> > afr_inode_need_refresh_set().
> > 
> > Change #3:
> > In both lookup and discover path, we are doing an inode refresh which is
> > not required since all 3 essentially do the same thing- update the inode
> > ctx with the good/bad copies from the brick replies. Inode refresh also
> > triggers background heals, but I think it is okay to do it when we call
> > refresh during the read and write txns and not in the lookup path.
> > 
> > Change-Id: Id0600dd34b144b4ae7a3bf3c397551adf7e402f1
> > Signed-off-by: Ravishankar N 
> > ---
> >  ...ismatch-resolution-with-fav-child-policy.t |  8 +-
> >  xlators/cluster/afr/src/afr-common.c   

Re: [Gluster-users] gnfs split brain when 1 server in 3x1 down (high load) - help request

2020-04-15 Thread Erik Jacobson
> Attached the wrong patch by mistake in my previous mail. Sending the correct
> one now.

Early results loook GREAT !!

We'll keep beating on it. We applied it to glsuter72 as that is what we
have to ship with. It applied fine with some line moves.

If you would like us to also run a test with gluster74 so that you can
say that's tested, we can run that test. I can do a special build.

THANK YOU!!

> 
> 
> -Ravi
> 
> 
> On 15/04/20 2:05 pm, Ravishankar N wrote:
> 
> 
> On 10/04/20 2:06 am, Erik Jacobson wrote:
> 
> Once again thanks for sticking with us. Here is a reply from Scott
> Titus. If you have something for us to try, we'd love it. The code had
> your patch applied when gdb was run:
> 
> 
> Here is the addr2line output for those addresses.  Very interesting
> command, of
> which I was not aware.
> 
> [root@leader3 ~]# addr2line -f -e/usr/lib64/glusterfs/7.2/xlator/
> cluster/
> afr.so 0x6f735
> afr_lookup_metadata_heal_check
> afr-common.c:2803
> [root@leader3 ~]# addr2line -f -e/usr/lib64/glusterfs/7.2/xlator/
> cluster/
> afr.so 0x6f0b9
> afr_lookup_done
> afr-common.c:2455
> [root@leader3 ~]# addr2line -f -e/usr/lib64/glusterfs/7.2/xlator/
> cluster/
> afr.so 0x5c701
> afr_inode_event_gen_reset
> afr-common.c:755
> 
> 
> Right, so afr_lookup_done() is resetting the event gen to zero. This looks
> like a race between lookup and inode refresh code paths. We made some
> changes to the event generation logic in AFR. Can you apply the attached
> patch and see if it fixes the split-brain issue? It should apply cleanly 
> on
> glusterfs-7.4.
> 
> Thanks,
> Ravi
> 
>
> 
> 
> 
> 
> Community Meeting Calendar:
> 
> Schedule -
> Every 2nd and 4th Tuesday at 14:30 IST / 09:00 UTC
> Bridge: https://bluejeans.com/441850968
> 
> Gluster-users mailing list
> Gluster-users@gluster.org
> https://lists.gluster.org/mailman/listinfo/gluster-users
> 

> >From 11601e709a97ce7c40078866bf5d24b486f39454 Mon Sep 17 00:00:00 2001
> From: Ravishankar N 
> Date: Wed, 15 Apr 2020 13:53:26 +0530
> Subject: [PATCH] afr: event gen changes
> 
> The general idea of the changes is to prevent resetting event generation
> to zero in the inode ctx, since event gen is something that should
> follow 'causal order'.
> 
> Change #1:
> For a read txn, in inode refresh cbk, if event_generation is
> found zero, we are failing the read fop. This is not needed
> because change in event gen is only a marker for the next inode refresh to
> happen and should not be taken into account by the current read txn.
> 
> Change #2:
> The event gen being zero above can happen if there is a racing lookup,
> which resets even get (in afr_lookup_done) if there are non zero afr
> xattrs. The resetting is done only to trigger an inode refresh and a
> possible client side heal on the next lookup. That can be acheived by
> setting the need_refresh flag in the inode ctx. So replaced all
> occurences of resetting even gen to zero with a call to
> afr_inode_need_refresh_set().
> 
> Change #3:
> In both lookup and discover path, we are doing an inode refresh which is
> not required since all 3 essentially do the same thing- update the inode
> ctx with the good/bad copies from the brick replies. Inode refresh also
> triggers background heals, but I think it is okay to do it when we call
> refresh during the read and write txns and not in the lookup path.
> 
> Change-Id: Id0600dd34b144b4ae7a3bf3c397551adf7e402f1
> Signed-off-by: Ravishankar N 
> ---
>  ...ismatch-resolution-with-fav-child-policy.t |  8 +-
>  xlators/cluster/afr/src/afr-common.c  | 92 ---
>  xlators/cluster/afr/src/afr-dir-write.c   |  6 +-
>  xlators/cluster/afr/src/afr.h |  5 +-
>  4 files changed, 29 insertions(+), 82 deletions(-)
> 
> diff --git a/tests/basic/afr/gfid-mismatch-resolution-with-fav-child-policy.t 
> b/tests/basic/afr/gfid-mismatch-resolution-with-fav-child-policy.t
> index f4aa351e4..12af0c854 100644
> --- a/tests/basic/afr/gfid-mismatch-resolution-with-fav-child-policy.t
> +++ b/tests/basic/afr/gfid-mismatch-resolution-with-fav-child-policy.t
> @@ -168,8 +168,8 @@ TEST [ "$gfid_1" != "$gfid_2" ]
>  #We know that second brick has the bigger size file
>  BIGGER_FILE_MD5=$(md5sum $B0/${V0}1/f3 | cut -d\  -f1)
>  
> -TEST ls $M0/f3
> -TEST cat $M0/f3
> +TEST ls $M0 #Trigger entry heal via readdir inode refresh
> +TEST cat $M0/f3 #Trigger 

Re: [Gluster-users] gnfs split brain when 1 server in 3x1 down (high load) - help request

2020-04-09 Thread Erik Jacobson
Once again thanks for sticking with us. Here is a reply from Scott
Titus. If you have something for us to try, we'd love it. The code had
your patch applied when gdb was run:


Here is the addr2line output for those addresses.  Very interesting command, of
which I was not aware.

[root@leader3 ~]# addr2line -f -e /usr/lib64/glusterfs/7.2/xlator/cluster/
afr.so 0x6f735
afr_lookup_metadata_heal_check
afr-common.c:2803
[root@leader3 ~]# addr2line -f -e /usr/lib64/glusterfs/7.2/xlator/cluster/
afr.so 0x6f0b9
afr_lookup_done
afr-common.c:2455
[root@leader3 ~]# addr2line -f -e /usr/lib64/glusterfs/7.2/xlator/cluster/
afr.so 0x5c701
afr_inode_event_gen_reset
afr-common.c:755

Thanks
-Scott


On Thu, Apr 09, 2020 at 11:38:04AM +0530, Ravishankar N wrote:
> 
> On 08/04/20 9:55 pm, Erik Jacobson wrote:
> > 9439138:[2020-04-08 15:48:44.737590] E 
> > [afr-common.c:754:afr_inode_event_gen_reset]
> > (-->/usr/lib64/glusterfs/7.2/xlator/cluster/replicate.so(+0x6f735) 
> > [0x7fa4fb1cb735]
> > -->/usr/lib64/glusterfs/7.2/xlator/cluster/replicate.so(+0x6f0b9) 
> > [0x7fa4fb1cb0b9]
> > -->/usr/lib64/glusterfs/7.2/xlator/cluster/replicate.so(+0x5c701) 
> > [0x7fa4fb1b8701] )
> > 0-cm_shared-replicate-0: Resetting event gen for 
> > f2d7abf0-5444-48d6-863d-4b128502daf9
> > 
> Could you print the function/line no. of each of these 3 functions in the
> backtrace and see who calls afr_inode_event_gen_reset? `addr2line` should
> give you that info:
>  addr2line -f -e /your/path/to/lib/glusterfs/7.2/xlator/cluster/afr.so
> 0x6f735
>  addr2line -f -e /your/path/to/lib/glusterfs/7.2/xlator/cluster/afr.so
> 0x6f0b9
>  addr2line -f -e /your/path/to/lib/glusterfs/7.2/xlator/cluster/afr.so
> 0x5c701
> 
> 
> I think it is likely called from afr_lookup_done, which I don't think is
> necessary. I will send a patch for review. Once reviews are over, I will
> share it with you and if it fixes the issue in your testing, we can merge it
> with confidence.
> 
> Thanks,
> Ravi





Community Meeting Calendar:

Schedule -
Every 2nd and 4th Tuesday at 14:30 IST / 09:00 UTC
Bridge: https://bluejeans.com/441850968

Gluster-users mailing list
Gluster-users@gluster.org
https://lists.gluster.org/mailman/listinfo/gluster-users


[Gluster-users] Impressive boot times for big clusters: NFS, Image Objects, and Sharding

2020-04-08 Thread Erik Jacobson
I wanted to share some positive news with the group here.

Summary: Using sharding and squashfs image files instead of expanded
directory trees for RO NFS OS images have led to impressive boot times of
2k diskless node clusters using 12 servers for gluster+tftp+etc+etc.

Details:

As you may have seen in some of my other posts, we have been using
gluster to boot giant clusters, some of which are in the top500 list of
HPC resources. The compute nodes are diskless.

Up until now, we have done this by pushing an operating system from our
head node to the storage cluster, which is made up of one or more
3-server/(3-brick) subvolumes in a distributed/replicate configuration.
The servers are also PXE-boot and tftboot servers and also serve the
"miniroot" (basically a fat initrd with a cluster manager toolchain).
We also locate other management functions there unrelated to boot and
root.

This copy of the operating system is a simple a directory tree
representing the whole operating system image. You could 'chroot' in to
it, for example.

So this operating system is a read-only NFS mount point used as a base
by all compute nodes to use as their root filesystem.

This has been working well, getting us boot times (not including BIOS
startup) of between 10 and 15 minutes for a 2,000 node cluster. Typically a
cluster like this would have 12 gluster/nfs servers in 3 subvolumes. On simple
RHEL8 images without much customization, I tend to get 10 minutes.

We have observed some slow-downs with certain job launch work loads for
customers who have very metadata intensive job launch. The metadata load
of such an operation is very intensive, with giant loads being observed
on the gluster servers.

We recently started supporting RW NFS as opposed to TMPFS for this
solution for the writable components of root. Our customers tend to prefer
to keep every byte of memory for jobs. We came up with a solution of hosting
the RW NFS sparse files with XFS filesystems on top from a writable area in
gluster for NFS. This makes the RW NFS solution very fast because it reduces
RW NFS metadata per-node. Boot times didn't go up significantly (but our
first attempt with just using a directory tree was a slow disaster, hitting
the worse-case lots of small file write + lots of metadata work load). So we
solved that problem with XFS FS images on RW NFS.

Building on that idea, we have in our development branch, a version of the
solution that changes the RO NFS image to a squashfs file on a sharding
volume. That is, instead of each operating system being many thousands
of files and being (slowly) synced to the gluser servers, the head node
makes a squashfs file out of the image and pushes that. Then all the
compute nodes mount the squashfs image from the NFS mount.
  (mount RO NFS mount, loop-mount squashfs image).

On a 2,000 node cluster I had access to for a time, our prototype got us
boot times of 5 minutes -- including RO NFS with squashfs and the RW NFS
for writable areas like /etc, /var, etc (on an XFS image file).
  * We also tried RW NFS with OVERLAY and no problem there

I expect, for people who prefer the squashfs non-expanded format, we can
reduce the leader per compute density.

Now, not all customers will want squashfs. Some want to be able to edit
a file and see it instantly on all nodes. However, customers looking for
fast boot times or who are suffering slowness on metadata intensive
job launch work loads, will have a new fast option.

Therefore, it's very important we still solve the bug we're working on
in another thread. But I wanted to share something positive.

So now I've said something positive instead of only asking for help :)
:)

Erik




Community Meeting Calendar:

Schedule -
Every 2nd and 4th Tuesday at 14:30 IST / 09:00 UTC
Bridge: https://bluejeans.com/441850968

Gluster-users mailing list
Gluster-users@gluster.org
https://lists.gluster.org/mailman/listinfo/gluster-users


Re: [Gluster-users] gnfs split brain when 1 server in 3x1 down (high load) - help request

2020-04-08 Thread Erik Jacobson
Thank you again for the help so far. Here is what Scott Titus came up
with. Let us know if you have suggestions for next steps.



We never hit the "Event gen is zero" message, so it appears that 
afr_access() never has a zero event_gen to begin with.

However, the "Resetting event gen" message was just a bit chatty, 
growing our nfs.log to >2.4GB.  Many were against a gfid of populated 
with zeros.

Around each split brain log, we did find the "Resetting event gen" 
messages containing a matching gfid:

9439138:[2020-04-08 15:48:44.737590] E 
[afr-common.c:754:afr_inode_event_gen_reset] 
(-->/usr/lib64/glusterfs/7.2/xlator/cluster/replicate.so(+0x6f735) 
[0x7fa4fb1cb735] 
-->/usr/lib64/glusterfs/7.2/xlator/cluster/replicate.so(+0x6f0b9) 
[0x7fa4fb1cb0b9] 
-->/usr/lib64/glusterfs/7.2/xlator/cluster/replicate.so(+0x5c701) 
[0x7fa4fb1b8701] ) 0-cm_shared-replicate-0: Resetting event gen for 
f2d7abf0-5444-48d6-863d-4b128502daf9
9439139:[2020-04-08 15:48:44.737636] E 
[afr-common.c:754:afr_inode_event_gen_reset] 
(-->/usr/lib64/glusterfs/7.2/xlator/cluster/replicate.so(+0x6f735) 
[0x7fa4fb1cb735] 
-->/usr/lib64/glusterfs/7.2/xlator/cluster/replicate.so(+0x6f0b9) 
[0x7fa4fb1cb0b9] 
-->/usr/lib64/glusterfs/7.2/xlator/cluster/replicate.so(+0x5c701) 
[0x7fa4fb1b8701] ) 0-cm_shared-replicate-0: Resetting event gen for 
f2d7abf0-5444-48d6-863d-4b128502daf9
9439140:[2020-04-08 15:48:44.737663] E [MSGID: 108008] 
[afr-read-txn.c:312:afr_read_txn_refresh_done] 0-cm_shared-replicate-0: 
Failing ACCESS on gfid f2d7abf0-5444-48d6-863d-4b128502daf9: split-brain 
observed. [Input/output error]
9439143:[2020-04-08 15:48:44.737801] E 
[afr-common.c:754:afr_inode_event_gen_reset] 
(-->/usr/lib64/glusterfs/7.2/xlator/cluster/replicate.so(+0x6f735) 
[0x7fa4fb1cb735] 
-->/usr/lib64/glusterfs/7.2/xlator/cluster/replicate.so(+0x6f0b9) 
[0x7fa4fb1cb0b9] 
-->/usr/lib64/glusterfs/7.2/xlator/cluster/replicate.so(+0x5c701) 
[0x7fa4fb1b8701] ) 0-cm_shared-replicate-0: Resetting event gen for 
f2d7abf0-5444-48d6-863d-4b128502daf9
9439145:[2020-04-08 15:48:44.737861] E 
[afr-common.c:754:afr_inode_event_gen_reset] 
(-->/usr/lib64/glusterfs/7.2/xlator/cluster/replicate.so(+0x6f735) 
[0x7fa4fb1cb735] 
-->/usr/lib64/glusterfs/7.2/xlator/cluster/replicate.so(+0x6f0b9) 
[0x7fa4fb1cb0b9] 
-->/usr/lib64/glusterfs/7.2/xlator/cluster/replicate.so(+0x5c701) 
[0x7fa4fb1b8701] ) 0-cm_shared-replicate-0: Resetting event gen for 
f2d7abf0-5444-48d6-863d-4b128502daf9
9439148:[2020-04-08 15:48:44.738125] E 
[afr-common.c:754:afr_inode_event_gen_reset] 
(-->/usr/lib64/glusterfs/7.2/xlator/cluster/replicate.so(+0x6f735) 
[0x7fa4fb1cb735] 
-->/usr/lib64/glusterfs/7.2/xlator/cluster/replicate.so(+0x6f0b9) 
[0x7fa4fb1cb0b9] 
-->/usr/lib64/glusterfs/7.2/xlator/cluster/replicate.so(+0x5c701) 
[0x7fa4fb1b8701] ) 0-cm_shared-replicate-0: Resetting event gen for 
f2d7abf0-5444-48d6-863d-4b128502daf9
9439225:[2020-04-08 15:48:44.749920] E 
[afr-common.c:754:afr_inode_event_gen_reset] 
(-->/usr/lib64/glusterfs/7.2/xlator/cluster/replicate.so(+0x6f735) 
[0x7fa4fb1cb735] 
-->/usr/lib64/glusterfs/7.2/xlator/cluster/replicate.so(+0x6f0b9) 
[0x7fa4fb1cb0b9] 
-->/usr/lib64/glusterfs/7.2/xlator/cluster/replicate.so(+0x5c701) 
[0x7fa4fb1b8701] ) 0-cm_shared-replicate-0: Resetting event gen for 
f2d7abf0-5444-48d6-863d-4b128502daf9

Thanks,
-Scott

On 4/8/20 8:31 AM, Erik Jacobson wrote:
> Hi team -
>
> We got an update to try more stuff from the community.
>
> I feel like I've been "given an inch but am taking a mile" but if we
> do happen to have time on orbit41 again, we'll do the next round of
> debugging.
>
> Erik


On Wed, Apr 08, 2020 at 01:53:00PM +0530, Ravishankar N wrote:
> On 08/04/20 4:59 am, Erik Jacobson wrote:
> > Apologies for misinterpreting the backtrace.
> > 
> > #0  afr_read_txn_refresh_done (frame=0x7ffcf4146478,
> > this=0x7fff64013720, err=5) at afr-read-txn.c:312
> > #1  0x7fff68938d2b in afr_txn_refresh_done
> > (frame=frame@entry=0x7ffcf4146478, this=this@entry=0x7fff64013720,
> > err=5, err@entry=0)
> >       at afr-common.c:1222
> Sorry, I missed this too.
> > (gdb) print event_generation
> > $3 = 0
> > 
> > (gdb) print priv->fav_child_policy
> > $4 = AFR_FAV_CHILD_NONE
> > 
> > I am not sure what this signifies though.  It appears to be a read
> > transaction with no event generation and no favorite child policy.
> > 
> > Feel free to ask for clarification in case my thought process went awry
> > somewhere.
> 
> Favorite child policy is only for automatically resolving split-brains and
> is 0 unless that volume option is set. The problem is indeed that
> event_generation is zero. Could you try to apply this logging patch and see
> if afr_inode_event_gen_reset() for that gfid is hit

Re: [Gluster-users] gnfs split brain when 1 server in 3x1 down (high load) - help request

2020-04-07 Thread Erik Jacobson
My co-worker prefers I keep driving the discussion since this isn't his
normal area. But he's far better at digging in to these low level calls
than I am. So I am passing along his analysis so far. We are wondering
if we have enough information yet to turn on any light bulbs in terms of
volume settings, system settings, or a code change... Or a suggested
path for further debug.

Recall from earlier in the thread, this a 3-way replicate single
subvolume gluster fileystem that gets split-brain errors under heavy
gnfs load when one of the three servers has gone down, representing a
customer-reported problem. 

Scott's analysis is below. Next steps truly appreciated !!



Apologies for misinterpreting the backtrace.

#0  afr_read_txn_refresh_done (frame=0x7ffcf4146478, 
this=0x7fff64013720, err=5) at afr-read-txn.c:312
#1  0x7fff68938d2b in afr_txn_refresh_done 
(frame=frame@entry=0x7ffcf4146478, this=this@entry=0x7fff64013720, 
err=5, err@entry=0)
     at afr-common.c:1222
#2  0x7fff68939003 in afr_inode_refresh_done 
(frame=frame@entry=0x7ffcf4146478, this=this@entry=0x7fff64013720, error=0)
     at afr-common.c:1294

instead of the #1/#2 above calling the functions afr_txn_refresh_done 
and afr_inode_refresh_done respectively, they are calling a function 
within afr_txn_refresh_done and afr_inode_refresh_done respectively.

So, afr_txn_refresh_done (frame=frame@entry=0x7ffcf4146478, 
this=this@entry=0x7fff64013720, err=5, err@entry=0)at 
afr-common.c:1222calls a function at line number 1222 in aft-common.c 
within the function afr_txn_refresh_done:

1163: int
1164: afr_txn_refresh_done(call_frame_t *frame, xlator_t *this, int err)
1165: {
1166:     call_frame_t *heal_frame = NULL;
1167:     afr_local_t *heal_local = NULL;
1168:     afr_local_t *local = NULL;
1169:     afr_private_t *priv = NULL;
1170:     inode_t *inode = NULL;
1171:     int event_generation = 0;
1172:     int read_subvol = -1;
1173:     int ret = 0;
1174:
1175:     local = frame->local;
1176:     inode = local->inode;
1177:     priv = this->private;
1178:
1179:     if (err)
1180:     goto refresh_done;
1181:
1182:     if (local->op == GF_FOP_LOOKUP)
1183:     goto refresh_done;
1184:
1185:     ret = afr_inode_get_readable(frame, inode, this, local->readable,
1186: _generation, local->transaction.type);
1187:
1188:     if (ret == -EIO || (local->is_read_txn && !event_generation)) {
1189:     /* No readable subvolume even after refresh ==> splitbrain.*/
*1190: **    if (!priv->fav_child_policy) {*
*1191:   err = EIO;
**1192:     goto refresh_done;
**1193: **    }*
1194:     read_subvol = afr_sh_get_fav_by_policy(this, 
local->replies, inode,
1195: NULL);
1196:     if (read_subvol == -1) {
1197:     err = EIO;
1198:     goto refresh_done;
1199:     }
1200:
1201:     heal_frame = afr_frame_create(this, NULL);
1202:     if (!heal_frame) {
1203:     err = EIO;
1204:     goto refresh_done;
1205:     }
1206:     heal_local = heal_frame->local;
1207:     heal_local->xdata_req = dict_new();
1208:     if (!heal_local->xdata_req) {
1209:     err = EIO;
1210: AFR_STACK_DESTROY(heal_frame);
1211:     goto refresh_done;
1212:     }
1213:     heal_local->heal_frame = frame;
1214:     ret = synctask_new(this->ctx->env, 
afr_fav_child_reset_sink_xattrs,
1215: afr_fav_child_reset_sink_xattrs_cbk, heal_frame,
1216: heal_frame);
1217:     return 0;
1218:    }
1219:
1220: refresh_done:
1221:     afr_local_replies_wipe(local, this->private);
*1222:     local->refreshfn(frame, this, err);*
1223:
1224:     return 0;
1225: }

So, backtrace #1 represents the following function call 
local->refreshfn(frame=frame@entry=0x7ffcf4146478, 
this=this@entry=0x7fff64013720, err=5, err@entry=0)
This is the 1st example of EIO being set.

Setting a breakpoint at *1190: **    if (!priv->fav_child_policy) { 
*reveals that ret is not set, but local->is_read_txn is set and 
event_generation is zero (xlators/cluster/afr/src/afr.h:108), so the 
conditional at 1188 is true.  Furthermore, priv->fav_child_policy is set 
to AFR_FAV_CHILD_NONE which is zero, so we found where the error is set 
to EIO, line 1191.

The following is the gdb output:

(gdb) print ret
$1 = 0
(gdb) print local->is_read_txn
$2 = true
(gdb) print event_generation
$3 = 0

(gdb) print priv->fav_child_policy
$4 = AFR_FAV_CHILD_NONE

I am not sure what this signifies though.  It appears to be a read 
transaction with no event generation and no favorite child policy.

Feel free to ask for clarification in case my thought process went awry 
somewhere.

Thanks,
-Scott



On Thu, Apr 02, 2020 at 02:02:46AM -0500, Erik Jacobson wrote:
> > Hmm, afr_inode_refresh_done() is called with error=0 and by the time we
> > reach afr_txn_refresh_done(), it becomes 5(i.e. EIO).
> >

Re: [Gluster-users] gnfs split brain when 1 server in 3x1 down (high load) - help request

2020-04-05 Thread Erik Jacobson
During the problem case, near as I can tell, afr_final_errno(),
in the loop where tmp_errno = local->replies[i].op_errno is set,
the errno is always "2" when it gets to that point on server 3 (where
the NFS load is).

I never see a value other than 2.

I later simply put the print at the end of the function too, to double
verify non-zero exit codes. There are thousands of non-zero return
codes, all 2 when not zero. Here is an exmaple flow right before a
split-brain. I do not wish to imply the split-brain is related, it's
just an example log snip:


[2020-04-06 00:54:21.125373] E [MSGID: 0] [afr-common.c:2546:afr_final_errno] 
0-erikj-afr_final_errno: erikj dbg afr_final_errno() errno from loop before 
afr_higher_errno was: 2
[2020-04-06 00:54:21.125374] E [MSGID: 0] [afr-common.c:2551:afr_final_errno] 
0-erikj-afr_final_errno: erikj dbg returning non-zero: 2
[2020-04-06 00:54:23.315397] E [MSGID: 0] 
[afr-read-txn.c:283:afr_read_txn_refresh_done] 0-cm_shared-replicate-0: erikj 
dbg crapola 1st if in afr_read_txn_refresh_done() !priv->thin_arbiter_count -- 
goto to readfn
[2020-04-06 00:54:23.315432] E [MSGID: 108008] 
[afr-read-txn.c:314:afr_read_txn_refresh_done] 0-cm_shared-replicate-0: Failing 
READLINK on gfid 57f269ef-919d-40ec-b7fc-a7906fee648b: split-brain observed. 
[Input/output error]
[2020-04-06 00:54:23.315450] W [MSGID: 112199] 
[nfs3-helpers.c:3327:nfs3_log_readlink_res] 0-nfs-nfsv3: 
/image/images_ro_nfs/rhel8.0/usr/lib64/libmlx5.so.1 => (XID: 1fdba2bc, 
READLINK: NFS: 5(I/O error), POSIX: 5(Input/output error)) target: (null)


I am missing something. I will see if Scott and I can work together
tomorrow. Happy for any more ideas, Thank you!!


On Sun, Apr 05, 2020 at 06:49:56PM -0500, Erik Jacobson wrote:
> First, it's possible our analysis is off somewhere. I never get to your
> print message. I put a debug statement at the start of the function so I
> know we get there (just to verify my print statements were taking
> affect).
> 
> I put a print statement for the if (call_count == 0) { call there, right
> after the if. I ran some tests.
> 
> I suspect that isn't a problem area. There were some interesting results
> with an NFS stale file handle error going through that path. Otherwise
> it's always errno=0 even in the heavy test case. I'm not concerned about
> a stale NFS file handle this moment. That print was also hit heavily when
> one server was down (which surprised me but I don't know the internals).
> 
> I'm trying to re-read and work through Scott's message to see if any
> other print statements might be helpful.
> 
> Thank you for your help so far. I will reply back if I find something.
> Otherwise suggestions welcome!
> 
> The MFG system I can access got smaller this weekend but is still large
> enough to reproduce the error.
> 
> As you can tell, I work mostly at a level well above filesystem code so
> thank you for staying with me as I struggle through this.
> 
> Erik
> 
> > After we hear from all children, afr_inode_refresh_subvol_cbk() then calls 
> > afr_inode_refresh_done()-->afr_txn_refresh_done()-->afr_read_txn_refresh_done().
> > But you already know this flow now.
> 
> > diff --git a/xlators/cluster/afr/src/afr-common.c 
> > b/xlators/cluster/afr/src/afr-common.c
> > index 4bfaef9e8..096ce06f0 100644
> > --- a/xlators/cluster/afr/src/afr-common.c
> > +++ b/xlators/cluster/afr/src/afr-common.c
> > @@ -1318,6 +1318,12 @@ afr_inode_refresh_subvol_cbk(call_frame_t *frame, 
> > void *cookie, xlator_t *this,
> >  if (xdata)
> >  local->replies[call_child].xdata = dict_ref(xdata);
> >  }
> > +if (op_ret == -1)
> > +gf_msg_callingfn(
> > +this->name, GF_LOG_ERROR, op_errno, AFR_MSG_SPLIT_BRAIN,
> > +"Inode refresh on child:%d failed with errno:%d for %s(%s) ",
> > +call_child, op_errno, local->loc.name,
> > +uuid_utoa(local->loc.inode->gfid));
> >  if (xdata) {
> >  ret = dict_get_int8(xdata, "link-count", _heal);
> >  local->replies[call_child].need_heal = need_heal;






Community Meeting Calendar:

Schedule -
Every 2nd and 4th Tuesday at 14:30 IST / 09:00 UTC
Bridge: https://bluejeans.com/441850968

Gluster-users mailing list
Gluster-users@gluster.org
https://lists.gluster.org/mailman/listinfo/gluster-users


Re: [Gluster-users] gnfs split brain when 1 server in 3x1 down (high load) - help request

2020-04-05 Thread Erik Jacobson
First, it's possible our analysis is off somewhere. I never get to your
print message. I put a debug statement at the start of the function so I
know we get there (just to verify my print statements were taking
affect).

I put a print statement for the if (call_count == 0) { call there, right
after the if. I ran some tests.

I suspect that isn't a problem area. There were some interesting results
with an NFS stale file handle error going through that path. Otherwise
it's always errno=0 even in the heavy test case. I'm not concerned about
a stale NFS file handle this moment. That print was also hit heavily when
one server was down (which surprised me but I don't know the internals).

I'm trying to re-read and work through Scott's message to see if any
other print statements might be helpful.

Thank you for your help so far. I will reply back if I find something.
Otherwise suggestions welcome!

The MFG system I can access got smaller this weekend but is still large
enough to reproduce the error.

As you can tell, I work mostly at a level well above filesystem code so
thank you for staying with me as I struggle through this.

Erik

> After we hear from all children, afr_inode_refresh_subvol_cbk() then calls 
> afr_inode_refresh_done()-->afr_txn_refresh_done()-->afr_read_txn_refresh_done().
> But you already know this flow now.

> diff --git a/xlators/cluster/afr/src/afr-common.c 
> b/xlators/cluster/afr/src/afr-common.c
> index 4bfaef9e8..096ce06f0 100644
> --- a/xlators/cluster/afr/src/afr-common.c
> +++ b/xlators/cluster/afr/src/afr-common.c
> @@ -1318,6 +1318,12 @@ afr_inode_refresh_subvol_cbk(call_frame_t *frame, void 
> *cookie, xlator_t *this,
>  if (xdata)
>  local->replies[call_child].xdata = dict_ref(xdata);
>  }
> +if (op_ret == -1)
> +gf_msg_callingfn(
> +this->name, GF_LOG_ERROR, op_errno, AFR_MSG_SPLIT_BRAIN,
> +"Inode refresh on child:%d failed with errno:%d for %s(%s) ",
> +call_child, op_errno, local->loc.name,
> +uuid_utoa(local->loc.inode->gfid));
>  if (xdata) {
>  ret = dict_get_int8(xdata, "link-count", _heal);
>  local->replies[call_child].need_heal = need_heal;





Community Meeting Calendar:

Schedule -
Every 2nd and 4th Tuesday at 14:30 IST / 09:00 UTC
Bridge: https://bluejeans.com/441850968

Gluster-users mailing list
Gluster-users@gluster.org
https://lists.gluster.org/mailman/listinfo/gluster-users


Re: [Gluster-users] gnfs split brain when 1 server in 3x1 down (high load) - help request

2020-04-04 Thread Erik Jacobson
I had a co-worker look through this with me (Scott Titus). He has a more
analyitcal mind than I do. Here is what he said with some edits by me.
My edits were formatting and adjusting some words. So we were hoping
that, given this analysis, the community could let us know if it raises
any red flags that would lead to a solution to the problem (whether it
be setup, settings, or code). If needed, I can get Scott to work with me
and dig further but it was starting to get painful where Scott stopped.

Scott's words (edited):

(all backtraces match - at least up to the point I'm concerned with at this
time)

Error was passed from afr_inode_refresh_done() into afr_txn_refresh_done() as
afr_inode_refresh_done()'s call frame has 'error=0'
while afr_txn_refresh_done() has 'err=5' in the call frame.


#0  afr_read_txn_refresh_done (frame=0x7ffc949cf7c8, this=0x7fff640137b0,
err=5) at afr-read-txn.c:281
#1  0x7fff68901fdb in afr_txn_refresh_done (
frame=frame at entry=0x7ffc949cf7c8, this=this at entry=0x7fff640137b0,
err=5,
err at entry=0) at afr-common.c:1223
#2  0x7fff689022b3 in afr_inode_refresh_done (
frame=frame at entry=0x7ffc949cf7c8, this=this at entry=0x7fff640137b0,
error=0)
at afr-common.c:1295
#3  0x7fff6890f3fb in afr_inode_refresh_subvol_cbk (frame=0x7ffc949cf7c8,
cookie=, this=0x7fff640137b0, op_ret=,
op_errno=, buf=buf at entry=0x7ffd53ffdaa0,
xdata=0x7ffd3c6764f8, par=0x7ffd53ffdb40) at afr-common.c:1333


Within afr_inode_refresh_done(), the only two ways it can generate an error
within is via setting it to EINVAL or resulting from a failure status from
afr_has_quorum().  Since EINVAL is 22, not 5, the quorum test failed.

Within the afr_has_quorum() conditional, an error could be set
from afr_final_errno() or afr_quorum_errno().  Digging reveals
afr_quorum_errno() just returns ENOTCONN which is 107, so that is not it.
This leaves us with afr_quorum_errno() returning the error.

(Scott provided me with source code with pieces bolded but I don't think
you need that).

afr_final_errno() iterates through the 'children', looking for
valid errors within the replies for the transaction (refresh transaction?).
The function returns the highest valued error, which must be EIO (value of 5)
in this case.

I have not looked into how or what would set the error value in the
replies array, as this being a distributed system the error could have been
generated on another server. Unless this path needs to be investigated, I'd
rather not get mired into finding which iteration (value of 'i') has the error
and what system? thread?  added the error to the reply unless it is
information that is required.



Any suggested next steps?

> 
> On 01/04/20 8:57 am, Erik Jacobson wrote:
> > Here are some back traces. They make my head hurt. Maybe you can suggest
> > something else to try next? In the morning I'll try to unwind this
> > myself too in the source code but I suspect it will be tough for me.
> > 
> > 
> > (gdb) break xlators/cluster/afr/src/afr-read-txn.c:280 if err == 5
> > Breakpoint 1 at 0x7fff688e057b: file afr-read-txn.c, line 281.
> > (gdb) continue
> > Continuing.
> > [Switching to Thread 0x7ffec700 (LWP 50175)]
> > 
> > Thread 15 "glfs_epoll007" hit Breakpoint 1, afr_read_txn_refresh_done (
> >  frame=0x7fff48325d78, this=0x7fff640137b0, err=5) at afr-read-txn.c:281
> > 281 if (err) {
> > (gdb) bt
> > #0  afr_read_txn_refresh_done (frame=0x7fff48325d78, this=0x7fff640137b0,
> >  err=5) at afr-read-txn.c:281
> > #1  0x7fff68901fdb in afr_txn_refresh_done (
> >  frame=frame@entry=0x7fff48325d78, this=this@entry=0x7fff640137b0, 
> > err=5,
> >  err@entry=0) at afr-common.c:1223
> > #2  0x7fff689022b3 in afr_inode_refresh_done (
> >  frame=frame@entry=0x7fff48325d78, this=this@entry=0x7fff640137b0, 
> > error=0)
> >  at afr-common.c:1295
> Hmm, afr_inode_refresh_done() is called with error=0 and by the time we
> reach afr_txn_refresh_done(), it becomes 5(i.e. EIO).
> So afr_inode_refresh_done() is changing it to 5. Maybe you can put
> breakpoints/ log messages in afr_inode_refresh_done() at the places where
> error is getting changed and see where the assignment happens.
> 
> 
> Regards,
> Ravi





Community Meeting Calendar:

Schedule -
Every 2nd and 4th Tuesday at 14:30 IST / 09:00 UTC
Bridge: https://bluejeans.com/441850968

Gluster-users mailing list
Gluster-users@gluster.org
https://lists.gluster.org/mailman/listinfo/gluster-users


Re: [Gluster-users] gnfs split brain when 1 server in 3x1 down (high load) - help request

2020-04-02 Thread Erik Jacobson
> Hmm, afr_inode_refresh_done() is called with error=0 and by the time we
> reach afr_txn_refresh_done(), it becomes 5(i.e. EIO).
> So afr_inode_refresh_done() is changing it to 5. Maybe you can put
> breakpoints/ log messages in afr_inode_refresh_done() at the places where
> error is getting changed and see where the assignment happens.

I had a lot of struggles tonight getting the system ready to go.  I had
seg11's in glusterfs(nfs) but I think it was related to not all brick
processes stopping with glusterd. I also re-installed and/or the print
statements. I'm not sure. I'm not used to seeing that.

I put print statements everywhere I thought error could change and got
no printed log messages.

I put break points where error would change and we didn't hit them.

I then point a breakpoint at

break xlators/cluster/afr/src/afr-common.c:1298 if error != 0

---> refresh_done:
afr_txn_refresh_done(frame, this, error);

And it never triggered (despite split-brain messages and my crapola
message).

So I'm unable to explain this transition. I'm also not a gdb expert.
I still see the same back trace though.

#1  0x7fff68938d7b in afr_txn_refresh_done (
frame=frame@entry=0x7ffd540ed8e8, this=this@entry=0x7fff64013720, err=5,
err@entry=0) at afr-common.c:1222
#2  0x7fff689391f0 in afr_inode_refresh_done (
frame=frame@entry=0x7ffd540ed8e8, this=this@entry=0x7fff64013720, error=0)
at afr-common.c:1299

Is there other advice you might have for me to try?

I'm very eager to solve this problem, which is why I'm staying up late
to get machine time. I must go to bed now. I look forward to another
shot tomorrow night if you have more ideas to try.

Erik




Community Meeting Calendar:

Schedule -
Every 2nd and 4th Tuesday at 14:30 IST / 09:00 UTC
Bridge: https://bluejeans.com/441850968

Gluster-users mailing list
Gluster-users@gluster.org
https://lists.gluster.org/mailman/listinfo/gluster-users


Re: [Gluster-users] 回复: Re: Cann't mount NFS,please help!

2020-04-01 Thread Erik Jacobson
> Thanks everyone!
> 
> You mean that: Ganesha is new solution ablout NFS Server function  than gNFS,
> in new version gNFS is not the suggest compoment,
> but,if I want using NFS Server ,I should install and configure Ganesha
> separately, is that ?

I would phrase it this way:
- The community is moving to Ganesha to provide NFS services. Ganesha
  supports several storage solutions, including gluster

- Therefore, distros and packages tend to disable the gNFS support in
  gluster since they assume people are moving to Ganesha. It would
  otherwise be a competing solutions for NFS.

- Some people still prefer gNFS and do not want to use Ganesha yet, and
  those people need to re-build their package in some cases like was
  outlined in the thread. This then provides the necessary libraries and
  config files to run gNFS

- gNFS still works well if you build it as far as I have found

- For my use, Ganesha crashes with my "not normal" workload and
  so I can't switch to it yet. I worked with the community some but ran
  out of system time and had to drop the thread. I would like to revisit
  so that I can run Ganesha too some day. My work load is very far away
  from typical.

Erik


> 
> 
> 
> ━━━
> sz_cui...@163.com
> 
>  
> From: Strahil Nikolov
> Date: 2020-04-02 00:58
> To: Erik Jacobson; sz_cui...@163.com
> CC: gluster-users
> Subject: Re: [Gluster-users] Cann't mount NFS,please help!
> On April 1, 2020 3:37:35 PM GMT+03:00, Erik Jacobson
>  wrote:
> >If you are like me and cannot yet switch to Ganesha (it doesn't work in
> >our workload yet; I need to get back to working with the community on
> >that...)
> >
> >What I would have expected in the process list was a glusterfs process
> >with
> >"nfs" in the name.
> >
> >here it is from one of my systems:
> >
> >root 57927 1  0 Mar31 ?00:00:00 /usr/sbin/glusterfs -s
> >localhost --volfile-id gluster/nfs -p /var/run/gluster/nfs/nfs.pid -l
> >/var/log/glusterfs/nfs.log -S /var/run/gluster/933ab0ad241fab5f.socket
> >
> >
> >My guess - but you'd have to confirm this with the logs - is your
> >gluster
> >build does not have gnfs built in. Since they wish us to move to
> >Ganesha, it is often off by default. For my own builds, I enable it in
> >the spec file.
> >
> >So you should have this installed:
> >
> >/usr/lib64/glusterfs/7.2/xlator/nfs/server.so
> >
> >If that isn't there, you likely need to adjust your spec file and
> >rebuild.
> >
> >As others mentioned, the suggestion is to use Ganesha if possible,
> >which is a separate project.
> >
> >I hope this helps!
> >
> >PS here is a sniip from the spec file I use, with an erikj comment for
> >what I adjusted:
> >
> ># gnfs
> ># if you wish to compile an rpm with the legacy gNFS server xlator
> ># rpmbuild -ta @PACKAGE_NAME@-@package_vers...@.tar.gz --with gnfs
> >%{?_without_gnfs:%global _with_gnfs --disable-gnfs}
> >
> ># erikj force enable
> >%global _with_gnfs --enable-gnfs
> ># end erikj
> >
> >
> >On Wed, Apr 01, 2020 at 11:57:16AM +0800, sz_cui...@163.com wrote:
> >> 1.The gluster server has set volume option nfs.disable to: off
> >>
> >> Volume Name: gv0
> >> Type: Disperse
> >> Volume ID: 429100e4-f56d-4e28-96d0-ee837386aa84
> >> Status: Started
> >> Snapshot Count: 0
> >> Number of Bricks: 1 x (2 + 1) = 3
> >> Transport-type: tcp
> >> Bricks:
> >> Brick1: gfs1:/brick1/gv0
> >> Brick2: gfs2:/brick1/gv0
> >> Brick3: gfs3:/brick1/gv0
> >> Options Reconfigured:
> >> transport.address-family: inet
> >> storage.fips-mode-rchecksum: on
> >> nfs.disable: off
> >>
> >> 2. The process has start.
> >>
> >> [root@gfs1 ~]# ps -ef | grep glustershd
> >> root   1117  1  0 10:12 ?00:00:00 /usr/sbin/glusterfs
> >-s
> >> localhost --volfile-id shd/gv0 -p
> >/var/run/gluster/shd/gv0/gv0-shd.pid -l /var/
> >> log/glusterfs/glustershd.log -S
> >/var/run/gluster/ca97b99a29c04606.socket
> >> --xlator-option
> >*replicate*.node-uuid=323075ea-2b38-427c-a9aa-70ce18e94208
> &g

Re: [Gluster-users] Cann't mount NFS,please help!

2020-04-01 Thread Erik Jacobson
If you are like me and cannot yet switch to Ganesha (it doesn't work in
our workload yet; I need to get back to working with the community on
that...)

What I would have expected in the process list was a glusterfs process with
"nfs" in the name.

here it is from one of my systems:

root 57927 1  0 Mar31 ?00:00:00 /usr/sbin/glusterfs -s 
localhost --volfile-id gluster/nfs -p /var/run/gluster/nfs/nfs.pid -l 
/var/log/glusterfs/nfs.log -S /var/run/gluster/933ab0ad241fab5f.socket


My guess - but you'd have to confirm this with the logs - is your gluster
build does not have gnfs built in. Since they wish us to move to
Ganesha, it is often off by default. For my own builds, I enable it in
the spec file.

So you should have this installed:

/usr/lib64/glusterfs/7.2/xlator/nfs/server.so

If that isn't there, you likely need to adjust your spec file and
rebuild.

As others mentioned, the suggestion is to use Ganesha if possible,
which is a separate project.

I hope this helps!

PS here is a sniip from the spec file I use, with an erikj comment for
what I adjusted:

# gnfs
# if you wish to compile an rpm with the legacy gNFS server xlator
# rpmbuild -ta @PACKAGE_NAME@-@package_vers...@.tar.gz --with gnfs
%{?_without_gnfs:%global _with_gnfs --disable-gnfs}

# erikj force enable
%global _with_gnfs --enable-gnfs
# end erikj


On Wed, Apr 01, 2020 at 11:57:16AM +0800, sz_cui...@163.com wrote:
> 1.The gluster server has set volume option nfs.disable to: off
> 
> Volume Name: gv0
> Type: Disperse
> Volume ID: 429100e4-f56d-4e28-96d0-ee837386aa84
> Status: Started
> Snapshot Count: 0
> Number of Bricks: 1 x (2 + 1) = 3
> Transport-type: tcp
> Bricks:
> Brick1: gfs1:/brick1/gv0
> Brick2: gfs2:/brick1/gv0
> Brick3: gfs3:/brick1/gv0
> Options Reconfigured:
> transport.address-family: inet
> storage.fips-mode-rchecksum: on
> nfs.disable: off
> 
> 2. The process has start.
> 
> [root@gfs1 ~]# ps -ef | grep glustershd
> root   1117  1  0 10:12 ?00:00:00 /usr/sbin/glusterfs -s
> localhost --volfile-id shd/gv0 -p /var/run/gluster/shd/gv0/gv0-shd.pid -l 
> /var/
> log/glusterfs/glustershd.log -S /var/run/gluster/ca97b99a29c04606.socket
> --xlator-option *replicate*.node-uuid=323075ea-2b38-427c-a9aa-70ce18e94208
> --process-name glustershd --client-pid=-6
> 
> 
> 3.But the status of gv0 is not correct,for it's status of NFS Server is not
> online.
> 
> [root@gfs1 ~]# gluster volume status gv0
> Status of volume: gv0
> Gluster process TCP Port  RDMA Port  Online  Pid
> --
> Brick gfs1:/brick1/gv0  49154 0  Y   4180
> Brick gfs2:/brick1/gv0  49154 0  Y   1222
> Brick gfs3:/brick1/gv0  49154 0  Y   1216
> Self-heal Daemon on localhost   N/A   N/AY   1117
> NFS Server on localhost N/A   N/AN   N/A
> Self-heal Daemon on gfs2N/A   N/AY   1138
> NFS Server on gfs2  N/A   N/AN   N/A
> Self-heal Daemon on gfs3N/A   N/AY   1131
> NFS Server on gfs3  N/A   N/AN   N/A
> 
> Task Status of Volume gv0
> --
> There are no active volume tasks
> 
> 4.So, I cann't mount the gv0 on my client.
> 
> [root@kvms1 ~]# mount -t nfs  gfs1:/gv0 /mnt/test
> mount.nfs: Connection refused
> 
> 
> Please Help!
> Thanks!
> 
> 
> 
> 
> 
> ━━━
> sz_cui...@163.com

> 
> 
> 
> 
> Community Meeting Calendar:
> 
> Schedule -
> Every 2nd and 4th Tuesday at 14:30 IST / 09:00 UTC
> Bridge: https://bluejeans.com/441850968 
> 
> Gluster-users mailing list
> Gluster-users@gluster.org
> https://lists.gluster.org/mailman/listinfo/gluster-users 



Erik Jacobson
Software Engineer

erik.jacob...@hpe.com
+1 612 851 0550 Office

Eagan, MN
hpe.com




Community Meeting Calendar:

Schedule -
Every 2nd and 4th Tuesday at 14:30 IST / 09:00 UTC
Bridge: https://bluejeans.com/441850968

Gluster-users mailing list
Gluster-users@gluster.org
https://lists.gluster.org/mailman/listinfo/gluster-users


Re: [Gluster-users] gnfs split brain when 1 server in 3x1 down (high load) - help request

2020-03-31 Thread Erik Jacobson
THANK YOU for the hints. Very happy to have the help.

I'll reply to a couple things then dig in:

On Tue, Mar 31, 2020 at 03:27:59PM +0530, Ravishankar N wrote:
> From your reply in the other thread, I'm assuming that the file/gfid in
> question is not in genuine split-brain or needing heal. i.e. for example

Right, they were not tagged split-brain either, just healing needed,
which is expected for those 76 files.

> with that 1 brick down and 2 bricks up test case, if you tried to read the
> file from say a temporary fuse mount (which is also now connected to only to
> 2 bricks since the 3rd one is down) it works fine and there is no EIO
> error...

Looking at the heal info, all files are the files I expected to have
write changes and I think* are outside the scope of this issue. To
close the loop, I ran a 'strings' on the top of one the files to confirm
from a fuse mount and had no trouble.

> ...which means that what you have observed is true, i.e.
> afr_read_txn_refresh_done() is called with err=EIO. You can add logs to see
> at what point it is EIO set. The call graph is like this: 
> afr_inode_refresh_done()-->afr_txn_refresh_done()-->afr_read_txn_refresh_done().
> 
> Maybe 
> https://github.com/gluster/glusterfs/blob/v7.4/xlators/cluster/afr/src/afr-common.c#L1188
> in afr_txn_refresh_done() is causing it either due to ret being -EIO or
> event_generation being zero.
> 
> If you are comfortable with gdb, you an put a conditional break point in
> afr_read_txn_refresh_done() at 
> https://github.com/gluster/glusterfs/blob/v7.4/xlators/cluster/afr/src/afr-read-txn.c#L283
> when err=EIO and then check the backtrace for who is setting err to EIO.

Ok so the main event! :)

I'm not a gdb expert but I think I figured it out well enough to paste
some back traces. However, I'm having trouble intepreting them exactly.
It looks to me to be the "event" case.

(I got permission to use this MFG system at night for a couple more
nights; avoiding the 24-hour-reserved internal larger system we have).

here is what I did, feel free to suggest something better.

- I am using an RPM build so I changed the spec file to create debuginfo
  packages. I'm on rhel8.1
- I installed the updated packages and debuginfo packages
- When glusterd started the nfs glusterfs, I killed it.
- I ran this:
gdb -d /root/rpmbuild/BUILD/glusterfs-7.2 -d 
/root/rpmbuild/BUILD/glusterfs-7.2/xlators/cluster/afr/src/ /usr/sbin/glusterfs

- Then from GDB, I ran this:
(gdb) run -s localhost --volfile-id gluster/nfs -p /var/run/gluster/nfs/nfs.pid 
-l /var/log/glusterfs/nfs.log -S /var/run/gluster/9ddb5561058ff543.socket -N

- I hit ctrl-c, then set the break point:
(gdb) break xlators/cluster/afr/src/afr-read-txn.c:280 if err == 5
- I have some debugging statements but gluster 72 line 280 is this:
-->  line 280 (I think gdb changed it to 281 internally)
if (err) {
if (!priv->thin_arbiter_count) {

- continue

- Then I ran the test case.


Here are some back traces. They make my head hurt. Maybe you can suggest
something else to try next? In the morning I'll try to unwind this
myself too in the source code but I suspect it will be tough for me.


(gdb) break xlators/cluster/afr/src/afr-read-txn.c:280 if err == 5
Breakpoint 1 at 0x7fff688e057b: file afr-read-txn.c, line 281.
(gdb) continue
Continuing.
[Switching to Thread 0x7ffec700 (LWP 50175)]

Thread 15 "glfs_epoll007" hit Breakpoint 1, afr_read_txn_refresh_done (
frame=0x7fff48325d78, this=0x7fff640137b0, err=5) at afr-read-txn.c:281
281 if (err) {
(gdb) bt
#0  afr_read_txn_refresh_done (frame=0x7fff48325d78, this=0x7fff640137b0, 
err=5) at afr-read-txn.c:281
#1  0x7fff68901fdb in afr_txn_refresh_done (
frame=frame@entry=0x7fff48325d78, this=this@entry=0x7fff640137b0, err=5, 
err@entry=0) at afr-common.c:1223
#2  0x7fff689022b3 in afr_inode_refresh_done (
frame=frame@entry=0x7fff48325d78, this=this@entry=0x7fff640137b0, error=0)
at afr-common.c:1295
#3  0x7fff6890f3fb in afr_inode_refresh_subvol_cbk (frame=0x7fff48325d78, 
cookie=, this=0x7fff640137b0, op_ret=, 
op_errno=, buf=buf@entry=0x7ffecfffdaa0, 
xdata=0x7ffeb806ef08, par=0x7ffecfffdb40) at afr-common.c:1333
#4  0x7fff6890f42a in afr_inode_refresh_subvol_with_lookup_cbk (
frame=, cookie=, this=, 
op_ret=, op_errno=, inode=, 
buf=0x7ffecfffdaa0, xdata=0x7ffeb806ef08, par=0x7ffecfffdb40)
at afr-common.c:1344
#5  0x7fff68b8e96f in client4_0_lookup_cbk (req=, 
iov=, count=, myframe=0x7fff483147b8)
at client-rpc-fops_v2.c:2640
#6  0x7fffed293115 in rpc_clnt_handle_reply (
clnt=clnt@entry=0x7fff640671b0, pollin=pollin@entry=0x7ffeb81aa110)
at rpc-clnt.c:764
#7  0x7fffed2934b3 in rpc_clnt_notify (trans=0x7fff64067540, 
mydata=0x7fff640671e0, event=, data=0x7ffeb81aa110)
at rpc-clnt.c:931
#8  0x7fffed29007b in rpc_transport_notify (
this=this@entry=0x7fff64067540, 

Re: [Gluster-users] gnfs split brain when 1 server in 3x1 down (high load) - help request

2020-03-31 Thread Erik Jacobson
I note that this part of  afr_read_txn() gets triggered a lot.

if (afr_is_inode_refresh_reqd(inode, this, local->event_generation,
  event_generation)) {

Maybe that's normal when one of the three servers are down (but why
isn't it using its local copy by default?)

The comment in that if block is:
/* servers have disconnected / reconnected, and possibly
   rebooted, very likely changing the state of freshness
   of copies */

But we have one server conssitently down, not a changing situation.

digging digging digging seemed to show this related to cache
invalidation Because the paths seemed to suggest the inode needed
refreshing and that seems handled by a case statement named
GF_UPCALL_CACHE_INVALIDATION

However, that must have been a wrong turn since turning off
cache invalidation didn't help.

I'm struggling to wrap my head around the code base and without the
background in these concepts it's a tough hill to climb.

I am going to have to try this again some day with fresh eyes and go to
bed; the machine I have easy access to is going away in the morning.
Now I'll have to reserve time on a contended one but I will do that and
continue digging.

Any suggestions would be greatly appreciated as I think I'm starting to
tip over here on this one.


On Mon, Mar 30, 2020 at 04:04:39PM -0500, Erik Jacobson wrote:
> > Sadly I am not a  developer,  so I can't answer your questions.
> 
> I'm not a FS o rnetwork developer either. I think there is a joke about
> playing one on TV but maybe it's netflix now.
> 
> Enabling certain debug options made too much information for me to watch
> personally (but an expert could probably get through it).
> 
> So I started putting targeted 'print' (gf_msg) statements in the code to
> see how it got its way to split-brain. Maybe this will ring a bell
> for someone.
> 
> I can tell the only way we enter the split-brain path is through in the
> first if statement of afr_read_txn_refresh_done().
> 
> This means afr_read_txn_refresh_done() itself was passed "err" and
> that it appears thin_arbiter_count was not set (which makes sense,
> I'm using 1x3, not a thin arbiter).
> 
> So we jump to the readfn label, and read_subvol() should still be -1.
> If I read right, it must mean that this if didn't return true because
> my print statement didn't appear:
> if ((ret == 0) && spb_choice >= 0) {
> 
> So we're still with the original read_subvol == 1,
> Which gets us to the split_brain message.
> 
> So now I will try to learn why afr_read_txn_refresh_done() would have
> 'err' set in the first place. I will also learn about
> afr_inode_split_brain_choice_get(). Those seem to be the two methods to
> have avoided falling in to the split brain hole here.
> 
> 
> I put debug statements in these locations. I will mark with !! what
> I see:
> 
> 
> 
> diff -Narup glusterfs-7.2-orig/xlators/cluster/afr/src/afr-read-txn.c 
> glusterfs-7.2-new/xlators/cluster/afr/src/afr-read-txn.c
> --- glusterfs-7.2-orig/xlators/cluster/afr/src/afr-read-txn.c 2020-01-15 
> 11:43:53.887894293 -0600
> +++ glusterfs-7.2-new/xlators/cluster/afr/src/afr-read-txn.c  2020-03-30 
> 15:45:02.917104321 -0500
> @@ -279,10 +279,14 @@ afr_read_txn_refresh_done(call_frame_t *
>  priv = this->private;
> 
>  if (err) {
> -if (!priv->thin_arbiter_count)
> +if (!priv->thin_arbiter_count) {
> +gf_msg(this->name, GF_LOG_ERROR,0,0,"erikj dbg crapola 1st if in 
> afr_read_txn_refresh_done() !priv->thin_arbiter_count -- goto to readfn");
> !!
> We hit this error condition and jump to readfn below
> !!!
>  goto readfn;
> -if (err != EINVAL)
> +}
> +if (err != EINVAL) {
> +gf_msg(this->name, GF_LOG_ERROR,0,0,"erikj 2nd if in 
> afr_read_txn_refresh_done() err != EINVAL, goto readfn");
>  goto readfn;
> +}
>  /* We need to query the good bricks and/or thin-arbiter.*/
>  afr_ta_read_txn_synctask(frame, this);
>  return 0;
> @@ -291,6 +295,8 @@ afr_read_txn_refresh_done(call_frame_t *
>  read_subvol = afr_read_subvol_select_by_policy(inode, this, 
> local->readable,
> NULL);
>  if (read_subvol == -1) {
> +gf_msg(this->name, GF_LOG_ERROR,0,0,"erikj dbg whoops read_subvol 
> returned -1, going to readfn");
> +
>  err = EIO;
>  goto readfn;
>  }
> @@ -304,11 +310,15 @@ afr_read_txn_refresh_done(call_frame_t *
>  readfn:
>  if (read_subvol == -1) {
>  ret = afr_inode_split_brain_choice_get(

Re: [Gluster-users] gnfs split brain when 1 server in 3x1 down (high load) - help request

2020-03-30 Thread Erik Jacobson
> Sadly I am not a  developer,  so I can't answer your questions.

I'm not a FS o rnetwork developer either. I think there is a joke about
playing one on TV but maybe it's netflix now.

Enabling certain debug options made too much information for me to watch
personally (but an expert could probably get through it).

So I started putting targeted 'print' (gf_msg) statements in the code to
see how it got its way to split-brain. Maybe this will ring a bell
for someone.

I can tell the only way we enter the split-brain path is through in the
first if statement of afr_read_txn_refresh_done().

This means afr_read_txn_refresh_done() itself was passed "err" and
that it appears thin_arbiter_count was not set (which makes sense,
I'm using 1x3, not a thin arbiter).

So we jump to the readfn label, and read_subvol() should still be -1.
If I read right, it must mean that this if didn't return true because
my print statement didn't appear:
if ((ret == 0) && spb_choice >= 0) {

So we're still with the original read_subvol == 1,
Which gets us to the split_brain message.

So now I will try to learn why afr_read_txn_refresh_done() would have
'err' set in the first place. I will also learn about
afr_inode_split_brain_choice_get(). Those seem to be the two methods to
have avoided falling in to the split brain hole here.


I put debug statements in these locations. I will mark with !! what
I see:



diff -Narup glusterfs-7.2-orig/xlators/cluster/afr/src/afr-read-txn.c 
glusterfs-7.2-new/xlators/cluster/afr/src/afr-read-txn.c
--- glusterfs-7.2-orig/xlators/cluster/afr/src/afr-read-txn.c   2020-01-15 
11:43:53.887894293 -0600
+++ glusterfs-7.2-new/xlators/cluster/afr/src/afr-read-txn.c2020-03-30 
15:45:02.917104321 -0500
@@ -279,10 +279,14 @@ afr_read_txn_refresh_done(call_frame_t *
 priv = this->private;

 if (err) {
-if (!priv->thin_arbiter_count)
+if (!priv->thin_arbiter_count) {
+gf_msg(this->name, GF_LOG_ERROR,0,0,"erikj dbg crapola 1st if in 
afr_read_txn_refresh_done() !priv->thin_arbiter_count -- goto to readfn");
!!
We hit this error condition and jump to readfn below
!!!
 goto readfn;
-if (err != EINVAL)
+}
+if (err != EINVAL) {
+gf_msg(this->name, GF_LOG_ERROR,0,0,"erikj 2nd if in 
afr_read_txn_refresh_done() err != EINVAL, goto readfn");
 goto readfn;
+}
 /* We need to query the good bricks and/or thin-arbiter.*/
 afr_ta_read_txn_synctask(frame, this);
 return 0;
@@ -291,6 +295,8 @@ afr_read_txn_refresh_done(call_frame_t *
 read_subvol = afr_read_subvol_select_by_policy(inode, this, 
local->readable,
NULL);
 if (read_subvol == -1) {
+gf_msg(this->name, GF_LOG_ERROR,0,0,"erikj dbg whoops read_subvol 
returned -1, going to readfn");
+
 err = EIO;
 goto readfn;
 }
@@ -304,11 +310,15 @@ afr_read_txn_refresh_done(call_frame_t *
 readfn:
 if (read_subvol == -1) {
 ret = afr_inode_split_brain_choice_get(inode, this, _choice);
-if ((ret == 0) && spb_choice >= 0)
+if ((ret == 0) && spb_choice >= 0) {
!!
We never get here, afr_inode_split_brain_choice_get() must not have
returned what was needed to enter.
!!
+gf_msg(this->name, GF_LOG_ERROR,0,0,"erikj dbg read_subvol was -1 
to begin with split brain choice found: %d", spb_choice);
 read_subvol = spb_choice;
+}
 }

 if (read_subvol == -1) {
+   gf_msg(this->name, GF_LOG_ERROR,0,0,"erikj dbg verify this shows up 
above split-brain error");
!!
We hit here. Game over player.
!!
+
 AFR_SET_ERROR_AND_CHECK_SPLIT_BRAIN(-1, err);
 }
 afr_read_txn_wind(frame, this, read_subvol);





Community Meeting Calendar:

Schedule -
Every 2nd and 4th Tuesday at 14:30 IST / 09:00 UTC
Bridge: https://bluejeans.com/441850968

Gluster-users mailing list
Gluster-users@gluster.org
https://lists.gluster.org/mailman/listinfo/gluster-users


Re: [Gluster-users] gnfs split brain when 1 server in 3x1 down (high load) - help request

2020-03-30 Thread Erik Jacobson
> Hi Erik,
> Sadly I didn't have the time to take a look in your logs, but I would like to 
> ask you whether you have statiatics of the network bandwidth usage.
> Could it be possible that the gNFS server is  starved for bandwidth and fails 
> to reach all bricks  leading to 'split-brain' errors ?
> 

I understand. I doubt there is a bandwidth issue but I'll add this to my
checks. We have 288 nodes per server normally and they run fine with all
servers up. The 76 number is just what we happened to have access to on
an internal system.

Question: What you mentioned above, and a feeling I have too personally
is -- is the split-brain error actually a generic catch-all error for
not being able to get access to a file? So when it says "split-brain"
could it really mean any type of access error? Could it also be given
when there is a IO timeout or something?

I'm starting to break open the source code to look around but I think my
head will explode before I understand it enough. I will still give it a
shot.

I have access to this system until later tonight. Then it goes away. We
have duplicated it on another system that stays, but the machine
internally is so contended for that I wouldn't get a time slot until
later in the week anyway. Trying to make as much use of this "gift"
machine as I can :) :)

Thanks again for the replies so far.

Erik




Community Meeting Calendar:

Schedule -
Every 2nd and 4th Tuesday at 14:30 IST / 09:00 UTC
Bridge: https://bluejeans.com/441850968

Gluster-users mailing list
Gluster-users@gluster.org
https://lists.gluster.org/mailman/listinfo/gluster-users


Re: [Gluster-users] gnfs split brain when 1 server in 3x1 down (high load) - help request

2020-03-30 Thread Erik Jacobson
Thank you so much for replying --

> > [2020-03-29 03:42:52.295532] E [MSGID: 108008] 
> > [afr-read-txn.c:312:afr_read_txn_refresh_done] 0-cm_shared-replicate-0: 
> > Failing ACCESS on gfid 8eed77d3-b4fa-4beb-a0e7-e46c2b71ffe1: split-brain 
> > observed. [Input/output error]


> Since you say that the errors go away when all 3 bricks (which I guess is
> what you refer to as 'leaders') of the replica are up, it could be possible

Yes leaders == gluster+gnfs server for this. We use 'leader' internally
for mean servers that help manage compute nodes. I try to convert it to
'server' in my writing but 'leader' slips out somtimes.

> that the brick you brought down had the only good copy. In such cases, even
> though you have the other 2 bricks of the replica up, they both are bad

I think all 3 copies are good. That is because the same exact files are
accessed the same way when nodes boot. With one server down, 76 nodes
normally boot with no errors. Once in a while one fails with split brain
errors in the log. The more load I put in, the more likely a split brain
when one server is down. So that's why my test case is so weird looking.
It has to generate a bunch of extra load and then try to access root
filesystem files using our tools to trigger the split brain. The test
is good in that it produces at least a couple slit-brain errors every
time. I'm actually ver happy to have a test case. We've been dealing
with reports for some time.

The healing errors seen are explained by the writable XFS image files in
gluster -- one per node -- that the nodes use for their /etc, /var, and
so on. So the 76 healing messages were expected. If it would help to
reduce confusion, I can repeat the test with using TMPFS for the
writable areas so that the healing list is clear.

> copies waiting to be healed and hence all operations on those files will
> fail with EIO. Since you say this occurs under high load only. I suspect

To be clear, with one server down, operations work like 99.9% of the time.
Same operations on every node. It's only when we bring the load up
(maybe heavy metadata related?) do we get split-brain errors with one
server down.

It is a strange problem but I don't believe there is a problem with any
copy of any file. Never say never and nothing would make me happier than
being wrong and solving the problem.

I want to thank you so much for writing back. I'm willing to try any
suggestions we come up with.

Erik




Community Meeting Calendar:

Schedule -
Every 2nd and 4th Tuesday at 14:30 IST / 09:00 UTC
Bridge: https://bluejeans.com/441850968

Gluster-users mailing list
Gluster-users@gluster.org
https://lists.gluster.org/mailman/listinfo/gluster-users


Re: [Gluster-users] gnfs split brain when 1 server in 3x1 down (high load) - help request

2020-03-29 Thread Erik Jacobson
Thank you for replying!! Responses below...

I have attached the volume def (meant to before).
I have attached a couple logs from one of the leaders.

> That's  odd.
> As  far as  I know, the client's are accessing  one of the gluster nodes  
> that serves as NFS server and then syncs data across the peers ,right?

Correct, although in this case, with a 1x3, all of them should have
local copies. Our first reports came in from 3x3 (9 server) systems but
we have been able to duplicate on 1x3 thankfully in house. This is a
huge step forward as I had no reproducer previously.

> What happens when the virtual IP(s) are  failed  over to the other gluster 
> node? Is the issue resolved?

While we do use CTDB for managing the IPs aliases, I don't start the test until
the IP is stabilized. I put all 76 nodes on one IP alias to make a more
similar load to what we have in the field.

I think it is important to point out that if I reduce the load, all is
well. For examples, if the test were just booting -- where the initial
reports were seen -- just 1 or 2 nodes out of 1,000 would have an issue
each cycle. They all boot the same way and are all using the same IP
alias for NFS in my test case. So I think the split-brain messages are maybe
a symptom of some sort of timeout ??? (making stuff up here).

> Also, what kind of  load balancing are you using ?
[I moved this question up because the below answer has too much
output]

We are doing very simple balancing - manual balancing. As we add compute
nodes to the cluster, a couple racks are assigned to IP alias #1, the
next couple to IP alias #2, and so on. I'm happy to not have the
complexity of a real load balancer right now.


> Do you get any split brain entries via 'gluster volume geal  info' ?

I ran two trials for the 'gluster volume heal ...'

Trial 1 - with all 3 servers up and while running the load:
[root@leader2 ~]# gluster volume heal cm_shared info
Brick 172.23.0.4:/data/brick_cm_shared
Status: Connected
Number of entries: 0

Brick 172.23.0.5:/data/brick_cm_shared
Status: Connected
Number of entries: 0

Brick 172.23.0.6:/data/brick_cm_shared
Status: Connected
Number of entries: 0


Trial 2 - with 1 server down (stopped glusterd on 1 server) - and
without doing any testing yet -- I see this.  Let me explain though -
not in the error path, I am using RW NFS filesystem image blobs on this
same volume for the writable areas of the node. In the field, we
duplicate the problem with using TMPFS for that writable area. I am
happy to re-do the test with RO NFS and TMPFS for writable, which my
GUESS says the healing messages would go away. Would that help?
If you look at the heal count -- 76 -- that equals the node count - the
number of writable XFS image files using for writing for each node.

[root@leader2 ~]# gluster volume heal cm_shared info
Brick 172.23.0.4:/data/brick_cm_shared
Status: Transport endpoint is not connected
Number of entries: -

Brick 172.23.0.5:/data/brick_cm_shared








Status: Connected
Number of entries: 8

Brick 172.23.0.6:/data/brick_cm_shared








Status: Connected
Number of entries: 8



Trial 3 - ran the heal command around the time the split-brain errors
were being reported


[root@leader2 glusterfs]# gluster volume heal cm_shared info
Brick 172.23.0.4:/data/brick_cm_shared
Status: Transport endpoint is not connected
Number of entries: -

Brick 172.23.0.5:/data/brick_cm_shared












































































Status: Connected
Number of entries: 76

Brick 172.23.0.6:/data/brick_cm_shared












































































Status: Connected
Number of entries: 76

 
Volume Name: cm_shared
Type: Replicate
Volume ID: f6175f56-8422-4056-9891-f9ba84756b87
Status: Started
Snapshot Count: 0
Number of Bricks: 1 x 3 = 3
Transport-type: tcp
Bricks:
Brick1: 172.23.0.4:/data/brick_cm_shared
Brick2: 172.23.0.5:/data/brick_cm_shared
Brick3: 172.23.0.6:/data/brick_cm_shared
Options Reconfigured:
nfs.event-threads: 3
config.brick-threads: 16
config.client-threads: 16
performance.iot-pass-through: false
config.global-threading: off
performance.client-io-threads: on
nfs.disable: off
storage.fips-mode-rchecksum: on
transport.address-family: inet
features.cache-invalidation: on
features.cache-invalidation-timeout: 600
cluster.lookup-optimize: on
client.event-threads: 32
server.event-threads: 32
performance.stat-prefetch: on
performance.cache-invalidation: on
performance.md-cache-timeout: 600
network.inode-lru-limit: 100
performance.io-thread-count: 32
performance.cache-size: 8GB
performance.parallel-readdir: on
cluster.lookup-unhashed: auto
performance.flush-behind: on
performance.aggregate-size: 2048KB
performance.write-behind-trickling-writes: off
transport.listen-backlog: 16384
performance.write-behind-window-size: 1024MB
server.outstanding-rpc-limit: 1024
nfs.outstanding-rpc-limit: 1024
nfs.acl: on
storage.max-hardlinks: 0
performance.cache-refresh-timeout: 60

[Gluster-users] gnfs split brain when 1 server in 3x1 down (high load) - help request

2020-03-28 Thread Erik Jacobson
Hello all,

I am getting split-brain errors in the gnfs nfs.log when 1 gluster
server is down in a 3-brick/3-node gluster volume. It only happens under
intense load.

I reported this a few months ago but didn't have a repeatable test case.
Since then, we got reports from the field and I was able to make a test case
with 3 gluster servers and 76 NFS clients/compute nodes. I point all 76
nodes to one gnfs server to make the problem more likely to happen with the
limited nodes we have in-house.

We are using gluster nfs (ganesha is not yet reliable for our workload)
to export an NFS filesystem that is used for a read-only root filesystem
for NFS clients. The largest client count we have is 2592 across 9
leaders (3 replicated subvolumes) - out in the field. This is where
the problem was first reported.

In the lab, I have a test case that can repeat the problem on a single
subvolume cluster.

Please forgive how ugly the test case is. I'm sure an IO test person can
make it pretty. It basically runs a bunch of cluster-manger NFS-intensive
operations while also producing other load. If one leader is down,
nfs.log reports some split-brain errors. For real-world customers, the
symptom is "some nodes failing to boot" in various ways or "jobs failing
to launch due to permissions or file read problems (like a library not
being readable on one node)". If all leaders are up, we see no errors.

As an attachment, I will include volume settings.

Here are example nfs.log errors:


[2020-03-29 03:42:52.295532] E [MSGID: 108008] 
[afr-read-txn.c:312:afr_read_txn_refresh_done] 0-cm_shared-replicate-0: Failing 
ACCESS on gfid 8eed77d3-b4fa-4beb-a0e7-e46c2b71ffe1: split-brain observed. 
[Input/output error]
[2020-03-29 03:42:52.295583] W [MSGID: 112199] 
[nfs3-helpers.c:3308:nfs3_log_common_res] 0-nfs-nfsv3: 
/bin/whoami => (XID: 19fb1558, 
ACCESS: NFS: 5(I/O error), POSIX: 5(Input/output error))
[2020-03-29 03:43:03.600023] E [MSGID: 108008] 
[afr-read-txn.c:312:afr_read_txn_refresh_done] 0-cm_shared-replicate-0: Failing 
ACCESS on gfid 77614c4f-1ac4-448d-8fc2-8aedc9b30868: split-brain observed. 
[Input/output error]
[2020-03-29 03:43:03.600075] W [MSGID: 112199] 
[nfs3-helpers.c:3308:nfs3_log_common_res] 0-nfs-nfsv3: 
/lib64/perl5/vendor_perl/XML/LibXML/Literal.pm
 => (XID: 9a851abc, ACCESS: NFS: 5(I/O error), POSIX: 5(Input/output error))
[2020-03-29 03:43:07.681294] E [MSGID: 108008] 
[afr-read-txn.c:312:afr_read_txn_refresh_done] 0-cm_shared-replicate-0: Failing 
READLINK on gfid 36134289-cb2d-43d9-bd50-60e23d7fa69b: split-brain observed. 
[Input/output error]
[2020-03-29 03:43:07.681339] W [MSGID: 112199] 
[nfs3-helpers.c:3327:nfs3_log_readlink_res] 0-nfs-nfsv3: 
/lib64/.libhogweed.so.4.hmac => 
(XID: 5c29744f, READLINK: NFS: 5(I/O error), POSIX: 5(Input/output error)) 
target: (null)


The brick log isn't very interesting during the failure. There are some
ACL errors that don't seem to directly relate to the issue at hand.
(I can attach if requested!)

This is glusterfs72 (although we originally hit it with 4.1.6).
I'm using rhel8 (although field reports are from rhel76).

If there is anything the community can suggest to help me with this, it
would really be appreciated. I'm getting unhappy reports from the field
that the failover doesn't work as expected.

I've tried tweaking several things from various threading settings to
enabling md-cach-statfs to mem-factor to listen backlogs. I even tried
adjusting the cluster.read-hash-mode and choose-local settings.

"cluster-configuration" in the script initiates a bunch of operations on the
node that results in reading many files and doing some database queries. I
used it in my test case as it is a common failure point when nodes are
booting. This test case, although ugly, fails 100% if one server is down and
works 100% if all servers are up.


#! /bin/bash

#
# Test case:
#
# in a 1x3 Gluster Replicated setup with the HPCM volume settings..
#
# On a cluster with 76 nodes (maybe can be replicated with less we don't
# know)
#
# When all the nodes are assigned to one IP alias to get the load in to
# one leader node
#
# This test case will produce split-brain errors in the nfs.log file
# when 1 leader is down, but will run clean when all 3 are up.
#
# It is not necessary to power off the leader you wish to disable. Simply
# running 'systemctl stop glusterd' is sufficient.
#
# We will use this script to try to resolve the issue with split-brain
# under stress when one leader is down.
#

# (compute group is 76 compute nodes)
echo "killing any node find or node tar commands..."
pdsh -f 500 -g compute killall find
pdsh -f 500 -g compute killall tar

# (in this test, leader1 is known to have glusterd stopped for the test case)
echo "stop, start glusterd, drop caches, sleep 15"
set -x
pdsh -w leader2,leader3 systemctl stop glusterd
sleep 3
pdsh -w leader2,leader3 "echo 3 > /proc/sys/vm/drop_caches"
pdsh -w leader2,leader3 systemctl start glusterd
set +x
sleep 15

echo "drop 

Re: [Gluster-users] gluster NFS hang observed mounting or umounting at scale

2020-02-13 Thread Erik Jacobson
While it's still early, our testing is showing this issue fixed in
glusterfs7.2 (we were at 416).

Closing the loop in case people search for this.

Erik

On Sun, Jan 26, 2020 at 12:04:00PM -0600, Erik Jacobson wrote:
> > One last reply to myself.
> 
> One of the test cases my test scripts triggered turned out to actually
> be due to my NFS RW mount options.
> 
> OLD RW NFS mount options:
> "rw,noatime,nocto,actimeo=3600,lookupcache=all,nolock,tcp,vers=3"
> 
> NEW options that work better
> rw,noatime,nolock,tcp,vers=3"
> 
> I had copied the RO NFS options we use which try to be aggressive about
> caching. The RO root image doesn't change much and we want it as fast
> as possible. The options are not appropriate for RW areas that change.
> (Even though it's a single image file we care about).
> 
> So now my test scripts run clean but since what we see on larger systems
> is right after reboot, the caching shouldn't matter. In the real problem
> case, the RW stuff is done once after reboot.
> 
> FWIW I attached my current test scripts, my last batch had some errors.
> 
> The search continues for the actual problem, which I'm struggling to
> reproduce @ 366 NFs clients.
> 
> I believe yesterday, when I posted about actual HANGS, that is the real
> problem we're tracking. I hit that once in my test scripts - only once.
> My script was otherwise hitting a "file doesn't really exist even though
> cached" issue and it was tricking my scripts.
> 
> In any case, I'm changing the RW NFS options we use regardless.
> 
> Erik





Community Meeting Calendar:

APAC Schedule -
Every 2nd and 4th Tuesday at 11:30 AM IST
Bridge: https://bluejeans.com/441850968

NA/EMEA Schedule -
Every 1st and 3rd Tuesday at 01:00 PM EDT
Bridge: https://bluejeans.com/441850968

Gluster-users mailing list
Gluster-users@gluster.org
https://lists.gluster.org/mailman/listinfo/gluster-users


Re: [Gluster-users] GlusterFS problems & alternatives

2020-02-11 Thread Erik Jacobson
> looking through the last couple of week on this mailing list and reflecting 
> our own experiences, I have to ask: what is the status of GlusterFS? So many 
> people here reporting bugs and no solutions are in sight. GlusterFS clusters 
> break left and right, reboots of a node have become a warrant for instability 
> and broken clusters, no way to fix broken clusters. And all of that with 
> recommended settings, and in our case, enterprise hardware underneath.


I have been one of the people asking questions. I sometimes get an
answer, which I appreciate. Other times not. But I'm not paying for
support in this forum so I appreciate what I can get. My questions
are sometimes very hard to summarize and I can't say I've been offering
help as much as I ask. I think I will try to do better.


Just to counter with something cool
As we speak now, I'm working on a 2,000 node cluster that will soon be a
5120 node cluster. We're validating it with the newest version of our
cluster manager.

It has 12 leader nodes (soon to have 24) that are gluster servers and
gnfs servers.

I am validating Gluster7.2 (updating from 4.6). Things are looking very
good. 5120 nodes using RO NFS root with RW NFS overmounts (for things
like /var, /etc, ...)...
- boot 1 (where each node creates a RW XFS image on top of NFS for its
  writable area then syncs /var, /etc, etc) -- full boot is 15-16
  minutes for 2007 nodes.
- boot 2 (where the writable area pre-exists and is reused, just
  re-rsynced) -- 8-9 minutes to boot 2007 nodes.

This is similar to gluster 4, but I think it's saying something to not
have had any failures in this setup on the bleeding edge release level.

We also use a different volume shared between the leaders and the head
node for shared-storage consoles and system logs. It's working great.

I haven't had time to test other solutions. Our old solution from SGI
days (ICE, ICE X, etc) was a different model where each leader served
a set of nodes and NFS-booted 288 or so. No shared storage.

Like you, I've wondered if something else matches this solution. We like
the shared storage and the ability for a leader to drop and not take
288 noes with it.

(All nodes running RHEL8.0, Glusterfs 72, CTDB 4.9.1)



So we can say gluster is providing the network boot solution for now two
supercomputers.



Erik


Community Meeting Calendar:

APAC Schedule -
Every 2nd and 4th Tuesday at 11:30 AM IST
Bridge: https://bluejeans.com/441850968

NA/EMEA Schedule -
Every 1st and 3rd Tuesday at 01:00 PM EDT
Bridge: https://bluejeans.com/441850968

Gluster-users mailing list
Gluster-users@gluster.org
https://lists.gluster.org/mailman/listinfo/gluster-users


[Gluster-users] question on rebalance errors gluster 7.2 (adding to distributed/replicated)

2020-02-10 Thread Erik Jacobson
My question: Are the errors and anomalies below something I need to
investigate? Are should I not be worried?


I installed a test cluster to gluster 7.2 to run some tests, preparing
to see if we gain confidence to put this on the 5,120 node
supercomputer instead of gluster 4.1.6.

I started with a 3x2 volume with heavy optimizations for writes and NFS.
(6 nodes, distribute/replicate).

I booted my NFS-root clients and maintained them online.

I then performaned a add-brick operation to make it a 3x3 instead of
3.2 (so 9 servers instead of 6).

The rebalance went much better for me than gluster 4.1.6. However, I saw
some errors. We noted them first here -- 14 errors on leader8, and a few
on the others. These are the NEW nodes so the data flow was from the old
nodes to these three that at least have one error:

[root@leader8 glusterfs]# gluster volume rebalance cm_shared status
Node Rebalanced-files  size   
scanned  failures   skipped   status  run time in h:m:s
   -  ---   ---   
---   ---   ---  --
  leader1.head.cm.eag.rdlabs.hpecorp.net18933   596.4MB
181780 0  3760completed0:41:39
  172.23.0.418960 1.2GB
181831 0  3766completed0:41:39
  172.23.0.518691 1.2GB
181826 0  3716completed0:41:39
  172.23.0.614917   618.8MB
175758 0  3869completed0:35:40
  172.23.0.715114   573.5MB
175728 0  3853completed0:35:41
  172.23.0.814864   459.2MB
175742 0  3951completed0:35:40
  172.23.0.900Bytes 
   11 3 0completed0:08:26
 172.23.0.1100Bytes 
  242 1 0completed0:08:25
   localhost00Bytes 
514 0completed0:08:26
volume rebalance: cm_shared: success



My rebalance log is like 32M and I find it's hard for people to help me
when I post that much data. So I've tried to filter some of the data
here. Two classes -- anomalies and errors.


Errors (14 reported on this node):

[root@leader8 glusterfs]# grep -i "error from gf_defrag_get_entry" 
cm_shared-rebalance.log
[2020-02-10 23:23:55.286830] W [dht-rebalance.c:3439:gf_defrag_process_dir] 
0-cm_shared-dht: Found error from gf_defrag_get_entry
[2020-02-10 23:24:12.903496] W [dht-rebalance.c:3439:gf_defrag_process_dir] 
0-cm_shared-dht: Found error from gf_defrag_get_entry
[2020-02-10 23:24:15.226948] W [dht-rebalance.c:3439:gf_defrag_process_dir] 
0-cm_shared-dht: Found error from gf_defrag_get_entry
[2020-02-10 23:24:15.259480] W [dht-rebalance.c:3439:gf_defrag_process_dir] 
0-cm_shared-dht: Found error from gf_defrag_get_entry
[2020-02-10 23:24:15.398784] W [dht-rebalance.c:3439:gf_defrag_process_dir] 
0-cm_shared-dht: Found error from gf_defrag_get_entry
[2020-02-10 23:24:16.633033] W [dht-rebalance.c:3439:gf_defrag_process_dir] 
0-cm_shared-dht: Found error from gf_defrag_get_entry
[2020-02-10 23:24:16.645847] W [dht-rebalance.c:3439:gf_defrag_process_dir] 
0-cm_shared-dht: Found error from gf_defrag_get_entry
[2020-02-10 23:24:21.783528] W [dht-rebalance.c:3439:gf_defrag_process_dir] 
0-cm_shared-dht: Found error from gf_defrag_get_entry
[2020-02-10 23:24:22.307464] W [dht-rebalance.c:3439:gf_defrag_process_dir] 
0-cm_shared-dht: Found error from gf_defrag_get_entry
[2020-02-10 23:25:23.391256] W [dht-rebalance.c:3439:gf_defrag_process_dir] 
0-cm_shared-dht: Found error from gf_defrag_get_entry
[2020-02-10 23:26:34.203129] W [dht-rebalance.c:3439:gf_defrag_process_dir] 
0-cm_shared-dht: Found error from gf_defrag_get_entry
[2020-02-10 23:26:39.669243] W [dht-rebalance.c:3439:gf_defrag_process_dir] 
0-cm_shared-dht: Found error from gf_defrag_get_entry
[2020-02-10 23:27:42.615081] W [dht-rebalance.c:3439:gf_defrag_process_dir] 
0-cm_shared-dht: Found error from gf_defrag_get_entry
[2020-02-10 23:28:53.942158] W [dht-rebalance.c:3439:gf_defrag_process_dir] 
0-cm_shared-dht: Found error from gf_defrag_get_entry


Brick log errors around 23:23:55 (to match the first error above):

[2020-02-10 23:23:54.605681] W [MSGID: 113096] 
[posix-handle.c:834:posix_handle_soft] 0-cm_shared-posix: symlink 
../../a4/3e/a43ef7fd-08eb-434c-8168-96a92059d186/LC_MESSAGES -> 

Re: [Gluster-users] NFS clients show missing files while gluster volume rebalanced

2020-02-10 Thread Erik Jacobson
Closing the loop in case someone does a search on this...

I have an update. I am getting some time on 1,000 node soon so I have
started to validate jumping to gluster 7.2 on my small lab machine.

I switched the packages to my own build of gluster 7.2 with gnfs.
I re-installed my leader node (gluster/gnfs servers) and created
the volumes the same way as before. This includes heavy cache
optimization for the NFS services volume.

I can no longer duplicate this problem on gluster 7.2. I was able to
duplicate rebalance troubles on NFS clients every time on gluster
4.1.6.

I do have a couple questions on some rebalance errors, which I will send
in a separate email.

Erik

On Wed, Jan 29, 2020 at 06:20:34PM -0600, Erik Jacobson wrote:
> We are using gluster 4.1.6. We are using gluster NFS (not ganesha).
> 
> Distributed/replicated with subvolume size 3 (6 total servers, 2
> subvols).
> 
> The NFS clients use this for their root filesystem.
> 
> When I add 3 more gluster servers to add one more subvolume to the
> storage volumes (so now subvolume size 3, 9 total servers, 3 total
> subvolumes), the process gets started. 
> 
> ssh leader1 gluster volume add-brick cm_shared 
> 172.23.0.9://data/brick_cm_shared 172.23.0.10://data/brick_cm_shared 
> 172.23.0.11://data/brick_cm_shared
> 
> then
> 
> ssh leader1 gluster volume rebalance cm_shared start
> 
> The re-balance works. 'gluster volume status' shows re-balance in
> progress.
> 
> However, existing gluster-NFS clients now show missing files and I can
> no longer log into them (since NFS is their root). If you are logged in,
> you can find that libraries are missing and general unhappiness with
> random files now missing.
> 
> Is accessing a volume that is in the process of being re-balanced not
> supported from a gluster NFS client? Or have I made an error?
> 
> Thank you for any help,
> 
> Erik



Community Meeting Calendar:

APAC Schedule -
Every 2nd and 4th Tuesday at 11:30 AM IST
Bridge: https://bluejeans.com/441850968

NA/EMEA Schedule -
Every 1st and 3rd Tuesday at 01:00 PM EDT
Bridge: https://bluejeans.com/441850968

Gluster-users mailing list
Gluster-users@gluster.org
https://lists.gluster.org/mailman/listinfo/gluster-users


[Gluster-users] NFS clients show missing files while gluster volume rebalanced

2020-01-29 Thread Erik Jacobson
We are using gluster 4.1.6. We are using gluster NFS (not ganesha).

Distributed/replicated with subvolume size 3 (6 total servers, 2
subvols).

The NFS clients use this for their root filesystem.

When I add 3 more gluster servers to add one more subvolume to the
storage volumes (so now subvolume size 3, 9 total servers, 3 total
subvolumes), the process gets started. 

ssh leader1 gluster volume add-brick cm_shared 
172.23.0.9://data/brick_cm_shared 172.23.0.10://data/brick_cm_shared 
172.23.0.11://data/brick_cm_shared

then

ssh leader1 gluster volume rebalance cm_shared start

The re-balance works. 'gluster volume status' shows re-balance in
progress.

However, existing gluster-NFS clients now show missing files and I can
no longer log into them (since NFS is their root). If you are logged in,
you can find that libraries are missing and general unhappiness with
random files now missing.

Is accessing a volume that is in the process of being re-balanced not
supported from a gluster NFS client? Or have I made an error?

Thank you for any help,

Erik


Community Meeting Calendar:

APAC Schedule -
Every 2nd and 4th Tuesday at 11:30 AM IST
Bridge: https://bluejeans.com/441850968

NA/EMEA Schedule -
Every 1st and 3rd Tuesday at 01:00 PM EDT
Bridge: https://bluejeans.com/441850968

Gluster-users mailing list
Gluster-users@gluster.org
https://lists.gluster.org/mailman/listinfo/gluster-users


Re: [Gluster-users] gluster NFS hang observed mounting or umounting at scale

2020-01-26 Thread Erik Jacobson
> One last reply to myself.

One of the test cases my test scripts triggered turned out to actually
be due to my NFS RW mount options.

OLD RW NFS mount options:
"rw,noatime,nocto,actimeo=3600,lookupcache=all,nolock,tcp,vers=3"

NEW options that work better
rw,noatime,nolock,tcp,vers=3"

I had copied the RO NFS options we use which try to be aggressive about
caching. The RO root image doesn't change much and we want it as fast
as possible. The options are not appropriate for RW areas that change.
(Even though it's a single image file we care about).

So now my test scripts run clean but since what we see on larger systems
is right after reboot, the caching shouldn't matter. In the real problem
case, the RW stuff is done once after reboot.

FWIW I attached my current test scripts, my last batch had some errors.

The search continues for the actual problem, which I'm struggling to
reproduce @ 366 NFs clients.

I believe yesterday, when I posted about actual HANGS, that is the real
problem we're tracking. I hit that once in my test scripts - only once.
My script was otherwise hitting a "file doesn't really exist even though
cached" issue and it was tricking my scripts.

In any case, I'm changing the RW NFS options we use regardless.

Erik


nfs-issues.tar.xz
Description: application/xz


Community Meeting Calendar:

APAC Schedule -
Every 2nd and 4th Tuesday at 11:30 AM IST
Bridge: https://bluejeans.com/441850968

NA/EMEA Schedule -
Every 1st and 3rd Tuesday at 01:00 PM EDT
Bridge: https://bluejeans.com/441850968

Gluster-users mailing list
Gluster-users@gluster.org
https://lists.gluster.org/mailman/listinfo/gluster-users


Re: [Gluster-users] gluster NFS hang observed mounting or umounting at scale

2020-01-25 Thread Erik Jacobson
e handle), POSIX: 116(Stale file handle)), count: 0, 
> STABLE,wverf: 1579664973
> [2020-01-26 02:42:43.908045] W [MSGID: 112199] 
> [nfs3-helpers.c:3494:nfs3_log_write_res] 0-nfs-nfsv3: 
> /image/images_rw_nfs/r17c3t4n1/rhel8.0/xfs.img => (XID: a87e7e7d, WRITE: NFS: 
> 70(Invalid file handle), POSIX: 116(Stale file handle)), count: 0, 
> STABLE,wverf: 1579664973
> [2020-01-26 02:42:43.908194] W [MSGID: 112199] 
> [nfs3-helpers.c:3494:nfs3_log_write_res] 0-nfs-nfsv3: 
> /image/images_rw_nfs/r17c3t4n1/rhel8.0/xfs.img => (XID: a67e7e7d, WRITE: NFS: 
> 70(Invalid file handle), POSIX: 116(Stale file handle)), count: 0, 
> STABLE,wverf: 1579664973
> 
> 
> 
> 
> 
> Community Meeting Calendar:
> 
> APAC Schedule -
> Every 2nd and 4th Tuesday at 11:30 AM IST
> Bridge: https://bluejeans.com/441850968 
> 
> NA/EMEA Schedule -
> Every 1st and 3rd Tuesday at 01:00 PM EDT
> Bridge: https://bluejeans.com/441850968 
> 
> Gluster-users mailing list
> Gluster-users@gluster.org
> https://lists.gluster.org/mailman/listinfo/gluster-users 


Erik Jacobson
Software Engineer

erik.jacob...@hpe.com
+1 612 851 0550 Office

Eagan, MN
hpe.com


Community Meeting Calendar:

APAC Schedule -
Every 2nd and 4th Tuesday at 11:30 AM IST
Bridge: https://bluejeans.com/441850968

NA/EMEA Schedule -
Every 1st and 3rd Tuesday at 01:00 PM EDT
Bridge: https://bluejeans.com/441850968

Gluster-users mailing list
Gluster-users@gluster.org
https://lists.gluster.org/mailman/listinfo/gluster-users


Re: [Gluster-users] gluster NFS hang observed mounting or umounting at scale

2020-01-25 Thread Erik Jacobson
> The gluster NFS log has this entry:
> [2020-01-25 19:07:33.085806] E [MSGID: 109040] 
> [dht-helper.c:1388:dht_migration_complete_check_task] 0-cm_shared-dht: 
> 19bd72f0-6863-4f1d-80dc-a426db8670b8: failed to lookup the file on 
> cm_shared-dht [Stale file handle]
> [2020-01-25 19:07:33.085848] W [MSGID: 112199] 
> [nfs3-helpers.c:3578:nfs3_log_commit_res] 0-nfs-nfsv3: 
> /image/images_rw_nfs/r41c4t1n1/rhel8.0/xfs-test.img => (XID: cb501b58, 
> COMMIT: NFS: 70(Invalid file handle), POSIX: 116(Stale file handle)), wverf: 
> 1579988225
> 

I've done more digging. I have access to an actual system that is
failing (instead of my test case) above. It appears to be the same
issue so that's good. (My access goes away in a couple hours).

The nodes don't hang at the mount, but rather, at a check in the code
for the existence of the image file. I'm not sure if the "holes" message
I share below is a problem or not, the file indeed does start sparse.

Restarting 'glusterd' on the problem server allows the node to boot.
However, it does seem like the problem image file disappears from the
face of the earth as far as I can tell (it doesn't exist in the gluster
mount to the same path).

Searching for all messages in nfs.log related to r17c3t6n3 (the problem
node with the problem nfs.img file), I see:

[root@leader1 glusterfs]# grep r17c3t6n3 nfs.log
[2020-01-24 12:29:42.412019] W [MSGID: 112199] 
[nfs3-helpers.c:3494:nfs3_log_write_res] 0-nfs-nfsv3: 
/image/images_rw_nfs/r17c3t6n3/rhel8.0/xfs.img => (XID: ca68a5fc, WRITE: NFS: 
70(Invalid file handle), POSIX: 116(Stale file handle)), count: 0, 
STABLE,wverf: 1579664973
[2020-01-25 04:57:10.199988] W [MSGID: 112199] 
[nfs3-helpers.c:3494:nfs3_log_write_res] 0-nfs-nfsv3: 
/image/images_rw_nfs/r17c3t6n3/rhel8.0/xfs.img => (XID: 1ec43ce0, WRITE: NFS: 
70(Invalid file handle), POSIX: 116(Stale file handle)), count: 0, 
STABLE,wverf: 1579664973 [Invalid argument]
[2020-01-25 04:57:10.200431] W [MSGID: 112199] 
[nfs3-helpers.c:3494:nfs3_log_write_res] 0-nfs-nfsv3: 
/image/images_rw_nfs/r17c3t6n3/rhel8.0/xfs.img => (XID: 20c43ce0, WRITE: NFS: 
70(Invalid file handle), POSIX: 116(Stale file handle)), count: 0, 
STABLE,wverf: 1579664973
[2020-01-25 04:57:10.200695] W [MSGID: 112199] 
[nfs3-helpers.c:3494:nfs3_log_write_res] 0-nfs-nfsv3: 
/image/images_rw_nfs/r17c3t6n3/rhel8.0/xfs.img => (XID: 21c43ce0, WRITE: NFS: 
70(Invalid file handle), POSIX: 116(Stale file handle)), count: 0, 
STABLE,wverf: 1579664973
[2020-01-25 04:57:10.200827] W [MSGID: 112199] 
[nfs3-helpers.c:3494:nfs3_log_write_res] 0-nfs-nfsv3: 
/image/images_rw_nfs/r17c3t6n3/rhel8.0/xfs.img => (XID: 1fc43ce0, WRITE: NFS: 
70(Invalid file handle), POSIX: 116(Stale file handle)), count: 0, 
STABLE,wverf: 1579664973
[2020-01-25 04:57:10.201808] W [MSGID: 112199] 
[nfs3-helpers.c:3494:nfs3_log_write_res] 0-nfs-nfsv3: 
/image/images_rw_nfs/r17c3t6n3/rhel8.0/xfs.img => (XID: 22c43ce0, WRITE: NFS: 
70(Invalid file handle), POSIX: 116(Stale file handle)), count: 0, 
STABLE,wverf: 1579664973 [Invalid argument]
[2020-01-25 23:32:09.629807] I [MSGID: 109063] 
[dht-layout.c:693:dht_layout_normalize] 0-cm_shared-dht: Found anomalies in 
/image/images_rw_nfs/r17c3t6n3/rhel8.0 (gfid = 
----). Holes=1 overlaps=0
[2020-01-26 02:42:33.712684] W [MSGID: 112199] 
[nfs3-helpers.c:3494:nfs3_log_write_res] 0-nfs-nfsv3: 
/image/images_rw_nfs/r17c3t6n3/rhel8.0/xfs.img => (XID: a0ca8fc3, WRITE: NFS: 
70(Invalid file handle), POSIX: 116(Stale file handle)), count: 0, 
STABLE,wverf: 1579664973


r17c3t4n1 is another case:



[2020-01-25 23:19:46.729427] I [MSGID: 109063] 
[dht-layout.c:693:dht_layout_normalize] 0-cm_shared-dht: Found anomalies in 
/image/images_rw_nfs/r17c3t4n1/rhel8.0 (gfid = 
----). Holes=1 overlaps=0
[2020-01-26 02:42:43.907163] W [MSGID: 112199] 
[nfs3-helpers.c:3494:nfs3_log_write_res] 0-nfs-nfsv3: 
/image/images_rw_nfs/r17c3t4n1/rhel8.0/xfs.img => (XID: a77e7e7d, WRITE: NFS: 
70(Invalid file handle), POSIX: 116(Stale file handle)), count: 0, 
STABLE,wverf: 1579664973
[2020-01-26 02:42:43.908045] W [MSGID: 112199] 
[nfs3-helpers.c:3494:nfs3_log_write_res] 0-nfs-nfsv3: 
/image/images_rw_nfs/r17c3t4n1/rhel8.0/xfs.img => (XID: a87e7e7d, WRITE: NFS: 
70(Invalid file handle), POSIX: 116(Stale file handle)), count: 0, 
STABLE,wverf: 1579664973
[2020-01-26 02:42:43.908194] W [MSGID: 112199] 
[nfs3-helpers.c:3494:nfs3_log_write_res] 0-nfs-nfsv3: 
/image/images_rw_nfs/r17c3t4n1/rhel8.0/xfs.img => (XID: a67e7e7d, WRITE: NFS: 
70(Invalid file handle), POSIX: 116(Stale file handle)), count: 0, 
STABLE,wverf: 1579664973





Community Meeting Calendar:

APAC Schedule -
Every 2nd and 4th Tuesday at 11:30 AM IST
Bridge: https://bluejeans.com/441850968

NA/EMEA Schedule -
Every 1st and 3rd Tuesday at 01:00 PM EDT
Bridge: https://bluejeans.com/441850968

Gluster-users mailing list
Gluster-users@gluster.org
https://lists.gluster.org/mailman/listinfo/gluster-users


Re: [Gluster-users] No possible to mount a gluster volume via /etc/fstab?

2020-01-25 Thread Erik Jacobson
> yes I know but I already tried that and failed at implementing it. 
> I'm now even suspecting gluster to have some kind of bug.
> 
> Could you show me how to do it correctly? Which services goes into after?
> Do have example unit files for mounting gluster volumes?

I have had some struggles with this, in the depths of systemd.

I ended up making a oneshot systemd service and a helper script.
I have one helper script for my gluster server/nfs server nodes
that tries to carefully not mount gluster paths until gluster is
actually started. It also ensures ctdb is started only after the gluster
lock is actually available.

Your case seems to be more like gluster-client-only, which I have a
simpler helper script for. Note that ideas for this came from this
very mailing list as I recall. So I'm not taking credit for the whole
idea. Now this is very specific to my situation but maybe you can
get some ideas. Otherwise, trash this email :)

systemd service:

# This cluster manager service ensures
# - shared storage is mounted
# - bind mounts are mounted
# - Works around distro problems (like RHEL8.0) that ignore _netdev
#   and try to mount network filesystems before the network is up
# - Also helps handle the case where the whole cluster is powered up and
#   the admin won't be able to mount shared stoage until SU leaders up.

[Unit]
Description=CM ADMIN Service to ensure mounts are good
After=network-online.target time-sync.target

[Service]
Type=oneshot
RemainAfterExit=yes
User=root
ExecStart=/opt/clmgr/lib/cm-admin-mounts


[Install]
WantedBy=multi-user.target





And the helper:


#! /bin/bash

# Copyright (c) 2019 Hewlett Packard Enterprise Development LP
# All rights reserved.

#  This program is free software; you can redistribute it and/or modify
#  it under the terms of the GNU General Public License as published by
#  the Free Software Foundation; either version 2 of the License, or
#  (at your option) any later version.
#
#  This program is distributed in the hope that it will be useful,
#  but WITHOUT ANY WARRANTY; without even the implied warranty of
#  MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
#  GNU General Public License for more details.
#
#  You should have received a copy of the GNU General Public License
#  along with this program; if not, write to the Free Software
#  Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA  02111-1307 USA


#
# This script handles ensuring:
# * Shared storage is actually mounted
# * bind mounts are sourced by shared storage and not by local directories
#
# This script solves two problems. One is a bug in RHEL 8.0 where systemd
# ignores _netdev in fstab and tries to mount network storage before the
# network is up. Additionally, this script is useful in all scenarios to
# handle the data-center-power-outage use case. In this case, SU leaders may
# take a while to get up and running -- longer than systemd might wait for
# mounts.
#
# In all cases, if systemd fails to mount the shared storage, it may
# ignore the dependencies and do the bind mounts any, which could
# incorrectly point to local directories instead of shared storage.
#
me=$(basename $0)


#
# Safety. Don't run on wrong node type.
#
if ! grep -q -P '^NODETYPE="admin"' /etc/opt/sgi/cminfo; then
echo "$me: Error: This script is only to be run on admin nodes." > 
/dev/stderr
logger "$me: Error: This script is only to be run on admin nodes." > 
/dev/stderr
exit 1
fi

if [ ! -r /opt/clmgr/lib/su-leader-functions.sh ]; then
echo "$me: Error: /opt/clmgr/lib/su-leader-functions.sh not found." > 
/dev/stderr
logger "$me: Error: /opt/clmgr/lib/su-leader-functions.sh not found." > 
/dev/stderr
exit 1
fi
source /opt/clmgr/lib/su-leader-functions.sh

#
# enable-su-leader would have placed a shared_storage entry in fstab.
# If that is not present, this admin may have been de-coupled from the
# leaders. Exit in that case.
#
if ! grep -P -q "\d+\.\d+\.\d+\.\d+:/cm_shared\s+" /etc/fstab; then
logger "$me: Shared storage not enabled. Exiting."
exit 0
fi

logger "$me: Unmount temporarily any bind mounts"
umount_bind_mounts_local

logger "$me: Keep trying to mount shared storage..."
while true; do
umount /opt/clmgr/shared_storage &> /dev/null

mount /opt/clmgr/shared_storage/
if [ $? -ne 0 ]; then
logger "$me: /opt/clmgr/shared_storage mount failed. Will 
re-try."
umount /opt/clmgr/shared_storage/
sleep 3
continue
fi
logger "$me: Mount command reports gluster mount success. Verifying."
if ! cat /proc/mounts | grep -q -P 
"\d+\.\d+\.\d+\.\d+:\S+\s+/opt/clmgr/shared_storage\s+fuse.glusterfs"; then
logger "$me: Verification. /opt/clmgr/shared_storage not in 
/proc/mounts as glusterfs. Retry"
sleep 3
continue
fi
logger "$me: Gluster mounts look correct in 

Re: [Gluster-users] hook script question related to ctdb, shared storage, and bind mounts

2019-11-09 Thread Erik Jacobson
> Here is what was the setup :

I thought I'd share an update in case it helps others. Your ideas
inspired me to try a different approach.

We support 4 main distros (and a 2 variants of some). We try not to
provide our own versions of distro-supported packages like CTDB where
possible. So a concern for me in modifying services is that they could
be replaced in package updates. There are ways to mitigate that but
that thought combined with yourr ideas led me to try this:

- Be sure ctdb service is disabled
- Added a systemd serivce of my own, oneshot, that runs a helper script
- The helper script first ensures the gluster volumes show up
  (I use localhost in my case and besides, in our environment, we don't
  want CTDB to have a public IP anyway until NFS can be served so this
  helps there too)
- Even with the gluster volume showing good, during init startup, first
  attempts to mount gluster volumes fail. So the helper script keeps
  looping until they work. It seems they work on the 2nd try (after a 3s
  sleep at failure).
- Once the mounts are confirmed working and mounted, then my helper
  starts the ctdb service.
- Awkward CTDB problems (where the lock check sometimes fails to detect
  a lock problem) are avoided since we won't start CTDB until we're 100%
  sure the gluster lock is mounted and pointing at gluster.

The above is working in prototype form so I'm going to start adding
my bind mounts to the equation.

I think I have a solution that will work now and I thank you so much for
the ideas.

I'm taking things from prototype form now on to something we can provide
people.


With regards to pacemaker. There are a few pacemaker solutions that I've
touched, and one I even helped implement. Now, it could be that I'm not
an expert at writing rules, but pacemaker seems to have often given us
more trouble than the problem it solves. I believe this is due to the
complexity of the software and the power of it. I am not knocking
pacemaker. However, a person really has to be a pacemaker expert
to not make a mistake that could cause a down time. So I have attempted
to avoid pacemaker in the new solution. I know there are down sides --
fencing is there for a reason -- but as far as I can tell the decision
has been right for us. CTDB is less complicated even if does not provide
100% true full HA abilities. That said, in the solution, I've been
careful to future-proof a move to pacemaker. For example, on the gluster
servers/NFS servers, I bring up IP aliases (interfaces) on the network the
BMCs reside so we're seamlessly able to switch to pacemaker with
IPMI/BMC/redfish fencing later if needed without causing too much pain in
the field with deployed servers.

I do realize there are tools to help configure pacemaker for you. Some
that I've tried have given me mixed results, perhaps due to the
complexity of networking setup in the solutions we have.

As we start to deploy this to more locations, I'll gain a feel for if
a move to pacemaker is right or not. I just share this in the interest
of learning. I'm always willing to learn and improve if I've overlooked
something.

Erik


Community Meeting Calendar:

APAC Schedule -
Every 2nd and 4th Tuesday at 11:30 AM IST
Bridge: https://bluejeans.com/118564314

NA/EMEA Schedule -
Every 1st and 3rd Tuesday at 01:00 PM EDT
Bridge: https://bluejeans.com/118564314

Gluster-users mailing list
Gluster-users@gluster.org
https://lists.gluster.org/mailman/listinfo/gluster-users


Re: [Gluster-users] hook script question related to ctdb, shared storage, and bind mounts

2019-11-05 Thread Erik Jacobson
On Tue, Nov 05, 2019 at 05:05:08AM +0200, Strahil wrote:
> Sure,
> 
> Here is what was the setup :

Thank you! You're very kind to send me this. I will verify it with my
setup soon. Hoping to to rid myself of these dep problems. Thank you !!!

Erik


Community Meeting Calendar:

APAC Schedule -
Every 2nd and 4th Tuesday at 11:30 AM IST
Bridge: https://bluejeans.com/118564314

NA/EMEA Schedule -
Every 1st and 3rd Tuesday at 01:00 PM EDT
Bridge: https://bluejeans.com/118564314

Gluster-users mailing list
Gluster-users@gluster.org
https://lists.gluster.org/mailman/listinfo/gluster-users


Re: [Gluster-users] hook script question related to ctdb, shared storage, and bind mounts

2019-11-04 Thread Erik Jacobson
Thank you! I am very interested. I hadn't considered the automounter
idea.

Also, your fstab has a different dependency approach than mine otherwise
as well.

If you happen to have the examples handy, I'll give them a shot here.

I'm looking forward to emerging from this dark place of dependencies not
working!!

Thank you so much for writing back,

Erik

On Mon, Nov 04, 2019 at 06:59:10AM +0200, Strahil wrote:
> Hi Erik,
> 
> I took another approach.
> 
> 1.  I got a systemd mount unit for my ctdb lock volume's brick:
> [root@ovirt1 system]# grep var /etc/fstab
> gluster1:/gluster_shared_storage /var/run/gluster/shared_storage/ glusterfs 
> defaults,x-systemd.requires=glusterd.service,x-systemd.automount0 0
> 
> As you can see - it is an automounter, because sometimes it fails to mount on 
> time
> 
> 2.  I got custom systemd services for glusterd,ctdb and vdo -  as I need to 
> 'put' dependencies for each of those.
> 
> Now, I'm no longer using ctdb & NFS Ganesha (as my version of ctdb cannot use 
> hpstnames and my environment is a little bit crazy), but I can still provide 
> hints how I did it.
> 
> Best Regards,
> Strahil NikolovOn Nov 3, 2019 22:46, Erik Jacobson  
> wrote:
> >
> > So, I have a solution I have written about in the based that is based on 
> > gluster with CTDB for IP and a level of redundancy. 
> >
> > It's been working fine except for a few quirks I need to work out on 
> > giant clusters when I get access. 
> >
> > I have 3x9 gluster volume, each are also NFS servers, using gluster 
> > NFS (ganesha isn't reliable for my workload yet). There are 9 IP 
> > aliases spread across 9 servers. 
> >
> > I also have many bind mounts that point to the shared storage as a 
> > source, and the /gluster/lock volume ("ctdb") of course. 
> >
> > glusterfs 4.1.6 (rhel8 today, but I use rhel7, rhel8, sles12, and 
> > sles15) 
> >
> > Things work well when everything is up and running. IP failover works 
> > well when one of the servers goes down. My issue is when that server 
> > comes back up. Despite my best efforts with systemd fstab dependencies, 
> > the shared storage areas including the gluster lock for CTDB do not 
> > always get mounted before CTDB starts. This causes trouble for CTDB 
> > correctly joining the collective. I also have problems where my 
> > bind mounts can happen before the shared storage is mounted, despite my 
> > attempts at preventing this with dependencies in fstab. 
> >
> > I decided a better approach would be to use a gluster hook and just 
> > mount everything I need as I need it, and start up ctdb when I know and 
> > verify that /gluster/lock is really gluster and not a local disk. 
> >
> > I started down a road of doing this with a start host hook and after 
> > spending a while at it, I realized my logic error. This will only fire 
> > when the volume is *started*, not when a server that was down re-joins. 
> >
> > I took a look at the code, glusterd-hooks.c, and found that support 
> > for "brick start" is not in place for a hook script but it's nearly 
> > there: 
> >
> >     [GD_OP_START_BRICK] = EMPTY, 
> > ... 
> >
> > and no entry in glusterd_hooks_add_op_args() yet. 
> >
> >
> > Before I make a patch for my own use, I wanted to do a sanity check and 
> > find out if others have solved this better than the road I'm heading 
> > down. 
> >
> > What I was thinking of doing is enabling a brick start hook, and 
> > do my processing for volumes being mounted from there. However, I 
> > suppose brick start is a bad choice for the case of simply stopping and 
> > starting the volume, because my processing would try to complete before 
> > the gluster volume was fully started. It would probably work for a brick 
> > "coming back and joining" but not "stop volume/start volume". 
> >
> > Any suggestions? 
> >
> > My end goal is: 
> > - mount shared storage every boot 
> > - only attempt to mount when gluster is available (_netdev doesn't seem 
> >    to be enough) 
> > - never start ctdb unless /gluster/lock is a shared storage and not a 
> >    directory. 
> > - only do my bind mounts from shared storage in to the rest of the 
> >    layout when we are sure the shared storage is mounted (don't 
> >    bind-mount using an empty directory as a source by accident!) 
> >
> > Thanks so much for reading my question, 
> >
> > Erik 
> >  
> >
> > Community Meeting Calendar: 
> >
> > APA

[Gluster-users] hook script question related to ctdb, shared storage, and bind mounts

2019-11-03 Thread Erik Jacobson
So, I have a solution I have written about in the based that is based on
gluster with CTDB for IP and a level of redundancy.

It's been working fine except for a few quirks I need to work out on
giant clusters when I get access.

I have 3x9 gluster volume, each are also NFS servers, using gluster
NFS (ganesha isn't reliable for my workload yet). There are 9 IP
aliases spread across 9 servers.

I also have many bind mounts that point to the shared storage as a
source, and the /gluster/lock volume ("ctdb") of course.

glusterfs 4.1.6 (rhel8 today, but I use rhel7, rhel8, sles12, and
sles15)

Things work well when everything is up and running. IP failover works
well when one of the servers goes down. My issue is when that server
comes back up. Despite my best efforts with systemd fstab dependencies,
the shared storage areas including the gluster lock for CTDB do not
always get mounted before CTDB starts. This causes trouble for CTDB
correctly joining the collective. I also have problems where my
bind mounts can happen before the shared storage is mounted, despite my
attempts at preventing this with dependencies in fstab.

I decided a better approach would be to use a gluster hook and just
mount everything I need as I need it, and start up ctdb when I know and
verify that /gluster/lock is really gluster and not a local disk.

I started down a road of doing this with a start host hook and after
spending a while at it, I realized my logic error. This will only fire
when the volume is *started*, not when a server that was down re-joins.

I took a look at the code, glusterd-hooks.c, and found that support
for "brick start" is not in place for a hook script but it's nearly
there:

[GD_OP_START_BRICK] = EMPTY,
...

and no entry in glusterd_hooks_add_op_args() yet.


Before I make a patch for my own use, I wanted to do a sanity check and
find out if others have solved this better than the road I'm heading
down.

What I was thinking of doing is enabling a brick start hook, and
do my processing for volumes being mounted from there. However, I
suppose brick start is a bad choice for the case of simply stopping and
starting the volume, because my processing would try to complete before
the gluster volume was fully started. It would probably work for a brick
"coming back and joining" but not "stop volume/start volume".

Any suggestions?

My end goal is:
 - mount shared storage every boot
 - only attempt to mount when gluster is available (_netdev doesn't seem
   to be enough)
 - never start ctdb unless /gluster/lock is a shared storage and not a
   directory.
 - only do my bind mounts from shared storage in to the rest of the
   layout when we are sure the shared storage is mounted (don't
   bind-mount using an empty directory as a source by accident!)

Thanks so much for reading my question,

Erik


Community Meeting Calendar:

APAC Schedule -
Every 2nd and 4th Tuesday at 11:30 AM IST
Bridge: https://bluejeans.com/118564314

NA/EMEA Schedule -
Every 1st and 3rd Tuesday at 01:00 PM EDT
Bridge: https://bluejeans.com/118564314

Gluster-users mailing list
Gluster-users@gluster.org
https://lists.gluster.org/mailman/listinfo/gluster-users


Re: [Gluster-users] split-brain errors under heavy load when one brick down

2019-09-18 Thread Erik Jacobson
Thank you for replying!

> Okay so 0-cm_shared-replicate-1 means these 3 bricks:
> 
> Brick4: 172.23.0.6:/data/brick_cm_shared
> Brick5: 172.23.0.7:/data/brick_cm_shared
> Brick6: 172.23.0.8:/data/brick_cm_shared

The above is correct.


> Were there any pending self-heals for this volume? Is it possible that the
> server (one of Brick 4, 5 or 6 ) that is down had the only good copy and the
> other 2 online bricks had a bad copy (needing heal)? Clients can get EIO in
> that case.

So I did check for heals and saw nothing. The storage at this time was in a
read-only use case. What I mean by that is the NFS clients mount it read only
and there were no write activities going to shared storage anyway at that
time.  So it was not surprising that no heals were listed.

I did inspect both remaining bricks for several of the example problem files
and found them with matching md5sums.

The strange thing, as I mentioned, is it only happened under the job
launch workload. The nfs boot workload, which is also very stressful,
ran clean with one brick down.

> When you say accessing the file from the compute nodes afterwards works
> fine, it is still with that one server (brick) down?

I can no longer check this system personally but as I recall when we
fixed the ethernet problem, all seemed well. I don't have a better
answer for this one than that. I am starting a document of things to try
when we have a large system in the factory to run on. I'll put this in
there.

> 
> There was a case of AFR reporting spurious split-brain errors but that was
> fixed long back (http://review.gluster.org/16362
> ) and seems to be present in glusterf-4.1.6.


So I brought this up. In my case, we know the files on the NFS client
side really were missing because we saw errors on the clients. That is
to say, the above bug seems to mean that split-brain was reported in
error with no other impacts. However, in my case, the error resulted in
actual problems accessing the files on the NFS clients.

> Side note: Why are you using replica 9 for the ctdb volume? All
> development/tests are usually done on (distributed) replica 3 setup.

I am happy to change this. Whatever guide I used to set this up
suggested replica 9. I don't even know which resource was incorrect as
it was so long ago. I have no other reason.

I'm filing an incident now to change our setup tools to use replica-3 for
CTDB for new setups.

Again, I appreciate that you followed up with me. Thank you,

Erik


Community Meeting Calendar:

APAC Schedule -
Every 2nd and 4th Tuesday at 11:30 AM IST
Bridge: https://bluejeans.com/118564314

NA/EMEA Schedule -
Every 1st and 3rd Tuesday at 01:00 PM EDT
Bridge: https://bluejeans.com/118564314

Gluster-users mailing list
Gluster-users@gluster.org
https://lists.gluster.org/mailman/listinfo/gluster-users


[Gluster-users] split-brain errors under heavy load when one brick down

2019-09-16 Thread Erik Jacobson
Hello all. I'm new to the list but not to gluster.

We are using gluster to service NFS boot on a top500 cluster. It is a
Distributed-Replicate volume 3x9.

We are having a problem when one server in a subvolume goes down, we get
random missing files and split-brain errors in the nfs.log file.

We are using Gluster NFS (We are interested in switching to Ganesha but
this workload presents problems there that we need to work through yet).

Unfortunately, like many such large systems, I am unable to take much
out of the system for debugging and unable to take the system down to
test this very often. However, my hope is to be well prepared when the
next large system comes through the factory so I can try to reproduce
this issue or have some things to try.

In the lab, I have a test system that is also a 3x9 setup like at the
customer site, but with only 3 compute nodes instead of 2,592 compute
nodes. We use CTDB for IP alias management - the compute nodes connect
to NFS with the alias.

Here is the issue we are having:
- 2592 nodes all PXE-booting at once and using the Gluster servers as
  their NFS root is working great. This includes when one subvolume is
  degraded due to the loss of a server. No issues at boot, no split-brain
  messages in the log.
- The problem comes in when we do an intensive job launch. This launch
  uses SLURM and then loads hundreds of shared libraries over NFS across
  all 2592 nodes.
- When all servers in the 3x9 pool are up, we're in good shape - no
  issues on the compute nodes, no split-brain messages in the log.
- When one subvolume has one missing server (its ethernet adapters
  died), while we boot fine, the SLURM launch has random missing files.
  Gluster nfs.log shows split-brain messages and ACCESS I/O errors.
- Taking an example failed file and accessing it across all compute nodes
  always works afterwords, the issue is transient.
- The missing file is always found in the other bricks in the subvolume by
  searching there is well
- No FS/disk IO errors in the logs or dmesg and the files are accessible
  before and after the transient error (and from the bricks themselves as I
  said).
- The customer jobs fail to launch, then, if we are degraded. They fail
  with library read errors, missing config files, etc.


What is perplexing is the huge load of 2592 nodes with NFS roots
PXE-booting does not trigger the issue when one subvolume is degraded.

Thank you for reading this far and thanks to the community for
making Gluster!!

Example errors:

ex1

[2019-09-06 18:26:42.665050] E [MSGID: 108008]
[afr-read-txn.c:123:afr_read_txn_refresh_done] 0-cm_shared-replicate-1: Failing
ACCESS on gfid ee3f5646-9368-4151-92a3-5b8e7db1fbf9: split-brain observed.
[Input/output error]

ex2

[2019-09-06 18:26:55.359272] E [MSGID: 108008]
[afr-read-txn.c:123:afr_read_txn_refresh_done] 0-cm_shared-replicate-1: Failing
READLINK on gfid f2be38c2-1cd1-486b-acad-17f2321a18b3: split-brain observed.
[Input/output error]
[2019-09-06 18:26:55.359367] W [MSGID: 112199]
[nfs3-helpers.c:3435:nfs3_log_readlink_res] 0-nfs-nfsv3:
/image/images_ro_nfs/toss-20190730/usr/lib64/libslurm.so.32 => (XID: 88651c80,
READLINK: NFS: 5(I/O error), POSIX: 5(Input/output error)) target: (null)



The errors seem to happen only on the 'replicate' volume where one
server is down in the subvolume (of course, any NFS server will
trigger that when it accesses the files on the degraded volume).



Now, I am no longer able to access this customer system and it is moving
to more secret work so I can't easily run tests on such a big system
until we have something come through the factory. However, I'm desperate
for help and would like a bag of tricks to attack this with next time I
can hit it. Having the HA stuff fail when needed has given me a bit of a
black eye on the solution. I had a lesson learned in being sure to test
the HA solution. I had tested many times at full system boot but didn't
think to do job launch tests while degraded in my testing. That pain
will haunt me but also make me better.



Info on the volumes:
 - RHEL 7.6 x86_64 Gluster/GNFS servers
 - gluster version 4.1.6, I set up the build
 - Clients are AARCH64 NFS 3 clients (technically configured with RO NFS
   (Using a version of Linux somewhat like CentOS 7.6).
 - The base filesystems for bricks are XFS and NO LVM layer.


What follows is the volume info from my test system in the lab, which
has the same versions and setup. I cannot get this info from the
customer without an approval process but the same scripts and tools set
up my test system so I'm confident the settings are the same.


[root@leader1 ~]# gluster volume info

Volume Name: cm_shared
Type: Distributed-Replicate
Volume ID: e7f2796b-7a94-41ab-a07d-bdce4900c731
Status: Started
Snapshot Count: 0
Number of Bricks: 3 x 3 = 9
Transport-type: tcp
Bricks:
Brick1: 172.23.0.3:/data/brick_cm_shared
Brick2: 172.23.0.4:/data/brick_cm_shared
Brick3: 172.23.0.5:/data/brick_cm_shared
Brick4: