Re: [Gluster-users] gluster forcing IPV6 on our IPV4 servers, glusterd fails (was gluster update question regarding new DNS resolution requirement)
On Tue, Sep 21, 2021 at 04:18:10PM +, Strahil Nikolov wrote: > As far as I know a fix was introduced recently, so even missing to run the > script won't be so critical - you can run it afterwards. > I would use Ansible to roll out such updates on a set of nodes - this will > prevent human errors and will give the opportunity to run such tiny details > like geo-rep modifying script. > > P.S.: Out of curiosity, are you using distributed-replicated or > distributed-dispersed volumes ? Distributed-Replicated, with different volume configurations per use case and one sharded. PS: I am HOPING to take another crack at Ganesha tomorrow to try to "get off our dependence on gnfs" but we'll see how things go with the crisis of the day always blocking progress. I hope to deprecate the use of expanded NFS trees (ie compute node root filesystems that are file-by-file served by the NFS server) in favor of image objects (squashfs images sitting in sharded volumes). I think what caused us trouble with ganesha a couple years ago was the huge metadata load which should be greatly reduced. We will see! Output from one test system if you're curious: [root@leader1 ~]# gluster volume info Volume Name: cm_logs Type: Distributed-Replicate Volume ID: 27ffa15b-9fed-4322-b591-225270ca9de5 Status: Started Snapshot Count: 0 Number of Bricks: 6 x 3 = 18 Transport-type: tcp Bricks: Brick1: 172.23.0.3:/data/brick_cm_logs Brick2: 172.23.0.2:/data/brick_cm_logs Brick3: 172.23.0.4:/data/brick_cm_logs Brick4: 172.23.0.5:/data/brick_cm_logs Brick5: 172.23.0.6:/data/brick_cm_logs Brick6: 172.23.0.7:/data/brick_cm_logs Brick7: 172.23.0.8:/data/brick_cm_logs Brick8: 172.23.0.9:/data/brick_cm_logs Brick9: 172.23.0.10:/data/brick_cm_logs Brick10: 172.23.0.11:/data/brick_cm_logs Brick11: 172.23.0.12:/data/brick_cm_logs Brick12: 172.23.0.13:/data/brick_cm_logs Brick13: 172.23.0.14:/data/brick_cm_logs Brick14: 172.23.0.15:/data/brick_cm_logs Brick15: 172.23.0.16:/data/brick_cm_logs Brick16: 172.23.0.17:/data/brick_cm_logs Brick17: 172.23.0.18:/data/brick_cm_logs Brick18: 172.23.0.19:/data/brick_cm_logs Options Reconfigured: nfs.auth-cache-ttl-sec: 360 nfs.auth-refresh-interval-sec: 360 nfs.mount-rmtab: /- nfs.exports-auth-enable: on nfs.export-dirs: on nfs.export-volumes: on nfs.nlm: off transport.address-family: inet storage.fips-mode-rchecksum: on nfs.disable: on performance.client-io-threads: off Volume Name: cm_obj_sharded Type: Distributed-Replicate Volume ID: 311bee36-09af-4d68-9180-b34b45e3c10b Status: Started Snapshot Count: 0 Number of Bricks: 6 x 3 = 18 Transport-type: tcp Bricks: Brick1: 172.23.0.3:/data/brick_cm_obj_sharded Brick2: 172.23.0.2:/data/brick_cm_obj_sharded Brick3: 172.23.0.4:/data/brick_cm_obj_sharded Brick4: 172.23.0.5:/data/brick_cm_obj_sharded Brick5: 172.23.0.6:/data/brick_cm_obj_sharded Brick6: 172.23.0.7:/data/brick_cm_obj_sharded Brick7: 172.23.0.8:/data/brick_cm_obj_sharded Brick8: 172.23.0.9:/data/brick_cm_obj_sharded Brick9: 172.23.0.10:/data/brick_cm_obj_sharded Brick10: 172.23.0.11:/data/brick_cm_obj_sharded Brick11: 172.23.0.12:/data/brick_cm_obj_sharded Brick12: 172.23.0.13:/data/brick_cm_obj_sharded Brick13: 172.23.0.14:/data/brick_cm_obj_sharded Brick14: 172.23.0.15:/data/brick_cm_obj_sharded Brick15: 172.23.0.16:/data/brick_cm_obj_sharded Brick16: 172.23.0.17:/data/brick_cm_obj_sharded Brick17: 172.23.0.18:/data/brick_cm_obj_sharded Brick18: 172.23.0.19:/data/brick_cm_obj_sharded Options Reconfigured: features.shard: on nfs.auth-cache-ttl-sec: 360 nfs.auth-refresh-interval-sec: 360 server.event-threads: 32 performance.io-thread-count: 32 nfs.mount-rmtab: /- transport.listen-backlog: 16384 nfs.exports-auth-enable: on nfs.export-dirs: on nfs.export-volumes: on nfs.nlm: off performance.nfs.io-cache: on performance.cache-refresh-timeout: 60 performance.flush-behind: on performance.cache-size: 8GB transport.address-family: inet storage.fips-mode-rchecksum: on nfs.disable: off performance.client-io-threads: on Volume Name: cm_shared Type: Distributed-Replicate Volume ID: 38093b8e-e668-4542-bc5e-34ffc491311a Status: Started Snapshot Count: 0 Number of Bricks: 6 x 3 = 18 Transport-type: tcp Bricks: Brick1: 172.23.0.3:/data/brick_cm_shared Brick2: 172.23.0.2:/data/brick_cm_shared Brick3: 172.23.0.4:/data/brick_cm_shared Brick4: 172.23.0.5:/data/brick_cm_shared Brick5: 172.23.0.6:/data/brick_cm_shared Brick6: 172.23.0.7:/data/brick_cm_shared Brick7: 172.23.0.8:/data/brick_cm_shared Brick8: 172.23.0.9:/data/brick_cm_shared Brick9: 172.23.0.10:/data/brick_cm_shared Brick10: 172.23.0.11:/data/brick_cm_shared Brick11: 172.23.0.12:/data/brick_cm_shared Brick12: 172.23.0.13:/data/brick_cm_shared Brick13: 172.23.0.14:/data/brick_cm_shared Brick14: 172.23.0.15:/data/brick_cm_shared Brick15: 172.23.0.16:/data/brick_cm_shared Brick16: 172.23.0.17:/data/brick_cm_shared Brick17: 172.23.0.18:/data/brick_cm_shared Brick18: 172.23.0.19:/data/brick_cm_shared Options Reconfigured: performance.client-io-threads: on
Re: [Gluster-users] gluster update question regarding new DNS resolution requirement
There is a discussion in -devel as well. I came at this just thinking "an update should work" and did take a quick look at release notes for 9.0 and 9.3. Come to think of it, I didn't read the Gluster8 relnotes so maybe that's why I missed this. We were at 7.9 and I read 9.0 and 9.3. We can't really disable IPV6 100% here. Well we could today but we'd have to open it again in a couple months. Our main head node already needs to talk to some IPV6-only stuff while also talking to IPV4 stuff. These leaders (gluster servers) will need to speak IPV6 very soon at least minimally. Some controllers are starting to appear, which these 'leader' nodes need to talk to, that are IPV6-only. It sounds like what you wrote is true though, that if there is any IPV6 around that function thinks that's what you want is IPV6. A couple private replies (thank you!!) also mentioned this. Maybe we'll have to make a more formal version of the patch rather than just force-setting IPV4 (for our internal use) later on. Basically, I am in the "once in a year" window where I can update gluster and get complete testing to be sure we don't have regressions so we'll keep moving forward with 9.3 with the ipv4 hack in place for now. This helps me get the context thank you for this note !! Erik On Tue, Sep 21, 2021 at 02:44:36PM +, Strahil Nikolov wrote: > As gf_resolve_ip6 fails, I guess you can disable ipv6 on the host (if not > using > the protocol) and check if it will workaround the problem till it's solved. > > For RH you can check https://access.redhat.com/solutions/8709 (use RH dev > subscription to read it, or ping me directly and I will try to summarize it > for > your OS version). > > > Best Regards, > Strahil Nikolov > > > On Mon, Sep 20, 2021 at 19:35, Erik Jacobson > wrote: > I missed the other important log snip: > > The message "E [MSGID: 101075] [common-utils.c:520:gf_resolve_ip6] > 0-resolver: error in getaddrinfo [{family=10}, {ret=Address family for > hostname not supported}]" repeated 620 times between [2021-09-20 > 15:49:23.720633 +] and [2021-09-20 15:50:41.731542 +] > > So I will dig in to the code some here. > > > On Mon, Sep 20, 2021 at 10:59:30AM -0500, Erik Jacobson wrote: > > Hello all! I hope you are well. > > > > We are starting a new software release cycle and I am trying to find a > > way to upgrade customers from our build of gluster 7.9 to our build of > > gluster 9.3 > > > > When we deploy gluster, we foribly remove all references to any host > > names and use only IP addresses. This is because, if for any reason a > > DNS server is unreachable, even if the peer files have IPs and DNS, it > > causes glusterd to be unable to reach peers properly. We can't really > > rely on /etc/hosts either because customers take artistic licene with > > their /etc/hosts files and don't realize that problems that can cause. > > > > So our deployed peer files look something like this: > > > > uuid=46a4b506-029d-4750-acfb-894501a88977 > > state=3 > > hostname1=172.23.0.16 > > > > That is, with full intention, we avoid host names. > > > > When we upgrade to gluster 9.3, we fall over with these errors and > > gluster is now partitioned and the updated gluster servers can't reach > > anybody: > > > > [2021-09-20 15:50:41.731543 +] E > [name.c:265:af_inet_client_get_remote_sockaddr] 0-management: DNS > resolution failed on host 172.23.0.16 > > > > > > As you can see, we have defined on purpose everything using IPs but in > > 9.3 it appears this method fails. Are there any suggestions short of > > putting real host names in peer files? > > > > > > > > FYI > > > > This supercomputer will be using gluster for part of its system > > management. It is how we deploy the Image Objects (squashfs images) > > hosted on NFS today and served by gluster leader nodes and also store > > system logs, console logs, and other data. > > > > https://www.olcf.ornl.gov/frontier/ > > > > > > Erik > > > > > > > > > > Community Meeting Calendar: > > > > Schedule - > > Every 2nd and 4th Tuesday at 14:30 IST / 09:00 UTC > > Bridge: https://meet.google.com/cpu-eiue-hvk > > Gluster-users mailing list > > Gluster-users@gluster.org > > https://lists.gluster.org/mailman/listinfo/gluster-users >
Re: [Gluster-users] gluster forcing IPV6 on our IPV4 servers, glusterd fails (was gluster update question regarding new DNS resolution requirement)
> Don't forget to run the geo-replication fix script , if you missed to do it > before the upgrade. We don't use geo-replication YET but thank you for this thoughtful reminder. Just a note on things like this -- we really try to do everything in a package update because that's how we'd have to deploy to customers in an automated way. So having to run a script as part of the upgrade would be very hard in a package based work flow for a packged solution. I'm not complaining I love gluster but this is just food for thought. I can't even hardly say it with a straight face because we suffer from similar issues on the cluster management side - updating one CM to the next is harder than it should be so I'm certainly not judging. Updating is always painful. I LOVE that slowly updating our gluster servers is "Just working". This will allow a supercomputer to slowly update their infrastructure while taking no compute nodes (using nfs-hosted squashfs images or root) down. It's really remarkable since it's a big jump too 7.9 to 9.3 I am impressed by this part. It's a huge relief that I didn't have to do an intermediate jump to gluster8 in the middle as that would have been nearly impossible for us to get right. Thank you all!! PS: Frontier will have 21 leader nodes running gluster servers. Distributed/replicate in groups of 3 hosting nfs-exported squashfs image objects for compute node root filesystems. Many thousands of nodes. > > Best Regards, > Strahil Nikolov > > > On Tue, Sep 21, 2021 at 0:46, Erik Jacobson > wrote: > I pretended I'm a low-level C programmer with network and filesystem > experience for a few hours. > > I'm not sure what the right solution is but what was happening was the > code was trying to treat our IPV4 hosts as AF_INET6 and the family was > incompatible with our IPV4 IP addresses. Yes, we need to move to IPV6 > but we're hoping to do that on our own time (~50 years like everybody > else :) > > I found a chunk of the code that seemed to be force-setting us to > AF_INET6. > > While I'm sure it is not 100% the correct patch, the patch attached and > pasted below is working for me so I'll integrate it with our internal > build to continue testing. > > Please let me know if there is a configuration item I missed or a > different way to do this. I added -devel to this email. > > In the previous thread, you would have seen that we're testing a > hopeful change that will upgrade our deployed customers from gluster > 7.9 to gluster 9.3. > > Thank you!! Advice on next steps would be appreciated !! > > > diff -Narup glusterfs-9.3-ORIG/rpc/rpc-transport/socket/src/name.c > glusterfs-9.3-NEW/rpc/rpc-transport/socket/src/name.c > --- glusterfs-9.3-ORIG/rpc/rpc-transport/socket/src/name.c2021-06-29 > 00:27:44.381408294 -0500 > +++ glusterfs-9.3-NEW/rpc/rpc-transport/socket/src/name.c2021-09-20 > 16:34:28.969425361 -0500 > @@ -252,9 +252,16 @@ af_inet_client_get_remote_sockaddr(rpc_t > /* Need to update transport-address family if address-family is not > provided > to command-line arguments > */ > +/* HPE This is forcing our IPV4 servers in to to an IPV6 address > +* family that is not compatible with IPV4. For now we will just set > it > +* to AF_INET. > +*/ > +/* > if (inet_pton(AF_INET6, remote_host, )) { > sockaddr->sa_family = AF_INET6; > } > +*/ > +sockaddr->sa_family = AF_INET; > > /* TODO: gf_resolve is a blocking call. kick in some > non blocking dns techniques */ > > > On Mon, Sep 20, 2021 at 11:35:35AM -0500, Erik Jacobson wrote: > > I missed the other important log snip: > > > > The message "E [MSGID: 101075] [common-utils.c:520:gf_resolve_ip6] > 0-resolver: error in getaddrinfo [{family=10}, {ret=Address family for > hostname not supported}]" repeated 620 times between [2021-09-20 > 15:49:23.720633 +] and [2021-09-20 15:50:41.731542 +] > > > > So I will dig in to the code some here. > > > > > > On Mon, Sep 20, 2021 at 10:59:30AM -0500, Erik Jacobson wrote: > > > Hello all! I hope you are well. > > > > > > We are starting a new software release cycle and I am trying to find a > > > way to upgrade customers from our build of gluster 7.9 to our build of > > > gluster 9.3 > > > > > > When we deploy gluster, we foribly remove all references to any host > > > names and use only IP addresses. This is
[Gluster-users] gluster forcing IPV6 on our IPV4 servers, glusterd fails (was gluster update question regarding new DNS resolution requirement)
I pretended I'm a low-level C programmer with network and filesystem experience for a few hours. I'm not sure what the right solution is but what was happening was the code was trying to treat our IPV4 hosts as AF_INET6 and the family was incompatible with our IPV4 IP addresses. Yes, we need to move to IPV6 but we're hoping to do that on our own time (~50 years like everybody else :) I found a chunk of the code that seemed to be force-setting us to AF_INET6. While I'm sure it is not 100% the correct patch, the patch attached and pasted below is working for me so I'll integrate it with our internal build to continue testing. Please let me know if there is a configuration item I missed or a different way to do this. I added -devel to this email. In the previous thread, you would have seen that we're testing a hopeful change that will upgrade our deployed customers from gluster 7.9 to gluster 9.3. Thank you!! Advice on next steps would be appreciated !! diff -Narup glusterfs-9.3-ORIG/rpc/rpc-transport/socket/src/name.c glusterfs-9.3-NEW/rpc/rpc-transport/socket/src/name.c --- glusterfs-9.3-ORIG/rpc/rpc-transport/socket/src/name.c 2021-06-29 00:27:44.381408294 -0500 +++ glusterfs-9.3-NEW/rpc/rpc-transport/socket/src/name.c 2021-09-20 16:34:28.969425361 -0500 @@ -252,9 +252,16 @@ af_inet_client_get_remote_sockaddr(rpc_t /* Need to update transport-address family if address-family is not provided to command-line arguments */ +/* HPE This is forcing our IPV4 servers in to to an IPV6 address + * family that is not compatible with IPV4. For now we will just set it + * to AF_INET. + */ +/* if (inet_pton(AF_INET6, remote_host, )) { sockaddr->sa_family = AF_INET6; } +*/ +sockaddr->sa_family = AF_INET; /* TODO: gf_resolve is a blocking call. kick in some non blocking dns techniques */ On Mon, Sep 20, 2021 at 11:35:35AM -0500, Erik Jacobson wrote: > I missed the other important log snip: > > The message "E [MSGID: 101075] [common-utils.c:520:gf_resolve_ip6] > 0-resolver: error in getaddrinfo [{family=10}, {ret=Address family for > hostname not supported}]" repeated 620 times between [2021-09-20 > 15:49:23.720633 +] and [2021-09-20 15:50:41.731542 +] > > So I will dig in to the code some here. > > > On Mon, Sep 20, 2021 at 10:59:30AM -0500, Erik Jacobson wrote: > > Hello all! I hope you are well. > > > > We are starting a new software release cycle and I am trying to find a > > way to upgrade customers from our build of gluster 7.9 to our build of > > gluster 9.3 > > > > When we deploy gluster, we foribly remove all references to any host > > names and use only IP addresses. This is because, if for any reason a > > DNS server is unreachable, even if the peer files have IPs and DNS, it > > causes glusterd to be unable to reach peers properly. We can't really > > rely on /etc/hosts either because customers take artistic licene with > > their /etc/hosts files and don't realize that problems that can cause. > > > > So our deployed peer files look something like this: > > > > uuid=46a4b506-029d-4750-acfb-894501a88977 > > state=3 > > hostname1=172.23.0.16 > > > > That is, with full intention, we avoid host names. > > > > When we upgrade to gluster 9.3, we fall over with these errors and > > gluster is now partitioned and the updated gluster servers can't reach > > anybody: > > > > [2021-09-20 15:50:41.731543 +] E > > [name.c:265:af_inet_client_get_remote_sockaddr] 0-management: DNS > > resolution failed on host 172.23.0.16 > > > > > > As you can see, we have defined on purpose everything using IPs but in > > 9.3 it appears this method fails. Are there any suggestions short of > > putting real host names in peer files? > > > > > > > > FYI > > > > This supercomputer will be using gluster for part of its system > > management. It is how we deploy the Image Objects (squashfs images) > > hosted on NFS today and served by gluster leader nodes and also store > > system logs, console logs, and other data. > > > > https://www.olcf.ornl.gov/frontier/ > > > > > > Erik > > > > > > > > > > Community Meeting Calendar: > > > > Schedule - > > Every 2nd and 4th Tuesday at 14:30 IST / 09:00 UTC > > Bridge: https://meet.google.com/cpu-eiue-hvk > > Gluster-users mailing list > > Gluster-users@gluster.org > > https://lists.gluster.org/mailman/listinfo/gluster-users > > > > > Community Meeting Calendar: > > Schedule - > Every 2nd and 4th Tuesday at 14:30
Re: [Gluster-users] gluster update question regarding new DNS resolution requirement
I missed the other important log snip: The message "E [MSGID: 101075] [common-utils.c:520:gf_resolve_ip6] 0-resolver: error in getaddrinfo [{family=10}, {ret=Address family for hostname not supported}]" repeated 620 times between [2021-09-20 15:49:23.720633 +] and [2021-09-20 15:50:41.731542 +] So I will dig in to the code some here. On Mon, Sep 20, 2021 at 10:59:30AM -0500, Erik Jacobson wrote: > Hello all! I hope you are well. > > We are starting a new software release cycle and I am trying to find a > way to upgrade customers from our build of gluster 7.9 to our build of > gluster 9.3 > > When we deploy gluster, we foribly remove all references to any host > names and use only IP addresses. This is because, if for any reason a > DNS server is unreachable, even if the peer files have IPs and DNS, it > causes glusterd to be unable to reach peers properly. We can't really > rely on /etc/hosts either because customers take artistic licene with > their /etc/hosts files and don't realize that problems that can cause. > > So our deployed peer files look something like this: > > uuid=46a4b506-029d-4750-acfb-894501a88977 > state=3 > hostname1=172.23.0.16 > > That is, with full intention, we avoid host names. > > When we upgrade to gluster 9.3, we fall over with these errors and > gluster is now partitioned and the updated gluster servers can't reach > anybody: > > [2021-09-20 15:50:41.731543 +] E > [name.c:265:af_inet_client_get_remote_sockaddr] 0-management: DNS resolution > failed on host 172.23.0.16 > > > As you can see, we have defined on purpose everything using IPs but in > 9.3 it appears this method fails. Are there any suggestions short of > putting real host names in peer files? > > > > FYI > > This supercomputer will be using gluster for part of its system > management. It is how we deploy the Image Objects (squashfs images) > hosted on NFS today and served by gluster leader nodes and also store > system logs, console logs, and other data. > > https://www.olcf.ornl.gov/frontier/ > > > Erik > > > > > Community Meeting Calendar: > > Schedule - > Every 2nd and 4th Tuesday at 14:30 IST / 09:00 UTC > Bridge: https://meet.google.com/cpu-eiue-hvk > Gluster-users mailing list > Gluster-users@gluster.org > https://lists.gluster.org/mailman/listinfo/gluster-users Community Meeting Calendar: Schedule - Every 2nd and 4th Tuesday at 14:30 IST / 09:00 UTC Bridge: https://meet.google.com/cpu-eiue-hvk Gluster-users mailing list Gluster-users@gluster.org https://lists.gluster.org/mailman/listinfo/gluster-users
[Gluster-users] gluster update question regarding new DNS resolution requirement
Hello all! I hope you are well. We are starting a new software release cycle and I am trying to find a way to upgrade customers from our build of gluster 7.9 to our build of gluster 9.3 When we deploy gluster, we foribly remove all references to any host names and use only IP addresses. This is because, if for any reason a DNS server is unreachable, even if the peer files have IPs and DNS, it causes glusterd to be unable to reach peers properly. We can't really rely on /etc/hosts either because customers take artistic licene with their /etc/hosts files and don't realize that problems that can cause. So our deployed peer files look something like this: uuid=46a4b506-029d-4750-acfb-894501a88977 state=3 hostname1=172.23.0.16 That is, with full intention, we avoid host names. When we upgrade to gluster 9.3, we fall over with these errors and gluster is now partitioned and the updated gluster servers can't reach anybody: [2021-09-20 15:50:41.731543 +] E [name.c:265:af_inet_client_get_remote_sockaddr] 0-management: DNS resolution failed on host 172.23.0.16 As you can see, we have defined on purpose everything using IPs but in 9.3 it appears this method fails. Are there any suggestions short of putting real host names in peer files? FYI This supercomputer will be using gluster for part of its system management. It is how we deploy the Image Objects (squashfs images) hosted on NFS today and served by gluster leader nodes and also store system logs, console logs, and other data. https://www.olcf.ornl.gov/frontier/ Erik Community Meeting Calendar: Schedule - Every 2nd and 4th Tuesday at 14:30 IST / 09:00 UTC Bridge: https://meet.google.com/cpu-eiue-hvk Gluster-users mailing list Gluster-users@gluster.org https://lists.gluster.org/mailman/listinfo/gluster-users
Re: [Gluster-users] Gluster usage scenarios in HPC cluster management
> I still have to grasp the "leader node" concept. > Weren't gluster nodes "peers"? Or by "leader" you mean that it's > mentioned in the fstab entry like > /l1,l2,l3:gv0 /mnt/gv0 glusterfs defaults 0 0 > while the peer list includes l1,l2,l3 and a bunch of other nodes? Right, it's a list of 24 peers. The 24 peers are split in to a 3x24 replicated/distributed setup for the volumes. They also have entries for themselves as clients in /etc/fstab. I'll dump some volume info at the end of this. > > So we would have 24 leader nodes, each leader would have a disk serving > > 4 bricks (one of which is simply a lock FS for CTDB, one is sharded, > > one is for logs, and one is heavily optimized for non-object expanded > > tree NFS). The term "disk" is loose. > That's a system way bigger than ours (3 nodes, replica3arbiter1, up to > 36 bricks per node). I have one dedicated "disk" (could be disk, raid lun, single ssd) and 4 directories for volumes ("bricks"). Of course, the "ctdb" volume is just for the lock and has a single file. > > > Specs of a leader node at a customer site: > > * 256G RAM > Glip! 256G for 4 bricks... No wonder I have had troubles running 26 > bricks in 64GB RAM... :) I'm not an expert in memory pools or how they would be impacted by more peers. I had to do a little research and I think what you're after is if I can run gluster volume status cm_shared mem on a real cluster that has a decent node count. I will see if I can do that. TEST ENV INFO for those who care Here is some info on my own test environemnt which you can skip. I have the environment duplicated on my desktop using virtual machines and it runs fine (slow but fine). It's a 3x1. I take out my giant 8GB cache from the optimized volumes but other than that it is fine. In my development environment, the gluster disk is a 40G qcow2 image. Cache sizes changed from 8G to 100M to fit in the VM. XML snips for memory, cpus: cm-leader1 99d5a8fc-a32c-b181-2f1a-2929b29c3953 3268608 3268608 2 .. I have 1 admin (head) node VM, 3 VM leader nodes like above, and one test compute node for my development environment. My desktop where I test this cluster stack is a beefy but not brand new desktop: Architecture:x86_64 CPU op-mode(s): 32-bit, 64-bit Byte Order: Little Endian Address sizes: 46 bits physical, 48 bits virtual CPU(s): 16 On-line CPU(s) list: 0-15 Thread(s) per core: 2 Core(s) per socket: 8 Socket(s): 1 NUMA node(s):1 Vendor ID: GenuineIntel CPU family: 6 Model: 79 Model name: Intel(R) Xeon(R) CPU E5-2620 v4 @ 2.10GHz Stepping:1 CPU MHz: 2594.333 CPU max MHz: 3000. CPU min MHz: 1200. BogoMIPS:4190.22 Virtualization: VT-x L1d cache: 32K L1i cache: 32K L2 cache:256K L3 cache:20480K NUMA node0 CPU(s): 0-15 (Not that it matters but this is a HP Z640 Workstation) 128G memory (good for a desktop I know, but I think 64G would work since I also run windows10 vm environment for unrelated reasons) I was able to find a MegaRAID in the lab a few years ago and so I have 4 drives in a MegaRAID and carve off a separate volume for the VM disk images. It has a cache. So that's also more beefy than a normal desktop. (on the other hand, I have no SSDs. May experiment with that some day but things work so well now I'm tempted to leave it until something croaks :) I keep all VMs for the test cluster with "Unsafe cache mode" since there is no true data to worry about and it makes the test cases faster. So I am able to test a complete cluster management stack including 3-leader-gluster servers, an admin, and compute all on my desktop using virtual machines and shared networks within libivrt/qemu. It is so much easier to do development when you don't have to reserve scarce test clusters and compete with people. I can do 90% of my cluster development work this way. Things fall over when I need to care about BMCs/ILOs or need to do performance testing of course. Then I move to real hardware and play the hunger-games-of-internal-test-resources :) :) I mention all this just to show that the beefy servers are not needed nor the memory usage high. I'm not continually swapping or anything like that. Configuration Info from Real Machine Some info on an active 3x3 cluster. 2738 compute nodes. The most active volume here is "cm_obj_sharded". It is where the image objects live and this cluster uses image objects for compute node root filesystems. I by hand changed the IP addresses (in case I made an error doing that). Memory status for volume : cm_obj_sharded -- Brick : 10.1.0.5:/data/brick_cm_obj_sharded Mallinfo Arena: 20676608 Ordblks : 2077 Smblks : 518 Hblks: 17 Hblkhd :
Re: [Gluster-users] Gluster usage scenarios in HPC cluster management
> > The stuff I work on doesn't use containers much (unlike a different > > system also at HPE). > By "pods" I meant "glusterd instance", a server hosting a collection of > bricks. Oh ok. The term is overloaded in my world. > > I don't have a recipe, they've just always been beefy enough for > > gluster. Sorry I don't have a more scientific answer. > Seems that 64GB RAM are not enough for a pod with 26 glusterfsd > instances and no other services (except sshd for management). What do > you mean by "beefy enough"? 128GB RAM or 1TB? We are currently using replica-3 but may also support replica-5 in the future. So if you had 24 leaders like HLRS, there would be 8 replica-3 at the bottom layer, and then distributed across. (replicated/distributed volumes) So we would have 24 leader nodes, each leader would have a disk serving 4 bricks (one of which is simply a lock FS for CTDB, one is sharded, one is for logs, and one is heavily optimized for non-object expanded tree NFS). The term "disk" is loose. So each SU Leader (or gluster server) serving the 4 volumes, 8x3 configuration, in our world has some differences in CPU type and memory and storage depending on order and preferences and timing (things always move forward). On an SU Leader, we typically do 2 RAID10 volumes with a RAID controller including cache. However, we have moved to RAID1 in some cases with better disks. Leaders store a lot of non-gluster stuff on "root" and then gluster has a dedicated disk/LUN. We have been trying to improve our helper tools to 100% wheel out a bad leader (say it melted in to the floor) and replace it. Once we have that solid, and because our monitoring data on the "root" drive is already redundant, we plan to move newer servers to two NVME drives without RAID. One for gluster and one for OS. If a leader melts in to the floor, we have a procedure to discover a new node for that, install the base OS including gluster/CTDB/etc, and then run a tool to re-integrate it in to the cluster as an SU Leader node again and do the healing. Separately, monitoring data outside of gluster will heal. PS: I will note that I have a mini-SU-leader cluster on my desktop (qemu/ libvirt) for development. It is a 1x3 set of SU Leaders, one head node, and one compute node. I make an adjustment to reduce the gluster cache to fit in the memory space. Works fine. Not real fast but good enough for development. Specs of a leader node at a customer site: * 256G RAM * Storage: - MR9361-8i controller - 7681GB root LUN (RAID1) - 15.4 TB for gluster bricks (RAID10) - 6 SATA SSD MZ7LH7T6HMLA-5 * AMD EPYC 7702 64-Core Processor - CPU(s): 128 - On-line CPU(s) list: 0-127 - Thread(s) per core: 2 - Core(s) per socket: 64 - Socket(s): 1 - NUMA node(s):4 * Management Ethernet - Gluster and cluster management co-mingled - 2x40G (but 2x10G wouold be fine) Community Meeting Calendar: Schedule - Every 2nd and 4th Tuesday at 14:30 IST / 09:00 UTC Bridge: https://meet.google.com/cpu-eiue-hvk Gluster-users mailing list Gluster-users@gluster.org https://lists.gluster.org/mailman/listinfo/gluster-users
Re: [Gluster-users] Gluster usage scenarios in HPC cluster management
The stuff I work on doesn't use containers much (unlike a different system also at HPE). Leaders are over-sized but the sizing largely is associated with all the other stuff leaders do, not just for gluster. That said, my gluster settings for the expanded nfs tree (as opposed to squashfs image files on nfs) method use heavy caching; I believe the max was 8G. I don't have a recipe, they've just always been beefy enough for gluster. Sorry I don't have a more scientific answer. On Mon, Mar 22, 2021 at 02:24:17PM +0100, Diego Zuccato wrote: > Il 19/03/2021 16:03, Erik Jacobson ha scritto: > > > A while back I was asked to make a blog or something similar to discuss > > the use cases the team I work on (HPCM cluster management) at HPE. > Tks for the article. > > I just miss a bit of information: how are you sizing CPU/RAM for pods? > > -- > Diego Zuccato > DIFA - Dip. di Fisica e Astronomia > Servizi Informatici > Alma Mater Studiorum - Università di Bologna > V.le Berti-Pichat 6/2 - 40127 Bologna - Italy > tel.: +39 051 20 95786 > > > > > Community Meeting Calendar: > > Schedule - > Every 2nd and 4th Tuesday at 14:30 IST / 09:00 UTC > Bridge: https://meet.google.com/cpu-eiue-hvk > Gluster-users mailing list > Gluster-users@gluster.org > https://lists.gluster.org/mailman/listinfo/gluster-users Community Meeting Calendar: Schedule - Every 2nd and 4th Tuesday at 14:30 IST / 09:00 UTC Bridge: https://meet.google.com/cpu-eiue-hvk Gluster-users mailing list Gluster-users@gluster.org https://lists.gluster.org/mailman/listinfo/gluster-users
Re: [Gluster-users] Gluster usage scenarios in HPC cluster management
> But I've also tested using tmpfs (allocating half of the RAM per compute node) > and exporting that as a distributed stripped GlusterFS volume over NFS over > RDMA to the 100 Gbps IB network so that the "ramdrives" can be used as a high > speed "scratch disk space" that doesn't have the write endurance limits that > NAND based flash memory SSDs have. In my world, we leave the high speed networks to jobs so I don't have much to offer. In our test SU Leader setup where we may not have disks, we do carve gluster bricks out of TMPS mounts. However, in that test case, designed to test the tooling and not the workload, I use iscsi to emulate disks to test the true solution. I will just mention that the cluster manager use of squashfs image objects sitting on NFS mounts is very fast even on top of 20G (2x10G) mgmt infrastructure. If you combine it with a TMPFS overlay, which is our default, you will have a writable area in to TMPFS that doesn't persist. You will have low memory usage. For a 4-node cluster, you probably don't need to bother with squashfs even and just mount the directory tree for the image at the right time. By using tmpfs overlay and some post-boot configuration, you can perhaps avoid the memory usage of what you are doing. As long as you don't need to beat the crap out of root, an NFS root is fine and using gluster backed disks is fine. Note that if you use exported trees with gnfs instead of image objects, there are lots of volume tweaks you can make to push efficiency up. For squashfs, I used a sharded volume. It's easy for me to write this since we have the install environment. While nothing is "Hard" in there, it's a bunch of code developed over time. That said, if you wanted to experiment, I can share some pieces of what we do. I just fear it's too complicated. I will note that some customers advocate for a tiny root - say 1.5G -- that could fit in TMPFS easily and then attach in workloads (other filesystems with development environments over the network, or container environments, etc). That would be another way to keep memory use low for a diskless cluster. (we use gnfs because we're not ready to switch to ganesha yet. It's on our list to move if we can get it working for our load). > Yes, it isn't as reliable or certainly not high availability (power goes down, > and the battery backup is exhausted, then the data is lost because it sat in > RAM), but it's to solve the problems of mechanically rotating hard drives are > too slow, NAND flash based SSDs has finite write endurance limits, and RAM > drives, whilst in theory, faster, is also the most expensive in a $/GB basis > compared to the other storage solutions. > > It's rather unfortunately that you have these different "tiers" of storage, > and > there's really nothing else in between that can help address all of these > issues simultaneously. > > Thank you for sharing your thoughts. > > Sincerely, > > Ewen Chan > > ━━━ > From: gluster-users-boun...@gluster.org on > behalf of Erik Jacobson > Sent: March 19, 2021 11:03 AM > To: gluster-users@gluster.org > Subject: [Gluster-users] Gluster usage scenarios in HPC cluster management > > A while back I was asked to make a blog or something similar to discuss > the use cases the team I work on (HPCM cluster management) at HPE. > > If you are not interested in reading about what I'm up to, just delete > this and move on. > > I really don't have a public blogging mechanism so I'll just describe > what we're up to here. Some of this was posted in some form in the past. > Since this contains the raw materials, I could make a wiki-ized version > if there were a public place to put it. > > > > We currently use gluster in two parts of cluster management. > > In fact, gluster in our management node infrastructure is helping us to > provide scaling and consistency to some of the largest clusters in the > world, clusters in the TOP100 list. While I can get in to trouble by > sharing too much, I will just say that trends are continuing and the > future may have some exciting announcements on where on TOP100 certain > new giant systems may end up in the coming 1-2 years. > > At HPE, HPCM is the "traditional cluster manager." There is another team > that develops a more cloud-like solution and I am not discussing that > solution here. > > > Use Case #1: Leader Nodes and Scale Out > -- > - Why? > * Scale out > * Redundancy (combined with CTDB, any leader can fail) > * Consistency (All servers and compute agree on what the content is) > > - Cluster manager has an admin or head no
Re: [Gluster-users] Gluster usage scenarios in HPC cluster management
> - Gluster sizing > * We typically state compute nodes per leader but this is not for > gluster per-se. Squashfs image objects are very efficient and > probably would be fine for 2k nodes per leader. Leader nodes provide > other services including console logs, system logs, and monitoring > services. I tried to avoid typos and mistakes but I missed something above. Argues for wiki right? :) I missed "512" :) * We typically state 512 compute nodes per leader but this is not for gluster per-se. Squashfs image objects are very efficient and probably would be fine for 2k nodes per leader. Leader nodes provide other services including console logs, system logs, and monitoring services. Community Meeting Calendar: Schedule - Every 2nd and 4th Tuesday at 14:30 IST / 09:00 UTC Bridge: https://meet.google.com/cpu-eiue-hvk Gluster-users mailing list Gluster-users@gluster.org https://lists.gluster.org/mailman/listinfo/gluster-users
[Gluster-users] Gluster usage scenarios in HPC cluster management
A while back I was asked to make a blog or something similar to discuss the use cases the team I work on (HPCM cluster management) at HPE. If you are not interested in reading about what I'm up to, just delete this and move on. I really don't have a public blogging mechanism so I'll just describe what we're up to here. Some of this was posted in some form in the past. Since this contains the raw materials, I could make a wiki-ized version if there were a public place to put it. We currently use gluster in two parts of cluster management. In fact, gluster in our management node infrastructure is helping us to provide scaling and consistency to some of the largest clusters in the world, clusters in the TOP100 list. While I can get in to trouble by sharing too much, I will just say that trends are continuing and the future may have some exciting announcements on where on TOP100 certain new giant systems may end up in the coming 1-2 years. At HPE, HPCM is the "traditional cluster manager." There is another team that develops a more cloud-like solution and I am not discussing that solution here. Use Case #1: Leader Nodes and Scale Out -- - Why? * Scale out * Redundancy (combined with CTDB, any leader can fail) * Consistency (All servers and compute agree on what the content is) - Cluster manager has an admin or head node and zero or more leader nodes - Leader nodes are provisioned in groups of 3 to use distributed replica-3 volumes (although at least one customer has interest in replica-5) - We configure a few different volumes for different use cases - We use Gluster NFS still because, over a year ago, Ganesha was not working with our workload and we haven't had time to re-test and engage with the community. No blame - we would also owe making sure our settings are right. - We use CTDB for a measure of HA and IP alias management. We use this instead of pacemaker to reduce complexity. - The volume use cases are: * Image sharing for diskless compute nodes (sometimes 6,000 nodes) -> Normally squashFS image files for speed/efficiency exported NFS -> Expanded ("chrootable") traditional NFS trees for people who prefer that, but they don't scale as well and are slower to boot -> Squashfs images sit on a sharded volume while traditional gluster used for expanded tree. * TFTP/HTTP for network boot/PXE including miniroot -> Spread across leaders too due so one node is not saturated with PXE/DHCP requests -> Miniroot is a "fatter initrd" that has our CM toolchain * Logs/consoles -> For traditional logs and consoles (HCPM also uses elasticsearch/kafka/friends but we don't put that in gluster) -> Separate volume to have more non-cached friendly settings * 4 total volumes used (one sharded, one heavily optimized for caching, one for ctdb lock, and one traditional for logging/etc) - Leader Setup * Admin node installs the leaders like any other compute nodes * A setup tool operates that configures gluster volumes and CTDB * When ready, an admin/head node can be engaged with the leaders * At that point, certain paths on the admin become gluster fuse mounts and bind mounts to gluster fuse mounts. - How images are deployed (squashfs mode) * User creates an image using image creation tools that make a chrootable tree style image on the admin/head node * mksquashfs will generate a squashfs image file on to a shared storage gluster mount * Nodes will mount the filesystem with the squashfs images and then loop mount the squashfs as part of the boot process. - How are compute nodes tied to leaders * We simply have a variable in our database where human or automated discovery tools can assign a given node to a given IP alias. This works better for us than trying to play routing tricks or load balance tricks * When leaders PXE, the DHCP response includes next-server and the compute node uses the leader IP alias for the tftp/http for getting the boot loader DHCP config files are on shared storage to facilitate future scaling of DHCP services. * ipxe or grub2 network config files then fetch the kernel, initrd * initrd has a small update to load a miniroot (install environment) which has more tooling * Node is installed (for nodes with root disks) or does a network boot cycle. - Gluster sizing * We typically state compute nodes per leader but this is not for gluster per-se. Squashfs image objects are very efficient and probably would be fine for 2k nodes per leader. Leader nodes provide other services including console logs, system logs, and monitoring services. * Our biggest deployment at a customer site right now has 24 leader nodes. Bigger systems are coming. - Startup scripts - Getting all the gluster mounts and many bind mounts used in the solution, as well
Re: [Gluster-users] qemu raw image file - qemu and grub2 can't find boot content from VM
We think this fixed it. While there is random chance in there, we can't repeat it in 7.9. So I'll close this thread out for now. We'll ask for help again if needed. Thanks for all the kind responses, Erik On Fri, Jan 29, 2021 at 02:20:56PM -0600, Erik Jacobson wrote: > I updated to 7.9, rebooted everything, and it started working. > > I will have QE try to break it again and report back. I couldn't break > it but they're better at breaking things (which is hard to imagine :) > > > On Fri, Jan 29, 2021 at 01:11:50PM -0600, Erik Jacobson wrote: > > Thank you. > > > > We reproduced the problem after force-killing one of the 3 physical > > nodes 6 times in a row. > > > > At that point, the grub2 loaded off the qemu virtual hard drive, but > > could not find partitions. Since there is random luck involved, we don't > > actually know if it was the force-killing that caused it to stop > > working. > > > > When I start the VM with the image in this state, there is nothing > > interesting in the fuse log for the volume in /var/log/glusterfs on the > > node hosting the image. > > > > No pending heals (all servers report 0 entries to heal). > > > > The same VM behavior happens on all the physical nodes when I try to > > start with the same VM image. > > > > Something from the gluster fuse mount log from earlier shows: > > > > [2021-01-28 21:24:40.814227] I [MSGID: 114018] > > [client.c:2347:client_rpc_notify] 0-adminvm-client-0: disconnected from > > adminvm-client-0. Client process will keep trying to connect to glusterd > > until brick's port is available > > [2021-01-28 21:24:43.815120] I [rpc-clnt.c:1963:rpc_clnt_reconfig] > > 0-adminvm-client-0: changing port to 49152 (from 0) > > [2021-01-28 21:24:43.815833] I [MSGID: 114057] > > [client-handshake.c:1376:select_server_supported_programs] > > 0-adminvm-client-0: Using Program GlusterFS 4.x v1, Num (1298437), Version > > (400) > > [2021-01-28 21:24:43.817682] I [MSGID: 114046] > > [client-handshake.c:1106:client_setvolume_cbk] 0-adminvm-client-0: > > Connected to adminvm-client-0, attached to remote volume > > '/data/brick_adminvm'. > > [2021-01-28 21:24:43.817709] I [MSGID: 114042] > > [client-handshake.c:930:client_post_handshake] 0-adminvm-client-0: 1 fds > > open - Delaying child_up until they are re-opened > > [2021-01-28 21:24:43.895163] I [MSGID: 114041] > > [client-handshake.c:318:client_child_up_reopen_done] 0-adminvm-client-0: > > last fd open'd/lock-self-heal'd - notifying CHILD-UP > > The message "W [MSGID: 114061] [client-common.c:2893:client_pre_lk_v2] > > 0-adminvm-client-0: (94695bdb-06b4-4105-9bc8-b8207270c941) remote_fd is > > -1. EBADFD [File descriptor in bad state]" repeated 6 times between > > [2021-01-28 21:23:54.395811] and [2021-01-28 21:23:54.811640] > > > > > > But that was a long time ago. > > > > Brick logs have an entry from when I first started the vm today (the > > problem was reproduced yesterday) all brick logs have something similar. > > Nothing appeared on the several other startup attempts of the VM: > > > > [2021-01-28 21:24:45.460147] I [MSGID: 115029] > > [server-handshake.c:549:server_setvolume] 0-adminvm-server: accepted client > > from > > CTX_ID:613f0d91-34e6-4495-859f-bca1c9f7af01-GRAPH_ID:0-PID:6287-HOST:nano-1-PC_NAME:adminvm-client-2-RECON_NO:-0 > > (version: 7.2) with subvol /data/brick_adminvm > > [2021-01-29 18:54:45.48] I [addr.c:54:compare_addr_and_update] > > 0-/data/brick_adminvm: allowed = "*", received addr = "172.23.255.153" > > [2021-01-29 18:54:45.455802] I [login.c:110:gf_auth] 0-auth/login: allowed > > user names: 3b66cfab-00d5-4b13-a103-93b4cf95e144 > > [2021-01-29 18:54:45.455815] I [MSGID: 115029] > > [server-handshake.c:549:server_setvolume] 0-adminvm-server: accepted client > > from > > CTX_ID:3774af6b-07b9-437b-a34e-9f71f3b57d03-GRAPH_ID:0-PID:45640-HOST:nano-3-PC_NAME:adminvm-client-2-RECON_NO:-0 > > (version: 7.2) with subvol /data/brick_adminvm > > [2021-01-29 18:54:45.494950] W [socket.c:774:__socket_rwv] > > 0-tcp.adminvm-server: readv on 172.23.255.153:48551 failed (No data > > available) > > [2021-01-29 18:54:45.494994] I [MSGID: 115036] > > [server.c:501:server_rpc_notify] 0-adminvm-server: disconnecting connection > > from > > CTX_ID:3774af6b-07b9-437b-a34e-9f71f3b57d03-GRAPH_ID:0-PID:45640-HOST:nano-3-PC_NAME:adminvm-client-2-RECON_NO:-0 > > [2021-01-29 18:54:45.495091] I [MSGID: 101055] > > [client_t.c:436:gf_client_unref] 0-a
[Gluster-users] gnfs exports netmask handling can incorrectly deny access to clients
Hello team - First, I wish to state that I know we are supposed to move to Ganesha. We had a lot of trouble with Ganesha in the past with our workload and we still owe trying the very latest version and working with the community. Some of our use cases are complicated and require very large clusters to test. Therefore, switching has remained elusive. We still rely on Gluster NFS. Gluster is now used as part of the solution in some of the largest supercomputers in the world. We encountered a problem with Gluster NFS handling of the exports file in relation to how it computes access rights. We have patched our build of Gluster with this fix. I'm not sure what the final fix would be like, but I'm hoping what I paste below will enable us to get a final fix in to the community. This analysis and patch was developed by Dick Riegner when I asked for his help on this problem. There were several others involved. What follows is his analysis. I will then paste the patch we're using now. We would be happy to test a new version of the fix if you like (so we can remove our patch when we upgrade). What follows are Dick's words. ANALYSIS == Here is my Gluster debug output from its nfs.log file. The working case is from a compute node client using the IP address 10.31.128.16, and the failing case is from a client using the IP address 10.31.133.16. Working case RJR01: gf_is_ip_in_net() Entered network is 10.31.128.0/18, ip_str is 10.31.128.16 RJR20: gf_is_ip_in_net() subnet is 18, net_str is 10.31.128.0, net_ip is 10.31.128.0 RJR40: gf_is_ip_in_net() Host byte order subnet_mask is 0003, ip_buf is 10801f0a, net_ip_buf is 00801f0a RJR42: gf_is_ip_in_net() Network byte order subnet_mask is 0300, ip_buf is 0a1f8010, net_ip_buf is 0a1f8000 RJR44: gf_is_ip_in_net() Network byte order shifted 14 host bits, ip_buf is 287e, net_ip_buf is 287e RJR46: gf_is_ip_in_net() My result is 1 RJR99: gf_is_ip_in_net() Exiting result is 1 Failing Case RJR01: gf_is_ip_in_net() Entered network is 10.31.128.0/18, ip_str is 10.31.133.16 RJR20: gf_is_ip_in_net() subnet is 18, net_str is 10.31.128.0, net_ip is 10.31.128.0 RJR40: gf_is_ip_in_net() Host byte order subnet_mask is 0003, ip_buf is 10851f0a, net_ip_buf is 00801f0a RJR42: gf_is_ip_in_net() Network byte order subnet_mask is 0300, ip_buf is 0a1f8510, net_ip_buf is 0a1f8000 RJR44: gf_is_ip_in_net() Network byte order shifted 14 host bits, ip_buf is 287e, net_ip_buf is 287e RJR46: gf_is_ip_in_net() My result is 1 RJR99: gf_is_ip_in_net() Exiting result is 0 Gluster function gf_is_ip_in_net() verifies a client's authorization to mount an export by comparing the subnet address of the client with an allowed subnet address. The comparison is made by masking the client IP address and the allowed subnet address and permitting access when the resulting subnets are equal. The mask is an all-ones bit-string the length of the subnet. In this case, the subnet is 18 bits and the subnet mask of 0x3 is in Little Endian ordering used by the Intel x86_64 processor. 1) Analysis of the working case from client IP address 10.31.128.16 These addresses are in Little Endian order on an Intel x86_64 processor. Client IP Subnet AddressMask Subnet 0x10801f0a & 0x3 => 0x01f0a Allowed Subnet Subnet AddressMask Subnet 0x00801f0a & 0x3 => 0x01f0a The resulting subnets are equal so Gluster allows the client to mount its exports. 2) Analysis of the failing case from client IP address 10.31.133.16 These addresses are in Little Endian order on an Intel x86_64 processor. Client IP Subnet AddressMask Subnet 0x10851f0a & 0x3 => 0x11f0a Allowed Subnet Subnet AddressMask Subnet 0x00801f0a & 0x3 => 0x01f0a The resulting subnets are not equal so Gluster does not allow the client to mount its exports. The comparison is incorrectly including the two lower-order bits from part of the host portion of the client IP address (0x85) as part of the subnet. The subnet comparison fails and the client is incorrectly denied access to the Gluster exports. PROPOSED FIX DESCRIPTION == The fix for the incorrect access denied errors is to convert the client and allowed subnet IP addresses from Host Byte Order (Little Endian) format to Network Byte Order (Big Endian) format and then isolate their subnets. This will ensure that the subnet and host parts of their IP addresses do not overlap. Once their subnets are properly isolated, the subnets can be properly compared. The conversion from Host Byte Order to Network Byte Order is done by calling the htonl() function. A subnet mask is no longer used, but the subnet bit length is used to isolate the subnet address. Once the
Re: [Gluster-users] qemu raw image file - qemu and grub2 can't find boot content from VM
I updated to 7.9, rebooted everything, and it started working. I will have QE try to break it again and report back. I couldn't break it but they're better at breaking things (which is hard to imagine :) On Fri, Jan 29, 2021 at 01:11:50PM -0600, Erik Jacobson wrote: > Thank you. > > We reproduced the problem after force-killing one of the 3 physical > nodes 6 times in a row. > > At that point, the grub2 loaded off the qemu virtual hard drive, but > could not find partitions. Since there is random luck involved, we don't > actually know if it was the force-killing that caused it to stop > working. > > When I start the VM with the image in this state, there is nothing > interesting in the fuse log for the volume in /var/log/glusterfs on the > node hosting the image. > > No pending heals (all servers report 0 entries to heal). > > The same VM behavior happens on all the physical nodes when I try to > start with the same VM image. > > Something from the gluster fuse mount log from earlier shows: > > [2021-01-28 21:24:40.814227] I [MSGID: 114018] > [client.c:2347:client_rpc_notify] 0-adminvm-client-0: disconnected from > adminvm-client-0. Client process will keep trying to connect to glusterd > until brick's port is available > [2021-01-28 21:24:43.815120] I [rpc-clnt.c:1963:rpc_clnt_reconfig] > 0-adminvm-client-0: changing port to 49152 (from 0) > [2021-01-28 21:24:43.815833] I [MSGID: 114057] > [client-handshake.c:1376:select_server_supported_programs] > 0-adminvm-client-0: Using Program GlusterFS 4.x v1, Num (1298437), Version > (400) > [2021-01-28 21:24:43.817682] I [MSGID: 114046] > [client-handshake.c:1106:client_setvolume_cbk] 0-adminvm-client-0: Connected > to adminvm-client-0, attached to remote volume '/data/brick_adminvm'. > [2021-01-28 21:24:43.817709] I [MSGID: 114042] > [client-handshake.c:930:client_post_handshake] 0-adminvm-client-0: 1 fds open > - Delaying child_up until they are re-opened > [2021-01-28 21:24:43.895163] I [MSGID: 114041] > [client-handshake.c:318:client_child_up_reopen_done] 0-adminvm-client-0: last > fd open'd/lock-self-heal'd - notifying CHILD-UP > The message "W [MSGID: 114061] [client-common.c:2893:client_pre_lk_v2] > 0-adminvm-client-0: (94695bdb-06b4-4105-9bc8-b8207270c941) remote_fd is -1. > EBADFD [File descriptor in bad state]" repeated 6 times between [2021-01-28 > 21:23:54.395811] and [2021-01-28 21:23:54.811640] > > > But that was a long time ago. > > Brick logs have an entry from when I first started the vm today (the > problem was reproduced yesterday) all brick logs have something similar. > Nothing appeared on the several other startup attempts of the VM: > > [2021-01-28 21:24:45.460147] I [MSGID: 115029] > [server-handshake.c:549:server_setvolume] 0-adminvm-server: accepted client > from > CTX_ID:613f0d91-34e6-4495-859f-bca1c9f7af01-GRAPH_ID:0-PID:6287-HOST:nano-1-PC_NAME:adminvm-client-2-RECON_NO:-0 > (version: 7.2) with subvol /data/brick_adminvm > [2021-01-29 18:54:45.48] I [addr.c:54:compare_addr_and_update] > 0-/data/brick_adminvm: allowed = "*", received addr = "172.23.255.153" > [2021-01-29 18:54:45.455802] I [login.c:110:gf_auth] 0-auth/login: allowed > user names: 3b66cfab-00d5-4b13-a103-93b4cf95e144 > [2021-01-29 18:54:45.455815] I [MSGID: 115029] > [server-handshake.c:549:server_setvolume] 0-adminvm-server: accepted client > from > CTX_ID:3774af6b-07b9-437b-a34e-9f71f3b57d03-GRAPH_ID:0-PID:45640-HOST:nano-3-PC_NAME:adminvm-client-2-RECON_NO:-0 > (version: 7.2) with subvol /data/brick_adminvm > [2021-01-29 18:54:45.494950] W [socket.c:774:__socket_rwv] > 0-tcp.adminvm-server: readv on 172.23.255.153:48551 failed (No data available) > [2021-01-29 18:54:45.494994] I [MSGID: 115036] > [server.c:501:server_rpc_notify] 0-adminvm-server: disconnecting connection > from > CTX_ID:3774af6b-07b9-437b-a34e-9f71f3b57d03-GRAPH_ID:0-PID:45640-HOST:nano-3-PC_NAME:adminvm-client-2-RECON_NO:-0 > [2021-01-29 18:54:45.495091] I [MSGID: 101055] > [client_t.c:436:gf_client_unref] 0-adminvm-server: Shutting down connection > CTX_ID:3774af6b-07b9-437b-a34e-9f71f3b57d03-GRAPH_ID:0-PID:45640-HOST:nano-3-PC_NAME:adminvm-client-2-RECON_NO:-0 > > > > Like before, if I halt the VM, kpartx the image, mount the giant root > within the image, then unmount, unkpartx, and start the VM - it works: > > nano-2:/var/log/glusterfs # kpartx -a /adminvm/images/adminvm.img > nano-2:/var/log/glusterfs # mount /dev/mapper/loop0p31 /mnt > nano-2:/var/log/glusterfs # dmesg|tail -3 > [85528.602570] loop: module loaded > [85535.975623] EXT4-fs (dm-3): recovery complete > [85535.979663] EXT4-fs (dm-3): mounted filesystem with ordered data mode. >
Re: [Gluster-users] qemu raw image file - qemu and grub2 can't find boot content from VM
> Also, I would like to point that I have VMs with large disks 1TB and 2TB, and > have no issues. definitely would upgrade Gluster version like let's say at > least 7.9. Great! Thank you! We can update but it's very sensitive due to the workload. I can't officially update our gluster until we have a cluster with a couple thousand nodes to test with. However, for this problem, this is on my list on the test machine. I'm hoping I can reproduce it. So far no luck making it happen again. Once I hit it, I will try to collect more data and at the end update gluster. What do you think about the suggestion to increase the shard size? Are you using the default size on your 1TB and 2TB images? > Amar also asked a question regarding enabling Sharding in the volume after > creating the VMs disks, which would certainly mess up the volume if that what > happened. Oh I missed this question. I basically scripted it quick since I was doing it so often.. I have a similar script that takes it away to start over. set -x pdsh -g gluster mkdir /data/brick_adminvm/ gluster volume create adminvm replica 3 transport tcp 172.23.255.151:/data/brick_adminvm 172.23.255.152:/data/brick_adminvm 172.23.255.153:/data/brick_adminvm gluster volume set adminvm group virt gluster volume set adminvm granular-entry-heal enable gluster volume set adminvm storage.owner-uid 439 gluster volume set adminvm storage.owner-gid 443 gluster volume start adminvm pdsh -g gluster mount /adminvm echo -n "press enter to continue for restore tarball" pushd /adminvm tar xvf /root/backup.tar popd echo -n "press enter to continue for qemu-img" pushd /adminvm qemu-img create -f raw -o preallocation=falloc /adminvm/images/adminvm.img 5T popd Thanks again for the kind responses, Erik > > On Wed, Jan 27, 2021 at 5:28 PM Erik Jacobson wrote: > > > > Shortly after the sharded volume is made, there are some fuse mount > > > messages. I'm not 100% sure if this was just before or during the > > > big qemu-img command to make the 5T image > > > (qemu-img create -f raw -o preallocation=falloc > > > /adminvm/images/adminvm.img 5T) > > Any reason to have a single disk with this size ? > > > Usually in any > > virtualization I have used , it is always recommended to keep it lower. > > Have you thought about multiple disks with smaller size ? > > Yes, because the actual virtual machine is an admin node/head node cluster > manager for a supercomputer that hosts big OS images and drives > multi-thousand-node-clusters (boot, monitoring, image creation, > distribution, sometimes NFS roots, etc) . So this VM is a biggie. > > We could make multiple smaller images but it would be very painful since > it differs from the normal non-VM setup. > > So unlike many solutions where you have lots of small VMs with their > images small images, this solution is one giant VM with one giant image. > We're essentially using gluster in this use case (as opposed to others I > have posted about in the past) for head node failover (combined with > pacemaker). > > > Also worth > > noting is that RHII is supported only when the shard size is 512MB, so > > it's worth trying bigger shard size . > > I have put larger shard size and newer gluster version on the list to > try. Thank you! Hoping to get it failing again to try these things! > > > > -- > Respectfully > Mahdi Community Meeting Calendar: Schedule - Every 2nd and 4th Tuesday at 14:30 IST / 09:00 UTC Bridge: https://meet.google.com/cpu-eiue-hvk Gluster-users mailing list Gluster-users@gluster.org https://lists.gluster.org/mailman/listinfo/gluster-users
Re: [Gluster-users] qemu raw image file - qemu and grub2 can't find boot content from VM
> > Shortly after the sharded volume is made, there are some fuse mount > > messages. I'm not 100% sure if this was just before or during the > > big qemu-img command to make the 5T image > > (qemu-img create -f raw -o preallocation=falloc > > /adminvm/images/adminvm.img 5T) > Any reason to have a single disk with this size ? > Usually in any > virtualization I have used , it is always recommended to keep it lower. > Have you thought about multiple disks with smaller size ? Yes, because the actual virtual machine is an admin node/head node cluster manager for a supercomputer that hosts big OS images and drives multi-thousand-node-clusters (boot, monitoring, image creation, distribution, sometimes NFS roots, etc) . So this VM is a biggie. We could make multiple smaller images but it would be very painful since it differs from the normal non-VM setup. So unlike many solutions where you have lots of small VMs with their images small images, this solution is one giant VM with one giant image. We're essentially using gluster in this use case (as opposed to others I have posted about in the past) for head node failover (combined with pacemaker). > Also worth > noting is that RHII is supported only when the shard size is 512MB, so > it's worth trying bigger shard size . I have put larger shard size and newer gluster version on the list to try. Thank you! Hoping to get it failing again to try these things! Community Meeting Calendar: Schedule - Every 2nd and 4th Tuesday at 14:30 IST / 09:00 UTC Bridge: https://meet.google.com/cpu-eiue-hvk Gluster-users mailing list Gluster-users@gluster.org https://lists.gluster.org/mailman/listinfo/gluster-users
Re: [Gluster-users] qemu raw image file - qemu and grub2 can't find boot content from VM
> Are you sure that there is no heals pending at the time of the power up I was watching heals when the problem was persisting and it was all clear. This was a great suggestion though. > I checked my oVirt-based gluster and the only difference is: > cluster.gra > nular-entry-heal: enable > The options seem fine. > > libglusterfs0-7.2-4723.1520.210122T1700.a.sles15sp2hpe.x86_64 > > glusterfs-7.2-4723.1520.210122T1700.a.sles15sp2hpe.x86_64 > > python3-gluster-7.2-4723.1520.210122T1700.a.sles15sp2hpe.noarch > This one is quite old although it never caused any troubles with my > oVirt VMs. Either try with latest v7 or even v8.3 . I can try a newer version. The issue is we have to do massive testing with thousands of nodes to validate function and that isn't always available. So we tend to latch on to a good one and stage an upgrade when we have a system big enough in the factory. In this case though, the use case is a single VM. If I could find a way to reproduce the problem I would be able to know if upgrading helped. These hard to reproduce problems are painful!! We keep hitting it in places but triggering has been elusive. THANK YOU for replying back. I will continue to try to reproduce the problem. If I get it back to consistent fail, I'll try updating gluster then and take another closer look at the logs and post them. Erik Community Meeting Calendar: Schedule - Every 2nd and 4th Tuesday at 14:30 IST / 09:00 UTC Bridge: https://meet.google.com/cpu-eiue-hvk Gluster-users mailing list Gluster-users@gluster.org https://lists.gluster.org/mailman/listinfo/gluster-users
Re: [Gluster-users] qemu raw image file - qemu and grub2 can't find boot content from VM
in the field. On Tue, Jan 26, 2021 at 07:40:19AM -0600, Erik Jacobson wrote: > Thank you so much for responding! More below. > > > > Anything in the logs of the fuse mount? can you stat the file from the > > mount? > > also, the report of an image is only 64M makes me think about Sharding as > > the > > default value of Shard size is 64M. > > Do you have any clues on when this issue start to happen? was there any > > operation done to the Gluster cluster? > > > - I had just created the gluster volumes within an hour of the problem > to test the vary problem I reported. So it was a "fresh start". > > - It booted one or two times, then stopped booting. Once it couldn't > boot, all 3 nodes were the same in that grub2 couldn't boot in the VM > image. > > As for the fuse log, I did see a couple of these before it happened the > first time. I'm not sure if it's a clue or not. > > [2021-01-25 22:48:19.310467] I [fuse-bridge.c:5777:fuse_graph_sync] 0-fuse: > switched to graph 0 > [2021-01-25 22:50:09.693958] E [fuse-bridge.c:227:check_and_dump_fuse_W] (--> > /usr/lib64/libglusterfs.so.0(_gf_log_callingfn+0x17a)[0x7f914e346faa] (--> > /usr/lib64/glusterfs/7.2/xlator/mount/fuse.so(+0x874a)[0x7f914a3d374a] (--> > /usr/lib64/glusterfs/7.2/xlator/mount/fuse.so(+0x91cb)[0x7f914a3d41cb] (--> > /lib64/libpthread.so.0(+0x84f9)[0x7f914cf184f9] (--> > /lib64/libc.so.6(clone+0x3f)[0x7f914c76afbf] ) 0-glusterfs-fuse: writing > to fuse device failed: No such file or directory > [2021-01-25 22:50:09.694462] E [fuse-bridge.c:227:check_and_dump_fuse_W] (--> > /usr/lib64/libglusterfs.so.0(_gf_log_callingfn+0x17a)[0x7f914e346faa] (--> > /usr/lib64/glusterfs/7.2/xlator/mount/fuse.so(+0x874a)[0x7f914a3d374a] (--> > /usr/lib64/glusterfs/7.2/xlator/mount/fuse.so(+0x91cb)[0x7f914a3d41cb] (--> > /lib64/libpthread.so.0(+0x84f9)[0x7f914cf184f9] (--> > /lib64/libc.so.6(clone+0x3f)[0x7f914c76afbf] ) 0-glusterfs-fuse: writing > to fuse device failed: No such file or directory > > > > I have reserved the test system again. My plans today are: > - Start over with the gluster volume on the machine with sles15sp2 >updates > > - Learn if there are modifications to the image (besides >mounting/umounting filesystems with the image using kpartx to map >them to force it to work). What if I add/remove a byte from the end >of the image file for example. > > - Revert the setup to sles15sp2 with no updates. My theory is the >updates are not making a difference and it's just random chance. >(re-making the gluster volume in the process) > > - The 64MB shard size made me think too!! > > - If the team feels it is worth it, I could try a newer gluster. We're >using the versions we've validated at scale when we have large >clusters in the factory but if the team thinks I should try something >else I'm happy to re-build it!!! We are @ 7.2 plus afr-event-gen-changes >patch. > > I will keep a better eye on the fuse log to tie an error to the problem > starting. > > > THANKS AGAIN for responding and let me know if you have any more > clues! > > Erik > > > > > > On Tue, Jan 26, 2021 at 2:40 AM Erik Jacobson wrote: > > > > Hello all. Thanks again for gluster. We're having a strange problem > > getting virtual machines started that are hosted on a gluster volume. > > > > One of the ways we use gluster now is to make a HA-ish cluster head > > node. A virtual machine runs in the shared storage and is backed up by 3 > > physical servers that contribute to the gluster storage share. > > > > We're using sharding in this volume. The VM image file is around 5T and > > we use qemu-img with falloc to get all the blocks allocated in advance. > > > > We are not using gfapi largely because it would mean we have to build > > our own libvirt and qemu and we'd prefer not to do that. So we're using > > a glusterfs fuse mount to host the image. The virtual machine is using > > virtio disks but we had similar trouble using scsi emulation. > > > > The issue: - all seems well, the VM head node installs, boots, etc. > > > > However, at some point, it stops being able to boot! grub2 acts like it > > cannot find /boot. At the grub2 prompt, it can see the partitions, but > > reports no filesystem found where there are indeed filesystems. > > > > If we switch qemu to use "direct kernel load" (bypass grub2), this often > > works around the problem but in one case Linux gave us a clue. Linu
Re: [Gluster-users] qemu raw image file - qemu and grub2 can't find boot content from VM
Thank you so much for responding! More below. > Anything in the logs of the fuse mount? can you stat the file from the mount? > also, the report of an image is only 64M makes me think about Sharding as the > default value of Shard size is 64M. > Do you have any clues on when this issue start to happen? was there any > operation done to the Gluster cluster? - I had just created the gluster volumes within an hour of the problem to test the vary problem I reported. So it was a "fresh start". - It booted one or two times, then stopped booting. Once it couldn't boot, all 3 nodes were the same in that grub2 couldn't boot in the VM image. As for the fuse log, I did see a couple of these before it happened the first time. I'm not sure if it's a clue or not. [2021-01-25 22:48:19.310467] I [fuse-bridge.c:5777:fuse_graph_sync] 0-fuse: switched to graph 0 [2021-01-25 22:50:09.693958] E [fuse-bridge.c:227:check_and_dump_fuse_W] (--> /usr/lib64/libglusterfs.so.0(_gf_log_callingfn+0x17a)[0x7f914e346faa] (--> /usr/lib64/glusterfs/7.2/xlator/mount/fuse.so(+0x874a)[0x7f914a3d374a] (--> /usr/lib64/glusterfs/7.2/xlator/mount/fuse.so(+0x91cb)[0x7f914a3d41cb] (--> /lib64/libpthread.so.0(+0x84f9)[0x7f914cf184f9] (--> /lib64/libc.so.6(clone+0x3f)[0x7f914c76afbf] ) 0-glusterfs-fuse: writing to fuse device failed: No such file or directory [2021-01-25 22:50:09.694462] E [fuse-bridge.c:227:check_and_dump_fuse_W] (--> /usr/lib64/libglusterfs.so.0(_gf_log_callingfn+0x17a)[0x7f914e346faa] (--> /usr/lib64/glusterfs/7.2/xlator/mount/fuse.so(+0x874a)[0x7f914a3d374a] (--> /usr/lib64/glusterfs/7.2/xlator/mount/fuse.so(+0x91cb)[0x7f914a3d41cb] (--> /lib64/libpthread.so.0(+0x84f9)[0x7f914cf184f9] (--> /lib64/libc.so.6(clone+0x3f)[0x7f914c76afbf] ) 0-glusterfs-fuse: writing to fuse device failed: No such file or directory I have reserved the test system again. My plans today are: - Start over with the gluster volume on the machine with sles15sp2 updates - Learn if there are modifications to the image (besides mounting/umounting filesystems with the image using kpartx to map them to force it to work). What if I add/remove a byte from the end of the image file for example. - Revert the setup to sles15sp2 with no updates. My theory is the updates are not making a difference and it's just random chance. (re-making the gluster volume in the process) - The 64MB shard size made me think too!! - If the team feels it is worth it, I could try a newer gluster. We're using the versions we've validated at scale when we have large clusters in the factory but if the team thinks I should try something else I'm happy to re-build it!!! We are @ 7.2 plus afr-event-gen-changes patch. I will keep a better eye on the fuse log to tie an error to the problem starting. THANKS AGAIN for responding and let me know if you have any more clues! Erik > > On Tue, Jan 26, 2021 at 2:40 AM Erik Jacobson wrote: > > Hello all. Thanks again for gluster. We're having a strange problem > getting virtual machines started that are hosted on a gluster volume. > > One of the ways we use gluster now is to make a HA-ish cluster head > node. A virtual machine runs in the shared storage and is backed up by 3 > physical servers that contribute to the gluster storage share. > > We're using sharding in this volume. The VM image file is around 5T and > we use qemu-img with falloc to get all the blocks allocated in advance. > > We are not using gfapi largely because it would mean we have to build > our own libvirt and qemu and we'd prefer not to do that. So we're using > a glusterfs fuse mount to host the image. The virtual machine is using > virtio disks but we had similar trouble using scsi emulation. > > The issue: - all seems well, the VM head node installs, boots, etc. > > However, at some point, it stops being able to boot! grub2 acts like it > cannot find /boot. At the grub2 prompt, it can see the partitions, but > reports no filesystem found where there are indeed filesystems. > > If we switch qemu to use "direct kernel load" (bypass grub2), this often > works around the problem but in one case Linux gave us a clue. Linux > reported /dev/vda as only being 64 megabytes, which would explain a lot. > This means the virtual machine Linux though the disk supplied by the > disk image was tiny! 64M instead of 5T > > We are using sles15sp2 and hit the problem more often with updates > applied than without. I'm in the process of trying to isolate if there > is a sles15sp2 update causing this, or if we're within "random chance". > > On one of the physical nodes, if it is in the failure mode, if I use > 'kpartx' to create the
[Gluster-users] qemu raw image file - qemu and grub2 can't find boot content from VM
Hello all. Thanks again for gluster. We're having a strange problem getting virtual machines started that are hosted on a gluster volume. One of the ways we use gluster now is to make a HA-ish cluster head node. A virtual machine runs in the shared storage and is backed up by 3 physical servers that contribute to the gluster storage share. We're using sharding in this volume. The VM image file is around 5T and we use qemu-img with falloc to get all the blocks allocated in advance. We are not using gfapi largely because it would mean we have to build our own libvirt and qemu and we'd prefer not to do that. So we're using a glusterfs fuse mount to host the image. The virtual machine is using virtio disks but we had similar trouble using scsi emulation. The issue: - all seems well, the VM head node installs, boots, etc. However, at some point, it stops being able to boot! grub2 acts like it cannot find /boot. At the grub2 prompt, it can see the partitions, but reports no filesystem found where there are indeed filesystems. If we switch qemu to use "direct kernel load" (bypass grub2), this often works around the problem but in one case Linux gave us a clue. Linux reported /dev/vda as only being 64 megabytes, which would explain a lot. This means the virtual machine Linux though the disk supplied by the disk image was tiny! 64M instead of 5T We are using sles15sp2 and hit the problem more often with updates applied than without. I'm in the process of trying to isolate if there is a sles15sp2 update causing this, or if we're within "random chance". On one of the physical nodes, if it is in the failure mode, if I use 'kpartx' to create the partitions from the image file, then mount the giant root filesystem (ie mount /dev/mapper/loop0p31 /mnt) and then umount /mnt, then that physical node starts the VM fine, grub2 loads, the virtual machine is fully happy! Until I try to shut it down and start it up again, at which point it sticks at grub2 again! What about mounting the image file makes it so qemu sees the whole disk? The problem doesn't always happen but once it starts, the same VM image has trouble starting on any of the 3 physical nodes sharing the storage. But using the trick to force-mount the root within the image with kpartx, then the machine can come up. My only guess is this changes the file just a tiny bit in the middle of the image. Once the problem starts, it keeps happening except temporarily working when I do the loop mount trick on the physical admin. Here is some info about what I have in place: nano-1:/adminvm/images # gluster volume info Volume Name: adminvm Type: Replicate Volume ID: 67de902c-8c00-4dc9-8b69-60b93b5f6104 Status: Started Snapshot Count: 0 Number of Bricks: 1 x 3 = 3 Transport-type: tcp Bricks: Brick1: 172.23.255.151:/data/brick_adminvm Brick2: 172.23.255.152:/data/brick_adminvm Brick3: 172.23.255.153:/data/brick_adminvm Options Reconfigured: performance.client-io-threads: on nfs.disable: on storage.fips-mode-rchecksum: on transport.address-family: inet performance.quick-read: off performance.read-ahead: off performance.io-cache: off performance.low-prio-threads: 32 network.remote-dio: enable cluster.eager-lock: enable cluster.quorum-type: auto cluster.server-quorum-type: server cluster.data-self-heal-algorithm: full cluster.locking-scheme: granular cluster.shd-max-threads: 8 cluster.shd-wait-qlength: 1 features.shard: on user.cifs: off cluster.choose-local: off client.event-threads: 4 server.event-threads: 4 cluster.granular-entry-heal: enable storage.owner-uid: 439 storage.owner-gid: 443 libglusterfs0-7.2-4723.1520.210122T1700.a.sles15sp2hpe.x86_64 glusterfs-7.2-4723.1520.210122T1700.a.sles15sp2hpe.x86_64 python3-gluster-7.2-4723.1520.210122T1700.a.sles15sp2hpe.noarch nano-1:/adminvm/images # uname -a Linux nano-1 5.3.18-24.46-default #1 SMP Tue Jan 5 16:11:50 UTC 2021 (4ff469b) x86_64 x86_64 x86_64 GNU/Linux nano-1:/adminvm/images # rpm -qa | grep qemu-4 qemu-4.2.0-9.4.x86_64 Would love any advice Erik Community Meeting Calendar: Schedule - Every 2nd and 4th Tuesday at 14:30 IST / 09:00 UTC Bridge: https://meet.google.com/cpu-eiue-hvk Gluster-users mailing list Gluster-users@gluster.org https://lists.gluster.org/mailman/listinfo/gluster-users
Re: [Gluster-users] State of Gluster project
> For NVMe/SSD - raid controller is pointless , so JBOD makes most sense. I am game for an education lesson here. We're still using spinng drives with big RAID caches but we keep discussing SSD in the context of RAID. I have read for many real-world workloads, RAID0 makes no sense with modern SSDs. I get that part. But if your concern is reliability and reducing the need to mess with Gluster to recover from a drive failure, a RAID1 or or RADI10 (or some other with redundancy) would seem to at least make sense from that perspective. Was your answer a performance answer? Or am I missing something about RAIDs for redundancy and SSDs being a bad choice? Thanks again as always, Erik Community Meeting Calendar: Schedule - Every 2nd and 4th Tuesday at 14:30 IST / 09:00 UTC Bridge: https://bluejeans.com/441850968 Gluster-users mailing list Gluster-users@gluster.org https://lists.gluster.org/mailman/listinfo/gluster-users
Re: [Gluster-users] State of Gluster project
I agree with this assessment for the most part. I'll just add that, during development of Gluster based solutions, we had internal use of Redhat Gluster. This was over a year and a half ago when we started. For my perhaps non-mainstream use cases, I found the latest versions of gluster 7 actually fixed several of my issues. Now, I did not try to work with RedHat when I hit problems as it was only "non-shipable support" - we could install it but not deliver it. Since it didn't work well for our strange use cases, we moved on to building our own Gluster instead instead of working to have customers buy the Red Hat one. (We also support sles12, sles15, rhel7, rhel8 - so having Red Hat's version of Gluster sort of wouldn't have worked out for us anyway). However, I also found that it is quite easy for my use case to hit new bugs. When we go from gluster72 to one of the newer ones, little things might happen (and did happen). I don't complain because I get free support from you and I do my best to fix them if I have time and access to a failing system. A tricky thing in my world is we will sell a cluster with 5,000 nodes to boot and my test cluster may have 3 nodes. I can get time up to 128 nodes on one test system. But I only get short-term access to bigger systems at the factory. So being able to change from one Gluster version to another is a real challenge for us because there simply is no way for us to test very often and, like is normal in HPC, problems only show at scale. hahaa :) :) This is also why we are still using Gluster NFS. We know we need to work with the community on fixing some Ganesha issues, but the amount of time we get on a large machine that exhibits the problem is short and we must prioritize. This is why I'm careful to never "blame Ganesha" but rather point out that we haven't had time to track the issues down with the Ganesha community. Meanwhile we hope we can keep building Gluster NFS :) When I next do a version-change of Gluster or try Ganesha again, it will be when I have sustained access to at least a 1024 node cluster to boot with 3 or 6 Gluster servers to really work out any issues. I consider this "a cost of doing business in the world I work in" but it is a real challenge indeed. I assume some challenges parallel Gluster developers "Works fine on my limited hardware or virtual machines". Erik > With every community project , you are in the position of a Betta Tester > - no matter Fedora, Gluster or CEPH. So far , I had issues with upstream > projects only diring and immediately after patching - but this is properly > mitigated with a reasonable patching strategy (patch test environment and > several months later patch prod with the same repos). > Enterprise Linux breaks (and alot) having 10-times more users and use > cases, so you cannot expect to start to use Gluster and assume that a > free peoject won't break at all. > Our part in this project is to help the devs to create a test case for our > workload , so regressions will be reduced to minimum. > > In the past 2 years, we got 2 major issues with VMware VSAN and 1 major > issue with a Enterprise Storage cluster (both solutions are quite > expensive) - so I always recommend proper testing of your software . > > > >> That's true, but you could also use NFS Ganesha, which is > >> more performant than FUSE and also as reliable as it. > > > >From this very list I read about many users with various problems when > >using NFS Ganesha. Is that a wrong impression? > > >From my observations, almost nobody is complaining about Ganesha in the > >mailing list -> 50% are having issues with geo replication,20% are > >having issues with small file performance and the rest have issues with very > >old version of gluster -> v5 or older. > > >> It's not so hard to do it - just use either 'reset-brick' or > >> 'replace-brick' . > > > >Sure - the command itself is simple enough. The point it that each > >reconstruction is quite more "riskier" than a simple RAID > >reconstruction. Do you run a full Gluster SDS, skipping RAID? How do > >you > >found this setup? > > I can't say that a replace-brick on a 'replica 3' volume is more riskier > than a rebuild of a raid, but I have noticed that nobody is following Red > Hat's guide to use either: > - a Raid6 of 12 Disks (2-3 TB big) > - a Raid10 of 12 Disks (2-3 TB big) > - JBOD disks in 'replica 3' mode (i'm not sure about the size RH > recommends, most probably 2-3 TB) > So far, I didn' have the opportunity to run on JBODs. > > > >Thanks. > > > > > Community Meeting Calendar: > > Schedule - > Every 2nd and 4th Tuesday at 14:30 IST / 09:00 UTC > Bridge: https://bluejeans.com/441850968 > > Gluster-users mailing list > Gluster-users@gluster.org > https://lists.gluster.org/mailman/listinfo/gluster-users Community Meeting Calendar: Schedule -
Re: [Gluster-users] State of Gluster project
> It is very hard to compare them because they are structurally very different. > For example, GlusterFS performance will depend *a lot* on the underlying file > system performance. Ceph eliminated that factor by using Bluestore. > Ceph is very well performing for VM storage, since it's block based and as > such optimized for that. I haven't tested CephFS a lot (I used it but only > for very small storage) so I cannot speak for its performance, but I am > guessing it's not ideal. For large amount of files thus GlusterFS is still a > good choice. Was your experience above based on using a sharded volume or a normal one? When we worked with virtual machine images, we followed the volume sharding advice. I don't have a comparison for Ceph handy. I was just curious. It worked so well for us (but maybe our storage is "too good") that we found it hard to imagine it could be improved much. This was a simple case though of a single VM, 3 gluster servers, a sharded volume, and a raw virtual machine image. Probably a simpler case than yours. Thank you for writing this and take care, Erik > > One *MAJOR* advantage of Ceph over GlusterFS is tooling. Ceph's > self-analytics, status reporting and problem fixing toolset is just so far > beyond GlusterFS that it's really hard for me to recommend GlusterFS for any > but the most experienced sysadmins. It does come with the type of > implementation Ceph has chosen that they have to have such good tooling > (because honestly, poking around in binary data structures really wouldn't be > practical for most users), but whenever I had a problem with Ceph the > solution was just a couple of command line commands (even if it meant to > remove a storage device, wipe it and add it back), where with GlusterFS it > means poking around in the .glusterfs directory, looking up inode numbers, > extended attributes etc. which is a real pain if you have a > multi-million-file filesystem to work on. And that's not even with sharding > or distributed volumes. > > Also, Ceph has been a lot more stable that GlusterFS for us. The amount of > hand-holding GlusterFS needs is crazy. With Ceph, there is this one bug (I > think in certain Linux kernel versions) where it sometimes reads only zeroes > from disk and complains about that and then you have to restart that OSD to > not run into problems, but that's one "swatch" process on each machine that > will do that automatically for us. I have run some Ceph clusters for several > years now and only once or twice I had to deal with problems. The several > GlusterFS clusters we operate constantly run into troubles. We now shut down > all GlusterFS clients before we reboot any GlusterFS node because it was near > impossible to reboot a single node without running into unrecoverable > troubles (heal entries that will not heal etc.). With Ceph we can achieve > 100% uptime, we regularly reboot our hosts one by one and some minutes later > the Ceph cluster is clean again. > > If others have more insights I'd be very happy to hear them. > > Stefan > > > - Original Message - > > Date: Tue, 16 Jun 2020 20:30:34 -0700 > > From: Artem Russakovskii > > To: Strahil Nikolov > > Cc: gluster-users > > Subject: Re: [Gluster-users] State of Gluster project > > Message-ID: > > > > Content-Type: text/plain; charset="utf-8" > > > > Has anyone tried to pit Ceph against gluster? I'm curious what the ups and > > downs are. > > > > On Tue, Jun 16, 2020, 4:32 PM Strahil Nikolov wrote: > > > >> Hey Mahdi, > >> > >> For me it looks like Red Hat are focusing more on CEPH than on Gluster. > >> I hope the project remains active, cause it's very difficult to find a > >> Software-defined Storage as easy and as scalable as Gluster. > >> > >> Best Regards, > >> Strahil Nikolov > >> > >> ?? 17 ??? 2020 ?. 0:06:33 GMT+03:00, Mahdi Adnan ??: > >> >Hello, > >> > > >> > I'm wondering what's the current and future plan for Gluster project > >> >overall, I see that the project is not as busy as it was before "at > >> >least > >> >this is what I'm seeing" Like there are fewer blogs about what the > >> >roadmap > >> >or future plans of the project, the deprecation of Glusterd2, even Red > >> >Hat > >> >Openshift storage switched to Ceph. > >> >As the community of this project, do you feel the same? Is the > >> >deprecation > >> >of Glusterd2 concerning? Do you feel that the project is slowing down > >> >somehow? Do you think Red Hat is abandoning the project or giving fewer > >> >resources to Gluster? > >> > >> > >> > >> > >> Community Meeting Calendar: > >> > >> Schedule - > >> Every 2nd and 4th Tuesday at 14:30 IST / 09:00 UTC > >> Bridge: https://bluejeans.com/441850968 > >> > >> Gluster-users mailing list > >> Gluster-users@gluster.org > >> https://lists.gluster.org/mailman/listinfo/gluster-users > >> > > -- next part -- > > An HTML attachment was scrubbed... > > URL: > >
Re: [Gluster-users] State of Gluster project
We never ran tests with Ceph mostly due to time constraints in engineering. We also liked that, at least when I started as a novice, gluster seemed easier to set up. We use the solution in automated setup scripts for maintaining very large clusters. Simplicity in automated setup is critical here for us including automated installation of supercomputers in QE and near-automation at customer sites. We have been happy with our performance using gluster and gluster NFS for root filesystems when using squashfs object files for the NFS roots as opposed to expanded files (on a sharded volume). For writable NFS, we use XFS filesystem images on gluster NFS instead of expanded trees (in this case, not on sharded volume). We have systems running as large as 3072 nodes with 16 gluster servers (subvolumes of 3, distributed/replicate). We will have 5k nodes in production soon and will need to support 10k nodes in a year or so. So far we use CTDB for "ha-like" functionality as pacemaker is scary to us. We also have designed a second solution around gluster for high-availability head nodes (aka admin nodes). The old solution used two admin nodes, pacemaker, external shared storage, to host a VM that would start on the 2nd server if the first server died. As we know, 2-node ha is not optimal. We designed a new 3-server HA solution that eliminates the external shared storage (which was expensive) and instead uses gluster, sharded volume, and a qemu raw image hosted in the shared storage to host the virtual admin node. We use RAIDD10 4-disk per server for gluster use in this. We have been happy with the performance of this. It's only a little slower than the external shared filesystem solution (we tended to use GFS2 or OCFS or whatever it is called in the past solution). We did need to use pacemaker for this one as virtual machine availability isn't suitable for CTDB (or less natural anyway). One highlight of this solution is it allows a customer to put each of the 3 servers in a separate firewalled vault or room to keep the head alive even if there were a fire that destroyed one server. A key to our use of gluster and not suffering from poor performance in our root-filesystem-workloads is encapsulating filesystems in image files instead of using expanded trees of small files. So far we have relied on gluster NFS for the boot servers as Ganesha would crash. We haven't re-tried in several months though and owe debugging on that front. We have not had resources to put in to debugging Ganesha just yet. I sure hope Gluster stays healthy and active. It is good to have multiple solutions with various strengths out there. I like choice. Plus, choice lets us learn from each other. I hope project sponsors see that too. Erik > 17.06.2020 08:59, Artem Russakovskii пишет: > > It may be stable, but it still suffers from performance issues, which > > the team is working on. But nevertheless, I'm curious if maybe Ceph has > > those problem sorted by now. > > > Dunno, we run gluster on small clusters, kvm and gluster on the same hosts. > > There were plans to use ceph on dedicated server next year, but budget cut > because you don't want to buy our oil for $120 ;-) > > Anyway, in our tests ceph is faster, this is why we wanted to use it, but > not migrate from gluster. > > > > > > > Community Meeting Calendar: > > Schedule - > Every 2nd and 4th Tuesday at 14:30 IST / 09:00 UTC > Bridge: https://bluejeans.com/441850968 > > Gluster-users mailing list > Gluster-users@gluster.org > https://lists.gluster.org/mailman/listinfo/gluster-users Community Meeting Calendar: Schedule - Every 2nd and 4th Tuesday at 14:30 IST / 09:00 UTC Bridge: https://bluejeans.com/441850968 Gluster-users mailing list Gluster-users@gluster.org https://lists.gluster.org/mailman/listinfo/gluster-users
Re: [Gluster-users] MTU 9000 question
Thank you !!! We are going to try to run some experiments as well in the coming weeks. Assuming I don't get re-routed, which often happens, I'll share if we notice anything in our work load. On Wed, May 06, 2020 at 07:41:56PM +0400, Dmitry Melekhov wrote: > > 06.05.2020 19:15, Erik Jacobson пишет: > > > It's been working pretty > > > well at 1500 MTU so far. If the only issue is less throughput, that may > > > be a price we can pay since we're not bandwidth bound right now. > > > > > I think that fragmentation offload on nics makes jumbo frames not very > useful. > > As I said we see no difference in our workload, we switched nics to mtu 9000 > just because we can, but not from start, > > and we did not see any improvements. > > Gluster never saturates our teamed in two 10Gb connections nor with mtu 9000 > nor with 1500 and there is no visible latency difference. > > > > > Community Meeting Calendar: Schedule - Every 2nd and 4th Tuesday at 14:30 IST / 09:00 UTC Bridge: https://bluejeans.com/441850968 Gluster-users mailing list Gluster-users@gluster.org https://lists.gluster.org/mailman/listinfo/gluster-users
Re: [Gluster-users] MTU 9000 question
> On the other side allow jumbo frames and change mtu on even hundreds on > nodes is extremely simple, > > you can just test it. I don't see "bunch of extra work" here, just use ssh > and some scripting or something like ansible... Our issue is we decided to simplify the configuration in our cluster manager so that cluster management traffic, NFS, and gluster are co-mingled. Works great. However, we often need to talk to BMCs on that same network, and many BMCs don't handle MTU 9K correctly. Often a BMC will seem to work but if you send something big like firmware flash to it, it never completes the transfer due to the MTU mismatch. So the "hard part" is due to our own stuff. We have a method in the cluster manager to put BMCs in a separate network but that isn't a common choice. We are investigating using MTU size-by-path but that gets complicated to test. Therefore, we are looking to understand the real-world problem with a 1500 MTU on 2x bonded 10G networks with gluster to decide if we want to put time and resource to solve the problem. It's been working pretty well at 1500 MTU so far. If the only issue is less throughput, that may be a price we can pay since we're not bandwidth bound right now. Erik Community Meeting Calendar: Schedule - Every 2nd and 4th Tuesday at 14:30 IST / 09:00 UTC Bridge: https://bluejeans.com/441850968 Gluster-users mailing list Gluster-users@gluster.org https://lists.gluster.org/mailman/listinfo/gluster-users
[Gluster-users] MTU 9000 question
It is inconvenient for us to use MTU 9K for our gluster servers for various reasons. We typically have bonded 10G interfaces. We use distribute/replicate and gluster NFS for compute nodes. My understanding is the negative to using 1500 MTU is just less efficient use of the network. Are there other concerns? We don't currently have network saturation problems. We are trying to make a decision on if we need to do a bunch of extra work to switch to 9K MTU and if it is worth the benefit. Does the community have any suggestions? Erik Community Meeting Calendar: Schedule - Every 2nd and 4th Tuesday at 14:30 IST / 09:00 UTC Bridge: https://bluejeans.com/441850968 Gluster-users mailing list Gluster-users@gluster.org https://lists.gluster.org/mailman/listinfo/gluster-users
Re: [Gluster-users] gnfs split brain when 1 server in 3x1 down (high load) - help request
Amar, Ravi - > This thread has been one of the largest effort to stabilize the systems in > recent times. Well thanks to you guys too. It would have been easy to stop replying when things got hard. I understand best effort community support and appreciate you sticking with us. The test system I had is disappearing on Monday. However, a larger test system will be less booked after a release finalizes. So I have a test platform through early next week, and will again have something in a couple weeks. I also may have a window at 1k nodes at a customer site during a maintenance window... And we should have a couple big ones going through the factory in the comings weeks in the 1k size. At 1k nodes, we have 3 gluster servers. THANKS AGAIN. Wow what a relief. Let me get these changes checked in so I can get it to some customers and then look at getting a new thread going on the thread hangs. Erik > > Thanks for patience and number of retries you did, Erik! > > Thanks indeed! Once https://review.gluster.org/#/c/glusterfs/+/24316/ gets > merged on master, I will back port it to the release branches. > > > We surely need to get to the glitch you found with the 7.4 version, as > with > every higher version, we expect more stability! > > True, maybe we should start a separate thread... > > Regards, > Ravi > > Regards, > Amar > > On Fri, Apr 17, 2020 at 2:46 AM Erik Jacobson > wrote: > > I have some news. > > After many many many trials, reboots of gluster servers, reboots of > nodes... > in what should have reproduced the issue several times. I think we're > stable. > > It appears this glusterfs nfs daemon hang only happens in glusterfs74 > and not 72. > > So > 1) Your split-brain patch > 2) performance.parallel-readdir off > 3) glusterfs72 > > I declare it stable. I can't make it fail: split-brain, hang, noor seg > fault > with one leader down. > > I'm working on putting this in to a SW update. > > We are going to test if performance.parallel-readdir off impacts > booting > at scale but we don't have a system to try it on at this time. > > THAK YOU! > > I may have access to the 57 node test system if there is something > you'd > like me to try with regards to why glusterfs74 is unstable in this > situation. Just let me know. > > Erik > > On Thu, Apr 16, 2020 at 12:03:33PM -0500, Erik Jacobson wrote: > > So in my test runs since making that change, we have a different odd > > behavior now. As you recall, this is with your patch -- still not > > split-brain -- and now with performance.parallel-readdir off > > > > The NFS server grinds to a hault after a few test runs. It does not > core > > dump. > > > > All that shows up in the log is: > > > > "pending frames:" with nothing after it and no date stamp. > > > > I will start looking for interesting break points I guess. > > > > > > The glusterfs for nfs is still alive: > > > > root 30541 1 42 09:57 ?00:51:06 /usr/sbin/glusterfs > -s localhost --volfile-id gluster/nfs -p /var/run/gluster/nfs/nfs.pid > -l /var/log/glusterfs/nfs.log -S /var/run/gluster/ > 9ddb5561058ff543.socket > > > > > > > > [root@leader3 ~]# strace -f -p 30541 > > strace: Process 30541 attached with 40 threads > > [pid 30580] futex(0x7f8904035f60, FUTEX_WAIT_PRIVATE, 2, NULL > > > [pid 30579] futex(0x7f8904035f60, FUTEX_WAIT_PRIVATE, 2, NULL > > > [pid 30578] futex(0x7f8904035f60, FUTEX_WAIT_PRIVATE, 2, NULL > > > [pid 30577] futex(0x7f8904035f60, FUTEX_WAIT_PRIVATE, 2, NULL > > > [pid 30576] futex(0x7f8904035f60, FUTEX_WAIT_PRIVATE, 2, NULL > > > [pid 30575] futex(0x7f8904035f60, FUTEX_WAIT_PRIVATE, 2, NULL > > > [pid 30574] futex(0x7f8904035f60, FUTEX_WAIT_PRIVATE, 2, NULL > > > [pid 30573] futex(0x7f8904035f60, FUTEX_WAIT_PRIVATE, 2, NULL > > > [pid 30572] futex(0x7f8904035f60, FUTEX_WAIT_PRIVATE, 2, NULL > > > [pid 30571] futex(0x7f8904035f60, FUTEX_WAIT_PRIVATE, 2, NULL > > > [pid 30570] futex(0x7f8904035f
Re: [Gluster-users] gnfs split brain when 1 server in 3x1 down (high load) - help request
I have some news. After many many many trials, reboots of gluster servers, reboots of nodes... in what should have reproduced the issue several times. I think we're stable. It appears this glusterfs nfs daemon hang only happens in glusterfs74 and not 72. So 1) Your split-brain patch 2) performance.parallel-readdir off 3) glusterfs72 I declare it stable. I can't make it fail: split-brain, hang, noor seg fault with one leader down. I'm working on putting this in to a SW update. We are going to test if performance.parallel-readdir off impacts booting at scale but we don't have a system to try it on at this time. THAK YOU! I may have access to the 57 node test system if there is something you'd like me to try with regards to why glusterfs74 is unstable in this situation. Just let me know. Erik On Thu, Apr 16, 2020 at 12:03:33PM -0500, Erik Jacobson wrote: > So in my test runs since making that change, we have a different odd > behavior now. As you recall, this is with your patch -- still not > split-brain -- and now with performance.parallel-readdir off > > The NFS server grinds to a hault after a few test runs. It does not core > dump. > > All that shows up in the log is: > > "pending frames:" with nothing after it and no date stamp. > > I will start looking for interesting break points I guess. > > > The glusterfs for nfs is still alive: > > root 30541 1 42 09:57 ?00:51:06 /usr/sbin/glusterfs -s > localhost --volfile-id gluster/nfs -p /var/run/gluster/nfs/nfs.pid -l > /var/log/glusterfs/nfs.log -S /var/run/gluster/9ddb5561058ff543.socket > > > > [root@leader3 ~]# strace -f -p 30541 > strace: Process 30541 attached with 40 threads > [pid 30580] futex(0x7f8904035f60, FUTEX_WAIT_PRIVATE, 2, NULL > [pid 30579] futex(0x7f8904035f60, FUTEX_WAIT_PRIVATE, 2, NULL > [pid 30578] futex(0x7f8904035f60, FUTEX_WAIT_PRIVATE, 2, NULL > [pid 30577] futex(0x7f8904035f60, FUTEX_WAIT_PRIVATE, 2, NULL > [pid 30576] futex(0x7f8904035f60, FUTEX_WAIT_PRIVATE, 2, NULL > [pid 30575] futex(0x7f8904035f60, FUTEX_WAIT_PRIVATE, 2, NULL > [pid 30574] futex(0x7f8904035f60, FUTEX_WAIT_PRIVATE, 2, NULL > [pid 30573] futex(0x7f8904035f60, FUTEX_WAIT_PRIVATE, 2, NULL > [pid 30572] futex(0x7f8904035f60, FUTEX_WAIT_PRIVATE, 2, NULL > [pid 30571] futex(0x7f8904035f60, FUTEX_WAIT_PRIVATE, 2, NULL > [pid 30570] futex(0x7f8904035f60, FUTEX_WAIT_PRIVATE, 2, NULL > [pid 30569] futex(0x7f8904035f60, FUTEX_WAIT_PRIVATE, 2, NULL > [pid 30568] futex(0x7f8904035f60, FUTEX_WAIT_PRIVATE, 2, NULL > [pid 30567] futex(0x7f8904035f60, FUTEX_WAIT_PRIVATE, 2, NULL > [pid 30566] futex(0x7f88b820, FUTEX_WAIT_PRIVATE, 2, NULL > [pid 30565] futex(0x7f8904035f60, FUTEX_WAIT_PRIVATE, 2, NULL > [pid 30564] futex(0x7f8904035f60, FUTEX_WAIT_PRIVATE, 2, NULL > [pid 30563] futex(0x7f8904035f60, FUTEX_WAIT_PRIVATE, 2, NULL > [pid 30562] futex(0x7f8904035f60, FUTEX_WAIT_PRIVATE, 2, NULL > [pid 30561] futex(0x7f8904035f60, FUTEX_WAIT_PRIVATE, 2, NULL > [pid 30560] futex(0x7f8904035f60, FUTEX_WAIT_PRIVATE, 2, NULL > [pid 30559] futex(0x7f8904035f60, FUTEX_WAIT_PRIVATE, 2, NULL > [pid 30558] futex(0x7f8904035f60, FUTEX_WAIT_PRIVATE, 2, NULL > [pid 30557] futex(0x7f8904035f60, FUTEX_WAIT_PRIVATE, 2, NULL > [pid 30556] futex(0x7f8904035f60, FUTEX_WAIT_PRIVATE, 2, NULL > [pid 30555] futex(0x7f8904035f60, FUTEX_WAIT_PRIVATE, 2, NULL > [pid 30554] futex(0x7f8904035f60, FUTEX_WAIT_PRIVATE, 2, NULL > [pid 30553] futex(0x7f8904035f60, FUTEX_WAIT_PRIVATE, 2, NULL > [pid 30552] futex(0x7f8904035f60, FUTEX_WAIT_PRIVATE, 2, NULL > [pid 30551] futex(0x7f8904035f60, FUTEX_WAIT_PRIVATE, 2, NULL > [pid 30550] restart_syscall(<... resuming interrupted restart_syscall ...> > > [pid 30549] futex(0x7f8904035f60, FUTEX_WAIT_PRIVATE, 2, NULL > [pid 30548] futex(0x7f88b820, FUTEX_WAIT_PRIVATE, 2, NULL > [pid 30547] select(0, NULL, NULL, NULL, {tv_sec=0, tv_usec=243775} > > [pid 30546] restart_syscall(<... resuming interrupted restart_syscall ...> > > [pid 30545] restart_syscall(<... resuming interrupted restart_syscall ...> > > [pid 30544] futex(0x7f88b820, FUTEX_WAIT_PRIVATE, 2, NULL > [pid 30543] rt_sigtimedwait([HUP INT USR1 USR2 TERM], > [pid 30542] futex(0x7f88b820, FUTEX_WAIT_PRIVATE, 2, NULL > [pid 30541] futex(0x7f890c3a39d0, FUTEX_WAIT, 30548, NULL > [pid 30547] <... select resumed> ) = 0 (Timeout) > [pid 30547] select(0, NULL, NULL, NULL, {tv_sec=1, tv_usec=0}) = 0 (Timeout) > [pid 30547] select(0, NULL, NULL, NULL, {tv_sec=1, tv_usec=0}) = 0 (Timeout) > [pid 30547] select(0, NULL, NULL, NULL, {tv_sec=1, tv_usec=0}) = 0 (Timeout) > [pid 30547] select(0, NULL, NULL, NULL, {tv_sec=1, tv_usec=0}) = 0 (Timeout) > [pid 30547] select(
Re: [Gluster-users] gnfs split brain when 1 server in 3x1 down (high load) - help request
557 detached strace: Process 30558 detached strace: Process 30559 detached strace: Process 30560 detached strace: Process 30561 detached strace: Process 30562 detached strace: Process 30563 detached strace: Process 30564 detached strace: Process 30565 detached strace: Process 30566 detached strace: Process 30567 detached strace: Process 30568 detached strace: Process 30569 detached strace: Process 30570 detached strace: Process 30571 detached strace: Process 30572 detached strace: Process 30573 detached strace: Process 30574 detached strace: Process 30575 detached strace: Process 30576 detached strace: Process 30577 detached strace: Process 30578 detached strace: Process 30579 detached strace: Process 30580 detached > On 16/04/20 8:04 pm, Erik Jacobson wrote: > > Quick update just on how this got set. > > > > gluster volume set cm_shared performance.parallel-readdir on > > > > Is something we did turn on, thinking it might make our NFS services > > faster and not knowing about it possibly being negative. > > > > Below is a diff of the nfs volume file ON vs OFF. So I will simply turn > > this OFF and do a test run. > Yes,that should do it. I am not sure if performance.parallel-readdir was > intentionally made to have an effect on gnfs volfiles. Usually, for other > performance xlators, `gluster volume set` only changes the fuse volfile. Community Meeting Calendar: Schedule - Every 2nd and 4th Tuesday at 14:30 IST / 09:00 UTC Bridge: https://bluejeans.com/441850968 Gluster-users mailing list Gluster-users@gluster.org https://lists.gluster.org/mailman/listinfo/gluster-users
Re: [Gluster-users] gnfs split brain when 1 server in 3x1 down (high load) - help request
Quick update just on how this got set. gluster volume set cm_shared performance.parallel-readdir on Is something we did turn on, thinking it might make our NFS services faster and not knowing about it possibly being negative. Below is a diff of the nfs volume file ON vs OFF. So I will simply turn this OFF and do a test run. Does this look correct? I will start testing with this turned OFF. Thank you! [root@leader1 nfs]# diff -u /tmp/nfs-server.vol-ORIG nfs-server.vol --- /tmp/nfs-server.vol-ORIG2020-04-16 09:28:56.855309870 -0500 +++ nfs-server.vol 2020-04-16 09:29:14.267289600 -0500 @@ -60,21 +60,13 @@ subvolumes cm_shared-client-0 cm_shared-client-1 cm_shared-client-2 end-volume -volume cm_shared-readdir-ahead-0 -type performance/readdir-ahead -option rda-cache-limit 10MB -option rda-request-size 131072 -option parallel-readdir on -subvolumes cm_shared-replicate-0 -end-volume - volume cm_shared-dht type cluster/distribute option force-migration off option lock-migration off option lookup-optimize on option lookup-unhashed auto -subvolumes cm_shared-readdir-ahead-0 +subvolumes cm_shared-replicate-0 end-volume volume cm_shared-utime On Thu, Apr 16, 2020 at 06:58:01PM +0530, Ravishankar N wrote: > > On 16/04/20 6:54 pm, Erik Jacobson wrote: > > > The patch by itself is only making changes specific to AFR, so it should > > > not > > > affect other translators. But I wonder how readdir-ahead is enabled in > > > your > > > gnfs stack. All performance xlators are turned off in gnfs except > > > write-behind and AFAIK, there is no way to enable them via the CLI. Did > > > you > > > custom edit your gnfs volfile to add readdir-ahead? If yes, does the crash > > > go-away if you remove the xlator from the nfs volfile? > > thank you. A quick reply. I will then go research how to do this, > > I've never hand edited a volume before. I've never even really looked at > > the gnfs volfile before. > > > > There are no custom code changes or hand edits. > > > > More soon. > > > Okay, /var/lib/glusterd/nfs/nfs-server.vol is the file you want to look at > if you are using gnfs. > > -Ravi Community Meeting Calendar: Schedule - Every 2nd and 4th Tuesday at 14:30 IST / 09:00 UTC Bridge: https://bluejeans.com/441850968 Gluster-users mailing list Gluster-users@gluster.org https://lists.gluster.org/mailman/listinfo/gluster-users
Re: [Gluster-users] gnfs split brain when 1 server in 3x1 down (high load) - help request
> The patch by itself is only making changes specific to AFR, so it should not > affect other translators. But I wonder how readdir-ahead is enabled in your > gnfs stack. All performance xlators are turned off in gnfs except > write-behind and AFAIK, there is no way to enable them via the CLI. Did you > custom edit your gnfs volfile to add readdir-ahead? If yes, does the crash > go-away if you remove the xlator from the nfs volfile? thank you. A quick reply. I will then go research how to do this, I've never hand edited a volume before. I've never even really looked at the gnfs volfile before. There are no custom code changes or hand edits. More soon. Community Meeting Calendar: Schedule - Every 2nd and 4th Tuesday at 14:30 IST / 09:00 UTC Bridge: https://bluejeans.com/441850968 Gluster-users mailing list Gluster-users@gluster.org https://lists.gluster.org/mailman/listinfo/gluster-users
Re: [Gluster-users] gnfs split brain when 1 server in 3x1 down (high load) - help request
ock = 0, __count = 0, __owner = 1586972324, __nusers = 0, __kind = 210092664, __spins = 0, __elision = 0, __list = {__prev = 0x0, __next = 0x0}}, __size = "\000\000\000\000\000\000\000\000\244F\227^\000\000\000\000x\302\205\f", '\000' , __align = 0}}, cookie = 0x0, complete = false, op = GF_FOP_NULL, begin = {tv_sec = 0, tv_nsec = 0}, end = {tv_sec = 0, tv_nsec = 0}, wind_from = 0x0, wind_to = 0x0, unwind_from = 0x0, unwind_to = 0x0} (gdb) print {call_frame_t}0x7fe5ac096288 $36 = {root = 0x7fe5ac378860, parent = 0x7fe5acf18eb8, frames = {next = 0x7fe5acf18ec8, prev = 0x7fe5ac6d6cf0}, local = 0x0, this = 0x7fe63c014000, ret = 0x7fe63bb5d350 , ref_count = 0, lock = {spinlock = 0, mutex = {__data = {__lock = 0, __count = 0, __owner = 0, __nusers = 0, __kind = 0, __spins = 0, __elision = 0, __list = {__prev = 0x0, __next = 0x0}}, __size = '\000' , __align = 0}}, cookie = 0x7fe5ac096288, complete = true, op = GF_FOP_READDIRP, begin = { tv_sec = 4234, tv_nsec = 637078816}, end = {tv_sec = 4234, tv_nsec = 803882755}, wind_from = 0x7fe63bb5e8c0 <__FUNCTION__.6> "rda_fill_fd", wind_to = 0x7fe63bb5e3f0 "(this->children->xlator)->fops->readdirp", unwind_from = 0x7fe63bdd8a80 <__FUNCTION__.20442> "afr_readdir_cbk", unwind_to = 0x7fe63bb5dfbb "rda_fill_fd_cbk"} On 4/15/20 8:14 AM, Erik Jacobson wrote: > Scott - I was going to start with gluster74 since that is what he > started at but it applies well to glsuter72 so I'll start tthere. > > Getting ready to go. The patch detail is interesting. This is probably > why it took hiim a bit longer. It wasn't a trivial patch. On Wed, Apr 15, 2020 at 12:45:57PM -0500, Erik Jacobson wrote: > > The new split-brain issue is much harder to reproduce, but after several > > (correcting to say new seg fault issue, the split brain is gone!!) > > > intense runs, it usually hits once. > > > > We switched to pure gluster74 plus your patch so we're apples to apples > > now. > > > > I'm going to see if Scott can help debug it. > > > > Here is the back trace info from the core dump: > > > > -rw-r- 1 root root 1.9G Apr 15 12:40 > > core.glusterfs.0.52467a7e67964553aa9971eb2bb0148c.61084.158697232400 > > -rw-r- 1 root root 221M Apr 15 12:40 > > core.glusterfs.0.52467a7e67964553aa9971eb2bb0148c.61084.158697232400.lz4 > > drwxrwxrwt 9 root root 20K Apr 15 12:40 . > > [root@leader3 tmp]# > > [root@leader3 tmp]# > > [root@leader3 tmp]# gdb > > core.glusterfs.0.52467a7e67964553aa9971eb2bb0148c.61084.158697232400 > > GNU gdb (GDB) Red Hat Enterprise Linux 8.2-5.el8 > > Copyright (C) 2018 Free Software Foundation, Inc. > > License GPLv3+: GNU GPL version 3 or later > > <http://gnu.org/licenses/gpl.html> > > This is free software: you are free to change and redistribute it. > > There is NO WARRANTY, to the extent permitted by law. > > Type "show copying" and "show warranty" for details. > > This GDB was configured as "x86_64-redhat-linux-gnu". > > Type "show configuration" for configuration details. > > For bug reporting instructions, please see: > > <http://www.gnu.org/software/gdb/bugs/>. > > Find the GDB manual and other documentation resources online at: > > <http://www.gnu.org/software/gdb/documentation/>. > > > > For help, type "help". > > Type "apropos word" to search for commands related to "word"... > > [New LWP 61102] > > [New LWP 61085] > > [New LWP 61087] > > [New LWP 61117] > > [New LWP 61086] > > [New LWP 61108] > > [New LWP 61089] > > [New LWP 61090] > > [New LWP 61121] > > [New LWP 61088] > > [New LWP 61091] > > [New LWP 61093] > > [New LWP 61095] > > [New LWP 61092] > > [New LWP 61094] > > [New LWP 61098] > > [New LWP 61096] > > [New LWP 61097] > > [New LWP 61084] > > [New LWP 61100] > > [New LWP 61103] > > [New LWP 61104] > > [New LWP 61099] > > [New LWP 61105] > > [New LWP 61101] > > [New LWP 61106] > > [New LWP 61109] > > [New LWP 61107] > > [New LWP 61112] > > [New LWP 61119] > > [New LWP 61110] > > [New LWP 6] > > [New LWP 61118] > > [New LWP 61123] > > [New LWP 61122] > > [New LWP 61113] > > [New LWP 61114] > > [New LWP 61120] > > [New LWP 61116] > > [New LWP 61115] > > > > warning: core file may not match specified executable file. > > Reading symbols from /usr/sbin/glusterfsd...Reading symbols from > > /
Re: [Gluster-users] gnfs split brain when 1 server in 3x1 down (high load) - help request
7fe617fff700 (LWP 61102))] > Missing separate debuginfos, use: dnf debuginfo-install > glibc-2.28-42.el8.x86_64 keyutils-libs-1.5.10-6.el8.x86_64 > krb5-libs-1.16.1-22.el8.x86_64 libacl-2.2.53-1.el8.x86_64 > libattr-2.4.48-3.el8.x86_64 libcom_err-1.44.3-2.el8.x86_64 > libgcc-8.2.1-3.5.el8.x86_64 libselinux-2.8-6.el8.x86_64 > libtirpc-1.1.4-3.el8.x86_64 libuuid-2.32.1-8.el8.x86_64 > openssl-libs-1.1.1-8.el8.x86_64 pcre2-10.32-1.el8.x86_64 > zlib-1.2.11-10.el8.x86_64 > (gdb) bt > #0 0x7fe63bb5d7bb in FRAME_DESTROY (frame=0x7fe5ac096288) > at ../../../../libglusterfs/src/glusterfs/stack.h:193 > #1 STACK_DESTROY (stack=0x7fe5ac6d65f8) > at ../../../../libglusterfs/src/glusterfs/stack.h:193 > #2 rda_fill_fd_cbk (frame=0x7fe5acf18eb8, cookie=, > this=0x7fe63c0162b0, op_ret=3, op_errno=0, entries=, > xdata=0x0) at readdir-ahead.c:623 > #3 0x7fe63bd6c3aa in afr_readdir_cbk (frame=, > cookie=, this=, op_ret=, > op_errno=, subvol_entries=, xdata=0x0) > at afr-dir-read.c:234 > #4 0x7fe6400a1e07 in client4_0_readdirp_cbk (req=, > iov=, count=, myframe=0x7fe5ace0eda8) > at client-rpc-fops_v2.c:2338 > #5 0x7fe6479ca115 in rpc_clnt_handle_reply ( > clnt=clnt@entry=0x7fe63c0663f0, pollin=pollin@entry=0x7fe60c1737a0) > at rpc-clnt.c:764 > #6 0x7fe6479ca4b3 in rpc_clnt_notify (trans=0x7fe63c066780, > mydata=0x7fe63c066420, event=, data=0x7fe60c1737a0) > at rpc-clnt.c:931 > #7 0x7fe6479c707b in rpc_transport_notify ( > this=this@entry=0x7fe63c066780, > event=event@entry=RPC_TRANSPORT_MSG_RECEIVED, > data=data@entry=0x7fe60c1737a0) at rpc-transport.c:545 > #8 0x7fe640da893c in socket_event_poll_in_async (xl=, > async=0x7fe60c1738c8) at socket.c:2601 > #9 0x7fe640db03dc in gf_async ( > cbk=0x7fe640da8910 , xl=, > async=0x7fe60c1738c8) at > ../../../../libglusterfs/src/glusterfs/async.h:189 > #10 socket_event_poll_in (notify_handled=true, this=0x7fe63c066780) > at socket.c:2642 > #11 socket_event_handler (fd=fd@entry=19, idx=idx@entry=10, gen=gen@entry=1, > data=data@entry=0x7fe63c066780, poll_in=, > poll_out=, poll_err=0, event_thread_died=0 '\000') > at socket.c:3040 > #12 0x7fe647c84a5b in event_dispatch_epoll_handler (event=0x7fe617ffe014, > event_pool=0x563f5a98c750) at event-epoll.c:650 > #13 event_dispatch_epoll_worker (data=0x7fe63c063b60) at event-epoll.c:763 > #14 0x7fe6467a72de in start_thread () from /lib64/libpthread.so.0 > #15 0x7fe645fffa63 in clone () from /lib64/libc.so.6 > > > > On Wed, Apr 15, 2020 at 10:39:34AM -0500, Erik Jacobson wrote: > > After several successful runs of the test case, we thought we were > > solved. Indeed, split-brain is gone. > > > > But we're triggering a seg fault now, even in a less loaded case. > > > > We're going to switch to gluster74, which was your intention, and report > > back. > > > > On Wed, Apr 15, 2020 at 10:33:01AM -0500, Erik Jacobson wrote: > > > > Attached the wrong patch by mistake in my previous mail. Sending the > > > > correct > > > > one now. > > > > > > Early results loook GREAT !! > > > > > > We'll keep beating on it. We applied it to glsuter72 as that is what we > > > have to ship with. It applied fine with some line moves. > > > > > > If you would like us to also run a test with gluster74 so that you can > > > say that's tested, we can run that test. I can do a special build. > > > > > > THANK YOU!! > > > > > > > > > > > > > > > -Ravi > > > > > > > > > > > > On 15/04/20 2:05 pm, Ravishankar N wrote: > > > > > > > > > > > > On 10/04/20 2:06 am, Erik Jacobson wrote: > > > > > > > > Once again thanks for sticking with us. Here is a reply from > > > > Scott > > > > Titus. If you have something for us to try, we'd love it. The > > > > code had > > > > your patch applied when gdb was run: > > > > > > > > > > > > Here is the addr2line output for those addresses. Very > > > > interesting > > > > command, of > > > > which I was not aware. > > > > > > > > [root@leader3 ~]# addr2line -f > > > > -e/usr/lib64/glusterfs/7.2/xlator/ > > > > cluster/ > > > > afr.so 0x6f735 > > > > afr_lookup_metadata_heal_check > > > > afr-c
Re: [Gluster-users] gnfs split brain when 1 server in 3x1 down (high load) - help request
ck.h:193 #2 rda_fill_fd_cbk (frame=0x7fe5acf18eb8, cookie=, this=0x7fe63c0162b0, op_ret=3, op_errno=0, entries=, xdata=0x0) at readdir-ahead.c:623 #3 0x7fe63bd6c3aa in afr_readdir_cbk (frame=, cookie=, this=, op_ret=, op_errno=, subvol_entries=, xdata=0x0) at afr-dir-read.c:234 #4 0x7fe6400a1e07 in client4_0_readdirp_cbk (req=, iov=, count=, myframe=0x7fe5ace0eda8) at client-rpc-fops_v2.c:2338 #5 0x7fe6479ca115 in rpc_clnt_handle_reply ( clnt=clnt@entry=0x7fe63c0663f0, pollin=pollin@entry=0x7fe60c1737a0) at rpc-clnt.c:764 #6 0x7fe6479ca4b3 in rpc_clnt_notify (trans=0x7fe63c066780, mydata=0x7fe63c066420, event=, data=0x7fe60c1737a0) at rpc-clnt.c:931 #7 0x7fe6479c707b in rpc_transport_notify ( this=this@entry=0x7fe63c066780, event=event@entry=RPC_TRANSPORT_MSG_RECEIVED, data=data@entry=0x7fe60c1737a0) at rpc-transport.c:545 #8 0x7fe640da893c in socket_event_poll_in_async (xl=, async=0x7fe60c1738c8) at socket.c:2601 #9 0x7fe640db03dc in gf_async ( cbk=0x7fe640da8910 , xl=, async=0x7fe60c1738c8) at ../../../../libglusterfs/src/glusterfs/async.h:189 #10 socket_event_poll_in (notify_handled=true, this=0x7fe63c066780) at socket.c:2642 #11 socket_event_handler (fd=fd@entry=19, idx=idx@entry=10, gen=gen@entry=1, data=data@entry=0x7fe63c066780, poll_in=, poll_out=, poll_err=0, event_thread_died=0 '\000') at socket.c:3040 #12 0x7fe647c84a5b in event_dispatch_epoll_handler (event=0x7fe617ffe014, event_pool=0x563f5a98c750) at event-epoll.c:650 #13 event_dispatch_epoll_worker (data=0x7fe63c063b60) at event-epoll.c:763 #14 0x7fe6467a72de in start_thread () from /lib64/libpthread.so.0 #15 0x7fe645fffa63 in clone () from /lib64/libc.so.6 On Wed, Apr 15, 2020 at 10:39:34AM -0500, Erik Jacobson wrote: > After several successful runs of the test case, we thought we were > solved. Indeed, split-brain is gone. > > But we're triggering a seg fault now, even in a less loaded case. > > We're going to switch to gluster74, which was your intention, and report > back. > > On Wed, Apr 15, 2020 at 10:33:01AM -0500, Erik Jacobson wrote: > > > Attached the wrong patch by mistake in my previous mail. Sending the > > > correct > > > one now. > > > > Early results loook GREAT !! > > > > We'll keep beating on it. We applied it to glsuter72 as that is what we > > have to ship with. It applied fine with some line moves. > > > > If you would like us to also run a test with gluster74 so that you can > > say that's tested, we can run that test. I can do a special build. > > > > THANK YOU!! > > > > > > > > > > > -Ravi > > > > > > > > > On 15/04/20 2:05 pm, Ravishankar N wrote: > > > > > > > > > On 10/04/20 2:06 am, Erik Jacobson wrote: > > > > > > Once again thanks for sticking with us. Here is a reply from Scott > > > Titus. If you have something for us to try, we'd love it. The > > > code had > > > your patch applied when gdb was run: > > > > > > > > > Here is the addr2line output for those addresses. Very > > > interesting > > > command, of > > > which I was not aware. > > > > > > [root@leader3 ~]# addr2line -f -e/usr/lib64/glusterfs/7.2/xlator/ > > > cluster/ > > > afr.so 0x6f735 > > > afr_lookup_metadata_heal_check > > > afr-common.c:2803 > > > [root@leader3 ~]# addr2line -f -e/usr/lib64/glusterfs/7.2/xlator/ > > > cluster/ > > > afr.so 0x6f0b9 > > > afr_lookup_done > > > afr-common.c:2455 > > > [root@leader3 ~]# addr2line -f -e/usr/lib64/glusterfs/7.2/xlator/ > > > cluster/ > > > afr.so 0x5c701 > > > afr_inode_event_gen_reset > > > afr-common.c:755 > > > > > > > > > Right, so afr_lookup_done() is resetting the event gen to zero. This > > > looks > > > like a race between lookup and inode refresh code paths. We made some > > > changes to the event generation logic in AFR. Can you apply the > > > attached > > > patch and see if it fixes the split-brain issue? It should apply > > > cleanly on > > > glusterfs-7.4. > > > > > > Thanks, > > > Ravi > > > > > > > > > > > > > > > > > > > > > Community Meeting Calendar: > > > > > >
Re: [Gluster-users] gnfs split brain when 1 server in 3x1 down (high load) - help request
After several successful runs of the test case, we thought we were solved. Indeed, split-brain is gone. But we're triggering a seg fault now, even in a less loaded case. We're going to switch to gluster74, which was your intention, and report back. On Wed, Apr 15, 2020 at 10:33:01AM -0500, Erik Jacobson wrote: > > Attached the wrong patch by mistake in my previous mail. Sending the correct > > one now. > > Early results loook GREAT !! > > We'll keep beating on it. We applied it to glsuter72 as that is what we > have to ship with. It applied fine with some line moves. > > If you would like us to also run a test with gluster74 so that you can > say that's tested, we can run that test. I can do a special build. > > THANK YOU!! > > > > > > > -Ravi > > > > > > On 15/04/20 2:05 pm, Ravishankar N wrote: > > > > > > On 10/04/20 2:06 am, Erik Jacobson wrote: > > > > Once again thanks for sticking with us. Here is a reply from Scott > > Titus. If you have something for us to try, we'd love it. The code > > had > > your patch applied when gdb was run: > > > > > > Here is the addr2line output for those addresses. Very interesting > > command, of > > which I was not aware. > > > > [root@leader3 ~]# addr2line -f -e/usr/lib64/glusterfs/7.2/xlator/ > > cluster/ > > afr.so 0x6f735 > > afr_lookup_metadata_heal_check > > afr-common.c:2803 > > [root@leader3 ~]# addr2line -f -e/usr/lib64/glusterfs/7.2/xlator/ > > cluster/ > > afr.so 0x6f0b9 > > afr_lookup_done > > afr-common.c:2455 > > [root@leader3 ~]# addr2line -f -e/usr/lib64/glusterfs/7.2/xlator/ > > cluster/ > > afr.so 0x5c701 > > afr_inode_event_gen_reset > > afr-common.c:755 > > > > > > Right, so afr_lookup_done() is resetting the event gen to zero. This > > looks > > like a race between lookup and inode refresh code paths. We made some > > changes to the event generation logic in AFR. Can you apply the attached > > patch and see if it fixes the split-brain issue? It should apply > > cleanly on > > glusterfs-7.4. > > > > Thanks, > > Ravi > > > > > > > > > > > > > > Community Meeting Calendar: > > > > Schedule - > > Every 2nd and 4th Tuesday at 14:30 IST / 09:00 UTC > > Bridge: https://bluejeans.com/441850968 > > > > Gluster-users mailing list > > Gluster-users@gluster.org > > https://lists.gluster.org/mailman/listinfo/gluster-users > > > > > >From 11601e709a97ce7c40078866bf5d24b486f39454 Mon Sep 17 00:00:00 2001 > > From: Ravishankar N > > Date: Wed, 15 Apr 2020 13:53:26 +0530 > > Subject: [PATCH] afr: event gen changes > > > > The general idea of the changes is to prevent resetting event generation > > to zero in the inode ctx, since event gen is something that should > > follow 'causal order'. > > > > Change #1: > > For a read txn, in inode refresh cbk, if event_generation is > > found zero, we are failing the read fop. This is not needed > > because change in event gen is only a marker for the next inode refresh to > > happen and should not be taken into account by the current read txn. > > > > Change #2: > > The event gen being zero above can happen if there is a racing lookup, > > which resets even get (in afr_lookup_done) if there are non zero afr > > xattrs. The resetting is done only to trigger an inode refresh and a > > possible client side heal on the next lookup. That can be acheived by > > setting the need_refresh flag in the inode ctx. So replaced all > > occurences of resetting even gen to zero with a call to > > afr_inode_need_refresh_set(). > > > > Change #3: > > In both lookup and discover path, we are doing an inode refresh which is > > not required since all 3 essentially do the same thing- update the inode > > ctx with the good/bad copies from the brick replies. Inode refresh also > > triggers background heals, but I think it is okay to do it when we call > > refresh during the read and write txns and not in the lookup path. > > > > Change-Id: Id0600dd34b144b4ae7a3bf3c397551adf7e402f1 > > Signed-off-by: Ravishankar N > > --- > > ...ismatch-resolution-with-fav-child-policy.t | 8 +- > > xlators/cluster/afr/src/afr-common.c
Re: [Gluster-users] gnfs split brain when 1 server in 3x1 down (high load) - help request
> Attached the wrong patch by mistake in my previous mail. Sending the correct > one now. Early results loook GREAT !! We'll keep beating on it. We applied it to glsuter72 as that is what we have to ship with. It applied fine with some line moves. If you would like us to also run a test with gluster74 so that you can say that's tested, we can run that test. I can do a special build. THANK YOU!! > > > -Ravi > > > On 15/04/20 2:05 pm, Ravishankar N wrote: > > > On 10/04/20 2:06 am, Erik Jacobson wrote: > > Once again thanks for sticking with us. Here is a reply from Scott > Titus. If you have something for us to try, we'd love it. The code had > your patch applied when gdb was run: > > > Here is the addr2line output for those addresses. Very interesting > command, of > which I was not aware. > > [root@leader3 ~]# addr2line -f -e/usr/lib64/glusterfs/7.2/xlator/ > cluster/ > afr.so 0x6f735 > afr_lookup_metadata_heal_check > afr-common.c:2803 > [root@leader3 ~]# addr2line -f -e/usr/lib64/glusterfs/7.2/xlator/ > cluster/ > afr.so 0x6f0b9 > afr_lookup_done > afr-common.c:2455 > [root@leader3 ~]# addr2line -f -e/usr/lib64/glusterfs/7.2/xlator/ > cluster/ > afr.so 0x5c701 > afr_inode_event_gen_reset > afr-common.c:755 > > > Right, so afr_lookup_done() is resetting the event gen to zero. This looks > like a race between lookup and inode refresh code paths. We made some > changes to the event generation logic in AFR. Can you apply the attached > patch and see if it fixes the split-brain issue? It should apply cleanly > on > glusterfs-7.4. > > Thanks, > Ravi > > > > > > > Community Meeting Calendar: > > Schedule - > Every 2nd and 4th Tuesday at 14:30 IST / 09:00 UTC > Bridge: https://bluejeans.com/441850968 > > Gluster-users mailing list > Gluster-users@gluster.org > https://lists.gluster.org/mailman/listinfo/gluster-users > > >From 11601e709a97ce7c40078866bf5d24b486f39454 Mon Sep 17 00:00:00 2001 > From: Ravishankar N > Date: Wed, 15 Apr 2020 13:53:26 +0530 > Subject: [PATCH] afr: event gen changes > > The general idea of the changes is to prevent resetting event generation > to zero in the inode ctx, since event gen is something that should > follow 'causal order'. > > Change #1: > For a read txn, in inode refresh cbk, if event_generation is > found zero, we are failing the read fop. This is not needed > because change in event gen is only a marker for the next inode refresh to > happen and should not be taken into account by the current read txn. > > Change #2: > The event gen being zero above can happen if there is a racing lookup, > which resets even get (in afr_lookup_done) if there are non zero afr > xattrs. The resetting is done only to trigger an inode refresh and a > possible client side heal on the next lookup. That can be acheived by > setting the need_refresh flag in the inode ctx. So replaced all > occurences of resetting even gen to zero with a call to > afr_inode_need_refresh_set(). > > Change #3: > In both lookup and discover path, we are doing an inode refresh which is > not required since all 3 essentially do the same thing- update the inode > ctx with the good/bad copies from the brick replies. Inode refresh also > triggers background heals, but I think it is okay to do it when we call > refresh during the read and write txns and not in the lookup path. > > Change-Id: Id0600dd34b144b4ae7a3bf3c397551adf7e402f1 > Signed-off-by: Ravishankar N > --- > ...ismatch-resolution-with-fav-child-policy.t | 8 +- > xlators/cluster/afr/src/afr-common.c | 92 --- > xlators/cluster/afr/src/afr-dir-write.c | 6 +- > xlators/cluster/afr/src/afr.h | 5 +- > 4 files changed, 29 insertions(+), 82 deletions(-) > > diff --git a/tests/basic/afr/gfid-mismatch-resolution-with-fav-child-policy.t > b/tests/basic/afr/gfid-mismatch-resolution-with-fav-child-policy.t > index f4aa351e4..12af0c854 100644 > --- a/tests/basic/afr/gfid-mismatch-resolution-with-fav-child-policy.t > +++ b/tests/basic/afr/gfid-mismatch-resolution-with-fav-child-policy.t > @@ -168,8 +168,8 @@ TEST [ "$gfid_1" != "$gfid_2" ] > #We know that second brick has the bigger size file > BIGGER_FILE_MD5=$(md5sum $B0/${V0}1/f3 | cut -d\ -f1) > > -TEST ls $M0/f3 > -TEST cat $M0/f3 > +TEST ls $M0 #Trigger entry heal via readdir inode refresh > +TEST cat $M0/f3 #Trigger
Re: [Gluster-users] gnfs split brain when 1 server in 3x1 down (high load) - help request
Once again thanks for sticking with us. Here is a reply from Scott Titus. If you have something for us to try, we'd love it. The code had your patch applied when gdb was run: Here is the addr2line output for those addresses. Very interesting command, of which I was not aware. [root@leader3 ~]# addr2line -f -e /usr/lib64/glusterfs/7.2/xlator/cluster/ afr.so 0x6f735 afr_lookup_metadata_heal_check afr-common.c:2803 [root@leader3 ~]# addr2line -f -e /usr/lib64/glusterfs/7.2/xlator/cluster/ afr.so 0x6f0b9 afr_lookup_done afr-common.c:2455 [root@leader3 ~]# addr2line -f -e /usr/lib64/glusterfs/7.2/xlator/cluster/ afr.so 0x5c701 afr_inode_event_gen_reset afr-common.c:755 Thanks -Scott On Thu, Apr 09, 2020 at 11:38:04AM +0530, Ravishankar N wrote: > > On 08/04/20 9:55 pm, Erik Jacobson wrote: > > 9439138:[2020-04-08 15:48:44.737590] E > > [afr-common.c:754:afr_inode_event_gen_reset] > > (-->/usr/lib64/glusterfs/7.2/xlator/cluster/replicate.so(+0x6f735) > > [0x7fa4fb1cb735] > > -->/usr/lib64/glusterfs/7.2/xlator/cluster/replicate.so(+0x6f0b9) > > [0x7fa4fb1cb0b9] > > -->/usr/lib64/glusterfs/7.2/xlator/cluster/replicate.so(+0x5c701) > > [0x7fa4fb1b8701] ) > > 0-cm_shared-replicate-0: Resetting event gen for > > f2d7abf0-5444-48d6-863d-4b128502daf9 > > > Could you print the function/line no. of each of these 3 functions in the > backtrace and see who calls afr_inode_event_gen_reset? `addr2line` should > give you that info: > addr2line -f -e /your/path/to/lib/glusterfs/7.2/xlator/cluster/afr.so > 0x6f735 > addr2line -f -e /your/path/to/lib/glusterfs/7.2/xlator/cluster/afr.so > 0x6f0b9 > addr2line -f -e /your/path/to/lib/glusterfs/7.2/xlator/cluster/afr.so > 0x5c701 > > > I think it is likely called from afr_lookup_done, which I don't think is > necessary. I will send a patch for review. Once reviews are over, I will > share it with you and if it fixes the issue in your testing, we can merge it > with confidence. > > Thanks, > Ravi Community Meeting Calendar: Schedule - Every 2nd and 4th Tuesday at 14:30 IST / 09:00 UTC Bridge: https://bluejeans.com/441850968 Gluster-users mailing list Gluster-users@gluster.org https://lists.gluster.org/mailman/listinfo/gluster-users
[Gluster-users] Impressive boot times for big clusters: NFS, Image Objects, and Sharding
I wanted to share some positive news with the group here. Summary: Using sharding and squashfs image files instead of expanded directory trees for RO NFS OS images have led to impressive boot times of 2k diskless node clusters using 12 servers for gluster+tftp+etc+etc. Details: As you may have seen in some of my other posts, we have been using gluster to boot giant clusters, some of which are in the top500 list of HPC resources. The compute nodes are diskless. Up until now, we have done this by pushing an operating system from our head node to the storage cluster, which is made up of one or more 3-server/(3-brick) subvolumes in a distributed/replicate configuration. The servers are also PXE-boot and tftboot servers and also serve the "miniroot" (basically a fat initrd with a cluster manager toolchain). We also locate other management functions there unrelated to boot and root. This copy of the operating system is a simple a directory tree representing the whole operating system image. You could 'chroot' in to it, for example. So this operating system is a read-only NFS mount point used as a base by all compute nodes to use as their root filesystem. This has been working well, getting us boot times (not including BIOS startup) of between 10 and 15 minutes for a 2,000 node cluster. Typically a cluster like this would have 12 gluster/nfs servers in 3 subvolumes. On simple RHEL8 images without much customization, I tend to get 10 minutes. We have observed some slow-downs with certain job launch work loads for customers who have very metadata intensive job launch. The metadata load of such an operation is very intensive, with giant loads being observed on the gluster servers. We recently started supporting RW NFS as opposed to TMPFS for this solution for the writable components of root. Our customers tend to prefer to keep every byte of memory for jobs. We came up with a solution of hosting the RW NFS sparse files with XFS filesystems on top from a writable area in gluster for NFS. This makes the RW NFS solution very fast because it reduces RW NFS metadata per-node. Boot times didn't go up significantly (but our first attempt with just using a directory tree was a slow disaster, hitting the worse-case lots of small file write + lots of metadata work load). So we solved that problem with XFS FS images on RW NFS. Building on that idea, we have in our development branch, a version of the solution that changes the RO NFS image to a squashfs file on a sharding volume. That is, instead of each operating system being many thousands of files and being (slowly) synced to the gluser servers, the head node makes a squashfs file out of the image and pushes that. Then all the compute nodes mount the squashfs image from the NFS mount. (mount RO NFS mount, loop-mount squashfs image). On a 2,000 node cluster I had access to for a time, our prototype got us boot times of 5 minutes -- including RO NFS with squashfs and the RW NFS for writable areas like /etc, /var, etc (on an XFS image file). * We also tried RW NFS with OVERLAY and no problem there I expect, for people who prefer the squashfs non-expanded format, we can reduce the leader per compute density. Now, not all customers will want squashfs. Some want to be able to edit a file and see it instantly on all nodes. However, customers looking for fast boot times or who are suffering slowness on metadata intensive job launch work loads, will have a new fast option. Therefore, it's very important we still solve the bug we're working on in another thread. But I wanted to share something positive. So now I've said something positive instead of only asking for help :) :) Erik Community Meeting Calendar: Schedule - Every 2nd and 4th Tuesday at 14:30 IST / 09:00 UTC Bridge: https://bluejeans.com/441850968 Gluster-users mailing list Gluster-users@gluster.org https://lists.gluster.org/mailman/listinfo/gluster-users
Re: [Gluster-users] gnfs split brain when 1 server in 3x1 down (high load) - help request
Thank you again for the help so far. Here is what Scott Titus came up with. Let us know if you have suggestions for next steps. We never hit the "Event gen is zero" message, so it appears that afr_access() never has a zero event_gen to begin with. However, the "Resetting event gen" message was just a bit chatty, growing our nfs.log to >2.4GB. Many were against a gfid of populated with zeros. Around each split brain log, we did find the "Resetting event gen" messages containing a matching gfid: 9439138:[2020-04-08 15:48:44.737590] E [afr-common.c:754:afr_inode_event_gen_reset] (-->/usr/lib64/glusterfs/7.2/xlator/cluster/replicate.so(+0x6f735) [0x7fa4fb1cb735] -->/usr/lib64/glusterfs/7.2/xlator/cluster/replicate.so(+0x6f0b9) [0x7fa4fb1cb0b9] -->/usr/lib64/glusterfs/7.2/xlator/cluster/replicate.so(+0x5c701) [0x7fa4fb1b8701] ) 0-cm_shared-replicate-0: Resetting event gen for f2d7abf0-5444-48d6-863d-4b128502daf9 9439139:[2020-04-08 15:48:44.737636] E [afr-common.c:754:afr_inode_event_gen_reset] (-->/usr/lib64/glusterfs/7.2/xlator/cluster/replicate.so(+0x6f735) [0x7fa4fb1cb735] -->/usr/lib64/glusterfs/7.2/xlator/cluster/replicate.so(+0x6f0b9) [0x7fa4fb1cb0b9] -->/usr/lib64/glusterfs/7.2/xlator/cluster/replicate.so(+0x5c701) [0x7fa4fb1b8701] ) 0-cm_shared-replicate-0: Resetting event gen for f2d7abf0-5444-48d6-863d-4b128502daf9 9439140:[2020-04-08 15:48:44.737663] E [MSGID: 108008] [afr-read-txn.c:312:afr_read_txn_refresh_done] 0-cm_shared-replicate-0: Failing ACCESS on gfid f2d7abf0-5444-48d6-863d-4b128502daf9: split-brain observed. [Input/output error] 9439143:[2020-04-08 15:48:44.737801] E [afr-common.c:754:afr_inode_event_gen_reset] (-->/usr/lib64/glusterfs/7.2/xlator/cluster/replicate.so(+0x6f735) [0x7fa4fb1cb735] -->/usr/lib64/glusterfs/7.2/xlator/cluster/replicate.so(+0x6f0b9) [0x7fa4fb1cb0b9] -->/usr/lib64/glusterfs/7.2/xlator/cluster/replicate.so(+0x5c701) [0x7fa4fb1b8701] ) 0-cm_shared-replicate-0: Resetting event gen for f2d7abf0-5444-48d6-863d-4b128502daf9 9439145:[2020-04-08 15:48:44.737861] E [afr-common.c:754:afr_inode_event_gen_reset] (-->/usr/lib64/glusterfs/7.2/xlator/cluster/replicate.so(+0x6f735) [0x7fa4fb1cb735] -->/usr/lib64/glusterfs/7.2/xlator/cluster/replicate.so(+0x6f0b9) [0x7fa4fb1cb0b9] -->/usr/lib64/glusterfs/7.2/xlator/cluster/replicate.so(+0x5c701) [0x7fa4fb1b8701] ) 0-cm_shared-replicate-0: Resetting event gen for f2d7abf0-5444-48d6-863d-4b128502daf9 9439148:[2020-04-08 15:48:44.738125] E [afr-common.c:754:afr_inode_event_gen_reset] (-->/usr/lib64/glusterfs/7.2/xlator/cluster/replicate.so(+0x6f735) [0x7fa4fb1cb735] -->/usr/lib64/glusterfs/7.2/xlator/cluster/replicate.so(+0x6f0b9) [0x7fa4fb1cb0b9] -->/usr/lib64/glusterfs/7.2/xlator/cluster/replicate.so(+0x5c701) [0x7fa4fb1b8701] ) 0-cm_shared-replicate-0: Resetting event gen for f2d7abf0-5444-48d6-863d-4b128502daf9 9439225:[2020-04-08 15:48:44.749920] E [afr-common.c:754:afr_inode_event_gen_reset] (-->/usr/lib64/glusterfs/7.2/xlator/cluster/replicate.so(+0x6f735) [0x7fa4fb1cb735] -->/usr/lib64/glusterfs/7.2/xlator/cluster/replicate.so(+0x6f0b9) [0x7fa4fb1cb0b9] -->/usr/lib64/glusterfs/7.2/xlator/cluster/replicate.so(+0x5c701) [0x7fa4fb1b8701] ) 0-cm_shared-replicate-0: Resetting event gen for f2d7abf0-5444-48d6-863d-4b128502daf9 Thanks, -Scott On 4/8/20 8:31 AM, Erik Jacobson wrote: > Hi team - > > We got an update to try more stuff from the community. > > I feel like I've been "given an inch but am taking a mile" but if we > do happen to have time on orbit41 again, we'll do the next round of > debugging. > > Erik On Wed, Apr 08, 2020 at 01:53:00PM +0530, Ravishankar N wrote: > On 08/04/20 4:59 am, Erik Jacobson wrote: > > Apologies for misinterpreting the backtrace. > > > > #0 afr_read_txn_refresh_done (frame=0x7ffcf4146478, > > this=0x7fff64013720, err=5) at afr-read-txn.c:312 > > #1 0x7fff68938d2b in afr_txn_refresh_done > > (frame=frame@entry=0x7ffcf4146478, this=this@entry=0x7fff64013720, > > err=5, err@entry=0) > > at afr-common.c:1222 > Sorry, I missed this too. > > (gdb) print event_generation > > $3 = 0 > > > > (gdb) print priv->fav_child_policy > > $4 = AFR_FAV_CHILD_NONE > > > > I am not sure what this signifies though. It appears to be a read > > transaction with no event generation and no favorite child policy. > > > > Feel free to ask for clarification in case my thought process went awry > > somewhere. > > Favorite child policy is only for automatically resolving split-brains and > is 0 unless that volume option is set. The problem is indeed that > event_generation is zero. Could you try to apply this logging patch and see > if afr_inode_event_gen_reset() for that gfid is hit
Re: [Gluster-users] gnfs split brain when 1 server in 3x1 down (high load) - help request
My co-worker prefers I keep driving the discussion since this isn't his normal area. But he's far better at digging in to these low level calls than I am. So I am passing along his analysis so far. We are wondering if we have enough information yet to turn on any light bulbs in terms of volume settings, system settings, or a code change... Or a suggested path for further debug. Recall from earlier in the thread, this a 3-way replicate single subvolume gluster fileystem that gets split-brain errors under heavy gnfs load when one of the three servers has gone down, representing a customer-reported problem. Scott's analysis is below. Next steps truly appreciated !! Apologies for misinterpreting the backtrace. #0 afr_read_txn_refresh_done (frame=0x7ffcf4146478, this=0x7fff64013720, err=5) at afr-read-txn.c:312 #1 0x7fff68938d2b in afr_txn_refresh_done (frame=frame@entry=0x7ffcf4146478, this=this@entry=0x7fff64013720, err=5, err@entry=0) at afr-common.c:1222 #2 0x7fff68939003 in afr_inode_refresh_done (frame=frame@entry=0x7ffcf4146478, this=this@entry=0x7fff64013720, error=0) at afr-common.c:1294 instead of the #1/#2 above calling the functions afr_txn_refresh_done and afr_inode_refresh_done respectively, they are calling a function within afr_txn_refresh_done and afr_inode_refresh_done respectively. So, afr_txn_refresh_done (frame=frame@entry=0x7ffcf4146478, this=this@entry=0x7fff64013720, err=5, err@entry=0)at afr-common.c:1222calls a function at line number 1222 in aft-common.c within the function afr_txn_refresh_done: 1163: int 1164: afr_txn_refresh_done(call_frame_t *frame, xlator_t *this, int err) 1165: { 1166: call_frame_t *heal_frame = NULL; 1167: afr_local_t *heal_local = NULL; 1168: afr_local_t *local = NULL; 1169: afr_private_t *priv = NULL; 1170: inode_t *inode = NULL; 1171: int event_generation = 0; 1172: int read_subvol = -1; 1173: int ret = 0; 1174: 1175: local = frame->local; 1176: inode = local->inode; 1177: priv = this->private; 1178: 1179: if (err) 1180: goto refresh_done; 1181: 1182: if (local->op == GF_FOP_LOOKUP) 1183: goto refresh_done; 1184: 1185: ret = afr_inode_get_readable(frame, inode, this, local->readable, 1186: _generation, local->transaction.type); 1187: 1188: if (ret == -EIO || (local->is_read_txn && !event_generation)) { 1189: /* No readable subvolume even after refresh ==> splitbrain.*/ *1190: ** if (!priv->fav_child_policy) {* *1191: err = EIO; **1192: goto refresh_done; **1193: ** }* 1194: read_subvol = afr_sh_get_fav_by_policy(this, local->replies, inode, 1195: NULL); 1196: if (read_subvol == -1) { 1197: err = EIO; 1198: goto refresh_done; 1199: } 1200: 1201: heal_frame = afr_frame_create(this, NULL); 1202: if (!heal_frame) { 1203: err = EIO; 1204: goto refresh_done; 1205: } 1206: heal_local = heal_frame->local; 1207: heal_local->xdata_req = dict_new(); 1208: if (!heal_local->xdata_req) { 1209: err = EIO; 1210: AFR_STACK_DESTROY(heal_frame); 1211: goto refresh_done; 1212: } 1213: heal_local->heal_frame = frame; 1214: ret = synctask_new(this->ctx->env, afr_fav_child_reset_sink_xattrs, 1215: afr_fav_child_reset_sink_xattrs_cbk, heal_frame, 1216: heal_frame); 1217: return 0; 1218: } 1219: 1220: refresh_done: 1221: afr_local_replies_wipe(local, this->private); *1222: local->refreshfn(frame, this, err);* 1223: 1224: return 0; 1225: } So, backtrace #1 represents the following function call local->refreshfn(frame=frame@entry=0x7ffcf4146478, this=this@entry=0x7fff64013720, err=5, err@entry=0) This is the 1st example of EIO being set. Setting a breakpoint at *1190: ** if (!priv->fav_child_policy) { *reveals that ret is not set, but local->is_read_txn is set and event_generation is zero (xlators/cluster/afr/src/afr.h:108), so the conditional at 1188 is true. Furthermore, priv->fav_child_policy is set to AFR_FAV_CHILD_NONE which is zero, so we found where the error is set to EIO, line 1191. The following is the gdb output: (gdb) print ret $1 = 0 (gdb) print local->is_read_txn $2 = true (gdb) print event_generation $3 = 0 (gdb) print priv->fav_child_policy $4 = AFR_FAV_CHILD_NONE I am not sure what this signifies though. It appears to be a read transaction with no event generation and no favorite child policy. Feel free to ask for clarification in case my thought process went awry somewhere. Thanks, -Scott On Thu, Apr 02, 2020 at 02:02:46AM -0500, Erik Jacobson wrote: > > Hmm, afr_inode_refresh_done() is called with error=0 and by the time we > > reach afr_txn_refresh_done(), it becomes 5(i.e. EIO). > >
Re: [Gluster-users] gnfs split brain when 1 server in 3x1 down (high load) - help request
During the problem case, near as I can tell, afr_final_errno(), in the loop where tmp_errno = local->replies[i].op_errno is set, the errno is always "2" when it gets to that point on server 3 (where the NFS load is). I never see a value other than 2. I later simply put the print at the end of the function too, to double verify non-zero exit codes. There are thousands of non-zero return codes, all 2 when not zero. Here is an exmaple flow right before a split-brain. I do not wish to imply the split-brain is related, it's just an example log snip: [2020-04-06 00:54:21.125373] E [MSGID: 0] [afr-common.c:2546:afr_final_errno] 0-erikj-afr_final_errno: erikj dbg afr_final_errno() errno from loop before afr_higher_errno was: 2 [2020-04-06 00:54:21.125374] E [MSGID: 0] [afr-common.c:2551:afr_final_errno] 0-erikj-afr_final_errno: erikj dbg returning non-zero: 2 [2020-04-06 00:54:23.315397] E [MSGID: 0] [afr-read-txn.c:283:afr_read_txn_refresh_done] 0-cm_shared-replicate-0: erikj dbg crapola 1st if in afr_read_txn_refresh_done() !priv->thin_arbiter_count -- goto to readfn [2020-04-06 00:54:23.315432] E [MSGID: 108008] [afr-read-txn.c:314:afr_read_txn_refresh_done] 0-cm_shared-replicate-0: Failing READLINK on gfid 57f269ef-919d-40ec-b7fc-a7906fee648b: split-brain observed. [Input/output error] [2020-04-06 00:54:23.315450] W [MSGID: 112199] [nfs3-helpers.c:3327:nfs3_log_readlink_res] 0-nfs-nfsv3: /image/images_ro_nfs/rhel8.0/usr/lib64/libmlx5.so.1 => (XID: 1fdba2bc, READLINK: NFS: 5(I/O error), POSIX: 5(Input/output error)) target: (null) I am missing something. I will see if Scott and I can work together tomorrow. Happy for any more ideas, Thank you!! On Sun, Apr 05, 2020 at 06:49:56PM -0500, Erik Jacobson wrote: > First, it's possible our analysis is off somewhere. I never get to your > print message. I put a debug statement at the start of the function so I > know we get there (just to verify my print statements were taking > affect). > > I put a print statement for the if (call_count == 0) { call there, right > after the if. I ran some tests. > > I suspect that isn't a problem area. There were some interesting results > with an NFS stale file handle error going through that path. Otherwise > it's always errno=0 even in the heavy test case. I'm not concerned about > a stale NFS file handle this moment. That print was also hit heavily when > one server was down (which surprised me but I don't know the internals). > > I'm trying to re-read and work through Scott's message to see if any > other print statements might be helpful. > > Thank you for your help so far. I will reply back if I find something. > Otherwise suggestions welcome! > > The MFG system I can access got smaller this weekend but is still large > enough to reproduce the error. > > As you can tell, I work mostly at a level well above filesystem code so > thank you for staying with me as I struggle through this. > > Erik > > > After we hear from all children, afr_inode_refresh_subvol_cbk() then calls > > afr_inode_refresh_done()-->afr_txn_refresh_done()-->afr_read_txn_refresh_done(). > > But you already know this flow now. > > > diff --git a/xlators/cluster/afr/src/afr-common.c > > b/xlators/cluster/afr/src/afr-common.c > > index 4bfaef9e8..096ce06f0 100644 > > --- a/xlators/cluster/afr/src/afr-common.c > > +++ b/xlators/cluster/afr/src/afr-common.c > > @@ -1318,6 +1318,12 @@ afr_inode_refresh_subvol_cbk(call_frame_t *frame, > > void *cookie, xlator_t *this, > > if (xdata) > > local->replies[call_child].xdata = dict_ref(xdata); > > } > > +if (op_ret == -1) > > +gf_msg_callingfn( > > +this->name, GF_LOG_ERROR, op_errno, AFR_MSG_SPLIT_BRAIN, > > +"Inode refresh on child:%d failed with errno:%d for %s(%s) ", > > +call_child, op_errno, local->loc.name, > > +uuid_utoa(local->loc.inode->gfid)); > > if (xdata) { > > ret = dict_get_int8(xdata, "link-count", _heal); > > local->replies[call_child].need_heal = need_heal; Community Meeting Calendar: Schedule - Every 2nd and 4th Tuesday at 14:30 IST / 09:00 UTC Bridge: https://bluejeans.com/441850968 Gluster-users mailing list Gluster-users@gluster.org https://lists.gluster.org/mailman/listinfo/gluster-users
Re: [Gluster-users] gnfs split brain when 1 server in 3x1 down (high load) - help request
First, it's possible our analysis is off somewhere. I never get to your print message. I put a debug statement at the start of the function so I know we get there (just to verify my print statements were taking affect). I put a print statement for the if (call_count == 0) { call there, right after the if. I ran some tests. I suspect that isn't a problem area. There were some interesting results with an NFS stale file handle error going through that path. Otherwise it's always errno=0 even in the heavy test case. I'm not concerned about a stale NFS file handle this moment. That print was also hit heavily when one server was down (which surprised me but I don't know the internals). I'm trying to re-read and work through Scott's message to see if any other print statements might be helpful. Thank you for your help so far. I will reply back if I find something. Otherwise suggestions welcome! The MFG system I can access got smaller this weekend but is still large enough to reproduce the error. As you can tell, I work mostly at a level well above filesystem code so thank you for staying with me as I struggle through this. Erik > After we hear from all children, afr_inode_refresh_subvol_cbk() then calls > afr_inode_refresh_done()-->afr_txn_refresh_done()-->afr_read_txn_refresh_done(). > But you already know this flow now. > diff --git a/xlators/cluster/afr/src/afr-common.c > b/xlators/cluster/afr/src/afr-common.c > index 4bfaef9e8..096ce06f0 100644 > --- a/xlators/cluster/afr/src/afr-common.c > +++ b/xlators/cluster/afr/src/afr-common.c > @@ -1318,6 +1318,12 @@ afr_inode_refresh_subvol_cbk(call_frame_t *frame, void > *cookie, xlator_t *this, > if (xdata) > local->replies[call_child].xdata = dict_ref(xdata); > } > +if (op_ret == -1) > +gf_msg_callingfn( > +this->name, GF_LOG_ERROR, op_errno, AFR_MSG_SPLIT_BRAIN, > +"Inode refresh on child:%d failed with errno:%d for %s(%s) ", > +call_child, op_errno, local->loc.name, > +uuid_utoa(local->loc.inode->gfid)); > if (xdata) { > ret = dict_get_int8(xdata, "link-count", _heal); > local->replies[call_child].need_heal = need_heal; Community Meeting Calendar: Schedule - Every 2nd and 4th Tuesday at 14:30 IST / 09:00 UTC Bridge: https://bluejeans.com/441850968 Gluster-users mailing list Gluster-users@gluster.org https://lists.gluster.org/mailman/listinfo/gluster-users
Re: [Gluster-users] gnfs split brain when 1 server in 3x1 down (high load) - help request
I had a co-worker look through this with me (Scott Titus). He has a more analyitcal mind than I do. Here is what he said with some edits by me. My edits were formatting and adjusting some words. So we were hoping that, given this analysis, the community could let us know if it raises any red flags that would lead to a solution to the problem (whether it be setup, settings, or code). If needed, I can get Scott to work with me and dig further but it was starting to get painful where Scott stopped. Scott's words (edited): (all backtraces match - at least up to the point I'm concerned with at this time) Error was passed from afr_inode_refresh_done() into afr_txn_refresh_done() as afr_inode_refresh_done()'s call frame has 'error=0' while afr_txn_refresh_done() has 'err=5' in the call frame. #0 afr_read_txn_refresh_done (frame=0x7ffc949cf7c8, this=0x7fff640137b0, err=5) at afr-read-txn.c:281 #1 0x7fff68901fdb in afr_txn_refresh_done ( frame=frame at entry=0x7ffc949cf7c8, this=this at entry=0x7fff640137b0, err=5, err at entry=0) at afr-common.c:1223 #2 0x7fff689022b3 in afr_inode_refresh_done ( frame=frame at entry=0x7ffc949cf7c8, this=this at entry=0x7fff640137b0, error=0) at afr-common.c:1295 #3 0x7fff6890f3fb in afr_inode_refresh_subvol_cbk (frame=0x7ffc949cf7c8, cookie=, this=0x7fff640137b0, op_ret=, op_errno=, buf=buf at entry=0x7ffd53ffdaa0, xdata=0x7ffd3c6764f8, par=0x7ffd53ffdb40) at afr-common.c:1333 Within afr_inode_refresh_done(), the only two ways it can generate an error within is via setting it to EINVAL or resulting from a failure status from afr_has_quorum(). Since EINVAL is 22, not 5, the quorum test failed. Within the afr_has_quorum() conditional, an error could be set from afr_final_errno() or afr_quorum_errno(). Digging reveals afr_quorum_errno() just returns ENOTCONN which is 107, so that is not it. This leaves us with afr_quorum_errno() returning the error. (Scott provided me with source code with pieces bolded but I don't think you need that). afr_final_errno() iterates through the 'children', looking for valid errors within the replies for the transaction (refresh transaction?). The function returns the highest valued error, which must be EIO (value of 5) in this case. I have not looked into how or what would set the error value in the replies array, as this being a distributed system the error could have been generated on another server. Unless this path needs to be investigated, I'd rather not get mired into finding which iteration (value of 'i') has the error and what system? thread? added the error to the reply unless it is information that is required. Any suggested next steps? > > On 01/04/20 8:57 am, Erik Jacobson wrote: > > Here are some back traces. They make my head hurt. Maybe you can suggest > > something else to try next? In the morning I'll try to unwind this > > myself too in the source code but I suspect it will be tough for me. > > > > > > (gdb) break xlators/cluster/afr/src/afr-read-txn.c:280 if err == 5 > > Breakpoint 1 at 0x7fff688e057b: file afr-read-txn.c, line 281. > > (gdb) continue > > Continuing. > > [Switching to Thread 0x7ffec700 (LWP 50175)] > > > > Thread 15 "glfs_epoll007" hit Breakpoint 1, afr_read_txn_refresh_done ( > > frame=0x7fff48325d78, this=0x7fff640137b0, err=5) at afr-read-txn.c:281 > > 281 if (err) { > > (gdb) bt > > #0 afr_read_txn_refresh_done (frame=0x7fff48325d78, this=0x7fff640137b0, > > err=5) at afr-read-txn.c:281 > > #1 0x7fff68901fdb in afr_txn_refresh_done ( > > frame=frame@entry=0x7fff48325d78, this=this@entry=0x7fff640137b0, > > err=5, > > err@entry=0) at afr-common.c:1223 > > #2 0x7fff689022b3 in afr_inode_refresh_done ( > > frame=frame@entry=0x7fff48325d78, this=this@entry=0x7fff640137b0, > > error=0) > > at afr-common.c:1295 > Hmm, afr_inode_refresh_done() is called with error=0 and by the time we > reach afr_txn_refresh_done(), it becomes 5(i.e. EIO). > So afr_inode_refresh_done() is changing it to 5. Maybe you can put > breakpoints/ log messages in afr_inode_refresh_done() at the places where > error is getting changed and see where the assignment happens. > > > Regards, > Ravi Community Meeting Calendar: Schedule - Every 2nd and 4th Tuesday at 14:30 IST / 09:00 UTC Bridge: https://bluejeans.com/441850968 Gluster-users mailing list Gluster-users@gluster.org https://lists.gluster.org/mailman/listinfo/gluster-users
Re: [Gluster-users] gnfs split brain when 1 server in 3x1 down (high load) - help request
> Hmm, afr_inode_refresh_done() is called with error=0 and by the time we > reach afr_txn_refresh_done(), it becomes 5(i.e. EIO). > So afr_inode_refresh_done() is changing it to 5. Maybe you can put > breakpoints/ log messages in afr_inode_refresh_done() at the places where > error is getting changed and see where the assignment happens. I had a lot of struggles tonight getting the system ready to go. I had seg11's in glusterfs(nfs) but I think it was related to not all brick processes stopping with glusterd. I also re-installed and/or the print statements. I'm not sure. I'm not used to seeing that. I put print statements everywhere I thought error could change and got no printed log messages. I put break points where error would change and we didn't hit them. I then point a breakpoint at break xlators/cluster/afr/src/afr-common.c:1298 if error != 0 ---> refresh_done: afr_txn_refresh_done(frame, this, error); And it never triggered (despite split-brain messages and my crapola message). So I'm unable to explain this transition. I'm also not a gdb expert. I still see the same back trace though. #1 0x7fff68938d7b in afr_txn_refresh_done ( frame=frame@entry=0x7ffd540ed8e8, this=this@entry=0x7fff64013720, err=5, err@entry=0) at afr-common.c:1222 #2 0x7fff689391f0 in afr_inode_refresh_done ( frame=frame@entry=0x7ffd540ed8e8, this=this@entry=0x7fff64013720, error=0) at afr-common.c:1299 Is there other advice you might have for me to try? I'm very eager to solve this problem, which is why I'm staying up late to get machine time. I must go to bed now. I look forward to another shot tomorrow night if you have more ideas to try. Erik Community Meeting Calendar: Schedule - Every 2nd and 4th Tuesday at 14:30 IST / 09:00 UTC Bridge: https://bluejeans.com/441850968 Gluster-users mailing list Gluster-users@gluster.org https://lists.gluster.org/mailman/listinfo/gluster-users
Re: [Gluster-users] 回复: Re: Cann't mount NFS,please help!
> Thanks everyone! > > You mean that: Ganesha is new solution ablout NFS Server function than gNFS, > in new version gNFS is not the suggest compoment, > but,if I want using NFS Server ,I should install and configure Ganesha > separately, is that ? I would phrase it this way: - The community is moving to Ganesha to provide NFS services. Ganesha supports several storage solutions, including gluster - Therefore, distros and packages tend to disable the gNFS support in gluster since they assume people are moving to Ganesha. It would otherwise be a competing solutions for NFS. - Some people still prefer gNFS and do not want to use Ganesha yet, and those people need to re-build their package in some cases like was outlined in the thread. This then provides the necessary libraries and config files to run gNFS - gNFS still works well if you build it as far as I have found - For my use, Ganesha crashes with my "not normal" workload and so I can't switch to it yet. I worked with the community some but ran out of system time and had to drop the thread. I would like to revisit so that I can run Ganesha too some day. My work load is very far away from typical. Erik > > > > ━━━ > sz_cui...@163.com > > > From: Strahil Nikolov > Date: 2020-04-02 00:58 > To: Erik Jacobson; sz_cui...@163.com > CC: gluster-users > Subject: Re: [Gluster-users] Cann't mount NFS,please help! > On April 1, 2020 3:37:35 PM GMT+03:00, Erik Jacobson > wrote: > >If you are like me and cannot yet switch to Ganesha (it doesn't work in > >our workload yet; I need to get back to working with the community on > >that...) > > > >What I would have expected in the process list was a glusterfs process > >with > >"nfs" in the name. > > > >here it is from one of my systems: > > > >root 57927 1 0 Mar31 ?00:00:00 /usr/sbin/glusterfs -s > >localhost --volfile-id gluster/nfs -p /var/run/gluster/nfs/nfs.pid -l > >/var/log/glusterfs/nfs.log -S /var/run/gluster/933ab0ad241fab5f.socket > > > > > >My guess - but you'd have to confirm this with the logs - is your > >gluster > >build does not have gnfs built in. Since they wish us to move to > >Ganesha, it is often off by default. For my own builds, I enable it in > >the spec file. > > > >So you should have this installed: > > > >/usr/lib64/glusterfs/7.2/xlator/nfs/server.so > > > >If that isn't there, you likely need to adjust your spec file and > >rebuild. > > > >As others mentioned, the suggestion is to use Ganesha if possible, > >which is a separate project. > > > >I hope this helps! > > > >PS here is a sniip from the spec file I use, with an erikj comment for > >what I adjusted: > > > ># gnfs > ># if you wish to compile an rpm with the legacy gNFS server xlator > ># rpmbuild -ta @PACKAGE_NAME@-@package_vers...@.tar.gz --with gnfs > >%{?_without_gnfs:%global _with_gnfs --disable-gnfs} > > > ># erikj force enable > >%global _with_gnfs --enable-gnfs > ># end erikj > > > > > >On Wed, Apr 01, 2020 at 11:57:16AM +0800, sz_cui...@163.com wrote: > >> 1.The gluster server has set volume option nfs.disable to: off > >> > >> Volume Name: gv0 > >> Type: Disperse > >> Volume ID: 429100e4-f56d-4e28-96d0-ee837386aa84 > >> Status: Started > >> Snapshot Count: 0 > >> Number of Bricks: 1 x (2 + 1) = 3 > >> Transport-type: tcp > >> Bricks: > >> Brick1: gfs1:/brick1/gv0 > >> Brick2: gfs2:/brick1/gv0 > >> Brick3: gfs3:/brick1/gv0 > >> Options Reconfigured: > >> transport.address-family: inet > >> storage.fips-mode-rchecksum: on > >> nfs.disable: off > >> > >> 2. The process has start. > >> > >> [root@gfs1 ~]# ps -ef | grep glustershd > >> root 1117 1 0 10:12 ?00:00:00 /usr/sbin/glusterfs > >-s > >> localhost --volfile-id shd/gv0 -p > >/var/run/gluster/shd/gv0/gv0-shd.pid -l /var/ > >> log/glusterfs/glustershd.log -S > >/var/run/gluster/ca97b99a29c04606.socket > >> --xlator-option > >*replicate*.node-uuid=323075ea-2b38-427c-a9aa-70ce18e94208 > &g
Re: [Gluster-users] Cann't mount NFS,please help!
If you are like me and cannot yet switch to Ganesha (it doesn't work in our workload yet; I need to get back to working with the community on that...) What I would have expected in the process list was a glusterfs process with "nfs" in the name. here it is from one of my systems: root 57927 1 0 Mar31 ?00:00:00 /usr/sbin/glusterfs -s localhost --volfile-id gluster/nfs -p /var/run/gluster/nfs/nfs.pid -l /var/log/glusterfs/nfs.log -S /var/run/gluster/933ab0ad241fab5f.socket My guess - but you'd have to confirm this with the logs - is your gluster build does not have gnfs built in. Since they wish us to move to Ganesha, it is often off by default. For my own builds, I enable it in the spec file. So you should have this installed: /usr/lib64/glusterfs/7.2/xlator/nfs/server.so If that isn't there, you likely need to adjust your spec file and rebuild. As others mentioned, the suggestion is to use Ganesha if possible, which is a separate project. I hope this helps! PS here is a sniip from the spec file I use, with an erikj comment for what I adjusted: # gnfs # if you wish to compile an rpm with the legacy gNFS server xlator # rpmbuild -ta @PACKAGE_NAME@-@package_vers...@.tar.gz --with gnfs %{?_without_gnfs:%global _with_gnfs --disable-gnfs} # erikj force enable %global _with_gnfs --enable-gnfs # end erikj On Wed, Apr 01, 2020 at 11:57:16AM +0800, sz_cui...@163.com wrote: > 1.The gluster server has set volume option nfs.disable to: off > > Volume Name: gv0 > Type: Disperse > Volume ID: 429100e4-f56d-4e28-96d0-ee837386aa84 > Status: Started > Snapshot Count: 0 > Number of Bricks: 1 x (2 + 1) = 3 > Transport-type: tcp > Bricks: > Brick1: gfs1:/brick1/gv0 > Brick2: gfs2:/brick1/gv0 > Brick3: gfs3:/brick1/gv0 > Options Reconfigured: > transport.address-family: inet > storage.fips-mode-rchecksum: on > nfs.disable: off > > 2. The process has start. > > [root@gfs1 ~]# ps -ef | grep glustershd > root 1117 1 0 10:12 ?00:00:00 /usr/sbin/glusterfs -s > localhost --volfile-id shd/gv0 -p /var/run/gluster/shd/gv0/gv0-shd.pid -l > /var/ > log/glusterfs/glustershd.log -S /var/run/gluster/ca97b99a29c04606.socket > --xlator-option *replicate*.node-uuid=323075ea-2b38-427c-a9aa-70ce18e94208 > --process-name glustershd --client-pid=-6 > > > 3.But the status of gv0 is not correct,for it's status of NFS Server is not > online. > > [root@gfs1 ~]# gluster volume status gv0 > Status of volume: gv0 > Gluster process TCP Port RDMA Port Online Pid > -- > Brick gfs1:/brick1/gv0 49154 0 Y 4180 > Brick gfs2:/brick1/gv0 49154 0 Y 1222 > Brick gfs3:/brick1/gv0 49154 0 Y 1216 > Self-heal Daemon on localhost N/A N/AY 1117 > NFS Server on localhost N/A N/AN N/A > Self-heal Daemon on gfs2N/A N/AY 1138 > NFS Server on gfs2 N/A N/AN N/A > Self-heal Daemon on gfs3N/A N/AY 1131 > NFS Server on gfs3 N/A N/AN N/A > > Task Status of Volume gv0 > -- > There are no active volume tasks > > 4.So, I cann't mount the gv0 on my client. > > [root@kvms1 ~]# mount -t nfs gfs1:/gv0 /mnt/test > mount.nfs: Connection refused > > > Please Help! > Thanks! > > > > > > ━━━ > sz_cui...@163.com > > > > > Community Meeting Calendar: > > Schedule - > Every 2nd and 4th Tuesday at 14:30 IST / 09:00 UTC > Bridge: https://bluejeans.com/441850968 > > Gluster-users mailing list > Gluster-users@gluster.org > https://lists.gluster.org/mailman/listinfo/gluster-users Erik Jacobson Software Engineer erik.jacob...@hpe.com +1 612 851 0550 Office Eagan, MN hpe.com Community Meeting Calendar: Schedule - Every 2nd and 4th Tuesday at 14:30 IST / 09:00 UTC Bridge: https://bluejeans.com/441850968 Gluster-users mailing list Gluster-users@gluster.org https://lists.gluster.org/mailman/listinfo/gluster-users
Re: [Gluster-users] gnfs split brain when 1 server in 3x1 down (high load) - help request
THANK YOU for the hints. Very happy to have the help. I'll reply to a couple things then dig in: On Tue, Mar 31, 2020 at 03:27:59PM +0530, Ravishankar N wrote: > From your reply in the other thread, I'm assuming that the file/gfid in > question is not in genuine split-brain or needing heal. i.e. for example Right, they were not tagged split-brain either, just healing needed, which is expected for those 76 files. > with that 1 brick down and 2 bricks up test case, if you tried to read the > file from say a temporary fuse mount (which is also now connected to only to > 2 bricks since the 3rd one is down) it works fine and there is no EIO > error... Looking at the heal info, all files are the files I expected to have write changes and I think* are outside the scope of this issue. To close the loop, I ran a 'strings' on the top of one the files to confirm from a fuse mount and had no trouble. > ...which means that what you have observed is true, i.e. > afr_read_txn_refresh_done() is called with err=EIO. You can add logs to see > at what point it is EIO set. The call graph is like this: > afr_inode_refresh_done()-->afr_txn_refresh_done()-->afr_read_txn_refresh_done(). > > Maybe > https://github.com/gluster/glusterfs/blob/v7.4/xlators/cluster/afr/src/afr-common.c#L1188 > in afr_txn_refresh_done() is causing it either due to ret being -EIO or > event_generation being zero. > > If you are comfortable with gdb, you an put a conditional break point in > afr_read_txn_refresh_done() at > https://github.com/gluster/glusterfs/blob/v7.4/xlators/cluster/afr/src/afr-read-txn.c#L283 > when err=EIO and then check the backtrace for who is setting err to EIO. Ok so the main event! :) I'm not a gdb expert but I think I figured it out well enough to paste some back traces. However, I'm having trouble intepreting them exactly. It looks to me to be the "event" case. (I got permission to use this MFG system at night for a couple more nights; avoiding the 24-hour-reserved internal larger system we have). here is what I did, feel free to suggest something better. - I am using an RPM build so I changed the spec file to create debuginfo packages. I'm on rhel8.1 - I installed the updated packages and debuginfo packages - When glusterd started the nfs glusterfs, I killed it. - I ran this: gdb -d /root/rpmbuild/BUILD/glusterfs-7.2 -d /root/rpmbuild/BUILD/glusterfs-7.2/xlators/cluster/afr/src/ /usr/sbin/glusterfs - Then from GDB, I ran this: (gdb) run -s localhost --volfile-id gluster/nfs -p /var/run/gluster/nfs/nfs.pid -l /var/log/glusterfs/nfs.log -S /var/run/gluster/9ddb5561058ff543.socket -N - I hit ctrl-c, then set the break point: (gdb) break xlators/cluster/afr/src/afr-read-txn.c:280 if err == 5 - I have some debugging statements but gluster 72 line 280 is this: --> line 280 (I think gdb changed it to 281 internally) if (err) { if (!priv->thin_arbiter_count) { - continue - Then I ran the test case. Here are some back traces. They make my head hurt. Maybe you can suggest something else to try next? In the morning I'll try to unwind this myself too in the source code but I suspect it will be tough for me. (gdb) break xlators/cluster/afr/src/afr-read-txn.c:280 if err == 5 Breakpoint 1 at 0x7fff688e057b: file afr-read-txn.c, line 281. (gdb) continue Continuing. [Switching to Thread 0x7ffec700 (LWP 50175)] Thread 15 "glfs_epoll007" hit Breakpoint 1, afr_read_txn_refresh_done ( frame=0x7fff48325d78, this=0x7fff640137b0, err=5) at afr-read-txn.c:281 281 if (err) { (gdb) bt #0 afr_read_txn_refresh_done (frame=0x7fff48325d78, this=0x7fff640137b0, err=5) at afr-read-txn.c:281 #1 0x7fff68901fdb in afr_txn_refresh_done ( frame=frame@entry=0x7fff48325d78, this=this@entry=0x7fff640137b0, err=5, err@entry=0) at afr-common.c:1223 #2 0x7fff689022b3 in afr_inode_refresh_done ( frame=frame@entry=0x7fff48325d78, this=this@entry=0x7fff640137b0, error=0) at afr-common.c:1295 #3 0x7fff6890f3fb in afr_inode_refresh_subvol_cbk (frame=0x7fff48325d78, cookie=, this=0x7fff640137b0, op_ret=, op_errno=, buf=buf@entry=0x7ffecfffdaa0, xdata=0x7ffeb806ef08, par=0x7ffecfffdb40) at afr-common.c:1333 #4 0x7fff6890f42a in afr_inode_refresh_subvol_with_lookup_cbk ( frame=, cookie=, this=, op_ret=, op_errno=, inode=, buf=0x7ffecfffdaa0, xdata=0x7ffeb806ef08, par=0x7ffecfffdb40) at afr-common.c:1344 #5 0x7fff68b8e96f in client4_0_lookup_cbk (req=, iov=, count=, myframe=0x7fff483147b8) at client-rpc-fops_v2.c:2640 #6 0x7fffed293115 in rpc_clnt_handle_reply ( clnt=clnt@entry=0x7fff640671b0, pollin=pollin@entry=0x7ffeb81aa110) at rpc-clnt.c:764 #7 0x7fffed2934b3 in rpc_clnt_notify (trans=0x7fff64067540, mydata=0x7fff640671e0, event=, data=0x7ffeb81aa110) at rpc-clnt.c:931 #8 0x7fffed29007b in rpc_transport_notify ( this=this@entry=0x7fff64067540,
Re: [Gluster-users] gnfs split brain when 1 server in 3x1 down (high load) - help request
I note that this part of afr_read_txn() gets triggered a lot. if (afr_is_inode_refresh_reqd(inode, this, local->event_generation, event_generation)) { Maybe that's normal when one of the three servers are down (but why isn't it using its local copy by default?) The comment in that if block is: /* servers have disconnected / reconnected, and possibly rebooted, very likely changing the state of freshness of copies */ But we have one server conssitently down, not a changing situation. digging digging digging seemed to show this related to cache invalidation Because the paths seemed to suggest the inode needed refreshing and that seems handled by a case statement named GF_UPCALL_CACHE_INVALIDATION However, that must have been a wrong turn since turning off cache invalidation didn't help. I'm struggling to wrap my head around the code base and without the background in these concepts it's a tough hill to climb. I am going to have to try this again some day with fresh eyes and go to bed; the machine I have easy access to is going away in the morning. Now I'll have to reserve time on a contended one but I will do that and continue digging. Any suggestions would be greatly appreciated as I think I'm starting to tip over here on this one. On Mon, Mar 30, 2020 at 04:04:39PM -0500, Erik Jacobson wrote: > > Sadly I am not a developer, so I can't answer your questions. > > I'm not a FS o rnetwork developer either. I think there is a joke about > playing one on TV but maybe it's netflix now. > > Enabling certain debug options made too much information for me to watch > personally (but an expert could probably get through it). > > So I started putting targeted 'print' (gf_msg) statements in the code to > see how it got its way to split-brain. Maybe this will ring a bell > for someone. > > I can tell the only way we enter the split-brain path is through in the > first if statement of afr_read_txn_refresh_done(). > > This means afr_read_txn_refresh_done() itself was passed "err" and > that it appears thin_arbiter_count was not set (which makes sense, > I'm using 1x3, not a thin arbiter). > > So we jump to the readfn label, and read_subvol() should still be -1. > If I read right, it must mean that this if didn't return true because > my print statement didn't appear: > if ((ret == 0) && spb_choice >= 0) { > > So we're still with the original read_subvol == 1, > Which gets us to the split_brain message. > > So now I will try to learn why afr_read_txn_refresh_done() would have > 'err' set in the first place. I will also learn about > afr_inode_split_brain_choice_get(). Those seem to be the two methods to > have avoided falling in to the split brain hole here. > > > I put debug statements in these locations. I will mark with !! what > I see: > > > > diff -Narup glusterfs-7.2-orig/xlators/cluster/afr/src/afr-read-txn.c > glusterfs-7.2-new/xlators/cluster/afr/src/afr-read-txn.c > --- glusterfs-7.2-orig/xlators/cluster/afr/src/afr-read-txn.c 2020-01-15 > 11:43:53.887894293 -0600 > +++ glusterfs-7.2-new/xlators/cluster/afr/src/afr-read-txn.c 2020-03-30 > 15:45:02.917104321 -0500 > @@ -279,10 +279,14 @@ afr_read_txn_refresh_done(call_frame_t * > priv = this->private; > > if (err) { > -if (!priv->thin_arbiter_count) > +if (!priv->thin_arbiter_count) { > +gf_msg(this->name, GF_LOG_ERROR,0,0,"erikj dbg crapola 1st if in > afr_read_txn_refresh_done() !priv->thin_arbiter_count -- goto to readfn"); > !! > We hit this error condition and jump to readfn below > !!! > goto readfn; > -if (err != EINVAL) > +} > +if (err != EINVAL) { > +gf_msg(this->name, GF_LOG_ERROR,0,0,"erikj 2nd if in > afr_read_txn_refresh_done() err != EINVAL, goto readfn"); > goto readfn; > +} > /* We need to query the good bricks and/or thin-arbiter.*/ > afr_ta_read_txn_synctask(frame, this); > return 0; > @@ -291,6 +295,8 @@ afr_read_txn_refresh_done(call_frame_t * > read_subvol = afr_read_subvol_select_by_policy(inode, this, > local->readable, > NULL); > if (read_subvol == -1) { > +gf_msg(this->name, GF_LOG_ERROR,0,0,"erikj dbg whoops read_subvol > returned -1, going to readfn"); > + > err = EIO; > goto readfn; > } > @@ -304,11 +310,15 @@ afr_read_txn_refresh_done(call_frame_t * > readfn: > if (read_subvol == -1) { > ret = afr_inode_split_brain_choice_get(
Re: [Gluster-users] gnfs split brain when 1 server in 3x1 down (high load) - help request
> Sadly I am not a developer, so I can't answer your questions. I'm not a FS o rnetwork developer either. I think there is a joke about playing one on TV but maybe it's netflix now. Enabling certain debug options made too much information for me to watch personally (but an expert could probably get through it). So I started putting targeted 'print' (gf_msg) statements in the code to see how it got its way to split-brain. Maybe this will ring a bell for someone. I can tell the only way we enter the split-brain path is through in the first if statement of afr_read_txn_refresh_done(). This means afr_read_txn_refresh_done() itself was passed "err" and that it appears thin_arbiter_count was not set (which makes sense, I'm using 1x3, not a thin arbiter). So we jump to the readfn label, and read_subvol() should still be -1. If I read right, it must mean that this if didn't return true because my print statement didn't appear: if ((ret == 0) && spb_choice >= 0) { So we're still with the original read_subvol == 1, Which gets us to the split_brain message. So now I will try to learn why afr_read_txn_refresh_done() would have 'err' set in the first place. I will also learn about afr_inode_split_brain_choice_get(). Those seem to be the two methods to have avoided falling in to the split brain hole here. I put debug statements in these locations. I will mark with !! what I see: diff -Narup glusterfs-7.2-orig/xlators/cluster/afr/src/afr-read-txn.c glusterfs-7.2-new/xlators/cluster/afr/src/afr-read-txn.c --- glusterfs-7.2-orig/xlators/cluster/afr/src/afr-read-txn.c 2020-01-15 11:43:53.887894293 -0600 +++ glusterfs-7.2-new/xlators/cluster/afr/src/afr-read-txn.c2020-03-30 15:45:02.917104321 -0500 @@ -279,10 +279,14 @@ afr_read_txn_refresh_done(call_frame_t * priv = this->private; if (err) { -if (!priv->thin_arbiter_count) +if (!priv->thin_arbiter_count) { +gf_msg(this->name, GF_LOG_ERROR,0,0,"erikj dbg crapola 1st if in afr_read_txn_refresh_done() !priv->thin_arbiter_count -- goto to readfn"); !! We hit this error condition and jump to readfn below !!! goto readfn; -if (err != EINVAL) +} +if (err != EINVAL) { +gf_msg(this->name, GF_LOG_ERROR,0,0,"erikj 2nd if in afr_read_txn_refresh_done() err != EINVAL, goto readfn"); goto readfn; +} /* We need to query the good bricks and/or thin-arbiter.*/ afr_ta_read_txn_synctask(frame, this); return 0; @@ -291,6 +295,8 @@ afr_read_txn_refresh_done(call_frame_t * read_subvol = afr_read_subvol_select_by_policy(inode, this, local->readable, NULL); if (read_subvol == -1) { +gf_msg(this->name, GF_LOG_ERROR,0,0,"erikj dbg whoops read_subvol returned -1, going to readfn"); + err = EIO; goto readfn; } @@ -304,11 +310,15 @@ afr_read_txn_refresh_done(call_frame_t * readfn: if (read_subvol == -1) { ret = afr_inode_split_brain_choice_get(inode, this, _choice); -if ((ret == 0) && spb_choice >= 0) +if ((ret == 0) && spb_choice >= 0) { !! We never get here, afr_inode_split_brain_choice_get() must not have returned what was needed to enter. !! +gf_msg(this->name, GF_LOG_ERROR,0,0,"erikj dbg read_subvol was -1 to begin with split brain choice found: %d", spb_choice); read_subvol = spb_choice; +} } if (read_subvol == -1) { + gf_msg(this->name, GF_LOG_ERROR,0,0,"erikj dbg verify this shows up above split-brain error"); !! We hit here. Game over player. !! + AFR_SET_ERROR_AND_CHECK_SPLIT_BRAIN(-1, err); } afr_read_txn_wind(frame, this, read_subvol); Community Meeting Calendar: Schedule - Every 2nd and 4th Tuesday at 14:30 IST / 09:00 UTC Bridge: https://bluejeans.com/441850968 Gluster-users mailing list Gluster-users@gluster.org https://lists.gluster.org/mailman/listinfo/gluster-users
Re: [Gluster-users] gnfs split brain when 1 server in 3x1 down (high load) - help request
> Hi Erik, > Sadly I didn't have the time to take a look in your logs, but I would like to > ask you whether you have statiatics of the network bandwidth usage. > Could it be possible that the gNFS server is starved for bandwidth and fails > to reach all bricks leading to 'split-brain' errors ? > I understand. I doubt there is a bandwidth issue but I'll add this to my checks. We have 288 nodes per server normally and they run fine with all servers up. The 76 number is just what we happened to have access to on an internal system. Question: What you mentioned above, and a feeling I have too personally is -- is the split-brain error actually a generic catch-all error for not being able to get access to a file? So when it says "split-brain" could it really mean any type of access error? Could it also be given when there is a IO timeout or something? I'm starting to break open the source code to look around but I think my head will explode before I understand it enough. I will still give it a shot. I have access to this system until later tonight. Then it goes away. We have duplicated it on another system that stays, but the machine internally is so contended for that I wouldn't get a time slot until later in the week anyway. Trying to make as much use of this "gift" machine as I can :) :) Thanks again for the replies so far. Erik Community Meeting Calendar: Schedule - Every 2nd and 4th Tuesday at 14:30 IST / 09:00 UTC Bridge: https://bluejeans.com/441850968 Gluster-users mailing list Gluster-users@gluster.org https://lists.gluster.org/mailman/listinfo/gluster-users
Re: [Gluster-users] gnfs split brain when 1 server in 3x1 down (high load) - help request
Thank you so much for replying -- > > [2020-03-29 03:42:52.295532] E [MSGID: 108008] > > [afr-read-txn.c:312:afr_read_txn_refresh_done] 0-cm_shared-replicate-0: > > Failing ACCESS on gfid 8eed77d3-b4fa-4beb-a0e7-e46c2b71ffe1: split-brain > > observed. [Input/output error] > Since you say that the errors go away when all 3 bricks (which I guess is > what you refer to as 'leaders') of the replica are up, it could be possible Yes leaders == gluster+gnfs server for this. We use 'leader' internally for mean servers that help manage compute nodes. I try to convert it to 'server' in my writing but 'leader' slips out somtimes. > that the brick you brought down had the only good copy. In such cases, even > though you have the other 2 bricks of the replica up, they both are bad I think all 3 copies are good. That is because the same exact files are accessed the same way when nodes boot. With one server down, 76 nodes normally boot with no errors. Once in a while one fails with split brain errors in the log. The more load I put in, the more likely a split brain when one server is down. So that's why my test case is so weird looking. It has to generate a bunch of extra load and then try to access root filesystem files using our tools to trigger the split brain. The test is good in that it produces at least a couple slit-brain errors every time. I'm actually ver happy to have a test case. We've been dealing with reports for some time. The healing errors seen are explained by the writable XFS image files in gluster -- one per node -- that the nodes use for their /etc, /var, and so on. So the 76 healing messages were expected. If it would help to reduce confusion, I can repeat the test with using TMPFS for the writable areas so that the healing list is clear. > copies waiting to be healed and hence all operations on those files will > fail with EIO. Since you say this occurs under high load only. I suspect To be clear, with one server down, operations work like 99.9% of the time. Same operations on every node. It's only when we bring the load up (maybe heavy metadata related?) do we get split-brain errors with one server down. It is a strange problem but I don't believe there is a problem with any copy of any file. Never say never and nothing would make me happier than being wrong and solving the problem. I want to thank you so much for writing back. I'm willing to try any suggestions we come up with. Erik Community Meeting Calendar: Schedule - Every 2nd and 4th Tuesday at 14:30 IST / 09:00 UTC Bridge: https://bluejeans.com/441850968 Gluster-users mailing list Gluster-users@gluster.org https://lists.gluster.org/mailman/listinfo/gluster-users
Re: [Gluster-users] gnfs split brain when 1 server in 3x1 down (high load) - help request
Thank you for replying!! Responses below... I have attached the volume def (meant to before). I have attached a couple logs from one of the leaders. > That's odd. > As far as I know, the client's are accessing one of the gluster nodes > that serves as NFS server and then syncs data across the peers ,right? Correct, although in this case, with a 1x3, all of them should have local copies. Our first reports came in from 3x3 (9 server) systems but we have been able to duplicate on 1x3 thankfully in house. This is a huge step forward as I had no reproducer previously. > What happens when the virtual IP(s) are failed over to the other gluster > node? Is the issue resolved? While we do use CTDB for managing the IPs aliases, I don't start the test until the IP is stabilized. I put all 76 nodes on one IP alias to make a more similar load to what we have in the field. I think it is important to point out that if I reduce the load, all is well. For examples, if the test were just booting -- where the initial reports were seen -- just 1 or 2 nodes out of 1,000 would have an issue each cycle. They all boot the same way and are all using the same IP alias for NFS in my test case. So I think the split-brain messages are maybe a symptom of some sort of timeout ??? (making stuff up here). > Also, what kind of load balancing are you using ? [I moved this question up because the below answer has too much output] We are doing very simple balancing - manual balancing. As we add compute nodes to the cluster, a couple racks are assigned to IP alias #1, the next couple to IP alias #2, and so on. I'm happy to not have the complexity of a real load balancer right now. > Do you get any split brain entries via 'gluster volume geal info' ? I ran two trials for the 'gluster volume heal ...' Trial 1 - with all 3 servers up and while running the load: [root@leader2 ~]# gluster volume heal cm_shared info Brick 172.23.0.4:/data/brick_cm_shared Status: Connected Number of entries: 0 Brick 172.23.0.5:/data/brick_cm_shared Status: Connected Number of entries: 0 Brick 172.23.0.6:/data/brick_cm_shared Status: Connected Number of entries: 0 Trial 2 - with 1 server down (stopped glusterd on 1 server) - and without doing any testing yet -- I see this. Let me explain though - not in the error path, I am using RW NFS filesystem image blobs on this same volume for the writable areas of the node. In the field, we duplicate the problem with using TMPFS for that writable area. I am happy to re-do the test with RO NFS and TMPFS for writable, which my GUESS says the healing messages would go away. Would that help? If you look at the heal count -- 76 -- that equals the node count - the number of writable XFS image files using for writing for each node. [root@leader2 ~]# gluster volume heal cm_shared info Brick 172.23.0.4:/data/brick_cm_shared Status: Transport endpoint is not connected Number of entries: - Brick 172.23.0.5:/data/brick_cm_shared Status: Connected Number of entries: 8 Brick 172.23.0.6:/data/brick_cm_shared Status: Connected Number of entries: 8 Trial 3 - ran the heal command around the time the split-brain errors were being reported [root@leader2 glusterfs]# gluster volume heal cm_shared info Brick 172.23.0.4:/data/brick_cm_shared Status: Transport endpoint is not connected Number of entries: - Brick 172.23.0.5:/data/brick_cm_shared Status: Connected Number of entries: 76 Brick 172.23.0.6:/data/brick_cm_shared Status: Connected Number of entries: 76 Volume Name: cm_shared Type: Replicate Volume ID: f6175f56-8422-4056-9891-f9ba84756b87 Status: Started Snapshot Count: 0 Number of Bricks: 1 x 3 = 3 Transport-type: tcp Bricks: Brick1: 172.23.0.4:/data/brick_cm_shared Brick2: 172.23.0.5:/data/brick_cm_shared Brick3: 172.23.0.6:/data/brick_cm_shared Options Reconfigured: nfs.event-threads: 3 config.brick-threads: 16 config.client-threads: 16 performance.iot-pass-through: false config.global-threading: off performance.client-io-threads: on nfs.disable: off storage.fips-mode-rchecksum: on transport.address-family: inet features.cache-invalidation: on features.cache-invalidation-timeout: 600 cluster.lookup-optimize: on client.event-threads: 32 server.event-threads: 32 performance.stat-prefetch: on performance.cache-invalidation: on performance.md-cache-timeout: 600 network.inode-lru-limit: 100 performance.io-thread-count: 32 performance.cache-size: 8GB performance.parallel-readdir: on cluster.lookup-unhashed: auto performance.flush-behind: on performance.aggregate-size: 2048KB performance.write-behind-trickling-writes: off transport.listen-backlog: 16384 performance.write-behind-window-size: 1024MB server.outstanding-rpc-limit: 1024 nfs.outstanding-rpc-limit: 1024 nfs.acl: on storage.max-hardlinks: 0 performance.cache-refresh-timeout: 60
[Gluster-users] gnfs split brain when 1 server in 3x1 down (high load) - help request
Hello all, I am getting split-brain errors in the gnfs nfs.log when 1 gluster server is down in a 3-brick/3-node gluster volume. It only happens under intense load. I reported this a few months ago but didn't have a repeatable test case. Since then, we got reports from the field and I was able to make a test case with 3 gluster servers and 76 NFS clients/compute nodes. I point all 76 nodes to one gnfs server to make the problem more likely to happen with the limited nodes we have in-house. We are using gluster nfs (ganesha is not yet reliable for our workload) to export an NFS filesystem that is used for a read-only root filesystem for NFS clients. The largest client count we have is 2592 across 9 leaders (3 replicated subvolumes) - out in the field. This is where the problem was first reported. In the lab, I have a test case that can repeat the problem on a single subvolume cluster. Please forgive how ugly the test case is. I'm sure an IO test person can make it pretty. It basically runs a bunch of cluster-manger NFS-intensive operations while also producing other load. If one leader is down, nfs.log reports some split-brain errors. For real-world customers, the symptom is "some nodes failing to boot" in various ways or "jobs failing to launch due to permissions or file read problems (like a library not being readable on one node)". If all leaders are up, we see no errors. As an attachment, I will include volume settings. Here are example nfs.log errors: [2020-03-29 03:42:52.295532] E [MSGID: 108008] [afr-read-txn.c:312:afr_read_txn_refresh_done] 0-cm_shared-replicate-0: Failing ACCESS on gfid 8eed77d3-b4fa-4beb-a0e7-e46c2b71ffe1: split-brain observed. [Input/output error] [2020-03-29 03:42:52.295583] W [MSGID: 112199] [nfs3-helpers.c:3308:nfs3_log_common_res] 0-nfs-nfsv3: /bin/whoami => (XID: 19fb1558, ACCESS: NFS: 5(I/O error), POSIX: 5(Input/output error)) [2020-03-29 03:43:03.600023] E [MSGID: 108008] [afr-read-txn.c:312:afr_read_txn_refresh_done] 0-cm_shared-replicate-0: Failing ACCESS on gfid 77614c4f-1ac4-448d-8fc2-8aedc9b30868: split-brain observed. [Input/output error] [2020-03-29 03:43:03.600075] W [MSGID: 112199] [nfs3-helpers.c:3308:nfs3_log_common_res] 0-nfs-nfsv3: /lib64/perl5/vendor_perl/XML/LibXML/Literal.pm => (XID: 9a851abc, ACCESS: NFS: 5(I/O error), POSIX: 5(Input/output error)) [2020-03-29 03:43:07.681294] E [MSGID: 108008] [afr-read-txn.c:312:afr_read_txn_refresh_done] 0-cm_shared-replicate-0: Failing READLINK on gfid 36134289-cb2d-43d9-bd50-60e23d7fa69b: split-brain observed. [Input/output error] [2020-03-29 03:43:07.681339] W [MSGID: 112199] [nfs3-helpers.c:3327:nfs3_log_readlink_res] 0-nfs-nfsv3: /lib64/.libhogweed.so.4.hmac => (XID: 5c29744f, READLINK: NFS: 5(I/O error), POSIX: 5(Input/output error)) target: (null) The brick log isn't very interesting during the failure. There are some ACL errors that don't seem to directly relate to the issue at hand. (I can attach if requested!) This is glusterfs72 (although we originally hit it with 4.1.6). I'm using rhel8 (although field reports are from rhel76). If there is anything the community can suggest to help me with this, it would really be appreciated. I'm getting unhappy reports from the field that the failover doesn't work as expected. I've tried tweaking several things from various threading settings to enabling md-cach-statfs to mem-factor to listen backlogs. I even tried adjusting the cluster.read-hash-mode and choose-local settings. "cluster-configuration" in the script initiates a bunch of operations on the node that results in reading many files and doing some database queries. I used it in my test case as it is a common failure point when nodes are booting. This test case, although ugly, fails 100% if one server is down and works 100% if all servers are up. #! /bin/bash # # Test case: # # in a 1x3 Gluster Replicated setup with the HPCM volume settings.. # # On a cluster with 76 nodes (maybe can be replicated with less we don't # know) # # When all the nodes are assigned to one IP alias to get the load in to # one leader node # # This test case will produce split-brain errors in the nfs.log file # when 1 leader is down, but will run clean when all 3 are up. # # It is not necessary to power off the leader you wish to disable. Simply # running 'systemctl stop glusterd' is sufficient. # # We will use this script to try to resolve the issue with split-brain # under stress when one leader is down. # # (compute group is 76 compute nodes) echo "killing any node find or node tar commands..." pdsh -f 500 -g compute killall find pdsh -f 500 -g compute killall tar # (in this test, leader1 is known to have glusterd stopped for the test case) echo "stop, start glusterd, drop caches, sleep 15" set -x pdsh -w leader2,leader3 systemctl stop glusterd sleep 3 pdsh -w leader2,leader3 "echo 3 > /proc/sys/vm/drop_caches" pdsh -w leader2,leader3 systemctl start glusterd set +x sleep 15 echo "drop
Re: [Gluster-users] gluster NFS hang observed mounting or umounting at scale
While it's still early, our testing is showing this issue fixed in glusterfs7.2 (we were at 416). Closing the loop in case people search for this. Erik On Sun, Jan 26, 2020 at 12:04:00PM -0600, Erik Jacobson wrote: > > One last reply to myself. > > One of the test cases my test scripts triggered turned out to actually > be due to my NFS RW mount options. > > OLD RW NFS mount options: > "rw,noatime,nocto,actimeo=3600,lookupcache=all,nolock,tcp,vers=3" > > NEW options that work better > rw,noatime,nolock,tcp,vers=3" > > I had copied the RO NFS options we use which try to be aggressive about > caching. The RO root image doesn't change much and we want it as fast > as possible. The options are not appropriate for RW areas that change. > (Even though it's a single image file we care about). > > So now my test scripts run clean but since what we see on larger systems > is right after reboot, the caching shouldn't matter. In the real problem > case, the RW stuff is done once after reboot. > > FWIW I attached my current test scripts, my last batch had some errors. > > The search continues for the actual problem, which I'm struggling to > reproduce @ 366 NFs clients. > > I believe yesterday, when I posted about actual HANGS, that is the real > problem we're tracking. I hit that once in my test scripts - only once. > My script was otherwise hitting a "file doesn't really exist even though > cached" issue and it was tricking my scripts. > > In any case, I'm changing the RW NFS options we use regardless. > > Erik Community Meeting Calendar: APAC Schedule - Every 2nd and 4th Tuesday at 11:30 AM IST Bridge: https://bluejeans.com/441850968 NA/EMEA Schedule - Every 1st and 3rd Tuesday at 01:00 PM EDT Bridge: https://bluejeans.com/441850968 Gluster-users mailing list Gluster-users@gluster.org https://lists.gluster.org/mailman/listinfo/gluster-users
Re: [Gluster-users] GlusterFS problems & alternatives
> looking through the last couple of week on this mailing list and reflecting > our own experiences, I have to ask: what is the status of GlusterFS? So many > people here reporting bugs and no solutions are in sight. GlusterFS clusters > break left and right, reboots of a node have become a warrant for instability > and broken clusters, no way to fix broken clusters. And all of that with > recommended settings, and in our case, enterprise hardware underneath. I have been one of the people asking questions. I sometimes get an answer, which I appreciate. Other times not. But I'm not paying for support in this forum so I appreciate what I can get. My questions are sometimes very hard to summarize and I can't say I've been offering help as much as I ask. I think I will try to do better. Just to counter with something cool As we speak now, I'm working on a 2,000 node cluster that will soon be a 5120 node cluster. We're validating it with the newest version of our cluster manager. It has 12 leader nodes (soon to have 24) that are gluster servers and gnfs servers. I am validating Gluster7.2 (updating from 4.6). Things are looking very good. 5120 nodes using RO NFS root with RW NFS overmounts (for things like /var, /etc, ...)... - boot 1 (where each node creates a RW XFS image on top of NFS for its writable area then syncs /var, /etc, etc) -- full boot is 15-16 minutes for 2007 nodes. - boot 2 (where the writable area pre-exists and is reused, just re-rsynced) -- 8-9 minutes to boot 2007 nodes. This is similar to gluster 4, but I think it's saying something to not have had any failures in this setup on the bleeding edge release level. We also use a different volume shared between the leaders and the head node for shared-storage consoles and system logs. It's working great. I haven't had time to test other solutions. Our old solution from SGI days (ICE, ICE X, etc) was a different model where each leader served a set of nodes and NFS-booted 288 or so. No shared storage. Like you, I've wondered if something else matches this solution. We like the shared storage and the ability for a leader to drop and not take 288 noes with it. (All nodes running RHEL8.0, Glusterfs 72, CTDB 4.9.1) So we can say gluster is providing the network boot solution for now two supercomputers. Erik Community Meeting Calendar: APAC Schedule - Every 2nd and 4th Tuesday at 11:30 AM IST Bridge: https://bluejeans.com/441850968 NA/EMEA Schedule - Every 1st and 3rd Tuesday at 01:00 PM EDT Bridge: https://bluejeans.com/441850968 Gluster-users mailing list Gluster-users@gluster.org https://lists.gluster.org/mailman/listinfo/gluster-users
[Gluster-users] question on rebalance errors gluster 7.2 (adding to distributed/replicated)
My question: Are the errors and anomalies below something I need to investigate? Are should I not be worried? I installed a test cluster to gluster 7.2 to run some tests, preparing to see if we gain confidence to put this on the 5,120 node supercomputer instead of gluster 4.1.6. I started with a 3x2 volume with heavy optimizations for writes and NFS. (6 nodes, distribute/replicate). I booted my NFS-root clients and maintained them online. I then performaned a add-brick operation to make it a 3x3 instead of 3.2 (so 9 servers instead of 6). The rebalance went much better for me than gluster 4.1.6. However, I saw some errors. We noted them first here -- 14 errors on leader8, and a few on the others. These are the NEW nodes so the data flow was from the old nodes to these three that at least have one error: [root@leader8 glusterfs]# gluster volume rebalance cm_shared status Node Rebalanced-files size scanned failures skipped status run time in h:m:s - --- --- --- --- --- -- leader1.head.cm.eag.rdlabs.hpecorp.net18933 596.4MB 181780 0 3760completed0:41:39 172.23.0.418960 1.2GB 181831 0 3766completed0:41:39 172.23.0.518691 1.2GB 181826 0 3716completed0:41:39 172.23.0.614917 618.8MB 175758 0 3869completed0:35:40 172.23.0.715114 573.5MB 175728 0 3853completed0:35:41 172.23.0.814864 459.2MB 175742 0 3951completed0:35:40 172.23.0.900Bytes 11 3 0completed0:08:26 172.23.0.1100Bytes 242 1 0completed0:08:25 localhost00Bytes 514 0completed0:08:26 volume rebalance: cm_shared: success My rebalance log is like 32M and I find it's hard for people to help me when I post that much data. So I've tried to filter some of the data here. Two classes -- anomalies and errors. Errors (14 reported on this node): [root@leader8 glusterfs]# grep -i "error from gf_defrag_get_entry" cm_shared-rebalance.log [2020-02-10 23:23:55.286830] W [dht-rebalance.c:3439:gf_defrag_process_dir] 0-cm_shared-dht: Found error from gf_defrag_get_entry [2020-02-10 23:24:12.903496] W [dht-rebalance.c:3439:gf_defrag_process_dir] 0-cm_shared-dht: Found error from gf_defrag_get_entry [2020-02-10 23:24:15.226948] W [dht-rebalance.c:3439:gf_defrag_process_dir] 0-cm_shared-dht: Found error from gf_defrag_get_entry [2020-02-10 23:24:15.259480] W [dht-rebalance.c:3439:gf_defrag_process_dir] 0-cm_shared-dht: Found error from gf_defrag_get_entry [2020-02-10 23:24:15.398784] W [dht-rebalance.c:3439:gf_defrag_process_dir] 0-cm_shared-dht: Found error from gf_defrag_get_entry [2020-02-10 23:24:16.633033] W [dht-rebalance.c:3439:gf_defrag_process_dir] 0-cm_shared-dht: Found error from gf_defrag_get_entry [2020-02-10 23:24:16.645847] W [dht-rebalance.c:3439:gf_defrag_process_dir] 0-cm_shared-dht: Found error from gf_defrag_get_entry [2020-02-10 23:24:21.783528] W [dht-rebalance.c:3439:gf_defrag_process_dir] 0-cm_shared-dht: Found error from gf_defrag_get_entry [2020-02-10 23:24:22.307464] W [dht-rebalance.c:3439:gf_defrag_process_dir] 0-cm_shared-dht: Found error from gf_defrag_get_entry [2020-02-10 23:25:23.391256] W [dht-rebalance.c:3439:gf_defrag_process_dir] 0-cm_shared-dht: Found error from gf_defrag_get_entry [2020-02-10 23:26:34.203129] W [dht-rebalance.c:3439:gf_defrag_process_dir] 0-cm_shared-dht: Found error from gf_defrag_get_entry [2020-02-10 23:26:39.669243] W [dht-rebalance.c:3439:gf_defrag_process_dir] 0-cm_shared-dht: Found error from gf_defrag_get_entry [2020-02-10 23:27:42.615081] W [dht-rebalance.c:3439:gf_defrag_process_dir] 0-cm_shared-dht: Found error from gf_defrag_get_entry [2020-02-10 23:28:53.942158] W [dht-rebalance.c:3439:gf_defrag_process_dir] 0-cm_shared-dht: Found error from gf_defrag_get_entry Brick log errors around 23:23:55 (to match the first error above): [2020-02-10 23:23:54.605681] W [MSGID: 113096] [posix-handle.c:834:posix_handle_soft] 0-cm_shared-posix: symlink ../../a4/3e/a43ef7fd-08eb-434c-8168-96a92059d186/LC_MESSAGES ->
Re: [Gluster-users] NFS clients show missing files while gluster volume rebalanced
Closing the loop in case someone does a search on this... I have an update. I am getting some time on 1,000 node soon so I have started to validate jumping to gluster 7.2 on my small lab machine. I switched the packages to my own build of gluster 7.2 with gnfs. I re-installed my leader node (gluster/gnfs servers) and created the volumes the same way as before. This includes heavy cache optimization for the NFS services volume. I can no longer duplicate this problem on gluster 7.2. I was able to duplicate rebalance troubles on NFS clients every time on gluster 4.1.6. I do have a couple questions on some rebalance errors, which I will send in a separate email. Erik On Wed, Jan 29, 2020 at 06:20:34PM -0600, Erik Jacobson wrote: > We are using gluster 4.1.6. We are using gluster NFS (not ganesha). > > Distributed/replicated with subvolume size 3 (6 total servers, 2 > subvols). > > The NFS clients use this for their root filesystem. > > When I add 3 more gluster servers to add one more subvolume to the > storage volumes (so now subvolume size 3, 9 total servers, 3 total > subvolumes), the process gets started. > > ssh leader1 gluster volume add-brick cm_shared > 172.23.0.9://data/brick_cm_shared 172.23.0.10://data/brick_cm_shared > 172.23.0.11://data/brick_cm_shared > > then > > ssh leader1 gluster volume rebalance cm_shared start > > The re-balance works. 'gluster volume status' shows re-balance in > progress. > > However, existing gluster-NFS clients now show missing files and I can > no longer log into them (since NFS is their root). If you are logged in, > you can find that libraries are missing and general unhappiness with > random files now missing. > > Is accessing a volume that is in the process of being re-balanced not > supported from a gluster NFS client? Or have I made an error? > > Thank you for any help, > > Erik Community Meeting Calendar: APAC Schedule - Every 2nd and 4th Tuesday at 11:30 AM IST Bridge: https://bluejeans.com/441850968 NA/EMEA Schedule - Every 1st and 3rd Tuesday at 01:00 PM EDT Bridge: https://bluejeans.com/441850968 Gluster-users mailing list Gluster-users@gluster.org https://lists.gluster.org/mailman/listinfo/gluster-users
[Gluster-users] NFS clients show missing files while gluster volume rebalanced
We are using gluster 4.1.6. We are using gluster NFS (not ganesha). Distributed/replicated with subvolume size 3 (6 total servers, 2 subvols). The NFS clients use this for their root filesystem. When I add 3 more gluster servers to add one more subvolume to the storage volumes (so now subvolume size 3, 9 total servers, 3 total subvolumes), the process gets started. ssh leader1 gluster volume add-brick cm_shared 172.23.0.9://data/brick_cm_shared 172.23.0.10://data/brick_cm_shared 172.23.0.11://data/brick_cm_shared then ssh leader1 gluster volume rebalance cm_shared start The re-balance works. 'gluster volume status' shows re-balance in progress. However, existing gluster-NFS clients now show missing files and I can no longer log into them (since NFS is their root). If you are logged in, you can find that libraries are missing and general unhappiness with random files now missing. Is accessing a volume that is in the process of being re-balanced not supported from a gluster NFS client? Or have I made an error? Thank you for any help, Erik Community Meeting Calendar: APAC Schedule - Every 2nd and 4th Tuesday at 11:30 AM IST Bridge: https://bluejeans.com/441850968 NA/EMEA Schedule - Every 1st and 3rd Tuesday at 01:00 PM EDT Bridge: https://bluejeans.com/441850968 Gluster-users mailing list Gluster-users@gluster.org https://lists.gluster.org/mailman/listinfo/gluster-users
Re: [Gluster-users] gluster NFS hang observed mounting or umounting at scale
> One last reply to myself. One of the test cases my test scripts triggered turned out to actually be due to my NFS RW mount options. OLD RW NFS mount options: "rw,noatime,nocto,actimeo=3600,lookupcache=all,nolock,tcp,vers=3" NEW options that work better rw,noatime,nolock,tcp,vers=3" I had copied the RO NFS options we use which try to be aggressive about caching. The RO root image doesn't change much and we want it as fast as possible. The options are not appropriate for RW areas that change. (Even though it's a single image file we care about). So now my test scripts run clean but since what we see on larger systems is right after reboot, the caching shouldn't matter. In the real problem case, the RW stuff is done once after reboot. FWIW I attached my current test scripts, my last batch had some errors. The search continues for the actual problem, which I'm struggling to reproduce @ 366 NFs clients. I believe yesterday, when I posted about actual HANGS, that is the real problem we're tracking. I hit that once in my test scripts - only once. My script was otherwise hitting a "file doesn't really exist even though cached" issue and it was tricking my scripts. In any case, I'm changing the RW NFS options we use regardless. Erik nfs-issues.tar.xz Description: application/xz Community Meeting Calendar: APAC Schedule - Every 2nd and 4th Tuesday at 11:30 AM IST Bridge: https://bluejeans.com/441850968 NA/EMEA Schedule - Every 1st and 3rd Tuesday at 01:00 PM EDT Bridge: https://bluejeans.com/441850968 Gluster-users mailing list Gluster-users@gluster.org https://lists.gluster.org/mailman/listinfo/gluster-users
Re: [Gluster-users] gluster NFS hang observed mounting or umounting at scale
e handle), POSIX: 116(Stale file handle)), count: 0, > STABLE,wverf: 1579664973 > [2020-01-26 02:42:43.908045] W [MSGID: 112199] > [nfs3-helpers.c:3494:nfs3_log_write_res] 0-nfs-nfsv3: > /image/images_rw_nfs/r17c3t4n1/rhel8.0/xfs.img => (XID: a87e7e7d, WRITE: NFS: > 70(Invalid file handle), POSIX: 116(Stale file handle)), count: 0, > STABLE,wverf: 1579664973 > [2020-01-26 02:42:43.908194] W [MSGID: 112199] > [nfs3-helpers.c:3494:nfs3_log_write_res] 0-nfs-nfsv3: > /image/images_rw_nfs/r17c3t4n1/rhel8.0/xfs.img => (XID: a67e7e7d, WRITE: NFS: > 70(Invalid file handle), POSIX: 116(Stale file handle)), count: 0, > STABLE,wverf: 1579664973 > > > > > > Community Meeting Calendar: > > APAC Schedule - > Every 2nd and 4th Tuesday at 11:30 AM IST > Bridge: https://bluejeans.com/441850968 > > NA/EMEA Schedule - > Every 1st and 3rd Tuesday at 01:00 PM EDT > Bridge: https://bluejeans.com/441850968 > > Gluster-users mailing list > Gluster-users@gluster.org > https://lists.gluster.org/mailman/listinfo/gluster-users Erik Jacobson Software Engineer erik.jacob...@hpe.com +1 612 851 0550 Office Eagan, MN hpe.com Community Meeting Calendar: APAC Schedule - Every 2nd and 4th Tuesday at 11:30 AM IST Bridge: https://bluejeans.com/441850968 NA/EMEA Schedule - Every 1st and 3rd Tuesday at 01:00 PM EDT Bridge: https://bluejeans.com/441850968 Gluster-users mailing list Gluster-users@gluster.org https://lists.gluster.org/mailman/listinfo/gluster-users
Re: [Gluster-users] gluster NFS hang observed mounting or umounting at scale
> The gluster NFS log has this entry: > [2020-01-25 19:07:33.085806] E [MSGID: 109040] > [dht-helper.c:1388:dht_migration_complete_check_task] 0-cm_shared-dht: > 19bd72f0-6863-4f1d-80dc-a426db8670b8: failed to lookup the file on > cm_shared-dht [Stale file handle] > [2020-01-25 19:07:33.085848] W [MSGID: 112199] > [nfs3-helpers.c:3578:nfs3_log_commit_res] 0-nfs-nfsv3: > /image/images_rw_nfs/r41c4t1n1/rhel8.0/xfs-test.img => (XID: cb501b58, > COMMIT: NFS: 70(Invalid file handle), POSIX: 116(Stale file handle)), wverf: > 1579988225 > I've done more digging. I have access to an actual system that is failing (instead of my test case) above. It appears to be the same issue so that's good. (My access goes away in a couple hours). The nodes don't hang at the mount, but rather, at a check in the code for the existence of the image file. I'm not sure if the "holes" message I share below is a problem or not, the file indeed does start sparse. Restarting 'glusterd' on the problem server allows the node to boot. However, it does seem like the problem image file disappears from the face of the earth as far as I can tell (it doesn't exist in the gluster mount to the same path). Searching for all messages in nfs.log related to r17c3t6n3 (the problem node with the problem nfs.img file), I see: [root@leader1 glusterfs]# grep r17c3t6n3 nfs.log [2020-01-24 12:29:42.412019] W [MSGID: 112199] [nfs3-helpers.c:3494:nfs3_log_write_res] 0-nfs-nfsv3: /image/images_rw_nfs/r17c3t6n3/rhel8.0/xfs.img => (XID: ca68a5fc, WRITE: NFS: 70(Invalid file handle), POSIX: 116(Stale file handle)), count: 0, STABLE,wverf: 1579664973 [2020-01-25 04:57:10.199988] W [MSGID: 112199] [nfs3-helpers.c:3494:nfs3_log_write_res] 0-nfs-nfsv3: /image/images_rw_nfs/r17c3t6n3/rhel8.0/xfs.img => (XID: 1ec43ce0, WRITE: NFS: 70(Invalid file handle), POSIX: 116(Stale file handle)), count: 0, STABLE,wverf: 1579664973 [Invalid argument] [2020-01-25 04:57:10.200431] W [MSGID: 112199] [nfs3-helpers.c:3494:nfs3_log_write_res] 0-nfs-nfsv3: /image/images_rw_nfs/r17c3t6n3/rhel8.0/xfs.img => (XID: 20c43ce0, WRITE: NFS: 70(Invalid file handle), POSIX: 116(Stale file handle)), count: 0, STABLE,wverf: 1579664973 [2020-01-25 04:57:10.200695] W [MSGID: 112199] [nfs3-helpers.c:3494:nfs3_log_write_res] 0-nfs-nfsv3: /image/images_rw_nfs/r17c3t6n3/rhel8.0/xfs.img => (XID: 21c43ce0, WRITE: NFS: 70(Invalid file handle), POSIX: 116(Stale file handle)), count: 0, STABLE,wverf: 1579664973 [2020-01-25 04:57:10.200827] W [MSGID: 112199] [nfs3-helpers.c:3494:nfs3_log_write_res] 0-nfs-nfsv3: /image/images_rw_nfs/r17c3t6n3/rhel8.0/xfs.img => (XID: 1fc43ce0, WRITE: NFS: 70(Invalid file handle), POSIX: 116(Stale file handle)), count: 0, STABLE,wverf: 1579664973 [2020-01-25 04:57:10.201808] W [MSGID: 112199] [nfs3-helpers.c:3494:nfs3_log_write_res] 0-nfs-nfsv3: /image/images_rw_nfs/r17c3t6n3/rhel8.0/xfs.img => (XID: 22c43ce0, WRITE: NFS: 70(Invalid file handle), POSIX: 116(Stale file handle)), count: 0, STABLE,wverf: 1579664973 [Invalid argument] [2020-01-25 23:32:09.629807] I [MSGID: 109063] [dht-layout.c:693:dht_layout_normalize] 0-cm_shared-dht: Found anomalies in /image/images_rw_nfs/r17c3t6n3/rhel8.0 (gfid = ----). Holes=1 overlaps=0 [2020-01-26 02:42:33.712684] W [MSGID: 112199] [nfs3-helpers.c:3494:nfs3_log_write_res] 0-nfs-nfsv3: /image/images_rw_nfs/r17c3t6n3/rhel8.0/xfs.img => (XID: a0ca8fc3, WRITE: NFS: 70(Invalid file handle), POSIX: 116(Stale file handle)), count: 0, STABLE,wverf: 1579664973 r17c3t4n1 is another case: [2020-01-25 23:19:46.729427] I [MSGID: 109063] [dht-layout.c:693:dht_layout_normalize] 0-cm_shared-dht: Found anomalies in /image/images_rw_nfs/r17c3t4n1/rhel8.0 (gfid = ----). Holes=1 overlaps=0 [2020-01-26 02:42:43.907163] W [MSGID: 112199] [nfs3-helpers.c:3494:nfs3_log_write_res] 0-nfs-nfsv3: /image/images_rw_nfs/r17c3t4n1/rhel8.0/xfs.img => (XID: a77e7e7d, WRITE: NFS: 70(Invalid file handle), POSIX: 116(Stale file handle)), count: 0, STABLE,wverf: 1579664973 [2020-01-26 02:42:43.908045] W [MSGID: 112199] [nfs3-helpers.c:3494:nfs3_log_write_res] 0-nfs-nfsv3: /image/images_rw_nfs/r17c3t4n1/rhel8.0/xfs.img => (XID: a87e7e7d, WRITE: NFS: 70(Invalid file handle), POSIX: 116(Stale file handle)), count: 0, STABLE,wverf: 1579664973 [2020-01-26 02:42:43.908194] W [MSGID: 112199] [nfs3-helpers.c:3494:nfs3_log_write_res] 0-nfs-nfsv3: /image/images_rw_nfs/r17c3t4n1/rhel8.0/xfs.img => (XID: a67e7e7d, WRITE: NFS: 70(Invalid file handle), POSIX: 116(Stale file handle)), count: 0, STABLE,wverf: 1579664973 Community Meeting Calendar: APAC Schedule - Every 2nd and 4th Tuesday at 11:30 AM IST Bridge: https://bluejeans.com/441850968 NA/EMEA Schedule - Every 1st and 3rd Tuesday at 01:00 PM EDT Bridge: https://bluejeans.com/441850968 Gluster-users mailing list Gluster-users@gluster.org https://lists.gluster.org/mailman/listinfo/gluster-users
Re: [Gluster-users] No possible to mount a gluster volume via /etc/fstab?
> yes I know but I already tried that and failed at implementing it. > I'm now even suspecting gluster to have some kind of bug. > > Could you show me how to do it correctly? Which services goes into after? > Do have example unit files for mounting gluster volumes? I have had some struggles with this, in the depths of systemd. I ended up making a oneshot systemd service and a helper script. I have one helper script for my gluster server/nfs server nodes that tries to carefully not mount gluster paths until gluster is actually started. It also ensures ctdb is started only after the gluster lock is actually available. Your case seems to be more like gluster-client-only, which I have a simpler helper script for. Note that ideas for this came from this very mailing list as I recall. So I'm not taking credit for the whole idea. Now this is very specific to my situation but maybe you can get some ideas. Otherwise, trash this email :) systemd service: # This cluster manager service ensures # - shared storage is mounted # - bind mounts are mounted # - Works around distro problems (like RHEL8.0) that ignore _netdev # and try to mount network filesystems before the network is up # - Also helps handle the case where the whole cluster is powered up and # the admin won't be able to mount shared stoage until SU leaders up. [Unit] Description=CM ADMIN Service to ensure mounts are good After=network-online.target time-sync.target [Service] Type=oneshot RemainAfterExit=yes User=root ExecStart=/opt/clmgr/lib/cm-admin-mounts [Install] WantedBy=multi-user.target And the helper: #! /bin/bash # Copyright (c) 2019 Hewlett Packard Enterprise Development LP # All rights reserved. # This program is free software; you can redistribute it and/or modify # it under the terms of the GNU General Public License as published by # the Free Software Foundation; either version 2 of the License, or # (at your option) any later version. # # This program is distributed in the hope that it will be useful, # but WITHOUT ANY WARRANTY; without even the implied warranty of # MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the # GNU General Public License for more details. # # You should have received a copy of the GNU General Public License # along with this program; if not, write to the Free Software # Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA 02111-1307 USA # # This script handles ensuring: # * Shared storage is actually mounted # * bind mounts are sourced by shared storage and not by local directories # # This script solves two problems. One is a bug in RHEL 8.0 where systemd # ignores _netdev in fstab and tries to mount network storage before the # network is up. Additionally, this script is useful in all scenarios to # handle the data-center-power-outage use case. In this case, SU leaders may # take a while to get up and running -- longer than systemd might wait for # mounts. # # In all cases, if systemd fails to mount the shared storage, it may # ignore the dependencies and do the bind mounts any, which could # incorrectly point to local directories instead of shared storage. # me=$(basename $0) # # Safety. Don't run on wrong node type. # if ! grep -q -P '^NODETYPE="admin"' /etc/opt/sgi/cminfo; then echo "$me: Error: This script is only to be run on admin nodes." > /dev/stderr logger "$me: Error: This script is only to be run on admin nodes." > /dev/stderr exit 1 fi if [ ! -r /opt/clmgr/lib/su-leader-functions.sh ]; then echo "$me: Error: /opt/clmgr/lib/su-leader-functions.sh not found." > /dev/stderr logger "$me: Error: /opt/clmgr/lib/su-leader-functions.sh not found." > /dev/stderr exit 1 fi source /opt/clmgr/lib/su-leader-functions.sh # # enable-su-leader would have placed a shared_storage entry in fstab. # If that is not present, this admin may have been de-coupled from the # leaders. Exit in that case. # if ! grep -P -q "\d+\.\d+\.\d+\.\d+:/cm_shared\s+" /etc/fstab; then logger "$me: Shared storage not enabled. Exiting." exit 0 fi logger "$me: Unmount temporarily any bind mounts" umount_bind_mounts_local logger "$me: Keep trying to mount shared storage..." while true; do umount /opt/clmgr/shared_storage &> /dev/null mount /opt/clmgr/shared_storage/ if [ $? -ne 0 ]; then logger "$me: /opt/clmgr/shared_storage mount failed. Will re-try." umount /opt/clmgr/shared_storage/ sleep 3 continue fi logger "$me: Mount command reports gluster mount success. Verifying." if ! cat /proc/mounts | grep -q -P "\d+\.\d+\.\d+\.\d+:\S+\s+/opt/clmgr/shared_storage\s+fuse.glusterfs"; then logger "$me: Verification. /opt/clmgr/shared_storage not in /proc/mounts as glusterfs. Retry" sleep 3 continue fi logger "$me: Gluster mounts look correct in
Re: [Gluster-users] hook script question related to ctdb, shared storage, and bind mounts
> Here is what was the setup : I thought I'd share an update in case it helps others. Your ideas inspired me to try a different approach. We support 4 main distros (and a 2 variants of some). We try not to provide our own versions of distro-supported packages like CTDB where possible. So a concern for me in modifying services is that they could be replaced in package updates. There are ways to mitigate that but that thought combined with yourr ideas led me to try this: - Be sure ctdb service is disabled - Added a systemd serivce of my own, oneshot, that runs a helper script - The helper script first ensures the gluster volumes show up (I use localhost in my case and besides, in our environment, we don't want CTDB to have a public IP anyway until NFS can be served so this helps there too) - Even with the gluster volume showing good, during init startup, first attempts to mount gluster volumes fail. So the helper script keeps looping until they work. It seems they work on the 2nd try (after a 3s sleep at failure). - Once the mounts are confirmed working and mounted, then my helper starts the ctdb service. - Awkward CTDB problems (where the lock check sometimes fails to detect a lock problem) are avoided since we won't start CTDB until we're 100% sure the gluster lock is mounted and pointing at gluster. The above is working in prototype form so I'm going to start adding my bind mounts to the equation. I think I have a solution that will work now and I thank you so much for the ideas. I'm taking things from prototype form now on to something we can provide people. With regards to pacemaker. There are a few pacemaker solutions that I've touched, and one I even helped implement. Now, it could be that I'm not an expert at writing rules, but pacemaker seems to have often given us more trouble than the problem it solves. I believe this is due to the complexity of the software and the power of it. I am not knocking pacemaker. However, a person really has to be a pacemaker expert to not make a mistake that could cause a down time. So I have attempted to avoid pacemaker in the new solution. I know there are down sides -- fencing is there for a reason -- but as far as I can tell the decision has been right for us. CTDB is less complicated even if does not provide 100% true full HA abilities. That said, in the solution, I've been careful to future-proof a move to pacemaker. For example, on the gluster servers/NFS servers, I bring up IP aliases (interfaces) on the network the BMCs reside so we're seamlessly able to switch to pacemaker with IPMI/BMC/redfish fencing later if needed without causing too much pain in the field with deployed servers. I do realize there are tools to help configure pacemaker for you. Some that I've tried have given me mixed results, perhaps due to the complexity of networking setup in the solutions we have. As we start to deploy this to more locations, I'll gain a feel for if a move to pacemaker is right or not. I just share this in the interest of learning. I'm always willing to learn and improve if I've overlooked something. Erik Community Meeting Calendar: APAC Schedule - Every 2nd and 4th Tuesday at 11:30 AM IST Bridge: https://bluejeans.com/118564314 NA/EMEA Schedule - Every 1st and 3rd Tuesday at 01:00 PM EDT Bridge: https://bluejeans.com/118564314 Gluster-users mailing list Gluster-users@gluster.org https://lists.gluster.org/mailman/listinfo/gluster-users
Re: [Gluster-users] hook script question related to ctdb, shared storage, and bind mounts
On Tue, Nov 05, 2019 at 05:05:08AM +0200, Strahil wrote: > Sure, > > Here is what was the setup : Thank you! You're very kind to send me this. I will verify it with my setup soon. Hoping to to rid myself of these dep problems. Thank you !!! Erik Community Meeting Calendar: APAC Schedule - Every 2nd and 4th Tuesday at 11:30 AM IST Bridge: https://bluejeans.com/118564314 NA/EMEA Schedule - Every 1st and 3rd Tuesday at 01:00 PM EDT Bridge: https://bluejeans.com/118564314 Gluster-users mailing list Gluster-users@gluster.org https://lists.gluster.org/mailman/listinfo/gluster-users
Re: [Gluster-users] hook script question related to ctdb, shared storage, and bind mounts
Thank you! I am very interested. I hadn't considered the automounter idea. Also, your fstab has a different dependency approach than mine otherwise as well. If you happen to have the examples handy, I'll give them a shot here. I'm looking forward to emerging from this dark place of dependencies not working!! Thank you so much for writing back, Erik On Mon, Nov 04, 2019 at 06:59:10AM +0200, Strahil wrote: > Hi Erik, > > I took another approach. > > 1. I got a systemd mount unit for my ctdb lock volume's brick: > [root@ovirt1 system]# grep var /etc/fstab > gluster1:/gluster_shared_storage /var/run/gluster/shared_storage/ glusterfs > defaults,x-systemd.requires=glusterd.service,x-systemd.automount0 0 > > As you can see - it is an automounter, because sometimes it fails to mount on > time > > 2. I got custom systemd services for glusterd,ctdb and vdo - as I need to > 'put' dependencies for each of those. > > Now, I'm no longer using ctdb & NFS Ganesha (as my version of ctdb cannot use > hpstnames and my environment is a little bit crazy), but I can still provide > hints how I did it. > > Best Regards, > Strahil NikolovOn Nov 3, 2019 22:46, Erik Jacobson > wrote: > > > > So, I have a solution I have written about in the based that is based on > > gluster with CTDB for IP and a level of redundancy. > > > > It's been working fine except for a few quirks I need to work out on > > giant clusters when I get access. > > > > I have 3x9 gluster volume, each are also NFS servers, using gluster > > NFS (ganesha isn't reliable for my workload yet). There are 9 IP > > aliases spread across 9 servers. > > > > I also have many bind mounts that point to the shared storage as a > > source, and the /gluster/lock volume ("ctdb") of course. > > > > glusterfs 4.1.6 (rhel8 today, but I use rhel7, rhel8, sles12, and > > sles15) > > > > Things work well when everything is up and running. IP failover works > > well when one of the servers goes down. My issue is when that server > > comes back up. Despite my best efforts with systemd fstab dependencies, > > the shared storage areas including the gluster lock for CTDB do not > > always get mounted before CTDB starts. This causes trouble for CTDB > > correctly joining the collective. I also have problems where my > > bind mounts can happen before the shared storage is mounted, despite my > > attempts at preventing this with dependencies in fstab. > > > > I decided a better approach would be to use a gluster hook and just > > mount everything I need as I need it, and start up ctdb when I know and > > verify that /gluster/lock is really gluster and not a local disk. > > > > I started down a road of doing this with a start host hook and after > > spending a while at it, I realized my logic error. This will only fire > > when the volume is *started*, not when a server that was down re-joins. > > > > I took a look at the code, glusterd-hooks.c, and found that support > > for "brick start" is not in place for a hook script but it's nearly > > there: > > > > [GD_OP_START_BRICK] = EMPTY, > > ... > > > > and no entry in glusterd_hooks_add_op_args() yet. > > > > > > Before I make a patch for my own use, I wanted to do a sanity check and > > find out if others have solved this better than the road I'm heading > > down. > > > > What I was thinking of doing is enabling a brick start hook, and > > do my processing for volumes being mounted from there. However, I > > suppose brick start is a bad choice for the case of simply stopping and > > starting the volume, because my processing would try to complete before > > the gluster volume was fully started. It would probably work for a brick > > "coming back and joining" but not "stop volume/start volume". > > > > Any suggestions? > > > > My end goal is: > > - mount shared storage every boot > > - only attempt to mount when gluster is available (_netdev doesn't seem > > to be enough) > > - never start ctdb unless /gluster/lock is a shared storage and not a > > directory. > > - only do my bind mounts from shared storage in to the rest of the > > layout when we are sure the shared storage is mounted (don't > > bind-mount using an empty directory as a source by accident!) > > > > Thanks so much for reading my question, > > > > Erik > > > > > > Community Meeting Calendar: > > > > APA
[Gluster-users] hook script question related to ctdb, shared storage, and bind mounts
So, I have a solution I have written about in the based that is based on gluster with CTDB for IP and a level of redundancy. It's been working fine except for a few quirks I need to work out on giant clusters when I get access. I have 3x9 gluster volume, each are also NFS servers, using gluster NFS (ganesha isn't reliable for my workload yet). There are 9 IP aliases spread across 9 servers. I also have many bind mounts that point to the shared storage as a source, and the /gluster/lock volume ("ctdb") of course. glusterfs 4.1.6 (rhel8 today, but I use rhel7, rhel8, sles12, and sles15) Things work well when everything is up and running. IP failover works well when one of the servers goes down. My issue is when that server comes back up. Despite my best efforts with systemd fstab dependencies, the shared storage areas including the gluster lock for CTDB do not always get mounted before CTDB starts. This causes trouble for CTDB correctly joining the collective. I also have problems where my bind mounts can happen before the shared storage is mounted, despite my attempts at preventing this with dependencies in fstab. I decided a better approach would be to use a gluster hook and just mount everything I need as I need it, and start up ctdb when I know and verify that /gluster/lock is really gluster and not a local disk. I started down a road of doing this with a start host hook and after spending a while at it, I realized my logic error. This will only fire when the volume is *started*, not when a server that was down re-joins. I took a look at the code, glusterd-hooks.c, and found that support for "brick start" is not in place for a hook script but it's nearly there: [GD_OP_START_BRICK] = EMPTY, ... and no entry in glusterd_hooks_add_op_args() yet. Before I make a patch for my own use, I wanted to do a sanity check and find out if others have solved this better than the road I'm heading down. What I was thinking of doing is enabling a brick start hook, and do my processing for volumes being mounted from there. However, I suppose brick start is a bad choice for the case of simply stopping and starting the volume, because my processing would try to complete before the gluster volume was fully started. It would probably work for a brick "coming back and joining" but not "stop volume/start volume". Any suggestions? My end goal is: - mount shared storage every boot - only attempt to mount when gluster is available (_netdev doesn't seem to be enough) - never start ctdb unless /gluster/lock is a shared storage and not a directory. - only do my bind mounts from shared storage in to the rest of the layout when we are sure the shared storage is mounted (don't bind-mount using an empty directory as a source by accident!) Thanks so much for reading my question, Erik Community Meeting Calendar: APAC Schedule - Every 2nd and 4th Tuesday at 11:30 AM IST Bridge: https://bluejeans.com/118564314 NA/EMEA Schedule - Every 1st and 3rd Tuesday at 01:00 PM EDT Bridge: https://bluejeans.com/118564314 Gluster-users mailing list Gluster-users@gluster.org https://lists.gluster.org/mailman/listinfo/gluster-users
Re: [Gluster-users] split-brain errors under heavy load when one brick down
Thank you for replying! > Okay so 0-cm_shared-replicate-1 means these 3 bricks: > > Brick4: 172.23.0.6:/data/brick_cm_shared > Brick5: 172.23.0.7:/data/brick_cm_shared > Brick6: 172.23.0.8:/data/brick_cm_shared The above is correct. > Were there any pending self-heals for this volume? Is it possible that the > server (one of Brick 4, 5 or 6 ) that is down had the only good copy and the > other 2 online bricks had a bad copy (needing heal)? Clients can get EIO in > that case. So I did check for heals and saw nothing. The storage at this time was in a read-only use case. What I mean by that is the NFS clients mount it read only and there were no write activities going to shared storage anyway at that time. So it was not surprising that no heals were listed. I did inspect both remaining bricks for several of the example problem files and found them with matching md5sums. The strange thing, as I mentioned, is it only happened under the job launch workload. The nfs boot workload, which is also very stressful, ran clean with one brick down. > When you say accessing the file from the compute nodes afterwards works > fine, it is still with that one server (brick) down? I can no longer check this system personally but as I recall when we fixed the ethernet problem, all seemed well. I don't have a better answer for this one than that. I am starting a document of things to try when we have a large system in the factory to run on. I'll put this in there. > > There was a case of AFR reporting spurious split-brain errors but that was > fixed long back (http://review.gluster.org/16362 > ) and seems to be present in glusterf-4.1.6. So I brought this up. In my case, we know the files on the NFS client side really were missing because we saw errors on the clients. That is to say, the above bug seems to mean that split-brain was reported in error with no other impacts. However, in my case, the error resulted in actual problems accessing the files on the NFS clients. > Side note: Why are you using replica 9 for the ctdb volume? All > development/tests are usually done on (distributed) replica 3 setup. I am happy to change this. Whatever guide I used to set this up suggested replica 9. I don't even know which resource was incorrect as it was so long ago. I have no other reason. I'm filing an incident now to change our setup tools to use replica-3 for CTDB for new setups. Again, I appreciate that you followed up with me. Thank you, Erik Community Meeting Calendar: APAC Schedule - Every 2nd and 4th Tuesday at 11:30 AM IST Bridge: https://bluejeans.com/118564314 NA/EMEA Schedule - Every 1st and 3rd Tuesday at 01:00 PM EDT Bridge: https://bluejeans.com/118564314 Gluster-users mailing list Gluster-users@gluster.org https://lists.gluster.org/mailman/listinfo/gluster-users
[Gluster-users] split-brain errors under heavy load when one brick down
Hello all. I'm new to the list but not to gluster. We are using gluster to service NFS boot on a top500 cluster. It is a Distributed-Replicate volume 3x9. We are having a problem when one server in a subvolume goes down, we get random missing files and split-brain errors in the nfs.log file. We are using Gluster NFS (We are interested in switching to Ganesha but this workload presents problems there that we need to work through yet). Unfortunately, like many such large systems, I am unable to take much out of the system for debugging and unable to take the system down to test this very often. However, my hope is to be well prepared when the next large system comes through the factory so I can try to reproduce this issue or have some things to try. In the lab, I have a test system that is also a 3x9 setup like at the customer site, but with only 3 compute nodes instead of 2,592 compute nodes. We use CTDB for IP alias management - the compute nodes connect to NFS with the alias. Here is the issue we are having: - 2592 nodes all PXE-booting at once and using the Gluster servers as their NFS root is working great. This includes when one subvolume is degraded due to the loss of a server. No issues at boot, no split-brain messages in the log. - The problem comes in when we do an intensive job launch. This launch uses SLURM and then loads hundreds of shared libraries over NFS across all 2592 nodes. - When all servers in the 3x9 pool are up, we're in good shape - no issues on the compute nodes, no split-brain messages in the log. - When one subvolume has one missing server (its ethernet adapters died), while we boot fine, the SLURM launch has random missing files. Gluster nfs.log shows split-brain messages and ACCESS I/O errors. - Taking an example failed file and accessing it across all compute nodes always works afterwords, the issue is transient. - The missing file is always found in the other bricks in the subvolume by searching there is well - No FS/disk IO errors in the logs or dmesg and the files are accessible before and after the transient error (and from the bricks themselves as I said). - The customer jobs fail to launch, then, if we are degraded. They fail with library read errors, missing config files, etc. What is perplexing is the huge load of 2592 nodes with NFS roots PXE-booting does not trigger the issue when one subvolume is degraded. Thank you for reading this far and thanks to the community for making Gluster!! Example errors: ex1 [2019-09-06 18:26:42.665050] E [MSGID: 108008] [afr-read-txn.c:123:afr_read_txn_refresh_done] 0-cm_shared-replicate-1: Failing ACCESS on gfid ee3f5646-9368-4151-92a3-5b8e7db1fbf9: split-brain observed. [Input/output error] ex2 [2019-09-06 18:26:55.359272] E [MSGID: 108008] [afr-read-txn.c:123:afr_read_txn_refresh_done] 0-cm_shared-replicate-1: Failing READLINK on gfid f2be38c2-1cd1-486b-acad-17f2321a18b3: split-brain observed. [Input/output error] [2019-09-06 18:26:55.359367] W [MSGID: 112199] [nfs3-helpers.c:3435:nfs3_log_readlink_res] 0-nfs-nfsv3: /image/images_ro_nfs/toss-20190730/usr/lib64/libslurm.so.32 => (XID: 88651c80, READLINK: NFS: 5(I/O error), POSIX: 5(Input/output error)) target: (null) The errors seem to happen only on the 'replicate' volume where one server is down in the subvolume (of course, any NFS server will trigger that when it accesses the files on the degraded volume). Now, I am no longer able to access this customer system and it is moving to more secret work so I can't easily run tests on such a big system until we have something come through the factory. However, I'm desperate for help and would like a bag of tricks to attack this with next time I can hit it. Having the HA stuff fail when needed has given me a bit of a black eye on the solution. I had a lesson learned in being sure to test the HA solution. I had tested many times at full system boot but didn't think to do job launch tests while degraded in my testing. That pain will haunt me but also make me better. Info on the volumes: - RHEL 7.6 x86_64 Gluster/GNFS servers - gluster version 4.1.6, I set up the build - Clients are AARCH64 NFS 3 clients (technically configured with RO NFS (Using a version of Linux somewhat like CentOS 7.6). - The base filesystems for bricks are XFS and NO LVM layer. What follows is the volume info from my test system in the lab, which has the same versions and setup. I cannot get this info from the customer without an approval process but the same scripts and tools set up my test system so I'm confident the settings are the same. [root@leader1 ~]# gluster volume info Volume Name: cm_shared Type: Distributed-Replicate Volume ID: e7f2796b-7a94-41ab-a07d-bdce4900c731 Status: Started Snapshot Count: 0 Number of Bricks: 3 x 3 = 9 Transport-type: tcp Bricks: Brick1: 172.23.0.3:/data/brick_cm_shared Brick2: 172.23.0.4:/data/brick_cm_shared Brick3: 172.23.0.5:/data/brick_cm_shared Brick4: