Re: [Gluster-users] Big problems after update to 9.6

2023-02-24 Thread David Cunningham
Hello,

After verifying that our backup was good we tried upgrading the other
server to version 9.6, and then it worked fine. So it appears that version
9.1 and 9.6 couldn't talk to each other.

Is this expected? I had thought that nodes with the same major version
number would be able to communicate without problem.

Thanks.


On Fri, 24 Feb 2023 at 21:30, Anant Saraswat 
wrote:

> Hi David,
>
> It seems like a network issue to me, As it's unable to connect the other
> node and getting timeout.
>
> Few things you can check-
>
>- Check the /etc/hosts file on both the servers and make sure it has
>the correct IP of the other node.
>- Are you binding gluster on any specific IP, which is changed after
>your update.
>- Check if you can access port 24007 from the other host.
>
> If all the above checks are fine then you can try to stop the glusterd on
> both the nodes(sg first and then br), and make sure there is no gluster
> related process left, then start gluster on br *first* and check the `*gluster
> peer status*` then start gluster on sg.  ( If you can take the downtime 🙂
> )
>
> Thanks,
> Anant
> --
> *From:* Gluster-users  on behalf of
> David Cunningham 
> *Sent:* 23 February 2023 9:56 PM
> *To:* gluster-users 
> *Subject:* Re: [Gluster-users] Big problems after update to 9.6
>
>
> *EXTERNAL: Do not click links or open attachments if you do not recognize
> the sender.*
> Is it possible that version 9.1 and 9.6 can't talk to each other? My
> understanding was that they should be able to.
>
>
> On Fri, 24 Feb 2023 at 10:36, David Cunningham 
> wrote:
>
> We've tried to remove "sg" from the cluster so we can re-install the
> GlusterFS node on it, but the following command run on "br" also gives a
> timeout error:
>
> gluster volume remove-brick gvol0 replica 1
> sg:/nodirectwritedata/gluster/gvol0 force
>
> How can we tell "br" to just remove "sg" without trying to contact it?
>
>
> On Fri, 24 Feb 2023 at 10:31, David Cunningham 
> wrote:
>
> Hello,
>
> We have a cluster with two nodes, "sg" and "br", which were running
> GlusterFS 9.1, installed via the Ubuntu package manager. We updated the
> Ubuntu packages on "sg" to version 9.6, and now have big problems. The "br"
> node is still on version 9.1.
>
> Running "gluster volume status" on either host gives "Error : Request
> timed out". On "sg" not all processes are running, compared to "br", as
> below. Restarting the services on "sg" doesn't help. Can anyone advise how
> we should proceed? This is a production system.
>
> root@sg:~# ps -ef | grep gluster
> root 15196 1  0 22:37 ?00:00:00 /usr/sbin/glusterd -p
> /var/run/glusterd.pid --log-level INFO
> root 15426 1  0 22:39 ?00:00:00 /usr/bin/python3
> /usr/sbin/glustereventsd --pid-file /var/run/glustereventsd.pid
> root 15457 15426  0 22:39 ?00:00:00 /usr/bin/python3
> /usr/sbin/glustereventsd --pid-file /var/run/glustereventsd.pid
> root 19341 13695  0 23:24 pts/100:00:00 grep --color=auto gluster
>
> root@br:~# ps -ef | grep gluster
> root  2052 1  0  2022 ?00:00:00 /usr/bin/python3
> /usr/sbin/glustereventsd --pid-file /var/run/glustereventsd.pid
> root  2062 1  3  2022 ?10-11:57:16 /usr/sbin/glusterfs
> --fuse-mountopts=noatime --process-name fuse --volfile-server=br
> --volfile-server=sg --volfile-id=/gvol0 --fuse-mountopts=noatime
> /mnt/glusterfs
> root  2379  2052  0  2022 ?00:00:00 /usr/bin/python3
> /usr/sbin/glustereventsd --pid-file /var/run/glustereventsd.pid
> root  5884 1  5  2022 ?18-16:08:53 /usr/sbin/glusterfsd -s
> br --volfile-id gvol0.br.nodirectwritedata-gluster-gvol0 -p
> /var/run/gluster/vols/gvol0/br-nodirectwritedata-gluster-gvol0.pid -S
> /var/run/gluster/61df1d4e1c65300e.socket --brick-name
> /nodirectwritedata/gluster/gvol0 -l
> /var/log/glusterfs/bricks/nodirectwritedata-gluster-gvol0.log
> --xlator-option *-posix.glusterd-uuid=11e528b0-8c69-4b5d-82ed-c41dd25536d6
> --process-name brick --brick-port 49152 --xlator-option
> gvol0-server.listen-port=49152
> root 10463 18747  0 23:24 pts/100:00:00 grep --color=auto gluster
> root 27744 1  0  2022 ?03:55:10 /usr/sbin/glusterfsd -s br
> --volfile-id gvol0.br.nodirectwritedata-gluster-gvol0 -p
> /var/run/gluster/vols/gvol0/br-nodirectwritedata-gluster-gvol0.pid -S
> /var/run/gluster/61df1d4e1c65300e.socket --brick-name
> /nodirectwritedata/gluster/gvol0 -l
> /var/log/glusterfs/bricks/nodirectwritedata-gluster-gvol0.log
> --xlator-option *-posix.glusterd-uuid=11e528b0-8c69-4b5d-82ed-c41dd25536d6
> --process-name brick --brick-port 49153 --xlator-option
> gvol0-server.listen-port=49153
> root 48227 1  0 Feb17 ?00:00:26 /usr/sbin/glusterd -p
> /var/run/glusterd.pid --log-level INFO
>
> On "sg" in glusterd.log we're seeing:
>
> [2023-02-23 20:26:57.619318 +] E [rpc-clnt.c:181:call_bail]
> 0-management: bailing out frame type(gluster

Re: [Gluster-users] [EXTERNAL] Re: New Gluster volume (10.3) not healing symlinks after brick offline

2023-02-24 Thread Matt Rubright
Hi Eli,

Thanks for the response. I had hoped for a simple fix here, but I think
perhaps there isn't one. I have built this as a part of a new environment,
eventually replacing a much older system built with Gluster 3.10 (yes -
that old). I appreciate the warning about 10.3 and will run some
comparative load testing against it and 9.5 both.

- Matt

On Fri, Feb 24, 2023 at 8:46 AM Eli V  wrote:

> I've seen issues with symlinks failing to heal as well. I never found
> a good solution on the glusterfs side of things. Most reliable fix I
> found is just rm and recreate the symlink in the fuse volume itself.
> Also, I'd strongly suggest heavy load testing before upgrading to 10.3
> in production, after upgrading from 9.5 -> 10.3 I've seen frequent
> brick process crashes(glusterfsd), whereas 9.5 was quite stable.
>
> On Mon, Jan 23, 2023 at 3:58 PM Matt Rubright  wrote:
> >
> > Hi friends,
> >
> > I have recently built a new replica 3 arbiter 1 volume on 10.3 servers
> and have been putting it through its paces before getting it ready for
> production use. The volume will ultimately contain about 200G of web
> content files shared among multiple frontends. Each will use the gluster
> fuse client to connect.
> >
> > What I am experiencing sounds very much like this post from 9 years ago:
> https://lists.gnu.org/archive/html/gluster-devel/2013-12/msg00103.html
> >
> > In short, if I perform these steps I can reliably end up with symlinks
> on the volume which will not heal either by initiating a 'full heal' from
> the cluster or using a fuse client to read each file:
> >
> > 1) Verify that all nodes are healthy, the volume is healthy, and there
> are no items needing to be healed
> > 2) Cleanly shut down one server hosting a brick
> > 3) Copy data, including some symlinks, from a fuse client to the volume
> > 4) Bring the brick back online and observe the number and type of items
> needing to be healed
> > 5) Initiate a full heal from one of the nodes
> > 6) Confirm that while files and directories are healed, symlinks are not
> >
> > Please help me determine if I have improper expectations here. I have
> some basic knowledge of managing gluster volumes, but I may be
> misunderstanding intended behavior.
> >
> > Here is the volume info and heal data at each step of the way:
> >
> > *** Verify that all nodes are healthy, the volume is healthy, and there
> are no items needing to be healed ***
> >
> > # gluster vol info cwsvol01
> >
> > Volume Name: cwsvol01
> > Type: Replicate
> > Volume ID: 7b28e6e6-4a73-41b7-83fe-863a45fd27fc
> > Status: Started
> > Snapshot Count: 0
> > Number of Bricks: 1 x (2 + 1) = 3
> > Transport-type: tcp
> > Bricks:
> > Brick1: glfs02-172-20-1:/data/brick01/cwsvol01
> > Brick2: glfs01-172-20-1:/data/brick01/cwsvol01
> > Brick3: glfsarb01-172-20-1:/data/arb01/cwsvol01 (arbiter)
> > Options Reconfigured:
> > performance.client-io-threads: off
> > nfs.disable: on
> > transport.address-family: inet
> > storage.fips-mode-rchecksum: on
> > cluster.granular-entry-heal: on
> >
> > # gluster vol status
> > Status of volume: cwsvol01
> > Gluster process TCP Port  RDMA Port  Online
> Pid
> >
> --
> > Brick glfs02-172-20-1:/data/brick01/cwsvol0
> > 1   50253 0  Y
>  1397
> > Brick glfs01-172-20-1:/data/brick01/cwsvol0
> > 1   56111 0  Y
>  1089
> > Brick glfsarb01-172-20-1:/data/arb01/cwsvol
> > 01  54517 0  Y
>  118704
> > Self-heal Daemon on localhost   N/A   N/AY
>  1413
> > Self-heal Daemon on glfs01-172-20-1 N/A   N/AY
>  3490
> > Self-heal Daemon on glfsarb01-172-20-1  N/A   N/AY
>  118720
> >
> > Task Status of Volume cwsvol01
> >
> --
> > There are no active volume tasks
> >
> > # gluster vol heal cwsvol01 info summary
> > Brick glfs02-172-20-1:/data/brick01/cwsvol01
> > Status: Connected
> > Total Number of entries: 0
> > Number of entries in heal pending: 0
> > Number of entries in split-brain: 0
> > Number of entries possibly healing: 0
> >
> > Brick glfs01-172-20-1:/data/brick01/cwsvol01
> > Status: Connected
> > Total Number of entries: 0
> > Number of entries in heal pending: 0
> > Number of entries in split-brain: 0
> > Number of entries possibly healing: 0
> >
> > Brick glfsarb01-172-20-1:/data/arb01/cwsvol01
> > Status: Connected
> > Total Number of entries: 0
> > Number of entries in heal pending: 0
> > Number of entries in split-brain: 0
> > Number of entries possibly healing: 0
> >
> > *** Cleanly shut down one server hosting a brick ***
> >
> > *** Copy data, including some symlinks, from a fuse client to the volume
> ***
> >
> > # gluster vol heal cwsvol01 info summary
> > Brick glfs02-17

Re: [Gluster-users] New Gluster volume (10.3) not healing symlinks after brick offline

2023-02-24 Thread Eli V
I've seen issues with symlinks failing to heal as well. I never found
a good solution on the glusterfs side of things. Most reliable fix I
found is just rm and recreate the symlink in the fuse volume itself.
Also, I'd strongly suggest heavy load testing before upgrading to 10.3
in production, after upgrading from 9.5 -> 10.3 I've seen frequent
brick process crashes(glusterfsd), whereas 9.5 was quite stable.

On Mon, Jan 23, 2023 at 3:58 PM Matt Rubright  wrote:
>
> Hi friends,
>
> I have recently built a new replica 3 arbiter 1 volume on 10.3 servers and 
> have been putting it through its paces before getting it ready for production 
> use. The volume will ultimately contain about 200G of web content files 
> shared among multiple frontends. Each will use the gluster fuse client to 
> connect.
>
> What I am experiencing sounds very much like this post from 9 years ago: 
> https://lists.gnu.org/archive/html/gluster-devel/2013-12/msg00103.html
>
> In short, if I perform these steps I can reliably end up with symlinks on the 
> volume which will not heal either by initiating a 'full heal' from the 
> cluster or using a fuse client to read each file:
>
> 1) Verify that all nodes are healthy, the volume is healthy, and there are no 
> items needing to be healed
> 2) Cleanly shut down one server hosting a brick
> 3) Copy data, including some symlinks, from a fuse client to the volume
> 4) Bring the brick back online and observe the number and type of items 
> needing to be healed
> 5) Initiate a full heal from one of the nodes
> 6) Confirm that while files and directories are healed, symlinks are not
>
> Please help me determine if I have improper expectations here. I have some 
> basic knowledge of managing gluster volumes, but I may be misunderstanding 
> intended behavior.
>
> Here is the volume info and heal data at each step of the way:
>
> *** Verify that all nodes are healthy, the volume is healthy, and there are 
> no items needing to be healed ***
>
> # gluster vol info cwsvol01
>
> Volume Name: cwsvol01
> Type: Replicate
> Volume ID: 7b28e6e6-4a73-41b7-83fe-863a45fd27fc
> Status: Started
> Snapshot Count: 0
> Number of Bricks: 1 x (2 + 1) = 3
> Transport-type: tcp
> Bricks:
> Brick1: glfs02-172-20-1:/data/brick01/cwsvol01
> Brick2: glfs01-172-20-1:/data/brick01/cwsvol01
> Brick3: glfsarb01-172-20-1:/data/arb01/cwsvol01 (arbiter)
> Options Reconfigured:
> performance.client-io-threads: off
> nfs.disable: on
> transport.address-family: inet
> storage.fips-mode-rchecksum: on
> cluster.granular-entry-heal: on
>
> # gluster vol status
> Status of volume: cwsvol01
> Gluster process TCP Port  RDMA Port  Online  Pid
> --
> Brick glfs02-172-20-1:/data/brick01/cwsvol0
> 1   50253 0  Y   1397
> Brick glfs01-172-20-1:/data/brick01/cwsvol0
> 1   56111 0  Y   1089
> Brick glfsarb01-172-20-1:/data/arb01/cwsvol
> 01  54517 0  Y   
> 118704
> Self-heal Daemon on localhost   N/A   N/AY   1413
> Self-heal Daemon on glfs01-172-20-1 N/A   N/AY   3490
> Self-heal Daemon on glfsarb01-172-20-1  N/A   N/AY   
> 118720
>
> Task Status of Volume cwsvol01
> --
> There are no active volume tasks
>
> # gluster vol heal cwsvol01 info summary
> Brick glfs02-172-20-1:/data/brick01/cwsvol01
> Status: Connected
> Total Number of entries: 0
> Number of entries in heal pending: 0
> Number of entries in split-brain: 0
> Number of entries possibly healing: 0
>
> Brick glfs01-172-20-1:/data/brick01/cwsvol01
> Status: Connected
> Total Number of entries: 0
> Number of entries in heal pending: 0
> Number of entries in split-brain: 0
> Number of entries possibly healing: 0
>
> Brick glfsarb01-172-20-1:/data/arb01/cwsvol01
> Status: Connected
> Total Number of entries: 0
> Number of entries in heal pending: 0
> Number of entries in split-brain: 0
> Number of entries possibly healing: 0
>
> *** Cleanly shut down one server hosting a brick ***
>
> *** Copy data, including some symlinks, from a fuse client to the volume ***
>
> # gluster vol heal cwsvol01 info summary
> Brick glfs02-172-20-1:/data/brick01/cwsvol01
> Status: Transport endpoint is not connected
> Total Number of entries: -
> Number of entries in heal pending: -
> Number of entries in split-brain: -
> Number of entries possibly healing: -
>
> Brick glfs01-172-20-1:/data/brick01/cwsvol01
> Status: Connected
> Total Number of entries: 810
> Number of entries in heal pending: 810
> Number of entries in split-brain: 0
> Number of entries possibly healing: 0
>
> Brick glfsarb01-172-20-1:/data/arb01/cwsvol01
> Status: Connected
> Total Number of entries: 810
> Number of entries

[Gluster-users] Gluster 11.0 upgrade report

2023-02-24 Thread Eli V
Just upgraded my test 3 node distributed-replica 9x2 glusterfs to 11.0
and it was a bit rough. After upgrading the 1st node, gluster volume
status showed only the bricks on node 1, and gluster peer status
showed node1 rejecting node 2 & 3. After upgrading node2, and then
node3, node 3 remained rejected. I followed the docs for resolving the
rejected peer, i.e. clean out /var/lib/glusterd other than .info file
and was able to peer probe and get node 3 back into the cluster.
However, the fuse glusterfs client is now oddly reporting the volume
is only 1.1TB, versus the 2.5TB before(9x280GB disks). Also,
glusterfsd's seem to crash under load testing just as much as 10, and
it created unhealable files which I'd never seen on 10, and only
resolved it by rm -rf on the whole testing directory tree.




Community Meeting Calendar:

Schedule -
Every 2nd and 4th Tuesday at 14:30 IST / 09:00 UTC
Bridge: https://meet.google.com/cpu-eiue-hvk
Gluster-users mailing list
Gluster-users@gluster.org
https://lists.gluster.org/mailman/listinfo/gluster-users


Re: [Gluster-users] Big problems after update to 9.6

2023-02-24 Thread Anant Saraswat
Hi David,

It seems like a network issue to me, As it's unable to connect the other node 
and getting timeout.

Few things you can check-

  *   Check the /etc/hosts file on both the servers and make sure it has the 
correct IP of the other node.
  *   Are you binding gluster on any specific IP, which is changed after your 
update.
  *   Check if you can access port 24007 from the other host.

If all the above checks are fine then you can try to stop the glusterd on both 
the nodes(sg first and then br), and make sure there is no gluster related 
process left, then start gluster on br first and check the `gluster peer 
status` then start gluster on sg.  ( If you can take the downtime 🙂 )

Thanks,
Anant

From: Gluster-users  on behalf of David 
Cunningham 
Sent: 23 February 2023 9:56 PM
To: gluster-users 
Subject: Re: [Gluster-users] Big problems after update to 9.6


EXTERNAL: Do not click links or open attachments if you do not recognize the 
sender.

Is it possible that version 9.1 and 9.6 can't talk to each other? My 
understanding was that they should be able to.


On Fri, 24 Feb 2023 at 10:36, David Cunningham 
mailto:dcunning...@voisonics.com>> wrote:
We've tried to remove "sg" from the cluster so we can re-install the GlusterFS 
node on it, but the following command run on "br" also gives a timeout error:

gluster volume remove-brick gvol0 replica 1 sg:/nodirectwritedata/gluster/gvol0 
force

How can we tell "br" to just remove "sg" without trying to contact it?


On Fri, 24 Feb 2023 at 10:31, David Cunningham 
mailto:dcunning...@voisonics.com>> wrote:
Hello,

We have a cluster with two nodes, "sg" and "br", which were running GlusterFS 
9.1, installed via the Ubuntu package manager. We updated the Ubuntu packages 
on "sg" to version 9.6, and now have big problems. The "br" node is still on 
version 9.1.

Running "gluster volume status" on either host gives "Error : Request timed 
out". On "sg" not all processes are running, compared to "br", as below. 
Restarting the services on "sg" doesn't help. Can anyone advise how we should 
proceed? This is a production system.

root@sg:~# ps -ef | grep gluster
root 15196 1  0 22:37 ?00:00:00 /usr/sbin/glusterd -p 
/var/run/glusterd.pid --log-level INFO
root 15426 1  0 22:39 ?00:00:00 /usr/bin/python3 
/usr/sbin/glustereventsd --pid-file /var/run/glustereventsd.pid
root 15457 15426  0 22:39 ?00:00:00 /usr/bin/python3 
/usr/sbin/glustereventsd --pid-file /var/run/glustereventsd.pid
root 19341 13695  0 23:24 pts/100:00:00 grep --color=auto gluster

root@br:~# ps -ef | grep gluster
root  2052 1  0  2022 ?00:00:00 /usr/bin/python3 
/usr/sbin/glustereventsd --pid-file /var/run/glustereventsd.pid
root  2062 1  3  2022 ?10-11:57:16 /usr/sbin/glusterfs 
--fuse-mountopts=noatime --process-name fuse --volfile-server=br 
--volfile-server=sg --volfile-id=/gvol0 --fuse-mountopts=noatime /mnt/glusterfs
root  2379  2052  0  2022 ?00:00:00 /usr/bin/python3 
/usr/sbin/glustereventsd --pid-file /var/run/glustereventsd.pid
root  5884 1  5  2022 ?18-16:08:53 /usr/sbin/glusterfsd -s br 
--volfile-id gvol0.br.nodirectwritedata-gluster-gvol0 -p 
/var/run/gluster/vols/gvol0/br-nodirectwritedata-gluster-gvol0.pid -S 
/var/run/gluster/61df1d4e1c65300e.socket --brick-name 
/nodirectwritedata/gluster/gvol0 -l 
/var/log/glusterfs/bricks/nodirectwritedata-gluster-gvol0.log --xlator-option 
*-posix.glusterd-uuid=11e528b0-8c69-4b5d-82ed-c41dd25536d6 --process-name brick 
--brick-port 49152 --xlator-option gvol0-server.listen-port=49152
root 10463 18747  0 23:24 pts/100:00:00 grep --color=auto gluster
root 27744 1  0  2022 ?03:55:10 /usr/sbin/glusterfsd -s br 
--volfile-id gvol0.br.nodirectwritedata-gluster-gvol0 -p 
/var/run/gluster/vols/gvol0/br-nodirectwritedata-gluster-gvol0.pid -S 
/var/run/gluster/61df1d4e1c65300e.socket --brick-name 
/nodirectwritedata/gluster/gvol0 -l 
/var/log/glusterfs/bricks/nodirectwritedata-gluster-gvol0.log --xlator-option 
*-posix.glusterd-uuid=11e528b0-8c69-4b5d-82ed-c41dd25536d6 --process-name brick 
--brick-port 49153 --xlator-option gvol0-server.listen-port=49153
root 48227 1  0 Feb17 ?00:00:26 /usr/sbin/glusterd -p 
/var/run/glusterd.pid --log-level INFO

On "sg" in glusterd.log we're seeing:

[2023-02-23 20:26:57.619318 +] E [rpc-clnt.c:181:call_bail] 0-management: 
bailing out frame type(glusterd mgmt v3), op(--(6)), xid = 0x11, unique = 27, 
sent = 2023-02-23 20:16:50.596447 +, timeout = 600 for 
10.20.20.11:24007
[2023-02-23 20:26:57.619425 +] E [MSGID: 106115] 
[glusterd-mgmt.c:122:gd_mgmt_v3_collate_errors] 0-management: Unlocking failed 
on br. Please check log file for details.
[2023-02-23 20: