Re: [Gluster-users] Upgrade 10.4 -> 11.1 making problems

Hu Bert Tue, 30 Jan 2024 05:26:18 -0800

Hi Strahil,
hm, not sure what the clients have to do with the situation. "gluster
volume status workdata clients" - lists all clients with their IP
addresses.


"gluster peer status" and "gluster volume status" are ok, the latter
one says that all bricks are online, have a port etc. The network is
okay, ping works etc. Well, made a check on one client: umount gluster
volume, remount, now the client appears in the list. Yeah... but why
now? Will try a few more... not that easy as most of these systems are
in production...

I had enabled the 3 self-heal values, but that didn't have any effect
back then. And, honestly, i won't do it now, because: if the heal
started now that would probably slow down the live system (with the
clients). I'll try it when the cluster isn't used anymore.

Interesting - new messages incoming on the "bad" server:

[2024-01-30 14:15:11,820] INFO [utils - 67:log_event] - {'nodeid':
'8ea1e6b4-9c77-4390-96a7-8724c3f9dc0f', 'ts': 1706620511, 'event':
'AFR_SPLIT_BRAIN', 'message': {'client-pid': '-6', 'subvol':
'workdata-replicate-2', 'type': 'gfid', '
file': '<gfid:46abf819-8d38-4fa9-848d-a1a2eccafe97>/756>', 'count':
'2', 'child-2': 'workdata-client-8', 'gfid-2':
'39807be6-b7de-4a82-8a22-cf61b1415208', 'child-0':
'workdata-client-6', 'gfid-0': 'bb4a12ec-f9b7-46bc-9fb3-c57730f1fc49'}
}
[2024-01-30 14:15:17,028] INFO [utils - 67:log_event] - {'nodeid':
'8ea1e6b4-9c77-4390-96a7-8724c3f9dc0f', 'ts': 1706620517, 'event':
'AFR_SPLIT_BRAIN', 'message': {'client-pid': '-6', 'subvol':
'workdata-replicate-4', 'type': 'gfid', '
file': '<gfid:2f5ef9f3-ddae-4fca-b739-6ff3795bcc0c>/94259611>',
'count': '2', 'child-2': 'workdata-client-14', 'gfid-2':
'01234675-17b9-4523-a598-5e331a72c4fa', 'child-0':
'workdata-client-12', 'gfid-0': 'b11140bd-355b-4583-9a85-5d06085
89f97'}}

They didn't appear in the beginning. Looks like a funny state that
this volume is in :D


Thx & best regards,

Hubert

Am Di., 30. Jan. 2024 um 07:14 Uhr schrieb Strahil Nikolov
<hunter86...@yahoo.com>:
>
> This is your problem : bad server has only 3 clients.
>
> I remember there is another gluster volume command to list the IPs of the 
> clients. Find it and run it to find which clients are actually OK (those 3) 
> and the remaining 17 are not.
>
> Then try to remount those 17 clients and if the situation persistes - work 
> with your Network Team to identify why the 17 clients can't reach the brick.
>
> Do you have selfheal enabled?
>
> cluster.data-self-heal
> cluster.entry-self-heal
> cluster.metadata-self-heal
>
>
> Best Regards,
>
> Strahil Nikolov
>
> On Mon, Jan 29, 2024 at 10:26, Hu Bert
> <revi...@googlemail.com> wrote:
> Hi,
> not sure what you mean with "clients" - do you mean the clients that
> mount the volume?
>
> gluster volume status workdata clients
> ----------------------------------------------
> Brick : glusterpub2:/gluster/md3/workdata
> Clients connected : 20
> Hostname                                              BytesRead
> BytesWritten      OpVersion
> --------                                              ---------
> ------------      ---------
> 192.168.0.222:49140                                    43698212
> 41152108          110000
> [...shortened...]
> 192.168.0.126:49123                                  8362352021
> 16445401205          110000
> ----------------------------------------------
> Brick : glusterpub3:/gluster/md3/workdata
> Clients connected : 3
> Hostname                                              BytesRead
> BytesWritten      OpVersion
> --------                                              ---------
> ------------      ---------
> 192.168.0.44:49150                                  5855740279
> 63649538575          110000
> 192.168.0.44:49137                                  308958200
> 319216608          110000
> 192.168.0.126:49120                                  7524915770
> 15489813449          110000
>
> 192.168.0.44 (glusterpub3) is the "bad" server. Not sure what you mean
> by "old" - probably not the age of the server, but rather the gluster
> version. op-version is 110000 on all servers+clients, upgraded from
> 10.4 -> 11.1
>
> "Have you checked if a client is not allowed to update all 3 copies ?"
> -> are there special log messages for that?
>
> "If it's only 1 system, you can remove the brick, reinitialize it and
> then bring it back for a full sync."
> -> 
> https://docs.gluster.org/en/v3/Administrator%20Guide/Managing%20Volumes/#replace-brick
> -> Replacing bricks in Replicate/Distributed Replicate volumes
>
> this part, right? Well, can't do this right now, as there are ~33TB of
> data (many small files) to copy, that would slow down the servers /
> the volume. But if the replacement is running i could do it
> afterwards, just to see what happens.
>
>
> Hubert
>
> Am Mo., 29. Jan. 2024 um 08:21 Uhr schrieb Strahil Nikolov
> <hunter86...@yahoo.com>:
> >
> > 2800 is too much. Most probably you are affected by a bug. How old are the 
> > clients ? Is only 1 server affected ?
> > Have you checked if a client is not allowed to update all 3 copies ?
> >
> > If it's only 1 system, you can remove the brick, reinitialize it and then 
> > bring it back for a full sync.
> >
> > Best Regards,
> > Strahil Nikolov
> >
> > On Mon, Jan 29, 2024 at 8:44, Hu Bert
> > <revi...@googlemail.com> wrote:
> > Morning,
> > a few bad apples - but which ones? Checked glustershd.log on the "bad"
> > server and counted todays "gfid mismatch" entries (2800 in total):
> >
> >    44 <gfid:faeea007-2f41-4a72-959f-e9e14e6a9ea4>/212>,
> >    44 <gfid:faeea007-2f41-4a72-959f-e9e14e6a9ea4>/174>,
> >    44 <gfid:d5c6d7b9-f217-4cc9-a664-448d034e74c2>/94037803>,
> >    44 <gfid:d263ecc2-9c21-455c-9ba9-5a999c03adce>/94066216>,
> >    44 <gfid:cbfd5d46-d580-4845-a544-e46fd82c1758>/249771609>,
> >    44 <gfid:aecf217a-0797-43d1-9481-422a8ac8a5d0>/64235523>,
> >    44 <gfid:a701d47b-b3fb-4e7e-bbfb-bc3e19632867>/185>,
> >
> > etc. But as i said, these are pretty new and didn't appear when the
> > volume/servers started missbehaving. Are there scripts/snippets
> > available how one could handle this?
> >
> > Healing would be very painful for the running system (still connected,
> > but not very long anymore), as there surely are 4-5 million entries to
> > be healed. I can't do this now - maybe, when the replacement is in
> > productive state, one could give it a try.
> >
> > Thx,
> > Hubert
> >
> > Am So., 28. Jan. 2024 um 23:12 Uhr schrieb Strahil Nikolov
> > <hunter86...@yahoo.com>:
> > >
> > > From gfid mismatch a manual effort is needed but you can script it.
> > > I think that a few bad "apples" can break the healing and if you fix them 
> > > the healing might be recovered.
> > >
> > > Also, check why the client is not updating all copies. Most probably you 
> > > have a client that is not able to connect to a brick.
> > >
> > > gluster volume status VOLUME_NAME clients
> > >
> > > Best Regards,
> > > Strahil Nikolov
> > >
> > > On Sun, Jan 28, 2024 at 20:55, Hu Bert
> > > <revi...@googlemail.com> wrote:
> > > Hi Strahil,
> > > there's no arbiter: 3 servers with 5 bricks each.
> > >
> > > Volume Name: workdata
> > > Type: Distributed-Replicate
> > > Volume ID: 7d1e23e5-0308-4443-a832-d36f85ff7959
> > > Status: Started
> > > Snapshot Count: 0
> > > Number of Bricks: 5 x 3 = 15
> > >
> > > The "problem" is: the number of files/entries to-be-healed has
> > > continuously grown since the beginning, and now we're talking about
> > > way too many files to do this manually. Last time i checked: 700K per
> > > brick, should be >900K at the moment. The command 'gluster volume heal
> > > workdata statistics heal-count' is unable to finish. Doesn't look that
> > > good :D
> > >
> > > Interesting, the glustershd.log on the "bad" server now shows errors like 
> > > these:
> > >
> > > [2024-01-28 18:48:33.734053 +0000] E [MSGID: 108008]
> > > [afr-self-heal-common.c:399:afr_gfid_split_brain_source]
> > > 0-workdata-replicate-3: Gfid mismatch detected for
> > > <gfid:70ab3d57-bd82-4932-86bf-d613db32c1ab>/803620716>,
> > > 82d7939a-8919-40ea-
> > > 9459-7b8af23d3b72 on workdata-client-11 and
> > > bb9399a3-0a5c-4cd1-b2b1-3ee787ec835a on workdata-client-9
> > >
> > > Shouldn't the heals happen on the 2 "good" servers?
> > >
> > > Anyway... we're currently preparing a different solution for our data
> > > and we'll throw away this gluster volume - no critical data will be
> > > lost, as these are derived from source data (on a different volume on
> > > different servers). Will be a hard time (calculating tons of data),
> > > but the chosen solution should have a way better performance.
> > >
> > > Well... thx to all for your efforts, really appreciate that :-)
> > >
> > >
> > > Hubert
> > >
> > > Am So., 28. Jan. 2024 um 08:35 Uhr schrieb Strahil Nikolov
> > > <hunter86...@yahoo.com>:
> > > >
> > > > What about the arbiter node ?
> > > > Actually, check on all nodes and script it - you might need it in the 
> > > > future.
> > > >
> > > > Simplest way to resolve is to make the file didappear (rename to 
> > > > something else and then rename it back). Another easy trick is to read 
> > > > thr whole file: dd if=file of=/dev/null status=progress
> > > >
> > > > Best Regards,
> > > > Strahil Nikolov
> > > >
> > > > On Sat, Jan 27, 2024 at 8:24, Hu Bert
> > > > <revi...@googlemail.com> wrote:
> > > > Morning,
> > > >
> > > > gfid1:
> > > > getfattr -d -e hex -m.
> > > > /gluster/md{3,4,5,6,7}/workdata/.glusterfs/fa/f5/faf59566-10f5-4ddd-8b0c-a87bc6a334fb
> > > >
> > > > glusterpub1 (good one):
> > > > getfattr: Removing leading '/' from absolute path names
> > > > # file: 
> > > > gluster/md6/workdata/.glusterfs/fa/f5/faf59566-10f5-4ddd-8b0c-a87bc6a334fb
> > > > trusted.afr.dirty=0x000000000000000000000000
> > > > trusted.afr.workdata-client-11=0x000000020000000100000000
> > > > trusted.gfid=0xfaf5956610f54ddd8b0ca87bc6a334fb
> > > > trusted.gfid2path.c2845024cc9b402e=0x38633139626234612d396236382d343532652d623434652d3664616331666434616465652f31323878313238732e6a7067
> > > > trusted.glusterfs.mdata=0x0100000000000000000000000065aaecff000000002695ebb70000000065aaecff000000002695ebb70000000065aaecff000000002533f110
> > > >
> > > > glusterpub3 (bad one):
> > > > getfattr: 
> > > > /gluster/md6/workdata/.glusterfs/fa/f5/faf59566-10f5-4ddd-8b0c-a87bc6a334fb:
> > > > No such file or directory
> > > >
> > > > gfid 2:
> > > > getfattr -d -e hex -m.
> > > > /gluster/md{3,4,5,6,7}/workdata/.glusterfs/60/46/60465723-5dc0-4ebe-aced-9f2c12e52642
> > > >
> > > > glusterpub1 (good one):
> > > > getfattr: Removing leading '/' from absolute path names
> > > > # file: 
> > > > gluster/md5/workdata/.glusterfs/60/46/60465723-5dc0-4ebe-aced-9f2c12e52642
> > > > trusted.afr.dirty=0x000000000000000000000000
> > > > trusted.afr.workdata-client-8=0x000000020000000100000000
> > > > trusted.gfid=0x604657235dc04ebeaced9f2c12e52642
> > > > trusted.gfid2path.ac4669e3c4faf926=0x33366463366137392d666135642d343238652d613738642d6234376230616662316562642f31323878313238732e6a7067
> > > > trusted.glusterfs.mdata=0x0100000000000000000000000065aaecfe000000000c5403bd0000000065aaecfe000000000c5403bd0000000065aaecfe000000000ad61ee4
> > > >
> > > > glusterpub3 (bad one):
> > > > getfattr: 
> > > > /gluster/md5/workdata/.glusterfs/60/46/60465723-5dc0-4ebe-aced-9f2c12e52642:
> > > > No such file or directory
> > > >
> > > > thx,
> > > > Hubert
> > > >
> > > > Am Sa., 27. Jan. 2024 um 06:13 Uhr schrieb Strahil Nikolov
> > > > <hunter86...@yahoo.com>:
> > > > >
> > > > > You don't need to mount it.
> > > > > Like this :
> > > > > # getfattr -d -e hex -m. 
> > > > > /path/to/brick/.glusterfs/00/46/00462be8-3e61-4931-8bda-dae1645c639e
> > > > > # file: 00/46/00462be8-3e61-4931-8bda-dae1645c639e
> > > > > trusted.gfid=0x00462be83e6149318bdadae1645c639e
> > > > > trusted.gfid2path.05fcbdafdeea18ab=0x30326333373930632d386637622d346436652d393464362d3936393132313930643131312f66696c656c6f636b696e672e7079
> > > > > trusted.glusterfs.mdata=0x010000000000000000000000006170340c0000000025b6a745000000006170340c0000000020efb577000000006170340c0000000020d42b07
> > > > > trusted.glusterfs.shard.block-size=0x0000000004000000
> > > > > trusted.glusterfs.shard.file-size=0x00000000000000cd000000000000000000000000000000010000000000000000
> > > > >
> > > > >
> > > > > Best Regards,
> > > > > Strahil Nikolov
> > > > >
> > > > >
> > > > >
> > > > > В четвъртък, 25 януари 2024 г. в 09:42:46 ч. Гринуич+2, Hu Bert 
> > > > > <revi...@googlemail.com> написа:
> > > > >
> > > > >
> > > > >
> > > > >
> > > > >
> > > > > Good morning,
> > > > >
> > > > > hope i got it right... using:
> > > > > https://access.redhat.com/documentation/de-de/red_hat_gluster_storage/3.1/html/administration_guide/ch27s02
> > > > >
> > > > > mount -t glusterfs -o aux-gfid-mount glusterpub1:/workdata 
> > > > > /mnt/workdata
> > > > >
> > > > > gfid 1:
> > > > > getfattr -n trusted.glusterfs.pathinfo -e text
> > > > > /mnt/workdata/.gfid/faf59566-10f5-4ddd-8b0c-a87bc6a334fb
> > > > > getfattr: Removing leading '/' from absolute path names
> > > > > # file: mnt/workdata/.gfid/faf59566-10f5-4ddd-8b0c-a87bc6a334fb
> > > > > trusted.glusterfs.pathinfo="(<DISTRIBUTE:workdata-dht>
> > > > > (<REPLICATE:workdata-replicate-3>
> > > > > <POSIX(/gluster/md6/workdata):glusterpub1:/gluster/md6/workdata/images/133/283/13328349/128x128s.jpg>
> > > > > <POSIX(/gluster/md6/workdata):glusterpub2:/gl
> > > > > uster/md6/workdata/images/133/283/13328349/128x128s.jpg>))"
> > > > >
> > > > > gfid 2:
> > > > > getfattr -n trusted.glusterfs.pathinfo -e text
> > > > > /mnt/workdata/.gfid/60465723-5dc0-4ebe-aced-9f2c12e52642
> > > > > getfattr: Removing leading '/' from absolute path names
> > > > > # file: mnt/workdata/.gfid/60465723-5dc0-4ebe-aced-9f2c12e52642
> > > > > trusted.glusterfs.pathinfo="(<DISTRIBUTE:workdata-dht>
> > > > > (<REPLICATE:workdata-replicate-2>
> > > > > <POSIX(/gluster/md5/workdata):glusterpub2:/gluster/md5/workdata/.glusterfs/60/46/60465723-5dc0-4ebe-aced-9f2c12e52642>
> > > > > <POSIX(/gluster/md5/workdata
> > > > > ):glusterpub1:/gluster/md5/workdata/.glusterfs/60/46/60465723-5dc0-4ebe-aced-9f2c12e52642>))"
> > > > >
> > > > > glusterpub1 + glusterpub2 are the good ones, glusterpub3 is the
> > > > > misbehaving (not healing) one.
> > > > >
> > > > > The file with gfid 1 is available under
> > > > > /gluster/md6/workdata/images/133/283/13328349/ on glusterpub1+2
> > > > > bricks, but missing on glusterpub3 brick.
> > > > >
> > > > > gfid 2: 
> > > > > /gluster/md5/workdata/.glusterfs/60/46/60465723-5dc0-4ebe-aced-9f2c12e52642
> > > > > is present on glusterpub1+2, but not on glusterpub3.
> > > > >
> > > > >
> > > > > Thx,
> > > > > Hubert
> > > > >
> > > > > Am Mi., 24. Jan. 2024 um 17:36 Uhr schrieb Strahil Nikolov
> > > > > <hunter86...@yahoo.com>:
> > > > >
> > > > > >
> > > > > > Hi,
> > > > > >
> > > > > > Can you find and check the files with gfids:
> > > > > > 60465723-5dc0-4ebe-aced-9f2c12e52642
> > > > > > faf59566-10f5-4ddd-8b0c-a87bc6a334fb
> > > > > >
> > > > > > Use 'getfattr -d -e hex -m. ' command from 
> > > > > > https://docs.gluster.org/en/main/Troubleshooting/resolving-splitbrain/#analysis-of-the-output
> > > > > >  .
> > > > > >
> > > > > > Best Regards,
> > > > > > Strahil Nikolov
> > > > > >
> > > > > > On Sat, Jan 20, 2024 at 9:44, Hu Bert
> > > > > > <revi...@googlemail.com> wrote:
> > > > > > Good morning,
> > > > > >
> > > > > > thx Gilberto, did the first three (set to WARNING), but the last one
> > > > > > doesn't work. Anyway, with setting these three some new messages
> > > > > > appear:
> > > > > >
> > > > > > [2024-01-20 07:23:58.561106 +0000] W [MSGID: 114061]
> > > > > > [client-common.c:796:client_pre_lk_v2] 0-workdata-client-11: 
> > > > > > remote_fd
> > > > > > is -1. EBADFD [{gfid=faf59566-10f5-4ddd-8b0c-a87bc6a334fb},
> > > > > > {errno=77}, {error=File descriptor in bad state}]
> > > > > > [2024-01-20 07:23:58.561177 +0000] E [MSGID: 108028]
> > > > > > [afr-open.c:361:afr_is_reopen_allowed_cbk] 0-workdata-replicate-3:
> > > > > > Failed getlk for faf59566-10f5-4ddd-8b0c-a87bc6a334fb [File 
> > > > > > descriptor
> > > > > > in bad state]
> > > > > > [2024-01-20 07:23:58.562151 +0000] W [MSGID: 114031]
> > > > > > [client-rpc-fops_v2.c:2561:client4_0_lookup_cbk] 
> > > > > > 0-workdata-client-11:
> > > > > > remote operation failed.
> > > > > > [{path=<gfid:faf59566-10f5-4ddd-8b0c-a87bc6a334fb>},
> > > > > > {gfid=faf59566-10f5-4ddd-8b0c-a87b
> > > > > > c6a334fb}, {errno=2}, {error=No such file or directory}]
> > > > > > [2024-01-20 07:23:58.562296 +0000] W [MSGID: 114061]
> > > > > > [client-common.c:530:client_pre_flush_v2] 0-workdata-client-11:
> > > > > > remote_fd is -1. EBADFD 
> > > > > > [{gfid=faf59566-10f5-4ddd-8b0c-a87bc6a334fb},
> > > > > > {errno=77}, {error=File descriptor in bad state}]
> > > > > > [2024-01-20 07:23:58.860552 +0000] W [MSGID: 114061]
> > > > > > [client-common.c:796:client_pre_lk_v2] 0-workdata-client-8: 
> > > > > > remote_fd
> > > > > > is -1. EBADFD [{gfid=60465723-5dc0-4ebe-aced-9f2c12e52642},
> > > > > > {errno=77}, {error=File descriptor in bad state}]
> > > > > > [2024-01-20 07:23:58.860608 +0000] E [MSGID: 108028]
> > > > > > [afr-open.c:361:afr_is_reopen_allowed_cbk] 0-workdata-replicate-2:
> > > > > > Failed getlk for 60465723-5dc0-4ebe-aced-9f2c12e52642 [File 
> > > > > > descriptor
> > > > > > in bad state]
> > > > > > [2024-01-20 07:23:58.861520 +0000] W [MSGID: 114031]
> > > > > > [client-rpc-fops_v2.c:2561:client4_0_lookup_cbk] 
> > > > > > 0-workdata-client-8:
> > > > > > remote operation failed.
> > > > > > [{path=<gfid:60465723-5dc0-4ebe-aced-9f2c12e52642>},
> > > > > > {gfid=60465723-5dc0-4ebe-aced-9f2c1
> > > > > > 2e52642}, {errno=2}, {error=No such file or directory}]
> > > > > > [2024-01-20 07:23:58.861640 +0000] W [MSGID: 114061]
> > > > > > [client-common.c:530:client_pre_flush_v2] 0-workdata-client-8:
> > > > > > remote_fd is -1. EBADFD 
> > > > > > [{gfid=60465723-5dc0-4ebe-aced-9f2c12e52642},
> > > > > > {errno=77}, {error=File descriptor in bad state}]
> > > > > >
> > > > > > Not many log entries appear, only a few. Has someone seen error
> > > > > > messages like these? Setting diagnostics.brick-sys-log-level to 
> > > > > > DEBUG
> > > > > > shows way more log entries, uploaded it to:
> > > > > > https://file.io/spLhlcbMCzr8 - not sure if that helps.
> > > > > >
> > > > > >
> > > > > > Thx,
> > > > > > Hubert
> > > > > >
> > > > > > Am Fr., 19. Jan. 2024 um 16:24 Uhr schrieb Gilberto Ferreira
> > > > > > <gilberto.nune...@gmail.com>:
> > > > > >
> > > > > > >
> > > > > > > gluster volume set testvol diagnostics.brick-log-level WARNING
> > > > > > > gluster volume set testvol diagnostics.brick-sys-log-level WARNING
> > > > > > > gluster volume set testvol diagnostics.client-log-level ERROR
> > > > > > > gluster --log-level=ERROR volume status
> > > > > > >
> > > > > > > ---
> > > > > > > Gilberto Nunes Ferreira
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > > Em sex., 19 de jan. de 2024 às 05:49, Hu Bert 
> > > > > > > <revi...@googlemail.com> escreveu:
> > > > > > >>
> > > > > > >> Hi Strahil,
> > > > > > >> hm, don't get me wrong, it may sound a bit stupid, but... where 
> > > > > > >> do i
> > > > > > >> set the log level? Using debian...
> > > > > > >>
> > > > > > >> https://access.redhat.com/documentation/de-de/red_hat_gluster_storage/3/html/administration_guide/configuring_the_log_level
> > > > > > >>
> > > > > > >> ls /etc/glusterfs/
> > > > > > >> eventsconfig.json  glusterfs-georep-logrotate
> > > > > > >> gluster-rsyslog-5.8.conf  group-db-workload      
> > > > > > >> group-gluster-block
> > > > > > >>  group-nl-cache  group-virt.example  logger.conf.example
> > > > > > >> glusterd.vol      glusterfs-logrotate
> > > > > > >> gluster-rsyslog-7.2.conf  group-distributed-virt  
> > > > > > >> group-metadata-cache
> > > > > > >>  group-samba    gsyncd.conf        thin-arbiter.vol
> > > > > > >>
> > > > > > >> checked: /etc/glusterfs/logger.conf.example
> > > > > > >>
> > > > > > >> # To enable enhanced logging capabilities,
> > > > > > >> #
> > > > > > >> # 1. rename this file to /etc/glusterfs/logger.conf
> > > > > > >> #
> > > > > > >> # 2. rename /etc/rsyslog.d/gluster.conf.example to
> > > > > > >> #    /etc/rsyslog.d/gluster.conf
> > > > > > >> #
> > > > > > >> # This change requires restart of all gluster services/volumes 
> > > > > > >> and
> > > > > > >> # rsyslog.
> > > > > > >>
> > > > > > >> tried (to test): /etc/glusterfs/logger.conf with " 
> > > > > > >> LOG_LEVEL='WARNING' "
> > > > > > >>
> > > > > > >> restart glusterd on that node, but this doesn't work, log-level 
> > > > > > >> stays
> > > > > > >> on INFO. /etc/rsyslog.d/gluster.conf.example does not exist. 
> > > > > > >> Probably
> > > > > > >> /etc/rsyslog.conf on debian. But first it would be better to know
> > > > > > >> where to set the log-level for glusterd.
> > > > > > >>
> > > > > > >> Depending on how much the DEBUG log-level talks ;-) i could 
> > > > > > >> assign up
> > > > > > >> to 100G to /var
> > > > > > >>
> > > > > > >>
> > > > > > >> Thx & best regards,
> > > > > > >> Hubert
> > > > > > >>
> > > > > > >>
> > > > > > >> Am Do., 18. Jan. 2024 um 22:58 Uhr schrieb Strahil Nikolov
> > > > > > >> <hunter86...@yahoo.com>:
> > > > > > >> >
> > > > > > >> > Are you able to set the logs to debug level ?
> > > > > > >> > It might provide a clue what it is going on.
> > > > > > >> >
> > > > > > >> > Best Regards,
> > > > > > >> > Strahil Nikolov
> > > > > > >> >
> > > > > > >> > On Thu, Jan 18, 2024 at 13:08, Diego Zuccato
> > > > > > >> > <diego.zucc...@unibo.it> wrote:
> > > > > > >> > That's the same kind of errors I keep seeing on my 2 clusters,
> > > > > > >> > regenerated some months ago. Seems a pseudo-split-brain that 
> > > > > > >> > should be
> > > > > > >> > impossible on a replica 3 cluster but keeps happening.
> > > > > > >> > Sadly going to ditch Gluster ASAP.
> > > > > > >> >
> > > > > > >> > Diego
> > > > > > >> >
> > > > > > >> > Il 18/01/2024 07:11, Hu Bert ha scritto:
> > > > > > >> > > Good morning,
> > > > > > >> > > heal still not running. Pending heals now sum up to 60K per 
> > > > > > >> > > brick.
> > > > > > >> > > Heal was starting instantly e.g. after server reboot with 
> > > > > > >> > > version
> > > > > > >> > > 10.4, but doesn't with version 11. What could be wrong?
> > > > > > >> > >
> > > > > > >> > > I only see these errors on one of the "good" servers in 
> > > > > > >> > > glustershd.log:
> > > > > > >> > >
> > > > > > >> > > [2024-01-18 06:08:57.328480 +0000] W [MSGID: 114031]
> > > > > > >> > > [client-rpc-fops_v2.c:2561:client4_0_lookup_cbk] 
> > > > > > >> > > 0-workdata-client-0:
> > > > > > >> > > remote operation failed.
> > > > > > >> > > [{path=<gfid:cb39a1e4-2a4c-4727-861d-3ed9ef00681b>},
> > > > > > >> > > {gfid=cb39a1e4-2a4c-4727-861d-3ed9e
> > > > > > >> > > f00681b}, {errno=2}, {error=No such file or directory}]
> > > > > > >> > > [2024-01-18 06:08:57.594051 +0000] W [MSGID: 114031]
> > > > > > >> > > [client-rpc-fops_v2.c:2561:client4_0_lookup_cbk] 
> > > > > > >> > > 0-workdata-client-1:
> > > > > > >> > > remote operation failed.
> > > > > > >> > > [{path=<gfid:3e9b178c-ae1f-4d85-ae47-fc539d94dd11>},
> > > > > > >> > > {gfid=3e9b178c-ae1f-4d85-ae47-fc539
> > > > > > >> > > d94dd11}, {errno=2}, {error=No such file or directory}]
> > > > > > >> > >
> > > > > > >> > > About 7K today. Any ideas? Someone?
> > > > > > >> > >
> > > > > > >> > >
> > > > > > >> > > Best regards,
> > > > > > >> > > Hubert
> > > > > > >> > >
> > > > > > >> > > Am Mi., 17. Jan. 2024 um 11:24 Uhr schrieb Hu Bert 
> > > > > > >> > > <revi...@googlemail.com>:
> > > > > > >> > >>
> > > > > > >> > >> ok, finally managed to get all servers, volumes etc 
> > > > > > >> > >> runnung, but took
> > > > > > >> > >> a couple of restarts, cksum checks etc.
> > > > > > >> > >>
> > > > > > >> > >> One problem: a volume doesn't heal automatically or doesn't 
> > > > > > >> > >> heal at all.
> > > > > > >> > >>
> > > > > > >> > >> gluster volume status
> > > > > > >> > >> Status of volume: workdata
> > > > > > >> > >> Gluster process                            TCP Port  RDMA 
> > > > > > >> > >> Port  Online  Pid
> > > > > > >> > >> ------------------------------------------------------------------------------
> > > > > > >> > >> Brick glusterpub1:/gluster/md3/workdata    58832    0       
> > > > > > >> > >>    Y      3436
> > > > > > >> > >> Brick glusterpub2:/gluster/md3/workdata    59315    0       
> > > > > > >> > >>    Y      1526
> > > > > > >> > >> Brick glusterpub3:/gluster/md3/workdata    56917    0       
> > > > > > >> > >>    Y      1952
> > > > > > >> > >> Brick glusterpub1:/gluster/md4/workdata    59688    0       
> > > > > > >> > >>    Y      3755
> > > > > > >> > >> Brick glusterpub2:/gluster/md4/workdata    60271    0       
> > > > > > >> > >>    Y      2271
> > > > > > >> > >> Brick glusterpub3:/gluster/md4/workdata    49461    0       
> > > > > > >> > >>    Y      2399
> > > > > > >> > >> Brick glusterpub1:/gluster/md5/workdata    54651    0       
> > > > > > >> > >>    Y      4208
> > > > > > >> > >> Brick glusterpub2:/gluster/md5/workdata    49685    0       
> > > > > > >> > >>    Y      2751
> > > > > > >> > >> Brick glusterpub3:/gluster/md5/workdata    59202    0       
> > > > > > >> > >>    Y      2803
> > > > > > >> > >> Brick glusterpub1:/gluster/md6/workdata    55829    0       
> > > > > > >> > >>    Y      4583
> > > > > > >> > >> Brick glusterpub2:/gluster/md6/workdata    50455    0       
> > > > > > >> > >>    Y      3296
> > > > > > >> > >> Brick glusterpub3:/gluster/md6/workdata    50262    0       
> > > > > > >> > >>    Y      3237
> > > > > > >> > >> Brick glusterpub1:/gluster/md7/workdata    52238    0       
> > > > > > >> > >>    Y      5014
> > > > > > >> > >> Brick glusterpub2:/gluster/md7/workdata    52474    0       
> > > > > > >> > >>    Y      3673
> > > > > > >> > >> Brick glusterpub3:/gluster/md7/workdata    57966    0       
> > > > > > >> > >>    Y      3653
> > > > > > >> > >> Self-heal Daemon on localhost              N/A      N/A     
> > > > > > >> > >>    Y      4141
> > > > > > >> > >> Self-heal Daemon on glusterpub1            N/A      N/A     
> > > > > > >> > >>    Y      5570
> > > > > > >> > >> Self-heal Daemon on glusterpub2            N/A      N/A     
> > > > > > >> > >>    Y      4139
> > > > > > >> > >>
> > > > > > >> > >> "gluster volume heal workdata info" lists a lot of files 
> > > > > > >> > >> per brick.
> > > > > > >> > >> "gluster volume heal workdata statistics heal-count" shows 
> > > > > > >> > >> thousands
> > > > > > >> > >> of files per brick.
> > > > > > >> > >> "gluster volume heal workdata enable" has no effect.
> > > > > > >> > >>
> > > > > > >> > >> gluster volume heal workdata full
> > > > > > >> > >> Launching heal operation to perform full self heal on 
> > > > > > >> > >> volume workdata
> > > > > > >> > >> has been successful
> > > > > > >> > >> Use heal info commands to check status.
> > > > > > >> > >>
> > > > > > >> > >> -> not doing anything at all. And nothing happening on the 
> > > > > > >> > >> 2 "good"
> > > > > > >> > >> servers in e.g. glustershd.log. Heal was working as 
> > > > > > >> > >> expected on
> > > > > > >> > >> version 10.4, but here... silence. Someone has an idea?
> > > > > > >> > >>
> > > > > > >> > >>
> > > > > > >> > >> Best regards,
> > > > > > >> > >> Hubert
> > > > > > >> > >>
> > > > > > >> > >> Am Di., 16. Jan. 2024 um 13:44 Uhr schrieb Gilberto Ferreira
> > > > > > >> > >> <gilberto.nune...@gmail.com>:
> > > > > > >> > >>>
> > > > > > >> > >>> Ah! Indeed! You need to perform an upgrade in the clients 
> > > > > > >> > >>> as well.
> > > > > > >> > >>>
> > > > > > >> > >>>
> > > > > > >> > >>>
> > > > > > >> > >>>
> > > > > > >> > >>>
> > > > > > >> > >>>
> > > > > > >> > >>>
> > > > > > >> > >>>
> > > > > > >> > >>> Em ter., 16 de jan. de 2024 às 03:12, Hu Bert 
> > > > > > >> > >>> <revi...@googlemail.com> escreveu:
> > > > > > >> > >>>>
> > > > > > >> > >>>> morning to those still reading :-)
> > > > > > >> > >>>>
> > > > > > >> > >>>> i found this: 
> > > > > > >> > >>>> https://docs.gluster.org/en/main/Troubleshooting/troubleshooting-glusterd/#common-issues-and-how-to-resolve-them
> > > > > > >> > >>>>
> > > > > > >> > >>>> there's a paragraph about "peer rejected" with the same 
> > > > > > >> > >>>> error message,
> > > > > > >> > >>>> telling me: "Update the cluster.op-version" - i had only 
> > > > > > >> > >>>> updated the
> > > > > > >> > >>>> server nodes, but not the clients. So upgrading the 
> > > > > > >> > >>>> cluster.op-version
> > > > > > >> > >>>> wasn't possible at this time. So... upgrading the clients 
> > > > > > >> > >>>> to version
> > > > > > >> > >>>> 11.1 and then the op-version should solve the problem?
> > > > > > >> > >>>>
> > > > > > >> > >>>>
> > > > > > >> > >>>> Thx,
> > > > > > >> > >>>> Hubert
> > > > > > >> > >>>>
> > > > > > >> > >>>> Am Mo., 15. Jan. 2024 um 09:16 Uhr schrieb Hu Bert 
> > > > > > >> > >>>> <revi...@googlemail.com>:
> > > > > > >> > >>>>>
> > > > > > >> > >>>>> Hi,
> > > > > > >> > >>>>> just upgraded some gluster servers from version 10.4 to 
> > > > > > >> > >>>>> version 11.1.
> > > > > > >> > >>>>> Debian bullseye & bookworm. When only installing the 
> > > > > > >> > >>>>> packages: good,
> > > > > > >> > >>>>> servers, volumes etc. work as expected.
> > > > > > >> > >>>>>
> > > > > > >> > >>>>> But one needs to test if the systems work after a daemon 
> > > > > > >> > >>>>> and/or server
> > > > > > >> > >>>>> restart. Well, did a reboot, and after that the 
> > > > > > >> > >>>>> rebooted/restarted
> > > > > > >> > >>>>> system is "out". Log message from working node:
> > > > > > >> > >>>>>
> > > > > > >> > >>>>> [2024-01-15 08:02:21.585694 +0000] I [MSGID: 106163]
> > > > > > >> > >>>>> [glusterd-handshake.c:1501:__glusterd_mgmt_hndsk_versions_ack]
> > > > > > >> > >>>>> 0-management: using the op-version 100000
> > > > > > >> > >>>>> [2024-01-15 08:02:21.589601 +0000] I [MSGID: 106490]
> > > > > > >> > >>>>> [glusterd-handler.c:2546:__glusterd_handle_incoming_friend_req]
> > > > > > >> > >>>>> 0-glusterd: Received probe from uuid:
> > > > > > >> > >>>>> b71401c3-512a-47cb-ac18-473c4ba7776e
> > > > > > >> > >>>>> [2024-01-15 08:02:23.608349 +0000] E [MSGID: 106010]
> > > > > > >> > >>>>> [glusterd-utils.c:3824:glusterd_compare_friend_volume] 
> > > > > > >> > >>>>> 0-management:
> > > > > > >> > >>>>> Version of Cksums sourceimages differ. local cksum = 
> > > > > > >> > >>>>> 2204642525,
> > > > > > >> > >>>>> remote cksum = 1931483801 on peer gluster190
> > > > > > >> > >>>>> [2024-01-15 08:02:23.608584 +0000] I [MSGID: 106493]
> > > > > > >> > >>>>> [glusterd-handler.c:3819:glusterd_xfer_friend_add_resp] 
> > > > > > >> > >>>>> 0-glusterd:
> > > > > > >> > >>>>> Responded to gluster190 (0), ret: 0, op_ret: -1
> > > > > > >> > >>>>> [2024-01-15 08:02:23.613553 +0000] I [MSGID: 106493]
> > > > > > >> > >>>>> [glusterd-rpc-ops.c:467:__glusterd_friend_add_cbk] 
> > > > > > >> > >>>>> 0-glusterd:
> > > > > > >> > >>>>> Received RJT from uuid: 
> > > > > > >> > >>>>> b71401c3-512a-47cb-ac18-473c4ba7776e, host:
> > > > > > >> > >>>>> gluster190, port: 0
> > > > > > >> > >>>>>
> > > > > > >> > >>>>> peer status from rebooted node:
> > > > > > >> > >>>>>
> > > > > > >> > >>>>> root@gluster190 ~ # gluster peer status
> > > > > > >> > >>>>> Number of Peers: 2
> > > > > > >> > >>>>>
> > > > > > >> > >>>>> Hostname: gluster189
> > > > > > >> > >>>>> Uuid: 50dc8288-aa49-4ea8-9c6c-9a9a926c67a7
> > > > > > >> > >>>>> State: Peer Rejected (Connected)
> > > > > > >> > >>>>>
> > > > > > >> > >>>>> Hostname: gluster188
> > > > > > >> > >>>>> Uuid: e15a33fe-e2f7-47cf-ac53-a3b34136555d
> > > > > > >> > >>>>> State: Peer Rejected (Connected)
> > > > > > >> > >>>>>
> > > > > > >> > >>>>> So the rebooted gluster190 is not accepted anymore. And 
> > > > > > >> > >>>>> thus does not
> > > > > > >> > >>>>> appear in "gluster volume status". I then followed this 
> > > > > > >> > >>>>> guide:
> > > > > > >> > >>>>>
> > > > > > >> > >>>>> https://gluster-documentations.readthedocs.io/en/latest/Administrator%20Guide/Resolving%20Peer%20Rejected/
> > > > > > >> > >>>>>
> > > > > > >> > >>>>> Remove everything under /var/lib/glusterd/ (except 
> > > > > > >> > >>>>> glusterd.info) and
> > > > > > >> > >>>>> restart glusterd service etc. Data get copied from other 
> > > > > > >> > >>>>> nodes,
> > > > > > >> > >>>>> 'gluster peer status' is ok again - but the volume info 
> > > > > > >> > >>>>> is missing,
> > > > > > >> > >>>>> /var/lib/glusterd/vols is empty. When syncing this dir 
> > > > > > >> > >>>>> from another
> > > > > > >> > >>>>> node, the volume then is available again, heals start 
> > > > > > >> > >>>>> etc.
> > > > > > >> > >>>>>
> > > > > > >> > >>>>> Well, and just to be sure that everything's working as 
> > > > > > >> > >>>>> it should,
> > > > > > >> > >>>>> rebooted that node again - the rebooted node is kicked 
> > > > > > >> > >>>>> out again, and
> > > > > > >> > >>>>> you have to restart bringing it back again.
> > > > > > >> > >>>>>
> > > > > > >> > >>>>> Sry, but did i miss anything? Has someone experienced 
> > > > > > >> > >>>>> similar
> > > > > > >> > >>>>> problems? I'll probably downgrade to 10.4 again, that 
> > > > > > >> > >>>>> version was
> > > > > > >> > >>>>> working...
> > > > > > >> > >>>>>
> > > > > > >> > >>>>>
> > > > > > >> > >>>>> Thx,
> > > > > > >> > >>>>> Hubert
> > > > > > >> > >>>> ________
> > > > > > >> > >>>>
> > > > > > >> > >>>>
> > > > > > >> > >>>>
> > > > > > >> > >>>> Community Meeting Calendar:
> > > > > > >> > >>>>
> > > > > > >> > >>>> Schedule -
> > > > > > >> > >>>> Every 2nd and 4th Tuesday at 14:30 IST / 09:00 UTC
> > > > > > >> > >>>> Bridge: https://meet.google.com/cpu-eiue-hvk
> > > > > > >> > >>>> Gluster-users mailing list
> > > > > > >> > >>>> Gluster-users@gluster.org
> > > > > > >> > >>>> https://lists.gluster.org/mailman/listinfo/gluster-users
> > > > > > >> > > ________
> > > > > > >> > >
> > > > > > >> > >
> > > > > > >> > >
> > > > > > >> > > Community Meeting Calendar:
> > > > > > >> > >
> > > > > > >> > > Schedule -
> > > > > > >> > > Every 2nd and 4th Tuesday at 14:30 IST / 09:00 UTC
> > > > > > >> > > Bridge: https://meet.google.com/cpu-eiue-hvk
> > > > > > >> > > Gluster-users mailing list
> > > > > > >> > > Gluster-users@gluster.org
> > > > > > >> > > https://lists.gluster.org/mailman/listinfo/gluster-users
> > > > > > >> >
> > > > > > >> > --
> > > > > > >> > Diego Zuccato
> > > > > > >> > DIFA - Dip. di Fisica e Astronomia
> > > > > > >> > Servizi Informatici
> > > > > > >> > Alma Mater Studiorum - Università di Bologna
> > > > > > >> > V.le Berti-Pichat 6/2 - 40127 Bologna - Italy
> > > > > > >> > tel.: +39 051 20 95786
> > > > > > >> >
> > > > > > >> > ________
> > > > > > >> >
> > > > > > >> >
> > > > > > >> >
> > > > > > >> > Community Meeting Calendar:
> > > > > > >> >
> > > > > > >> > Schedule -
> > > > > > >> > Every 2nd and 4th Tuesday at 14:30 IST / 09:00 UTC
> > > > > > >> > Bridge: https://meet.google.com/cpu-eiue-hvk
> > > > > > >> > Gluster-users mailing list
> > > > > > >> > Gluster-users@gluster.org
> > > > > > >> > https://lists.gluster.org/mailman/listinfo/gluster-users
> > > > > > >> >
> > > > > > >> > ________
> > > > > > >> >
> > > > > > >> >
> > > > > > >> >
> > > > > > >> > Community Meeting Calendar:
> > > > > > >> >
> > > > > > >> > Schedule -
> > > > > > >> > Every 2nd and 4th Tuesday at 14:30 IST / 09:00 UTC
> > > > > > >> > Bridge: https://meet.google.com/cpu-eiue-hvk
> > > > > > >> > Gluster-users mailing list
> > > > > > >> > Gluster-users@gluster.org
> > > > > > >> > https://lists.gluster.org/mailman/listinfo/gluster-users
> > > > > > >> ________
> > > > > > >>
> > > > > > >>
> > > > > > >>
> > > > > > >> Community Meeting Calendar:
> > > > > > >>
> > > > > > >> Schedule -
> > > > > > >> Every 2nd and 4th Tuesday at 14:30 IST / 09:00 UTC
> > > > > > >> Bridge: https://meet.google.com/cpu-eiue-hvk
> > > > > > >> Gluster-users mailing list
> > > > > > >> Gluster-users@gluster.org
> > > > > > >> https://lists.gluster.org/mailman/listinfo/gluster-users
________



Community Meeting Calendar:

Schedule -
Every 2nd and 4th Tuesday at 14:30 IST / 09:00 UTC
Bridge: https://meet.google.com/cpu-eiue-hvk
Gluster-users mailing list
Gluster-users@gluster.org
https://lists.gluster.org/mailman/listinfo/gluster-users

Re: [Gluster-users] Upgrade 10.4 -> 11.1 making problems

Reply via email to