Re: [Gluster-users] Upgrade 10.4 -> 11.1 making problems
Hello, just to bring this to an end... the servers and the volume are "out of service", so i tried to repair. - umount all related mounts - rebooted misbehaving server - mounted volume on all clients Well, no healing happens. 'gluster volume status workdata clients' looks good btw. gluster volume heal workdata statistics heal-count: empty. gluster volume heal workdata info: lists lots of files glustershd on the "good" servers: [2024-02-15 09:31:32.427779 +] W [MSGID: 114031] [client-rpc-fops_v2.c:2561:client4_0_lookup_cbk] 0-workdata-client-13: remote operation failed. [{path=}, {gfid=2a9dfe1d-c617-4ca5-9842-5267 5581c880}, {errno=2}, {error=No such file or directory}] glustershd on the "bad" server: [2024-02-15 09:32:18.613343 +] E [MSGID: 108008] [afr-self-heal-common.c:399:afr_gfid_split_brain_source] 0-workdata-replicate-2: Gfid mismatch detected for /854>, bb8e53c7-0446-4f82-bd23-1 2253e8484db on workdata-client-8 and a42769e2-f6ba-44b0-ad8c-1e451ba943a6 on workdata-client-6. [2024-02-15 09:32:18.613550 +] E [MSGID: 108008] [afr-self-heal-entry.c:465:afr_selfheal_detect_gfid_and_type_mismatch] 0-workdata-replicate-2: Skipping conservative merge on the file. Well, i won't put any more work into this. The volume is screwed up, and was replaced by a different solution. Servers will be dismissed soon. Thx for all your efforts, Hubert Am Mi., 31. Jan. 2024 um 17:10 Uhr schrieb Strahil Nikolov : > > Hi, > > This is a simplified description, see the links bellow for more detailed one. > When a client makes a change to a file - it commits that change to all > bricks simultaneously and if the change passes on a quorate number of bricks > (in your case 2 out of 3 is enough) it is treated as successful. > During that phase the 2 bricks, that successfully have completed the task, > will mark the 3rd brick as 'dirty' and you will see that in the heal report. > Only when the heal daemon syncs the file to the final brick, that heal will > be cleaned from the remaining bricks. > > > If a client has only 2 out of 3 bricks connected, it will constantly create > new files for healing (as it can't save it on all 3) and this can even get > worse with the increase of the number of clients that fail to connect to the > 3rd brick. > > Check that all client's IPs are connected to all bricks and those that are > not - remount the volume. After remounting the behavior should not persist. > If it does - check with your network/firewall team for troubleshooting the > problem. > > You can use 'gluster volume status all client-list' and 'gluster volume > status all clients' (where 'all' can be replaced by the volume name) to find > more details on that side. > > You can find a more detailed explanation of the whole process at this blog: > https://ravispeaks.wordpress.com/2019/04/05/glusterfs-afr-the-complete-guide/ > > https://ravispeaks.wordpress.com/2019/04/15/gluster-afr-the-complete-guide-part-2/ > > https://ravispeaks.wordpress.com/2019/05/14/gluster-afr-the-complete-guide-part-3/ > > > Best Regards, > Strahil Nikolov > > > > On Tue, Jan 30, 2024 at 15:26, Hu Bert > wrote: > Hi Strahil, > hm, not sure what the clients have to do with the situation. "gluster > volume status workdata clients" - lists all clients with their IP > addresses. > > "gluster peer status" and "gluster volume status" are ok, the latter > one says that all bricks are online, have a port etc. The network is > okay, ping works etc. Well, made a check on one client: umount gluster > volume, remount, now the client appears in the list. Yeah... but why > now? Will try a few more... not that easy as most of these systems are > in production... > > I had enabled the 3 self-heal values, but that didn't have any effect > back then. And, honestly, i won't do it now, because: if the heal > started now that would probably slow down the live system (with the > clients). I'll try it when the cluster isn't used anymore. > > Interesting - new messages incoming on the "bad" server: > > [2024-01-30 14:15:11,820] INFO [utils - 67:log_event] - {'nodeid': > '8ea1e6b4-9c77-4390-96a7-8724c3f9dc0f', 'ts': 1706620511, 'event': > 'AFR_SPLIT_BRAIN', 'message': {'client-pid': '-6', 'subvol': > 'workdata-replicate-2', 'type': 'gfid', ' > file': '/756>', 'count': > '2', 'child-2': 'workdata-client-8', 'gfid-2': > '39807be6-b7de-4a82-8a22-cf61b1415208', 'child-0': > 'workdata-client-6', 'gfid-0': 'bb4a12ec-f9b7-46bc-9fb3-c57730f1fc49'} > } > [2024-01-30 14:15:17,028] INFO [utils - 67:log_event] - {'nodeid': > '8ea1e6b4-9c77-4390-96a7-8724c3f9dc0f', 'ts': 1706620517, 'event': > 'AFR_SPLIT_BRAIN', 'message': {'client-pid': '-6', 'subvol': > 'workdata-replicate-4', 'type': 'gfid', ' > file': '/94259611>', > 'count': '2', 'child-2': 'workdata-client-14', 'gfid-2': > '01234675-17b9-4523-a598-5e331a72c4fa', 'child-0': > 'workdata-client-12', 'gfid-0': 'b11140bd-355b-4583-9a85-5d06085 > 89f97'}} > > They didn't appear in the beginning. Looks like a funny
Re: [Gluster-users] Upgrade 10.4 -> 11.1 making problems
Hi, This is a simplified description, see the links bellow for more detailed one.When a client makes a change to a file - it commits that change to all bricks simultaneously and if the change passes on a quorate number of bricks (in your case 2 out of 3 is enough) it is treated as successful.During that phase the 2 bricks, that successfully have completed the task, will mark the 3rd brick as 'dirty' and you will see that in the heal report.Only when the heal daemon syncs the file to the final brick, that heal will be cleaned from the remaining bricks. If a client has only 2 out of 3 bricks connected, it will constantly create new files for healing (as it can't save it on all 3) and this can even get worse with the increase of the number of clients that fail to connect to the 3rd brick. Check that all client's IPs are connected to all bricks and those that are not - remount the volume. After remounting the behavior should not persist. If it does - check with your network/firewall team for troubleshooting the problem. You can use 'gluster volume status all client-list' and 'gluster volume status all clients' (where 'all' can be replaced by the volume name) to find more details on that side. You can find a more detailed explanation of the whole process at this blog:https://ravispeaks.wordpress.com/2019/04/05/glusterfs-afr-the-complete-guide/ https://ravispeaks.wordpress.com/2019/04/15/gluster-afr-the-complete-guide-part-2/ https://ravispeaks.wordpress.com/2019/05/14/gluster-afr-the-complete-guide-part-3/ Best Regards,Strahil Nikolov On Tue, Jan 30, 2024 at 15:26, Hu Bert wrote: Hi Strahil, hm, not sure what the clients have to do with the situation. "gluster volume status workdata clients" - lists all clients with their IP addresses. "gluster peer status" and "gluster volume status" are ok, the latter one says that all bricks are online, have a port etc. The network is okay, ping works etc. Well, made a check on one client: umount gluster volume, remount, now the client appears in the list. Yeah... but why now? Will try a few more... not that easy as most of these systems are in production... I had enabled the 3 self-heal values, but that didn't have any effect back then. And, honestly, i won't do it now, because: if the heal started now that would probably slow down the live system (with the clients). I'll try it when the cluster isn't used anymore. Interesting - new messages incoming on the "bad" server: [2024-01-30 14:15:11,820] INFO [utils - 67:log_event] - {'nodeid': '8ea1e6b4-9c77-4390-96a7-8724c3f9dc0f', 'ts': 1706620511, 'event': 'AFR_SPLIT_BRAIN', 'message': {'client-pid': '-6', 'subvol': 'workdata-replicate-2', 'type': 'gfid', ' file': '/756>', 'count': '2', 'child-2': 'workdata-client-8', 'gfid-2': '39807be6-b7de-4a82-8a22-cf61b1415208', 'child-0': 'workdata-client-6', 'gfid-0': 'bb4a12ec-f9b7-46bc-9fb3-c57730f1fc49'} } [2024-01-30 14:15:17,028] INFO [utils - 67:log_event] - {'nodeid': '8ea1e6b4-9c77-4390-96a7-8724c3f9dc0f', 'ts': 1706620517, 'event': 'AFR_SPLIT_BRAIN', 'message': {'client-pid': '-6', 'subvol': 'workdata-replicate-4', 'type': 'gfid', ' file': '/94259611>', 'count': '2', 'child-2': 'workdata-client-14', 'gfid-2': '01234675-17b9-4523-a598-5e331a72c4fa', 'child-0': 'workdata-client-12', 'gfid-0': 'b11140bd-355b-4583-9a85-5d06085 89f97'}} They didn't appear in the beginning. Looks like a funny state that this volume is in :D Thx & best regards, Hubert Am Di., 30. Jan. 2024 um 07:14 Uhr schrieb Strahil Nikolov : > > This is your problem : bad server has only 3 clients. > > I remember there is another gluster volume command to list the IPs of the > clients. Find it and run it to find which clients are actually OK (those 3) > and the remaining 17 are not. > > Then try to remount those 17 clients and if the situation persistes - work > with your Network Team to identify why the 17 clients can't reach the brick. > > Do you have selfheal enabled? > > cluster.data-self-heal > cluster.entry-self-heal > cluster.metadata-self-heal > > > Best Regards, > > Strahil Nikolov > > On Mon, Jan 29, 2024 at 10:26, Hu Bert > wrote: > Hi, > not sure what you mean with "clients" - do you mean the clients that > mount the volume? > > gluster volume status workdata clients > -- > Brick : glusterpub2:/gluster/md3/workdata > Clients connected : 20 > Hostname BytesRead > BytesWritten OpVersion > - > - > 192.168.0.222:49140 43698212 > 41152108 11 > [...shortened...] > 192.168.0.126:49123 8362352021 > 16445401205 11 > -- > Brick : glusterpub3:/gluster/md3/workdata > Clients connected : 3 > Hostname BytesRead >
Re: [Gluster-users] Upgrade 10.4 -> 11.1 making problems
Hi Strahil, hm, not sure what the clients have to do with the situation. "gluster volume status workdata clients" - lists all clients with their IP addresses. "gluster peer status" and "gluster volume status" are ok, the latter one says that all bricks are online, have a port etc. The network is okay, ping works etc. Well, made a check on one client: umount gluster volume, remount, now the client appears in the list. Yeah... but why now? Will try a few more... not that easy as most of these systems are in production... I had enabled the 3 self-heal values, but that didn't have any effect back then. And, honestly, i won't do it now, because: if the heal started now that would probably slow down the live system (with the clients). I'll try it when the cluster isn't used anymore. Interesting - new messages incoming on the "bad" server: [2024-01-30 14:15:11,820] INFO [utils - 67:log_event] - {'nodeid': '8ea1e6b4-9c77-4390-96a7-8724c3f9dc0f', 'ts': 1706620511, 'event': 'AFR_SPLIT_BRAIN', 'message': {'client-pid': '-6', 'subvol': 'workdata-replicate-2', 'type': 'gfid', ' file': '/756>', 'count': '2', 'child-2': 'workdata-client-8', 'gfid-2': '39807be6-b7de-4a82-8a22-cf61b1415208', 'child-0': 'workdata-client-6', 'gfid-0': 'bb4a12ec-f9b7-46bc-9fb3-c57730f1fc49'} } [2024-01-30 14:15:17,028] INFO [utils - 67:log_event] - {'nodeid': '8ea1e6b4-9c77-4390-96a7-8724c3f9dc0f', 'ts': 1706620517, 'event': 'AFR_SPLIT_BRAIN', 'message': {'client-pid': '-6', 'subvol': 'workdata-replicate-4', 'type': 'gfid', ' file': '/94259611>', 'count': '2', 'child-2': 'workdata-client-14', 'gfid-2': '01234675-17b9-4523-a598-5e331a72c4fa', 'child-0': 'workdata-client-12', 'gfid-0': 'b11140bd-355b-4583-9a85-5d06085 89f97'}} They didn't appear in the beginning. Looks like a funny state that this volume is in :D Thx & best regards, Hubert Am Di., 30. Jan. 2024 um 07:14 Uhr schrieb Strahil Nikolov : > > This is your problem : bad server has only 3 clients. > > I remember there is another gluster volume command to list the IPs of the > clients. Find it and run it to find which clients are actually OK (those 3) > and the remaining 17 are not. > > Then try to remount those 17 clients and if the situation persistes - work > with your Network Team to identify why the 17 clients can't reach the brick. > > Do you have selfheal enabled? > > cluster.data-self-heal > cluster.entry-self-heal > cluster.metadata-self-heal > > > Best Regards, > > Strahil Nikolov > > On Mon, Jan 29, 2024 at 10:26, Hu Bert > wrote: > Hi, > not sure what you mean with "clients" - do you mean the clients that > mount the volume? > > gluster volume status workdata clients > -- > Brick : glusterpub2:/gluster/md3/workdata > Clients connected : 20 > Hostname BytesRead > BytesWritten OpVersion > - > - > 192.168.0.222:4914043698212 > 41152108 11 > [...shortened...] > 192.168.0.126:49123 8362352021 > 16445401205 11 > -- > Brick : glusterpub3:/gluster/md3/workdata > Clients connected : 3 > Hostname BytesRead > BytesWritten OpVersion > - > - > 192.168.0.44:49150 5855740279 > 63649538575 11 > 192.168.0.44:49137 308958200 > 319216608 11 > 192.168.0.126:49120 7524915770 > 15489813449 11 > > 192.168.0.44 (glusterpub3) is the "bad" server. Not sure what you mean > by "old" - probably not the age of the server, but rather the gluster > version. op-version is 11 on all servers+clients, upgraded from > 10.4 -> 11.1 > > "Have you checked if a client is not allowed to update all 3 copies ?" > -> are there special log messages for that? > > "If it's only 1 system, you can remove the brick, reinitialize it and > then bring it back for a full sync." > -> > https://docs.gluster.org/en/v3/Administrator%20Guide/Managing%20Volumes/#replace-brick > -> Replacing bricks in Replicate/Distributed Replicate volumes > > this part, right? Well, can't do this right now, as there are ~33TB of > data (many small files) to copy, that would slow down the servers / > the volume. But if the replacement is running i could do it > afterwards, just to see what happens. > > > Hubert > > Am Mo., 29. Jan. 2024 um 08:21 Uhr schrieb Strahil Nikolov > : > > > > 2800 is too much. Most probably you are affected by a bug. How old are the > > clients ? Is only 1 server affected ? > > Have you checked if a client is not allowed to update all 3 copies ? > > > > If it's only 1 system, you can remove the brick, reinitialize it and
Re: [Gluster-users] Upgrade 10.4 -> 11.1 making problems
This is your problem : bad server has only 3 clients. I remember there is another gluster volume command to list the IPs of the clients. Find it and run it to find which clients are actually OK (those 3) and the remaining 17 are not. Then try to remount those 17 clients and if the situation persistes - work with your Network Team to identify why the 17 clients can't reach the brick. Do you have selfheal enabled?cluster.data-self-heal cluster.entry-self-heal cluster.metadata-self-heal Best Regards,Strahil Nikolov On Mon, Jan 29, 2024 at 10:26, Hu Bert wrote: Hi, not sure what you mean with "clients" - do you mean the clients that mount the volume? gluster volume status workdata clients -- Brick : glusterpub2:/gluster/md3/workdata Clients connected : 20 Hostname BytesRead BytesWritten OpVersion - - 192.168.0.222:49140 43698212 41152108 11 [...shortened...] 192.168.0.126:49123 8362352021 16445401205 11 -- Brick : glusterpub3:/gluster/md3/workdata Clients connected : 3 Hostname BytesRead BytesWritten OpVersion - - 192.168.0.44:49150 5855740279 63649538575 11 192.168.0.44:49137 308958200 319216608 11 192.168.0.126:49120 7524915770 15489813449 11 192.168.0.44 (glusterpub3) is the "bad" server. Not sure what you mean by "old" - probably not the age of the server, but rather the gluster version. op-version is 11 on all servers+clients, upgraded from 10.4 -> 11.1 "Have you checked if a client is not allowed to update all 3 copies ?" -> are there special log messages for that? "If it's only 1 system, you can remove the brick, reinitialize it and then bring it back for a full sync." -> https://docs.gluster.org/en/v3/Administrator%20Guide/Managing%20Volumes/#replace-brick -> Replacing bricks in Replicate/Distributed Replicate volumes this part, right? Well, can't do this right now, as there are ~33TB of data (many small files) to copy, that would slow down the servers / the volume. But if the replacement is running i could do it afterwards, just to see what happens. Hubert Am Mo., 29. Jan. 2024 um 08:21 Uhr schrieb Strahil Nikolov : > > 2800 is too much. Most probably you are affected by a bug. How old are the > clients ? Is only 1 server affected ? > Have you checked if a client is not allowed to update all 3 copies ? > > If it's only 1 system, you can remove the brick, reinitialize it and then > bring it back for a full sync. > > Best Regards, > Strahil Nikolov > > On Mon, Jan 29, 2024 at 8:44, Hu Bert > wrote: > Morning, > a few bad apples - but which ones? Checked glustershd.log on the "bad" > server and counted todays "gfid mismatch" entries (2800 in total): > > 44 /212>, > 44 /174>, > 44 /94037803>, > 44 /94066216>, > 44 /249771609>, > 44 /64235523>, > 44 /185>, > > etc. But as i said, these are pretty new and didn't appear when the > volume/servers started missbehaving. Are there scripts/snippets > available how one could handle this? > > Healing would be very painful for the running system (still connected, > but not very long anymore), as there surely are 4-5 million entries to > be healed. I can't do this now - maybe, when the replacement is in > productive state, one could give it a try. > > Thx, > Hubert > > Am So., 28. Jan. 2024 um 23:12 Uhr schrieb Strahil Nikolov > : > > > > From gfid mismatch a manual effort is needed but you can script it. > > I think that a few bad "apples" can break the healing and if you fix them > > the healing might be recovered. > > > > Also, check why the client is not updating all copies. Most probably you > > have a client that is not able to connect to a brick. > > > > gluster volume status VOLUME_NAME clients > > > > Best Regards, > > Strahil Nikolov > > > > On Sun, Jan 28, 2024 at 20:55, Hu Bert > > wrote: > > Hi Strahil, > > there's no arbiter: 3 servers with 5 bricks each. > > > > Volume Name: workdata > > Type: Distributed-Replicate > > Volume ID: 7d1e23e5-0308-4443-a832-d36f85ff7959 > > Status: Started > > Snapshot Count: 0 > > Number of Bricks: 5 x 3 = 15 > > > > The "problem" is: the number of files/entries to-be-healed has > > continuously grown since the beginning, and now we're talking about > > way too many files to do this manually. Last time i checked: 700K per > > brick, should be >900K at the moment. The command 'gluster volume heal > > workdata statistics heal-count' is unable to finish. Doesn't look that > > good :D > > >
Re: [Gluster-users] Upgrade 10.4 -> 11.1 making problems
Hi, not sure what you mean with "clients" - do you mean the clients that mount the volume? gluster volume status workdata clients -- Brick : glusterpub2:/gluster/md3/workdata Clients connected : 20 Hostname BytesRead BytesWritten OpVersion - - 192.168.0.222:49140 43698212 41152108 11 [...shortened...] 192.168.0.126:49123 8362352021 16445401205 11 -- Brick : glusterpub3:/gluster/md3/workdata Clients connected : 3 Hostname BytesRead BytesWritten OpVersion - - 192.168.0.44:49150 5855740279 63649538575 11 192.168.0.44:49137 308958200 319216608 11 192.168.0.126:49120 7524915770 15489813449 11 192.168.0.44 (glusterpub3) is the "bad" server. Not sure what you mean by "old" - probably not the age of the server, but rather the gluster version. op-version is 11 on all servers+clients, upgraded from 10.4 -> 11.1 "Have you checked if a client is not allowed to update all 3 copies ?" -> are there special log messages for that? "If it's only 1 system, you can remove the brick, reinitialize it and then bring it back for a full sync." -> https://docs.gluster.org/en/v3/Administrator%20Guide/Managing%20Volumes/#replace-brick -> Replacing bricks in Replicate/Distributed Replicate volumes this part, right? Well, can't do this right now, as there are ~33TB of data (many small files) to copy, that would slow down the servers / the volume. But if the replacement is running i could do it afterwards, just to see what happens. Hubert Am Mo., 29. Jan. 2024 um 08:21 Uhr schrieb Strahil Nikolov : > > 2800 is too much. Most probably you are affected by a bug. How old are the > clients ? Is only 1 server affected ? > Have you checked if a client is not allowed to update all 3 copies ? > > If it's only 1 system, you can remove the brick, reinitialize it and then > bring it back for a full sync. > > Best Regards, > Strahil Nikolov > > On Mon, Jan 29, 2024 at 8:44, Hu Bert > wrote: > Morning, > a few bad apples - but which ones? Checked glustershd.log on the "bad" > server and counted todays "gfid mismatch" entries (2800 in total): > > 44 /212>, > 44 /174>, > 44 /94037803>, > 44 /94066216>, > 44 /249771609>, > 44 /64235523>, > 44 /185>, > > etc. But as i said, these are pretty new and didn't appear when the > volume/servers started missbehaving. Are there scripts/snippets > available how one could handle this? > > Healing would be very painful for the running system (still connected, > but not very long anymore), as there surely are 4-5 million entries to > be healed. I can't do this now - maybe, when the replacement is in > productive state, one could give it a try. > > Thx, > Hubert > > Am So., 28. Jan. 2024 um 23:12 Uhr schrieb Strahil Nikolov > : > > > > From gfid mismatch a manual effort is needed but you can script it. > > I think that a few bad "apples" can break the healing and if you fix them > > the healing might be recovered. > > > > Also, check why the client is not updating all copies. Most probably you > > have a client that is not able to connect to a brick. > > > > gluster volume status VOLUME_NAME clients > > > > Best Regards, > > Strahil Nikolov > > > > On Sun, Jan 28, 2024 at 20:55, Hu Bert > > wrote: > > Hi Strahil, > > there's no arbiter: 3 servers with 5 bricks each. > > > > Volume Name: workdata > > Type: Distributed-Replicate > > Volume ID: 7d1e23e5-0308-4443-a832-d36f85ff7959 > > Status: Started > > Snapshot Count: 0 > > Number of Bricks: 5 x 3 = 15 > > > > The "problem" is: the number of files/entries to-be-healed has > > continuously grown since the beginning, and now we're talking about > > way too many files to do this manually. Last time i checked: 700K per > > brick, should be >900K at the moment. The command 'gluster volume heal > > workdata statistics heal-count' is unable to finish. Doesn't look that > > good :D > > > > Interesting, the glustershd.log on the "bad" server now shows errors like > > these: > > > > [2024-01-28 18:48:33.734053 +] E [MSGID: 108008] > > [afr-self-heal-common.c:399:afr_gfid_split_brain_source] > > 0-workdata-replicate-3: Gfid mismatch detected for > > /803620716>, > > 82d7939a-8919-40ea- > > 9459-7b8af23d3b72 on workdata-client-11 and > > bb9399a3-0a5c-4cd1-b2b1-3ee787ec835a on workdata-client-9 > > > > Shouldn't the heals happen on the 2 "good" servers? > > > > Anyway... we're currently preparing a different solution for our data
Re: [Gluster-users] Upgrade 10.4 -> 11.1 making problems
2800 is too much. Most probably you are affected by a bug. How old are the clients ? Is only 1 server affected ?Have you checked if a client is not allowed to update all 3 copies ? If it's only 1 system, you can remove the brick, reinitialize it and then bring it back for a full sync. Best Regards,Strahil Nikolov On Mon, Jan 29, 2024 at 8:44, Hu Bert wrote: Morning, a few bad apples - but which ones? Checked glustershd.log on the "bad" server and counted todays "gfid mismatch" entries (2800 in total): 44 /212>, 44 /174>, 44 /94037803>, 44 /94066216>, 44 /249771609>, 44 /64235523>, 44 /185>, etc. But as i said, these are pretty new and didn't appear when the volume/servers started missbehaving. Are there scripts/snippets available how one could handle this? Healing would be very painful for the running system (still connected, but not very long anymore), as there surely are 4-5 million entries to be healed. I can't do this now - maybe, when the replacement is in productive state, one could give it a try. Thx, Hubert Am So., 28. Jan. 2024 um 23:12 Uhr schrieb Strahil Nikolov : > > From gfid mismatch a manual effort is needed but you can script it. > I think that a few bad "apples" can break the healing and if you fix them the > healing might be recovered. > > Also, check why the client is not updating all copies. Most probably you have > a client that is not able to connect to a brick. > > gluster volume status VOLUME_NAME clients > > Best Regards, > Strahil Nikolov > > On Sun, Jan 28, 2024 at 20:55, Hu Bert > wrote: > Hi Strahil, > there's no arbiter: 3 servers with 5 bricks each. > > Volume Name: workdata > Type: Distributed-Replicate > Volume ID: 7d1e23e5-0308-4443-a832-d36f85ff7959 > Status: Started > Snapshot Count: 0 > Number of Bricks: 5 x 3 = 15 > > The "problem" is: the number of files/entries to-be-healed has > continuously grown since the beginning, and now we're talking about > way too many files to do this manually. Last time i checked: 700K per > brick, should be >900K at the moment. The command 'gluster volume heal > workdata statistics heal-count' is unable to finish. Doesn't look that > good :D > > Interesting, the glustershd.log on the "bad" server now shows errors like > these: > > [2024-01-28 18:48:33.734053 +] E [MSGID: 108008] > [afr-self-heal-common.c:399:afr_gfid_split_brain_source] > 0-workdata-replicate-3: Gfid mismatch detected for > /803620716>, > 82d7939a-8919-40ea- > 9459-7b8af23d3b72 on workdata-client-11 and > bb9399a3-0a5c-4cd1-b2b1-3ee787ec835a on workdata-client-9 > > Shouldn't the heals happen on the 2 "good" servers? > > Anyway... we're currently preparing a different solution for our data > and we'll throw away this gluster volume - no critical data will be > lost, as these are derived from source data (on a different volume on > different servers). Will be a hard time (calculating tons of data), > but the chosen solution should have a way better performance. > > Well... thx to all for your efforts, really appreciate that :-) > > > Hubert > > Am So., 28. Jan. 2024 um 08:35 Uhr schrieb Strahil Nikolov > : > > > > What about the arbiter node ? > > Actually, check on all nodes and script it - you might need it in the > > future. > > > > Simplest way to resolve is to make the file didappear (rename to something > > else and then rename it back). Another easy trick is to read thr whole > > file: dd if=file of=/dev/null status=progress > > > > Best Regards, > > Strahil Nikolov > > > > On Sat, Jan 27, 2024 at 8:24, Hu Bert > > wrote: > > Morning, > > > > gfid1: > > getfattr -d -e hex -m. > > /gluster/md{3,4,5,6,7}/workdata/.glusterfs/fa/f5/faf59566-10f5-4ddd-8b0c-a87bc6a334fb > > > > glusterpub1 (good one): > > getfattr: Removing leading '/' from absolute path names > > # file: > > gluster/md6/workdata/.glusterfs/fa/f5/faf59566-10f5-4ddd-8b0c-a87bc6a334fb > > trusted.afr.dirty=0x > > trusted.afr.workdata-client-11=0x00020001 > > trusted.gfid=0xfaf5956610f54ddd8b0ca87bc6a334fb > > trusted.gfid2path.c2845024cc9b402e=0x38633139626234612d396236382d343532652d623434652d3664616331666434616465652f31323878313238732e6a7067 > > trusted.glusterfs.mdata=0x0165aaecff2695ebb765aaecff2695ebb765aaecff2533f110 > > > > glusterpub3 (bad one): > > getfattr: > > /gluster/md6/workdata/.glusterfs/fa/f5/faf59566-10f5-4ddd-8b0c-a87bc6a334fb: > > No such file or directory > > > > gfid 2: > > getfattr -d -e hex -m. > > /gluster/md{3,4,5,6,7}/workdata/.glusterfs/60/46/60465723-5dc0-4ebe-aced-9f2c12e52642 > > > > glusterpub1 (good one): > > getfattr: Removing leading '/' from absolute path names > > # file: > > gluster/md5/workdata/.glusterfs/60/46/60465723-5dc0-4ebe-aced-9f2c12e52642 > > trusted.afr.dirty=0x > > trusted.afr.workdata-client-8=0x00020001 > > trusted.gfid=0x604657235dc04ebeaced9f2c12e52642 >
Re: [Gluster-users] Upgrade 10.4 -> 11.1 making problems
Morning, a few bad apples - but which ones? Checked glustershd.log on the "bad" server and counted todays "gfid mismatch" entries (2800 in total): 44 /212>, 44 /174>, 44 /94037803>, 44 /94066216>, 44 /249771609>, 44 /64235523>, 44 /185>, etc. But as i said, these are pretty new and didn't appear when the volume/servers started missbehaving. Are there scripts/snippets available how one could handle this? Healing would be very painful for the running system (still connected, but not very long anymore), as there surely are 4-5 million entries to be healed. I can't do this now - maybe, when the replacement is in productive state, one could give it a try. Thx, Hubert Am So., 28. Jan. 2024 um 23:12 Uhr schrieb Strahil Nikolov : > > From gfid mismatch a manual effort is needed but you can script it. > I think that a few bad "apples" can break the healing and if you fix them the > healing might be recovered. > > Also, check why the client is not updating all copies. Most probably you have > a client that is not able to connect to a brick. > > gluster volume status VOLUME_NAME clients > > Best Regards, > Strahil Nikolov > > On Sun, Jan 28, 2024 at 20:55, Hu Bert > wrote: > Hi Strahil, > there's no arbiter: 3 servers with 5 bricks each. > > Volume Name: workdata > Type: Distributed-Replicate > Volume ID: 7d1e23e5-0308-4443-a832-d36f85ff7959 > Status: Started > Snapshot Count: 0 > Number of Bricks: 5 x 3 = 15 > > The "problem" is: the number of files/entries to-be-healed has > continuously grown since the beginning, and now we're talking about > way too many files to do this manually. Last time i checked: 700K per > brick, should be >900K at the moment. The command 'gluster volume heal > workdata statistics heal-count' is unable to finish. Doesn't look that > good :D > > Interesting, the glustershd.log on the "bad" server now shows errors like > these: > > [2024-01-28 18:48:33.734053 +] E [MSGID: 108008] > [afr-self-heal-common.c:399:afr_gfid_split_brain_source] > 0-workdata-replicate-3: Gfid mismatch detected for > /803620716>, > 82d7939a-8919-40ea- > 9459-7b8af23d3b72 on workdata-client-11 and > bb9399a3-0a5c-4cd1-b2b1-3ee787ec835a on workdata-client-9 > > Shouldn't the heals happen on the 2 "good" servers? > > Anyway... we're currently preparing a different solution for our data > and we'll throw away this gluster volume - no critical data will be > lost, as these are derived from source data (on a different volume on > different servers). Will be a hard time (calculating tons of data), > but the chosen solution should have a way better performance. > > Well... thx to all for your efforts, really appreciate that :-) > > > Hubert > > Am So., 28. Jan. 2024 um 08:35 Uhr schrieb Strahil Nikolov > : > > > > What about the arbiter node ? > > Actually, check on all nodes and script it - you might need it in the > > future. > > > > Simplest way to resolve is to make the file didappear (rename to something > > else and then rename it back). Another easy trick is to read thr whole > > file: dd if=file of=/dev/null status=progress > > > > Best Regards, > > Strahil Nikolov > > > > On Sat, Jan 27, 2024 at 8:24, Hu Bert > > wrote: > > Morning, > > > > gfid1: > > getfattr -d -e hex -m. > > /gluster/md{3,4,5,6,7}/workdata/.glusterfs/fa/f5/faf59566-10f5-4ddd-8b0c-a87bc6a334fb > > > > glusterpub1 (good one): > > getfattr: Removing leading '/' from absolute path names > > # file: > > gluster/md6/workdata/.glusterfs/fa/f5/faf59566-10f5-4ddd-8b0c-a87bc6a334fb > > trusted.afr.dirty=0x > > trusted.afr.workdata-client-11=0x00020001 > > trusted.gfid=0xfaf5956610f54ddd8b0ca87bc6a334fb > > trusted.gfid2path.c2845024cc9b402e=0x38633139626234612d396236382d343532652d623434652d3664616331666434616465652f31323878313238732e6a7067 > > trusted.glusterfs.mdata=0x0165aaecff2695ebb765aaecff2695ebb765aaecff2533f110 > > > > glusterpub3 (bad one): > > getfattr: > > /gluster/md6/workdata/.glusterfs/fa/f5/faf59566-10f5-4ddd-8b0c-a87bc6a334fb: > > No such file or directory > > > > gfid 2: > > getfattr -d -e hex -m. > > /gluster/md{3,4,5,6,7}/workdata/.glusterfs/60/46/60465723-5dc0-4ebe-aced-9f2c12e52642 > > > > glusterpub1 (good one): > > getfattr: Removing leading '/' from absolute path names > > # file: > > gluster/md5/workdata/.glusterfs/60/46/60465723-5dc0-4ebe-aced-9f2c12e52642 > > trusted.afr.dirty=0x > > trusted.afr.workdata-client-8=0x00020001 > > trusted.gfid=0x604657235dc04ebeaced9f2c12e52642 > > trusted.gfid2path.ac4669e3c4faf926=0x33366463366137392d666135642d343238652d613738642d6234376230616662316562642f31323878313238732e6a7067 > > trusted.glusterfs.mdata=0x0165aaecfe0c5403bd65aaecfe0c5403bd65aaecfe0ad61ee4 > > > > glusterpub3 (bad one): > > getfattr: > >
Re: [Gluster-users] Upgrade 10.4 -> 11.1 making problems
Hi Strahil, there's no arbiter: 3 servers with 5 bricks each. Volume Name: workdata Type: Distributed-Replicate Volume ID: 7d1e23e5-0308-4443-a832-d36f85ff7959 Status: Started Snapshot Count: 0 Number of Bricks: 5 x 3 = 15 The "problem" is: the number of files/entries to-be-healed has continuously grown since the beginning, and now we're talking about way too many files to do this manually. Last time i checked: 700K per brick, should be >900K at the moment. The command 'gluster volume heal workdata statistics heal-count' is unable to finish. Doesn't look that good :D Interesting, the glustershd.log on the "bad" server now shows errors like these: [2024-01-28 18:48:33.734053 +] E [MSGID: 108008] [afr-self-heal-common.c:399:afr_gfid_split_brain_source] 0-workdata-replicate-3: Gfid mismatch detected for /803620716>, 82d7939a-8919-40ea- 9459-7b8af23d3b72 on workdata-client-11 and bb9399a3-0a5c-4cd1-b2b1-3ee787ec835a on workdata-client-9 Shouldn't the heals happen on the 2 "good" servers? Anyway... we're currently preparing a different solution for our data and we'll throw away this gluster volume - no critical data will be lost, as these are derived from source data (on a different volume on different servers). Will be a hard time (calculating tons of data), but the chosen solution should have a way better performance. Well... thx to all for your efforts, really appreciate that :-) Hubert Am So., 28. Jan. 2024 um 08:35 Uhr schrieb Strahil Nikolov : > > What about the arbiter node ? > Actually, check on all nodes and script it - you might need it in the future. > > Simplest way to resolve is to make the file didappear (rename to something > else and then rename it back). Another easy trick is to read thr whole file: > dd if=file of=/dev/null status=progress > > Best Regards, > Strahil Nikolov > > On Sat, Jan 27, 2024 at 8:24, Hu Bert > wrote: > Morning, > > gfid1: > getfattr -d -e hex -m. > /gluster/md{3,4,5,6,7}/workdata/.glusterfs/fa/f5/faf59566-10f5-4ddd-8b0c-a87bc6a334fb > > glusterpub1 (good one): > getfattr: Removing leading '/' from absolute path names > # file: > gluster/md6/workdata/.glusterfs/fa/f5/faf59566-10f5-4ddd-8b0c-a87bc6a334fb > trusted.afr.dirty=0x > trusted.afr.workdata-client-11=0x00020001 > trusted.gfid=0xfaf5956610f54ddd8b0ca87bc6a334fb > trusted.gfid2path.c2845024cc9b402e=0x38633139626234612d396236382d343532652d623434652d3664616331666434616465652f31323878313238732e6a7067 > trusted.glusterfs.mdata=0x0165aaecff2695ebb765aaecff2695ebb765aaecff2533f110 > > glusterpub3 (bad one): > getfattr: > /gluster/md6/workdata/.glusterfs/fa/f5/faf59566-10f5-4ddd-8b0c-a87bc6a334fb: > No such file or directory > > gfid 2: > getfattr -d -e hex -m. > /gluster/md{3,4,5,6,7}/workdata/.glusterfs/60/46/60465723-5dc0-4ebe-aced-9f2c12e52642 > > glusterpub1 (good one): > getfattr: Removing leading '/' from absolute path names > # file: > gluster/md5/workdata/.glusterfs/60/46/60465723-5dc0-4ebe-aced-9f2c12e52642 > trusted.afr.dirty=0x > trusted.afr.workdata-client-8=0x00020001 > trusted.gfid=0x604657235dc04ebeaced9f2c12e52642 > trusted.gfid2path.ac4669e3c4faf926=0x33366463366137392d666135642d343238652d613738642d6234376230616662316562642f31323878313238732e6a7067 > trusted.glusterfs.mdata=0x0165aaecfe0c5403bd65aaecfe0c5403bd65aaecfe0ad61ee4 > > glusterpub3 (bad one): > getfattr: > /gluster/md5/workdata/.glusterfs/60/46/60465723-5dc0-4ebe-aced-9f2c12e52642: > No such file or directory > > thx, > Hubert > > Am Sa., 27. Jan. 2024 um 06:13 Uhr schrieb Strahil Nikolov > : > > > > You don't need to mount it. > > Like this : > > # getfattr -d -e hex -m. > > /path/to/brick/.glusterfs/00/46/00462be8-3e61-4931-8bda-dae1645c639e > > # file: 00/46/00462be8-3e61-4931-8bda-dae1645c639e > > trusted.gfid=0x00462be83e6149318bdadae1645c639e > > trusted.gfid2path.05fcbdafdeea18ab=0x3032673930632d386637622d346436652d393464362d3936393132313930643131312f66696c656c6f636b696e672e7079 > > trusted.glusterfs.mdata=0x016170340c25b6a7456170340c20efb5776170340c20d42b07 > > trusted.glusterfs.shard.block-size=0x0400 > > trusted.glusterfs.shard.file-size=0x00cd0001 > > > > > > Best Regards, > > Strahil Nikolov > > > > > > > > В четвъртък, 25 януари 2024 г. в 09:42:46 ч. Гринуич+2, Hu Bert > > написа: > > > > > > > > > > > > Good morning, > > > > hope i got it right... using: > > https://access.redhat.com/documentation/de-de/red_hat_gluster_storage/3.1/html/administration_guide/ch27s02 > > > > mount -t glusterfs -o aux-gfid-mount glusterpub1:/workdata /mnt/workdata > > > > gfid 1: > > getfattr -n trusted.glusterfs.pathinfo -e text > >
Re: [Gluster-users] Upgrade 10.4 -> 11.1 making problems
What about the arbiter node ?Actually, check on all nodes and script it - you might need it in the future. Simplest way to resolve is to make the file didappear (rename to something else and then rename it back). Another easy trick is to read thr whole file: dd if=file of=/dev/null status=progress Best Regards,Strahil Nikolov On Sat, Jan 27, 2024 at 8:24, Hu Bert wrote: Morning, gfid1: getfattr -d -e hex -m. /gluster/md{3,4,5,6,7}/workdata/.glusterfs/fa/f5/faf59566-10f5-4ddd-8b0c-a87bc6a334fb glusterpub1 (good one): getfattr: Removing leading '/' from absolute path names # file: gluster/md6/workdata/.glusterfs/fa/f5/faf59566-10f5-4ddd-8b0c-a87bc6a334fb trusted.afr.dirty=0x trusted.afr.workdata-client-11=0x00020001 trusted.gfid=0xfaf5956610f54ddd8b0ca87bc6a334fb trusted.gfid2path.c2845024cc9b402e=0x38633139626234612d396236382d343532652d623434652d3664616331666434616465652f31323878313238732e6a7067 trusted.glusterfs.mdata=0x0165aaecff2695ebb765aaecff2695ebb765aaecff2533f110 glusterpub3 (bad one): getfattr: /gluster/md6/workdata/.glusterfs/fa/f5/faf59566-10f5-4ddd-8b0c-a87bc6a334fb: No such file or directory gfid 2: getfattr -d -e hex -m. /gluster/md{3,4,5,6,7}/workdata/.glusterfs/60/46/60465723-5dc0-4ebe-aced-9f2c12e52642 glusterpub1 (good one): getfattr: Removing leading '/' from absolute path names # file: gluster/md5/workdata/.glusterfs/60/46/60465723-5dc0-4ebe-aced-9f2c12e52642 trusted.afr.dirty=0x trusted.afr.workdata-client-8=0x00020001 trusted.gfid=0x604657235dc04ebeaced9f2c12e52642 trusted.gfid2path.ac4669e3c4faf926=0x33366463366137392d666135642d343238652d613738642d6234376230616662316562642f31323878313238732e6a7067 trusted.glusterfs.mdata=0x0165aaecfe0c5403bd65aaecfe0c5403bd65aaecfe0ad61ee4 glusterpub3 (bad one): getfattr: /gluster/md5/workdata/.glusterfs/60/46/60465723-5dc0-4ebe-aced-9f2c12e52642: No such file or directory thx, Hubert Am Sa., 27. Jan. 2024 um 06:13 Uhr schrieb Strahil Nikolov : > > You don't need to mount it. > Like this : > # getfattr -d -e hex -m. > /path/to/brick/.glusterfs/00/46/00462be8-3e61-4931-8bda-dae1645c639e > # file: 00/46/00462be8-3e61-4931-8bda-dae1645c639e > trusted.gfid=0x00462be83e6149318bdadae1645c639e > trusted.gfid2path.05fcbdafdeea18ab=0x3032673930632d386637622d346436652d393464362d3936393132313930643131312f66696c656c6f636b696e672e7079 > trusted.glusterfs.mdata=0x016170340c25b6a7456170340c20efb5776170340c20d42b07 > trusted.glusterfs.shard.block-size=0x0400 > trusted.glusterfs.shard.file-size=0x00cd0001 > > > Best Regards, > Strahil Nikolov > > > > В четвъртък, 25 януари 2024 г. в 09:42:46 ч. Гринуич+2, Hu Bert > написа: > > > > > > Good morning, > > hope i got it right... using: > https://access.redhat.com/documentation/de-de/red_hat_gluster_storage/3.1/html/administration_guide/ch27s02 > > mount -t glusterfs -o aux-gfid-mount glusterpub1:/workdata /mnt/workdata > > gfid 1: > getfattr -n trusted.glusterfs.pathinfo -e text > /mnt/workdata/.gfid/faf59566-10f5-4ddd-8b0c-a87bc6a334fb > getfattr: Removing leading '/' from absolute path names > # file: mnt/workdata/.gfid/faf59566-10f5-4ddd-8b0c-a87bc6a334fb > trusted.glusterfs.pathinfo="( > ( > > uster/md6/workdata/images/133/283/13328349/128x128s.jpg>))" > > gfid 2: > getfattr -n trusted.glusterfs.pathinfo -e text > /mnt/workdata/.gfid/60465723-5dc0-4ebe-aced-9f2c12e52642 > getfattr: Removing leading '/' from absolute path names > # file: mnt/workdata/.gfid/60465723-5dc0-4ebe-aced-9f2c12e52642 > trusted.glusterfs.pathinfo="( > ( > > ):glusterpub1:/gluster/md5/workdata/.glusterfs/60/46/60465723-5dc0-4ebe-aced-9f2c12e52642>))" > > glusterpub1 + glusterpub2 are the good ones, glusterpub3 is the > misbehaving (not healing) one. > > The file with gfid 1 is available under > /gluster/md6/workdata/images/133/283/13328349/ on glusterpub1+2 > bricks, but missing on glusterpub3 brick. > > gfid 2: > /gluster/md5/workdata/.glusterfs/60/46/60465723-5dc0-4ebe-aced-9f2c12e52642 > is present on glusterpub1+2, but not on glusterpub3. > > > Thx, > Hubert > > Am Mi., 24. Jan. 2024 um 17:36 Uhr schrieb Strahil Nikolov > : > > > > > Hi, > > > > Can you find and check the files with gfids: > > 60465723-5dc0-4ebe-aced-9f2c12e52642 > > faf59566-10f5-4ddd-8b0c-a87bc6a334fb > > > > Use 'getfattr -d -e hex -m. ' command from > > https://docs.gluster.org/en/main/Troubleshooting/resolving-splitbrain/#analysis-of-the-output > > . > > > > Best Regards, > > Strahil Nikolov > > > > On Sat, Jan 20, 2024 at 9:44, Hu Bert > > wrote: > > Good morning, > > > > thx Gilberto, did the first three (set to WARNING), but the last one > > doesn't work. Anyway, with setting these three some
Re: [Gluster-users] Upgrade 10.4 -> 11.1 making problems
Morning, gfid1: getfattr -d -e hex -m. /gluster/md{3,4,5,6,7}/workdata/.glusterfs/fa/f5/faf59566-10f5-4ddd-8b0c-a87bc6a334fb glusterpub1 (good one): getfattr: Removing leading '/' from absolute path names # file: gluster/md6/workdata/.glusterfs/fa/f5/faf59566-10f5-4ddd-8b0c-a87bc6a334fb trusted.afr.dirty=0x trusted.afr.workdata-client-11=0x00020001 trusted.gfid=0xfaf5956610f54ddd8b0ca87bc6a334fb trusted.gfid2path.c2845024cc9b402e=0x38633139626234612d396236382d343532652d623434652d3664616331666434616465652f31323878313238732e6a7067 trusted.glusterfs.mdata=0x0165aaecff2695ebb765aaecff2695ebb765aaecff2533f110 glusterpub3 (bad one): getfattr: /gluster/md6/workdata/.glusterfs/fa/f5/faf59566-10f5-4ddd-8b0c-a87bc6a334fb: No such file or directory gfid 2: getfattr -d -e hex -m. /gluster/md{3,4,5,6,7}/workdata/.glusterfs/60/46/60465723-5dc0-4ebe-aced-9f2c12e52642 glusterpub1 (good one): getfattr: Removing leading '/' from absolute path names # file: gluster/md5/workdata/.glusterfs/60/46/60465723-5dc0-4ebe-aced-9f2c12e52642 trusted.afr.dirty=0x trusted.afr.workdata-client-8=0x00020001 trusted.gfid=0x604657235dc04ebeaced9f2c12e52642 trusted.gfid2path.ac4669e3c4faf926=0x33366463366137392d666135642d343238652d613738642d6234376230616662316562642f31323878313238732e6a7067 trusted.glusterfs.mdata=0x0165aaecfe0c5403bd65aaecfe0c5403bd65aaecfe0ad61ee4 glusterpub3 (bad one): getfattr: /gluster/md5/workdata/.glusterfs/60/46/60465723-5dc0-4ebe-aced-9f2c12e52642: No such file or directory thx, Hubert Am Sa., 27. Jan. 2024 um 06:13 Uhr schrieb Strahil Nikolov : > > You don't need to mount it. > Like this : > # getfattr -d -e hex -m. > /path/to/brick/.glusterfs/00/46/00462be8-3e61-4931-8bda-dae1645c639e > # file: 00/46/00462be8-3e61-4931-8bda-dae1645c639e > trusted.gfid=0x00462be83e6149318bdadae1645c639e > trusted.gfid2path.05fcbdafdeea18ab=0x3032673930632d386637622d346436652d393464362d3936393132313930643131312f66696c656c6f636b696e672e7079 > trusted.glusterfs.mdata=0x016170340c25b6a7456170340c20efb5776170340c20d42b07 > trusted.glusterfs.shard.block-size=0x0400 > trusted.glusterfs.shard.file-size=0x00cd0001 > > > Best Regards, > Strahil Nikolov > > > > В четвъртък, 25 януари 2024 г. в 09:42:46 ч. Гринуич+2, Hu Bert > написа: > > > > > > Good morning, > > hope i got it right... using: > https://access.redhat.com/documentation/de-de/red_hat_gluster_storage/3.1/html/administration_guide/ch27s02 > > mount -t glusterfs -o aux-gfid-mount glusterpub1:/workdata /mnt/workdata > > gfid 1: > getfattr -n trusted.glusterfs.pathinfo -e text > /mnt/workdata/.gfid/faf59566-10f5-4ddd-8b0c-a87bc6a334fb > getfattr: Removing leading '/' from absolute path names > # file: mnt/workdata/.gfid/faf59566-10f5-4ddd-8b0c-a87bc6a334fb > trusted.glusterfs.pathinfo="( > ( > > uster/md6/workdata/images/133/283/13328349/128x128s.jpg>))" > > gfid 2: > getfattr -n trusted.glusterfs.pathinfo -e text > /mnt/workdata/.gfid/60465723-5dc0-4ebe-aced-9f2c12e52642 > getfattr: Removing leading '/' from absolute path names > # file: mnt/workdata/.gfid/60465723-5dc0-4ebe-aced-9f2c12e52642 > trusted.glusterfs.pathinfo="( > ( > > ):glusterpub1:/gluster/md5/workdata/.glusterfs/60/46/60465723-5dc0-4ebe-aced-9f2c12e52642>))" > > glusterpub1 + glusterpub2 are the good ones, glusterpub3 is the > misbehaving (not healing) one. > > The file with gfid 1 is available under > /gluster/md6/workdata/images/133/283/13328349/ on glusterpub1+2 > bricks, but missing on glusterpub3 brick. > > gfid 2: > /gluster/md5/workdata/.glusterfs/60/46/60465723-5dc0-4ebe-aced-9f2c12e52642 > is present on glusterpub1+2, but not on glusterpub3. > > > Thx, > Hubert > > Am Mi., 24. Jan. 2024 um 17:36 Uhr schrieb Strahil Nikolov > : > > > > > Hi, > > > > Can you find and check the files with gfids: > > 60465723-5dc0-4ebe-aced-9f2c12e52642 > > faf59566-10f5-4ddd-8b0c-a87bc6a334fb > > > > Use 'getfattr -d -e hex -m. ' command from > > https://docs.gluster.org/en/main/Troubleshooting/resolving-splitbrain/#analysis-of-the-output > > . > > > > Best Regards, > > Strahil Nikolov > > > > On Sat, Jan 20, 2024 at 9:44, Hu Bert > > wrote: > > Good morning, > > > > thx Gilberto, did the first three (set to WARNING), but the last one > > doesn't work. Anyway, with setting these three some new messages > > appear: > > > > [2024-01-20 07:23:58.561106 +] W [MSGID: 114061] > > [client-common.c:796:client_pre_lk_v2] 0-workdata-client-11: remote_fd > > is -1. EBADFD [{gfid=faf59566-10f5-4ddd-8b0c-a87bc6a334fb}, > > {errno=77}, {error=File descriptor in bad state}] > > [2024-01-20 07:23:58.561177 +] E [MSGID: 108028] > > [afr-open.c:361:afr_is_reopen_allowed_cbk]
Re: [Gluster-users] Upgrade 10.4 -> 11.1 making problems
You don't need to mount it. Like this : # getfattr -d -e hex -m. /path/to/brick/.glusterfs/00/46/00462be8-3e61-4931-8bda-dae1645c639e # file: 00/46/00462be8-3e61-4931-8bda-dae1645c639e trusted.gfid=0x00462be83e6149318bdadae1645c639e trusted.gfid2path.05fcbdafdeea18ab=0x3032673930632d386637622d346436652d393464362d3936393132313930643131312f66696c656c6f636b696e672e7079 trusted.glusterfs.mdata=0x016170340c25b6a7456170340c20efb5776170340c20d42b07 trusted.glusterfs.shard.block-size=0x0400 trusted.glusterfs.shard.file-size=0x00cd0001 Best Regards, Strahil Nikolov В четвъртък, 25 януари 2024 г. в 09:42:46 ч. Гринуич+2, Hu Bert написа: Good morning, hope i got it right... using: https://access.redhat.com/documentation/de-de/red_hat_gluster_storage/3.1/html/administration_guide/ch27s02 mount -t glusterfs -o aux-gfid-mount glusterpub1:/workdata /mnt/workdata gfid 1: getfattr -n trusted.glusterfs.pathinfo -e text /mnt/workdata/.gfid/faf59566-10f5-4ddd-8b0c-a87bc6a334fb getfattr: Removing leading '/' from absolute path names # file: mnt/workdata/.gfid/faf59566-10f5-4ddd-8b0c-a87bc6a334fb trusted.glusterfs.pathinfo="( ( ))" gfid 2: getfattr -n trusted.glusterfs.pathinfo -e text /mnt/workdata/.gfid/60465723-5dc0-4ebe-aced-9f2c12e52642 getfattr: Removing leading '/' from absolute path names # file: mnt/workdata/.gfid/60465723-5dc0-4ebe-aced-9f2c12e52642 trusted.glusterfs.pathinfo="( ( ))" glusterpub1 + glusterpub2 are the good ones, glusterpub3 is the misbehaving (not healing) one. The file with gfid 1 is available under /gluster/md6/workdata/images/133/283/13328349/ on glusterpub1+2 bricks, but missing on glusterpub3 brick. gfid 2: /gluster/md5/workdata/.glusterfs/60/46/60465723-5dc0-4ebe-aced-9f2c12e52642 is present on glusterpub1+2, but not on glusterpub3. Thx, Hubert Am Mi., 24. Jan. 2024 um 17:36 Uhr schrieb Strahil Nikolov : > > Hi, > > Can you find and check the files with gfids: > 60465723-5dc0-4ebe-aced-9f2c12e52642 > faf59566-10f5-4ddd-8b0c-a87bc6a334fb > > Use 'getfattr -d -e hex -m. ' command from > https://docs.gluster.org/en/main/Troubleshooting/resolving-splitbrain/#analysis-of-the-output > . > > Best Regards, > Strahil Nikolov > > On Sat, Jan 20, 2024 at 9:44, Hu Bert > wrote: > Good morning, > > thx Gilberto, did the first three (set to WARNING), but the last one > doesn't work. Anyway, with setting these three some new messages > appear: > > [2024-01-20 07:23:58.561106 +] W [MSGID: 114061] > [client-common.c:796:client_pre_lk_v2] 0-workdata-client-11: remote_fd > is -1. EBADFD [{gfid=faf59566-10f5-4ddd-8b0c-a87bc6a334fb}, > {errno=77}, {error=File descriptor in bad state}] > [2024-01-20 07:23:58.561177 +] E [MSGID: 108028] > [afr-open.c:361:afr_is_reopen_allowed_cbk] 0-workdata-replicate-3: > Failed getlk for faf59566-10f5-4ddd-8b0c-a87bc6a334fb [File descriptor > in bad state] > [2024-01-20 07:23:58.562151 +] W [MSGID: 114031] > [client-rpc-fops_v2.c:2561:client4_0_lookup_cbk] 0-workdata-client-11: > remote operation failed. > [{path=}, > {gfid=faf59566-10f5-4ddd-8b0c-a87b > c6a334fb}, {errno=2}, {error=No such file or directory}] > [2024-01-20 07:23:58.562296 +] W [MSGID: 114061] > [client-common.c:530:client_pre_flush_v2] 0-workdata-client-11: > remote_fd is -1. EBADFD [{gfid=faf59566-10f5-4ddd-8b0c-a87bc6a334fb}, > {errno=77}, {error=File descriptor in bad state}] > [2024-01-20 07:23:58.860552 +] W [MSGID: 114061] > [client-common.c:796:client_pre_lk_v2] 0-workdata-client-8: remote_fd > is -1. EBADFD [{gfid=60465723-5dc0-4ebe-aced-9f2c12e52642}, > {errno=77}, {error=File descriptor in bad state}] > [2024-01-20 07:23:58.860608 +] E [MSGID: 108028] > [afr-open.c:361:afr_is_reopen_allowed_cbk] 0-workdata-replicate-2: > Failed getlk for 60465723-5dc0-4ebe-aced-9f2c12e52642 [File descriptor > in bad state] > [2024-01-20 07:23:58.861520 +] W [MSGID: 114031] > [client-rpc-fops_v2.c:2561:client4_0_lookup_cbk] 0-workdata-client-8: > remote operation failed. > [{path=}, > {gfid=60465723-5dc0-4ebe-aced-9f2c1 > 2e52642}, {errno=2}, {error=No such file or directory}] > [2024-01-20 07:23:58.861640 +] W [MSGID: 114061] > [client-common.c:530:client_pre_flush_v2] 0-workdata-client-8: > remote_fd is -1. EBADFD [{gfid=60465723-5dc0-4ebe-aced-9f2c12e52642}, > {errno=77}, {error=File descriptor in bad state}] > > Not many log entries appear, only a few. Has someone seen error > messages like these? Setting diagnostics.brick-sys-log-level to DEBUG > shows way more log entries, uploaded it to: > https://file.io/spLhlcbMCzr8 - not sure if that helps. > > > Thx, > Hubert > > Am Fr., 19. Jan. 2024 um 16:24 Uhr schrieb Gilberto Ferreira > : > > > > > gluster volume set testvol diagnostics.brick-log-level WARNING > > gluster volume set testvol diagnostics.brick-sys-log-level WARNING > > gluster volume set testvol
Re: [Gluster-users] Upgrade 10.4 -> 11.1 making problems
Good morning, hope i got it right... using: https://access.redhat.com/documentation/de-de/red_hat_gluster_storage/3.1/html/administration_guide/ch27s02 mount -t glusterfs -o aux-gfid-mount glusterpub1:/workdata /mnt/workdata gfid 1: getfattr -n trusted.glusterfs.pathinfo -e text /mnt/workdata/.gfid/faf59566-10f5-4ddd-8b0c-a87bc6a334fb getfattr: Removing leading '/' from absolute path names # file: mnt/workdata/.gfid/faf59566-10f5-4ddd-8b0c-a87bc6a334fb trusted.glusterfs.pathinfo="( ( ))" gfid 2: getfattr -n trusted.glusterfs.pathinfo -e text /mnt/workdata/.gfid/60465723-5dc0-4ebe-aced-9f2c12e52642 getfattr: Removing leading '/' from absolute path names # file: mnt/workdata/.gfid/60465723-5dc0-4ebe-aced-9f2c12e52642 trusted.glusterfs.pathinfo="( ( ))" glusterpub1 + glusterpub2 are the good ones, glusterpub3 is the misbehaving (not healing) one. The file with gfid 1 is available under /gluster/md6/workdata/images/133/283/13328349/ on glusterpub1+2 bricks, but missing on glusterpub3 brick. gfid 2: /gluster/md5/workdata/.glusterfs/60/46/60465723-5dc0-4ebe-aced-9f2c12e52642 is present on glusterpub1+2, but not on glusterpub3. Thx, Hubert Am Mi., 24. Jan. 2024 um 17:36 Uhr schrieb Strahil Nikolov : > > Hi, > > Can you find and check the files with gfids: > 60465723-5dc0-4ebe-aced-9f2c12e52642 > faf59566-10f5-4ddd-8b0c-a87bc6a334fb > > Use 'getfattr -d -e hex -m. ' command from > https://docs.gluster.org/en/main/Troubleshooting/resolving-splitbrain/#analysis-of-the-output > . > > Best Regards, > Strahil Nikolov > > On Sat, Jan 20, 2024 at 9:44, Hu Bert > wrote: > Good morning, > > thx Gilberto, did the first three (set to WARNING), but the last one > doesn't work. Anyway, with setting these three some new messages > appear: > > [2024-01-20 07:23:58.561106 +] W [MSGID: 114061] > [client-common.c:796:client_pre_lk_v2] 0-workdata-client-11: remote_fd > is -1. EBADFD [{gfid=faf59566-10f5-4ddd-8b0c-a87bc6a334fb}, > {errno=77}, {error=File descriptor in bad state}] > [2024-01-20 07:23:58.561177 +] E [MSGID: 108028] > [afr-open.c:361:afr_is_reopen_allowed_cbk] 0-workdata-replicate-3: > Failed getlk for faf59566-10f5-4ddd-8b0c-a87bc6a334fb [File descriptor > in bad state] > [2024-01-20 07:23:58.562151 +] W [MSGID: 114031] > [client-rpc-fops_v2.c:2561:client4_0_lookup_cbk] 0-workdata-client-11: > remote operation failed. > [{path=}, > {gfid=faf59566-10f5-4ddd-8b0c-a87b > c6a334fb}, {errno=2}, {error=No such file or directory}] > [2024-01-20 07:23:58.562296 +] W [MSGID: 114061] > [client-common.c:530:client_pre_flush_v2] 0-workdata-client-11: > remote_fd is -1. EBADFD [{gfid=faf59566-10f5-4ddd-8b0c-a87bc6a334fb}, > {errno=77}, {error=File descriptor in bad state}] > [2024-01-20 07:23:58.860552 +] W [MSGID: 114061] > [client-common.c:796:client_pre_lk_v2] 0-workdata-client-8: remote_fd > is -1. EBADFD [{gfid=60465723-5dc0-4ebe-aced-9f2c12e52642}, > {errno=77}, {error=File descriptor in bad state}] > [2024-01-20 07:23:58.860608 +] E [MSGID: 108028] > [afr-open.c:361:afr_is_reopen_allowed_cbk] 0-workdata-replicate-2: > Failed getlk for 60465723-5dc0-4ebe-aced-9f2c12e52642 [File descriptor > in bad state] > [2024-01-20 07:23:58.861520 +] W [MSGID: 114031] > [client-rpc-fops_v2.c:2561:client4_0_lookup_cbk] 0-workdata-client-8: > remote operation failed. > [{path=}, > {gfid=60465723-5dc0-4ebe-aced-9f2c1 > 2e52642}, {errno=2}, {error=No such file or directory}] > [2024-01-20 07:23:58.861640 +] W [MSGID: 114061] > [client-common.c:530:client_pre_flush_v2] 0-workdata-client-8: > remote_fd is -1. EBADFD [{gfid=60465723-5dc0-4ebe-aced-9f2c12e52642}, > {errno=77}, {error=File descriptor in bad state}] > > Not many log entries appear, only a few. Has someone seen error > messages like these? Setting diagnostics.brick-sys-log-level to DEBUG > shows way more log entries, uploaded it to: > https://file.io/spLhlcbMCzr8 - not sure if that helps. > > > Thx, > Hubert > > Am Fr., 19. Jan. 2024 um 16:24 Uhr schrieb Gilberto Ferreira > : > > > > > gluster volume set testvol diagnostics.brick-log-level WARNING > > gluster volume set testvol diagnostics.brick-sys-log-level WARNING > > gluster volume set testvol diagnostics.client-log-level ERROR > > gluster --log-level=ERROR volume status > > > > --- > > Gilberto Nunes Ferreira > > > > > > > > > > > > > > Em sex., 19 de jan. de 2024 às 05:49, Hu Bert > > escreveu: > >> > >> Hi Strahil, > >> hm, don't get me wrong, it may sound a bit stupid, but... where do i > >> set the log level? Using debian... > >> > >> https://access.redhat.com/documentation/de-de/red_hat_gluster_storage/3/html/administration_guide/configuring_the_log_level > >> > >> ls /etc/glusterfs/ > >> eventsconfig.json glusterfs-georep-logrotate > >> gluster-rsyslog-5.8.conf group-db-workload group-gluster-block > >> group-nl-cache group-virt.example logger.conf.example > >> glusterd.vol glusterfs-logrotate > >> gluster-rsyslog-7.2.conf group-distributed-virt
Re: [Gluster-users] Upgrade 10.4 -> 11.1 making problems
Hi, Can you find and check the files with gfids: 60465723-5dc0-4ebe-aced-9f2c12e52642faf59566-10f5-4ddd-8b0c-a87bc6a334fb Use 'getfattr -d -e hex -m. ' command from https://docs.gluster.org/en/main/Troubleshooting/resolving-splitbrain/#analysis-of-the-output . Best Regards,Strahil Nikolov On Sat, Jan 20, 2024 at 9:44, Hu Bert wrote: Good morning, thx Gilberto, did the first three (set to WARNING), but the last one doesn't work. Anyway, with setting these three some new messages appear: [2024-01-20 07:23:58.561106 +] W [MSGID: 114061] [client-common.c:796:client_pre_lk_v2] 0-workdata-client-11: remote_fd is -1. EBADFD [{gfid=faf59566-10f5-4ddd-8b0c-a87bc6a334fb}, {errno=77}, {error=File descriptor in bad state}] [2024-01-20 07:23:58.561177 +] E [MSGID: 108028] [afr-open.c:361:afr_is_reopen_allowed_cbk] 0-workdata-replicate-3: Failed getlk for faf59566-10f5-4ddd-8b0c-a87bc6a334fb [File descriptor in bad state] [2024-01-20 07:23:58.562151 +] W [MSGID: 114031] [client-rpc-fops_v2.c:2561:client4_0_lookup_cbk] 0-workdata-client-11: remote operation failed. [{path=}, {gfid=faf59566-10f5-4ddd-8b0c-a87b c6a334fb}, {errno=2}, {error=No such file or directory}] [2024-01-20 07:23:58.562296 +] W [MSGID: 114061] [client-common.c:530:client_pre_flush_v2] 0-workdata-client-11: remote_fd is -1. EBADFD [{gfid=faf59566-10f5-4ddd-8b0c-a87bc6a334fb}, {errno=77}, {error=File descriptor in bad state}] [2024-01-20 07:23:58.860552 +] W [MSGID: 114061] [client-common.c:796:client_pre_lk_v2] 0-workdata-client-8: remote_fd is -1. EBADFD [{gfid=60465723-5dc0-4ebe-aced-9f2c12e52642}, {errno=77}, {error=File descriptor in bad state}] [2024-01-20 07:23:58.860608 +] E [MSGID: 108028] [afr-open.c:361:afr_is_reopen_allowed_cbk] 0-workdata-replicate-2: Failed getlk for 60465723-5dc0-4ebe-aced-9f2c12e52642 [File descriptor in bad state] [2024-01-20 07:23:58.861520 +] W [MSGID: 114031] [client-rpc-fops_v2.c:2561:client4_0_lookup_cbk] 0-workdata-client-8: remote operation failed. [{path=}, {gfid=60465723-5dc0-4ebe-aced-9f2c1 2e52642}, {errno=2}, {error=No such file or directory}] [2024-01-20 07:23:58.861640 +] W [MSGID: 114061] [client-common.c:530:client_pre_flush_v2] 0-workdata-client-8: remote_fd is -1. EBADFD [{gfid=60465723-5dc0-4ebe-aced-9f2c12e52642}, {errno=77}, {error=File descriptor in bad state}] Not many log entries appear, only a few. Has someone seen error messages like these? Setting diagnostics.brick-sys-log-level to DEBUG shows way more log entries, uploaded it to: https://file.io/spLhlcbMCzr8 - not sure if that helps. Thx, Hubert Am Fr., 19. Jan. 2024 um 16:24 Uhr schrieb Gilberto Ferreira : > > gluster volume set testvol diagnostics.brick-log-level WARNING > gluster volume set testvol diagnostics.brick-sys-log-level WARNING > gluster volume set testvol diagnostics.client-log-level ERROR > gluster --log-level=ERROR volume status > > --- > Gilberto Nunes Ferreira > > > > > > > Em sex., 19 de jan. de 2024 às 05:49, Hu Bert > escreveu: >> >> Hi Strahil, >> hm, don't get me wrong, it may sound a bit stupid, but... where do i >> set the log level? Using debian... >> >> https://access.redhat.com/documentation/de-de/red_hat_gluster_storage/3/html/administration_guide/configuring_the_log_level >> >> ls /etc/glusterfs/ >> eventsconfig.json glusterfs-georep-logrotate >> gluster-rsyslog-5.8.conf group-db-workload group-gluster-block >> group-nl-cache group-virt.example logger.conf.example >> glusterd.vol glusterfs-logrotate >> gluster-rsyslog-7.2.conf group-distributed-virt group-metadata-cache >> group-samba gsyncd.conf thin-arbiter.vol >> >> checked: /etc/glusterfs/logger.conf.example >> >> # To enable enhanced logging capabilities, >> # >> # 1. rename this file to /etc/glusterfs/logger.conf >> # >> # 2. rename /etc/rsyslog.d/gluster.conf.example to >> # /etc/rsyslog.d/gluster.conf >> # >> # This change requires restart of all gluster services/volumes and >> # rsyslog. >> >> tried (to test): /etc/glusterfs/logger.conf with " LOG_LEVEL='WARNING' " >> >> restart glusterd on that node, but this doesn't work, log-level stays >> on INFO. /etc/rsyslog.d/gluster.conf.example does not exist. Probably >> /etc/rsyslog.conf on debian. But first it would be better to know >> where to set the log-level for glusterd. >> >> Depending on how much the DEBUG log-level talks ;-) i could assign up >> to 100G to /var >> >> >> Thx & best regards, >> Hubert >> >> >> Am Do., 18. Jan. 2024 um 22:58 Uhr schrieb Strahil Nikolov >> : >> > >> > Are you able to set the logs to debug level ? >> > It might provide a clue what it is going on. >> > >> > Best Regards, >> > Strahil Nikolov >> > >> > On Thu, Jan 18, 2024 at 13:08, Diego Zuccato >> > wrote: >> > That's the same kind of errors I keep seeing on my 2 clusters, >> > regenerated some months ago. Seems a pseudo-split-brain that should be >> > impossible on a replica 3 cluster but keeps happening. >> > Sadly going to ditch Gluster
Re: [Gluster-users] Upgrade 10.4 -> 11.1 making problems
Good morning, thx Gilberto, did the first three (set to WARNING), but the last one doesn't work. Anyway, with setting these three some new messages appear: [2024-01-20 07:23:58.561106 +] W [MSGID: 114061] [client-common.c:796:client_pre_lk_v2] 0-workdata-client-11: remote_fd is -1. EBADFD [{gfid=faf59566-10f5-4ddd-8b0c-a87bc6a334fb}, {errno=77}, {error=File descriptor in bad state}] [2024-01-20 07:23:58.561177 +] E [MSGID: 108028] [afr-open.c:361:afr_is_reopen_allowed_cbk] 0-workdata-replicate-3: Failed getlk for faf59566-10f5-4ddd-8b0c-a87bc6a334fb [File descriptor in bad state] [2024-01-20 07:23:58.562151 +] W [MSGID: 114031] [client-rpc-fops_v2.c:2561:client4_0_lookup_cbk] 0-workdata-client-11: remote operation failed. [{path=}, {gfid=faf59566-10f5-4ddd-8b0c-a87b c6a334fb}, {errno=2}, {error=No such file or directory}] [2024-01-20 07:23:58.562296 +] W [MSGID: 114061] [client-common.c:530:client_pre_flush_v2] 0-workdata-client-11: remote_fd is -1. EBADFD [{gfid=faf59566-10f5-4ddd-8b0c-a87bc6a334fb}, {errno=77}, {error=File descriptor in bad state}] [2024-01-20 07:23:58.860552 +] W [MSGID: 114061] [client-common.c:796:client_pre_lk_v2] 0-workdata-client-8: remote_fd is -1. EBADFD [{gfid=60465723-5dc0-4ebe-aced-9f2c12e52642}, {errno=77}, {error=File descriptor in bad state}] [2024-01-20 07:23:58.860608 +] E [MSGID: 108028] [afr-open.c:361:afr_is_reopen_allowed_cbk] 0-workdata-replicate-2: Failed getlk for 60465723-5dc0-4ebe-aced-9f2c12e52642 [File descriptor in bad state] [2024-01-20 07:23:58.861520 +] W [MSGID: 114031] [client-rpc-fops_v2.c:2561:client4_0_lookup_cbk] 0-workdata-client-8: remote operation failed. [{path=}, {gfid=60465723-5dc0-4ebe-aced-9f2c1 2e52642}, {errno=2}, {error=No such file or directory}] [2024-01-20 07:23:58.861640 +] W [MSGID: 114061] [client-common.c:530:client_pre_flush_v2] 0-workdata-client-8: remote_fd is -1. EBADFD [{gfid=60465723-5dc0-4ebe-aced-9f2c12e52642}, {errno=77}, {error=File descriptor in bad state}] Not many log entries appear, only a few. Has someone seen error messages like these? Setting diagnostics.brick-sys-log-level to DEBUG shows way more log entries, uploaded it to: https://file.io/spLhlcbMCzr8 - not sure if that helps. Thx, Hubert Am Fr., 19. Jan. 2024 um 16:24 Uhr schrieb Gilberto Ferreira : > > gluster volume set testvol diagnostics.brick-log-level WARNING > gluster volume set testvol diagnostics.brick-sys-log-level WARNING > gluster volume set testvol diagnostics.client-log-level ERROR > gluster --log-level=ERROR volume status > > --- > Gilberto Nunes Ferreira > > > > > > > Em sex., 19 de jan. de 2024 às 05:49, Hu Bert > escreveu: >> >> Hi Strahil, >> hm, don't get me wrong, it may sound a bit stupid, but... where do i >> set the log level? Using debian... >> >> https://access.redhat.com/documentation/de-de/red_hat_gluster_storage/3/html/administration_guide/configuring_the_log_level >> >> ls /etc/glusterfs/ >> eventsconfig.json glusterfs-georep-logrotate >> gluster-rsyslog-5.8.conf group-db-workload group-gluster-block >> group-nl-cache group-virt.example logger.conf.example >> glusterd.vol glusterfs-logrotate >> gluster-rsyslog-7.2.conf group-distributed-virt group-metadata-cache >> group-samba gsyncd.conf thin-arbiter.vol >> >> checked: /etc/glusterfs/logger.conf.example >> >> # To enable enhanced logging capabilities, >> # >> # 1. rename this file to /etc/glusterfs/logger.conf >> # >> # 2. rename /etc/rsyslog.d/gluster.conf.example to >> #/etc/rsyslog.d/gluster.conf >> # >> # This change requires restart of all gluster services/volumes and >> # rsyslog. >> >> tried (to test): /etc/glusterfs/logger.conf with " LOG_LEVEL='WARNING' " >> >> restart glusterd on that node, but this doesn't work, log-level stays >> on INFO. /etc/rsyslog.d/gluster.conf.example does not exist. Probably >> /etc/rsyslog.conf on debian. But first it would be better to know >> where to set the log-level for glusterd. >> >> Depending on how much the DEBUG log-level talks ;-) i could assign up >> to 100G to /var >> >> >> Thx & best regards, >> Hubert >> >> >> Am Do., 18. Jan. 2024 um 22:58 Uhr schrieb Strahil Nikolov >> : >> > >> > Are you able to set the logs to debug level ? >> > It might provide a clue what it is going on. >> > >> > Best Regards, >> > Strahil Nikolov >> > >> > On Thu, Jan 18, 2024 at 13:08, Diego Zuccato >> > wrote: >> > That's the same kind of errors I keep seeing on my 2 clusters, >> > regenerated some months ago. Seems a pseudo-split-brain that should be >> > impossible on a replica 3 cluster but keeps happening. >> > Sadly going to ditch Gluster ASAP. >> > >> > Diego >> > >> > Il 18/01/2024 07:11, Hu Bert ha scritto: >> > > Good morning, >> > > heal still not running. Pending heals now sum up to 60K per brick. >> > > Heal was starting instantly e.g. after server reboot with version >> > > 10.4, but doesn't with version 11. What could be wrong? >> > > >> > > I only see these errors
Re: [Gluster-users] Upgrade 10.4 -> 11.1 making problems
gluster volume set testvol diagnostics.brick-log-level WARNING gluster volume set testvol diagnostics.brick-sys-log-level WARNING gluster volume set testvol diagnostics.client-log-level ERROR gluster --log-level=ERROR volume status --- Gilberto Nunes Ferreira Em sex., 19 de jan. de 2024 às 05:49, Hu Bert escreveu: > Hi Strahil, > hm, don't get me wrong, it may sound a bit stupid, but... where do i > set the log level? Using debian... > > > https://access.redhat.com/documentation/de-de/red_hat_gluster_storage/3/html/administration_guide/configuring_the_log_level > > ls /etc/glusterfs/ > eventsconfig.json glusterfs-georep-logrotate > gluster-rsyslog-5.8.conf group-db-workload group-gluster-block > group-nl-cache group-virt.example logger.conf.example > glusterd.vol glusterfs-logrotate > gluster-rsyslog-7.2.conf group-distributed-virt group-metadata-cache > group-samba gsyncd.conf thin-arbiter.vol > > checked: /etc/glusterfs/logger.conf.example > > # To enable enhanced logging capabilities, > # > # 1. rename this file to /etc/glusterfs/logger.conf > # > # 2. rename /etc/rsyslog.d/gluster.conf.example to > #/etc/rsyslog.d/gluster.conf > # > # This change requires restart of all gluster services/volumes and > # rsyslog. > > tried (to test): /etc/glusterfs/logger.conf with " LOG_LEVEL='WARNING' " > > restart glusterd on that node, but this doesn't work, log-level stays > on INFO. /etc/rsyslog.d/gluster.conf.example does not exist. Probably > /etc/rsyslog.conf on debian. But first it would be better to know > where to set the log-level for glusterd. > > Depending on how much the DEBUG log-level talks ;-) i could assign up > to 100G to /var > > > Thx & best regards, > Hubert > > > Am Do., 18. Jan. 2024 um 22:58 Uhr schrieb Strahil Nikolov > : > > > > Are you able to set the logs to debug level ? > > It might provide a clue what it is going on. > > > > Best Regards, > > Strahil Nikolov > > > > On Thu, Jan 18, 2024 at 13:08, Diego Zuccato > > wrote: > > That's the same kind of errors I keep seeing on my 2 clusters, > > regenerated some months ago. Seems a pseudo-split-brain that should be > > impossible on a replica 3 cluster but keeps happening. > > Sadly going to ditch Gluster ASAP. > > > > Diego > > > > Il 18/01/2024 07:11, Hu Bert ha scritto: > > > Good morning, > > > heal still not running. Pending heals now sum up to 60K per brick. > > > Heal was starting instantly e.g. after server reboot with version > > > 10.4, but doesn't with version 11. What could be wrong? > > > > > > I only see these errors on one of the "good" servers in glustershd.log: > > > > > > [2024-01-18 06:08:57.328480 +] W [MSGID: 114031] > > > [client-rpc-fops_v2.c:2561:client4_0_lookup_cbk] 0-workdata-client-0: > > > remote operation failed. > > > [{path=}, > > > {gfid=cb39a1e4-2a4c-4727-861d-3ed9e > > > f00681b}, {errno=2}, {error=No such file or directory}] > > > [2024-01-18 06:08:57.594051 +] W [MSGID: 114031] > > > [client-rpc-fops_v2.c:2561:client4_0_lookup_cbk] 0-workdata-client-1: > > > remote operation failed. > > > [{path=}, > > > {gfid=3e9b178c-ae1f-4d85-ae47-fc539 > > > d94dd11}, {errno=2}, {error=No such file or directory}] > > > > > > About 7K today. Any ideas? Someone? > > > > > > > > > Best regards, > > > Hubert > > > > > > Am Mi., 17. Jan. 2024 um 11:24 Uhr schrieb Hu Bert < > revi...@googlemail.com>: > > >> > > >> ok, finally managed to get all servers, volumes etc runnung, but took > > >> a couple of restarts, cksum checks etc. > > >> > > >> One problem: a volume doesn't heal automatically or doesn't heal at > all. > > >> > > >> gluster volume status > > >> Status of volume: workdata > > >> Gluster processTCP Port RDMA Port > Online Pid > > >> > -- > > >> Brick glusterpub1:/gluster/md3/workdata588320 Y > 3436 > > >> Brick glusterpub2:/gluster/md3/workdata593150 Y > 1526 > > >> Brick glusterpub3:/gluster/md3/workdata569170 Y > 1952 > > >> Brick glusterpub1:/gluster/md4/workdata596880 Y > 3755 > > >> Brick glusterpub2:/gluster/md4/workdata602710 Y > 2271 > > >> Brick glusterpub3:/gluster/md4/workdata494610 Y > 2399 > > >> Brick glusterpub1:/gluster/md5/workdata546510 Y > 4208 > > >> Brick glusterpub2:/gluster/md5/workdata496850 Y > 2751 > > >> Brick glusterpub3:/gluster/md5/workdata592020 Y > 2803 > > >> Brick glusterpub1:/gluster/md6/workdata558290 Y > 4583 > > >> Brick glusterpub2:/gluster/md6/workdata504550 Y > 3296 > > >> Brick glusterpub3:/gluster/md6/workdata502620 Y > 3237 > > >> Brick glusterpub1:/gluster/md7/workdata522380 Y > 5014 > > >> Brick glusterpub2:/gluster/md7/workdata524740 Y > 3673 > > >> Brick
Re: [Gluster-users] Upgrade 10.4 -> 11.1 making problems
Hi Strahil, hm, don't get me wrong, it may sound a bit stupid, but... where do i set the log level? Using debian... https://access.redhat.com/documentation/de-de/red_hat_gluster_storage/3/html/administration_guide/configuring_the_log_level ls /etc/glusterfs/ eventsconfig.json glusterfs-georep-logrotate gluster-rsyslog-5.8.conf group-db-workload group-gluster-block group-nl-cache group-virt.example logger.conf.example glusterd.vol glusterfs-logrotate gluster-rsyslog-7.2.conf group-distributed-virt group-metadata-cache group-samba gsyncd.conf thin-arbiter.vol checked: /etc/glusterfs/logger.conf.example # To enable enhanced logging capabilities, # # 1. rename this file to /etc/glusterfs/logger.conf # # 2. rename /etc/rsyslog.d/gluster.conf.example to #/etc/rsyslog.d/gluster.conf # # This change requires restart of all gluster services/volumes and # rsyslog. tried (to test): /etc/glusterfs/logger.conf with " LOG_LEVEL='WARNING' " restart glusterd on that node, but this doesn't work, log-level stays on INFO. /etc/rsyslog.d/gluster.conf.example does not exist. Probably /etc/rsyslog.conf on debian. But first it would be better to know where to set the log-level for glusterd. Depending on how much the DEBUG log-level talks ;-) i could assign up to 100G to /var Thx & best regards, Hubert Am Do., 18. Jan. 2024 um 22:58 Uhr schrieb Strahil Nikolov : > > Are you able to set the logs to debug level ? > It might provide a clue what it is going on. > > Best Regards, > Strahil Nikolov > > On Thu, Jan 18, 2024 at 13:08, Diego Zuccato > wrote: > That's the same kind of errors I keep seeing on my 2 clusters, > regenerated some months ago. Seems a pseudo-split-brain that should be > impossible on a replica 3 cluster but keeps happening. > Sadly going to ditch Gluster ASAP. > > Diego > > Il 18/01/2024 07:11, Hu Bert ha scritto: > > Good morning, > > heal still not running. Pending heals now sum up to 60K per brick. > > Heal was starting instantly e.g. after server reboot with version > > 10.4, but doesn't with version 11. What could be wrong? > > > > I only see these errors on one of the "good" servers in glustershd.log: > > > > [2024-01-18 06:08:57.328480 +] W [MSGID: 114031] > > [client-rpc-fops_v2.c:2561:client4_0_lookup_cbk] 0-workdata-client-0: > > remote operation failed. > > [{path=}, > > {gfid=cb39a1e4-2a4c-4727-861d-3ed9e > > f00681b}, {errno=2}, {error=No such file or directory}] > > [2024-01-18 06:08:57.594051 +] W [MSGID: 114031] > > [client-rpc-fops_v2.c:2561:client4_0_lookup_cbk] 0-workdata-client-1: > > remote operation failed. > > [{path=}, > > {gfid=3e9b178c-ae1f-4d85-ae47-fc539 > > d94dd11}, {errno=2}, {error=No such file or directory}] > > > > About 7K today. Any ideas? Someone? > > > > > > Best regards, > > Hubert > > > > Am Mi., 17. Jan. 2024 um 11:24 Uhr schrieb Hu Bert : > >> > >> ok, finally managed to get all servers, volumes etc runnung, but took > >> a couple of restarts, cksum checks etc. > >> > >> One problem: a volume doesn't heal automatically or doesn't heal at all. > >> > >> gluster volume status > >> Status of volume: workdata > >> Gluster processTCP Port RDMA Port Online Pid > >> -- > >> Brick glusterpub1:/gluster/md3/workdata588320 Y 3436 > >> Brick glusterpub2:/gluster/md3/workdata593150 Y 1526 > >> Brick glusterpub3:/gluster/md3/workdata569170 Y 1952 > >> Brick glusterpub1:/gluster/md4/workdata596880 Y 3755 > >> Brick glusterpub2:/gluster/md4/workdata602710 Y 2271 > >> Brick glusterpub3:/gluster/md4/workdata494610 Y 2399 > >> Brick glusterpub1:/gluster/md5/workdata546510 Y 4208 > >> Brick glusterpub2:/gluster/md5/workdata496850 Y 2751 > >> Brick glusterpub3:/gluster/md5/workdata592020 Y 2803 > >> Brick glusterpub1:/gluster/md6/workdata558290 Y 4583 > >> Brick glusterpub2:/gluster/md6/workdata504550 Y 3296 > >> Brick glusterpub3:/gluster/md6/workdata502620 Y 3237 > >> Brick glusterpub1:/gluster/md7/workdata522380 Y 5014 > >> Brick glusterpub2:/gluster/md7/workdata524740 Y 3673 > >> Brick glusterpub3:/gluster/md7/workdata579660 Y 3653 > >> Self-heal Daemon on localhost N/A N/AY 4141 > >> Self-heal Daemon on glusterpub1N/A N/AY 5570 > >> Self-heal Daemon on glusterpub2N/A N/AY 4139 > >> > >> "gluster volume heal workdata info" lists a lot of files per brick. > >> "gluster volume heal workdata statistics heal-count" shows thousands > >> of files per brick. > >> "gluster volume heal workdata enable" has no effect. > >> >
Re: [Gluster-users] Upgrade 10.4 -> 11.1 making problems
I don't want to hijack the thread. And in my case setting logs to debug would fill my /var partitions in no time. Maybe the OP can. Diego Il 18/01/2024 22:58, Strahil Nikolov ha scritto: Are you able to set the logs to debug level ? It might provide a clue what it is going on. Best Regards, Strahil Nikolov On Thu, Jan 18, 2024 at 13:08, Diego Zuccato wrote: That's the same kind of errors I keep seeing on my 2 clusters, regenerated some months ago. Seems a pseudo-split-brain that should be impossible on a replica 3 cluster but keeps happening. Sadly going to ditch Gluster ASAP. Diego Il 18/01/2024 07:11, Hu Bert ha scritto: > Good morning, > heal still not running. Pending heals now sum up to 60K per brick. > Heal was starting instantly e.g. after server reboot with version > 10.4, but doesn't with version 11. What could be wrong? > > I only see these errors on one of the "good" servers in glustershd.log: > > [2024-01-18 06:08:57.328480 +] W [MSGID: 114031] > [client-rpc-fops_v2.c:2561:client4_0_lookup_cbk] 0-workdata-client-0: > remote operation failed. > [{path=}, > {gfid=cb39a1e4-2a4c-4727-861d-3ed9e > f00681b}, {errno=2}, {error=No such file or directory}] > [2024-01-18 06:08:57.594051 +] W [MSGID: 114031] > [client-rpc-fops_v2.c:2561:client4_0_lookup_cbk] 0-workdata-client-1: > remote operation failed. > [{path=}, > {gfid=3e9b178c-ae1f-4d85-ae47-fc539 > d94dd11}, {errno=2}, {error=No such file or directory}] > > About 7K today. Any ideas? Someone? > > > Best regards, > Hubert > > Am Mi., 17. Jan. 2024 um 11:24 Uhr schrieb Hu Bert mailto:revi...@googlemail.com>>: >> >> ok, finally managed to get all servers, volumes etc runnung, but took >> a couple of restarts, cksum checks etc. >> >> One problem: a volume doesn't heal automatically or doesn't heal at all. >> >> gluster volume status >> Status of volume: workdata >> Gluster process TCP Port RDMA Port Online Pid >> -- >> Brick glusterpub1:/gluster/md3/workdata 58832 0 Y 3436 >> Brick glusterpub2:/gluster/md3/workdata 59315 0 Y 1526 >> Brick glusterpub3:/gluster/md3/workdata 56917 0 Y 1952 >> Brick glusterpub1:/gluster/md4/workdata 59688 0 Y 3755 >> Brick glusterpub2:/gluster/md4/workdata 60271 0 Y 2271 >> Brick glusterpub3:/gluster/md4/workdata 49461 0 Y 2399 >> Brick glusterpub1:/gluster/md5/workdata 54651 0 Y 4208 >> Brick glusterpub2:/gluster/md5/workdata 49685 0 Y 2751 >> Brick glusterpub3:/gluster/md5/workdata 59202 0 Y 2803 >> Brick glusterpub1:/gluster/md6/workdata 55829 0 Y 4583 >> Brick glusterpub2:/gluster/md6/workdata 50455 0 Y 3296 >> Brick glusterpub3:/gluster/md6/workdata 50262 0 Y 3237 >> Brick glusterpub1:/gluster/md7/workdata 52238 0 Y 5014 >> Brick glusterpub2:/gluster/md7/workdata 52474 0 Y 3673 >> Brick glusterpub3:/gluster/md7/workdata 57966 0 Y 3653 >> Self-heal Daemon on localhost N/A N/A Y 4141 >> Self-heal Daemon on glusterpub1 N/A N/A Y 5570 >> Self-heal Daemon on glusterpub2 N/A N/A Y 4139 >> >> "gluster volume heal workdata info" lists a lot of files per brick. >> "gluster volume heal workdata statistics heal-count" shows thousands >> of files per brick. >> "gluster volume heal workdata enable" has no effect. >> >> gluster volume heal workdata full >> Launching heal operation to perform full self heal on volume workdata >> has been successful >> Use heal info commands to check status. >> >> -> not doing anything at all. And nothing happening on the 2 "good" >> servers in e.g. glustershd.log. Heal was working as expected on >> version 10.4, but here... silence. Someone has an idea? >> >> >> Best regards, >> Hubert >> >> Am Di., 16. Jan. 2024 um 13:44 Uhr schrieb Gilberto Ferreira >> mailto:gilberto.nune...@gmail.com>>: >>> >>> Ah! Indeed! You need to perform an upgrade in the clients as well. >>> >>> >>> >>> >>> >>> >>> >>> >>> Em ter., 16 de jan. de 2024 às 03:12, Hu Bert mailto:revi...@googlemail.com>> escreveu: morning to those still
Re: [Gluster-users] Upgrade 10.4 -> 11.1 making problems
Are you able to set the logs to debug level ?It might provide a clue what it is going on. Best Regards,Strahil Nikolov On Thu, Jan 18, 2024 at 13:08, Diego Zuccato wrote: That's the same kind of errors I keep seeing on my 2 clusters, regenerated some months ago. Seems a pseudo-split-brain that should be impossible on a replica 3 cluster but keeps happening. Sadly going to ditch Gluster ASAP. Diego Il 18/01/2024 07:11, Hu Bert ha scritto: > Good morning, > heal still not running. Pending heals now sum up to 60K per brick. > Heal was starting instantly e.g. after server reboot with version > 10.4, but doesn't with version 11. What could be wrong? > > I only see these errors on one of the "good" servers in glustershd.log: > > [2024-01-18 06:08:57.328480 +] W [MSGID: 114031] > [client-rpc-fops_v2.c:2561:client4_0_lookup_cbk] 0-workdata-client-0: > remote operation failed. > [{path=}, > {gfid=cb39a1e4-2a4c-4727-861d-3ed9e > f00681b}, {errno=2}, {error=No such file or directory}] > [2024-01-18 06:08:57.594051 +] W [MSGID: 114031] > [client-rpc-fops_v2.c:2561:client4_0_lookup_cbk] 0-workdata-client-1: > remote operation failed. > [{path=}, > {gfid=3e9b178c-ae1f-4d85-ae47-fc539 > d94dd11}, {errno=2}, {error=No such file or directory}] > > About 7K today. Any ideas? Someone? > > > Best regards, > Hubert > > Am Mi., 17. Jan. 2024 um 11:24 Uhr schrieb Hu Bert : >> >> ok, finally managed to get all servers, volumes etc runnung, but took >> a couple of restarts, cksum checks etc. >> >> One problem: a volume doesn't heal automatically or doesn't heal at all. >> >> gluster volume status >> Status of volume: workdata >> Gluster process TCP Port RDMA Port Online Pid >> -- >> Brick glusterpub1:/gluster/md3/workdata 58832 0 Y 3436 >> Brick glusterpub2:/gluster/md3/workdata 59315 0 Y 1526 >> Brick glusterpub3:/gluster/md3/workdata 56917 0 Y 1952 >> Brick glusterpub1:/gluster/md4/workdata 59688 0 Y 3755 >> Brick glusterpub2:/gluster/md4/workdata 60271 0 Y 2271 >> Brick glusterpub3:/gluster/md4/workdata 49461 0 Y 2399 >> Brick glusterpub1:/gluster/md5/workdata 54651 0 Y 4208 >> Brick glusterpub2:/gluster/md5/workdata 49685 0 Y 2751 >> Brick glusterpub3:/gluster/md5/workdata 59202 0 Y 2803 >> Brick glusterpub1:/gluster/md6/workdata 55829 0 Y 4583 >> Brick glusterpub2:/gluster/md6/workdata 50455 0 Y 3296 >> Brick glusterpub3:/gluster/md6/workdata 50262 0 Y 3237 >> Brick glusterpub1:/gluster/md7/workdata 52238 0 Y 5014 >> Brick glusterpub2:/gluster/md7/workdata 52474 0 Y 3673 >> Brick glusterpub3:/gluster/md7/workdata 57966 0 Y 3653 >> Self-heal Daemon on localhost N/A N/A Y 4141 >> Self-heal Daemon on glusterpub1 N/A N/A Y 5570 >> Self-heal Daemon on glusterpub2 N/A N/A Y 4139 >> >> "gluster volume heal workdata info" lists a lot of files per brick. >> "gluster volume heal workdata statistics heal-count" shows thousands >> of files per brick. >> "gluster volume heal workdata enable" has no effect. >> >> gluster volume heal workdata full >> Launching heal operation to perform full self heal on volume workdata >> has been successful >> Use heal info commands to check status. >> >> -> not doing anything at all. And nothing happening on the 2 "good" >> servers in e.g. glustershd.log. Heal was working as expected on >> version 10.4, but here... silence. Someone has an idea? >> >> >> Best regards, >> Hubert >> >> Am Di., 16. Jan. 2024 um 13:44 Uhr schrieb Gilberto Ferreira >> : >>> >>> Ah! Indeed! You need to perform an upgrade in the clients as well. >>> >>> >>> >>> >>> >>> >>> >>> >>> Em ter., 16 de jan. de 2024 às 03:12, Hu Bert >>> escreveu: morning to those still reading :-) i found this: https://docs.gluster.org/en/main/Troubleshooting/troubleshooting-glusterd/#common-issues-and-how-to-resolve-them there's a paragraph about "peer rejected" with the same error message, telling me: "Update the cluster.op-version" - i had only updated the server nodes, but not the clients. So upgrading the cluster.op-version wasn't possible at this time. So... upgrading the clients to version 11.1 and then the op-version should solve the problem? Thx, Hubert Am Mo., 15. Jan. 2024 um 09:16 Uhr schrieb Hu Bert : > > Hi, > just upgraded some gluster servers from version 10.4 to version 11.1. > Debian bullseye & bookworm. When only installing the packages: good, > servers, volumes etc. work as expected. > > But one needs to
Re: [Gluster-users] Upgrade 10.4 -> 11.1 making problems
Thx for your answer. We don't have that much data (but 33 TB anyway), but millions of files in total, on normal SATA disks. So copying stuff away and back, with a downtime maybe, is not manageable. Good thing is: the data can be re-calculated, as they are derived from source data. But one needs some new hardware for that. And maybe/probably think of a new solution for that, as we all know about the state of the gluster project. Thx, Hubert Am Do., 18. Jan. 2024 um 09:33 Uhr schrieb Diego Zuccato : > > Since glusterd does not consider it a split brain, you can't solve it > with standard split brain tools. > I've found no way to resolve it except by manually handling one file at > a time: completely unmanageable with thousands of files and having to > juggle between actual path on brick and metadata files! > Previously I "fixed" it by: > 1) moving all the data from the volume to a temp space > 2) recovering from the bricks what was inaccessible from the mountpoint > (keeping different file revisions for the conflicting ones) > 3) destroying and recreating the volume > 4) copying back the data from the backup > > When gluster gets used because you need lots of space (we had more than > 400TB on 3 nodes with 30x12TB SAS disks in "replica 3 arbiter 1"), where > do you park the data? Is the official solution "just have a second > cluster idle for when you need to fix errors"? > It took more than a month of downtime this summer, and after less than 6 > months I'd have to repeat it? Users are rightly quite upset... > > Diego > > Il 18/01/2024 09:17, Hu Bert ha scritto: > > were you able to solve the problem? Can it be treated like a "normal" > > split brain? 'gluster peer status' and 'gluster volume status' are ok, > > so kinda looks like "pseudo"... > > > > > > hubert > > > > Am Do., 18. Jan. 2024 um 08:28 Uhr schrieb Diego Zuccato > > : > >> > >> That's the same kind of errors I keep seeing on my 2 clusters, > >> regenerated some months ago. Seems a pseudo-split-brain that should be > >> impossible on a replica 3 cluster but keeps happening. > >> Sadly going to ditch Gluster ASAP. > >> > >> Diego > >> > >> Il 18/01/2024 07:11, Hu Bert ha scritto: > >>> Good morning, > >>> heal still not running. Pending heals now sum up to 60K per brick. > >>> Heal was starting instantly e.g. after server reboot with version > >>> 10.4, but doesn't with version 11. What could be wrong? > >>> > >>> I only see these errors on one of the "good" servers in glustershd.log: > >>> > >>> [2024-01-18 06:08:57.328480 +] W [MSGID: 114031] > >>> [client-rpc-fops_v2.c:2561:client4_0_lookup_cbk] 0-workdata-client-0: > >>> remote operation failed. > >>> [{path=}, > >>> {gfid=cb39a1e4-2a4c-4727-861d-3ed9e > >>> f00681b}, {errno=2}, {error=No such file or directory}] > >>> [2024-01-18 06:08:57.594051 +] W [MSGID: 114031] > >>> [client-rpc-fops_v2.c:2561:client4_0_lookup_cbk] 0-workdata-client-1: > >>> remote operation failed. > >>> [{path=}, > >>> {gfid=3e9b178c-ae1f-4d85-ae47-fc539 > >>> d94dd11}, {errno=2}, {error=No such file or directory}] > >>> > >>> About 7K today. Any ideas? Someone? > >>> > >>> > >>> Best regards, > >>> Hubert > >>> > >>> Am Mi., 17. Jan. 2024 um 11:24 Uhr schrieb Hu Bert > >>> : > > ok, finally managed to get all servers, volumes etc runnung, but took > a couple of restarts, cksum checks etc. > > One problem: a volume doesn't heal automatically or doesn't heal at all. > > gluster volume status > Status of volume: workdata > Gluster process TCP Port RDMA Port Online > Pid > -- > Brick glusterpub1:/gluster/md3/workdata 58832 0 Y > 3436 > Brick glusterpub2:/gluster/md3/workdata 59315 0 Y > 1526 > Brick glusterpub3:/gluster/md3/workdata 56917 0 Y > 1952 > Brick glusterpub1:/gluster/md4/workdata 59688 0 Y > 3755 > Brick glusterpub2:/gluster/md4/workdata 60271 0 Y > 2271 > Brick glusterpub3:/gluster/md4/workdata 49461 0 Y > 2399 > Brick glusterpub1:/gluster/md5/workdata 54651 0 Y > 4208 > Brick glusterpub2:/gluster/md5/workdata 49685 0 Y > 2751 > Brick glusterpub3:/gluster/md5/workdata 59202 0 Y > 2803 > Brick glusterpub1:/gluster/md6/workdata 55829 0 Y > 4583 > Brick glusterpub2:/gluster/md6/workdata 50455 0 Y > 3296 > Brick glusterpub3:/gluster/md6/workdata 50262 0 Y > 3237 > Brick glusterpub1:/gluster/md7/workdata 52238 0 Y > 5014 > Brick glusterpub2:/gluster/md7/workdata 52474 0 Y >
Re: [Gluster-users] Upgrade 10.4 -> 11.1 making problems
Since glusterd does not consider it a split brain, you can't solve it with standard split brain tools. I've found no way to resolve it except by manually handling one file at a time: completely unmanageable with thousands of files and having to juggle between actual path on brick and metadata files! Previously I "fixed" it by: 1) moving all the data from the volume to a temp space 2) recovering from the bricks what was inaccessible from the mountpoint (keeping different file revisions for the conflicting ones) 3) destroying and recreating the volume 4) copying back the data from the backup When gluster gets used because you need lots of space (we had more than 400TB on 3 nodes with 30x12TB SAS disks in "replica 3 arbiter 1"), where do you park the data? Is the official solution "just have a second cluster idle for when you need to fix errors"? It took more than a month of downtime this summer, and after less than 6 months I'd have to repeat it? Users are rightly quite upset... Diego Il 18/01/2024 09:17, Hu Bert ha scritto: were you able to solve the problem? Can it be treated like a "normal" split brain? 'gluster peer status' and 'gluster volume status' are ok, so kinda looks like "pseudo"... hubert Am Do., 18. Jan. 2024 um 08:28 Uhr schrieb Diego Zuccato : That's the same kind of errors I keep seeing on my 2 clusters, regenerated some months ago. Seems a pseudo-split-brain that should be impossible on a replica 3 cluster but keeps happening. Sadly going to ditch Gluster ASAP. Diego Il 18/01/2024 07:11, Hu Bert ha scritto: Good morning, heal still not running. Pending heals now sum up to 60K per brick. Heal was starting instantly e.g. after server reboot with version 10.4, but doesn't with version 11. What could be wrong? I only see these errors on one of the "good" servers in glustershd.log: [2024-01-18 06:08:57.328480 +] W [MSGID: 114031] [client-rpc-fops_v2.c:2561:client4_0_lookup_cbk] 0-workdata-client-0: remote operation failed. [{path=}, {gfid=cb39a1e4-2a4c-4727-861d-3ed9e f00681b}, {errno=2}, {error=No such file or directory}] [2024-01-18 06:08:57.594051 +] W [MSGID: 114031] [client-rpc-fops_v2.c:2561:client4_0_lookup_cbk] 0-workdata-client-1: remote operation failed. [{path=}, {gfid=3e9b178c-ae1f-4d85-ae47-fc539 d94dd11}, {errno=2}, {error=No such file or directory}] About 7K today. Any ideas? Someone? Best regards, Hubert Am Mi., 17. Jan. 2024 um 11:24 Uhr schrieb Hu Bert : ok, finally managed to get all servers, volumes etc runnung, but took a couple of restarts, cksum checks etc. One problem: a volume doesn't heal automatically or doesn't heal at all. gluster volume status Status of volume: workdata Gluster process TCP Port RDMA Port Online Pid -- Brick glusterpub1:/gluster/md3/workdata 58832 0 Y 3436 Brick glusterpub2:/gluster/md3/workdata 59315 0 Y 1526 Brick glusterpub3:/gluster/md3/workdata 56917 0 Y 1952 Brick glusterpub1:/gluster/md4/workdata 59688 0 Y 3755 Brick glusterpub2:/gluster/md4/workdata 60271 0 Y 2271 Brick glusterpub3:/gluster/md4/workdata 49461 0 Y 2399 Brick glusterpub1:/gluster/md5/workdata 54651 0 Y 4208 Brick glusterpub2:/gluster/md5/workdata 49685 0 Y 2751 Brick glusterpub3:/gluster/md5/workdata 59202 0 Y 2803 Brick glusterpub1:/gluster/md6/workdata 55829 0 Y 4583 Brick glusterpub2:/gluster/md6/workdata 50455 0 Y 3296 Brick glusterpub3:/gluster/md6/workdata 50262 0 Y 3237 Brick glusterpub1:/gluster/md7/workdata 52238 0 Y 5014 Brick glusterpub2:/gluster/md7/workdata 52474 0 Y 3673 Brick glusterpub3:/gluster/md7/workdata 57966 0 Y 3653 Self-heal Daemon on localhost N/A N/AY 4141 Self-heal Daemon on glusterpub1 N/A N/AY 5570 Self-heal Daemon on glusterpub2 N/A N/AY 4139 "gluster volume heal workdata info" lists a lot of files per brick. "gluster volume heal workdata statistics heal-count" shows thousands of files per brick. "gluster volume heal workdata enable" has no effect. gluster volume heal workdata full Launching heal operation to perform full self heal on volume workdata has been successful Use heal info commands to check status. -> not doing anything at all. And nothing happening on the 2 "good" servers in e.g. glustershd.log. Heal was working as expected on version 10.4, but here... silence. Someone has an idea? Best regards, Hubert Am Di., 16. Jan. 2024 um 13:44 Uhr schrieb Gilberto Ferreira : Ah! Indeed! You need to perform an upgrade in the clients as well. Em ter.,
Re: [Gluster-users] Upgrade 10.4 -> 11.1 making problems
were you able to solve the problem? Can it be treated like a "normal" split brain? 'gluster peer status' and 'gluster volume status' are ok, so kinda looks like "pseudo"... hubert Am Do., 18. Jan. 2024 um 08:28 Uhr schrieb Diego Zuccato : > > That's the same kind of errors I keep seeing on my 2 clusters, > regenerated some months ago. Seems a pseudo-split-brain that should be > impossible on a replica 3 cluster but keeps happening. > Sadly going to ditch Gluster ASAP. > > Diego > > Il 18/01/2024 07:11, Hu Bert ha scritto: > > Good morning, > > heal still not running. Pending heals now sum up to 60K per brick. > > Heal was starting instantly e.g. after server reboot with version > > 10.4, but doesn't with version 11. What could be wrong? > > > > I only see these errors on one of the "good" servers in glustershd.log: > > > > [2024-01-18 06:08:57.328480 +] W [MSGID: 114031] > > [client-rpc-fops_v2.c:2561:client4_0_lookup_cbk] 0-workdata-client-0: > > remote operation failed. > > [{path=}, > > {gfid=cb39a1e4-2a4c-4727-861d-3ed9e > > f00681b}, {errno=2}, {error=No such file or directory}] > > [2024-01-18 06:08:57.594051 +] W [MSGID: 114031] > > [client-rpc-fops_v2.c:2561:client4_0_lookup_cbk] 0-workdata-client-1: > > remote operation failed. > > [{path=}, > > {gfid=3e9b178c-ae1f-4d85-ae47-fc539 > > d94dd11}, {errno=2}, {error=No such file or directory}] > > > > About 7K today. Any ideas? Someone? > > > > > > Best regards, > > Hubert > > > > Am Mi., 17. Jan. 2024 um 11:24 Uhr schrieb Hu Bert : > >> > >> ok, finally managed to get all servers, volumes etc runnung, but took > >> a couple of restarts, cksum checks etc. > >> > >> One problem: a volume doesn't heal automatically or doesn't heal at all. > >> > >> gluster volume status > >> Status of volume: workdata > >> Gluster process TCP Port RDMA Port Online > >> Pid > >> -- > >> Brick glusterpub1:/gluster/md3/workdata 58832 0 Y > >> 3436 > >> Brick glusterpub2:/gluster/md3/workdata 59315 0 Y > >> 1526 > >> Brick glusterpub3:/gluster/md3/workdata 56917 0 Y > >> 1952 > >> Brick glusterpub1:/gluster/md4/workdata 59688 0 Y > >> 3755 > >> Brick glusterpub2:/gluster/md4/workdata 60271 0 Y > >> 2271 > >> Brick glusterpub3:/gluster/md4/workdata 49461 0 Y > >> 2399 > >> Brick glusterpub1:/gluster/md5/workdata 54651 0 Y > >> 4208 > >> Brick glusterpub2:/gluster/md5/workdata 49685 0 Y > >> 2751 > >> Brick glusterpub3:/gluster/md5/workdata 59202 0 Y > >> 2803 > >> Brick glusterpub1:/gluster/md6/workdata 55829 0 Y > >> 4583 > >> Brick glusterpub2:/gluster/md6/workdata 50455 0 Y > >> 3296 > >> Brick glusterpub3:/gluster/md6/workdata 50262 0 Y > >> 3237 > >> Brick glusterpub1:/gluster/md7/workdata 52238 0 Y > >> 5014 > >> Brick glusterpub2:/gluster/md7/workdata 52474 0 Y > >> 3673 > >> Brick glusterpub3:/gluster/md7/workdata 57966 0 Y > >> 3653 > >> Self-heal Daemon on localhost N/A N/AY > >> 4141 > >> Self-heal Daemon on glusterpub1 N/A N/AY > >> 5570 > >> Self-heal Daemon on glusterpub2 N/A N/AY > >> 4139 > >> > >> "gluster volume heal workdata info" lists a lot of files per brick. > >> "gluster volume heal workdata statistics heal-count" shows thousands > >> of files per brick. > >> "gluster volume heal workdata enable" has no effect. > >> > >> gluster volume heal workdata full > >> Launching heal operation to perform full self heal on volume workdata > >> has been successful > >> Use heal info commands to check status. > >> > >> -> not doing anything at all. And nothing happening on the 2 "good" > >> servers in e.g. glustershd.log. Heal was working as expected on > >> version 10.4, but here... silence. Someone has an idea? > >> > >> > >> Best regards, > >> Hubert > >> > >> Am Di., 16. Jan. 2024 um 13:44 Uhr schrieb Gilberto Ferreira > >> : > >>> > >>> Ah! Indeed! You need to perform an upgrade in the clients as well. > >>> > >>> > >>> > >>> > >>> > >>> > >>> > >>> > >>> Em ter., 16 de jan. de 2024 às 03:12, Hu Bert > >>> escreveu: > > morning to those still reading :-) > > i found this: > https://docs.gluster.org/en/main/Troubleshooting/troubleshooting-glusterd/#common-issues-and-how-to-resolve-them > > there's a paragraph about "peer rejected" with the same error message, > telling me: "Update the cluster.op-version" - i had only updated the > server nodes, but not the clients. So upgrading the cluster.op-version > wasn't possible at this time. So...
Re: [Gluster-users] Upgrade 10.4 -> 11.1 making problems
That's the same kind of errors I keep seeing on my 2 clusters, regenerated some months ago. Seems a pseudo-split-brain that should be impossible on a replica 3 cluster but keeps happening. Sadly going to ditch Gluster ASAP. Diego Il 18/01/2024 07:11, Hu Bert ha scritto: Good morning, heal still not running. Pending heals now sum up to 60K per brick. Heal was starting instantly e.g. after server reboot with version 10.4, but doesn't with version 11. What could be wrong? I only see these errors on one of the "good" servers in glustershd.log: [2024-01-18 06:08:57.328480 +] W [MSGID: 114031] [client-rpc-fops_v2.c:2561:client4_0_lookup_cbk] 0-workdata-client-0: remote operation failed. [{path=}, {gfid=cb39a1e4-2a4c-4727-861d-3ed9e f00681b}, {errno=2}, {error=No such file or directory}] [2024-01-18 06:08:57.594051 +] W [MSGID: 114031] [client-rpc-fops_v2.c:2561:client4_0_lookup_cbk] 0-workdata-client-1: remote operation failed. [{path=}, {gfid=3e9b178c-ae1f-4d85-ae47-fc539 d94dd11}, {errno=2}, {error=No such file or directory}] About 7K today. Any ideas? Someone? Best regards, Hubert Am Mi., 17. Jan. 2024 um 11:24 Uhr schrieb Hu Bert : ok, finally managed to get all servers, volumes etc runnung, but took a couple of restarts, cksum checks etc. One problem: a volume doesn't heal automatically or doesn't heal at all. gluster volume status Status of volume: workdata Gluster process TCP Port RDMA Port Online Pid -- Brick glusterpub1:/gluster/md3/workdata 58832 0 Y 3436 Brick glusterpub2:/gluster/md3/workdata 59315 0 Y 1526 Brick glusterpub3:/gluster/md3/workdata 56917 0 Y 1952 Brick glusterpub1:/gluster/md4/workdata 59688 0 Y 3755 Brick glusterpub2:/gluster/md4/workdata 60271 0 Y 2271 Brick glusterpub3:/gluster/md4/workdata 49461 0 Y 2399 Brick glusterpub1:/gluster/md5/workdata 54651 0 Y 4208 Brick glusterpub2:/gluster/md5/workdata 49685 0 Y 2751 Brick glusterpub3:/gluster/md5/workdata 59202 0 Y 2803 Brick glusterpub1:/gluster/md6/workdata 55829 0 Y 4583 Brick glusterpub2:/gluster/md6/workdata 50455 0 Y 3296 Brick glusterpub3:/gluster/md6/workdata 50262 0 Y 3237 Brick glusterpub1:/gluster/md7/workdata 52238 0 Y 5014 Brick glusterpub2:/gluster/md7/workdata 52474 0 Y 3673 Brick glusterpub3:/gluster/md7/workdata 57966 0 Y 3653 Self-heal Daemon on localhost N/A N/AY 4141 Self-heal Daemon on glusterpub1 N/A N/AY 5570 Self-heal Daemon on glusterpub2 N/A N/AY 4139 "gluster volume heal workdata info" lists a lot of files per brick. "gluster volume heal workdata statistics heal-count" shows thousands of files per brick. "gluster volume heal workdata enable" has no effect. gluster volume heal workdata full Launching heal operation to perform full self heal on volume workdata has been successful Use heal info commands to check status. -> not doing anything at all. And nothing happening on the 2 "good" servers in e.g. glustershd.log. Heal was working as expected on version 10.4, but here... silence. Someone has an idea? Best regards, Hubert Am Di., 16. Jan. 2024 um 13:44 Uhr schrieb Gilberto Ferreira : Ah! Indeed! You need to perform an upgrade in the clients as well. Em ter., 16 de jan. de 2024 às 03:12, Hu Bert escreveu: morning to those still reading :-) i found this: https://docs.gluster.org/en/main/Troubleshooting/troubleshooting-glusterd/#common-issues-and-how-to-resolve-them there's a paragraph about "peer rejected" with the same error message, telling me: "Update the cluster.op-version" - i had only updated the server nodes, but not the clients. So upgrading the cluster.op-version wasn't possible at this time. So... upgrading the clients to version 11.1 and then the op-version should solve the problem? Thx, Hubert Am Mo., 15. Jan. 2024 um 09:16 Uhr schrieb Hu Bert : Hi, just upgraded some gluster servers from version 10.4 to version 11.1. Debian bullseye & bookworm. When only installing the packages: good, servers, volumes etc. work as expected. But one needs to test if the systems work after a daemon and/or server restart. Well, did a reboot, and after that the rebooted/restarted system is "out". Log message from working node: [2024-01-15 08:02:21.585694 +] I [MSGID: 106163] [glusterd-handshake.c:1501:__glusterd_mgmt_hndsk_versions_ack] 0-management: using the op-version 10 [2024-01-15 08:02:21.589601 +] I [MSGID: 106490] [glusterd-handler.c:2546:__glusterd_handle_incoming_friend_req] 0-glusterd: Received probe
Re: [Gluster-users] Upgrade 10.4 -> 11.1 making problems
Good morning, heal still not running. Pending heals now sum up to 60K per brick. Heal was starting instantly e.g. after server reboot with version 10.4, but doesn't with version 11. What could be wrong? I only see these errors on one of the "good" servers in glustershd.log: [2024-01-18 06:08:57.328480 +] W [MSGID: 114031] [client-rpc-fops_v2.c:2561:client4_0_lookup_cbk] 0-workdata-client-0: remote operation failed. [{path=}, {gfid=cb39a1e4-2a4c-4727-861d-3ed9e f00681b}, {errno=2}, {error=No such file or directory}] [2024-01-18 06:08:57.594051 +] W [MSGID: 114031] [client-rpc-fops_v2.c:2561:client4_0_lookup_cbk] 0-workdata-client-1: remote operation failed. [{path=}, {gfid=3e9b178c-ae1f-4d85-ae47-fc539 d94dd11}, {errno=2}, {error=No such file or directory}] About 7K today. Any ideas? Someone? Best regards, Hubert Am Mi., 17. Jan. 2024 um 11:24 Uhr schrieb Hu Bert : > > ok, finally managed to get all servers, volumes etc runnung, but took > a couple of restarts, cksum checks etc. > > One problem: a volume doesn't heal automatically or doesn't heal at all. > > gluster volume status > Status of volume: workdata > Gluster process TCP Port RDMA Port Online Pid > -- > Brick glusterpub1:/gluster/md3/workdata 58832 0 Y 3436 > Brick glusterpub2:/gluster/md3/workdata 59315 0 Y 1526 > Brick glusterpub3:/gluster/md3/workdata 56917 0 Y 1952 > Brick glusterpub1:/gluster/md4/workdata 59688 0 Y 3755 > Brick glusterpub2:/gluster/md4/workdata 60271 0 Y 2271 > Brick glusterpub3:/gluster/md4/workdata 49461 0 Y 2399 > Brick glusterpub1:/gluster/md5/workdata 54651 0 Y 4208 > Brick glusterpub2:/gluster/md5/workdata 49685 0 Y 2751 > Brick glusterpub3:/gluster/md5/workdata 59202 0 Y 2803 > Brick glusterpub1:/gluster/md6/workdata 55829 0 Y 4583 > Brick glusterpub2:/gluster/md6/workdata 50455 0 Y 3296 > Brick glusterpub3:/gluster/md6/workdata 50262 0 Y 3237 > Brick glusterpub1:/gluster/md7/workdata 52238 0 Y 5014 > Brick glusterpub2:/gluster/md7/workdata 52474 0 Y 3673 > Brick glusterpub3:/gluster/md7/workdata 57966 0 Y 3653 > Self-heal Daemon on localhost N/A N/AY 4141 > Self-heal Daemon on glusterpub1 N/A N/AY 5570 > Self-heal Daemon on glusterpub2 N/A N/AY 4139 > > "gluster volume heal workdata info" lists a lot of files per brick. > "gluster volume heal workdata statistics heal-count" shows thousands > of files per brick. > "gluster volume heal workdata enable" has no effect. > > gluster volume heal workdata full > Launching heal operation to perform full self heal on volume workdata > has been successful > Use heal info commands to check status. > > -> not doing anything at all. And nothing happening on the 2 "good" > servers in e.g. glustershd.log. Heal was working as expected on > version 10.4, but here... silence. Someone has an idea? > > > Best regards, > Hubert > > Am Di., 16. Jan. 2024 um 13:44 Uhr schrieb Gilberto Ferreira > : > > > > Ah! Indeed! You need to perform an upgrade in the clients as well. > > > > > > > > > > > > > > > > > > Em ter., 16 de jan. de 2024 às 03:12, Hu Bert > > escreveu: > >> > >> morning to those still reading :-) > >> > >> i found this: > >> https://docs.gluster.org/en/main/Troubleshooting/troubleshooting-glusterd/#common-issues-and-how-to-resolve-them > >> > >> there's a paragraph about "peer rejected" with the same error message, > >> telling me: "Update the cluster.op-version" - i had only updated the > >> server nodes, but not the clients. So upgrading the cluster.op-version > >> wasn't possible at this time. So... upgrading the clients to version > >> 11.1 and then the op-version should solve the problem? > >> > >> > >> Thx, > >> Hubert > >> > >> Am Mo., 15. Jan. 2024 um 09:16 Uhr schrieb Hu Bert > >> : > >> > > >> > Hi, > >> > just upgraded some gluster servers from version 10.4 to version 11.1. > >> > Debian bullseye & bookworm. When only installing the packages: good, > >> > servers, volumes etc. work as expected. > >> > > >> > But one needs to test if the systems work after a daemon and/or server > >> > restart. Well, did a reboot, and after that the rebooted/restarted > >> > system is "out". Log message from working node: > >> > > >> > [2024-01-15 08:02:21.585694 +] I [MSGID: 106163] > >> > [glusterd-handshake.c:1501:__glusterd_mgmt_hndsk_versions_ack] > >> > 0-management: using the op-version 10 > >> > [2024-01-15 08:02:21.589601 +] I [MSGID: 106490] > >> >
Re: [Gluster-users] Upgrade 10.4 -> 11.1 making problems
hm, i only see such messages in glustershd.log on the 2 good servers: [2024-01-17 12:18:48.912952 +] W [MSGID: 114031] [client-rpc-fops_v2.c:2561:client4_0_lookup_cbk] 0-workdata-client-6: remote operation failed. [{path=}, {gfid=ee28b56c-e352-48f8-bbb5-dbf31 babe073}, {errno=2}, {error=No such file or directory}] [2024-01-17 12:18:48.913015 +] W [MSGID: 114031] [client-rpc-fops_v2.c:2561:client4_0_lookup_cbk] 0-workdata-client-7: remote operation failed. [{path=}, {gfid=ee28b56c-e352-48f8-bbb5-dbf31 babe073}, {errno=2}, {error=No such file or directory}] [2024-01-17 12:19:09.450335 +] W [MSGID: 114031] [client-rpc-fops_v2.c:2561:client4_0_lookup_cbk] 0-workdata-client-10: remote operation failed. [{path=}, {gfid=ea4a63e3-1470-40a5-8a7e-2a10 61a8fcb0}, {errno=2}, {error=No such file or directory}] [2024-01-17 12:19:09.450771 +] W [MSGID: 114031] [client-rpc-fops_v2.c:2561:client4_0_lookup_cbk] 0-workdata-client-9: remote operation failed. [{path=}, {gfid=ea4a63e3-1470-40a5-8a7e-2a106 1a8fcb0}, {errno=2}, {error=No such file or directory}] not sure if this is important. Am Mi., 17. Jan. 2024 um 11:24 Uhr schrieb Hu Bert : > > ok, finally managed to get all servers, volumes etc runnung, but took > a couple of restarts, cksum checks etc. > > One problem: a volume doesn't heal automatically or doesn't heal at all. > > gluster volume status > Status of volume: workdata > Gluster process TCP Port RDMA Port Online Pid > -- > Brick glusterpub1:/gluster/md3/workdata 58832 0 Y 3436 > Brick glusterpub2:/gluster/md3/workdata 59315 0 Y 1526 > Brick glusterpub3:/gluster/md3/workdata 56917 0 Y 1952 > Brick glusterpub1:/gluster/md4/workdata 59688 0 Y 3755 > Brick glusterpub2:/gluster/md4/workdata 60271 0 Y 2271 > Brick glusterpub3:/gluster/md4/workdata 49461 0 Y 2399 > Brick glusterpub1:/gluster/md5/workdata 54651 0 Y 4208 > Brick glusterpub2:/gluster/md5/workdata 49685 0 Y 2751 > Brick glusterpub3:/gluster/md5/workdata 59202 0 Y 2803 > Brick glusterpub1:/gluster/md6/workdata 55829 0 Y 4583 > Brick glusterpub2:/gluster/md6/workdata 50455 0 Y 3296 > Brick glusterpub3:/gluster/md6/workdata 50262 0 Y 3237 > Brick glusterpub1:/gluster/md7/workdata 52238 0 Y 5014 > Brick glusterpub2:/gluster/md7/workdata 52474 0 Y 3673 > Brick glusterpub3:/gluster/md7/workdata 57966 0 Y 3653 > Self-heal Daemon on localhost N/A N/AY 4141 > Self-heal Daemon on glusterpub1 N/A N/AY 5570 > Self-heal Daemon on glusterpub2 N/A N/AY 4139 > > "gluster volume heal workdata info" lists a lot of files per brick. > "gluster volume heal workdata statistics heal-count" shows thousands > of files per brick. > "gluster volume heal workdata enable" has no effect. > > gluster volume heal workdata full > Launching heal operation to perform full self heal on volume workdata > has been successful > Use heal info commands to check status. > > -> not doing anything at all. And nothing happening on the 2 "good" > servers in e.g. glustershd.log. Heal was working as expected on > version 10.4, but here... silence. Someone has an idea? > > > Best regards, > Hubert > > Am Di., 16. Jan. 2024 um 13:44 Uhr schrieb Gilberto Ferreira > : > > > > Ah! Indeed! You need to perform an upgrade in the clients as well. > > > > > > > > > > > > > > > > > > Em ter., 16 de jan. de 2024 às 03:12, Hu Bert > > escreveu: > >> > >> morning to those still reading :-) > >> > >> i found this: > >> https://docs.gluster.org/en/main/Troubleshooting/troubleshooting-glusterd/#common-issues-and-how-to-resolve-them > >> > >> there's a paragraph about "peer rejected" with the same error message, > >> telling me: "Update the cluster.op-version" - i had only updated the > >> server nodes, but not the clients. So upgrading the cluster.op-version > >> wasn't possible at this time. So... upgrading the clients to version > >> 11.1 and then the op-version should solve the problem? > >> > >> > >> Thx, > >> Hubert > >> > >> Am Mo., 15. Jan. 2024 um 09:16 Uhr schrieb Hu Bert > >> : > >> > > >> > Hi, > >> > just upgraded some gluster servers from version 10.4 to version 11.1. > >> > Debian bullseye & bookworm. When only installing the packages: good, > >> > servers, volumes etc. work as expected. > >> > > >> > But one needs to test if the systems work after a daemon and/or server > >> > restart. Well, did a reboot, and after that the rebooted/restarted > >> > system is "out". Log message from working node: > >> > > >> > [2024-01-15 08:02:21.585694
Re: [Gluster-users] Upgrade 10.4 -> 11.1 making problems
ok, finally managed to get all servers, volumes etc runnung, but took a couple of restarts, cksum checks etc. One problem: a volume doesn't heal automatically or doesn't heal at all. gluster volume status Status of volume: workdata Gluster process TCP Port RDMA Port Online Pid -- Brick glusterpub1:/gluster/md3/workdata 58832 0 Y 3436 Brick glusterpub2:/gluster/md3/workdata 59315 0 Y 1526 Brick glusterpub3:/gluster/md3/workdata 56917 0 Y 1952 Brick glusterpub1:/gluster/md4/workdata 59688 0 Y 3755 Brick glusterpub2:/gluster/md4/workdata 60271 0 Y 2271 Brick glusterpub3:/gluster/md4/workdata 49461 0 Y 2399 Brick glusterpub1:/gluster/md5/workdata 54651 0 Y 4208 Brick glusterpub2:/gluster/md5/workdata 49685 0 Y 2751 Brick glusterpub3:/gluster/md5/workdata 59202 0 Y 2803 Brick glusterpub1:/gluster/md6/workdata 55829 0 Y 4583 Brick glusterpub2:/gluster/md6/workdata 50455 0 Y 3296 Brick glusterpub3:/gluster/md6/workdata 50262 0 Y 3237 Brick glusterpub1:/gluster/md7/workdata 52238 0 Y 5014 Brick glusterpub2:/gluster/md7/workdata 52474 0 Y 3673 Brick glusterpub3:/gluster/md7/workdata 57966 0 Y 3653 Self-heal Daemon on localhost N/A N/AY 4141 Self-heal Daemon on glusterpub1 N/A N/AY 5570 Self-heal Daemon on glusterpub2 N/A N/AY 4139 "gluster volume heal workdata info" lists a lot of files per brick. "gluster volume heal workdata statistics heal-count" shows thousands of files per brick. "gluster volume heal workdata enable" has no effect. gluster volume heal workdata full Launching heal operation to perform full self heal on volume workdata has been successful Use heal info commands to check status. -> not doing anything at all. And nothing happening on the 2 "good" servers in e.g. glustershd.log. Heal was working as expected on version 10.4, but here... silence. Someone has an idea? Best regards, Hubert Am Di., 16. Jan. 2024 um 13:44 Uhr schrieb Gilberto Ferreira : > > Ah! Indeed! You need to perform an upgrade in the clients as well. > > > > > > > > > Em ter., 16 de jan. de 2024 às 03:12, Hu Bert > escreveu: >> >> morning to those still reading :-) >> >> i found this: >> https://docs.gluster.org/en/main/Troubleshooting/troubleshooting-glusterd/#common-issues-and-how-to-resolve-them >> >> there's a paragraph about "peer rejected" with the same error message, >> telling me: "Update the cluster.op-version" - i had only updated the >> server nodes, but not the clients. So upgrading the cluster.op-version >> wasn't possible at this time. So... upgrading the clients to version >> 11.1 and then the op-version should solve the problem? >> >> >> Thx, >> Hubert >> >> Am Mo., 15. Jan. 2024 um 09:16 Uhr schrieb Hu Bert : >> > >> > Hi, >> > just upgraded some gluster servers from version 10.4 to version 11.1. >> > Debian bullseye & bookworm. When only installing the packages: good, >> > servers, volumes etc. work as expected. >> > >> > But one needs to test if the systems work after a daemon and/or server >> > restart. Well, did a reboot, and after that the rebooted/restarted >> > system is "out". Log message from working node: >> > >> > [2024-01-15 08:02:21.585694 +] I [MSGID: 106163] >> > [glusterd-handshake.c:1501:__glusterd_mgmt_hndsk_versions_ack] >> > 0-management: using the op-version 10 >> > [2024-01-15 08:02:21.589601 +] I [MSGID: 106490] >> > [glusterd-handler.c:2546:__glusterd_handle_incoming_friend_req] >> > 0-glusterd: Received probe from uuid: >> > b71401c3-512a-47cb-ac18-473c4ba7776e >> > [2024-01-15 08:02:23.608349 +] E [MSGID: 106010] >> > [glusterd-utils.c:3824:glusterd_compare_friend_volume] 0-management: >> > Version of Cksums sourceimages differ. local cksum = 2204642525, >> > remote cksum = 1931483801 on peer gluster190 >> > [2024-01-15 08:02:23.608584 +] I [MSGID: 106493] >> > [glusterd-handler.c:3819:glusterd_xfer_friend_add_resp] 0-glusterd: >> > Responded to gluster190 (0), ret: 0, op_ret: -1 >> > [2024-01-15 08:02:23.613553 +] I [MSGID: 106493] >> > [glusterd-rpc-ops.c:467:__glusterd_friend_add_cbk] 0-glusterd: >> > Received RJT from uuid: b71401c3-512a-47cb-ac18-473c4ba7776e, host: >> > gluster190, port: 0 >> > >> > peer status from rebooted node: >> > >> > root@gluster190 ~ # gluster peer status >> > Number of Peers: 2 >> > >> > Hostname: gluster189 >> > Uuid: 50dc8288-aa49-4ea8-9c6c-9a9a926c67a7 >> > State: Peer Rejected (Connected) >> > >> > Hostname: gluster188 >> > Uuid: e15a33fe-e2f7-47cf-ac53-a3b34136555d >> > State:
Re: [Gluster-users] Upgrade 10.4 -> 11.1 making problems
Ah! Indeed! You need to perform an upgrade in the clients as well. Em ter., 16 de jan. de 2024 às 03:12, Hu Bert escreveu: > morning to those still reading :-) > > i found this: > https://docs.gluster.org/en/main/Troubleshooting/troubleshooting-glusterd/#common-issues-and-how-to-resolve-them > > there's a paragraph about "peer rejected" with the same error message, > telling me: "Update the cluster.op-version" - i had only updated the > server nodes, but not the clients. So upgrading the cluster.op-version > wasn't possible at this time. So... upgrading the clients to version > 11.1 and then the op-version should solve the problem? > > > Thx, > Hubert > > Am Mo., 15. Jan. 2024 um 09:16 Uhr schrieb Hu Bert >: > > > > Hi, > > just upgraded some gluster servers from version 10.4 to version 11.1. > > Debian bullseye & bookworm. When only installing the packages: good, > > servers, volumes etc. work as expected. > > > > But one needs to test if the systems work after a daemon and/or server > > restart. Well, did a reboot, and after that the rebooted/restarted > > system is "out". Log message from working node: > > > > [2024-01-15 08:02:21.585694 +] I [MSGID: 106163] > > [glusterd-handshake.c:1501:__glusterd_mgmt_hndsk_versions_ack] > > 0-management: using the op-version 10 > > [2024-01-15 08:02:21.589601 +] I [MSGID: 106490] > > [glusterd-handler.c:2546:__glusterd_handle_incoming_friend_req] > > 0-glusterd: Received probe from uuid: > > b71401c3-512a-47cb-ac18-473c4ba7776e > > [2024-01-15 08:02:23.608349 +] E [MSGID: 106010] > > [glusterd-utils.c:3824:glusterd_compare_friend_volume] 0-management: > > Version of Cksums sourceimages differ. local cksum = 2204642525, > > remote cksum = 1931483801 on peer gluster190 > > [2024-01-15 08:02:23.608584 +] I [MSGID: 106493] > > [glusterd-handler.c:3819:glusterd_xfer_friend_add_resp] 0-glusterd: > > Responded to gluster190 (0), ret: 0, op_ret: -1 > > [2024-01-15 08:02:23.613553 +] I [MSGID: 106493] > > [glusterd-rpc-ops.c:467:__glusterd_friend_add_cbk] 0-glusterd: > > Received RJT from uuid: b71401c3-512a-47cb-ac18-473c4ba7776e, host: > > gluster190, port: 0 > > > > peer status from rebooted node: > > > > root@gluster190 ~ # gluster peer status > > Number of Peers: 2 > > > > Hostname: gluster189 > > Uuid: 50dc8288-aa49-4ea8-9c6c-9a9a926c67a7 > > State: Peer Rejected (Connected) > > > > Hostname: gluster188 > > Uuid: e15a33fe-e2f7-47cf-ac53-a3b34136555d > > State: Peer Rejected (Connected) > > > > So the rebooted gluster190 is not accepted anymore. And thus does not > > appear in "gluster volume status". I then followed this guide: > > > > > https://gluster-documentations.readthedocs.io/en/latest/Administrator%20Guide/Resolving%20Peer%20Rejected/ > > > > Remove everything under /var/lib/glusterd/ (except glusterd.info) and > > restart glusterd service etc. Data get copied from other nodes, > > 'gluster peer status' is ok again - but the volume info is missing, > > /var/lib/glusterd/vols is empty. When syncing this dir from another > > node, the volume then is available again, heals start etc. > > > > Well, and just to be sure that everything's working as it should, > > rebooted that node again - the rebooted node is kicked out again, and > > you have to restart bringing it back again. > > > > Sry, but did i miss anything? Has someone experienced similar > > problems? I'll probably downgrade to 10.4 again, that version was > > working... > > > > > > Thx, > > Hubert > > > > > Community Meeting Calendar: > > Schedule - > Every 2nd and 4th Tuesday at 14:30 IST / 09:00 UTC > Bridge: https://meet.google.com/cpu-eiue-hvk > Gluster-users mailing list > Gluster-users@gluster.org > https://lists.gluster.org/mailman/listinfo/gluster-users > Community Meeting Calendar: Schedule - Every 2nd and 4th Tuesday at 14:30 IST / 09:00 UTC Bridge: https://meet.google.com/cpu-eiue-hvk Gluster-users mailing list Gluster-users@gluster.org https://lists.gluster.org/mailman/listinfo/gluster-users
Re: [Gluster-users] Upgrade 10.4 -> 11.1 making problems
morning to those still reading :-) i found this: https://docs.gluster.org/en/main/Troubleshooting/troubleshooting-glusterd/#common-issues-and-how-to-resolve-them there's a paragraph about "peer rejected" with the same error message, telling me: "Update the cluster.op-version" - i had only updated the server nodes, but not the clients. So upgrading the cluster.op-version wasn't possible at this time. So... upgrading the clients to version 11.1 and then the op-version should solve the problem? Thx, Hubert Am Mo., 15. Jan. 2024 um 09:16 Uhr schrieb Hu Bert : > > Hi, > just upgraded some gluster servers from version 10.4 to version 11.1. > Debian bullseye & bookworm. When only installing the packages: good, > servers, volumes etc. work as expected. > > But one needs to test if the systems work after a daemon and/or server > restart. Well, did a reboot, and after that the rebooted/restarted > system is "out". Log message from working node: > > [2024-01-15 08:02:21.585694 +] I [MSGID: 106163] > [glusterd-handshake.c:1501:__glusterd_mgmt_hndsk_versions_ack] > 0-management: using the op-version 10 > [2024-01-15 08:02:21.589601 +] I [MSGID: 106490] > [glusterd-handler.c:2546:__glusterd_handle_incoming_friend_req] > 0-glusterd: Received probe from uuid: > b71401c3-512a-47cb-ac18-473c4ba7776e > [2024-01-15 08:02:23.608349 +] E [MSGID: 106010] > [glusterd-utils.c:3824:glusterd_compare_friend_volume] 0-management: > Version of Cksums sourceimages differ. local cksum = 2204642525, > remote cksum = 1931483801 on peer gluster190 > [2024-01-15 08:02:23.608584 +] I [MSGID: 106493] > [glusterd-handler.c:3819:glusterd_xfer_friend_add_resp] 0-glusterd: > Responded to gluster190 (0), ret: 0, op_ret: -1 > [2024-01-15 08:02:23.613553 +] I [MSGID: 106493] > [glusterd-rpc-ops.c:467:__glusterd_friend_add_cbk] 0-glusterd: > Received RJT from uuid: b71401c3-512a-47cb-ac18-473c4ba7776e, host: > gluster190, port: 0 > > peer status from rebooted node: > > root@gluster190 ~ # gluster peer status > Number of Peers: 2 > > Hostname: gluster189 > Uuid: 50dc8288-aa49-4ea8-9c6c-9a9a926c67a7 > State: Peer Rejected (Connected) > > Hostname: gluster188 > Uuid: e15a33fe-e2f7-47cf-ac53-a3b34136555d > State: Peer Rejected (Connected) > > So the rebooted gluster190 is not accepted anymore. And thus does not > appear in "gluster volume status". I then followed this guide: > > https://gluster-documentations.readthedocs.io/en/latest/Administrator%20Guide/Resolving%20Peer%20Rejected/ > > Remove everything under /var/lib/glusterd/ (except glusterd.info) and > restart glusterd service etc. Data get copied from other nodes, > 'gluster peer status' is ok again - but the volume info is missing, > /var/lib/glusterd/vols is empty. When syncing this dir from another > node, the volume then is available again, heals start etc. > > Well, and just to be sure that everything's working as it should, > rebooted that node again - the rebooted node is kicked out again, and > you have to restart bringing it back again. > > Sry, but did i miss anything? Has someone experienced similar > problems? I'll probably downgrade to 10.4 again, that version was > working... > > > Thx, > Hubert Community Meeting Calendar: Schedule - Every 2nd and 4th Tuesday at 14:30 IST / 09:00 UTC Bridge: https://meet.google.com/cpu-eiue-hvk Gluster-users mailing list Gluster-users@gluster.org https://lists.gluster.org/mailman/listinfo/gluster-users
Re: [Gluster-users] Upgrade 10.4 -> 11.1 making problems
just downgraded one node to 10.4, did a reboot - same result: cksum error. i'm able to bring it back in again, but it that error persists when downgrading all servers... Am Mo., 15. Jan. 2024 um 09:16 Uhr schrieb Hu Bert : > > Hi, > just upgraded some gluster servers from version 10.4 to version 11.1. > Debian bullseye & bookworm. When only installing the packages: good, > servers, volumes etc. work as expected. > > But one needs to test if the systems work after a daemon and/or server > restart. Well, did a reboot, and after that the rebooted/restarted > system is "out". Log message from working node: > > [2024-01-15 08:02:21.585694 +] I [MSGID: 106163] > [glusterd-handshake.c:1501:__glusterd_mgmt_hndsk_versions_ack] > 0-management: using the op-version 10 > [2024-01-15 08:02:21.589601 +] I [MSGID: 106490] > [glusterd-handler.c:2546:__glusterd_handle_incoming_friend_req] > 0-glusterd: Received probe from uuid: > b71401c3-512a-47cb-ac18-473c4ba7776e > [2024-01-15 08:02:23.608349 +] E [MSGID: 106010] > [glusterd-utils.c:3824:glusterd_compare_friend_volume] 0-management: > Version of Cksums sourceimages differ. local cksum = 2204642525, > remote cksum = 1931483801 on peer gluster190 > [2024-01-15 08:02:23.608584 +] I [MSGID: 106493] > [glusterd-handler.c:3819:glusterd_xfer_friend_add_resp] 0-glusterd: > Responded to gluster190 (0), ret: 0, op_ret: -1 > [2024-01-15 08:02:23.613553 +] I [MSGID: 106493] > [glusterd-rpc-ops.c:467:__glusterd_friend_add_cbk] 0-glusterd: > Received RJT from uuid: b71401c3-512a-47cb-ac18-473c4ba7776e, host: > gluster190, port: 0 > > peer status from rebooted node: > > root@gluster190 ~ # gluster peer status > Number of Peers: 2 > > Hostname: gluster189 > Uuid: 50dc8288-aa49-4ea8-9c6c-9a9a926c67a7 > State: Peer Rejected (Connected) > > Hostname: gluster188 > Uuid: e15a33fe-e2f7-47cf-ac53-a3b34136555d > State: Peer Rejected (Connected) > > So the rebooted gluster190 is not accepted anymore. And thus does not > appear in "gluster volume status". I then followed this guide: > > https://gluster-documentations.readthedocs.io/en/latest/Administrator%20Guide/Resolving%20Peer%20Rejected/ > > Remove everything under /var/lib/glusterd/ (except glusterd.info) and > restart glusterd service etc. Data get copied from other nodes, > 'gluster peer status' is ok again - but the volume info is missing, > /var/lib/glusterd/vols is empty. When syncing this dir from another > node, the volume then is available again, heals start etc. > > Well, and just to be sure that everything's working as it should, > rebooted that node again - the rebooted node is kicked out again, and > you have to restart bringing it back again. > > Sry, but did i miss anything? Has someone experienced similar > problems? I'll probably downgrade to 10.4 again, that version was > working... > > > Thx, > Hubert Community Meeting Calendar: Schedule - Every 2nd and 4th Tuesday at 14:30 IST / 09:00 UTC Bridge: https://meet.google.com/cpu-eiue-hvk Gluster-users mailing list Gluster-users@gluster.org https://lists.gluster.org/mailman/listinfo/gluster-users