Re: [Gluster-users] Upgrade 10.4 -> 11.1 making problems

2024-02-15 Thread Hu Bert
Hello,
just to bring this to an end... the servers and the volume are "out of
service", so i tried to repair.

- umount all related mounts
- rebooted misbehaving server
- mounted volume on all clients

Well, no healing happens. 'gluster volume status workdata clients'
looks good btw.

gluster volume heal workdata statistics heal-count: empty.
gluster volume heal workdata info: lists lots of files

glustershd on the "good" servers:
[2024-02-15 09:31:32.427779 +] W [MSGID: 114031]
[client-rpc-fops_v2.c:2561:client4_0_lookup_cbk] 0-workdata-client-13:
remote operation failed.
[{path=},
{gfid=2a9dfe1d-c617-4ca5-9842-5267
5581c880}, {errno=2}, {error=No such file or directory}]

glustershd on the "bad" server:
[2024-02-15 09:32:18.613343 +] E [MSGID: 108008]
[afr-self-heal-common.c:399:afr_gfid_split_brain_source]
0-workdata-replicate-2: Gfid mismatch detected for
/854>,
bb8e53c7-0446-4f82-bd23-1
2253e8484db on workdata-client-8 and
a42769e2-f6ba-44b0-ad8c-1e451ba943a6 on workdata-client-6.
[2024-02-15 09:32:18.613550 +] E [MSGID: 108008]
[afr-self-heal-entry.c:465:afr_selfheal_detect_gfid_and_type_mismatch]
0-workdata-replicate-2: Skipping conservative merge on the file.

Well, i won't put any more work into this. The volume is screwed up,
and was replaced by a different solution. Servers will be dismissed
soon.


Thx for all your efforts,

Hubert

Am Mi., 31. Jan. 2024 um 17:10 Uhr schrieb Strahil Nikolov
:
>
> Hi,
>
> This is a simplified description, see the links bellow for more detailed one.
> When a client makes a change to a file - it  commits that change to all 
> bricks simultaneously and if the change passes on a quorate number of bricks 
> (in your case 2 out of 3 is enough) it is treated as successful.
> During that phase the 2 bricks, that successfully have completed the task, 
> will mark the 3rd brick as 'dirty' and you will see that in the heal report.
> Only when the heal daemon syncs the file to the final brick, that heal will 
> be cleaned from the remaining bricks.
>
>
> If a client has only 2 out of 3 bricks connected, it will constantly create 
> new files for healing (as it can't save it on all 3) and this can even get 
> worse with the increase of the number of clients that fail to connect to the 
> 3rd brick.
>
> Check that all client's IPs are connected to all bricks and those that are 
> not - remount the volume. After remounting the behavior should not persist. 
> If it does - check with your network/firewall team for troubleshooting the 
> problem.
>
> You can use 'gluster volume status all client-list'  and 'gluster volume 
> status all clients' (where 'all' can be replaced by the volume name) to find 
> more details on that side.
>
> You can find a more detailed explanation of the whole process at this blog:
> https://ravispeaks.wordpress.com/2019/04/05/glusterfs-afr-the-complete-guide/
>
> https://ravispeaks.wordpress.com/2019/04/15/gluster-afr-the-complete-guide-part-2/
>
> https://ravispeaks.wordpress.com/2019/05/14/gluster-afr-the-complete-guide-part-3/
>
>
> Best Regards,
> Strahil Nikolov
>
>
>
> On Tue, Jan 30, 2024 at 15:26, Hu Bert
>  wrote:
> Hi Strahil,
> hm, not sure what the clients have to do with the situation. "gluster
> volume status workdata clients" - lists all clients with their IP
> addresses.
>
> "gluster peer status" and "gluster volume status" are ok, the latter
> one says that all bricks are online, have a port etc. The network is
> okay, ping works etc. Well, made a check on one client: umount gluster
> volume, remount, now the client appears in the list. Yeah... but why
> now? Will try a few more... not that easy as most of these systems are
> in production...
>
> I had enabled the 3 self-heal values, but that didn't have any effect
> back then. And, honestly, i won't do it now, because: if the heal
> started now that would probably slow down the live system (with the
> clients). I'll try it when the cluster isn't used anymore.
>
> Interesting - new messages incoming on the "bad" server:
>
> [2024-01-30 14:15:11,820] INFO [utils - 67:log_event] - {'nodeid':
> '8ea1e6b4-9c77-4390-96a7-8724c3f9dc0f', 'ts': 1706620511, 'event':
> 'AFR_SPLIT_BRAIN', 'message': {'client-pid': '-6', 'subvol':
> 'workdata-replicate-2', 'type': 'gfid', '
> file': '/756>', 'count':
> '2', 'child-2': 'workdata-client-8', 'gfid-2':
> '39807be6-b7de-4a82-8a22-cf61b1415208', 'child-0':
> 'workdata-client-6', 'gfid-0': 'bb4a12ec-f9b7-46bc-9fb3-c57730f1fc49'}
> }
> [2024-01-30 14:15:17,028] INFO [utils - 67:log_event] - {'nodeid':
> '8ea1e6b4-9c77-4390-96a7-8724c3f9dc0f', 'ts': 1706620517, 'event':
> 'AFR_SPLIT_BRAIN', 'message': {'client-pid': '-6', 'subvol':
> 'workdata-replicate-4', 'type': 'gfid', '
> file': '/94259611>',
> 'count': '2', 'child-2': 'workdata-client-14', 'gfid-2':
> '01234675-17b9-4523-a598-5e331a72c4fa', 'child-0':
> 'workdata-client-12', 'gfid-0': 'b11140bd-355b-4583-9a85-5d06085
> 89f97'}}
>
> They didn't appear in the beginning. Looks like a funny 

Re: [Gluster-users] Upgrade 10.4 -> 11.1 making problems

2024-01-31 Thread Strahil Nikolov
Hi,
This is a simplified description, see the links bellow for more detailed 
one.When a client makes a change to a file - it  commits that change to all 
bricks simultaneously and if the change passes on a quorate number of bricks 
(in your case 2 out of 3 is enough) it is treated as successful.During that 
phase the 2 bricks, that successfully have completed the task, will mark the 
3rd brick as 'dirty' and you will see that in the heal report.Only when the 
heal daemon syncs the file to the final brick, that heal will be cleaned from 
the remaining bricks.

If a client has only 2 out of 3 bricks connected, it will constantly create new 
files for healing (as it can't save it on all 3) and this can even get worse 
with the increase of the number of clients that fail to connect to the 3rd 
brick.
Check that all client's IPs are connected to all bricks and those that are not 
- remount the volume. After remounting the behavior should not persist. If it 
does - check with your network/firewall team for troubleshooting the problem.
You can use 'gluster volume status all client-list'  and 'gluster volume status 
all clients' (where 'all' can be replaced by the volume name) to find more 
details on that side.
You can find a more detailed explanation of the whole process at this 
blog:https://ravispeaks.wordpress.com/2019/04/05/glusterfs-afr-the-complete-guide/

https://ravispeaks.wordpress.com/2019/04/15/gluster-afr-the-complete-guide-part-2/

https://ravispeaks.wordpress.com/2019/05/14/gluster-afr-the-complete-guide-part-3/


Best Regards,Strahil Nikolov

 
 
  On Tue, Jan 30, 2024 at 15:26, Hu Bert wrote:   Hi 
Strahil,
hm, not sure what the clients have to do with the situation. "gluster
volume status workdata clients" - lists all clients with their IP
addresses.

"gluster peer status" and "gluster volume status" are ok, the latter
one says that all bricks are online, have a port etc. The network is
okay, ping works etc. Well, made a check on one client: umount gluster
volume, remount, now the client appears in the list. Yeah... but why
now? Will try a few more... not that easy as most of these systems are
in production...

I had enabled the 3 self-heal values, but that didn't have any effect
back then. And, honestly, i won't do it now, because: if the heal
started now that would probably slow down the live system (with the
clients). I'll try it when the cluster isn't used anymore.

Interesting - new messages incoming on the "bad" server:

[2024-01-30 14:15:11,820] INFO [utils - 67:log_event] - {'nodeid':
'8ea1e6b4-9c77-4390-96a7-8724c3f9dc0f', 'ts': 1706620511, 'event':
'AFR_SPLIT_BRAIN', 'message': {'client-pid': '-6', 'subvol':
'workdata-replicate-2', 'type': 'gfid', '
file': '/756>', 'count':
'2', 'child-2': 'workdata-client-8', 'gfid-2':
'39807be6-b7de-4a82-8a22-cf61b1415208', 'child-0':
'workdata-client-6', 'gfid-0': 'bb4a12ec-f9b7-46bc-9fb3-c57730f1fc49'}
}
[2024-01-30 14:15:17,028] INFO [utils - 67:log_event] - {'nodeid':
'8ea1e6b4-9c77-4390-96a7-8724c3f9dc0f', 'ts': 1706620517, 'event':
'AFR_SPLIT_BRAIN', 'message': {'client-pid': '-6', 'subvol':
'workdata-replicate-4', 'type': 'gfid', '
file': '/94259611>',
'count': '2', 'child-2': 'workdata-client-14', 'gfid-2':
'01234675-17b9-4523-a598-5e331a72c4fa', 'child-0':
'workdata-client-12', 'gfid-0': 'b11140bd-355b-4583-9a85-5d06085
89f97'}}

They didn't appear in the beginning. Looks like a funny state that
this volume is in :D


Thx & best regards,

Hubert

Am Di., 30. Jan. 2024 um 07:14 Uhr schrieb Strahil Nikolov
:
>
> This is your problem : bad server has only 3 clients.
>
> I remember there is another gluster volume command to list the IPs of the 
> clients. Find it and run it to find which clients are actually OK (those 3) 
> and the remaining 17 are not.
>
> Then try to remount those 17 clients and if the situation persistes - work 
> with your Network Team to identify why the 17 clients can't reach the brick.
>
> Do you have selfheal enabled?
>
> cluster.data-self-heal
> cluster.entry-self-heal
> cluster.metadata-self-heal
>
>
> Best Regards,
>
> Strahil Nikolov
>
> On Mon, Jan 29, 2024 at 10:26, Hu Bert
>  wrote:
> Hi,
> not sure what you mean with "clients" - do you mean the clients that
> mount the volume?
>
> gluster volume status workdata clients
> --
> Brick : glusterpub2:/gluster/md3/workdata
> Clients connected : 20
> Hostname                                              BytesRead
> BytesWritten      OpVersion
>                                               -
>       -
> 192.168.0.222:49140                                    43698212
> 41152108          11
> [...shortened...]
> 192.168.0.126:49123                                  8362352021
> 16445401205          11
> --
> Brick : glusterpub3:/gluster/md3/workdata
> Clients connected : 3
> Hostname                                              BytesRead
> 

Re: [Gluster-users] Upgrade 10.4 -> 11.1 making problems

2024-01-30 Thread Hu Bert
Hi Strahil,
hm, not sure what the clients have to do with the situation. "gluster
volume status workdata clients" - lists all clients with their IP
addresses.

"gluster peer status" and "gluster volume status" are ok, the latter
one says that all bricks are online, have a port etc. The network is
okay, ping works etc. Well, made a check on one client: umount gluster
volume, remount, now the client appears in the list. Yeah... but why
now? Will try a few more... not that easy as most of these systems are
in production...

I had enabled the 3 self-heal values, but that didn't have any effect
back then. And, honestly, i won't do it now, because: if the heal
started now that would probably slow down the live system (with the
clients). I'll try it when the cluster isn't used anymore.

Interesting - new messages incoming on the "bad" server:

[2024-01-30 14:15:11,820] INFO [utils - 67:log_event] - {'nodeid':
'8ea1e6b4-9c77-4390-96a7-8724c3f9dc0f', 'ts': 1706620511, 'event':
'AFR_SPLIT_BRAIN', 'message': {'client-pid': '-6', 'subvol':
'workdata-replicate-2', 'type': 'gfid', '
file': '/756>', 'count':
'2', 'child-2': 'workdata-client-8', 'gfid-2':
'39807be6-b7de-4a82-8a22-cf61b1415208', 'child-0':
'workdata-client-6', 'gfid-0': 'bb4a12ec-f9b7-46bc-9fb3-c57730f1fc49'}
}
[2024-01-30 14:15:17,028] INFO [utils - 67:log_event] - {'nodeid':
'8ea1e6b4-9c77-4390-96a7-8724c3f9dc0f', 'ts': 1706620517, 'event':
'AFR_SPLIT_BRAIN', 'message': {'client-pid': '-6', 'subvol':
'workdata-replicate-4', 'type': 'gfid', '
file': '/94259611>',
'count': '2', 'child-2': 'workdata-client-14', 'gfid-2':
'01234675-17b9-4523-a598-5e331a72c4fa', 'child-0':
'workdata-client-12', 'gfid-0': 'b11140bd-355b-4583-9a85-5d06085
89f97'}}

They didn't appear in the beginning. Looks like a funny state that
this volume is in :D


Thx & best regards,

Hubert

Am Di., 30. Jan. 2024 um 07:14 Uhr schrieb Strahil Nikolov
:
>
> This is your problem : bad server has only 3 clients.
>
> I remember there is another gluster volume command to list the IPs of the 
> clients. Find it and run it to find which clients are actually OK (those 3) 
> and the remaining 17 are not.
>
> Then try to remount those 17 clients and if the situation persistes - work 
> with your Network Team to identify why the 17 clients can't reach the brick.
>
> Do you have selfheal enabled?
>
> cluster.data-self-heal
> cluster.entry-self-heal
> cluster.metadata-self-heal
>
>
> Best Regards,
>
> Strahil Nikolov
>
> On Mon, Jan 29, 2024 at 10:26, Hu Bert
>  wrote:
> Hi,
> not sure what you mean with "clients" - do you mean the clients that
> mount the volume?
>
> gluster volume status workdata clients
> --
> Brick : glusterpub2:/gluster/md3/workdata
> Clients connected : 20
> Hostname  BytesRead
> BytesWritten  OpVersion
>   -
>   -
> 192.168.0.222:4914043698212
> 41152108  11
> [...shortened...]
> 192.168.0.126:49123  8362352021
> 16445401205  11
> --
> Brick : glusterpub3:/gluster/md3/workdata
> Clients connected : 3
> Hostname  BytesRead
> BytesWritten  OpVersion
>   -
>   -
> 192.168.0.44:49150  5855740279
> 63649538575  11
> 192.168.0.44:49137  308958200
> 319216608  11
> 192.168.0.126:49120  7524915770
> 15489813449  11
>
> 192.168.0.44 (glusterpub3) is the "bad" server. Not sure what you mean
> by "old" - probably not the age of the server, but rather the gluster
> version. op-version is 11 on all servers+clients, upgraded from
> 10.4 -> 11.1
>
> "Have you checked if a client is not allowed to update all 3 copies ?"
> -> are there special log messages for that?
>
> "If it's only 1 system, you can remove the brick, reinitialize it and
> then bring it back for a full sync."
> -> 
> https://docs.gluster.org/en/v3/Administrator%20Guide/Managing%20Volumes/#replace-brick
> -> Replacing bricks in Replicate/Distributed Replicate volumes
>
> this part, right? Well, can't do this right now, as there are ~33TB of
> data (many small files) to copy, that would slow down the servers /
> the volume. But if the replacement is running i could do it
> afterwards, just to see what happens.
>
>
> Hubert
>
> Am Mo., 29. Jan. 2024 um 08:21 Uhr schrieb Strahil Nikolov
> :
> >
> > 2800 is too much. Most probably you are affected by a bug. How old are the 
> > clients ? Is only 1 server affected ?
> > Have you checked if a client is not allowed to update all 3 copies ?
> >
> > If it's only 1 system, you can remove the brick, reinitialize it and 

Re: [Gluster-users] Upgrade 10.4 -> 11.1 making problems

2024-01-29 Thread Strahil Nikolov
This is your problem : bad server has only 3 clients.
I remember there is another gluster volume command to list the IPs of the 
clients. Find it and run it to find which clients are actually OK (those 3) and 
the remaining 17 are not. 
Then try to remount those 17 clients and if the situation persistes - work with 
your Network Team to identify why the 17 clients can't reach the brick.
Do you have selfheal enabled?cluster.data-self-heal
cluster.entry-self-heal
cluster.metadata-self-heal
Best Regards,Strahil Nikolov 
  On Mon, Jan 29, 2024 at 10:26, Hu Bert wrote:   Hi,
not sure what you mean with "clients" - do you mean the clients that
mount the volume?

gluster volume status workdata clients
--
Brick : glusterpub2:/gluster/md3/workdata
Clients connected : 20
Hostname                                              BytesRead
BytesWritten      OpVersion
                                              -
      -
192.168.0.222:49140                                    43698212
 41152108          11
[...shortened...]
192.168.0.126:49123                                  8362352021
16445401205          11
--
Brick : glusterpub3:/gluster/md3/workdata
Clients connected : 3
Hostname                                              BytesRead
BytesWritten      OpVersion
                                              -
      -
192.168.0.44:49150                                  5855740279
63649538575          11
192.168.0.44:49137                                  308958200
319216608          11
192.168.0.126:49120                                  7524915770
15489813449          11

192.168.0.44 (glusterpub3) is the "bad" server. Not sure what you mean
by "old" - probably not the age of the server, but rather the gluster
version. op-version is 11 on all servers+clients, upgraded from
10.4 -> 11.1

"Have you checked if a client is not allowed to update all 3 copies ?"
-> are there special log messages for that?

"If it's only 1 system, you can remove the brick, reinitialize it and
then bring it back for a full sync."
-> 
https://docs.gluster.org/en/v3/Administrator%20Guide/Managing%20Volumes/#replace-brick
-> Replacing bricks in Replicate/Distributed Replicate volumes

this part, right? Well, can't do this right now, as there are ~33TB of
data (many small files) to copy, that would slow down the servers /
the volume. But if the replacement is running i could do it
afterwards, just to see what happens.


Hubert

Am Mo., 29. Jan. 2024 um 08:21 Uhr schrieb Strahil Nikolov
:
>
> 2800 is too much. Most probably you are affected by a bug. How old are the 
> clients ? Is only 1 server affected ?
> Have you checked if a client is not allowed to update all 3 copies ?
>
> If it's only 1 system, you can remove the brick, reinitialize it and then 
> bring it back for a full sync.
>
> Best Regards,
> Strahil Nikolov
>
> On Mon, Jan 29, 2024 at 8:44, Hu Bert
>  wrote:
> Morning,
> a few bad apples - but which ones? Checked glustershd.log on the "bad"
> server and counted todays "gfid mismatch" entries (2800 in total):
>
>    44 /212>,
>    44 /174>,
>    44 /94037803>,
>    44 /94066216>,
>    44 /249771609>,
>    44 /64235523>,
>    44 /185>,
>
> etc. But as i said, these are pretty new and didn't appear when the
> volume/servers started missbehaving. Are there scripts/snippets
> available how one could handle this?
>
> Healing would be very painful for the running system (still connected,
> but not very long anymore), as there surely are 4-5 million entries to
> be healed. I can't do this now - maybe, when the replacement is in
> productive state, one could give it a try.
>
> Thx,
> Hubert
>
> Am So., 28. Jan. 2024 um 23:12 Uhr schrieb Strahil Nikolov
> :
> >
> > From gfid mismatch a manual effort is needed but you can script it.
> > I think that a few bad "apples" can break the healing and if you fix them 
> > the healing might be recovered.
> >
> > Also, check why the client is not updating all copies. Most probably you 
> > have a client that is not able to connect to a brick.
> >
> > gluster volume status VOLUME_NAME clients
> >
> > Best Regards,
> > Strahil Nikolov
> >
> > On Sun, Jan 28, 2024 at 20:55, Hu Bert
> >  wrote:
> > Hi Strahil,
> > there's no arbiter: 3 servers with 5 bricks each.
> >
> > Volume Name: workdata
> > Type: Distributed-Replicate
> > Volume ID: 7d1e23e5-0308-4443-a832-d36f85ff7959
> > Status: Started
> > Snapshot Count: 0
> > Number of Bricks: 5 x 3 = 15
> >
> > The "problem" is: the number of files/entries to-be-healed has
> > continuously grown since the beginning, and now we're talking about
> > way too many files to do this manually. Last time i checked: 700K per
> > brick, should be >900K at the moment. The command 'gluster volume heal
> > workdata statistics heal-count' is unable to finish. Doesn't look that
> > good :D
> >
> 

Re: [Gluster-users] Upgrade 10.4 -> 11.1 making problems

2024-01-29 Thread Hu Bert
Hi,
not sure what you mean with "clients" - do you mean the clients that
mount the volume?

gluster volume status workdata clients
--
Brick : glusterpub2:/gluster/md3/workdata
Clients connected : 20
Hostname   BytesRead
BytesWritten   OpVersion
   -
   -
192.168.0.222:49140 43698212
 41152108  11
[...shortened...]
192.168.0.126:49123   8362352021
16445401205  11
--
Brick : glusterpub3:/gluster/md3/workdata
Clients connected : 3
Hostname   BytesRead
BytesWritten   OpVersion
   -
   -
192.168.0.44:49150  5855740279
63649538575  11
192.168.0.44:49137   308958200
319216608  11
192.168.0.126:49120   7524915770
15489813449  11

192.168.0.44 (glusterpub3) is the "bad" server. Not sure what you mean
by "old" - probably not the age of the server, but rather the gluster
version. op-version is 11 on all servers+clients, upgraded from
10.4 -> 11.1

"Have you checked if a client is not allowed to update all 3 copies ?"
-> are there special log messages for that?

"If it's only 1 system, you can remove the brick, reinitialize it and
then bring it back for a full sync."
-> 
https://docs.gluster.org/en/v3/Administrator%20Guide/Managing%20Volumes/#replace-brick
-> Replacing bricks in Replicate/Distributed Replicate volumes

this part, right? Well, can't do this right now, as there are ~33TB of
data (many small files) to copy, that would slow down the servers /
the volume. But if the replacement is running i could do it
afterwards, just to see what happens.


Hubert

Am Mo., 29. Jan. 2024 um 08:21 Uhr schrieb Strahil Nikolov
:
>
> 2800 is too much. Most probably you are affected by a bug. How old are the 
> clients ? Is only 1 server affected ?
> Have you checked if a client is not allowed to update all 3 copies ?
>
> If it's only 1 system, you can remove the brick, reinitialize it and then 
> bring it back for a full sync.
>
> Best Regards,
> Strahil Nikolov
>
> On Mon, Jan 29, 2024 at 8:44, Hu Bert
>  wrote:
> Morning,
> a few bad apples - but which ones? Checked glustershd.log on the "bad"
> server and counted todays "gfid mismatch" entries (2800 in total):
>
> 44 /212>,
> 44 /174>,
> 44 /94037803>,
> 44 /94066216>,
> 44 /249771609>,
> 44 /64235523>,
> 44 /185>,
>
> etc. But as i said, these are pretty new and didn't appear when the
> volume/servers started missbehaving. Are there scripts/snippets
> available how one could handle this?
>
> Healing would be very painful for the running system (still connected,
> but not very long anymore), as there surely are 4-5 million entries to
> be healed. I can't do this now - maybe, when the replacement is in
> productive state, one could give it a try.
>
> Thx,
> Hubert
>
> Am So., 28. Jan. 2024 um 23:12 Uhr schrieb Strahil Nikolov
> :
> >
> > From gfid mismatch a manual effort is needed but you can script it.
> > I think that a few bad "apples" can break the healing and if you fix them 
> > the healing might be recovered.
> >
> > Also, check why the client is not updating all copies. Most probably you 
> > have a client that is not able to connect to a brick.
> >
> > gluster volume status VOLUME_NAME clients
> >
> > Best Regards,
> > Strahil Nikolov
> >
> > On Sun, Jan 28, 2024 at 20:55, Hu Bert
> >  wrote:
> > Hi Strahil,
> > there's no arbiter: 3 servers with 5 bricks each.
> >
> > Volume Name: workdata
> > Type: Distributed-Replicate
> > Volume ID: 7d1e23e5-0308-4443-a832-d36f85ff7959
> > Status: Started
> > Snapshot Count: 0
> > Number of Bricks: 5 x 3 = 15
> >
> > The "problem" is: the number of files/entries to-be-healed has
> > continuously grown since the beginning, and now we're talking about
> > way too many files to do this manually. Last time i checked: 700K per
> > brick, should be >900K at the moment. The command 'gluster volume heal
> > workdata statistics heal-count' is unable to finish. Doesn't look that
> > good :D
> >
> > Interesting, the glustershd.log on the "bad" server now shows errors like 
> > these:
> >
> > [2024-01-28 18:48:33.734053 +] E [MSGID: 108008]
> > [afr-self-heal-common.c:399:afr_gfid_split_brain_source]
> > 0-workdata-replicate-3: Gfid mismatch detected for
> > /803620716>,
> > 82d7939a-8919-40ea-
> > 9459-7b8af23d3b72 on workdata-client-11 and
> > bb9399a3-0a5c-4cd1-b2b1-3ee787ec835a on workdata-client-9
> >
> > Shouldn't the heals happen on the 2 "good" servers?
> >
> > Anyway... we're currently preparing a different solution for our data

Re: [Gluster-users] Upgrade 10.4 -> 11.1 making problems

2024-01-28 Thread Strahil Nikolov
2800 is too much. Most probably you are affected by a bug. How old are the 
clients ? Is only 1 server affected ?Have you checked if a client is not 
allowed to update all 3 copies ?
If it's only 1 system, you can remove the brick, reinitialize it and then bring 
it back for a full sync.
Best Regards,Strahil Nikolov
 
 
  On Mon, Jan 29, 2024 at 8:44, Hu Bert wrote:   
Morning,
a few bad apples - but which ones? Checked glustershd.log on the "bad"
server and counted todays "gfid mismatch" entries (2800 in total):

    44 /212>,
    44 /174>,
    44 /94037803>,
    44 /94066216>,
    44 /249771609>,
    44 /64235523>,
    44 /185>,

etc. But as i said, these are pretty new and didn't appear when the
volume/servers started missbehaving. Are there scripts/snippets
available how one could handle this?

Healing would be very painful for the running system (still connected,
but not very long anymore), as there surely are 4-5 million entries to
be healed. I can't do this now - maybe, when the replacement is in
productive state, one could give it a try.

Thx,
Hubert

Am So., 28. Jan. 2024 um 23:12 Uhr schrieb Strahil Nikolov
:
>
> From gfid mismatch a manual effort is needed but you can script it.
> I think that a few bad "apples" can break the healing and if you fix them the 
> healing might be recovered.
>
> Also, check why the client is not updating all copies. Most probably you have 
> a client that is not able to connect to a brick.
>
> gluster volume status VOLUME_NAME clients
>
> Best Regards,
> Strahil Nikolov
>
> On Sun, Jan 28, 2024 at 20:55, Hu Bert
>  wrote:
> Hi Strahil,
> there's no arbiter: 3 servers with 5 bricks each.
>
> Volume Name: workdata
> Type: Distributed-Replicate
> Volume ID: 7d1e23e5-0308-4443-a832-d36f85ff7959
> Status: Started
> Snapshot Count: 0
> Number of Bricks: 5 x 3 = 15
>
> The "problem" is: the number of files/entries to-be-healed has
> continuously grown since the beginning, and now we're talking about
> way too many files to do this manually. Last time i checked: 700K per
> brick, should be >900K at the moment. The command 'gluster volume heal
> workdata statistics heal-count' is unable to finish. Doesn't look that
> good :D
>
> Interesting, the glustershd.log on the "bad" server now shows errors like 
> these:
>
> [2024-01-28 18:48:33.734053 +] E [MSGID: 108008]
> [afr-self-heal-common.c:399:afr_gfid_split_brain_source]
> 0-workdata-replicate-3: Gfid mismatch detected for
> /803620716>,
> 82d7939a-8919-40ea-
> 9459-7b8af23d3b72 on workdata-client-11 and
> bb9399a3-0a5c-4cd1-b2b1-3ee787ec835a on workdata-client-9
>
> Shouldn't the heals happen on the 2 "good" servers?
>
> Anyway... we're currently preparing a different solution for our data
> and we'll throw away this gluster volume - no critical data will be
> lost, as these are derived from source data (on a different volume on
> different servers). Will be a hard time (calculating tons of data),
> but the chosen solution should have a way better performance.
>
> Well... thx to all for your efforts, really appreciate that :-)
>
>
> Hubert
>
> Am So., 28. Jan. 2024 um 08:35 Uhr schrieb Strahil Nikolov
> :
> >
> > What about the arbiter node ?
> > Actually, check on all nodes and script it - you might need it in the 
> > future.
> >
> > Simplest way to resolve is to make the file didappear (rename to something 
> > else and then rename it back). Another easy trick is to read thr whole 
> > file: dd if=file of=/dev/null status=progress
> >
> > Best Regards,
> > Strahil Nikolov
> >
> > On Sat, Jan 27, 2024 at 8:24, Hu Bert
> >  wrote:
> > Morning,
> >
> > gfid1:
> > getfattr -d -e hex -m.
> > /gluster/md{3,4,5,6,7}/workdata/.glusterfs/fa/f5/faf59566-10f5-4ddd-8b0c-a87bc6a334fb
> >
> > glusterpub1 (good one):
> > getfattr: Removing leading '/' from absolute path names
> > # file: 
> > gluster/md6/workdata/.glusterfs/fa/f5/faf59566-10f5-4ddd-8b0c-a87bc6a334fb
> > trusted.afr.dirty=0x
> > trusted.afr.workdata-client-11=0x00020001
> > trusted.gfid=0xfaf5956610f54ddd8b0ca87bc6a334fb
> > trusted.gfid2path.c2845024cc9b402e=0x38633139626234612d396236382d343532652d623434652d3664616331666434616465652f31323878313238732e6a7067
> > trusted.glusterfs.mdata=0x0165aaecff2695ebb765aaecff2695ebb765aaecff2533f110
> >
> > glusterpub3 (bad one):
> > getfattr: 
> > /gluster/md6/workdata/.glusterfs/fa/f5/faf59566-10f5-4ddd-8b0c-a87bc6a334fb:
> > No such file or directory
> >
> > gfid 2:
> > getfattr -d -e hex -m.
> > /gluster/md{3,4,5,6,7}/workdata/.glusterfs/60/46/60465723-5dc0-4ebe-aced-9f2c12e52642
> >
> > glusterpub1 (good one):
> > getfattr: Removing leading '/' from absolute path names
> > # file: 
> > gluster/md5/workdata/.glusterfs/60/46/60465723-5dc0-4ebe-aced-9f2c12e52642
> > trusted.afr.dirty=0x
> > trusted.afr.workdata-client-8=0x00020001
> > trusted.gfid=0x604657235dc04ebeaced9f2c12e52642
> 

Re: [Gluster-users] Upgrade 10.4 -> 11.1 making problems

2024-01-28 Thread Hu Bert
Morning,
a few bad apples - but which ones? Checked glustershd.log on the "bad"
server and counted todays "gfid mismatch" entries (2800 in total):

44 /212>,
44 /174>,
44 /94037803>,
44 /94066216>,
44 /249771609>,
44 /64235523>,
44 /185>,

etc. But as i said, these are pretty new and didn't appear when the
volume/servers started missbehaving. Are there scripts/snippets
available how one could handle this?

Healing would be very painful for the running system (still connected,
but not very long anymore), as there surely are 4-5 million entries to
be healed. I can't do this now - maybe, when the replacement is in
productive state, one could give it a try.

Thx,
Hubert

Am So., 28. Jan. 2024 um 23:12 Uhr schrieb Strahil Nikolov
:
>
> From gfid mismatch a manual effort is needed but you can script it.
> I think that a few bad "apples" can break the healing and if you fix them the 
> healing might be recovered.
>
> Also, check why the client is not updating all copies. Most probably you have 
> a client that is not able to connect to a brick.
>
> gluster volume status VOLUME_NAME clients
>
> Best Regards,
> Strahil Nikolov
>
> On Sun, Jan 28, 2024 at 20:55, Hu Bert
>  wrote:
> Hi Strahil,
> there's no arbiter: 3 servers with 5 bricks each.
>
> Volume Name: workdata
> Type: Distributed-Replicate
> Volume ID: 7d1e23e5-0308-4443-a832-d36f85ff7959
> Status: Started
> Snapshot Count: 0
> Number of Bricks: 5 x 3 = 15
>
> The "problem" is: the number of files/entries to-be-healed has
> continuously grown since the beginning, and now we're talking about
> way too many files to do this manually. Last time i checked: 700K per
> brick, should be >900K at the moment. The command 'gluster volume heal
> workdata statistics heal-count' is unable to finish. Doesn't look that
> good :D
>
> Interesting, the glustershd.log on the "bad" server now shows errors like 
> these:
>
> [2024-01-28 18:48:33.734053 +] E [MSGID: 108008]
> [afr-self-heal-common.c:399:afr_gfid_split_brain_source]
> 0-workdata-replicate-3: Gfid mismatch detected for
> /803620716>,
> 82d7939a-8919-40ea-
> 9459-7b8af23d3b72 on workdata-client-11 and
> bb9399a3-0a5c-4cd1-b2b1-3ee787ec835a on workdata-client-9
>
> Shouldn't the heals happen on the 2 "good" servers?
>
> Anyway... we're currently preparing a different solution for our data
> and we'll throw away this gluster volume - no critical data will be
> lost, as these are derived from source data (on a different volume on
> different servers). Will be a hard time (calculating tons of data),
> but the chosen solution should have a way better performance.
>
> Well... thx to all for your efforts, really appreciate that :-)
>
>
> Hubert
>
> Am So., 28. Jan. 2024 um 08:35 Uhr schrieb Strahil Nikolov
> :
> >
> > What about the arbiter node ?
> > Actually, check on all nodes and script it - you might need it in the 
> > future.
> >
> > Simplest way to resolve is to make the file didappear (rename to something 
> > else and then rename it back). Another easy trick is to read thr whole 
> > file: dd if=file of=/dev/null status=progress
> >
> > Best Regards,
> > Strahil Nikolov
> >
> > On Sat, Jan 27, 2024 at 8:24, Hu Bert
> >  wrote:
> > Morning,
> >
> > gfid1:
> > getfattr -d -e hex -m.
> > /gluster/md{3,4,5,6,7}/workdata/.glusterfs/fa/f5/faf59566-10f5-4ddd-8b0c-a87bc6a334fb
> >
> > glusterpub1 (good one):
> > getfattr: Removing leading '/' from absolute path names
> > # file: 
> > gluster/md6/workdata/.glusterfs/fa/f5/faf59566-10f5-4ddd-8b0c-a87bc6a334fb
> > trusted.afr.dirty=0x
> > trusted.afr.workdata-client-11=0x00020001
> > trusted.gfid=0xfaf5956610f54ddd8b0ca87bc6a334fb
> > trusted.gfid2path.c2845024cc9b402e=0x38633139626234612d396236382d343532652d623434652d3664616331666434616465652f31323878313238732e6a7067
> > trusted.glusterfs.mdata=0x0165aaecff2695ebb765aaecff2695ebb765aaecff2533f110
> >
> > glusterpub3 (bad one):
> > getfattr: 
> > /gluster/md6/workdata/.glusterfs/fa/f5/faf59566-10f5-4ddd-8b0c-a87bc6a334fb:
> > No such file or directory
> >
> > gfid 2:
> > getfattr -d -e hex -m.
> > /gluster/md{3,4,5,6,7}/workdata/.glusterfs/60/46/60465723-5dc0-4ebe-aced-9f2c12e52642
> >
> > glusterpub1 (good one):
> > getfattr: Removing leading '/' from absolute path names
> > # file: 
> > gluster/md5/workdata/.glusterfs/60/46/60465723-5dc0-4ebe-aced-9f2c12e52642
> > trusted.afr.dirty=0x
> > trusted.afr.workdata-client-8=0x00020001
> > trusted.gfid=0x604657235dc04ebeaced9f2c12e52642
> > trusted.gfid2path.ac4669e3c4faf926=0x33366463366137392d666135642d343238652d613738642d6234376230616662316562642f31323878313238732e6a7067
> > trusted.glusterfs.mdata=0x0165aaecfe0c5403bd65aaecfe0c5403bd65aaecfe0ad61ee4
> >
> > glusterpub3 (bad one):
> > getfattr: 
> > 

Re: [Gluster-users] Upgrade 10.4 -> 11.1 making problems

2024-01-28 Thread Hu Bert
Hi Strahil,
there's no arbiter: 3 servers with 5 bricks each.

Volume Name: workdata
Type: Distributed-Replicate
Volume ID: 7d1e23e5-0308-4443-a832-d36f85ff7959
Status: Started
Snapshot Count: 0
Number of Bricks: 5 x 3 = 15

The "problem" is: the number of files/entries to-be-healed has
continuously grown since the beginning, and now we're talking about
way too many files to do this manually. Last time i checked: 700K per
brick, should be >900K at the moment. The command 'gluster volume heal
workdata statistics heal-count' is unable to finish. Doesn't look that
good :D

Interesting, the glustershd.log on the "bad" server now shows errors like these:

[2024-01-28 18:48:33.734053 +] E [MSGID: 108008]
[afr-self-heal-common.c:399:afr_gfid_split_brain_source]
0-workdata-replicate-3: Gfid mismatch detected for
/803620716>,
82d7939a-8919-40ea-
9459-7b8af23d3b72 on workdata-client-11 and
bb9399a3-0a5c-4cd1-b2b1-3ee787ec835a on workdata-client-9

Shouldn't the heals happen on the 2 "good" servers?

Anyway... we're currently preparing a different solution for our data
and we'll throw away this gluster volume - no critical data will be
lost, as these are derived from source data (on a different volume on
different servers). Will be a hard time (calculating tons of data),
but the chosen solution should have a way better performance.

Well... thx to all for your efforts, really appreciate that :-)


Hubert

Am So., 28. Jan. 2024 um 08:35 Uhr schrieb Strahil Nikolov
:
>
> What about the arbiter node ?
> Actually, check on all nodes and script it - you might need it in the future.
>
> Simplest way to resolve is to make the file didappear (rename to something 
> else and then rename it back). Another easy trick is to read thr whole file: 
> dd if=file of=/dev/null status=progress
>
> Best Regards,
> Strahil Nikolov
>
> On Sat, Jan 27, 2024 at 8:24, Hu Bert
>  wrote:
> Morning,
>
> gfid1:
> getfattr -d -e hex -m.
> /gluster/md{3,4,5,6,7}/workdata/.glusterfs/fa/f5/faf59566-10f5-4ddd-8b0c-a87bc6a334fb
>
> glusterpub1 (good one):
> getfattr: Removing leading '/' from absolute path names
> # file: 
> gluster/md6/workdata/.glusterfs/fa/f5/faf59566-10f5-4ddd-8b0c-a87bc6a334fb
> trusted.afr.dirty=0x
> trusted.afr.workdata-client-11=0x00020001
> trusted.gfid=0xfaf5956610f54ddd8b0ca87bc6a334fb
> trusted.gfid2path.c2845024cc9b402e=0x38633139626234612d396236382d343532652d623434652d3664616331666434616465652f31323878313238732e6a7067
> trusted.glusterfs.mdata=0x0165aaecff2695ebb765aaecff2695ebb765aaecff2533f110
>
> glusterpub3 (bad one):
> getfattr: 
> /gluster/md6/workdata/.glusterfs/fa/f5/faf59566-10f5-4ddd-8b0c-a87bc6a334fb:
> No such file or directory
>
> gfid 2:
> getfattr -d -e hex -m.
> /gluster/md{3,4,5,6,7}/workdata/.glusterfs/60/46/60465723-5dc0-4ebe-aced-9f2c12e52642
>
> glusterpub1 (good one):
> getfattr: Removing leading '/' from absolute path names
> # file: 
> gluster/md5/workdata/.glusterfs/60/46/60465723-5dc0-4ebe-aced-9f2c12e52642
> trusted.afr.dirty=0x
> trusted.afr.workdata-client-8=0x00020001
> trusted.gfid=0x604657235dc04ebeaced9f2c12e52642
> trusted.gfid2path.ac4669e3c4faf926=0x33366463366137392d666135642d343238652d613738642d6234376230616662316562642f31323878313238732e6a7067
> trusted.glusterfs.mdata=0x0165aaecfe0c5403bd65aaecfe0c5403bd65aaecfe0ad61ee4
>
> glusterpub3 (bad one):
> getfattr: 
> /gluster/md5/workdata/.glusterfs/60/46/60465723-5dc0-4ebe-aced-9f2c12e52642:
> No such file or directory
>
> thx,
> Hubert
>
> Am Sa., 27. Jan. 2024 um 06:13 Uhr schrieb Strahil Nikolov
> :
> >
> > You don't need to mount it.
> > Like this :
> > # getfattr -d -e hex -m. 
> > /path/to/brick/.glusterfs/00/46/00462be8-3e61-4931-8bda-dae1645c639e
> > # file: 00/46/00462be8-3e61-4931-8bda-dae1645c639e
> > trusted.gfid=0x00462be83e6149318bdadae1645c639e
> > trusted.gfid2path.05fcbdafdeea18ab=0x3032673930632d386637622d346436652d393464362d3936393132313930643131312f66696c656c6f636b696e672e7079
> > trusted.glusterfs.mdata=0x016170340c25b6a7456170340c20efb5776170340c20d42b07
> > trusted.glusterfs.shard.block-size=0x0400
> > trusted.glusterfs.shard.file-size=0x00cd0001
> >
> >
> > Best Regards,
> > Strahil Nikolov
> >
> >
> >
> > В четвъртък, 25 януари 2024 г. в 09:42:46 ч. Гринуич+2, Hu Bert 
> >  написа:
> >
> >
> >
> >
> >
> > Good morning,
> >
> > hope i got it right... using:
> > https://access.redhat.com/documentation/de-de/red_hat_gluster_storage/3.1/html/administration_guide/ch27s02
> >
> > mount -t glusterfs -o aux-gfid-mount glusterpub1:/workdata /mnt/workdata
> >
> > gfid 1:
> > getfattr -n trusted.glusterfs.pathinfo -e text
> > 

Re: [Gluster-users] Upgrade 10.4 -> 11.1 making problems

2024-01-27 Thread Strahil Nikolov
What about the arbiter node ?Actually, check on all nodes and script it - you 
might need it in the future.
Simplest way to resolve is to make the file didappear (rename to something else 
and then rename it back). Another easy trick is to read thr whole file: dd 
if=file of=/dev/null status=progress
Best Regards,Strahil Nikolov 
 
  On Sat, Jan 27, 2024 at 8:24, Hu Bert wrote:   
Morning,

gfid1:
getfattr -d -e hex -m.
/gluster/md{3,4,5,6,7}/workdata/.glusterfs/fa/f5/faf59566-10f5-4ddd-8b0c-a87bc6a334fb

glusterpub1 (good one):
getfattr: Removing leading '/' from absolute path names
# file: 
gluster/md6/workdata/.glusterfs/fa/f5/faf59566-10f5-4ddd-8b0c-a87bc6a334fb
trusted.afr.dirty=0x
trusted.afr.workdata-client-11=0x00020001
trusted.gfid=0xfaf5956610f54ddd8b0ca87bc6a334fb
trusted.gfid2path.c2845024cc9b402e=0x38633139626234612d396236382d343532652d623434652d3664616331666434616465652f31323878313238732e6a7067
trusted.glusterfs.mdata=0x0165aaecff2695ebb765aaecff2695ebb765aaecff2533f110

glusterpub3 (bad one):
getfattr: 
/gluster/md6/workdata/.glusterfs/fa/f5/faf59566-10f5-4ddd-8b0c-a87bc6a334fb:
No such file or directory

gfid 2:
getfattr -d -e hex -m.
/gluster/md{3,4,5,6,7}/workdata/.glusterfs/60/46/60465723-5dc0-4ebe-aced-9f2c12e52642

glusterpub1 (good one):
getfattr: Removing leading '/' from absolute path names
# file: 
gluster/md5/workdata/.glusterfs/60/46/60465723-5dc0-4ebe-aced-9f2c12e52642
trusted.afr.dirty=0x
trusted.afr.workdata-client-8=0x00020001
trusted.gfid=0x604657235dc04ebeaced9f2c12e52642
trusted.gfid2path.ac4669e3c4faf926=0x33366463366137392d666135642d343238652d613738642d6234376230616662316562642f31323878313238732e6a7067
trusted.glusterfs.mdata=0x0165aaecfe0c5403bd65aaecfe0c5403bd65aaecfe0ad61ee4

glusterpub3 (bad one):
getfattr: 
/gluster/md5/workdata/.glusterfs/60/46/60465723-5dc0-4ebe-aced-9f2c12e52642:
No such file or directory

thx,
Hubert

Am Sa., 27. Jan. 2024 um 06:13 Uhr schrieb Strahil Nikolov
:
>
> You don't need to mount it.
> Like this :
> # getfattr -d -e hex -m. 
> /path/to/brick/.glusterfs/00/46/00462be8-3e61-4931-8bda-dae1645c639e
> # file: 00/46/00462be8-3e61-4931-8bda-dae1645c639e
> trusted.gfid=0x00462be83e6149318bdadae1645c639e
> trusted.gfid2path.05fcbdafdeea18ab=0x3032673930632d386637622d346436652d393464362d3936393132313930643131312f66696c656c6f636b696e672e7079
> trusted.glusterfs.mdata=0x016170340c25b6a7456170340c20efb5776170340c20d42b07
> trusted.glusterfs.shard.block-size=0x0400
> trusted.glusterfs.shard.file-size=0x00cd0001
>
>
> Best Regards,
> Strahil Nikolov
>
>
>
> В четвъртък, 25 януари 2024 г. в 09:42:46 ч. Гринуич+2, Hu Bert 
>  написа:
>
>
>
>
>
> Good morning,
>
> hope i got it right... using:
> https://access.redhat.com/documentation/de-de/red_hat_gluster_storage/3.1/html/administration_guide/ch27s02
>
> mount -t glusterfs -o aux-gfid-mount glusterpub1:/workdata /mnt/workdata
>
> gfid 1:
> getfattr -n trusted.glusterfs.pathinfo -e text
> /mnt/workdata/.gfid/faf59566-10f5-4ddd-8b0c-a87bc6a334fb
> getfattr: Removing leading '/' from absolute path names
> # file: mnt/workdata/.gfid/faf59566-10f5-4ddd-8b0c-a87bc6a334fb
> trusted.glusterfs.pathinfo="(
> (
> 
>  uster/md6/workdata/images/133/283/13328349/128x128s.jpg>))"
>
> gfid 2:
> getfattr -n trusted.glusterfs.pathinfo -e text
> /mnt/workdata/.gfid/60465723-5dc0-4ebe-aced-9f2c12e52642
> getfattr: Removing leading '/' from absolute path names
> # file: mnt/workdata/.gfid/60465723-5dc0-4ebe-aced-9f2c12e52642
> trusted.glusterfs.pathinfo="(
> (
> 
>  ):glusterpub1:/gluster/md5/workdata/.glusterfs/60/46/60465723-5dc0-4ebe-aced-9f2c12e52642>))"
>
> glusterpub1 + glusterpub2 are the good ones, glusterpub3 is the
> misbehaving (not healing) one.
>
> The file with gfid 1 is available under
> /gluster/md6/workdata/images/133/283/13328349/ on glusterpub1+2
> bricks, but missing on glusterpub3 brick.
>
> gfid 2: 
> /gluster/md5/workdata/.glusterfs/60/46/60465723-5dc0-4ebe-aced-9f2c12e52642
> is present on glusterpub1+2, but not on glusterpub3.
>
>
> Thx,
> Hubert
>
> Am Mi., 24. Jan. 2024 um 17:36 Uhr schrieb Strahil Nikolov
> :
>
> >
> > Hi,
> >
> > Can you find and check the files with gfids:
> > 60465723-5dc0-4ebe-aced-9f2c12e52642
> > faf59566-10f5-4ddd-8b0c-a87bc6a334fb
> >
> > Use 'getfattr -d -e hex -m. ' command from 
> > https://docs.gluster.org/en/main/Troubleshooting/resolving-splitbrain/#analysis-of-the-output
> >  .
> >
> > Best Regards,
> > Strahil Nikolov
> >
> > On Sat, Jan 20, 2024 at 9:44, Hu Bert
> >  wrote:
> > Good morning,
> >
> > thx Gilberto, did the first three (set to WARNING), but the last one
> > doesn't work. Anyway, with setting these three some 

Re: [Gluster-users] Upgrade 10.4 -> 11.1 making problems

2024-01-26 Thread Hu Bert
Morning,

gfid1:
getfattr -d -e hex -m.
/gluster/md{3,4,5,6,7}/workdata/.glusterfs/fa/f5/faf59566-10f5-4ddd-8b0c-a87bc6a334fb

glusterpub1 (good one):
getfattr: Removing leading '/' from absolute path names
# file: 
gluster/md6/workdata/.glusterfs/fa/f5/faf59566-10f5-4ddd-8b0c-a87bc6a334fb
trusted.afr.dirty=0x
trusted.afr.workdata-client-11=0x00020001
trusted.gfid=0xfaf5956610f54ddd8b0ca87bc6a334fb
trusted.gfid2path.c2845024cc9b402e=0x38633139626234612d396236382d343532652d623434652d3664616331666434616465652f31323878313238732e6a7067
trusted.glusterfs.mdata=0x0165aaecff2695ebb765aaecff2695ebb765aaecff2533f110

glusterpub3 (bad one):
getfattr: 
/gluster/md6/workdata/.glusterfs/fa/f5/faf59566-10f5-4ddd-8b0c-a87bc6a334fb:
No such file or directory

gfid 2:
getfattr -d -e hex -m.
/gluster/md{3,4,5,6,7}/workdata/.glusterfs/60/46/60465723-5dc0-4ebe-aced-9f2c12e52642

glusterpub1 (good one):
getfattr: Removing leading '/' from absolute path names
# file: 
gluster/md5/workdata/.glusterfs/60/46/60465723-5dc0-4ebe-aced-9f2c12e52642
trusted.afr.dirty=0x
trusted.afr.workdata-client-8=0x00020001
trusted.gfid=0x604657235dc04ebeaced9f2c12e52642
trusted.gfid2path.ac4669e3c4faf926=0x33366463366137392d666135642d343238652d613738642d6234376230616662316562642f31323878313238732e6a7067
trusted.glusterfs.mdata=0x0165aaecfe0c5403bd65aaecfe0c5403bd65aaecfe0ad61ee4

glusterpub3 (bad one):
getfattr: 
/gluster/md5/workdata/.glusterfs/60/46/60465723-5dc0-4ebe-aced-9f2c12e52642:
No such file or directory

thx,
Hubert

Am Sa., 27. Jan. 2024 um 06:13 Uhr schrieb Strahil Nikolov
:
>
> You don't need to mount it.
> Like this :
> # getfattr -d -e hex -m. 
> /path/to/brick/.glusterfs/00/46/00462be8-3e61-4931-8bda-dae1645c639e
> # file: 00/46/00462be8-3e61-4931-8bda-dae1645c639e
> trusted.gfid=0x00462be83e6149318bdadae1645c639e
> trusted.gfid2path.05fcbdafdeea18ab=0x3032673930632d386637622d346436652d393464362d3936393132313930643131312f66696c656c6f636b696e672e7079
> trusted.glusterfs.mdata=0x016170340c25b6a7456170340c20efb5776170340c20d42b07
> trusted.glusterfs.shard.block-size=0x0400
> trusted.glusterfs.shard.file-size=0x00cd0001
>
>
> Best Regards,
> Strahil Nikolov
>
>
>
> В четвъртък, 25 януари 2024 г. в 09:42:46 ч. Гринуич+2, Hu Bert 
>  написа:
>
>
>
>
>
> Good morning,
>
> hope i got it right... using:
> https://access.redhat.com/documentation/de-de/red_hat_gluster_storage/3.1/html/administration_guide/ch27s02
>
> mount -t glusterfs -o aux-gfid-mount glusterpub1:/workdata /mnt/workdata
>
> gfid 1:
> getfattr -n trusted.glusterfs.pathinfo -e text
> /mnt/workdata/.gfid/faf59566-10f5-4ddd-8b0c-a87bc6a334fb
> getfattr: Removing leading '/' from absolute path names
> # file: mnt/workdata/.gfid/faf59566-10f5-4ddd-8b0c-a87bc6a334fb
> trusted.glusterfs.pathinfo="(
> (
> 
>  uster/md6/workdata/images/133/283/13328349/128x128s.jpg>))"
>
> gfid 2:
> getfattr -n trusted.glusterfs.pathinfo -e text
> /mnt/workdata/.gfid/60465723-5dc0-4ebe-aced-9f2c12e52642
> getfattr: Removing leading '/' from absolute path names
> # file: mnt/workdata/.gfid/60465723-5dc0-4ebe-aced-9f2c12e52642
> trusted.glusterfs.pathinfo="(
> (
> 
>  ):glusterpub1:/gluster/md5/workdata/.glusterfs/60/46/60465723-5dc0-4ebe-aced-9f2c12e52642>))"
>
> glusterpub1 + glusterpub2 are the good ones, glusterpub3 is the
> misbehaving (not healing) one.
>
> The file with gfid 1 is available under
> /gluster/md6/workdata/images/133/283/13328349/ on glusterpub1+2
> bricks, but missing on glusterpub3 brick.
>
> gfid 2: 
> /gluster/md5/workdata/.glusterfs/60/46/60465723-5dc0-4ebe-aced-9f2c12e52642
> is present on glusterpub1+2, but not on glusterpub3.
>
>
> Thx,
> Hubert
>
> Am Mi., 24. Jan. 2024 um 17:36 Uhr schrieb Strahil Nikolov
> :
>
> >
> > Hi,
> >
> > Can you find and check the files with gfids:
> > 60465723-5dc0-4ebe-aced-9f2c12e52642
> > faf59566-10f5-4ddd-8b0c-a87bc6a334fb
> >
> > Use 'getfattr -d -e hex -m. ' command from 
> > https://docs.gluster.org/en/main/Troubleshooting/resolving-splitbrain/#analysis-of-the-output
> >  .
> >
> > Best Regards,
> > Strahil Nikolov
> >
> > On Sat, Jan 20, 2024 at 9:44, Hu Bert
> >  wrote:
> > Good morning,
> >
> > thx Gilberto, did the first three (set to WARNING), but the last one
> > doesn't work. Anyway, with setting these three some new messages
> > appear:
> >
> > [2024-01-20 07:23:58.561106 +] W [MSGID: 114061]
> > [client-common.c:796:client_pre_lk_v2] 0-workdata-client-11: remote_fd
> > is -1. EBADFD [{gfid=faf59566-10f5-4ddd-8b0c-a87bc6a334fb},
> > {errno=77}, {error=File descriptor in bad state}]
> > [2024-01-20 07:23:58.561177 +] E [MSGID: 108028]
> > [afr-open.c:361:afr_is_reopen_allowed_cbk] 

Re: [Gluster-users] Upgrade 10.4 -> 11.1 making problems

2024-01-26 Thread Strahil Nikolov
You don't need to mount it.
Like this :
# getfattr -d -e hex -m. 
/path/to/brick/.glusterfs/00/46/00462be8-3e61-4931-8bda-dae1645c639e
# file: 00/46/00462be8-3e61-4931-8bda-dae1645c639e
trusted.gfid=0x00462be83e6149318bdadae1645c639e
trusted.gfid2path.05fcbdafdeea18ab=0x3032673930632d386637622d346436652d393464362d3936393132313930643131312f66696c656c6f636b696e672e7079
trusted.glusterfs.mdata=0x016170340c25b6a7456170340c20efb5776170340c20d42b07
trusted.glusterfs.shard.block-size=0x0400
trusted.glusterfs.shard.file-size=0x00cd0001


Best Regards,
Strahil Nikolov



В четвъртък, 25 януари 2024 г. в 09:42:46 ч. Гринуич+2, Hu Bert 
 написа: 





Good morning,

hope i got it right... using:
https://access.redhat.com/documentation/de-de/red_hat_gluster_storage/3.1/html/administration_guide/ch27s02

mount -t glusterfs -o aux-gfid-mount glusterpub1:/workdata /mnt/workdata

gfid 1:
getfattr -n trusted.glusterfs.pathinfo -e text
/mnt/workdata/.gfid/faf59566-10f5-4ddd-8b0c-a87bc6a334fb
getfattr: Removing leading '/' from absolute path names
# file: mnt/workdata/.gfid/faf59566-10f5-4ddd-8b0c-a87bc6a334fb
trusted.glusterfs.pathinfo="(
(

))"

gfid 2:
getfattr -n trusted.glusterfs.pathinfo -e text
/mnt/workdata/.gfid/60465723-5dc0-4ebe-aced-9f2c12e52642
getfattr: Removing leading '/' from absolute path names
# file: mnt/workdata/.gfid/60465723-5dc0-4ebe-aced-9f2c12e52642
trusted.glusterfs.pathinfo="(
(

))"

glusterpub1 + glusterpub2 are the good ones, glusterpub3 is the
misbehaving (not healing) one.

The file with gfid 1 is available under
/gluster/md6/workdata/images/133/283/13328349/ on glusterpub1+2
bricks, but missing on glusterpub3 brick.

gfid 2: 
/gluster/md5/workdata/.glusterfs/60/46/60465723-5dc0-4ebe-aced-9f2c12e52642
is present on glusterpub1+2, but not on glusterpub3.


Thx,
Hubert

Am Mi., 24. Jan. 2024 um 17:36 Uhr schrieb Strahil Nikolov
:

>
> Hi,
>
> Can you find and check the files with gfids:
> 60465723-5dc0-4ebe-aced-9f2c12e52642
> faf59566-10f5-4ddd-8b0c-a87bc6a334fb
>
> Use 'getfattr -d -e hex -m. ' command from 
> https://docs.gluster.org/en/main/Troubleshooting/resolving-splitbrain/#analysis-of-the-output
>  .
>
> Best Regards,
> Strahil Nikolov
>
> On Sat, Jan 20, 2024 at 9:44, Hu Bert
>  wrote:
> Good morning,
>
> thx Gilberto, did the first three (set to WARNING), but the last one
> doesn't work. Anyway, with setting these three some new messages
> appear:
>
> [2024-01-20 07:23:58.561106 +] W [MSGID: 114061]
> [client-common.c:796:client_pre_lk_v2] 0-workdata-client-11: remote_fd
> is -1. EBADFD [{gfid=faf59566-10f5-4ddd-8b0c-a87bc6a334fb},
> {errno=77}, {error=File descriptor in bad state}]
> [2024-01-20 07:23:58.561177 +] E [MSGID: 108028]
> [afr-open.c:361:afr_is_reopen_allowed_cbk] 0-workdata-replicate-3:
> Failed getlk for faf59566-10f5-4ddd-8b0c-a87bc6a334fb [File descriptor
> in bad state]
> [2024-01-20 07:23:58.562151 +] W [MSGID: 114031]
> [client-rpc-fops_v2.c:2561:client4_0_lookup_cbk] 0-workdata-client-11:
> remote operation failed.
> [{path=},
> {gfid=faf59566-10f5-4ddd-8b0c-a87b
> c6a334fb}, {errno=2}, {error=No such file or directory}]
> [2024-01-20 07:23:58.562296 +] W [MSGID: 114061]
> [client-common.c:530:client_pre_flush_v2] 0-workdata-client-11:
> remote_fd is -1. EBADFD [{gfid=faf59566-10f5-4ddd-8b0c-a87bc6a334fb},
> {errno=77}, {error=File descriptor in bad state}]
> [2024-01-20 07:23:58.860552 +] W [MSGID: 114061]
> [client-common.c:796:client_pre_lk_v2] 0-workdata-client-8: remote_fd
> is -1. EBADFD [{gfid=60465723-5dc0-4ebe-aced-9f2c12e52642},
> {errno=77}, {error=File descriptor in bad state}]
> [2024-01-20 07:23:58.860608 +] E [MSGID: 108028]
> [afr-open.c:361:afr_is_reopen_allowed_cbk] 0-workdata-replicate-2:
> Failed getlk for 60465723-5dc0-4ebe-aced-9f2c12e52642 [File descriptor
> in bad state]
> [2024-01-20 07:23:58.861520 +] W [MSGID: 114031]
> [client-rpc-fops_v2.c:2561:client4_0_lookup_cbk] 0-workdata-client-8:
> remote operation failed.
> [{path=},
> {gfid=60465723-5dc0-4ebe-aced-9f2c1
> 2e52642}, {errno=2}, {error=No such file or directory}]
> [2024-01-20 07:23:58.861640 +] W [MSGID: 114061]
> [client-common.c:530:client_pre_flush_v2] 0-workdata-client-8:
> remote_fd is -1. EBADFD [{gfid=60465723-5dc0-4ebe-aced-9f2c12e52642},
> {errno=77}, {error=File descriptor in bad state}]
>
> Not many log entries appear, only a few. Has someone seen error
> messages like these? Setting diagnostics.brick-sys-log-level to DEBUG
> shows way more log entries, uploaded it to:
> https://file.io/spLhlcbMCzr8 - not sure if that helps.
>
>
> Thx,
> Hubert
>
> Am Fr., 19. Jan. 2024 um 16:24 Uhr schrieb Gilberto Ferreira
> :
>
> >
> > gluster volume set testvol diagnostics.brick-log-level WARNING
> > gluster volume set testvol diagnostics.brick-sys-log-level WARNING
> > gluster volume set testvol 

Re: [Gluster-users] Upgrade 10.4 -> 11.1 making problems

2024-01-24 Thread Hu Bert
Good morning,

hope i got it right... using:
https://access.redhat.com/documentation/de-de/red_hat_gluster_storage/3.1/html/administration_guide/ch27s02

mount -t glusterfs -o aux-gfid-mount glusterpub1:/workdata /mnt/workdata

gfid 1:
getfattr -n trusted.glusterfs.pathinfo -e text
/mnt/workdata/.gfid/faf59566-10f5-4ddd-8b0c-a87bc6a334fb
getfattr: Removing leading '/' from absolute path names
# file: mnt/workdata/.gfid/faf59566-10f5-4ddd-8b0c-a87bc6a334fb
trusted.glusterfs.pathinfo="(
(

))"

gfid 2:
getfattr -n trusted.glusterfs.pathinfo -e text
/mnt/workdata/.gfid/60465723-5dc0-4ebe-aced-9f2c12e52642
getfattr: Removing leading '/' from absolute path names
# file: mnt/workdata/.gfid/60465723-5dc0-4ebe-aced-9f2c12e52642
trusted.glusterfs.pathinfo="(
(

))"

glusterpub1 + glusterpub2 are the good ones, glusterpub3 is the
misbehaving (not healing) one.

The file with gfid 1 is available under
/gluster/md6/workdata/images/133/283/13328349/ on glusterpub1+2
bricks, but missing on glusterpub3 brick.

gfid 2: 
/gluster/md5/workdata/.glusterfs/60/46/60465723-5dc0-4ebe-aced-9f2c12e52642
is present on glusterpub1+2, but not on glusterpub3.


Thx,
Hubert

Am Mi., 24. Jan. 2024 um 17:36 Uhr schrieb Strahil Nikolov
:
>
> Hi,
>
> Can you find and check the files with gfids:
> 60465723-5dc0-4ebe-aced-9f2c12e52642
> faf59566-10f5-4ddd-8b0c-a87bc6a334fb
>
> Use 'getfattr -d -e hex -m. ' command from 
> https://docs.gluster.org/en/main/Troubleshooting/resolving-splitbrain/#analysis-of-the-output
>  .
>
> Best Regards,
> Strahil Nikolov
>
> On Sat, Jan 20, 2024 at 9:44, Hu Bert
>  wrote:
> Good morning,
>
> thx Gilberto, did the first three (set to WARNING), but the last one
> doesn't work. Anyway, with setting these three some new messages
> appear:
>
> [2024-01-20 07:23:58.561106 +] W [MSGID: 114061]
> [client-common.c:796:client_pre_lk_v2] 0-workdata-client-11: remote_fd
> is -1. EBADFD [{gfid=faf59566-10f5-4ddd-8b0c-a87bc6a334fb},
> {errno=77}, {error=File descriptor in bad state}]
> [2024-01-20 07:23:58.561177 +] E [MSGID: 108028]
> [afr-open.c:361:afr_is_reopen_allowed_cbk] 0-workdata-replicate-3:
> Failed getlk for faf59566-10f5-4ddd-8b0c-a87bc6a334fb [File descriptor
> in bad state]
> [2024-01-20 07:23:58.562151 +] W [MSGID: 114031]
> [client-rpc-fops_v2.c:2561:client4_0_lookup_cbk] 0-workdata-client-11:
> remote operation failed.
> [{path=},
> {gfid=faf59566-10f5-4ddd-8b0c-a87b
> c6a334fb}, {errno=2}, {error=No such file or directory}]
> [2024-01-20 07:23:58.562296 +] W [MSGID: 114061]
> [client-common.c:530:client_pre_flush_v2] 0-workdata-client-11:
> remote_fd is -1. EBADFD [{gfid=faf59566-10f5-4ddd-8b0c-a87bc6a334fb},
> {errno=77}, {error=File descriptor in bad state}]
> [2024-01-20 07:23:58.860552 +] W [MSGID: 114061]
> [client-common.c:796:client_pre_lk_v2] 0-workdata-client-8: remote_fd
> is -1. EBADFD [{gfid=60465723-5dc0-4ebe-aced-9f2c12e52642},
> {errno=77}, {error=File descriptor in bad state}]
> [2024-01-20 07:23:58.860608 +] E [MSGID: 108028]
> [afr-open.c:361:afr_is_reopen_allowed_cbk] 0-workdata-replicate-2:
> Failed getlk for 60465723-5dc0-4ebe-aced-9f2c12e52642 [File descriptor
> in bad state]
> [2024-01-20 07:23:58.861520 +] W [MSGID: 114031]
> [client-rpc-fops_v2.c:2561:client4_0_lookup_cbk] 0-workdata-client-8:
> remote operation failed.
> [{path=},
> {gfid=60465723-5dc0-4ebe-aced-9f2c1
> 2e52642}, {errno=2}, {error=No such file or directory}]
> [2024-01-20 07:23:58.861640 +] W [MSGID: 114061]
> [client-common.c:530:client_pre_flush_v2] 0-workdata-client-8:
> remote_fd is -1. EBADFD [{gfid=60465723-5dc0-4ebe-aced-9f2c12e52642},
> {errno=77}, {error=File descriptor in bad state}]
>
> Not many log entries appear, only a few. Has someone seen error
> messages like these? Setting diagnostics.brick-sys-log-level to DEBUG
> shows way more log entries, uploaded it to:
> https://file.io/spLhlcbMCzr8 - not sure if that helps.
>
>
> Thx,
> Hubert
>
> Am Fr., 19. Jan. 2024 um 16:24 Uhr schrieb Gilberto Ferreira
> :
>
> >
> > gluster volume set testvol diagnostics.brick-log-level WARNING
> > gluster volume set testvol diagnostics.brick-sys-log-level WARNING
> > gluster volume set testvol diagnostics.client-log-level ERROR
> > gluster --log-level=ERROR volume status
> >
> > ---
> > Gilberto Nunes Ferreira
> >
> >
> >
> >
> >
> >
> > Em sex., 19 de jan. de 2024 às 05:49, Hu Bert  
> > escreveu:
> >>
> >> Hi Strahil,
> >> hm, don't get me wrong, it may sound a bit stupid, but... where do i
> >> set the log level? Using debian...
> >>
> >> https://access.redhat.com/documentation/de-de/red_hat_gluster_storage/3/html/administration_guide/configuring_the_log_level
> >>
> >> ls /etc/glusterfs/
> >> eventsconfig.json  glusterfs-georep-logrotate
> >> gluster-rsyslog-5.8.conf  group-db-workload  group-gluster-block
> >>  group-nl-cache  group-virt.example  logger.conf.example
> >> glusterd.vol  glusterfs-logrotate
> >> gluster-rsyslog-7.2.conf  group-distributed-virt  

Re: [Gluster-users] Upgrade 10.4 -> 11.1 making problems

2024-01-24 Thread Strahil Nikolov
Hi,
Can you find and check the files with gfids:
 60465723-5dc0-4ebe-aced-9f2c12e52642faf59566-10f5-4ddd-8b0c-a87bc6a334fb
Use 'getfattr -d -e hex -m. ' command from 
https://docs.gluster.org/en/main/Troubleshooting/resolving-splitbrain/#analysis-of-the-output
 .
Best Regards,Strahil Nikolov
 
  On Sat, Jan 20, 2024 at 9:44, Hu Bert wrote:   Good 
morning,

thx Gilberto, did the first three (set to WARNING), but the last one
doesn't work. Anyway, with setting these three some new messages
appear:

[2024-01-20 07:23:58.561106 +] W [MSGID: 114061]
[client-common.c:796:client_pre_lk_v2] 0-workdata-client-11: remote_fd
is -1. EBADFD [{gfid=faf59566-10f5-4ddd-8b0c-a87bc6a334fb},
{errno=77}, {error=File descriptor in bad state}]
[2024-01-20 07:23:58.561177 +] E [MSGID: 108028]
[afr-open.c:361:afr_is_reopen_allowed_cbk] 0-workdata-replicate-3:
Failed getlk for faf59566-10f5-4ddd-8b0c-a87bc6a334fb [File descriptor
in bad state]
[2024-01-20 07:23:58.562151 +] W [MSGID: 114031]
[client-rpc-fops_v2.c:2561:client4_0_lookup_cbk] 0-workdata-client-11:
remote operation failed.
[{path=},
{gfid=faf59566-10f5-4ddd-8b0c-a87b
c6a334fb}, {errno=2}, {error=No such file or directory}]
[2024-01-20 07:23:58.562296 +] W [MSGID: 114061]
[client-common.c:530:client_pre_flush_v2] 0-workdata-client-11:
remote_fd is -1. EBADFD [{gfid=faf59566-10f5-4ddd-8b0c-a87bc6a334fb},
{errno=77}, {error=File descriptor in bad state}]
[2024-01-20 07:23:58.860552 +] W [MSGID: 114061]
[client-common.c:796:client_pre_lk_v2] 0-workdata-client-8: remote_fd
is -1. EBADFD [{gfid=60465723-5dc0-4ebe-aced-9f2c12e52642},
{errno=77}, {error=File descriptor in bad state}]
[2024-01-20 07:23:58.860608 +] E [MSGID: 108028]
[afr-open.c:361:afr_is_reopen_allowed_cbk] 0-workdata-replicate-2:
Failed getlk for 60465723-5dc0-4ebe-aced-9f2c12e52642 [File descriptor
in bad state]
[2024-01-20 07:23:58.861520 +] W [MSGID: 114031]
[client-rpc-fops_v2.c:2561:client4_0_lookup_cbk] 0-workdata-client-8:
remote operation failed.
[{path=},
{gfid=60465723-5dc0-4ebe-aced-9f2c1
2e52642}, {errno=2}, {error=No such file or directory}]
[2024-01-20 07:23:58.861640 +] W [MSGID: 114061]
[client-common.c:530:client_pre_flush_v2] 0-workdata-client-8:
remote_fd is -1. EBADFD [{gfid=60465723-5dc0-4ebe-aced-9f2c12e52642},
{errno=77}, {error=File descriptor in bad state}]

Not many log entries appear, only a few. Has someone seen error
messages like these? Setting diagnostics.brick-sys-log-level to DEBUG
shows way more log entries, uploaded it to:
https://file.io/spLhlcbMCzr8 - not sure if that helps.


Thx,
Hubert

Am Fr., 19. Jan. 2024 um 16:24 Uhr schrieb Gilberto Ferreira
:
>
> gluster volume set testvol diagnostics.brick-log-level WARNING
> gluster volume set testvol diagnostics.brick-sys-log-level WARNING
> gluster volume set testvol diagnostics.client-log-level ERROR
> gluster --log-level=ERROR volume status
>
> ---
> Gilberto Nunes Ferreira
>
>
>
>
>
>
> Em sex., 19 de jan. de 2024 às 05:49, Hu Bert  
> escreveu:
>>
>> Hi Strahil,
>> hm, don't get me wrong, it may sound a bit stupid, but... where do i
>> set the log level? Using debian...
>>
>> https://access.redhat.com/documentation/de-de/red_hat_gluster_storage/3/html/administration_guide/configuring_the_log_level
>>
>> ls /etc/glusterfs/
>> eventsconfig.json  glusterfs-georep-logrotate
>> gluster-rsyslog-5.8.conf  group-db-workload      group-gluster-block
>>  group-nl-cache  group-virt.example  logger.conf.example
>> glusterd.vol      glusterfs-logrotate
>> gluster-rsyslog-7.2.conf  group-distributed-virt  group-metadata-cache
>>  group-samba    gsyncd.conf        thin-arbiter.vol
>>
>> checked: /etc/glusterfs/logger.conf.example
>>
>> # To enable enhanced logging capabilities,
>> #
>> # 1. rename this file to /etc/glusterfs/logger.conf
>> #
>> # 2. rename /etc/rsyslog.d/gluster.conf.example to
>> #    /etc/rsyslog.d/gluster.conf
>> #
>> # This change requires restart of all gluster services/volumes and
>> # rsyslog.
>>
>> tried (to test): /etc/glusterfs/logger.conf with " LOG_LEVEL='WARNING' "
>>
>> restart glusterd on that node, but this doesn't work, log-level stays
>> on INFO. /etc/rsyslog.d/gluster.conf.example does not exist. Probably
>> /etc/rsyslog.conf on debian. But first it would be better to know
>> where to set the log-level for glusterd.
>>
>> Depending on how much the DEBUG log-level talks ;-) i could assign up
>> to 100G to /var
>>
>>
>> Thx & best regards,
>> Hubert
>>
>>
>> Am Do., 18. Jan. 2024 um 22:58 Uhr schrieb Strahil Nikolov
>> :
>> >
>> > Are you able to set the logs to debug level ?
>> > It might provide a clue what it is going on.
>> >
>> > Best Regards,
>> > Strahil Nikolov
>> >
>> > On Thu, Jan 18, 2024 at 13:08, Diego Zuccato
>> >  wrote:
>> > That's the same kind of errors I keep seeing on my 2 clusters,
>> > regenerated some months ago. Seems a pseudo-split-brain that should be
>> > impossible on a replica 3 cluster but keeps happening.
>> > Sadly going to ditch Gluster 

Re: [Gluster-users] Upgrade 10.4 -> 11.1 making problems

2024-01-19 Thread Hu Bert
Good morning,

thx Gilberto, did the first three (set to WARNING), but the last one
doesn't work. Anyway, with setting these three some new messages
appear:

[2024-01-20 07:23:58.561106 +] W [MSGID: 114061]
[client-common.c:796:client_pre_lk_v2] 0-workdata-client-11: remote_fd
is -1. EBADFD [{gfid=faf59566-10f5-4ddd-8b0c-a87bc6a334fb},
{errno=77}, {error=File descriptor in bad state}]
[2024-01-20 07:23:58.561177 +] E [MSGID: 108028]
[afr-open.c:361:afr_is_reopen_allowed_cbk] 0-workdata-replicate-3:
Failed getlk for faf59566-10f5-4ddd-8b0c-a87bc6a334fb [File descriptor
in bad state]
[2024-01-20 07:23:58.562151 +] W [MSGID: 114031]
[client-rpc-fops_v2.c:2561:client4_0_lookup_cbk] 0-workdata-client-11:
remote operation failed.
[{path=},
{gfid=faf59566-10f5-4ddd-8b0c-a87b
c6a334fb}, {errno=2}, {error=No such file or directory}]
[2024-01-20 07:23:58.562296 +] W [MSGID: 114061]
[client-common.c:530:client_pre_flush_v2] 0-workdata-client-11:
remote_fd is -1. EBADFD [{gfid=faf59566-10f5-4ddd-8b0c-a87bc6a334fb},
{errno=77}, {error=File descriptor in bad state}]
[2024-01-20 07:23:58.860552 +] W [MSGID: 114061]
[client-common.c:796:client_pre_lk_v2] 0-workdata-client-8: remote_fd
is -1. EBADFD [{gfid=60465723-5dc0-4ebe-aced-9f2c12e52642},
{errno=77}, {error=File descriptor in bad state}]
[2024-01-20 07:23:58.860608 +] E [MSGID: 108028]
[afr-open.c:361:afr_is_reopen_allowed_cbk] 0-workdata-replicate-2:
Failed getlk for 60465723-5dc0-4ebe-aced-9f2c12e52642 [File descriptor
in bad state]
[2024-01-20 07:23:58.861520 +] W [MSGID: 114031]
[client-rpc-fops_v2.c:2561:client4_0_lookup_cbk] 0-workdata-client-8:
remote operation failed.
[{path=},
{gfid=60465723-5dc0-4ebe-aced-9f2c1
2e52642}, {errno=2}, {error=No such file or directory}]
[2024-01-20 07:23:58.861640 +] W [MSGID: 114061]
[client-common.c:530:client_pre_flush_v2] 0-workdata-client-8:
remote_fd is -1. EBADFD [{gfid=60465723-5dc0-4ebe-aced-9f2c12e52642},
{errno=77}, {error=File descriptor in bad state}]

Not many log entries appear, only a few. Has someone seen error
messages like these? Setting diagnostics.brick-sys-log-level to DEBUG
shows way more log entries, uploaded it to:
https://file.io/spLhlcbMCzr8 - not sure if that helps.


Thx,
Hubert

Am Fr., 19. Jan. 2024 um 16:24 Uhr schrieb Gilberto Ferreira
:
>
> gluster volume set testvol diagnostics.brick-log-level WARNING
> gluster volume set testvol diagnostics.brick-sys-log-level WARNING
> gluster volume set testvol diagnostics.client-log-level ERROR
> gluster --log-level=ERROR volume status
>
> ---
> Gilberto Nunes Ferreira
>
>
>
>
>
>
> Em sex., 19 de jan. de 2024 às 05:49, Hu Bert  
> escreveu:
>>
>> Hi Strahil,
>> hm, don't get me wrong, it may sound a bit stupid, but... where do i
>> set the log level? Using debian...
>>
>> https://access.redhat.com/documentation/de-de/red_hat_gluster_storage/3/html/administration_guide/configuring_the_log_level
>>
>> ls /etc/glusterfs/
>> eventsconfig.json  glusterfs-georep-logrotate
>> gluster-rsyslog-5.8.conf  group-db-workload   group-gluster-block
>>  group-nl-cache  group-virt.example  logger.conf.example
>> glusterd.vol   glusterfs-logrotate
>> gluster-rsyslog-7.2.conf  group-distributed-virt  group-metadata-cache
>>  group-samba gsyncd.conf thin-arbiter.vol
>>
>> checked: /etc/glusterfs/logger.conf.example
>>
>> # To enable enhanced logging capabilities,
>> #
>> # 1. rename this file to /etc/glusterfs/logger.conf
>> #
>> # 2. rename /etc/rsyslog.d/gluster.conf.example to
>> #/etc/rsyslog.d/gluster.conf
>> #
>> # This change requires restart of all gluster services/volumes and
>> # rsyslog.
>>
>> tried (to test): /etc/glusterfs/logger.conf with " LOG_LEVEL='WARNING' "
>>
>> restart glusterd on that node, but this doesn't work, log-level stays
>> on INFO. /etc/rsyslog.d/gluster.conf.example does not exist. Probably
>> /etc/rsyslog.conf on debian. But first it would be better to know
>> where to set the log-level for glusterd.
>>
>> Depending on how much the DEBUG log-level talks ;-) i could assign up
>> to 100G to /var
>>
>>
>> Thx & best regards,
>> Hubert
>>
>>
>> Am Do., 18. Jan. 2024 um 22:58 Uhr schrieb Strahil Nikolov
>> :
>> >
>> > Are you able to set the logs to debug level ?
>> > It might provide a clue what it is going on.
>> >
>> > Best Regards,
>> > Strahil Nikolov
>> >
>> > On Thu, Jan 18, 2024 at 13:08, Diego Zuccato
>> >  wrote:
>> > That's the same kind of errors I keep seeing on my 2 clusters,
>> > regenerated some months ago. Seems a pseudo-split-brain that should be
>> > impossible on a replica 3 cluster but keeps happening.
>> > Sadly going to ditch Gluster ASAP.
>> >
>> > Diego
>> >
>> > Il 18/01/2024 07:11, Hu Bert ha scritto:
>> > > Good morning,
>> > > heal still not running. Pending heals now sum up to 60K per brick.
>> > > Heal was starting instantly e.g. after server reboot with version
>> > > 10.4, but doesn't with version 11. What could be wrong?
>> > >
>> > > I only see these errors 

Re: [Gluster-users] Upgrade 10.4 -> 11.1 making problems

2024-01-19 Thread Gilberto Ferreira
gluster volume set testvol diagnostics.brick-log-level WARNING
gluster volume set testvol diagnostics.brick-sys-log-level WARNING
gluster volume set testvol diagnostics.client-log-level ERROR
gluster --log-level=ERROR volume status

---
Gilberto Nunes Ferreira






Em sex., 19 de jan. de 2024 às 05:49, Hu Bert 
escreveu:

> Hi Strahil,
> hm, don't get me wrong, it may sound a bit stupid, but... where do i
> set the log level? Using debian...
>
>
> https://access.redhat.com/documentation/de-de/red_hat_gluster_storage/3/html/administration_guide/configuring_the_log_level
>
> ls /etc/glusterfs/
> eventsconfig.json  glusterfs-georep-logrotate
> gluster-rsyslog-5.8.conf  group-db-workload   group-gluster-block
>  group-nl-cache  group-virt.example  logger.conf.example
> glusterd.vol   glusterfs-logrotate
> gluster-rsyslog-7.2.conf  group-distributed-virt  group-metadata-cache
>  group-samba gsyncd.conf thin-arbiter.vol
>
> checked: /etc/glusterfs/logger.conf.example
>
> # To enable enhanced logging capabilities,
> #
> # 1. rename this file to /etc/glusterfs/logger.conf
> #
> # 2. rename /etc/rsyslog.d/gluster.conf.example to
> #/etc/rsyslog.d/gluster.conf
> #
> # This change requires restart of all gluster services/volumes and
> # rsyslog.
>
> tried (to test): /etc/glusterfs/logger.conf with " LOG_LEVEL='WARNING' "
>
> restart glusterd on that node, but this doesn't work, log-level stays
> on INFO. /etc/rsyslog.d/gluster.conf.example does not exist. Probably
> /etc/rsyslog.conf on debian. But first it would be better to know
> where to set the log-level for glusterd.
>
> Depending on how much the DEBUG log-level talks ;-) i could assign up
> to 100G to /var
>
>
> Thx & best regards,
> Hubert
>
>
> Am Do., 18. Jan. 2024 um 22:58 Uhr schrieb Strahil Nikolov
> :
> >
> > Are you able to set the logs to debug level ?
> > It might provide a clue what it is going on.
> >
> > Best Regards,
> > Strahil Nikolov
> >
> > On Thu, Jan 18, 2024 at 13:08, Diego Zuccato
> >  wrote:
> > That's the same kind of errors I keep seeing on my 2 clusters,
> > regenerated some months ago. Seems a pseudo-split-brain that should be
> > impossible on a replica 3 cluster but keeps happening.
> > Sadly going to ditch Gluster ASAP.
> >
> > Diego
> >
> > Il 18/01/2024 07:11, Hu Bert ha scritto:
> > > Good morning,
> > > heal still not running. Pending heals now sum up to 60K per brick.
> > > Heal was starting instantly e.g. after server reboot with version
> > > 10.4, but doesn't with version 11. What could be wrong?
> > >
> > > I only see these errors on one of the "good" servers in glustershd.log:
> > >
> > > [2024-01-18 06:08:57.328480 +] W [MSGID: 114031]
> > > [client-rpc-fops_v2.c:2561:client4_0_lookup_cbk] 0-workdata-client-0:
> > > remote operation failed.
> > > [{path=},
> > > {gfid=cb39a1e4-2a4c-4727-861d-3ed9e
> > > f00681b}, {errno=2}, {error=No such file or directory}]
> > > [2024-01-18 06:08:57.594051 +] W [MSGID: 114031]
> > > [client-rpc-fops_v2.c:2561:client4_0_lookup_cbk] 0-workdata-client-1:
> > > remote operation failed.
> > > [{path=},
> > > {gfid=3e9b178c-ae1f-4d85-ae47-fc539
> > > d94dd11}, {errno=2}, {error=No such file or directory}]
> > >
> > > About 7K today. Any ideas? Someone?
> > >
> > >
> > > Best regards,
> > > Hubert
> > >
> > > Am Mi., 17. Jan. 2024 um 11:24 Uhr schrieb Hu Bert <
> revi...@googlemail.com>:
> > >>
> > >> ok, finally managed to get all servers, volumes etc runnung, but took
> > >> a couple of restarts, cksum checks etc.
> > >>
> > >> One problem: a volume doesn't heal automatically or doesn't heal at
> all.
> > >>
> > >> gluster volume status
> > >> Status of volume: workdata
> > >> Gluster processTCP Port  RDMA Port
> Online  Pid
> > >>
> --
> > >> Brick glusterpub1:/gluster/md3/workdata588320  Y
> 3436
> > >> Brick glusterpub2:/gluster/md3/workdata593150  Y
> 1526
> > >> Brick glusterpub3:/gluster/md3/workdata569170  Y
> 1952
> > >> Brick glusterpub1:/gluster/md4/workdata596880  Y
> 3755
> > >> Brick glusterpub2:/gluster/md4/workdata602710  Y
> 2271
> > >> Brick glusterpub3:/gluster/md4/workdata494610  Y
> 2399
> > >> Brick glusterpub1:/gluster/md5/workdata546510  Y
> 4208
> > >> Brick glusterpub2:/gluster/md5/workdata496850  Y
> 2751
> > >> Brick glusterpub3:/gluster/md5/workdata592020  Y
> 2803
> > >> Brick glusterpub1:/gluster/md6/workdata558290  Y
> 4583
> > >> Brick glusterpub2:/gluster/md6/workdata504550  Y
> 3296
> > >> Brick glusterpub3:/gluster/md6/workdata502620  Y
> 3237
> > >> Brick glusterpub1:/gluster/md7/workdata522380  Y
> 5014
> > >> Brick glusterpub2:/gluster/md7/workdata524740  Y
> 3673
> > >> Brick 

Re: [Gluster-users] Upgrade 10.4 -> 11.1 making problems

2024-01-19 Thread Hu Bert
Hi Strahil,
hm, don't get me wrong, it may sound a bit stupid, but... where do i
set the log level? Using debian...

https://access.redhat.com/documentation/de-de/red_hat_gluster_storage/3/html/administration_guide/configuring_the_log_level

ls /etc/glusterfs/
eventsconfig.json  glusterfs-georep-logrotate
gluster-rsyslog-5.8.conf  group-db-workload   group-gluster-block
 group-nl-cache  group-virt.example  logger.conf.example
glusterd.vol   glusterfs-logrotate
gluster-rsyslog-7.2.conf  group-distributed-virt  group-metadata-cache
 group-samba gsyncd.conf thin-arbiter.vol

checked: /etc/glusterfs/logger.conf.example

# To enable enhanced logging capabilities,
#
# 1. rename this file to /etc/glusterfs/logger.conf
#
# 2. rename /etc/rsyslog.d/gluster.conf.example to
#/etc/rsyslog.d/gluster.conf
#
# This change requires restart of all gluster services/volumes and
# rsyslog.

tried (to test): /etc/glusterfs/logger.conf with " LOG_LEVEL='WARNING' "

restart glusterd on that node, but this doesn't work, log-level stays
on INFO. /etc/rsyslog.d/gluster.conf.example does not exist. Probably
/etc/rsyslog.conf on debian. But first it would be better to know
where to set the log-level for glusterd.

Depending on how much the DEBUG log-level talks ;-) i could assign up
to 100G to /var


Thx & best regards,
Hubert


Am Do., 18. Jan. 2024 um 22:58 Uhr schrieb Strahil Nikolov
:
>
> Are you able to set the logs to debug level ?
> It might provide a clue what it is going on.
>
> Best Regards,
> Strahil Nikolov
>
> On Thu, Jan 18, 2024 at 13:08, Diego Zuccato
>  wrote:
> That's the same kind of errors I keep seeing on my 2 clusters,
> regenerated some months ago. Seems a pseudo-split-brain that should be
> impossible on a replica 3 cluster but keeps happening.
> Sadly going to ditch Gluster ASAP.
>
> Diego
>
> Il 18/01/2024 07:11, Hu Bert ha scritto:
> > Good morning,
> > heal still not running. Pending heals now sum up to 60K per brick.
> > Heal was starting instantly e.g. after server reboot with version
> > 10.4, but doesn't with version 11. What could be wrong?
> >
> > I only see these errors on one of the "good" servers in glustershd.log:
> >
> > [2024-01-18 06:08:57.328480 +] W [MSGID: 114031]
> > [client-rpc-fops_v2.c:2561:client4_0_lookup_cbk] 0-workdata-client-0:
> > remote operation failed.
> > [{path=},
> > {gfid=cb39a1e4-2a4c-4727-861d-3ed9e
> > f00681b}, {errno=2}, {error=No such file or directory}]
> > [2024-01-18 06:08:57.594051 +] W [MSGID: 114031]
> > [client-rpc-fops_v2.c:2561:client4_0_lookup_cbk] 0-workdata-client-1:
> > remote operation failed.
> > [{path=},
> > {gfid=3e9b178c-ae1f-4d85-ae47-fc539
> > d94dd11}, {errno=2}, {error=No such file or directory}]
> >
> > About 7K today. Any ideas? Someone?
> >
> >
> > Best regards,
> > Hubert
> >
> > Am Mi., 17. Jan. 2024 um 11:24 Uhr schrieb Hu Bert :
> >>
> >> ok, finally managed to get all servers, volumes etc runnung, but took
> >> a couple of restarts, cksum checks etc.
> >>
> >> One problem: a volume doesn't heal automatically or doesn't heal at all.
> >>
> >> gluster volume status
> >> Status of volume: workdata
> >> Gluster processTCP Port  RDMA Port  Online  Pid
> >> --
> >> Brick glusterpub1:/gluster/md3/workdata588320  Y  3436
> >> Brick glusterpub2:/gluster/md3/workdata593150  Y  1526
> >> Brick glusterpub3:/gluster/md3/workdata569170  Y  1952
> >> Brick glusterpub1:/gluster/md4/workdata596880  Y  3755
> >> Brick glusterpub2:/gluster/md4/workdata602710  Y  2271
> >> Brick glusterpub3:/gluster/md4/workdata494610  Y  2399
> >> Brick glusterpub1:/gluster/md5/workdata546510  Y  4208
> >> Brick glusterpub2:/gluster/md5/workdata496850  Y  2751
> >> Brick glusterpub3:/gluster/md5/workdata592020  Y  2803
> >> Brick glusterpub1:/gluster/md6/workdata558290  Y  4583
> >> Brick glusterpub2:/gluster/md6/workdata504550  Y  3296
> >> Brick glusterpub3:/gluster/md6/workdata502620  Y  3237
> >> Brick glusterpub1:/gluster/md7/workdata522380  Y  5014
> >> Brick glusterpub2:/gluster/md7/workdata524740  Y  3673
> >> Brick glusterpub3:/gluster/md7/workdata579660  Y  3653
> >> Self-heal Daemon on localhost  N/A  N/AY  4141
> >> Self-heal Daemon on glusterpub1N/A  N/AY  5570
> >> Self-heal Daemon on glusterpub2N/A  N/AY  4139
> >>
> >> "gluster volume heal workdata info" lists a lot of files per brick.
> >> "gluster volume heal workdata statistics heal-count" shows thousands
> >> of files per brick.
> >> "gluster volume heal workdata enable" has no effect.
> >>
> 

Re: [Gluster-users] Upgrade 10.4 -> 11.1 making problems

2024-01-18 Thread Diego Zuccato
I don't want to hijack the thread. And in my case setting logs to debug 
would fill my /var partitions in no time. Maybe the OP can.


Diego

Il 18/01/2024 22:58, Strahil Nikolov ha scritto:

Are you able to set the logs to debug level ?
It might provide a clue what it is going on.

Best Regards,
Strahil Nikolov

On Thu, Jan 18, 2024 at 13:08, Diego Zuccato
 wrote:
That's the same kind of errors I keep seeing on my 2 clusters,
regenerated some months ago. Seems a pseudo-split-brain that should be
impossible on a replica 3 cluster but keeps happening.
Sadly going to ditch Gluster ASAP.

Diego

Il 18/01/2024 07:11, Hu Bert ha scritto:
 > Good morning,
 > heal still not running. Pending heals now sum up to 60K per brick.
 > Heal was starting instantly e.g. after server reboot with version
 > 10.4, but doesn't with version 11. What could be wrong?
 >
 > I only see these errors on one of the "good" servers in
glustershd.log:
 >
 > [2024-01-18 06:08:57.328480 +] W [MSGID: 114031]
 > [client-rpc-fops_v2.c:2561:client4_0_lookup_cbk] 0-workdata-client-0:
 > remote operation failed.
 > [{path=},
 > {gfid=cb39a1e4-2a4c-4727-861d-3ed9e
 > f00681b}, {errno=2}, {error=No such file or directory}]
 > [2024-01-18 06:08:57.594051 +] W [MSGID: 114031]
 > [client-rpc-fops_v2.c:2561:client4_0_lookup_cbk] 0-workdata-client-1:
 > remote operation failed.
 > [{path=},
 > {gfid=3e9b178c-ae1f-4d85-ae47-fc539
 > d94dd11}, {errno=2}, {error=No such file or directory}]
 >
 > About 7K today. Any ideas? Someone?
 >
 >
 > Best regards,
 > Hubert
 >
 > Am Mi., 17. Jan. 2024 um 11:24 Uhr schrieb Hu Bert
mailto:revi...@googlemail.com>>:
 >>
 >> ok, finally managed to get all servers, volumes etc runnung, but
took
 >> a couple of restarts, cksum checks etc.
 >>
 >> One problem: a volume doesn't heal automatically or doesn't heal
at all.
 >>
 >> gluster volume status
 >> Status of volume: workdata
 >> Gluster process                            TCP Port  RDMA Port 
Online  Pid

 >>

--
 >> Brick glusterpub1:/gluster/md3/workdata    58832    0 
Y      3436
 >> Brick glusterpub2:/gluster/md3/workdata    59315    0 
Y      1526
 >> Brick glusterpub3:/gluster/md3/workdata    56917    0 
Y      1952
 >> Brick glusterpub1:/gluster/md4/workdata    59688    0 
Y      3755
 >> Brick glusterpub2:/gluster/md4/workdata    60271    0 
Y      2271
 >> Brick glusterpub3:/gluster/md4/workdata    49461    0 
Y      2399
 >> Brick glusterpub1:/gluster/md5/workdata    54651    0 
Y      4208
 >> Brick glusterpub2:/gluster/md5/workdata    49685    0 
Y      2751
 >> Brick glusterpub3:/gluster/md5/workdata    59202    0 
Y      2803
 >> Brick glusterpub1:/gluster/md6/workdata    55829    0 
Y      4583
 >> Brick glusterpub2:/gluster/md6/workdata    50455    0 
Y      3296
 >> Brick glusterpub3:/gluster/md6/workdata    50262    0 
Y      3237
 >> Brick glusterpub1:/gluster/md7/workdata    52238    0 
Y      5014
 >> Brick glusterpub2:/gluster/md7/workdata    52474    0 
Y      3673
 >> Brick glusterpub3:/gluster/md7/workdata    57966    0 
Y      3653
 >> Self-heal Daemon on localhost              N/A      N/A   
Y      4141
 >> Self-heal Daemon on glusterpub1            N/A      N/A   
Y      5570
 >> Self-heal Daemon on glusterpub2            N/A      N/A   
Y      4139

 >>
 >> "gluster volume heal workdata info" lists a lot of files per brick.
 >> "gluster volume heal workdata statistics heal-count" shows thousands
 >> of files per brick.
 >> "gluster volume heal workdata enable" has no effect.
 >>
 >> gluster volume heal workdata full
 >> Launching heal operation to perform full self heal on volume
workdata
 >> has been successful
 >> Use heal info commands to check status.
 >>
 >> -> not doing anything at all. And nothing happening on the 2 "good"
 >> servers in e.g. glustershd.log. Heal was working as expected on
 >> version 10.4, but here... silence. Someone has an idea?
 >>
 >>
 >> Best regards,
 >> Hubert
 >>
 >> Am Di., 16. Jan. 2024 um 13:44 Uhr schrieb Gilberto Ferreira
 >> mailto:gilberto.nune...@gmail.com>>:
 >>>
 >>> Ah! Indeed! You need to perform an upgrade in the clients as well.
 >>>
 >>>
 >>>
 >>>
 >>>
 >>>
 >>>
 >>>
 >>> Em ter., 16 de jan. de 2024 às 03:12, Hu Bert
mailto:revi...@googlemail.com>> escreveu:
 
  morning to those still 

Re: [Gluster-users] Upgrade 10.4 -> 11.1 making problems

2024-01-18 Thread Strahil Nikolov
Are you able to set the logs to debug level ?It might provide a clue what it is 
going on.
Best Regards,Strahil Nikolov 
 
  On Thu, Jan 18, 2024 at 13:08, Diego Zuccato wrote:   
That's the same kind of errors I keep seeing on my 2 clusters, 
regenerated some months ago. Seems a pseudo-split-brain that should be 
impossible on a replica 3 cluster but keeps happening.
Sadly going to ditch Gluster ASAP.

Diego

Il 18/01/2024 07:11, Hu Bert ha scritto:
> Good morning,
> heal still not running. Pending heals now sum up to 60K per brick.
> Heal was starting instantly e.g. after server reboot with version
> 10.4, but doesn't with version 11. What could be wrong?
> 
> I only see these errors on one of the "good" servers in glustershd.log:
> 
> [2024-01-18 06:08:57.328480 +] W [MSGID: 114031]
> [client-rpc-fops_v2.c:2561:client4_0_lookup_cbk] 0-workdata-client-0:
> remote operation failed.
> [{path=},
> {gfid=cb39a1e4-2a4c-4727-861d-3ed9e
> f00681b}, {errno=2}, {error=No such file or directory}]
> [2024-01-18 06:08:57.594051 +] W [MSGID: 114031]
> [client-rpc-fops_v2.c:2561:client4_0_lookup_cbk] 0-workdata-client-1:
> remote operation failed.
> [{path=},
> {gfid=3e9b178c-ae1f-4d85-ae47-fc539
> d94dd11}, {errno=2}, {error=No such file or directory}]
> 
> About 7K today. Any ideas? Someone?
> 
> 
> Best regards,
> Hubert
> 
> Am Mi., 17. Jan. 2024 um 11:24 Uhr schrieb Hu Bert :
>>
>> ok, finally managed to get all servers, volumes etc runnung, but took
>> a couple of restarts, cksum checks etc.
>>
>> One problem: a volume doesn't heal automatically or doesn't heal at all.
>>
>> gluster volume status
>> Status of volume: workdata
>> Gluster process                            TCP Port  RDMA Port  Online  Pid
>> --
>> Brick glusterpub1:/gluster/md3/workdata    58832    0          Y      3436
>> Brick glusterpub2:/gluster/md3/workdata    59315    0          Y      1526
>> Brick glusterpub3:/gluster/md3/workdata    56917    0          Y      1952
>> Brick glusterpub1:/gluster/md4/workdata    59688    0          Y      3755
>> Brick glusterpub2:/gluster/md4/workdata    60271    0          Y      2271
>> Brick glusterpub3:/gluster/md4/workdata    49461    0          Y      2399
>> Brick glusterpub1:/gluster/md5/workdata    54651    0          Y      4208
>> Brick glusterpub2:/gluster/md5/workdata    49685    0          Y      2751
>> Brick glusterpub3:/gluster/md5/workdata    59202    0          Y      2803
>> Brick glusterpub1:/gluster/md6/workdata    55829    0          Y      4583
>> Brick glusterpub2:/gluster/md6/workdata    50455    0          Y      3296
>> Brick glusterpub3:/gluster/md6/workdata    50262    0          Y      3237
>> Brick glusterpub1:/gluster/md7/workdata    52238    0          Y      5014
>> Brick glusterpub2:/gluster/md7/workdata    52474    0          Y      3673
>> Brick glusterpub3:/gluster/md7/workdata    57966    0          Y      3653
>> Self-heal Daemon on localhost              N/A      N/A        Y      4141
>> Self-heal Daemon on glusterpub1            N/A      N/A        Y      5570
>> Self-heal Daemon on glusterpub2            N/A      N/A        Y      4139
>>
>> "gluster volume heal workdata info" lists a lot of files per brick.
>> "gluster volume heal workdata statistics heal-count" shows thousands
>> of files per brick.
>> "gluster volume heal workdata enable" has no effect.
>>
>> gluster volume heal workdata full
>> Launching heal operation to perform full self heal on volume workdata
>> has been successful
>> Use heal info commands to check status.
>>
>> -> not doing anything at all. And nothing happening on the 2 "good"
>> servers in e.g. glustershd.log. Heal was working as expected on
>> version 10.4, but here... silence. Someone has an idea?
>>
>>
>> Best regards,
>> Hubert
>>
>> Am Di., 16. Jan. 2024 um 13:44 Uhr schrieb Gilberto Ferreira
>> :
>>>
>>> Ah! Indeed! You need to perform an upgrade in the clients as well.
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>> Em ter., 16 de jan. de 2024 às 03:12, Hu Bert  
>>> escreveu:

 morning to those still reading :-)

 i found this: 
 https://docs.gluster.org/en/main/Troubleshooting/troubleshooting-glusterd/#common-issues-and-how-to-resolve-them

 there's a paragraph about "peer rejected" with the same error message,
 telling me: "Update the cluster.op-version" - i had only updated the
 server nodes, but not the clients. So upgrading the cluster.op-version
 wasn't possible at this time. So... upgrading the clients to version
 11.1 and then the op-version should solve the problem?


 Thx,
 Hubert

 Am Mo., 15. Jan. 2024 um 09:16 Uhr schrieb Hu Bert 
 :
>
> Hi,
> just upgraded some gluster servers from version 10.4 to version 11.1.
> Debian bullseye & bookworm. When only installing the packages: good,
> servers, volumes etc. work as expected.
>
> But one needs to 

Re: [Gluster-users] Upgrade 10.4 -> 11.1 making problems

2024-01-18 Thread Hu Bert
Thx for your answer. We don't have that much data (but 33 TB anyway),
but millions of files in total, on normal SATA disks. So copying stuff
away and back, with a downtime maybe, is not manageable.

Good thing is: the data can be re-calculated, as they are derived from
source data. But one needs some new hardware for that. And
maybe/probably think of a new solution for that, as we all know about
the state of the gluster project.

Thx,
Hubert

Am Do., 18. Jan. 2024 um 09:33 Uhr schrieb Diego Zuccato
:
>
> Since glusterd does not consider it a split brain, you can't solve it
> with standard split brain tools.
> I've found no way to resolve it except by manually handling one file at
> a time: completely unmanageable with thousands of files and having to
> juggle between actual path on brick and metadata files!
> Previously I "fixed" it by:
> 1) moving all the data from the volume to a temp space
> 2) recovering from the bricks what was inaccessible from the mountpoint
> (keeping different file revisions for the conflicting ones)
> 3) destroying and recreating the volume
> 4) copying back the data from the backup
>
> When gluster gets used because you need lots of space (we had more than
> 400TB on 3 nodes with 30x12TB SAS disks in "replica 3 arbiter 1"), where
> do you park the data? Is the official solution "just have a second
> cluster idle for when you need to fix errors"?
> It took more than a month of downtime this summer, and after less than 6
> months I'd have to repeat it? Users are rightly quite upset...
>
> Diego
>
> Il 18/01/2024 09:17, Hu Bert ha scritto:
> > were you able to solve the problem? Can it be treated like a "normal"
> > split brain? 'gluster peer status' and 'gluster volume status' are ok,
> > so kinda looks like "pseudo"...
> >
> >
> > hubert
> >
> > Am Do., 18. Jan. 2024 um 08:28 Uhr schrieb Diego Zuccato
> > :
> >>
> >> That's the same kind of errors I keep seeing on my 2 clusters,
> >> regenerated some months ago. Seems a pseudo-split-brain that should be
> >> impossible on a replica 3 cluster but keeps happening.
> >> Sadly going to ditch Gluster ASAP.
> >>
> >> Diego
> >>
> >> Il 18/01/2024 07:11, Hu Bert ha scritto:
> >>> Good morning,
> >>> heal still not running. Pending heals now sum up to 60K per brick.
> >>> Heal was starting instantly e.g. after server reboot with version
> >>> 10.4, but doesn't with version 11. What could be wrong?
> >>>
> >>> I only see these errors on one of the "good" servers in glustershd.log:
> >>>
> >>> [2024-01-18 06:08:57.328480 +] W [MSGID: 114031]
> >>> [client-rpc-fops_v2.c:2561:client4_0_lookup_cbk] 0-workdata-client-0:
> >>> remote operation failed.
> >>> [{path=},
> >>> {gfid=cb39a1e4-2a4c-4727-861d-3ed9e
> >>> f00681b}, {errno=2}, {error=No such file or directory}]
> >>> [2024-01-18 06:08:57.594051 +] W [MSGID: 114031]
> >>> [client-rpc-fops_v2.c:2561:client4_0_lookup_cbk] 0-workdata-client-1:
> >>> remote operation failed.
> >>> [{path=},
> >>> {gfid=3e9b178c-ae1f-4d85-ae47-fc539
> >>> d94dd11}, {errno=2}, {error=No such file or directory}]
> >>>
> >>> About 7K today. Any ideas? Someone?
> >>>
> >>>
> >>> Best regards,
> >>> Hubert
> >>>
> >>> Am Mi., 17. Jan. 2024 um 11:24 Uhr schrieb Hu Bert 
> >>> :
> 
>  ok, finally managed to get all servers, volumes etc runnung, but took
>  a couple of restarts, cksum checks etc.
> 
>  One problem: a volume doesn't heal automatically or doesn't heal at all.
> 
>  gluster volume status
>  Status of volume: workdata
>  Gluster process TCP Port  RDMA Port  Online  
>  Pid
>  --
>  Brick glusterpub1:/gluster/md3/workdata 58832 0  Y   
>  3436
>  Brick glusterpub2:/gluster/md3/workdata 59315 0  Y   
>  1526
>  Brick glusterpub3:/gluster/md3/workdata 56917 0  Y   
>  1952
>  Brick glusterpub1:/gluster/md4/workdata 59688 0  Y   
>  3755
>  Brick glusterpub2:/gluster/md4/workdata 60271 0  Y   
>  2271
>  Brick glusterpub3:/gluster/md4/workdata 49461 0  Y   
>  2399
>  Brick glusterpub1:/gluster/md5/workdata 54651 0  Y   
>  4208
>  Brick glusterpub2:/gluster/md5/workdata 49685 0  Y   
>  2751
>  Brick glusterpub3:/gluster/md5/workdata 59202 0  Y   
>  2803
>  Brick glusterpub1:/gluster/md6/workdata 55829 0  Y   
>  4583
>  Brick glusterpub2:/gluster/md6/workdata 50455 0  Y   
>  3296
>  Brick glusterpub3:/gluster/md6/workdata 50262 0  Y   
>  3237
>  Brick glusterpub1:/gluster/md7/workdata 52238 0  Y   
>  5014
>  Brick glusterpub2:/gluster/md7/workdata 52474 0  Y   
>  

Re: [Gluster-users] Upgrade 10.4 -> 11.1 making problems

2024-01-18 Thread Diego Zuccato
Since glusterd does not consider it a split brain, you can't solve it 
with standard split brain tools.
I've found no way to resolve it except by manually handling one file at 
a time: completely unmanageable with thousands of files and having to 
juggle between actual path on brick and metadata files!

Previously I "fixed" it by:
1) moving all the data from the volume to a temp space
2) recovering from the bricks what was inaccessible from the mountpoint 
(keeping different file revisions for the conflicting ones)

3) destroying and recreating the volume
4) copying back the data from the backup

When gluster gets used because you need lots of space (we had more than 
400TB on 3 nodes with 30x12TB SAS disks in "replica 3 arbiter 1"), where 
do you park the data? Is the official solution "just have a second 
cluster idle for when you need to fix errors"?
It took more than a month of downtime this summer, and after less than 6 
months I'd have to repeat it? Users are rightly quite upset...


Diego

Il 18/01/2024 09:17, Hu Bert ha scritto:

were you able to solve the problem? Can it be treated like a "normal"
split brain? 'gluster peer status' and 'gluster volume status' are ok,
so kinda looks like "pseudo"...


hubert

Am Do., 18. Jan. 2024 um 08:28 Uhr schrieb Diego Zuccato
:


That's the same kind of errors I keep seeing on my 2 clusters,
regenerated some months ago. Seems a pseudo-split-brain that should be
impossible on a replica 3 cluster but keeps happening.
Sadly going to ditch Gluster ASAP.

Diego

Il 18/01/2024 07:11, Hu Bert ha scritto:

Good morning,
heal still not running. Pending heals now sum up to 60K per brick.
Heal was starting instantly e.g. after server reboot with version
10.4, but doesn't with version 11. What could be wrong?

I only see these errors on one of the "good" servers in glustershd.log:

[2024-01-18 06:08:57.328480 +] W [MSGID: 114031]
[client-rpc-fops_v2.c:2561:client4_0_lookup_cbk] 0-workdata-client-0:
remote operation failed.
[{path=},
{gfid=cb39a1e4-2a4c-4727-861d-3ed9e
f00681b}, {errno=2}, {error=No such file or directory}]
[2024-01-18 06:08:57.594051 +] W [MSGID: 114031]
[client-rpc-fops_v2.c:2561:client4_0_lookup_cbk] 0-workdata-client-1:
remote operation failed.
[{path=},
{gfid=3e9b178c-ae1f-4d85-ae47-fc539
d94dd11}, {errno=2}, {error=No such file or directory}]

About 7K today. Any ideas? Someone?


Best regards,
Hubert

Am Mi., 17. Jan. 2024 um 11:24 Uhr schrieb Hu Bert :


ok, finally managed to get all servers, volumes etc runnung, but took
a couple of restarts, cksum checks etc.

One problem: a volume doesn't heal automatically or doesn't heal at all.

gluster volume status
Status of volume: workdata
Gluster process TCP Port  RDMA Port  Online  Pid
--
Brick glusterpub1:/gluster/md3/workdata 58832 0  Y   3436
Brick glusterpub2:/gluster/md3/workdata 59315 0  Y   1526
Brick glusterpub3:/gluster/md3/workdata 56917 0  Y   1952
Brick glusterpub1:/gluster/md4/workdata 59688 0  Y   3755
Brick glusterpub2:/gluster/md4/workdata 60271 0  Y   2271
Brick glusterpub3:/gluster/md4/workdata 49461 0  Y   2399
Brick glusterpub1:/gluster/md5/workdata 54651 0  Y   4208
Brick glusterpub2:/gluster/md5/workdata 49685 0  Y   2751
Brick glusterpub3:/gluster/md5/workdata 59202 0  Y   2803
Brick glusterpub1:/gluster/md6/workdata 55829 0  Y   4583
Brick glusterpub2:/gluster/md6/workdata 50455 0  Y   3296
Brick glusterpub3:/gluster/md6/workdata 50262 0  Y   3237
Brick glusterpub1:/gluster/md7/workdata 52238 0  Y   5014
Brick glusterpub2:/gluster/md7/workdata 52474 0  Y   3673
Brick glusterpub3:/gluster/md7/workdata 57966 0  Y   3653
Self-heal Daemon on localhost   N/A   N/AY   4141
Self-heal Daemon on glusterpub1 N/A   N/AY   5570
Self-heal Daemon on glusterpub2 N/A   N/AY   4139

"gluster volume heal workdata info" lists a lot of files per brick.
"gluster volume heal workdata statistics heal-count" shows thousands
of files per brick.
"gluster volume heal workdata enable" has no effect.

gluster volume heal workdata full
Launching heal operation to perform full self heal on volume workdata
has been successful
Use heal info commands to check status.

-> not doing anything at all. And nothing happening on the 2 "good"
servers in e.g. glustershd.log. Heal was working as expected on
version 10.4, but here... silence. Someone has an idea?


Best regards,
Hubert

Am Di., 16. Jan. 2024 um 13:44 Uhr schrieb Gilberto Ferreira
:


Ah! Indeed! You need to perform an upgrade in the clients as well.








Em ter., 

Re: [Gluster-users] Upgrade 10.4 -> 11.1 making problems

2024-01-18 Thread Hu Bert
were you able to solve the problem? Can it be treated like a "normal"
split brain? 'gluster peer status' and 'gluster volume status' are ok,
so kinda looks like "pseudo"...


hubert

Am Do., 18. Jan. 2024 um 08:28 Uhr schrieb Diego Zuccato
:
>
> That's the same kind of errors I keep seeing on my 2 clusters,
> regenerated some months ago. Seems a pseudo-split-brain that should be
> impossible on a replica 3 cluster but keeps happening.
> Sadly going to ditch Gluster ASAP.
>
> Diego
>
> Il 18/01/2024 07:11, Hu Bert ha scritto:
> > Good morning,
> > heal still not running. Pending heals now sum up to 60K per brick.
> > Heal was starting instantly e.g. after server reboot with version
> > 10.4, but doesn't with version 11. What could be wrong?
> >
> > I only see these errors on one of the "good" servers in glustershd.log:
> >
> > [2024-01-18 06:08:57.328480 +] W [MSGID: 114031]
> > [client-rpc-fops_v2.c:2561:client4_0_lookup_cbk] 0-workdata-client-0:
> > remote operation failed.
> > [{path=},
> > {gfid=cb39a1e4-2a4c-4727-861d-3ed9e
> > f00681b}, {errno=2}, {error=No such file or directory}]
> > [2024-01-18 06:08:57.594051 +] W [MSGID: 114031]
> > [client-rpc-fops_v2.c:2561:client4_0_lookup_cbk] 0-workdata-client-1:
> > remote operation failed.
> > [{path=},
> > {gfid=3e9b178c-ae1f-4d85-ae47-fc539
> > d94dd11}, {errno=2}, {error=No such file or directory}]
> >
> > About 7K today. Any ideas? Someone?
> >
> >
> > Best regards,
> > Hubert
> >
> > Am Mi., 17. Jan. 2024 um 11:24 Uhr schrieb Hu Bert :
> >>
> >> ok, finally managed to get all servers, volumes etc runnung, but took
> >> a couple of restarts, cksum checks etc.
> >>
> >> One problem: a volume doesn't heal automatically or doesn't heal at all.
> >>
> >> gluster volume status
> >> Status of volume: workdata
> >> Gluster process TCP Port  RDMA Port  Online  
> >> Pid
> >> --
> >> Brick glusterpub1:/gluster/md3/workdata 58832 0  Y   
> >> 3436
> >> Brick glusterpub2:/gluster/md3/workdata 59315 0  Y   
> >> 1526
> >> Brick glusterpub3:/gluster/md3/workdata 56917 0  Y   
> >> 1952
> >> Brick glusterpub1:/gluster/md4/workdata 59688 0  Y   
> >> 3755
> >> Brick glusterpub2:/gluster/md4/workdata 60271 0  Y   
> >> 2271
> >> Brick glusterpub3:/gluster/md4/workdata 49461 0  Y   
> >> 2399
> >> Brick glusterpub1:/gluster/md5/workdata 54651 0  Y   
> >> 4208
> >> Brick glusterpub2:/gluster/md5/workdata 49685 0  Y   
> >> 2751
> >> Brick glusterpub3:/gluster/md5/workdata 59202 0  Y   
> >> 2803
> >> Brick glusterpub1:/gluster/md6/workdata 55829 0  Y   
> >> 4583
> >> Brick glusterpub2:/gluster/md6/workdata 50455 0  Y   
> >> 3296
> >> Brick glusterpub3:/gluster/md6/workdata 50262 0  Y   
> >> 3237
> >> Brick glusterpub1:/gluster/md7/workdata 52238 0  Y   
> >> 5014
> >> Brick glusterpub2:/gluster/md7/workdata 52474 0  Y   
> >> 3673
> >> Brick glusterpub3:/gluster/md7/workdata 57966 0  Y   
> >> 3653
> >> Self-heal Daemon on localhost   N/A   N/AY   
> >> 4141
> >> Self-heal Daemon on glusterpub1 N/A   N/AY   
> >> 5570
> >> Self-heal Daemon on glusterpub2 N/A   N/AY   
> >> 4139
> >>
> >> "gluster volume heal workdata info" lists a lot of files per brick.
> >> "gluster volume heal workdata statistics heal-count" shows thousands
> >> of files per brick.
> >> "gluster volume heal workdata enable" has no effect.
> >>
> >> gluster volume heal workdata full
> >> Launching heal operation to perform full self heal on volume workdata
> >> has been successful
> >> Use heal info commands to check status.
> >>
> >> -> not doing anything at all. And nothing happening on the 2 "good"
> >> servers in e.g. glustershd.log. Heal was working as expected on
> >> version 10.4, but here... silence. Someone has an idea?
> >>
> >>
> >> Best regards,
> >> Hubert
> >>
> >> Am Di., 16. Jan. 2024 um 13:44 Uhr schrieb Gilberto Ferreira
> >> :
> >>>
> >>> Ah! Indeed! You need to perform an upgrade in the clients as well.
> >>>
> >>>
> >>>
> >>>
> >>>
> >>>
> >>>
> >>>
> >>> Em ter., 16 de jan. de 2024 às 03:12, Hu Bert  
> >>> escreveu:
> 
>  morning to those still reading :-)
> 
>  i found this: 
>  https://docs.gluster.org/en/main/Troubleshooting/troubleshooting-glusterd/#common-issues-and-how-to-resolve-them
> 
>  there's a paragraph about "peer rejected" with the same error message,
>  telling me: "Update the cluster.op-version" - i had only updated the
>  server nodes, but not the clients. So upgrading the cluster.op-version
>  wasn't possible at this time. So... 

Re: [Gluster-users] Upgrade 10.4 -> 11.1 making problems

2024-01-17 Thread Diego Zuccato
That's the same kind of errors I keep seeing on my 2 clusters, 
regenerated some months ago. Seems a pseudo-split-brain that should be 
impossible on a replica 3 cluster but keeps happening.

Sadly going to ditch Gluster ASAP.

Diego

Il 18/01/2024 07:11, Hu Bert ha scritto:

Good morning,
heal still not running. Pending heals now sum up to 60K per brick.
Heal was starting instantly e.g. after server reboot with version
10.4, but doesn't with version 11. What could be wrong?

I only see these errors on one of the "good" servers in glustershd.log:

[2024-01-18 06:08:57.328480 +] W [MSGID: 114031]
[client-rpc-fops_v2.c:2561:client4_0_lookup_cbk] 0-workdata-client-0:
remote operation failed.
[{path=},
{gfid=cb39a1e4-2a4c-4727-861d-3ed9e
f00681b}, {errno=2}, {error=No such file or directory}]
[2024-01-18 06:08:57.594051 +] W [MSGID: 114031]
[client-rpc-fops_v2.c:2561:client4_0_lookup_cbk] 0-workdata-client-1:
remote operation failed.
[{path=},
{gfid=3e9b178c-ae1f-4d85-ae47-fc539
d94dd11}, {errno=2}, {error=No such file or directory}]

About 7K today. Any ideas? Someone?


Best regards,
Hubert

Am Mi., 17. Jan. 2024 um 11:24 Uhr schrieb Hu Bert :


ok, finally managed to get all servers, volumes etc runnung, but took
a couple of restarts, cksum checks etc.

One problem: a volume doesn't heal automatically or doesn't heal at all.

gluster volume status
Status of volume: workdata
Gluster process TCP Port  RDMA Port  Online  Pid
--
Brick glusterpub1:/gluster/md3/workdata 58832 0  Y   3436
Brick glusterpub2:/gluster/md3/workdata 59315 0  Y   1526
Brick glusterpub3:/gluster/md3/workdata 56917 0  Y   1952
Brick glusterpub1:/gluster/md4/workdata 59688 0  Y   3755
Brick glusterpub2:/gluster/md4/workdata 60271 0  Y   2271
Brick glusterpub3:/gluster/md4/workdata 49461 0  Y   2399
Brick glusterpub1:/gluster/md5/workdata 54651 0  Y   4208
Brick glusterpub2:/gluster/md5/workdata 49685 0  Y   2751
Brick glusterpub3:/gluster/md5/workdata 59202 0  Y   2803
Brick glusterpub1:/gluster/md6/workdata 55829 0  Y   4583
Brick glusterpub2:/gluster/md6/workdata 50455 0  Y   3296
Brick glusterpub3:/gluster/md6/workdata 50262 0  Y   3237
Brick glusterpub1:/gluster/md7/workdata 52238 0  Y   5014
Brick glusterpub2:/gluster/md7/workdata 52474 0  Y   3673
Brick glusterpub3:/gluster/md7/workdata 57966 0  Y   3653
Self-heal Daemon on localhost   N/A   N/AY   4141
Self-heal Daemon on glusterpub1 N/A   N/AY   5570
Self-heal Daemon on glusterpub2 N/A   N/AY   4139

"gluster volume heal workdata info" lists a lot of files per brick.
"gluster volume heal workdata statistics heal-count" shows thousands
of files per brick.
"gluster volume heal workdata enable" has no effect.

gluster volume heal workdata full
Launching heal operation to perform full self heal on volume workdata
has been successful
Use heal info commands to check status.

-> not doing anything at all. And nothing happening on the 2 "good"
servers in e.g. glustershd.log. Heal was working as expected on
version 10.4, but here... silence. Someone has an idea?


Best regards,
Hubert

Am Di., 16. Jan. 2024 um 13:44 Uhr schrieb Gilberto Ferreira
:


Ah! Indeed! You need to perform an upgrade in the clients as well.








Em ter., 16 de jan. de 2024 às 03:12, Hu Bert  escreveu:


morning to those still reading :-)

i found this: 
https://docs.gluster.org/en/main/Troubleshooting/troubleshooting-glusterd/#common-issues-and-how-to-resolve-them

there's a paragraph about "peer rejected" with the same error message,
telling me: "Update the cluster.op-version" - i had only updated the
server nodes, but not the clients. So upgrading the cluster.op-version
wasn't possible at this time. So... upgrading the clients to version
11.1 and then the op-version should solve the problem?


Thx,
Hubert

Am Mo., 15. Jan. 2024 um 09:16 Uhr schrieb Hu Bert :


Hi,
just upgraded some gluster servers from version 10.4 to version 11.1.
Debian bullseye & bookworm. When only installing the packages: good,
servers, volumes etc. work as expected.

But one needs to test if the systems work after a daemon and/or server
restart. Well, did a reboot, and after that the rebooted/restarted
system is "out". Log message from working node:

[2024-01-15 08:02:21.585694 +] I [MSGID: 106163]
[glusterd-handshake.c:1501:__glusterd_mgmt_hndsk_versions_ack]
0-management: using the op-version 10
[2024-01-15 08:02:21.589601 +] I [MSGID: 106490]
[glusterd-handler.c:2546:__glusterd_handle_incoming_friend_req]
0-glusterd: Received probe 

Re: [Gluster-users] Upgrade 10.4 -> 11.1 making problems

2024-01-17 Thread Hu Bert
Good morning,
heal still not running. Pending heals now sum up to 60K per brick.
Heal was starting instantly e.g. after server reboot with version
10.4, but doesn't with version 11. What could be wrong?

I only see these errors on one of the "good" servers in glustershd.log:

[2024-01-18 06:08:57.328480 +] W [MSGID: 114031]
[client-rpc-fops_v2.c:2561:client4_0_lookup_cbk] 0-workdata-client-0:
remote operation failed.
[{path=},
{gfid=cb39a1e4-2a4c-4727-861d-3ed9e
f00681b}, {errno=2}, {error=No such file or directory}]
[2024-01-18 06:08:57.594051 +] W [MSGID: 114031]
[client-rpc-fops_v2.c:2561:client4_0_lookup_cbk] 0-workdata-client-1:
remote operation failed.
[{path=},
{gfid=3e9b178c-ae1f-4d85-ae47-fc539
d94dd11}, {errno=2}, {error=No such file or directory}]

About 7K today. Any ideas? Someone?


Best regards,
Hubert

Am Mi., 17. Jan. 2024 um 11:24 Uhr schrieb Hu Bert :
>
> ok, finally managed to get all servers, volumes etc runnung, but took
> a couple of restarts, cksum checks etc.
>
> One problem: a volume doesn't heal automatically or doesn't heal at all.
>
> gluster volume status
> Status of volume: workdata
> Gluster process TCP Port  RDMA Port  Online  Pid
> --
> Brick glusterpub1:/gluster/md3/workdata 58832 0  Y   3436
> Brick glusterpub2:/gluster/md3/workdata 59315 0  Y   1526
> Brick glusterpub3:/gluster/md3/workdata 56917 0  Y   1952
> Brick glusterpub1:/gluster/md4/workdata 59688 0  Y   3755
> Brick glusterpub2:/gluster/md4/workdata 60271 0  Y   2271
> Brick glusterpub3:/gluster/md4/workdata 49461 0  Y   2399
> Brick glusterpub1:/gluster/md5/workdata 54651 0  Y   4208
> Brick glusterpub2:/gluster/md5/workdata 49685 0  Y   2751
> Brick glusterpub3:/gluster/md5/workdata 59202 0  Y   2803
> Brick glusterpub1:/gluster/md6/workdata 55829 0  Y   4583
> Brick glusterpub2:/gluster/md6/workdata 50455 0  Y   3296
> Brick glusterpub3:/gluster/md6/workdata 50262 0  Y   3237
> Brick glusterpub1:/gluster/md7/workdata 52238 0  Y   5014
> Brick glusterpub2:/gluster/md7/workdata 52474 0  Y   3673
> Brick glusterpub3:/gluster/md7/workdata 57966 0  Y   3653
> Self-heal Daemon on localhost   N/A   N/AY   4141
> Self-heal Daemon on glusterpub1 N/A   N/AY   5570
> Self-heal Daemon on glusterpub2 N/A   N/AY   4139
>
> "gluster volume heal workdata info" lists a lot of files per brick.
> "gluster volume heal workdata statistics heal-count" shows thousands
> of files per brick.
> "gluster volume heal workdata enable" has no effect.
>
> gluster volume heal workdata full
> Launching heal operation to perform full self heal on volume workdata
> has been successful
> Use heal info commands to check status.
>
> -> not doing anything at all. And nothing happening on the 2 "good"
> servers in e.g. glustershd.log. Heal was working as expected on
> version 10.4, but here... silence. Someone has an idea?
>
>
> Best regards,
> Hubert
>
> Am Di., 16. Jan. 2024 um 13:44 Uhr schrieb Gilberto Ferreira
> :
> >
> > Ah! Indeed! You need to perform an upgrade in the clients as well.
> >
> >
> >
> >
> >
> >
> >
> >
> > Em ter., 16 de jan. de 2024 às 03:12, Hu Bert  
> > escreveu:
> >>
> >> morning to those still reading :-)
> >>
> >> i found this: 
> >> https://docs.gluster.org/en/main/Troubleshooting/troubleshooting-glusterd/#common-issues-and-how-to-resolve-them
> >>
> >> there's a paragraph about "peer rejected" with the same error message,
> >> telling me: "Update the cluster.op-version" - i had only updated the
> >> server nodes, but not the clients. So upgrading the cluster.op-version
> >> wasn't possible at this time. So... upgrading the clients to version
> >> 11.1 and then the op-version should solve the problem?
> >>
> >>
> >> Thx,
> >> Hubert
> >>
> >> Am Mo., 15. Jan. 2024 um 09:16 Uhr schrieb Hu Bert 
> >> :
> >> >
> >> > Hi,
> >> > just upgraded some gluster servers from version 10.4 to version 11.1.
> >> > Debian bullseye & bookworm. When only installing the packages: good,
> >> > servers, volumes etc. work as expected.
> >> >
> >> > But one needs to test if the systems work after a daemon and/or server
> >> > restart. Well, did a reboot, and after that the rebooted/restarted
> >> > system is "out". Log message from working node:
> >> >
> >> > [2024-01-15 08:02:21.585694 +] I [MSGID: 106163]
> >> > [glusterd-handshake.c:1501:__glusterd_mgmt_hndsk_versions_ack]
> >> > 0-management: using the op-version 10
> >> > [2024-01-15 08:02:21.589601 +] I [MSGID: 106490]
> >> > 

Re: [Gluster-users] Upgrade 10.4 -> 11.1 making problems

2024-01-17 Thread Hu Bert
hm, i only see such messages in glustershd.log on the 2 good servers:

[2024-01-17 12:18:48.912952 +] W [MSGID: 114031]
[client-rpc-fops_v2.c:2561:client4_0_lookup_cbk] 0-workdata-client-6:
remote operation failed.
[{path=},
{gfid=ee28b56c-e352-48f8-bbb5-dbf31
babe073}, {errno=2}, {error=No such file or directory}]
[2024-01-17 12:18:48.913015 +] W [MSGID: 114031]
[client-rpc-fops_v2.c:2561:client4_0_lookup_cbk] 0-workdata-client-7:
remote operation failed.
[{path=},
{gfid=ee28b56c-e352-48f8-bbb5-dbf31
babe073}, {errno=2}, {error=No such file or directory}]
[2024-01-17 12:19:09.450335 +] W [MSGID: 114031]
[client-rpc-fops_v2.c:2561:client4_0_lookup_cbk] 0-workdata-client-10:
remote operation failed.
[{path=},
{gfid=ea4a63e3-1470-40a5-8a7e-2a10
61a8fcb0}, {errno=2}, {error=No such file or directory}]
[2024-01-17 12:19:09.450771 +] W [MSGID: 114031]
[client-rpc-fops_v2.c:2561:client4_0_lookup_cbk] 0-workdata-client-9:
remote operation failed.
[{path=},
{gfid=ea4a63e3-1470-40a5-8a7e-2a106
1a8fcb0}, {errno=2}, {error=No such file or directory}]

not sure if this is important.

Am Mi., 17. Jan. 2024 um 11:24 Uhr schrieb Hu Bert :
>
> ok, finally managed to get all servers, volumes etc runnung, but took
> a couple of restarts, cksum checks etc.
>
> One problem: a volume doesn't heal automatically or doesn't heal at all.
>
> gluster volume status
> Status of volume: workdata
> Gluster process TCP Port  RDMA Port  Online  Pid
> --
> Brick glusterpub1:/gluster/md3/workdata 58832 0  Y   3436
> Brick glusterpub2:/gluster/md3/workdata 59315 0  Y   1526
> Brick glusterpub3:/gluster/md3/workdata 56917 0  Y   1952
> Brick glusterpub1:/gluster/md4/workdata 59688 0  Y   3755
> Brick glusterpub2:/gluster/md4/workdata 60271 0  Y   2271
> Brick glusterpub3:/gluster/md4/workdata 49461 0  Y   2399
> Brick glusterpub1:/gluster/md5/workdata 54651 0  Y   4208
> Brick glusterpub2:/gluster/md5/workdata 49685 0  Y   2751
> Brick glusterpub3:/gluster/md5/workdata 59202 0  Y   2803
> Brick glusterpub1:/gluster/md6/workdata 55829 0  Y   4583
> Brick glusterpub2:/gluster/md6/workdata 50455 0  Y   3296
> Brick glusterpub3:/gluster/md6/workdata 50262 0  Y   3237
> Brick glusterpub1:/gluster/md7/workdata 52238 0  Y   5014
> Brick glusterpub2:/gluster/md7/workdata 52474 0  Y   3673
> Brick glusterpub3:/gluster/md7/workdata 57966 0  Y   3653
> Self-heal Daemon on localhost   N/A   N/AY   4141
> Self-heal Daemon on glusterpub1 N/A   N/AY   5570
> Self-heal Daemon on glusterpub2 N/A   N/AY   4139
>
> "gluster volume heal workdata info" lists a lot of files per brick.
> "gluster volume heal workdata statistics heal-count" shows thousands
> of files per brick.
> "gluster volume heal workdata enable" has no effect.
>
> gluster volume heal workdata full
> Launching heal operation to perform full self heal on volume workdata
> has been successful
> Use heal info commands to check status.
>
> -> not doing anything at all. And nothing happening on the 2 "good"
> servers in e.g. glustershd.log. Heal was working as expected on
> version 10.4, but here... silence. Someone has an idea?
>
>
> Best regards,
> Hubert
>
> Am Di., 16. Jan. 2024 um 13:44 Uhr schrieb Gilberto Ferreira
> :
> >
> > Ah! Indeed! You need to perform an upgrade in the clients as well.
> >
> >
> >
> >
> >
> >
> >
> >
> > Em ter., 16 de jan. de 2024 às 03:12, Hu Bert  
> > escreveu:
> >>
> >> morning to those still reading :-)
> >>
> >> i found this: 
> >> https://docs.gluster.org/en/main/Troubleshooting/troubleshooting-glusterd/#common-issues-and-how-to-resolve-them
> >>
> >> there's a paragraph about "peer rejected" with the same error message,
> >> telling me: "Update the cluster.op-version" - i had only updated the
> >> server nodes, but not the clients. So upgrading the cluster.op-version
> >> wasn't possible at this time. So... upgrading the clients to version
> >> 11.1 and then the op-version should solve the problem?
> >>
> >>
> >> Thx,
> >> Hubert
> >>
> >> Am Mo., 15. Jan. 2024 um 09:16 Uhr schrieb Hu Bert 
> >> :
> >> >
> >> > Hi,
> >> > just upgraded some gluster servers from version 10.4 to version 11.1.
> >> > Debian bullseye & bookworm. When only installing the packages: good,
> >> > servers, volumes etc. work as expected.
> >> >
> >> > But one needs to test if the systems work after a daemon and/or server
> >> > restart. Well, did a reboot, and after that the rebooted/restarted
> >> > system is "out". Log message from working node:
> >> >
> >> > [2024-01-15 08:02:21.585694 

Re: [Gluster-users] Upgrade 10.4 -> 11.1 making problems

2024-01-17 Thread Hu Bert
ok, finally managed to get all servers, volumes etc runnung, but took
a couple of restarts, cksum checks etc.

One problem: a volume doesn't heal automatically or doesn't heal at all.

gluster volume status
Status of volume: workdata
Gluster process TCP Port  RDMA Port  Online  Pid
--
Brick glusterpub1:/gluster/md3/workdata 58832 0  Y   3436
Brick glusterpub2:/gluster/md3/workdata 59315 0  Y   1526
Brick glusterpub3:/gluster/md3/workdata 56917 0  Y   1952
Brick glusterpub1:/gluster/md4/workdata 59688 0  Y   3755
Brick glusterpub2:/gluster/md4/workdata 60271 0  Y   2271
Brick glusterpub3:/gluster/md4/workdata 49461 0  Y   2399
Brick glusterpub1:/gluster/md5/workdata 54651 0  Y   4208
Brick glusterpub2:/gluster/md5/workdata 49685 0  Y   2751
Brick glusterpub3:/gluster/md5/workdata 59202 0  Y   2803
Brick glusterpub1:/gluster/md6/workdata 55829 0  Y   4583
Brick glusterpub2:/gluster/md6/workdata 50455 0  Y   3296
Brick glusterpub3:/gluster/md6/workdata 50262 0  Y   3237
Brick glusterpub1:/gluster/md7/workdata 52238 0  Y   5014
Brick glusterpub2:/gluster/md7/workdata 52474 0  Y   3673
Brick glusterpub3:/gluster/md7/workdata 57966 0  Y   3653
Self-heal Daemon on localhost   N/A   N/AY   4141
Self-heal Daemon on glusterpub1 N/A   N/AY   5570
Self-heal Daemon on glusterpub2 N/A   N/AY   4139

"gluster volume heal workdata info" lists a lot of files per brick.
"gluster volume heal workdata statistics heal-count" shows thousands
of files per brick.
"gluster volume heal workdata enable" has no effect.

gluster volume heal workdata full
Launching heal operation to perform full self heal on volume workdata
has been successful
Use heal info commands to check status.

-> not doing anything at all. And nothing happening on the 2 "good"
servers in e.g. glustershd.log. Heal was working as expected on
version 10.4, but here... silence. Someone has an idea?


Best regards,
Hubert

Am Di., 16. Jan. 2024 um 13:44 Uhr schrieb Gilberto Ferreira
:
>
> Ah! Indeed! You need to perform an upgrade in the clients as well.
>
>
>
>
>
>
>
>
> Em ter., 16 de jan. de 2024 às 03:12, Hu Bert  
> escreveu:
>>
>> morning to those still reading :-)
>>
>> i found this: 
>> https://docs.gluster.org/en/main/Troubleshooting/troubleshooting-glusterd/#common-issues-and-how-to-resolve-them
>>
>> there's a paragraph about "peer rejected" with the same error message,
>> telling me: "Update the cluster.op-version" - i had only updated the
>> server nodes, but not the clients. So upgrading the cluster.op-version
>> wasn't possible at this time. So... upgrading the clients to version
>> 11.1 and then the op-version should solve the problem?
>>
>>
>> Thx,
>> Hubert
>>
>> Am Mo., 15. Jan. 2024 um 09:16 Uhr schrieb Hu Bert :
>> >
>> > Hi,
>> > just upgraded some gluster servers from version 10.4 to version 11.1.
>> > Debian bullseye & bookworm. When only installing the packages: good,
>> > servers, volumes etc. work as expected.
>> >
>> > But one needs to test if the systems work after a daemon and/or server
>> > restart. Well, did a reboot, and after that the rebooted/restarted
>> > system is "out". Log message from working node:
>> >
>> > [2024-01-15 08:02:21.585694 +] I [MSGID: 106163]
>> > [glusterd-handshake.c:1501:__glusterd_mgmt_hndsk_versions_ack]
>> > 0-management: using the op-version 10
>> > [2024-01-15 08:02:21.589601 +] I [MSGID: 106490]
>> > [glusterd-handler.c:2546:__glusterd_handle_incoming_friend_req]
>> > 0-glusterd: Received probe from uuid:
>> > b71401c3-512a-47cb-ac18-473c4ba7776e
>> > [2024-01-15 08:02:23.608349 +] E [MSGID: 106010]
>> > [glusterd-utils.c:3824:glusterd_compare_friend_volume] 0-management:
>> > Version of Cksums sourceimages differ. local cksum = 2204642525,
>> > remote cksum = 1931483801 on peer gluster190
>> > [2024-01-15 08:02:23.608584 +] I [MSGID: 106493]
>> > [glusterd-handler.c:3819:glusterd_xfer_friend_add_resp] 0-glusterd:
>> > Responded to gluster190 (0), ret: 0, op_ret: -1
>> > [2024-01-15 08:02:23.613553 +] I [MSGID: 106493]
>> > [glusterd-rpc-ops.c:467:__glusterd_friend_add_cbk] 0-glusterd:
>> > Received RJT from uuid: b71401c3-512a-47cb-ac18-473c4ba7776e, host:
>> > gluster190, port: 0
>> >
>> > peer status from rebooted node:
>> >
>> > root@gluster190 ~ # gluster peer status
>> > Number of Peers: 2
>> >
>> > Hostname: gluster189
>> > Uuid: 50dc8288-aa49-4ea8-9c6c-9a9a926c67a7
>> > State: Peer Rejected (Connected)
>> >
>> > Hostname: gluster188
>> > Uuid: e15a33fe-e2f7-47cf-ac53-a3b34136555d
>> > State: 

Re: [Gluster-users] Upgrade 10.4 -> 11.1 making problems

2024-01-16 Thread Gilberto Ferreira
Ah! Indeed! You need to perform an upgrade in the clients as well.








Em ter., 16 de jan. de 2024 às 03:12, Hu Bert 
escreveu:

> morning to those still reading :-)
>
> i found this:
> https://docs.gluster.org/en/main/Troubleshooting/troubleshooting-glusterd/#common-issues-and-how-to-resolve-them
>
> there's a paragraph about "peer rejected" with the same error message,
> telling me: "Update the cluster.op-version" - i had only updated the
> server nodes, but not the clients. So upgrading the cluster.op-version
> wasn't possible at this time. So... upgrading the clients to version
> 11.1 and then the op-version should solve the problem?
>
>
> Thx,
> Hubert
>
> Am Mo., 15. Jan. 2024 um 09:16 Uhr schrieb Hu Bert  >:
> >
> > Hi,
> > just upgraded some gluster servers from version 10.4 to version 11.1.
> > Debian bullseye & bookworm. When only installing the packages: good,
> > servers, volumes etc. work as expected.
> >
> > But one needs to test if the systems work after a daemon and/or server
> > restart. Well, did a reboot, and after that the rebooted/restarted
> > system is "out". Log message from working node:
> >
> > [2024-01-15 08:02:21.585694 +] I [MSGID: 106163]
> > [glusterd-handshake.c:1501:__glusterd_mgmt_hndsk_versions_ack]
> > 0-management: using the op-version 10
> > [2024-01-15 08:02:21.589601 +] I [MSGID: 106490]
> > [glusterd-handler.c:2546:__glusterd_handle_incoming_friend_req]
> > 0-glusterd: Received probe from uuid:
> > b71401c3-512a-47cb-ac18-473c4ba7776e
> > [2024-01-15 08:02:23.608349 +] E [MSGID: 106010]
> > [glusterd-utils.c:3824:glusterd_compare_friend_volume] 0-management:
> > Version of Cksums sourceimages differ. local cksum = 2204642525,
> > remote cksum = 1931483801 on peer gluster190
> > [2024-01-15 08:02:23.608584 +] I [MSGID: 106493]
> > [glusterd-handler.c:3819:glusterd_xfer_friend_add_resp] 0-glusterd:
> > Responded to gluster190 (0), ret: 0, op_ret: -1
> > [2024-01-15 08:02:23.613553 +] I [MSGID: 106493]
> > [glusterd-rpc-ops.c:467:__glusterd_friend_add_cbk] 0-glusterd:
> > Received RJT from uuid: b71401c3-512a-47cb-ac18-473c4ba7776e, host:
> > gluster190, port: 0
> >
> > peer status from rebooted node:
> >
> > root@gluster190 ~ # gluster peer status
> > Number of Peers: 2
> >
> > Hostname: gluster189
> > Uuid: 50dc8288-aa49-4ea8-9c6c-9a9a926c67a7
> > State: Peer Rejected (Connected)
> >
> > Hostname: gluster188
> > Uuid: e15a33fe-e2f7-47cf-ac53-a3b34136555d
> > State: Peer Rejected (Connected)
> >
> > So the rebooted gluster190 is not accepted anymore. And thus does not
> > appear in "gluster volume status". I then followed this guide:
> >
> >
> https://gluster-documentations.readthedocs.io/en/latest/Administrator%20Guide/Resolving%20Peer%20Rejected/
> >
> > Remove everything under /var/lib/glusterd/ (except glusterd.info) and
> > restart glusterd service etc. Data get copied from other nodes,
> > 'gluster peer status' is ok again - but the volume info is missing,
> > /var/lib/glusterd/vols is empty. When syncing this dir from another
> > node, the volume then is available again, heals start etc.
> >
> > Well, and just to be sure that everything's working as it should,
> > rebooted that node again - the rebooted node is kicked out again, and
> > you have to restart bringing it back again.
> >
> > Sry, but did i miss anything? Has someone experienced similar
> > problems? I'll probably downgrade to 10.4 again, that version was
> > working...
> >
> >
> > Thx,
> > Hubert
> 
>
>
>
> Community Meeting Calendar:
>
> Schedule -
> Every 2nd and 4th Tuesday at 14:30 IST / 09:00 UTC
> Bridge: https://meet.google.com/cpu-eiue-hvk
> Gluster-users mailing list
> Gluster-users@gluster.org
> https://lists.gluster.org/mailman/listinfo/gluster-users
>




Community Meeting Calendar:

Schedule -
Every 2nd and 4th Tuesday at 14:30 IST / 09:00 UTC
Bridge: https://meet.google.com/cpu-eiue-hvk
Gluster-users mailing list
Gluster-users@gluster.org
https://lists.gluster.org/mailman/listinfo/gluster-users


Re: [Gluster-users] Upgrade 10.4 -> 11.1 making problems

2024-01-15 Thread Hu Bert
morning to those still reading :-)

i found this: 
https://docs.gluster.org/en/main/Troubleshooting/troubleshooting-glusterd/#common-issues-and-how-to-resolve-them

there's a paragraph about "peer rejected" with the same error message,
telling me: "Update the cluster.op-version" - i had only updated the
server nodes, but not the clients. So upgrading the cluster.op-version
wasn't possible at this time. So... upgrading the clients to version
11.1 and then the op-version should solve the problem?


Thx,
Hubert

Am Mo., 15. Jan. 2024 um 09:16 Uhr schrieb Hu Bert :
>
> Hi,
> just upgraded some gluster servers from version 10.4 to version 11.1.
> Debian bullseye & bookworm. When only installing the packages: good,
> servers, volumes etc. work as expected.
>
> But one needs to test if the systems work after a daemon and/or server
> restart. Well, did a reboot, and after that the rebooted/restarted
> system is "out". Log message from working node:
>
> [2024-01-15 08:02:21.585694 +] I [MSGID: 106163]
> [glusterd-handshake.c:1501:__glusterd_mgmt_hndsk_versions_ack]
> 0-management: using the op-version 10
> [2024-01-15 08:02:21.589601 +] I [MSGID: 106490]
> [glusterd-handler.c:2546:__glusterd_handle_incoming_friend_req]
> 0-glusterd: Received probe from uuid:
> b71401c3-512a-47cb-ac18-473c4ba7776e
> [2024-01-15 08:02:23.608349 +] E [MSGID: 106010]
> [glusterd-utils.c:3824:glusterd_compare_friend_volume] 0-management:
> Version of Cksums sourceimages differ. local cksum = 2204642525,
> remote cksum = 1931483801 on peer gluster190
> [2024-01-15 08:02:23.608584 +] I [MSGID: 106493]
> [glusterd-handler.c:3819:glusterd_xfer_friend_add_resp] 0-glusterd:
> Responded to gluster190 (0), ret: 0, op_ret: -1
> [2024-01-15 08:02:23.613553 +] I [MSGID: 106493]
> [glusterd-rpc-ops.c:467:__glusterd_friend_add_cbk] 0-glusterd:
> Received RJT from uuid: b71401c3-512a-47cb-ac18-473c4ba7776e, host:
> gluster190, port: 0
>
> peer status from rebooted node:
>
> root@gluster190 ~ # gluster peer status
> Number of Peers: 2
>
> Hostname: gluster189
> Uuid: 50dc8288-aa49-4ea8-9c6c-9a9a926c67a7
> State: Peer Rejected (Connected)
>
> Hostname: gluster188
> Uuid: e15a33fe-e2f7-47cf-ac53-a3b34136555d
> State: Peer Rejected (Connected)
>
> So the rebooted gluster190 is not accepted anymore. And thus does not
> appear in "gluster volume status". I then followed this guide:
>
> https://gluster-documentations.readthedocs.io/en/latest/Administrator%20Guide/Resolving%20Peer%20Rejected/
>
> Remove everything under /var/lib/glusterd/ (except glusterd.info) and
> restart glusterd service etc. Data get copied from other nodes,
> 'gluster peer status' is ok again - but the volume info is missing,
> /var/lib/glusterd/vols is empty. When syncing this dir from another
> node, the volume then is available again, heals start etc.
>
> Well, and just to be sure that everything's working as it should,
> rebooted that node again - the rebooted node is kicked out again, and
> you have to restart bringing it back again.
>
> Sry, but did i miss anything? Has someone experienced similar
> problems? I'll probably downgrade to 10.4 again, that version was
> working...
>
>
> Thx,
> Hubert




Community Meeting Calendar:

Schedule -
Every 2nd and 4th Tuesday at 14:30 IST / 09:00 UTC
Bridge: https://meet.google.com/cpu-eiue-hvk
Gluster-users mailing list
Gluster-users@gluster.org
https://lists.gluster.org/mailman/listinfo/gluster-users


Re: [Gluster-users] Upgrade 10.4 -> 11.1 making problems

2024-01-15 Thread Hu Bert
just downgraded one node to 10.4, did a reboot - same result: cksum
error. i'm able to bring it back in again, but it that error persists
when downgrading all servers...

Am Mo., 15. Jan. 2024 um 09:16 Uhr schrieb Hu Bert :
>
> Hi,
> just upgraded some gluster servers from version 10.4 to version 11.1.
> Debian bullseye & bookworm. When only installing the packages: good,
> servers, volumes etc. work as expected.
>
> But one needs to test if the systems work after a daemon and/or server
> restart. Well, did a reboot, and after that the rebooted/restarted
> system is "out". Log message from working node:
>
> [2024-01-15 08:02:21.585694 +] I [MSGID: 106163]
> [glusterd-handshake.c:1501:__glusterd_mgmt_hndsk_versions_ack]
> 0-management: using the op-version 10
> [2024-01-15 08:02:21.589601 +] I [MSGID: 106490]
> [glusterd-handler.c:2546:__glusterd_handle_incoming_friend_req]
> 0-glusterd: Received probe from uuid:
> b71401c3-512a-47cb-ac18-473c4ba7776e
> [2024-01-15 08:02:23.608349 +] E [MSGID: 106010]
> [glusterd-utils.c:3824:glusterd_compare_friend_volume] 0-management:
> Version of Cksums sourceimages differ. local cksum = 2204642525,
> remote cksum = 1931483801 on peer gluster190
> [2024-01-15 08:02:23.608584 +] I [MSGID: 106493]
> [glusterd-handler.c:3819:glusterd_xfer_friend_add_resp] 0-glusterd:
> Responded to gluster190 (0), ret: 0, op_ret: -1
> [2024-01-15 08:02:23.613553 +] I [MSGID: 106493]
> [glusterd-rpc-ops.c:467:__glusterd_friend_add_cbk] 0-glusterd:
> Received RJT from uuid: b71401c3-512a-47cb-ac18-473c4ba7776e, host:
> gluster190, port: 0
>
> peer status from rebooted node:
>
> root@gluster190 ~ # gluster peer status
> Number of Peers: 2
>
> Hostname: gluster189
> Uuid: 50dc8288-aa49-4ea8-9c6c-9a9a926c67a7
> State: Peer Rejected (Connected)
>
> Hostname: gluster188
> Uuid: e15a33fe-e2f7-47cf-ac53-a3b34136555d
> State: Peer Rejected (Connected)
>
> So the rebooted gluster190 is not accepted anymore. And thus does not
> appear in "gluster volume status". I then followed this guide:
>
> https://gluster-documentations.readthedocs.io/en/latest/Administrator%20Guide/Resolving%20Peer%20Rejected/
>
> Remove everything under /var/lib/glusterd/ (except glusterd.info) and
> restart glusterd service etc. Data get copied from other nodes,
> 'gluster peer status' is ok again - but the volume info is missing,
> /var/lib/glusterd/vols is empty. When syncing this dir from another
> node, the volume then is available again, heals start etc.
>
> Well, and just to be sure that everything's working as it should,
> rebooted that node again - the rebooted node is kicked out again, and
> you have to restart bringing it back again.
>
> Sry, but did i miss anything? Has someone experienced similar
> problems? I'll probably downgrade to 10.4 again, that version was
> working...
>
>
> Thx,
> Hubert




Community Meeting Calendar:

Schedule -
Every 2nd and 4th Tuesday at 14:30 IST / 09:00 UTC
Bridge: https://meet.google.com/cpu-eiue-hvk
Gluster-users mailing list
Gluster-users@gluster.org
https://lists.gluster.org/mailman/listinfo/gluster-users