I've seen issues with symlinks failing to heal as well. I never found
a good solution on the glusterfs side of things. Most reliable fix I
found is just rm and recreate the symlink in the fuse volume itself.
Also, I'd strongly suggest heavy load testing before upgrading to 10.3
in production, after upgrading from 9.5 -> 10.3 I've seen frequent
brick process crashes(glusterfsd), whereas 9.5 was quite stable.

On Mon, Jan 23, 2023 at 3:58 PM Matt Rubright <mrubr...@uncc.edu> wrote:
>
> Hi friends,
>
> I have recently built a new replica 3 arbiter 1 volume on 10.3 servers and 
> have been putting it through its paces before getting it ready for production 
> use. The volume will ultimately contain about 200G of web content files 
> shared among multiple frontends. Each will use the gluster fuse client to 
> connect.
>
> What I am experiencing sounds very much like this post from 9 years ago: 
> https://lists.gnu.org/archive/html/gluster-devel/2013-12/msg00103.html
>
> In short, if I perform these steps I can reliably end up with symlinks on the 
> volume which will not heal either by initiating a 'full heal' from the 
> cluster or using a fuse client to read each file:
>
> 1) Verify that all nodes are healthy, the volume is healthy, and there are no 
> items needing to be healed
> 2) Cleanly shut down one server hosting a brick
> 3) Copy data, including some symlinks, from a fuse client to the volume
> 4) Bring the brick back online and observe the number and type of items 
> needing to be healed
> 5) Initiate a full heal from one of the nodes
> 6) Confirm that while files and directories are healed, symlinks are not
>
> Please help me determine if I have improper expectations here. I have some 
> basic knowledge of managing gluster volumes, but I may be misunderstanding 
> intended behavior.
>
> Here is the volume info and heal data at each step of the way:
>
> *** Verify that all nodes are healthy, the volume is healthy, and there are 
> no items needing to be healed ***
>
> # gluster vol info cwsvol01
>
> Volume Name: cwsvol01
> Type: Replicate
> Volume ID: 7b28e6e6-4a73-41b7-83fe-863a45fd27fc
> Status: Started
> Snapshot Count: 0
> Number of Bricks: 1 x (2 + 1) = 3
> Transport-type: tcp
> Bricks:
> Brick1: glfs02-172-20-1:/data/brick01/cwsvol01
> Brick2: glfs01-172-20-1:/data/brick01/cwsvol01
> Brick3: glfsarb01-172-20-1:/data/arb01/cwsvol01 (arbiter)
> Options Reconfigured:
> performance.client-io-threads: off
> nfs.disable: on
> transport.address-family: inet
> storage.fips-mode-rchecksum: on
> cluster.granular-entry-heal: on
>
> # gluster vol status
> Status of volume: cwsvol01
> Gluster process                             TCP Port  RDMA Port  Online  Pid
> ------------------------------------------------------------------------------
> Brick glfs02-172-20-1:/data/brick01/cwsvol0
> 1                                           50253     0          Y       1397
> Brick glfs01-172-20-1:/data/brick01/cwsvol0
> 1                                           56111     0          Y       1089
> Brick glfsarb01-172-20-1:/data/arb01/cwsvol
> 01                                          54517     0          Y       
> 118704
> Self-heal Daemon on localhost               N/A       N/A        Y       1413
> Self-heal Daemon on glfs01-172-20-1         N/A       N/A        Y       3490
> Self-heal Daemon on glfsarb01-172-20-1      N/A       N/A        Y       
> 118720
>
> Task Status of Volume cwsvol01
> ------------------------------------------------------------------------------
> There are no active volume tasks
>
> # gluster vol heal cwsvol01 info summary
> Brick glfs02-172-20-1:/data/brick01/cwsvol01
> Status: Connected
> Total Number of entries: 0
> Number of entries in heal pending: 0
> Number of entries in split-brain: 0
> Number of entries possibly healing: 0
>
> Brick glfs01-172-20-1:/data/brick01/cwsvol01
> Status: Connected
> Total Number of entries: 0
> Number of entries in heal pending: 0
> Number of entries in split-brain: 0
> Number of entries possibly healing: 0
>
> Brick glfsarb01-172-20-1:/data/arb01/cwsvol01
> Status: Connected
> Total Number of entries: 0
> Number of entries in heal pending: 0
> Number of entries in split-brain: 0
> Number of entries possibly healing: 0
>
> *** Cleanly shut down one server hosting a brick ***
>
> *** Copy data, including some symlinks, from a fuse client to the volume ***
>
> # gluster vol heal cwsvol01 info summary
> Brick glfs02-172-20-1:/data/brick01/cwsvol01
> Status: Transport endpoint is not connected
> Total Number of entries: -
> Number of entries in heal pending: -
> Number of entries in split-brain: -
> Number of entries possibly healing: -
>
> Brick glfs01-172-20-1:/data/brick01/cwsvol01
> Status: Connected
> Total Number of entries: 810
> Number of entries in heal pending: 810
> Number of entries in split-brain: 0
> Number of entries possibly healing: 0
>
> Brick glfsarb01-172-20-1:/data/arb01/cwsvol01
> Status: Connected
> Total Number of entries: 810
> Number of entries in heal pending: 810
> Number of entries in split-brain: 0
> Number of entries possibly healing: 0
>
> *** Bring the brick back online and observe the number and type of entities 
> needing to be healed ***
>
> # gluster vol heal cwsvol01 info summary
> Brick glfs02-172-20-1:/data/brick01/cwsvol01
> Status: Connected
> Total Number of entries: 0
> Number of entries in heal pending: 0
> Number of entries in split-brain: 0
> Number of entries possibly healing: 0
>
> Brick glfs01-172-20-1:/data/brick01/cwsvol01
> Status: Connected
> Total Number of entries: 769
> Number of entries in heal pending: 769
> Number of entries in split-brain: 0
> Number of entries possibly healing: 0
>
> Brick glfsarb01-172-20-1:/data/arb01/cwsvol01
> Status: Connected
> Total Number of entries: 769
> Number of entries in heal pending: 769
> Number of entries in split-brain: 0
> Number of entries possibly healing: 0
>
> *** Initiate a full heal from one of the nodes ***
>
> # gluster vol heal cwsvol01 info summary
> Brick glfs02-172-20-1:/data/brick01/cwsvol01
> Status: Connected
> Total Number of entries: 0
> Number of entries in heal pending: 0
> Number of entries in split-brain: 0
> Number of entries possibly healing: 0
>
> Brick glfs01-172-20-1:/data/brick01/cwsvol01
> Status: Connected
> Total Number of entries: 148
> Number of entries in heal pending: 148
> Number of entries in split-brain: 0
> Number of entries possibly healing: 0
>
> Brick glfsarb01-172-20-1:/data/arb01/cwsvol01
> Status: Connected
> Total Number of entries: 148
> Number of entries in heal pending: 148
> Number of entries in split-brain: 0
> Number of entries possibly healing: 0
>
> # gluster vol heal cwsvol01 info
> Brick glfs02-172-20-1:/data/brick01/cwsvol01
> Status: Connected
> Number of entries: 0
>
> Brick glfs01-172-20-1:/data/brick01/cwsvol01
> /web01-etc
> /web01-etc/nsswitch.conf
> /web01-etc/swid/swidtags.d
> /web01-etc/swid/swidtags.d/redhat.com
> /web01-etc/os-release
> /web01-etc/system-release
> < truncated >
>
> *** Verify that one brick contains the symlink while the previously-offline 
> one does not ***
>
> [root@cws-glfs01 ~]# ls -ld /data/brick01/cwsvol01/web01-etc/nsswitch.conf
> lrwxrwxrwx 2 root root 29 Jan  4 16:00 
> /data/brick01/cwsvol01/web01-etc/nsswitch.conf -> 
> /etc/authselect/nsswitch.conf
>
> [root@cws-glfs02 ~]# ls -ld /data/brick01/cwsvol01/web01-etc/nsswitch.conf
> ls: cannot access '/data/brick01/cwsvol01/web01-etc/nsswitch.conf': No such 
> file or directory
>
> *** Note entries in /var/log/gluster/glustershd.log ***
>
> [2023-01-23 20:34:40.939904 +0000] W [MSGID: 114031] 
> [client-rpc-fops_v2.c:2457:client4_0_link_cbk] 0-cwsvol01-client-1: remote 
> operation failed. [{source=<gfid:3cade471-8aba-492a-b981-d63330d2e02e>}, 
> {target=(null)}, {errno=116}, {error=Stale file handle}]
> [2023-01-23 20:34:40.945774 +0000] W [MSGID: 114031] 
> [client-rpc-fops_v2.c:2457:client4_0_link_cbk] 0-cwsvol01-client-1: remote 
> operation failed. [{source=<gfid:35102340-9409-4d88-a391-da43c00644e7>}, 
> {target=(null)}, {errno=116}, {error=Stale file handle}]
> [2023-01-23 20:34:40.749715 +0000] W [MSGID: 114031] 
> [client-rpc-fops_v2.c:2457:client4_0_link_cbk] 0-cwsvol01-client-1: remote 
> operation failed. [{source=<gfid:874406a9-9478-4b83-9e6a-09e262e4b85d>}, 
> {target=(null)}, {errno=116}, {error=Stale file handle}]
>
>
> ________
>
>
>
> Community Meeting Calendar:
>
> Schedule -
> Every 2nd and 4th Tuesday at 14:30 IST / 09:00 UTC
> Bridge: https://meet.google.com/cpu-eiue-hvk
> Gluster-users mailing list
> Gluster-users@gluster.org
> https://lists.gluster.org/mailman/listinfo/gluster-users
________



Community Meeting Calendar:

Schedule -
Every 2nd and 4th Tuesday at 14:30 IST / 09:00 UTC
Bridge: https://meet.google.com/cpu-eiue-hvk
Gluster-users mailing list
Gluster-users@gluster.org
https://lists.gluster.org/mailman/listinfo/gluster-users

Reply via email to