Re: [Gluster-users] Geo-replication: Entry not present on master. Fixing gfid mismatch in slave
Hi Strahil and Sunny, Thank you for the replies. I checked the gfid on the master and slaves and they are the same. After moving the file away and back again it doesn't seem to be having the issue with that file any more. We are still getting higher CPU usage on one of the master nodes than the others. It logs this every few seconds: [2020-06-02 03:10:15.637815] I [master(worker /nodirectwritedata/gluster/gvol0):1384:process] _GMaster: Entry Time Taken MKD=0 MKN=0 LIN=0 SYM=0 REN=0 RMD=0CRE=0 duration=0. UNL=0 [2020-06-02 03:10:15.638010] I [master(worker /nodirectwritedata/gluster/gvol0):1394:process] _GMaster: Data/Metadata Time TakenSETA=0 SETX=0 meta_duration=0.data_duration=12.7878 DATA=4 XATT=0 [2020-06-02 03:10:15.638286] I [master(worker /nodirectwritedata/gluster/gvol0):1404:process] _GMaster: Batch Completed changelog_end=1591067378entry_stime=(1591067167, 0) changelog_start=1591067364 stime=(1591067377, 0) duration=12.8068 num_changelogs=2mode=live_changelog [2020-06-02 03:10:20.658601] I [master(worker /nodirectwritedata/gluster/gvol0):1470:crawl] _GMaster: slave's time stime=(1591067377, 0) [2020-06-02 03:10:34.21799] I [master(worker /nodirectwritedata/gluster/gvol0):1954:syncjob] Syncer: Sync Time Taken duration=0.3826 num_files=8 job=1 return_code=0 [2020-06-02 03:10:46.440535] I [master(worker /nodirectwritedata/gluster/gvol0):1384:process] _GMaster: Entry Time Taken MKD=0 MKN=0 LIN=0 SYM=0 REN=1 RMD=0CRE=2 duration=0.1314 UNL=1 [2020-06-02 03:10:46.440809] I [master(worker /nodirectwritedata/gluster/gvol0):1394:process] _GMaster: Data/Metadata Time TakenSETA=0 SETX=0 meta_duration=0.data_duration=13.0171 DATA=14 XATT=0 [2020-06-02 03:10:46.441205] I [master(worker /nodirectwritedata/gluster/gvol0):1404:process] _GMaster: Batch Completed changelog_end=1591067420entry_stime=(1591067419, 0) changelog_start=1591067392 stime=(1591067419, 0) duration=13.0322 num_changelogs=3mode=live_changelog [2020-06-02 03:10:51.460925] I [master(worker /nodirectwritedata/gluster/gvol0):1470:crawl] _GMaster: slave's time stime=(1591067419, 0) [2020-06-02 03:11:04.448913] I [master(worker /nodirectwritedata/gluster/gvol0):1954:syncjob] Syncer: Sync Time Taken duration=0.3466 num_files=3 job=1 return_code=0 Whereas the other master nodes only log this: [2020-06-02 03:11:33.886938] I [gsyncd(config-get):308:main] : Using session config file path=/var/lib/glusterd/geo-replication/gvol0_nvfs10_gvol0/gsyncd.conf [2020-06-02 03:11:33.993175] I [gsyncd(status):308:main] : Using session config file path=/var/lib/glusterd/geo-replication/gvol0_nvfs10_gvol0/gsyncd.conf Can anyone help with what might cause the high CPU usage on one master node? The process is this one, and is using 70-100% of CPU: python2 /usr/lib/x86_64-linux-gnu/glusterfs/python/syncdaemon/gsyncd.py worker gvol0 nvfs10::gvol0 --feedback-fd 15 --local-path /nodirectwritedata/gluster/gvol0 --local-node cafs30 --local-node-id b7521445-ee93-4fed-8ced-6a609fa8c7d4 --slave-id cdcdb210-839c-4306-a4dc-e696b165ed17 --rpc-fd 12,11,9,13 --subvol-num 1 --resource-remote nvfs30 --resource-remote-id 1e698ccd-aeec-4ec4-96fe-383da8fc3b78 Thank you in advance! On Sat, 30 May 2020 at 20:20, Strahil Nikolov wrote: > Hey David, > > for me a gfid mismatch means that the file was replaced/recreated - > just like vim in linux does (and it is expected for config file). > > Have you checked the gfid of the file on both source and destination, > do they really match or they are different ? > > What happens when you move away the file from the slave , does it fixes > the issue ? > > Best Regards, > Strahil Nikolov > > На 30 май 2020 г. 1:10:56 GMT+03:00, David Cunningham < > dcunning...@voisonics.com> написа: > >Hello, > > > >We're having an issue with a geo-replication process with unusually > >high > >CPU use and giving "Entry not present on master. Fixing gfid mismatch > >in > >slave" errors. Can anyone help on this? > > > >We have 3 GlusterFS replica nodes (we'll call the master), which also > >push > >data to a remote server (slave) using geo-replication. This has been > >running fine for a couple of months, but yesterday one of the master > >nodes > >started having unusually high CPU use. It's this process: > > > >root@cafs30:/var/log/glusterfs# ps aux | grep 32048 > >root 32048 68.7 0.6 1843140 845756 ? Rl 02:51 493:51 > >python2 > >/usr/lib/x86_64-linux-gnu/glusterfs/python/syncdaemon/gsyncd.py worker > >gvol0 nvfs10::gvol0 --feedback-fd 15 --local-path > >/nodirectwritedata/gluster/gvol0 --local-node cafs30 --local-node-id > >b7521445-ee93-4fed-8ced-6a609fa8c7d4 --slave-id > >cdcdb210-839c-4306-a4dc-e696b165ed17 --rpc-fd 12,11,9,13 --subvol-num 1 > >--resource-remote nvfs30 --resource-remote-id > >1e698ccd-aeec-4ec4-96fe-383da8fc3b78 > > > >Here's what is being logged in >
Re: [Gluster-users] Gfid mismatch detected - but no split brain - how to solve?
Hi, I am assuming that you are using one of the maintained versions of gluster. GFID split-brains can be resolved using one of the methods in the split-brain resolution CLI as explained in the section "3. Resolution of split-brain using gluster CLI" of https://docs.gluster.org/en/latest/Troubleshooting/resolving-splitbrain/. The things to be noted here while using this CLI for resolving gfid split-brains are: - You can not use the GFID of the file as an argument with any of the CLI options to resolve GFID split-brain. It should be the absolute path as seen from the mount point to the file considered as source. - With source-brick option there is no way to resolve all the GFID split-brain in one shot by not specifying any file-path in the CLI as done while resolving data or metadata split-brain. For each file in GFID split-brain, run the CLI with the policy you want to use. - Resolving directory GFID split-brain using CLI with the "source-brick" option in a "distributed-replicated" volume needs to be done on all the volumes explicitly if the file is in gfid split-brain on multiple subvolumes. Since directories get created on all the subvolumes, using one particular brick as source for directory GFID split-brain, heal the directories for that subvolume. In this case, other subvolumes must be healed using the brick which has the same GFID as that of the previous brick which was used as source for healing other subvolume. Regards, Karthik On Sat, May 30, 2020 at 3:39 AM lejeczek wrote: > hi Guys > > I'm seeing "Gfid mismatch detected" in the logs but no split > brain indicated (4-way replica) > > Brick > swir-ring8:/__.aLocalStorages/0/0-GLUSTERs/0GLUSTER.USER-HOME > Status: Connected > Total Number of entries: 22 > Number of entries in heal pending: 22 > Number of entries in split-brain: 0 > Number of entries possibly healing: 0 > > Brick > whale-ring8:/__.aLocalStorages/0/0-GLUSTERs/0GLUSTER.USER-HOME > Status: Connected > Total Number of entries: 22 > Number of entries in heal pending: 22 > Number of entries in split-brain: 0 > Number of entries possibly healing: 0 > > Brick > rider-ring8:/__.aLocalStorages/0/0-GLUSTERs/0GLUSTER.USER-HOME > Status: Connected > Total Number of entries: 0 > Number of entries in heal pending: 0 > Number of entries in split-brain: 0 > Number of entries possibly healing: 0 > > Brick dzien:/__.aLocalStorages/0/0-GLUSTERs/0GLUSTER.USER-HOME > Status: Connected > Total Number of entries: 10 > Number of entries in heal pending: 10 > Number of entries in split-brain: 0 > Number of entries possibly healing: 0 > > On swir-ring8: > ... > The message "E [MSGID: 108008] > [afr-self-heal-common.c:384:afr_gfid_split_brain_source] > 0-USER-HOME-replicate-0: Gfid mismatch detected for > /lock_file>, > 37b2456f-5216-4679-ac5c-4908b24f895a on USER-HOME-client-15 > and ba8f87ed-9bf3-404e-8d67-2631923e1645 on > USER-HOME-client-13." repeated 2 times between [2020-05-29 > 21:47:49.034935] and [2020-05-29 21:47:49.079480] > The message "E [MSGID: 108008] > [afr-self-heal-common.c:384:afr_gfid_split_brain_source] > 0-USER-HOME-replicate-0: Gfid mismatch detected for > /t>, > d7a4ed01-139b-4df3-8070-31bd620a6f15 on USER-HOME-client-15 > and d794b6ba-2a1d-4043-bb31-b98b22692763 on > USER-HOME-client-13." repeated 2 times between [2020-05-29 > 21:47:49.126173] and [2020-05-29 21:47:49.155432] > The message "E [MSGID: 108008] > [afr-self-heal-common.c:384:afr_gfid_split_brain_source] > 0-USER-HOME-replicate-0: Gfid mismatch detected for > /Tables.docx>, > 344febd8-c89c-4bf3-8ad8-6494c2189c43 on USER-HOME-client-15 > and 48d5b12b-03f4-46bf-bed1-9f8f88815615 on > USER-HOME-client-13." repeated 2 times between [2020-05-29 > 21:47:49.194061] and [2020-05-29 21:47:49.239896] > The message "E [MSGID: 108008] > [afr-self-heal-entry.c:257:afr_selfheal_detect_gfid_and_type_mismatch] > 0-USER-HOME-replicate-0: Skipping conservative merge on the > file." repeated 8 times between [2020-05-29 21:47:49.037812] > and [2020-05-29 21:47:49.240423] > ... > > On whale-ring8: > ... > The message "E [MSGID: 108008] > [afr-self-heal-common.c:384:afr_gfid_split_brain_source] > 0-USER-HOME-replicate-0: Gfid mismatch detected for > /pcs>, > a83d0e5f-ef3a-40ab-be7b-784538d150be on USER-HOME-client-15 > and 89af3d31-81fa-4242-b8f7-0f49fd5fe57b on > USER-HOME-client-13." repeated 2 times between [2020-05-29 > 21:45:46.152052] and [2020-05-29 21:45:46.422393] > The message "E [MSGID: 108008] > [afr-self-heal-common.c:384:afr_gfid_split_brain_source] > 0-USER-HOME-replicate-0: Gfid mismatch detected for > /history_database>, > 81ebb0d5-264a-4eba-984a-e18673b43826 on USER-HOME-client-15 > and 2498a303-8937-43c3-939e-5e1d786b07fa on > USER-HOME-client-13." repeated 2 times between [2020-05-29 > 21:45:46.167704] and [2020-05-29 21:45:46.437702] > The message "E [MSGID: 108008] > [afr-self-heal-common.c:384:afr_gfid_split_brain_source] > 0-USER-HOME-replicate-0: Gfid mismatch detected for > /client-state>, >