Dear Aravinda, Thank you for the analysis and submitting a patch for this issue. I hope it can make it into the next GlusterFS release 3.7.7.
As suggested I ran the find_gfid_issues.py script on my brick on the two master nodes and slave nodes but the only output it shows to me is the following: NO GFID(DIR) : /data/myvolume-geo/brick/test NO GFID(DIR) : /data/myvolume-geo/brick/data NO GFID(DIR) : /data/myvolume-geo/brick/data/files_encryption NO GFID(DIR) : /data/myvolume-geo/brick/data/username As you can see there are no files at all. So I am still left with 394 files of 0 kBytes on my geo-rep slave node. Do you have any suggestion how to cleanup this mess? Best regards ML On Tuesday, February 2, 2016 7:59 AM, Aravinda <avish...@redhat.com> wrote: Hi ML, We analyzed the issue. Looks like Changelog is replayed may be because of Geo-rep worker crash or Active/Passive switch or both Geo-rep workers becoming active. >From changelogs, CREATE logo-login-04.svg.part RENAME logo-login-04.svg.part logo-login-04.svg When it is replayed, CREATE logo-login-04.svg.part RENAME logo-login-04.svg.part logo-login-04.svg CREATE logo-login-04.svg.part RENAME logo-login-04.svg.part logo-login-04.svg During replay backend GFID link is broken and Geo-rep failed to cleanup. Milind is working on the patch to fix the same. Patches are in review and expected to be available in 3.7.8 release. http://review.gluster.org/#/c/13316/ http://review.gluster.org/#/c/13189/ Following script can be used to find problematic file in each Brick backend. https://gist.github.com/aravindavk/29f673f13c2f8963447e regards Aravinda On 02/01/2016 08:45 PM, ML mail wrote: > Sure, I will just send it to you through an encrypted cloud storage app and > send you the password via private mail. > > Regards > ML > > > > On Monday, February 1, 2016 3:14 PM, Saravanakumar Arumugam > <sarum...@redhat.com> wrote: > > > On 02/01/2016 07:22 PM, ML mail wrote: >> I just found out I needed to run the getfattr on a mount and not on the >> glusterfs server directly. So here are the additional output you asked for: >> >> >> # getfattr -n glusterfs.gfid.string -m . logo-login-09.svg >> # file: logo-login-09.svg >> glusterfs.gfid.string="1c648409-e98b-4544-a7fa-c2aef87f92ad" >> >> # grep 1c648409-e98b-4544-a7fa-c2aef87f92ad >> /data/myvolume/brick/.glusterfs/changelogs -rn >> Binary file /data/myvolume/brick/.glusterfs/changelogs/CHANGELOG.1454278219 >> matches > Great! Can you share the CHANGELOG ? ( It contains various fops > carried out on this gfid) > >> Regards >> ML >> >> >> >> On Monday, February 1, 2016 1:30 PM, Saravanakumar Arumugam >> <sarum...@redhat.com> wrote: >> Hi, >> >> On 02/01/2016 02:14 PM, ML mail wrote: >>> Hello, >>> >>> I just set up distributed geo-replication to a slave on my 2 nodes' >>> replicated volume and noticed quite a few error messages (around 70 of >>> them) in the slave's brick log file: >>> >>> The exact log file is: /var/log/glusterfs/bricks/data-myvolume-geo-brick.log >>> >>> [2016-01-31 22:19:29.524370] E [MSGID: 113020] [posix.c:1221:posix_mknod] >>> 0-myvolume-geo-posix: setting gfid on >>> /data/myvolume-geo/brick/data/username/files/shared/logo-login-09.svg.ocTransferId1789604916.part >>> failed >>> [2016-01-31 22:19:29.535478] W [MSGID: 113026] [posix.c:1338:posix_mkdir] >>> 0-myvolume-geo-posix: mkdir >>> (/data/username/files_encryption/keys/files/shared/logo-login-09.svg.ocTransferId1789604916.part): >>> gfid (15bbcec6-a332-4c21-81e4-c52472b1e13d) isalready associated with >>> directory >>> (/data/myvolume-geo/brick/.glusterfs/49/5d/495d6868-4844-4632-8ff9-ad9646a878fe/logo-login-09.svg). >>> Hence,both directories will share same gfid and thiscan lead to >>> inconsistencies. >> Can you grep for this gfid(of the corresponding files) in changelogs and >> share those files ? >> >> { >> For example: >> >> 1. Get gfid of the files like this: >> >> # getfattr -n glusterfs.gfid.string -m . /mnt/slave/file456 >> getfattr: Removing leading '/' from absolute path names >> # file: mnt/slave/file456 >> glusterfs.gfid.string="05b22446-de9e-42df-a63e-399c24d690c4" >> >> 2. grep for the corresponding gfid in brick back end like below: >> >> [root@gfvm3 changelogs]# grep 05b22446-de9e-42df-a63e-399c24d690c4 >> /opt/volume_test/tv_2/b1/.glusterfs/changelogs/ -rn >> Binary file >> /opt/volume_test/tv_2/b1/.glusterfs/changelogs/CHANGELOG.1454135265 matches >> Binary file >> /opt/volume_test/tv_2/b1/.glusterfs/changelogs/CHANGELOG.1454135476 matches >> >> } >> This will help in understanding what operations are carried out in >> master volume, which leads to this inconsistency. >> >> Also, get the following: >> gluster version >> gluster volume info >> gluster volume geo-replication status >> >>> This doesn't look good at all because the file mentioned in the error >>> message ( >>> logo-login-09.svg.ocTransferId1789604916.part) is left there with 0 kbytes >>> and does not get deleted or cleaned up by glusterfs, leaving my geo-rep >>> slave node in an inconsistent state which does not reflect the reality from >>> the master nodes. The master nodes don't have that file anymore (which is >>> correct). Here below is an "ls" of the concerned file with the correct file >>> on top. >>> >>> >>> -rw-r--r-- 2 www-data www-data 24312 Jan 6 2014 logo-login-09.svg >>> -rw-r--r-- 1 root root 0 Jan 31 23:19 >>> logo-login-09.svg.ocTransferId1789604916.part >> Rename issues in geo-replication are fixed lately. This looks similar to >> >> one. >> >> Thanks, >> Saravana >> _______________________________________________ >> Gluster-users mailing list >> Gluster-users@gluster.org >> http://www.gluster.org/mailman/listinfo/gluster-users > _______________________________________________ > Gluster-users mailing list > Gluster-users@gluster.org > http://www.gluster.org/mailman/listinfo/gluster-users _______________________________________________ Gluster-users mailing list Gluster-users@gluster.org http://www.gluster.org/mailman/listinfo/gluster-users