Re: [Gluster-users] Hundreds of duplicate files

Joe Julian Fri, 20 Feb 2015 14:11:05 -0800


On 02/20/2015 01:47 PM, Olav Peeters wrote:

Thanks Joe,
for the answers!
I was not clear enough about the set up apparently.
The Gluster cluster consist of 3 nodes with each 14 bricks. The bricksare formatted as xfs, mounted locally as xfs. There is one volume,type: Distributed-Replicate (replica 2). The configuration is so thatbricks are mirrored on two different nodes.
The NFS mount which was alive but not used during reboot when theproblem started are from clients (2 XenServer machines configured as apool - a shared storage set-up). The comparisons I give below arebetween (other) clients mounting via either glusterfs or NFS. Similarproblem with the exception that the first listing (via ls) after afresh mount via NFS actually does find the files with data. A secondlisting only finds the 0 bit file with the same name.
So all the 0bit files in mode 0644 can be safely removed?

Probably? Is it likely that you have any empty files? I don't know.

Why do I see three files with the same name (and modificationtimestamp etc.) via either a glusterfs or NFS mount from a client?Deleting one of the three will probably not solve the issue either..this seems to me an indexing issue in the gluster cluster.

Very good question. I don't know. The xattrs tell a strange story that Ihaven't seen before. One legit file shows sr_vol01-client-32 and 33.This would be normal, assuming the filename hash would put it on thatreplica pair (we can't tell since the rebalance has changed the hashmap). Another file shows sr_vol01-client-32, 33, 34, and 35 with pendingupdates scheduled for 35. I have no idea which brick this is (see"gluster volume info" and map the digits (35) with the bricks offset by1 (client-35 is brick 36). That last one is on 40,41.

I don't know how these files all got on different replica sets. Myspeculations include hostname changes, long-running net-split conditionswith different dht maps (failed rebalances), moved bricks, loadbalancers between client and server, mercury in retrograde (lol)...

How do I get Gluster to replicate the files correctly, only 2 versionsof the same file, not three, and on two bricks on different machines?

Identify which replica is correct by using the little python script athttp://joejulian.name/blog/dht-misses-are-expensive/ to get the hash ofthe filename. Examine the dht map to see which replica pair *should*have that hash and remove the others (and their hardlink in .glusterfs).There is no 1-liner that's going to do this. I would probably script thelogic in python, have it print out what it was going to do, check thatfor sanity and, if sane, execute it.

But mostly figure out how Bricks 32 and/or 33 can become 34 and/or 35and/or 40 and/or 41. That's the root of the whole problem.

Cheers,
Olav




On 20/02/15 21:51, Joe Julian wrote:
On 02/20/2015 12:21 PM, Olav Peeters wrote:
Let's take one file (3009f448-cf6e-413f-baec-c3b9f0cf9d72.vhd) as anexample...On the 3 nodes where all bricks are formatted as XFS and mounted in/export and 272b2366-dfbf-ad47-2a0f-5d5cc40863e3 is the mountingpoint of a NFS shared storage connection from XenServer machines:
Did I just read this correctly? Your bricks are NFS mounts? ie,GlusterFS Client <-> GlusterFS Server <-> NFS <-> XFS
[root@gluster01 ~]# find/export/*/272b2366-dfbf-ad47-2a0f-5d5cc40863e3/ -name '300*' -execls -la {} \;-rw-r--r--. 2 root root 44332659200 Feb 17 23:55/export/brick13gfs01/272b2366-dfbf-ad47-2a0f-5d5cc40863e3/3009f448-cf6e-413f-baec-c3b9f0cf9d72.vhd
Supposedly, this is the actual file.
-rw-r--r--. 2 root root 0 Feb 18 00:51/export/brick14gfs01/272b2366-dfbf-ad47-2a0f-5d5cc40863e3/3009f448-cf6e-413f-baec-c3b9f0cf9d72.vhd
This is not a linkfile. Note it's mode 0644. How it got there withthose permissions would be a matter of history and would requireinformation that's probably lost.
root@gluster02 ~]# find/export/*/272b2366-dfbf-ad47-2a0f-5d5cc40863e3/ -name '300*' -execls -la {} \;-rw-r--r--. 2 root root 44332659200 Feb 17 23:55/export/brick13gfs02/272b2366-dfbf-ad47-2a0f-5d5cc40863e3/3009f448-cf6e-413f-baec-c3b9f0cf9d72.vhd
[root@gluster03 ~]# find/export/*/272b2366-dfbf-ad47-2a0f-5d5cc40863e3/ -name '300*' -execls -la {} \;-rw-r--r--. 2 root root 44332659200 Feb 17 23:55/export/brick13gfs03/272b2366-dfbf-ad47-2a0f-5d5cc40863e3/3009f448-cf6e-413f-baec-c3b9f0cf9d72.vhd-rw-r--r--. 2 root root 0 Feb 18 00:51/export/brick14gfs03/272b2366-dfbf-ad47-2a0f-5d5cc40863e3/3009f448-cf6e-413f-baec-c3b9f0cf9d72.vhd
Same analysis as above.
3 files with information, 2 x a 0-bit file with the same name

Checking the 0-bit files:
[root@gluster01 ~]# getfattr -m . -d -e hex/export/brick14gfs01/272b2366-dfbf-ad47-2a0f-5d5cc40863e3/3009f448-cf6e-413f-baec-c3b9f0cf9d72.vhd
getfattr: Removing leading '/' from absolute path names
# file:export/brick14gfs01/272b2366-dfbf-ad47-2a0f-5d5cc40863e3/3009f448-cf6e-413f-baec-c3b9f0cf9d72.vhd
security.selinux=0x73797374656d5f753a6f626a6563745f723a66696c655f743a733000
trusted.afr.dirty=0x000000000000000000000000
trusted.afr.sr_vol01-client-34=0x000000000000000000000000
trusted.afr.sr_vol01-client-35=0x000000000000000000000000
trusted.gfid=0xaefd184508414a8f8408f1ab8aa7a417
[root@gluster03 ~]# getfattr -m . -d -e hex/export/brick14gfs03/272b2366-dfbf-ad47-2a0f-5d5cc40863e3/3009f448-cf6e-413f-baec-c3b9f0cf9d72.vhd
getfattr: Removing leading '/' from absolute path names
# file:export/brick14gfs03/272b2366-dfbf-ad47-2a0f-5d5cc40863e3/3009f448-cf6e-413f-baec-c3b9f0cf9d72.vhd
security.selinux=0x73797374656d5f753a6f626a6563745f723a66696c655f743a733000
trusted.afr.dirty=0x000000000000000000000000
trusted.afr.sr_vol01-client-34=0x000000000000000000000000
trusted.afr.sr_vol01-client-35=0x000000000000000000000000
trusted.gfid=0xaefd184508414a8f8408f1ab8aa7a417
This is not a glusterfs link file since there is no"trusted.glusterfs.dht.linkto", am I correct?
You are correct.
And checking the "good" files:
# file:export/brick13gfs01/272b2366-dfbf-ad47-2a0f-5d5cc40863e3/3009f448-cf6e-413f-baec-c3b9f0cf9d72.vhd
security.selinux=0x756e636f6e66696e65645f753a6f626a6563745f723a66696c655f743a733000
trusted.afr.dirty=0x000000000000000000000000
trusted.afr.sr_vol01-client-32=0x000000000000000000000000
trusted.afr.sr_vol01-client-33=0x000000000000000000000000
trusted.afr.sr_vol01-client-34=0x000000000000000000000000
trusted.afr.sr_vol01-client-35=0x000000010000000100000000
trusted.gfid=0xaefd184508414a8f8408f1ab8aa7a417
[root@gluster02 ~]# getfattr -m . -d -e hex/export/brick13gfs02/272b2366-dfbf-ad47-2a0f-5d5cc40863e3/3009f448-cf6e-413f-baec-c3b9f0cf9d72.vhd
getfattr: Removing leading '/' from absolute path names
# file:export/brick13gfs02/272b2366-dfbf-ad47-2a0f-5d5cc40863e3/3009f448-cf6e-413f-baec-c3b9f0cf9d72.vhd
security.selinux=0x73797374656d5f753a6f626a6563745f723a66696c655f743a733000
trusted.afr.dirty=0x000000000000000000000000
trusted.afr.sr_vol01-client-32=0x000000000000000000000000
trusted.afr.sr_vol01-client-33=0x000000000000000000000000
trusted.gfid=0xaefd184508414a8f8408f1ab8aa7a417
[root@gluster03 ~]# getfattr -m . -d -e hex/export/brick13gfs03/272b2366-dfbf-ad47-2a0f-5d5cc40863e3/3009f448-cf6e-413f-baec-c3b9f0cf9d72.vhd
getfattr: Removing leading '/' from absolute path names
# file:export/brick13gfs03/272b2366-dfbf-ad47-2a0f-5d5cc40863e3/3009f448-cf6e-413f-baec-c3b9f0cf9d72.vhd
security.selinux=0x73797374656d5f753a6f626a6563745f723a66696c655f743a733000
trusted.afr.dirty=0x000000000000000000000000
trusted.afr.sr_vol01-client-40=0x000000000000000000000000
trusted.afr.sr_vol01-client-41=0x000000000000000000000000
trusted.gfid=0xaefd184508414a8f8408f1ab8aa7a417



Seen from a client via a glusterfs mount:
[root@client ~]# ls -al/mnt/glusterfs/272b2366-dfbf-ad47-2a0f-5d5cc40863e3/300*-rw-r--r--. 1 root root 0 Feb 18 00:51/mnt/glusterfs/272b2366-dfbf-ad47-2a0f-5d5cc40863e3/3009f448-cf6e-413f-baec-c3b9f0cf9d72.vhd-rw-r--r--. 1 root root 0 Feb 18 00:51/mnt/glusterfs/272b2366-dfbf-ad47-2a0f-5d5cc40863e3/3009f448-cf6e-413f-baec-c3b9f0cf9d72.vhd-rw-r--r--. 1 root root 0 Feb 18 00:51/mnt/glusterfs/272b2366-dfbf-ad47-2a0f-5d5cc40863e3/3009f448-cf6e-413f-baec-c3b9f0cf9d72.vhd
Via NFS (just after performing a umount and mount the volume again):
[root@client ~]# ls -al/mnt/nfs/272b2366-dfbf-ad47-2a0f-5d5cc40863e3/300*-rw-r--r--. 1 root root 44332659200 Feb 17 23:55/mnt/test/272b2366-dfbf-ad47-2a0f-5d5cc40863e3/3009f448-cf6e-413f-baec-c3b9f0cf9d72.vhd-rw-r--r--. 1 root root 44332659200 Feb 17 23:55/mnt/test/272b2366-dfbf-ad47-2a0f-5d5cc40863e3/3009f448-cf6e-413f-baec-c3b9f0cf9d72.vhd-rw-r--r--. 1 root root 44332659200 Feb 17 23:55/mnt/test/272b2366-dfbf-ad47-2a0f-5d5cc40863e3/3009f448-cf6e-413f-baec-c3b9f0cf9d72.vhd
Doing the same list a couple of seconds later:
[root@client ~]# ls -al/mnt/nfs/272b2366-dfbf-ad47-2a0f-5d5cc40863e3/300*-rw-r--r--. 1 root root 0 Feb 18 00:51/mnt/test/272b2366-dfbf-ad47-2a0f-5d5cc40863e3/3009f448-cf6e-413f-baec-c3b9f0cf9d72.vhd-rw-r--r--. 1 root root 0 Feb 18 00:51/mnt/test/272b2366-dfbf-ad47-2a0f-5d5cc40863e3/3009f448-cf6e-413f-baec-c3b9f0cf9d72.vhd-rw-r--r--. 1 root root 0 Feb 18 00:51/mnt/test/272b2366-dfbf-ad47-2a0f-5d5cc40863e3/3009f448-cf6e-413f-baec-c3b9f0cf9d72.vhd
And again, and again, and again:
[root@client ~]# ls -al/mnt/nfs/272b2366-dfbf-ad47-2a0f-5d5cc40863e3/300*-rw-r--r--. 1 root root 0 Feb 18 00:51/mnt/test/272b2366-dfbf-ad47-2a0f-5d5cc40863e3/3009f448-cf6e-413f-baec-c3b9f0cf9d72.vhd-rw-r--r--. 1 root root 0 Feb 18 00:51/mnt/test/272b2366-dfbf-ad47-2a0f-5d5cc40863e3/3009f448-cf6e-413f-baec-c3b9f0cf9d72.vhd-rw-r--r--. 1 root root 0 Feb 18 00:51/mnt/test/272b2366-dfbf-ad47-2a0f-5d5cc40863e3/3009f448-cf6e-413f-baec-c3b9f0cf9d72.vhd
This really seems odd. Why do we get to see "real data file" once only?
It seems more and more that this crazy file duplication (and writingof sticky bit files) was actually triggered when rebooting one ofthe three nodes while there still is an active (even when there isno data exchange at all) NFS connection, since all 0-bit files (ofthe non Sticky bit type) were either created at 00:51 or 00:41, theexact moment one of the three nodes in the cluster were rebooted.This would mean that replication currently with GlusterFS createshardly any redundancy. Quiet the opposite, if one of the machinesgoes down, all of your data seriously gets disorganised. I am buzzyconfiguring a test installation to see how this can be bestreproduced for a bug report..
Does anyone have a suggestion how to best get rid of the duplicates,or rather get this mess organised the way it should be?This is a cluster with millions of files. A rebalance does not fixthe issue, neither does a rebalance fix-layout help. Since this is areplicated volume all files should be their 2x, not 3x. Can I safelyjust remove all the 0 bit files outside of the .glusterfs directoryincluding the sticky bit files?
The empty 0 bit files outside of .glusterfs on every brick I canprobably safely removed like this:find /export/* -path */.glusterfs -prune -o -type f -size 0 -perm1000 -exec rm {} \;
not?

Thanks!

Cheers,
Olav
On 18/02/15 22:10, Olav Peeters wrote:
Thanks Tom and Joe,
for the fast response!
Before I started my upgrade I stopped all clients using the volumeand stopped all VM's with VHD on the volume, but I guess, and thismay be the missing thing to reproduce this in a lab, I did notdetach a NFS shared storage mount from a XenServer pool to thisvolume, since this is an extremely risky business. I also did notstop the volume. This I guess was a bit stupid, but since I didupgrades in the past this way without any issues I skipped thisstep (a really bad habit). I'll make amends and file a proper bugreport :-). I agree with you Joe, this should never happen, evenwhen someone ignores the advice of stopping the volume. If it wouldalso be nessessary to detach shared storage NFS connections to avolume, than franky, glusterfs is unusable in a private cloud. Noone can afford downtime of the whole infrastructure just for aglusterfs upgrade. Ideally a replicated gluster volume should evenbe able to remain online and used during (at least a minor version)upgrade.
I don't know whether a heal was maybe buzzy when I started theupgrade. I forgot to check. I did check the CPU activity on thegluster nodes which were very low (in the 0.0X range via top), so Idoubt it. I will add this to the bug report as a suggestion shouldthey not be able to reproduce with an open NFS connection.
By the way, is it sufficient to do:
service glusterd stop
service glusterfsd stop
and do a:
ps aux | gluster*
to see if everything has stopped and kill any leftovers should thisbe necessary?
For the fix, do you agree that if I run e.g.:
find /export/* -type f -size 0 -perm 1000 -exec /bin/rm {} \;
on every node if /export is the location of all my bricks, also ina replicated set-up, this will be save?No necessary 0bit files will be deleted in e.g. the .glusterfs ofevery brick?
Thanks for your support!

Cheers,
Olav





On 18/02/15 20:51, Joe Julian wrote:
On 02/18/2015 11:43 AM, tben...@3vgeomatics.com wrote:
Hi Olav,
I have a hunch that our problem was caused by improper unmountingof the gluster volume, and have since found that the proper ordershould be: kill all jobs using volume -> unmount volume onclients -> gluster volume stop -> stop gluster service (if necessary)In my case, I wrote a Python script to find duplicate files onthe mounted volume, then delete the corresponding link files onthe bricks (making sure to also delete files in the .glusterfsdirectory)However, your find command was also suggested to me and I thinkit's a simpler solution. I believe removing all link files (evenones that are not causing duplicates) is fine since the next fileaccess gluster will do a lookup on all bricks and recreate anylink files if necessary. Hopefully a gluster expert can chime inon this point as I'm not completely sure.
You are correct.
Keep in mind your setup is somewhat different than mine as I haveonly 5 bricks with no replication.
Regards,
Tom

    --------- Original Message ---------
    Subject: Re: [Gluster-users] Hundreds of duplicate files
    From: "Olav Peeters" <opeet...@gmail.com>
    Date: 2/18/15 10:52 am
    To: gluster-users@gluster.org, tben...@3vgeomatics.com

    Hi all,
    I'm have this problem after upgrading from 3.5.3 to 3.6.2.
    At the moment I am still waiting for a heal to finish (on a
    31TB volume with 42 bricks, replicated over three nodes).

    Tom,
    how did you remove the duplicates?
    with 42 bricks I will not be able to do this manually..
    Did a:
    find $brick_root -type f -size 0 -perm 1000 -exec /bin/rm {} \;
    work for you?

    Should this type of thing ideally not be checked and mended
    by a heal?

    Does anyone have an idea yet how this happens in the first
    place? Can it be connected to upgrading?

    Cheers,
    Olav
    On 01/01/15 03:07, tben...@3vgeomatics.com wrote:

        No, the files can be read on a newly mounted client! I
        went ahead and deleted all of the link files associated
        with these duplicates, and then remounted the volume. The
        problem is fixed!
        Thanks again for the help, Joe and Vijay.
        Tom

            --------- Original Message ---------
            Subject: Re: [Gluster-users] Hundreds of duplicate files
            From: "Vijay Bellur" <vbel...@redhat.com>
            Date: 12/28/14 3:23 am
            To: tben...@3vgeomatics.com, gluster-users@gluster.org

            On 12/28/2014 01:20 PM, tben...@3vgeomatics.com wrote:
            > Hi Vijay,
            > Yes the files are still readable from the
            .glusterfs path.
            > There is no explicit error. However, trying to read
            a text file in
            > python simply gives me null characters:
            >
            > >>> open('ott_mf_itab').readlines()
            >
            
['\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00']
            >
            > And reading binary files does the same
            >

            Is this behavior seen with a freshly mounted client too?

            -Vijay

            > --------- Original Message ---------
            > Subject: Re: [Gluster-users] Hundreds of duplicate
            files
            > From: "Vijay Bellur" <vbel...@redhat.com>
            > Date: 12/27/14 9:57 pm
            > To: tben...@3vgeomatics.com, gluster-users@gluster.org
            >
            > On 12/28/2014 10:13 AM, tben...@3vgeomatics.com wrote:
            > > Thanks Joe, I've read your blog post as well as
            your post
            > regarding the
            > > .glusterfs directory.
            > > I found some unneeded duplicate files which were
            not being read
            > > properly. I then deleted the link file from the
            brick. This always
            > > removes the duplicate file from the listing, but
            the file does not
            > > always become readable. If I also delete the
            associated file in the
            > > .glusterfs directory on that brick, then some
            more files become
            > > readable. However this solution still doesn't
            work for all files.
            > > I know the file on the brick is not corrupt as it
            can be read
            > directly
            > > from the brick directory.
            >
            > For files that are not readable from the client,
            can you check if the
            > file is readable from the .glusterfs/ path?
            >
            > What is the specific error that is seen while
            trying to read one such
            > file from the client?
            >
            > Thanks,
            > Vijay
            >
            >
            >
            > _______________________________________________
            > Gluster-users mailing list
            > Gluster-users@gluster.org
            > http://www.gluster.org/mailman/listinfo/gluster-users
            >



        _______________________________________________
        Gluster-users mailing list
        Gluster-users@gluster.org
        http://www.gluster.org/mailman/listinfo/gluster-users



_______________________________________________
Gluster-users mailing list
Gluster-users@gluster.org
http://www.gluster.org/mailman/listinfo/gluster-users
_______________________________________________
Gluster-users mailing list
Gluster-users@gluster.org
http://www.gluster.org/mailman/listinfo/gluster-users

_______________________________________________
Gluster-users mailing list
Gluster-users@gluster.org
http://www.gluster.org/mailman/listinfo/gluster-users

Re: [Gluster-users] Hundreds of duplicate files

Reply via email to