Hi Ravi, You're right that I had mentioned using rsync to copy the brick content to a new host, but in the end I actually decided not to bring it up on a new brick. Instead I added the original brick back into the volume. So the xattrs and symlinks to .glusterfs on the original brick are fine. I think the problem probably lies with a remove-brick that got interrupted. A few weeks ago during the maintenance I had tried to remove a brick and then after twenty minutes and no obvious progress I stopped it—after that the bricks were still part of the volume.
In the last few days I have run a fix-layout that took 26 hours and finished successfully. Then I started a full index heal and it has healed about 3.3 million files in a few days and I see a clear increase of network traffic from old brick host to new brick host over that time. Once the full index heal completes I will try to do a rebalance. Thank you, On Mon, Jun 3, 2019 at 7:40 PM Ravishankar N <ravishan...@redhat.com> wrote: > > On 01/06/19 9:37 PM, Alan Orth wrote: > > Dear Ravi, > > The .glusterfs hardlinks/symlinks should be fine. I'm not sure how I could > verify them for six bricks and millions of files, though... :\ > > Hi Alan, > > The reason I asked this is because you had mentioned in one of your > earlier emails that when you moved content from the old brick to the new > one, you had skipped the .glusterfs directory. So I was assuming that when > you added back this new brick to the cluster, it might have been missing > the .glusterfs entries. If that is the cae, one way to verify could be to > check using a script if all files on the brick have a link-count of at > least 2 and all dirs have valid symlinks inside .glusterfs pointing to > themselves. > > > I had a small success in fixing some issues with duplicated files on the > FUSE mount point yesterday. I read quite a bit about the elastic hashing > algorithm that determines which files get placed on which bricks based on > the hash of their filename and the trusted.glusterfs.dht xattr on brick > directories (thanks to Joe Julian's blog post and Python script for showing > how it works¹). With that knowledge I looked closer at one of the files > that was appearing as duplicated on the FUSE mount and found that it was > also duplicated on more than `replica 2` bricks. For this particular file I > found two "real" files and several zero-size files with > trusted.glusterfs.dht.linkto xattrs. Neither of the "real" files were on > the correct brick as far as the DHT layout is concerned, so I copied one of > them to the correct brick, deleted the others and their hard links, and did > a `stat` on the file from the FUSE mount point and it fixed itself. Yay! > > Could this have been caused by a replace-brick that got interrupted and > didn't finish re-labeling the xattrs? > > No, replace-brick only initiates AFR self-heal, which just copies the > contents from the other brick(s) of the *same* replica pair into the > replaced brick. The link-to files are created by DHT when you rename a > file from the client. If the new name hashes to a different brick, DHT > does not move the entire file there. It instead creates the link-to file > (the one with the dht.linkto xattrs) on the hashed subvol. The value of > this xattr points to the brick where the actual data is there (`getfattr -e > text` to see it for yourself). Perhaps you had attempted a rebalance or > remove-brick earlier and interrupted that? > > Should I be thinking of some heuristics to identify and fix these issues > with a script (incorrect brick placement), or is this something a fix > layout or repeated volume heals can fix? I've already completed a whole > heal on this particular volume this week and it did heal about 1,000,000 > files (mostly data and metadata, but about 20,000 entry heals as well). > > Maybe you should let the AFR self-heals complete first and then attempt a > full rebalance to take care of the dht link-to files. But if the files are > in millions, it could take quite some time to complete. > Regards, > Ravi > > Thanks for your support, > > ¹ https://joejulian.name/post/dht-misses-are-expensive/ > > On Fri, May 31, 2019 at 7:57 AM Ravishankar N <ravishan...@redhat.com> > wrote: > >> >> On 31/05/19 3:20 AM, Alan Orth wrote: >> >> Dear Ravi, >> >> I spent a bit of time inspecting the xattrs on some files and directories >> on a few bricks for this volume and it looks a bit messy. Even if I could >> make sense of it for a few and potentially heal them manually, there are >> millions of files and directories in total so that's definitely not a >> scalable solution. After a few missteps with `replace-brick ... commit >> force` in the last week—one of which on a brick that was dead/offline—as >> well as some premature `remove-brick` commands, I'm unsure how how to >> proceed and I'm getting demotivated. It's scary how quickly things get out >> of hand in distributed systems... >> >> Hi Alan, >> The one good thing about gluster is it that the data is always available >> directly on the backed bricks even if your volume has inconsistencies at >> the gluster level. So theoretically, if your cluster is FUBAR, you could >> just create a new volume and copy all data onto it via its mount from the >> old volume's bricks. >> >> >> I had hoped that bringing the old brick back up would help, but by the >> time I added it again a few days had passed and all the brick-id's had >> changed due to the replace/remove brick commands, not to mention that the >> trusted.afr.$volume-client-xx values were now probably pointing to the >> wrong bricks (?). >> >> Anyways, a few hours ago I started a full heal on the volume and I see >> that there is a sustained 100MiB/sec of network traffic going from the old >> brick's host to the new one. The completed heals reported in the logs look >> promising too: >> >> Old brick host: >> >> # grep '2019-05-30' /var/log/glusterfs/glustershd.log | grep -o -E >> 'Completed (data|metadata|entry) selfheal' | sort | uniq -c >> 281614 Completed data selfheal >> 84 Completed entry selfheal >> 299648 Completed metadata selfheal >> >> New brick host: >> >> # grep '2019-05-30' /var/log/glusterfs/glustershd.log | grep -o -E >> 'Completed (data|metadata|entry) selfheal' | sort | uniq -c >> 198256 Completed data selfheal >> 16829 Completed entry selfheal >> 229664 Completed metadata selfheal >> >> So that's good I guess, though I have no idea how long it will take or if >> it will fix the "missing files" issue on the FUSE mount. I've increased >> cluster.shd-max-threads to 8 to hopefully speed up the heal process. >> >> The afr xattrs should not cause files to disappear from mount. If the >> xattr names do not match what each AFR subvol expects (for eg. in a replica >> 2 volume, trusted.afr.*-client-{0,1} for 1st subvol, client-{2,3} for 2nd >> subvol and so on - ) for its children then it won't heal the data, that is >> all. But in your case I see some inconsistencies like one brick having the >> actual file (licenseserver.cfg) and the other having a linkto file (the >> one with the dht.linkto xattr) *in the same replica pair*. >> >> >> I'd be happy for any advice or pointers, >> >> Did you check if the .glusterfs hardlinks/symlinks exist and are in order >> for all bricks? >> >> -Ravi >> >> >> On Wed, May 29, 2019 at 5:20 PM Alan Orth <alan.o...@gmail.com> wrote: >> >>> Dear Ravi, >>> >>> Thank you for the link to the blog post series—it is very informative >>> and current! If I understand your blog post correctly then I think the >>> answer to your previous question about pending AFRs is: no, there are no >>> pending AFRs. I have identified one file that is a good test case to try to >>> understand what happened after I issued the `gluster volume replace-brick >>> ... commit force` a few days ago and then added the same original brick >>> back to the volume later. This is the current state of the replica 2 >>> distribute/replicate volume: >>> >>> [root@wingu0 ~]# gluster volume info apps >>> >>> Volume Name: apps >>> Type: Distributed-Replicate >>> Volume ID: f118d2da-79df-4ee1-919d-53884cd34eda >>> Status: Started >>> Snapshot Count: 0 >>> Number of Bricks: 3 x 2 = 6 >>> Transport-type: tcp >>> Bricks: >>> Brick1: wingu3:/mnt/gluster/apps >>> Brick2: wingu4:/mnt/gluster/apps >>> Brick3: wingu05:/data/glusterfs/sdb/apps >>> Brick4: wingu06:/data/glusterfs/sdb/apps >>> Brick5: wingu0:/mnt/gluster/apps >>> Brick6: wingu05:/data/glusterfs/sdc/apps >>> Options Reconfigured: >>> diagnostics.client-log-level: DEBUG >>> storage.health-check-interval: 10 >>> nfs.disable: on >>> >>> I checked the xattrs of one file that is missing from the volume's FUSE >>> mount (though I can read it if I access its full path explicitly), but is >>> present in several of the volume's bricks (some with full size, others >>> empty): >>> >>> [root@wingu0 ~]# getfattr -d -m. -e hex >>> /mnt/gluster/apps/clcgenomics/clclicsrv/licenseserver.cfg >>> >>> getfattr: Removing leading '/' from absolute path names >>> # file: mnt/gluster/apps/clcgenomics/clclicsrv/licenseserver.cfg >>> security.selinux=0x73797374656d5f753a6f626a6563745f723a756e6c6162656c65645f743a733000 >>> trusted.afr.apps-client-3=0x000000000000000000000000 >>> trusted.afr.apps-client-5=0x000000000000000000000000 >>> trusted.afr.dirty=0x000000000000000000000000 >>> trusted.bit-rot.version=0x0200000000000000585a396f00046e15 >>> trusted.gfid=0x878003a2fb5243b6a0d14d2f8b4306bd >>> >>> [root@wingu05 ~]# getfattr -d -m. -e hex >>> /data/glusterfs/sdb/apps/clcgenomics/clclicsrv/licenseserver.cfg >>> getfattr: Removing leading '/' from absolute path names >>> # file: data/glusterfs/sdb/apps/clcgenomics/clclicsrv/licenseserver.cfg >>> security.selinux=0x73797374656d5f753a6f626a6563745f723a756e6c6162656c65645f743a733000 >>> trusted.gfid=0x878003a2fb5243b6a0d14d2f8b4306bd >>> trusted.gfid2path.82586deefbc539c3=0x34666437323861612d356462392d343836382d616232662d6564393031636566333561392f6c6963656e73657365727665722e636667 >>> trusted.glusterfs.dht.linkto=0x617070732d7265706c69636174652d3200 >>> >>> [root@wingu05 ~]# getfattr -d -m. -e hex >>> /data/glusterfs/sdc/apps/clcgenomics/clclicsrv/licenseserver.cfg >>> getfattr: Removing leading '/' from absolute path names >>> # file: data/glusterfs/sdc/apps/clcgenomics/clclicsrv/licenseserver.cfg >>> security.selinux=0x73797374656d5f753a6f626a6563745f723a756e6c6162656c65645f743a733000 >>> trusted.gfid=0x878003a2fb5243b6a0d14d2f8b4306bd >>> trusted.gfid2path.82586deefbc539c3=0x34666437323861612d356462392d343836382d616232662d6564393031636566333561392f6c6963656e73657365727665722e636667 >>> >>> [root@wingu06 ~]# getfattr -d -m. -e hex >>> /data/glusterfs/sdb/apps/clcgenomics/clclicsrv/licenseserver.cfg >>> getfattr: Removing leading '/' from absolute path names >>> # file: data/glusterfs/sdb/apps/clcgenomics/clclicsrv/licenseserver.cfg >>> security.selinux=0x73797374656d5f753a6f626a6563745f723a756e6c6162656c65645f743a733000 >>> trusted.gfid=0x878003a2fb5243b6a0d14d2f8b4306bd >>> trusted.gfid2path.82586deefbc539c3=0x34666437323861612d356462392d343836382d616232662d6564393031636566333561392f6c6963656e73657365727665722e636667 >>> trusted.glusterfs.dht.linkto=0x617070732d7265706c69636174652d3200 >>> >>> According to the trusted.afr.apps-client-xx xattrs this particular file >>> should be on bricks with id "apps-client-3" and "apps-client-5". It took me >>> a few hours to realize that the brick-id values are recorded in the >>> volume's volfiles in /var/lib/glusterd/vols/apps/bricks. After comparing >>> those brick-id values with a volfile backup from before the replace-brick, >>> I realized that the files are simply on the wrong brick now as far as >>> Gluster is concerned. This particular file is now on the brick for >>> "apps-client-4". As an experiment I copied this one file to the two >>> bricks listed in the xattrs and I was then able to see the file from the >>> FUSE mount (yay!). >>> >>> Other than replacing the brick, removing it, and then adding the old >>> brick on the original server back, there has been no change in the data >>> this entire time. Can I change the brick IDs in the volfiles so they >>> reflect where the data actually is? Or perhaps script something to reset >>> all the xattrs on the files/directories to point to the correct bricks? >>> >>> Thank you for any help or pointers, >>> >>> On Wed, May 29, 2019 at 7:24 AM Ravishankar N <ravishan...@redhat.com> >>> wrote: >>> >>>> >>>> On 29/05/19 9:50 AM, Ravishankar N wrote: >>>> >>>> >>>> On 29/05/19 3:59 AM, Alan Orth wrote: >>>> >>>> Dear Ravishankar, >>>> >>>> I'm not sure if Brick4 had pending AFRs because I don't know what that >>>> means and it's been a few days so I am not sure I would be able to find >>>> that information. >>>> >>>> When you find some time, have a look at a blog <http://wp.me/peiBB-6b> >>>> series I wrote about AFR- I've tried to explain what one needs to know to >>>> debug replication related issues in it. >>>> >>>> Made a typo error. The URL for the blog is https://wp.me/peiBB-6b >>>> >>>> -Ravi >>>> >>>> >>>> Anyways, after wasting a few days rsyncing the old brick to a new host >>>> I decided to just try to add the old brick back into the volume instead of >>>> bringing it up on the new host. I created a new brick directory on the old >>>> host, moved the old brick's contents into that new directory (minus the >>>> .glusterfs directory), added the new brick to the volume, and then did >>>> Vlad's find/stat trick¹ from the brick to the FUSE mount point. >>>> >>>> The interesting problem I have now is that some files don't appear in >>>> the FUSE mount's directory listings, but I can actually list them directly >>>> and even read them. What could cause that? >>>> >>>> Not sure, too many variables in the hacks that you did to take a guess. >>>> You can check if the contents of the .glusterfs folder are in order on the >>>> new brick (example hardlink for files and symlinks for directories are >>>> present etc.) . >>>> Regards, >>>> Ravi >>>> >>>> >>>> Thanks, >>>> >>>> ¹ >>>> https://lists.gluster.org/pipermail/gluster-users/2018-February/033584.html >>>> >>>> On Fri, May 24, 2019 at 4:59 PM Ravishankar N <ravishan...@redhat.com> >>>> wrote: >>>> >>>>> >>>>> On 23/05/19 2:40 AM, Alan Orth wrote: >>>>> >>>>> Dear list, >>>>> >>>>> I seem to have gotten into a tricky situation. Today I brought up a >>>>> shiny new server with new disk arrays and attempted to replace one brick >>>>> of >>>>> a replica 2 distribute/replicate volume on an older server using the >>>>> `replace-brick` command: >>>>> >>>>> # gluster volume replace-brick homes wingu0:/mnt/gluster/homes >>>>> wingu06:/data/glusterfs/sdb/homes commit force >>>>> >>>>> The command was successful and I see the new brick in the output of >>>>> `gluster volume info`. The problem is that Gluster doesn't seem to be >>>>> migrating the data, >>>>> >>>>> `replace-brick` definitely must heal (not migrate) the data. In your >>>>> case, data must have been healed from Brick-4 to the replaced Brick-3. Are >>>>> there any errors in the self-heal daemon logs of Brick-4's node? Does >>>>> Brick-4 have pending AFR xattrs blaming Brick-3? The doc is a bit out of >>>>> date. replace-brick command internally does all the setfattr steps that >>>>> are >>>>> mentioned in the doc. >>>>> >>>>> -Ravi >>>>> >>>>> >>>>> and now the original brick that I replaced is no longer part of the >>>>> volume (and a few terabytes of data are just sitting on the old brick): >>>>> >>>>> # gluster volume info homes | grep -E "Brick[0-9]:" >>>>> Brick1: wingu4:/mnt/gluster/homes >>>>> Brick2: wingu3:/mnt/gluster/homes >>>>> Brick3: wingu06:/data/glusterfs/sdb/homes >>>>> Brick4: wingu05:/data/glusterfs/sdb/homes >>>>> Brick5: wingu05:/data/glusterfs/sdc/homes >>>>> Brick6: wingu06:/data/glusterfs/sdc/homes >>>>> >>>>> I see the Gluster docs have a more complicated procedure for replacing >>>>> bricks that involves getfattr/setfattr¹. How can I tell Gluster about the >>>>> old brick? I see that I have a backup of the old volfile thanks to yum's >>>>> rpmsave function if that helps. >>>>> >>>>> We are using Gluster 5.6 on CentOS 7. Thank you for any advice you can >>>>> give. >>>>> >>>>> ¹ >>>>> https://docs.gluster.org/en/latest/Administrator%20Guide/Managing%20Volumes/#replace-faulty-brick >>>>> >>>>> -- >>>>> Alan Orth >>>>> alan.o...@gmail.com >>>>> https://picturingjordan.com >>>>> https://englishbulgaria.net >>>>> https://mjanja.ch >>>>> "In heaven all the interesting people are missing." ―Friedrich >>>>> Nietzsche >>>>> >>>>> _______________________________________________ >>>>> Gluster-users mailing >>>>> listGluster-users@gluster.orghttps://lists.gluster.org/mailman/listinfo/gluster-users >>>>> >>>>> >>>> >>>> -- >>>> Alan Orth >>>> alan.o...@gmail.com >>>> https://picturingjordan.com >>>> https://englishbulgaria.net >>>> https://mjanja.ch >>>> "In heaven all the interesting people are missing." ―Friedrich Nietzsche >>>> >>>> >>>> _______________________________________________ >>>> Gluster-users mailing >>>> listGluster-users@gluster.orghttps://lists.gluster.org/mailman/listinfo/gluster-users >>>> >>>> >>> >>> -- >>> Alan Orth >>> alan.o...@gmail.com >>> https://picturingjordan.com >>> https://englishbulgaria.net >>> https://mjanja.ch >>> "In heaven all the interesting people are missing." ―Friedrich Nietzsche >>> >> >> >> -- >> Alan Orth >> alan.o...@gmail.com >> https://picturingjordan.com >> https://englishbulgaria.net >> https://mjanja.ch >> "In heaven all the interesting people are missing." ―Friedrich Nietzsche >> >> > > -- > Alan Orth > alan.o...@gmail.com > https://picturingjordan.com > https://englishbulgaria.net > https://mjanja.ch > "In heaven all the interesting people are missing." ―Friedrich Nietzsche > > -- Alan Orth alan.o...@gmail.com https://picturingjordan.com https://englishbulgaria.net https://mjanja.ch "In heaven all the interesting people are missing." ―Friedrich Nietzsche
_______________________________________________ Gluster-users mailing list Gluster-users@gluster.org https://lists.gluster.org/mailman/listinfo/gluster-users