Hi, Have you checked for any file system errors on the brick mount point?
I once was facing weird io errors and xfs_repair fixed the issue. What about the heal? Does it report any pending heals? On Feb 15, 2018 14:20, "Dave Sherohman" <d...@sherohman.org> wrote: > Well, it looks like I've stumped the list, so I did a bit of additional > digging myself: > > azathoth replicates with yog-sothoth, so I compared their brick > directories. `ls -R /var/local/brick0/data | md5sum` gives the same > result on both servers, so the filenames are identical in both bricks. > However, `du -s /var/local/brick0/data` shows that azathoth has about 3G > more data (445G vs 442G) than yog. > > This seems consistent with my assumption that the problem is on > yog-sothoth (everything is fine with only azathoth; there are problems > with only yog-sothoth) and I am reminded that a few weeks ago, > yog-sothoth was offline for 4-5 days, although it should have been > brought back up-to-date once it came back online. > > So, assuming that the issue is stale/missing data on yog-sothoth, is > there a way to force gluster to do a full refresh of the data from > azathoth's brick to yog-sothoth's brick? I would have expected running > heal and/or rebalance to do that sort of thing, but I've run them both > (with and without fix-layout on the rebalance) and the problem persists. > > If there isn't a way to force a refresh, how risky would it be to kill > gluster on yog-sothoth, wipe everything from /var/local/brick0, and then > re-add it to the cluster as if I were replacing a physically failed > disk? Seems like that should work in principle, but it feels dangerous > to wipe the partition and rebuild, regardless. > > On Tue, Feb 13, 2018 at 07:33:44AM -0600, Dave Sherohman wrote: > > I'm using gluster for a virt-store with 3x2 distributed/replicated > > servers for 16 qemu/kvm/libvirt virtual machines using image files > > stored in gluster and accessed via libgfapi. Eight of these disk images > > are standalone, while the other eight are qcow2 images which all share a > > single backing file. > > > > For the most part, this is all working very well. However, one of the > > gluster servers (azathoth) causes three of the standalone VMs and all 8 > > of the shared-backing-image VMs to fail if it goes down. Any of the > > other gluster servers can go down with no problems; only azathoth causes > > issues. > > > > In addition, the kvm hosts have the gluster volume fuse mounted and one > > of them (out of five) detects an error on the gluster volume and puts > > the fuse mount into read-only mode if azathoth goes down. libgfapi > > connections to the VM images continue to work normally from this host > > despite this and the other four kvm hosts are unaffected. > > > > It initially seemed relevant that I have the libgfapi URIs specified as > > gluster://azathoth/..., but I've tried changing them to make the initial > > connection via other gluster hosts and it had no effect on the problem. > > Losing azathoth still took them out. > > > > In addition to changing the mount URI, I've also manually run a heal and > > rebalance on the volume, enabled the bitrot daemons (then turned them > > back off a week later, since they reported no activity in that time), > > and copied one of the standalone images to a new file in case it was a > > problem with the file itself. As far as I can tell, none of these > > attempts changed anything. > > > > So I'm at a loss. Is this a known type of problem? If so, how do I fix > > it? If not, what's the next step to troubleshoot it? > > > > > > # gluster --version > > glusterfs 3.8.8 built on Jan 11 2017 14:07:11 > > Repository revision: git://git.gluster.com/glusterfs.git > > > > # gluster volume status > > Status of volume: palantir > > Gluster process TCP Port RDMA Port Online > > Pid > > ------------------------------------------------------------ > ------------------ > > Brick saruman:/var/local/brick0/data 49154 0 Y > > 10690 > > Brick gandalf:/var/local/brick0/data 49155 0 Y > > 18732 > > Brick azathoth:/var/local/brick0/data 49155 0 Y > > 9507 > > Brick yog-sothoth:/var/local/brick0/data 49153 0 Y > > 39559 > > Brick cthulhu:/var/local/brick0/data 49152 0 Y > > 2682 > > Brick mordiggian:/var/local/brick0/data 49152 0 Y > > 39479 > > Self-heal Daemon on localhost N/A N/A Y > > 9614 > > Self-heal Daemon on saruman.lub.lu.se N/A N/A Y > > 15016 > > Self-heal Daemon on cthulhu.lub.lu.se N/A N/A Y > > 9756 > > Self-heal Daemon on gandalf.lub.lu.se N/A N/A Y > > 5962 > > Self-heal Daemon on mordiggian.lub.lu.se N/A N/A Y > > 8295 > > Self-heal Daemon on yog-sothoth.lub.lu.se N/A N/A Y > > 7588 > > > > Task Status of Volume palantir > > ------------------------------------------------------------ > ------------------ > > Task : Rebalance > > ID : c38e11fe-fe1b-464d-b9f5-1398441cc229 > > Status : completed > > > > > > -- > > Dave Sherohman > > _______________________________________________ > > Gluster-users mailing list > > Gluster-users@gluster.org > > http://lists.gluster.org/mailman/listinfo/gluster-users > > > -- > Dave Sherohman > _______________________________________________ > Gluster-users mailing list > Gluster-users@gluster.org > http://lists.gluster.org/mailman/listinfo/gluster-users >
_______________________________________________ Gluster-users mailing list Gluster-users@gluster.org http://lists.gluster.org/mailman/listinfo/gluster-users