On Sun, Aug 6, 2017 at 4:42 AM, Jim Kusznir <j...@palousetech.com> wrote:
> Well, after a very stressful weekend, I think I have things largely > working. Turns out that most of the above issues were caused by the linux > permissions of the exports for all three volumes (they had been reset to > 600; setting them to 774 or 770 fixed many of the issues). Of course, I > didn't find that until a much more harrowing outage, and hours and hours of > work, including beginning to look at rebuilding my cluster.... > > So, now my cluster is operating again, and everything looks good EXCEPT > for one major Gluster issue/question that I haven't found any references or > info on. > > my host ovirt2, one of the replica gluster servers, is the one that lost > its storage and had to reinitialize it from the cluster. the iso volume is > perfectly fine and complete, but the engine and data volumes are smaller on > disk on this node than on the other node (and this node before the crash). > On the engine store, the entire cluster reports the smaller utilization on > mounted gluster filesystems; on the data partition, it reports the larger > size (rest of cluster). Here's some df statments to help clarify: > > (brick1 = engine; brick2=data, brick4=iso): > Filesystem Size Used Avail Use% Mounted on > /dev/mapper/gluster-engine 25G 12G 14G 47% /gluster/brick1 > /dev/mapper/gluster-data 136G 125G 12G 92% /gluster/brick2 > /dev/mapper/gluster-iso 25G 7.3G 18G 29% /gluster/brick4 > 192.168.8.11:/engine 15G 9.7G 5.4G 65% > /rhev/data-center/mnt/glusterSD/192.168.8.11:_engine > 192.168.8.11:/data 136G 125G 12G 92% > /rhev/data-center/mnt/glusterSD/192.168.8.11:_data > 192.168.8.11:/iso 13G 7.3G 5.8G 56% > /rhev/data-center/mnt/glusterSD/192.168.8.11:_iso > > View from ovirt2: > Filesystem Size Used Avail Use% Mounted on > /dev/mapper/gluster-engine 15G 9.7G 5.4G 65% /gluster/brick1 > /dev/mapper/gluster-data 174G 119G 56G 69% /gluster/brick2 > /dev/mapper/gluster-iso 13G 7.3G 5.8G 56% /gluster/brick4 > 192.168.8.11:/engine 15G 9.7G 5.4G 65% > /rhev/data-center/mnt/glusterSD/192.168.8.11:_engine > 192.168.8.11:/data 136G 125G 12G 92% > /rhev/data-center/mnt/glusterSD/192.168.8.11:_data > 192.168.8.11:/iso 13G 7.3G 5.8G 56% > /rhev/data-center/mnt/glusterSD/192.168.8.11:_iso > > As you can see, in the process of rebuilding the hard drive for ovirt2, I > did resize some things to give more space to data, where I desperately need > it. If this goes well and the storage is given a clean bill of health at > this time, then I will take ovirt1 down and resize to match ovirt2, and > thus score a decent increase in storage for data. I fully realize that > right now the gluster mounted volumes should have the total size as the > least common denominator. > > So, is this size reduction appropriate? A big part of me thinks data is > missing, but I even went through and shut down ovirt2's gluster daemons, > wiped all the gluster data, and restarted gluster to allow it a fresh heal > attempt, and it again came back to the exact same size. This cluster was > originally built about the time ovirt 4.0 came out, and has been upgraded > to 'current', so perhaps some new gluster features are making more > efficient use of space (dedupe or something)? > The used capacity should be consistent on all nodes - I see you have a discrepancy with the data volume brick. What does "gluster vol heal data info" tell you? Are there entries to be healed? Can you provide the glustershd logs? > > Thank you for your assistance! > --JIm > > On Fri, Aug 4, 2017 at 7:49 PM, Jim Kusznir <j...@palousetech.com> wrote: > >> Hi all: >> >> Today has been rough. two of my three nodes went down today, and self >> heal has not been healing well. 4 hours later, VMs are running. but the >> engine is not happy. It claims the storage domain is down (even though it >> is up on all hosts and VMs are running). I'm getting a ton of these >> messages logging: >> >> VDSM engine3 command HSMGetAllTasksStatusesVDS failed: Not SPM >> >> Aug 4, 2017 7:23:00 PM >> >> VDSM engine3 command SpmStatusVDS failed: Error validating master storage >> domain: ('MD read error',) >> >> Aug 4, 2017 7:22:49 PM >> >> VDSM engine3 command ConnectStoragePoolVDS failed: Cannot find master >> domain: u'spUUID=5868392a-0148-02cf-014d-000000000121, >> msdUUID=cdaf180c-fde6-4cb3-b6e5-b6bd869c8770' >> >> Aug 4, 2017 7:22:47 PM >> >> VDSM engine1 command ConnectStoragePoolVDS failed: Cannot find master >> domain: u'spUUID=5868392a-0148-02cf-014d-000000000121, >> msdUUID=cdaf180c-fde6-4cb3-b6e5-b6bd869c8770' >> >> Aug 4, 2017 7:22:46 PM >> >> VDSM engine2 command SpmStatusVDS failed: Error validating master storage >> domain: ('MD read error',) >> >> Aug 4, 2017 7:22:44 PM >> >> VDSM engine2 command ConnectStoragePoolVDS failed: Cannot find master >> domain: u'spUUID=5868392a-0148-02cf-014d-000000000121, >> msdUUID=cdaf180c-fde6-4cb3-b6e5-b6bd869c8770' >> >> Aug 4, 2017 7:22:42 PM >> >> VDSM engine1 command HSMGetAllTasksStatusesVDS failed: Not SPM: () >> >> >> ------------ >> I cannot set an SPM as it claims the storage domain is down; I cannot set >> the storage domain up. >> >> Also in the storage realm, one of my exports shows substantially less >> data than is actually there. >> >> Here's what happened, as best as I understood them: >> I went to do maintence on ovirt2 (needed to replace a faulty ram stick >> and rework the disk). I put it in maintence mode, then shut it down and >> did my work. In the process, much of the disk contents was lost (all the >> gluster data). I figure, no big deal, the gluster data is redundant on the >> network, it will heal when it comes back up. >> >> While I was doing maintence, all but one of the VMs were running on >> engine1. When I turned on engine2, all of the sudden, all vms including >> the main engine stop and go non-responsive. As far as I can tell, this >> should not have happened, as I turned ON one host, but none the less, I >> waited for recovery to occur (while customers started calling asking why >> everything stopped working....). As I waited, I was checking, and gluster >> volume status only showed ovirt1 and ovirt2....Apparently gluster had >> stopped/failed at some point on ovirt3. I assume that was the cause of the >> outage, still, if everything was working fine with ovirt1 gluster, and >> ovirt2 powers on with a very broke gluster (the volume status was showing >> NA for the port fileds for the gluster volumes), I would not expect to have >> a working gluster go stupid like that. >> >> After starting ovirt3 glusterd and checking the status, all three showed >> ovirt1 and ovirt3 as operational, and ovirt2 as NA. Unfortunately, >> recovery was still not happening, so I did some googling and found about >> the commands to inquire about the hosted-engine status. It appeared to be >> stuck "paused" and I couldn't find a way to unpause it, so I poweroff'ed >> it, then started it manually on engine 1, and the cluster came back up. It >> showed all VMs paused. I was able to unpause them and they worked again. >> >> So now I began to work the ovirt2 gluster healing problem. It didn't >> appear to be self-healing, but eventually I found this document: >> https://support.rackspace.com/how-to/recover-from-a-failed-s >> erver-in-a-glusterfs-array/ >> and from that found the magic xattr commands. After setting them, >> gluster volumes on ovirt2 came online. I told iso to heal, and it did but >> only came up about half as much data as it should have. I told it heal >> full, and it did finish off the remaining data, and came up to full. I >> then told engine to do a full heal (gluster volume heal engine full), and >> it transferred its data from the other gluster hosts too. However, it said >> it was done when it hit 9.7GB while there was 15GB on disk! It is still >> stuck that way; ovirt gui and gluster volume heal engine info both show the >> volume fully healed, but it is not: >> [root@ovirt1 ~]# df -h >> Filesystem Size Used Avail Use% Mounted on >> /dev/mapper/centos_ovirt-root 20G 4.2G 16G 21% / >> devtmpfs 16G 0 16G 0% /dev >> tmpfs 16G 16K 16G 1% /dev/shm >> tmpfs 16G 26M 16G 1% /run >> tmpfs 16G 0 16G 0% /sys/fs/cgroup >> /dev/mapper/gluster-engine 25G 12G 14G 47% /gluster/brick1 >> /dev/sda1 497M 315M 183M 64% /boot >> /dev/mapper/gluster-data 136G 124G 13G 92% /gluster/brick2 >> /dev/mapper/gluster-iso 25G 7.3G 18G 29% /gluster/brick4 >> tmpfs 3.2G 0 3.2G 0% /run/user/0 >> 192.168.8.11:/engine 15G 9.7G 5.4G 65% >> /rhev/data-center/mnt/glusterSD/192.168.8.11:_engine >> 192.168.8.11:/data 136G 124G 13G 92% >> /rhev/data-center/mnt/glusterSD/192.168.8.11:_data >> 192.168.8.11:/iso 13G 7.3G 5.8G 56% >> /rhev/data-center/mnt/glusterSD/192.168.8.11:_iso >> >> This is from ovirt1, and before the work, both ovirt1 and ovirt2's brings >> had the same usage. ovirt2's bricks and the gluster mountpoints agree on >> iso and engine, but as you can see, not here. If I do a du -sh on >> /rhev/data-center/mnt/glusterSD/..../_engine, it comes back with the >> 12GB number (/brick1 is engine, brick2 is data and brick4 is iso). >> However, gluster still says its only 9.7G. I haven't figured out how to >> get it to finish "healing". >> >> data is in the process of healing currently. >> >> So, I think I have two main things to solve right now: >> >> 1) how do I get ovirt to see the data center/storage domain as online >> again? >> 2) How do I get engine to finish healing to ovirt2? >> >> Thanks all for reading this very long message! >> --Jim >> >> > > _______________________________________________ > Users mailing list > Users@ovirt.org > http://lists.ovirt.org/mailman/listinfo/users > >
_______________________________________________ Users mailing list Users@ovirt.org http://lists.ovirt.org/mailman/listinfo/users