[ovirt-users] Re: Ovirt cluster unstable; gluster to blame (again)

Johan Bernhardsson Mon, 09 Jul 2018 19:34:42 -0700

In some cases Linux does not reject the broken sata drive it just getshorribly slow. From my experience it is how the drive fails.

It might have shown signs in smart and it might have shown some signs insyslog with write errors and drive queue errors

For gluster to notice that the drive is gone the drive needs to be rejectand marked as failed in Linux then gluster would have reported it as dead.

This is one reason it's a good practice in gluster to run a brick on a raidvolume instead of only one drive.


/Johan

On July 10, 2018 04:21:33 Jim Kusznir <j...@palousetech.com> wrote:

Thank you for your help.
After more troubleshooting and host reboots, I accidentally discovered thatthe backing disk on ovirt2 (host) had suffered a failure. On reboot, theraid card refused to see it at all. It said it had cache waiting to bewritten to disk, and in the end, as it couldn't (wouldn't) see that disk, Ihad no choice but to discard that cache and boot up without the physicaldisk. Since doing so (and running a gluster volume remove for the affectedhost), things are running like normal, although it appears it corrupted twodisks (I've now lost 5 VMs to gluster-induced disk failures during poorlyhandled failures).
I don't understand why one bad disk wasn't simply failed, or if oneunderlying process was having such a problem, the other hosts didn't takeit offline and continue (much like RAID would have done). Instead,everything was broke (including gluster volumes on unaffected disks thatare fully functional across all hosts) as well as very poor performance ofaffected machine AND no diagnostic reports that would allude to a failinghard drive. Is this expected behavior?
--Jim

On Sun, Jul 8, 2018 at 3:54 AM, Yaniv Kaul <yk...@redhat.com> wrote:



On Sat, Jul 7, 2018 at 8:45 AM, Jim Kusznir <j...@palousetech.com> wrote:
So, I'm still at a loss...It sounds like its either insufficient ram/swap,or insufficient network. It seems to be neither now. At this point, itappears that gluster is just "broke" and killing my systems for nodescernable reason. Here's detals, all from the same system (currentlyrunning 3 VMs):
[root@ovirt3 ~]# w
22:26:53 up 36 days,  4:34,  1 user,  load average: 42.78, 55.98, 53.31
USER     TTY      FROM             LOGIN@   IDLE   JCPU   PCPU WHAT
root     pts/0    192.168.8.90     22:26    2.00s  0.12s  0.11s w
bwm-ng reports the highest data usage was about 6MB/s during this test (andthat was combined; I have two different gig networks. One gluster network(primary VM storage) runs on one, the other network handles everything else).
[root@ovirt3 ~]# free -m
total        used        free      shared  buff/cache   available
Mem:          31996       13236         232          18       18526       18195
Swap:         16383        1475       14908

top - 22:32:56 up 36 days,  4:41,  1 user,  load average: 17.99, 39.69, 47.66

That is indeed a high load average. How many CPUs do you have, btw?

Tasks: 407 total,   1 running, 405 sleeping,   1 stopped,   0 zombie
%Cpu(s):  8.6 us,  2.1 sy,  0.0 ni, 87.6 id,  1.6 wa,  0.0 hi,  0.1 si,  0.0 st
KiB Mem : 32764284 total,   228296 free, 13541952 used, 18994036 buff/cache
KiB Swap: 16777212 total, 15246200 free,  1531012 used. 18643960 avail Mem

Can you check what's swapping here? (a tweak to top output will show that)


PID USER      PR  NI    VIRT    RES    SHR S  %CPU %MEM     TIME+ COMMAND
30036 qemu 20 0 6872324 5.2g 13532 S 144.6 16.5 216:14.55/usr/libexec/qemu-kvm -name guest=BillingWin,debug-threads=on -S -objectsecret,id=masterKey0,format=raw,file=/v+28501 qemu 20 0 5034968 3.6g 12880 S 16.2 11.7 73:44.99/usr/libexec/qemu-kvm -name guest=FusionPBX,debug-threads=on -S -objectsecret,id=masterKey0,format=raw,file=/va+2694 root 20 0 2169224 12164 3108 S 5.0 0.0 3290:42/usr/sbin/glusterfsd -s ovirt3.nwfiber.com --volfile-iddata.ovirt3.nwfiber.com.gluster-brick2-data -p /var/run/+
This one's certainly taking quite a bit of your CPU usage overall.
14293 root 15 -5 944700 13356 4436 S 4.0 0.0 16:32.15/usr/sbin/glusterfs --volfile-server=192.168.8.11--volfile-server=192.168.8.12 --volfile-server=192.168.8.13 --+
I'm not sure what the sorting order is, but doesn't look like Gluster istaking a lot of memory?
25100 vdsm 0 -20 6747440 107868 12836 S 2.3 0.3 21:35.20/usr/bin/python2 /usr/share/vdsm/vdsmd28971 qemu 20 0 2842592 1.5g 13548 S 1.7 4.7 241:46.49/usr/libexec/qemu-kvm -name guest=unifi.palousetech.com,debug-threads=on -S-object secret,id=masterKey0,format=+
12095 root      20   0  162276   2836   1868 R   1.3  0.0   0:00.25 top
2708 root 20 0 1906040 12404 3080 S 1.0 0.0 1083:33/usr/sbin/glusterfsd -s ovirt3.nwfiber.com --volfile-idengine.ovirt3.nwfiber.com.gluster-brick1-engine -p /var/+28623 qemu 20 0 4749536 1.7g 12896 S 0.7 5.5 4:30.64/usr/libexec/qemu-kvm -name guest=billing.nwfiber.com,debug-threads=on -S-object secret,id=masterKey0,format=ra+
The VMs I see here and above together account for most? (5.2+3.6+1.5+1.7 =12GB) - still plenty of memory left.
10 root      20   0       0      0      0 S   0.3  0.0 215:54.72 [rcu_sched]
1030 sanlock rt 0 773804 27908 2744 S 0.3 0.1 35:55.61/usr/sbin/sanlock daemon1890 zabbix 20 0 83904 1696 1612 S 0.3 0.0 24:30.63/usr/sbin/zabbix_agentd: collector [idle 1 sec]2722 root 20 0 1298004 6148 2580 S 0.3 0.0 38:10.82/usr/sbin/glusterfsd -s ovirt3.nwfiber.com --volfile-idiso.ovirt3.nwfiber.com.gluster-brick4-iso -p /var/run/gl+6340 root 20 0 0 0 0 S 0.3 0.0 0:04.30[kworker/7:0]10652 root 20 0 0 0 0 S 0.3 0.0 0:00.23[kworker/u64:2]14724 root 20 0 1076344 17400 3200 S 0.3 0.1 10:04.13/usr/sbin/glusterfs -s localhost --volfile-id gluster/glustershd -p/var/run/gluster/glustershd/glustershd.pid -+22011 root 20 0 0 0 0 S 0.3 0.0 0:05.04[kworker/10:1]
Not sure why the system load dropped other than I was trying to take apicture of it :)
In any case, it appears that at this time, I have plenty of swap, ram, andnetwork capacity, and yet things are still running very sluggish; I'm stillgetting e-mails from servers complaining about loss of communication withsomething or another; I still get e-mails from the engine about bad enginestatus, then recovery, etc.
1g isn't good enough for Gluster. It doesn't help that you have SSD,because network is certainly your bottleneck even for regular performance,not to mention when you are healing. Jumbo frames would give you additional5% or so - nothing to write home about.
I've shut down 2/3 of my VMs, too....just trying to keep the critical onesoperating.
At this point, I don't believe the problem is the memory leak, but it seemsto be triggered by the memory leak, as in all my problems started when Igot low ram warnings from one of my 3 nodes and began recovery efforts fromthat.
I do really like the idea / concept behind glusterfs, but I really have tofigure out why its been so poor performing from day one, and its caused 95%of my outages (including several large ones lately). If I can get itstable, reliable, and well performing, then I'd love to keep it. If Ican't, then perhaps NFS is the way to go? I don't like the single point offailure aspect of it, but my other NAS boxes I run for clients (centralstorage for windows boxes) have been very solid; If I could get that kindof reliability for my ovirt stack, it would be a substantial improvement.Currently, it seems about every other month I have a gluster-induced outage.
Sometimes I wonder if its just hyperconverged is the issue, but myinfrastructure doesn't justify three servers at the same location...I mightbe able to do two, but even that seems like its pushing it.
We have many happy users running Gluster and hyperconverged. We need tounderstand where's the failure in your setup.
Looks like I can upgrade to 10G for about $900. I can order a dual-Xeonsupermicro 12-disk server, loaded with 2TB WD Enterprise disks and a pairof SSDs for the os, 32GB ram, 2.67Ghz CPUs for about $720 delivered. I'vegot to do something to improve my reliability; I can't keep going the way Ihave been....
Agreed. Thanks for continuing looking into this, we'll probably need someGluster logs to understand what's going on.
Y.


--Jim



On Fri, Jul 6, 2018 at 9:13 PM, Johan Bernhardsson <jo...@kafit.se> wrote:
Load like that is mostly io based either the machine is swapping or networkis to slow. Check I/o wait in top.
And the problem where you get oom killer to kill off gluster. That meansthat you don't monitor ram usage on the servers? Either it's eating allyour ram and swap gets really io intensive and then is killed off. Or youhave the wrong swap settings in sysctl.conf (there are tons of brokenguides that recommends swappines to 0 but that disables swap on newerkernels. The proper swappines for only swapping when nesseary is 1 or asufficiently low number like 10 default is 60)
Moving to nfs will not improve things. You will get more memory sincegluster isn't running and that is good. But you will have a single nodethat can fail with all your storage and it would still be on 1 gigabit onlyand your three node cluster would easily saturate that link.
On July 7, 2018 04:13:13 Jim Kusznir <j...@palousetech.com> wrote:
So far it does not appear to be helping much. I'm still getting VM'slocking up and all kinds of notices from overt engine about non-responsivehosts. I'm still seeing load averages in the 20-30 range.
Jim

On Fri, Jul 6, 2018, 3:13 PM Jim Kusznir <j...@palousetech.com> wrote:
Thank you for the advice and help
I do plan on going 10Gbps networking; haven't quite jumped off that cliffyet, though.
I did put my data-hdd (main VM storage volume) onto a dedicated 1Gbpsnetwork, and I've watched throughput on that and never seen more than60GB/s achieved (as reported by bwm-ng). I have a separate 1Gbps networkfor communication and ovirt migration, but I wanted to break that upfurther (separate out VM traffice from migration/mgmt traffic). My threeSSD-backed gluster volumes run the main network too, as I haven't been ableto get them to move to the new network (which I was trying to use as allgluster). I tried bonding, but that seamed to reduce performance ratherthan improve it.
--Jim
On Fri, Jul 6, 2018 at 2:52 PM, Jamie Lawrence <jlawre...@squaretrade.com>wrote:
Hi Jim,
I don't have any targeted suggestions, because there isn't much to latch onto. I can say Gluster replica three (no arbiters) on dedicated serversserving a couple Ovirt VM clusters here have not had these sorts of issues.
I suspect your long heal times (and the resultant long periods of highload) are at least partly related to 1G networking. That is just a matterof IO - heals of VMs involve moving a lot of bits. My cluster uses 10Gbonded NICs on the gluster and ovirt boxes for storage traffic and separatebonded 1G for ovirtmgmt and communication with other machines/people, andwe're occasionally hitting the bandwidth ceiling on the storage network.I'm starting to think about 40/100G, different ways of splitting upintensive systems, and considering iSCSI for specific volumes, although Ireally don't want to go there.
I don't run FreeNAS[1], but I do run FreeBSD as storage servers for theirexcellent ZFS implementation, mostly for backups. ZFS will make your `heal`problem go away, but not your bandwidth problems, which become worse(because of fewer NICS pushing traffic). 10G hardware is not exactly in theimpulse-buy territory, but if you can, I'd recommend doing some testingusing it. I think at least some of your problems are related.
If that's not possible, my next stops would be optimizing everything Icould about sharding, healing and optimizing for serving the shard size tosqueeze as much performance out of 1G as I could, but that will only go so far.
-j

[1] FreeNAS is just a storage-tuned FreeBSD with a GUI.
On Jul 6, 2018, at 1:19 PM, Jim Kusznir <j...@palousetech.com> wrote:

hi all:
Once again my production ovirt cluster is collapsing in on itself. Myservers are intermittently unavailable or degrading, customers are noticingand calling in. This seems to be yet another gluster failure that Ihaven't been able to pin down.
I posted about this a while ago, but didn't get anywhere (no replies that Ifound). The problem started out as a glusterfsd process consuming largeamounts of ram (up to the point where ram and swap were exhausted and thekernel OOM killer killed off the glusterfsd process). For reasons notclear to me at this time, that resulted in any VMs running on that host andthat gluster volume to be paused with I/O error (the glusterfs process isusually unharmed; why it didn't continue I/O with other servers isconfusing to me).
I have 3 servers and a total of 4 gluster volumes (engine, iso, data, anddata-hdd). The first 3 are replica 2+arb; the 4th (data-hdd) is replica 3.The first 3 are backed by an LVM partition (some thin provisioned) on anSSD; the 4th is on a seagate hybrid disk (hdd + some internal flash foracceleration). data-hdd is the only thing on the disk. Servers are DellR610 with the PERC/6i raid card, with the disks individually passed throughto the OS (no raid enabled).
The above RAM usage issue came from the data-hdd volume. Yesterday, Icought one of the glusterfsd high ram usage before the OOM-Killer had torun. I was able to migrate the VMs off the machine and for good measure,reboot the entire machine (after taking this opportunity to run thesoftware updates that ovirt said were pending). Upon booting back up, thenecessary volume healing began. However, this time, the healing caused allthree servers to go to very, very high load averages (I saw just under 200on one server; typically they've been 40-70) with top reporting IO Wait at7-20%. Network for this volume is a dedicated gig network. According tobwm-ng, initially the network bandwidth would hit 50MB/s (yes, bytes), buttailed off to mostly in the kB/s for a while. All machines' load averageswere still 40+ and gluster volume heal data-hdd info reported 5 itemsneeding healing. Server's were intermittently experiencing IO issues, evenon the 3 gluster volumes that appeared largely unaffected. Even the OSactivities on the hosts itself (logging in, running commands) would oftenbe very delayed. The ovirt engine was seemingly randomly throwing enginedown / engine up / engine failed notifications. Responsiveness on ANY VMwas horrific most of the time, with random VMs being inaccessible.
I let the gluster heal run overnight. By morning, there were still 5 itemsneeding healing, all three servers were still experiencing high load, andservers were still largely unstable.
I've noticed that all of my ovirt outages (and I've had a lot, way morethan is acceptable for a production cluster) have come from gluster. Istill have 3 VMs who's hard disk images have become corrupted by my lastgluster crash that I haven't had time to repair / rebuild yet (I believethis crash was caused by the OOM issue previously mentioned, but I didn'tknow it at the time).
Is gluster really ready for production yet? It seems so unstable to me....I'm looking at replacing gluster with a dedicated NFS server likelyFreeNAS. Any suggestions? What is the "right" way to do productionstorage on this (3 node cluster)? Can I get this gluster volume stableenough to get my VMs to run reliably again until I can deploy anotherstorage solution?
--Jim
_______________________________________________
Users mailing list -- users@ovirt.org
To unsubscribe send an email to users-le...@ovirt.org
Privacy Statement: https://www.ovirt.org/site/privacy-policy/
oVirt Code of Conduct:https://www.ovirt.org/community/about/community-guidelines/List Archives:https://lists.ovirt.org/archives/list/users@ovirt.org/message/YQX3LQFQQPW4JTCB7B6FY2LLR6NA2CB3/
_______________________________________________
Users mailing list -- users@ovirt.org
To unsubscribe send an email to users-le...@ovirt.org
Privacy Statement: https://www.ovirt.org/site/privacy-policy/
oVirt Code of Conduct:https://www.ovirt.org/community/about/community-guidelines/List Archives:https://lists.ovirt.org/archives/list/users@ovirt.org/message/O2HIECLFMYGKH3KSZHHSMDUVGOEBI7GQ/
_______________________________________________
Users mailing list -- users@ovirt.org
To unsubscribe send an email to users-le...@ovirt.org
Privacy Statement: https://www.ovirt.org/site/privacy-policy/
oVirt Code of Conduct:https://www.ovirt.org/community/about/community-guidelines/List Archives:https://lists.ovirt.org/archives/list/users@ovirt.org/message/73F7P66ARAQ6VLXDAUK2XEGXTB4B3FSA/

_______________________________________________
Users mailing list -- users@ovirt.org
To unsubscribe send an email to users-le...@ovirt.org
Privacy Statement: https://www.ovirt.org/site/privacy-policy/
oVirt Code of Conduct: 
https://www.ovirt.org/community/about/community-guidelines/
List Archives: 
https://lists.ovirt.org/archives/list/users@ovirt.org/message/IZZUZKWTUIEBL2C7HZFCUSPD7EOI3IUM/

[ovirt-users] Re: Ovirt cluster unstable; gluster to blame (again)

Reply via email to