[ovirt-users] Re: Ovirt cluster unstable; gluster to blame (again)

Alex K Tue, 10 Jul 2018 03:07:29 -0700

I see also that the last 4 or 5 weeks (after I upgraded from 4.1 to 4.2) I
have to almost every week go and refresh the servers (maintenance, reboot)
to release the RAM. If I leave them the RAM ill be eventually depleted from
gluster services. I am running gluster 3.12.9-1 with ovirt 4.2.4.5-1.el7.


Alex

On Mon, Jul 9, 2018 at 6:08 PM, Edward Clay <edward.c...@uk2group.com>
wrote:

> Just to add my .02 here.  I've opened a bug on this issue where HV/host
> connected to clusterfs volumes are running out of ram.  This seemed to be a
> bug fixed in gluster 3.13 but that patch doesn't seem to be avaiable any
> longer and 3.12 is what ovirt is using.  For example I have a host that was
> showing 72% of memory consumption with 3 VMs running on it.  If I migrate
> those VMs to another Host memory consumption drops to 52%.  If i put this
> host into maintenance and then activate it it drops down to 2% or so.
> Since I ran into this issue I've been manually watching memory consumption
> on each host and migrating VMs from it to others to keep things from
> dying.  I'm hoping with the announcement of gluster 3.12 end of life and
> the move to gluster 4.1 that this will get fixed or that the patch from
> 3.13 can get backported so this problem will go away.
>
> https://bugzilla.redhat.com/show_bug.cgi?id=1593826
>
> On 07/07/2018 11:49 AM, Jim Kusznir wrote:
>
> **Security Notice - This external email is NOT from The Hut Group**
>
> This host has NO VMs running on it, only 3 running cluster-wide (including
> the engine, which is on its own storage):
>
> top - 10:44:41 up 1 day, 17:10,  1 user,  load average: 15.86, 14.33, 13.39
> Tasks: 381 total,   1 running, 379 sleeping,   1 stopped,   0 zombie
> %Cpu(s):  2.7 us,  2.1 sy,  0.0 ni, 89.0 id,  6.1 wa,  0.0 hi,  0.2 si,
> 0.0 st
> KiB Mem : 32764284 total,   338232 free,   842324 used, 31583728 buff/cache
> KiB Swap: 12582908 total, 12258660 free,   324248 used. 31076748 avail Mem
>
>   PID USER      PR  NI    VIRT    RES    SHR S  %CPU %MEM     TIME+
> COMMAND
>
> 13279 root      20   0 2380708  37628   4396 S  51.7  0.1   3768:03
> glusterfsd
>
> 13273 root      20   0 2233212  20460   4380 S  17.2  0.1 105:50.44
> glusterfsd
>
> 13287 root      20   0 2233212  20608   4340 S   4.3  0.1  34:27.20
> glusterfsd
>
> 16205 vdsm       0 -20 5048672  88940  13364 S   1.3  0.3   0:32.69 vdsmd
>
>
> 16300 vdsm      20   0  608488  25096   5404 S   1.3  0.1   0:05.78
> python
>
>  1109 vdsm      20   0 3127696  44228   8552 S   0.7  0.1  18:49.76
> ovirt-ha-broker
>
> 25555 root      20   0       0      0      0 S   0.7  0.0   0:00.13
> kworker/u64:3
>
>    10 root      20   0       0      0      0 S   0.3  0.0   4:22.36
> rcu_sched
>
>   572 root       0 -20       0      0      0 S   0.3  0.0   0:12.02
> kworker/1:1H
>
>   797 root      20   0       0      0      0 S   0.3  0.0   1:59.59
> kdmwork-253:2
>
>   877 root       0 -20       0      0      0 S   0.3  0.0   0:11.34
> kworker/3:1H
>
>  1028 root      20   0       0      0      0 S   0.3  0.0   0:35.35
> xfsaild/dm-10
>
>  1869 root      20   0 1496472  10540   6564 S   0.3  0.0   2:15.46
> python
>
>  3747 root      20   0       0      0      0 D   0.3  0.0   0:01.21
> kworker/u64:1
>
> 10979 root      15  -5  723504  15644   3920 S   0.3  0.0  22:46.27
> glusterfs
>
> 15085 root      20   0  680884  10792   4328 S   0.3  0.0   0:01.13
> glusterd
>
> 16102 root      15  -5 1204216  44948  11160 S   0.3  0.1   0:18.61
> supervdsmd
>
> At the moment, the engine is barely usable, my other VMs appear to be
> unresponsive.  Two on one host, one on another, and none on the third.
>
>
>
> On Sat, Jul 7, 2018 at 10:38 AM, Jim Kusznir <j...@palousetech.com> wrote:
>
>> I run 4-7 VMs, and most of them are 2GB ram.  I have 2 VMs with 4GB.
>>
>> Ram hasn't been an issue until recent ovirt/gluster upgrades.  Storage
>> has always been slow, especially with these drives.  However, even watching
>> network utilization on my switch, the gig-e links never max out.
>>
>> The loadavg issues and unresponsive behavior started with yesterday's
>> ovirt updates.  I now have one VM with low I/O that lives on a separate
>> storage volume (data, fully SSD backed instead of data-hdd, which was
>> having the issues).  I moved it to a ovirt host with no other VMs on it,
>> and that had reshly been rebooted.  Before it had this one VM on it,
>> loadavg was >0.5.  Now its up in the 20's, with only one low Disk I/O, 4GB
>> ram VM on the host.
>>
>> This to me says there's now a new problem separate from Gluster.  I don't
>> have any non-gluster storage available to test with.  I did notice that the
>> last update included a new kernel, and it appears its the qemu-kvm
>> processes that are consuming way more CPU than they used to now.
>>
>> Are there any known issues?  I'm going to reboot into my previous kernel
>> to see if its kernel-caused.
>>
>> --Jim
>>
>>
>>
>> On Fri, Jul 6, 2018 at 11:07 PM, Johan Bernhardsson <jo...@kafit.se>
>> wrote:
>>
>>> That is a single sata drive that is slow on random I/O and that has to
>>> be synced with 2 other servers. Gluster works syncronous so one write has
>>> to be written and acknowledged on all the three nodes.
>>>
>>> So you have a bottle neck in io on drives and one on network and
>>> depending on how many virtual servers you have and how much ram they take
>>> you might have memory.
>>>
>>> Load spikes when you have a wait somewhere and are overusing capacity.
>>> But it's now only CPU that load is counted on. It is waiting for resources
>>> so it can be memory or Network or drives.
>>>
>>> How many virtual server do you run and how much ram do they consume?
>>>
>>> On July 7, 2018 09:51:42 Jim Kusznir <j...@palousetech.com> wrote:
>>>
>>>> In case it matters, the data-hdd gluster volume uses these hard drives:
>>>>
>>>> https://www.amazon.com/gp/product/B01M1NHCZT/ref=oh_aui_deta
>>>> ilpage_o05_s00?ie=UTF8&psc=1
>>>>
>>>> This is in a Dell R610 with PERC6/i (one drive per server, configured
>>>> as a single drive volume to pass it through as its own /dev/sd* device).
>>>> Inside the OS, its partitioned with lvm_thin, then an lvm volume formatted
>>>> with XFS and mounted as /gluster/brick3, with the data-hdd volume created
>>>> inside that.
>>>>
>>>> --Jim
>>>>
>>>> On Fri, Jul 6, 2018 at 10:45 PM, Jim Kusznir <j...@palousetech.com>
>>>> wrote:
>>>>
>>>>> So, I'm still at a loss...It sounds like its either insufficient
>>>>> ram/swap, or insufficient network.  It seems to be neither now.  At this
>>>>> point, it appears that gluster is just "broke" and killing my systems for
>>>>> no descernable reason.  Here's detals, all from the same system (currently
>>>>> running 3 VMs):
>>>>>
>>>>> [root@ovirt3 ~]# w
>>>>>  22:26:53 up 36 days,  4:34,  1 user,  load average: 42.78, 55.98,
>>>>> 53.31
>>>>> USER     TTY      FROM             LOGIN@   IDLE   JCPU   PCPU WHAT
>>>>> root     pts/0    192.168.8.90     22:26    2.00s  0.12s  0.11s w
>>>>>
>>>>> bwm-ng reports the highest data usage was about 6MB/s during this test
>>>>> (and that was combined; I have two different gig networks.  One gluster
>>>>> network (primary VM storage) runs on one, the other network handles
>>>>> everything else).
>>>>>
>>>>> [root@ovirt3 ~]# free -m
>>>>>               total        used        free      shared  buff/cache
>>>>>  available
>>>>> Mem:          31996       13236         232          18       18526
>>>>>    18195
>>>>> Swap:         16383        1475       14908
>>>>>
>>>>> top - 22:32:56 up 36 days,  4:41,  1 user,  load average: 17.99,
>>>>> 39.69, 47.66
>>>>> Tasks: 407 total,   1 running, 405 sleeping,   1 stopped,   0 zombie
>>>>> %Cpu(s):  8.6 us,  2.1 sy,  0.0 ni, 87.6 id,  1.6 wa,  0.0 hi,  0.1
>>>>> si,  0.0 st
>>>>> KiB Mem : 32764284 total,   228296 free, 13541952 used, 18994036
>>>>> buff/cache
>>>>> KiB Swap: 16777212 total, 15246200 free,  1531012 used. 18643960 avail
>>>>> Mem
>>>>>
>>>>>   PID USER      PR  NI    VIRT    RES    SHR S  %CPU %MEM     TIME+
>>>>> COMMAND
>>>>>
>>>>> 30036 qemu      20   0 6872324   5.2g  13532 S 144.6 16.5 216:14.55
>>>>> /usr/libexec/qemu-kvm -name guest=BillingWin,debug-threads=on -S
>>>>> -object secret,id=masterKey0,format=raw,file=/v+
>>>>> 28501 qemu      20   0 5034968   3.6g  12880 S  16.2 11.7  73:44.99
>>>>> /usr/libexec/qemu-kvm -name guest=FusionPBX,debug-threads=on -S
>>>>> -object secret,id=masterKey0,format=raw,file=/va+
>>>>>  2694 root      20   0 2169224  12164   3108 S   5.0  0.0   3290:42
>>>>> /usr/sbin/glusterfsd -s ovirt3.nwfiber.com --volfile-id
>>>>> data.ovirt3.nwfiber.com.gluster-brick2-data -p /var/run/+
>>>>> 14293 root      15  -5  944700  13356   4436 S   4.0  0.0  16:32.15
>>>>> /usr/sbin/glusterfs --volfile-server=192.168.8.11
>>>>> --volfile-server=192.168.8.12 --volfile-server=192.168.8.13 --+
>>>>> 25100 vdsm       0 -20 6747440 107868  12836 S   2.3  0.3  21:35.20
>>>>> /usr/bin/python2 /usr/share/vdsm/vdsmd
>>>>>
>>>>> 28971 qemu      20   0 2842592   1.5g  13548 S   1.7  4.7 241:46.49
>>>>> /usr/libexec/qemu-kvm -name guest=unifi.palousetech.com,debug-threads=on
>>>>> -S -object secret,id=masterKey0,format=+
>>>>> 12095 root      20   0  162276   2836   1868 R   1.3  0.0   0:00.25
>>>>> top
>>>>>
>>>>>  2708 root      20   0 1906040  12404   3080 S   1.0  0.0   1083:33
>>>>> /usr/sbin/glusterfsd -s ovirt3.nwfiber.com --volfile-id
>>>>> engine.ovirt3.nwfiber.com.gluster-brick1-engine -p /var/+
>>>>> 28623 qemu      20   0 4749536   1.7g  12896 S   0.7  5.5   4:30.64
>>>>> /usr/libexec/qemu-kvm -name guest=billing.nwfiber.com,debug-threads=on
>>>>> -S -object secret,id=masterKey0,format=ra+
>>>>>    10 root      20   0       0      0      0 S   0.3  0.0 215:54.72
>>>>> [rcu_sched]
>>>>>
>>>>>  1030 sanlock   rt   0  773804  27908   2744 S   0.3  0.1  35:55.61
>>>>> /usr/sbin/sanlock daemon
>>>>>
>>>>>  1890 zabbix    20   0   83904   1696   1612 S   0.3  0.0  24:30.63
>>>>> /usr/sbin/zabbix_agentd: collector [idle 1 sec]
>>>>>
>>>>>  2722 root      20   0 1298004   6148   2580 S   0.3  0.0  38:10.82
>>>>> /usr/sbin/glusterfsd -s ovirt3.nwfiber.com --volfile-id
>>>>> iso.ovirt3.nwfiber.com.gluster-brick4-iso -p /var/run/gl+
>>>>>  6340 root      20   0       0      0      0 S   0.3  0.0   0:04.30
>>>>> [kworker/7:0]
>>>>>
>>>>> 10652 root      20   0       0      0      0 S   0.3  0.0   0:00.23
>>>>> [kworker/u64:2]
>>>>>
>>>>> 14724 root      20   0 1076344  17400   3200 S   0.3  0.1  10:04.13
>>>>> /usr/sbin/glusterfs -s localhost --volfile-id gluster/glustershd -p
>>>>> /var/run/gluster/glustershd/glustershd.pid -+
>>>>> 22011 root      20   0       0      0      0 S   0.3  0.0   0:05.04
>>>>> [kworker/10:1]
>>>>>
>>>>>
>>>>> Not sure why the system load dropped other than I was trying to take a
>>>>> picture of it :)
>>>>>
>>>>> In any case, it appears that at this time, I have plenty of swap, ram,
>>>>> and network capacity, and yet things are still running very sluggish; I'm
>>>>> still getting e-mails from servers complaining about loss of communication
>>>>> with something or another; I still get e-mails from the engine about bad
>>>>> engine status, then recovery, etc.
>>>>>
>>>>> I've shut down 2/3 of my VMs, too....just trying to keep the critical
>>>>> ones operating.
>>>>>
>>>>> At this point, I don't believe the problem is the memory leak, but it
>>>>> seems to be triggered by the memory leak, as in all my problems started
>>>>> when I got low ram warnings from one of my 3 nodes and began recovery
>>>>> efforts from that.
>>>>>
>>>>> I do really like the idea / concept behind glusterfs, but I really
>>>>> have to figure out why its been so poor performing from day one, and its
>>>>> caused 95% of my outages (including several large ones lately).  If I can
>>>>> get it stable, reliable, and well performing, then I'd love to keep it.  
>>>>> If
>>>>> I can't, then perhaps NFS is the way to go?  I don't like the single point
>>>>> of failure aspect of it, but my other NAS boxes I run for clients (central
>>>>> storage for windows boxes) have been very solid; If I could get that kind
>>>>> of reliability for my ovirt stack, it would be a substantial improvement.
>>>>> Currently, it seems about every other month I have a gluster-induced 
>>>>> outage.
>>>>>
>>>>> Sometimes I wonder if its just hyperconverged is the issue, but my
>>>>> infrastructure doesn't justify three servers at the same location...I 
>>>>> might
>>>>> be able to do two, but even that seems like its pushing it.
>>>>>
>>>>> Looks like I can upgrade to 10G for about $900.  I can order a
>>>>> dual-Xeon supermicro 12-disk server, loaded with 2TB WD Enterprise disks
>>>>> and a pair of SSDs for the os, 32GB ram, 2.67Ghz CPUs for about $720
>>>>> delivered.  I've got to do something to improve my reliability; I can't
>>>>> keep going the way I have been....
>>>>>
>>>>> --Jim
>>>>>
>>>>>
>>>>> On Fri, Jul 6, 2018 at 9:13 PM, Johan Bernhardsson <jo...@kafit.se>
>>>>> wrote:
>>>>>
>>>>>> Load like that is mostly io based either the machine is swapping or
>>>>>> network is to slow. Check I/o wait in top.
>>>>>>
>>>>>> And the problem where you get oom killer to kill off gluster. That
>>>>>> means that you don't monitor ram usage on the servers? Either it's eating
>>>>>> all your ram and swap gets really io intensive and then is killed off. Or
>>>>>> you have the wrong swap settings in sysctl.conf (there are tons of broken
>>>>>> guides that recommends swappines to 0 but that disables swap on newer
>>>>>> kernels. The proper swappines for only swapping when nesseary is 1 or a
>>>>>> sufficiently low number like 10 default is 60)
>>>>>>
>>>>>>
>>>>>> Moving to nfs will not improve things. You will get more memory since
>>>>>> gluster isn't running and that is good. But you will have a single node
>>>>>> that can fail with all your storage and it would still be on 1 gigabit 
>>>>>> only
>>>>>> and your three node cluster would easily saturate that link.
>>>>>>
>>>>>> On July 7, 2018 04:13:13 Jim Kusznir <j...@palousetech.com> wrote:
>>>>>>
>>>>>>> So far it does not appear to be helping much. I'm still getting VM's
>>>>>>> locking up and all kinds of notices from overt engine about 
>>>>>>> non-responsive
>>>>>>> hosts.  I'm still seeing load averages in the 20-30 range.
>>>>>>>
>>>>>>> Jim
>>>>>>>
>>>>>>> On Fri, Jul 6, 2018, 3:13 PM Jim Kusznir <j...@palousetech.com>
>>>>>>> wrote:
>>>>>>>
>>>>>>>> Thank you for the advice and help
>>>>>>>>
>>>>>>>> I do plan on going 10Gbps networking; haven't quite jumped off that
>>>>>>>> cliff yet, though.
>>>>>>>>
>>>>>>>> I did put my data-hdd (main VM storage volume) onto a dedicated
>>>>>>>> 1Gbps network, and I've watched throughput on that and never seen more 
>>>>>>>> than
>>>>>>>> 60GB/s achieved (as reported by bwm-ng).  I have a separate 1Gbps 
>>>>>>>> network
>>>>>>>> for communication and ovirt migration, but I wanted to break that up
>>>>>>>> further (separate out VM traffice from migration/mgmt traffic).  My 
>>>>>>>> three
>>>>>>>> SSD-backed gluster volumes run the main network too, as I haven't been 
>>>>>>>> able
>>>>>>>> to get them to move to the new network (which I was trying to use as 
>>>>>>>> all
>>>>>>>> gluster).  I tried bonding, but that seamed to reduce performance 
>>>>>>>> rather
>>>>>>>> than improve it.
>>>>>>>>
>>>>>>>> --Jim
>>>>>>>>
>>>>>>>> On Fri, Jul 6, 2018 at 2:52 PM, Jamie Lawrence <
>>>>>>>> jlawre...@squaretrade.com> wrote:
>>>>>>>>
>>>>>>>>> Hi Jim,
>>>>>>>>>
>>>>>>>>> I don't have any targeted suggestions, because there isn't much to
>>>>>>>>> latch on to. I can say Gluster replica three  (no arbiters) on 
>>>>>>>>> dedicated
>>>>>>>>> servers serving a couple Ovirt VM clusters here have not had these 
>>>>>>>>> sorts of
>>>>>>>>> issues.
>>>>>>>>>
>>>>>>>>> I suspect your long heal times (and the resultant long periods of
>>>>>>>>> high load) are at least partly related to 1G networking. That is just 
>>>>>>>>> a
>>>>>>>>> matter of IO - heals of VMs involve moving a lot of bits. My cluster 
>>>>>>>>> uses
>>>>>>>>> 10G bonded NICs on the gluster and ovirt boxes for storage traffic and
>>>>>>>>> separate bonded 1G for ovirtmgmt and communication with other
>>>>>>>>> machines/people, and we're occasionally hitting the bandwidth ceiling 
>>>>>>>>> on
>>>>>>>>> the storage network. I'm starting to think about 40/100G, different 
>>>>>>>>> ways of
>>>>>>>>> splitting up intensive systems, and considering iSCSI for specific 
>>>>>>>>> volumes,
>>>>>>>>> although I really don't want to go there.
>>>>>>>>>
>>>>>>>>> I don't run FreeNAS[1], but I do run FreeBSD as storage servers
>>>>>>>>> for their excellent ZFS implementation, mostly for backups. ZFS will 
>>>>>>>>> make
>>>>>>>>> your `heal` problem go away, but not your bandwidth problems, which 
>>>>>>>>> become
>>>>>>>>> worse (because of fewer NICS pushing traffic). 10G hardware is not 
>>>>>>>>> exactly
>>>>>>>>> in the impulse-buy territory, but if you can, I'd recommend doing some
>>>>>>>>> testing using it. I think at least some of your problems are related.
>>>>>>>>>
>>>>>>>>> If that's not possible, my next stops would be optimizing
>>>>>>>>> everything I could about sharding, healing and optimizing for serving 
>>>>>>>>> the
>>>>>>>>> shard size to squeeze as much performance out of 1G as I could, but 
>>>>>>>>> that
>>>>>>>>> will only go so far.
>>>>>>>>>
>>>>>>>>> -j
>>>>>>>>>
>>>>>>>>> [1] FreeNAS is just a storage-tuned FreeBSD with a GUI.
>>>>>>>>>
>>>>>>>>> > On Jul 6, 2018, at 1:19 PM, Jim Kusznir <j...@palousetech.com>
>>>>>>>>> wrote:
>>>>>>>>> >
>>>>>>>>> > hi all:
>>>>>>>>> >
>>>>>>>>> > Once again my production ovirt cluster is collapsing in on
>>>>>>>>> itself.  My servers are intermittently unavailable or degrading, 
>>>>>>>>> customers
>>>>>>>>> are noticing and calling in.  This seems to be yet another gluster 
>>>>>>>>> failure
>>>>>>>>> that I haven't been able to pin down.
>>>>>>>>> >
>>>>>>>>> > I posted about this a while ago, but didn't get anywhere (no
>>>>>>>>> replies that I found).  The problem started out as a glusterfsd 
>>>>>>>>> process
>>>>>>>>> consuming large amounts of ram (up to the point where ram and swap 
>>>>>>>>> were
>>>>>>>>> exhausted and the kernel OOM killer killed off the glusterfsd 
>>>>>>>>> process).
>>>>>>>>> For reasons not clear to me at this time, that resulted in any VMs 
>>>>>>>>> running
>>>>>>>>> on that host and that gluster volume to be paused with I/O error (the
>>>>>>>>> glusterfs process is usually unharmed; why it didn't continue I/O with
>>>>>>>>> other servers is confusing to me).
>>>>>>>>> >
>>>>>>>>> > I have 3 servers and a total of 4 gluster volumes (engine, iso,
>>>>>>>>> data, and data-hdd).  The first 3 are replica 2+arb; the 4th 
>>>>>>>>> (data-hdd) is
>>>>>>>>> replica 3.  The first 3 are backed by an LVM partition (some thin
>>>>>>>>> provisioned) on an SSD; the 4th is on a seagate hybrid disk (hdd + 
>>>>>>>>> some
>>>>>>>>> internal flash for acceleration).  data-hdd is the only thing on the 
>>>>>>>>> disk.
>>>>>>>>> Servers are Dell R610 with the PERC/6i raid card, with the disks
>>>>>>>>> individually passed through to the OS (no raid enabled).
>>>>>>>>> >
>>>>>>>>> > The above RAM usage issue came from the data-hdd volume.
>>>>>>>>> Yesterday, I cought one of the glusterfsd high ram usage before the
>>>>>>>>> OOM-Killer had to run.  I was able to migrate the VMs off the machine 
>>>>>>>>> and
>>>>>>>>> for good measure, reboot the entire machine (after taking this 
>>>>>>>>> opportunity
>>>>>>>>> to run the software updates that ovirt said were pending).  Upon 
>>>>>>>>> booting
>>>>>>>>> back up, the necessary volume healing began.  However, this time, the
>>>>>>>>> healing caused all three servers to go to very, very high load 
>>>>>>>>> averages (I
>>>>>>>>> saw just under 200 on one server; typically they've been 40-70) with 
>>>>>>>>> top
>>>>>>>>> reporting IO Wait at 7-20%.  Network for this volume is a dedicated 
>>>>>>>>> gig
>>>>>>>>> network.  According to bwm-ng, initially the network bandwidth would 
>>>>>>>>> hit
>>>>>>>>> 50MB/s (yes, bytes), but tailed off to mostly in the kB/s for a 
>>>>>>>>> while.  All
>>>>>>>>> machines' load averages were still 40+ and gluster volume heal 
>>>>>>>>> data-hdd
>>>>>>>>> info reported 5 items needing healing.  Server's were intermittently
>>>>>>>>> experiencing IO issues, even on the 3 gluster volumes that appeared 
>>>>>>>>> largely
>>>>>>>>> unaffected.  Even the OS activities on the hosts itself (logging in,
>>>>>>>>> running commands) would often be very delayed.  The ovirt engine was
>>>>>>>>> seemingly randomly throwing engine down / engine up / engine failed
>>>>>>>>> notifications.  Responsiveness on ANY VM was horrific most of the 
>>>>>>>>> time,
>>>>>>>>> with random VMs being inaccessible.
>>>>>>>>> >
>>>>>>>>> > I let the gluster heal run overnight.  By morning, there were
>>>>>>>>> still 5 items needing healing, all three servers were still 
>>>>>>>>> experiencing
>>>>>>>>> high load, and servers were still largely unstable.
>>>>>>>>> >
>>>>>>>>> > I've noticed that all of my ovirt outages (and I've had a lot,
>>>>>>>>> way more than is acceptable for a production cluster) have come from
>>>>>>>>> gluster.  I still have 3 VMs who's hard disk images have become 
>>>>>>>>> corrupted
>>>>>>>>> by my last gluster crash that I haven't had time to repair / rebuild 
>>>>>>>>> yet (I
>>>>>>>>> believe this crash was caused by the OOM issue previously mentioned, 
>>>>>>>>> but I
>>>>>>>>> didn't know it at the time).
>>>>>>>>> >
>>>>>>>>> > Is gluster really ready for production yet?  It seems so
>>>>>>>>> unstable to me....  I'm looking at replacing gluster with a dedicated 
>>>>>>>>> NFS
>>>>>>>>> server likely FreeNAS.  Any suggestions?  What is the "right" way to 
>>>>>>>>> do
>>>>>>>>> production storage on this (3 node cluster)?  Can I get this gluster 
>>>>>>>>> volume
>>>>>>>>> stable enough to get my VMs to run reliably again until I can deploy
>>>>>>>>> another storage solution?
>>>>>>>>> >
>>>>>>>>> > --Jim
>>>>>>>>> > _______________________________________________
>>>>>>>>> > Users mailing list -- users@ovirt.org
>>>>>>>>> > To unsubscribe send an email to users-le...@ovirt.org
>>>>>>>>> > Privacy Statement: https://www.ovirt.org/site/privacy-policy/
>>>>>>>>> > oVirt Code of Conduct: https://www.ovirt.org/communit
>>>>>>>>> y/about/community-guidelines/
>>>>>>>>> > List Archives: https://lists.ovirt.org/archiv
>>>>>>>>> es/list/users@ovirt.org/message/YQX3LQFQQPW4JTCB7B6FY2LLR6NA2CB3/
>>>>>>>>>
>>>>>>>>>
>>>>>>>> _______________________________________________
>>>>>>> Users mailing list -- users@ovirt.org
>>>>>>> To unsubscribe send an email to users-le...@ovirt.org
>>>>>>> Privacy Statement: https://www.ovirt.org/site/privacy-policy/
>>>>>>> oVirt Code of Conduct: https://www.ovirt.org/communit
>>>>>>> y/about/community-guidelines/
>>>>>>> List Archives: https://lists.ovirt.org/archiv
>>>>>>> es/list/users@ovirt.org/message/O2HIECLFMYGKH3KSZHHSMDUVGOEBI7GQ/
>>>>>>>
>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
>
>
> _______________________________________________
> Users mailing list -- users@ovirt.org
> To unsubscribe send an email to users-le...@ovirt.org
> Privacy Statement: https://www.ovirt.org/site/privacy-policy/
> oVirt Code of Conduct: 
> https://www.ovirt.org/community/about/community-guidelines/
>
> List Archives: 
> https://lists.ovirt.org/archives/list/users@ovirt.org/message/T2M4J3Z7RPSGPEHNC33WFC2HUYOVL6FB/
>
>
> Edward Clay
> Systems Administrator
> The Hut Group <http://www.thehutgroup.com/>
>
> Tel:
> Email: edward.c...@uk2group.com
>
>
> For the purposes of this email, the "company" means The Hut Group Limited,
> a company registered in England and Wales (company number 6539496) whose
> registered office is at Fifth Floor, Voyager House, Chicago Avenue,
> Manchester Airport, M90 3DQ and/or any of its respective subsidiaries.
>
> *Confidentiality Notice*
> This e-mail is confidential and intended for the use of the named
> recipient only. If you are not the intended recipient please notify us by
> telephone immediately on +44(0)1606 811888 or return it to us by e-mail.
> Please then delete it from your system and note that any use,
> dissemination, forwarding, printing or copying is strictly prohibited. Any
> views or opinions are solely those of the author and do not necessarily
> represent those of the company.
>
> *Encryptions and Viruses*
> Please note that this e-mail and any attachments have not been encrypted.
> They may therefore be liable to be compromised. Please also note that it is
> your responsibility to scan this e-mail and any attachments for viruses. We
> do not, to the extent permitted by law, accept any liability (whether in
> contract, negligence or otherwise) for any virus infection and/or external
> compromise of security and/or confidentiality in relation to transmissions
> sent by e-mail.
>
> *Monitoring*
> Activity and use of the company's systems is monitored to secure its
> effective use and operation and for other lawful business purposes.
> Communications using these systems will also be monitored and may be
> recorded to secure effective use and operation and for other lawful
> business purposes.
> hgvyjuv
>
> _______________________________________________
> Users mailing list -- users@ovirt.org
> To unsubscribe send an email to users-le...@ovirt.org
> Privacy Statement: https://www.ovirt.org/site/privacy-policy/
> oVirt Code of Conduct: https://www.ovirt.org/community/about/community-
> guidelines/
> List Archives: https://lists.ovirt.org/archives/list/users@ovirt.org/
> message/Y2ZFGU2XDAXPMNLPQVHRDTNJQDFVWGCL/
>
>

_______________________________________________
Users mailing list -- users@ovirt.org
To unsubscribe send an email to users-le...@ovirt.org
Privacy Statement: https://www.ovirt.org/site/privacy-policy/
oVirt Code of Conduct: 
https://www.ovirt.org/community/about/community-guidelines/
List Archives: 
https://lists.ovirt.org/archives/list/users@ovirt.org/message/DXQPCVIJXNSM3IYY6EN5NUAGNWQKQ7DB/

[ovirt-users] Re: Ovirt cluster unstable; gluster to blame (again)

Reply via email to