[ovirt-users] Gluster rebuild: request suggestions (poor IO performance)
Hi: I've been having one heck of a time for nearly the entire time I've been running ovirt with disk IO performance. I've tried a variety of things, I've posted to this list for help several times, and it sounds like in most cases the problems are due to design decisions and such. My cluster has been devolving into nearly unusable performance, and I believe its mostly disk IO related. I'm currently using FreeNAS as my primary VM storage (via NFS), but now it too is performing slowly (it started out reasonable, but slowly devolved for unknown reasons). I'm ready to switch back to gluster if I can get specific recommendations as to what I need to do to make it work. I feel like I've been trying random things, and sinking money into this to try and make it work, but nothing has really fixed the problem. I have 3 Dell R610 servers with 750GB SSDs as their primary drive. I had used some Seagate SSHDs, but the internal Dell DRAC raid controller (which had been configured to pass them through as a single disk volume, but still wasn't really JBOD), but it started silently failing them, and causing major issues for gluster. I think the DRAC just doesn't like those HDDs. I can put some real spinning disks in; perhaps a RAID-1 pair of 2TB? These servers only take 2.5" hdd's, so that greatly limits my options. I'm sure others out there are using Dell R610 servers...what do you use for storage? How does it perform? What do I need to do to get this cluster actually usable again? Are PERC-6i storage controllers usable? I'm not even sure where to go troubleshooting now...everything is so slw. BTW: I had a small data volume on the SSDs, and the gluster performance on those was pretty poor. performance of the hosted engine is pretty poor still, and it is still on the SSDs. ___ Users mailing list -- users@ovirt.org To unsubscribe send an email to users-le...@ovirt.org Privacy Statement: https://www.ovirt.org/site/privacy-policy/ oVirt Code of Conduct: https://www.ovirt.org/community/about/community-guidelines/ List Archives: https://lists.ovirt.org/archives/list/users@ovirt.org/message/IGR3RDAKQYXSPGAQCHWS5SGKOYA4QKJY/
[ovirt-users] Poor I/O Performance (again...)
Hi all: I've had I/O performance problems pretty much since the beginning of using oVirt. I've applied several upgrades as time went on, but strangely, none of them have alleviated the problem. VM disk I/O is still very slow to the point that running VMs is often painful; it notably affects nearly all my VMs, and makes me leary of starting any more. I'm currently running 12 VMs and the hosted engine on the stack. My configuration started out with 1Gbps networking and hyperconverged gluster running on a single SSD on each node. It worked, but I/O was painfully slow. I also started running out of space, so I added an SSHD on each node, created another gluster volume, and moved VMs over to it. I also ran that on a dedicated 1Gbps network. I had recurring disk failures (seems that disks only lasted about 3-6 months; I warrantied all three at least once, and some twice before giving up). I suspect the Dell PERC 6/i was partly to blame; the raid card refused to see/acknowledge the disk, but plugging it into a normal PC showed no signs of problems. In any case, performance on that storage was notably bad, even though the gig-e interface was rarely taxed. I put in 10Gbps ethernet and moved all the storage on that none the less, as several people here said that 1Gbps just wasn't fast enough. Some aspects improved a bit, but disk I/O is still slow. And I was still having problems with the SSHD data gluster volume eating disks, so I bought a dedicated NAS server (supermicro 12 disk dedicated FreeNAS NFS storage system on 10Gbps ethernet). Set that up. I found that it was actually FASTER than the SSD-based gluster volume, but still slow. Lately its been getting slower, too...Don't know why. The FreeNAS server reports network loads around 4MB/s on its 10Gbe interface, so its not network constrained. At 4MB/s, I'd sure hope the 12 spindle SAS interface wasn't constrained either. (and disk I/O operations on the NAS itself complete much faster). So, running a test on my NAS against an ISO file I haven't accessed in months: # dd if=en_windows_server_2008_r2_standard_enterprise_datacenter_and_web_x64_dvd_x15-59754.iso of=/dev/null bs=1024k count=500 500+0 records in 500+0 records out 524288000 bytes transferred in 2.459501 secs (213168465 bytes/sec) Running it on one of my hosts: root@unifi:/home/kusznir# time dd if=/dev/sda of=/dev/null bs=1024k count=500 500+0 records in 500+0 records out 524288000 bytes (524 MB, 500 MiB) copied, 7.21337 s, 72.7 MB/s (I don't know if this is a true apples to apples comparison, as I don't have a large file inside this VM's image). Even this is faster than I often see. I have a VoIP Phone server running as a VM. Voicemail and other recordings usually fail due to IO issues opening and writing the files. Often, the first 4 or so seconds of the recording is missed; sometimes the entire thing just fails. I didn't use to have this problem, but its definately been getting worse. I finally bit the bullet and ordered a physical server dedicated for my VoIP System...But I still want to figure out why I'm having all these IO problems. I read on the list of people running 30+ VMs...I feel that my IO can't take any more VMs with any semblance of reliability. We have a Quickbooks server on here too (windows), and the performance is abysmal; my CPA is charging me extra because of all the lost staff time waiting on the system to respond and generate reports. I'm at my whits end...I started with gluster on SSD with 1Gbps network, migrated to 10Gbps network, and now to dedicated high performance NAS box over NFS, and still have performance issues.I don't know how to troubleshoot the issue any further, but I've never had these kinds of issues when I was playing with other VM technologies. I'd like to get to the point where I can resell virtual servers to customers, but I can't do so with my current performance levels. I'd greatly appreciate help troubleshooting this further. --Jim ___ Users mailing list -- users@ovirt.org To unsubscribe send an email to users-le...@ovirt.org Privacy Statement: https://www.ovirt.org/site/privacy-policy/ oVirt Code of Conduct: https://www.ovirt.org/community/about/community-guidelines/ List Archives: https://lists.ovirt.org/archives/list/users@ovirt.org/message/ZR64VABNT2SGKLNP3XNTHCGFZXSOJAQF/
[ovirt-users] Ovirt Host Replacement/Rebuild
Hi all: I had an unplanned power outage (generator failed to start, power failure lasted 3 min longer than UPS batteries). One node didn't survive the unplanned power outage. By that, I mean it kernel panic's on boot, and I haven't been able to capture the KP or the first part of it (just the end), and so I don't truely know what the root cause is. I have validated the hardware is just fine, so its got to be an OS corruption. Based on this, I was thinking that perhaps the easiest way to recover would simply be to delete the host from the cluster, reformat and reinstall this host, and then add it back to the cluster as a new host. Is this in fact a good idea? Are there any references to how to do this (the detailed steps so I don't mess it up)? My cluster is (was) a 3 node hyperconverged cluster with gluster used for the management node. I also have a gluster share for VMs, but I use an NFS share from a NAS for that (which I will ask about in another post). Thanks for the help! --Jim ___ Users mailing list -- users@ovirt.org To unsubscribe send an email to users-le...@ovirt.org Privacy Statement: https://www.ovirt.org/site/privacy-policy/ oVirt Code of Conduct: https://www.ovirt.org/community/about/community-guidelines/ List Archives: https://lists.ovirt.org/archives/list/users@ovirt.org/message/Y66A6Q3NOGD3BCQ4UVAZK5ATS4ZFPVYV/
[ovirt-users] Re: Upgraded host, engine now won't boot
Ok, finally got it...Had to get a terminal ready with the virsh command and guess what the instance number was, and then run suspend right after starting with --vm-start-paused. Got it to really be paused, got into the console, booted the old kernel, and have now been repairing a bad yum transactionI *think* I've finished that. So, if I understand correctly, after the yum update, I should run engine-setup? Do I run that inside the engine vm, or on the host its running on? BTW: I did look up upgrade procedures on the documentation for the release. It links to two or three levels of other documents, then ends in an error 404. --Jim On Mon, Sep 3, 2018 at 6:39 PM, Jim Kusznir wrote: > global maintence mode is already on. hosted-engine --vm-start-paused > results in a non-paused VM being started. Of course, this is executed > after hosted-engine --vm-poweroff and suitable time left to let things shut > down. > > I just ran another test, and did in fact see the engine was briefly > paused, but then was quickly put in the running state. I don't know by > what, though. Global maintence mode is definitely enabled, every run of > the hosted-engine command reminds me! > > > > > > On Mon, Sep 3, 2018 at 11:12 AM, Darrell Budic > wrote: > >> Don’t know if there’s anything special, it’s been a while since I’ve >> needed to start it in paused mode. Try putting it in HA maintenance mode >> from the CLI and then start it in paused mode maybe? >> >> -- >> *From:* Jim Kusznir >> *Subject:* Re: [ovirt-users] Upgraded host, engine now won't boot >> *Date:* September 3, 2018 at 1:08:27 PM CDT >> >> *To:* Darrell Budic >> *Cc:* users >> >> Unfortunately, I seem unable to get connected to the console early enough >> to actually see a kernel list. >> >> I've tried the hosted-engine --start-vm-paused command, but it just >> starts it (running mode, not paused). By the time I can get vnc connected, >> I have just that last line. ctrl-alt-del doesn't do anything with it, >> either. sending a reset through virsh seems to just kill the VM (it >> doesn't respawn). >> >> ha seems to have some trouble with this too...Originally I allowed ha to >> start it, and it would take it a good long while before it gave up on the >> engine and reset it. It instantly booted to the same crashed state, and >> again waited a "good long while" (sorry, never timed it, but I know it was >> >5 min). >> >> My current thought is that I need to get the engine started in paused >> mode, connect vnc, then unpause it with virsh to catch what is happening. >> Is there any magic to getting it started in paused mode? >> >> On Mon, Sep 3, 2018 at 11:03 AM, Darrell Budic >> wrote: >> >>> Send it a ctl-alt-delete and see what happens. Possibly try an older >>> kernel at the grub boot menu. Could also try stopping it with hosted-engine >>> —vm-stop and let HA reboot it, see if it boots or get onto the console >>> quickly and try and watch more of the boot. >>> >>> Ssh and yum upgrade is fine for the OS, although it’s a good idea to >>> enable Global HA Maintenance first so the HA watchdogs don’t reboot it in >>> the middle of that. After that, run “engine-setup” again, at least if there >>> are new ovirt engine updates to be done. Then disable Global HA >>> Maintenance, and run "shutdown -h now” to stop the Engine VM (rebooting >>> seems to cause it to exit anyway, HA seems to run it as a single execution >>> VM. Or at least in the past, it seems to quit anyway on me and shutdown >>> triggered HA faster). Wait a few minutes, and HA will respawn it on a new >>> instance and you can log into your engine again. >>> >>> -- >>> *From:* Jim Kusznir >>> *Subject:* Re: [ovirt-users] Upgraded host, engine now won't boot >>> *Date:* September 3, 2018 at 12:45:22 PM CDT >>> *To:* Darrell Budic >>> *Cc:* users >>> >>> >>> Thanks to Jayme who pointed me to the --add-console-password >>> hosted-engine command to set a password for vnc. Using that, I see only >>> the single line: >>> >>> Probing EDD (edd=off to disable)... ok >>> >>> --Jim >>> >>> On Mon, Sep 3, 2018 at 10:26 AM, Jim Kusznir >>> wrote: >>> >>>> Is there a way to get a graphical console on boot of the engine vm so I >>>> can see what's causing the failure to boot? >>>> >>>> O
[ovirt-users] Re: Upgraded host, engine now won't boot
global maintence mode is already on. hosted-engine --vm-start-paused results in a non-paused VM being started. Of course, this is executed after hosted-engine --vm-poweroff and suitable time left to let things shut down. I just ran another test, and did in fact see the engine was briefly paused, but then was quickly put in the running state. I don't know by what, though. Global maintence mode is definitely enabled, every run of the hosted-engine command reminds me! On Mon, Sep 3, 2018 at 11:12 AM, Darrell Budic wrote: > Don’t know if there’s anything special, it’s been a while since I’ve > needed to start it in paused mode. Try putting it in HA maintenance mode > from the CLI and then start it in paused mode maybe? > > ------ > *From:* Jim Kusznir > *Subject:* Re: [ovirt-users] Upgraded host, engine now won't boot > *Date:* September 3, 2018 at 1:08:27 PM CDT > > *To:* Darrell Budic > *Cc:* users > > Unfortunately, I seem unable to get connected to the console early enough > to actually see a kernel list. > > I've tried the hosted-engine --start-vm-paused command, but it just starts > it (running mode, not paused). By the time I can get vnc connected, I have > just that last line. ctrl-alt-del doesn't do anything with it, either. > sending a reset through virsh seems to just kill the VM (it doesn't > respawn). > > ha seems to have some trouble with this too...Originally I allowed ha to > start it, and it would take it a good long while before it gave up on the > engine and reset it. It instantly booted to the same crashed state, and > again waited a "good long while" (sorry, never timed it, but I know it was > >5 min). > > My current thought is that I need to get the engine started in paused > mode, connect vnc, then unpause it with virsh to catch what is happening. > Is there any magic to getting it started in paused mode? > > On Mon, Sep 3, 2018 at 11:03 AM, Darrell Budic > wrote: > >> Send it a ctl-alt-delete and see what happens. Possibly try an older >> kernel at the grub boot menu. Could also try stopping it with hosted-engine >> —vm-stop and let HA reboot it, see if it boots or get onto the console >> quickly and try and watch more of the boot. >> >> Ssh and yum upgrade is fine for the OS, although it’s a good idea to >> enable Global HA Maintenance first so the HA watchdogs don’t reboot it in >> the middle of that. After that, run “engine-setup” again, at least if there >> are new ovirt engine updates to be done. Then disable Global HA >> Maintenance, and run "shutdown -h now” to stop the Engine VM (rebooting >> seems to cause it to exit anyway, HA seems to run it as a single execution >> VM. Or at least in the past, it seems to quit anyway on me and shutdown >> triggered HA faster). Wait a few minutes, and HA will respawn it on a new >> instance and you can log into your engine again. >> >> -- >> *From:* Jim Kusznir >> *Subject:* Re: [ovirt-users] Upgraded host, engine now won't boot >> *Date:* September 3, 2018 at 12:45:22 PM CDT >> *To:* Darrell Budic >> *Cc:* users >> >> >> Thanks to Jayme who pointed me to the --add-console-password >> hosted-engine command to set a password for vnc. Using that, I see only >> the single line: >> >> Probing EDD (edd=off to disable)... ok >> >> --Jim >> >> On Mon, Sep 3, 2018 at 10:26 AM, Jim Kusznir wrote: >> >>> Is there a way to get a graphical console on boot of the engine vm so I >>> can see what's causing the failure to boot? >>> >>> On Mon, Sep 3, 2018 at 10:23 AM, Jim Kusznir >>> wrote: >>> >>>> Thanks; I guess I didn't mention that I started there. >>>> >>>> The virsh list shows it in state running, and gluster is showing fully >>>> online and healed. However, I cannot bring up a console of the engine VM >>>> to see why its not booting, even though it shows in running state. >>>> >>>> In any case, the hosts and engine were running happily. I applied the >>>> latest updates on the host, and the engine went unstable. I thought, Ok, >>>> maybe there's an update to ovirt that also needs to be applied to the >>>> engine, so I ssh'ed in and ran yum update (never did find clear >>>> instructions on how one is supposed to maintain the engine, but I did see >>>> that listed online). A while later, it reset and never booted again. >>>> >>>> -JIm >>>> >>>> On Sun, Sep 2, 2018 at 4:2
[ovirt-users] Re: Upgraded host, engine now won't boot
Unfortunately, I seem unable to get connected to the console early enough to actually see a kernel list. I've tried the hosted-engine --start-vm-paused command, but it just starts it (running mode, not paused). By the time I can get vnc connected, I have just that last line. ctrl-alt-del doesn't do anything with it, either. sending a reset through virsh seems to just kill the VM (it doesn't respawn). ha seems to have some trouble with this too...Originally I allowed ha to start it, and it would take it a good long while before it gave up on the engine and reset it. It instantly booted to the same crashed state, and again waited a "good long while" (sorry, never timed it, but I know it was >5 min). My current thought is that I need to get the engine started in paused mode, connect vnc, then unpause it with virsh to catch what is happening. Is there any magic to getting it started in paused mode? On Mon, Sep 3, 2018 at 11:03 AM, Darrell Budic wrote: > Send it a ctl-alt-delete and see what happens. Possibly try an older > kernel at the grub boot menu. Could also try stopping it with hosted-engine > —vm-stop and let HA reboot it, see if it boots or get onto the console > quickly and try and watch more of the boot. > > Ssh and yum upgrade is fine for the OS, although it’s a good idea to > enable Global HA Maintenance first so the HA watchdogs don’t reboot it in > the middle of that. After that, run “engine-setup” again, at least if there > are new ovirt engine updates to be done. Then disable Global HA > Maintenance, and run "shutdown -h now” to stop the Engine VM (rebooting > seems to cause it to exit anyway, HA seems to run it as a single execution > VM. Or at least in the past, it seems to quit anyway on me and shutdown > triggered HA faster). Wait a few minutes, and HA will respawn it on a new > instance and you can log into your engine again. > > -- > *From:* Jim Kusznir > *Subject:* Re: [ovirt-users] Upgraded host, engine now won't boot > *Date:* September 3, 2018 at 12:45:22 PM CDT > *To:* Darrell Budic > *Cc:* users > > > Thanks to Jayme who pointed me to the --add-console-password hosted-engine > command to set a password for vnc. Using that, I see only the single line: > > Probing EDD (edd=off to disable)... ok > > --Jim > > On Mon, Sep 3, 2018 at 10:26 AM, Jim Kusznir wrote: > >> Is there a way to get a graphical console on boot of the engine vm so I >> can see what's causing the failure to boot? >> >> On Mon, Sep 3, 2018 at 10:23 AM, Jim Kusznir wrote: >> >>> Thanks; I guess I didn't mention that I started there. >>> >>> The virsh list shows it in state running, and gluster is showing fully >>> online and healed. However, I cannot bring up a console of the engine VM >>> to see why its not booting, even though it shows in running state. >>> >>> In any case, the hosts and engine were running happily. I applied the >>> latest updates on the host, and the engine went unstable. I thought, Ok, >>> maybe there's an update to ovirt that also needs to be applied to the >>> engine, so I ssh'ed in and ran yum update (never did find clear >>> instructions on how one is supposed to maintain the engine, but I did see >>> that listed online). A while later, it reset and never booted again. >>> >>> -JIm >>> >>> On Sun, Sep 2, 2018 at 4:28 PM, Darrell Budic >>> wrote: >>> >>>> It’s definitely not starting, you’ll have to see if you can figure out >>>> why. A couple things to try: >>>> >>>> - Check "virsh list" and see if it’s running, or paused for storage. >>>> (google "virsh saslpasswd2 >>>> <https://www.google.com/search?client=safari&rls=en&q=virsh+saslpasswd2&ie=UTF-8&oe=UTF-8>” >>>> if you need to add a user to do this with, it’s per host) >>>> - It’s hyper converged, so check your gluster volume for healing >>>> and/or split brains and wait/resolve those. >>>> - check “gluster peer status” and on each host and make sure your >>>> gluster hosts are all talking. I’ve seen an upgrade screwup the firewall, >>>> easy fix is to add a rule to allow the hosts to talk to each other on your >>>> gluster network, no questions asked (-j ACCEPT, no port, etc). >>>> >>>> Good luck! >>>> >>>> -- >>>> *From:* Jim Kusznir >>>> *Subject:* [ovirt-users] Upgraded host, engine now won't boot >>>> *Date:* September 1, 2018 at 8:38:1
[ovirt-users] Re: Upgraded host, engine now won't boot
Thanks to Jayme who pointed me to the --add-console-password hosted-engine command to set a password for vnc. Using that, I see only the single line: Probing EDD (edd=off to disable)... ok --Jim On Mon, Sep 3, 2018 at 10:26 AM, Jim Kusznir wrote: > Is there a way to get a graphical console on boot of the engine vm so I > can see what's causing the failure to boot? > > On Mon, Sep 3, 2018 at 10:23 AM, Jim Kusznir wrote: > >> Thanks; I guess I didn't mention that I started there. >> >> The virsh list shows it in state running, and gluster is showing fully >> online and healed. However, I cannot bring up a console of the engine VM >> to see why its not booting, even though it shows in running state. >> >> In any case, the hosts and engine were running happily. I applied the >> latest updates on the host, and the engine went unstable. I thought, Ok, >> maybe there's an update to ovirt that also needs to be applied to the >> engine, so I ssh'ed in and ran yum update (never did find clear >> instructions on how one is supposed to maintain the engine, but I did see >> that listed online). A while later, it reset and never booted again. >> >> -JIm >> >> On Sun, Sep 2, 2018 at 4:28 PM, Darrell Budic >> wrote: >> >>> It’s definitely not starting, you’ll have to see if you can figure out >>> why. A couple things to try: >>> >>> - Check "virsh list" and see if it’s running, or paused for storage. >>> (google "virsh saslpasswd2 >>> <https://www.google.com/search?client=safari&rls=en&q=virsh+saslpasswd2&ie=UTF-8&oe=UTF-8>” >>> if you need to add a user to do this with, it’s per host) >>> - It’s hyper converged, so check your gluster volume for healing and/or >>> split brains and wait/resolve those. >>> - check “gluster peer status” and on each host and make sure your >>> gluster hosts are all talking. I’ve seen an upgrade screwup the firewall, >>> easy fix is to add a rule to allow the hosts to talk to each other on your >>> gluster network, no questions asked (-j ACCEPT, no port, etc). >>> >>> Good luck! >>> >>> -- >>> *From:* Jim Kusznir >>> *Subject:* [ovirt-users] Upgraded host, engine now won't boot >>> *Date:* September 1, 2018 at 8:38:12 PM CDT >>> *To:* users >>> >>> Hello: >>> >>> I saw that there were updates to my ovirt-4.2 3 node hyperconverged >>> system, so I proceeded to apply them the usual way through the UI. >>> >>> At one point, the hosted engine was migrated to one of the upgraded >>> hosts, and then went "unstable" on me. Now, the hosted engine appears to >>> be crashed: It gets powered up, but it never boots up to the point where >>> it responds to pings or allows logins. After a while, the hosted engine >>> shows status (via console "hosted-engine --vm-status" command) "Powering >>> Down". It stays there for a long time. >>> >>> I tried forcing a poweroff then powering it on, but again, it never gets >>> up to where it will respond to pings. --vm-status shows bad health, but up. >>> >>> I tried running the hosted-engine --console command, but got: >>> >>> [root@ovirt1 ~]# hosted-engine --console >>> The engine VM is running on this host >>> Connected to domain HostedEngine >>> Escape character is ^] >>> error: internal error: cannot find character device >>> >>> [root@ovirt1 ~]# >>> >>> >>> I tried to run the hosted-engine --upgrade-appliance command, but it >>> hangs at obtaining certificate (understandably, as the hosted-engine is not >>> up). >>> >>> How do i recover from this? And what caused this? >>> >>> --Jim >>> ___ >>> Users mailing list -- users@ovirt.org >>> To unsubscribe send an email to users-le...@ovirt.org >>> Privacy Statement: https://www.ovirt.org/site/privacy-policy/ >>> oVirt Code of Conduct: https://www.ovirt.org/communit >>> y/about/community-guidelines/ >>> List Archives: https://lists.ovirt.org/archiv >>> es/list/users@ovirt.org/message/XBNOOF4OA5C5AFGCT3KGUPUTRSOLIPXX/ >>> >>> >>> >> > ___ Users mailing list -- users@ovirt.org To unsubscribe send an email to users-le...@ovirt.org Privacy Statement: https://www.ovirt.org/site/privacy-policy/ oVirt Code of Conduct: https://www.ovirt.org/community/about/community-guidelines/ List Archives: https://lists.ovirt.org/archives/list/users@ovirt.org/message/C62LXGEOGRDWCEZ6XWN3YUSGS32IPROS/
[ovirt-users] Re: Upgraded host, engine now won't boot
Is there a way to get a graphical console on boot of the engine vm so I can see what's causing the failure to boot? On Mon, Sep 3, 2018 at 10:23 AM, Jim Kusznir wrote: > Thanks; I guess I didn't mention that I started there. > > The virsh list shows it in state running, and gluster is showing fully > online and healed. However, I cannot bring up a console of the engine VM > to see why its not booting, even though it shows in running state. > > In any case, the hosts and engine were running happily. I applied the > latest updates on the host, and the engine went unstable. I thought, Ok, > maybe there's an update to ovirt that also needs to be applied to the > engine, so I ssh'ed in and ran yum update (never did find clear > instructions on how one is supposed to maintain the engine, but I did see > that listed online). A while later, it reset and never booted again. > > -JIm > > On Sun, Sep 2, 2018 at 4:28 PM, Darrell Budic > wrote: > >> It’s definitely not starting, you’ll have to see if you can figure out >> why. A couple things to try: >> >> - Check "virsh list" and see if it’s running, or paused for storage. >> (google "virsh saslpasswd2 >> <https://www.google.com/search?client=safari&rls=en&q=virsh+saslpasswd2&ie=UTF-8&oe=UTF-8>” >> if you need to add a user to do this with, it’s per host) >> - It’s hyper converged, so check your gluster volume for healing and/or >> split brains and wait/resolve those. >> - check “gluster peer status” and on each host and make sure your gluster >> hosts are all talking. I’ve seen an upgrade screwup the firewall, easy fix >> is to add a rule to allow the hosts to talk to each other on your gluster >> network, no questions asked (-j ACCEPT, no port, etc). >> >> Good luck! >> >> -- >> *From:* Jim Kusznir >> *Subject:* [ovirt-users] Upgraded host, engine now won't boot >> *Date:* September 1, 2018 at 8:38:12 PM CDT >> *To:* users >> >> Hello: >> >> I saw that there were updates to my ovirt-4.2 3 node hyperconverged >> system, so I proceeded to apply them the usual way through the UI. >> >> At one point, the hosted engine was migrated to one of the upgraded >> hosts, and then went "unstable" on me. Now, the hosted engine appears to >> be crashed: It gets powered up, but it never boots up to the point where >> it responds to pings or allows logins. After a while, the hosted engine >> shows status (via console "hosted-engine --vm-status" command) "Powering >> Down". It stays there for a long time. >> >> I tried forcing a poweroff then powering it on, but again, it never gets >> up to where it will respond to pings. --vm-status shows bad health, but up. >> >> I tried running the hosted-engine --console command, but got: >> >> [root@ovirt1 ~]# hosted-engine --console >> The engine VM is running on this host >> Connected to domain HostedEngine >> Escape character is ^] >> error: internal error: cannot find character device >> >> [root@ovirt1 ~]# >> >> >> I tried to run the hosted-engine --upgrade-appliance command, but it >> hangs at obtaining certificate (understandably, as the hosted-engine is not >> up). >> >> How do i recover from this? And what caused this? >> >> --Jim >> ___ >> Users mailing list -- users@ovirt.org >> To unsubscribe send an email to users-le...@ovirt.org >> Privacy Statement: https://www.ovirt.org/site/privacy-policy/ >> oVirt Code of Conduct: https://www.ovirt.org/communit >> y/about/community-guidelines/ >> List Archives: https://lists.ovirt.org/archiv >> es/list/users@ovirt.org/message/XBNOOF4OA5C5AFGCT3KGUPUTRSOLIPXX/ >> >> >> > ___ Users mailing list -- users@ovirt.org To unsubscribe send an email to users-le...@ovirt.org Privacy Statement: https://www.ovirt.org/site/privacy-policy/ oVirt Code of Conduct: https://www.ovirt.org/community/about/community-guidelines/ List Archives: https://lists.ovirt.org/archives/list/users@ovirt.org/message/WKPMXVUYM5AAD7KYAYLB4DJ4NYGKXZFE/
[ovirt-users] Re: Upgraded host, engine now won't boot
Thanks; I guess I didn't mention that I started there. The virsh list shows it in state running, and gluster is showing fully online and healed. However, I cannot bring up a console of the engine VM to see why its not booting, even though it shows in running state. In any case, the hosts and engine were running happily. I applied the latest updates on the host, and the engine went unstable. I thought, Ok, maybe there's an update to ovirt that also needs to be applied to the engine, so I ssh'ed in and ran yum update (never did find clear instructions on how one is supposed to maintain the engine, but I did see that listed online). A while later, it reset and never booted again. -JIm On Sun, Sep 2, 2018 at 4:28 PM, Darrell Budic wrote: > It’s definitely not starting, you’ll have to see if you can figure out > why. A couple things to try: > > - Check "virsh list" and see if it’s running, or paused for storage. > (google "virsh saslpasswd2 > <https://www.google.com/search?client=safari&rls=en&q=virsh+saslpasswd2&ie=UTF-8&oe=UTF-8>” > if you need to add a user to do this with, it’s per host) > - It’s hyper converged, so check your gluster volume for healing and/or > split brains and wait/resolve those. > - check “gluster peer status” and on each host and make sure your gluster > hosts are all talking. I’ve seen an upgrade screwup the firewall, easy fix > is to add a rule to allow the hosts to talk to each other on your gluster > network, no questions asked (-j ACCEPT, no port, etc). > > Good luck! > > -- > *From:* Jim Kusznir > *Subject:* [ovirt-users] Upgraded host, engine now won't boot > *Date:* September 1, 2018 at 8:38:12 PM CDT > *To:* users > > Hello: > > I saw that there were updates to my ovirt-4.2 3 node hyperconverged > system, so I proceeded to apply them the usual way through the UI. > > At one point, the hosted engine was migrated to one of the upgraded hosts, > and then went "unstable" on me. Now, the hosted engine appears to be > crashed: It gets powered up, but it never boots up to the point where it > responds to pings or allows logins. After a while, the hosted engine shows > status (via console "hosted-engine --vm-status" command) "Powering Down". > It stays there for a long time. > > I tried forcing a poweroff then powering it on, but again, it never gets > up to where it will respond to pings. --vm-status shows bad health, but up. > > I tried running the hosted-engine --console command, but got: > > [root@ovirt1 ~]# hosted-engine --console > The engine VM is running on this host > Connected to domain HostedEngine > Escape character is ^] > error: internal error: cannot find character device > > [root@ovirt1 ~]# > > > I tried to run the hosted-engine --upgrade-appliance command, but it hangs > at obtaining certificate (understandably, as the hosted-engine is not up). > > How do i recover from this? And what caused this? > > --Jim > ___ > Users mailing list -- users@ovirt.org > To unsubscribe send an email to users-le...@ovirt.org > Privacy Statement: https://www.ovirt.org/site/privacy-policy/ > oVirt Code of Conduct: https://www.ovirt.org/community/about/community- > guidelines/ > List Archives: https://lists.ovirt.org/archives/list/users@ovirt.org/ > message/XBNOOF4OA5C5AFGCT3KGUPUTRSOLIPXX/ > > > ___ Users mailing list -- users@ovirt.org To unsubscribe send an email to users-le...@ovirt.org Privacy Statement: https://www.ovirt.org/site/privacy-policy/ oVirt Code of Conduct: https://www.ovirt.org/community/about/community-guidelines/ List Archives: https://lists.ovirt.org/archives/list/users@ovirt.org/message/25AE7XZBRFCWG3HKZBGAC2KBXDZLBOC2/
[ovirt-users] Upgraded host, engine now won't boot
Hello: I saw that there were updates to my ovirt-4.2 3 node hyperconverged system, so I proceeded to apply them the usual way through the UI. At one point, the hosted engine was migrated to one of the upgraded hosts, and then went "unstable" on me. Now, the hosted engine appears to be crashed: It gets powered up, but it never boots up to the point where it responds to pings or allows logins. After a while, the hosted engine shows status (via console "hosted-engine --vm-status" command) "Powering Down". It stays there for a long time. I tried forcing a poweroff then powering it on, but again, it never gets up to where it will respond to pings. --vm-status shows bad health, but up. I tried running the hosted-engine --console command, but got: [root@ovirt1 ~]# hosted-engine --console The engine VM is running on this host Connected to domain HostedEngine Escape character is ^] error: internal error: cannot find character device [root@ovirt1 ~]# I tried to run the hosted-engine --upgrade-appliance command, but it hangs at obtaining certificate (understandably, as the hosted-engine is not up). How do i recover from this? And what caused this? --Jim ___ Users mailing list -- users@ovirt.org To unsubscribe send an email to users-le...@ovirt.org Privacy Statement: https://www.ovirt.org/site/privacy-policy/ oVirt Code of Conduct: https://www.ovirt.org/community/about/community-guidelines/ List Archives: https://lists.ovirt.org/archives/list/users@ovirt.org/message/XBNOOF4OA5C5AFGCT3KGUPUTRSOLIPXX/
[ovirt-users] Data Recovery from snapshot
Hi: With yet another gluster disk failure / gluster collapse, it appears I lost the "main" backing image for one of my vm servers. I have snapshots still in tact (or at least, appear to be), but the main image is gone. The main server process stores a backup at regular intervals in its disk, and that would have been changed data, so it would be in the snapshot rather than the base image. Is there any way to recover this one .tar.gz file from the snapshot with the missing main image? This is a backup of dynamic data, and without it, I will have lost several customers data, some of which cannot be recreated/regenerated. it also appears that my backup (gluster geo-replication) did not work (had crashed a while ago, and has a very old backup of this image). --Jim ___ Users mailing list -- users@ovirt.org To unsubscribe send an email to users-le...@ovirt.org Privacy Statement: https://www.ovirt.org/site/privacy-policy/ oVirt Code of Conduct: https://www.ovirt.org/community/about/community-guidelines/ List Archives: https://lists.ovirt.org/archives/list/users@ovirt.org/message/H6ROZ2HCOCLE7ETBFU5U2QVPW3LCP35H/
[ovirt-users] Re: Ovirt cluster unstable; gluster to blame (again)
Thank you for your help. After more troubleshooting and host reboots, I accidentally discovered that the backing disk on ovirt2 (host) had suffered a failure. On reboot, the raid card refused to see it at all. It said it had cache waiting to be written to disk, and in the end, as it couldn't (wouldn't) see that disk, I had no choice but to discard that cache and boot up without the physical disk. Since doing so (and running a gluster volume remove for the affected host), things are running like normal. I don't understand why one bad disk wasn't simply failed, or if one underlying process was having such a problem, the other hosts didn't take it offline and continue (much like RAID would have done). Instead, everything was broke (including gluster volumes on unaffected disks that are fully functional across all hosts). I'm seeing the need to go multi-spindle for each storage, and I don't want to do that with the ovirt hosts due to hardware concerns/issues (I have to use the PERC6i, which I am also learning to distrust), and I would have to use 2.5in disks (I want to use 3.5"). As such, I will be going to a dedicated storage server with 12 spindles in a RAID6 configuration. I'm debating if its worth setting it up as a gluster replica 1 system (so I can easily migrate later), or just build it NFS with FreeNAS. I'm leaning to the latter, as it seems pointless to run gluster on a single node. --Jim On Sun, Jul 8, 2018 at 3:54 AM, Yaniv Kaul wrote: > > > On Sat, Jul 7, 2018 at 8:45 AM, Jim Kusznir wrote: > >> So, I'm still at a loss...It sounds like its either insufficient >> ram/swap, or insufficient network. It seems to be neither now. At this >> point, it appears that gluster is just "broke" and killing my systems for >> no descernable reason. Here's detals, all from the same system (currently >> running 3 VMs): >> >> [root@ovirt3 ~]# w >> 22:26:53 up 36 days, 4:34, 1 user, load average: 42.78, 55.98, 53.31 >> USER TTY FROM LOGIN@ IDLE JCPU PCPU WHAT >> root pts/0192.168.8.90 22:262.00s 0.12s 0.11s w >> >> bwm-ng reports the highest data usage was about 6MB/s during this test >> (and that was combined; I have two different gig networks. One gluster >> network (primary VM storage) runs on one, the other network handles >> everything else). >> >> [root@ovirt3 ~]# free -m >> totalusedfree shared buff/cache >> available >> Mem: 31996 13236 232 18 18526 >> 18195 >> Swap: 163831475 14908 >> >> top - 22:32:56 up 36 days, 4:41, 1 user, load average: 17.99, 39.69, >> 47.66 >> > > That is indeed a high load average. How many CPUs do you have, btw? > > >> Tasks: 407 total, 1 running, 405 sleeping, 1 stopped, 0 zombie >> %Cpu(s): 8.6 us, 2.1 sy, 0.0 ni, 87.6 id, 1.6 wa, 0.0 hi, 0.1 si, >> 0.0 st >> KiB Mem : 32764284 total, 228296 free, 13541952 used, 18994036 >> buff/cache >> KiB Swap: 16777212 total, 15246200 free, 1531012 used. 18643960 avail >> Mem >> > > Can you check what's swapping here? (a tweak to top output will show that) > > >> >> PID USER PR NIVIRTRESSHR S %CPU %MEM TIME+ >> COMMAND >> >> 30036 qemu 20 0 6872324 5.2g 13532 S 144.6 16.5 216:14.55 >> /usr/libexec/qemu-kvm -name guest=BillingWin,debug-threads=on -S -object >> secret,id=masterKey0,format=raw,file=/v+ >> 28501 qemu 20 0 5034968 3.6g 12880 S 16.2 11.7 73:44.99 >> /usr/libexec/qemu-kvm -name guest=FusionPBX,debug-threads=on -S -object >> secret,id=masterKey0,format=raw,file=/va+ >> 2694 root 20 0 2169224 12164 3108 S 5.0 0.0 3290:42 >> /usr/sbin/glusterfsd -s ovirt3.nwfiber.com --volfile-id >> data.ovirt3.nwfiber.com.gluster-brick2-data -p /var/run/+ >> > > This one's certainly taking quite a bit of your CPU usage overall. > > >> 14293 root 15 -5 944700 13356 4436 S 4.0 0.0 16:32.15 >> /usr/sbin/glusterfs --volfile-server=192.168.8.11 >> --volfile-server=192.168.8.12 --volfile-server=192.168.8.13 --+ >> > > I'm not sure what the sorting order is, but doesn't look like Gluster is > taking a lot of memory? > > >> 25100 vdsm 0 -20 6747440 107868 12836 S 2.3 0.3 21:35.20 >> /usr/bin/python2 /usr/share/vdsm/vdsmd >> >> 28971 qemu 20 0 2842592 1.5g 13548 S 1.7 4.7 241:46.49 >> /usr/libexec/qemu-kvm -name guest=unifi.palousetech.com,debug-threads=on >> -S -object secret,id=masterK
[ovirt-users] Re: Ovirt cluster unstable; gluster to blame (again)
Thank you for your help. After more troubleshooting and host reboots, I accidentally discovered that the backing disk on ovirt2 (host) had suffered a failure. On reboot, the raid card refused to see it at all. It said it had cache waiting to be written to disk, and in the end, as it couldn't (wouldn't) see that disk, I had no choice but to discard that cache and boot up without the physical disk. Since doing so (and running a gluster volume remove for the affected host), things are running like normal, although it appears it corrupted two disks (I've now lost 5 VMs to gluster-induced disk failures during poorly handled failures). I don't understand why one bad disk wasn't simply failed, or if one underlying process was having such a problem, the other hosts didn't take it offline and continue (much like RAID would have done). Instead, everything was broke (including gluster volumes on unaffected disks that are fully functional across all hosts) as well as very poor performance of affected machine AND no diagnostic reports that would allude to a failing hard drive. Is this expected behavior? --Jim On Sun, Jul 8, 2018 at 3:54 AM, Yaniv Kaul wrote: > > > On Sat, Jul 7, 2018 at 8:45 AM, Jim Kusznir wrote: > >> So, I'm still at a loss...It sounds like its either insufficient >> ram/swap, or insufficient network. It seems to be neither now. At this >> point, it appears that gluster is just "broke" and killing my systems for >> no descernable reason. Here's detals, all from the same system (currently >> running 3 VMs): >> >> [root@ovirt3 ~]# w >> 22:26:53 up 36 days, 4:34, 1 user, load average: 42.78, 55.98, 53.31 >> USER TTY FROM LOGIN@ IDLE JCPU PCPU WHAT >> root pts/0192.168.8.90 22:262.00s 0.12s 0.11s w >> >> bwm-ng reports the highest data usage was about 6MB/s during this test >> (and that was combined; I have two different gig networks. One gluster >> network (primary VM storage) runs on one, the other network handles >> everything else). >> >> [root@ovirt3 ~]# free -m >> totalusedfree shared buff/cache >> available >> Mem: 31996 13236 232 18 18526 >> 18195 >> Swap: 163831475 14908 >> >> top - 22:32:56 up 36 days, 4:41, 1 user, load average: 17.99, 39.69, >> 47.66 >> > > That is indeed a high load average. How many CPUs do you have, btw? > > >> Tasks: 407 total, 1 running, 405 sleeping, 1 stopped, 0 zombie >> %Cpu(s): 8.6 us, 2.1 sy, 0.0 ni, 87.6 id, 1.6 wa, 0.0 hi, 0.1 si, >> 0.0 st >> KiB Mem : 32764284 total, 228296 free, 13541952 used, 18994036 >> buff/cache >> KiB Swap: 16777212 total, 15246200 free, 1531012 used. 18643960 avail >> Mem >> > > Can you check what's swapping here? (a tweak to top output will show that) > > >> >> PID USER PR NIVIRTRESSHR S %CPU %MEM TIME+ >> COMMAND >> >> 30036 qemu 20 0 6872324 5.2g 13532 S 144.6 16.5 216:14.55 >> /usr/libexec/qemu-kvm -name guest=BillingWin,debug-threads=on -S -object >> secret,id=masterKey0,format=raw,file=/v+ >> 28501 qemu 20 0 5034968 3.6g 12880 S 16.2 11.7 73:44.99 >> /usr/libexec/qemu-kvm -name guest=FusionPBX,debug-threads=on -S -object >> secret,id=masterKey0,format=raw,file=/va+ >> 2694 root 20 0 2169224 12164 3108 S 5.0 0.0 3290:42 >> /usr/sbin/glusterfsd -s ovirt3.nwfiber.com --volfile-id >> data.ovirt3.nwfiber.com.gluster-brick2-data -p /var/run/+ >> > > This one's certainly taking quite a bit of your CPU usage overall. > > >> 14293 root 15 -5 944700 13356 4436 S 4.0 0.0 16:32.15 >> /usr/sbin/glusterfs --volfile-server=192.168.8.11 >> --volfile-server=192.168.8.12 --volfile-server=192.168.8.13 --+ >> > > I'm not sure what the sorting order is, but doesn't look like Gluster is > taking a lot of memory? > > >> 25100 vdsm 0 -20 6747440 107868 12836 S 2.3 0.3 21:35.20 >> /usr/bin/python2 /usr/share/vdsm/vdsmd >> >> 28971 qemu 20 0 2842592 1.5g 13548 S 1.7 4.7 241:46.49 >> /usr/libexec/qemu-kvm -name guest=unifi.palousetech.com,debug-threads=on >> -S -object secret,id=masterKey0,format=+ >> 12095 root 20 0 162276 2836 1868 R 1.3 0.0 0:00.25 top >> >> >> 2708 root 20 0 1906040 12404 3080 S 1.0 0.0 1083:33 >> /usr/sbin/glusterfsd -s ovirt3.nwfiber.com --volfile-id >> engine.ovirt3.nwfiber.com.gluster-brick1-engine -p /var/+ >&
[ovirt-users] Re: Ovirt cluster unstable; gluster to blame (again)
This host has NO VMs running on it, only 3 running cluster-wide (including the engine, which is on its own storage): top - 10:44:41 up 1 day, 17:10, 1 user, load average: 15.86, 14.33, 13.39 Tasks: 381 total, 1 running, 379 sleeping, 1 stopped, 0 zombie %Cpu(s): 2.7 us, 2.1 sy, 0.0 ni, 89.0 id, 6.1 wa, 0.0 hi, 0.2 si, 0.0 st KiB Mem : 32764284 total, 338232 free, 842324 used, 31583728 buff/cache KiB Swap: 12582908 total, 12258660 free, 324248 used. 31076748 avail Mem PID USER PR NIVIRTRESSHR S %CPU %MEM TIME+ COMMAND 13279 root 20 0 2380708 37628 4396 S 51.7 0.1 3768:03 glusterfsd 13273 root 20 0 2233212 20460 4380 S 17.2 0.1 105:50.44 glusterfsd 13287 root 20 0 2233212 20608 4340 S 4.3 0.1 34:27.20 glusterfsd 16205 vdsm 0 -20 5048672 88940 13364 S 1.3 0.3 0:32.69 vdsmd 16300 vdsm 20 0 608488 25096 5404 S 1.3 0.1 0:05.78 python 1109 vdsm 20 0 3127696 44228 8552 S 0.7 0.1 18:49.76 ovirt-ha-broker 2 root 20 0 0 0 0 S 0.7 0.0 0:00.13 kworker/u64:3 10 root 20 0 0 0 0 S 0.3 0.0 4:22.36 rcu_sched 572 root 0 -20 0 0 0 S 0.3 0.0 0:12.02 kworker/1:1H 797 root 20 0 0 0 0 S 0.3 0.0 1:59.59 kdmwork-253:2 877 root 0 -20 0 0 0 S 0.3 0.0 0:11.34 kworker/3:1H 1028 root 20 0 0 0 0 S 0.3 0.0 0:35.35 xfsaild/dm-10 1869 root 20 0 1496472 10540 6564 S 0.3 0.0 2:15.46 python 3747 root 20 0 0 0 0 D 0.3 0.0 0:01.21 kworker/u64:1 10979 root 15 -5 723504 15644 3920 S 0.3 0.0 22:46.27 glusterfs 15085 root 20 0 680884 10792 4328 S 0.3 0.0 0:01.13 glusterd 16102 root 15 -5 1204216 44948 11160 S 0.3 0.1 0:18.61 supervdsmd At the moment, the engine is barely usable, my other VMs appear to be unresponsive. Two on one host, one on another, and none on the third. On Sat, Jul 7, 2018 at 10:38 AM, Jim Kusznir wrote: > I run 4-7 VMs, and most of them are 2GB ram. I have 2 VMs with 4GB. > > Ram hasn't been an issue until recent ovirt/gluster upgrades. Storage has > always been slow, especially with these drives. However, even watching > network utilization on my switch, the gig-e links never max out. > > The loadavg issues and unresponsive behavior started with yesterday's > ovirt updates. I now have one VM with low I/O that lives on a separate > storage volume (data, fully SSD backed instead of data-hdd, which was > having the issues). I moved it to a ovirt host with no other VMs on it, > and that had reshly been rebooted. Before it had this one VM on it, > loadavg was >0.5. Now its up in the 20's, with only one low Disk I/O, 4GB > ram VM on the host. > > This to me says there's now a new problem separate from Gluster. I don't > have any non-gluster storage available to test with. I did notice that the > last update included a new kernel, and it appears its the qemu-kvm > processes that are consuming way more CPU than they used to now. > > Are there any known issues? I'm going to reboot into my previous kernel > to see if its kernel-caused. > > --Jim > > > > On Fri, Jul 6, 2018 at 11:07 PM, Johan Bernhardsson > wrote: > >> That is a single sata drive that is slow on random I/O and that has to be >> synced with 2 other servers. Gluster works syncronous so one write has to >> be written and acknowledged on all the three nodes. >> >> So you have a bottle neck in io on drives and one on network and >> depending on how many virtual servers you have and how much ram they take >> you might have memory. >> >> Load spikes when you have a wait somewhere and are overusing capacity. >> But it's now only CPU that load is counted on. It is waiting for resources >> so it can be memory or Network or drives. >> >> How many virtual server do you run and how much ram do they consume? >> >> On July 7, 2018 09:51:42 Jim Kusznir wrote: >> >>> In case it matters, the data-hdd gluster volume uses these hard drives: >>> >>> https://www.amazon.com/gp/product/B01M1NHCZT/ref=oh_aui_deta >>> ilpage_o05_s00?ie=UTF8&psc=1 >>> >>> This is in a Dell R610 with PERC6/i (one drive per server, configured as >>> a single drive volume to pass it through as its own /dev/sd* device). >>> Inside the OS, its partitioned with lvm_thin, then an lvm volume formatted >>> with XFS and mounted as /gluster/brick3, with the data-hdd volume created >>> inside that. >>> >>> --Jim >>> >>> On Fri, Jul 6, 2018 at 10:45 PM
[ovirt-users] Re: Ovirt cluster unstable; gluster to blame (again)
I run 4-7 VMs, and most of them are 2GB ram. I have 2 VMs with 4GB. Ram hasn't been an issue until recent ovirt/gluster upgrades. Storage has always been slow, especially with these drives. However, even watching network utilization on my switch, the gig-e links never max out. The loadavg issues and unresponsive behavior started with yesterday's ovirt updates. I now have one VM with low I/O that lives on a separate storage volume (data, fully SSD backed instead of data-hdd, which was having the issues). I moved it to a ovirt host with no other VMs on it, and that had reshly been rebooted. Before it had this one VM on it, loadavg was >0.5. Now its up in the 20's, with only one low Disk I/O, 4GB ram VM on the host. This to me says there's now a new problem separate from Gluster. I don't have any non-gluster storage available to test with. I did notice that the last update included a new kernel, and it appears its the qemu-kvm processes that are consuming way more CPU than they used to now. Are there any known issues? I'm going to reboot into my previous kernel to see if its kernel-caused. --Jim On Fri, Jul 6, 2018 at 11:07 PM, Johan Bernhardsson wrote: > That is a single sata drive that is slow on random I/O and that has to be > synced with 2 other servers. Gluster works syncronous so one write has to > be written and acknowledged on all the three nodes. > > So you have a bottle neck in io on drives and one on network and depending > on how many virtual servers you have and how much ram they take you might > have memory. > > Load spikes when you have a wait somewhere and are overusing capacity. But > it's now only CPU that load is counted on. It is waiting for resources so > it can be memory or Network or drives. > > How many virtual server do you run and how much ram do they consume? > > On July 7, 2018 09:51:42 Jim Kusznir wrote: > >> In case it matters, the data-hdd gluster volume uses these hard drives: >> >> https://www.amazon.com/gp/product/B01M1NHCZT/ref=oh_aui_ >> detailpage_o05_s00?ie=UTF8&psc=1 >> >> This is in a Dell R610 with PERC6/i (one drive per server, configured as >> a single drive volume to pass it through as its own /dev/sd* device). >> Inside the OS, its partitioned with lvm_thin, then an lvm volume formatted >> with XFS and mounted as /gluster/brick3, with the data-hdd volume created >> inside that. >> >> --Jim >> >> On Fri, Jul 6, 2018 at 10:45 PM, Jim Kusznir wrote: >> >>> So, I'm still at a loss...It sounds like its either insufficient >>> ram/swap, or insufficient network. It seems to be neither now. At this >>> point, it appears that gluster is just "broke" and killing my systems for >>> no descernable reason. Here's detals, all from the same system (currently >>> running 3 VMs): >>> >>> [root@ovirt3 ~]# w >>> 22:26:53 up 36 days, 4:34, 1 user, load average: 42.78, 55.98, 53.31 >>> USER TTY FROM LOGIN@ IDLE JCPU PCPU WHAT >>> root pts/0192.168.8.90 22:262.00s 0.12s 0.11s w >>> >>> bwm-ng reports the highest data usage was about 6MB/s during this test >>> (and that was combined; I have two different gig networks. One gluster >>> network (primary VM storage) runs on one, the other network handles >>> everything else). >>> >>> [root@ovirt3 ~]# free -m >>> totalusedfree shared buff/cache >>> available >>> Mem: 31996 13236 232 18 18526 >>> 18195 >>> Swap: 163831475 14908 >>> >>> top - 22:32:56 up 36 days, 4:41, 1 user, load average: 17.99, 39.69, >>> 47.66 >>> Tasks: 407 total, 1 running, 405 sleeping, 1 stopped, 0 zombie >>> %Cpu(s): 8.6 us, 2.1 sy, 0.0 ni, 87.6 id, 1.6 wa, 0.0 hi, 0.1 si, >>> 0.0 st >>> KiB Mem : 32764284 total, 228296 free, 13541952 used, 18994036 >>> buff/cache >>> KiB Swap: 16777212 total, 15246200 free, 1531012 used. 18643960 avail >>> Mem >>> >>> PID USER PR NIVIRTRESSHR S %CPU %MEM TIME+ >>> COMMAND >>> >>> 30036 qemu 20 0 6872324 5.2g 13532 S 144.6 16.5 216:14.55 >>> /usr/libexec/qemu-kvm -name guest=BillingWin,debug-threads=on -S >>> -object secret,id=masterKey0,format=raw,file=/v+ >>> 28501 qemu 20 0 5034968 3.6g 12880 S 16.2 11.7 73:44.99 >>> /usr/libexec/qemu-kvm -name guest=FusionPBX,debug-threads=on -S -object >>> secret,id=masterKey0,format=raw,file
[ovirt-users] Re: Ovirt cluster unstable; gluster to blame (again)
I think I should throw one more thing out there: The current batch of problems started essentially today, and I did apply the updates waiting in the ovirt repos (through the ovirt mgmt interface: install updates). Perhaps there is now something from that which is breaking things. On Fri, Jul 6, 2018 at 10:51 PM, Jim Kusznir wrote: > In case it matters, the data-hdd gluster volume uses these hard drives: > > https://www.amazon.com/gp/product/B01M1NHCZT/ref=oh_aui_ > detailpage_o05_s00?ie=UTF8&psc=1 > > This is in a Dell R610 with PERC6/i (one drive per server, configured as a > single drive volume to pass it through as its own /dev/sd* device). Inside > the OS, its partitioned with lvm_thin, then an lvm volume formatted with > XFS and mounted as /gluster/brick3, with the data-hdd volume created inside > that. > > --Jim > > On Fri, Jul 6, 2018 at 10:45 PM, Jim Kusznir wrote: > >> So, I'm still at a loss...It sounds like its either insufficient >> ram/swap, or insufficient network. It seems to be neither now. At this >> point, it appears that gluster is just "broke" and killing my systems for >> no descernable reason. Here's detals, all from the same system (currently >> running 3 VMs): >> >> [root@ovirt3 ~]# w >> 22:26:53 up 36 days, 4:34, 1 user, load average: 42.78, 55.98, 53.31 >> USER TTY FROM LOGIN@ IDLE JCPU PCPU WHAT >> root pts/0192.168.8.90 22:262.00s 0.12s 0.11s w >> >> bwm-ng reports the highest data usage was about 6MB/s during this test >> (and that was combined; I have two different gig networks. One gluster >> network (primary VM storage) runs on one, the other network handles >> everything else). >> >> [root@ovirt3 ~]# free -m >> totalusedfree shared buff/cache >> available >> Mem: 31996 13236 232 18 18526 >> 18195 >> Swap: 163831475 14908 >> >> top - 22:32:56 up 36 days, 4:41, 1 user, load average: 17.99, 39.69, >> 47.66 >> Tasks: 407 total, 1 running, 405 sleeping, 1 stopped, 0 zombie >> %Cpu(s): 8.6 us, 2.1 sy, 0.0 ni, 87.6 id, 1.6 wa, 0.0 hi, 0.1 si, >> 0.0 st >> KiB Mem : 32764284 total, 228296 free, 13541952 used, 18994036 >> buff/cache >> KiB Swap: 16777212 total, 15246200 free, 1531012 used. 18643960 avail >> Mem >> >> PID USER PR NIVIRTRESSHR S %CPU %MEM TIME+ >> COMMAND >> >> 30036 qemu 20 0 6872324 5.2g 13532 S 144.6 16.5 216:14.55 >> /usr/libexec/qemu-kvm -name guest=BillingWin,debug-threads=on -S -object >> secret,id=masterKey0,format=raw,file=/v+ >> 28501 qemu 20 0 5034968 3.6g 12880 S 16.2 11.7 73:44.99 >> /usr/libexec/qemu-kvm -name guest=FusionPBX,debug-threads=on -S -object >> secret,id=masterKey0,format=raw,file=/va+ >> 2694 root 20 0 2169224 12164 3108 S 5.0 0.0 3290:42 >> /usr/sbin/glusterfsd -s ovirt3.nwfiber.com --volfile-id >> data.ovirt3.nwfiber.com.gluster-brick2-data -p /var/run/+ >> 14293 root 15 -5 944700 13356 4436 S 4.0 0.0 16:32.15 >> /usr/sbin/glusterfs --volfile-server=192.168.8.11 >> --volfile-server=192.168.8.12 --volfile-server=192.168.8.13 --+ >> 25100 vdsm 0 -20 6747440 107868 12836 S 2.3 0.3 21:35.20 >> /usr/bin/python2 /usr/share/vdsm/vdsmd >> >> 28971 qemu 20 0 2842592 1.5g 13548 S 1.7 4.7 241:46.49 >> /usr/libexec/qemu-kvm -name guest=unifi.palousetech.com,debug-threads=on >> -S -object secret,id=masterKey0,format=+ >> 12095 root 20 0 162276 2836 1868 R 1.3 0.0 0:00.25 top >> >> >> 2708 root 20 0 1906040 12404 3080 S 1.0 0.0 1083:33 >> /usr/sbin/glusterfsd -s ovirt3.nwfiber.com --volfile-id >> engine.ovirt3.nwfiber.com.gluster-brick1-engine -p /var/+ >> 28623 qemu 20 0 4749536 1.7g 12896 S 0.7 5.5 4:30.64 >> /usr/libexec/qemu-kvm -name guest=billing.nwfiber.com,debug-threads=on >> -S -object secret,id=masterKey0,format=ra+ >>10 root 20 0 0 0 0 S 0.3 0.0 215:54.72 >> [rcu_sched] >> >> 1030 sanlock rt 0 773804 27908 2744 S 0.3 0.1 35:55.61 >> /usr/sbin/sanlock daemon >> >> 1890 zabbix20 0 83904 1696 1612 S 0.3 0.0 24:30.63 >> /usr/sbin/zabbix_agentd: collector [idle 1 sec] >> >> 2722 root 20 0 1298004 6148 2580 S 0.3 0.0 38:10.82 >> /usr/sbin/glusterfsd -s ovirt3.nwfiber.com --volfile-id >> iso.ovirt3.nwfiber.com.gluster-brick4-iso -p /var/run/gl+ &
[ovirt-users] Re: Ovirt cluster unstable; gluster to blame (again)
In case it matters, the data-hdd gluster volume uses these hard drives: https://www.amazon.com/gp/product/B01M1NHCZT/ref=oh_aui_detailpage_o05_s00?ie=UTF8&psc=1 This is in a Dell R610 with PERC6/i (one drive per server, configured as a single drive volume to pass it through as its own /dev/sd* device). Inside the OS, its partitioned with lvm_thin, then an lvm volume formatted with XFS and mounted as /gluster/brick3, with the data-hdd volume created inside that. --Jim On Fri, Jul 6, 2018 at 10:45 PM, Jim Kusznir wrote: > So, I'm still at a loss...It sounds like its either insufficient ram/swap, > or insufficient network. It seems to be neither now. At this point, it > appears that gluster is just "broke" and killing my systems for no > descernable reason. Here's detals, all from the same system (currently > running 3 VMs): > > [root@ovirt3 ~]# w > 22:26:53 up 36 days, 4:34, 1 user, load average: 42.78, 55.98, 53.31 > USER TTY FROM LOGIN@ IDLE JCPU PCPU WHAT > root pts/0192.168.8.90 22:262.00s 0.12s 0.11s w > > bwm-ng reports the highest data usage was about 6MB/s during this test > (and that was combined; I have two different gig networks. One gluster > network (primary VM storage) runs on one, the other network handles > everything else). > > [root@ovirt3 ~]# free -m > totalusedfree shared buff/cache > available > Mem: 31996 13236 232 18 18526 > 18195 > Swap: 163831475 14908 > > top - 22:32:56 up 36 days, 4:41, 1 user, load average: 17.99, 39.69, > 47.66 > Tasks: 407 total, 1 running, 405 sleeping, 1 stopped, 0 zombie > %Cpu(s): 8.6 us, 2.1 sy, 0.0 ni, 87.6 id, 1.6 wa, 0.0 hi, 0.1 si, > 0.0 st > KiB Mem : 32764284 total, 228296 free, 13541952 used, 18994036 buff/cache > KiB Swap: 16777212 total, 15246200 free, 1531012 used. 18643960 avail Mem > > PID USER PR NIVIRTRESSHR S %CPU %MEM TIME+ > COMMAND > > 30036 qemu 20 0 6872324 5.2g 13532 S 144.6 16.5 216:14.55 > /usr/libexec/qemu-kvm -name guest=BillingWin,debug-threads=on -S -object > secret,id=masterKey0,format=raw,file=/v+ > 28501 qemu 20 0 5034968 3.6g 12880 S 16.2 11.7 73:44.99 > /usr/libexec/qemu-kvm -name guest=FusionPBX,debug-threads=on -S -object > secret,id=masterKey0,format=raw,file=/va+ > 2694 root 20 0 2169224 12164 3108 S 5.0 0.0 3290:42 > /usr/sbin/glusterfsd -s ovirt3.nwfiber.com --volfile-id > data.ovirt3.nwfiber.com.gluster-brick2-data -p /var/run/+ > 14293 root 15 -5 944700 13356 4436 S 4.0 0.0 16:32.15 > /usr/sbin/glusterfs --volfile-server=192.168.8.11 > --volfile-server=192.168.8.12 --volfile-server=192.168.8.13 --+ > 25100 vdsm 0 -20 6747440 107868 12836 S 2.3 0.3 21:35.20 > /usr/bin/python2 /usr/share/vdsm/vdsmd > > 28971 qemu 20 0 2842592 1.5g 13548 S 1.7 4.7 241:46.49 > /usr/libexec/qemu-kvm -name guest=unifi.palousetech.com,debug-threads=on > -S -object secret,id=masterKey0,format=+ > 12095 root 20 0 162276 2836 1868 R 1.3 0.0 0:00.25 top > > > 2708 root 20 0 1906040 12404 3080 S 1.0 0.0 1083:33 > /usr/sbin/glusterfsd -s ovirt3.nwfiber.com --volfile-id > engine.ovirt3.nwfiber.com.gluster-brick1-engine -p /var/+ > 28623 qemu 20 0 4749536 1.7g 12896 S 0.7 5.5 4:30.64 > /usr/libexec/qemu-kvm -name guest=billing.nwfiber.com,debug-threads=on -S > -object secret,id=masterKey0,format=ra+ >10 root 20 0 0 0 0 S 0.3 0.0 215:54.72 > [rcu_sched] > > 1030 sanlock rt 0 773804 27908 2744 S 0.3 0.1 35:55.61 > /usr/sbin/sanlock daemon > > 1890 zabbix20 0 83904 1696 1612 S 0.3 0.0 24:30.63 > /usr/sbin/zabbix_agentd: collector [idle 1 sec] > > 2722 root 20 0 1298004 6148 2580 S 0.3 0.0 38:10.82 > /usr/sbin/glusterfsd -s ovirt3.nwfiber.com --volfile-id > iso.ovirt3.nwfiber.com.gluster-brick4-iso -p /var/run/gl+ > 6340 root 20 0 0 0 0 S 0.3 0.0 0:04.30 > [kworker/7:0] > > 10652 root 20 0 0 0 0 S 0.3 0.0 0:00.23 > [kworker/u64:2] > > 14724 root 20 0 1076344 17400 3200 S 0.3 0.1 10:04.13 > /usr/sbin/glusterfs -s localhost --volfile-id gluster/glustershd -p > /var/run/gluster/glustershd/glustershd.pid -+ > 22011 root 20 0 0 0 0 S 0.3 0.0 0:05.04 > [kworker/10:1] > > > Not sure why the system load dropped other than I was trying to take a > picture of it :) > > In any case, it appears that at this time, I have plenty of swap, ram, and > network capacity, and yet things ar
[ovirt-users] Re: Ovirt cluster unstable; gluster to blame (again)
r NAS boxes I run for clients (central storage for windows boxes) have been very solid; If I could get that kind of reliability for my ovirt stack, it would be a substantial improvement. Currently, it seems about every other month I have a gluster-induced outage. Sometimes I wonder if its just hyperconverged is the issue, but my infrastructure doesn't justify three servers at the same location...I might be able to do two, but even that seems like its pushing it. Looks like I can upgrade to 10G for about $900. I can order a dual-Xeon supermicro 12-disk server, loaded with 2TB WD Enterprise disks and a pair of SSDs for the os, 32GB ram, 2.67Ghz CPUs for about $720 delivered. I've got to do something to improve my reliability; I can't keep going the way I have been --Jim On Fri, Jul 6, 2018 at 9:13 PM, Johan Bernhardsson wrote: > Load like that is mostly io based either the machine is swapping or > network is to slow. Check I/o wait in top. > > And the problem where you get oom killer to kill off gluster. That means > that you don't monitor ram usage on the servers? Either it's eating all > your ram and swap gets really io intensive and then is killed off. Or you > have the wrong swap settings in sysctl.conf (there are tons of broken > guides that recommends swappines to 0 but that disables swap on newer > kernels. The proper swappines for only swapping when nesseary is 1 or a > sufficiently low number like 10 default is 60) > > > Moving to nfs will not improve things. You will get more memory since > gluster isn't running and that is good. But you will have a single node > that can fail with all your storage and it would still be on 1 gigabit only > and your three node cluster would easily saturate that link. > > On July 7, 2018 04:13:13 Jim Kusznir wrote: > >> So far it does not appear to be helping much. I'm still getting VM's >> locking up and all kinds of notices from overt engine about non-responsive >> hosts. I'm still seeing load averages in the 20-30 range. >> >> Jim >> >> On Fri, Jul 6, 2018, 3:13 PM Jim Kusznir wrote: >> >>> Thank you for the advice and help >>> >>> I do plan on going 10Gbps networking; haven't quite jumped off that >>> cliff yet, though. >>> >>> I did put my data-hdd (main VM storage volume) onto a dedicated 1Gbps >>> network, and I've watched throughput on that and never seen more than >>> 60GB/s achieved (as reported by bwm-ng). I have a separate 1Gbps network >>> for communication and ovirt migration, but I wanted to break that up >>> further (separate out VM traffice from migration/mgmt traffic). My three >>> SSD-backed gluster volumes run the main network too, as I haven't been able >>> to get them to move to the new network (which I was trying to use as all >>> gluster). I tried bonding, but that seamed to reduce performance rather >>> than improve it. >>> >>> --Jim >>> >>> On Fri, Jul 6, 2018 at 2:52 PM, Jamie Lawrence < >>> jlawre...@squaretrade.com> wrote: >>> >>>> Hi Jim, >>>> >>>> I don't have any targeted suggestions, because there isn't much to >>>> latch on to. I can say Gluster replica three (no arbiters) on dedicated >>>> servers serving a couple Ovirt VM clusters here have not had these sorts of >>>> issues. >>>> >>>> I suspect your long heal times (and the resultant long periods of high >>>> load) are at least partly related to 1G networking. That is just a matter >>>> of IO - heals of VMs involve moving a lot of bits. My cluster uses 10G >>>> bonded NICs on the gluster and ovirt boxes for storage traffic and separate >>>> bonded 1G for ovirtmgmt and communication with other machines/people, and >>>> we're occasionally hitting the bandwidth ceiling on the storage network. >>>> I'm starting to think about 40/100G, different ways of splitting up >>>> intensive systems, and considering iSCSI for specific volumes, although I >>>> really don't want to go there. >>>> >>>> I don't run FreeNAS[1], but I do run FreeBSD as storage servers for >>>> their excellent ZFS implementation, mostly for backups. ZFS will make your >>>> `heal` problem go away, but not your bandwidth problems, which become worse >>>> (because of fewer NICS pushing traffic). 10G hardware is not exactly in the >>>> impulse-buy territory, but if you can, I'd recommend doing some testing >>>> using it. I think at least some of
[ovirt-users] Re: Ovirt cluster unstable; gluster to blame (again)
So far it does not appear to be helping much. I'm still getting VM's locking up and all kinds of notices from overt engine about non-responsive hosts. I'm still seeing load averages in the 20-30 range. Jim On Fri, Jul 6, 2018, 3:13 PM Jim Kusznir wrote: > Thank you for the advice and help > > I do plan on going 10Gbps networking; haven't quite jumped off that cliff > yet, though. > > I did put my data-hdd (main VM storage volume) onto a dedicated 1Gbps > network, and I've watched throughput on that and never seen more than > 60GB/s achieved (as reported by bwm-ng). I have a separate 1Gbps network > for communication and ovirt migration, but I wanted to break that up > further (separate out VM traffice from migration/mgmt traffic). My three > SSD-backed gluster volumes run the main network too, as I haven't been able > to get them to move to the new network (which I was trying to use as all > gluster). I tried bonding, but that seamed to reduce performance rather > than improve it. > > --Jim > > On Fri, Jul 6, 2018 at 2:52 PM, Jamie Lawrence > wrote: > >> Hi Jim, >> >> I don't have any targeted suggestions, because there isn't much to latch >> on to. I can say Gluster replica three (no arbiters) on dedicated servers >> serving a couple Ovirt VM clusters here have not had these sorts of issues. >> >> I suspect your long heal times (and the resultant long periods of high >> load) are at least partly related to 1G networking. That is just a matter >> of IO - heals of VMs involve moving a lot of bits. My cluster uses 10G >> bonded NICs on the gluster and ovirt boxes for storage traffic and separate >> bonded 1G for ovirtmgmt and communication with other machines/people, and >> we're occasionally hitting the bandwidth ceiling on the storage network. >> I'm starting to think about 40/100G, different ways of splitting up >> intensive systems, and considering iSCSI for specific volumes, although I >> really don't want to go there. >> >> I don't run FreeNAS[1], but I do run FreeBSD as storage servers for their >> excellent ZFS implementation, mostly for backups. ZFS will make your `heal` >> problem go away, but not your bandwidth problems, which become worse >> (because of fewer NICS pushing traffic). 10G hardware is not exactly in the >> impulse-buy territory, but if you can, I'd recommend doing some testing >> using it. I think at least some of your problems are related. >> >> If that's not possible, my next stops would be optimizing everything I >> could about sharding, healing and optimizing for serving the shard size to >> squeeze as much performance out of 1G as I could, but that will only go so >> far. >> >> -j >> >> [1] FreeNAS is just a storage-tuned FreeBSD with a GUI. >> >> > On Jul 6, 2018, at 1:19 PM, Jim Kusznir wrote: >> > >> > hi all: >> > >> > Once again my production ovirt cluster is collapsing in on itself. My >> servers are intermittently unavailable or degrading, customers are noticing >> and calling in. This seems to be yet another gluster failure that I >> haven't been able to pin down. >> > >> > I posted about this a while ago, but didn't get anywhere (no replies >> that I found). The problem started out as a glusterfsd process consuming >> large amounts of ram (up to the point where ram and swap were exhausted and >> the kernel OOM killer killed off the glusterfsd process). For reasons not >> clear to me at this time, that resulted in any VMs running on that host and >> that gluster volume to be paused with I/O error (the glusterfs process is >> usually unharmed; why it didn't continue I/O with other servers is >> confusing to me). >> > >> > I have 3 servers and a total of 4 gluster volumes (engine, iso, data, >> and data-hdd). The first 3 are replica 2+arb; the 4th (data-hdd) is >> replica 3. The first 3 are backed by an LVM partition (some thin >> provisioned) on an SSD; the 4th is on a seagate hybrid disk (hdd + some >> internal flash for acceleration). data-hdd is the only thing on the disk. >> Servers are Dell R610 with the PERC/6i raid card, with the disks >> individually passed through to the OS (no raid enabled). >> > >> > The above RAM usage issue came from the data-hdd volume. Yesterday, I >> cought one of the glusterfsd high ram usage before the OOM-Killer had to >> run. I was able to migrate the VMs off the machine and for good measure, >> reboot the entire machine (after taking this opportunity to run the >> sof
[ovirt-users] Re: Ovirt cluster unstable; gluster to blame (again)
Thank you for the advice and help I do plan on going 10Gbps networking; haven't quite jumped off that cliff yet, though. I did put my data-hdd (main VM storage volume) onto a dedicated 1Gbps network, and I've watched throughput on that and never seen more than 60GB/s achieved (as reported by bwm-ng). I have a separate 1Gbps network for communication and ovirt migration, but I wanted to break that up further (separate out VM traffice from migration/mgmt traffic). My three SSD-backed gluster volumes run the main network too, as I haven't been able to get them to move to the new network (which I was trying to use as all gluster). I tried bonding, but that seamed to reduce performance rather than improve it. --Jim On Fri, Jul 6, 2018 at 2:52 PM, Jamie Lawrence wrote: > Hi Jim, > > I don't have any targeted suggestions, because there isn't much to latch > on to. I can say Gluster replica three (no arbiters) on dedicated servers > serving a couple Ovirt VM clusters here have not had these sorts of issues. > > I suspect your long heal times (and the resultant long periods of high > load) are at least partly related to 1G networking. That is just a matter > of IO - heals of VMs involve moving a lot of bits. My cluster uses 10G > bonded NICs on the gluster and ovirt boxes for storage traffic and separate > bonded 1G for ovirtmgmt and communication with other machines/people, and > we're occasionally hitting the bandwidth ceiling on the storage network. > I'm starting to think about 40/100G, different ways of splitting up > intensive systems, and considering iSCSI for specific volumes, although I > really don't want to go there. > > I don't run FreeNAS[1], but I do run FreeBSD as storage servers for their > excellent ZFS implementation, mostly for backups. ZFS will make your `heal` > problem go away, but not your bandwidth problems, which become worse > (because of fewer NICS pushing traffic). 10G hardware is not exactly in the > impulse-buy territory, but if you can, I'd recommend doing some testing > using it. I think at least some of your problems are related. > > If that's not possible, my next stops would be optimizing everything I > could about sharding, healing and optimizing for serving the shard size to > squeeze as much performance out of 1G as I could, but that will only go so > far. > > -j > > [1] FreeNAS is just a storage-tuned FreeBSD with a GUI. > > > On Jul 6, 2018, at 1:19 PM, Jim Kusznir wrote: > > > > hi all: > > > > Once again my production ovirt cluster is collapsing in on itself. My > servers are intermittently unavailable or degrading, customers are noticing > and calling in. This seems to be yet another gluster failure that I > haven't been able to pin down. > > > > I posted about this a while ago, but didn't get anywhere (no replies > that I found). The problem started out as a glusterfsd process consuming > large amounts of ram (up to the point where ram and swap were exhausted and > the kernel OOM killer killed off the glusterfsd process). For reasons not > clear to me at this time, that resulted in any VMs running on that host and > that gluster volume to be paused with I/O error (the glusterfs process is > usually unharmed; why it didn't continue I/O with other servers is > confusing to me). > > > > I have 3 servers and a total of 4 gluster volumes (engine, iso, data, > and data-hdd). The first 3 are replica 2+arb; the 4th (data-hdd) is > replica 3. The first 3 are backed by an LVM partition (some thin > provisioned) on an SSD; the 4th is on a seagate hybrid disk (hdd + some > internal flash for acceleration). data-hdd is the only thing on the disk. > Servers are Dell R610 with the PERC/6i raid card, with the disks > individually passed through to the OS (no raid enabled). > > > > The above RAM usage issue came from the data-hdd volume. Yesterday, I > cought one of the glusterfsd high ram usage before the OOM-Killer had to > run. I was able to migrate the VMs off the machine and for good measure, > reboot the entire machine (after taking this opportunity to run the > software updates that ovirt said were pending). Upon booting back up, the > necessary volume healing began. However, this time, the healing caused all > three servers to go to very, very high load averages (I saw just under 200 > on one server; typically they've been 40-70) with top reporting IO Wait at > 7-20%. Network for this volume is a dedicated gig network. According to > bwm-ng, initially the network bandwidth would hit 50MB/s (yes, bytes), but > tailed off to mostly in the kB/s for a while. All machines' load averages > were still 40+ and gluster volume heal data-hdd in
[ovirt-users] Ovirt cluster unstable; gluster to blame (again)
hi all: Once again my production ovirt cluster is collapsing in on itself. My servers are intermittently unavailable or degrading, customers are noticing and calling in. This seems to be yet another gluster failure that I haven't been able to pin down. I posted about this a while ago, but didn't get anywhere (no replies that I found). The problem started out as a glusterfsd process consuming large amounts of ram (up to the point where ram and swap were exhausted and the kernel OOM killer killed off the glusterfsd process). For reasons not clear to me at this time, that resulted in any VMs running on that host and that gluster volume to be paused with I/O error (the glusterfs process is usually unharmed; why it didn't continue I/O with other servers is confusing to me). I have 3 servers and a total of 4 gluster volumes (engine, iso, data, and data-hdd). The first 3 are replica 2+arb; the 4th (data-hdd) is replica 3. The first 3 are backed by an LVM partition (some thin provisioned) on an SSD; the 4th is on a seagate hybrid disk (hdd + some internal flash for acceleration). data-hdd is the only thing on the disk. Servers are Dell R610 with the PERC/6i raid card, with the disks individually passed through to the OS (no raid enabled). The above RAM usage issue came from the data-hdd volume. Yesterday, I cought one of the glusterfsd high ram usage before the OOM-Killer had to run. I was able to migrate the VMs off the machine and for good measure, reboot the entire machine (after taking this opportunity to run the software updates that ovirt said were pending). Upon booting back up, the necessary volume healing began. However, this time, the healing caused all three servers to go to very, very high load averages (I saw just under 200 on one server; typically they've been 40-70) with top reporting IO Wait at 7-20%. Network for this volume is a dedicated gig network. According to bwm-ng, initially the network bandwidth would hit 50MB/s (yes, bytes), but tailed off to mostly in the kB/s for a while. All machines' load averages were still 40+ and gluster volume heal data-hdd info reported 5 items needing healing. Server's were intermittently experiencing IO issues, even on the 3 gluster volumes that appeared largely unaffected. Even the OS activities on the hosts itself (logging in, running commands) would often be very delayed. The ovirt engine was seemingly randomly throwing engine down / engine up / engine failed notifications. Responsiveness on ANY VM was horrific most of the time, with random VMs being inaccessible. I let the gluster heal run overnight. By morning, there were still 5 items needing healing, all three servers were still experiencing high load, and servers were still largely unstable. I've noticed that all of my ovirt outages (and I've had a lot, way more than is acceptable for a production cluster) have come from gluster. I still have 3 VMs who's hard disk images have become corrupted by my last gluster crash that I haven't had time to repair / rebuild yet (I believe this crash was caused by the OOM issue previously mentioned, but I didn't know it at the time). Is gluster really ready for production yet? It seems so unstable to me I'm looking at replacing gluster with a dedicated NFS server likely FreeNAS. Any suggestions? What is the "right" way to do production storage on this (3 node cluster)? Can I get this gluster volume stable enough to get my VMs to run reliably again until I can deploy another storage solution? --Jim ___ Users mailing list -- users@ovirt.org To unsubscribe send an email to users-le...@ovirt.org Privacy Statement: https://www.ovirt.org/site/privacy-policy/ oVirt Code of Conduct: https://www.ovirt.org/community/about/community-guidelines/ List Archives: https://lists.ovirt.org/archives/list/users@ovirt.org/message/YQX3LQFQQPW4JTCB7B6FY2LLR6NA2CB3/
[ovirt-users] Re: User agent for Ovirt 4.2
Hmm...I did that, and restarted the ovirt agent, but it doesn't appear to be working...All my VMs in ovirt still complain about no / outdated agents. I'll do more looking later today. On Thu, Jun 14, 2018 at 11:11 AM, Alex K wrote: > In my case I am using some Debian9 VMs and I can install the guest agents > with: > > apt-get install ovirt-guest-agent > > This article also has some reference: > https://www.ovirt.org/documentation/how-to/guest- > agent/install-the-guest-agent-in-debian/ > > > Alex > > On Thu, Jun 14, 2018 at 8:38 PM, Jim Kusznir wrote: > >> What about for debian guests? I unfortunately have several that must run >> debian (I do have a mix of RedHat-based and Debian-based VMs). >> >> Thanks! >> --Jim >> >> On Wed, Jun 13, 2018 at 11:00 PM, Leo David wrote: >> >>> Hi, >>> I've just managed to install guest agent by installing epel-release >>> first. >>> yum -y install epel-release >>> yum -y install ovirt-guest-agent-common >>> >>> At this moment is getting ovirt-guest-agent-common.noarch >>> 0:1.0.14-1.el7 >>> >>> On Thu, Jun 14, 2018 at 7:09 AM, Jim Kusznir >>> wrote: >>> >>>> Hi: >>>> >>>> I haven't managed to find the new / current repo/source for the ovirt >>>> guest agent for the 4.2 upgrade. All my VMs now say that they need the >>>> agent. Googles keep referring me to old / broke / non-existent repos. >>>> Where do I find the 4.2 agent (or does the 4.2 agent even exist?) >>>> >>>> Thanks! >>>> --Jim >>>> >>>> ___ >>>> Users mailing list -- users@ovirt.org >>>> To unsubscribe send an email to users-le...@ovirt.org >>>> Privacy Statement: https://www.ovirt.org/site/privacy-policy/ >>>> oVirt Code of Conduct: https://www.ovirt.org/communit >>>> y/about/community-guidelines/ >>>> List Archives: https://lists.ovirt.org/archiv >>>> es/list/users@ovirt.org/message/UYEOKFI7DQCWFJXZHBJ3GXM3V3SEGST4/ >>>> >>>> >>> >>> >>> -- >>> Best regards, Leo David >>> >> >> >> ___ >> Users mailing list -- users@ovirt.org >> To unsubscribe send an email to users-le...@ovirt.org >> Privacy Statement: https://www.ovirt.org/site/privacy-policy/ >> oVirt Code of Conduct: https://www.ovirt.org/communit >> y/about/community-guidelines/ >> List Archives: https://lists.ovirt.org/archiv >> es/list/users@ovirt.org/message/L6PMHKVEG5X4RZDFQDGD7IZMG24WZG6I/ >> >> > ___ Users mailing list -- users@ovirt.org To unsubscribe send an email to users-le...@ovirt.org Privacy Statement: https://www.ovirt.org/site/privacy-policy/ oVirt Code of Conduct: https://www.ovirt.org/community/about/community-guidelines/ List Archives: https://lists.ovirt.org/archives/list/users@ovirt.org/message/OOWW4HLTFFWO4Z2CO37HTSVNYCORN3N2/
[ovirt-users] Re: User agent for Ovirt 4.2
What about for debian guests? I unfortunately have several that must run debian (I do have a mix of RedHat-based and Debian-based VMs). Thanks! --Jim On Wed, Jun 13, 2018 at 11:00 PM, Leo David wrote: > Hi, > I've just managed to install guest agent by installing epel-release first. > yum -y install epel-release > yum -y install ovirt-guest-agent-common > > At this moment is getting ovirt-guest-agent-common.noarch 0:1.0.14-1.el7 > > On Thu, Jun 14, 2018 at 7:09 AM, Jim Kusznir wrote: > >> Hi: >> >> I haven't managed to find the new / current repo/source for the ovirt >> guest agent for the 4.2 upgrade. All my VMs now say that they need the >> agent. Googles keep referring me to old / broke / non-existent repos. >> Where do I find the 4.2 agent (or does the 4.2 agent even exist?) >> >> Thanks! >> --Jim >> >> ___ >> Users mailing list -- users@ovirt.org >> To unsubscribe send an email to users-le...@ovirt.org >> Privacy Statement: https://www.ovirt.org/site/privacy-policy/ >> oVirt Code of Conduct: https://www.ovirt.org/communit >> y/about/community-guidelines/ >> List Archives: https://lists.ovirt.org/archiv >> es/list/users@ovirt.org/message/UYEOKFI7DQCWFJXZHBJ3GXM3V3SEGST4/ >> >> > > > -- > Best regards, Leo David > ___ Users mailing list -- users@ovirt.org To unsubscribe send an email to users-le...@ovirt.org Privacy Statement: https://www.ovirt.org/site/privacy-policy/ oVirt Code of Conduct: https://www.ovirt.org/community/about/community-guidelines/ List Archives: https://lists.ovirt.org/archives/list/users@ovirt.org/message/L6PMHKVEG5X4RZDFQDGD7IZMG24WZG6I/
[ovirt-users] User agent for Ovirt 4.2
Hi: I haven't managed to find the new / current repo/source for the ovirt guest agent for the 4.2 upgrade. All my VMs now say that they need the agent. Googles keep referring me to old / broke / non-existent repos. Where do I find the 4.2 agent (or does the 4.2 agent even exist?) Thanks! --Jim ___ Users mailing list -- users@ovirt.org To unsubscribe send an email to users-le...@ovirt.org Privacy Statement: https://www.ovirt.org/site/privacy-policy/ oVirt Code of Conduct: https://www.ovirt.org/community/about/community-guidelines/ List Archives: https://lists.ovirt.org/archives/list/users@ovirt.org/message/UYEOKFI7DQCWFJXZHBJ3GXM3V3SEGST4/
[ovirt-users] Re: Gluster problems, cluster performance issues
At the moment, it is responding like I would expect. I do know I have one failed drive on one brick (hardware failure, OS removed drive completely; the underlying /dev/sdb is gone). I have a new disk on order (overnight), but that is also one brick of one volume that is replica 3, so I would hope a complete failure like that would restore the system to operational capabilities. Since having the gluster-volume-starting problems, I have performed a test in the engine volume with writing and removing a file and verifying its happening from all three hosts; that worked. The engine volume has all of its bricks, as well two other volumes; its only one volume that is shy one brick. --Jim On Tue, May 29, 2018 at 11:41 PM, Johan Bernhardsson wrote: > Is storage working as it should? Does the gluster mount point respond as > it should? Can you write files to it? Does the physical drives say that > they are ok? Can you write (you shouldn't bypass gluster mount point but > you need to test the drives) to the physical drives? > > For me this sounds like broken or almost broken hardware or broken > underlying filesystems. > > If one of the drives malfunction and timeout, gluster will be slow and > timeout. It runs write in sync so the slowest node will slow down the whole > system. > > /Johan > > > On May 30, 2018 08:29:46 Jim Kusznir wrote: > >> hosted-engine --deploy failed (would not come up on my existing gluster >> storage). However, I realized no changes were written to my existing >> storage. So, I went back to trying to get my old engine running. >> >> hosted-engine --vm-status is now taking a very long time (5+minutes) to >> return, and it returns stail information everywhere. I thought perhaps the >> lockspace is corrupt, so tried to clean that and metadata, but both are >> failing (--cleam-metadata has hung and I can't even ctrl-c out of it). >> >> How can I reinitialize all the lockspace/metadata safely? There is no >> engine or VMs running currently >> >> --Jim >> >> On Tue, May 29, 2018 at 9:33 PM, Jim Kusznir wrote: >> >>> Well, things went from bad to very, very bad >>> >>> It appears that during one of the 2 minute lockups, the fencing agents >>> decided that another node in the cluster was down. As a result, 2 of the 3 >>> nodes were simultaneously reset with fencing agent reboot. After the nodes >>> came back up, the engine would not start. All running VMs (including VMs >>> on the 3rd node that was not rebooted) crashed. >>> >>> I've now been working for about 3 hours trying to get the engine to come >>> up. I don't know why it won't start. hosted-engine --vm-start says its >>> starting, but it doesn't start (virsh doesn't show any VMs running). I'm >>> currently running --deploy, as I had run out of options for anything else I >>> can come up with. I hope this will allow me to re-import all my existing >>> VMs and allow me to start them back up after everything comes back up. >>> >>> I do have an unverified geo-rep backup; I don't know if it is a good >>> backup (there were several prior messages to this list, but I didn't get >>> replies to my questions. It was running in what I believe to be "strange", >>> and the data directories are larger than their source). >>> >>> I'll see if my --deploy works, and if not, I'll be back with another >>> message/help request. >>> >>> When the dust settles and I'm at least minimally functional again, I >>> really want to understand why all these technologies designed to offer >>> redundancy conspired to reduce uptime and create failures where there >>> weren't any otherwise. I thought with hosted engine, 3 ovirt servers and >>> glusterfs with minimum replica 2+arb or replica 3 should have offered >>> strong resilience against server failure or disk failure, and should have >>> prevented / recovered from data corruption. Instead, all of the above >>> happened (once I get my cluster back up, I still have to try and recover my >>> webserver VM, which won't boot due to XFS corrupt journal issues created >>> during the gluster crashes). I think a lot of these issues were rooted >>> from the upgrade from 4.1 to 4.2. >>> >>> --Jim >>> >>> On Tue, May 29, 2018 at 6:25 PM, Jim Kusznir >>> wrote: >>> >>>> I also finally found the following in my system log on one server: >>>> >>>> [10679.524491] INFO: task glusterclogro
[ovirt-users] Re: Gluster problems, cluster performance issues
hosted-engine --deploy failed (would not come up on my existing gluster storage). However, I realized no changes were written to my existing storage. So, I went back to trying to get my old engine running. hosted-engine --vm-status is now taking a very long time (5+minutes) to return, and it returns stail information everywhere. I thought perhaps the lockspace is corrupt, so tried to clean that and metadata, but both are failing (--cleam-metadata has hung and I can't even ctrl-c out of it). How can I reinitialize all the lockspace/metadata safely? There is no engine or VMs running currently --Jim On Tue, May 29, 2018 at 9:33 PM, Jim Kusznir wrote: > Well, things went from bad to very, very bad > > It appears that during one of the 2 minute lockups, the fencing agents > decided that another node in the cluster was down. As a result, 2 of the 3 > nodes were simultaneously reset with fencing agent reboot. After the nodes > came back up, the engine would not start. All running VMs (including VMs > on the 3rd node that was not rebooted) crashed. > > I've now been working for about 3 hours trying to get the engine to come > up. I don't know why it won't start. hosted-engine --vm-start says its > starting, but it doesn't start (virsh doesn't show any VMs running). I'm > currently running --deploy, as I had run out of options for anything else I > can come up with. I hope this will allow me to re-import all my existing > VMs and allow me to start them back up after everything comes back up. > > I do have an unverified geo-rep backup; I don't know if it is a good > backup (there were several prior messages to this list, but I didn't get > replies to my questions. It was running in what I believe to be "strange", > and the data directories are larger than their source). > > I'll see if my --deploy works, and if not, I'll be back with another > message/help request. > > When the dust settles and I'm at least minimally functional again, I > really want to understand why all these technologies designed to offer > redundancy conspired to reduce uptime and create failures where there > weren't any otherwise. I thought with hosted engine, 3 ovirt servers and > glusterfs with minimum replica 2+arb or replica 3 should have offered > strong resilience against server failure or disk failure, and should have > prevented / recovered from data corruption. Instead, all of the above > happened (once I get my cluster back up, I still have to try and recover my > webserver VM, which won't boot due to XFS corrupt journal issues created > during the gluster crashes). I think a lot of these issues were rooted > from the upgrade from 4.1 to 4.2. > > --Jim > > On Tue, May 29, 2018 at 6:25 PM, Jim Kusznir wrote: > >> I also finally found the following in my system log on one server: >> >> [10679.524491] INFO: task glusterclogro:14933 blocked for more than 120 >> seconds. >> [10679.525826] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" >> disables this message. >> [10679.527144] glusterclogro D 97209832bf40 0 14933 1 >> 0x0080 >> [10679.527150] Call Trace: >> [10679.527161] [] schedule+0x29/0x70 >> [10679.527218] [] _xfs_log_force_lsn+0x2e8/0x340 [xfs] >> [10679.527225] [] ? wake_up_state+0x20/0x20 >> [10679.527254] [] xfs_file_fsync+0x107/0x1e0 [xfs] >> [10679.527260] [] do_fsync+0x67/0xb0 >> [10679.527268] [] ? system_call_after_swapgs+0xbc/ >> 0x160 >> [10679.527271] [] SyS_fsync+0x10/0x20 >> [10679.527275] [] system_call_fastpath+0x1c/0x21 >> [10679.527279] [] ? system_call_after_swapgs+0xc8/ >> 0x160 >> [10679.527283] INFO: task glusterposixfsy:14941 blocked for more than 120 >> seconds. >> [10679.528608] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" >> disables this message. >> [10679.529956] glusterposixfsy D 972495f84f10 0 14941 1 >> 0x0080 >> [10679.529961] Call Trace: >> [10679.529966] [] schedule+0x29/0x70 >> [10679.530003] [] _xfs_log_force_lsn+0x2e8/0x340 [xfs] >> [10679.530008] [] ? wake_up_state+0x20/0x20 >> [10679.530038] [] xfs_file_fsync+0x107/0x1e0 [xfs] >> [10679.530042] [] do_fsync+0x67/0xb0 >> [10679.530046] [] ? system_call_after_swapgs+0xbc/ >> 0x160 >> [10679.530050] [] SyS_fdatasync+0x13/0x20 >> [10679.530054] [] system_call_fastpath+0x1c/0x21 >> [10679.530058] [] ? system_call_after_swapgs+0xc8/ >> 0x160 >> [10679.530062] INFO: task glusteriotwr13:15486 blocked for more than 120 >> seconds. >> [10679.531805] "echo 0 > /proc/sys/kernel/hung_task_tim
[ovirt-users] Re: Gluster problems, cluster performance issues
Well, things went from bad to very, very bad It appears that during one of the 2 minute lockups, the fencing agents decided that another node in the cluster was down. As a result, 2 of the 3 nodes were simultaneously reset with fencing agent reboot. After the nodes came back up, the engine would not start. All running VMs (including VMs on the 3rd node that was not rebooted) crashed. I've now been working for about 3 hours trying to get the engine to come up. I don't know why it won't start. hosted-engine --vm-start says its starting, but it doesn't start (virsh doesn't show any VMs running). I'm currently running --deploy, as I had run out of options for anything else I can come up with. I hope this will allow me to re-import all my existing VMs and allow me to start them back up after everything comes back up. I do have an unverified geo-rep backup; I don't know if it is a good backup (there were several prior messages to this list, but I didn't get replies to my questions. It was running in what I believe to be "strange", and the data directories are larger than their source). I'll see if my --deploy works, and if not, I'll be back with another message/help request. When the dust settles and I'm at least minimally functional again, I really want to understand why all these technologies designed to offer redundancy conspired to reduce uptime and create failures where there weren't any otherwise. I thought with hosted engine, 3 ovirt servers and glusterfs with minimum replica 2+arb or replica 3 should have offered strong resilience against server failure or disk failure, and should have prevented / recovered from data corruption. Instead, all of the above happened (once I get my cluster back up, I still have to try and recover my webserver VM, which won't boot due to XFS corrupt journal issues created during the gluster crashes). I think a lot of these issues were rooted from the upgrade from 4.1 to 4.2. --Jim On Tue, May 29, 2018 at 6:25 PM, Jim Kusznir wrote: > I also finally found the following in my system log on one server: > > [10679.524491] INFO: task glusterclogro:14933 blocked for more than 120 > seconds. > [10679.525826] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" > disables this message. > [10679.527144] glusterclogro D 97209832bf40 0 14933 1 > 0x0080 > [10679.527150] Call Trace: > [10679.527161] [] schedule+0x29/0x70 > [10679.527218] [] _xfs_log_force_lsn+0x2e8/0x340 [xfs] > [10679.527225] [] ? wake_up_state+0x20/0x20 > [10679.527254] [] xfs_file_fsync+0x107/0x1e0 [xfs] > [10679.527260] [] do_fsync+0x67/0xb0 > [10679.527268] [] ? system_call_after_swapgs+0xbc/0x160 > [10679.527271] [] SyS_fsync+0x10/0x20 > [10679.527275] [] system_call_fastpath+0x1c/0x21 > [10679.527279] [] ? system_call_after_swapgs+0xc8/0x160 > [10679.527283] INFO: task glusterposixfsy:14941 blocked for more than 120 > seconds. > [10679.528608] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" > disables this message. > [10679.529956] glusterposixfsy D 972495f84f10 0 14941 1 > 0x0080 > [10679.529961] Call Trace: > [10679.529966] [] schedule+0x29/0x70 > [10679.530003] [] _xfs_log_force_lsn+0x2e8/0x340 [xfs] > [10679.530008] [] ? wake_up_state+0x20/0x20 > [10679.530038] [] xfs_file_fsync+0x107/0x1e0 [xfs] > [10679.530042] [] do_fsync+0x67/0xb0 > [10679.530046] [] ? system_call_after_swapgs+0xbc/0x160 > [10679.530050] [] SyS_fdatasync+0x13/0x20 > [10679.530054] [] system_call_fastpath+0x1c/0x21 > [10679.530058] [] ? system_call_after_swapgs+0xc8/0x160 > [10679.530062] INFO: task glusteriotwr13:15486 blocked for more than 120 > seconds. > [10679.531805] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" > disables this message. > [10679.533732] glusteriotwr13 D 9720a83f 0 15486 1 > 0x0080 > [10679.533738] Call Trace: > [10679.533747] [] schedule+0x29/0x70 > [10679.533799] [] _xfs_log_force_lsn+0x2e8/0x340 [xfs] > [10679.533806] [] ? wake_up_state+0x20/0x20 > [10679.533846] [] xfs_file_fsync+0x107/0x1e0 [xfs] > [10679.533852] [] do_fsync+0x67/0xb0 > [10679.533858] [] ? system_call_after_swapgs+0xbc/0x160 > [10679.533863] [] SyS_fdatasync+0x13/0x20 > [10679.533868] [] system_call_fastpath+0x1c/0x21 > [10679.533873] [] ? system_call_after_swapgs+0xc8/0x160 > [10919.512757] INFO: task glusterclogro:14933 blocked for more than 120 > seconds. > [10919.514714] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" > disables this message. > [10919.516663] glusterclogro D 97209832bf40 0 14933 1 > 0x0080 > [10919.516677] Call Trace: > [10919.516690] [] schedule+0x29/0x70 > [10919.516696] []
[ovirt-users] Re: Gluster problems, cluster performance issues
0x0080 [11279.504635] Call Trace: [11279.504640] [] schedule+0x29/0x70 [11279.504676] [] _xfs_log_force_lsn+0x2e8/0x340 [xfs] [11279.504681] [] ? wake_up_state+0x20/0x20 [11279.504710] [] xfs_file_fsync+0x107/0x1e0 [xfs] [11279.504714] [] do_fsync+0x67/0xb0 [11279.504718] [] ? system_call_after_swapgs+0xbc/0x160 [11279.504722] [] SyS_fsync+0x10/0x20 [11279.504725] [] system_call_fastpath+0x1c/0x21 [11279.504730] [] ? system_call_after_swapgs+0xc8/0x160 [12127.466494] perf: interrupt took too long (8263 > 8150), lowering kernel.perf_event_max_sample_rate to 24000 I think this is the cause of the massive ovirt performance issues irrespective of gluster volume. At the time this happened, I was also ssh'ed into the host, and was doing some rpm querry commands. I had just run rpm -qa |grep glusterfs (to verify what version was actually installed), and that command took almost 2 minutes to return! Normally it takes less than 2 seconds. That is all pure local SSD IO, too I'm no expert, but its my understanding that anytime a software causes these kinds of issues, its a serious bug in the software, even if its mis-handled exceptions. Is this correct? --Jim On Tue, May 29, 2018 at 3:01 PM, Jim Kusznir wrote: > I think this is the profile information for one of the volumes that lives > on the SSDs and is fully operational with no down/problem disks: > > [root@ovirt2 yum.repos.d]# gluster volume profile data info > Brick: ovirt2.nwfiber.com:/gluster/brick2/data > -- > Cumulative Stats: >Block Size:256b+ 512b+ > 1024b+ > No. of Reads: 983 2696 > 1059 > No. of Writes:0 1113 > 302 > >Block Size: 2048b+4096b+ > 8192b+ > No. of Reads: 852 88608 > 53526 > No. of Writes: 522812340 > 76257 > >Block Size: 16384b+ 32768b+ > 65536b+ > No. of Reads:54351241901 > 15024 > No. of Writes:21636 8656 > 8976 > >Block Size: 131072b+ > No. of Reads: 524156 > No. of Writes: 296071 > %-latency Avg-latency Min-Latency Max-Latency No. of calls > Fop > - --- --- --- > > 0.00 0.00 us 0.00 us 0.00 us 4189 > RELEASE > 0.00 0.00 us 0.00 us 0.00 us 1257 > RELEASEDIR > 0.00 46.19 us 12.00 us 187.00 us 69 > FLUSH > 0.00 147.00 us 78.00 us 367.00 us 86 > REMOVEXATTR > 0.00 223.46 us 24.00 us1166.00 us149 > READDIR > 0.00 565.34 us 76.00 us3639.00 us 88 > FTRUNCATE > 0.00 263.28 us 20.00 us 28385.00 us228 > LK > 0.00 98.84 us 2.00 us 880.00 us 1198 > OPENDIR > 0.00 91.59 us 26.00 us 10371.00 us 3853 > STATFS > 0.00 494.14 us 17.00 us 193439.00 us 1171 > GETXATTR > 0.00 299.42 us 35.00 us9799.00 us 2044 > READDIRP > 0.001965.31 us 110.00 us 382258.00 us321 > XATTROP > 0.01 113.40 us 24.00 us 61061.00 us 8134 > STAT > 0.01 755.38 us 57.00 us 607603.00 us 3196 > DISCARD > 0.052690.09 us 58.00 us 2704761.00 us 3206 > OPEN > 0.10 119978.25 us 97.00 us 9406684.00 us154 > SETATTR > 0.18 101.73 us 28.00 us 700477.00 us 313379 > FSTAT > 0.231059.84 us 25.00 us 2716124.00 us 38255 > LOOKUP > 0.471024.11 us 54.00 us 6197164.00 us 81455 > FXATTROP > 1.722984.00 us 15.00 us 37098954.00 us 103020 > FINODELK > 5.92 44315.32 us 51.00 us 24731536.00 us 23957 > FSYNC > 13.272399.78 us 25.00 us 22089540.00 us 991005 > READ > 37.005980.43 us 52.00 us 22099889.00 us1108976 > WRITE > 41.045452.75 us 13.00 us 22102452.00 us1349053 > INODELK > > Duration: 10026 seconds >Data Read: 80046027759 bytes > Data Written: 44496632320 bytes > > Interval 1 Stats: >Block Size:256b+ 512b+ > 1024b+ > No. of Reads: 983 2696 > 1059 > No. of Writes:
[ovirt-users] Re: Gluster problems, cluster performance issues
51.00 us 3595513.00 us 131642 WRITE 17.71 957.08 us 16.00 us 13700466.00 us1508160 INODELK 24.562546.42 us 26.00 us 5077347.00 us 786060 READ 31.54 49651.63 us 47.00 us 3746331.00 us 51777 FSYNC Duration: 10101 seconds Data Read: 101562897361 bytes Data Written: 4834450432 bytes On Tue, May 29, 2018 at 2:55 PM, Jim Kusznir wrote: > Thank you for your response. > > I have 4 gluster volumes. 3 are replica 2 + arbitrator. replica bricks > are on ovirt1 and ovirt2, arbitrator on ovirt3. The 4th volume is replica > 3, with a brick on all three ovirt machines. > > The first 3 volumes are on an SSD disk; the 4th is on a Seagate SSHD (same > in all three machines). On ovirt3, the SSHD has reported hard IO failures, > and that brick is offline. However, the other two replicas are fully > operational (although they still show contents in the heal info command > that won't go away, but that may be the case until I replace the failed > disk). > > What is bothering me is that ALL 4 gluster volumes are showing horrible > performance issues. At this point, as the bad disk has been completely > offlined, I would expect gluster to perform at normal speed, but that is > definitely not the case. > > I've also noticed that the performance hits seem to come in waves: things > seem to work acceptably (but slow) for a while, then suddenly, its as if > all disk IO on all volumes (including non-gluster local OS disk volumes for > the hosts) pause for about 30 seconds, then IO resumes again. During those > times, I start getting VM not responding and host not responding notices as > well as the applications having major issues. > > I've shut down most of my VMs and am down to just my essential core VMs > (shedded about 75% of my VMs). I still am experiencing the same issues. > > Am I correct in believing that once the failed disk was brought offline > that performance should return to normal? > > On Tue, May 29, 2018 at 1:27 PM, Alex K wrote: > >> I would check disks status and accessibility of mount points where your >> gluster volumes reside. >> >> On Tue, May 29, 2018, 22:28 Jim Kusznir wrote: >> >>> On one ovirt server, I'm now seeing these messages: >>> [56474.239725] blk_update_request: 63 callbacks suppressed >>> [56474.239732] blk_update_request: I/O error, dev dm-2, sector 0 >>> [56474.240602] blk_update_request: I/O error, dev dm-2, sector 3905945472 >>> [56474.241346] blk_update_request: I/O error, dev dm-2, sector 3905945584 >>> [56474.242236] blk_update_request: I/O error, dev dm-2, sector 2048 >>> [56474.243072] blk_update_request: I/O error, dev dm-2, sector 3905943424 >>> [56474.243997] blk_update_request: I/O error, dev dm-2, sector 3905943536 >>> [56474.247347] blk_update_request: I/O error, dev dm-2, sector 0 >>> [56474.248315] blk_update_request: I/O error, dev dm-2, sector 3905945472 >>> [56474.249231] blk_update_request: I/O error, dev dm-2, sector 3905945584 >>> [56474.250221] blk_update_request: I/O error, dev dm-2, sector 2048 >>> >>> >>> >>> >>> On Tue, May 29, 2018 at 11:59 AM, Jim Kusznir >>> wrote: >>> >>>> I see in messages on ovirt3 (my 3rd machine, the one upgraded to 4.2): >>>> >>>> May 29 11:54:41 ovirt3 ovs-vsctl: >>>> ovs|1|db_ctl_base|ERR|unix:/var/run/openvswitch/db.sock: >>>> database connection failed (No such file or directory) >>>> May 29 11:54:51 ovirt3 ovs-vsctl: >>>> ovs|1|db_ctl_base|ERR|unix:/var/run/openvswitch/db.sock: >>>> database connection failed (No such file or directory) >>>> May 29 11:55:01 ovirt3 ovs-vsctl: >>>> ovs|1|db_ctl_base|ERR|unix:/var/run/openvswitch/db.sock: >>>> database connection failed (No such file or directory) >>>> (appears a lot). >>>> >>>> I also found on the ssh session of that, some sysv warnings about the >>>> backing disk for one of the gluster volumes (straight replica 3). The >>>> glusterfs process for that disk on that machine went offline. Its my >>>> understanding that it should continue to work with the other two machines >>>> while I attempt to replace that disk, right? Attempted writes (touching an >>>> empty file) can take 15 seconds, repeating it later will be much faster. >>>> >>>> Gluster generates a bunch of different log files, I don't know what >>>> ones you want, or from which machine(s). >>>> >>>> How do I do "vol
[ovirt-users] Re: Gluster quorum
I had the same problem when I upgraded to 4.2. I found that if I went to the brick in the UI and selected it, there was a "start" button in the upper-right of the gui. clicking that resolved this problem a few minutes later. I had to repeat for each volume that showed a brick down for which that brick was not actually down. --Jim On Tue, May 29, 2018 at 6:34 AM, Demeter Tibor wrote: > Hi, > > I've successfully upgraded my hosts and I could raise the cluster level to > 4.2. > Everything seems fine, but the monitoring problem does not resolved. My > bricks on first node are shown down (red) , but the glusterfs working fine > (I verified in terminal). > > I've attached my engine.log. > > Thanks in advance, > > R, > Tibor > > - 2018. máj.. 28., 14:59, Demeter Tibor írta: > > Hi, > Ok I will try it. > > In this case, is it possible to remove and re-add a host that member of HA > gluster ? This is an another task, but I need to separate my gluster > network from my ovirtmgmt network. > What is the recommended way for do this? > > It is not important now, but I need to do in future. > > I will attach my engine.log after upgrade my host. > > Thanks, > Regards. > > Tibor > > > - 2018. máj.. 28., 14:44, Sahina Bose írta: > > > > On Mon, May 28, 2018 at 4:47 PM, Demeter Tibor > wrote: > >> Dear Sahina, >> >> Yes, exactly. I can check that check box, but I don't know how is safe >> that. Is it safe? >> > > It is safe - if you can ensure that only one host is put into maintenance > at a time. > >> >> I want to upgrade all of my host. If it will done, then the monitoring >> will work perfectly? >> > > If it does not please provide engine.log again once you've upgraded all > the hosts. > > >> Thanks. >> R. >> >> Tibor >> >> >> >> - 2018. máj.. 28., 10:09, Sahina Bose írta: >> >> >> >> On Mon, May 28, 2018 at 1:06 PM, Demeter Tibor >> wrote: >> >>> Hi, >>> >>> Somebody could answer to my question please? >>> It is very important for me, I could no finish my upgrade process (from >>> 4.1 to 4.2) since 9th May! >>> >> >> Can you explain how the upgrade process is blocked due to the monitoring? >> If it's because you cannot move the host to maintenance, can you try with >> the option "Ignore quorum checks" enabled? >> >> >>> Meanwhile - I don't know why - one of my two gluster volume seems UP >>> (green) on the GUI. So, now only one is down. >>> >>> I need help. What can I do? >>> >>> Thanks in advance, >>> >>> Regards, >>> >>> Tibor >>> >>> >>> - 2018. máj.. 23., 21:09, Demeter Tibor írta: >>> >>> Hi, >>> >>> I've updated again to the latest version, but there are no changes. All >>> of bricks on my first node are down in the GUI (in console are ok) >>> An Interesting thing, the "Self-Heal info" column show "OK" for all >>> hosts and all bricks, but "Space used" column is zero for all hosts/bricks. >>> Can I force remove and re-add my host to cluster awhile it is a gluster >>> member? Is it safe ? >>> What can I do? >>> >>> I haven't update other hosts while gluster not working fine, or the GUI >>> does not detect . So my other hosts is remained 4.1 yet :( >>> >>> Thanks in advance, >>> >>> Regards >>> >>> Tibor >>> >>> - 2018. máj.. 23., 14:45, Denis Chapligin >>> írta: >>> >>> Hello! >>> >>> On Tue, May 22, 2018 at 11:10 AM, Demeter Tibor >>> wrote: >>> Is there any changes with this bug? Still I haven't finish my upgrade process that i've started on 9th may:( Please help me if you can. >>> >>> Looks like all required patches are already merged, so could you please >>> to update your engine again to the latest night build? >>> >>> >>> ___ >>> Users mailing list -- users@ovirt.org >>> To unsubscribe send an email to users-le...@ovirt.org >>> >>> >>> ___ >>> Users mailing list -- users@ovirt.org >>> To unsubscribe send an email to users-le...@ovirt.org >>> Privacy Statement: https://www.ovirt.org/site/privacy-policy/ >>> oVirt Code of Conduct: https://www.ovirt.org/community/about/community- >>> guidelines/ >>> List Archives: https://lists.ovirt.org/archives/list/users@ovirt.org/ >>> message/MRAAPZSRIXLAJZBV6TRDXXK7R2ISPSDK/ >>> >>> >> > > ___ > Users mailing list -- users@ovirt.org > To unsubscribe send an email to users-le...@ovirt.org > Privacy Statement: https://www.ovirt.org/site/privacy-policy/ > oVirt Code of Conduct: https://www.ovirt.org/community/about/community- > guidelines/ > List Archives: https://lists.ovirt.org/archives/list/users@ovirt.org/ > message/OWA2I6AFZPO56Z2N6D25HUHLW6CGOUWL/ > > > > ___ > Users mailing list -- users@ovirt.org > To unsubscribe send an email to users-le...@ovirt.org > Privacy Statement: https://www.ovirt.org/site/privacy-policy/ > oVirt Code of Conduct: https://www.ovirt.org/community/about/community- > guidelines/ > List Archives: https
[ovirt-users] Re: Gluster problems, cluster performance issues
Thank you for your response. I have 4 gluster volumes. 3 are replica 2 + arbitrator. replica bricks are on ovirt1 and ovirt2, arbitrator on ovirt3. The 4th volume is replica 3, with a brick on all three ovirt machines. The first 3 volumes are on an SSD disk; the 4th is on a Seagate SSHD (same in all three machines). On ovirt3, the SSHD has reported hard IO failures, and that brick is offline. However, the other two replicas are fully operational (although they still show contents in the heal info command that won't go away, but that may be the case until I replace the failed disk). What is bothering me is that ALL 4 gluster volumes are showing horrible performance issues. At this point, as the bad disk has been completely offlined, I would expect gluster to perform at normal speed, but that is definitely not the case. I've also noticed that the performance hits seem to come in waves: things seem to work acceptably (but slow) for a while, then suddenly, its as if all disk IO on all volumes (including non-gluster local OS disk volumes for the hosts) pause for about 30 seconds, then IO resumes again. During those times, I start getting VM not responding and host not responding notices as well as the applications having major issues. I've shut down most of my VMs and am down to just my essential core VMs (shedded about 75% of my VMs). I still am experiencing the same issues. Am I correct in believing that once the failed disk was brought offline that performance should return to normal? On Tue, May 29, 2018 at 1:27 PM, Alex K wrote: > I would check disks status and accessibility of mount points where your > gluster volumes reside. > > On Tue, May 29, 2018, 22:28 Jim Kusznir wrote: > >> On one ovirt server, I'm now seeing these messages: >> [56474.239725] blk_update_request: 63 callbacks suppressed >> [56474.239732] blk_update_request: I/O error, dev dm-2, sector 0 >> [56474.240602] blk_update_request: I/O error, dev dm-2, sector 3905945472 >> [56474.241346] blk_update_request: I/O error, dev dm-2, sector 3905945584 >> [56474.242236] blk_update_request: I/O error, dev dm-2, sector 2048 >> [56474.243072] blk_update_request: I/O error, dev dm-2, sector 3905943424 >> [56474.243997] blk_update_request: I/O error, dev dm-2, sector 3905943536 >> [56474.247347] blk_update_request: I/O error, dev dm-2, sector 0 >> [56474.248315] blk_update_request: I/O error, dev dm-2, sector 3905945472 >> [56474.249231] blk_update_request: I/O error, dev dm-2, sector 3905945584 >> [56474.250221] blk_update_request: I/O error, dev dm-2, sector 2048 >> >> >> >> >> On Tue, May 29, 2018 at 11:59 AM, Jim Kusznir >> wrote: >> >>> I see in messages on ovirt3 (my 3rd machine, the one upgraded to 4.2): >>> >>> May 29 11:54:41 ovirt3 ovs-vsctl: ovs|1|db_ctl_base|ERR| >>> unix:/var/run/openvswitch/db.sock: database connection failed (No such >>> file or directory) >>> May 29 11:54:51 ovirt3 ovs-vsctl: ovs|1|db_ctl_base|ERR| >>> unix:/var/run/openvswitch/db.sock: database connection failed (No such >>> file or directory) >>> May 29 11:55:01 ovirt3 ovs-vsctl: ovs|1|db_ctl_base|ERR| >>> unix:/var/run/openvswitch/db.sock: database connection failed (No such >>> file or directory) >>> (appears a lot). >>> >>> I also found on the ssh session of that, some sysv warnings about the >>> backing disk for one of the gluster volumes (straight replica 3). The >>> glusterfs process for that disk on that machine went offline. Its my >>> understanding that it should continue to work with the other two machines >>> while I attempt to replace that disk, right? Attempted writes (touching an >>> empty file) can take 15 seconds, repeating it later will be much faster. >>> >>> Gluster generates a bunch of different log files, I don't know what ones >>> you want, or from which machine(s). >>> >>> How do I do "volume profiling"? >>> >>> Thanks! >>> >>> On Tue, May 29, 2018 at 11:53 AM, Sahina Bose wrote: >>> >>>> Do you see errors reported in the mount logs for the volume? If so, >>>> could you attach the logs? >>>> Any issues with your underlying disks. Can you also attach output of >>>> volume profiling? >>>> >>>> On Wed, May 30, 2018 at 12:13 AM, Jim Kusznir >>>> wrote: >>>> >>>>> Ok, things have gotten MUCH worse this morning. I'm getting random >>>>> errors from VMs, right now, about a third of my VMs have been paused due >>>>> to >>>>> storage issues
[ovirt-users] Re: Gluster problems, cluster performance issues
Due to the cluster spiraling downward and increasing customer complaints, I went ahead and finished the upgrade of the nodes to ovirt 4.2 and gluster 3.12. It didn't seem to help at all. I DO have one brick down on ONE of my 4 gluster filesystems/exports/whatever. The other 3 are fully available. However, I still see heavy IO wait, including on the perfectly healthy filesystem. its bad enough that I get ovirt e-mails warning of hosts down and back up, and VMs on the good gluster filesystem are reporting IO Waits of greater than 60% in top! I have applications that are crashing due to the IO Wait issues. I do think I got glusterfs profiling running, but I don't know how to get a useful report out (its in the ovirt gui). I did see read and write operations showing about 30 seconds; I would have expected that to be MUCH better. (As I write this, my core VoIP server is now showing 99.1% IOWait loadAnd that is customer calls failing/dropping). PLEASE...how do I FIX this? --JIm On Tue, May 29, 2018 at 12:14 PM, Jim Kusznir wrote: > On one ovirt server, I'm now seeing these messages: > [56474.239725] blk_update_request: 63 callbacks suppressed > [56474.239732] blk_update_request: I/O error, dev dm-2, sector 0 > [56474.240602] blk_update_request: I/O error, dev dm-2, sector 3905945472 > [56474.241346] blk_update_request: I/O error, dev dm-2, sector 3905945584 > [56474.242236] blk_update_request: I/O error, dev dm-2, sector 2048 > [56474.243072] blk_update_request: I/O error, dev dm-2, sector 3905943424 > [56474.243997] blk_update_request: I/O error, dev dm-2, sector 3905943536 > [56474.247347] blk_update_request: I/O error, dev dm-2, sector 0 > [56474.248315] blk_update_request: I/O error, dev dm-2, sector 3905945472 > [56474.249231] blk_update_request: I/O error, dev dm-2, sector 3905945584 > [56474.250221] blk_update_request: I/O error, dev dm-2, sector 2048 > > > > > On Tue, May 29, 2018 at 11:59 AM, Jim Kusznir wrote: > >> I see in messages on ovirt3 (my 3rd machine, the one upgraded to 4.2): >> >> May 29 11:54:41 ovirt3 ovs-vsctl: >> ovs|1|db_ctl_base|ERR|unix:/var/run/openvswitch/db.sock: >> database connection failed (No such file or directory) >> May 29 11:54:51 ovirt3 ovs-vsctl: >> ovs|1|db_ctl_base|ERR|unix:/var/run/openvswitch/db.sock: >> database connection failed (No such file or directory) >> May 29 11:55:01 ovirt3 ovs-vsctl: >> ovs|1|db_ctl_base|ERR|unix:/var/run/openvswitch/db.sock: >> database connection failed (No such file or directory) >> (appears a lot). >> >> I also found on the ssh session of that, some sysv warnings about the >> backing disk for one of the gluster volumes (straight replica 3). The >> glusterfs process for that disk on that machine went offline. Its my >> understanding that it should continue to work with the other two machines >> while I attempt to replace that disk, right? Attempted writes (touching an >> empty file) can take 15 seconds, repeating it later will be much faster. >> >> Gluster generates a bunch of different log files, I don't know what ones >> you want, or from which machine(s). >> >> How do I do "volume profiling"? >> >> Thanks! >> >> On Tue, May 29, 2018 at 11:53 AM, Sahina Bose wrote: >> >>> Do you see errors reported in the mount logs for the volume? If so, >>> could you attach the logs? >>> Any issues with your underlying disks. Can you also attach output of >>> volume profiling? >>> >>> On Wed, May 30, 2018 at 12:13 AM, Jim Kusznir >>> wrote: >>> >>>> Ok, things have gotten MUCH worse this morning. I'm getting random >>>> errors from VMs, right now, about a third of my VMs have been paused due to >>>> storage issues, and most of the remaining VMs are not performing well. >>>> >>>> At this point, I am in full EMERGENCY mode, as my production services >>>> are now impacted, and I'm getting calls coming in with problems... >>>> >>>> I'd greatly appreciate help...VMs are running VERY slowly (when they >>>> run), and they are steadily getting worse. I don't know why. I was seeing >>>> CPU peaks (to 100%) on several VMs, in perfect sync, for a few minutes at a >>>> time (while the VM became unresponsive and any VMs I was logged into that >>>> were linux were giving me the CPU stuck messages in my origional post). Is >>>> all this storage related? >>>> >>>> I also have two different gluster volumes for VM storage, and only one >>>> had the issues, but now VMs in both are bei
[ovirt-users] Re: Gluster problems, cluster performance issues
On one ovirt server, I'm now seeing these messages: [56474.239725] blk_update_request: 63 callbacks suppressed [56474.239732] blk_update_request: I/O error, dev dm-2, sector 0 [56474.240602] blk_update_request: I/O error, dev dm-2, sector 3905945472 [56474.241346] blk_update_request: I/O error, dev dm-2, sector 3905945584 [56474.242236] blk_update_request: I/O error, dev dm-2, sector 2048 [56474.243072] blk_update_request: I/O error, dev dm-2, sector 3905943424 [56474.243997] blk_update_request: I/O error, dev dm-2, sector 3905943536 [56474.247347] blk_update_request: I/O error, dev dm-2, sector 0 [56474.248315] blk_update_request: I/O error, dev dm-2, sector 3905945472 [56474.249231] blk_update_request: I/O error, dev dm-2, sector 3905945584 [56474.250221] blk_update_request: I/O error, dev dm-2, sector 2048 On Tue, May 29, 2018 at 11:59 AM, Jim Kusznir wrote: > I see in messages on ovirt3 (my 3rd machine, the one upgraded to 4.2): > > May 29 11:54:41 ovirt3 ovs-vsctl: ovs|1|db_ctl_base|ERR| > unix:/var/run/openvswitch/db.sock: database connection failed (No such > file or directory) > May 29 11:54:51 ovirt3 ovs-vsctl: ovs|1|db_ctl_base|ERR| > unix:/var/run/openvswitch/db.sock: database connection failed (No such > file or directory) > May 29 11:55:01 ovirt3 ovs-vsctl: ovs|1|db_ctl_base|ERR| > unix:/var/run/openvswitch/db.sock: database connection failed (No such > file or directory) > (appears a lot). > > I also found on the ssh session of that, some sysv warnings about the > backing disk for one of the gluster volumes (straight replica 3). The > glusterfs process for that disk on that machine went offline. Its my > understanding that it should continue to work with the other two machines > while I attempt to replace that disk, right? Attempted writes (touching an > empty file) can take 15 seconds, repeating it later will be much faster. > > Gluster generates a bunch of different log files, I don't know what ones > you want, or from which machine(s). > > How do I do "volume profiling"? > > Thanks! > > On Tue, May 29, 2018 at 11:53 AM, Sahina Bose wrote: > >> Do you see errors reported in the mount logs for the volume? If so, could >> you attach the logs? >> Any issues with your underlying disks. Can you also attach output of >> volume profiling? >> >> On Wed, May 30, 2018 at 12:13 AM, Jim Kusznir >> wrote: >> >>> Ok, things have gotten MUCH worse this morning. I'm getting random >>> errors from VMs, right now, about a third of my VMs have been paused due to >>> storage issues, and most of the remaining VMs are not performing well. >>> >>> At this point, I am in full EMERGENCY mode, as my production services >>> are now impacted, and I'm getting calls coming in with problems... >>> >>> I'd greatly appreciate help...VMs are running VERY slowly (when they >>> run), and they are steadily getting worse. I don't know why. I was seeing >>> CPU peaks (to 100%) on several VMs, in perfect sync, for a few minutes at a >>> time (while the VM became unresponsive and any VMs I was logged into that >>> were linux were giving me the CPU stuck messages in my origional post). Is >>> all this storage related? >>> >>> I also have two different gluster volumes for VM storage, and only one >>> had the issues, but now VMs in both are being affected at the same time and >>> same way. >>> >>> --Jim >>> >>> On Mon, May 28, 2018 at 10:50 PM, Sahina Bose wrote: >>> >>>> [Adding gluster-users to look at the heal issue] >>>> >>>> On Tue, May 29, 2018 at 9:17 AM, Jim Kusznir >>>> wrote: >>>> >>>>> Hello: >>>>> >>>>> I've been having some cluster and gluster performance issues lately. >>>>> I also found that my cluster was out of date, and was trying to apply >>>>> updates (hoping to fix some of these), and discovered the ovirt 4.1 repos >>>>> were taken completely offline. So, I was forced to begin an upgrade to >>>>> 4.2. According to docs I found/read, I needed only add the new repo, do a >>>>> yum update, reboot, and be good on my hosts (did the yum update, the >>>>> engine-setup on my hosted engine). Things seemed to work relatively well, >>>>> except for a gluster sync issue that showed up. >>>>> >>>>> My cluster is a 3 node hyperconverged cluster. I upgraded the hosted >>>>> engine first, then engine 3. When engine 3 came back up, for some reason >>>>> on
[ovirt-users] Re: Gluster problems, cluster performance issues
I see in messages on ovirt3 (my 3rd machine, the one upgraded to 4.2): May 29 11:54:41 ovirt3 ovs-vsctl: ovs|1|db_ctl_base|ERR|unix:/var/run/openvswitch/db.sock: database connection failed (No such file or directory) May 29 11:54:51 ovirt3 ovs-vsctl: ovs|1|db_ctl_base|ERR|unix:/var/run/openvswitch/db.sock: database connection failed (No such file or directory) May 29 11:55:01 ovirt3 ovs-vsctl: ovs|1|db_ctl_base|ERR|unix:/var/run/openvswitch/db.sock: database connection failed (No such file or directory) (appears a lot). I also found on the ssh session of that, some sysv warnings about the backing disk for one of the gluster volumes (straight replica 3). The glusterfs process for that disk on that machine went offline. Its my understanding that it should continue to work with the other two machines while I attempt to replace that disk, right? Attempted writes (touching an empty file) can take 15 seconds, repeating it later will be much faster. Gluster generates a bunch of different log files, I don't know what ones you want, or from which machine(s). How do I do "volume profiling"? Thanks! On Tue, May 29, 2018 at 11:53 AM, Sahina Bose wrote: > Do you see errors reported in the mount logs for the volume? If so, could > you attach the logs? > Any issues with your underlying disks. Can you also attach output of > volume profiling? > > On Wed, May 30, 2018 at 12:13 AM, Jim Kusznir wrote: > >> Ok, things have gotten MUCH worse this morning. I'm getting random >> errors from VMs, right now, about a third of my VMs have been paused due to >> storage issues, and most of the remaining VMs are not performing well. >> >> At this point, I am in full EMERGENCY mode, as my production services are >> now impacted, and I'm getting calls coming in with problems... >> >> I'd greatly appreciate help...VMs are running VERY slowly (when they >> run), and they are steadily getting worse. I don't know why. I was seeing >> CPU peaks (to 100%) on several VMs, in perfect sync, for a few minutes at a >> time (while the VM became unresponsive and any VMs I was logged into that >> were linux were giving me the CPU stuck messages in my origional post). Is >> all this storage related? >> >> I also have two different gluster volumes for VM storage, and only one >> had the issues, but now VMs in both are being affected at the same time and >> same way. >> >> --Jim >> >> On Mon, May 28, 2018 at 10:50 PM, Sahina Bose wrote: >> >>> [Adding gluster-users to look at the heal issue] >>> >>> On Tue, May 29, 2018 at 9:17 AM, Jim Kusznir >>> wrote: >>> >>>> Hello: >>>> >>>> I've been having some cluster and gluster performance issues lately. I >>>> also found that my cluster was out of date, and was trying to apply updates >>>> (hoping to fix some of these), and discovered the ovirt 4.1 repos were >>>> taken completely offline. So, I was forced to begin an upgrade to 4.2. >>>> According to docs I found/read, I needed only add the new repo, do a yum >>>> update, reboot, and be good on my hosts (did the yum update, the >>>> engine-setup on my hosted engine). Things seemed to work relatively well, >>>> except for a gluster sync issue that showed up. >>>> >>>> My cluster is a 3 node hyperconverged cluster. I upgraded the hosted >>>> engine first, then engine 3. When engine 3 came back up, for some reason >>>> one of my gluster volumes would not sync. Here's sample output: >>>> >>>> [root@ovirt3 ~]# gluster volume heal data-hdd info >>>> Brick 172.172.1.11:/gluster/brick3/data-hdd >>>> /cc65f671-3377-494a-a7d4-1d9f7c3ae46c/images/48d7ecb8-7ac5-4 >>>> 725-bca5-b3519681cf2f/0d6080b0-7018-4fa3-bb82-1dd9ef07d9b9 >>>> /cc65f671-3377-494a-a7d4-1d9f7c3ae46c/images/647be733-f153-4 >>>> cdc-85bd-ba72544c2631/b453a300-0602-4be1-8310-8bd5abe00971 >>>> /cc65f671-3377-494a-a7d4-1d9f7c3ae46c/images/6da854d1-b6be-4 >>>> 46b-9bf0-90a0dbbea830/3c93bd1f-b7fa-4aa2-b445-6904e31839ba >>>> /cc65f671-3377-494a-a7d4-1d9f7c3ae46c/images/7f647567-d18c-4 >>>> 4f1-a58e-9b8865833acb/f9364470-9770-4bb1-a6b9-a54861849625 >>>> /cc65f671-3377-494a-a7d4-1d9f7c3ae46c/images/f3c8e7aa-6ef2-4 >>>> 2a7-93d4-e0a4df6dd2fa/2eb0b1ad-2606-44ef-9cd3-ae59610a504b >>>> /cc65f671-3377-494a-a7d4-1d9f7c3ae46c/images/b1ea3f62-0f05-4 >>>> ded-8c82-9c91c90e0b61/d5d6bf5a-499f-431d-9013-5453db93ed32 >>>> /cc65f671-3377-494a-a7d4-1d9f7c3ae46c/images/8c8b5147-e9d6-4 >>>
[ovirt-users] Re: Gluster problems, cluster performance issues
Ok, things have gotten MUCH worse this morning. I'm getting random errors from VMs, right now, about a third of my VMs have been paused due to storage issues, and most of the remaining VMs are not performing well. At this point, I am in full EMERGENCY mode, as my production services are now impacted, and I'm getting calls coming in with problems... I'd greatly appreciate help...VMs are running VERY slowly (when they run), and they are steadily getting worse. I don't know why. I was seeing CPU peaks (to 100%) on several VMs, in perfect sync, for a few minutes at a time (while the VM became unresponsive and any VMs I was logged into that were linux were giving me the CPU stuck messages in my origional post). Is all this storage related? I also have two different gluster volumes for VM storage, and only one had the issues, but now VMs in both are being affected at the same time and same way. --Jim On Mon, May 28, 2018 at 10:50 PM, Sahina Bose wrote: > [Adding gluster-users to look at the heal issue] > > On Tue, May 29, 2018 at 9:17 AM, Jim Kusznir wrote: > >> Hello: >> >> I've been having some cluster and gluster performance issues lately. I >> also found that my cluster was out of date, and was trying to apply updates >> (hoping to fix some of these), and discovered the ovirt 4.1 repos were >> taken completely offline. So, I was forced to begin an upgrade to 4.2. >> According to docs I found/read, I needed only add the new repo, do a yum >> update, reboot, and be good on my hosts (did the yum update, the >> engine-setup on my hosted engine). Things seemed to work relatively well, >> except for a gluster sync issue that showed up. >> >> My cluster is a 3 node hyperconverged cluster. I upgraded the hosted >> engine first, then engine 3. When engine 3 came back up, for some reason >> one of my gluster volumes would not sync. Here's sample output: >> >> [root@ovirt3 ~]# gluster volume heal data-hdd info >> Brick 172.172.1.11:/gluster/brick3/data-hdd >> /cc65f671-3377-494a-a7d4-1d9f7c3ae46c/images/48d7ecb8-7ac5- >> 4725-bca5-b3519681cf2f/0d6080b0-7018-4fa3-bb82-1dd9ef07d9b9 >> /cc65f671-3377-494a-a7d4-1d9f7c3ae46c/images/647be733-f153- >> 4cdc-85bd-ba72544c2631/b453a300-0602-4be1-8310-8bd5abe00971 >> /cc65f671-3377-494a-a7d4-1d9f7c3ae46c/images/6da854d1-b6be- >> 446b-9bf0-90a0dbbea830/3c93bd1f-b7fa-4aa2-b445-6904e31839ba >> /cc65f671-3377-494a-a7d4-1d9f7c3ae46c/images/7f647567-d18c- >> 44f1-a58e-9b8865833acb/f9364470-9770-4bb1-a6b9-a54861849625 >> /cc65f671-3377-494a-a7d4-1d9f7c3ae46c/images/f3c8e7aa-6ef2- >> 42a7-93d4-e0a4df6dd2fa/2eb0b1ad-2606-44ef-9cd3-ae59610a504b >> /cc65f671-3377-494a-a7d4-1d9f7c3ae46c/images/b1ea3f62-0f05- >> 4ded-8c82-9c91c90e0b61/d5d6bf5a-499f-431d-9013-5453db93ed32 >> /cc65f671-3377-494a-a7d4-1d9f7c3ae46c/images/8c8b5147-e9d6- >> 4810-b45b-185e3ed65727/16f08231-93b0-489d-a2fd-687b6bf88eaa >> /cc65f671-3377-494a-a7d4-1d9f7c3ae46c/images/12924435-b9c2- >> 4aab-ba19-1c1bc31310ef/07b3db69-440e-491e-854c-bbfa18a7cff2 >> Status: Connected >> Number of entries: 8 >> >> Brick 172.172.1.12:/gluster/brick3/data-hdd >> /cc65f671-3377-494a-a7d4-1d9f7c3ae46c/images/48d7ecb8-7ac5- >> 4725-bca5-b3519681cf2f/0d6080b0-7018-4fa3-bb82-1dd9ef07d9b9 >> /cc65f671-3377-494a-a7d4-1d9f7c3ae46c/images/647be733-f153- >> 4cdc-85bd-ba72544c2631/b453a300-0602-4be1-8310-8bd5abe00971 >> /cc65f671-3377-494a-a7d4-1d9f7c3ae46c/images/b1ea3f62-0f05- >> 4ded-8c82-9c91c90e0b61/d5d6bf5a-499f-431d-9013-5453db93ed32 >> /cc65f671-3377-494a-a7d4-1d9f7c3ae46c/images/6da854d1-b6be- >> 446b-9bf0-90a0dbbea830/3c93bd1f-b7fa-4aa2-b445-6904e31839ba >> /cc65f671-3377-494a-a7d4-1d9f7c3ae46c/images/7f647567-d18c- >> 44f1-a58e-9b8865833acb/f9364470-9770-4bb1-a6b9-a54861849625 >> /cc65f671-3377-494a-a7d4-1d9f7c3ae46c/images/8c8b5147-e9d6- >> 4810-b45b-185e3ed65727/16f08231-93b0-489d-a2fd-687b6bf88eaa >> /cc65f671-3377-494a-a7d4-1d9f7c3ae46c/images/12924435-b9c2- >> 4aab-ba19-1c1bc31310ef/07b3db69-440e-491e-854c-bbfa18a7cff2 >> /cc65f671-3377-494a-a7d4-1d9f7c3ae46c/images/f3c8e7aa-6ef2- >> 42a7-93d4-e0a4df6dd2fa/2eb0b1ad-2606-44ef-9cd3-ae59610a504b >> Status: Connected >> Number of entries: 8 >> >> Brick 172.172.1.13:/gluster/brick3/data-hdd >> /cc65f671-3377-494a-a7d4-1d9f7c3ae46c/images/b1ea3f62-0f05- >> 4ded-8c82-9c91c90e0b61/d5d6bf5a-499f-431d-9013-5453db93ed32 >> /cc65f671-3377-494a-a7d4-1d9f7c3ae46c/images/8c8b5147-e9d6- >> 4810-b45b-185e3ed65727/16f08231-93b0-489d-a2fd-687b6bf88eaa >> /cc65f671-3377-494a-a7d4-1d9f7c3ae46c/images/12924435-b9c2- >> 4aab-ba19-1c1bc31310ef/07b3db69-440e-491e-854c-bbfa18a7cff
[ovirt-users] Gluster problems, cluster performance issues
Hello: I've been having some cluster and gluster performance issues lately. I also found that my cluster was out of date, and was trying to apply updates (hoping to fix some of these), and discovered the ovirt 4.1 repos were taken completely offline. So, I was forced to begin an upgrade to 4.2. According to docs I found/read, I needed only add the new repo, do a yum update, reboot, and be good on my hosts (did the yum update, the engine-setup on my hosted engine). Things seemed to work relatively well, except for a gluster sync issue that showed up. My cluster is a 3 node hyperconverged cluster. I upgraded the hosted engine first, then engine 3. When engine 3 came back up, for some reason one of my gluster volumes would not sync. Here's sample output: [root@ovirt3 ~]# gluster volume heal data-hdd info Brick 172.172.1.11:/gluster/brick3/data-hdd /cc65f671-3377-494a-a7d4-1d9f7c3ae46c/images/48d7ecb8-7ac5-4725-bca5-b3519681cf2f/0d6080b0-7018-4fa3-bb82-1dd9ef07d9b9 /cc65f671-3377-494a-a7d4-1d9f7c3ae46c/images/647be733-f153-4cdc-85bd-ba72544c2631/b453a300-0602-4be1-8310-8bd5abe00971 /cc65f671-3377-494a-a7d4-1d9f7c3ae46c/images/6da854d1-b6be-446b-9bf0-90a0dbbea830/3c93bd1f-b7fa-4aa2-b445-6904e31839ba /cc65f671-3377-494a-a7d4-1d9f7c3ae46c/images/7f647567-d18c-44f1-a58e-9b8865833acb/f9364470-9770-4bb1-a6b9-a54861849625 /cc65f671-3377-494a-a7d4-1d9f7c3ae46c/images/f3c8e7aa-6ef2-42a7-93d4-e0a4df6dd2fa/2eb0b1ad-2606-44ef-9cd3-ae59610a504b /cc65f671-3377-494a-a7d4-1d9f7c3ae46c/images/b1ea3f62-0f05-4ded-8c82-9c91c90e0b61/d5d6bf5a-499f-431d-9013-5453db93ed32 /cc65f671-3377-494a-a7d4-1d9f7c3ae46c/images/8c8b5147-e9d6-4810-b45b-185e3ed65727/16f08231-93b0-489d-a2fd-687b6bf88eaa /cc65f671-3377-494a-a7d4-1d9f7c3ae46c/images/12924435-b9c2-4aab-ba19-1c1bc31310ef/07b3db69-440e-491e-854c-bbfa18a7cff2 Status: Connected Number of entries: 8 Brick 172.172.1.12:/gluster/brick3/data-hdd /cc65f671-3377-494a-a7d4-1d9f7c3ae46c/images/48d7ecb8-7ac5-4725-bca5-b3519681cf2f/0d6080b0-7018-4fa3-bb82-1dd9ef07d9b9 /cc65f671-3377-494a-a7d4-1d9f7c3ae46c/images/647be733-f153-4cdc-85bd-ba72544c2631/b453a300-0602-4be1-8310-8bd5abe00971 /cc65f671-3377-494a-a7d4-1d9f7c3ae46c/images/b1ea3f62-0f05-4ded-8c82-9c91c90e0b61/d5d6bf5a-499f-431d-9013-5453db93ed32 /cc65f671-3377-494a-a7d4-1d9f7c3ae46c/images/6da854d1-b6be-446b-9bf0-90a0dbbea830/3c93bd1f-b7fa-4aa2-b445-6904e31839ba /cc65f671-3377-494a-a7d4-1d9f7c3ae46c/images/7f647567-d18c-44f1-a58e-9b8865833acb/f9364470-9770-4bb1-a6b9-a54861849625 /cc65f671-3377-494a-a7d4-1d9f7c3ae46c/images/8c8b5147-e9d6-4810-b45b-185e3ed65727/16f08231-93b0-489d-a2fd-687b6bf88eaa /cc65f671-3377-494a-a7d4-1d9f7c3ae46c/images/12924435-b9c2-4aab-ba19-1c1bc31310ef/07b3db69-440e-491e-854c-bbfa18a7cff2 /cc65f671-3377-494a-a7d4-1d9f7c3ae46c/images/f3c8e7aa-6ef2-42a7-93d4-e0a4df6dd2fa/2eb0b1ad-2606-44ef-9cd3-ae59610a504b Status: Connected Number of entries: 8 Brick 172.172.1.13:/gluster/brick3/data-hdd /cc65f671-3377-494a-a7d4-1d9f7c3ae46c/images/b1ea3f62-0f05-4ded-8c82-9c91c90e0b61/d5d6bf5a-499f-431d-9013-5453db93ed32 /cc65f671-3377-494a-a7d4-1d9f7c3ae46c/images/8c8b5147-e9d6-4810-b45b-185e3ed65727/16f08231-93b0-489d-a2fd-687b6bf88eaa /cc65f671-3377-494a-a7d4-1d9f7c3ae46c/images/12924435-b9c2-4aab-ba19-1c1bc31310ef/07b3db69-440e-491e-854c-bbfa18a7cff2 /cc65f671-3377-494a-a7d4-1d9f7c3ae46c/images/f3c8e7aa-6ef2-42a7-93d4-e0a4df6dd2fa/2eb0b1ad-2606-44ef-9cd3-ae59610a504b /cc65f671-3377-494a-a7d4-1d9f7c3ae46c/images/647be733-f153-4cdc-85bd-ba72544c2631/b453a300-0602-4be1-8310-8bd5abe00971 /cc65f671-3377-494a-a7d4-1d9f7c3ae46c/images/48d7ecb8-7ac5-4725-bca5-b3519681cf2f/0d6080b0-7018-4fa3-bb82-1dd9ef07d9b9 /cc65f671-3377-494a-a7d4-1d9f7c3ae46c/images/6da854d1-b6be-446b-9bf0-90a0dbbea830/3c93bd1f-b7fa-4aa2-b445-6904e31839ba /cc65f671-3377-494a-a7d4-1d9f7c3ae46c/images/7f647567-d18c-44f1-a58e-9b8865833acb/f9364470-9770-4bb1-a6b9-a54861849625 Status: Connected Number of entries: 8 - Its been in this state for a couple days now, and bandwidth monitoring shows no appreciable data moving. I've tried repeatedly commanding a full heal from all three clusters in the node. Its always the same files that need healing. When running gluster volume heal data-hdd statistics, I see sometimes different information, but always some number of "heal failed" entries. It shows 0 for split brain. I'm not quite sure what to do. I suspect it may be due to nodes 1 and 2 still being on the older ovirt/gluster release, but I'm afraid to upgrade and reboot them until I have a good gluster sync (don't need to create a split brain issue). How do I proceed with this? Second issue: I've been experiencing VERY POOR performance on most of my VMs. To the tune that logging into a windows 10 vm via remote desktop can take 5 minutes, launching quickbooks inside said vm can easily take 10 minutes. On some linux VMs, I get random messages like this: Message from syslogd@unifi at May 28 20:39:23 ... kernel:[6171996.308904] NMI watchdog: BUG: soft lo
Re: [ovirt-users] Gluster: VM disk stuck in transfer; georep gone wonky
Thank you for the replies. While waiting, I found one more google responce that said to run engine-setup. I did that, and it fixed the issue. the VM is now running again. As to checking the logs, I'm not sure which ones to check...there are so many in so many different places. I was not able to detach the disk, as "an operation is currently in process" No matter what i did to the disk, it was essentially still locked, even though it no longer said "locked" after I removed it with the unlock script. So, it appears running engine-setup can really fix a bunch of stuff! An important tip to remember... --Jim On Mon, Mar 19, 2018 at 11:55 PM, Tony Brian Albers wrote: > I read somewhere about clearing out wrong stuff from the UI by manually > editing the database, maybe you can try searching for something like that. > > With regards to the VM, I'd probably just delete it, edit the DB and > remove all sorts of references to it and then recover it from backup. > > Is there nothing about all this in the ovirt logs on the engine and the > host? It might point you in the right direction. > > HTH > > /tony > > > On 20/03/18 07:48, Jim Kusznir wrote: > > Unfortunately, I came under heavy pressure to get this vm back up. So, > > i did more googling and attempted to recover myself. I've gotten > > closer, but still not quite. > > > > I found this post: > > > > http://lists.ovirt.org/pipermail/users/2015-November/035686.html > > > > Which gave me the unlock tool, which was successful in unlocking the > > disk. Unfortunately, it did not delete the task, nor did ovirt do so on > > its own after the disk was unlocked. > > > > So I found the taskcleaner.sh in the same directory and attempted to > > clean the task outexcept it doesn't seem to see the task (none of > > the show tasks options seemed to work or the delete all options). I did > > still have the task uuid from the gui, so i attempted to use that, but > > all I got back was a "t" on one line and a "0" on the next, so I have no > > idea what that was supposed to mean. In any case, the web UI still > > shows the task, still won't let me start the VM and appears convinced > > its still copying. I've tried restarting the engine and vdsm on the > > SPM, neither have helped. I can't find any evidence of the task on the > > command line; only in the UI. > > > > I'd create a new VM if i could rescue the image, but I'm not sure I can > > manage to get this image accepted in another VM > > > > How do i recover now? > > > > --Jim > > > > On Mon, Mar 19, 2018 at 9:38 AM, Jim Kusznir > <mailto:j...@palousetech.com>> wrote: > > > > Hi all: > > > > Sorry for yet another semi-related message to the list. In my > > attempts to troubleshoot and verify some suspicions on the nature of > > the performance problems I posted under "Major Performance Issues > > with gluster", I attempted to move one of my problem VM's back to > > the original storage (SSD-backed). It appeared to be moving fine, > > but last night froze at 84%. This morning (8hrs later), its still > > at 84%. > > > > I need to get that VM back up and running, but I don't know how...It > > seems to be stuck in limbo. > > > > The only thing I explicitly did last night as well that may have > > caused an issue is finally set up and activated georep to an offsite > > backup machine. That too seems to have gone a bit wonky. On the > > ovirt server side, it shows normal with all but data-hdd show a last > > sync'ed time of 3am (which matches my bandwidth graphs for the WAN > > connections involved). data-hdd (the new disk-backed storage with > > most of my data in it) shows not yet synced, but I'm also not > > currently seeing bandwidth usage anymore. > > > > I logged into the georep destination box, and found system load a > > bit high, a bunch of gluster and rsync processes running, and both > > data and data-hdd using MORE disk space than the origional (data-hdd > > using 4x more disk space than is on the master node). Not sure what > > to do about this; I paused the replication from the cluster, but > > that hasn't seem to had an effect on the georep destination. > > > > I promise I'll stop trying things until I get guidance from the > > list! Please do help; I need the VM HDD unstuck so I can start it. > > > > Thanks! >
Re: [ovirt-users] Gluster: VM disk stuck in transfer; georep gone wonky
Unfortunately, I came under heavy pressure to get this vm back up. So, i did more googling and attempted to recover myself. I've gotten closer, but still not quite. I found this post: http://lists.ovirt.org/pipermail/users/2015-November/035686.html Which gave me the unlock tool, which was successful in unlocking the disk. Unfortunately, it did not delete the task, nor did ovirt do so on its own after the disk was unlocked. So I found the taskcleaner.sh in the same directory and attempted to clean the task outexcept it doesn't seem to see the task (none of the show tasks options seemed to work or the delete all options). I did still have the task uuid from the gui, so i attempted to use that, but all I got back was a "t" on one line and a "0" on the next, so I have no idea what that was supposed to mean. In any case, the web UI still shows the task, still won't let me start the VM and appears convinced its still copying. I've tried restarting the engine and vdsm on the SPM, neither have helped. I can't find any evidence of the task on the command line; only in the UI. I'd create a new VM if i could rescue the image, but I'm not sure I can manage to get this image accepted in another VM How do i recover now? --Jim On Mon, Mar 19, 2018 at 9:38 AM, Jim Kusznir wrote: > Hi all: > > Sorry for yet another semi-related message to the list. In my attempts to > troubleshoot and verify some suspicions on the nature of the performance > problems I posted under "Major Performance Issues with gluster", I > attempted to move one of my problem VM's back to the original storage > (SSD-backed). It appeared to be moving fine, but last night froze at 84%. > This morning (8hrs later), its still at 84%. > > I need to get that VM back up and running, but I don't know how...It seems > to be stuck in limbo. > > The only thing I explicitly did last night as well that may have caused an > issue is finally set up and activated georep to an offsite backup machine. > That too seems to have gone a bit wonky. On the ovirt server side, it > shows normal with all but data-hdd show a last sync'ed time of 3am (which > matches my bandwidth graphs for the WAN connections involved). data-hdd > (the new disk-backed storage with most of my data in it) shows not yet > synced, but I'm also not currently seeing bandwidth usage anymore. > > I logged into the georep destination box, and found system load a bit > high, a bunch of gluster and rsync processes running, and both data and > data-hdd using MORE disk space than the origional (data-hdd using 4x more > disk space than is on the master node). Not sure what to do about this; I > paused the replication from the cluster, but that hasn't seem to had an > effect on the georep destination. > > I promise I'll stop trying things until I get guidance from the list! > Please do help; I need the VM HDD unstuck so I can start it. > > Thanks! > --Jim > > ___ Users mailing list Users@ovirt.org http://lists.ovirt.org/mailman/listinfo/users
[ovirt-users] Gluster: VM disk stuck in transfer; georep gone wonky
Hi all: Sorry for yet another semi-related message to the list. In my attempts to troubleshoot and verify some suspicions on the nature of the performance problems I posted under "Major Performance Issues with gluster", I attempted to move one of my problem VM's back to the original storage (SSD-backed). It appeared to be moving fine, but last night froze at 84%. This morning (8hrs later), its still at 84%. I need to get that VM back up and running, but I don't know how...It seems to be stuck in limbo. The only thing I explicitly did last night as well that may have caused an issue is finally set up and activated georep to an offsite backup machine. That too seems to have gone a bit wonky. On the ovirt server side, it shows normal with all but data-hdd show a last sync'ed time of 3am (which matches my bandwidth graphs for the WAN connections involved). data-hdd (the new disk-backed storage with most of my data in it) shows not yet synced, but I'm also not currently seeing bandwidth usage anymore. I logged into the georep destination box, and found system load a bit high, a bunch of gluster and rsync processes running, and both data and data-hdd using MORE disk space than the origional (data-hdd using 4x more disk space than is on the master node). Not sure what to do about this; I paused the replication from the cluster, but that hasn't seem to had an effect on the georep destination. I promise I'll stop trying things until I get guidance from the list! Please do help; I need the VM HDD unstuck so I can start it. Thanks! --Jim ___ Users mailing list Users@ovirt.org http://lists.ovirt.org/mailman/listinfo/users
Re: [ovirt-users] Major Performance Issues with gluster
Here's gluster volume info: [root@ovirt2 ~]# gluster volume info Volume Name: data Type: Replicate Volume ID: e670c488-ac16-4dd1-8bd3-e43b2e42cc59 Status: Started Snapshot Count: 0 Number of Bricks: 1 x (2 + 1) = 3 Transport-type: tcp Bricks: Brick1: ovirt1.nwfiber.com:/gluster/brick2/data Brick2: ovirt2.nwfiber.com:/gluster/brick2/data Brick3: ovirt3.nwfiber.com:/gluster/brick2/data (arbiter) Options Reconfigured: changelog.changelog: on geo-replication.ignore-pid-check: on geo-replication.indexing: on server.allow-insecure: on performance.readdir-ahead: on performance.quick-read: off performance.read-ahead: off performance.io-cache: off performance.stat-prefetch: off cluster.eager-lock: enable network.remote-dio: enable cluster.quorum-type: auto cluster.server-quorum-type: server storage.owner-uid: 36 storage.owner-gid: 36 features.shard: on features.shard-block-size: 512MB performance.low-prio-threads: 32 cluster.data-self-heal-algorithm: full cluster.locking-scheme: granular cluster.shd-wait-qlength: 1 cluster.shd-max-threads: 8 network.ping-timeout: 30 user.cifs: off nfs.disable: on performance.strict-o-direct: on Volume Name: data-hdd Type: Replicate Volume ID: d342a3ab-16f3-49f0-bbcf-f788be8ac5f1 Status: Started Snapshot Count: 0 Number of Bricks: 1 x 3 = 3 Transport-type: tcp Bricks: Brick1: 172.172.1.11:/gluster/brick3/data-hdd Brick2: 172.172.1.12:/gluster/brick3/data-hdd Brick3: 172.172.1.13:/gluster/brick3/data-hdd Options Reconfigured: changelog.changelog: on geo-replication.ignore-pid-check: on geo-replication.indexing: on transport.address-family: inet performance.readdir-ahead: on Volume Name: engine Type: Replicate Volume ID: 87ad86b9-d88b-457e-ba21-5d3173c612de Status: Started Snapshot Count: 0 Number of Bricks: 1 x (2 + 1) = 3 Transport-type: tcp Bricks: Brick1: ovirt1.nwfiber.com:/gluster/brick1/engine Brick2: ovirt2.nwfiber.com:/gluster/brick1/engine Brick3: ovirt3.nwfiber.com:/gluster/brick1/engine (arbiter) Options Reconfigured: changelog.changelog: on geo-replication.ignore-pid-check: on geo-replication.indexing: on performance.readdir-ahead: on performance.quick-read: off performance.read-ahead: off performance.io-cache: off performance.stat-prefetch: off cluster.eager-lock: enable network.remote-dio: off cluster.quorum-type: auto cluster.server-quorum-type: server storage.owner-uid: 36 storage.owner-gid: 36 features.shard: on features.shard-block-size: 512MB performance.low-prio-threads: 32 cluster.data-self-heal-algorithm: full cluster.locking-scheme: granular cluster.shd-wait-qlength: 1 cluster.shd-max-threads: 6 network.ping-timeout: 30 user.cifs: off nfs.disable: on performance.strict-o-direct: on Volume Name: iso Type: Replicate Volume ID: b1ba15f5-0f0f-4411-89d0-595179f02b92 Status: Started Snapshot Count: 0 Number of Bricks: 1 x (2 + 1) = 3 Transport-type: tcp Bricks: Brick1: ovirt1.nwfiber.com:/gluster/brick4/iso Brick2: ovirt2.nwfiber.com:/gluster/brick4/iso Brick3: ovirt3.nwfiber.com:/gluster/brick4/iso (arbiter) Options Reconfigured: performance.readdir-ahead: on performance.quick-read: off performance.read-ahead: off performance.io-cache: off performance.stat-prefetch: off cluster.eager-lock: enable network.remote-dio: off cluster.quorum-type: auto cluster.server-quorum-type: server storage.owner-uid: 36 storage.owner-gid: 36 features.shard: on features.shard-block-size: 512MB performance.low-prio-threads: 32 cluster.data-self-heal-algorithm: full cluster.locking-scheme: granular cluster.shd-wait-qlength: 1 cluster.shd-max-threads: 6 network.ping-timeout: 30 user.cifs: off nfs.disable: on performance.strict-o-direct: on -- When I try and turn on profiling, I get: [root@ovirt2 ~]# gluster volume profile data-hdd start Another transaction is in progress for data-hdd. Please try again after sometime. I don't know what that other transaction is, but I am having some "odd behavior" this morning, like a vm disk move between data and data-hdd that stuck at 84% overnight. I've been asking on IRC how to "un-stick" this transfer, as the VM cannot be started, and I can't seem to do anything about it. --Jim On Mon, Mar 19, 2018 at 2:14 AM, Sahina Bose wrote: > > > On Mon, Mar 19, 2018 at 7:39 AM, Jim Kusznir wrote: > >> Hello: >> >> This past week, I created a new gluster store, as I was running out of >> disk space on my main, SSD-backed storage pool. I used 2TB Seagate >> FireCuda drives (hybrid SSD/spinning). Hardware is Dell R610's with >> integral PERC/6i cards. I placed one disk per machine, exported the disk >> as a single disk volume from the raid controller, formatted it XFS, mounted >> it, and dedicated it to a new replica 3 gluster volume. >> >> Since doing so, I've been having major performance problems. One of my >> windows VMs sits at 100% disk utilization nearly continously, and its >> pa
[ovirt-users] Major Performance Issues with gluster
Hello: This past week, I created a new gluster store, as I was running out of disk space on my main, SSD-backed storage pool. I used 2TB Seagate FireCuda drives (hybrid SSD/spinning). Hardware is Dell R610's with integral PERC/6i cards. I placed one disk per machine, exported the disk as a single disk volume from the raid controller, formatted it XFS, mounted it, and dedicated it to a new replica 3 gluster volume. Since doing so, I've been having major performance problems. One of my windows VMs sits at 100% disk utilization nearly continously, and its painful to do anything on it. A Zabbix install on CentOS using mysql as the backing has 70%+ iowait nearly all the time, and I can't seem to get graphs loaded from the web console. Its also always spewing errors that ultimately come down to insufficient disk performance issues. All of this was working OK before the changes. There are two: Old storage was SSD backed, Replica 2 + arb, and running on the same GigE network as management and main VM network. New storage was created using the dedicated Gluster network (running on em4 on these servers, completely different subnet (174.x vs 192.x), and was created replica 3 (no arb), on the FireCuda disks (seem to be the fastest I could afford for non-SSD, as I needed a lot more storage). My attempts to watch so far have NOT shown maxed network interfaces (using bwm-ng on the command line); in fact, the gluster interface is usually below 20% utilized. I'm not sure how to meaningfully measure the performance of the disk itself; I'm not sure what else to look at. My cluster is not very usable currently, though. IOWait on my hosts appears to be below 0.5%, usually 0.0 to 0.1. Inside the VMs is a whole different story. My cluster is currently running ovirt 4.1. I'm interested in going to 4.2, but I think I need to fix this first. Thanks! --Jim ___ Users mailing list Users@ovirt.org http://lists.ovirt.org/mailman/listinfo/users
[ovirt-users] gluster self-heal takes cluster offline
Hi all: I'm trying to understand why/how (and most importantly, how to fix) an substantial issue I had last night. This happened one other time, but I didn't know/understand all the parts associated with it until last night. I have a 3 node hyperconverged (self-hosted engine, Gluster on each node) cluster. Gluster is Replica 2 + arbitrar. Current network configuration is 2x GigE on load balance ("LAG Group" on switch), plus one GigE from each server on a separate vlan, intended for Gluster (but not used). Server hardware is Dell R610's, each server as an SSD in it. Server 1 and 2 have the full replica, server 3 is the arbitrar. I put server 2 into maintence so I can work on the hardware, including turn it off and such. In the course of the work, I found that I needed to reconfigure the SSD's partitioning somewhat, and it resulted in wiping the data partition (storing VM images). I figure, its no big deal, gluster will rebuild that in short order. I did take care of the extended attr settings and the like, and when I booted it up, gluster came up as expected and began rebuilding the disk. The problem is that suddenly my entire cluster got very sluggish. The entine was marking nodes and VMs failed and unfaling them throughout the system, fairly randomly. It didn't matter what node the engine or VM was on. At one point, it power cycled server 1 for "non-responsive" (even though everything was running on it, and the gluster rebuild was working on it). As a result of this, about 6 VMs were killed and my entire gluster system went down hard (suspending all remaining VMs and the engine), as there were no remaining full copies of the data. After several minutes (these are Dell servers, after all...), server 1 came back up, and gluster resumed the rebuild, and came online on the cluster. I had to manually (virtsh command) unpause the engine, and then struggle through trying to get critical VMs back up. Everything was super slow, and load averages on the servers were often seen in excess of 80 (these are 8 core / 16 thread boxes). Actual CPU usage (reported by top) was rarely above 40% (inclusive of all CPUs) for any one server. Glusterfs was often seen using 180%-350% of a CPU on server 1 and 2. I ended up putting the cluster in global HA maintence mode and disabling power fencing on the nodes until the process finished. It appeared on at least two occasions a functional node was marked bad and had the fencing not been disabled, a node would have rebooted, just further exacerbating the problem. Its clear that the gluster rebuild overloaded things and caused the problem. I don't know why the load was so high (even IOWait was low), but load averages were definately tied to the glusterfs cpu utilization %. At no point did I have any problems pinging any machine (host or VM) unless the engine decided it was dead and killed it. Why did my system bite it so hard with the rebuild? I baby'ed it along until the rebuild was complete, after which it returned to normal operation. As of this event, all networking (host/engine management, gluster, and VM network) were on the same vlan. I'd love to move things off, but so far any attempt to do so breaks my cluster. How can I move my management interfaces to a separate VLAN/IP Space? I also want to move Gluster to its own private space, but it seems if I change anything in the peers file, the entire gluster cluster goes down. The dedicated gluster network is listed as a secondary hostname for all peers already. Will the above network reconfigurations be enough? I got the impression that the issue may not have been purely network based, but possibly server IO overload. Is this likely / right? I appreciate input. I don't think gluster's recovery is supposed to do as much damage as it did the last two or three times any healing was required. Thanks! --Jim ___ Users mailing list Users@ovirt.org http://lists.ovirt.org/mailman/listinfo/users
Re: [ovirt-users] hyperconverged question
I can confirm that I did set it up manually, and I did specify backupvol, and in the "manage domain" storage settings, I do have under mount options, backup-volfile-servers=192.168.8.12:192.168.8.13 (and this was done at initial install time). The "used managed gluster" checkbox is NOT checked, and if I check it and save settings, next time I go in it is not checked. --Jim On Fri, Sep 1, 2017 at 2:08 PM, Charles Kozler wrote: > @ Jim - here is my setup which I will test in a few (brand new cluster) > and report back what I found in my tests > > - 3x servers direct connected via 10Gb > - 2 of those 3 setup in ovirt as hosts > - Hosted engine > - Gluster replica 3 (no arbiter) for all volumes > - 1x engine volume gluster replica 3 manually configured (not using ovirt > managed gluster) > - 1x datatest volume (20gb) replica 3 manually configured (not using ovirt > managed gluster) > - 1x nfstest domain served from some other server in my infrastructure > which, at the time of my original testing, was master domain > > I tested this earlier and all VMs stayed online. However, ovirt cluster > reported DC/cluster down, all VM's stayed up > > As I am now typing this, can you confirm you setup your gluster storage > domain with backupvol? Also, confirm you updated hosted-engine.conf with > backupvol mount option as well? > > On Fri, Sep 1, 2017 at 4:22 PM, Jim Kusznir wrote: > >> So, after reading the first document twice and the 2nd link thoroughly >> once, I believe that the arbitrator volume should be sufficient and count >> for replica / split brain. EG, if any one full replica is down, and the >> arbitrator and the other replica is up, then it should have quorum and all >> should be good. >> >> I think my underlying problem has to do more with config than the replica >> state. That said, I did size the drive on my 3rd node planning to have an >> identical copy of all data on it, so I'm still not opposed to making it a >> full replica. >> >> Did I miss something here? >> >> Thanks! >> >> On Fri, Sep 1, 2017 at 11:59 AM, Charles Kozler >> wrote: >> >>> These can get a little confusing but this explains it best: >>> https://gluster.readthedocs.io/en/latest/Administrator >>> %20Guide/arbiter-volumes-and-quorum/#replica-2-and-replica-3-volumes >>> >>> Basically in the first paragraph they are explaining why you cant have >>> HA with quorum for 2 nodes. Here is another overview doc that explains some >>> more >>> >>> http://openmymind.net/Does-My-Replica-Set-Need-An-Arbiter/ >>> >>> From my understanding arbiter is good for resolving split brains. Quorum >>> and arbiter are two different things though quorum is a mechanism to help >>> you **avoid** split brain and the arbiter is to help gluster resolve split >>> brain by voting and other internal mechanics (as outlined in link 1). How >>> did you create the volume exactly - what command? It looks to me like you >>> created it with 'gluster volume create replica 2 arbiter 1 {}' per your >>> earlier mention of "replica 2 arbiter 1". That being said, if you did that >>> and then setup quorum in the volume configuration, this would cause your >>> gluster to halt up since quorum was lost (as you saw until you recovered >>> node 1) >>> >>> As you can see from the docs, there is still a corner case for getting >>> in to split brain with replica 3, which again, is where arbiter would help >>> gluster resolve it >>> >>> I need to amend my previous statement: I was told that arbiter volume >>> does not store data, only metadata. I cannot find anything in the docs >>> backing this up however it would make sense for it to be. That being said, >>> in my setup, I would not include my arbiter or my third node in my ovirt VM >>> cluster component. I would keep it completely separate >>> >>> >>> On Fri, Sep 1, 2017 at 2:46 PM, Jim Kusznir wrote: >>> >>>> I'm now also confused as to what the point of an arbiter is / what it >>>> does / why one would use it. >>>> >>>> On Fri, Sep 1, 2017 at 11:44 AM, Jim Kusznir >>>> wrote: >>>> >>>>> Thanks for the help! >>>>> >>>>> Here's my gluster volume info for the data export/brick (I have 3: >>>>> data, engine, and iso, but they're all configured the same): >>>>> >>>>> Volume Name: data >>>>> Type: Replicate
Re: [ovirt-users] hyperconverged question
So, after reading the first document twice and the 2nd link thoroughly once, I believe that the arbitrator volume should be sufficient and count for replica / split brain. EG, if any one full replica is down, and the arbitrator and the other replica is up, then it should have quorum and all should be good. I think my underlying problem has to do more with config than the replica state. That said, I did size the drive on my 3rd node planning to have an identical copy of all data on it, so I'm still not opposed to making it a full replica. Did I miss something here? Thanks! On Fri, Sep 1, 2017 at 11:59 AM, Charles Kozler wrote: > These can get a little confusing but this explains it best: > https://gluster.readthedocs.io/en/latest/Administrator%20Guide/arbiter- > volumes-and-quorum/#replica-2-and-replica-3-volumes > > Basically in the first paragraph they are explaining why you cant have HA > with quorum for 2 nodes. Here is another overview doc that explains some > more > > http://openmymind.net/Does-My-Replica-Set-Need-An-Arbiter/ > > From my understanding arbiter is good for resolving split brains. Quorum > and arbiter are two different things though quorum is a mechanism to help > you **avoid** split brain and the arbiter is to help gluster resolve split > brain by voting and other internal mechanics (as outlined in link 1). How > did you create the volume exactly - what command? It looks to me like you > created it with 'gluster volume create replica 2 arbiter 1 {}' per your > earlier mention of "replica 2 arbiter 1". That being said, if you did that > and then setup quorum in the volume configuration, this would cause your > gluster to halt up since quorum was lost (as you saw until you recovered > node 1) > > As you can see from the docs, there is still a corner case for getting in > to split brain with replica 3, which again, is where arbiter would help > gluster resolve it > > I need to amend my previous statement: I was told that arbiter volume does > not store data, only metadata. I cannot find anything in the docs backing > this up however it would make sense for it to be. That being said, in my > setup, I would not include my arbiter or my third node in my ovirt VM > cluster component. I would keep it completely separate > > > On Fri, Sep 1, 2017 at 2:46 PM, Jim Kusznir wrote: > >> I'm now also confused as to what the point of an arbiter is / what it >> does / why one would use it. >> >> On Fri, Sep 1, 2017 at 11:44 AM, Jim Kusznir wrote: >> >>> Thanks for the help! >>> >>> Here's my gluster volume info for the data export/brick (I have 3: data, >>> engine, and iso, but they're all configured the same): >>> >>> Volume Name: data >>> Type: Replicate >>> Volume ID: e670c488-ac16-4dd1-8bd3-e43b2e42cc59 >>> Status: Started >>> Snapshot Count: 0 >>> Number of Bricks: 1 x (2 + 1) = 3 >>> Transport-type: tcp >>> Bricks: >>> Brick1: ovirt1.nwfiber.com:/gluster/brick2/data >>> Brick2: ovirt2.nwfiber.com:/gluster/brick2/data >>> Brick3: ovirt3.nwfiber.com:/gluster/brick2/data (arbiter) >>> Options Reconfigured: >>> performance.strict-o-direct: on >>> nfs.disable: on >>> user.cifs: off >>> network.ping-timeout: 30 >>> cluster.shd-max-threads: 8 >>> cluster.shd-wait-qlength: 1 >>> cluster.locking-scheme: granular >>> cluster.data-self-heal-algorithm: full >>> performance.low-prio-threads: 32 >>> features.shard-block-size: 512MB >>> features.shard: on >>> storage.owner-gid: 36 >>> storage.owner-uid: 36 >>> cluster.server-quorum-type: server >>> cluster.quorum-type: auto >>> network.remote-dio: enable >>> cluster.eager-lock: enable >>> performance.stat-prefetch: off >>> performance.io-cache: off >>> performance.read-ahead: off >>> performance.quick-read: off >>> performance.readdir-ahead: on >>> server.allow-insecure: on >>> [root@ovirt1 ~]# >>> >>> >>> all 3 of my brick nodes ARE also members of the virtualization cluster >>> (including ovirt3). How can I convert it into a full replica instead of >>> just an arbiter? >>> >>> Thanks! >>> --Jim >>> >>> On Fri, Sep 1, 2017 at 9:09 AM, Charles Kozler >>> wrote: >>> >>>> @Kasturi - Looks good now. Cluster showed down for a moment but VM's >>>> stayed up in their appropriate places. Thanks! >>>> >>>> < Anyone on this list please feel free to correct my
Re: [ovirt-users] hyperconverged question
Thank you! I created my cluster following these instructions: https://www.ovirt.org/blog/2016/08/up-and-running-with-ovirt-4-0-and-gluster-storage/ (I built it about 10 months ago) I used their recipe for automated gluster node creation. Originally I thought I had 3 replicas, then I started realizing that node 3's disk usage was essentially nothing compared to node 1 and 2, and eventually on this list discovered that I had an arbiter. Currently I am running on a 1Gbps backbone, but I can dedicate a gig port (or even do bonded gig -- my servers have 4 1Gbps interfaces, and my switch is only used for this cluster, so it has the ports to hook them all up). I am planning on a 10gbps upgrade once I bring in some more cash to pay for it. Last night, node 2 and 3 were up, and I rebooted node 1 for updates. As soon as it shut down, my cluster halted (including the hosted engine), and everything went messy. When the node came back up, I still had to recover the hosted engine via command line, then could go in and start unpausing my VMs. I'm glad it happened at 8pm at night...That would have been very ugly if it happened during the day. I had thought I had enough redundancy in the cluster that I could take down any 1 node and not have an issue...That definitely is not what happened. --Jim On Fri, Sep 1, 2017 at 11:59 AM, Charles Kozler wrote: > These can get a little confusing but this explains it best: > https://gluster.readthedocs.io/en/latest/Administrator%20Guide/arbiter- > volumes-and-quorum/#replica-2-and-replica-3-volumes > > Basically in the first paragraph they are explaining why you cant have HA > with quorum for 2 nodes. Here is another overview doc that explains some > more > > http://openmymind.net/Does-My-Replica-Set-Need-An-Arbiter/ > > From my understanding arbiter is good for resolving split brains. Quorum > and arbiter are two different things though quorum is a mechanism to help > you **avoid** split brain and the arbiter is to help gluster resolve split > brain by voting and other internal mechanics (as outlined in link 1). How > did you create the volume exactly - what command? It looks to me like you > created it with 'gluster volume create replica 2 arbiter 1 {}' per your > earlier mention of "replica 2 arbiter 1". That being said, if you did that > and then setup quorum in the volume configuration, this would cause your > gluster to halt up since quorum was lost (as you saw until you recovered > node 1) > > As you can see from the docs, there is still a corner case for getting in > to split brain with replica 3, which again, is where arbiter would help > gluster resolve it > > I need to amend my previous statement: I was told that arbiter volume does > not store data, only metadata. I cannot find anything in the docs backing > this up however it would make sense for it to be. That being said, in my > setup, I would not include my arbiter or my third node in my ovirt VM > cluster component. I would keep it completely separate > > > On Fri, Sep 1, 2017 at 2:46 PM, Jim Kusznir wrote: > >> I'm now also confused as to what the point of an arbiter is / what it >> does / why one would use it. >> >> On Fri, Sep 1, 2017 at 11:44 AM, Jim Kusznir wrote: >> >>> Thanks for the help! >>> >>> Here's my gluster volume info for the data export/brick (I have 3: data, >>> engine, and iso, but they're all configured the same): >>> >>> Volume Name: data >>> Type: Replicate >>> Volume ID: e670c488-ac16-4dd1-8bd3-e43b2e42cc59 >>> Status: Started >>> Snapshot Count: 0 >>> Number of Bricks: 1 x (2 + 1) = 3 >>> Transport-type: tcp >>> Bricks: >>> Brick1: ovirt1.nwfiber.com:/gluster/brick2/data >>> Brick2: ovirt2.nwfiber.com:/gluster/brick2/data >>> Brick3: ovirt3.nwfiber.com:/gluster/brick2/data (arbiter) >>> Options Reconfigured: >>> performance.strict-o-direct: on >>> nfs.disable: on >>> user.cifs: off >>> network.ping-timeout: 30 >>> cluster.shd-max-threads: 8 >>> cluster.shd-wait-qlength: 1 >>> cluster.locking-scheme: granular >>> cluster.data-self-heal-algorithm: full >>> performance.low-prio-threads: 32 >>> features.shard-block-size: 512MB >>> features.shard: on >>> storage.owner-gid: 36 >>> storage.owner-uid: 36 >>> cluster.server-quorum-type: server >>> cluster.quorum-type: auto >>> network.remote-dio: enable >>> cluster.eager-lock: enable >>> performance.stat-prefetch: off >>> performance.io-cache: off >>> performance.read-ahead: off >>> performance.quick-read: off >>
Re: [ovirt-users] hyperconverged question
I'm now also confused as to what the point of an arbiter is / what it does / why one would use it. On Fri, Sep 1, 2017 at 11:44 AM, Jim Kusznir wrote: > Thanks for the help! > > Here's my gluster volume info for the data export/brick (I have 3: data, > engine, and iso, but they're all configured the same): > > Volume Name: data > Type: Replicate > Volume ID: e670c488-ac16-4dd1-8bd3-e43b2e42cc59 > Status: Started > Snapshot Count: 0 > Number of Bricks: 1 x (2 + 1) = 3 > Transport-type: tcp > Bricks: > Brick1: ovirt1.nwfiber.com:/gluster/brick2/data > Brick2: ovirt2.nwfiber.com:/gluster/brick2/data > Brick3: ovirt3.nwfiber.com:/gluster/brick2/data (arbiter) > Options Reconfigured: > performance.strict-o-direct: on > nfs.disable: on > user.cifs: off > network.ping-timeout: 30 > cluster.shd-max-threads: 8 > cluster.shd-wait-qlength: 1 > cluster.locking-scheme: granular > cluster.data-self-heal-algorithm: full > performance.low-prio-threads: 32 > features.shard-block-size: 512MB > features.shard: on > storage.owner-gid: 36 > storage.owner-uid: 36 > cluster.server-quorum-type: server > cluster.quorum-type: auto > network.remote-dio: enable > cluster.eager-lock: enable > performance.stat-prefetch: off > performance.io-cache: off > performance.read-ahead: off > performance.quick-read: off > performance.readdir-ahead: on > server.allow-insecure: on > [root@ovirt1 ~]# > > > all 3 of my brick nodes ARE also members of the virtualization cluster > (including ovirt3). How can I convert it into a full replica instead of > just an arbiter? > > Thanks! > --Jim > > On Fri, Sep 1, 2017 at 9:09 AM, Charles Kozler > wrote: > >> @Kasturi - Looks good now. Cluster showed down for a moment but VM's >> stayed up in their appropriate places. Thanks! >> >> < Anyone on this list please feel free to correct my response to Jim if >> its wrong> >> >> @ Jim - If you can share your gluster volume info / status I can confirm >> (to the best of my knowledge). From my understanding, If you setup the >> volume with something like 'gluster volume set group virt' this will >> configure some quorum options as well, Ex: http://i.imgur.com/Mya4N5o.png >> >> While, yes, you are configured for arbiter node you're still losing >> quorum by dropping from 2 -> 1. You would need 4 node with 1 being arbiter >> to configure quorum which is in effect 3 writable nodes and 1 arbiter. If >> one gluster node drops, you still have 2 up. Although in this case, you >> probably wouldnt need arbiter at all >> >> If you are configured, you can drop quorum settings and just let arbiter >> run since you're not using arbiter node in your VM cluster part (I >> believe), just storage cluster part. When using quorum, you need > 50% of >> the cluster being up at one time. Since you have 3 nodes with 1 arbiter, >> you're actually losing 1/2 which == 50 which == degraded / hindered gluster >> >> Again, this is to the best of my knowledge based on other quorum backed >> softwareand this is what I understand from testing with gluster and >> ovirt thus far >> >> On Fri, Sep 1, 2017 at 11:53 AM, Jim Kusznir wrote: >> >>> Huh...Ok., how do I convert the arbitrar to full replica, then? I was >>> misinformed when I created this setup. I thought the arbitrator held >>> enough metadata that it could validate or refudiate any one replica (kinda >>> like the parity drive for a RAID-4 array). I was also under the impression >>> that one replica + Arbitrator is enough to keep the array online and >>> functional. >>> >>> --Jim >>> >>> On Fri, Sep 1, 2017 at 5:22 AM, Charles Kozler >>> wrote: >>> >>>> @ Jim - you have only two data volumes and lost quorum. Arbitrator only >>>> stores metadata, no actual files. So yes, you were running in degraded mode >>>> so some operations were hindered. >>>> >>>> @ Sahina - Yes, this actually worked fine for me once I did that. >>>> However, the issue I am still facing, is when I go to create a new gluster >>>> storage domain (replica 3, hyperconverged) and I tell it "Host to use" and >>>> I select that host. If I fail that host, all VMs halt. I do not recall this >>>> in 3.6 or early 4.0. This to me makes it seem like this is "pinning" a node >>>> to a volume and vice versa like you could, for instance, for a singular >>>> hyperconverged to ex: export a local disk via NFS and then m
Re: [ovirt-users] hyperconverged question
Thanks for the help! Here's my gluster volume info for the data export/brick (I have 3: data, engine, and iso, but they're all configured the same): Volume Name: data Type: Replicate Volume ID: e670c488-ac16-4dd1-8bd3-e43b2e42cc59 Status: Started Snapshot Count: 0 Number of Bricks: 1 x (2 + 1) = 3 Transport-type: tcp Bricks: Brick1: ovirt1.nwfiber.com:/gluster/brick2/data Brick2: ovirt2.nwfiber.com:/gluster/brick2/data Brick3: ovirt3.nwfiber.com:/gluster/brick2/data (arbiter) Options Reconfigured: performance.strict-o-direct: on nfs.disable: on user.cifs: off network.ping-timeout: 30 cluster.shd-max-threads: 8 cluster.shd-wait-qlength: 1 cluster.locking-scheme: granular cluster.data-self-heal-algorithm: full performance.low-prio-threads: 32 features.shard-block-size: 512MB features.shard: on storage.owner-gid: 36 storage.owner-uid: 36 cluster.server-quorum-type: server cluster.quorum-type: auto network.remote-dio: enable cluster.eager-lock: enable performance.stat-prefetch: off performance.io-cache: off performance.read-ahead: off performance.quick-read: off performance.readdir-ahead: on server.allow-insecure: on [root@ovirt1 ~]# all 3 of my brick nodes ARE also members of the virtualization cluster (including ovirt3). How can I convert it into a full replica instead of just an arbiter? Thanks! --Jim On Fri, Sep 1, 2017 at 9:09 AM, Charles Kozler wrote: > @Kasturi - Looks good now. Cluster showed down for a moment but VM's > stayed up in their appropriate places. Thanks! > > < Anyone on this list please feel free to correct my response to Jim if > its wrong> > > @ Jim - If you can share your gluster volume info / status I can confirm > (to the best of my knowledge). From my understanding, If you setup the > volume with something like 'gluster volume set group virt' this will > configure some quorum options as well, Ex: http://i.imgur.com/Mya4N5o.png > > While, yes, you are configured for arbiter node you're still losing quorum > by dropping from 2 -> 1. You would need 4 node with 1 being arbiter to > configure quorum which is in effect 3 writable nodes and 1 arbiter. If one > gluster node drops, you still have 2 up. Although in this case, you > probably wouldnt need arbiter at all > > If you are configured, you can drop quorum settings and just let arbiter > run since you're not using arbiter node in your VM cluster part (I > believe), just storage cluster part. When using quorum, you need > 50% of > the cluster being up at one time. Since you have 3 nodes with 1 arbiter, > you're actually losing 1/2 which == 50 which == degraded / hindered gluster > > Again, this is to the best of my knowledge based on other quorum backed > software....and this is what I understand from testing with gluster and > ovirt thus far > > On Fri, Sep 1, 2017 at 11:53 AM, Jim Kusznir wrote: > >> Huh...Ok., how do I convert the arbitrar to full replica, then? I was >> misinformed when I created this setup. I thought the arbitrator held >> enough metadata that it could validate or refudiate any one replica (kinda >> like the parity drive for a RAID-4 array). I was also under the impression >> that one replica + Arbitrator is enough to keep the array online and >> functional. >> >> --Jim >> >> On Fri, Sep 1, 2017 at 5:22 AM, Charles Kozler >> wrote: >> >>> @ Jim - you have only two data volumes and lost quorum. Arbitrator only >>> stores metadata, no actual files. So yes, you were running in degraded mode >>> so some operations were hindered. >>> >>> @ Sahina - Yes, this actually worked fine for me once I did that. >>> However, the issue I am still facing, is when I go to create a new gluster >>> storage domain (replica 3, hyperconverged) and I tell it "Host to use" and >>> I select that host. If I fail that host, all VMs halt. I do not recall this >>> in 3.6 or early 4.0. This to me makes it seem like this is "pinning" a node >>> to a volume and vice versa like you could, for instance, for a singular >>> hyperconverged to ex: export a local disk via NFS and then mount it via >>> ovirt domain. But of course, this has its caveats. To that end, I am using >>> gluster replica 3, when configuring it I say "host to use: " node 1, then >>> in the connection details I give it node1:/data. I fail node1, all VMs >>> halt. Did I miss something? >>> >>> On Fri, Sep 1, 2017 at 2:13 AM, Sahina Bose wrote: >>> >>>> To the OP question, when you set up a gluster storage domain, you need >>>> to specify backup-volfile-servers=: where server2 >>>> and server3 also have bricks running. When ser
Re: [ovirt-users] hyperconverged question
Speaking of the "use managed gluster", I created this gluster setup under ovirt 4.0 when that wasn't there. I've gone into my settings and checked the box and saved it at least twice, but when I go back into the storage settings, its not checked again. The "about" box in the gui reports that I'm using this version: oVirt Engine Version: 4.1.1.8-1.el7.centos I thought I was staying up to date, but I'm not sure if I'm doing everything right on the upgrade...The documentation says to click for hosted engine upgrade instructions, which takes me to a page not found error...For several versions now, and I haven't found those instructions, so I've been "winging it". --Jim On Fri, Sep 1, 2017 at 8:53 AM, Jim Kusznir wrote: > Huh...Ok., how do I convert the arbitrar to full replica, then? I was > misinformed when I created this setup. I thought the arbitrator held > enough metadata that it could validate or refudiate any one replica (kinda > like the parity drive for a RAID-4 array). I was also under the impression > that one replica + Arbitrator is enough to keep the array online and > functional. > > --Jim > > On Fri, Sep 1, 2017 at 5:22 AM, Charles Kozler > wrote: > >> @ Jim - you have only two data volumes and lost quorum. Arbitrator only >> stores metadata, no actual files. So yes, you were running in degraded mode >> so some operations were hindered. >> >> @ Sahina - Yes, this actually worked fine for me once I did that. >> However, the issue I am still facing, is when I go to create a new gluster >> storage domain (replica 3, hyperconverged) and I tell it "Host to use" and >> I select that host. If I fail that host, all VMs halt. I do not recall this >> in 3.6 or early 4.0. This to me makes it seem like this is "pinning" a node >> to a volume and vice versa like you could, for instance, for a singular >> hyperconverged to ex: export a local disk via NFS and then mount it via >> ovirt domain. But of course, this has its caveats. To that end, I am using >> gluster replica 3, when configuring it I say "host to use: " node 1, then >> in the connection details I give it node1:/data. I fail node1, all VMs >> halt. Did I miss something? >> >> On Fri, Sep 1, 2017 at 2:13 AM, Sahina Bose wrote: >> >>> To the OP question, when you set up a gluster storage domain, you need >>> to specify backup-volfile-servers=: where server2 and >>> server3 also have bricks running. When server1 is down, and the volume is >>> mounted again - server2 or server3 are queried to get the gluster volfiles. >>> >>> @Jim, if this does not work, are you using 4.1.5 build with libgfapi >>> access? If not, please provide the vdsm and gluster mount logs to analyse >>> >>> If VMs go to paused state - this could mean the storage is not >>> available. You can check "gluster volume status " to see if >>> atleast 2 bricks are running. >>> >>> On Fri, Sep 1, 2017 at 11:31 AM, Johan Bernhardsson >>> wrote: >>> >>>> If gluster drops in quorum so that it has less votes than it should it >>>> will stop file operations until quorum is back to normal.If i rember it >>>> right you need two bricks to write for quorum to be met and that the >>>> arbiter only is a vote to avoid split brain. >>>> >>>> >>>> Basically what you have is a raid5 solution without a spare. And when >>>> one disk dies it will run in degraded mode. And some raid systems will stop >>>> the raid until you have removed the disk or forced it to run anyway. >>>> >>>> You can read up on it here: https://gluster.readthed >>>> ocs.io/en/latest/Administrator%20Guide/arbiter-volumes-and-quorum/ >>>> >>>> /Johan >>>> >>>> On Thu, 2017-08-31 at 22:33 -0700, Jim Kusznir wrote: >>>> >>>> Hi all: >>>> >>>> Sorry to hijack the thread, but I was about to start essentially the >>>> same thread. >>>> >>>> I have a 3 node cluster, all three are hosts and gluster nodes (replica >>>> 2 + arbitrar). I DO have the mnt_options=backup-volfile-servers= set: >>>> >>>> storage=192.168.8.11:/engine >>>> mnt_options=backup-volfile-servers=192.168.8.12:192.168.8.13 >>>> >>>> I had an issue today where 192.168.8.11 went down. ALL VMs immediately >>>> paused, including the engine (all VMs were running on host2:192.168.8.12). >>>> I couldn't get any gluste
Re: [ovirt-users] hyperconverged question
Huh...Ok., how do I convert the arbitrar to full replica, then? I was misinformed when I created this setup. I thought the arbitrator held enough metadata that it could validate or refudiate any one replica (kinda like the parity drive for a RAID-4 array). I was also under the impression that one replica + Arbitrator is enough to keep the array online and functional. --Jim On Fri, Sep 1, 2017 at 5:22 AM, Charles Kozler wrote: > @ Jim - you have only two data volumes and lost quorum. Arbitrator only > stores metadata, no actual files. So yes, you were running in degraded mode > so some operations were hindered. > > @ Sahina - Yes, this actually worked fine for me once I did that. However, > the issue I am still facing, is when I go to create a new gluster storage > domain (replica 3, hyperconverged) and I tell it "Host to use" and I select > that host. If I fail that host, all VMs halt. I do not recall this in 3.6 > or early 4.0. This to me makes it seem like this is "pinning" a node to a > volume and vice versa like you could, for instance, for a singular > hyperconverged to ex: export a local disk via NFS and then mount it via > ovirt domain. But of course, this has its caveats. To that end, I am using > gluster replica 3, when configuring it I say "host to use: " node 1, then > in the connection details I give it node1:/data. I fail node1, all VMs > halt. Did I miss something? > > On Fri, Sep 1, 2017 at 2:13 AM, Sahina Bose wrote: > >> To the OP question, when you set up a gluster storage domain, you need to >> specify backup-volfile-servers=: where server2 and >> server3 also have bricks running. When server1 is down, and the volume is >> mounted again - server2 or server3 are queried to get the gluster volfiles. >> >> @Jim, if this does not work, are you using 4.1.5 build with libgfapi >> access? If not, please provide the vdsm and gluster mount logs to analyse >> >> If VMs go to paused state - this could mean the storage is not available. >> You can check "gluster volume status " to see if atleast 2 bricks >> are running. >> >> On Fri, Sep 1, 2017 at 11:31 AM, Johan Bernhardsson >> wrote: >> >>> If gluster drops in quorum so that it has less votes than it should it >>> will stop file operations until quorum is back to normal.If i rember it >>> right you need two bricks to write for quorum to be met and that the >>> arbiter only is a vote to avoid split brain. >>> >>> >>> Basically what you have is a raid5 solution without a spare. And when >>> one disk dies it will run in degraded mode. And some raid systems will stop >>> the raid until you have removed the disk or forced it to run anyway. >>> >>> You can read up on it here: https://gluster.readthed >>> ocs.io/en/latest/Administrator%20Guide/arbiter-volumes-and-quorum/ >>> >>> /Johan >>> >>> On Thu, 2017-08-31 at 22:33 -0700, Jim Kusznir wrote: >>> >>> Hi all: >>> >>> Sorry to hijack the thread, but I was about to start essentially the >>> same thread. >>> >>> I have a 3 node cluster, all three are hosts and gluster nodes (replica >>> 2 + arbitrar). I DO have the mnt_options=backup-volfile-servers= set: >>> >>> storage=192.168.8.11:/engine >>> mnt_options=backup-volfile-servers=192.168.8.12:192.168.8.13 >>> >>> I had an issue today where 192.168.8.11 went down. ALL VMs immediately >>> paused, including the engine (all VMs were running on host2:192.168.8.12). >>> I couldn't get any gluster stuff working until host1 (192.168.8.11) was >>> restored. >>> >>> What's wrong / what did I miss? >>> >>> (this was set up "manually" through the article on setting up >>> self-hosted gluster cluster back when 4.0 was new..I've upgraded it to 4.1 >>> since). >>> >>> Thanks! >>> --Jim >>> >>> >>> On Thu, Aug 31, 2017 at 12:31 PM, Charles Kozler >>> wrote: >>> >>> Typo..."Set it up and then failed that **HOST**" >>> >>> And upon that host going down, the storage domain went down. I only have >>> hosted storage domain and this new one - is this why the DC went down and >>> no SPM could be elected? >>> >>> I dont recall this working this way in early 4.0 or 3.6 >>> >>> On Thu, Aug 31, 2017 at 3:30 PM, Charles Kozler >>> wrote: >>> >>> So I've tested this today and I failed a node. Specifically, I setup a >>>
Re: [ovirt-users] Storage slowly expanding
Thank you! I created all the VMs using the sparce allocation method. I wanted a method that would create disks that did not immediately occupy their full declared size (eg, allow overcommit of disk space, as most VM hard drives are 30-50% empty for their entire life). I kinda figured that it would not free space on the underlying storage when a file is deleted within the disk. What confuses me is a disk that is only 30GB to the OS is using 53GB of space on gluster. In my understanding, the actual on-disk usage should be limited to 30GB max if I don't take snapshots. (I do like having the ability to take snapshots, and I do use them from time to time, but I usually don't keep the snapshot for an extended time...long enough to verify whatever operation I did was successful). I did find the "sparcify" command within ovirt and ran that; it reclaimed some space (the above example of the 30GB disk which is actually using 20GB inside the VM but was using 53GB on gluster shrunk to 50GB on gluster...But there's still at least 20GB unaccounted for there. I would love it if there was something I could do to reclaim the space inside the disk that isn't in use too (eg, get that disk down to just the 21GB that the actual VM is using). If I change to virtio-scsi (its currently just "virtio"), will that enable the DISCARD support, and is Gluster a supported underlying storage? Thanks! --Jim On Fri, Sep 1, 2017 at 5:45 AM, Yaniv Kaul wrote: > > > On Fri, Sep 1, 2017 at 8:41 AM, Jim Kusznir wrote: > >> Hi all: >> >> I have several VMs, all thin provisioned, on my small storage >> (self-hosted gluster / hyperconverged cluster). I'm now noticing that some >> of my VMs (espicially my only Windows VM) are using even MORE disk space >> than the blank it was allocated. >> >> Example: windows VM: virtual size created at creation: 30GB (thin >> provisioned). Actual disk space in use: 19GB. According to the storage -> >> Disks tab, its currently using 39GB. How do I get that down? >> >> I have two other VMs that are somewhat heavy DB load (Zabbix and Unifi); >> both of those are also larger than their created max size despite disk in >> machine not being fully utilized. >> >> None of these have snapshots. >> > > How come you have qcow2 and not raw-sparse, if you are not using > snapshots? is it a VM from a template? > > Generally, this is how thin provisioning works. The underlying qcow2 > doesn't know when you delete a file from within the guest - as file > deletion is merely marking entries in the file system tables as free, not > really doing any deletion IO. > You could run virt-sparsify on the disks to sparsify them, which will, if > the underlying storage supports it, reclaim storage space. > You could use IDE or virtio-SCSI and enable DISCARD support, which will, > if the underlying storage supports it, reclaim storage space. > > Those are not exclusive, btw. > Y. > > >> How do I fix this? >> >> Thanks! >> --Jim >> >> ___ >> Users mailing list >> Users@ovirt.org >> http://lists.ovirt.org/mailman/listinfo/users >> >> > ___ Users mailing list Users@ovirt.org http://lists.ovirt.org/mailman/listinfo/users
[ovirt-users] Storage slowly expanding
Hi all: I have several VMs, all thin provisioned, on my small storage (self-hosted gluster / hyperconverged cluster). I'm now noticing that some of my VMs (espicially my only Windows VM) are using even MORE disk space than the blank it was allocated. Example: windows VM: virtual size created at creation: 30GB (thin provisioned). Actual disk space in use: 19GB. According to the storage -> Disks tab, its currently using 39GB. How do I get that down? I have two other VMs that are somewhat heavy DB load (Zabbix and Unifi); both of those are also larger than their created max size despite disk in machine not being fully utilized. None of these have snapshots. How do I fix this? Thanks! --Jim ___ Users mailing list Users@ovirt.org http://lists.ovirt.org/mailman/listinfo/users
Re: [ovirt-users] hyperconverged question
Hi all: Sorry to hijack the thread, but I was about to start essentially the same thread. I have a 3 node cluster, all three are hosts and gluster nodes (replica 2 + arbitrar). I DO have the mnt_options=backup-volfile-servers= set: storage=192.168.8.11:/engine mnt_options=backup-volfile-servers=192.168.8.12:192.168.8.13 I had an issue today where 192.168.8.11 went down. ALL VMs immediately paused, including the engine (all VMs were running on host2:192.168.8.12). I couldn't get any gluster stuff working until host1 (192.168.8.11) was restored. What's wrong / what did I miss? (this was set up "manually" through the article on setting up self-hosted gluster cluster back when 4.0 was new..I've upgraded it to 4.1 since). Thanks! --Jim On Thu, Aug 31, 2017 at 12:31 PM, Charles Kozler wrote: > Typo..."Set it up and then failed that **HOST**" > > And upon that host going down, the storage domain went down. I only have > hosted storage domain and this new one - is this why the DC went down and > no SPM could be elected? > > I dont recall this working this way in early 4.0 or 3.6 > > On Thu, Aug 31, 2017 at 3:30 PM, Charles Kozler > wrote: > >> So I've tested this today and I failed a node. Specifically, I setup a >> glusterfs domain and selected "host to use: node1". Set it up and then >> failed that VM >> >> However, this did not work and the datacenter went down. My engine stayed >> up, however, it seems configuring a domain to pin to a host to use will >> obviously cause it to fail >> >> This seems counter-intuitive to the point of glusterfs or any redundant >> storage. If a single host has to be tied to its function, this introduces a >> single point of failure >> >> Am I missing something obvious? >> >> On Thu, Aug 31, 2017 at 9:43 AM, Kasturi Narra wrote: >> >>> yes, right. What you can do is edit the hosted-engine.conf file and >>> there is a parameter as shown below [1] and replace h2 and h3 with your >>> second and third storage servers. Then you will need to restart >>> ovirt-ha-agent and ovirt-ha-broker services in all the nodes . >>> >>> [1] 'mnt_options=backup-volfile-servers=:' >>> >>> On Thu, Aug 31, 2017 at 5:54 PM, Charles Kozler >>> wrote: >>> Hi Kasturi - Thanks for feedback > If cockpit+gdeploy plugin would be have been used then that would have automatically detected glusterfs replica 3 volume created during Hosted Engine deployment and this question would not have been asked Actually, doing hosted-engine --deploy it too also auto detects glusterfs. I know glusterfs fuse client has the ability to failover between all nodes in cluster, but I am still curious given the fact that I see in ovirt config node1:/engine (being node1 I set it to in hosted-engine --deploy). So my concern was to ensure and find out exactly how engine works when one node goes away and the fuse client moves over to the other node in the gluster cluster But you did somewhat answer my question, the answer seems to be no (as default) and I will have to use hosted-engine.conf and change the parameter as you list So I need to do something manual to create HA for engine on gluster? Yes? Thanks so much! On Thu, Aug 31, 2017 at 3:03 AM, Kasturi Narra wrote: > Hi, > >During Hosted Engine setup question about glusterfs volume is being > asked because you have setup the volumes yourself. If cockpit+gdeploy > plugin would be have been used then that would have automatically detected > glusterfs replica 3 volume created during Hosted Engine deployment and > this > question would not have been asked. > >During new storage domain creation when glusterfs is selected there > is a feature called 'use managed gluster volumes' and upon checking this > all glusterfs volumes managed will be listed and you could choose the > volume of your choice from the dropdown list. > > There is a conf file called /etc/hosted-engine/hosted-engine.conf > where there is a parameter called backup-volfile-servers="h1:h2" and if > one > of the gluster node goes down engine uses this parameter to provide ha / > failover. > > Hope this helps !! > > Thanks > kasturi > > > > On Wed, Aug 30, 2017 at 8:09 PM, Charles Kozler > wrote: > >> Hello - >> >> I have successfully created a hyperconverged hosted engine setup >> consisting of 3 nodes - 2 for VM's and the third purely for storage. I >> manually configured it all, did not use ovirt node or anything. Built the >> gluster volumes myself >> >> However, I noticed that when setting up the hosted engine and even >> when adding a new storage domain with glusterfs type, it still asks for >> hostname:/volumename >> >> This leads me to believe that if that one node goes down (ex: >> n
Re: [ovirt-users] Recovering from a multi-node failure
the heal info command shows perfect consistency between nodes; that's what confused me. At the moment, the physical partitions (lvm partitions) that gluster is using are different sizes, but I expected to see the "least common denominator" for the total size, and I expected to see it consistant accross the cluster. As this issue was from a couple weeks ago, I don't know what logs to give you anymore. Since the origional issue, the entire cluster has been rebooted, with not all nodes down at the same time, but every node having been rebooted. Now things look a bit different: [root@ovirt1 ~]# df -h Filesystem Size Used Avail Use% Mounted on /dev/mapper/centos_ovirt-root 20G 5.1G 15G 26% / devtmpfs16G 0 16G 0% /dev tmpfs 16G 0 16G 0% /dev/shm tmpfs 16G 34M 16G 1% /run tmpfs 16G 0 16G 0% /sys/fs/cgroup /dev/mapper/gluster-iso 25G 7.3G 18G 29% /gluster/brick4 /dev/sda1 497M 315M 183M 64% /boot /dev/mapper/gluster-engine 25G 13G 13G 49% /gluster/brick1 /dev/mapper/gluster-data 136G 126G 11G 93% /gluster/brick2 192.168.8.11:/engine15G 10G 5.1G 67% /rhev/data-center/mnt/glusterSD/192.168.8.11:_engine 192.168.8.11:/data 136G 126G 11G 93% /rhev/data-center/mnt/glusterSD/192.168.8.11:_data 192.168.8.11:/iso 13G 7.3G 5.8G 56% /rhev/data-center/mnt/glusterSD/192.168.8.11:_iso tmpfs 3.2G 0 3.2G 0% /run/user/0 [root@ovirt2 ~]# df -h Filesystem Size Used Avail Use% Mounted on /dev/mapper/centos_ovirt-root 8.0G 3.1G 5.0G 39% / devtmpfs16G 0 16G 0% /dev tmpfs 16G 16K 16G 1% /dev/shm tmpfs 16G 90M 16G 1% /run tmpfs 16G 0 16G 0% /sys/fs/cgroup /dev/mapper/gluster-engine 15G 10G 5.1G 67% /gluster/brick1 /dev/sda1 497M 307M 191M 62% /boot /dev/mapper/gluster-iso 13G 7.3G 5.8G 56% /gluster/brick4 /dev/mapper/gluster-data 174G 121G 54G 70% /gluster/brick2 192.168.8.11:/engine15G 10G 5.1G 67% /rhev/data-center/mnt/glusterSD/192.168.8.11:_engine 192.168.8.11:/data 136G 126G 11G 93% /rhev/data-center/mnt/glusterSD/192.168.8.11:_data 192.168.8.11:/iso 13G 7.3G 5.8G 56% /rhev/data-center/mnt/glusterSD/192.168.8.11:_iso tmpfs 3.2G 0 3.2G 0% /run/user/0 The thing that still bothers me is that for engine (brick1) ovirt1's physical disk space used is still higher than ovirt2's physical disk space used, but the smaller number is reported on the gluster fs. For data (brick2), ovirt1 and ovirt2 physical usage are still different, but the larger number is reported by glsuterfs. the main question is still: Is there cause for concern with the fact that physical usage for the bricks are not consistent between the replicas that the heal info show completely healed? (again, I was so concerned that with ovirt2, I re-deleted everything and let gluster re-heal the volume, and it came to the exact same amount of (less) disk usage and claimed fully healed. --Jim On Wed, Aug 16, 2017 at 5:22 AM, Sahina Bose wrote: > > > On Sun, Aug 6, 2017 at 4:42 AM, Jim Kusznir wrote: > >> Well, after a very stressful weekend, I think I have things largely >> working. Turns out that most of the above issues were caused by the linux >> permissions of the exports for all three volumes (they had been reset to >> 600; setting them to 774 or 770 fixed many of the issues). Of course, I >> didn't find that until a much more harrowing outage, and hours and hours of >> work, including beginning to look at rebuilding my cluster >> >> So, now my cluster is operating again, and everything looks good EXCEPT >> for one major Gluster issue/question that I haven't found any references or >> info on. >> >> my host ovirt2, one of the replica gluster servers, is the one that lost >> its storage and had to reinitialize it from the cluster. the iso volume is >> perfectly fine and complete, but the engine and data volumes are smaller on >> disk on this node than on the other node (and this node before the crash). >> On the engine store, the entire cluster reports the smaller utilization on >> mounted gluster filesystems; on the data partition, it reports the larger >> size (rest of cluster). Here's some df statments to help clarify: >> >> (brick1 = engine; brick2=data, brick4=iso): >> Filesystem Size Used Avail Use% Mounted on >> /dev/mapper/gluster-engine 25G 1
Re: [ovirt-users] Recovering from a multi-node failure
Well, after a very stressful weekend, I think I have things largely working. Turns out that most of the above issues were caused by the linux permissions of the exports for all three volumes (they had been reset to 600; setting them to 774 or 770 fixed many of the issues). Of course, I didn't find that until a much more harrowing outage, and hours and hours of work, including beginning to look at rebuilding my cluster So, now my cluster is operating again, and everything looks good EXCEPT for one major Gluster issue/question that I haven't found any references or info on. my host ovirt2, one of the replica gluster servers, is the one that lost its storage and had to reinitialize it from the cluster. the iso volume is perfectly fine and complete, but the engine and data volumes are smaller on disk on this node than on the other node (and this node before the crash). On the engine store, the entire cluster reports the smaller utilization on mounted gluster filesystems; on the data partition, it reports the larger size (rest of cluster). Here's some df statments to help clarify: (brick1 = engine; brick2=data, brick4=iso): Filesystem Size Used Avail Use% Mounted on /dev/mapper/gluster-engine 25G 12G 14G 47% /gluster/brick1 /dev/mapper/gluster-data 136G 125G 12G 92% /gluster/brick2 /dev/mapper/gluster-iso 25G 7.3G 18G 29% /gluster/brick4 192.168.8.11:/engine15G 9.7G 5.4G 65% /rhev/data-center/mnt/glusterSD/192.168.8.11:_engine 192.168.8.11:/data 136G 125G 12G 92% /rhev/data-center/mnt/glusterSD/192.168.8.11:_data 192.168.8.11:/iso 13G 7.3G 5.8G 56% /rhev/data-center/mnt/glusterSD/192.168.8.11:_iso View from ovirt2: Filesystem Size Used Avail Use% Mounted on /dev/mapper/gluster-engine 15G 9.7G 5.4G 65% /gluster/brick1 /dev/mapper/gluster-data 174G 119G 56G 69% /gluster/brick2 /dev/mapper/gluster-iso 13G 7.3G 5.8G 56% /gluster/brick4 192.168.8.11:/engine15G 9.7G 5.4G 65% /rhev/data-center/mnt/glusterSD/192.168.8.11:_engine 192.168.8.11:/data 136G 125G 12G 92% /rhev/data-center/mnt/glusterSD/192.168.8.11:_data 192.168.8.11:/iso 13G 7.3G 5.8G 56% /rhev/data-center/mnt/glusterSD/192.168.8.11:_iso As you can see, in the process of rebuilding the hard drive for ovirt2, I did resize some things to give more space to data, where I desperately need it. If this goes well and the storage is given a clean bill of health at this time, then I will take ovirt1 down and resize to match ovirt2, and thus score a decent increase in storage for data. I fully realize that right now the gluster mounted volumes should have the total size as the least common denominator. So, is this size reduction appropriate? A big part of me thinks data is missing, but I even went through and shut down ovirt2's gluster daemons, wiped all the gluster data, and restarted gluster to allow it a fresh heal attempt, and it again came back to the exact same size. This cluster was originally built about the time ovirt 4.0 came out, and has been upgraded to 'current', so perhaps some new gluster features are making more efficient use of space (dedupe or something)? Thank you for your assistance! --JIm On Fri, Aug 4, 2017 at 7:49 PM, Jim Kusznir wrote: > Hi all: > > Today has been rough. two of my three nodes went down today, and self > heal has not been healing well. 4 hours later, VMs are running. but the > engine is not happy. It claims the storage domain is down (even though it > is up on all hosts and VMs are running). I'm getting a ton of these > messages logging: > > VDSM engine3 command HSMGetAllTasksStatusesVDS failed: Not SPM > > Aug 4, 2017 7:23:00 PM > > VDSM engine3 command SpmStatusVDS failed: Error validating master storage > domain: ('MD read error',) > > Aug 4, 2017 7:22:49 PM > > VDSM engine3 command ConnectStoragePoolVDS failed: Cannot find master > domain: u'spUUID=5868392a-0148-02cf-014d-0121, > msdUUID=cdaf180c-fde6-4cb3-b6e5-b6bd869c8770' > > Aug 4, 2017 7:22:47 PM > > VDSM engine1 command ConnectStoragePoolVDS failed: Cannot find master > domain: u'spUUID=5868392a-0148-02cf-014d-0121, > msdUUID=cdaf180c-fde6-4cb3-b6e5-b6bd869c8770' > > Aug 4, 2017 7:22:46 PM > > VDSM engine2 command SpmStatusVDS failed: Error validating master storage > domain: ('MD read error',) > > Aug 4, 2017 7:22:44 PM > > VDSM engine2 command ConnectStoragePoolVDS failed: Cannot find master > domain: u'spUUID=5868392a-0148-02cf-014d-0121, > msdUUID=cdaf180c-fde6-4cb3-b6e5-b6bd869c8770' > > Aug 4, 2017 7:22:42 PM > > VDSM engine1 command HSMGetAllTasksStatusesVDS failed: Not SPM: () > > > >
[ovirt-users] Recovering from a multi-node failure
Hi all: Today has been rough. two of my three nodes went down today, and self heal has not been healing well. 4 hours later, VMs are running. but the engine is not happy. It claims the storage domain is down (even though it is up on all hosts and VMs are running). I'm getting a ton of these messages logging: VDSM engine3 command HSMGetAllTasksStatusesVDS failed: Not SPM Aug 4, 2017 7:23:00 PM VDSM engine3 command SpmStatusVDS failed: Error validating master storage domain: ('MD read error',) Aug 4, 2017 7:22:49 PM VDSM engine3 command ConnectStoragePoolVDS failed: Cannot find master domain: u'spUUID=5868392a-0148-02cf-014d-0121, msdUUID=cdaf180c-fde6-4cb3-b6e5-b6bd869c8770' Aug 4, 2017 7:22:47 PM VDSM engine1 command ConnectStoragePoolVDS failed: Cannot find master domain: u'spUUID=5868392a-0148-02cf-014d-0121, msdUUID=cdaf180c-fde6-4cb3-b6e5-b6bd869c8770' Aug 4, 2017 7:22:46 PM VDSM engine2 command SpmStatusVDS failed: Error validating master storage domain: ('MD read error',) Aug 4, 2017 7:22:44 PM VDSM engine2 command ConnectStoragePoolVDS failed: Cannot find master domain: u'spUUID=5868392a-0148-02cf-014d-0121, msdUUID=cdaf180c-fde6-4cb3-b6e5-b6bd869c8770' Aug 4, 2017 7:22:42 PM VDSM engine1 command HSMGetAllTasksStatusesVDS failed: Not SPM: () I cannot set an SPM as it claims the storage domain is down; I cannot set the storage domain up. Also in the storage realm, one of my exports shows substantially less data than is actually there. Here's what happened, as best as I understood them: I went to do maintence on ovirt2 (needed to replace a faulty ram stick and rework the disk). I put it in maintence mode, then shut it down and did my work. In the process, much of the disk contents was lost (all the gluster data). I figure, no big deal, the gluster data is redundant on the network, it will heal when it comes back up. While I was doing maintence, all but one of the VMs were running on engine1. When I turned on engine2, all of the sudden, all vms including the main engine stop and go non-responsive. As far as I can tell, this should not have happened, as I turned ON one host, but none the less, I waited for recovery to occur (while customers started calling asking why everything stopped working). As I waited, I was checking, and gluster volume status only showed ovirt1 and ovirt2Apparently gluster had stopped/failed at some point on ovirt3. I assume that was the cause of the outage, still, if everything was working fine with ovirt1 gluster, and ovirt2 powers on with a very broke gluster (the volume status was showing NA for the port fileds for the gluster volumes), I would not expect to have a working gluster go stupid like that. After starting ovirt3 glusterd and checking the status, all three showed ovirt1 and ovirt3 as operational, and ovirt2 as NA. Unfortunately, recovery was still not happening, so I did some googling and found about the commands to inquire about the hosted-engine status. It appeared to be stuck "paused" and I couldn't find a way to unpause it, so I poweroff'ed it, then started it manually on engine 1, and the cluster came back up. It showed all VMs paused. I was able to unpause them and they worked again. So now I began to work the ovirt2 gluster healing problem. It didn't appear to be self-healing, but eventually I found this document: https://support.rackspace.com/how-to/recover-from-a-failed-server-in-a-glusterfs-array/ and from that found the magic xattr commands. After setting them, gluster volumes on ovirt2 came online. I told iso to heal, and it did but only came up about half as much data as it should have. I told it heal full, and it did finish off the remaining data, and came up to full. I then told engine to do a full heal (gluster volume heal engine full), and it transferred its data from the other gluster hosts too. However, it said it was done when it hit 9.7GB while there was 15GB on disk! It is still stuck that way; ovirt gui and gluster volume heal engine info both show the volume fully healed, but it is not: [root@ovirt1 ~]# df -h Filesystem Size Used Avail Use% Mounted on /dev/mapper/centos_ovirt-root 20G 4.2G 16G 21% / devtmpfs16G 0 16G 0% /dev tmpfs 16G 16K 16G 1% /dev/shm tmpfs 16G 26M 16G 1% /run tmpfs 16G 0 16G 0% /sys/fs/cgroup /dev/mapper/gluster-engine 25G 12G 14G 47% /gluster/brick1 /dev/sda1 497M 315M 183M 64% /boot /dev/mapper/gluster-data 136G 124G 13G 92% /gluster/brick2 /dev/mapper/gluster-iso 25G 7.3G 18G 29% /gluster/brick4 tmpfs 3.2G 0 3.2G 0% /run/user/0 192.168.8.11:/engine15G 9.7G 5.4G 65% /rhev/data-center/mnt/glusterSD/192.168.8.11:_engine 192.168.8.11:/data 136G 124G 13G 92% /rhev/
Re: [ovirt-users] ovirt-hosted-engine state transition messages
ad::INFO::2017-07-17 08:16:04,700::state_decorators::88::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(check) Timeout cleared while transitioning -> MainThread::INFO::2017-07-17 08:16:04,710::brokerlink::111::ovirt_hosted_engine_ha.lib.brokerlink.BrokerLink::(notify) Trying: notify time=1500304564.71 type=state_transition detail=EngineUpBadHealth-EngineUp hostname='ovirt1.nwfiber.com' MainThread::INFO::2017-07-17 08:16:04,798::brokerlink::121::ovirt_hosted_engine_ha.lib.brokerlink.BrokerLink::(notify) Success, was notification of state_transition (EngineUpBadHealth-EngineUp) sent? sent MainThread::INFO::2017-07-17 08:16:04,799::hosted_engine::604::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(_initialize_vdsm) Initializing VDSM MainThread::INFO::2017-07-17 08:16:07,435::hosted_engine::630::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(_initialize_storage_images) Connecting the storage MainThread::INFO::2017-07-17 08:16:07,491::storage_server::219::ovirt_hosted_engine_ha.lib.storage_server.StorageServer::(connect_storage_server) Connecting storage server MainThread::INFO::2017-07-17 08:16:13,906::storage_server::226::ovirt_hosted_engine_ha.lib.storage_server.StorageServer::(connect_storage_server) Connecting storage server MainThread::INFO::2017-07-17 08:16:14,131::storage_server::233::ovirt_hosted_engine_ha.lib.storage_server.StorageServer::(connect_storage_server) Refreshing the storage domain MainThread::INFO::2017-07-17 08:16:14,437::hosted_engine::657::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(_initialize_storage_images) Preparing images MainThread::INFO::2017-07-17 08:16:14,438::image::126::ovirt_hosted_engine_ha.lib.image.Image::(prepare_images) Preparing images On Thu, Mar 30, 2017 at 5:58 AM, Simone Tiraboschi wrote: > Could you please check your /var/log/ovirt-hosted-engine-ha/agent.log ? > > On Thu, Mar 30, 2017 at 3:10 AM, Jim Kusznir wrote: > >> Hello: >> >> I find that I often get random-seeming messages. A lot of them mention >> "ReintializeFSM", but I also get engine down, engine start, etc. >> messages. All the time, nothing appears to be happening on the cluster, >> and I rarely can find anything wrong or any trigger/cause. Is this >> normal? What causes this (beyond obvious hardware issues / hosts >> rebooting)? Most of the time when I get these, my cluster is going along >> smoothly, and nothing (not even administrative access) is interrupted. >> >> Could ISP issues cause these messages to be generated? >> >> Thanks! >> --Jim >> >> ___ >> Users mailing list >> Users@ovirt.org >> http://lists.ovirt.org/mailman/listinfo/users >> >> > ___ Users mailing list Users@ovirt.org http://lists.ovirt.org/mailman/listinfo/users
Re: [ovirt-users] Setting up GeoReplication
I tried to create a gluster volume on the georep node by running: gluster volume create engine-rep replica 1 georep.nwfiber.com: /mnt/gluster/engine-rep I got back an error saying replica must be > 1. So I tried to create it again: gluster volume create engine-rep replica 2 georep.nwfiber.com:/mnt/gluster/engine-rep server2.nwfiber.com:/mnt/gluster/engine-rep where server2 did not exist. That failed too, but I don't recall the error message. gluster is installed, but when I try and start it with the init script, it fails to start with a complaint about reading the block file; my googling indicated that's the error you get until you've created a gluster volume, and that was the first clue to me that maybe I needed to create one first. So, how do I create a replica 1 volume? Thinking way ahead, I have a related replica question: Currently my ovirt nodes are also my gluster nodes (replica 2 arbitrar 1). Eventually I'll want to pull my gluster off onto dedicated hardware I suspect. If I do so, do I need 3 servers, or is a replica 2 sufficient? I guess I could have an ovirt node continue to be an arbitrar... I would eventually like to distribute my ovirt cluster accross multiple locations with the option for remote failover (say location A looses all its network and/or power; have important VMs started at location B in addition to location B's normal VMs). I assume at this point the recommended arch would be: 2 Gluster servers at each location Each location has a gluster volume for that location, and is georep for the other location (so all my data will physically exist on 4 gluster servers). I probably won't have more than 2 or 3 ovirt hosts at each location, so I don't expect this to be a "heavy use" system. Am I on track? I'd be interested to learn what others suggest for this deployment model. On Sun, May 14, 2017 at 11:09 PM, Sahina Bose wrote: > Adding Aravinda > > On Sat, May 13, 2017 at 11:21 PM, Jim Kusznir wrote: > >> Hi All: >> >> I've been trying to set up georeplication for a while now, but can't seem >> to make it work. I've found documentation on the web (mostly >> https://gluster.readthedocs.io/en/refactor/Administr >> ator%20Guide/Geo%20Replication/), and I found http://blog.gluster.org/ >> 2015/09/introducing-georepsetup-gluster-geo-replication-setup-tool/ >> >> Unfortunately, it seems that some critical steps are missing from both, >> and I can't figure out for sure what they are. >> >> My environment: >> >> Production: replica 2 + arbitrator running on my 3-node oVirt cluster, 3 >> volumes (engine, data, iso). >> >> New geo-replication: Raspberry Pi3 with USB hard drive shoved in some >> other data closet off-site. >> >> I've installed rasbian-lite, and after much fighting, got >> glusterfs-*-3.8.11 installed. I've created my mountpoint (USB hard drive, >> much larger than my gluster volumes), and then ran the command. I get this >> far: >> >> [OK] georep.nwfiber.com is Reachable(Port 22) >> [OK] SSH Connection established r...@georep.nwfiber.com >> [OK] Master Volume and Slave Volume are compatible (Version: 3.8.11) >> [NOT OK] Unable to Mount Gluster Volume georep.nwfiber.com:engine-rep >> >> Trying it with the steps in the gluster docs also has the same problem. >> No long files are generated on the slave. Log files on the master include: >> >> [root@ovirt1 geo-replication]# more georepsetup.mount.log >> [2017-05-13 17:26:27.318599] I [MSGID: 100030] [glusterfsd.c:2454:main] >> 0-glusterfs: Started running glusterfs version 3.8.11 (args: >> glusterfs --xlator-option="*dht.lookup-unhashed=off" --volfile-server >> localhost --volfile-id engine -l /var/log/glusterfs/geo-repli >> cation/georepsetup.mount.log --client-pid=-1 /tmp/georepsetup_wZtfkN) >> [2017-05-13 17:26:27.341170] I [MSGID: 101190] >> [event-epoll.c:628:event_dispatch_epoll_worker] 0-epoll: Started thread >> with index 1 >> [2017-05-13 17:26:27.341260] E [socket.c:2309:socket_connect_finish] >> 0-glusterfs: connection to ::1:24007 failed (Connection refused >> ) >> [2017-05-13 17:26:27.341846] E [glusterfsd-mgmt.c:1908:mgmt_rpc_notify] >> 0-glusterfsd-mgmt: failed to connect with remote-host: local >> host (Transport endpoint is not connected) >> [2017-05-13 17:26:31.335849] I [MSGID: 101190] >> [event-epoll.c:628:event_dispatch_epoll_worker] 0-epoll: Started thread >> with index 2 >> [2017-05-13 17:26:31.337545] I [MSGID: 114020] [client.c:2356:notify] >> 0-engine-client-0: parent translators are ready, attempting co >> nnect on transport >> [2017-05-13 17:26:31.3
[ovirt-users] Setting up GeoReplication
Hi All: I've been trying to set up georeplication for a while now, but can't seem to make it work. I've found documentation on the web (mostly https://gluster.readthedocs.io/en/refactor/Administrator%20Guide/Geo%20Replication/), and I found http://blog.gluster.org/2015/09/introducing-georepsetup-gluster-geo-replication-setup-tool/ Unfortunately, it seems that some critical steps are missing from both, and I can't figure out for sure what they are. My environment: Production: replica 2 + arbitrator running on my 3-node oVirt cluster, 3 volumes (engine, data, iso). New geo-replication: Raspberry Pi3 with USB hard drive shoved in some other data closet off-site. I've installed rasbian-lite, and after much fighting, got glusterfs-*-3.8.11 installed. I've created my mountpoint (USB hard drive, much larger than my gluster volumes), and then ran the command. I get this far: [OK] georep.nwfiber.com is Reachable(Port 22) [OK] SSH Connection established r...@georep.nwfiber.com [OK] Master Volume and Slave Volume are compatible (Version: 3.8.11) [NOT OK] Unable to Mount Gluster Volume georep.nwfiber.com:engine-rep Trying it with the steps in the gluster docs also has the same problem. No long files are generated on the slave. Log files on the master include: [root@ovirt1 geo-replication]# more georepsetup.mount.log [2017-05-13 17:26:27.318599] I [MSGID: 100030] [glusterfsd.c:2454:main] 0-glusterfs: Started running glusterfs version 3.8.11 (args: glusterfs --xlator-option="*dht.lookup-unhashed=off" --volfile-server localhost --volfile-id engine -l /var/log/glusterfs/geo-repli cation/georepsetup.mount.log --client-pid=-1 /tmp/georepsetup_wZtfkN) [2017-05-13 17:26:27.341170] I [MSGID: 101190] [event-epoll.c:628:event_dispatch_epoll_worker] 0-epoll: Started thread with index 1 [2017-05-13 17:26:27.341260] E [socket.c:2309:socket_connect_finish] 0-glusterfs: connection to ::1:24007 failed (Connection refused ) [2017-05-13 17:26:27.341846] E [glusterfsd-mgmt.c:1908:mgmt_rpc_notify] 0-glusterfsd-mgmt: failed to connect with remote-host: local host (Transport endpoint is not connected) [2017-05-13 17:26:31.335849] I [MSGID: 101190] [event-epoll.c:628:event_dispatch_epoll_worker] 0-epoll: Started thread with index 2 [2017-05-13 17:26:31.337545] I [MSGID: 114020] [client.c:2356:notify] 0-engine-client-0: parent translators are ready, attempting co nnect on transport [2017-05-13 17:26:31.344485] I [MSGID: 114020] [client.c:2356:notify] 0-engine-client-1: parent translators are ready, attempting co nnect on transport [2017-05-13 17:26:31.345146] I [rpc-clnt.c:1965:rpc_clnt_reconfig] 0-engine-client-0: changing port to 49157 (from 0) [2017-05-13 17:26:31.350868] I [MSGID: 114020] [client.c:2356:notify] 0-engine-client-2: parent translators are ready, attempting co nnect on transport [2017-05-13 17:26:31.355946] I [MSGID: 114057] [client-handshake.c:1440:select_server_supported_programs] 0-engine-client-0: Using P rogram GlusterFS 3.3, Num (1298437), Version (330) [2017-05-13 17:26:31.356280] I [rpc-clnt.c:1965:rpc_clnt_reconfig] 0-engine-client-1: changing port to 49157 (from 0) Final graph: +--+ 1: volume engine-client-0 2: type protocol/client 3: option clnt-lk-version 1 4: option volfile-checksum 0 5: option volfile-key engine 6: option client-version 3.8.11 7: option process-uuid ovirt1.nwfiber.com-25660-2017/05/13-17:26:27:311929-engine-client-0-0-0 8: option fops-version 1298437 9: option ping-timeout 30 10: option remote-host ovirt1.nwfiber.com 11: option remote-subvolume /gluster/brick1/engine 12: option transport-type socket 13: option username 028984cf-0399-42e6-b04b-bb9b1685c536 14: option password eae737cc-9659-405f-865e-9a7ef97a3307 15: option filter-O_DIRECT off 16: option send-gids true 17: end-volume 18: 19: volume engine-client-1 20: type protocol/client 21: option ping-timeout 30 22: option remote-host ovirt2.nwfiber.com 23: option remote-subvolume /gluster/brick1/engine 24: option transport-type socket 25: option username 028984cf-0399-42e6-b04b-bb9b1685c536 26: option password eae737cc-9659-405f-865e-9a7ef97a3307 27: option filter-O_DIRECT off 28: option send-gids true 29: end-volume 30: 31: volume engine-client-2 32: type protocol/client 33: option ping-timeout 30 34: option remote-host ovirt3.nwfiber.com 35: option remote-subvolume /gluster/brick1/engine 36: option transport-type socket 37: option username 028984cf-0399-42e6-b04b-bb9b1685c536 38: option password eae737cc-9659-405f-865e-9a7ef97a3307 39: option filter-O_DIRECT off 40: option send-gids true 41: end-volume 42: 43: volume engine-replicate-0 44: type cluster/replicate 45: option arbiter-count 1 46: option data-self-heal-algorithm full 47: option
Re: [ovirt-users] Gluster and oVirt 4.0 questions
Thank you for your response! With the right magic word "geo-replication", I was able to find a howto that appears to be what I need to get started. As to the documentation, some more use-case docs would be helpful. I also find myself struggling to understand it at a level to feel comfortable admining it. For example, I followed a howto to get my current gluster stuff running, but just barely, and I still don't truely understand the components or how to make them work. My system is 3 node ovirt cluster with gluster bricks located on the same nodes (and node 3 is the arbitrar apparently). I was trying to set up gluster to ride on its own network instead of sharing the ovirt main network. Unfortunately, I never could get that to work, and now that it is in production, I don't have the slightest on where to look to cause gluster to use a different network interface already configured and up on the servers. I'm not even sure I know what tools to use at the command line level to ensure gluster is healthy, and should something happen, I'd probably have to post panicked e-mails here I also am not sure I understand gluster well enough to architect a system under new assumptions. My current configuration was always intended to be a "phase 1" to get my cluster online and thus start and grow my business. However, my current storage is very limited. So, what should my target be for a "better" cluster? Single server, or dual? If I grow to the point where I have multiple clusters at different offices (connected by fiber I own), how should I architect the storage then such that VMs can be moved between clusters, and my clusters 'back each other up"? I could use geo-replication, but is that the best/proper way? If I build dedicated servers, do I need more than one gluster storage server per location? This is a lot I just threw out there...These questions have all passed through my head, but I haven't found enough details to answer them myself. I'm slowly growing in a few areas; this geo-replication configuration will be my next growth. I would like to move my gluster to another network, and I think I found some of the files where relevant configs are stored, but not enough detail to feel comfortable without breaking what I have. I realized that most systems I've set up, the docs are a bit less "recipie"-ish and have more explanation interspersed with the commands, and intermediate checks (with explanations) to check your work as you go. These are extremely valuable to me. For example, if my initial configuration instructions had a paragraph about the network architecture, different settings (eg, gluster-gluster node sync vs management interface that ovirt uses vs data access for clients), and then walks through configuring each one, then showed the command line instructions to check that was working correctly before moving on to the next stage, that would help me understand what I've done, and be more likely to maintain it. It also makes other docs more understandable given a deeper knowledge of what I've already done. Its possible that the instructions I used may have been poorer than typical for the project, but my googling didn't turn up something that allowed me to figure it out before I posted my original e-mail. Thanks for your help! --Jim On Tue, Apr 25, 2017 at 10:02 AM, Sahina Bose wrote: > > > On Tue, Apr 25, 2017 at 9:18 PM, Jim Kusznir wrote: > >> So with arbiter, I actually only have two copies of data...Does arbiter >> have at least checksum or something to detect corruption of a copy? (like >> old RAID-4 disk configuration)? >> > > Yes, the arbiter brick stores metadata information about the files to > decide the good copy of data stored on the replicas in case of conflict. > > >> >> Ok...Related question: Is there a way to set up an offsite gluster >> storage server to mirror the contents of my main server? As "fire" >> insurance basically? (eventually, I'd like to have an "offsite" DR >> cluster, but I don't have the resources or scale yet for that). >> >> What I'd like to do is place a basic storage server somewhere else and >> have it sync any gluster data changes on a regular basis, and be usable to >> repopulate storage should I loose all of my current cluster (eg, a building >> fire or theft). >> > > Yes, the geo-replication feature can help with that. There's a remote data > sync feature introduced for gluster storage domains, that helps with this. > You can set this up such that data from your storage domain is regularly > synced to a remote gluster volume, while ensuring data consistency. The > remote gluster volume does not have to a replica 3. > > >> >> I find gluster
Re: [ovirt-users] Ovirt tasks "stuck"
(sorry. e-mail client sent message prematurely) Ok, I figured out that this needs to be run on the engine, I figured out that PGPASSWORD is the postgres password, and I finally figured out that the db password is stored in: /etc/ovirt-engine/engine.conf.d/10-setup-database.conf Unfortunately, when I run the command provided, I get just an empty line back, no UUIDs. I looked in the gui, under the disks tab and found the ID there. I ran the command on the two UUIDs for the two disks in question: [root@ovirt ~]# PGPASSWORD= /usr/share/ovirt-engine/setup/dbutils/unlock_entity.sh -q -t disk -u engine [root@ovirt ~]# PGPASSWORD= /usr/share/ovirt-engine/setup/dbutils/unlock_entity.sh -t snapshot -u engine 405fabe0-873c-4e8e-ae10-9990debf96c0 Caution, this operation may lead to data corruption and should be used with care. Please contact support prior to running this command Are you sure you want to proceed? [y/n] y select fn_db_unlock_snapshot('405fabe0-873c-4e8e-ae10-9990debf96c0'); INSERT 0 1 unlock snapshot 405fabe0-873c-4e8e-ae10-9990debf96c0 completed successfully. [root@ovirt ~]# PGPASSWORD= /usr/share/ovirt-engine/setup/dbutils/unlock_entity.sh -t snapshot -u engine eada2c1c-1d99-4391-9be3-352c411a0a91 Caution, this operation may lead to data corruption and should be used with care. Please contact support prior to running this command Are you sure you want to proceed? [y/n] y select fn_db_unlock_snapshot('eada2c1c-1d99-4391-9be3-352c411a0a91'); INSERT 0 1 unlock snapshot eada2c1c-1d99-4391-9be3-352c411a0a91 completed successfully. Unfortunately, this doesn't appear to have accomplished anything. In the web UI, the disks are still shown as locked, and the tasks are still shown as pending. I logged into a host node and found the directory by the same UUID: root@ovirt1 images]# cd 405fabe0-873c-4e8e-ae10-9990debf96c0/ [root@ovirt1 405fabe0-873c-4e8e-ae10-9990debf96c0]# ls 8e4a02a7-760b-478c-a694-81466d601356 8e4a02a7-760b-478c-a694-81466d601356.lease 8e4a02a7-760b-478c-a694-81466d601356.meta [root@ovirt1 405fabe0-873c-4e8e-ae10-9990debf96c0]# du -sh 514M . I'm assuming I should NOT just rm these files and the containing directory Suggestions moving forward? On Tue, Apr 25, 2017 at 8:50 AM, Jim Kusznir wrote: > Ok, I figured out that this needs to be run on the engine, I figured out > that PGPASSWORD > > On Tue, Apr 4, 2017 at 2:02 AM, Nathanaël Blanchet > wrote: > >> For instance >> >> PGPASSWORD=X /usr/share/ovirt-engine/setup/dbutils/unlock_entity.sh >> -q -t disk -u engine >> 296c010e-3c1d-4008-84b3-5cd39cff6aa1 | 525a4dda-dbbb-4872-a5f1-8ac2ae >> d48392 >> >> PGPASSWORD=X /usr/share/ovirt-engine/setup/dbutils/unlock_entity.sh >> -t snapshot -u engine 525a4dda-dbbb-4872-a5f1-8ac2aed48392 >> >> Le 01/04/2017 à 19:55, Jim Kusznir a écrit : >> >> Hi: >> >> A few days ago I attempted to create a new VM from one of the >> ovirt-image-repository images. I haven't really figured out how to use >> this reliably yet, and in this case, while trying to import an image, one >> of my nodes spontaneously rebooted (or at least, it looked like that to >> ovirt...Not sure if it had an OOM issue or something else). I assume it >> was the node that got the task of importing those images, as ever since >> then (several days now), on my management screen under "Tasks" it shows the >> attempted imports, still stuck in "processing". I'm quite certain its not >> actually processing. I do believe it used some of my storage up in the >> partially downloaded images, though (they do show up as >> GlanceDisk-, with a status of "Locked" under the main Disks tab. >> >> How do I "properly" recover from this (abort the task and delete the >> partial download)? >> >> Thanks! >> >> --Jim >> >> >> ___ >> Users mailing >> listUsers@ovirt.orghttp://lists.ovirt.org/mailman/listinfo/users >> >> >> -- >> Nathanaël Blanchet >> >> Supervision réseau >> Pôle Infrastrutures Informatiques >> 227 avenue Professeur-Jean-Louis-Viala >> 34193 MONTPELLIER CEDEX 5 >> Tél. 33 (0)4 67 54 84 55 >> Fax 33 (0)4 67 54 84 14blanc...@abes.fr >> >> > ___ Users mailing list Users@ovirt.org http://lists.ovirt.org/mailman/listinfo/users
Re: [ovirt-users] Ovirt tasks "stuck"
Ok, I figured out that this needs to be run on the engine, I figured out that PGPASSWORD On Tue, Apr 4, 2017 at 2:02 AM, Nathanaël Blanchet wrote: > For instance > > PGPASSWORD=X /usr/share/ovirt-engine/setup/dbutils/unlock_entity.sh > -q -t disk -u engine > 296c010e-3c1d-4008-84b3-5cd39cff6aa1 | 525a4dda-dbbb-4872-a5f1- > 8ac2aed48392 > > PGPASSWORD=X /usr/share/ovirt-engine/setup/dbutils/unlock_entity.sh > -t snapshot -u engine 525a4dda-dbbb-4872-a5f1-8ac2aed48392 > > Le 01/04/2017 à 19:55, Jim Kusznir a écrit : > > Hi: > > A few days ago I attempted to create a new VM from one of the > ovirt-image-repository images. I haven't really figured out how to use > this reliably yet, and in this case, while trying to import an image, one > of my nodes spontaneously rebooted (or at least, it looked like that to > ovirt...Not sure if it had an OOM issue or something else). I assume it > was the node that got the task of importing those images, as ever since > then (several days now), on my management screen under "Tasks" it shows the > attempted imports, still stuck in "processing". I'm quite certain its not > actually processing. I do believe it used some of my storage up in the > partially downloaded images, though (they do show up as > GlanceDisk-, with a status of "Locked" under the main Disks tab. > > How do I "properly" recover from this (abort the task and delete the > partial download)? > > Thanks! > > --Jim > > > ___ > Users mailing listUsers@ovirt.orghttp://lists.ovirt.org/mailman/listinfo/users > > > -- > Nathanaël Blanchet > > Supervision réseau > Pôle Infrastrutures Informatiques > 227 avenue Professeur-Jean-Louis-Viala > 34193 MONTPELLIER CEDEX 5 > Tél. 33 (0)4 67 54 84 55 > Fax 33 (0)4 67 54 84 14blanc...@abes.fr > > ___ Users mailing list Users@ovirt.org http://lists.ovirt.org/mailman/listinfo/users
Re: [ovirt-users] Gluster and oVirt 4.0 questions
So with arbiter, I actually only have two copies of data...Does arbiter have at least checksum or something to detect corruption of a copy? (like old RAID-4 disk configuration)? Ok...Related question: Is there a way to set up an offsite gluster storage server to mirror the contents of my main server? As "fire" insurance basically? (eventually, I'd like to have an "offsite" DR cluster, but I don't have the resources or scale yet for that). What I'd like to do is place a basic storage server somewhere else and have it sync any gluster data changes on a regular basis, and be usable to repopulate storage should I loose all of my current cluster (eg, a building fire or theft). I find gluster has amazing power from what I hear, but I have a hard time finding documentation at "the right level" to be useful. I've found some very basic introductory guide, then some very advanced guides that require extensive knowledge of gluster already. Something in the middle to explain some of these questions (like arbitrar and migration strategies, geo-replication, etc; and how to deploy them) are absent (or at least, i haven't found them yet). I still feel like I'm using something I don't understand, and the only avenue I have to learn more is to ask questions here, as the docs aren't at an accessible level. Thanks! --Jim On Mon, Apr 3, 2017 at 10:34 PM, Sahina Bose wrote: > > > On Sat, Apr 1, 2017 at 10:32 PM, Jim Kusznir wrote: > >> Thank you! >> >> Here's the output of gluster volume info: >> [root@ovirt1 ~]# gluster volume info >> >> Volume Name: data >> Type: Replicate >> Volume ID: e670c488-ac16-4dd1-8bd3-e43b2e42cc59 >> Status: Started >> Number of Bricks: 1 x (2 + 1) = 3 >> Transport-type: tcp >> Bricks: >> Brick1: ovirt1.nwfiber.com:/gluster/brick2/data >> Brick2: ovirt2.nwfiber.com:/gluster/brick2/data >> Brick3: ovirt3.nwfiber.com:/gluster/brick2/data (arbiter) >> Options Reconfigured: >> performance.strict-o-direct: on >> nfs.disable: on >> user.cifs: off >> network.ping-timeout: 30 >> cluster.shd-max-threads: 6 >> cluster.shd-wait-qlength: 1 >> cluster.locking-scheme: granular >> cluster.data-self-heal-algorithm: full >> performance.low-prio-threads: 32 >> features.shard-block-size: 512MB >> features.shard: on >> storage.owner-gid: 36 >> storage.owner-uid: 36 >> cluster.server-quorum-type: server >> cluster.quorum-type: auto >> network.remote-dio: enable >> cluster.eager-lock: enable >> performance.stat-prefetch: off >> performance.io-cache: off >> performance.read-ahead: off >> performance.quick-read: off >> performance.readdir-ahead: on >> server.allow-insecure: on >> >> Volume Name: engine >> Type: Replicate >> Volume ID: 87ad86b9-d88b-457e-ba21-5d3173c612de >> Status: Started >> Number of Bricks: 1 x (2 + 1) = 3 >> Transport-type: tcp >> Bricks: >> Brick1: ovirt1.nwfiber.com:/gluster/brick1/engine >> Brick2: ovirt2.nwfiber.com:/gluster/brick1/engine >> Brick3: ovirt3.nwfiber.com:/gluster/brick1/engine (arbiter) >> Options Reconfigured: >> performance.readdir-ahead: on >> performance.quick-read: off >> performance.read-ahead: off >> performance.io-cache: off >> performance.stat-prefetch: off >> cluster.eager-lock: enable >> network.remote-dio: off >> cluster.quorum-type: auto >> cluster.server-quorum-type: server >> storage.owner-uid: 36 >> storage.owner-gid: 36 >> features.shard: on >> features.shard-block-size: 512MB >> performance.low-prio-threads: 32 >> cluster.data-self-heal-algorithm: full >> cluster.locking-scheme: granular >> cluster.shd-wait-qlength: 1 >> cluster.shd-max-threads: 6 >> network.ping-timeout: 30 >> user.cifs: off >> nfs.disable: on >> performance.strict-o-direct: on >> >> Volume Name: export >> Type: Replicate >> Volume ID: 04ee58c7-2ba1-454f-be99-26ac75a352b4 >> Status: Stopped >> Number of Bricks: 1 x (2 + 1) = 3 >> Transport-type: tcp >> Bricks: >> Brick1: ovirt1.nwfiber.com:/gluster/brick3/export >> Brick2: ovirt2.nwfiber.com:/gluster/brick3/export >> Brick3: ovirt3.nwfiber.com:/gluster/brick3/export (arbiter) >> Options Reconfigured: >> performance.readdir-ahead: on >> performance.quick-read: off >> performance.read-ahead: off >> performance.io-cache: off >> performance.stat-prefetch: off >> cluster.eager-lock: enable >> network.remote-dio: off >> cluster.quorum-type: auto >> cluster.server-quorum-type: server >>
[ovirt-users] Ovirt tasks "stuck"
Hi: A few days ago I attempted to create a new VM from one of the ovirt-image-repository images. I haven't really figured out how to use this reliably yet, and in this case, while trying to import an image, one of my nodes spontaneously rebooted (or at least, it looked like that to ovirt...Not sure if it had an OOM issue or something else). I assume it was the node that got the task of importing those images, as ever since then (several days now), on my management screen under "Tasks" it shows the attempted imports, still stuck in "processing". I'm quite certain its not actually processing. I do believe it used some of my storage up in the partially downloaded images, though (they do show up as GlanceDisk-, with a status of "Locked" under the main Disks tab. How do I "properly" recover from this (abort the task and delete the partial download)? Thanks! --Jim ___ Users mailing list Users@ovirt.org http://lists.ovirt.org/mailman/listinfo/users
Re: [ovirt-users] Gluster and oVirt 4.0 questions
Based on the suggestions here, I did successfully remove the unused export gluster brick and allocate all otherwise unassigned space to my data export, then used xfs_growfs to realize the new size. This should hold me for a while longer before building a "proper" storage solution. --Jim On Sat, Apr 1, 2017 at 10:02 AM, Jim Kusznir wrote: > Thank you! > > Here's the output of gluster volume info: > [root@ovirt1 ~]# gluster volume info > > Volume Name: data > Type: Replicate > Volume ID: e670c488-ac16-4dd1-8bd3-e43b2e42cc59 > Status: Started > Number of Bricks: 1 x (2 + 1) = 3 > Transport-type: tcp > Bricks: > Brick1: ovirt1.nwfiber.com:/gluster/brick2/data > Brick2: ovirt2.nwfiber.com:/gluster/brick2/data > Brick3: ovirt3.nwfiber.com:/gluster/brick2/data (arbiter) > Options Reconfigured: > performance.strict-o-direct: on > nfs.disable: on > user.cifs: off > network.ping-timeout: 30 > cluster.shd-max-threads: 6 > cluster.shd-wait-qlength: 1 > cluster.locking-scheme: granular > cluster.data-self-heal-algorithm: full > performance.low-prio-threads: 32 > features.shard-block-size: 512MB > features.shard: on > storage.owner-gid: 36 > storage.owner-uid: 36 > cluster.server-quorum-type: server > cluster.quorum-type: auto > network.remote-dio: enable > cluster.eager-lock: enable > performance.stat-prefetch: off > performance.io-cache: off > performance.read-ahead: off > performance.quick-read: off > performance.readdir-ahead: on > server.allow-insecure: on > > Volume Name: engine > Type: Replicate > Volume ID: 87ad86b9-d88b-457e-ba21-5d3173c612de > Status: Started > Number of Bricks: 1 x (2 + 1) = 3 > Transport-type: tcp > Bricks: > Brick1: ovirt1.nwfiber.com:/gluster/brick1/engine > Brick2: ovirt2.nwfiber.com:/gluster/brick1/engine > Brick3: ovirt3.nwfiber.com:/gluster/brick1/engine (arbiter) > Options Reconfigured: > performance.readdir-ahead: on > performance.quick-read: off > performance.read-ahead: off > performance.io-cache: off > performance.stat-prefetch: off > cluster.eager-lock: enable > network.remote-dio: off > cluster.quorum-type: auto > cluster.server-quorum-type: server > storage.owner-uid: 36 > storage.owner-gid: 36 > features.shard: on > features.shard-block-size: 512MB > performance.low-prio-threads: 32 > cluster.data-self-heal-algorithm: full > cluster.locking-scheme: granular > cluster.shd-wait-qlength: 1 > cluster.shd-max-threads: 6 > network.ping-timeout: 30 > user.cifs: off > nfs.disable: on > performance.strict-o-direct: on > > Volume Name: export > Type: Replicate > Volume ID: 04ee58c7-2ba1-454f-be99-26ac75a352b4 > Status: Stopped > Number of Bricks: 1 x (2 + 1) = 3 > Transport-type: tcp > Bricks: > Brick1: ovirt1.nwfiber.com:/gluster/brick3/export > Brick2: ovirt2.nwfiber.com:/gluster/brick3/export > Brick3: ovirt3.nwfiber.com:/gluster/brick3/export (arbiter) > Options Reconfigured: > performance.readdir-ahead: on > performance.quick-read: off > performance.read-ahead: off > performance.io-cache: off > performance.stat-prefetch: off > cluster.eager-lock: enable > network.remote-dio: off > cluster.quorum-type: auto > cluster.server-quorum-type: server > storage.owner-uid: 36 > storage.owner-gid: 36 > features.shard: on > features.shard-block-size: 512MB > performance.low-prio-threads: 32 > cluster.data-self-heal-algorithm: full > cluster.locking-scheme: granular > cluster.shd-wait-qlength: 1 > cluster.shd-max-threads: 6 > network.ping-timeout: 30 > user.cifs: off > nfs.disable: on > performance.strict-o-direct: on > > Volume Name: iso > Type: Replicate > Volume ID: b1ba15f5-0f0f-4411-89d0-595179f02b92 > Status: Started > Number of Bricks: 1 x (2 + 1) = 3 > Transport-type: tcp > Bricks: > Brick1: ovirt1.nwfiber.com:/gluster/brick4/iso > Brick2: ovirt2.nwfiber.com:/gluster/brick4/iso > Brick3: ovirt3.nwfiber.com:/gluster/brick4/iso (arbiter) > Options Reconfigured: > performance.readdir-ahead: on > performance.quick-read: off > performance.read-ahead: off > performance.io-cache: off > performance.stat-prefetch: off > cluster.eager-lock: enable > network.remote-dio: off > cluster.quorum-type: auto > cluster.server-quorum-type: server > storage.owner-uid: 36 > storage.owner-gid: 36 > features.shard: on > features.shard-block-size: 512MB > performance.low-prio-threads: 32 > cluster.data-self-heal-algorithm: full > cluster.locking-scheme: granular > cluster.shd-wait-qlength: 1 > cluster.shd-max-threads: 6 > network.ping-timeout: 30 > user.cifs: off > nfs.disable: on > performance.strict-o-direct: on > > > The node marked as (arbiter)
Re: [ovirt-users] Gluster and oVirt 4.0 questions
LV Pool data lvthinpool_tdata LV Status available # open 4 LV Size150.00 GiB Allocated pool data65.02% Allocated metadata 14.92% Current LE 38400 Segments 1 Allocation inherit Read ahead sectors auto - currently set to 256 Block device 253:5 --- Logical volume --- LV Path/dev/gluster/data LV Namedata VG Namegluster LV UUIDNBxLOJ-vp48-GM4I-D9ON-4OcB-hZrh-MrDacn LV Write Accessread/write LV Creation host, time ovirt1.nwfiber.com, 2016-12-31 14:40:11 -0800 LV Pool name lvthinpool LV Status available # open 1 LV Size100.00 GiB Mapped size90.28% Current LE 25600 Segments 1 Allocation inherit Read ahead sectors auto - currently set to 256 Block device 253:7 --- Logical volume --- LV Path/dev/gluster/export LV Nameexport VG Namegluster LV UUIDbih4nU-1QfI-tE12-ZLp0-fSR5-dlKt-YHkhx8 LV Write Accessread/write LV Creation host, time ovirt1.nwfiber.com, 2016-12-31 14:40:20 -0800 LV Pool name lvthinpool LV Status available # open 1 LV Size25.00 GiB Mapped size0.12% Current LE 6400 Segments 1 Allocation inherit Read ahead sectors auto - currently set to 256 Block device 253:8 --- Logical volume --- LV Path/dev/gluster/iso LV Nameiso VG Namegluster LV UUIDl8l1JU-ViD3-IFiZ-TucN-tGPE-Toqc-Q3R6uX LV Write Accessread/write LV Creation host, time ovirt1.nwfiber.com, 2016-12-31 14:40:29 -0800 LV Pool name lvthinpool LV Status available # open 1 LV Size25.00 GiB Mapped size28.86% Current LE 6400 Segments 1 Allocation inherit Read ahead sectors auto - currently set to 256 Block device 253:9 --- Logical volume --- LV Path/dev/centos_ovirt/swap LV Nameswap VG Namecentos_ovirt LV UUIDPcVQ11-hQ9U-9KZT-QPuM-HwT6-8o49-2hzNkQ LV Write Accessread/write LV Creation host, time localhost, 2016-12-31 13:56:36 -0800 LV Status available # open 2 LV Size16.00 GiB Current LE 4096 Segments 1 Allocation inherit Read ahead sectors auto - currently set to 256 Block device 253:1 --- Logical volume --- LV Path/dev/centos_ovirt/root LV Nameroot VG Namecentos_ovirt LV UUIDg2h2fn-sF0r-Peos-hAE1-WEo9-WENO-MlO3ly LV Write Accessread/write LV Creation host, time localhost, 2016-12-31 13:56:36 -0800 LV Status available # open 1 LV Size20.00 GiB Current LE 5120 Segments 1 Allocation inherit Read ahead sectors auto - currently set to 256 Block device 253:0 I don't use the export gluster volume, and I've never used lvthinpool-type allocations before, so I'm not sure if there's anything special there. I followed the setup instructions from an ovirt contributed documentation that I can't find now that talked about how to install ovirt with gluster on a 3-node cluster. Thank you for your assistance! --Jim On Thu, Mar 30, 2017 at 1:27 AM, Sahina Bose wrote: > > > On Thu, Mar 30, 2017 at 1:23 PM, Liron Aravot wrote: > >> Hi Jim, please see inline >> >> On Thu, Mar 30, 2017 at 4:08 AM, Jim Kusznir wrote: >> >>> hello: >>> >>> I've been running my ovirt Version 4.0.5.5-1.el7.centos cluster for a >>> while now, and am now revisiting some aspects of it for ensuring that I >>> have good reliability. >>> >>> My cluster is a 3 node cluster, with gluster nodes running on each >>> node. After running my cluster a bit, I'm realizing I didn't do a very >>> optimal job of allocating the space on my disk to the different gluster >>> mount points. Fortunately, they were created with LVM, so I'm hoping that >>> I can resize them without much trouble. >>> >>> I have a domain for iso, domain for export, and domain for storage, all >>> thin provisioned; then a domain for the engine, not thin provisioned. I'd >>> like
[ovirt-users] ovirt-hosted-engine state transition messages
Hello: I find that I often get random-seeming messages. A lot of them mention "ReintializeFSM", but I also get engine down, engine start, etc. messages. All the time, nothing appears to be happening on the cluster, and I rarely can find anything wrong or any trigger/cause. Is this normal? What causes this (beyond obvious hardware issues / hosts rebooting)? Most of the time when I get these, my cluster is going along smoothly, and nothing (not even administrative access) is interrupted. Could ISP issues cause these messages to be generated? Thanks! --Jim ___ Users mailing list Users@ovirt.org http://lists.ovirt.org/mailman/listinfo/users
[ovirt-users] Gluster and oVirt 4.0 questions
hello: I've been running my ovirt Version 4.0.5.5-1.el7.centos cluster for a while now, and am now revisiting some aspects of it for ensuring that I have good reliability. My cluster is a 3 node cluster, with gluster nodes running on each node. After running my cluster a bit, I'm realizing I didn't do a very optimal job of allocating the space on my disk to the different gluster mount points. Fortunately, they were created with LVM, so I'm hoping that I can resize them without much trouble. I have a domain for iso, domain for export, and domain for storage, all thin provisioned; then a domain for the engine, not thin provisioned. I'd like to expand the storage domain, and possibly shrink the engine domain and make that space also available to the main storage domain. Is it as simple as expanding the LVM partition, or are there more steps involved? Do I need to take the node offline? second, I've noticed that the first two nodes seem to have a full copy of the data (the disks are in use), but the 3rd node appears to not be using any of its storage space...It is participating in the gluster cluster, though. Third, currently gluster shares the same network as the VM networks. I'd like to put it on its own network. I'm not sure how to do this, as when I tried to do it at install time, I never got the cluster to come online; I had to make them share the same network to make that work. Ovirt questions: I've noticed that recently, I don't appear to be getting software updates anymore. I used to get update available notifications on my nodes every few days; I haven't seen one for a couple weeks now. is something wrong? I have a windows 10 x64 VM. I get a warning that my VM type does not match the installed OS. All works fine, but I've quadrouple-checked that it does match. Is this a known bug? I have a UPS that all three nodes and the networking are on. It is a USB UPS. How should I best integrate monitoring in? I could put a raspberry pi up and then run NUT or similar on it, but is there a "better" way with oVirt? Thanks! --Jim ___ Users mailing list Users@ovirt.org http://lists.ovirt.org/mailman/listinfo/users
[ovirt-users] Hosted Engine migration problems
Hi again: I thought I had fixed the hosted engine migration that was preventing me from updating the host the engine was running on. Today it let me migrate it from ovirt1 to ovirt2, and perform needed updates on ovirt1. When I tried to migrate it back to ovirt1 after the updates, I got errors that it failed migration. I tried an auto-migrate, and it claimed that the other two nodes (including the node it was running on) do not meet minimum requrements, specifically that they are not HA nodesBut I did explicitly set them up as HA nodes. Here's the engine.log output from the command: 2017-02-11 06:12:03,078 INFO [org.ovirt.engine.core.bll.scheduling.SchedulingManager] (default task-41) [252e1f97] Candidate host 'engine1' ('1e182fb9-8057-42ed-abd6-bc5bc343ccc6') was filtered out by 'VAR__FILTERTYPE__INTERNAL' filter 'HA' (correlation id: null) 2017-02-11 06:12:03,078 INFO [org.ovirt.engine.core.bll.scheduling.SchedulingManager] (default task-41) [252e1f97] Candidate host 'engine3' ('bac8ace2-cf7e-48ea-9113-b82343cd87f7') was filtered out by 'VAR__FILTERTYPE__INTERNAL' filter 'HA' (correlation id: null) 2017-02-11 06:12:03,081 INFO [org.ovirt.engine.core.bll.scheduling.SchedulingManager] (default task-41) [252e1f97] Candidate host 'engine2' ('76c075fc-1dfb-479d-98ef-57575ec11787') was filtered out by 'VAR__FILTERTYPE__INTERNAL' filter 'Migration' (correlation id: null) 2017-02-11 06:12:03,081 WARN [org.ovirt.engine.core.bll.MigrateVmCommand] (default task-41) [252e1f97] Validation of action 'MigrateVm' failed for user admin@internal-authz. Reasons: VAR__ACTION__MIGRATE,VAR__TYPE__VM,SCHEDULING_ALL_HOSTS_FILTERED_OUT,VAR__FILTERTYPE__INTERNAL,$hostName engine1,$filterName HA,VAR__DETAIL__NOT_HE_HOST,SCHEDULING_HOST_FILTERED_REASON_WITH_DETAIL,VAR__FILTERTYPE__INTERNAL,$hostName engine3,$filterName HA,VAR__DETAIL__NOT_HE_HOST,SCHEDULING_HOST_FILTERED_REASON_WITH_DETAIL,VAR__FILTERTYPE__INTERNAL,$hostName engine2,$filterName Migration,VAR__DETAIL__SAME_HOST,SCHEDULING_HOST_FILTERED_REASON_WITH_DETAIL I'm a bit confused by thisI followed the ovirt+gluster howto referenced from the contributed documentation page. --Jim ___ Users mailing list Users@ovirt.org http://lists.ovirt.org/mailman/listinfo/users
[ovirt-users] oVirt maintence "best practices"
hello: Now that I've had my ovirt cluster running for about a month and a half, I am realizing I don't necessarily know the best practices for keeping it up. I've been seeing the notices in the ovirt hosts screen showing that there are updates waiting for the hosts, and I'll put them in maintence mode one at a time and apply the updates. What about the engine itself? Is it recommended/safe to log into the engine and just run "yum update"? Is there other procedures I should be doing? My oVirt cluster is a 3 node cluster with gluster, running on each node. Thanks! --Jim ___ Users mailing list Users@ovirt.org http://lists.ovirt.org/mailman/listinfo/users
Re: [ovirt-users] Optimizations for VoIP VM
Sorry for the delayed response, I finally found where gmail hid this response... :( So the application is FusionPBX, a FreeSwitch-based VoIP system, running on a very unloaded (1% cpu load, 2-4 VMs running) system. I've been experiencing intermittent call breakup, for which external support immediately blamed on the virtualization solution claiming that "You can't virtualize VoIP systems without causing voice breakup and other call quality issues". Previously, I had attempted to run FreePBX (asterisk-based) on a Hyper-V system, and I did find that to be the case; moving over to very weak, but dedicated hardware, fixed the problem immediately. Since I sent this message, I did extensive testing with my system, and it appears that the breakup is in fact network related. I've been able to do phone to phone calls on the local network for extended durations without issue, and even have phone to phone calls on external networks without issue. However, calls going to my VoIP provider do break up, so it appears to be the network route to my provider. So, oVirt does not appear to be to blame (which I didn't think so, but was hoping for some "expert information" to support this...It appears that I got that and more with my tests). Thank you again for your work on such a great product! --Jim On Wed, Jan 4, 2017 at 10:08 AM, Chris Adams wrote: > Once upon a time, Yaniv Dary said: > > Can you please describe the application network requirements? > > Does it relay on low latency? Pass-through or SR-IOV could help with > > reducing that. > > For VoIP, latency can be an issue, but the amount of latency from adding > VM networking overhead isn't a big deal (because other network latency > will have a larger impact). 10ms isn't really a problem for VoIP for > example. > > The bigger network concern for VoIP is jitter; for that, the only > solution is to not over-provision hardware CPUs or total network > bandwidth. > > -- > Chris Adams > ___ > Users mailing list > Users@ovirt.org > http://lists.ovirt.org/mailman/listinfo/users > ___ Users mailing list Users@ovirt.org http://lists.ovirt.org/mailman/listinfo/users
Re: [ovirt-users] Guest agent for CentOS
Also, on the debian instructions, I found one error: Under "starting the service", the 2nd line (after su - ) states: service ovirt-guest-agent enable && service ovirt-guest-agent start However, on debian systems, the first part won't work. The working version would read: update-rc.d ovirt-guest-agent enable && service ovirt-guest-agent start --Jim On Sun, Jan 8, 2017 at 9:14 PM, Jim Kusznir wrote: > Hello: > > I'm wanting to install the guest agent in one of my CentOS VMs. I looked > on the documentation, and found this page: > > http://www.ovirt.org/documentation/internal/guest- > agent/understanding-guest-agents-and-other-tools/ > > It mentions that guest agent is available for CentOS, but does not provide > a link to instructions. Due to similarities with Fedora, I followed the > link there. It says to install "ovirt-guest-agent-common", but the package > isn't found. I tried adding the ovirt SIG repos for my CentOS (7), and > re-searched, but still it is not found. No links were provided to the RPM. > > Where do I find the CentOS guest agent? > > Perhaps the documentation online should be updated with this information, > too...I'm sure I'm not the only one looking... > > --Jim > ___ Users mailing list Users@ovirt.org http://lists.ovirt.org/mailman/listinfo/users
[ovirt-users] Guest agent for CentOS
Hello: I'm wanting to install the guest agent in one of my CentOS VMs. I looked on the documentation, and found this page: http://www.ovirt.org/documentation/internal/guest-agent/understanding-guest-agents-and-other-tools/ It mentions that guest agent is available for CentOS, but does not provide a link to instructions. Due to similarities with Fedora, I followed the link there. It says to install "ovirt-guest-agent-common", but the package isn't found. I tried adding the ovirt SIG repos for my CentOS (7), and re-searched, but still it is not found. No links were provided to the RPM. Where do I find the CentOS guest agent? Perhaps the documentation online should be updated with this information, too...I'm sure I'm not the only one looking... --Jim ___ Users mailing list Users@ovirt.org http://lists.ovirt.org/mailman/listinfo/users
Re: [ovirt-users] unable to start VMs after upgrade
Well, it turned out it was 100% of one core, percentage reported took into account how many cores the VM had assigned. Rebooting the node did fix the problem. Just to be clear, the "proper" procedure for rebooting a host in oVirt is to put it in maintence mode, ssh to the node, issue the reboot, then after confirming its back up, right click on the node in the web UI and select "confirm node reboot", then take it out of maintence mode? --Jim On Sun, Jan 8, 2017 at 9:10 AM, Robert Story wrote: > On Sat, 7 Jan 2017 15:02:10 -0800 Jim wrote: > JK> I went on about the work I came in to do, and tried to start up a VM. > It > JK> appeared to start, but it never booted. It did raise the CPU usage > for > JK> that VM, but console was all black, no resize or anything. Tried > several > JK> settings. This was on a VM I had just powered down. I noticed it was > JK> starting the VM on engine3, so I did a runonce specifying the vm start > on > JK> engine2. Booted up just fine. After booting, I could migrate to > engine3, > JK> and all was good. > JK> > JK> What happened? I get no error messages, starting any vm on engine3, > start > JK> paused, attaching display, then running it, I always get the same > thing: > JK> blank console, about 50% cpu usage reported by the web interface, no > JK> response on any network, and by all signs available to me, no actual > JK> booting (reminds me of a PC that doesn't POST). Simply changing the > engine > JK> it starts on to one that has not been upgraded fixes the problem. > > I had this issue too, except I had 100% cpu usage reported on the web > interface. have you rebooted the troublesome host since it was upgraded? I > think that was what solved it for me. > > > Robert > > -- > Senior Software Engineer @ Parsons > ___ Users mailing list Users@ovirt.org http://lists.ovirt.org/mailman/listinfo/users
[ovirt-users] ReinitiaalizeFSM-EngineDown -- what does this mean?
Hello: I've been getting a bunch of e-mails from my ovirt system stating that a "state transition" has occurred, first: StartState-ReinitalizeFSM, then a 2nd e-mail ReinitailzeFSM-EngineDown. These are all for my host2 system, my hosted engine is running on host1. Host2 appears to be working just fine, and has the majority of my VMs on it at the moment. Timing is also a bit wierd: Got my first one at 12:05AM this morning, then 2:40, 2:55am, 4:20, 4:25, and 4:40am, then 7:11, 7:40, 9:51am and 12:26PM. I'd appreciate any insight! --Jim ___ Users mailing list Users@ovirt.org http://lists.ovirt.org/mailman/listinfo/users
[ovirt-users] unable to start VMs after upgrade
Hello: I'm still fairly new to ovirt. I'm running a 3-node cluster largely built by Jason Brooks' howto for ovirt+gluster on the contributed docs section of the ovirt webpage. I had everything mostly working, and this morning when I logged in, I saw a new symbol attached to all three of my hosts indicating an upgrade is available. So I clicked on egine3 and told it to upgrade. It migrated my VMs off, did its upgrade, and everything looked good. I was able to migrate a vm or two back, and they continued to function just fine. Then I tried to upgrade eingine1, which was running my hosted engine. In theory, all three engines/hosts were set up to be able to run the engine, per Jason's instructions. However, it failed to migrate the engine off host1, and I realized that I still have the same issue I had on an earlier incarnation of this cluster: inability to migrate the engine around. Ok, I'll deal with that later (with help from this list, hopefully). I went on about the work I came in to do, and tried to start up a VM. It appeared to start, but it never booted. It did raise the CPU usage for that VM, but console was all black, no resize or anything. Tried several settings. This was on a VM I had just powered down. I noticed it was starting the VM on engine3, so I did a runonce specifying the vm start on engine2. Booted up just fine. After booting, I could migrate to engine3, and all was good. What happened? I get no error messages, starting any vm on engine3, start paused, attaching display, then running it, I always get the same thing: blank console, about 50% cpu usage reported by the web interface, no response on any network, and by all signs available to me, no actual booting (reminds me of a PC that doesn't POST). Simply changing the engine it starts on to one that has not been upgraded fixes the problem. I'd greatly appreciate your help: 1) how to fix it so the upgraded engine can start VMs again 2) How to fix the cluster so the HostedEngine can migrate between hosts (and I'm able to put host1 in maintence mode). Ovirt 4 series, latest in repos as of last weekend (Jan1). --Jim ___ Users mailing list Users@ovirt.org http://lists.ovirt.org/mailman/listinfo/users
[ovirt-users] Optimizations for VoIP VM
Hello: I set up a FreeSwitch-based VoIP server as a host on my cluster, and am having audio problems. I'm not 100% sure if its virtualization related or network related yet. But I would like to optimize my VM for VoIP (or rather, tell Ovirt all the "right settings" to optimize that VM to VoIP). Does anyone have any specific suggestions? Are there known issues with VoIP on Ovirt-managed clusters? (I know well reputed companies that sell VoIP server virtual hosting and guarantee the performance, so I know VoIP Virtualization is possible, just need to know if its recommended with Ovirt, and if so what do I need to do to give it the best chance of success?) Thanks! --Jim ___ Users mailing list Users@ovirt.org http://lists.ovirt.org/mailman/listinfo/users
Re: [ovirt-users] New install: can't install engine
I did eventually figure out the issue: I mis-understood the question about "cloud-init" as "engine-init", once I answered yes to cloud-init, I was allowed to set a root password and run engine-setup. I am curious about some of your questions too, actually. I was following the instructions from here: http://www.ovirt.org/blog/2016/08/up-and-running-with-ovirt-4-0-and-gluster-storage/ In that, he said that in order to enable the gluster support in the engine, one has to do it manually. It seemed odd, or perhaps a misunderstanding of the procedure. My understanding was that after deploying my three hosts and gluster, I had to deploy the hosted-appliance manually on host 1 so that after the initial deploy, but before it was taken over by HA, I could go in and set the gluster service to on, then finish the setup with the reboot, etc. The other thing I was wondering about was setting up the additional hosts. The instructions say to ssh to each host and deploy the engine there through the ssh command line. The tool itself says that I shouldn't be doing it, I should be running through the website. However, when I did that, I had a number of issues that I wasn't able to correct, and when I did it through ssh, I had the best functioning ovirt build yet. One of the issues I had was when I tried to create a private network for gluster sync'ing, the web interface saw the gluster IPs for hosts 2 and 3 (although host 1 was added with its ovirt/real management IP, and the two networks were not routed). The web itnerface host add ended up adding hosts 2 and 3 with the gluster IP, and thus things broke a lot. I had no means of overriding those settings that I saw through the web interface. When adding through SSH, I had a lot more control over adding the hosts, and was able to add them "correctly". I suspect this means the overall procedure was in error, and I'd like to learn the "better" way to do it. --Jim On Mon, Jan 2, 2017 at 3:38 AM, Simone Tiraboschi wrote: > > > On Mon, Jan 2, 2017 at 9:39 AM, Sandro Bonazzola > wrote: > >> >> >> On Fri, Dec 30, 2016 at 7:31 PM, Jim Kusznir wrote: >> >>> Hi all: >>> >>> I'm trying to set up a new ovirt cluster. I got it "mostly working" >>> earlier, but wanted to change some physical networking stuff, and so I >>> thought I'd blow away my machines and rebuild. I followed the same recipe >>> to build it all, but now I'm failing at a point that previously worked. >>> >>> I've built a 3 node cluster with glusterfs backing (3 brick replica), >>> and all that is good and well. I run the engine-setup --deploy, and it >>> does its stuff, asks me (among other things) the admin password, I type in >>> the password I want it to use (just like last time), then it says to log >>> into the new VM and run engine-setup. Here's the problem: I try to ssh in >>> as root, and it will NOT accept my password. It worked a couple days ago, >>> doing it the exact same way, but it will not work now. >>> >> > Hi Jim, > sorry but why do you need to manually run engine-setup in the engine VM? > If you are deploying with the ovirt-engine-appliance, hosted-engine-setup > will run it for you with the right parameters. > > >> I've destroyed and re-deployed several times, I've even done a low level >>> wipe of all three nodes and rebuild everything, and again, it doesn't work. >>> >>> My only guess is that one of the packages the gdeploy script changed, >>> and it has a bug or "new feature" that breaks this for some reason. >>> Unfortunately, I do not have the package versions that worked or the >>> current list to compare to, so I cannot support this. >>> >>> In any case, I'm completely stuck here...I can't log in to run >>> engine-deploy, and I don't know enough of the console/low level stuff to >>> try and hack my way into the VM (eg, to manually mount the disk image and >>> replace the password or put my SSH key in). >>> >>> Suggestions? Can anyone else replicate this? >>> >> >> Can you please provide logs? >> >> >> >>> >>> --Jim >>> >>> ___ >>> Users mailing list >>> Users@ovirt.org >>> http://lists.ovirt.org/mailman/listinfo/users >>> >>> >> >> >> -- >> Sandro Bonazzola >> Better technology. Faster innovation. Powered by community collaboration. >> See how it works at redhat.com >> >> ___ >> Users mailing list >> Users@ovirt.org >> http://lists.ovirt.org/mailman/listinfo/users >> >> > ___ Users mailing list Users@ovirt.org http://lists.ovirt.org/mailman/listinfo/users
Re: [ovirt-users] creating a vlan-tagged network
Actually, I finally was able to identify the issue and fix it...Turns out (as you probably expected), it wasn't ovirt... My upstream provider had some wierd security left over, it limited the MAC addresses permitted to exit the building, and my ovirt host made the list somehow while my VMs did not. I now have two VMs on two different nodes that are online! Thank you for your help! --Jim On Sun, Jan 1, 2017 at 11:57 PM, Edward Haas wrote: > > > On Sun, Jan 1, 2017 at 7:16 PM, Jim Kusznir wrote: > >> I pinged both the router on the subnet and a host IP in-between the two >> ip's. >> >> [root@ovirt3 ~]# ping -I 162.248.147.33 162.248.147.1 >> PING 162.248.147.1 (162.248.147.1) from 162.248.147.33 : 56(84) bytes of >> data. >> 64 bytes from 162.248.147.1: icmp_seq=1 ttl=255 time=8.17 ms >> 64 bytes from 162.248.147.1: icmp_seq=2 ttl=255 time=7.47 ms >> 64 bytes from 162.248.147.1: icmp_seq=3 ttl=255 time=7.53 ms >> 64 bytes from 162.248.147.1: icmp_seq=4 ttl=255 time=8.42 ms >> ^C >> --- 162.248.147.1 ping statistics --- >> 4 packets transmitted, 4 received, 0% packet loss, time 3004ms >> rtt min/avg/max/mdev = 7.475/7.901/8.424/0.420 ms >> [root@ovirt3 ~]# >> >> The VM only has its public IP. >> >> --Jim >> > > Very strange, all looks good to me. > > I can try to help you debug using tcpdump, just send me the details for > remote connection on private. > It will also help if you join the vdsm or ovir IRC channels. > > >> >> On Jan 1, 2017 01:26, "Edward Haas" wrote: >> >>> >>> >>> On Sun, Jan 1, 2017 at 10:50 AM, Jim Kusznir >>> wrote: >>> >>>> I currently only have two IPs assigned to me...I can try and take >>>> another, but that may not route out of the rack. I've got the VM on one of >>>> the IPs and the host on the other currently. >>>> >>>> The switch is a "web-managed" basic 8-port switch (thrown in for >>>> testing while the "real" switch is in transit). It has the 3 ports the >>>> hosts are plugged in configured with vlan 1 untagged, set as PVID, and vlan >>>> 2 tagged. Another port on the switch is untagged on vlan 1 connected to >>>> the router for the ovirtmgmt network (protected by a VPN, but not "burning" >>>> public IPs for mgmt purposes), another couple ports are untagged on vlan >>>> 2. One of those ports goes out of the rack, another goes to the router's >>>> internet port. Router gets to the internet just fine. >>>> >>>> VM: >>>> kusznir@FusionPBX:~$ ip address >>>> 1: lo: mtu 65536 qdisc noqueue state UNKNOWN >>>> group default >>>> link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00 >>>> inet 127.0.0.1/8 scope host lo >>>>valid_lft forever preferred_lft forever >>>> inet6 ::1/128 scope host >>>>valid_lft forever preferred_lft forever >>>> 2: eth0: mtu 1500 qdisc pfifo_fast >>>> state UP group default qlen 1000 >>>> link/ether 00:1a:4a:16:01:51 brd ff:ff:ff:ff:ff:ff >>>> inet 162.248.147.31/24 brd 162.248.147.255 scope global eth0 >>>>valid_lft forever preferred_lft forever >>>> inet6 fe80::21a:4aff:fe16:151/64 scope link >>>>valid_lft forever preferred_lft forever >>>> kusznir@FusionPBX:~$ ip route >>>> default via 162.248.147.1 dev eth0 >>>> 162.248.147.0/24 dev eth0 proto kernel scope link src >>>> 162.248.147.31 >>>> kusznir@FusionPBX:~$ >>>> >>>> Host: >>>> [root@ovirt3 ~]# ip address >>>> 1: lo: mtu 65536 qdisc noqueue state UNKNOWN >>>> link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00 >>>> inet 127.0.0.1/8 scope host lo >>>>valid_lft forever preferred_lft forever >>>> inet6 ::1/128 scope host >>>>valid_lft forever preferred_lft forever >>>> 2: em1: mtu 1500 qdisc mq master >>>> ovirtmgmt state UP qlen 1000 >>>> link/ether 00:21:9b:98:2f:44 brd ff:ff:ff:ff:ff:ff >>>> 3: em2: mtu 1500 qdisc mq state DOWN qlen 1000 >>>> link/ether 00:21:9b:98:2f:46 brd ff:ff:ff:ff:ff:ff >>>> 4: em3: mtu 1500 qdisc mq state DOWN qlen 1000 >>>> link/ether 00:21:9b:98:2f:48 brd ff:ff:ff:ff:ff:ff >>>> 5: em4: mtu 1500 qdisc mq state >>>> DOWN qlen 1000 >>>> link/ether 00:21:9b:
Re: [ovirt-users] creating a vlan-tagged network
I pinged both the router on the subnet and a host IP in-between the two ip's. [root@ovirt3 ~]# ping -I 162.248.147.33 162.248.147.1 PING 162.248.147.1 (162.248.147.1) from 162.248.147.33 : 56(84) bytes of data. 64 bytes from 162.248.147.1: icmp_seq=1 ttl=255 time=8.17 ms 64 bytes from 162.248.147.1: icmp_seq=2 ttl=255 time=7.47 ms 64 bytes from 162.248.147.1: icmp_seq=3 ttl=255 time=7.53 ms 64 bytes from 162.248.147.1: icmp_seq=4 ttl=255 time=8.42 ms ^C --- 162.248.147.1 ping statistics --- 4 packets transmitted, 4 received, 0% packet loss, time 3004ms rtt min/avg/max/mdev = 7.475/7.901/8.424/0.420 ms [root@ovirt3 ~]# The VM only has its public IP. --Jim On Jan 1, 2017 01:26, "Edward Haas" wrote: > > > On Sun, Jan 1, 2017 at 10:50 AM, Jim Kusznir wrote: > >> I currently only have two IPs assigned to me...I can try and take >> another, but that may not route out of the rack. I've got the VM on one of >> the IPs and the host on the other currently. >> >> The switch is a "web-managed" basic 8-port switch (thrown in for testing >> while the "real" switch is in transit). It has the 3 ports the hosts are >> plugged in configured with vlan 1 untagged, set as PVID, and vlan 2 >> tagged. Another port on the switch is untagged on vlan 1 connected to the >> router for the ovirtmgmt network (protected by a VPN, but not "burning" >> public IPs for mgmt purposes), another couple ports are untagged on vlan >> 2. One of those ports goes out of the rack, another goes to the router's >> internet port. Router gets to the internet just fine. >> >> VM: >> kusznir@FusionPBX:~$ ip address >> 1: lo: mtu 65536 qdisc noqueue state UNKNOWN group >> default >> link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00 >> inet 127.0.0.1/8 scope host lo >>valid_lft forever preferred_lft forever >> inet6 ::1/128 scope host >>valid_lft forever preferred_lft forever >> 2: eth0: mtu 1500 qdisc pfifo_fast >> state UP group default qlen 1000 >> link/ether 00:1a:4a:16:01:51 brd ff:ff:ff:ff:ff:ff >> inet 162.248.147.31/24 brd 162.248.147.255 scope global eth0 >>valid_lft forever preferred_lft forever >> inet6 fe80::21a:4aff:fe16:151/64 scope link >>valid_lft forever preferred_lft forever >> kusznir@FusionPBX:~$ ip route >> default via 162.248.147.1 dev eth0 >> 162.248.147.0/24 dev eth0 proto kernel scope link src 162.248.147.31 >> kusznir@FusionPBX:~$ >> >> Host: >> [root@ovirt3 ~]# ip address >> 1: lo: mtu 65536 qdisc noqueue state UNKNOWN >> link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00 >> inet 127.0.0.1/8 scope host lo >>valid_lft forever preferred_lft forever >> inet6 ::1/128 scope host >>valid_lft forever preferred_lft forever >> 2: em1: mtu 1500 qdisc mq master >> ovirtmgmt state UP qlen 1000 >> link/ether 00:21:9b:98:2f:44 brd ff:ff:ff:ff:ff:ff >> 3: em2: mtu 1500 qdisc mq state DOWN qlen 1000 >> link/ether 00:21:9b:98:2f:46 brd ff:ff:ff:ff:ff:ff >> 4: em3: mtu 1500 qdisc mq state DOWN qlen 1000 >> link/ether 00:21:9b:98:2f:48 brd ff:ff:ff:ff:ff:ff >> 5: em4: mtu 1500 qdisc mq state DOWN >> qlen 1000 >> link/ether 00:21:9b:98:2f:4a brd ff:ff:ff:ff:ff:ff >> 6: ;vdsmdummy;: mtu 1500 qdisc noop state DOWN >> link/ether 8e:1b:51:60:87:55 brd ff:ff:ff:ff:ff:ff >> 7: ovirtmgmt: mtu 1500 qdisc noqueue >> state UP >> link/ether 00:21:9b:98:2f:44 brd ff:ff:ff:ff:ff:ff >> inet 192.168.8.13/24 brd 192.168.8.255 scope global dynamic ovirtmgmt >>valid_lft 54830sec preferred_lft 54830sec >> 11: em1.2@em1: mtu 1500 qdisc noqueue >> master Public_Cable state UP >> link/ether 00:21:9b:98:2f:44 brd ff:ff:ff:ff:ff:ff >> 12: Public_Cable: mtu 1500 qdisc >> noqueue state UP >> link/ether 00:21:9b:98:2f:44 brd ff:ff:ff:ff:ff:ff >> inet 162.248.147.33/24 brd 162.248.147.255 scope global Public_Cable >>valid_lft forever preferred_lft forever >> 14: vnet0: mtu 1500 qdisc pfifo_fast >> master ovirtmgmt state UNKNOWN qlen 500 >> link/ether fe:1a:4a:16:01:54 brd ff:ff:ff:ff:ff:ff >> inet6 fe80::fc1a:4aff:fe16:154/64 scope link >>valid_lft forever preferred_lft forever >> 15: vnet1: mtu 1500 qdisc pfifo_fast >> master ovirtmgmt state UNKNOWN qlen 500 >> link/ether fe:1a:4a:16:01:52 brd ff:ff:ff:ff:ff:ff >> inet6 fe80::fc1a:4aff:fe16:152/64 scope link >>valid_lft forever preferred_lft forever >> 16: vnet2: mtu 1500 qdisc pfifo_fast >>
Re: [ovirt-users] creating a vlan-tagged network
I currently only have two IPs assigned to me...I can try and take another, but that may not route out of the rack. I've got the VM on one of the IPs and the host on the other currently. The switch is a "web-managed" basic 8-port switch (thrown in for testing while the "real" switch is in transit). It has the 3 ports the hosts are plugged in configured with vlan 1 untagged, set as PVID, and vlan 2 tagged. Another port on the switch is untagged on vlan 1 connected to the router for the ovirtmgmt network (protected by a VPN, but not "burning" public IPs for mgmt purposes), another couple ports are untagged on vlan 2. One of those ports goes out of the rack, another goes to the router's internet port. Router gets to the internet just fine. VM: kusznir@FusionPBX:~$ ip address 1: lo: mtu 65536 qdisc noqueue state UNKNOWN group default link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00 inet 127.0.0.1/8 scope host lo valid_lft forever preferred_lft forever inet6 ::1/128 scope host valid_lft forever preferred_lft forever 2: eth0: mtu 1500 qdisc pfifo_fast state UP group default qlen 1000 link/ether 00:1a:4a:16:01:51 brd ff:ff:ff:ff:ff:ff inet 162.248.147.31/24 brd 162.248.147.255 scope global eth0 valid_lft forever preferred_lft forever inet6 fe80::21a:4aff:fe16:151/64 scope link valid_lft forever preferred_lft forever kusznir@FusionPBX:~$ ip route default via 162.248.147.1 dev eth0 162.248.147.0/24 dev eth0 proto kernel scope link src 162.248.147.31 kusznir@FusionPBX:~$ Host: [root@ovirt3 ~]# ip address 1: lo: mtu 65536 qdisc noqueue state UNKNOWN link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00 inet 127.0.0.1/8 scope host lo valid_lft forever preferred_lft forever inet6 ::1/128 scope host valid_lft forever preferred_lft forever 2: em1: mtu 1500 qdisc mq master ovirtmgmt state UP qlen 1000 link/ether 00:21:9b:98:2f:44 brd ff:ff:ff:ff:ff:ff 3: em2: mtu 1500 qdisc mq state DOWN qlen 1000 link/ether 00:21:9b:98:2f:46 brd ff:ff:ff:ff:ff:ff 4: em3: mtu 1500 qdisc mq state DOWN qlen 1000 link/ether 00:21:9b:98:2f:48 brd ff:ff:ff:ff:ff:ff 5: em4: mtu 1500 qdisc mq state DOWN qlen 1000 link/ether 00:21:9b:98:2f:4a brd ff:ff:ff:ff:ff:ff 6: ;vdsmdummy;: mtu 1500 qdisc noop state DOWN link/ether 8e:1b:51:60:87:55 brd ff:ff:ff:ff:ff:ff 7: ovirtmgmt: mtu 1500 qdisc noqueue state UP link/ether 00:21:9b:98:2f:44 brd ff:ff:ff:ff:ff:ff inet 192.168.8.13/24 brd 192.168.8.255 scope global dynamic ovirtmgmt valid_lft 54830sec preferred_lft 54830sec 11: em1.2@em1: mtu 1500 qdisc noqueue master Public_Cable state UP link/ether 00:21:9b:98:2f:44 brd ff:ff:ff:ff:ff:ff 12: Public_Cable: mtu 1500 qdisc noqueue state UP link/ether 00:21:9b:98:2f:44 brd ff:ff:ff:ff:ff:ff inet 162.248.147.33/24 brd 162.248.147.255 scope global Public_Cable valid_lft forever preferred_lft forever 14: vnet0: mtu 1500 qdisc pfifo_fast master ovirtmgmt state UNKNOWN qlen 500 link/ether fe:1a:4a:16:01:54 brd ff:ff:ff:ff:ff:ff inet6 fe80::fc1a:4aff:fe16:154/64 scope link valid_lft forever preferred_lft forever 15: vnet1: mtu 1500 qdisc pfifo_fast master ovirtmgmt state UNKNOWN qlen 500 link/ether fe:1a:4a:16:01:52 brd ff:ff:ff:ff:ff:ff inet6 fe80::fc1a:4aff:fe16:152/64 scope link valid_lft forever preferred_lft forever 16: vnet2: mtu 1500 qdisc pfifo_fast master ovirtmgmt state UNKNOWN qlen 500 link/ether fe:1a:4a:16:01:53 brd ff:ff:ff:ff:ff:ff inet6 fe80::fc1a:4aff:fe16:153/64 scope link valid_lft forever preferred_lft forever 17: vnet3: mtu 1500 qdisc pfifo_fast master Public_Cable state UNKNOWN qlen 500 link/ether fe:1a:4a:16:01:51 brd ff:ff:ff:ff:ff:ff inet6 fe80::fc1a:4aff:fe16:151/64 scope link valid_lft forever preferred_lft forever [root@ovirt3 ~]# ip route default via 192.168.8.1 dev ovirtmgmt 162.248.147.0/24 dev Public_Cable proto kernel scope link src 162.248.147.33 169.254.0.0/16 dev ovirtmgmt scope link metric 1007 169.254.0.0/16 dev Public_Cable scope link metric 1012 192.168.8.0/24 dev ovirtmgmt proto kernel scope link src 192.168.8.13 [root@ovirt3 ~]# brctl show bridge name bridge id STP enabled interfaces ;vdsmdummy; 8000. no Public_Cable 8000.00219b982f44 no em1.2 vnet3 ovirtmgmt 8000.00219b982f44 no em1 vnet0 vnet1 vnet2 [root@ovirt3 ~]# I did see that the cluster settings has a switch type setting; currently at the default "LEGACY", it also has "OVS" as an option. Not sure if that matters or not. I configured another VM on the network, and static'ed an IP, and could ping the other VM as well as the host, but not the internet. The host can still ping the internet. --Jim ___ Users mailing list Users@ovirt.org http://lists.ovirt.org/mailman/listinfo/users
[ovirt-users] creating a vlan-tagged network
Hi all: I've got my ovirt cluster up, but am facing an odd situation that I haven't pinned down. I've also run into someone on the IRC channel with the same bug, no solutions as of yet. Google also hasn't helped. My goal is this: 1 physical NIC; two networks: ovirtmgmt (untagged) Public (vlan 2) ovirtmgmt works great. a VM on Public cannot talk to anything off the host. Steps to set up: Datacenter -> networks: created network, checked vm network, checked vlan, put 2 in the tag box. Set required. Save. I only have one cluster (default), and it automatically added it there. I went to the hosts in the cluster, and dragged the unassigned Public network onto the nic (which already has ovirtmgmt on it). After completing on all three of my hosts, the network shows online. Create VM, assign to Public, inside VM assign its IP, and it cannot talk to the world. In troubleshooting, I assigned another IP to the host itself (click pencil in host network settings). VM can ping host. SSH into host, host CAN ping other machines on the net and the router for the net. VM cannot ping anything but host (only have one VM on that host currently). VM is isolated until I move it to ovirtmgmt network, then it can get off the host to the world, etc. I tried disabling iptables just in case, but that had no effect. How do I troubleshoot this further? --Jim ___ Users mailing list Users@ovirt.org http://lists.ovirt.org/mailman/listinfo/users
[ovirt-users] New install: can't install engine
Hi all: I'm trying to set up a new ovirt cluster. I got it "mostly working" earlier, but wanted to change some physical networking stuff, and so I thought I'd blow away my machines and rebuild. I followed the same recipe to build it all, but now I'm failing at a point that previously worked. I've built a 3 node cluster with glusterfs backing (3 brick replica), and all that is good and well. I run the engine-setup --deploy, and it does its stuff, asks me (among other things) the admin password, I type in the password I want it to use (just like last time), then it says to log into the new VM and run engine-setup. Here's the problem: I try to ssh in as root, and it will NOT accept my password. It worked a couple days ago, doing it the exact same way, but it will not work now. I've destroyed and re-deployed several times, I've even done a low level wipe of all three nodes and rebuild everything, and again, it doesn't work. My only guess is that one of the packages the gdeploy script changed, and it has a bug or "new feature" that breaks this for some reason. Unfortunately, I do not have the package versions that worked or the current list to compare to, so I cannot support this. In any case, I'm completely stuck here...I can't log in to run engine-deploy, and I don't know enough of the console/low level stuff to try and hack my way into the VM (eg, to manually mount the disk image and replace the password or put my SSH key in). Suggestions? Can anyone else replicate this? --Jim ___ Users mailing list Users@ovirt.org http://lists.ovirt.org/mailman/listinfo/users
[ovirt-users] New oVirt user
Hello: I've been involved in virtualization from its very early days, and been running linux virtualization solutions off and on for a decade. Previously, I was always frustrated with the long feature list offered by many linux virtualization systems but with no reasonable way to manage that. It seemed that I had to spend an inordinate amount of time doing everything by hand. Thus, when I found oVirt, I was ecstatic! Unfortunately, at that time I changed employment (or rather left employment and became self-employed), and didn't have any reason to build my own virt cluster..until now! So I'm back with oVirt, and actually deploying a small 3-node cluster. I intend to run on it: VoIP Server Web Server Business backend server UniFi management server Monitoring server (zabbix) Not a heavy load, and 3 servers is probably overkill, but I need this to work, and it sounds like 3 is the magic entry level for all the cluster/failover stuff to work. For now, my intent is to use a single SSD on each node with gluster for the storage backend. I figure if all the failover stuff actually working, if I loose a node due to disk failure, its not the end of the world. I can rebuild it, reconnect gluster, and restart everything. As this is for a startup business, funds are thin at the moment, so I'm trying to cut a couple corners that don't affect overall reliability. If this side of the business grows more, I would likely invest in some dedicated servers. So far, I've based my efforts around this guide on oVirt's website: http://www.ovirt.org/blog/2016/08/up-and-running-with-ovirt-4-0-and-gluster-storage/ My cluster is currently functioning, but not entirely correctly. Some of it is gut feel, some of it is specific test cases (more to follow). First, some areas that lacked clarity and the choices I made in them: Early on, Jason talks about using a dedicated gluster network for the gluster storage sync'ing. I liked that idea, and as I had 4 nics on each machine, I thought dedicating one or two to gluster would be fine. So, on my clean, bare machines, I setup another network with private NiCs and put it on a standalone switch. I added hostnames with a designator (-g on the end) for the IPs for all three nodes into /etc/hosts on all three nodes so now each node can resolve itself and the other nodes on the -g name (and private IP) as well as their main host name and "more public" (but not public) IP. Then, for gdeploy, I put the hostnames in as the -g hostnames, as I didn't see anywhere to tell gluster to use the private network. I think this is a place I went wrong, but didn't realize it until the end I set up the gdeploy script (it took a few times, and a few OS rebuilds to get it just right...), and ran it, and it was successful! When complete, I had a working gluster cluster and the right software installed on each node! I set up the engine on node1, and that worked, and I was able to log in to the web gui. I mistakenly skipped the web gui enable gluster service before doing the engine vm reboot to complete the engine setup process, but I did go back in after the reboot and do that. After doing that, I was notified in the gui that there were additional nodes, did I want to add them. Initially, I skipped that and went back to the command line as Jason suggests. Unfortunately, it could not find any other nodes through his method, and it didn't work. Combine that with the warnings that I should not be using the command line method, and it would be removed in the next release, I went back to the gui and attempted to add the nodes that way. Here's where things appeared to go wrong...It showed me two additional nodes, but ONLY by their -g (private gluster) hostname. And the ssh fingerprints were not populated, so it would not let me proceed. After messing with this for a bit, I realized that the engine cannot get to the nodes via the gluster interface (and as far as I knew, it shouldn't). Working late at night, I let myself "hack it up" a bit, and on the engine VM, I added /etc/hosts entries for the -g hostnames pointing to the main IPs. It then populated the ssh host keys and let me add them in. Ok, so things appear to be working..kinda. I noticed at this point that ALL aspects of the gui became VERY slow. Clicking in and typing in any field felt like I was on ssh over a satellite link. Everything felt a bit worse than the early days of vSpherePainfully slow. but it was still working, so I pressed on. I configured gluster storage. Eventually I was successful, but initially it would only let me add a "Data" storage domain, the drop-down menu did NOT contain iso, export, or anything else... Somehow, on its own, after leaving and re-entering that tab a few times, iso and export materialized on their own in the menu, so I was able to finish that setup. Ok, all looks good. I wanted to try out his little tip on adding a VM, too. I saw "ovirt-imiage-repository" in the "external provid