Re: [ovirt-users] Glusterfs HA with Ovirt
I was never able to achieve a stable system that could survive the loss of a single node with glusterfs. I attempted to use replica 2 across 3 nodes (which required 2 bricks per node as the number of bricks must be a multiple of the replica, and you have to order them so the brick pairs span servers). I enabled server-side quorum, but found out later that client side quorum is based on 'sub volumes', which means that with a single node failure on replica 2, even though there were 3 nodes, it would go into a readonly state. After disabling client-side quorum (but keeping server side quorum) I thought the issue was fixed, but every once in a while, rebooting one of the nodes (after ensuring gluster was healed) would lead to i/o errors on the VM guest and essentially make it so it needed to be rebooted (which was successful and everything worked after even before bringing the downed node back up). My nodes were all combined glusterfs and ovirt nodes. I tried using both 'localhost' on the nodes as well as using a keepalived VIP. Its possible my issues were all due to client-side quorum not being enabled, but that would require replica 3 to be able to survive a single node failure, but I never pursued testing that theory. Also, heal times seemed a bit long for a single idle VM, it would consume 2 full cores of CPU for about 5 minutes for healing a single idle VM (granted, I was testing on a 1Gbps network, but that doesn't explain the CPU usage). -Brad On 7/4/14 1:29 AM, Andrew Lau wrote: As long as all your compute nodes are part of the gluster peer, localhost will work. Just remember, gluster will connect to any server, so even if you mount as localhost:/ it could be accessing the storage from another host in the gluster peer group. On Fri, Jul 4, 2014 at 3:26 PM, Punit Dambiwal wrote: Hi Andrew, Yes..both on the same node...but i have 4 nodes of this type in the same clusterSo it should work or not ?? 1. 4 physical nodes with 12 bricks each(distributed replicated)... 2. The same all 4 nodes use for the compute purpose also... Do i still require the VIP...or not ?? because i tested even the mount point node goes down...the VM will not pause and not affect... On Fri, Jul 4, 2014 at 1:18 PM, Andrew Lau wrote: Or just localhost as your computer and storage are on the same box. On Fri, Jul 4, 2014 at 2:48 PM, Punit Dambiwal wrote: Hi Andrew, Thanks for the updatethat means HA can not work without VIP in the gluster,so better to use the glusterfs with the VIP to take over the ip...in case of any storage node failure... On Fri, Jul 4, 2014 at 12:35 PM, Andrew Lau wrote: Don't forget to take into consideration quroum, that's something people often forget The reason you're having the current happen, is gluster only uses the initial IP address to get the volume details. After that it'll connect directly to ONE of the servers, so with your 2 storage server case, 50% chance it won't go to paused state. For the VIP, you could consider CTDB or keepelived, or even just using localhost (as your storage and compute are all on the same machine). For CTDB, checkout http://community.redhat.com/blog/2014/05/ovirt-3-4-glusterized/ I have a BZ open regarding gluster VMs going into paused state and not being resumable, so it's something you should also consider. My case, switch dies, gluster volume goes away, VMs go into paused state but can't be resumed. If you lose one server out of a cluster is a different story though. https://bugzilla.redhat.com/show_bug.cgi?id=1058300 HTH On Fri, Jul 4, 2014 at 11:48 AM, Punit Dambiwal wrote: Hi, Thanks...can you suggest me any good how to/article for the glusterfs with ovirt... One strange thing is if i will try both (compute & storage) on the same node...the below quote not happen - Right now, if 10.10.10.2 goes away, all your gluster mounts go away and your VMs get paused because the hypervisors can’t access the storage. Your gluster storage is still fine, but ovirt can’t talk to it because 10.10.10.2 isn’t there. - Even the 10.10.10.2 goes down...i can still access the gluster mounts and no VM pausei can access the VM via ssh...no connection failure.the connection drop only in case of SPM goes down and the another node will elect as SPM(All the running VM's pause in this condition). On Fri, Jul 4, 2014 at 4:12 AM, Darrell Budic wrote: You need to setup a virtual IP to use as the mount point, most people use keepalived to provide a virtual ip via vrrp for this. Setup something like 10.10.10.10 and use that for your mounts. Right now, if 10.10.10.2 goes away, all your gluster mounts go away and your VMs get paused because the hypervisors can’t access the storage. Your gluster storage is still fine, but ovirt can’t talk to it because 10.10.10.2 isn’t there. If the SPM goes down, it the other hypervisor hosts will elect a new one (under control of the ovirt engine). Same scenar
Re: [ovirt-users] VM HostedEngie is down. Exist message: internal error Failed to acquire lock error -243
Ok, I thought I was doing something wrong yesterday and just tore down my 3-node cluster with the hosted engine and started rebuilding. I was seeing essentially the same thing, a score of 0 on the VMs not running the engine, it wouldn't allow migration of the hosted engine. I played with all things related to setting maintenance and rebooting hosts, nothing brought them up to a point where I could migrate the hosted engine. I thought it was related to ovirt messing up when deploying the other hosts (I told it not to modify the firewall that I disabled, but the deploy process forcibly reenabled the firewall which gluster really didn't like). Now after reading this it appears my assumption may be false. Previously a 2-node cluster I had worked fine, but I wanted to go to 3-nodes so I could enable quorum on gluster to not risk split-brain issues. -Brad On 6/10/14 1:19 AM, Andrew Lau wrote: I'm really having a hard time finding out why it's happening.. If I set the cluster to global for a minute or two, the scores will reset back to 2400. Set maintenance mode to none, and all will be fine until a migration occurs. It seems it tries to migrate, fails and sets the score to 0 permanently rather than the 10? minutes mentioned in one of the ovirt slides. When I have two hosts, it's score 0 only when a migration occurs. (Just on the host which doesn't have engine up). The score 0 only happens when it's tried to migrate when I set the host to local maintenance. Migrating the VM from the UI has worked quite a few times, but it's recently started to fail. When I have three hosts, after 5~ mintues of them all up the score will hit 0 on the hosts not running the VMs. It doesn't even have to attempt to migrate before the score goes to 0. Stopping the ha agent on one host, and "resetting" it with the global maintenance method brings it back to the 2 host scenario above. I may move on and just go back to a standalone engine as this is not getting very much luck.. On Tue, Jun 10, 2014 at 3:11 PM, combuster wrote: Nah, I've explicitly allowed hosted-engine vm to be able to access the NAS device as the NFS share itself, before the deploy procedure even started. But I'm puzzled at how you can reproduce the bug, all was well on my setup before I've stated manual migration of the engine's vm. Even auto migration worked before that (tested it). Does it just happen without any procedure on the engine itself? Is the score 0 for just one node, or two of three of them? On 06/10/2014 01:02 AM, Andrew Lau wrote: nvm, just as I hit send the error has returned. Ignore this.. On Tue, Jun 10, 2014 at 9:01 AM, Andrew Lau wrote: So after adding the L3 capabilities to my storage network, I'm no longer seeing this issue anymore. So the engine needs to be able to access the storage domain it sits on? But that doesn't show up in the UI? Ivan, was this also the case with your setup? Engine couldn't access storage domain? On Mon, Jun 9, 2014 at 9:56 PM, Andrew Lau wrote: Interesting, my storage network is a L2 only and doesn't run on the ovirtmgmt (which is the only thing HostedEngine sees) but I've only seen this issue when running ctdb in front of my NFS server. I previously was using localhost as all my hosts had the nfs server on it (gluster). On Mon, Jun 9, 2014 at 9:15 PM, Artyom Lukianov wrote: I just blocked connection to storage for testing, but on result I had this error: "Failed to acquire lock error -243", so I added it in reproduce steps. If you know another steps to reproduce this error, without blocking connection to storage it also can be wonderful if you can provide them. Thanks - Original Message - From: "Andrew Lau" To: "combuster" Cc: "users" Sent: Monday, June 9, 2014 3:47:00 AM Subject: Re: [ovirt-users] VM HostedEngie is down. Exist message: internal error Failed to acquire lock error -243 I just ran a few extra tests, I had a 2 host, hosted-engine running for a day. They both had a score of 2400. Migrated the VM through the UI multiple times, all worked fine. I then added the third host, and that's when it all fell to pieces. Other two hosts have a score of 0 now. I'm also curious, in the BZ there's a note about: where engine-vm block connection to storage domain(via iptables -I INPUT -s sd_ip -j DROP) What's the purpose for that? On Sat, Jun 7, 2014 at 4:16 PM, Andrew Lau wrote: Ignore that, the issue came back after 10 minutes. I've even tried a gluster mount + nfs server on top of that, and the same issue has come back. On Fri, Jun 6, 2014 at 6:26 PM, Andrew Lau wrote: Interesting, I put it all into global maintenance. Shut it all down for 10~ minutes, and it's regained it's sanlock control and doesn't seem to have that issue coming up in the log. On Fri, Jun 6, 2014 at 4:21 PM, combuster wrote: It was pure NFS on a NAS device. They all had different ids (had no redeployements of nodes before problem occured). Thanks Jirka. On 06/06/2014 08:19 AM, Jiri Mosk
Re: [Users] [ANN] oVirt 3.4.0 Release Candidate is now available
You're welcome to join us testing this release candidate in next week test day [2] scheduled for 2014-03-06! [1] http://www.ovirt.org/OVirt_3.4.0_release_notes [2] http://www.ovirt.org/OVirt_3.4_Test_Day Known issues should list some information about Gluster I think. Such as the fact that libgfapi is not currently being used even when choosing GlusterFS instead of POSIXFS, instead it creates a Posix mount and uses that. This was an advertised 3.3 feature, so this would be considered a regression or known issue, right? I was told it was due to BZ #1017289 This has been observed in Fedora 19, though that BZ lists RHEL6. Thanks! -Brad ___ Users mailing list Users@ovirt.org http://lists.ovirt.org/mailman/listinfo/users
[Users] oVirt 3.4 pre-release and GlusterFS support (F19)
I've been testing the 3.4 prerelease on Fedora 19. When I create a GlusterFS (not POSIXFS) storage group and create a VM with a disk image on the storage group, I see a POSIX mount created on the host. Upon further investigation, when evaluating the executed qemu command line, it doesn't appear qemu is being told to use libgfapi but rather that previously observed POSIX mount. One other note, I'm specifically testing the hosted engine, and haven't tested using the non-hosted variant. The question is is this expected behavior, and if so, is it because of the hosted engine? Or is this some form of regression from the advertised feature list of oVirt 3.3? Anything I should try or look at? I'm obviously concerned about the FUSE overhead with Gluster and would like to avoid that if possible. Thanks! -Brad ___ Users mailing list Users@ovirt.org http://lists.ovirt.org/mailman/listinfo/users
Re: [Users] oVirt 3.4.0 beta - Hosted Engine Setup -- issues
On 01/23/2014 09:24 AM, Sandro Bonazzola wrote: # service vdsmd status >Redirecting to /bin/systemctl status vdsmd.service >vdsmd.service - Virtual Desktop Server Manager >Loaded: loaded (/usr/lib/systemd/system/vdsmd.service; enabled) >Active: failed (Result: start-limit) since Thu 2014-01-23 09:16:38 EST; 1min 26s ago > Process: 2387 ExecStopPost=/usr/libexec/vdsm/vdsmd_init_common.sh --post-stop (code=exited, status=0/SUCCESS) > Process: 2380 ExecStart=/usr/share/vdsm/daemonAdapter -0 /dev/null -1 /dev/null -2 /dev/null /usr/share/vdsm/vdsm (code=exited, status=1/FAILURE) > Process: 2328 ExecStartPre=/usr/libexec/vdsm/vdsmd_init_common.sh --pre-start (code=exited, status=0/SUCCESS) > >Jan 23 09:16:38 ovirttest.internal.monetra.com systemd[1]: Unit vdsmd.service entered failed state. >Jan 23 09:16:38 ovirttest.internal.monetra.com systemd[1]: vdsmd.service holdoff time over, scheduling restart. >Jan 23 09:16:38 ovirttest.internal.monetra.com systemd[1]: Stopping Virtual Desktop Server Manager... >Jan 23 09:16:38 ovirttest.internal.monetra.com systemd[1]: Starting Virtual Desktop Server Manager... >Jan 23 09:16:38 ovirttest.internal.monetra.com systemd[1]: vdsmd.service start request repeated too quickly, refusing to start. what's this . It's the first start of vdsmd! I'm not sure I understand the meaning of this reply :/ Thanks. -Brad ___ Users mailing list Users@ovirt.org http://lists.ovirt.org/mailman/listinfo/users
Re: [Users] hosted-engine setup fails on RHEL6
On 01/23/2014 08:24 AM, Frank Wall wrote: Hi, I'm currently trying to setup a hosted-engine on a RHEL6 host with nightly repository (because 3.4 BETA didn't work either): [ ERROR ] Failed to execute stage 'Environment customization': [Errno 111] Connection refused [ INFO ] Stage: Clean up [ INFO ] Stage: Pre-termination [ INFO ] Stage: Termination Any hint? Thanks - Frank Yep, I get the _same_ exact issue. Please see my e-mail chain from yesterday with the subject line: "oVirt 3.4.0 beta - Hosted Engine Setup -- issues" And look at Andrew Lau's responses who also experienced the same issue ... with a potential work-around which I have yet to try. -Brad ___ Users mailing list Users@ovirt.org http://lists.ovirt.org/mailman/listinfo/users
Re: [Users] oVirt 3.4.0 beta - Hosted Engine Setup -- issues
On 1/23/14 7:00 AM, Andrew Lau wrote: Good luck! If you get time it'd really be great if you could post those logs (ovirt-hosted-engine-setup.log and vdsm.log) to BZ 1055153 for me? It'd help them debug the issue and save me from having to find a new spare server. I spent a good 2 days trying to work through the alpha jungle so hope this helps :) No, problem, I'll do that before I rebuild the server to follow your procedure. Hopefully I'll be able to do it today since it is test day, but unfortunately I've got meetings planned most of the day :/ -Brad ___ Users mailing list Users@ovirt.org http://lists.ovirt.org/mailman/listinfo/users
Re: [Users] oVirt 3.4.0 beta - Hosted Engine Setup -- issues
On 1/22/14 11:42 PM, Andrew Lau wrote: That sounds exactly what I did, do you mind putting those log files into the BZ saves me from having to find some hardware and replicate it again. So what i did was I ended up just nuking the whole OS, and getting a clean start. First after doing all your initial prep EXCEPT configuring your 4 NICs run the "hosted-engine --deploy" command twice. Assuming the first run fails always like in my case, else once you get to the "Configure Storage" phase press Ctrl + D to exit. Now configure your NICs (also configure ovirtmgmt manually as there is another BZ about it not being able to create the bridge) and rerun "hosted-engine --deploy" and you should be back in action. This should get you to a working hosted-engine solution. P.S. could you add me in the CC when you reply, I would've seen your message sooner. Weird, very very weird. I'll give it a shot and see what happens. Thanks! -Brad ___ Users mailing list Users@ovirt.org http://lists.ovirt.org/mailman/listinfo/users
Re: [Users] oVirt 3.4.0 beta - Hosted Engine Setup -- issues
On 01/22/2014 05:42 PM, Andrew Lau wrote: Hi, This looks like this BZ which I reported https://bugzilla.redhat.com/show_bug.cgi?id=1055153 Did you customize your nics before you tried running hosted-engine --deploy? Thanks, Andrew Yes, I created all the /etc/sysconfig/network-scripts/ifcfg-* for my 4 NICs in the bond, then also created the ifcfg-bond0, as well as an ifcfg-ovirtmgmt which uses the bond0. So the ovirt management interface should be fully configured before even running the hosted-engine --deploy ... when I was testing the all-in-one (non-hosted) I had to do that, so figured it was a good idea for hosted too. I'm trying to understand your BZ 1055153, did you actually get it to work? Or did you get stuck at that point? I have seen the same issue as your BZ 1055129 as well. -Brad ___ Users mailing list Users@ovirt.org http://lists.ovirt.org/mailman/listinfo/users
[Users] oVirt 3.4.0 beta - Hosted Engine Setup -- issues
I'm trying to test out the oVirt Hosted Engine, but am experiencing a failure early on and was hoping someone could point me in the right direction. I'm not familiar enough with the architecture of oVirt to start to debug this situation. Basically, I run the "hosted-engine --deploy" command and it outputs: [ INFO ] Stage: Initializing Continuing will configure this host for serving as hypervisor and create a VM where you have to install oVirt Engine afterwards. Are you sure you want to continue? (Yes, No)[Yes]: [ INFO ] Generating a temporary VNC password. [ INFO ] Stage: Environment setup Configuration files: [] Log file: /var/log/ovirt-hosted-engine-setup/ovirt-hosted-engine-setup-20140122110741.log Version: otopi-1.2.0_beta (otopi-1.2.0-0.1.beta.fc19) [ INFO ] Hardware supports virtualization [ INFO ] Bridge ovirtmgmt already created [ INFO ] Stage: Environment packages setup [ INFO ] Stage: Programs detection [ INFO ] Stage: Environment setup [ INFO ] Waiting for VDSM hardware info [ INFO ] Waiting for VDSM hardware info [ INFO ] Waiting for VDSM hardware info [ INFO ] Waiting for VDSM hardware info [ INFO ] Waiting for VDSM hardware info [ INFO ] Waiting for VDSM hardware info [ INFO ] Waiting for VDSM hardware info [ INFO ] Waiting for VDSM hardware info [ INFO ] Waiting for VDSM hardware info [ INFO ] Waiting for VDSM hardware info [ INFO ] Stage: Environment customization --== STORAGE CONFIGURATION ==-- During customization use CTRL-D to abort. [ ERROR ] Failed to execute stage 'Environment customization': [Errno 111] Connection refused [ INFO ] Stage: Clean up [ INFO ] Stage: Pre-termination [ INFO ] Stage: Termination /var/log/ovirt-hosted-engine-setup/ovirt-hosted-engine-setup-20140122110741.log has: 2014-01-22 11:07:57 DEBUG otopi.plugins.ovirt_hosted_engine_setup.storage.storage storage._check_existing_pools:631 getConnectedStoragePoolsList 2014-01-22 11:07:57 DEBUG otopi.context context._executeMethod:152 method exception Traceback (most recent call last): File "/usr/lib/python2.7/site-packages/otopi/context.py", line 142, in _executeMethod method['method']() File "/usr/share/ovirt-hosted-engine-setup/scripts/../plugins/ovirt-hosted-engine-setup/storage/storage.py", line 729, in _customization self._check_existing_pools() File "/usr/share/ovirt-hosted-engine-setup/scripts/../plugins/ovirt-hosted-engine-setup/storage/storage.py", line 632, in _check_existing_pools pools = self.serv.s.getConnectedStoragePoolsList() File "/usr/lib64/python2.7/xmlrpclib.py", line 1224, in __call__ return self.__send(self.__name, args) File "/usr/lib64/python2.7/xmlrpclib.py", line 1578, in __request verbose=self.__verbose File "/usr/lib64/python2.7/xmlrpclib.py", line 1264, in request return self.single_request(host, handler, request_body, verbose) File "/usr/lib64/python2.7/xmlrpclib.py", line 1292, in single_request self.send_content(h, request_body) File "/usr/lib64/python2.7/xmlrpclib.py", line 1439, in send_content connection.endheaders(request_body) File "/usr/lib64/python2.7/httplib.py", line 969, in endheaders self._send_output(message_body) File "/usr/lib64/python2.7/httplib.py", line 829, in _send_output self.send(msg) File "/usr/lib64/python2.7/httplib.py", line 791, in send self.connect() File "/usr/lib64/python2.7/site-packages/vdsm/SecureXMLRPCServer.py", line 188, in connect sock = socket.create_connection((self.host, self.port), self.timeout) File "/usr/lib64/python2.7/socket.py", line 571, in create_connection raise err error: [Errno 111] Connection refused 2014-01-22 11:07:57 ERROR otopi.context context._executeMethod:161 Failed to execute stage 'Environment customization': [Errno 111] Connection refused Unfortunately I do not know what service this is trying to connect to, or what hostname or port to try to start debugging that. Some other useful information about my environment: - Fedora 19 (64bit), minimal install, selected 'standard' add-on utilities. This was a fresh install just for this test. - 512MB /boot ext4 - 80GB / ext4 in LVM, 220GB free in VG - "yum -y update" performed to get all latest updates - SElinux in permissive mode - Hardware: - Supermicro 1026T-URF barebones - single CPU populated (Xeon E5630 4x2.53GHz) - 12GB ECC DDR3 RAM - H/W Raid with SSDs - Networking: - Network Manager DISABLED - 4 GbE ports (p2p1, p2p2, em1, em2) - all 4 ports configured in a bond (bond0) using balance-alb - ovirtmgmt bridge pre-created with 'bond0' as the only member, assigned a static IP address - firewall DISABLED - /etc/yum.repos.d/ovirt.rep has ONLY the 3.4.0 beta repo enabled: [ovirt-3.4.0-beta] name=3.4.0 beta testing repo for the oVirt project baseurl=http://ovirt.org/relea