Interesting, my storage network is a L2 only and doesn't run on the ovirtmgmt (which is the only thing HostedEngine sees) but I've only seen this issue when running ctdb in front of my NFS server. I previously was using localhost as all my hosts had the nfs server on it (gluster).
On Mon, Jun 9, 2014 at 9:15 PM, Artyom Lukianov <aluki...@redhat.com> wrote: > I just blocked connection to storage for testing, but on result I had this > error: "Failed to acquire lock error -243", so I added it in reproduce steps. > If you know another steps to reproduce this error, without blocking > connection to storage it also can be wonderful if you can provide them. > Thanks > > ----- Original Message ----- > From: "Andrew Lau" <and...@andrewklau.com> > To: "combuster" <combus...@archlinux.us> > Cc: "users" <users@ovirt.org> > Sent: Monday, June 9, 2014 3:47:00 AM > Subject: Re: [ovirt-users] VM HostedEngie is down. Exist message: internal > error Failed to acquire lock error -243 > > I just ran a few extra tests, I had a 2 host, hosted-engine running > for a day. They both had a score of 2400. Migrated the VM through the > UI multiple times, all worked fine. I then added the third host, and > that's when it all fell to pieces. > Other two hosts have a score of 0 now. > > I'm also curious, in the BZ there's a note about: > > where engine-vm block connection to storage domain(via iptables -I > INPUT -s sd_ip -j DROP) > > What's the purpose for that? > > On Sat, Jun 7, 2014 at 4:16 PM, Andrew Lau <and...@andrewklau.com> wrote: >> Ignore that, the issue came back after 10 minutes. >> >> I've even tried a gluster mount + nfs server on top of that, and the >> same issue has come back. >> >> On Fri, Jun 6, 2014 at 6:26 PM, Andrew Lau <and...@andrewklau.com> wrote: >>> Interesting, I put it all into global maintenance. Shut it all down >>> for 10~ minutes, and it's regained it's sanlock control and doesn't >>> seem to have that issue coming up in the log. >>> >>> On Fri, Jun 6, 2014 at 4:21 PM, combuster <combus...@archlinux.us> wrote: >>>> It was pure NFS on a NAS device. They all had different ids (had no >>>> redeployements of nodes before problem occured). >>>> >>>> Thanks Jirka. >>>> >>>> >>>> On 06/06/2014 08:19 AM, Jiri Moskovcak wrote: >>>>> >>>>> I've seen that problem in other threads, the common denominator was "nfs >>>>> on top of gluster". So if you have this setup, then it's a known problem. >>>>> Or >>>>> you should double check if you hosts have different ids otherwise they >>>>> would >>>>> be trying to acquire the same lock. >>>>> >>>>> --Jirka >>>>> >>>>> On 06/06/2014 08:03 AM, Andrew Lau wrote: >>>>>> >>>>>> Hi Ivan, >>>>>> >>>>>> Thanks for the in depth reply. >>>>>> >>>>>> I've only seen this happen twice, and only after I added a third host >>>>>> to the HA cluster. I wonder if that's the root problem. >>>>>> >>>>>> Have you seen this happen on all your installs or only just after your >>>>>> manual migration? It's a little frustrating this is happening as I was >>>>>> hoping to get this into a production environment. It was all working >>>>>> except that log message :( >>>>>> >>>>>> Thanks, >>>>>> Andrew >>>>>> >>>>>> >>>>>> On Fri, Jun 6, 2014 at 3:20 PM, combuster <combus...@archlinux.us> wrote: >>>>>>> >>>>>>> Hi Andrew, >>>>>>> >>>>>>> this is something that I saw in my logs too, first on one node and then >>>>>>> on >>>>>>> the other three. When that happend on all four of them, engine was >>>>>>> corrupted >>>>>>> beyond repair. >>>>>>> >>>>>>> First of all, I think that message is saying that sanlock can't get a >>>>>>> lock >>>>>>> on the shared storage that you defined for the hostedengine during >>>>>>> installation. I got this error when I've tried to manually migrate the >>>>>>> hosted engine. There is an unresolved bug there and I think it's related >>>>>>> to >>>>>>> this one: >>>>>>> >>>>>>> [Bug 1093366 - Migration of hosted-engine vm put target host score to >>>>>>> zero] >>>>>>> https://bugzilla.redhat.com/show_bug.cgi?id=1093366 >>>>>>> >>>>>>> This is a blocker bug (or should be) for the selfhostedengine and, from >>>>>>> my >>>>>>> own experience with it, shouldn't be used in the production enviroment >>>>>>> (not >>>>>>> untill it's fixed). >>>>>>> >>>>>>> Nothing that I've done couldn't fix the fact that the score for the >>>>>>> target >>>>>>> node was Zero, tried to reinstall the node, reboot the node, restarted >>>>>>> several services, tailed a tons of logs etc but to no avail. When only >>>>>>> one >>>>>>> node was left (that was actually running the hosted engine), I brought >>>>>>> the >>>>>>> engine's vm down gracefully (hosted-engine --vm-shutdown I belive) and >>>>>>> after >>>>>>> that, when I've tried to start the vm - it wouldn't load. Running VNC >>>>>>> showed >>>>>>> that the filesystem inside the vm was corrupted and when I ran fsck and >>>>>>> finally started up - it was too badly damaged. I succeded to start the >>>>>>> engine itself (after repairing postgresql service that wouldn't want to >>>>>>> start) but the database was damaged enough and acted pretty weird >>>>>>> (showed >>>>>>> that storage domains were down but the vm's were running fine etc). >>>>>>> Lucky >>>>>>> me, I had already exported all of the VM's on the first sign of trouble >>>>>>> and >>>>>>> then installed ovirt-engine on the dedicated server and attached the >>>>>>> export >>>>>>> domain. >>>>>>> >>>>>>> So while really a usefull feature, and it's working (for the most part >>>>>>> ie, >>>>>>> automatic migration works), manually migrating VM with the hosted-engine >>>>>>> will lead to troubles. >>>>>>> >>>>>>> I hope that my experience with it, will be of use to you. It happened to >>>>>>> me >>>>>>> two weeks ago, ovirt-engine was current (3.4.1) and there was no fix >>>>>>> available. >>>>>>> >>>>>>> Regards, >>>>>>> >>>>>>> Ivan >>>>>>> >>>>>>> On 06/06/2014 05:12 AM, Andrew Lau wrote: >>>>>>> >>>>>>> Hi, >>>>>>> >>>>>>> I'm seeing this weird message in my engine log >>>>>>> >>>>>>> 2014-06-06 03:06:09,380 INFO >>>>>>> [org.ovirt.engine.core.vdsbroker.VdsUpdateRunTimeInfo] >>>>>>> (DefaultQuartzScheduler_Worker-79) RefreshVmList vm id >>>>>>> 85d4cfb9-f063-4c7c-a9f8-2b74f5f7afa5 status = WaitForLaunch on vds >>>>>>> ov-hv2-2a-08-23 ignoring it in the refresh until migration is done >>>>>>> 2014-06-06 03:06:12,494 INFO >>>>>>> [org.ovirt.engine.core.vdsbroker.vdsbroker.DestroyVDSCommand] >>>>>>> (DefaultQuartzScheduler_Worker-89) START, DestroyVDSCommand(HostName = >>>>>>> ov-hv2-2a-08-23, HostId = c04c62be-5d34-4e73-bd26-26f805b2dc60, >>>>>>> vmId=85d4cfb9-f063-4c7c-a9f8-2b74f5f7afa5, force=false, >>>>>>> secondsToWait=0, gracefully=false), log id: 62a9d4c1 >>>>>>> 2014-06-06 03:06:12,561 INFO >>>>>>> [org.ovirt.engine.core.vdsbroker.vdsbroker.DestroyVDSCommand] >>>>>>> (DefaultQuartzScheduler_Worker-89) FINISH, DestroyVDSCommand, log id: >>>>>>> 62a9d4c1 >>>>>>> 2014-06-06 03:06:12,652 INFO >>>>>>> [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] >>>>>>> (DefaultQuartzScheduler_ >>>>>>> Worker-89) Correlation ID: null, Call Stack: >>>>>>> null, Custom Event ID: -1, Message: VM HostedEngine is down. Exit >>>>>>> message: internal error Failed to acquire lock: error -243. >>>>>>> >>>>>>> It also appears to occur on the other hosts in the cluster, except the >>>>>>> host which is running the hosted-engine. So right now 3 servers, it >>>>>>> shows up twice in the engine UI. >>>>>>> >>>>>>> The engine VM continues to run peacefully, without any issues on the >>>>>>> host which doesn't have that error. >>>>>>> >>>>>>> Any ideas? >>>>>>> _______________________________________________ >>>>>>> Users mailing list >>>>>>> Users@ovirt.org >>>>>>> http://lists.ovirt.org/mailman/listinfo/users >>>>>>> >>>>>>> >>>>>> _______________________________________________ >>>>>> Users mailing list >>>>>> Users@ovirt.org >>>>>> http://lists.ovirt.org/mailman/listinfo/users >>>>>> >>>>> >>>> > _______________________________________________ > Users mailing list > Users@ovirt.org > http://lists.ovirt.org/mailman/listinfo/users _______________________________________________ Users mailing list Users@ovirt.org http://lists.ovirt.org/mailman/listinfo/users