Is this related to the NFS server which gluster provides, or is because of the way gluster does replication?
There's a few posts ie. http://community.redhat.com/blog/2014/05/ovirt-3-4-glusterized/ which are reporting success with gluster + hosted engine. So it'd be good to know, so we could possibly try a work around. Cheers. On Fri, Jun 6, 2014 at 4:19 PM, Jiri Moskovcak <jmosk...@redhat.com> wrote: > I've seen that problem in other threads, the common denominator was "nfs on > top of gluster". So if you have this setup, then it's a known problem. Or > you should double check if you hosts have different ids otherwise they would > be trying to acquire the same lock. > > --Jirka > > > On 06/06/2014 08:03 AM, Andrew Lau wrote: >> >> Hi Ivan, >> >> Thanks for the in depth reply. >> >> I've only seen this happen twice, and only after I added a third host >> to the HA cluster. I wonder if that's the root problem. >> >> Have you seen this happen on all your installs or only just after your >> manual migration? It's a little frustrating this is happening as I was >> hoping to get this into a production environment. It was all working >> except that log message :( >> >> Thanks, >> Andrew >> >> >> On Fri, Jun 6, 2014 at 3:20 PM, combuster <combus...@archlinux.us> wrote: >>> >>> Hi Andrew, >>> >>> this is something that I saw in my logs too, first on one node and then >>> on >>> the other three. When that happend on all four of them, engine was >>> corrupted >>> beyond repair. >>> >>> First of all, I think that message is saying that sanlock can't get a >>> lock >>> on the shared storage that you defined for the hostedengine during >>> installation. I got this error when I've tried to manually migrate the >>> hosted engine. There is an unresolved bug there and I think it's related >>> to >>> this one: >>> >>> [Bug 1093366 - Migration of hosted-engine vm put target host score to >>> zero] >>> https://bugzilla.redhat.com/show_bug.cgi?id=1093366 >>> >>> This is a blocker bug (or should be) for the selfhostedengine and, from >>> my >>> own experience with it, shouldn't be used in the production enviroment >>> (not >>> untill it's fixed). >>> >>> Nothing that I've done couldn't fix the fact that the score for the >>> target >>> node was Zero, tried to reinstall the node, reboot the node, restarted >>> several services, tailed a tons of logs etc but to no avail. When only >>> one >>> node was left (that was actually running the hosted engine), I brought >>> the >>> engine's vm down gracefully (hosted-engine --vm-shutdown I belive) and >>> after >>> that, when I've tried to start the vm - it wouldn't load. Running VNC >>> showed >>> that the filesystem inside the vm was corrupted and when I ran fsck and >>> finally started up - it was too badly damaged. I succeded to start the >>> engine itself (after repairing postgresql service that wouldn't want to >>> start) but the database was damaged enough and acted pretty weird (showed >>> that storage domains were down but the vm's were running fine etc). Lucky >>> me, I had already exported all of the VM's on the first sign of trouble >>> and >>> then installed ovirt-engine on the dedicated server and attached the >>> export >>> domain. >>> >>> So while really a usefull feature, and it's working (for the most part >>> ie, >>> automatic migration works), manually migrating VM with the hosted-engine >>> will lead to troubles. >>> >>> I hope that my experience with it, will be of use to you. It happened to >>> me >>> two weeks ago, ovirt-engine was current (3.4.1) and there was no fix >>> available. >>> >>> Regards, >>> >>> Ivan >>> >>> On 06/06/2014 05:12 AM, Andrew Lau wrote: >>> >>> Hi, >>> >>> I'm seeing this weird message in my engine log >>> >>> 2014-06-06 03:06:09,380 INFO >>> [org.ovirt.engine.core.vdsbroker.VdsUpdateRunTimeInfo] >>> (DefaultQuartzScheduler_Worker-79) RefreshVmList vm id >>> 85d4cfb9-f063-4c7c-a9f8-2b74f5f7afa5 status = WaitForLaunch on vds >>> ov-hv2-2a-08-23 ignoring it in the refresh until migration is done >>> 2014-06-06 03:06:12,494 INFO >>> [org.ovirt.engine.core.vdsbroker.vdsbroker.DestroyVDSCommand] >>> (DefaultQuartzScheduler_Worker-89) START, DestroyVDSCommand(HostName = >>> ov-hv2-2a-08-23, HostId = c04c62be-5d34-4e73-bd26-26f805b2dc60, >>> vmId=85d4cfb9-f063-4c7c-a9f8-2b74f5f7afa5, force=false, >>> secondsToWait=0, gracefully=false), log id: 62a9d4c1 >>> 2014-06-06 03:06:12,561 INFO >>> [org.ovirt.engine.core.vdsbroker.vdsbroker.DestroyVDSCommand] >>> (DefaultQuartzScheduler_Worker-89) FINISH, DestroyVDSCommand, log id: >>> 62a9d4c1 >>> 2014-06-06 03:06:12,652 INFO >>> [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] >>> (DefaultQuartzScheduler_ >>> Worker-89) Correlation ID: null, Call Stack: >>> null, Custom Event ID: -1, Message: VM HostedEngine is down. Exit >>> message: internal error Failed to acquire lock: error -243. >>> >>> It also appears to occur on the other hosts in the cluster, except the >>> host which is running the hosted-engine. So right now 3 servers, it >>> shows up twice in the engine UI. >>> >>> The engine VM continues to run peacefully, without any issues on the >>> host which doesn't have that error. >>> >>> Any ideas? >>> _______________________________________________ >>> Users mailing list >>> Users@ovirt.org >>> http://lists.ovirt.org/mailman/listinfo/users >>> >>> >> _______________________________________________ >> Users mailing list >> Users@ovirt.org >> http://lists.ovirt.org/mailman/listinfo/users >> > _______________________________________________ Users mailing list Users@ovirt.org http://lists.ovirt.org/mailman/listinfo/users