On 06/10/2014 07:19 AM, Andrew Lau wrote:
I'm really having a hard time finding out why it's happening..

If I set the cluster to global for a minute or two, the scores will
reset back to 2400. Set maintenance mode to none, and all will be fine
until a migration occurs. It seems it tries to migrate, fails and sets
the score to 0 permanently rather than the 10? minutes mentioned in
one of the ovirt slides.

When I have two hosts, it's score 0 only when a migration occurs.
(Just on the host which doesn't have engine up). The score 0 only
happens when it's tried to migrate when I set the host to local
maintenance. Migrating the VM from the UI has worked quite a few
times, but it's recently started to fail.

When I have three hosts, after 5~ mintues of them all up the score
will hit 0 on the hosts not running the VMs. It doesn't even have to
attempt to migrate before the score goes to 0. Stopping the ha agent
on one host, and "resetting" it with the global maintenance method
brings it back to the 2 host scenario above.

I may move on and just go back to a standalone engine as this is not
getting very much luck..
Well I've done this already, I can't really afford to have so much unplanned downtime on my critical vm's, especially since it would take me several hours (even a whole day) to install a dedicated engine, then setup the nodes if need be, and then import vm's from export domain. I would love to help more to resolve this one, but I was pressed with time, I already had ovirt 3.3 running (for a year and a half rock solid stable, started from 3.1 i think), and I couldn't spare more then a day in trying to get around this bug (had to have a setup runing by the end of the weekend). I wasn't using gluster at all, so at least we know now that gluster is not a must in the mix. Besides Artyom already described it nicely in the bug report, havent had anything to add.

You were lucky Andrew, when I've tried the global maintenance method and restarted the VM, I got a corrupted filesystem on the VM's engine and it wouldn't even start on that one node that had a good score. It was "bad health" or uknown state on all of the nodes, and I've managed to repair the fs on the vm via VNC, then just barely bring the services online but the postgres db was too much damaged, so engine missbehaved.

At the time, I've explained it to myself :) that the locking mechanism didn't prevent one node to try to start (or write to) the vm while it was already running on another node, because filesystem was so damaged that I couldn't belive it, for 15 years I've never seen an extX fs so badly damaged, and the fact that this happens during migration just amped this thought up.

On Tue, Jun 10, 2014 at 3:11 PM, combuster <combus...@archlinux.us> wrote:
Nah, I've explicitly allowed hosted-engine vm to be able to access the NAS
device as the NFS share itself, before the deploy procedure even started.
But I'm puzzled at how you can reproduce the bug, all was well on my setup
before I've stated manual migration of the engine's vm. Even auto migration
worked before that (tested it). Does it just happen without any procedure on
the engine itself? Is the score 0 for just one node, or two of three of
them?

On 06/10/2014 01:02 AM, Andrew Lau wrote:
nvm, just as I hit send the error has returned.
Ignore this..

On Tue, Jun 10, 2014 at 9:01 AM, Andrew Lau <and...@andrewklau.com> wrote:
So after adding the L3 capabilities to my storage network, I'm no
longer seeing this issue anymore. So the engine needs to be able to
access the storage domain it sits on? But that doesn't show up in the
UI?

Ivan, was this also the case with your setup? Engine couldn't access
storage domain?

On Mon, Jun 9, 2014 at 9:56 PM, Andrew Lau <and...@andrewklau.com> wrote:
Interesting, my storage network is a L2 only and doesn't run on the
ovirtmgmt (which is the only thing HostedEngine sees) but I've only
seen this issue when running ctdb in front of my NFS server. I
previously was using localhost as all my hosts had the nfs server on
it (gluster).

On Mon, Jun 9, 2014 at 9:15 PM, Artyom Lukianov <aluki...@redhat.com>
wrote:
I just blocked connection to storage for testing, but on result I had
this error: "Failed to acquire lock error -243", so I added it in reproduce
steps.
If you know another steps to reproduce this error, without blocking
connection to storage it also can be wonderful if you can provide them.
Thanks

----- Original Message -----
From: "Andrew Lau" <and...@andrewklau.com>
To: "combuster" <combus...@archlinux.us>
Cc: "users" <users@ovirt.org>
Sent: Monday, June 9, 2014 3:47:00 AM
Subject: Re: [ovirt-users] VM HostedEngie is down. Exist message:
internal error Failed to acquire lock error -243

I just ran a few extra tests, I had a 2 host, hosted-engine running
for a day. They both had a score of 2400. Migrated the VM through the
UI multiple times, all worked fine. I then added the third host, and
that's when it all fell to pieces.
Other two hosts have a score of 0 now.

I'm also curious, in the BZ there's a note about:

where engine-vm block connection to storage domain(via iptables -I
INPUT -s sd_ip -j DROP)

What's the purpose for that?

On Sat, Jun 7, 2014 at 4:16 PM, Andrew Lau <and...@andrewklau.com>
wrote:
Ignore that, the issue came back after 10 minutes.

I've even tried a gluster mount + nfs server on top of that, and the
same issue has come back.

On Fri, Jun 6, 2014 at 6:26 PM, Andrew Lau <and...@andrewklau.com>
wrote:
Interesting, I put it all into global maintenance. Shut it all down
for 10~ minutes, and it's regained it's sanlock control and doesn't
seem to have that issue coming up in the log.

On Fri, Jun 6, 2014 at 4:21 PM, combuster <combus...@archlinux.us>
wrote:
It was pure NFS on a NAS device. They all had different ids (had no
redeployements of nodes before problem occured).

Thanks Jirka.


On 06/06/2014 08:19 AM, Jiri Moskovcak wrote:
I've seen that problem in other threads, the common denominator was
"nfs
on top of gluster". So if you have this setup, then it's a known
problem. Or
you should double check if you hosts have different ids otherwise
they would
be trying to acquire the same lock.

--Jirka

On 06/06/2014 08:03 AM, Andrew Lau wrote:
Hi Ivan,

Thanks for the in depth reply.

I've only seen this happen twice, and only after I added a third
host
to the HA cluster. I wonder if that's the root problem.

Have you seen this happen on all your installs or only just after
your
manual migration? It's a little frustrating this is happening as I
was
hoping to get this into a production environment. It was all
working
except that log message :(

Thanks,
Andrew


On Fri, Jun 6, 2014 at 3:20 PM, combuster <combus...@archlinux.us>
wrote:
Hi Andrew,

this is something that I saw in my logs too, first on one node
and then
on
the other three. When that happend on all four of them, engine
was
corrupted
beyond repair.

First of all, I think that message is saying that sanlock can't
get a
lock
on the shared storage that you defined for the hostedengine
during
installation. I got this error when I've tried to manually
migrate the
hosted engine. There is an unresolved bug there and I think it's
related
to
this one:

[Bug 1093366 - Migration of hosted-engine vm put target host
score to
zero]
https://bugzilla.redhat.com/show_bug.cgi?id=1093366

This is a blocker bug (or should be) for the selfhostedengine
and, from
my
own experience with it, shouldn't be used in the production
enviroment
(not
untill it's fixed).

Nothing that I've done couldn't fix the fact that the score for
the
target
node was Zero, tried to reinstall the node, reboot the node,
restarted
several services, tailed a tons of logs etc but to no avail. When
only
one
node was left (that was actually running the hosted engine), I
brought
the
engine's vm down gracefully (hosted-engine --vm-shutdown I
belive) and
after
that, when I've tried to start the vm - it wouldn't load. Running
VNC
showed
that the filesystem inside the vm was corrupted and when I ran
fsck and
finally started up - it was too badly damaged. I succeded to
start the
engine itself (after repairing postgresql service that wouldn't
want to
start) but the database was damaged enough and acted pretty weird
(showed
that storage domains were down but the vm's were running fine
etc).
Lucky
me, I had already exported all of the VM's on the first sign of
trouble
and
then installed ovirt-engine on the dedicated server and attached
the
export
domain.

So while really a usefull feature, and it's working (for the most
part
ie,
automatic migration works), manually migrating VM with the
hosted-engine
will lead to troubles.

I hope that my experience with it, will be of use to you. It
happened to
me
two weeks ago, ovirt-engine was current (3.4.1) and there was no
fix
available.

Regards,

Ivan

On 06/06/2014 05:12 AM, Andrew Lau wrote:

Hi,

I'm seeing this weird message in my engine log

2014-06-06 03:06:09,380 INFO
[org.ovirt.engine.core.vdsbroker.VdsUpdateRunTimeInfo]
(DefaultQuartzScheduler_Worker-79) RefreshVmList vm id
85d4cfb9-f063-4c7c-a9f8-2b74f5f7afa5 status = WaitForLaunch on
vds
ov-hv2-2a-08-23 ignoring it in the refresh until migration is
done
2014-06-06 03:06:12,494 INFO
[org.ovirt.engine.core.vdsbroker.vdsbroker.DestroyVDSCommand]
(DefaultQuartzScheduler_Worker-89) START,
DestroyVDSCommand(HostName =
ov-hv2-2a-08-23, HostId = c04c62be-5d34-4e73-bd26-26f805b2dc60,
vmId=85d4cfb9-f063-4c7c-a9f8-2b74f5f7afa5, force=false,
secondsToWait=0, gracefully=false), log id: 62a9d4c1
2014-06-06 03:06:12,561 INFO
[org.ovirt.engine.core.vdsbroker.vdsbroker.DestroyVDSCommand]
(DefaultQuartzScheduler_Worker-89) FINISH, DestroyVDSCommand, log
id:
62a9d4c1
2014-06-06 03:06:12,652 INFO

[org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector]
(DefaultQuartzScheduler_
Worker-89) Correlation ID: null, Call Stack:
null, Custom Event ID: -1, Message: VM HostedEngine is down. Exit
message: internal error Failed to acquire lock: error -243.

It also appears to occur on the other hosts in the cluster,
except the
host which is running the hosted-engine. So right now 3 servers,
it
shows up twice in the engine UI.

The engine VM continues to run peacefully, without any issues on
the
host which doesn't have that error.

Any ideas?
_______________________________________________
Users mailing list
Users@ovirt.org
http://lists.ovirt.org/mailman/listinfo/users


_______________________________________________
Users mailing list
Users@ovirt.org
http://lists.ovirt.org/mailman/listinfo/users

_______________________________________________
Users mailing list
Users@ovirt.org
http://lists.ovirt.org/mailman/listinfo/users


_______________________________________________
Users mailing list
Users@ovirt.org
http://lists.ovirt.org/mailman/listinfo/users

Reply via email to