Re: [ovirt-users] VM HostedEngie is down. Exist message: internal error Failed to acquire lock error -243

2014-07-15 Thread Stefano Stagnaro




Hi,

since the bug 1093366 is evidently blocking the Hosted Engine feature, 
it should be added as blocker for the oVirt 3.4.3 tracker (bug 1107968).


All the more so now that the proposed patches seems to have fixed the 
problem (I've run at least 30 Hosted Engine migrations without errors).


Thank you,
Stefano.

On 06/06/2014 05:12 AM, Andrew Lau wrote:

Hi,

I'm seeing this weird message in my engine log

2014-06-06 03:06:09,380 INFO
[org.ovirt.engine.core.vdsbroker.VdsUpdateRunTimeInfo]
(DefaultQuartzScheduler_Worker-79) RefreshVmList vm id
85d4cfb9-f063-4c7c-a9f8-2b74f5f7afa5 status = WaitForLaunch on vds
ov-hv2-2a-08-23 ignoring it in the refresh until migration is done
2014-06-06 03:06:12,494 INFO
[org.ovirt.engine.core.vdsbroker.vdsbroker.DestroyVDSCommand]
(DefaultQuartzScheduler_Worker-89) START, DestroyVDSCommand(HostName =
ov-hv2-2a-08-23, HostId = c04c62be-5d34-4e73-bd26-26f805b2dc60,
vmId=85d4cfb9-f063-4c7c-a9f8-2b74f5f7afa5, force=false,
secondsToWait=0, gracefully=false), log id: 62a9d4c1
2014-06-06 03:06:12,561 INFO
[org.ovirt.engine.core.vdsbroker.vdsbroker.DestroyVDSCommand]
(DefaultQuartzScheduler_Worker-89) FINISH, DestroyVDSCommand, log id:
62a9d4c1
2014-06-06 03:06:12,652 INFO
[org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector]
(DefaultQuartzScheduler_Worker-89) Correlation ID: null, Call Stack:
null, Custom Event ID: -1, Message: VM HostedEngine is down. Exit
message: internal error Failed to acquire lock: error -243.

It also appears to occur on the other hosts in the cluster, except the
host which is running the hosted-engine. So right now 3 servers, it
shows up twice in the engine UI.

The engine VM continues to run peacefully, without any issues on the
host which doesn't have that error.

Any ideas?





___
Users mailing list
Users@ovirt.org
http://lists.ovirt.org/mailman/listinfo/users


Re: [ovirt-users] VM HostedEngie is down. Exist message: internal error Failed to acquire lock error -243

2014-07-14 Thread Stefano Stagnaro

Hi,

since the bug 1093366 is evidently blocking the Hosted Engine feature, 
it should be added as blocker for the oVirt 3.4.3 tracker (bug 1107968).


All the more so now that the proposed patches seems to have fixed the 
problem (I've run at least 30 Hosted Engine migrations without errors).


Thank you,
Stefano.

On 06/06/2014 05:12 AM, Andrew Lau wrote:

Hi,

I'm seeing this weird message in my engine log

2014-06-06 03:06:09,380 INFO
[org.ovirt.engine.core.vdsbroker.VdsUpdateRunTimeInfo]
(DefaultQuartzScheduler_Worker-79) RefreshVmList vm id
85d4cfb9-f063-4c7c-a9f8-2b74f5f7afa5 status = WaitForLaunch on vds
ov-hv2-2a-08-23 ignoring it in the refresh until migration is done
2014-06-06 03:06:12,494 INFO
[org.ovirt.engine.core.vdsbroker.vdsbroker.DestroyVDSCommand]
(DefaultQuartzScheduler_Worker-89) START, DestroyVDSCommand(HostName =
ov-hv2-2a-08-23, HostId = c04c62be-5d34-4e73-bd26-26f805b2dc60,
vmId=85d4cfb9-f063-4c7c-a9f8-2b74f5f7afa5, force=false,
secondsToWait=0, gracefully=false), log id: 62a9d4c1
2014-06-06 03:06:12,561 INFO
[org.ovirt.engine.core.vdsbroker.vdsbroker.DestroyVDSCommand]
(DefaultQuartzScheduler_Worker-89) FINISH, DestroyVDSCommand, log id:
62a9d4c1
2014-06-06 03:06:12,652 INFO
[org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector]
(DefaultQuartzScheduler_Worker-89) Correlation ID: null, Call Stack:
null, Custom Event ID: -1, Message: VM HostedEngine is down. Exit
message: internal error Failed to acquire lock: error -243.

It also appears to occur on the other hosts in the cluster, except the
host which is running the hosted-engine. So right now 3 servers, it
shows up twice in the engine UI.

The engine VM continues to run peacefully, without any issues on the
host which doesn't have that error.

Any ideas?



___
Users mailing list
Users@ovirt.org
http://lists.ovirt.org/mailman/listinfo/users


Re: [ovirt-users] VM HostedEngie is down. Exist message: internal error Failed to acquire lock error -243

2014-06-11 Thread noc
Just to update everyone, I've the same problem with a 3 host setup and 
uploaded logs to the BZ 1093366


Joop

___
Users mailing list
Users@ovirt.org
http://lists.ovirt.org/mailman/listinfo/users


Re: [ovirt-users] VM HostedEngie is down. Exist message: internal error Failed to acquire lock error -243

2014-06-10 Thread Brad House

Ok, I thought I was doing something wrong yesterday and just
tore down my 3-node cluster with the hosted engine and started
rebuilding.  I was seeing essentially the same thing, a score of
0 on the VMs not running the engine, it wouldn't allow migration of
the hosted engine.   I played with all things related to setting
maintenance and rebooting hosts, nothing brought them up to a
point where I could migrate the hosted engine.

I thought it was related to ovirt messing up when deploying the
other hosts (I told it not to modify the firewall that I disabled,
but the deploy process forcibly reenabled the firewall which gluster
really didn't like).  Now after reading this it appears my assumption
may be false.

Previously a 2-node cluster I had worked fine, but I wanted to
go to 3-nodes so I could enable quorum on gluster to not risk
split-brain issues.

-Brad


On 6/10/14 1:19 AM, Andrew Lau wrote:

I'm really having a hard time finding out why it's happening..

If I set the cluster to global for a minute or two, the scores will
reset back to 2400. Set maintenance mode to none, and all will be fine
until a migration occurs. It seems it tries to migrate, fails and sets
the score to 0 permanently rather than the 10? minutes mentioned in
one of the ovirt slides.

When I have two hosts, it's score 0 only when a migration occurs.
(Just on the host which doesn't have engine up). The score 0 only
happens when it's tried to migrate when I set the host to local
maintenance. Migrating the VM from the UI has worked quite a few
times, but it's recently started to fail.

When I have three hosts, after 5~ mintues of them all up the score
will hit 0 on the hosts not running the VMs. It doesn't even have to
attempt to migrate before the score goes to 0. Stopping the ha agent
on one host, and resetting it with the global maintenance method
brings it back to the 2 host scenario above.

I may move on and just go back to a standalone engine as this is not
getting very much luck..

On Tue, Jun 10, 2014 at 3:11 PM, combuster combus...@archlinux.us wrote:

Nah, I've explicitly allowed hosted-engine vm to be able to access the NAS
device as the NFS share itself, before the deploy procedure even started.
But I'm puzzled at how you can reproduce the bug, all was well on my setup
before I've stated manual migration of the engine's vm. Even auto migration
worked before that (tested it). Does it just happen without any procedure on
the engine itself? Is the score 0 for just one node, or two of three of
them?

On 06/10/2014 01:02 AM, Andrew Lau wrote:


nvm, just as I hit send the error has returned.
Ignore this..

On Tue, Jun 10, 2014 at 9:01 AM, Andrew Lau and...@andrewklau.com wrote:


So after adding the L3 capabilities to my storage network, I'm no
longer seeing this issue anymore. So the engine needs to be able to
access the storage domain it sits on? But that doesn't show up in the
UI?

Ivan, was this also the case with your setup? Engine couldn't access
storage domain?

On Mon, Jun 9, 2014 at 9:56 PM, Andrew Lau and...@andrewklau.com wrote:


Interesting, my storage network is a L2 only and doesn't run on the
ovirtmgmt (which is the only thing HostedEngine sees) but I've only
seen this issue when running ctdb in front of my NFS server. I
previously was using localhost as all my hosts had the nfs server on
it (gluster).

On Mon, Jun 9, 2014 at 9:15 PM, Artyom Lukianov aluki...@redhat.com
wrote:


I just blocked connection to storage for testing, but on result I had
this error: Failed to acquire lock error -243, so I added it in reproduce
steps.
If you know another steps to reproduce this error, without blocking
connection to storage it also can be wonderful if you can provide them.
Thanks

- Original Message -
From: Andrew Lau and...@andrewklau.com
To: combuster combus...@archlinux.us
Cc: users users@ovirt.org
Sent: Monday, June 9, 2014 3:47:00 AM
Subject: Re: [ovirt-users] VM HostedEngie is down. Exist message:
internal error Failed to acquire lock error -243

I just ran a few extra tests, I had a 2 host, hosted-engine running
for a day. They both had a score of 2400. Migrated the VM through the
UI multiple times, all worked fine. I then added the third host, and
that's when it all fell to pieces.
Other two hosts have a score of 0 now.

I'm also curious, in the BZ there's a note about:

where engine-vm block connection to storage domain(via iptables -I
INPUT -s sd_ip -j DROP)

What's the purpose for that?

On Sat, Jun 7, 2014 at 4:16 PM, Andrew Lau and...@andrewklau.com
wrote:


Ignore that, the issue came back after 10 minutes.

I've even tried a gluster mount + nfs server on top of that, and the
same issue has come back.

On Fri, Jun 6, 2014 at 6:26 PM, Andrew Lau and...@andrewklau.com
wrote:


Interesting, I put it all into global maintenance. Shut it all down
for 10~ minutes, and it's regained it's sanlock control and doesn't
seem to have that issue coming up in the log.

On Fri, Jun 6, 2014 at 4:21 PM, combuster

Re: [ovirt-users] VM HostedEngie is down. Exist message: internal error Failed to acquire lock error -243

2014-06-09 Thread Artyom Lukianov
I just blocked connection to storage for testing, but on result I had this 
error: Failed to acquire lock error -243, so I added it in reproduce steps.
If you know another steps to reproduce this error, without blocking connection 
to storage it also can be wonderful if you can provide them.
Thanks

- Original Message -
From: Andrew Lau and...@andrewklau.com
To: combuster combus...@archlinux.us
Cc: users users@ovirt.org
Sent: Monday, June 9, 2014 3:47:00 AM
Subject: Re: [ovirt-users] VM HostedEngie is down. Exist message: internal 
error Failed to acquire lock error -243

I just ran a few extra tests, I had a 2 host, hosted-engine running
for a day. They both had a score of 2400. Migrated the VM through the
UI multiple times, all worked fine. I then added the third host, and
that's when it all fell to pieces.
Other two hosts have a score of 0 now.

I'm also curious, in the BZ there's a note about:

where engine-vm block connection to storage domain(via iptables -I
INPUT -s sd_ip -j DROP)

What's the purpose for that?

On Sat, Jun 7, 2014 at 4:16 PM, Andrew Lau and...@andrewklau.com wrote:
 Ignore that, the issue came back after 10 minutes.

 I've even tried a gluster mount + nfs server on top of that, and the
 same issue has come back.

 On Fri, Jun 6, 2014 at 6:26 PM, Andrew Lau and...@andrewklau.com wrote:
 Interesting, I put it all into global maintenance. Shut it all down
 for 10~ minutes, and it's regained it's sanlock control and doesn't
 seem to have that issue coming up in the log.

 On Fri, Jun 6, 2014 at 4:21 PM, combuster combus...@archlinux.us wrote:
 It was pure NFS on a NAS device. They all had different ids (had no
 redeployements of nodes before problem occured).

 Thanks Jirka.


 On 06/06/2014 08:19 AM, Jiri Moskovcak wrote:

 I've seen that problem in other threads, the common denominator was nfs
 on top of gluster. So if you have this setup, then it's a known problem. 
 Or
 you should double check if you hosts have different ids otherwise they 
 would
 be trying to acquire the same lock.

 --Jirka

 On 06/06/2014 08:03 AM, Andrew Lau wrote:

 Hi Ivan,

 Thanks for the in depth reply.

 I've only seen this happen twice, and only after I added a third host
 to the HA cluster. I wonder if that's the root problem.

 Have you seen this happen on all your installs or only just after your
 manual migration? It's a little frustrating this is happening as I was
 hoping to get this into a production environment. It was all working
 except that log message :(

 Thanks,
 Andrew


 On Fri, Jun 6, 2014 at 3:20 PM, combuster combus...@archlinux.us wrote:

 Hi Andrew,

 this is something that I saw in my logs too, first on one node and then
 on
 the other three. When that happend on all four of them, engine was
 corrupted
 beyond repair.

 First of all, I think that message is saying that sanlock can't get a
 lock
 on the shared storage that you defined for the hostedengine during
 installation. I got this error when I've tried to manually migrate the
 hosted engine. There is an unresolved bug there and I think it's related
 to
 this one:

 [Bug 1093366 - Migration of hosted-engine vm put target host score to
 zero]
 https://bugzilla.redhat.com/show_bug.cgi?id=1093366

 This is a blocker bug (or should be) for the selfhostedengine and, from
 my
 own experience with it, shouldn't be used in the production enviroment
 (not
 untill it's fixed).

 Nothing that I've done couldn't fix the fact that the score for the
 target
 node was Zero, tried to reinstall the node, reboot the node, restarted
 several services, tailed a tons of logs etc but to no avail. When only
 one
 node was left (that was actually running the hosted engine), I brought
 the
 engine's vm down gracefully (hosted-engine --vm-shutdown I belive) and
 after
 that, when I've tried to start the vm - it wouldn't load. Running VNC
 showed
 that the filesystem inside the vm was corrupted and when I ran fsck and
 finally started up - it was too badly damaged. I succeded to start the
 engine itself (after repairing postgresql service that wouldn't want to
 start) but the database was damaged enough and acted pretty weird
 (showed
 that storage domains were down but the vm's were running fine etc).
 Lucky
 me, I had already exported all of the VM's on the first sign of trouble
 and
 then installed ovirt-engine on the dedicated server and attached the
 export
 domain.

 So while really a usefull feature, and it's working (for the most part
 ie,
 automatic migration works), manually migrating VM with the hosted-engine
 will lead to troubles.

 I hope that my experience with it, will be of use to you. It happened to
 me
 two weeks ago, ovirt-engine was current (3.4.1) and there was no fix
 available.

 Regards,

 Ivan

 On 06/06/2014 05:12 AM, Andrew Lau wrote:

 Hi,

 I'm seeing this weird message in my engine log

 2014-06-06 03:06:09,380 INFO
 [org.ovirt.engine.core.vdsbroker.VdsUpdateRunTimeInfo]
 (DefaultQuartzScheduler_Worker-79

Re: [ovirt-users] VM HostedEngie is down. Exist message: internal error Failed to acquire lock error -243

2014-06-09 Thread Andrew Lau
Interesting, my storage network is a L2 only and doesn't run on the
ovirtmgmt (which is the only thing HostedEngine sees) but I've only
seen this issue when running ctdb in front of my NFS server. I
previously was using localhost as all my hosts had the nfs server on
it (gluster).

On Mon, Jun 9, 2014 at 9:15 PM, Artyom Lukianov aluki...@redhat.com wrote:
 I just blocked connection to storage for testing, but on result I had this 
 error: Failed to acquire lock error -243, so I added it in reproduce steps.
 If you know another steps to reproduce this error, without blocking 
 connection to storage it also can be wonderful if you can provide them.
 Thanks

 - Original Message -
 From: Andrew Lau and...@andrewklau.com
 To: combuster combus...@archlinux.us
 Cc: users users@ovirt.org
 Sent: Monday, June 9, 2014 3:47:00 AM
 Subject: Re: [ovirt-users] VM HostedEngie is down. Exist message: internal 
 error Failed to acquire lock error -243

 I just ran a few extra tests, I had a 2 host, hosted-engine running
 for a day. They both had a score of 2400. Migrated the VM through the
 UI multiple times, all worked fine. I then added the third host, and
 that's when it all fell to pieces.
 Other two hosts have a score of 0 now.

 I'm also curious, in the BZ there's a note about:

 where engine-vm block connection to storage domain(via iptables -I
 INPUT -s sd_ip -j DROP)

 What's the purpose for that?

 On Sat, Jun 7, 2014 at 4:16 PM, Andrew Lau and...@andrewklau.com wrote:
 Ignore that, the issue came back after 10 minutes.

 I've even tried a gluster mount + nfs server on top of that, and the
 same issue has come back.

 On Fri, Jun 6, 2014 at 6:26 PM, Andrew Lau and...@andrewklau.com wrote:
 Interesting, I put it all into global maintenance. Shut it all down
 for 10~ minutes, and it's regained it's sanlock control and doesn't
 seem to have that issue coming up in the log.

 On Fri, Jun 6, 2014 at 4:21 PM, combuster combus...@archlinux.us wrote:
 It was pure NFS on a NAS device. They all had different ids (had no
 redeployements of nodes before problem occured).

 Thanks Jirka.


 On 06/06/2014 08:19 AM, Jiri Moskovcak wrote:

 I've seen that problem in other threads, the common denominator was nfs
 on top of gluster. So if you have this setup, then it's a known problem. 
 Or
 you should double check if you hosts have different ids otherwise they 
 would
 be trying to acquire the same lock.

 --Jirka

 On 06/06/2014 08:03 AM, Andrew Lau wrote:

 Hi Ivan,

 Thanks for the in depth reply.

 I've only seen this happen twice, and only after I added a third host
 to the HA cluster. I wonder if that's the root problem.

 Have you seen this happen on all your installs or only just after your
 manual migration? It's a little frustrating this is happening as I was
 hoping to get this into a production environment. It was all working
 except that log message :(

 Thanks,
 Andrew


 On Fri, Jun 6, 2014 at 3:20 PM, combuster combus...@archlinux.us wrote:

 Hi Andrew,

 this is something that I saw in my logs too, first on one node and then
 on
 the other three. When that happend on all four of them, engine was
 corrupted
 beyond repair.

 First of all, I think that message is saying that sanlock can't get a
 lock
 on the shared storage that you defined for the hostedengine during
 installation. I got this error when I've tried to manually migrate the
 hosted engine. There is an unresolved bug there and I think it's related
 to
 this one:

 [Bug 1093366 - Migration of hosted-engine vm put target host score to
 zero]
 https://bugzilla.redhat.com/show_bug.cgi?id=1093366

 This is a blocker bug (or should be) for the selfhostedengine and, from
 my
 own experience with it, shouldn't be used in the production enviroment
 (not
 untill it's fixed).

 Nothing that I've done couldn't fix the fact that the score for the
 target
 node was Zero, tried to reinstall the node, reboot the node, restarted
 several services, tailed a tons of logs etc but to no avail. When only
 one
 node was left (that was actually running the hosted engine), I brought
 the
 engine's vm down gracefully (hosted-engine --vm-shutdown I belive) and
 after
 that, when I've tried to start the vm - it wouldn't load. Running VNC
 showed
 that the filesystem inside the vm was corrupted and when I ran fsck and
 finally started up - it was too badly damaged. I succeded to start the
 engine itself (after repairing postgresql service that wouldn't want to
 start) but the database was damaged enough and acted pretty weird
 (showed
 that storage domains were down but the vm's were running fine etc).
 Lucky
 me, I had already exported all of the VM's on the first sign of trouble
 and
 then installed ovirt-engine on the dedicated server and attached the
 export
 domain.

 So while really a usefull feature, and it's working (for the most part
 ie,
 automatic migration works), manually migrating VM with the hosted-engine
 will lead to troubles.

 I hope that my experience

Re: [ovirt-users] VM HostedEngie is down. Exist message: internal error Failed to acquire lock error -243

2014-06-09 Thread Andrew Lau
So after adding the L3 capabilities to my storage network, I'm no
longer seeing this issue anymore. So the engine needs to be able to
access the storage domain it sits on? But that doesn't show up in the
UI?

Ivan, was this also the case with your setup? Engine couldn't access
storage domain?

On Mon, Jun 9, 2014 at 9:56 PM, Andrew Lau and...@andrewklau.com wrote:
 Interesting, my storage network is a L2 only and doesn't run on the
 ovirtmgmt (which is the only thing HostedEngine sees) but I've only
 seen this issue when running ctdb in front of my NFS server. I
 previously was using localhost as all my hosts had the nfs server on
 it (gluster).

 On Mon, Jun 9, 2014 at 9:15 PM, Artyom Lukianov aluki...@redhat.com wrote:
 I just blocked connection to storage for testing, but on result I had this 
 error: Failed to acquire lock error -243, so I added it in reproduce steps.
 If you know another steps to reproduce this error, without blocking 
 connection to storage it also can be wonderful if you can provide them.
 Thanks

 - Original Message -
 From: Andrew Lau and...@andrewklau.com
 To: combuster combus...@archlinux.us
 Cc: users users@ovirt.org
 Sent: Monday, June 9, 2014 3:47:00 AM
 Subject: Re: [ovirt-users] VM HostedEngie is down. Exist message: internal 
 error Failed to acquire lock error -243

 I just ran a few extra tests, I had a 2 host, hosted-engine running
 for a day. They both had a score of 2400. Migrated the VM through the
 UI multiple times, all worked fine. I then added the third host, and
 that's when it all fell to pieces.
 Other two hosts have a score of 0 now.

 I'm also curious, in the BZ there's a note about:

 where engine-vm block connection to storage domain(via iptables -I
 INPUT -s sd_ip -j DROP)

 What's the purpose for that?

 On Sat, Jun 7, 2014 at 4:16 PM, Andrew Lau and...@andrewklau.com wrote:
 Ignore that, the issue came back after 10 minutes.

 I've even tried a gluster mount + nfs server on top of that, and the
 same issue has come back.

 On Fri, Jun 6, 2014 at 6:26 PM, Andrew Lau and...@andrewklau.com wrote:
 Interesting, I put it all into global maintenance. Shut it all down
 for 10~ minutes, and it's regained it's sanlock control and doesn't
 seem to have that issue coming up in the log.

 On Fri, Jun 6, 2014 at 4:21 PM, combuster combus...@archlinux.us wrote:
 It was pure NFS on a NAS device. They all had different ids (had no
 redeployements of nodes before problem occured).

 Thanks Jirka.


 On 06/06/2014 08:19 AM, Jiri Moskovcak wrote:

 I've seen that problem in other threads, the common denominator was nfs
 on top of gluster. So if you have this setup, then it's a known 
 problem. Or
 you should double check if you hosts have different ids otherwise they 
 would
 be trying to acquire the same lock.

 --Jirka

 On 06/06/2014 08:03 AM, Andrew Lau wrote:

 Hi Ivan,

 Thanks for the in depth reply.

 I've only seen this happen twice, and only after I added a third host
 to the HA cluster. I wonder if that's the root problem.

 Have you seen this happen on all your installs or only just after your
 manual migration? It's a little frustrating this is happening as I was
 hoping to get this into a production environment. It was all working
 except that log message :(

 Thanks,
 Andrew


 On Fri, Jun 6, 2014 at 3:20 PM, combuster combus...@archlinux.us 
 wrote:

 Hi Andrew,

 this is something that I saw in my logs too, first on one node and then
 on
 the other three. When that happend on all four of them, engine was
 corrupted
 beyond repair.

 First of all, I think that message is saying that sanlock can't get a
 lock
 on the shared storage that you defined for the hostedengine during
 installation. I got this error when I've tried to manually migrate the
 hosted engine. There is an unresolved bug there and I think it's 
 related
 to
 this one:

 [Bug 1093366 - Migration of hosted-engine vm put target host score to
 zero]
 https://bugzilla.redhat.com/show_bug.cgi?id=1093366

 This is a blocker bug (or should be) for the selfhostedengine and, from
 my
 own experience with it, shouldn't be used in the production enviroment
 (not
 untill it's fixed).

 Nothing that I've done couldn't fix the fact that the score for the
 target
 node was Zero, tried to reinstall the node, reboot the node, restarted
 several services, tailed a tons of logs etc but to no avail. When only
 one
 node was left (that was actually running the hosted engine), I brought
 the
 engine's vm down gracefully (hosted-engine --vm-shutdown I belive) and
 after
 that, when I've tried to start the vm - it wouldn't load. Running VNC
 showed
 that the filesystem inside the vm was corrupted and when I ran fsck and
 finally started up - it was too badly damaged. I succeded to start the
 engine itself (after repairing postgresql service that wouldn't want to
 start) but the database was damaged enough and acted pretty weird
 (showed
 that storage domains were down but the vm's were running fine etc

Re: [ovirt-users] VM HostedEngie is down. Exist message: internal error Failed to acquire lock error -243

2014-06-09 Thread Andrew Lau
nvm, just as I hit send the error has returned.
Ignore this..

On Tue, Jun 10, 2014 at 9:01 AM, Andrew Lau and...@andrewklau.com wrote:
 So after adding the L3 capabilities to my storage network, I'm no
 longer seeing this issue anymore. So the engine needs to be able to
 access the storage domain it sits on? But that doesn't show up in the
 UI?

 Ivan, was this also the case with your setup? Engine couldn't access
 storage domain?

 On Mon, Jun 9, 2014 at 9:56 PM, Andrew Lau and...@andrewklau.com wrote:
 Interesting, my storage network is a L2 only and doesn't run on the
 ovirtmgmt (which is the only thing HostedEngine sees) but I've only
 seen this issue when running ctdb in front of my NFS server. I
 previously was using localhost as all my hosts had the nfs server on
 it (gluster).

 On Mon, Jun 9, 2014 at 9:15 PM, Artyom Lukianov aluki...@redhat.com wrote:
 I just blocked connection to storage for testing, but on result I had this 
 error: Failed to acquire lock error -243, so I added it in reproduce 
 steps.
 If you know another steps to reproduce this error, without blocking 
 connection to storage it also can be wonderful if you can provide them.
 Thanks

 - Original Message -
 From: Andrew Lau and...@andrewklau.com
 To: combuster combus...@archlinux.us
 Cc: users users@ovirt.org
 Sent: Monday, June 9, 2014 3:47:00 AM
 Subject: Re: [ovirt-users] VM HostedEngie is down. Exist message: internal 
 error Failed to acquire lock error -243

 I just ran a few extra tests, I had a 2 host, hosted-engine running
 for a day. They both had a score of 2400. Migrated the VM through the
 UI multiple times, all worked fine. I then added the third host, and
 that's when it all fell to pieces.
 Other two hosts have a score of 0 now.

 I'm also curious, in the BZ there's a note about:

 where engine-vm block connection to storage domain(via iptables -I
 INPUT -s sd_ip -j DROP)

 What's the purpose for that?

 On Sat, Jun 7, 2014 at 4:16 PM, Andrew Lau and...@andrewklau.com wrote:
 Ignore that, the issue came back after 10 minutes.

 I've even tried a gluster mount + nfs server on top of that, and the
 same issue has come back.

 On Fri, Jun 6, 2014 at 6:26 PM, Andrew Lau and...@andrewklau.com wrote:
 Interesting, I put it all into global maintenance. Shut it all down
 for 10~ minutes, and it's regained it's sanlock control and doesn't
 seem to have that issue coming up in the log.

 On Fri, Jun 6, 2014 at 4:21 PM, combuster combus...@archlinux.us wrote:
 It was pure NFS on a NAS device. They all had different ids (had no
 redeployements of nodes before problem occured).

 Thanks Jirka.


 On 06/06/2014 08:19 AM, Jiri Moskovcak wrote:

 I've seen that problem in other threads, the common denominator was nfs
 on top of gluster. So if you have this setup, then it's a known 
 problem. Or
 you should double check if you hosts have different ids otherwise they 
 would
 be trying to acquire the same lock.

 --Jirka

 On 06/06/2014 08:03 AM, Andrew Lau wrote:

 Hi Ivan,

 Thanks for the in depth reply.

 I've only seen this happen twice, and only after I added a third host
 to the HA cluster. I wonder if that's the root problem.

 Have you seen this happen on all your installs or only just after your
 manual migration? It's a little frustrating this is happening as I was
 hoping to get this into a production environment. It was all working
 except that log message :(

 Thanks,
 Andrew


 On Fri, Jun 6, 2014 at 3:20 PM, combuster combus...@archlinux.us 
 wrote:

 Hi Andrew,

 this is something that I saw in my logs too, first on one node and 
 then
 on
 the other three. When that happend on all four of them, engine was
 corrupted
 beyond repair.

 First of all, I think that message is saying that sanlock can't get a
 lock
 on the shared storage that you defined for the hostedengine during
 installation. I got this error when I've tried to manually migrate the
 hosted engine. There is an unresolved bug there and I think it's 
 related
 to
 this one:

 [Bug 1093366 - Migration of hosted-engine vm put target host score to
 zero]
 https://bugzilla.redhat.com/show_bug.cgi?id=1093366

 This is a blocker bug (or should be) for the selfhostedengine and, 
 from
 my
 own experience with it, shouldn't be used in the production enviroment
 (not
 untill it's fixed).

 Nothing that I've done couldn't fix the fact that the score for the
 target
 node was Zero, tried to reinstall the node, reboot the node, restarted
 several services, tailed a tons of logs etc but to no avail. When only
 one
 node was left (that was actually running the hosted engine), I brought
 the
 engine's vm down gracefully (hosted-engine --vm-shutdown I belive) and
 after
 that, when I've tried to start the vm - it wouldn't load. Running VNC
 showed
 that the filesystem inside the vm was corrupted and when I ran fsck 
 and
 finally started up - it was too badly damaged. I succeded to start the
 engine itself (after repairing postgresql service that wouldn't

Re: [ovirt-users] VM HostedEngie is down. Exist message: internal error Failed to acquire lock error -243

2014-06-09 Thread combuster
Nah, I've explicitly allowed hosted-engine vm to be able to access the 
NAS device as the NFS share itself, before the deploy procedure even 
started. But I'm puzzled at how you can reproduce the bug, all was well 
on my setup before I've stated manual migration of the engine's vm. Even 
auto migration worked before that (tested it). Does it just happen 
without any procedure on the engine itself? Is the score 0 for just one 
node, or two of three of them?

On 06/10/2014 01:02 AM, Andrew Lau wrote:

nvm, just as I hit send the error has returned.
Ignore this..

On Tue, Jun 10, 2014 at 9:01 AM, Andrew Lau and...@andrewklau.com wrote:

So after adding the L3 capabilities to my storage network, I'm no
longer seeing this issue anymore. So the engine needs to be able to
access the storage domain it sits on? But that doesn't show up in the
UI?

Ivan, was this also the case with your setup? Engine couldn't access
storage domain?

On Mon, Jun 9, 2014 at 9:56 PM, Andrew Lau and...@andrewklau.com wrote:

Interesting, my storage network is a L2 only and doesn't run on the
ovirtmgmt (which is the only thing HostedEngine sees) but I've only
seen this issue when running ctdb in front of my NFS server. I
previously was using localhost as all my hosts had the nfs server on
it (gluster).

On Mon, Jun 9, 2014 at 9:15 PM, Artyom Lukianov aluki...@redhat.com wrote:

I just blocked connection to storage for testing, but on result I had this error: 
Failed to acquire lock error -243, so I added it in reproduce steps.
If you know another steps to reproduce this error, without blocking connection 
to storage it also can be wonderful if you can provide them.
Thanks

- Original Message -
From: Andrew Lau and...@andrewklau.com
To: combuster combus...@archlinux.us
Cc: users users@ovirt.org
Sent: Monday, June 9, 2014 3:47:00 AM
Subject: Re: [ovirt-users] VM HostedEngie is down. Exist message: internal 
error Failed to acquire lock error -243

I just ran a few extra tests, I had a 2 host, hosted-engine running
for a day. They both had a score of 2400. Migrated the VM through the
UI multiple times, all worked fine. I then added the third host, and
that's when it all fell to pieces.
Other two hosts have a score of 0 now.

I'm also curious, in the BZ there's a note about:

where engine-vm block connection to storage domain(via iptables -I
INPUT -s sd_ip -j DROP)

What's the purpose for that?

On Sat, Jun 7, 2014 at 4:16 PM, Andrew Lau and...@andrewklau.com wrote:

Ignore that, the issue came back after 10 minutes.

I've even tried a gluster mount + nfs server on top of that, and the
same issue has come back.

On Fri, Jun 6, 2014 at 6:26 PM, Andrew Lau and...@andrewklau.com wrote:

Interesting, I put it all into global maintenance. Shut it all down
for 10~ minutes, and it's regained it's sanlock control and doesn't
seem to have that issue coming up in the log.

On Fri, Jun 6, 2014 at 4:21 PM, combuster combus...@archlinux.us wrote:

It was pure NFS on a NAS device. They all had different ids (had no
redeployements of nodes before problem occured).

Thanks Jirka.


On 06/06/2014 08:19 AM, Jiri Moskovcak wrote:

I've seen that problem in other threads, the common denominator was nfs
on top of gluster. So if you have this setup, then it's a known problem. Or
you should double check if you hosts have different ids otherwise they would
be trying to acquire the same lock.

--Jirka

On 06/06/2014 08:03 AM, Andrew Lau wrote:

Hi Ivan,

Thanks for the in depth reply.

I've only seen this happen twice, and only after I added a third host
to the HA cluster. I wonder if that's the root problem.

Have you seen this happen on all your installs or only just after your
manual migration? It's a little frustrating this is happening as I was
hoping to get this into a production environment. It was all working
except that log message :(

Thanks,
Andrew


On Fri, Jun 6, 2014 at 3:20 PM, combuster combus...@archlinux.us wrote:

Hi Andrew,

this is something that I saw in my logs too, first on one node and then
on
the other three. When that happend on all four of them, engine was
corrupted
beyond repair.

First of all, I think that message is saying that sanlock can't get a
lock
on the shared storage that you defined for the hostedengine during
installation. I got this error when I've tried to manually migrate the
hosted engine. There is an unresolved bug there and I think it's related
to
this one:

[Bug 1093366 - Migration of hosted-engine vm put target host score to
zero]
https://bugzilla.redhat.com/show_bug.cgi?id=1093366

This is a blocker bug (or should be) for the selfhostedengine and, from
my
own experience with it, shouldn't be used in the production enviroment
(not
untill it's fixed).

Nothing that I've done couldn't fix the fact that the score for the
target
node was Zero, tried to reinstall the node, reboot the node, restarted
several services, tailed a tons of logs etc but to no avail. When only
one
node was left (that was actually running

Re: [ovirt-users] VM HostedEngie is down. Exist message: internal error Failed to acquire lock error -243

2014-06-09 Thread Andrew Lau
I'm really having a hard time finding out why it's happening..

If I set the cluster to global for a minute or two, the scores will
reset back to 2400. Set maintenance mode to none, and all will be fine
until a migration occurs. It seems it tries to migrate, fails and sets
the score to 0 permanently rather than the 10? minutes mentioned in
one of the ovirt slides.

When I have two hosts, it's score 0 only when a migration occurs.
(Just on the host which doesn't have engine up). The score 0 only
happens when it's tried to migrate when I set the host to local
maintenance. Migrating the VM from the UI has worked quite a few
times, but it's recently started to fail.

When I have three hosts, after 5~ mintues of them all up the score
will hit 0 on the hosts not running the VMs. It doesn't even have to
attempt to migrate before the score goes to 0. Stopping the ha agent
on one host, and resetting it with the global maintenance method
brings it back to the 2 host scenario above.

I may move on and just go back to a standalone engine as this is not
getting very much luck..

On Tue, Jun 10, 2014 at 3:11 PM, combuster combus...@archlinux.us wrote:
 Nah, I've explicitly allowed hosted-engine vm to be able to access the NAS
 device as the NFS share itself, before the deploy procedure even started.
 But I'm puzzled at how you can reproduce the bug, all was well on my setup
 before I've stated manual migration of the engine's vm. Even auto migration
 worked before that (tested it). Does it just happen without any procedure on
 the engine itself? Is the score 0 for just one node, or two of three of
 them?

 On 06/10/2014 01:02 AM, Andrew Lau wrote:

 nvm, just as I hit send the error has returned.
 Ignore this..

 On Tue, Jun 10, 2014 at 9:01 AM, Andrew Lau and...@andrewklau.com wrote:

 So after adding the L3 capabilities to my storage network, I'm no
 longer seeing this issue anymore. So the engine needs to be able to
 access the storage domain it sits on? But that doesn't show up in the
 UI?

 Ivan, was this also the case with your setup? Engine couldn't access
 storage domain?

 On Mon, Jun 9, 2014 at 9:56 PM, Andrew Lau and...@andrewklau.com wrote:

 Interesting, my storage network is a L2 only and doesn't run on the
 ovirtmgmt (which is the only thing HostedEngine sees) but I've only
 seen this issue when running ctdb in front of my NFS server. I
 previously was using localhost as all my hosts had the nfs server on
 it (gluster).

 On Mon, Jun 9, 2014 at 9:15 PM, Artyom Lukianov aluki...@redhat.com
 wrote:

 I just blocked connection to storage for testing, but on result I had
 this error: Failed to acquire lock error -243, so I added it in 
 reproduce
 steps.
 If you know another steps to reproduce this error, without blocking
 connection to storage it also can be wonderful if you can provide them.
 Thanks

 - Original Message -
 From: Andrew Lau and...@andrewklau.com
 To: combuster combus...@archlinux.us
 Cc: users users@ovirt.org
 Sent: Monday, June 9, 2014 3:47:00 AM
 Subject: Re: [ovirt-users] VM HostedEngie is down. Exist message:
 internal error Failed to acquire lock error -243

 I just ran a few extra tests, I had a 2 host, hosted-engine running
 for a day. They both had a score of 2400. Migrated the VM through the
 UI multiple times, all worked fine. I then added the third host, and
 that's when it all fell to pieces.
 Other two hosts have a score of 0 now.

 I'm also curious, in the BZ there's a note about:

 where engine-vm block connection to storage domain(via iptables -I
 INPUT -s sd_ip -j DROP)

 What's the purpose for that?

 On Sat, Jun 7, 2014 at 4:16 PM, Andrew Lau and...@andrewklau.com
 wrote:

 Ignore that, the issue came back after 10 minutes.

 I've even tried a gluster mount + nfs server on top of that, and the
 same issue has come back.

 On Fri, Jun 6, 2014 at 6:26 PM, Andrew Lau and...@andrewklau.com
 wrote:

 Interesting, I put it all into global maintenance. Shut it all down
 for 10~ minutes, and it's regained it's sanlock control and doesn't
 seem to have that issue coming up in the log.

 On Fri, Jun 6, 2014 at 4:21 PM, combuster combus...@archlinux.us
 wrote:

 It was pure NFS on a NAS device. They all had different ids (had no
 redeployements of nodes before problem occured).

 Thanks Jirka.


 On 06/06/2014 08:19 AM, Jiri Moskovcak wrote:

 I've seen that problem in other threads, the common denominator was
 nfs
 on top of gluster. So if you have this setup, then it's a known
 problem. Or
 you should double check if you hosts have different ids otherwise
 they would
 be trying to acquire the same lock.

 --Jirka

 On 06/06/2014 08:03 AM, Andrew Lau wrote:

 Hi Ivan,

 Thanks for the in depth reply.

 I've only seen this happen twice, and only after I added a third
 host
 to the HA cluster. I wonder if that's the root problem.

 Have you seen this happen on all your installs or only just after
 your
 manual migration? It's a little frustrating this is happening as I

Re: [ovirt-users] VM HostedEngie is down. Exist message: internal error Failed to acquire lock error -243

2014-06-09 Thread combuster

On 06/10/2014 07:19 AM, Andrew Lau wrote:

I'm really having a hard time finding out why it's happening..

If I set the cluster to global for a minute or two, the scores will
reset back to 2400. Set maintenance mode to none, and all will be fine
until a migration occurs. It seems it tries to migrate, fails and sets
the score to 0 permanently rather than the 10? minutes mentioned in
one of the ovirt slides.

When I have two hosts, it's score 0 only when a migration occurs.
(Just on the host which doesn't have engine up). The score 0 only
happens when it's tried to migrate when I set the host to local
maintenance. Migrating the VM from the UI has worked quite a few
times, but it's recently started to fail.

When I have three hosts, after 5~ mintues of them all up the score
will hit 0 on the hosts not running the VMs. It doesn't even have to
attempt to migrate before the score goes to 0. Stopping the ha agent
on one host, and resetting it with the global maintenance method
brings it back to the 2 host scenario above.

I may move on and just go back to a standalone engine as this is not
getting very much luck..
Well I've done this already, I can't really afford to have so much 
unplanned downtime on my critical vm's, especially since it would take 
me several hours (even a whole day) to install a dedicated engine, then 
setup the nodes if need be, and then import vm's from export domain. I 
would love to help more to resolve this one, but I was pressed with 
time, I already had ovirt 3.3 running (for a year and a half rock solid 
stable, started from 3.1 i think), and I couldn't spare more then a day 
in trying to get around this bug (had to have a setup runing by the end 
of the weekend). I wasn't using gluster at all, so at least we know now 
that gluster is not a must in the mix. Besides Artyom already described 
it nicely in the bug report, havent had anything to add.


You were lucky Andrew, when I've tried the global maintenance method and 
restarted the VM, I got a corrupted filesystem on the VM's engine and it 
wouldn't even start on that one node that had a good score. It was bad 
health or uknown state on all of the nodes, and I've managed to repair 
the fs on the vm via VNC, then just barely bring the services online but 
the postgres db was too much damaged, so engine missbehaved.


At the time, I've explained it to myself :) that the locking mechanism 
didn't prevent one node to try to start (or write to) the vm while it 
was already running on another node, because filesystem was so damaged 
that I couldn't belive it, for 15 years I've never seen an extX fs so 
badly damaged, and the fact that this happens during migration just 
amped this thought up.


On Tue, Jun 10, 2014 at 3:11 PM, combuster combus...@archlinux.us wrote:

Nah, I've explicitly allowed hosted-engine vm to be able to access the NAS
device as the NFS share itself, before the deploy procedure even started.
But I'm puzzled at how you can reproduce the bug, all was well on my setup
before I've stated manual migration of the engine's vm. Even auto migration
worked before that (tested it). Does it just happen without any procedure on
the engine itself? Is the score 0 for just one node, or two of three of
them?

On 06/10/2014 01:02 AM, Andrew Lau wrote:

nvm, just as I hit send the error has returned.
Ignore this..

On Tue, Jun 10, 2014 at 9:01 AM, Andrew Lau and...@andrewklau.com wrote:

So after adding the L3 capabilities to my storage network, I'm no
longer seeing this issue anymore. So the engine needs to be able to
access the storage domain it sits on? But that doesn't show up in the
UI?

Ivan, was this also the case with your setup? Engine couldn't access
storage domain?

On Mon, Jun 9, 2014 at 9:56 PM, Andrew Lau and...@andrewklau.com wrote:

Interesting, my storage network is a L2 only and doesn't run on the
ovirtmgmt (which is the only thing HostedEngine sees) but I've only
seen this issue when running ctdb in front of my NFS server. I
previously was using localhost as all my hosts had the nfs server on
it (gluster).

On Mon, Jun 9, 2014 at 9:15 PM, Artyom Lukianov aluki...@redhat.com
wrote:

I just blocked connection to storage for testing, but on result I had
this error: Failed to acquire lock error -243, so I added it in reproduce
steps.
If you know another steps to reproduce this error, without blocking
connection to storage it also can be wonderful if you can provide them.
Thanks

- Original Message -
From: Andrew Lau and...@andrewklau.com
To: combuster combus...@archlinux.us
Cc: users users@ovirt.org
Sent: Monday, June 9, 2014 3:47:00 AM
Subject: Re: [ovirt-users] VM HostedEngie is down. Exist message:
internal error Failed to acquire lock error -243

I just ran a few extra tests, I had a 2 host, hosted-engine running
for a day. They both had a score of 2400. Migrated the VM through the
UI multiple times, all worked fine. I then added the third host, and
that's when it all fell to pieces.
Other two hosts have

Re: [ovirt-users] VM HostedEngie is down. Exist message: internal error Failed to acquire lock error -243

2014-06-08 Thread Andrew Lau
I just ran a few extra tests, I had a 2 host, hosted-engine running
for a day. They both had a score of 2400. Migrated the VM through the
UI multiple times, all worked fine. I then added the third host, and
that's when it all fell to pieces.
Other two hosts have a score of 0 now.

I'm also curious, in the BZ there's a note about:

where engine-vm block connection to storage domain(via iptables -I
INPUT -s sd_ip -j DROP)

What's the purpose for that?

On Sat, Jun 7, 2014 at 4:16 PM, Andrew Lau and...@andrewklau.com wrote:
 Ignore that, the issue came back after 10 minutes.

 I've even tried a gluster mount + nfs server on top of that, and the
 same issue has come back.

 On Fri, Jun 6, 2014 at 6:26 PM, Andrew Lau and...@andrewklau.com wrote:
 Interesting, I put it all into global maintenance. Shut it all down
 for 10~ minutes, and it's regained it's sanlock control and doesn't
 seem to have that issue coming up in the log.

 On Fri, Jun 6, 2014 at 4:21 PM, combuster combus...@archlinux.us wrote:
 It was pure NFS on a NAS device. They all had different ids (had no
 redeployements of nodes before problem occured).

 Thanks Jirka.


 On 06/06/2014 08:19 AM, Jiri Moskovcak wrote:

 I've seen that problem in other threads, the common denominator was nfs
 on top of gluster. So if you have this setup, then it's a known problem. 
 Or
 you should double check if you hosts have different ids otherwise they 
 would
 be trying to acquire the same lock.

 --Jirka

 On 06/06/2014 08:03 AM, Andrew Lau wrote:

 Hi Ivan,

 Thanks for the in depth reply.

 I've only seen this happen twice, and only after I added a third host
 to the HA cluster. I wonder if that's the root problem.

 Have you seen this happen on all your installs or only just after your
 manual migration? It's a little frustrating this is happening as I was
 hoping to get this into a production environment. It was all working
 except that log message :(

 Thanks,
 Andrew


 On Fri, Jun 6, 2014 at 3:20 PM, combuster combus...@archlinux.us wrote:

 Hi Andrew,

 this is something that I saw in my logs too, first on one node and then
 on
 the other three. When that happend on all four of them, engine was
 corrupted
 beyond repair.

 First of all, I think that message is saying that sanlock can't get a
 lock
 on the shared storage that you defined for the hostedengine during
 installation. I got this error when I've tried to manually migrate the
 hosted engine. There is an unresolved bug there and I think it's related
 to
 this one:

 [Bug 1093366 - Migration of hosted-engine vm put target host score to
 zero]
 https://bugzilla.redhat.com/show_bug.cgi?id=1093366

 This is a blocker bug (or should be) for the selfhostedengine and, from
 my
 own experience with it, shouldn't be used in the production enviroment
 (not
 untill it's fixed).

 Nothing that I've done couldn't fix the fact that the score for the
 target
 node was Zero, tried to reinstall the node, reboot the node, restarted
 several services, tailed a tons of logs etc but to no avail. When only
 one
 node was left (that was actually running the hosted engine), I brought
 the
 engine's vm down gracefully (hosted-engine --vm-shutdown I belive) and
 after
 that, when I've tried to start the vm - it wouldn't load. Running VNC
 showed
 that the filesystem inside the vm was corrupted and when I ran fsck and
 finally started up - it was too badly damaged. I succeded to start the
 engine itself (after repairing postgresql service that wouldn't want to
 start) but the database was damaged enough and acted pretty weird
 (showed
 that storage domains were down but the vm's were running fine etc).
 Lucky
 me, I had already exported all of the VM's on the first sign of trouble
 and
 then installed ovirt-engine on the dedicated server and attached the
 export
 domain.

 So while really a usefull feature, and it's working (for the most part
 ie,
 automatic migration works), manually migrating VM with the hosted-engine
 will lead to troubles.

 I hope that my experience with it, will be of use to you. It happened to
 me
 two weeks ago, ovirt-engine was current (3.4.1) and there was no fix
 available.

 Regards,

 Ivan

 On 06/06/2014 05:12 AM, Andrew Lau wrote:

 Hi,

 I'm seeing this weird message in my engine log

 2014-06-06 03:06:09,380 INFO
 [org.ovirt.engine.core.vdsbroker.VdsUpdateRunTimeInfo]
 (DefaultQuartzScheduler_Worker-79) RefreshVmList vm id
 85d4cfb9-f063-4c7c-a9f8-2b74f5f7afa5 status = WaitForLaunch on vds
 ov-hv2-2a-08-23 ignoring it in the refresh until migration is done
 2014-06-06 03:06:12,494 INFO
 [org.ovirt.engine.core.vdsbroker.vdsbroker.DestroyVDSCommand]
 (DefaultQuartzScheduler_Worker-89) START, DestroyVDSCommand(HostName =
 ov-hv2-2a-08-23, HostId = c04c62be-5d34-4e73-bd26-26f805b2dc60,
 vmId=85d4cfb9-f063-4c7c-a9f8-2b74f5f7afa5, force=false,
 secondsToWait=0, gracefully=false), log id: 62a9d4c1
 2014-06-06 03:06:12,561 INFO
 [org.ovirt.engine.core.vdsbroker.vdsbroker.DestroyVDSCommand]
 

Re: [ovirt-users] VM HostedEngie is down. Exist message: internal error Failed to acquire lock error -243

2014-06-07 Thread Andrew Lau
Ignore that, the issue came back after 10 minutes.

I've even tried a gluster mount + nfs server on top of that, and the
same issue has come back.

On Fri, Jun 6, 2014 at 6:26 PM, Andrew Lau and...@andrewklau.com wrote:
 Interesting, I put it all into global maintenance. Shut it all down
 for 10~ minutes, and it's regained it's sanlock control and doesn't
 seem to have that issue coming up in the log.

 On Fri, Jun 6, 2014 at 4:21 PM, combuster combus...@archlinux.us wrote:
 It was pure NFS on a NAS device. They all had different ids (had no
 redeployements of nodes before problem occured).

 Thanks Jirka.


 On 06/06/2014 08:19 AM, Jiri Moskovcak wrote:

 I've seen that problem in other threads, the common denominator was nfs
 on top of gluster. So if you have this setup, then it's a known problem. Or
 you should double check if you hosts have different ids otherwise they would
 be trying to acquire the same lock.

 --Jirka

 On 06/06/2014 08:03 AM, Andrew Lau wrote:

 Hi Ivan,

 Thanks for the in depth reply.

 I've only seen this happen twice, and only after I added a third host
 to the HA cluster. I wonder if that's the root problem.

 Have you seen this happen on all your installs or only just after your
 manual migration? It's a little frustrating this is happening as I was
 hoping to get this into a production environment. It was all working
 except that log message :(

 Thanks,
 Andrew


 On Fri, Jun 6, 2014 at 3:20 PM, combuster combus...@archlinux.us wrote:

 Hi Andrew,

 this is something that I saw in my logs too, first on one node and then
 on
 the other three. When that happend on all four of them, engine was
 corrupted
 beyond repair.

 First of all, I think that message is saying that sanlock can't get a
 lock
 on the shared storage that you defined for the hostedengine during
 installation. I got this error when I've tried to manually migrate the
 hosted engine. There is an unresolved bug there and I think it's related
 to
 this one:

 [Bug 1093366 - Migration of hosted-engine vm put target host score to
 zero]
 https://bugzilla.redhat.com/show_bug.cgi?id=1093366

 This is a blocker bug (or should be) for the selfhostedengine and, from
 my
 own experience with it, shouldn't be used in the production enviroment
 (not
 untill it's fixed).

 Nothing that I've done couldn't fix the fact that the score for the
 target
 node was Zero, tried to reinstall the node, reboot the node, restarted
 several services, tailed a tons of logs etc but to no avail. When only
 one
 node was left (that was actually running the hosted engine), I brought
 the
 engine's vm down gracefully (hosted-engine --vm-shutdown I belive) and
 after
 that, when I've tried to start the vm - it wouldn't load. Running VNC
 showed
 that the filesystem inside the vm was corrupted and when I ran fsck and
 finally started up - it was too badly damaged. I succeded to start the
 engine itself (after repairing postgresql service that wouldn't want to
 start) but the database was damaged enough and acted pretty weird
 (showed
 that storage domains were down but the vm's were running fine etc).
 Lucky
 me, I had already exported all of the VM's on the first sign of trouble
 and
 then installed ovirt-engine on the dedicated server and attached the
 export
 domain.

 So while really a usefull feature, and it's working (for the most part
 ie,
 automatic migration works), manually migrating VM with the hosted-engine
 will lead to troubles.

 I hope that my experience with it, will be of use to you. It happened to
 me
 two weeks ago, ovirt-engine was current (3.4.1) and there was no fix
 available.

 Regards,

 Ivan

 On 06/06/2014 05:12 AM, Andrew Lau wrote:

 Hi,

 I'm seeing this weird message in my engine log

 2014-06-06 03:06:09,380 INFO
 [org.ovirt.engine.core.vdsbroker.VdsUpdateRunTimeInfo]
 (DefaultQuartzScheduler_Worker-79) RefreshVmList vm id
 85d4cfb9-f063-4c7c-a9f8-2b74f5f7afa5 status = WaitForLaunch on vds
 ov-hv2-2a-08-23 ignoring it in the refresh until migration is done
 2014-06-06 03:06:12,494 INFO
 [org.ovirt.engine.core.vdsbroker.vdsbroker.DestroyVDSCommand]
 (DefaultQuartzScheduler_Worker-89) START, DestroyVDSCommand(HostName =
 ov-hv2-2a-08-23, HostId = c04c62be-5d34-4e73-bd26-26f805b2dc60,
 vmId=85d4cfb9-f063-4c7c-a9f8-2b74f5f7afa5, force=false,
 secondsToWait=0, gracefully=false), log id: 62a9d4c1
 2014-06-06 03:06:12,561 INFO
 [org.ovirt.engine.core.vdsbroker.vdsbroker.DestroyVDSCommand]
 (DefaultQuartzScheduler_Worker-89) FINISH, DestroyVDSCommand, log id:
 62a9d4c1
 2014-06-06 03:06:12,652 INFO
 [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector]
 (DefaultQuartzScheduler_
 Worker-89) Correlation ID: null, Call Stack:
 null, Custom Event ID: -1, Message: VM HostedEngine is down. Exit
 message: internal error Failed to acquire lock: error -243.

 It also appears to occur on the other hosts in the cluster, except the
 host which is running the hosted-engine. So right now 3 servers, it
 shows up twice 

Re: [ovirt-users] VM HostedEngie is down. Exist message: internal error Failed to acquire lock error -243

2014-06-06 Thread Andrew Lau
Hi Ivan,

Thanks for the in depth reply.

I've only seen this happen twice, and only after I added a third host
to the HA cluster. I wonder if that's the root problem.

Have you seen this happen on all your installs or only just after your
manual migration? It's a little frustrating this is happening as I was
hoping to get this into a production environment. It was all working
except that log message :(

Thanks,
Andrew


On Fri, Jun 6, 2014 at 3:20 PM, combuster combus...@archlinux.us wrote:
 Hi Andrew,

 this is something that I saw in my logs too, first on one node and then on
 the other three. When that happend on all four of them, engine was corrupted
 beyond repair.

 First of all, I think that message is saying that sanlock can't get a lock
 on the shared storage that you defined for the hostedengine during
 installation. I got this error when I've tried to manually migrate the
 hosted engine. There is an unresolved bug there and I think it's related to
 this one:

 [Bug 1093366 - Migration of hosted-engine vm put target host score to zero]
 https://bugzilla.redhat.com/show_bug.cgi?id=1093366

 This is a blocker bug (or should be) for the selfhostedengine and, from my
 own experience with it, shouldn't be used in the production enviroment (not
 untill it's fixed).

 Nothing that I've done couldn't fix the fact that the score for the target
 node was Zero, tried to reinstall the node, reboot the node, restarted
 several services, tailed a tons of logs etc but to no avail. When only one
 node was left (that was actually running the hosted engine), I brought the
 engine's vm down gracefully (hosted-engine --vm-shutdown I belive) and after
 that, when I've tried to start the vm - it wouldn't load. Running VNC showed
 that the filesystem inside the vm was corrupted and when I ran fsck and
 finally started up - it was too badly damaged. I succeded to start the
 engine itself (after repairing postgresql service that wouldn't want to
 start) but the database was damaged enough and acted pretty weird (showed
 that storage domains were down but the vm's were running fine etc). Lucky
 me, I had already exported all of the VM's on the first sign of trouble and
 then installed ovirt-engine on the dedicated server and attached the export
 domain.

 So while really a usefull feature, and it's working (for the most part ie,
 automatic migration works), manually migrating VM with the hosted-engine
 will lead to troubles.

 I hope that my experience with it, will be of use to you. It happened to me
 two weeks ago, ovirt-engine was current (3.4.1) and there was no fix
 available.

 Regards,

 Ivan

 On 06/06/2014 05:12 AM, Andrew Lau wrote:

 Hi,

 I'm seeing this weird message in my engine log

 2014-06-06 03:06:09,380 INFO
 [org.ovirt.engine.core.vdsbroker.VdsUpdateRunTimeInfo]
 (DefaultQuartzScheduler_Worker-79) RefreshVmList vm id
 85d4cfb9-f063-4c7c-a9f8-2b74f5f7afa5 status = WaitForLaunch on vds
 ov-hv2-2a-08-23 ignoring it in the refresh until migration is done
 2014-06-06 03:06:12,494 INFO
 [org.ovirt.engine.core.vdsbroker.vdsbroker.DestroyVDSCommand]
 (DefaultQuartzScheduler_Worker-89) START, DestroyVDSCommand(HostName =
 ov-hv2-2a-08-23, HostId = c04c62be-5d34-4e73-bd26-26f805b2dc60,
 vmId=85d4cfb9-f063-4c7c-a9f8-2b74f5f7afa5, force=false,
 secondsToWait=0, gracefully=false), log id: 62a9d4c1
 2014-06-06 03:06:12,561 INFO
 [org.ovirt.engine.core.vdsbroker.vdsbroker.DestroyVDSCommand]
 (DefaultQuartzScheduler_Worker-89) FINISH, DestroyVDSCommand, log id:
 62a9d4c1
 2014-06-06 03:06:12,652 INFO
 [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector]
 (DefaultQuartzScheduler_
 Worker-89) Correlation ID: null, Call Stack:
 null, Custom Event ID: -1, Message: VM HostedEngine is down. Exit
 message: internal error Failed to acquire lock: error -243.

 It also appears to occur on the other hosts in the cluster, except the
 host which is running the hosted-engine. So right now 3 servers, it
 shows up twice in the engine UI.

 The engine VM continues to run peacefully, without any issues on the
 host which doesn't have that error.

 Any ideas?
 ___
 Users mailing list
 Users@ovirt.org
 http://lists.ovirt.org/mailman/listinfo/users


___
Users mailing list
Users@ovirt.org
http://lists.ovirt.org/mailman/listinfo/users


Re: [ovirt-users] VM HostedEngie is down. Exist message: internal error Failed to acquire lock error -243

2014-06-06 Thread Jiri Moskovcak
I've seen that problem in other threads, the common denominator was nfs 
on top of gluster. So if you have this setup, then it's a known 
problem. Or you should double check if you hosts have different ids 
otherwise they would be trying to acquire the same lock.


--Jirka

On 06/06/2014 08:03 AM, Andrew Lau wrote:

Hi Ivan,

Thanks for the in depth reply.

I've only seen this happen twice, and only after I added a third host
to the HA cluster. I wonder if that's the root problem.

Have you seen this happen on all your installs or only just after your
manual migration? It's a little frustrating this is happening as I was
hoping to get this into a production environment. It was all working
except that log message :(

Thanks,
Andrew


On Fri, Jun 6, 2014 at 3:20 PM, combuster combus...@archlinux.us wrote:

Hi Andrew,

this is something that I saw in my logs too, first on one node and then on
the other three. When that happend on all four of them, engine was corrupted
beyond repair.

First of all, I think that message is saying that sanlock can't get a lock
on the shared storage that you defined for the hostedengine during
installation. I got this error when I've tried to manually migrate the
hosted engine. There is an unresolved bug there and I think it's related to
this one:

[Bug 1093366 - Migration of hosted-engine vm put target host score to zero]
https://bugzilla.redhat.com/show_bug.cgi?id=1093366

This is a blocker bug (or should be) for the selfhostedengine and, from my
own experience with it, shouldn't be used in the production enviroment (not
untill it's fixed).

Nothing that I've done couldn't fix the fact that the score for the target
node was Zero, tried to reinstall the node, reboot the node, restarted
several services, tailed a tons of logs etc but to no avail. When only one
node was left (that was actually running the hosted engine), I brought the
engine's vm down gracefully (hosted-engine --vm-shutdown I belive) and after
that, when I've tried to start the vm - it wouldn't load. Running VNC showed
that the filesystem inside the vm was corrupted and when I ran fsck and
finally started up - it was too badly damaged. I succeded to start the
engine itself (after repairing postgresql service that wouldn't want to
start) but the database was damaged enough and acted pretty weird (showed
that storage domains were down but the vm's were running fine etc). Lucky
me, I had already exported all of the VM's on the first sign of trouble and
then installed ovirt-engine on the dedicated server and attached the export
domain.

So while really a usefull feature, and it's working (for the most part ie,
automatic migration works), manually migrating VM with the hosted-engine
will lead to troubles.

I hope that my experience with it, will be of use to you. It happened to me
two weeks ago, ovirt-engine was current (3.4.1) and there was no fix
available.

Regards,

Ivan

On 06/06/2014 05:12 AM, Andrew Lau wrote:

Hi,

I'm seeing this weird message in my engine log

2014-06-06 03:06:09,380 INFO
[org.ovirt.engine.core.vdsbroker.VdsUpdateRunTimeInfo]
(DefaultQuartzScheduler_Worker-79) RefreshVmList vm id
85d4cfb9-f063-4c7c-a9f8-2b74f5f7afa5 status = WaitForLaunch on vds
ov-hv2-2a-08-23 ignoring it in the refresh until migration is done
2014-06-06 03:06:12,494 INFO
[org.ovirt.engine.core.vdsbroker.vdsbroker.DestroyVDSCommand]
(DefaultQuartzScheduler_Worker-89) START, DestroyVDSCommand(HostName =
ov-hv2-2a-08-23, HostId = c04c62be-5d34-4e73-bd26-26f805b2dc60,
vmId=85d4cfb9-f063-4c7c-a9f8-2b74f5f7afa5, force=false,
secondsToWait=0, gracefully=false), log id: 62a9d4c1
2014-06-06 03:06:12,561 INFO
[org.ovirt.engine.core.vdsbroker.vdsbroker.DestroyVDSCommand]
(DefaultQuartzScheduler_Worker-89) FINISH, DestroyVDSCommand, log id:
62a9d4c1
2014-06-06 03:06:12,652 INFO
[org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector]
(DefaultQuartzScheduler_
Worker-89) Correlation ID: null, Call Stack:
null, Custom Event ID: -1, Message: VM HostedEngine is down. Exit
message: internal error Failed to acquire lock: error -243.

It also appears to occur on the other hosts in the cluster, except the
host which is running the hosted-engine. So right now 3 servers, it
shows up twice in the engine UI.

The engine VM continues to run peacefully, without any issues on the
host which doesn't have that error.

Any ideas?
___
Users mailing list
Users@ovirt.org
http://lists.ovirt.org/mailman/listinfo/users



___
Users mailing list
Users@ovirt.org
http://lists.ovirt.org/mailman/listinfo/users



___
Users mailing list
Users@ovirt.org
http://lists.ovirt.org/mailman/listinfo/users


Re: [ovirt-users] VM HostedEngie is down. Exist message: internal error Failed to acquire lock error -243

2014-06-06 Thread combuster

On 06/06/2014 08:03 AM, Andrew Lau wrote:

Hi Ivan,

Thanks for the in depth reply.

I've only seen this happen twice, and only after I added a third host
to the HA cluster. I wonder if that's the root problem.
It shouldn't be if a shared storage that vm is residing on is accessible 
by a third node in the cluster.


Have you seen this happen on all your installs or only just after your
manual migration? It's a little frustrating this is happening as I was
hoping to get this into a production environment. It was all working
except that log message :(
Just after manual migration, then things went all to ... My strong 
recommendation is not to use self hosted engine feature for production 
purposes untill the mentioned bug is resolved. But it would really help 
to hear someone from the dev team on this one.

Thanks,
Andrew


On Fri, Jun 6, 2014 at 3:20 PM, combuster combus...@archlinux.us wrote:

Hi Andrew,

this is something that I saw in my logs too, first on one node and then on
the other three. When that happend on all four of them, engine was corrupted
beyond repair.

First of all, I think that message is saying that sanlock can't get a lock
on the shared storage that you defined for the hostedengine during
installation. I got this error when I've tried to manually migrate the
hosted engine. There is an unresolved bug there and I think it's related to
this one:

[Bug 1093366 - Migration of hosted-engine vm put target host score to zero]
https://bugzilla.redhat.com/show_bug.cgi?id=1093366

This is a blocker bug (or should be) for the selfhostedengine and, from my
own experience with it, shouldn't be used in the production enviroment (not
untill it's fixed).

Nothing that I've done couldn't fix the fact that the score for the target
node was Zero, tried to reinstall the node, reboot the node, restarted
several services, tailed a tons of logs etc but to no avail. When only one
node was left (that was actually running the hosted engine), I brought the
engine's vm down gracefully (hosted-engine --vm-shutdown I belive) and after
that, when I've tried to start the vm - it wouldn't load. Running VNC showed
that the filesystem inside the vm was corrupted and when I ran fsck and
finally started up - it was too badly damaged. I succeded to start the
engine itself (after repairing postgresql service that wouldn't want to
start) but the database was damaged enough and acted pretty weird (showed
that storage domains were down but the vm's were running fine etc). Lucky
me, I had already exported all of the VM's on the first sign of trouble and
then installed ovirt-engine on the dedicated server and attached the export
domain.

So while really a usefull feature, and it's working (for the most part ie,
automatic migration works), manually migrating VM with the hosted-engine
will lead to troubles.

I hope that my experience with it, will be of use to you. It happened to me
two weeks ago, ovirt-engine was current (3.4.1) and there was no fix
available.

Regards,

Ivan

On 06/06/2014 05:12 AM, Andrew Lau wrote:

Hi,

I'm seeing this weird message in my engine log

2014-06-06 03:06:09,380 INFO
[org.ovirt.engine.core.vdsbroker.VdsUpdateRunTimeInfo]
(DefaultQuartzScheduler_Worker-79) RefreshVmList vm id
85d4cfb9-f063-4c7c-a9f8-2b74f5f7afa5 status = WaitForLaunch on vds
ov-hv2-2a-08-23 ignoring it in the refresh until migration is done
2014-06-06 03:06:12,494 INFO
[org.ovirt.engine.core.vdsbroker.vdsbroker.DestroyVDSCommand]
(DefaultQuartzScheduler_Worker-89) START, DestroyVDSCommand(HostName =
ov-hv2-2a-08-23, HostId = c04c62be-5d34-4e73-bd26-26f805b2dc60,
vmId=85d4cfb9-f063-4c7c-a9f8-2b74f5f7afa5, force=false,
secondsToWait=0, gracefully=false), log id: 62a9d4c1
2014-06-06 03:06:12,561 INFO
[org.ovirt.engine.core.vdsbroker.vdsbroker.DestroyVDSCommand]
(DefaultQuartzScheduler_Worker-89) FINISH, DestroyVDSCommand, log id:
62a9d4c1
2014-06-06 03:06:12,652 INFO
[org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector]
(DefaultQuartzScheduler_
Worker-89) Correlation ID: null, Call Stack:
null, Custom Event ID: -1, Message: VM HostedEngine is down. Exit
message: internal error Failed to acquire lock: error -243.

It also appears to occur on the other hosts in the cluster, except the
host which is running the hosted-engine. So right now 3 servers, it
shows up twice in the engine UI.

The engine VM continues to run peacefully, without any issues on the
host which doesn't have that error.

Any ideas?
___
Users mailing list
Users@ovirt.org
http://lists.ovirt.org/mailman/listinfo/users




___
Users mailing list
Users@ovirt.org
http://lists.ovirt.org/mailman/listinfo/users


Re: [ovirt-users] VM HostedEngie is down. Exist message: internal error Failed to acquire lock error -243

2014-06-06 Thread combuster
It was pure NFS on a NAS device. They all had different ids (had no 
redeployements of nodes before problem occured).


Thanks Jirka.

On 06/06/2014 08:19 AM, Jiri Moskovcak wrote:
I've seen that problem in other threads, the common denominator was 
nfs on top of gluster. So if you have this setup, then it's a known 
problem. Or you should double check if you hosts have different ids 
otherwise they would be trying to acquire the same lock.


--Jirka

On 06/06/2014 08:03 AM, Andrew Lau wrote:

Hi Ivan,

Thanks for the in depth reply.

I've only seen this happen twice, and only after I added a third host
to the HA cluster. I wonder if that's the root problem.

Have you seen this happen on all your installs or only just after your
manual migration? It's a little frustrating this is happening as I was
hoping to get this into a production environment. It was all working
except that log message :(

Thanks,
Andrew


On Fri, Jun 6, 2014 at 3:20 PM, combuster combus...@archlinux.us 
wrote:

Hi Andrew,

this is something that I saw in my logs too, first on one node and 
then on
the other three. When that happend on all four of them, engine was 
corrupted

beyond repair.

First of all, I think that message is saying that sanlock can't get 
a lock

on the shared storage that you defined for the hostedengine during
installation. I got this error when I've tried to manually migrate the
hosted engine. There is an unresolved bug there and I think it's 
related to

this one:

[Bug 1093366 - Migration of hosted-engine vm put target host score 
to zero]

https://bugzilla.redhat.com/show_bug.cgi?id=1093366

This is a blocker bug (or should be) for the selfhostedengine and, 
from my
own experience with it, shouldn't be used in the production 
enviroment (not

untill it's fixed).

Nothing that I've done couldn't fix the fact that the score for the 
target

node was Zero, tried to reinstall the node, reboot the node, restarted
several services, tailed a tons of logs etc but to no avail. When 
only one
node was left (that was actually running the hosted engine), I 
brought the
engine's vm down gracefully (hosted-engine --vm-shutdown I belive) 
and after
that, when I've tried to start the vm - it wouldn't load. Running 
VNC showed

that the filesystem inside the vm was corrupted and when I ran fsck and
finally started up - it was too badly damaged. I succeded to start the
engine itself (after repairing postgresql service that wouldn't want to
start) but the database was damaged enough and acted pretty weird 
(showed
that storage domains were down but the vm's were running fine etc). 
Lucky
me, I had already exported all of the VM's on the first sign of 
trouble and
then installed ovirt-engine on the dedicated server and attached the 
export

domain.

So while really a usefull feature, and it's working (for the most 
part ie,
automatic migration works), manually migrating VM with the 
hosted-engine

will lead to troubles.

I hope that my experience with it, will be of use to you. It 
happened to me

two weeks ago, ovirt-engine was current (3.4.1) and there was no fix
available.

Regards,

Ivan

On 06/06/2014 05:12 AM, Andrew Lau wrote:

Hi,

I'm seeing this weird message in my engine log

2014-06-06 03:06:09,380 INFO
[org.ovirt.engine.core.vdsbroker.VdsUpdateRunTimeInfo]
(DefaultQuartzScheduler_Worker-79) RefreshVmList vm id
85d4cfb9-f063-4c7c-a9f8-2b74f5f7afa5 status = WaitForLaunch on vds
ov-hv2-2a-08-23 ignoring it in the refresh until migration is done
2014-06-06 03:06:12,494 INFO
[org.ovirt.engine.core.vdsbroker.vdsbroker.DestroyVDSCommand]
(DefaultQuartzScheduler_Worker-89) START, DestroyVDSCommand(HostName =
ov-hv2-2a-08-23, HostId = c04c62be-5d34-4e73-bd26-26f805b2dc60,
vmId=85d4cfb9-f063-4c7c-a9f8-2b74f5f7afa5, force=false,
secondsToWait=0, gracefully=false), log id: 62a9d4c1
2014-06-06 03:06:12,561 INFO
[org.ovirt.engine.core.vdsbroker.vdsbroker.DestroyVDSCommand]
(DefaultQuartzScheduler_Worker-89) FINISH, DestroyVDSCommand, log id:
62a9d4c1
2014-06-06 03:06:12,652 INFO
[org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector]
(DefaultQuartzScheduler_
Worker-89) Correlation ID: null, Call Stack:
null, Custom Event ID: -1, Message: VM HostedEngine is down. Exit
message: internal error Failed to acquire lock: error -243.

It also appears to occur on the other hosts in the cluster, except the
host which is running the hosted-engine. So right now 3 servers, it
shows up twice in the engine UI.

The engine VM continues to run peacefully, without any issues on the
host which doesn't have that error.

Any ideas?
___
Users mailing list
Users@ovirt.org
http://lists.ovirt.org/mailman/listinfo/users



___
Users mailing list
Users@ovirt.org
http://lists.ovirt.org/mailman/listinfo/users





___
Users mailing list
Users@ovirt.org
http://lists.ovirt.org/mailman/listinfo/users


Re: [ovirt-users] VM HostedEngie is down. Exist message: internal error Failed to acquire lock error -243

2014-06-06 Thread Andrew Lau
Is this related to the NFS server which gluster provides, or is
because of the way gluster does replication?

There's a few posts ie.
http://community.redhat.com/blog/2014/05/ovirt-3-4-glusterized/ which
are reporting success with gluster + hosted engine. So it'd be good to
know, so we could possibly try a work around.

Cheers.

On Fri, Jun 6, 2014 at 4:19 PM, Jiri Moskovcak jmosk...@redhat.com wrote:
 I've seen that problem in other threads, the common denominator was nfs on
 top of gluster. So if you have this setup, then it's a known problem. Or
 you should double check if you hosts have different ids otherwise they would
 be trying to acquire the same lock.

 --Jirka


 On 06/06/2014 08:03 AM, Andrew Lau wrote:

 Hi Ivan,

 Thanks for the in depth reply.

 I've only seen this happen twice, and only after I added a third host
 to the HA cluster. I wonder if that's the root problem.

 Have you seen this happen on all your installs or only just after your
 manual migration? It's a little frustrating this is happening as I was
 hoping to get this into a production environment. It was all working
 except that log message :(

 Thanks,
 Andrew


 On Fri, Jun 6, 2014 at 3:20 PM, combuster combus...@archlinux.us wrote:

 Hi Andrew,

 this is something that I saw in my logs too, first on one node and then
 on
 the other three. When that happend on all four of them, engine was
 corrupted
 beyond repair.

 First of all, I think that message is saying that sanlock can't get a
 lock
 on the shared storage that you defined for the hostedengine during
 installation. I got this error when I've tried to manually migrate the
 hosted engine. There is an unresolved bug there and I think it's related
 to
 this one:

 [Bug 1093366 - Migration of hosted-engine vm put target host score to
 zero]
 https://bugzilla.redhat.com/show_bug.cgi?id=1093366

 This is a blocker bug (or should be) for the selfhostedengine and, from
 my
 own experience with it, shouldn't be used in the production enviroment
 (not
 untill it's fixed).

 Nothing that I've done couldn't fix the fact that the score for the
 target
 node was Zero, tried to reinstall the node, reboot the node, restarted
 several services, tailed a tons of logs etc but to no avail. When only
 one
 node was left (that was actually running the hosted engine), I brought
 the
 engine's vm down gracefully (hosted-engine --vm-shutdown I belive) and
 after
 that, when I've tried to start the vm - it wouldn't load. Running VNC
 showed
 that the filesystem inside the vm was corrupted and when I ran fsck and
 finally started up - it was too badly damaged. I succeded to start the
 engine itself (after repairing postgresql service that wouldn't want to
 start) but the database was damaged enough and acted pretty weird (showed
 that storage domains were down but the vm's were running fine etc). Lucky
 me, I had already exported all of the VM's on the first sign of trouble
 and
 then installed ovirt-engine on the dedicated server and attached the
 export
 domain.

 So while really a usefull feature, and it's working (for the most part
 ie,
 automatic migration works), manually migrating VM with the hosted-engine
 will lead to troubles.

 I hope that my experience with it, will be of use to you. It happened to
 me
 two weeks ago, ovirt-engine was current (3.4.1) and there was no fix
 available.

 Regards,

 Ivan

 On 06/06/2014 05:12 AM, Andrew Lau wrote:

 Hi,

 I'm seeing this weird message in my engine log

 2014-06-06 03:06:09,380 INFO
 [org.ovirt.engine.core.vdsbroker.VdsUpdateRunTimeInfo]
 (DefaultQuartzScheduler_Worker-79) RefreshVmList vm id
 85d4cfb9-f063-4c7c-a9f8-2b74f5f7afa5 status = WaitForLaunch on vds
 ov-hv2-2a-08-23 ignoring it in the refresh until migration is done
 2014-06-06 03:06:12,494 INFO
 [org.ovirt.engine.core.vdsbroker.vdsbroker.DestroyVDSCommand]
 (DefaultQuartzScheduler_Worker-89) START, DestroyVDSCommand(HostName =
 ov-hv2-2a-08-23, HostId = c04c62be-5d34-4e73-bd26-26f805b2dc60,
 vmId=85d4cfb9-f063-4c7c-a9f8-2b74f5f7afa5, force=false,
 secondsToWait=0, gracefully=false), log id: 62a9d4c1
 2014-06-06 03:06:12,561 INFO
 [org.ovirt.engine.core.vdsbroker.vdsbroker.DestroyVDSCommand]
 (DefaultQuartzScheduler_Worker-89) FINISH, DestroyVDSCommand, log id:
 62a9d4c1
 2014-06-06 03:06:12,652 INFO
 [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector]
 (DefaultQuartzScheduler_
 Worker-89) Correlation ID: null, Call Stack:
 null, Custom Event ID: -1, Message: VM HostedEngine is down. Exit
 message: internal error Failed to acquire lock: error -243.

 It also appears to occur on the other hosts in the cluster, except the
 host which is running the hosted-engine. So right now 3 servers, it
 shows up twice in the engine UI.

 The engine VM continues to run peacefully, without any issues on the
 host which doesn't have that error.

 Any ideas?
 ___
 Users mailing list
 Users@ovirt.org
 

Re: [ovirt-users] VM HostedEngie is down. Exist message: internal error Failed to acquire lock error -243

2014-06-06 Thread Andrew Lau
Interesting, I put it all into global maintenance. Shut it all down
for 10~ minutes, and it's regained it's sanlock control and doesn't
seem to have that issue coming up in the log.

On Fri, Jun 6, 2014 at 4:21 PM, combuster combus...@archlinux.us wrote:
 It was pure NFS on a NAS device. They all had different ids (had no
 redeployements of nodes before problem occured).

 Thanks Jirka.


 On 06/06/2014 08:19 AM, Jiri Moskovcak wrote:

 I've seen that problem in other threads, the common denominator was nfs
 on top of gluster. So if you have this setup, then it's a known problem. Or
 you should double check if you hosts have different ids otherwise they would
 be trying to acquire the same lock.

 --Jirka

 On 06/06/2014 08:03 AM, Andrew Lau wrote:

 Hi Ivan,

 Thanks for the in depth reply.

 I've only seen this happen twice, and only after I added a third host
 to the HA cluster. I wonder if that's the root problem.

 Have you seen this happen on all your installs or only just after your
 manual migration? It's a little frustrating this is happening as I was
 hoping to get this into a production environment. It was all working
 except that log message :(

 Thanks,
 Andrew


 On Fri, Jun 6, 2014 at 3:20 PM, combuster combus...@archlinux.us wrote:

 Hi Andrew,

 this is something that I saw in my logs too, first on one node and then
 on
 the other three. When that happend on all four of them, engine was
 corrupted
 beyond repair.

 First of all, I think that message is saying that sanlock can't get a
 lock
 on the shared storage that you defined for the hostedengine during
 installation. I got this error when I've tried to manually migrate the
 hosted engine. There is an unresolved bug there and I think it's related
 to
 this one:

 [Bug 1093366 - Migration of hosted-engine vm put target host score to
 zero]
 https://bugzilla.redhat.com/show_bug.cgi?id=1093366

 This is a blocker bug (or should be) for the selfhostedengine and, from
 my
 own experience with it, shouldn't be used in the production enviroment
 (not
 untill it's fixed).

 Nothing that I've done couldn't fix the fact that the score for the
 target
 node was Zero, tried to reinstall the node, reboot the node, restarted
 several services, tailed a tons of logs etc but to no avail. When only
 one
 node was left (that was actually running the hosted engine), I brought
 the
 engine's vm down gracefully (hosted-engine --vm-shutdown I belive) and
 after
 that, when I've tried to start the vm - it wouldn't load. Running VNC
 showed
 that the filesystem inside the vm was corrupted and when I ran fsck and
 finally started up - it was too badly damaged. I succeded to start the
 engine itself (after repairing postgresql service that wouldn't want to
 start) but the database was damaged enough and acted pretty weird
 (showed
 that storage domains were down but the vm's were running fine etc).
 Lucky
 me, I had already exported all of the VM's on the first sign of trouble
 and
 then installed ovirt-engine on the dedicated server and attached the
 export
 domain.

 So while really a usefull feature, and it's working (for the most part
 ie,
 automatic migration works), manually migrating VM with the hosted-engine
 will lead to troubles.

 I hope that my experience with it, will be of use to you. It happened to
 me
 two weeks ago, ovirt-engine was current (3.4.1) and there was no fix
 available.

 Regards,

 Ivan

 On 06/06/2014 05:12 AM, Andrew Lau wrote:

 Hi,

 I'm seeing this weird message in my engine log

 2014-06-06 03:06:09,380 INFO
 [org.ovirt.engine.core.vdsbroker.VdsUpdateRunTimeInfo]
 (DefaultQuartzScheduler_Worker-79) RefreshVmList vm id
 85d4cfb9-f063-4c7c-a9f8-2b74f5f7afa5 status = WaitForLaunch on vds
 ov-hv2-2a-08-23 ignoring it in the refresh until migration is done
 2014-06-06 03:06:12,494 INFO
 [org.ovirt.engine.core.vdsbroker.vdsbroker.DestroyVDSCommand]
 (DefaultQuartzScheduler_Worker-89) START, DestroyVDSCommand(HostName =
 ov-hv2-2a-08-23, HostId = c04c62be-5d34-4e73-bd26-26f805b2dc60,
 vmId=85d4cfb9-f063-4c7c-a9f8-2b74f5f7afa5, force=false,
 secondsToWait=0, gracefully=false), log id: 62a9d4c1
 2014-06-06 03:06:12,561 INFO
 [org.ovirt.engine.core.vdsbroker.vdsbroker.DestroyVDSCommand]
 (DefaultQuartzScheduler_Worker-89) FINISH, DestroyVDSCommand, log id:
 62a9d4c1
 2014-06-06 03:06:12,652 INFO
 [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector]
 (DefaultQuartzScheduler_
 Worker-89) Correlation ID: null, Call Stack:
 null, Custom Event ID: -1, Message: VM HostedEngine is down. Exit
 message: internal error Failed to acquire lock: error -243.

 It also appears to occur on the other hosts in the cluster, except the
 host which is running the hosted-engine. So right now 3 servers, it
 shows up twice in the engine UI.

 The engine VM continues to run peacefully, without any issues on the
 host which doesn't have that error.

 Any ideas?
 ___
 Users mailing list
 

[ovirt-users] VM HostedEngie is down. Exist message: internal error Failed to acquire lock error -243

2014-06-05 Thread Andrew Lau
Hi,

I'm seeing this weird message in my engine log

2014-06-06 03:06:09,380 INFO
[org.ovirt.engine.core.vdsbroker.VdsUpdateRunTimeInfo]
(DefaultQuartzScheduler_Worker-79) RefreshVmList vm id
85d4cfb9-f063-4c7c-a9f8-2b74f5f7afa5 status = WaitForLaunch on vds
ov-hv2-2a-08-23 ignoring it in the refresh until migration is done
2014-06-06 03:06:12,494 INFO
[org.ovirt.engine.core.vdsbroker.vdsbroker.DestroyVDSCommand]
(DefaultQuartzScheduler_Worker-89) START, DestroyVDSCommand(HostName =
ov-hv2-2a-08-23, HostId = c04c62be-5d34-4e73-bd26-26f805b2dc60,
vmId=85d4cfb9-f063-4c7c-a9f8-2b74f5f7afa5, force=false,
secondsToWait=0, gracefully=false), log id: 62a9d4c1
2014-06-06 03:06:12,561 INFO
[org.ovirt.engine.core.vdsbroker.vdsbroker.DestroyVDSCommand]
(DefaultQuartzScheduler_Worker-89) FINISH, DestroyVDSCommand, log id:
62a9d4c1
2014-06-06 03:06:12,652 INFO
[org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector]
(DefaultQuartzScheduler_Worker-89) Correlation ID: null, Call Stack:
null, Custom Event ID: -1, Message: VM HostedEngine is down. Exit
message: internal error Failed to acquire lock: error -243.

It also appears to occur on the other hosts in the cluster, except the
host which is running the hosted-engine. So right now 3 servers, it
shows up twice in the engine UI.

The engine VM continues to run peacefully, without any issues on the
host which doesn't have that error.

Any ideas?
___
Users mailing list
Users@ovirt.org
http://lists.ovirt.org/mailman/listinfo/users


Re: [ovirt-users] VM HostedEngie is down. Exist message: internal error Failed to acquire lock error -243

2014-06-05 Thread combuster

Hi Andrew,

this is something that I saw in my logs too, first on one node and then 
on the other three. When that happend on all four of them, engine was 
corrupted beyond repair.


First of all, I think that message is saying that sanlock can't get a 
lock on the shared storage that you defined for the hostedengine during 
installation. I got this error when I've tried to manually migrate the 
hosted engine. There is an unresolved bug there and I think it's related 
to this one:


[*Bug 1093366* https://bugzilla.redhat.com/show_bug.cgi?id=1093366 
-Migration of hosted-engine vm put target host score to zero]

https://bugzilla.redhat.com/show_bug.cgi?id=1093366

This is a blocker bug (or should be) for the selfhostedengine and, from 
my own experience with it, shouldn't be used in the production 
enviroment (not untill it's fixed).


Nothing that I've done couldn't fix the fact that the score for the 
target node was Zero, tried to reinstall the node, reboot the node, 
restarted several services, tailed a tons of logs etc but to no avail. 
When only one node was left (that was actually running the hosted 
engine), I brought the engine's vm down gracefully (hosted-engine 
--vm-shutdown I belive) and after that, when I've tried to start the vm 
- it wouldn't load. Running VNC showed that the filesystem inside the vm 
was corrupted and when I ran fsck and finally started up - it was too 
badly damaged. I succeded to start the engine itself (after repairing 
postgresql service that wouldn't want to start) but the database was 
damaged enough and acted pretty weird (showed that storage domains were 
down but the vm's were running fine etc). Lucky me, I had already 
exported all of the VM's on the first sign of trouble and then installed 
ovirt-engine on the dedicated server and attached the export domain.


So while really a usefull feature, and it's working (for the most part 
ie, automatic migration works), manually migrating VM with the 
hosted-engine will lead to troubles.


I hope that my experience with it, will be of use to you. It happened to 
me two weeks ago, ovirt-engine was current (3.4.1) and there was no fix 
available.


Regards,

Ivan
On 06/06/2014 05:12 AM, Andrew Lau wrote:

Hi,

I'm seeing this weird message in my engine log

2014-06-06 03:06:09,380 INFO
[org.ovirt.engine.core.vdsbroker.VdsUpdateRunTimeInfo]
(DefaultQuartzScheduler_Worker-79) RefreshVmList vm id
85d4cfb9-f063-4c7c-a9f8-2b74f5f7afa5 status = WaitForLaunch on vds
ov-hv2-2a-08-23 ignoring it in the refresh until migration is done
2014-06-06 03:06:12,494 INFO
[org.ovirt.engine.core.vdsbroker.vdsbroker.DestroyVDSCommand]
(DefaultQuartzScheduler_Worker-89) START, DestroyVDSCommand(HostName =
ov-hv2-2a-08-23, HostId = c04c62be-5d34-4e73-bd26-26f805b2dc60,
vmId=85d4cfb9-f063-4c7c-a9f8-2b74f5f7afa5, force=false,
secondsToWait=0, gracefully=false), log id: 62a9d4c1
2014-06-06 03:06:12,561 INFO
[org.ovirt.engine.core.vdsbroker.vdsbroker.DestroyVDSCommand]
(DefaultQuartzScheduler_Worker-89) FINISH, DestroyVDSCommand, log id:
62a9d4c1
2014-06-06 03:06:12,652 INFO
[org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector]
(DefaultQuartzScheduler_Worker-89) Correlation ID: null, Call Stack:
null, Custom Event ID: -1, Message: VM HostedEngine is down. Exit
message: internal error Failed to acquire lock: error -243.

It also appears to occur on the other hosts in the cluster, except the
host which is running the hosted-engine. So right now 3 servers, it
shows up twice in the engine UI.

The engine VM continues to run peacefully, without any issues on the
host which doesn't have that error.

Any ideas?
___
Users mailing list
Users@ovirt.org
http://lists.ovirt.org/mailman/listinfo/users


___
Users mailing list
Users@ovirt.org
http://lists.ovirt.org/mailman/listinfo/users