Hi,

We again had issues when having many VMs deployed on many hosts at the same 
time (log excerpts below) and deploying more.
We saw over  25 runaway  VMs left behind running from the last two weeks, that 
one had marked as DONE, also deploy, copy and stop failed randomly quite often.

It starts to be a major problem, when we can't run opennebula in a stable and 
predictable manner on larger Clouds...
We have the following intervals configured, we do need to monitor more often 
then every 10 minutes we feel.
HOST_MONITORING_INTERVAL = 20
VM_POLLING_INTERVAL      = 30
So we first used our snmp driver again, which solved a large part of the 
problems, but our cloud is still growing, so we reached the next limit...

What seems to be happening is that the "virsh --connect qemu:///system dominfo" 
interferes with other virsh commands.Virsh locks libvirt-sock, so multiple 
processes can not connect at the same time.
Solution we are now trying : do the monitoring of VMs in read only mode: "virsh 
-readonly --connect qemu:///system dominfo"
Which we added/changed in the file: /usr/lib/one/mads/one_vmm_kvm.rb
Now virsh doesn't lock the libvirt-sock as far as we can see

Currently we do not see the error messages we had before, but some kind of 
robust, scalable and  fail safe monitoring solution for opennebula is needed.

Hope this helps
Kind regards,

Floris


Thu Jul 22 16:16:21 2010 [VMM][I]: Command execution fail: virsh --connect 
qemu:///system dominfo one-428
Thu Jul 22 16:16:21 2010 [VMM][I]: STDERR follows.
Thu Jul 22 16:16:21 2010 [VMM][I]: error: unable to connect to 
'/var/run/libvirt/libvirt-sock', libvirtd may need to be started: Permission 
denied
Thu Jul 22 16:16:21 2010 [VMM][I]: error: failed to connect to the hypervisor
Thu Jul 22 16:16:21 2010 [VMM][I]: ExitCode: 1
Thu Jul 22 16:16:21 2010 [VMM][E]: Error monitoring VM, -

And sometimes destroy would fail:
Wed Jul 28 13:05:34 2010 [LCM][I]: New VM state is SAVE_STOP
Wed Jul 28 13:05:34 2010 [VMM][I]: Command execution fail: 'touch 
/var/lib/one/585/images/checkpoint;virsh --connect qemu:///system save one-585 
/var/lib/one/585/images/checkpoint'
Wed Jul 28 13:05:34 2010 [VMM][I]: STDERR follows.
Wed Jul 28 13:05:34 2010 [VMM][I]: error: unable to connect to 
'/var/run/libvirt/libvirt-sock', libvirtd may need to be started: Permission 
denied
Wed Jul 28 13:05:34 2010 [VMM][I]: error: failed to connect to the hypervisor
Wed Jul 28 13:05:34 2010 [VMM][I]: ExitCode: 1
Wed Jul 28 13:05:34 2010 [VMM][E]: Error saving VM state, -
Wed Jul 28 13:05:35 2010 [LCM][I]: Fail to save VM state. Assuming that the VM 
is still RUNNING (will poll VM).
Wed Jul 28 13:05:38 2010 [VMM][I]: Command execution fail: virsh --connect 
qemu:///system dominfo one-585
Wed Jul 28 13:05:38 2010 [VMM][I]: STDERR follows.
Wed Jul 28 13:05:38 2010 [VMM][I]: error: unable to connect to 
'/var/run/libvirt/libvirt-sock', libvirtd may need to be started: Permission 
denied
Wed Jul 28 13:05:38 2010 [VMM][I]: error: failed to connect to the hypervisor
Wed Jul 28 13:05:38 2010 [VMM][I]: ExitCode: 1
Wed Jul 28 13:05:38 2010 [VMM][E]: Error monitoring VM, -
...trying like 10 times ...
Wed Jul 28 13:09:14 2010 [VMM][E]: Error monitoring VM, -
Wed Jul 28 13:09:56 2010 [LCM][I]: New VM state is SAVE_STOP
Wed Jul 28 13:09:56 2010 [VMM][I]: Command execution fail: 'touch 
/var/lib/one/585/images/checkpoint;virsh --connect qemu:///system save one-585 
/var/lib/one/585/images/checkpoint'
Wed Jul 28 13:09:56 2010 [VMM][I]: STDERR follows.
Wed Jul 28 13:09:56 2010 [VMM][I]: error: unable to connect to 
'/var/run/libvirt/libvirt-sock', libvirtd may need to be started: Permission 
denied
Wed Jul 28 13:09:56 2010 [VMM][I]: error: failed to connect to the hypervisor
Wed Jul 28 13:09:56 2010 [VMM][I]: ExitCode: 1
Wed Jul 28 13:09:56 2010 [VMM][E]: Error saving VM state, -
Wed Jul 28 13:09:56 2010 [LCM][I]: Fail to save VM state. Assuming that the VM 
is still RUNNING (will poll VM).
Wed Jul 28 13:09:56 2010 [VMM][I]: Command execution fail: virsh --connect 
qemu:///system dominfo one-585
Wed Jul 28 13:09:56 2010 [VMM][I]: STDERR follows.
Wed Jul 28 13:09:56 2010 [VMM][I]: error: unable to connect to 
'/var/run/libvirt/libvirt-sock', libvirtd may need to be started: Permission 
denied
Wed Jul 28 13:09:56 2010 [VMM][I]: error: failed to connect to the hypervisor
Wed Jul 28 13:09:56 2010 [VMM][I]: ExitCode: 1
Wed Jul 28 13:09:56 2010 [VMM][E]: Error monitoring VM, -
Wed Jul 28 13:10:24 2010 [VMM][I]: Command execution fail: virsh --connect 
qemu:///system dominfo one-585
Wed Jul 28 13:10:24 2010 [VMM][I]: STDERR follows.
Wed Jul 28 13:10:24 2010 [VMM][I]: error: unable to connect to 
'/var/run/libvirt/libvirt-sock', libvirtd may need to be started: Permission 
denied
Wed Jul 28 13:10:24 2010 [VMM][I]: error: failed to connect to the hypervisor
Wed Jul 28 13:10:24 2010 [VMM][I]: ExitCode: 1
Wed Jul 28 13:10:24 2010 [VMM][E]: Error monitoring VM, -
Wed Jul 28 13:10:45 2010 [DiM][I]: New VM state is DONE
Wed Jul 28 13:10:45 2010 [VMM][W]: Ignored: LOG - 585 Driver command for 585 
cancelled
Wed Jul 28 13:10:45 2010 [TM][W]: Ignored: LOG - 585 tm_delete.sh: Deleting 
/var/lib/one/585/images
Wed Jul 28 13:10:45 2010 [TM][W]: Ignored: LOG - 585 tm_delete.sh: Executed 
"ssh node13-one rm -rf /var/lib/one/585/images".
Wed Jul 28 13:10:45 2010 [TM][W]: Ignored: TRANSFER SUCCESS 585 -
Wed Jul 28 13:10:45 2010 [VMM][W]: Ignored: LOG - 585 Command execution fail: 
virsh --connect qemu:///system destroy one-585
Wed Jul 28 13:10:45 2010 [VMM][W]: Ignored: LOG - 585 STDERR follows.
Wed Jul 28 13:10:45 2010 [VMM][W]: Ignored: LOG - 585 error: unable to connect 
to '/var/run/libvirt/libvirt-sock', libvirtd may need to be started: Permission 
denied
Wed Jul 28 13:10:45 2010 [VMM][W]: Ignored: LOG - 585 error: failed to connect 
to the hypervisor
Wed Jul 28 13:10:45 2010 [VMM][W]: Ignored: LOG - 585 ExitCode: 1
Wed Jul 28 13:10:45 2010 [VMM][W]: Ignored: CANCEL FAILURE 585 -



From: users-boun...@lists.opennebula.org 
[mailto:users-boun...@lists.opennebula.org] On Behalf Of Floris Sluiter
Sent: maandag 19 juli 2010 18:18
To: 'Tino Vazquez'; DuDu
Cc: users@lists.opennebula.org
Subject: Re: [one-users] oned hang

Hi Dudu, Tino and all,

We have seen the exact same message (Command execution fail and bad 
interpreter: Text file busy)) on our cluster last week when we expanded it from 
12 to 16 hosts (with add host)and deploying 10 Vmachines at the same time. We 
did not have multiple instances of opennebula running, we only added to a 
running one,  so it is unlikely that was the issue (the cluster was already 
running stable for a while). We investigated and thought it was a timing issue 
with the monitoring (ssh) driver set to 60 seconds and having many hosts and 
many VMs.
We started using the ssh-monitoring driver again in after the latest update to 
opennebula, before that we used our in hous developed snmp monitoring driver.
When we deployed our snmp driver, the error message stopped and for the last 
week we have a stable cloud again, now with 16 hosts...

For people who think see the same timing issues as we did , the snmp_driver is 
available in the ecosystem (but make sure you know what snmp is before you try 
;-)): http://opennebula.org/software:ecosystem:snmp_im_driver
Regards,

Floris
HPC project leader
Sara


From: users-boun...@lists.opennebula.org 
[mailto:users-boun...@lists.opennebula.org] On Behalf Of Tino Vazquez
Sent: maandag 19 juli 2010 16:15
To: DuDu
Cc: users@lists.opennebula.org
Subject: Re: [one-users] oned hang

Dear DuDu,

This happens when two monitorization actions take place at the same time.

First thing, which OpenNebula version are you using?

Are you per chance running two OpenNebula instances? Did you change the host 
polling time?

Regards,

-Tino

--
Constantino Vázquez Blanco | 
dsa-research.org/tinova<http://dsa-research.org/tinova>
Virtualization Technology Engineer / Researcher
OpenNebula Toolkit | opennebula.org<http://opennebula.org>
On Wed, Jul 14, 2010 at 3:13 PM, DuDu 
<black...@gmail.com<mailto:black...@gmail.com>> wrote:

Hi,

We deployed a small cluster of opennebula, with 8 hosts. It is the default 
opennebula installation, however, we found that after several days of running, 
oned hung. All CLI commands hang too. No new logs generated in one_xmlrpc.log. 
And there are quite some error message like the following in oned.log:

[r...@vm-container-31-0 logdir]# tail oned.log
Wed Jul 14 14:51:02 2010 [InM][I]: Warning: untrusted X11 forwarding setup 
failed: xauth key data not generated
Wed Jul 14 14:51:02 2010 [InM][I]: Warning: No xauth data; using fake 
authentication data for X11 forwarding.
Wed Jul 14 14:51:02 2010 [InM][I]: bash: 
/tmp/one-im//one_im-c4718299a313d89398ea693104dcce5f: /bin/sh: bad interpreter: 
Text file busy
Wed Jul 14 14:51:02 2010 [InM][I]: ExitCode: 126
Wed Jul 14 14:51:02 2010 [InM][I]: Command execution fail: 'mkdir -p 
/tmp/one-im/; cat > /tmp/one-im//one_im-f3817715aa24450225bafb4c19b23822; if [ 
"x$?" != "x0" ]; then exit -1; fi; chmod +x 
/tmp/one-im//one_im-f3817715aa24450225bafb4c19b23822; 
/tmp/one-im//one_im-f3817715aa24450225bafb4c19b23822'
Wed Jul 14 14:51:02 2010 [InM][I]: STDERR follows.
Wed Jul 14 14:51:02 2010 [InM][I]: Warning: untrusted X11 forwarding setup 
failed: xauth key data not generated
Wed Jul 14 14:51:02 2010 [InM][I]: Warning: No xauth data; using fake 
authentication data for X11 forwarding.
Wed Jul 14 14:51:02 2010 [InM][I]: bash: 
/tmp/one-im//one_im-f3817715aa24450225bafb4c19b23822: /bin/sh: bad interpreter: 
Text file busy
Wed Jul 14 14:51:02 2010 [InM][I]: ExitCode: 126

We have to sigkill oned and restart it. And that solves all problems.

Any idea of this?

Thanks!

_______________________________________________
Users mailing list
Users@lists.opennebula.org<mailto:Users@lists.opennebula.org>
http://lists.opennebula.org/listinfo.cgi/users-opennebula.org

_______________________________________________
Users mailing list
Users@lists.opennebula.org
http://lists.opennebula.org/listinfo.cgi/users-opennebula.org

Reply via email to