Hi,

Thanks for the input. We were using "-r" when we tested again this morning, and paying closer attention to the log showed that the problem is in the delete phase. The delete command seems to be executed on the host that is unreachable (in this case bl05):

Thu Mar 27 11:07:36 2014 [VMM][I]: fi" failed: ssh: connect to host bl05 port 22: Connection refused
Thu Mar 27 11:07:36 2014 [VMM][E]: Error deleting /var/lib/one/datastores/109/416/disk.0
Thu Mar 27 11:07:36 2014 [VMM][I]: ExitCode: 255
Thu Mar 27 11:07:36 2014 [VMM][I]: Failed to execute transfer manager driver operation: tm_delete.
Thu Mar 27 11:07:36 2014 [VMM][I]: Command execution fail: /var/lib/one/remotes/tm/shared/delete bl05:/var/lib/one//datastores/109/416 416 109

I attached a log for the VM running on host bl05.

Thanks for the help,

Nuno



Nuno Serro
Coordenador
Núcleo de Infraestruturas e Telecomunicações
Departamento de Informática

Alameda da Universidade  -  Cidade Universitária
1649-004 Lisboa    PORTUGAL
T. +351 210 443 566 - Ext. 19816
E. nse...@reitoria.ulisboa.pt
www.ulisboa.pt

 

 

On 26-03-2014 16:57, Tino Vazquez wrote:
Hi,

Thanks for the info.

The hook for host error in OpenNebula 4.4 allows to define one, and only one, of the "-r" and "-d" flags:

   * -r will "delete --recreate" the VM in the failed host, this will go through the epilog_delete phase, and it should erase the sym links and launch the VM again. This is probably what you want, please come back if the problem does not go away
 
  * -d will "delete" the VM in the failed host, but won't launch the VM again.

These two are mutually exclusive. 

Regards,

-Tino


--
OpenNebula - Flexible Enterprise Cloud Made Simple

--
Constantino Vázquez Blanco, PhD, MSc
Senior Infrastructure Architect at C12G Labs
www.c12g.com | @C12G | es.linkedin.com/in/tinova

--
Confidentiality Warning: The information contained in this e-mail and any accompanying documents, unless otherwise expressly indicated, is confidential and privileged, and is intended solely for the person and/or entity to whom it is addressed (i.e. those identified in the "To" and "cc" box). They are the property of C12G Labs S.L.. Unauthorized distribution, review, use, disclosure, or copying of this communication, or any part thereof, is strictly prohibited and may be unlawful. If you have received this e-mail in error, please notify us immediately by e-mail at ab...@c12g.com and delete the e-mail and attachments and any copy from your system. C12G thanks you for your cooperation.


On 26 March 2014 17:48, Nuno Serro <nse...@reitoria.ulisboa.pt> wrote:
Hello Tino,

We are using version 4.4.1. If you need any details on the configuration I can provide them.



Nuno Serro
Coordenador
Núcleo de Infraestruturas e Telecomunicações
Departamento de Informática

Alameda da Universidade  -  Cidade Universitária
1649-004 Lisboa    PORTUGAL
T. +351 210 443 566 - Ext. 19816
E. nse...@reitoria.ulisboa.pt
www.ulisboa.pt

 

 

On 26-03-2014 16:44, Tino Vazquez wrote:
Hi Nuno,

What version of OpenNebula are you using?

Best,

-Tino


--
OpenNebula - Flexible Enterprise Cloud Made Simple

--
Constantino Vázquez Blanco, PhD, MSc
Senior Infrastructure Architect at C12G Labs
www.c12g.com | @C12G | es.linkedin.com/in/tinova

--
Confidentiality Warning: The information contained in this e-mail and any accompanying documents, unless otherwise expressly indicated, is confidential and privileged, and is intended solely for the person and/or entity to whom it is addressed (i.e. those identified in the "To" and "cc" box). They are the property of C12G Labs S.L.. Unauthorized distribution, review, use, disclosure, or copying of this communication, or any part thereof, is strictly prohibited and may be unlawful. If you have received this e-mail in error, please notify us immediately by e-mail at ab...@c12g.com and delete the e-mail and attachments and any copy from your system. C12G thanks you for your cooperation.


On 26 March 2014 17:29, Nuno Serro <nse...@reitoria.ulisboa.pt> wrote:
Hello,

We've started using a system datastore whith a shared storage in a clustered fs, so we could start testing the live migrate functionality. The live migrate is working as expected, but when testing the fault tolerance using the host_hook, we noticed the following error:

[TM][I]:  ln -s "/dev/vg-nebula/lv-one-144" "/var/lib/one/datastores/109/393/disk.0"" failed: ln: failed to create symbolic link `/var/lib/one/datastores/109/393/disk.0': File exists
[TM][E]: Error linking /dev/vg-nebula/lv-one-144

I understand the error. When I kill one node and host_hook kicks in, being the storage shared between nodes, the symlinks to the images are already there.

My question is regarding the hook definition:

HOST_HOOK = [
    name      = "error",
    on        = "ERROR",
    command   = "ft/host_error.rb",
    arguments = "$ID -r",
    remote    = "no" ]

Is it possible to configure the hook to delete the VM, and afterwards creating it? We tried combining "-d" and "-r" flags but only one flag seems to be used.

Thanks in advance.

Regards,

--

Nuno Serro
Coordenador
Núcleo de Infraestruturas e Telecomunicações
Departamento de Informática

Alameda da Universidade  -  Cidade Universitária
1649-004 Lisboa    PORTUGAL
T. +351 210 443 566 - Ext. 19816
E. nse...@reitoria.ulisboa.pt
www.ulisboa.pt

 

 


_______________________________________________
Users mailing list
Users@lists.opennebula.org
http://lists.opennebula.org/listinfo.cgi/users-opennebula.org




_______________________________________________
Users mailing list
Users@lists.opennebula.org
http://lists.opennebula.org/listinfo.cgi/users-opennebula.org



Thu Mar 27 11:04:11 2014 [VMM][I]: Command execution fail: 'if [ -x 
"/var/tmp/one/vmm/kvm/poll" ]; then /var/tmp/one/vmm/kvm/poll one-416 bl05 416 
bl05; else exit 42; fi'
Thu Mar 27 11:04:11 2014 [VMM][I]: ssh: connect to host bl05 port 22: No route 
to host
Thu Mar 27 11:04:11 2014 [VMM][I]: ExitCode: 255
Thu Mar 27 11:04:11 2014 [VMM][E]: Error monitoring VM
Thu Mar 27 11:04:11 2014 [LCM][I]: New VM state is UNKNOWN
Thu Mar 27 11:04:12 2014 [HKM][I]: Success executing Hook: mail_unknown: . 
Thu Mar 27 11:06:41 2014 [VMM][I]: Command execution fail: 'if [ -x 
"/var/tmp/one/vmm/kvm/poll" ]; then /var/tmp/one/vmm/kvm/poll one-416 bl05 416 
bl05; else exit 42; fi'
Thu Mar 27 11:06:41 2014 [VMM][I]: ssh: connect to host bl05 port 22: No route 
to host
Thu Mar 27 11:06:41 2014 [VMM][I]: ExitCode: 255
Thu Mar 27 11:06:41 2014 [VMM][E]: Error monitoring VM
Thu Mar 27 11:07:35 2014 [LCM][I]: New VM state is CLEANUP.
Thu Mar 27 11:07:35 2014 [VMM][I]: Driver command for 416 cancelled
Thu Mar 27 11:07:35 2014 [VMM][I]: Command execution fail: 
/var/tmp/one/vmm/kvm/cancel 'one-416' 'bl05' 416 bl05
Thu Mar 27 11:07:35 2014 [VMM][I]: ssh: connect to host bl05 port 22: 
Connection refused
Thu Mar 27 11:07:35 2014 [VMM][I]: ExitSSHCode: 255
Thu Mar 27 11:07:35 2014 [VMM][E]: Error connecting to bl05
Thu Mar 27 11:07:35 2014 [VMM][I]: Failed to execute virtualization driver 
operation: cancel.
Thu Mar 27 11:07:35 2014 [VMM][I]: ssh: connect to host bl05 port 22: 
Connection refused
Thu Mar 27 11:07:35 2014 [VMM][I]: ExitSSHCode: 255
Thu Mar 27 11:07:35 2014 [VMM][E]: Error connecting to bl05
Thu Mar 27 11:07:35 2014 [VMM][I]: Failed to execute network driver operation: 
clean.
Thu Mar 27 11:07:36 2014 [VMM][I]: Command execution fail: 
/var/lib/one/remotes/tm/lvm/delete bl05:/var/lib/one//datastores/109/416/disk.0 
416 102
Thu Mar 27 11:07:36 2014 [VMM][E]: delete: Command " DEV=$(readlink 
/var/lib/one/datastores/109/416/disk.0)
Thu Mar 27 11:07:36 2014 [VMM][I]: LV_NAME=$(basename $DEV)
Thu Mar 27 11:07:36 2014 [VMM][I]: 
Thu Mar 27 11:07:36 2014 [VMM][I]: # remove link
Thu Mar 27 11:07:36 2014 [VMM][I]: rm -f /var/lib/one/datastores/109/416/disk.0
Thu Mar 27 11:07:36 2014 [VMM][I]: 
Thu Mar 27 11:07:36 2014 [VMM][I]: # The following attrs will only be there if 
the image is non-persistent
Thu Mar 27 11:07:36 2014 [VMM][I]: # lv-one-<image> ==> persistent
Thu Mar 27 11:07:36 2014 [VMM][I]: # lv-one-<image>-<vmid>-<diskid> ==> 
non-persistent (clone)
Thu Mar 27 11:07:36 2014 [VMM][I]: 
Thu Mar 27 11:07:36 2014 [VMM][I]: VMID=$(echo $LV_NAME|cut -d- -f4)
Thu Mar 27 11:07:36 2014 [VMM][I]: DISKID=$(echo $LV_NAME|cut -d- -f5)
Thu Mar 27 11:07:36 2014 [VMM][I]: 
Thu Mar 27 11:07:36 2014 [VMM][I]: if [ -n "$VMID" -a -n "$DISKID" ]; then
Thu Mar 27 11:07:36 2014 [VMM][I]: # this is a cloned image
Thu Mar 27 11:07:36 2014 [VMM][I]: sudo lvremove -f $DEV
Thu Mar 27 11:07:36 2014 [VMM][I]: fi" failed: ssh: connect to host bl05 port 
22: Connection refused
Thu Mar 27 11:07:36 2014 [VMM][E]: Error deleting 
/var/lib/one/datastores/109/416/disk.0
Thu Mar 27 11:07:36 2014 [VMM][I]: ExitCode: 255
Thu Mar 27 11:07:36 2014 [VMM][I]: Failed to execute transfer manager driver 
operation: tm_delete.
Thu Mar 27 11:07:36 2014 [VMM][I]: Command execution fail: 
/var/lib/one/remotes/tm/shared/delete bl05:/var/lib/one//datastores/109/416 416 
109
Thu Mar 27 11:07:36 2014 [VMM][I]: delete: Deleting 
/var/lib/one/datastores/109/416
Thu Mar 27 11:07:36 2014 [VMM][E]: delete: Command "[ -e 
"/var/lib/one/datastores/109/416" ] || exit 0
Thu Mar 27 11:07:36 2014 [VMM][I]: 
Thu Mar 27 11:07:36 2014 [VMM][I]: times=10
Thu Mar 27 11:07:36 2014 [VMM][I]: function="rm -rf 
/var/lib/one/datastores/109/416"
Thu Mar 27 11:07:36 2014 [VMM][I]: 
Thu Mar 27 11:07:36 2014 [VMM][I]: count=1
Thu Mar 27 11:07:36 2014 [VMM][I]: 
Thu Mar 27 11:07:36 2014 [VMM][I]: ret=$($function)
Thu Mar 27 11:07:36 2014 [VMM][I]: error=$?
Thu Mar 27 11:07:36 2014 [VMM][I]: 
Thu Mar 27 11:07:36 2014 [VMM][I]: while [ $count -lt $times -a "$error" != "0" 
]; do
Thu Mar 27 11:07:36 2014 [VMM][I]: sleep 1
Thu Mar 27 11:07:36 2014 [VMM][I]: count=$(( $count + 1 ))
Thu Mar 27 11:07:36 2014 [VMM][I]: ret=$($function)
Thu Mar 27 11:07:36 2014 [VMM][I]: error=$?
Thu Mar 27 11:07:36 2014 [VMM][I]: done
Thu Mar 27 11:07:36 2014 [VMM][I]: 
Thu Mar 27 11:07:36 2014 [VMM][I]: [ "x$error" = "x0" ]" failed: ssh: connect 
to host bl05 port 22: Connection refused
Thu Mar 27 11:07:36 2014 [VMM][E]: Error deleting 
/var/lib/one/datastores/109/416
Thu Mar 27 11:07:36 2014 [VMM][I]: ExitCode: 255
Thu Mar 27 11:07:36 2014 [VMM][I]: Failed to execute transfer manager driver 
operation: tm_delete.
Thu Mar 27 11:07:36 2014 [VMM][I]: Host successfully cleaned.
Thu Mar 27 11:07:36 2014 [DiM][I]: New VM state is PENDING
Thu Mar 27 11:07:54 2014 [DiM][I]: New VM state is ACTIVE.
Thu Mar 27 11:07:54 2014 [LCM][I]: New VM state is PROLOG.
Thu Mar 27 11:07:54 2014 [TM][I]: Command execution fail: 
/var/lib/one/remotes/tm/lvm/ln bl01:vg-nebula.lv-one-144 
bl02:/var/lib/one//datastores/109/416/disk.0 416 102
Thu Mar 27 11:07:54 2014 [TM][E]: ln: Command " set -e
Thu Mar 27 11:07:54 2014 [TM][I]: mkdir -p /var/lib/one/datastores/109/416
Thu Mar 27 11:07:54 2014 [TM][I]: ln -s "/dev/vg-nebula/lv-one-144" 
"/var/lib/one/datastores/109/416/disk.0"" failed: ln: failed to create symbolic 
link `/var/lib/one/datastores/109/416/disk.0': File exists
Thu Mar 27 11:07:54 2014 [TM][E]: Error linking /dev/vg-nebula/lv-one-144
Thu Mar 27 11:07:54 2014 [TM][I]: ExitCode: 1
Thu Mar 27 11:07:54 2014 [TM][E]: Error executing image transfer script: Error 
linking /dev/vg-nebula/lv-one-144
Thu Mar 27 11:07:54 2014 [DiM][I]: New VM state is FAILED
Thu Mar 27 11:07:55 2014 [HKM][I]: Success executing Hook: mail_failed: . 

Attachment: signature.asc
Description: OpenPGP digital signature

_______________________________________________
Users mailing list
Users@lists.opennebula.org
http://lists.opennebula.org/listinfo.cgi/users-opennebula.org

Reply via email to