
I didn´t stop the agent, I did a 'shutdown -h now' at the Host 'A' in order
to simulate a crash.

My goal is verify if one of my KVM hosts fail, the VMs with HA enabled from
thos host 'A'  will migrate to another Host (in this case host 'B'. Or , al
least, it will be posible do it manually.

If you need more tests, I can do it.


>> Hi !
>> My test today: I stopped other instance, and changed to HA Offer. I
>> started
>> this instance.
>> After, I shutdown gracefully the KVM host of it.
> Why a gracefully shutdown of the KVM host ? The HA process is to (re)start
> the HA VMs on a new host, the current host has been crashed or not
> available i.e. its cloudstack agent won't respond.
> If you stopped gently the cloudstack-agent, the CS mgr don't consider this
> to a crash, so the HA won't start.
> What's behavior do you expect?
>> and I checked the investigators process:
>> [root@1q2 ~]# grep -i Investigator
>> /var/log/cloudstack/management/management-server.log
>> [root@1q2 ~]# date
>> Mon Jul 20 14:39:43 UTC 2015
>> [root@1q2 ~]# ls -ltrh
>> /var/log/cloudstack/management/management-server.log
>> -rw-rw-r--. 1 cloud cloud 14M Jul 20 14:39
>> /var/log/cloudstack/management/management-server.log
>> Nothing.  I dont know how internally these process work. but seems that
>> they are not working well, agree?
>> options                     value
>> ha.investigators.exclude     nothing
>> ha.investigators.orde
>> SimpleInvestigator,XenServerInvestigator,KVMInvestigator,HypervInvestigator,VMwareInvestigator,PingInvestigator,ManagementIPSysVMInvestigator
>> investigate.retry.interval    60
>> There´s a way to check if these process are running ?
>> [root@1q2 ~]# ps waux| grep -i java
>> root     11408  0.0  0.0 103252   880 pts/0    S+   14:44   0:00 grep -i
>> java
>> cloud    24225  0.7  1.7 16982036 876412 ?     Sl   Jul16  43:48
>> /usr/lib/jvm/jre-1.7.0/bin/java -Djava.awt.headless=true
>> -Dcom.sun.management.jmxremote=false -Xmx2g
>> -XX:+HeapDumpOnOutOfMemoryError
>> -XX:HeapDumpPath=/var/log/cloudstack/management/ -XX:PermSize=512M
>> -XX:MaxPermSize=800m
>> -Djava.security.properties=/etc/cloudstack/management/java.security.ciphers
>> -classpath
>> :::/etc/cloudstack/management:/usr/share/cloudstack-management/setup:/usr/share/cloudstack-management/bin/bootstrap.jar:/usr/share/cloudstack-management/bin/tomcat-juli.jar:/usr/share/java/commons-daemon.jar
>> -Dcatalina.base=/usr/share/cloudstack-management
>> -Dcatalina.home=/usr/share/cloudstack-management -Djava.endorsed.dirs=
>> -Djava.io.tmpdir=/usr/share/cloudstack-management/temp
>> -Djava.util.logging.config.file=/usr/share/cloudstack-management/conf/logging.properties
>> -Djava.util.logging.manager=org.apache.juli.ClassLoaderLogManager
>> org.apache.catalina.startup.Bootstrap start
>> Thanks
>>>  Perhaps, the management server don't reconize the host 3 totally down
>>>>> (ping alive? or some quorum don't ok)
>>>>> The only way to the mgt server to accept totally that the host 3 has a
>>>>> real problem that the host 3 has been reboot (around 12:44)?
>>>>>  The host disconnect was triggered at 12:19 on host 3. Mgmt server was
>>>> pretty sure the host is down (it was a graceful shutdown I believe)
>>>> which
>>>> is why it triggered a disconnect and notified other nodes. There was no
>>>> checkhealth/checkonhost/etc. triggered; just the agent disconnected and
>>>> all
>>>> listeners (ping/etc.) notified.
>>>> At this time mgmt server should have scheduled HA on all VMs running on
>>>> that host. The HA investigators would then work their way identifying
>>>> whether the VMs are still running, if they need to be fenced, etc. But
>>>> this
>>>> never happened.
>>> AFAIK, stopping the cloudstack-agent service don't allow to start the HA
>>> process for the VMs hosted by the node. Seems normal to me that the HA
>>> process don't start at this moment.
>>> If I would start the HA process on a node, I go to the Web UI (or
>>> cloudmonkey) to change the state of the Host from Up to Maintenance.
>>> (after I can stop the CS-agent service if I need for exemple reboot a
>>> node)
>>>>  Ok, so here are my findings.
>>>>> 1. Host ID 3 was shutdown around 2015-07-16 12:19:09 at which point
>>>>> management server called a disconnect.
>>>>> 2. Based on the logs, it seems VM IDs 32, 18, 39 and 46 were running on
>>>>> the host.
>>>>> 3. No HA tasks for any of these VMs at this time.
>>>>> 5. Management server restarted at around 2015-07-16 12:30:20.
>>>>> 6. Host ID 3 connected back at around 2015-07-16 12:44:08.
>>>>> 7. Management server identified the missing VMs and triggered HA on
>>>>> those.
>>>>> 8. The VMs were eventually started, all 4 of them.
>>>>> I am not 100% sure why HA wasn't triggered until 2015-07-16 12:30 (#3),
>>>>> but I know that management server restart caused it not happen until
>>>>> the
>>>>> host was reconnected.
>>>>>  Perhaps, the management server don't reconize the host 3 totally down
>>>> (ping alive? or some quorum don't ok)
>>>> The only way to the mgt server to accept totally that the host 3 has a
>>>> real problem that the host 3 has been reboot (around 12:44)?
>>>> What is the storage subsystem? CLVMd?
>>>>> No problems Somesh, thanks for your help.
>>>>> Link of log:
>>>>> https://dl.dropboxusercontent.com/u/6774061/management-server.log.2015-07-16.gz
>>>>> Luciano
>>>>>   How large is the management server logs dated 2015-07-16? I would
>>>>> like
>>>>>> to
>>>>>> review the logs. All the information I need from that incident should
>>>>>> be in
>>>>>> there so I don't need any more testing.
>>>>>> Hi Somesh!
>>>>>> [root@1q2 ~]# zgrep -i -E
>>>>>> 'SimpleIvestigator|KVMInvestigator|PingInvestigator|ManagementIPSysVMInvestigator'
>>>>>> /var/log/cloudstack/management/management-server.log.2015-07-16.gz
>>>>>> |tail
>>>>>> -5000 > /tmp/management.txt
>>>>>> [root@1q2 ~]# cat /tmp/management.txt
>>>>>> 2015-07-16 12:30:45,452 DEBUG [o.a.c.s.l.r.ExtensionRegistry]
>>>>>> (main:null)
>>>>>> Registering extension [KVMInvestigator] in [Ha Investigators Registry]
>>>>>> 2015-07-16 12:30:45,452 DEBUG [o.a.c.s.l.r.RegistryLifecycle]
>>>>>> (main:null)
>>>>>> Registered com.cloud.ha.KVMInvestigator@57ceec9a
>>>>>> 2015-07-16 12:30:45,927 DEBUG [o.a.c.s.l.r.ExtensionRegistry]
>>>>>> (main:null)
>>>>>> Registering extension [PingInvestigator] in [Ha Investigators
>>>>>> Registry]
>>>>>> 2015-07-16 12:30:45,928 DEBUG [o.a.c.s.l.r.ExtensionRegistry]
>>>>>> (main:null)
>>>>>> Registering extension [ManagementIPSysVMInvestigator] in [Ha
>>>>>> Investigators
>>>>>> Registry]
>>>>>> 2015-07-16 12:30:53,796 INFO  [o.a.c.s.l.r.DumpRegistry] (main:null)
>>>>>> Registry [Ha Investigators Registry] contains [SimpleInvestigator,
>>>>>> XenServerInvestigator, KVMInv
>>>>>> I  searched  this log before, but as I thought that had not nothing
>>>>>> special.
>>>>>> If you want propose to me another scenario of test, I can do it.
>>>>>> Thanks
>>>>>>> Hi Somesh!
>>>>>>> thanks for help.. I did again ,and I collected new logs:
>>>>>>> My vm_instance name is i-2-39-VM. There was some routers in KVM host
>>>>>>> 'A'
>>>>>>> (this one that I powered off now):
>>>>>>> [root@1q2 ~]# grep -i -E 'SimpleInvestigator.*false'
>>>>>>> /var/log/cloudstack/management/catalina.out
>>>>>>> INFO  [c.c.h.HighAvailabilityManagerImpl] (HA-Worker-2:ctx-e2f91c9c
>>>>>>>  work-3)
>>>>>>  SimpleInvestigator found VM[DomainRouter|r-4-VM]to be alive? false
>>>>>>> INFO  [c.c.h.HighAvailabilityManagerImpl] (HA-Worker-1:ctx-729acf4f
>>>>>>>  work-7)
>>>>>>  SimpleInvestigator found VM[User|i-23-33-VM]to be alive? false
>>>>>>> INFO  [c.c.h.HighAvailabilityManagerImpl] (HA-Worker-4:ctx-a66a4941
>>>>>>>  work-8)
>>>>>>  SimpleInvestigator found VM[DomainRouter|r-36-VM]to be alive? false
>>>>>>> INFO  [c.c.h.HighAvailabilityManagerImpl] (HA-Worker-1:ctx-5977245e
>>>>>>> work-10) SimpleInvestigator found VM[User|i-17-26-VM]to be alive?
>>>>>>> false
>>>>>>> INFO  [c.c.h.HighAvailabilityManagerImpl] (HA-Worker-1:ctx-c7f39be0
>>>>>>>  work-9)
>>>>>>  SimpleInvestigator found VM[DomainRouter|r-32-VM]to be alive? false
>>>>>>> INFO  [c.c.h.HighAvailabilityManagerImpl] (HA-Worker-3:ctx-ad4f5fda
>>>>>>> work-10) SimpleInvestigator found VM[DomainRouter|r-46-VM]to be
>>>>>>> alive?
>>>>>>> false
>>>>>>> INFO  [c.c.h.HighAvailabilityManagerImpl] (HA-Worker-0:ctx-0257f5af
>>>>>>> work-11) SimpleInvestigator found VM[User|i-4-52-VM]to be alive?
>>>>>>> false
>>>>>>> INFO  [c.c.h.HighAvailabilityManagerImpl] (HA-Worker-4:ctx-7ddff382
>>>>>>> work-12) SimpleInvestigator found VM[DomainRouter|r-32-VM]to be
>>>>>>> alive?
>>>>>>> false
>>>>>>> INFO  [c.c.h.HighAvailabilityManagerImpl] (HA-Worker-1:ctx-9f79917e
>>>>>>> work-13) SimpleInvestigator found VM[User|i-2-39-VM]to be alive?
>>>>>>> false
>>>>>>> KVM  host 'B' agent log (where the machine would be migrate):
>>>>>>> 2015-07-16 16:58:56,537 INFO  [kvm.resource.LibvirtComputingResource]
>>>>>>> (agentRequest-Handler-4:null) Live migration of instance i-2-39-VM
>>>>>>> initiated
>>>>>>> 2015-07-16 16:58:57,540 INFO  [kvm.resource.LibvirtComputingResource]
>>>>>>> (agentRequest-Handler-4:null) Waiting for migration of i-2-39-VM to
>>>>>>> complete, waited 1000ms
>>>>>>> 2015-07-16 16:58:58,541 INFO  [kvm.resource.LibvirtComputingResource]
>>>>>>> (agentRequest-Handler-4:null) Waiting for migration of i-2-39-VM to
>>>>>>> complete, waited 2000ms
>>>>>>> 2015-07-16 16:58:59,542 INFO  [kvm.resource.LibvirtComputingResource]
>>>>>>> (agentRequest-Handler-4:null) Waiting for migration of i-2-39-VM to
>>>>>>> complete, waited 3000ms
>>>>>>> 2015-07-16 16:59:00,543 INFO  [kvm.resource.LibvirtComputingResource]
>>>>>>> (agentRequest-Handler-4:null) Waiting for migration of i-2-39-VM to
>>>>>>> complete, waited 4000ms
>>>>>>> 2015-07-16 16:59:01,245 INFO  [kvm.resource.LibvirtComputingResource]
>>>>>>> (agentRequest-Handler-4:null) Migration thread for i-2-39-VM is done
>>>>>>> It said done for my i-2-39-VM instance, but I can´t ping this host.
>>>>>>> Luciano
>>>>>> Luciano Castro

Luciano Castro

