On 20/07/2015 18:38, Luciano Castro wrote:
Hi I didn´t stop the agent, I did a 'shutdown -h now' at the Host 'A' in order to simulate a crash.
You don't simulate a crash like this.When you run a "shutdown -h now", you made a clean shutdown (all services are stopped cleany), the cloudstack-agent send a signal to the CS mgr to indicate shutdown himself (the service not the host). The CS mgr don't consider this event to be a crash (until you restart the host / cs-agent service)
On my test environment (CS 4.5.1/Ubuntu14.04/NFS) this lines on the cs mgr logs indicates the "clean" stop after I run the shutdown command:
2015-07-20 20:07:18,894 DEBUG [c.c.a.m.AgentManagerImpl] (AgentManager-Handler-7:null) SeqA 4--1: Processing Seq 4--1: { Cmd , MgmtId: -1, via: 4, Ver: v1, Flags: 111, [{"com.cloud.agent.api.ShutdownCommand":{"reason":"sig.kill","wait":0}}] } 2015-07-20 20:07:18,894 INFO [c.c.a.m.AgentManagerImpl] (AgentManager-Handler-7:null) Host 4 has informed us that it is shutting down with reason sig.kill and detail null 2015-07-20 20:07:18,896 INFO [c.c.a.m.AgentManagerImpl] (AgentTaskPool-1:ctx-fb37c248) Host 4 is disconnecting with event ShutdownRequested 2015-07-20 20:07:18,898 DEBUG [c.c.a.m.AgentManagerImpl] (AgentTaskPool-1:ctx-fb37c248) The next status of agent 4is Disconnected, current status is Up 2015-07-20 20:07:18,898 DEBUG [c.c.a.m.AgentManagerImpl] (AgentTaskPool-1:ctx-fb37c248) Deregistering link for 4 with state Disconnected 2015-07-20 20:07:18,898 DEBUG [c.c.a.m.AgentManagerImpl] (AgentTaskPool-1:ctx-fb37c248) Remove Agent : 4 2015-07-20 20:07:18,899 DEBUG [c.c.a.m.ConnectedAgentAttache] (AgentTaskPool-1:ctx-fb37c248) Processing Disconnect.
No action for HA process are started by the CS mgr, the HA VMs on the host are still mark "running" in the UI. (until the restart of the host).
My goal is verify if one of my KVM hosts fail, the VMs with HA enabled from thos host 'A' will migrate to another Host (in this case host 'B'. Or , al least, it will be posible do it manually.
If you want test the HA, the better way (for me) is to make a Linux Kernel Crash with this command on the host
(BE CAREFUL) # echo "c" > /proc/sysrq-trigger The host will freeze immediately.Look for the mgr logs, you can view the start of the HA process (example for 1 host with only 1 HA VM inside):
Wait some time (2~3 minutes)2015-07-20 20:21:05,906 INFO [c.c.a.m.AgentManagerImpl] (AgentTaskPool-2:ctx-96219030) Investigating why host 4 has disconnected with event PingTimeout 2015-07-20 20:21:05,908 DEBUG [c.c.a.m.AgentManagerImpl] (AgentTaskPool-2:ctx-96219030) checking if agent (4) is alive
[...]2015-07-20 20:21:36,498 DEBUG [c.c.c.CapacityManagerImpl] (CapacityChecker:ctx-8c096a7a) No need to calibrate cpu capacity, host:1 usedCpu: 6500 reservedCpu: 0 2015-07-20 20:21:36,498 DEBUG [c.c.c.CapacityManagerImpl] (CapacityChecker:ctx-8c096a7a) No need to calibrate memory capacity, host:1 usedMem: 7247757312 reservedMem: 0 2015-07-20 20:21:36,509 DEBUG [c.c.c.CapacityManagerImpl] (CapacityChecker:ctx-8c096a7a) Found 1 VMs on host 4 2015-07-20 20:21:36,519 DEBUG [c.c.c.CapacityManagerImpl] (CapacityChecker:ctx-8c096a7a) Found 2 VM, not running on host 4 2015-07-20 20:21:36,526 DEBUG [c.c.c.CapacityManagerImpl] (CapacityChecker:ctx-8c096a7a) No need to calibrate cpu capacity, host:4 usedCpu: 1000 reservedCpu: 0 2015-07-20 20:21:36,526 DEBUG [c.c.c.CapacityManagerImpl] (CapacityChecker:ctx-8c096a7a) No need to calibrate memory capacity, host:4 usedMem: 1073741824 reservedMem: 0
[...]2015-07-20 20:22:05,906 INFO [c.c.a.m.AgentManagerImpl] (AgentMonitor-1:ctx-f98b00cb) Found the following agents behind on ping: [4] 2015-07-20 20:22:05,909 DEBUG [c.c.h.Status] (AgentMonitor-1:ctx-f98b00cb) Ping timeout for host 4, do invstigation 2015-07-20 20:22:05,911 INFO [c.c.a.m.AgentManagerImpl] (AgentTaskPool-3:ctx-00c2c8da) Investigating why host 4 has disconnected with event PingTimeout 2015-07-20 20:22:05,912 DEBUG [c.c.a.m.AgentManagerImpl] (AgentTaskPool-3:ctx-00c2c8da) checking if agent (4) is alive
[...]2015-07-20 20:22:45,915 WARN [c.c.a.m.AgentAttache] (AgentTaskPool-2:ctx-96219030) Seq 4-4609434218613702688: Timed out on Seq 4-4609434218613702688: { Cmd , MgmtId: 203050744474923, via: 4(devcloudnode2), Ver: v1, Flags: 100011, [{"com.cloud.agent.api.CheckHealthCommand":{"wait":50}}] } 2015-07-20 20:22:45,915 DEBUG [c.c.a.m.AgentAttache] (AgentTaskPool-2:ctx-96219030) Seq 4-4609434218613702688: Cancelling. 2015-07-20 20:22:45,915 WARN [c.c.a.m.AgentManagerImpl] (AgentTaskPool-2:ctx-96219030) Operation timed out: Commands 4609434218613702688 to Host 4 timed out after 100 2015-07-20 20:22:45,917 DEBUG [c.c.h.HighAvailabilityManagerImpl] (AgentTaskPool-2:ctx-96219030) SimpleInvestigator unable to determine the state of the host. Moving on. 2015-07-20 20:22:45,918 DEBUG [c.c.h.HighAvailabilityManagerImpl] (AgentTaskPool-2:ctx-96219030) XenServerInvestigator unable to determine the state of the host. Moving on.
[...]2015-07-20 20:22:45,967 DEBUG [c.c.h.HighAvailabilityManagerImpl] (AgentTaskPool-2:ctx-96219030) KVMInvestigator was able to determine host 4 is in Down 2015-07-20 20:22:45,967 INFO [c.c.a.m.AgentManagerImpl] (AgentTaskPool-2:ctx-96219030) The state determined is Down 2015-07-20 20:22:45,967 ERROR [c.c.a.m.AgentManagerImpl] (AgentTaskPool-2:ctx-96219030) Host is down: 4-devcloudnode2. Starting HA on the VMs 2015-07-20 20:22:45,968 WARN [o.a.c.alerts] (AgentTaskPool-2:ctx-96219030) alertType:: 7 // dataCenterId:: 1 // podId:: 1 // clusterId:: null // message:: Host disconnected, 4 2015-07-20 20:22:45,974 INFO [c.c.a.m.AgentManagerImpl] (AgentTaskPool-2:ctx-96219030) Host 4 is disconnecting with event HostDown 2015-07-20 20:22:45,976 DEBUG [c.c.a.m.AgentManagerImpl] (AgentTaskPool-2:ctx-96219030) The next status of agent 4is Down, current status is Up 2015-07-20 20:22:45,976 DEBUG [c.c.a.m.AgentManagerImpl] (AgentTaskPool-2:ctx-96219030) Deregistering link for 4 with state Down 2015-07-20 20:22:45,976 DEBUG [c.c.a.m.AgentManagerImpl] (AgentTaskPool-2:ctx-96219030) Remove Agent : 4 2015-07-20 20:22:45,976 DEBUG [c.c.a.m.ConnectedAgentAttache] (AgentTaskPool-2:ctx-96219030) Processing Disconnect.
[...]2015-07-20 20:22:45,990 DEBUG [c.c.n.NetworkUsageManagerImpl] (AgentTaskPool-2:ctx-96219030) Disconnected called on 4 with status Down
[...]2015-07-20 20:22:45,998 WARN [c.c.h.HighAvailabilityManagerImpl] (AgentTaskPool-2:ctx-96219030) Scheduling restart for VMs on host 4-devcloudnode2
[...]2015-07-20 20:22:46,007 DEBUG [c.c.h.HighAvailabilityManagerImpl] (AgentTaskPool-2:ctx-96219030) Notifying HA Mgr of to restart vm 67-i-2-67-VM 2015-07-20 20:22:46,014 INFO [c.c.h.HighAvailabilityManagerImpl] (AgentTaskPool-2:ctx-96219030) Schedule vm for HA: VM[User|i-2-67-VM] 2015-07-20 20:22:46,016 DEBUG [c.c.a.m.ClusteredAgentManagerImpl] (AgentTaskPool-2:ctx-96219030) Notifying other nodes of to disconnect
Etc.... (the start of the VM i-2-67-VM on another host) Milamber
If you need more tests, I can do it. Thanks. On Mon, Jul 20, 2015 at 12:16 PM, Milamber <milam...@apache.org> wrote:On 20/07/2015 15:44, Luciano Castro wrote:Hi ! My test today: I stopped other instance, and changed to HA Offer. I started this instance. After, I shutdown gracefully the KVM host of it.Why a gracefully shutdown of the KVM host ? The HA process is to (re)start the HA VMs on a new host, the current host has been crashed or not available i.e. its cloudstack agent won't respond. If you stopped gently the cloudstack-agent, the CS mgr don't consider this to a crash, so the HA won't start. What's behavior do you expect?and I checked the investigators process: [root@1q2 ~]# grep -i Investigator /var/log/cloudstack/management/management-server.log [root@1q2 ~]# date Mon Jul 20 14:39:43 UTC 2015 [root@1q2 ~]# ls -ltrh /var/log/cloudstack/management/management-server.log -rw-rw-r--. 1 cloud cloud 14M Jul 20 14:39 /var/log/cloudstack/management/management-server.log Nothing. I dont know how internally these process work. but seems that they are not working well, agree? options value ha.investigators.exclude nothing ha.investigators.orde SimpleInvestigator,XenServerInvestigator,KVMInvestigator,HypervInvestigator,VMwareInvestigator,PingInvestigator,ManagementIPSysVMInvestigator investigate.retry.interval 60 There´s a way to check if these process are running ? [root@1q2 ~]# ps waux| grep -i java root 11408 0.0 0.0 103252 880 pts/0 S+ 14:44 0:00 grep -i java cloud 24225 0.7 1.7 16982036 876412 ? Sl Jul16 43:48 /usr/lib/jvm/jre-1.7.0/bin/java -Djava.awt.headless=true -Dcom.sun.management.jmxremote=false -Xmx2g -XX:+HeapDumpOnOutOfMemoryError -XX:HeapDumpPath=/var/log/cloudstack/management/ -XX:PermSize=512M -XX:MaxPermSize=800m -Djava.security.properties=/etc/cloudstack/management/java.security.ciphers -classpath :::/etc/cloudstack/management:/usr/share/cloudstack-management/setup:/usr/share/cloudstack-management/bin/bootstrap.jar:/usr/share/cloudstack-management/bin/tomcat-juli.jar:/usr/share/java/commons-daemon.jar -Dcatalina.base=/usr/share/cloudstack-management -Dcatalina.home=/usr/share/cloudstack-management -Djava.endorsed.dirs= -Djava.io.tmpdir=/usr/share/cloudstack-management/temp -Djava.util.logging.config.file=/usr/share/cloudstack-management/conf/logging.properties -Djava.util.logging.manager=org.apache.juli.ClassLoaderLogManager org.apache.catalina.startup.Bootstrap start Thanks On Sat, Jul 18, 2015 at 1:53 PM, Milamber <milam...@apache.org> wrote:On 17/07/2015 22:26, Somesh Naidu wrote: Perhaps, the management server don't reconize the host 3 totally down(ping alive? or some quorum don't ok) The only way to the mgt server to accept totally that the host 3 has a real problem that the host 3 has been reboot (around 12:44)? The host disconnect was triggered at 12:19 on host 3. Mgmt server waspretty sure the host is down (it was a graceful shutdown I believe) which is why it triggered a disconnect and notified other nodes. There was no checkhealth/checkonhost/etc. triggered; just the agent disconnected and all listeners (ping/etc.) notified. At this time mgmt server should have scheduled HA on all VMs running on that host. The HA investigators would then work their way identifying whether the VMs are still running, if they need to be fenced, etc. But this never happened.AFAIK, stopping the cloudstack-agent service don't allow to start the HA process for the VMs hosted by the node. Seems normal to me that the HA process don't start at this moment. If I would start the HA process on a node, I go to the Web UI (or cloudmonkey) to change the state of the Host from Up to Maintenance. (after I can stop the CS-agent service if I need for exemple reboot a node) Regards,Somesh -----Original Message----- From: Milamber [mailto:milam...@apache.org] Sent: Friday, July 17, 2015 6:01 PM To: users@cloudstack.apache.org Subject: Re: HA feature - KVM - CloudStack 4.5.1 On 17/07/2015 21:23, Somesh Naidu wrote: Ok, so here are my findings.1. Host ID 3 was shutdown around 2015-07-16 12:19:09 at which point management server called a disconnect. 2. Based on the logs, it seems VM IDs 32, 18, 39 and 46 were running on the host. 3. No HA tasks for any of these VMs at this time. 5. Management server restarted at around 2015-07-16 12:30:20. 6. Host ID 3 connected back at around 2015-07-16 12:44:08. 7. Management server identified the missing VMs and triggered HA on those. 8. The VMs were eventually started, all 4 of them. I am not 100% sure why HA wasn't triggered until 2015-07-16 12:30 (#3), but I know that management server restart caused it not happen until the host was reconnected. Perhaps, the management server don't reconize the host 3 totally down(ping alive? or some quorum don't ok) The only way to the mgt server to accept totally that the host 3 has a real problem that the host 3 has been reboot (around 12:44)? What is the storage subsystem? CLVMd? Regards,Somesh -----Original Message----- From: Luciano Castro [mailto:luciano.cas...@gmail.com] Sent: Friday, July 17, 2015 12:13 PM To: users@cloudstack.apache.org Subject: Re: HA feature - KVM - CloudStack 4.5.1 No problems Somesh, thanks for your help. Link of log: https://dl.dropboxusercontent.com/u/6774061/management-server.log.2015-07-16.gz Luciano On Fri, Jul 17, 2015 at 12:00 PM, Somesh Naidu < somesh.na...@citrix.com> wrote: How large is the management server logs dated 2015-07-16? I would liketo review the logs. All the information I need from that incident should be in there so I don't need any more testing. Regards, Somesh -----Original Message----- From: Luciano Castro [mailto:luciano.cas...@gmail.com] Sent: Friday, July 17, 2015 7:58 AM To: users@cloudstack.apache.org Subject: Re: HA feature - KVM - CloudStack 4.5.1 Hi Somesh! [root@1q2 ~]# zgrep -i -E 'SimpleIvestigator|KVMInvestigator|PingInvestigator|ManagementIPSysVMInvestigator' /var/log/cloudstack/management/management-server.log.2015-07-16.gz |tail -5000 > /tmp/management.txt [root@1q2 ~]# cat /tmp/management.txt 2015-07-16 12:30:45,452 DEBUG [o.a.c.s.l.r.ExtensionRegistry] (main:null) Registering extension [KVMInvestigator] in [Ha Investigators Registry] 2015-07-16 12:30:45,452 DEBUG [o.a.c.s.l.r.RegistryLifecycle] (main:null) Registered com.cloud.ha.KVMInvestigator@57ceec9a 2015-07-16 12:30:45,927 DEBUG [o.a.c.s.l.r.ExtensionRegistry] (main:null) Registering extension [PingInvestigator] in [Ha Investigators Registry] 2015-07-16 12:30:45,928 DEBUG [o.a.c.s.l.r.ExtensionRegistry] (main:null) Registering extension [ManagementIPSysVMInvestigator] in [Ha Investigators Registry] 2015-07-16 12:30:53,796 INFO [o.a.c.s.l.r.DumpRegistry] (main:null) Registry [Ha Investigators Registry] contains [SimpleInvestigator, XenServerInvestigator, KVMInv I searched this log before, but as I thought that had not nothing special. If you want propose to me another scenario of test, I can do it. Thanks On Thu, Jul 16, 2015 at 7:27 PM, Somesh Naidu < somesh.na...@citrix.com> wrote: What about other investigators, specifically " KVMInvestigator,PingInvestigator"? They report the VMs as alive=false too? Also, it is recommended that you look at the management-sever.log instead of catalina.out (for one, the latter doesn’t have timestamp). Regards, Somesh -----Original Message----- From: Luciano Castro [mailto:luciano.cas...@gmail.com] Sent: Thursday, July 16, 2015 1:14 PM To: users@cloudstack.apache.org Subject: Re: HA feature - KVM - CloudStack 4.5.1 Hi Somesh! thanks for help.. I did again ,and I collected new logs: My vm_instance name is i-2-39-VM. There was some routers in KVM host 'A' (this one that I powered off now): [root@1q2 ~]# grep -i -E 'SimpleInvestigator.*false' /var/log/cloudstack/management/catalina.out INFO [c.c.h.HighAvailabilityManagerImpl] (HA-Worker-2:ctx-e2f91c9c work-3)SimpleInvestigator found VM[DomainRouter|r-4-VM]to be alive? falseINFO [c.c.h.HighAvailabilityManagerImpl] (HA-Worker-1:ctx-729acf4f work-7)SimpleInvestigator found VM[User|i-23-33-VM]to be alive? falseINFO [c.c.h.HighAvailabilityManagerImpl] (HA-Worker-4:ctx-a66a4941 work-8)SimpleInvestigator found VM[DomainRouter|r-36-VM]to be alive? falseINFO [c.c.h.HighAvailabilityManagerImpl] (HA-Worker-1:ctx-5977245e work-10) SimpleInvestigator found VM[User|i-17-26-VM]to be alive? false INFO [c.c.h.HighAvailabilityManagerImpl] (HA-Worker-1:ctx-c7f39be0 work-9)SimpleInvestigator found VM[DomainRouter|r-32-VM]to be alive? falseINFO [c.c.h.HighAvailabilityManagerImpl] (HA-Worker-3:ctx-ad4f5fda work-10) SimpleInvestigator found VM[DomainRouter|r-46-VM]to be alive? false INFO [c.c.h.HighAvailabilityManagerImpl] (HA-Worker-0:ctx-0257f5af work-11) SimpleInvestigator found VM[User|i-4-52-VM]to be alive? false INFO [c.c.h.HighAvailabilityManagerImpl] (HA-Worker-4:ctx-7ddff382 work-12) SimpleInvestigator found VM[DomainRouter|r-32-VM]to be alive? false INFO [c.c.h.HighAvailabilityManagerImpl] (HA-Worker-1:ctx-9f79917e work-13) SimpleInvestigator found VM[User|i-2-39-VM]to be alive? false KVM host 'B' agent log (where the machine would be migrate): 2015-07-16 16:58:56,537 INFO [kvm.resource.LibvirtComputingResource] (agentRequest-Handler-4:null) Live migration of instance i-2-39-VM initiated 2015-07-16 16:58:57,540 INFO [kvm.resource.LibvirtComputingResource] (agentRequest-Handler-4:null) Waiting for migration of i-2-39-VM to complete, waited 1000ms 2015-07-16 16:58:58,541 INFO [kvm.resource.LibvirtComputingResource] (agentRequest-Handler-4:null) Waiting for migration of i-2-39-VM to complete, waited 2000ms 2015-07-16 16:58:59,542 INFO [kvm.resource.LibvirtComputingResource] (agentRequest-Handler-4:null) Waiting for migration of i-2-39-VM to complete, waited 3000ms 2015-07-16 16:59:00,543 INFO [kvm.resource.LibvirtComputingResource] (agentRequest-Handler-4:null) Waiting for migration of i-2-39-VM to complete, waited 4000ms 2015-07-16 16:59:01,245 INFO [kvm.resource.LibvirtComputingResource] (agentRequest-Handler-4:null) Migration thread for i-2-39-VM is done It said done for my i-2-39-VM instance, but I can´t ping this host. Luciano --Luciano Castro