Hi Vadim, It can also be XenHA but I remember you already said it is off. Did you check the hardware health?
I’d recommend turning on XenHA as otherwise in case of a failure you will not have an automatic recovery. Regards, Remi On 14/09/15 15:09, "Vadim Kimlaychuk" <va...@kickcloud.net> wrote: >Remi, > > I have analyzed script xenheartbeat.sh and it seems it is useless, >because relies on file /opt/cloud/bin/heartbeat that has 0 length. It is >not set-up during installation and there is no such a step in >documentation for setting it up. Logically admin must run >"setup_heartbeat_file.sh" to make heartbeat work. If this file is 0 >length then script checks nothing and log this message every minute: > >Sep 14 04:43:53 xcp1 heartbeat: Problem with heartbeat, no iSCSI or NFS >mount defined in /opt/cloud/bin/heartbeat! > > That means it can't reboot host, because it doesn't check >anything. Isn't it ? > > Is there any other script that may reboot host if when there is a >problem with storage? > >Vadim. > > > > >On 2015-09-14 15:40, Remi Bergsma wrote: > >> Hi Vadim, >> >> This does indeed reboot a box, once storage fails: >> echo b > /proc/sysrq-trigger >> >> Removing it doesn't make sense, as there are serious issues once you >> hit this code. I'd recommend making sure the storage is reliable. >> >> Regards, Remi >> >> On 14/09/15 08:13, "Vadim Kimlaychuk" <va...@kickcloud.net> wrote: >> >> Remi, >> >> I have analyzed situation and found that storage may cause problem >> with host reboot as you wrote before in this thread. Reason for that -- >> we do offline backups from NFS server at that time when hosts fail. >> Basically we copy all files in primary and secondary storage offsite. >> This process starts precisely at 00:00 and somewhere around 00:10 - >> 00:40 XenServer host starts to reboot. >> >> Reading old threads I have found that >> /opt/cloud/bin/xenheartbeat.sh may do this job. Particularly last lines >> at my xenheartbeat.sh are: >> >> ------------------------- >> /usr/bin/logger -t heartbeat "Problem with $hb: not reachable for >> $(($(date +%s) - $lastdate)) seconds, rebooting system!" >> echo b > /proc/sysrq-trigger >> ------------------------- >> >> The only "unclear" moment is -- I don't have such line in my logs. >> May this command "echo b > /proc/sysrq-trigger" prevent from writing to >> syslog file? Documentation says that it does reboot immediately without >> synchronizing FS. It seems there is no other place that may do it, but >> still I am not 100% sure. >> >> Vadim. >> >> On 2015-09-13 18:26, Vadim Kimlaychuk wrote: >> >> Remi, >> >> Thank you for hint. At least one problem is identified: >> >> [root@xcp1 ~]# xe pool-list params=all | grep -E >> "ha-enabled|ha-config" >> ha-enabled ( RO): false >> ha-configuration ( RO): >> >> Where should I look for storage errors? Host? Management server? I have >> checked /var/log/messages and there were only regular messages, no >> "fence" or "reboot" commands. >> >> I have dedicated NFS server that should be accessible all the time (at >> least NIC interfaces are bonded in master-slave mode). Server is used >> for both primary and secondary storage. >> >> Thanks, >> >> Vadim. >> >> On 2015-09-13 14:38, Remi Bergsma wrote: >> >> Hi Vadim, >> >> Not sure what the problem is. Although I do know that when shared >> storage is used, both CloudStack and XenServer will fence (reboot) the >> box to prevent corruption in case access to the network or the storage >> is not possible. What storage do you use? >> >> What does this return on a XenServer?: >> xe pool-list params=all | grep -E "ha-enabled|ha-config" >> >> HA should be on, or else a hypervisor crash will not recover properly. >> >> If you search the logs for Fence or reboot, does anything come back? >> >> The logs you mention are nothing to worry about. >> >> Can you tell us in some more details what happens and how we can >> reproduce it? >> >> Regards, >> Remi >> >> -----Original Message----- >> From: Vadim Kimlaychuk [mailto:va...@kickcloud.net] >> Sent: zondag 13 september 2015 9:32 >> To: users@cloudstack.apache.org >> Cc: Remi Bergsma >> Subject: Re: CS 4.5.2: all hosts reboot after 3 days at production >> >> Hello Remi, >> >> This issue has nothing to do with CS 4.5.2. We got host reboot after >> precisely 1 week with previous version of CS (4.5.1). Previous version >> has been working without restart for 106 days before. So it is not a >> software issue. >> >> What does really make me unhappy -- accidental host reboot made entire >> cluster unusable. Cloudstack management server was up and running, >> second cluster node was up and running all the time and VM were >> transferred to the second host, but System VMs were not rebooted >> properly by CS and half of the network was down. SSVM and CPVM were in >> "disconnected" status. Client VMs were up, but couldn't connect to >> storage, because VRs were offline. Entire mess. >> >> I have used planned maintenance mode before and cluster worked just >> perfect. We didn't have any single second downtime. But with accidental >> reboot there is no use of clusterization. :( >> >> Vadim. >> >> On 2015-09-08 09:35, Vadim Kimlaychuk wrote: >> >> Hello Remi, >> >> First of all I don't have /var/log/xha.log file. I have examined >> logs >> in detail and haven't found any trace that heartbeat has failed. The >> only serious problem I have found in management logs before restart is >> repeating many times error: >> >> ----------------------------------- >> -------------- >> 2015-09-06 00:47:21,591 DEBUG [c.c.a.m.AgentManagerImpl] >> (RouterMonitor-1:ctx-2d67d422) Details from executing class >> com.cloud.agent.api.NetworkUsageCommand: Exception: >> java.lang.Exception >> Message: vpc network usage plugin call failed >> Stack: java.lang.Exception: vpc network usage plugin call failed at >> com.cloud.hypervisor.xenserver.resource.XenServer56Resource.VPCNetwork >> Usage(XenServer56Resource.java:172) >> at >> com.cloud.hypervisor.xenserver.resource.XenServer56Resource.execute(Xe >> nServer56Resource.java:195) >> at >> com.cloud.hypervisor.xenserver.resource.XenServer56Resource.executeReq >> uest(XenServer56Resource.java:62) >> at >> com.cloud.hypervisor.xenserver.resource.XenServer610Resource.executeRe >> quest(XenServer610Resource.java:87) >> at >> com.cloud.hypervisor.xenserver.resource.XenServer620SP1Resource.execut >> eRequest(XenServer620SP1Resource.java:65) >> at >> com.cloud.agent.manager.DirectAgentAttache$Task.runInContext(DirectAge >> ntAttache.java:302) >> ... >> >> ----------------------------------- >> -------------- >> >> Just couple of seconds before XCP2 host restart: >> >> ----------------------------------- >> -------------- >> 2015-09-06 00:48:27,884 DEBUG [c.c.a.m.DirectAgentAttache] >> (DirectAgentCronJob-83:ctx-ff822baf) Ping from 2(xcp1) >> 2015-09-06 00:48:27,884 DEBUG >> [c.c.v.VirtualMachinePowerStateSyncImpl] >> (DirectAgentCronJob-83:ctx-ff822baf) Process host VM state report >> from >> ping process. host: 2 >> 2015-09-06 00:48:27,904 DEBUG >> [c.c.v.VirtualMachinePowerStateSyncImpl] >> (DirectAgentCronJob-83:ctx-ff822baf) Process VM state report. host: >> 2, >> number of records in report: 6 >> 2015-09-06 00:48:27,904 DEBUG >> [c.c.v.VirtualMachinePowerStateSyncImpl] >> (DirectAgentCronJob-83:ctx-ff822baf) VM state report. host: 2, vm id: >> 85, power state: PowerOn >> 2015-09-06 00:48:27,907 DEBUG >> [c.c.v.VirtualMachinePowerStateSyncImpl] >> (DirectAgentCronJob-83:ctx-ff822baf) VM power state does not change, >> skip DB writing. vm id: 85 >> 2015-09-06 00:48:27,907 DEBUG >> [c.c.v.VirtualMachinePowerStateSyncImpl] >> (DirectAgentCronJob-83:ctx-ff822baf) VM state report. host: 2, vm id: >> 1, power state: PowerOn >> 2015-09-06 00:48:27,910 DEBUG >> [c.c.v.VirtualMachinePowerStateSyncImpl] >> (DirectAgentCronJob-83:ctx-ff822baf) VM power state does not change, >> skip DB writing. vm id: 1 >> 2015-09-06 00:48:27,910 DEBUG >> [c.c.v.VirtualMachinePowerStateSyncImpl] >> (DirectAgentCronJob-83:ctx-ff822baf) VM state report. host: 2, vm id: >> 2, power state: PowerOn >> 2015-09-06 00:48:27,913 DEBUG >> [c.c.v.VirtualMachinePowerStateSyncImpl] >> (DirectAgentCronJob-83:ctx-ff822baf) VM power state does not change, >> skip DB writing. vm id: 2 >> 2015-09-06 00:48:27,913 DEBUG >> [c.c.v.VirtualMachinePowerStateSyncImpl] >> (DirectAgentCronJob-83:ctx-ff822baf) VM state report. host: 2, vm id: >> 82, power state: PowerOn >> 2015-09-06 00:48:27,916 DEBUG >> [c.c.v.VirtualMachinePowerStateSyncImpl] >> (DirectAgentCronJob-83:ctx-ff822baf) VM power state does not change, >> skip DB writing. vm id: 82 >> 2015-09-06 00:48:27,916 DEBUG >> [c.c.v.VirtualMachinePowerStateSyncImpl] >> (DirectAgentCronJob-83:ctx-ff822baf) VM state report. host: 2, vm id: >> 94, power state: PowerOn >> 2015-09-06 00:48:27,919 DEBUG >> [c.c.v.VirtualMachinePowerStateSyncImpl] >> (DirectAgentCronJob-83:ctx-ff822baf) VM power state does not change, >> skip DB writing. vm id: 94 >> 2015-09-06 00:48:27,919 DEBUG >> [c.c.v.VirtualMachinePowerStateSyncImpl] >> (DirectAgentCronJob-83:ctx-ff822baf) VM state report. host: 2, vm id: >> 90, power state: PowerOn >> 2015-09-06 00:48:27,922 DEBUG >> [c.c.v.VirtualMachinePowerStateSyncImpl] >> (DirectAgentCronJob-83:ctx-ff822baf) VM power state does not change, >> skip DB writing. vm id: 90 >> 2015-09-06 00:48:27,928 DEBUG >> [c.c.v.VirtualMachinePowerStateSyncImpl] >> (DirectAgentCronJob-83:ctx-ff822baf) Done with process of VM state >> report. host: 2 >> 2015-09-06 00:48:27,940 DEBUG [c.c.a.m.DirectAgentAttache] >> (DirectAgentCronJob-154:ctx-2e8a5911) Ping from 1(xcp2) >> 2015-09-06 00:48:27,940 DEBUG >> [c.c.v.VirtualMachinePowerStateSyncImpl] >> (DirectAgentCronJob-154:ctx-2e8a5911) Process host VM state report >> from ping process. host: 1 >> 2015-09-06 00:48:27,951 DEBUG >> [c.c.v.VirtualMachinePowerStateSyncImpl] >> (DirectAgentCronJob-154:ctx-2e8a5911) Process VM state report. host: >> 1, number of records in report: 4 >> 2015-09-06 00:48:27,951 DEBUG >> [c.c.v.VirtualMachinePowerStateSyncImpl] >> (DirectAgentCronJob-154:ctx-2e8a5911) VM state report. host: 1, vm >> id: >> 100, power state: PowerOn >> 2015-09-06 00:48:27,954 DEBUG >> [c.c.v.VirtualMachinePowerStateSyncImpl] >> (DirectAgentCronJob-154:ctx-2e8a5911) VM power state does not change, >> skip DB writing. vm id: 100 >> 2015-09-06 00:48:27,954 DEBUG >> [c.c.v.VirtualMachinePowerStateSyncImpl] >> (DirectAgentCronJob-154:ctx-2e8a5911) VM state report. host: 1, vm >> id: >> 33, power state: PowerOn >> 2015-09-06 00:48:27,957 DEBUG >> [c.c.v.VirtualMachinePowerStateSyncImpl] >> (DirectAgentCronJob-154:ctx-2e8a5911) VM power state does not change, >> skip DB writing. vm id: 33 >> 2015-09-06 00:48:27,957 DEBUG >> [c.c.v.VirtualMachinePowerStateSyncImpl] >> (DirectAgentCronJob-154:ctx-2e8a5911) VM state report. host: 1, vm >> id: >> 89, power state: PowerOn >> 2015-09-06 00:48:27,960 DEBUG >> [c.c.v.VirtualMachinePowerStateSyncImpl] >> (DirectAgentCronJob-154:ctx-2e8a5911) VM power state does not change, >> skip DB writing. vm id: 89 >> 2015-09-06 00:48:27,961 DEBUG >> [c.c.v.VirtualMachinePowerStateSyncImpl] >> (DirectAgentCronJob-154:ctx-2e8a5911) VM state report. host: 1, vm >> id: >> 88, power state: PowerOn >> 2015-09-06 00:48:27,963 DEBUG >> [c.c.v.VirtualMachinePowerStateSyncImpl] >> (DirectAgentCronJob-154:ctx-2e8a5911) VM power state does not change, >> skip DB writing. vm id: 88 >> 2015-09-06 00:48:27,968 DEBUG >> [c.c.v.VirtualMachinePowerStateSyncImpl] >> (DirectAgentCronJob-154:ctx-2e8a5911) Done with process of VM state >> report. host: 1 >> ----------------------------------- >> ----------------- >> >> Vadim. >> >> On 2015-09-07 23:18, Remi Bergsma wrote: >> >> Hi Vadim, >> >> What kind of storage do you use? Can you show /var/log/xha.log (I >> think that is the name) please? It could be xen-ha that fences the box >> if the heartbeat cannot be written. >> >> You suggest it is CloudStack. Did you see anything in the mgt logs? >> >> Regards, Remi >> >> Sent from my iPhone >> >> On 07 Sep 2015, at 08:26, Vadim Kimlaychuk <va...@kickcloud.net> wrote: >> >> Hello all, >> >> I have experienced accidental cluster reboot 3 days after update to CS >> 4.5.2. Cluster is XenServer 6.5 with SP1. Reboot has been started from >> slave node and then - master. >> Syslog on slave shows only this: >> >> Sep 6 00:47:05 xcp2 last message repeated 3 times Sep 6 00:47:15 xcp2 >> xenstored: D12 write data/meminfo_free 713732 Sep 6 00:47:15 xcp2 >> xenstored: A1564203 w event /local/domain/12/data/meminfo_free >> /local/domain/12/data/meminfo_free >> Sep 6 00:47:15 xcp2 xenstored: D12 write data/updated Sun Sep 6 >> 00:48:55 EEST 2015 >> Sep 6 00:47:15 xcp2 xenstored: A6 w event >> /local/domain/12/data/updated /local/domain/12/data/updated >> Sep 6 >> 00:47:15 xcp2 xenstored: A10 w event /local/domain/12/data/updated >> /local/domain/12/data/updated Sep 6 00:47:26 xcp2 dhclient: >> DHCPREQUEST on xenbr0 to 172.17.0.1 port >> 67 (xid=0x304ae9dc) >> Sep 6 00:47:27 xcp2 xapi: [ info|xcp2|462044 INET >> 0.0.0.0:80|dispatch:host.call_plugin D:7593b578fada|taskhelper] task >> host.call_plugin R:ddd3cc399f86 forwarded >> (trackid=407f6adaa118a34f19eb1e29cd68a0e8) >> Sep 6 00:47:36 xcp2 dhclient: DHCPREQUEST on xenbr0 to 172.17.0.1 port >> 67 (xid=0x304ae9dc) >> Sep 6 00:48:18 xcp2 last message repeated 4 times Sep 6 00:48:25 xcp2 >> xenstored: D1 write data/meminfo_free 1740496 Sep 6 00:48:25 xcp2 >> xenstored: A1564203 w event /local/domain/1/data/meminfo_free >> /local/domain/1/data/meminfo_free Sep 6 00:48:25 xcp2 xenstored: >> D1 >> write data/updated Sat Sep 5 21:50:07 EEST 2015 Sep 6 00:48:25 xcp2 >> xenstored: A6 w event /local/domain/1/data/updated >> /local/domain/1/data/updated Sep 6 00:48:25 xcp2 xenstored: A10 w >> event /local/domain/1/data/updated >> /local/domain/1/data/updated Sep 6 >> 00:48:26 xcp2 dhclient: DHCPREQUEST on xenbr0 to 172.17.0.1 port >> 67 (xid=0x304ae9dc) >> Sep 6 00:48:27 xcp2 xapi: [ info|xcp2|462044 INET >> 0.0.0.0:80|dispatch:host.call_plugin D:f2c8987bc0ff|taskhelper] task >> host.call_plugin R:b62d2d4f58eb forwarded >> (trackid=e3d4ea00c96194830a7dbbfc35563a3c) >> Sep 6 00:48:38 xcp2 dhclient: DHCPREQUEST on xenbr0 to 172.17.0.1 port >> 67 (xid=0x304ae9dc) >> Sep 06 00:48:48 xcp2 syslogd 1.4.1: restart. >> Sep 6 00:48:48 xcp2 kernel: klogd 1.4.1, log source = /proc/kmsg >> started. >> Sep 6 00:48:48 xcp2 kernel: [ 0.000000] Initializing cgroup subsys >> cpuset Sep 6 00:48:48 xcp2 kernel: [ 0.000000] Initializing cgroup >> subsys cpu Sep 6 00:48:48 xcp2 kernel: [ 0.000000] Initializing cgroup >> subsys cpuacct Sep 6 00:48:48 xcp2 kernel: [ 0.000000] Linux version >> 3.10.0+2 >> >> Can anyone help with diagnostics ? >> >> Thank you, >> >> Vadim.