Re: CS 4.5.2: all hosts reboot after 3 days at production

Remi Bergsma Mon, 14 Sep 2015 05:41:30 -0700

Hi Vadim,

This does indeed reboot a box, once storage fails:
echo b > /proc/sysrq-trigger



Removing it doesn’t make sense, as there are serious issues once you hit this 
code. I’d recommend making sure the storage is reliable.

Regards, Remi



On 14/09/15 08:13, "Vadim Kimlaychuk" <va...@kickcloud.net> wrote:

>Remi,
>
>       I have analyzed situation and found that storage may cause problem 
>with host reboot as you wrote before in this thread. Reason for that -- 
>we do offline backups from NFS server at that time when hosts fail. 
>Basically we copy all files in primary and secondary storage offsite. 
>This process starts precisely at 00:00 and somewhere around 00:10 - 
>00:40 XenServer host starts to reboot.
>
>       Reading old threads I have found that 
>/opt/cloud/bin/xenheartbeat.sh may do this job. Particularly last lines 
>at my xenheartbeat.sh  are:
>
>       -------------------------
>       /usr/bin/logger -t heartbeat "Problem with $hb: not reachable for 
>$(($(date +%s) - $lastdate)) seconds, rebooting system!"
>       echo b > /proc/sysrq-trigger
>       -------------------------
>
>       The only "unclear" moment is -- I don't have such line in my logs. 
>May this command "echo b > /proc/sysrq-trigger" prevent from writing to 
>syslog file? Documentation says that it does reboot immediately without 
>synchronizing FS. It seems there is no other place that may do it, but 
>still I am not 100% sure.
>
>Vadim.
>
>On 2015-09-13 18:26, Vadim Kimlaychuk wrote:
>
>> Remi,
>> 
>> Thank you for hint. At least one problem is identified:
>> 
>> [root@xcp1 ~]# xe pool-⁠list params=all | grep -⁠E 
>> "ha-⁠enabled|ha-⁠config"
>> ha-⁠enabled ( RO): false
>> ha-⁠configuration ( RO):
>> 
>> Where should I look for storage errors? Host? Management server? I have 
>> checked /var/log/messages and there were only regular messages, no 
>> "fence" or "reboot" commands.
>> 
>> I have dedicated NFS server that should be accessible all the time (at 
>> least NIC interfaces are bonded in master-slave mode). Server is used 
>> for both primary and secondary storage.
>> 
>> Thanks,
>> 
>> Vadim.
>> 
>> On 2015-⁠09-⁠13 14:38, Remi Bergsma wrote:
>> 
>> Hi Vadim,
>> 
>> Not sure what the problem is. Although I do know that when shared 
>> storage is used, both CloudStack and XenServer will fence (reboot) the 
>> box to prevent corruption in case access to the network or the storage 
>> is not possible. What storage do you use?
>> 
>> What does this return on a XenServer?:
>> xe pool-⁠list params=all | grep -⁠E "ha-⁠enabled|ha-⁠config"
>> 
>> HA should be on, or else a hypervisor crash will not recover properly.
>> 
>> If you search the logs for Fence or reboot, does anything come back?
>> 
>> The logs you mention are nothing to worry about.
>> 
>> Can you tell us in some more details what happens and how we can 
>> reproduce it?
>> 
>> Regards,
>> Remi
>> 
>> -⁠-⁠-⁠-⁠-⁠Original Message-⁠-⁠-⁠-⁠-⁠
>> From: Vadim Kimlaychuk [mailto:va...@kickcloud.net]
>> Sent: zondag 13 september 2015 9:32
>> To: users@cloudstack.apache.org
>> Cc: Remi Bergsma
>> Subject: Re: CS 4.5.2: all hosts reboot after 3 days at production
>> 
>> Hello Remi,
>> 
>> This issue has nothing to do with CS 4.5.2. We got host reboot after 
>> precisely 1 week with previous version of CS (4.5.1). Previous version 
>> has been working without restart for 106 days before. So it is not a 
>> software issue.
>> 
>> What does really make me unhappy -- accidental host reboot made entire 
>> cluster unusable. Cloudstack management server was up and running, 
>> second cluster node was up and running all the time and VM were 
>> transferred to the second host, but System VMs were not rebooted 
>> properly by CS and half of the network was down. SSVM and CPVM were in 
>> "disconnected" status. Client VMs were up, but couldn't connect to 
>> storage, because VRs were offline. Entire mess.
>> 
>> I have used planned maintenance mode before and cluster worked just 
>> perfect. We didn't have any single second downtime. But with accidental 
>> reboot there is no use of clusterization. :(
>> 
>> Vadim.
>> 
>> On 2015-⁠09-⁠08 09:35, Vadim Kimlaychuk wrote:
>> 
>> Hello Remi,
>> 
>> First of all I don't have /⁠var/⁠log/⁠xha.log file. I have examined 
>> logs
>> in detail and haven't found any trace that heartbeat has failed. The
>> only serious problem I have found in management logs before restart is
>> repeating many times error:
>> 
>> -⁠⁠-⁠⁠-⁠⁠-⁠⁠-⁠⁠-⁠⁠-⁠⁠-⁠⁠-⁠⁠-⁠⁠-⁠⁠-⁠⁠-⁠⁠-⁠⁠-⁠⁠-⁠⁠-⁠⁠-⁠⁠-⁠⁠-⁠⁠-⁠⁠-⁠⁠-⁠⁠-⁠⁠-⁠⁠-⁠⁠-⁠⁠-⁠⁠-⁠⁠-⁠⁠-⁠⁠-⁠⁠-⁠⁠-⁠⁠-⁠⁠
>> -⁠⁠-⁠⁠-⁠⁠-⁠⁠-⁠⁠-⁠⁠-⁠⁠-⁠⁠-⁠⁠-⁠⁠-⁠⁠-⁠⁠-⁠⁠-⁠⁠
>> 2015-⁠09-⁠06 00:47:21,591 DEBUG [c.c.a.m.AgentManagerImpl]
>> (RouterMonitor-⁠1:ctx-⁠2d67d422) Details from executing class
>> com.cloud.agent.api.NetworkUsageCommand: Exception:
>> java.lang.Exception
>> Message: vpc network usage plugin call failed
>> Stack: java.lang.Exception: vpc network usage plugin call failed at
>> com.cloud.hypervisor.xenserver.resource.XenServer56Resource.VPCNetwork
>> Usage(XenServer56Resource.java:172)
>> at
>> com.cloud.hypervisor.xenserver.resource.XenServer56Resource.execute(Xe
>> nServer56Resource.java:195)
>> at
>> com.cloud.hypervisor.xenserver.resource.XenServer56Resource.executeReq
>> uest(XenServer56Resource.java:62)
>> at
>> com.cloud.hypervisor.xenserver.resource.XenServer610Resource.executeRe
>> quest(XenServer610Resource.java:87)
>> at
>> com.cloud.hypervisor.xenserver.resource.XenServer620SP1Resource.execut
>> eRequest(XenServer620SP1Resource.java:65)
>> at
>> com.cloud.agent.manager.DirectAgentAttache$Task.runInContext(DirectAge
>> ntAttache.java:302)
>> ...
>> 
>> -⁠⁠-⁠⁠-⁠⁠-⁠⁠-⁠⁠-⁠⁠-⁠⁠-⁠⁠-⁠⁠-⁠⁠-⁠⁠-⁠⁠-⁠⁠-⁠⁠-⁠⁠-⁠⁠-⁠⁠-⁠⁠-⁠⁠-⁠⁠-⁠⁠-⁠⁠-⁠⁠-⁠⁠-⁠⁠-⁠⁠-⁠⁠-⁠⁠-⁠⁠-⁠⁠-⁠⁠-⁠⁠-⁠⁠-⁠⁠-⁠⁠
>> -⁠⁠-⁠⁠-⁠⁠-⁠⁠-⁠⁠-⁠⁠-⁠⁠-⁠⁠-⁠⁠-⁠⁠-⁠⁠-⁠⁠-⁠⁠-⁠⁠
>> 
>> Just couple of seconds before XCP2 host restart:
>> 
>> -⁠⁠-⁠⁠-⁠⁠-⁠⁠-⁠⁠-⁠⁠-⁠⁠-⁠⁠-⁠⁠-⁠⁠-⁠⁠-⁠⁠-⁠⁠-⁠⁠-⁠⁠-⁠⁠-⁠⁠-⁠⁠-⁠⁠-⁠⁠-⁠⁠-⁠⁠-⁠⁠-⁠⁠-⁠⁠-⁠⁠-⁠⁠-⁠⁠-⁠⁠-⁠⁠-⁠⁠-⁠⁠-⁠⁠-⁠⁠-⁠⁠
>> -⁠⁠-⁠⁠-⁠⁠-⁠⁠-⁠⁠-⁠⁠-⁠⁠-⁠⁠-⁠⁠-⁠⁠-⁠⁠-⁠⁠-⁠⁠-⁠⁠
>> 2015-⁠09-⁠06 00:48:27,884 DEBUG [c.c.a.m.DirectAgentAttache]
>> (DirectAgentCronJob-⁠83:ctx-⁠ff822baf) Ping from 2(xcp1)
>> 2015-⁠09-⁠06 00:48:27,884 DEBUG 
>> [c.c.v.VirtualMachinePowerStateSyncImpl]
>> (DirectAgentCronJob-⁠83:ctx-⁠ff822baf) Process host VM state report 
>> from
>> ping process. host: 2
>> 2015-⁠09-⁠06 00:48:27,904 DEBUG 
>> [c.c.v.VirtualMachinePowerStateSyncImpl]
>> (DirectAgentCronJob-⁠83:ctx-⁠ff822baf) Process VM state report. host: 
>> 2,
>> number of records in report: 6
>> 2015-⁠09-⁠06 00:48:27,904 DEBUG 
>> [c.c.v.VirtualMachinePowerStateSyncImpl]
>> (DirectAgentCronJob-⁠83:ctx-⁠ff822baf) VM state report. host: 2, vm id:
>> 85, power state: PowerOn
>> 2015-⁠09-⁠06 00:48:27,907 DEBUG 
>> [c.c.v.VirtualMachinePowerStateSyncImpl]
>> (DirectAgentCronJob-⁠83:ctx-⁠ff822baf) VM power state does not change,
>> skip DB writing. vm id: 85
>> 2015-⁠09-⁠06 00:48:27,907 DEBUG 
>> [c.c.v.VirtualMachinePowerStateSyncImpl]
>> (DirectAgentCronJob-⁠83:ctx-⁠ff822baf) VM state report. host: 2, vm id:
>> 1, power state: PowerOn
>> 2015-⁠09-⁠06 00:48:27,910 DEBUG 
>> [c.c.v.VirtualMachinePowerStateSyncImpl]
>> (DirectAgentCronJob-⁠83:ctx-⁠ff822baf) VM power state does not change,
>> skip DB writing. vm id: 1
>> 2015-⁠09-⁠06 00:48:27,910 DEBUG 
>> [c.c.v.VirtualMachinePowerStateSyncImpl]
>> (DirectAgentCronJob-⁠83:ctx-⁠ff822baf) VM state report. host: 2, vm id:
>> 2, power state: PowerOn
>> 2015-⁠09-⁠06 00:48:27,913 DEBUG 
>> [c.c.v.VirtualMachinePowerStateSyncImpl]
>> (DirectAgentCronJob-⁠83:ctx-⁠ff822baf) VM power state does not change,
>> skip DB writing. vm id: 2
>> 2015-⁠09-⁠06 00:48:27,913 DEBUG 
>> [c.c.v.VirtualMachinePowerStateSyncImpl]
>> (DirectAgentCronJob-⁠83:ctx-⁠ff822baf) VM state report. host: 2, vm id:
>> 82, power state: PowerOn
>> 2015-⁠09-⁠06 00:48:27,916 DEBUG 
>> [c.c.v.VirtualMachinePowerStateSyncImpl]
>> (DirectAgentCronJob-⁠83:ctx-⁠ff822baf) VM power state does not change,
>> skip DB writing. vm id: 82
>> 2015-⁠09-⁠06 00:48:27,916 DEBUG 
>> [c.c.v.VirtualMachinePowerStateSyncImpl]
>> (DirectAgentCronJob-⁠83:ctx-⁠ff822baf) VM state report. host: 2, vm id:
>> 94, power state: PowerOn
>> 2015-⁠09-⁠06 00:48:27,919 DEBUG 
>> [c.c.v.VirtualMachinePowerStateSyncImpl]
>> (DirectAgentCronJob-⁠83:ctx-⁠ff822baf) VM power state does not change,
>> skip DB writing. vm id: 94
>> 2015-⁠09-⁠06 00:48:27,919 DEBUG 
>> [c.c.v.VirtualMachinePowerStateSyncImpl]
>> (DirectAgentCronJob-⁠83:ctx-⁠ff822baf) VM state report. host: 2, vm id:
>> 90, power state: PowerOn
>> 2015-⁠09-⁠06 00:48:27,922 DEBUG 
>> [c.c.v.VirtualMachinePowerStateSyncImpl]
>> (DirectAgentCronJob-⁠83:ctx-⁠ff822baf) VM power state does not change,
>> skip DB writing. vm id: 90
>> 2015-⁠09-⁠06 00:48:27,928 DEBUG 
>> [c.c.v.VirtualMachinePowerStateSyncImpl]
>> (DirectAgentCronJob-⁠83:ctx-⁠ff822baf) Done with process of VM state
>> report. host: 2
>> 2015-⁠09-⁠06 00:48:27,940 DEBUG [c.c.a.m.DirectAgentAttache]
>> (DirectAgentCronJob-⁠154:ctx-⁠2e8a5911) Ping from 1(xcp2)
>> 2015-⁠09-⁠06 00:48:27,940 DEBUG 
>> [c.c.v.VirtualMachinePowerStateSyncImpl]
>> (DirectAgentCronJob-⁠154:ctx-⁠2e8a5911) Process host VM state report
>> from ping process. host: 1
>> 2015-⁠09-⁠06 00:48:27,951 DEBUG 
>> [c.c.v.VirtualMachinePowerStateSyncImpl]
>> (DirectAgentCronJob-⁠154:ctx-⁠2e8a5911) Process VM state report. host:
>> 1, number of records in report: 4
>> 2015-⁠09-⁠06 00:48:27,951 DEBUG 
>> [c.c.v.VirtualMachinePowerStateSyncImpl]
>> (DirectAgentCronJob-⁠154:ctx-⁠2e8a5911) VM state report. host: 1, vm 
>> id:
>> 100, power state: PowerOn
>> 2015-⁠09-⁠06 00:48:27,954 DEBUG 
>> [c.c.v.VirtualMachinePowerStateSyncImpl]
>> (DirectAgentCronJob-⁠154:ctx-⁠2e8a5911) VM power state does not change,
>> skip DB writing. vm id: 100
>> 2015-⁠09-⁠06 00:48:27,954 DEBUG 
>> [c.c.v.VirtualMachinePowerStateSyncImpl]
>> (DirectAgentCronJob-⁠154:ctx-⁠2e8a5911) VM state report. host: 1, vm 
>> id:
>> 33, power state: PowerOn
>> 2015-⁠09-⁠06 00:48:27,957 DEBUG 
>> [c.c.v.VirtualMachinePowerStateSyncImpl]
>> (DirectAgentCronJob-⁠154:ctx-⁠2e8a5911) VM power state does not change,
>> skip DB writing. vm id: 33
>> 2015-⁠09-⁠06 00:48:27,957 DEBUG 
>> [c.c.v.VirtualMachinePowerStateSyncImpl]
>> (DirectAgentCronJob-⁠154:ctx-⁠2e8a5911) VM state report. host: 1, vm 
>> id:
>> 89, power state: PowerOn
>> 2015-⁠09-⁠06 00:48:27,960 DEBUG 
>> [c.c.v.VirtualMachinePowerStateSyncImpl]
>> (DirectAgentCronJob-⁠154:ctx-⁠2e8a5911) VM power state does not change,
>> skip DB writing. vm id: 89
>> 2015-⁠09-⁠06 00:48:27,961 DEBUG 
>> [c.c.v.VirtualMachinePowerStateSyncImpl]
>> (DirectAgentCronJob-⁠154:ctx-⁠2e8a5911) VM state report. host: 1, vm 
>> id:
>> 88, power state: PowerOn
>> 2015-⁠09-⁠06 00:48:27,963 DEBUG 
>> [c.c.v.VirtualMachinePowerStateSyncImpl]
>> (DirectAgentCronJob-⁠154:ctx-⁠2e8a5911) VM power state does not change,
>> skip DB writing. vm id: 88
>> 2015-⁠09-⁠06 00:48:27,968 DEBUG 
>> [c.c.v.VirtualMachinePowerStateSyncImpl]
>> (DirectAgentCronJob-⁠154:ctx-⁠2e8a5911) Done with process of VM state
>> report. host: 1
>> -⁠⁠-⁠⁠-⁠⁠-⁠⁠-⁠⁠-⁠⁠-⁠⁠-⁠⁠-⁠⁠-⁠⁠-⁠⁠-⁠⁠-⁠⁠-⁠⁠-⁠⁠-⁠⁠-⁠⁠-⁠⁠-⁠⁠-⁠⁠-⁠⁠-⁠⁠-⁠⁠-⁠⁠-⁠⁠-⁠⁠-⁠⁠-⁠⁠-⁠⁠-⁠⁠-⁠⁠-⁠⁠-⁠⁠-⁠⁠-⁠⁠
>> -⁠⁠-⁠⁠-⁠⁠-⁠⁠-⁠⁠-⁠⁠-⁠⁠-⁠⁠-⁠⁠-⁠⁠-⁠⁠-⁠⁠-⁠⁠-⁠⁠-⁠⁠-⁠⁠-⁠⁠
>> 
>> Vadim.
>> 
>> On 2015-⁠⁠09-⁠⁠07 23:18, Remi Bergsma wrote:
>> 
>> Hi Vadim,
>> 
>> What kind of storage do you use? Can you show /⁠var/⁠log/⁠xha.log (I
>> think that is the name) please? It could be xen-⁠ha that fences the box
>> if the heartbeat cannot be written.
>> 
>> You suggest it is CloudStack. Did you see anything in the mgt logs?
>> 
>> Regards, Remi
>> 
>> Sent from my iPhone
>> 
>> On 07 Sep 2015, at 08:26, Vadim Kimlaychuk <va...@kickcloud.net> wrote:
>> 
>> Hello all,
>> 
>> I have experienced accidental cluster reboot 3 days after update to CS
>> 4.5.2. Cluster is XenServer 6.5 with SP1. Reboot has been started from
>> slave node and then -⁠ master.
>> Syslog on slave shows only this:
>> 
>> Sep 6 00:47:05 xcp2 last message repeated 3 times Sep 6 00:47:15 xcp2
>> xenstored: D12 write data/⁠⁠meminfo_free 713732 Sep 6 00:47:15 xcp2
>> xenstored: A1564203 w event /⁠local/⁠domain/⁠12/⁠data/⁠meminfo_free
>> /⁠local/⁠domain/⁠12/⁠data/⁠meminfo_free
>> Sep 6 00:47:15 xcp2 xenstored: D12 write data/⁠updated Sun Sep 6
>> 00:48:55 EEST 2015
>> Sep 6 00:47:15 xcp2 xenstored: A6 w event
>> /⁠local/⁠domain/⁠12/⁠data/⁠updated /⁠local/⁠domain/⁠12/⁠data/⁠updated 
>> Sep 6
>> 00:47:15 xcp2 xenstored: A10 w event /⁠local/⁠domain/⁠12/⁠data/⁠updated
>> /⁠local/⁠domain/⁠12/⁠data/⁠updated Sep 6 00:47:26 xcp2 dhclient:
>> DHCPREQUEST on xenbr0 to 172.17.0.1 port
>> 67 (xid=0x304ae9dc)
>> Sep 6 00:47:27 xcp2 xapi: [ info|xcp2|462044 INET
>> 0.0.0.0:80|dispatch:host.call_plugin D:7593b578fada|taskhelper] task
>> host.call_plugin R:ddd3cc399f86 forwarded
>> (trackid=407f6adaa118a34f19eb1e29cd68a0e8)
>> Sep 6 00:47:36 xcp2 dhclient: DHCPREQUEST on xenbr0 to 172.17.0.1 port
>> 67 (xid=0x304ae9dc)
>> Sep 6 00:48:18 xcp2 last message repeated 4 times Sep 6 00:48:25 xcp2
>> xenstored: D1 write data/⁠⁠meminfo_free 1740496 Sep 6 00:48:25 xcp2
>> xenstored: A1564203 w event /⁠local/⁠domain/⁠1/⁠data/⁠meminfo_free
>> /⁠local/⁠domain/⁠1/⁠data/⁠meminfo_free Sep 6 00:48:25 xcp2 xenstored: 
>> D1
>> write data/⁠updated Sat Sep 5 21:50:07 EEST 2015 Sep 6 00:48:25 xcp2
>> xenstored: A6 w event /⁠local/⁠domain/⁠1/⁠data/⁠updated
>> /⁠local/⁠domain/⁠1/⁠data/⁠updated Sep 6 00:48:25 xcp2 xenstored: A10 w
>> event /⁠local/⁠domain/⁠1/⁠data/⁠updated 
>> /⁠local/⁠domain/⁠1/⁠data/⁠updated Sep 6
>> 00:48:26 xcp2 dhclient: DHCPREQUEST on xenbr0 to 172.17.0.1 port
>> 67 (xid=0x304ae9dc)
>> Sep 6 00:48:27 xcp2 xapi: [ info|xcp2|462044 INET
>> 0.0.0.0:80|dispatch:host.call_plugin D:f2c8987bc0ff|taskhelper] task
>> host.call_plugin R:b62d2d4f58eb forwarded
>> (trackid=e3d4ea00c96194830a7dbbfc35563a3c)
>> Sep 6 00:48:38 xcp2 dhclient: DHCPREQUEST on xenbr0 to 172.17.0.1 port
>> 67 (xid=0x304ae9dc)
>> Sep 06 00:48:48 xcp2 syslogd 1.4.1: restart.
>> Sep 6 00:48:48 xcp2 kernel: klogd 1.4.1, log source = /⁠proc/⁠kmsg
>> started.
>> Sep 6 00:48:48 xcp2 kernel: [ 0.000000] Initializing cgroup subsys
>> cpuset Sep 6 00:48:48 xcp2 kernel: [ 0.000000] Initializing cgroup
>> subsys cpu Sep 6 00:48:48 xcp2 kernel: [ 0.000000] Initializing cgroup
>> subsys cpuacct Sep 6 00:48:48 xcp2 kernel: [ 0.000000] Linux version
>> 3.10.0+2
>> 
>> Can anyone help with diagnostics ?
>> 
>> Thank you,
>> 
>> Vadim.

Re: CS 4.5.2: all hosts reboot after 3 days at production

Reply via email to