Re: CS 4.5.2: all hosts reboot after 3 days at production

Remi Bergsma Mon, 14 Sep 2015 07:37:02 -0700

Hi Vadim,

It can also be XenHA but I remember you already said it is off. Did you check 
the hardware health?


I’d recommend turning on XenHA as otherwise in case of a failure you will not 
have an automatic recovery.

Regards,
Remi




On 14/09/15 15:09, "Vadim Kimlaychuk" <[email protected]> wrote:

>Remi,
>
>       I have analyzed script xenheartbeat.sh and it seems it is useless, 
>because relies on file /opt/cloud/bin/heartbeat that has 0 length. It is 
>not set-up during installation and there is no such a step in 
>documentation for setting it up. Logically admin must run 
>"setup_heartbeat_file.sh" to make heartbeat work. If this file is 0 
>length then script checks nothing and log this message every minute:
>
>Sep 14 04:43:53 xcp1 heartbeat: Problem with heartbeat, no iSCSI or NFS 
>mount defined in /opt/cloud/bin/heartbeat!
>
>       That means it can't reboot host, because it doesn't check 
>anything. Isn't it ?
>
>       Is there any other script that may reboot host if when there is a 
>problem with storage?
>
>Vadim.
>
>
>
>
>On 2015-09-14 15:40, Remi Bergsma wrote:
>
>> Hi Vadim,
>> 
>> This does indeed reboot a box, once storage fails:
>> echo b > /proc/sysrq-trigger
>> 
>> Removing it doesn't make sense, as there are serious issues once you 
>> hit this code. I'd recommend making sure the storage is reliable.
>> 
>> Regards, Remi
>> 
>> On 14/09/15 08:13, "Vadim Kimlaychuk" <[email protected]> wrote:
>> 
>> Remi,
>> 
>> I have analyzed situation and found that storage may cause problem
>> with host reboot as you wrote before in this thread. Reason for that --
>> we do offline backups from NFS server at that time when hosts fail.
>> Basically we copy all files in primary and secondary storage offsite.
>> This process starts precisely at 00:00 and somewhere around 00:10 -
>> 00:40 XenServer host starts to reboot.
>> 
>> Reading old threads I have found that
>> /opt/cloud/bin/xenheartbeat.sh may do this job. Particularly last lines
>> at my xenheartbeat.sh are:
>> 
>> -------------------------
>> /usr/bin/logger -t heartbeat "Problem with $hb: not reachable for
>> $(($(date +%s) - $lastdate)) seconds, rebooting system!"
>> echo b > /proc/sysrq-trigger
>> -------------------------
>> 
>> The only "unclear" moment is -- I don't have such line in my logs.
>> May this command "echo b > /proc/sysrq-trigger" prevent from writing to
>> syslog file? Documentation says that it does reboot immediately without
>> synchronizing FS. It seems there is no other place that may do it, but
>> still I am not 100% sure.
>> 
>> Vadim.
>> 
>> On 2015-09-13 18:26, Vadim Kimlaychuk wrote:
>> 
>> Remi,
>> 
>> Thank you for hint. At least one problem is identified:
>> 
>> [root@xcp1 ~]# xe pool-⁠list params=all | grep -⁠E
>> "ha-⁠enabled|ha-⁠config"
>> ha-⁠enabled ( RO): false
>> ha-⁠configuration ( RO):
>> 
>> Where should I look for storage errors? Host? Management server? I have
>> checked /var/log/messages and there were only regular messages, no
>> "fence" or "reboot" commands.
>> 
>> I have dedicated NFS server that should be accessible all the time (at
>> least NIC interfaces are bonded in master-slave mode). Server is used
>> for both primary and secondary storage.
>> 
>> Thanks,
>> 
>> Vadim.
>> 
>> On 2015-⁠09-⁠13 14:38, Remi Bergsma wrote:
>> 
>> Hi Vadim,
>> 
>> Not sure what the problem is. Although I do know that when shared
>> storage is used, both CloudStack and XenServer will fence (reboot) the
>> box to prevent corruption in case access to the network or the storage
>> is not possible. What storage do you use?
>> 
>> What does this return on a XenServer?:
>> xe pool-⁠list params=all | grep -⁠E "ha-⁠enabled|ha-⁠config"
>> 
>> HA should be on, or else a hypervisor crash will not recover properly.
>> 
>> If you search the logs for Fence or reboot, does anything come back?
>> 
>> The logs you mention are nothing to worry about.
>> 
>> Can you tell us in some more details what happens and how we can
>> reproduce it?
>> 
>> Regards,
>> Remi
>> 
>> -⁠-⁠-⁠-⁠-⁠Original Message-⁠-⁠-⁠-⁠-⁠
>> From: Vadim Kimlaychuk [mailto:[email protected]]
>> Sent: zondag 13 september 2015 9:32
>> To: [email protected]
>> Cc: Remi Bergsma
>> Subject: Re: CS 4.5.2: all hosts reboot after 3 days at production
>> 
>> Hello Remi,
>> 
>> This issue has nothing to do with CS 4.5.2. We got host reboot after
>> precisely 1 week with previous version of CS (4.5.1). Previous version
>> has been working without restart for 106 days before. So it is not a
>> software issue.
>> 
>> What does really make me unhappy -- accidental host reboot made entire
>> cluster unusable. Cloudstack management server was up and running,
>> second cluster node was up and running all the time and VM were
>> transferred to the second host, but System VMs were not rebooted
>> properly by CS and half of the network was down. SSVM and CPVM were in
>> "disconnected" status. Client VMs were up, but couldn't connect to
>> storage, because VRs were offline. Entire mess.
>> 
>> I have used planned maintenance mode before and cluster worked just
>> perfect. We didn't have any single second downtime. But with accidental
>> reboot there is no use of clusterization. :(
>> 
>> Vadim.
>> 
>> On 2015-⁠09-⁠08 09:35, Vadim Kimlaychuk wrote:
>> 
>> Hello Remi,
>> 
>> First of all I don't have /⁠var/⁠log/⁠xha.log file. I have examined
>> logs
>> in detail and haven't found any trace that heartbeat has failed. The
>> only serious problem I have found in management logs before restart is
>> repeating many times error:
>> 
>> -⁠⁠-⁠⁠-⁠⁠-⁠⁠-⁠⁠-⁠⁠-⁠⁠-⁠⁠-⁠⁠-⁠⁠-⁠⁠-⁠⁠-⁠⁠-⁠⁠-⁠⁠-⁠⁠-⁠⁠-⁠⁠-⁠⁠-⁠⁠-⁠⁠-⁠⁠-⁠⁠-⁠⁠-⁠⁠-⁠⁠-⁠⁠-⁠⁠-⁠⁠-⁠⁠-⁠⁠-⁠⁠-⁠⁠-⁠⁠-⁠⁠
>> -⁠⁠-⁠⁠-⁠⁠-⁠⁠-⁠⁠-⁠⁠-⁠⁠-⁠⁠-⁠⁠-⁠⁠-⁠⁠-⁠⁠-⁠⁠-⁠⁠
>> 2015-⁠09-⁠06 00:47:21,591 DEBUG [c.c.a.m.AgentManagerImpl]
>> (RouterMonitor-⁠1:ctx-⁠2d67d422) Details from executing class
>> com.cloud.agent.api.NetworkUsageCommand: Exception:
>> java.lang.Exception
>> Message: vpc network usage plugin call failed
>> Stack: java.lang.Exception: vpc network usage plugin call failed at
>> com.cloud.hypervisor.xenserver.resource.XenServer56Resource.VPCNetwork
>> Usage(XenServer56Resource.java:172)
>> at
>> com.cloud.hypervisor.xenserver.resource.XenServer56Resource.execute(Xe
>> nServer56Resource.java:195)
>> at
>> com.cloud.hypervisor.xenserver.resource.XenServer56Resource.executeReq
>> uest(XenServer56Resource.java:62)
>> at
>> com.cloud.hypervisor.xenserver.resource.XenServer610Resource.executeRe
>> quest(XenServer610Resource.java:87)
>> at
>> com.cloud.hypervisor.xenserver.resource.XenServer620SP1Resource.execut
>> eRequest(XenServer620SP1Resource.java:65)
>> at
>> com.cloud.agent.manager.DirectAgentAttache$Task.runInContext(DirectAge
>> ntAttache.java:302)
>> ...
>> 
>> -⁠⁠-⁠⁠-⁠⁠-⁠⁠-⁠⁠-⁠⁠-⁠⁠-⁠⁠-⁠⁠-⁠⁠-⁠⁠-⁠⁠-⁠⁠-⁠⁠-⁠⁠-⁠⁠-⁠⁠-⁠⁠-⁠⁠-⁠⁠-⁠⁠-⁠⁠-⁠⁠-⁠⁠-⁠⁠-⁠⁠-⁠⁠-⁠⁠-⁠⁠-⁠⁠-⁠⁠-⁠⁠-⁠⁠-⁠⁠-⁠⁠
>> -⁠⁠-⁠⁠-⁠⁠-⁠⁠-⁠⁠-⁠⁠-⁠⁠-⁠⁠-⁠⁠-⁠⁠-⁠⁠-⁠⁠-⁠⁠-⁠⁠
>> 
>> Just couple of seconds before XCP2 host restart:
>> 
>> -⁠⁠-⁠⁠-⁠⁠-⁠⁠-⁠⁠-⁠⁠-⁠⁠-⁠⁠-⁠⁠-⁠⁠-⁠⁠-⁠⁠-⁠⁠-⁠⁠-⁠⁠-⁠⁠-⁠⁠-⁠⁠-⁠⁠-⁠⁠-⁠⁠-⁠⁠-⁠⁠-⁠⁠-⁠⁠-⁠⁠-⁠⁠-⁠⁠-⁠⁠-⁠⁠-⁠⁠-⁠⁠-⁠⁠-⁠⁠-⁠⁠
>> -⁠⁠-⁠⁠-⁠⁠-⁠⁠-⁠⁠-⁠⁠-⁠⁠-⁠⁠-⁠⁠-⁠⁠-⁠⁠-⁠⁠-⁠⁠-⁠⁠
>> 2015-⁠09-⁠06 00:48:27,884 DEBUG [c.c.a.m.DirectAgentAttache]
>> (DirectAgentCronJob-⁠83:ctx-⁠ff822baf) Ping from 2(xcp1)
>> 2015-⁠09-⁠06 00:48:27,884 DEBUG
>> [c.c.v.VirtualMachinePowerStateSyncImpl]
>> (DirectAgentCronJob-⁠83:ctx-⁠ff822baf) Process host VM state report
>> from
>> ping process. host: 2
>> 2015-⁠09-⁠06 00:48:27,904 DEBUG
>> [c.c.v.VirtualMachinePowerStateSyncImpl]
>> (DirectAgentCronJob-⁠83:ctx-⁠ff822baf) Process VM state report. host:
>> 2,
>> number of records in report: 6
>> 2015-⁠09-⁠06 00:48:27,904 DEBUG
>> [c.c.v.VirtualMachinePowerStateSyncImpl]
>> (DirectAgentCronJob-⁠83:ctx-⁠ff822baf) VM state report. host: 2, vm id:
>> 85, power state: PowerOn
>> 2015-⁠09-⁠06 00:48:27,907 DEBUG
>> [c.c.v.VirtualMachinePowerStateSyncImpl]
>> (DirectAgentCronJob-⁠83:ctx-⁠ff822baf) VM power state does not change,
>> skip DB writing. vm id: 85
>> 2015-⁠09-⁠06 00:48:27,907 DEBUG
>> [c.c.v.VirtualMachinePowerStateSyncImpl]
>> (DirectAgentCronJob-⁠83:ctx-⁠ff822baf) VM state report. host: 2, vm id:
>> 1, power state: PowerOn
>> 2015-⁠09-⁠06 00:48:27,910 DEBUG
>> [c.c.v.VirtualMachinePowerStateSyncImpl]
>> (DirectAgentCronJob-⁠83:ctx-⁠ff822baf) VM power state does not change,
>> skip DB writing. vm id: 1
>> 2015-⁠09-⁠06 00:48:27,910 DEBUG
>> [c.c.v.VirtualMachinePowerStateSyncImpl]
>> (DirectAgentCronJob-⁠83:ctx-⁠ff822baf) VM state report. host: 2, vm id:
>> 2, power state: PowerOn
>> 2015-⁠09-⁠06 00:48:27,913 DEBUG
>> [c.c.v.VirtualMachinePowerStateSyncImpl]
>> (DirectAgentCronJob-⁠83:ctx-⁠ff822baf) VM power state does not change,
>> skip DB writing. vm id: 2
>> 2015-⁠09-⁠06 00:48:27,913 DEBUG
>> [c.c.v.VirtualMachinePowerStateSyncImpl]
>> (DirectAgentCronJob-⁠83:ctx-⁠ff822baf) VM state report. host: 2, vm id:
>> 82, power state: PowerOn
>> 2015-⁠09-⁠06 00:48:27,916 DEBUG
>> [c.c.v.VirtualMachinePowerStateSyncImpl]
>> (DirectAgentCronJob-⁠83:ctx-⁠ff822baf) VM power state does not change,
>> skip DB writing. vm id: 82
>> 2015-⁠09-⁠06 00:48:27,916 DEBUG
>> [c.c.v.VirtualMachinePowerStateSyncImpl]
>> (DirectAgentCronJob-⁠83:ctx-⁠ff822baf) VM state report. host: 2, vm id:
>> 94, power state: PowerOn
>> 2015-⁠09-⁠06 00:48:27,919 DEBUG
>> [c.c.v.VirtualMachinePowerStateSyncImpl]
>> (DirectAgentCronJob-⁠83:ctx-⁠ff822baf) VM power state does not change,
>> skip DB writing. vm id: 94
>> 2015-⁠09-⁠06 00:48:27,919 DEBUG
>> [c.c.v.VirtualMachinePowerStateSyncImpl]
>> (DirectAgentCronJob-⁠83:ctx-⁠ff822baf) VM state report. host: 2, vm id:
>> 90, power state: PowerOn
>> 2015-⁠09-⁠06 00:48:27,922 DEBUG
>> [c.c.v.VirtualMachinePowerStateSyncImpl]
>> (DirectAgentCronJob-⁠83:ctx-⁠ff822baf) VM power state does not change,
>> skip DB writing. vm id: 90
>> 2015-⁠09-⁠06 00:48:27,928 DEBUG
>> [c.c.v.VirtualMachinePowerStateSyncImpl]
>> (DirectAgentCronJob-⁠83:ctx-⁠ff822baf) Done with process of VM state
>> report. host: 2
>> 2015-⁠09-⁠06 00:48:27,940 DEBUG [c.c.a.m.DirectAgentAttache]
>> (DirectAgentCronJob-⁠154:ctx-⁠2e8a5911) Ping from 1(xcp2)
>> 2015-⁠09-⁠06 00:48:27,940 DEBUG
>> [c.c.v.VirtualMachinePowerStateSyncImpl]
>> (DirectAgentCronJob-⁠154:ctx-⁠2e8a5911) Process host VM state report
>> from ping process. host: 1
>> 2015-⁠09-⁠06 00:48:27,951 DEBUG
>> [c.c.v.VirtualMachinePowerStateSyncImpl]
>> (DirectAgentCronJob-⁠154:ctx-⁠2e8a5911) Process VM state report. host:
>> 1, number of records in report: 4
>> 2015-⁠09-⁠06 00:48:27,951 DEBUG
>> [c.c.v.VirtualMachinePowerStateSyncImpl]
>> (DirectAgentCronJob-⁠154:ctx-⁠2e8a5911) VM state report. host: 1, vm
>> id:
>> 100, power state: PowerOn
>> 2015-⁠09-⁠06 00:48:27,954 DEBUG
>> [c.c.v.VirtualMachinePowerStateSyncImpl]
>> (DirectAgentCronJob-⁠154:ctx-⁠2e8a5911) VM power state does not change,
>> skip DB writing. vm id: 100
>> 2015-⁠09-⁠06 00:48:27,954 DEBUG
>> [c.c.v.VirtualMachinePowerStateSyncImpl]
>> (DirectAgentCronJob-⁠154:ctx-⁠2e8a5911) VM state report. host: 1, vm
>> id:
>> 33, power state: PowerOn
>> 2015-⁠09-⁠06 00:48:27,957 DEBUG
>> [c.c.v.VirtualMachinePowerStateSyncImpl]
>> (DirectAgentCronJob-⁠154:ctx-⁠2e8a5911) VM power state does not change,
>> skip DB writing. vm id: 33
>> 2015-⁠09-⁠06 00:48:27,957 DEBUG
>> [c.c.v.VirtualMachinePowerStateSyncImpl]
>> (DirectAgentCronJob-⁠154:ctx-⁠2e8a5911) VM state report. host: 1, vm
>> id:
>> 89, power state: PowerOn
>> 2015-⁠09-⁠06 00:48:27,960 DEBUG
>> [c.c.v.VirtualMachinePowerStateSyncImpl]
>> (DirectAgentCronJob-⁠154:ctx-⁠2e8a5911) VM power state does not change,
>> skip DB writing. vm id: 89
>> 2015-⁠09-⁠06 00:48:27,961 DEBUG
>> [c.c.v.VirtualMachinePowerStateSyncImpl]
>> (DirectAgentCronJob-⁠154:ctx-⁠2e8a5911) VM state report. host: 1, vm
>> id:
>> 88, power state: PowerOn
>> 2015-⁠09-⁠06 00:48:27,963 DEBUG
>> [c.c.v.VirtualMachinePowerStateSyncImpl]
>> (DirectAgentCronJob-⁠154:ctx-⁠2e8a5911) VM power state does not change,
>> skip DB writing. vm id: 88
>> 2015-⁠09-⁠06 00:48:27,968 DEBUG
>> [c.c.v.VirtualMachinePowerStateSyncImpl]
>> (DirectAgentCronJob-⁠154:ctx-⁠2e8a5911) Done with process of VM state
>> report. host: 1
>> -⁠⁠-⁠⁠-⁠⁠-⁠⁠-⁠⁠-⁠⁠-⁠⁠-⁠⁠-⁠⁠-⁠⁠-⁠⁠-⁠⁠-⁠⁠-⁠⁠-⁠⁠-⁠⁠-⁠⁠-⁠⁠-⁠⁠-⁠⁠-⁠⁠-⁠⁠-⁠⁠-⁠⁠-⁠⁠-⁠⁠-⁠⁠-⁠⁠-⁠⁠-⁠⁠-⁠⁠-⁠⁠-⁠⁠-⁠⁠-⁠⁠
>> -⁠⁠-⁠⁠-⁠⁠-⁠⁠-⁠⁠-⁠⁠-⁠⁠-⁠⁠-⁠⁠-⁠⁠-⁠⁠-⁠⁠-⁠⁠-⁠⁠-⁠⁠-⁠⁠-⁠⁠
>> 
>> Vadim.
>> 
>> On 2015-⁠⁠09-⁠⁠07 23:18, Remi Bergsma wrote:
>> 
>> Hi Vadim,
>> 
>> What kind of storage do you use? Can you show /⁠var/⁠log/⁠xha.log (I
>> think that is the name) please? It could be xen-⁠ha that fences the box
>> if the heartbeat cannot be written.
>> 
>> You suggest it is CloudStack. Did you see anything in the mgt logs?
>> 
>> Regards, Remi
>> 
>> Sent from my iPhone
>> 
>> On 07 Sep 2015, at 08:26, Vadim Kimlaychuk <[email protected]> wrote:
>> 
>> Hello all,
>> 
>> I have experienced accidental cluster reboot 3 days after update to CS
>> 4.5.2. Cluster is XenServer 6.5 with SP1. Reboot has been started from
>> slave node and then -⁠ master.
>> Syslog on slave shows only this:
>> 
>> Sep 6 00:47:05 xcp2 last message repeated 3 times Sep 6 00:47:15 xcp2
>> xenstored: D12 write data/⁠⁠meminfo_free 713732 Sep 6 00:47:15 xcp2
>> xenstored: A1564203 w event /⁠local/⁠domain/⁠12/⁠data/⁠meminfo_free
>> /⁠local/⁠domain/⁠12/⁠data/⁠meminfo_free
>> Sep 6 00:47:15 xcp2 xenstored: D12 write data/⁠updated Sun Sep 6
>> 00:48:55 EEST 2015
>> Sep 6 00:47:15 xcp2 xenstored: A6 w event
>> /⁠local/⁠domain/⁠12/⁠data/⁠updated /⁠local/⁠domain/⁠12/⁠data/⁠updated
>> Sep 6
>> 00:47:15 xcp2 xenstored: A10 w event /⁠local/⁠domain/⁠12/⁠data/⁠updated
>> /⁠local/⁠domain/⁠12/⁠data/⁠updated Sep 6 00:47:26 xcp2 dhclient:
>> DHCPREQUEST on xenbr0 to 172.17.0.1 port
>> 67 (xid=0x304ae9dc)
>> Sep 6 00:47:27 xcp2 xapi: [ info|xcp2|462044 INET
>> 0.0.0.0:80|dispatch:host.call_plugin D:7593b578fada|taskhelper] task
>> host.call_plugin R:ddd3cc399f86 forwarded
>> (trackid=407f6adaa118a34f19eb1e29cd68a0e8)
>> Sep 6 00:47:36 xcp2 dhclient: DHCPREQUEST on xenbr0 to 172.17.0.1 port
>> 67 (xid=0x304ae9dc)
>> Sep 6 00:48:18 xcp2 last message repeated 4 times Sep 6 00:48:25 xcp2
>> xenstored: D1 write data/⁠⁠meminfo_free 1740496 Sep 6 00:48:25 xcp2
>> xenstored: A1564203 w event /⁠local/⁠domain/⁠1/⁠data/⁠meminfo_free
>> /⁠local/⁠domain/⁠1/⁠data/⁠meminfo_free Sep 6 00:48:25 xcp2 xenstored:
>> D1
>> write data/⁠updated Sat Sep 5 21:50:07 EEST 2015 Sep 6 00:48:25 xcp2
>> xenstored: A6 w event /⁠local/⁠domain/⁠1/⁠data/⁠updated
>> /⁠local/⁠domain/⁠1/⁠data/⁠updated Sep 6 00:48:25 xcp2 xenstored: A10 w
>> event /⁠local/⁠domain/⁠1/⁠data/⁠updated
>> /⁠local/⁠domain/⁠1/⁠data/⁠updated Sep 6
>> 00:48:26 xcp2 dhclient: DHCPREQUEST on xenbr0 to 172.17.0.1 port
>> 67 (xid=0x304ae9dc)
>> Sep 6 00:48:27 xcp2 xapi: [ info|xcp2|462044 INET
>> 0.0.0.0:80|dispatch:host.call_plugin D:f2c8987bc0ff|taskhelper] task
>> host.call_plugin R:b62d2d4f58eb forwarded
>> (trackid=e3d4ea00c96194830a7dbbfc35563a3c)
>> Sep 6 00:48:38 xcp2 dhclient: DHCPREQUEST on xenbr0 to 172.17.0.1 port
>> 67 (xid=0x304ae9dc)
>> Sep 06 00:48:48 xcp2 syslogd 1.4.1: restart.
>> Sep 6 00:48:48 xcp2 kernel: klogd 1.4.1, log source = /⁠proc/⁠kmsg
>> started.
>> Sep 6 00:48:48 xcp2 kernel: [ 0.000000] Initializing cgroup subsys
>> cpuset Sep 6 00:48:48 xcp2 kernel: [ 0.000000] Initializing cgroup
>> subsys cpu Sep 6 00:48:48 xcp2 kernel: [ 0.000000] Initializing cgroup
>> subsys cpuacct Sep 6 00:48:48 xcp2 kernel: [ 0.000000] Linux version
>> 3.10.0+2
>> 
>> Can anyone help with diagnostics ?
>> 
>> Thank you,
>> 
>> Vadim.

Re: CS 4.5.2: all hosts reboot after 3 days at production

Reply via email to