Dear Remi,

I had time to investigate HA and reboot issues with XenServer 6.5 SP1 and here are some interesting points:

1. Enabling HA storage using ShapeBlue guide ( makes Cloudstack to recover from host failure properly. The only tough case is when both cluster nodes goes to reboot simultaneously. Then hosts stuck in creating the pool until manual fix. This is also described at ShapeBlue guide. Temporary disabling/enabling HA did the trick.

2. At my XenServer script didn't play any role. I have disabled reboot there, but hosts have detected storage failure by themselves and did reboot. I think this functionality has already implemented by hypervisor development team, but don't know exactly which daemon did this. Probably Citrix knows exactly and someone may share this information. Anyway - for XenServer 6.5 cloudstack heartbeat seems to be unnecessary module. Hypervisor itself detects storage problems earlier than heartbeat script did. That is why my log was empty.



On 2015-09-15 17:01, Remi Bergsma wrote:


You cannot sync because that will also try to write to remote disks and that doesn't work. Or else you wouldn't be in trouble anyway. Before, we have seen situations where the box was supposed to be rebooted, but that took a long time due to the sync. Instead, you wait a little bit and then fence it. Otherwise your ring buffers will become full and disks will get corruption.

There are global settings to tweak the parameters, if I recall correctly.

This all gets less relevant due to XenHA. This process is smarter and also fences the box, always leaving behind log trails. Running XenServer without turning on XenHA is asking for trouble anyway. I always put XenHA to 180 seconds.


On 15/09/15 10:19, "Vadim Kimlaychuk" <> wrote:

Abhinandan and Frank,

1. is designed to monitor iSCSI and NFS mounts
2. default installation monitors only presence of
/opt/cloud/bin/heartbeat local file. Administrator must run script with host UUID and SR UUID it wants to monitor
and then heartbeat file will contain information about what storage to
3. if p2. was set up, script will try to read 100 bytes from the
mounted storage to /dev/null and if read is successful withing 1 min. it
sleeps for 1 min. Otherwise it will report problem, exit endless cycle
and do reboot
4. it does reboot by calling "echo b > /proc/sysrq-trigger". I am
not confident that such method is safe in terms of local disk writes. I
mean script log may not be flushed to hard drive and thus after reboot
admin may not see any problems in syslog, because local disk writes are



On 2015-09-15 10:41, Frank Louwers wrote:

Important correction: it monitors the health of the first primary NFS
(or otherwise "distributed and mounted") filesystem. If you don't use
NFS als (main) primary storage, it's safe to disable that reboot. If
you know your NFS has "issues" from time to time, and have controls
around that, yes, you can disable that...


On 15 Sep 2015, at 09:08, Abhinandan Prateek
<> wrote:

The heartbeat script monitors the health of the primary storage by
using a timestamp that is written to each primary store.
In case the primary storage is unreachable it reboots the XenServer in
order to protect the virtual machines from corruption.

On 14-Sep-2015, at 8:48 pm, Vadim Kimlaychuk <>


I will definitely enable HA when find who is rebooting the host. I
known circumstances when it happens and I know that it is
storage-related. Hardware health is monitored by SNMP and there were no
problems with temperature, CPU, RAM or HDD ranges. In case of HW
failure I should theoretically have kernel panic or crash dumps. But
there is none. Will experiment a bit.

Thank you,


On 2015-09-14 17:35, Remi Bergsma wrote:

Hi Vadim,
It can also be XenHA but I remember you already said it is off. Did you
check the hardware health?
I'd recommend turning on XenHA as otherwise in case of a failure you
will not have an automatic recovery.
On 14/09/15 15:09, "Vadim Kimlaychuk" <> wrote:
I have analyzed script and it seems it is useless,
because relies on file /opt/cloud/bin/heartbeat that has 0 length. It
not set-up during installation and there is no such a step in
documentation for setting it up. Logically admin must run
"" to make heartbeat work. If this file is 0
length then script checks nothing and log this message every minute:
Sep 14 04:43:53 xcp1 heartbeat: Problem with heartbeat, no iSCSI or NFS
mount defined in /opt/cloud/bin/heartbeat!
That means it can't reboot host, because it doesn't check
anything. Isn't it ?
Is there any other script that may reboot host if when there is a
problem with storage?
On 2015-09-14 15:40, Remi Bergsma wrote:
Hi Vadim,
This does indeed reboot a box, once storage fails:
echo b > /proc/sysrq-trigger
Removing it doesn't make sense, as there are serious issues once you
hit this code. I'd recommend making sure the storage is reliable.
Regards, Remi
On 14/09/15 08:13, "Vadim Kimlaychuk" <> wrote:
I have analyzed situation and found that storage may cause problem
with host reboot as you wrote before in this thread. Reason for that --
we do offline backups from NFS server at that time when hosts fail.
Basically we copy all files in primary and secondary storage offsite.
This process starts precisely at 00:00 and somewhere around 00:10 -
00:40 XenServer host starts to reboot.
Reading old threads I have found that
/opt/cloud/bin/ may do this job. Particularly last lines
at my are:
/usr/bin/logger -t heartbeat "Problem with $hb: not reachable for
$(($(date +%s) - $lastdate)) seconds, rebooting system!"
echo b > /proc/sysrq-trigger
The only "unclear" moment is -- I don't have such line in my logs.
May this command "echo b > /proc/sysrq-trigger" prevent from writing to
syslog file? Documentation says that it does reboot immediately without
synchronizing FS. It seems there is no other place that may do it, but
still I am not 100% sure.
On 2015-09-13 18:26, Vadim Kimlaychuk wrote:
Thank you for hint. At least one problem is identified:
[root@xcp1 ~]# xe pool-⁠list params=all | grep -⁠E
ha-⁠enabled ( RO): false
ha-⁠configuration ( RO):
Where should I look for storage errors? Host? Management server? I have
checked /var/log/messages and there were only regular messages, no
"fence" or "reboot" commands.
I have dedicated NFS server that should be accessible all the time (at
least NIC interfaces are bonded in master-slave mode). Server is used
for both primary and secondary storage.
On 2015-⁠09-⁠13 14:38, Remi Bergsma wrote:
Hi Vadim,
Not sure what the problem is. Although I do know that when shared
storage is used, both CloudStack and XenServer will fence (reboot) the
box to prevent corruption in case access to the network or the storage
is not possible. What storage do you use?
What does this return on a XenServer?:
xe pool-⁠list params=all | grep -⁠E "ha-⁠enabled|ha-⁠config"
HA should be on, or else a hypervisor crash will not recover properly.
If you search the logs for Fence or reboot, does anything come back?
The logs you mention are nothing to worry about.
Can you tell us in some more details what happens and how we can
reproduce it?
-⁠-⁠-⁠-⁠-⁠Original Message-⁠-⁠-⁠-⁠-⁠
From: Vadim Kimlaychuk []
Sent: zondag 13 september 2015 9:32
Cc: Remi Bergsma
Subject: Re: CS 4.5.2: all hosts reboot after 3 days at production
Hello Remi,
This issue has nothing to do with CS 4.5.2. We got host reboot after
precisely 1 week with previous version of CS (4.5.1). Previous version
has been working without restart for 106 days before. So it is not a
software issue.
What does really make me unhappy -- accidental host reboot made entire
cluster unusable. Cloudstack management server was up and running,
second cluster node was up and running all the time and VM were
transferred to the second host, but System VMs were not rebooted
properly by CS and half of the network was down. SSVM and CPVM were in
"disconnected" status. Client VMs were up, but couldn't connect to
storage, because VRs were offline. Entire mess.
I have used planned maintenance mode before and cluster worked just
perfect. We didn't have any single second downtime. But with accidental
reboot there is no use of clusterization. :(
On 2015-⁠09-⁠08 09:35, Vadim Kimlaychuk wrote:
Hello Remi,
First of all I don't have /⁠var/⁠log/⁠xha.log file. I have examined
in detail and haven't found any trace that heartbeat has failed. The
only serious problem I have found in management logs before restart is
repeating many times error:
2015-⁠09-⁠06 00:47:21,591 DEBUG [c.c.a.m.AgentManagerImpl]
(RouterMonitor-⁠1:ctx-⁠2d67d422) Details from executing class Exception:
Message: vpc network usage plugin call failed
Stack: java.lang.Exception: vpc network usage plugin call failed at
Just couple of seconds before XCP2 host restart:
2015-⁠09-⁠06 00:48:27,884 DEBUG [c.c.a.m.DirectAgentAttache]
(DirectAgentCronJob-⁠83:ctx-⁠ff822baf) Ping from 2(xcp1)
2015-⁠09-⁠06 00:48:27,884 DEBUG
(DirectAgentCronJob-⁠83:ctx-⁠ff822baf) Process host VM state report
ping process. host: 2
2015-⁠09-⁠06 00:48:27,904 DEBUG
(DirectAgentCronJob-⁠83:ctx-⁠ff822baf) Process VM state report. host:
number of records in report: 6
2015-⁠09-⁠06 00:48:27,904 DEBUG
(DirectAgentCronJob-⁠83:ctx-⁠ff822baf) VM state report. host: 2, vm id:
85, power state: PowerOn
2015-⁠09-⁠06 00:48:27,907 DEBUG
(DirectAgentCronJob-⁠83:ctx-⁠ff822baf) VM power state does not change,
skip DB writing. vm id: 85
2015-⁠09-⁠06 00:48:27,907 DEBUG
(DirectAgentCronJob-⁠83:ctx-⁠ff822baf) VM state report. host: 2, vm id:
1, power state: PowerOn
2015-⁠09-⁠06 00:48:27,910 DEBUG
(DirectAgentCronJob-⁠83:ctx-⁠ff822baf) VM power state does not change,
skip DB writing. vm id: 1
2015-⁠09-⁠06 00:48:27,910 DEBUG
(DirectAgentCronJob-⁠83:ctx-⁠ff822baf) VM state report. host: 2, vm id:
2, power state: PowerOn
2015-⁠09-⁠06 00:48:27,913 DEBUG
(DirectAgentCronJob-⁠83:ctx-⁠ff822baf) VM power state does not change,
skip DB writing. vm id: 2
2015-⁠09-⁠06 00:48:27,913 DEBUG
(DirectAgentCronJob-⁠83:ctx-⁠ff822baf) VM state report. host: 2, vm id:
82, power state: PowerOn
2015-⁠09-⁠06 00:48:27,916 DEBUG
(DirectAgentCronJob-⁠83:ctx-⁠ff822baf) VM power state does not change,
skip DB writing. vm id: 82
2015-⁠09-⁠06 00:48:27,916 DEBUG
(DirectAgentCronJob-⁠83:ctx-⁠ff822baf) VM state report. host: 2, vm id:
94, power state: PowerOn
2015-⁠09-⁠06 00:48:27,919 DEBUG
(DirectAgentCronJob-⁠83:ctx-⁠ff822baf) VM power state does not change,
skip DB writing. vm id: 94
2015-⁠09-⁠06 00:48:27,919 DEBUG
(DirectAgentCronJob-⁠83:ctx-⁠ff822baf) VM state report. host: 2, vm id:
90, power state: PowerOn
2015-⁠09-⁠06 00:48:27,922 DEBUG
(DirectAgentCronJob-⁠83:ctx-⁠ff822baf) VM power state does not change,
skip DB writing. vm id: 90
2015-⁠09-⁠06 00:48:27,928 DEBUG
(DirectAgentCronJob-⁠83:ctx-⁠ff822baf) Done with process of VM state
report. host: 2
2015-⁠09-⁠06 00:48:27,940 DEBUG [c.c.a.m.DirectAgentAttache]
(DirectAgentCronJob-⁠154:ctx-⁠2e8a5911) Ping from 1(xcp2)
2015-⁠09-⁠06 00:48:27,940 DEBUG
(DirectAgentCronJob-⁠154:ctx-⁠2e8a5911) Process host VM state report
from ping process. host: 1
2015-⁠09-⁠06 00:48:27,951 DEBUG
(DirectAgentCronJob-⁠154:ctx-⁠2e8a5911) Process VM state report. host:
1, number of records in report: 4
2015-⁠09-⁠06 00:48:27,951 DEBUG
(DirectAgentCronJob-⁠154:ctx-⁠2e8a5911) VM state report. host: 1, vm
100, power state: PowerOn
2015-⁠09-⁠06 00:48:27,954 DEBUG
(DirectAgentCronJob-⁠154:ctx-⁠2e8a5911) VM power state does not change,
skip DB writing. vm id: 100
2015-⁠09-⁠06 00:48:27,954 DEBUG
(DirectAgentCronJob-⁠154:ctx-⁠2e8a5911) VM state report. host: 1, vm
33, power state: PowerOn
2015-⁠09-⁠06 00:48:27,957 DEBUG
(DirectAgentCronJob-⁠154:ctx-⁠2e8a5911) VM power state does not change,
skip DB writing. vm id: 33
2015-⁠09-⁠06 00:48:27,957 DEBUG
(DirectAgentCronJob-⁠154:ctx-⁠2e8a5911) VM state report. host: 1, vm
89, power state: PowerOn
2015-⁠09-⁠06 00:48:27,960 DEBUG
(DirectAgentCronJob-⁠154:ctx-⁠2e8a5911) VM power state does not change,
skip DB writing. vm id: 89
2015-⁠09-⁠06 00:48:27,961 DEBUG
(DirectAgentCronJob-⁠154:ctx-⁠2e8a5911) VM state report. host: 1, vm
88, power state: PowerOn
2015-⁠09-⁠06 00:48:27,963 DEBUG
(DirectAgentCronJob-⁠154:ctx-⁠2e8a5911) VM power state does not change,
skip DB writing. vm id: 88
2015-⁠09-⁠06 00:48:27,968 DEBUG
(DirectAgentCronJob-⁠154:ctx-⁠2e8a5911) Done with process of VM state
report. host: 1
On 2015-⁠⁠09-⁠⁠07 23:18, Remi Bergsma wrote:
Hi Vadim,
What kind of storage do you use? Can you show /⁠var/⁠log/⁠xha.log (I
think that is the name) please? It could be xen-⁠ha that fences the box
if the heartbeat cannot be written.
You suggest it is CloudStack. Did you see anything in the mgt logs?
Regards, Remi
Sent from my iPhone
On 07 Sep 2015, at 08:26, Vadim Kimlaychuk <> wrote:
Hello all,
I have experienced accidental cluster reboot 3 days after update to CS
4.5.2. Cluster is XenServer 6.5 with SP1. Reboot has been started from
slave node and then -⁠ master.
Syslog on slave shows only this:
Sep 6 00:47:05 xcp2 last message repeated 3 times Sep 6 00:47:15 xcp2
xenstored: D12 write data/⁠⁠meminfo_free 713732 Sep 6 00:47:15 xcp2
xenstored: A1564203 w event /⁠local/⁠domain/⁠12/⁠data/⁠meminfo_free
Sep 6 00:47:15 xcp2 xenstored: D12 write data/⁠updated Sun Sep 6
00:48:55 EEST 2015
Sep 6 00:47:15 xcp2 xenstored: A6 w event
/⁠local/⁠domain/⁠12/⁠data/⁠updated /⁠local/⁠domain/⁠12/⁠data/⁠updated
Sep 6
00:47:15 xcp2 xenstored: A10 w event /⁠local/⁠domain/⁠12/⁠data/⁠updated
/⁠local/⁠domain/⁠12/⁠data/⁠updated Sep 6 00:47:26 xcp2 dhclient:
DHCPREQUEST on xenbr0 to port
67 (xid=0x304ae9dc)
Sep 6 00:47:27 xcp2 xapi: [ info|xcp2|462044 INET|dispatch:host.call_plugin D:7593b578fada|taskhelper] task
host.call_plugin R:ddd3cc399f86 forwarded
Sep 6 00:47:36 xcp2 dhclient: DHCPREQUEST on xenbr0 to port
67 (xid=0x304ae9dc)
Sep 6 00:48:18 xcp2 last message repeated 4 times Sep 6 00:48:25 xcp2
xenstored: D1 write data/⁠⁠meminfo_free 1740496 Sep 6 00:48:25 xcp2
xenstored: A1564203 w event /⁠local/⁠domain/⁠1/⁠data/⁠meminfo_free
/⁠local/⁠domain/⁠1/⁠data/⁠meminfo_free Sep 6 00:48:25 xcp2 xenstored:
write data/⁠updated Sat Sep 5 21:50:07 EEST 2015 Sep 6 00:48:25 xcp2
xenstored: A6 w event /⁠local/⁠domain/⁠1/⁠data/⁠updated
/⁠local/⁠domain/⁠1/⁠data/⁠updated Sep 6 00:48:25 xcp2 xenstored: A10 w
event /⁠local/⁠domain/⁠1/⁠data/⁠updated
/⁠local/⁠domain/⁠1/⁠data/⁠updated Sep 6
00:48:26 xcp2 dhclient: DHCPREQUEST on xenbr0 to port
67 (xid=0x304ae9dc)
Sep 6 00:48:27 xcp2 xapi: [ info|xcp2|462044 INET|dispatch:host.call_plugin D:f2c8987bc0ff|taskhelper] task
host.call_plugin R:b62d2d4f58eb forwarded
Sep 6 00:48:38 xcp2 dhclient: DHCPREQUEST on xenbr0 to port
67 (xid=0x304ae9dc)
Sep 06 00:48:48 xcp2 syslogd 1.4.1: restart.
Sep 6 00:48:48 xcp2 kernel: klogd 1.4.1, log source = /⁠proc/⁠kmsg
Sep 6 00:48:48 xcp2 kernel: [ 0.000000] Initializing cgroup subsys
cpuset Sep 6 00:48:48 xcp2 kernel: [ 0.000000] Initializing cgroup
subsys cpu Sep 6 00:48:48 xcp2 kernel: [ 0.000000] Initializing cgroup
subsys cpuacct Sep 6 00:48:48 xcp2 kernel: [ 0.000000] Linux version
Can anyone help with diagnostics ?
Thank you,
Find out more about ShapeBlue and our range of CloudStack related

Reply via email to