Re: CS 4.5.2: all hosts reboot after 3 days at production

2015-09-21 Thread Vadim Kimlaychuk
esn't check
anything. Isn't it ?
Is there any other script that may reboot host if when there is a
problem with storage?
Vadim.
On 2015-09-14 15:40, Remi Bergsma wrote:
Hi Vadim,
This does indeed reboot a box, once storage fails:
echo b > /proc/sysrq-trigger
Removing it doesn't make sense, as there are serious issues once you
hit this code. I'd recommend making sure the storage is reliable.
Regards, Remi
On 14/09/15 08:13, "Vadim Kimlaychuk" <va...@kickcloud.net> wrote:
Remi,
I have analyzed situation and found that storage may cause problem
with host reboot as you wrote before in this thread. Reason for that --
we do offline backups from NFS server at that time when hosts fail.
Basically we copy all files in primary and secondary storage offsite.
This process starts precisely at 00:00 and somewhere around 00:10 -
00:40 XenServer host starts to reboot.
Reading old threads I have found that
/opt/cloud/bin/xenheartbeat.sh may do this job. Particularly last lines
at my xenheartbeat.sh are:
-
/usr/bin/logger -t heartbeat "Problem with $hb: not reachable for
$(($(date +%s) - $lastdate)) seconds, rebooting system!"
echo b > /proc/sysrq-trigger
-
The only "unclear" moment is -- I don't have such line in my logs.
May this command "echo b > /proc/sysrq-trigger" prevent from writing to
syslog file? Documentation says that it does reboot immediately without
synchronizing FS. It seems there is no other place that may do it, but
still I am not 100% sure.
Vadim.
On 2015-09-13 18:26, Vadim Kimlaychuk wrote:
Remi,
Thank you for hint. At least one problem is identified:
[root@xcp1 ~]# xe pool-⁠list params=all | grep -⁠E
"ha-⁠enabled|ha-⁠config"
ha-⁠enabled ( RO): false
ha-⁠configuration ( RO):
Where should I look for storage errors? Host? Management server? I have
checked /var/log/messages and there were only regular messages, no
"fence" or "reboot" commands.
I have dedicated NFS server that should be accessible all the time (at
least NIC interfaces are bonded in master-slave mode). Server is used
for both primary and secondary storage.
Thanks,
Vadim.
On 2015-⁠09-⁠13 14:38, Remi Bergsma wrote:
Hi Vadim,
Not sure what the problem is. Although I do know that when shared
storage is used, both CloudStack and XenServer will fence (reboot) the
box to prevent corruption in case access to the network or the storage
is not possible. What storage do you use?
What does this return on a XenServer?:
xe pool-⁠list params=all | grep -⁠E "ha-⁠enabled|ha-⁠config"
HA should be on, or else a hypervisor crash will not recover properly.
If you search the logs for Fence or reboot, does anything come back?
The logs you mention are nothing to worry about.
Can you tell us in some more details what happens and how we can
reproduce it?
Regards,
Remi
-⁠-⁠-⁠-⁠-⁠Original Message-⁠-⁠-⁠-⁠-⁠
From: Vadim Kimlaychuk [mailto:va...@kickcloud.net]
Sent: zondag 13 september 2015 9:32
To: users@cloudstack.apache.org
Cc: Remi Bergsma
Subject: Re: CS 4.5.2: all hosts reboot after 3 days at production
Hello Remi,
This issue has nothing to do with CS 4.5.2. We got host reboot after
precisely 1 week with previous version of CS (4.5.1). Previous version
has been working without restart for 106 days before. So it is not a
software issue.
What does really make me unhappy -- accidental host reboot made entire
cluster unusable. Cloudstack management server was up and running,
second cluster node was up and running all the time and VM were
transferred to the second host, but System VMs were not rebooted
properly by CS and half of the network was down. SSVM and CPVM were in
"disconnected" status. Client VMs were up, but couldn't connect to
storage, because VRs were offline. Entire mess.
I have used planned maintenance mode before and cluster worked just
perfect. We didn't have any single second downtime. But with accidental
reboot there is no use of clusterization. :(
Vadim.
On 2015-⁠09-⁠08 09:35, Vadim Kimlaychuk wrote:
Hello Remi,
First of all I don't have /⁠var/⁠log/⁠xha.log file. I have examined
logs
in detail and haven't found any trace that heartbeat has failed. The
only serious problem I have found in management logs before restart is
repeating many times error:
-⁠⁠-⁠⁠-⁠⁠-⁠⁠-⁠⁠-⁠⁠-⁠⁠-⁠⁠-⁠⁠-⁠⁠-⁠⁠-⁠⁠-⁠⁠-⁠⁠-⁠⁠-⁠⁠-⁠⁠-⁠⁠-⁠⁠-⁠⁠-⁠⁠-⁠⁠-⁠⁠-⁠⁠-⁠⁠-⁠⁠-⁠⁠-⁠⁠-⁠⁠-⁠⁠-⁠⁠-⁠⁠-⁠⁠-⁠⁠-⁠⁠
-⁠⁠-⁠⁠-⁠⁠-⁠⁠-⁠⁠-⁠⁠-⁠⁠-⁠⁠-⁠⁠-⁠⁠-⁠⁠-⁠⁠-⁠⁠-⁠⁠
2015-⁠09-⁠06 00:47:21,591 DEBUG [c.c.a.m.AgentManagerImpl]
(RouterMonitor-⁠1:ctx-⁠2d67d422) Details from executing class
com.cloud.agent.api.NetworkUsageCommand: Exception:
java.lang.Exception
Message: vpc network usage plugin call failed
Stack: java.lang.Exception: vpc network usage plugin call failed at
com.cloud.hypervisor.xenserver.resource.XenServer56Resource.VPCNetwork
Usage(XenServer56Resource.java:172)
at
com.cloud.hypervisor.xenserver.resource.XenServer56Resource.execute(Xe
nServer56Resource.java:

Re: CS 4.5.2: all hosts reboot after 3 days at production

2015-09-15 Thread Frank Louwers
there were only regular messages, no
>>> "fence" or "reboot" commands.
>>> I have dedicated NFS server that should be accessible all the time (at
>>> least NIC interfaces are bonded in master-slave mode). Server is used
>>> for both primary and secondary storage.
>>> Thanks,
>>> Vadim.
>>> On 2015-⁠09-⁠13 14:38, Remi Bergsma wrote:
>>> Hi Vadim,
>>> Not sure what the problem is. Although I do know that when shared
>>> storage is used, both CloudStack and XenServer will fence (reboot) the
>>> box to prevent corruption in case access to the network or the storage
>>> is not possible. What storage do you use?
>>> What does this return on a XenServer?:
>>> xe pool-⁠list params=all | grep -⁠E "ha-⁠enabled|ha-⁠config"
>>> HA should be on, or else a hypervisor crash will not recover properly.
>>> If you search the logs for Fence or reboot, does anything come back?
>>> The logs you mention are nothing to worry about.
>>> Can you tell us in some more details what happens and how we can
>>> reproduce it?
>>> Regards,
>>> Remi
>>> -⁠-⁠-⁠-⁠-⁠Original Message-⁠-⁠-⁠-⁠-⁠
>>> From: Vadim Kimlaychuk [mailto:va...@kickcloud.net]
>>> Sent: zondag 13 september 2015 9:32
>>> To: users@cloudstack.apache.org
>>> Cc: Remi Bergsma
>>> Subject: Re: CS 4.5.2: all hosts reboot after 3 days at production
>>> Hello Remi,
>>> This issue has nothing to do with CS 4.5.2. We got host reboot after
>>> precisely 1 week with previous version of CS (4.5.1). Previous version
>>> has been working without restart for 106 days before. So it is not a
>>> software issue.
>>> What does really make me unhappy -- accidental host reboot made entire
>>> cluster unusable. Cloudstack management server was up and running,
>>> second cluster node was up and running all the time and VM were
>>> transferred to the second host, but System VMs were not rebooted
>>> properly by CS and half of the network was down. SSVM and CPVM were in
>>> "disconnected" status. Client VMs were up, but couldn't connect to
>>> storage, because VRs were offline. Entire mess.
>>> I have used planned maintenance mode before and cluster worked just
>>> perfect. We didn't have any single second downtime. But with accidental
>>> reboot there is no use of clusterization. :(
>>> Vadim.
>>> On 2015-⁠09-⁠08 09:35, Vadim Kimlaychuk wrote:
>>> Hello Remi,
>>> First of all I don't have /⁠var/⁠log/⁠xha.log file. I have examined
>>> logs
>>> in detail and haven't found any trace that heartbeat has failed. The
>>> only serious problem I have found in management logs before restart is
>>> repeating many times error:
>>> -⁠⁠-⁠⁠-⁠⁠-⁠⁠-⁠⁠-⁠⁠-⁠⁠-⁠⁠-⁠⁠-⁠⁠-⁠⁠-⁠⁠-⁠⁠-⁠⁠-⁠⁠-⁠⁠-⁠⁠-⁠⁠-⁠⁠-⁠⁠-⁠⁠-⁠⁠-⁠⁠-⁠⁠-⁠⁠-⁠⁠-⁠⁠-⁠⁠-⁠⁠-⁠⁠-⁠⁠-⁠⁠-⁠⁠-⁠⁠-⁠⁠
>>> -⁠⁠-⁠⁠-⁠⁠-⁠⁠-⁠⁠-⁠⁠-⁠⁠-⁠⁠-⁠⁠-⁠⁠-⁠⁠-⁠⁠-⁠⁠-⁠⁠
>>> 2015-⁠09-⁠06 00:47:21,591 DEBUG [c.c.a.m.AgentManagerImpl]
>>> (RouterMonitor-⁠1:ctx-⁠2d67d422) Details from executing class
>>> com.cloud.agent.api.NetworkUsageCommand: Exception:
>>> java.lang.Exception
>>> Message: vpc network usage plugin call failed
>>> Stack: java.lang.Exception: vpc network usage plugin call failed at
>>> com.cloud.hypervisor.xenserver.resource.XenServer56Resource.VPCNetwork
>>> Usage(XenServer56Resource.java:172)
>>> at
>>> com.cloud.hypervisor.xenserver.resource.XenServer56Resource.execute(Xe
>>> nServer56Resource.java:195)
>>> at
>>> com.cloud.hypervisor.xenserver.resource.XenServer56Resource.executeReq
>>> uest(XenServer56Resource.java:62)
>>> at
>>> com.cloud.hypervisor.xenserver.resource.XenServer610Resource.executeRe
>>> quest(XenServer610Resource.java:87)
>>> at
>>> com.cloud.hypervisor.xenserver.resource.XenServer620SP1Resource.execut
>>> eRequest(XenServer620SP1Resource.java:65)
>>> at
>>> com.cloud.agent.manager.DirectAgentAttache$Task.runInContext(DirectAge
>>> ntAttache.java:302)
>>> ...
>>> -⁠⁠-⁠⁠-⁠⁠-⁠⁠-⁠⁠-⁠⁠-⁠⁠-⁠⁠-⁠⁠-⁠⁠-⁠⁠-⁠⁠-⁠⁠-⁠⁠-⁠⁠-⁠⁠-⁠⁠-⁠⁠-⁠⁠-⁠⁠-⁠⁠-⁠⁠-⁠⁠-⁠⁠-⁠⁠-⁠⁠-⁠⁠-⁠⁠-⁠⁠-⁠⁠-⁠⁠-⁠⁠-⁠⁠-⁠⁠-⁠⁠
>>> -⁠⁠-⁠⁠-⁠⁠-⁠⁠-⁠⁠-⁠⁠-⁠⁠-⁠⁠-⁠⁠-⁠⁠-⁠⁠-⁠⁠-⁠⁠-⁠⁠
>>> Just couple of seconds before XCP2 host restart:
>>> -⁠⁠-⁠⁠-⁠⁠-⁠⁠-⁠⁠-⁠⁠-⁠⁠-⁠⁠-⁠⁠-⁠⁠-⁠⁠-⁠⁠-⁠⁠-⁠⁠-⁠⁠-⁠⁠-⁠⁠-⁠⁠-⁠⁠-⁠⁠-⁠⁠-⁠⁠-⁠⁠-⁠⁠-⁠⁠-⁠⁠-⁠⁠-⁠⁠-⁠⁠-⁠⁠-⁠⁠-⁠⁠-⁠⁠-⁠⁠-⁠⁠
>>> -⁠⁠-⁠⁠-⁠⁠-⁠⁠-⁠⁠-⁠⁠-⁠⁠-⁠⁠-⁠⁠-⁠⁠-⁠⁠-⁠⁠-⁠⁠-⁠⁠
>>> 2015-⁠09-⁠06 00:48:27,884 DEBUG [c.c.a.m.DirectAgentAttache]
>>> (DirectAgentC

Re: CS 4.5.2: all hosts reboot after 3 days at production

2015-09-15 Thread Abhinandan Prateek
The heartbeat script monitors the health of the primary storage by using a 
timestamp that is written to each primary store.
In case the primary storage is unreachable it reboots the XenServer in order to 
protect the virtual machines from corruption.

> On 14-Sep-2015, at 8:48 pm, Vadim Kimlaychuk <va...@kickcloud.net> wrote:
>
> Remi,
>
>I will definitely enable HA when find who is rebooting the host. I known 
> circumstances when it happens and I know that it is storage-related. Hardware 
> health is monitored by SNMP and there were no problems with temperature, CPU, 
> RAM or HDD ranges. In case of HW failure I should theoretically have kernel 
> panic or crash dumps. But there is none. Will experiment a bit.
>
>Thank you,
>
> Vadim.
>
> On 2015-09-14 17:35, Remi Bergsma wrote:
>
>> Hi Vadim,
>> It can also be XenHA but I remember you already said it is off. Did you 
>> check the hardware health?
>> I'd recommend turning on XenHA as otherwise in case of a failure you will 
>> not have an automatic recovery.
>> Regards,
>> Remi
>> On 14/09/15 15:09, "Vadim Kimlaychuk" <va...@kickcloud.net> wrote:
>> Remi,
>> I have analyzed script xenheartbeat.sh and it seems it is useless,
>> because relies on file /opt/cloud/bin/heartbeat that has 0 length. It is
>> not set-up during installation and there is no such a step in
>> documentation for setting it up. Logically admin must run
>> "setup_heartbeat_file.sh" to make heartbeat work. If this file is 0
>> length then script checks nothing and log this message every minute:
>> Sep 14 04:43:53 xcp1 heartbeat: Problem with heartbeat, no iSCSI or NFS
>> mount defined in /opt/cloud/bin/heartbeat!
>> That means it can't reboot host, because it doesn't check
>> anything. Isn't it ?
>> Is there any other script that may reboot host if when there is a
>> problem with storage?
>> Vadim.
>> On 2015-09-14 15:40, Remi Bergsma wrote:
>> Hi Vadim,
>> This does indeed reboot a box, once storage fails:
>> echo b > /proc/sysrq-trigger
>> Removing it doesn't make sense, as there are serious issues once you
>> hit this code. I'd recommend making sure the storage is reliable.
>> Regards, Remi
>> On 14/09/15 08:13, "Vadim Kimlaychuk" <va...@kickcloud.net> wrote:
>> Remi,
>> I have analyzed situation and found that storage may cause problem
>> with host reboot as you wrote before in this thread. Reason for that --
>> we do offline backups from NFS server at that time when hosts fail.
>> Basically we copy all files in primary and secondary storage offsite.
>> This process starts precisely at 00:00 and somewhere around 00:10 -
>> 00:40 XenServer host starts to reboot.
>> Reading old threads I have found that
>> /opt/cloud/bin/xenheartbeat.sh may do this job. Particularly last lines
>> at my xenheartbeat.sh are:
>> -
>> /usr/bin/logger -t heartbeat "Problem with $hb: not reachable for
>> $(($(date +%s) - $lastdate)) seconds, rebooting system!"
>> echo b > /proc/sysrq-trigger
>> -
>> The only "unclear" moment is -- I don't have such line in my logs.
>> May this command "echo b > /proc/sysrq-trigger" prevent from writing to
>> syslog file? Documentation says that it does reboot immediately without
>> synchronizing FS. It seems there is no other place that may do it, but
>> still I am not 100% sure.
>> Vadim.
>> On 2015-09-13 18:26, Vadim Kimlaychuk wrote:
>> Remi,
>> Thank you for hint. At least one problem is identified:
>> [root@xcp1 ~]# xe pool-⁠list params=all | grep -⁠E
>> "ha-⁠enabled|ha-⁠config"
>> ha-⁠enabled ( RO): false
>> ha-⁠configuration ( RO):
>> Where should I look for storage errors? Host? Management server? I have
>> checked /var/log/messages and there were only regular messages, no
>> "fence" or "reboot" commands.
>> I have dedicated NFS server that should be accessible all the time (at
>> least NIC interfaces are bonded in master-slave mode). Server is used
>> for both primary and secondary storage.
>> Thanks,
>> Vadim.
>> On 2015-⁠09-⁠13 14:38, Remi Bergsma wrote:
>> Hi Vadim,
>> Not sure what the problem is. Although I do know that when shared
>> storage is used, both CloudStack and XenServer will fence (reboot) the
>> box to prevent corruption in case access to the network or the storage
>> is not possible. What storage do you use?
>> What does this return on a XenServer?:
>> xe pool-⁠list params=all | 

Re: CS 4.5.2: all hosts reboot after 3 days at production

2015-09-15 Thread Vadim Kimlaychuk
checked /var/log/messages and there were only regular messages, no
"fence" or "reboot" commands.
I have dedicated NFS server that should be accessible all the time (at
least NIC interfaces are bonded in master-slave mode). Server is used
for both primary and secondary storage.
Thanks,
Vadim.
On 2015-⁠09-⁠13 14:38, Remi Bergsma wrote:
Hi Vadim,
Not sure what the problem is. Although I do know that when shared
storage is used, both CloudStack and XenServer will fence (reboot) the
box to prevent corruption in case access to the network or the storage
is not possible. What storage do you use?
What does this return on a XenServer?:
xe pool-⁠list params=all | grep -⁠E "ha-⁠enabled|ha-⁠config"
HA should be on, or else a hypervisor crash will not recover properly.
If you search the logs for Fence or reboot, does anything come back?
The logs you mention are nothing to worry about.
Can you tell us in some more details what happens and how we can
reproduce it?
Regards,
Remi
-⁠-⁠-⁠-⁠-⁠Original Message-⁠-⁠-⁠-⁠-⁠
From: Vadim Kimlaychuk [mailto:va...@kickcloud.net]
Sent: zondag 13 september 2015 9:32
To: users@cloudstack.apache.org
Cc: Remi Bergsma
Subject: Re: CS 4.5.2: all hosts reboot after 3 days at production
Hello Remi,
This issue has nothing to do with CS 4.5.2. We got host reboot after
precisely 1 week with previous version of CS (4.5.1). Previous version
has been working without restart for 106 days before. So it is not a
software issue.
What does really make me unhappy -- accidental host reboot made entire
cluster unusable. Cloudstack management server was up and running,
second cluster node was up and running all the time and VM were
transferred to the second host, but System VMs were not rebooted
properly by CS and half of the network was down. SSVM and CPVM were in
"disconnected" status. Client VMs were up, but couldn't connect to
storage, because VRs were offline. Entire mess.
I have used planned maintenance mode before and cluster worked just
perfect. We didn't have any single second downtime. But with accidental
reboot there is no use of clusterization. :(
Vadim.
On 2015-⁠09-⁠08 09:35, Vadim Kimlaychuk wrote:
Hello Remi,
First of all I don't have /⁠var/⁠log/⁠xha.log file. I have examined
logs
in detail and haven't found any trace that heartbeat has failed. The
only serious problem I have found in management logs before restart is
repeating many times error:
-⁠⁠-⁠⁠-⁠⁠-⁠⁠-⁠⁠-⁠⁠-⁠⁠-⁠⁠-⁠⁠-⁠⁠-⁠⁠-⁠⁠-⁠⁠-⁠⁠-⁠⁠-⁠⁠-⁠⁠-⁠⁠-⁠⁠-⁠⁠-⁠⁠-⁠⁠-⁠⁠-⁠⁠-⁠⁠-⁠⁠-⁠⁠-⁠⁠-⁠⁠-⁠⁠-⁠⁠-⁠⁠-⁠⁠-⁠⁠-⁠⁠
-⁠⁠-⁠⁠-⁠⁠-⁠⁠-⁠⁠-⁠⁠-⁠⁠-⁠⁠-⁠⁠-⁠⁠-⁠⁠-⁠⁠-⁠⁠-⁠⁠
2015-⁠09-⁠06 00:47:21,591 DEBUG [c.c.a.m.AgentManagerImpl]
(RouterMonitor-⁠1:ctx-⁠2d67d422) Details from executing class
com.cloud.agent.api.NetworkUsageCommand: Exception:
java.lang.Exception
Message: vpc network usage plugin call failed
Stack: java.lang.Exception: vpc network usage plugin call failed at
com.cloud.hypervisor.xenserver.resource.XenServer56Resource.VPCNetwork
Usage(XenServer56Resource.java:172)
at
com.cloud.hypervisor.xenserver.resource.XenServer56Resource.execute(Xe
nServer56Resource.java:195)
at
com.cloud.hypervisor.xenserver.resource.XenServer56Resource.executeReq
uest(XenServer56Resource.java:62)
at
com.cloud.hypervisor.xenserver.resource.XenServer610Resource.executeRe
quest(XenServer610Resource.java:87)
at
com.cloud.hypervisor.xenserver.resource.XenServer620SP1Resource.execut
eRequest(XenServer620SP1Resource.java:65)
at
com.cloud.agent.manager.DirectAgentAttache$Task.runInContext(DirectAge
ntAttache.java:302)
...
-⁠⁠-⁠⁠-⁠⁠-⁠⁠-⁠⁠-⁠⁠-⁠⁠-⁠⁠-⁠⁠-⁠⁠-⁠⁠-⁠⁠-⁠⁠-⁠⁠-⁠⁠-⁠⁠-⁠⁠-⁠⁠-⁠⁠-⁠⁠-⁠⁠-⁠⁠-⁠⁠-⁠⁠-⁠⁠-⁠⁠-⁠⁠-⁠⁠-⁠⁠-⁠⁠-⁠⁠-⁠⁠-⁠⁠-⁠⁠-⁠⁠
-⁠⁠-⁠⁠-⁠⁠-⁠⁠-⁠⁠-⁠⁠-⁠⁠-⁠⁠-⁠⁠-⁠⁠-⁠⁠-⁠⁠-⁠⁠-⁠⁠
Just couple of seconds before XCP2 host restart:
-⁠⁠-⁠⁠-⁠⁠-⁠⁠-⁠⁠-⁠⁠-⁠⁠-⁠⁠-⁠⁠-⁠⁠-⁠⁠-⁠⁠-⁠⁠-⁠⁠-⁠⁠-⁠⁠-⁠⁠-⁠⁠-⁠⁠-⁠⁠-⁠⁠-⁠⁠-⁠⁠-⁠⁠-⁠⁠-⁠⁠-⁠⁠-⁠⁠-⁠⁠-⁠⁠-⁠⁠-⁠⁠-⁠⁠-⁠⁠-⁠⁠
-⁠⁠-⁠⁠-⁠⁠-⁠⁠-⁠⁠-⁠⁠-⁠⁠-⁠⁠-⁠⁠-⁠⁠-⁠⁠-⁠⁠-⁠⁠-⁠⁠
2015-⁠09-⁠06 00:48:27,884 DEBUG [c.c.a.m.DirectAgentAttache]
(DirectAgentCronJob-⁠83:ctx-⁠ff822baf) Ping from 2(xcp1)
2015-⁠09-⁠06 00:48:27,884 DEBUG
[c.c.v.VirtualMachinePowerStateSyncImpl]
(DirectAgentCronJob-⁠83:ctx-⁠ff822baf) Process host VM state report
from
ping process. host: 2
2015-⁠09-⁠06 00:48:27,904 DEBUG
[c.c.v.VirtualMachinePowerStateSyncImpl]
(DirectAgentCronJob-⁠83:ctx-⁠ff822baf) Process VM state report. host:
2,
number of records in report: 6
2015-⁠09-⁠06 00:48:27,904 DEBUG
[c.c.v.VirtualMachinePowerStateSyncImpl]
(DirectAgentCronJob-⁠83:ctx-⁠ff822baf) VM state report. host: 2, vm id:
85, power state: PowerOn
2015-⁠09-⁠06 00:48:27,907 DEBUG
[c.c.v.VirtualMachinePowerStateSyncImpl]
(DirectAgentCronJob-⁠83:ctx-⁠ff822baf) VM power state does not change,
skip DB writing. vm id: 85
2015-⁠09-⁠06 00:48:27,907 DEBUG
[c.c.v.VirtualMachinePowerStateSyncImpl]
(DirectAgentCronJob-⁠83:ctx-⁠ff822baf) VM state report. host: 2, vm id:
1, power state: PowerOn
2015-⁠09-⁠06 00:48:27,910 DEBUG
[c.c.v.VirtualMachinePowerStateSyncImpl]
(DirectAgentCronJob-⁠83:ctx-⁠ff822baf) VM power state does not change,
ski

Re: CS 4.5.2: all hosts reboot after 3 days at production

2015-09-15 Thread Remi Bergsma
9/15 08:13, "Vadim Kimlaychuk" <va...@kickcloud.net> wrote:
>> Remi,
>> I have analyzed situation and found that storage may cause problem
>> with host reboot as you wrote before in this thread. Reason for that --
>> we do offline backups from NFS server at that time when hosts fail.
>> Basically we copy all files in primary and secondary storage offsite.
>> This process starts precisely at 00:00 and somewhere around 00:10 -
>> 00:40 XenServer host starts to reboot.
>> Reading old threads I have found that
>> /opt/cloud/bin/xenheartbeat.sh may do this job. Particularly last lines
>> at my xenheartbeat.sh are:
>> -
>> /usr/bin/logger -t heartbeat "Problem with $hb: not reachable for
>> $(($(date +%s) - $lastdate)) seconds, rebooting system!"
>> echo b > /proc/sysrq-trigger
>> -
>> The only "unclear" moment is -- I don't have such line in my logs.
>> May this command "echo b > /proc/sysrq-trigger" prevent from writing to
>> syslog file? Documentation says that it does reboot immediately without
>> synchronizing FS. It seems there is no other place that may do it, but
>> still I am not 100% sure.
>> Vadim.
>> On 2015-09-13 18:26, Vadim Kimlaychuk wrote:
>> Remi,
>> Thank you for hint. At least one problem is identified:
>> [root@xcp1 ~]# xe pool-⁠list params=all | grep -⁠E
>> "ha-⁠enabled|ha-⁠config"
>> ha-⁠enabled ( RO): false
>> ha-⁠configuration ( RO):
>> Where should I look for storage errors? Host? Management server? I have
>> checked /var/log/messages and there were only regular messages, no
>> "fence" or "reboot" commands.
>> I have dedicated NFS server that should be accessible all the time (at
>> least NIC interfaces are bonded in master-slave mode). Server is used
>> for both primary and secondary storage.
>> Thanks,
>> Vadim.
>> On 2015-⁠09-⁠13 14:38, Remi Bergsma wrote:
>> Hi Vadim,
>> Not sure what the problem is. Although I do know that when shared
>> storage is used, both CloudStack and XenServer will fence (reboot) the
>> box to prevent corruption in case access to the network or the storage
>> is not possible. What storage do you use?
>> What does this return on a XenServer?:
>> xe pool-⁠list params=all | grep -⁠E "ha-⁠enabled|ha-⁠config"
>> HA should be on, or else a hypervisor crash will not recover properly.
>> If you search the logs for Fence or reboot, does anything come back?
>> The logs you mention are nothing to worry about.
>> Can you tell us in some more details what happens and how we can
>> reproduce it?
>> Regards,
>> Remi
>> -⁠-⁠-⁠-⁠-⁠Original Message-⁠-⁠-⁠-⁠-⁠
>> From: Vadim Kimlaychuk [mailto:va...@kickcloud.net]
>> Sent: zondag 13 september 2015 9:32
>> To: users@cloudstack.apache.org
>> Cc: Remi Bergsma
>> Subject: Re: CS 4.5.2: all hosts reboot after 3 days at production
>> Hello Remi,
>> This issue has nothing to do with CS 4.5.2. We got host reboot after
>> precisely 1 week with previous version of CS (4.5.1). Previous version
>> has been working without restart for 106 days before. So it is not a
>> software issue.
>> What does really make me unhappy -- accidental host reboot made entire
>> cluster unusable. Cloudstack management server was up and running,
>> second cluster node was up and running all the time and VM were
>> transferred to the second host, but System VMs were not rebooted
>> properly by CS and half of the network was down. SSVM and CPVM were in
>> "disconnected" status. Client VMs were up, but couldn't connect to
>> storage, because VRs were offline. Entire mess.
>> I have used planned maintenance mode before and cluster worked just
>> perfect. We didn't have any single second downtime. But with accidental
>> reboot there is no use of clusterization. :(
>> Vadim.
>> On 2015-⁠09-⁠08 09:35, Vadim Kimlaychuk wrote:
>> Hello Remi,
>> First of all I don't have /⁠var/⁠log/⁠xha.log file. I have examined
>> logs
>> in detail and haven't found any trace that heartbeat has failed. The
>> only serious problem I have found in management logs before restart is
>> repeating many times error:
>> -⁠⁠-⁠⁠-⁠⁠-⁠⁠-⁠⁠-⁠⁠-⁠⁠-⁠⁠-⁠⁠-⁠⁠-⁠⁠-⁠⁠-⁠⁠-⁠⁠-⁠⁠-⁠⁠-⁠⁠-⁠⁠-⁠⁠-⁠⁠-⁠⁠-⁠⁠-⁠⁠-⁠⁠-⁠⁠-⁠⁠-⁠⁠-⁠⁠-⁠⁠-⁠⁠-⁠⁠-⁠⁠-⁠⁠-⁠⁠-⁠⁠
>> -⁠⁠-⁠⁠-⁠⁠-⁠⁠-⁠⁠-⁠⁠-⁠⁠-⁠⁠-⁠⁠-⁠⁠-⁠⁠-⁠⁠-⁠⁠-⁠⁠
>> 2015-⁠09-⁠06 00:47:21,591 DEBUG [c.c.a.m.AgentManagerImpl]
>> (R

Re: CS 4.5.2: all hosts reboot after 3 days at production

2015-09-14 Thread Remi Bergsma
Hi Vadim,

This does indeed reboot a box, once storage fails:
echo b > /proc/sysrq-trigger


Removing it doesn’t make sense, as there are serious issues once you hit this 
code. I’d recommend making sure the storage is reliable.

Regards, Remi



On 14/09/15 08:13, "Vadim Kimlaychuk" <va...@kickcloud.net> wrote:

>Remi,
>
>   I have analyzed situation and found that storage may cause problem 
>with host reboot as you wrote before in this thread. Reason for that -- 
>we do offline backups from NFS server at that time when hosts fail. 
>Basically we copy all files in primary and secondary storage offsite. 
>This process starts precisely at 00:00 and somewhere around 00:10 - 
>00:40 XenServer host starts to reboot.
>
>   Reading old threads I have found that 
>/opt/cloud/bin/xenheartbeat.sh may do this job. Particularly last lines 
>at my xenheartbeat.sh  are:
>
>   -
>   /usr/bin/logger -t heartbeat "Problem with $hb: not reachable for 
>$(($(date +%s) - $lastdate)) seconds, rebooting system!"
>   echo b > /proc/sysrq-trigger
>   -
>
>   The only "unclear" moment is -- I don't have such line in my logs. 
>May this command "echo b > /proc/sysrq-trigger" prevent from writing to 
>syslog file? Documentation says that it does reboot immediately without 
>synchronizing FS. It seems there is no other place that may do it, but 
>still I am not 100% sure.
>
>Vadim.
>
>On 2015-09-13 18:26, Vadim Kimlaychuk wrote:
>
>> Remi,
>> 
>> Thank you for hint. At least one problem is identified:
>> 
>> [root@xcp1 ~]# xe pool-⁠list params=all | grep -⁠E 
>> "ha-⁠enabled|ha-⁠config"
>> ha-⁠enabled ( RO): false
>> ha-⁠configuration ( RO):
>> 
>> Where should I look for storage errors? Host? Management server? I have 
>> checked /var/log/messages and there were only regular messages, no 
>> "fence" or "reboot" commands.
>> 
>> I have dedicated NFS server that should be accessible all the time (at 
>> least NIC interfaces are bonded in master-slave mode). Server is used 
>> for both primary and secondary storage.
>> 
>> Thanks,
>> 
>> Vadim.
>> 
>> On 2015-⁠09-⁠13 14:38, Remi Bergsma wrote:
>> 
>> Hi Vadim,
>> 
>> Not sure what the problem is. Although I do know that when shared 
>> storage is used, both CloudStack and XenServer will fence (reboot) the 
>> box to prevent corruption in case access to the network or the storage 
>> is not possible. What storage do you use?
>> 
>> What does this return on a XenServer?:
>> xe pool-⁠list params=all | grep -⁠E "ha-⁠enabled|ha-⁠config"
>> 
>> HA should be on, or else a hypervisor crash will not recover properly.
>> 
>> If you search the logs for Fence or reboot, does anything come back?
>> 
>> The logs you mention are nothing to worry about.
>> 
>> Can you tell us in some more details what happens and how we can 
>> reproduce it?
>> 
>> Regards,
>> Remi
>> 
>> -⁠-⁠-⁠-⁠-⁠Original Message-⁠-⁠-⁠-⁠-⁠
>> From: Vadim Kimlaychuk [mailto:va...@kickcloud.net]
>> Sent: zondag 13 september 2015 9:32
>> To: users@cloudstack.apache.org
>> Cc: Remi Bergsma
>> Subject: Re: CS 4.5.2: all hosts reboot after 3 days at production
>> 
>> Hello Remi,
>> 
>> This issue has nothing to do with CS 4.5.2. We got host reboot after 
>> precisely 1 week with previous version of CS (4.5.1). Previous version 
>> has been working without restart for 106 days before. So it is not a 
>> software issue.
>> 
>> What does really make me unhappy -- accidental host reboot made entire 
>> cluster unusable. Cloudstack management server was up and running, 
>> second cluster node was up and running all the time and VM were 
>> transferred to the second host, but System VMs were not rebooted 
>> properly by CS and half of the network was down. SSVM and CPVM were in 
>> "disconnected" status. Client VMs were up, but couldn't connect to 
>> storage, because VRs were offline. Entire mess.
>> 
>> I have used planned maintenance mode before and cluster worked just 
>> perfect. We didn't have any single second downtime. But with accidental 
>> reboot there is no use of clusterization. :(
>> 
>> Vadim.
>> 
>> On 2015-⁠09-⁠08 09:35, Vadim Kimlaychuk wrote:
>> 
>> Hello Remi,
>> 
>> First of all I don't have /⁠var/⁠log/⁠xha.log file. I have examined 
>&

Re: CS 4.5.2: all hosts reboot after 3 days at production

2015-09-14 Thread Vadim Kimlaychuk

Remi,

  I have analyzed script xenheartbeat.sh and it seems it is useless, 
because relies on file /opt/cloud/bin/heartbeat that has 0 length. It is 
not set-up during installation and there is no such a step in 
documentation for setting it up. Logically admin must run 
"setup_heartbeat_file.sh" to make heartbeat work. If this file is 0 
length then script checks nothing and log this message every minute:


Sep 14 04:43:53 xcp1 heartbeat: Problem with heartbeat, no iSCSI or NFS 
mount defined in /opt/cloud/bin/heartbeat!


  That means it can't reboot host, because it doesn't check 
anything. Isn't it ?


  Is there any other script that may reboot host if when there is a 
problem with storage?


Vadim.




On 2015-09-14 15:40, Remi Bergsma wrote:


Hi Vadim,

This does indeed reboot a box, once storage fails:
echo b > /proc/sysrq-trigger

Removing it doesn't make sense, as there are serious issues once you 
hit this code. I'd recommend making sure the storage is reliable.


Regards, Remi

On 14/09/15 08:13, "Vadim Kimlaychuk" <va...@kickcloud.net> wrote:

Remi,

I have analyzed situation and found that storage may cause problem
with host reboot as you wrote before in this thread. Reason for that --
we do offline backups from NFS server at that time when hosts fail.
Basically we copy all files in primary and secondary storage offsite.
This process starts precisely at 00:00 and somewhere around 00:10 -
00:40 XenServer host starts to reboot.

Reading old threads I have found that
/opt/cloud/bin/xenheartbeat.sh may do this job. Particularly last lines
at my xenheartbeat.sh are:

-
/usr/bin/logger -t heartbeat "Problem with $hb: not reachable for
$(($(date +%s) - $lastdate)) seconds, rebooting system!"
echo b > /proc/sysrq-trigger
-

The only "unclear" moment is -- I don't have such line in my logs.
May this command "echo b > /proc/sysrq-trigger" prevent from writing to
syslog file? Documentation says that it does reboot immediately without
synchronizing FS. It seems there is no other place that may do it, but
still I am not 100% sure.

Vadim.

On 2015-09-13 18:26, Vadim Kimlaychuk wrote:

Remi,

Thank you for hint. At least one problem is identified:

[root@xcp1 ~]# xe pool-⁠list params=all | grep -⁠E
"ha-⁠enabled|ha-⁠config"
ha-⁠enabled ( RO): false
ha-⁠configuration ( RO):

Where should I look for storage errors? Host? Management server? I have
checked /var/log/messages and there were only regular messages, no
"fence" or "reboot" commands.

I have dedicated NFS server that should be accessible all the time (at
least NIC interfaces are bonded in master-slave mode). Server is used
for both primary and secondary storage.

Thanks,

Vadim.

On 2015-⁠09-⁠13 14:38, Remi Bergsma wrote:

Hi Vadim,

Not sure what the problem is. Although I do know that when shared
storage is used, both CloudStack and XenServer will fence (reboot) the
box to prevent corruption in case access to the network or the storage
is not possible. What storage do you use?

What does this return on a XenServer?:
xe pool-⁠list params=all | grep -⁠E "ha-⁠enabled|ha-⁠config"

HA should be on, or else a hypervisor crash will not recover properly.

If you search the logs for Fence or reboot, does anything come back?

The logs you mention are nothing to worry about.

Can you tell us in some more details what happens and how we can
reproduce it?

Regards,
Remi

-⁠-⁠-⁠-⁠-⁠Original Message-⁠-⁠-⁠-⁠-⁠
From: Vadim Kimlaychuk [mailto:va...@kickcloud.net]
Sent: zondag 13 september 2015 9:32
To: users@cloudstack.apache.org
Cc: Remi Bergsma
Subject: Re: CS 4.5.2: all hosts reboot after 3 days at production

Hello Remi,

This issue has nothing to do with CS 4.5.2. We got host reboot after
precisely 1 week with previous version of CS (4.5.1). Previous version
has been working without restart for 106 days before. So it is not a
software issue.

What does really make me unhappy -- accidental host reboot made entire
cluster unusable. Cloudstack management server was up and running,
second cluster node was up and running all the time and VM were
transferred to the second host, but System VMs were not rebooted
properly by CS and half of the network was down. SSVM and CPVM were in
"disconnected" status. Client VMs were up, but couldn't connect to
storage, because VRs were offline. Entire mess.

I have used planned maintenance mode before and cluster worked just
perfect. We didn't have any single second downtime. But with accidental
reboot there is no use of clusterization. :(

Vadim.

On 2015-⁠09-⁠08 09:35, Vadim Kimlaychuk wrote:

Hello Remi,

First of all I don't have /⁠var/⁠log/⁠xha.log file. I have examined
logs
in detail and haven't found any trace that heartbea

Re: CS 4.5.2: all hosts reboot after 3 days at production

2015-09-14 Thread Remi Bergsma
Hi Vadim,

It can also be XenHA but I remember you already said it is off. Did you check 
the hardware health?

I’d recommend turning on XenHA as otherwise in case of a failure you will not 
have an automatic recovery.

Regards,
Remi




On 14/09/15 15:09, "Vadim Kimlaychuk" <va...@kickcloud.net> wrote:

>Remi,
>
>   I have analyzed script xenheartbeat.sh and it seems it is useless, 
>because relies on file /opt/cloud/bin/heartbeat that has 0 length. It is 
>not set-up during installation and there is no such a step in 
>documentation for setting it up. Logically admin must run 
>"setup_heartbeat_file.sh" to make heartbeat work. If this file is 0 
>length then script checks nothing and log this message every minute:
>
>Sep 14 04:43:53 xcp1 heartbeat: Problem with heartbeat, no iSCSI or NFS 
>mount defined in /opt/cloud/bin/heartbeat!
>
>   That means it can't reboot host, because it doesn't check 
>anything. Isn't it ?
>
>   Is there any other script that may reboot host if when there is a 
>problem with storage?
>
>Vadim.
>
>
>
>
>On 2015-09-14 15:40, Remi Bergsma wrote:
>
>> Hi Vadim,
>> 
>> This does indeed reboot a box, once storage fails:
>> echo b > /proc/sysrq-trigger
>> 
>> Removing it doesn't make sense, as there are serious issues once you 
>> hit this code. I'd recommend making sure the storage is reliable.
>> 
>> Regards, Remi
>> 
>> On 14/09/15 08:13, "Vadim Kimlaychuk" <va...@kickcloud.net> wrote:
>> 
>> Remi,
>> 
>> I have analyzed situation and found that storage may cause problem
>> with host reboot as you wrote before in this thread. Reason for that --
>> we do offline backups from NFS server at that time when hosts fail.
>> Basically we copy all files in primary and secondary storage offsite.
>> This process starts precisely at 00:00 and somewhere around 00:10 -
>> 00:40 XenServer host starts to reboot.
>> 
>> Reading old threads I have found that
>> /opt/cloud/bin/xenheartbeat.sh may do this job. Particularly last lines
>> at my xenheartbeat.sh are:
>> 
>> -
>> /usr/bin/logger -t heartbeat "Problem with $hb: not reachable for
>> $(($(date +%s) - $lastdate)) seconds, rebooting system!"
>> echo b > /proc/sysrq-trigger
>> -
>> 
>> The only "unclear" moment is -- I don't have such line in my logs.
>> May this command "echo b > /proc/sysrq-trigger" prevent from writing to
>> syslog file? Documentation says that it does reboot immediately without
>> synchronizing FS. It seems there is no other place that may do it, but
>> still I am not 100% sure.
>> 
>> Vadim.
>> 
>> On 2015-09-13 18:26, Vadim Kimlaychuk wrote:
>> 
>> Remi,
>> 
>> Thank you for hint. At least one problem is identified:
>> 
>> [root@xcp1 ~]# xe pool-⁠list params=all | grep -⁠E
>> "ha-⁠enabled|ha-⁠config"
>> ha-⁠enabled ( RO): false
>> ha-⁠configuration ( RO):
>> 
>> Where should I look for storage errors? Host? Management server? I have
>> checked /var/log/messages and there were only regular messages, no
>> "fence" or "reboot" commands.
>> 
>> I have dedicated NFS server that should be accessible all the time (at
>> least NIC interfaces are bonded in master-slave mode). Server is used
>> for both primary and secondary storage.
>> 
>> Thanks,
>> 
>> Vadim.
>> 
>> On 2015-⁠09-⁠13 14:38, Remi Bergsma wrote:
>> 
>> Hi Vadim,
>> 
>> Not sure what the problem is. Although I do know that when shared
>> storage is used, both CloudStack and XenServer will fence (reboot) the
>> box to prevent corruption in case access to the network or the storage
>> is not possible. What storage do you use?
>> 
>> What does this return on a XenServer?:
>> xe pool-⁠list params=all | grep -⁠E "ha-⁠enabled|ha-⁠config"
>> 
>> HA should be on, or else a hypervisor crash will not recover properly.
>> 
>> If you search the logs for Fence or reboot, does anything come back?
>> 
>> The logs you mention are nothing to worry about.
>> 
>> Can you tell us in some more details what happens and how we can
>> reproduce it?
>> 
>> Regards,
>> Remi
>> 
>> -⁠-⁠-⁠-⁠-⁠Original Message-⁠-⁠-⁠-⁠-⁠
>> From: Vadim Kimlaychuk [mailto:va...@kickcloud.net]
>> Sent: zondag 13 september 2015 9:32
>> To: users@cloudstack.apache.org
>> Cc: Remi Bergsma
>> Subject: Re: CS 4.5.2: all ho

Re: CS 4.5.2: all hosts reboot after 3 days at production

2015-09-14 Thread Vadim Kimlaychuk

Remi,

I will definitely enable HA when find who is rebooting the host. I 
known circumstances when it happens and I know that it is 
storage-related. Hardware health is monitored by SNMP and there were no 
problems with temperature, CPU, RAM or HDD ranges. In case of HW failure 
I should theoretically have kernel panic or crash dumps. But there is 
none. Will experiment a bit.


Thank you,

Vadim.

On 2015-09-14 17:35, Remi Bergsma wrote:


Hi Vadim,

It can also be XenHA but I remember you already said it is off. Did you 
check the hardware health?


I'd recommend turning on XenHA as otherwise in case of a failure you 
will not have an automatic recovery.


Regards,
Remi

On 14/09/15 15:09, "Vadim Kimlaychuk" <va...@kickcloud.net> wrote:

Remi,

I have analyzed script xenheartbeat.sh and it seems it is useless,
because relies on file /opt/cloud/bin/heartbeat that has 0 length. It 
is

not set-up during installation and there is no such a step in
documentation for setting it up. Logically admin must run
"setup_heartbeat_file.sh" to make heartbeat work. If this file is 0
length then script checks nothing and log this message every minute:

Sep 14 04:43:53 xcp1 heartbeat: Problem with heartbeat, no iSCSI or NFS
mount defined in /opt/cloud/bin/heartbeat!

That means it can't reboot host, because it doesn't check
anything. Isn't it ?

Is there any other script that may reboot host if when there is a
problem with storage?

Vadim.

On 2015-09-14 15:40, Remi Bergsma wrote:

Hi Vadim,

This does indeed reboot a box, once storage fails:
echo b > /proc/sysrq-trigger

Removing it doesn't make sense, as there are serious issues once you
hit this code. I'd recommend making sure the storage is reliable.

Regards, Remi

On 14/09/15 08:13, "Vadim Kimlaychuk" <va...@kickcloud.net> wrote:

Remi,

I have analyzed situation and found that storage may cause problem
with host reboot as you wrote before in this thread. Reason for that --
we do offline backups from NFS server at that time when hosts fail.
Basically we copy all files in primary and secondary storage offsite.
This process starts precisely at 00:00 and somewhere around 00:10 -
00:40 XenServer host starts to reboot.

Reading old threads I have found that
/opt/cloud/bin/xenheartbeat.sh may do this job. Particularly last lines
at my xenheartbeat.sh are:

-
/usr/bin/logger -t heartbeat "Problem with $hb: not reachable for
$(($(date +%s) - $lastdate)) seconds, rebooting system!"
echo b > /proc/sysrq-trigger
-

The only "unclear" moment is -- I don't have such line in my logs.
May this command "echo b > /proc/sysrq-trigger" prevent from writing to
syslog file? Documentation says that it does reboot immediately without
synchronizing FS. It seems there is no other place that may do it, but
still I am not 100% sure.

Vadim.

On 2015-09-13 18:26, Vadim Kimlaychuk wrote:

Remi,

Thank you for hint. At least one problem is identified:

[root@xcp1 ~]# xe pool-⁠list params=all | grep -⁠E
"ha-⁠enabled|ha-⁠config"
ha-⁠enabled ( RO): false
ha-⁠configuration ( RO):

Where should I look for storage errors? Host? Management server? I have
checked /var/log/messages and there were only regular messages, no
"fence" or "reboot" commands.

I have dedicated NFS server that should be accessible all the time (at
least NIC interfaces are bonded in master-slave mode). Server is used
for both primary and secondary storage.

Thanks,

Vadim.

On 2015-⁠09-⁠13 14:38, Remi Bergsma wrote:

Hi Vadim,

Not sure what the problem is. Although I do know that when shared
storage is used, both CloudStack and XenServer will fence (reboot) the
box to prevent corruption in case access to the network or the storage
is not possible. What storage do you use?

What does this return on a XenServer?:
xe pool-⁠list params=all | grep -⁠E "ha-⁠enabled|ha-⁠config"

HA should be on, or else a hypervisor crash will not recover properly.

If you search the logs for Fence or reboot, does anything come back?

The logs you mention are nothing to worry about.

Can you tell us in some more details what happens and how we can
reproduce it?

Regards,
Remi

-⁠-⁠-⁠-⁠-⁠Original Message-⁠-⁠-⁠-⁠-⁠
From: Vadim Kimlaychuk [mailto:va...@kickcloud.net]
Sent: zondag 13 september 2015 9:32
To: users@cloudstack.apache.org
Cc: Remi Bergsma
Subject: Re: CS 4.5.2: all hosts reboot after 3 days at production

Hello Remi,

This issue has nothing to do with CS 4.5.2. We got host reboot after
precisely 1 week with previous version of CS (4.5.1). Previous version
has been working without restart for 106 days before. So it is not a
software issue.

What does really make me unhappy -- accidental host reboot made entire
cluster unusable. Cloudstack management server was up and running,
second cluster node was up and running all the time and VM were
transferred to the second ho

RE: CS 4.5.2: all hosts reboot after 3 days at production

2015-09-14 Thread Vadim Kimlaychuk

Remi,

  I have analyzed situation and found that storage may cause problem 
with host reboot as you wrote before in this thread. Reason for that -- 
we do offline backups from NFS server at that time when hosts fail. 
Basically we copy all files in primary and secondary storage offsite. 
This process starts precisely at 00:00 and somewhere around 00:10 - 
00:40 XenServer host starts to reboot.


  Reading old threads I have found that 
/opt/cloud/bin/xenheartbeat.sh may do this job. Particularly last lines 
at my xenheartbeat.sh  are:


  -
  /usr/bin/logger -t heartbeat "Problem with $hb: not reachable for 
$(($(date +%s) - $lastdate)) seconds, rebooting system!"

  echo b > /proc/sysrq-trigger
  -

  The only "unclear" moment is -- I don't have such line in my logs. 
May this command "echo b > /proc/sysrq-trigger" prevent from writing to 
syslog file? Documentation says that it does reboot immediately without 
synchronizing FS. It seems there is no other place that may do it, but 
still I am not 100% sure.


Vadim.

On 2015-09-13 18:26, Vadim Kimlaychuk wrote:


Remi,

Thank you for hint. At least one problem is identified:

[root@xcp1 ~]# xe pool-⁠list params=all | grep -⁠E 
"ha-⁠enabled|ha-⁠config"

ha-⁠enabled ( RO): false
ha-⁠configuration ( RO):

Where should I look for storage errors? Host? Management server? I have 
checked /var/log/messages and there were only regular messages, no 
"fence" or "reboot" commands.


I have dedicated NFS server that should be accessible all the time (at 
least NIC interfaces are bonded in master-slave mode). Server is used 
for both primary and secondary storage.


Thanks,

Vadim.

On 2015-⁠09-⁠13 14:38, Remi Bergsma wrote:

Hi Vadim,

Not sure what the problem is. Although I do know that when shared 
storage is used, both CloudStack and XenServer will fence (reboot) the 
box to prevent corruption in case access to the network or the storage 
is not possible. What storage do you use?


What does this return on a XenServer?:
xe pool-⁠list params=all | grep -⁠E "ha-⁠enabled|ha-⁠config"

HA should be on, or else a hypervisor crash will not recover properly.

If you search the logs for Fence or reboot, does anything come back?

The logs you mention are nothing to worry about.

Can you tell us in some more details what happens and how we can 
reproduce it?


Regards,
Remi

-⁠-⁠-⁠-⁠-⁠Original Message-⁠-⁠-⁠-⁠-⁠
From: Vadim Kimlaychuk [mailto:va...@kickcloud.net]
Sent: zondag 13 september 2015 9:32
To: users@cloudstack.apache.org
Cc: Remi Bergsma
Subject: Re: CS 4.5.2: all hosts reboot after 3 days at production

Hello Remi,

This issue has nothing to do with CS 4.5.2. We got host reboot after 
precisely 1 week with previous version of CS (4.5.1). Previous version 
has been working without restart for 106 days before. So it is not a 
software issue.


What does really make me unhappy -- accidental host reboot made entire 
cluster unusable. Cloudstack management server was up and running, 
second cluster node was up and running all the time and VM were 
transferred to the second host, but System VMs were not rebooted 
properly by CS and half of the network was down. SSVM and CPVM were in 
"disconnected" status. Client VMs were up, but couldn't connect to 
storage, because VRs were offline. Entire mess.


I have used planned maintenance mode before and cluster worked just 
perfect. We didn't have any single second downtime. But with accidental 
reboot there is no use of clusterization. :(


Vadim.

On 2015-⁠09-⁠08 09:35, Vadim Kimlaychuk wrote:

Hello Remi,

First of all I don't have /⁠var/⁠log/⁠xha.log file. I have examined 
logs

in detail and haven't found any trace that heartbeat has failed. The
only serious problem I have found in management logs before restart is
repeating many times error:

-⁠⁠-⁠⁠-⁠⁠-⁠⁠-⁠⁠-⁠⁠-⁠⁠-⁠⁠-⁠⁠-⁠⁠-⁠⁠-⁠⁠-⁠⁠-⁠⁠-⁠⁠-⁠⁠-⁠⁠-⁠⁠-⁠⁠-⁠⁠-⁠⁠-⁠⁠-⁠⁠-⁠⁠-⁠⁠-⁠⁠-⁠⁠-⁠⁠-⁠⁠-⁠⁠-⁠⁠-⁠⁠-⁠⁠-⁠⁠-⁠⁠
-⁠⁠-⁠⁠-⁠⁠-⁠⁠-⁠⁠-⁠⁠-⁠⁠-⁠⁠-⁠⁠-⁠⁠-⁠⁠-⁠⁠-⁠⁠-⁠⁠
2015-⁠09-⁠06 00:47:21,591 DEBUG [c.c.a.m.AgentManagerImpl]
(RouterMonitor-⁠1:ctx-⁠2d67d422) Details from executing class
com.cloud.agent.api.NetworkUsageCommand: Exception:
java.lang.Exception
Message: vpc network usage plugin call failed
Stack: java.lang.Exception: vpc network usage plugin call failed at
com.cloud.hypervisor.xenserver.resource.XenServer56Resource.VPCNetwork
Usage(XenServer56Resource.java:172)
at
com.cloud.hypervisor.xenserver.resource.XenServer56Resource.execute(Xe
nServer56Resource.java:195)
at
com.cloud.hypervisor.xenserver.resource.XenServer56Resource.executeReq
uest(XenServer56Resource.java:62)
at
com.cloud.hypervisor.xenserver.resource.XenServer610Resource.executeRe
quest(XenServer610Resource.java:87)
at
com.cloud.hypervisor.xenserver.resource.XenServer620SP1Resource.execut
eRequest(XenServer620SP1Resource.java:65)
at
com

Re: CS 4.5.2: all hosts reboot after 3 days at production

2015-09-13 Thread Vadim Kimlaychuk

Hello Remi,

This issue has nothing to do with CS 4.5.2. We got host reboot 
after precisely 1 week with previous version of CS (4.5.1). Previous 
version has been working without restart for 106 days before. So it is 
not a software issue.


What does really make me unhappy -- accidental host reboot made 
entire cluster unusable. Cloudstack management server was up and 
running, second cluster node was up and running all the time and VM were 
transferred to the second host, but System VMs were not rebooted 
properly by CS and half of the network was down. SSVM and CPVM were in 
"disconnected" status. Client VMs were up, but couldn't connect to 
storage, because VRs were offline. Entire mess.


I have used planned maintenance mode before and cluster worked 
just perfect. We didn't have any single second downtime. But with 
accidental reboot there is no use of clusterization. :(


Vadim.

On 2015-09-08 09:35, Vadim Kimlaychuk wrote:


Hello Remi,

First of all I don't have /var/log/xha.log file. I have examined logs 
in detail and haven't found any trace that heartbeat has failed. The 
only serious problem I have found in management logs before restart is 
repeating many times error:


-⁠-⁠-⁠-⁠-⁠-⁠-⁠-⁠-⁠-⁠-⁠-⁠-⁠-⁠-⁠-⁠-⁠-⁠-⁠-⁠-⁠-⁠-⁠-⁠-⁠-⁠-⁠-⁠-⁠-⁠-⁠-⁠-⁠-⁠-⁠-⁠-⁠-⁠-⁠-⁠-⁠-⁠-⁠-⁠-⁠-⁠-⁠-⁠-⁠
2015-09-06 00:47:21,591 DEBUG [c.c.a.m.AgentManagerImpl] 
(RouterMonitor-1:ctx-2d67d422) Details from executing class 
com.cloud.agent.api.NetworkUsageCommand: Exception: java.lang.Exception

Message: vpc network usage plugin call failed
Stack: java.lang.Exception: vpc network usage plugin call failed
at 
com.cloud.hypervisor.xenserver.resource.XenServer56Resource.VPCNetworkUsage(XenServer56Resource.java:172)
at 
com.cloud.hypervisor.xenserver.resource.XenServer56Resource.execute(XenServer56Resource.java:195)
at 
com.cloud.hypervisor.xenserver.resource.XenServer56Resource.executeRequest(XenServer56Resource.java:62)
at 
com.cloud.hypervisor.xenserver.resource.XenServer610Resource.executeRequest(XenServer610Resource.java:87)
at 
com.cloud.hypervisor.xenserver.resource.XenServer620SP1Resource.executeRequest(XenServer620SP1Resource.java:65)
at 
com.cloud.agent.manager.DirectAgentAttache$Task.runInContext(DirectAgentAttache.java:302)

...

-⁠-⁠-⁠-⁠-⁠-⁠-⁠-⁠-⁠-⁠-⁠-⁠-⁠-⁠-⁠-⁠-⁠-⁠-⁠-⁠-⁠-⁠-⁠-⁠-⁠-⁠-⁠-⁠-⁠-⁠-⁠-⁠-⁠-⁠-⁠-⁠-⁠-⁠-⁠-⁠-⁠-⁠-⁠-⁠-⁠-⁠-⁠-⁠-⁠

Just couple of seconds before XCP2 host restart:

-⁠-⁠-⁠-⁠-⁠-⁠-⁠-⁠-⁠-⁠-⁠-⁠-⁠-⁠-⁠-⁠-⁠-⁠-⁠-⁠-⁠-⁠-⁠-⁠-⁠-⁠-⁠-⁠-⁠-⁠-⁠-⁠-⁠-⁠-⁠-⁠-⁠-⁠-⁠-⁠-⁠-⁠-⁠-⁠-⁠-⁠-⁠-⁠-⁠
2015-09-06 00:48:27,884 DEBUG [c.c.a.m.DirectAgentAttache] 
(DirectAgentCronJob-83:ctx-ff822baf) Ping from 2(xcp1)
2015-09-06 00:48:27,884 DEBUG [c.c.v.VirtualMachinePowerStateSyncImpl] 
(DirectAgentCronJob-83:ctx-ff822baf) Process host VM state report from 
ping process. host: 2
2015-09-06 00:48:27,904 DEBUG [c.c.v.VirtualMachinePowerStateSyncImpl] 
(DirectAgentCronJob-83:ctx-ff822baf) Process VM state report. host: 2, 
number of records in report: 6
2015-09-06 00:48:27,904 DEBUG [c.c.v.VirtualMachinePowerStateSyncImpl] 
(DirectAgentCronJob-83:ctx-ff822baf) VM state report. host: 2, vm id: 
85, power state: PowerOn
2015-09-06 00:48:27,907 DEBUG [c.c.v.VirtualMachinePowerStateSyncImpl] 
(DirectAgentCronJob-83:ctx-ff822baf) VM power state does not change, 
skip DB writing. vm id: 85
2015-09-06 00:48:27,907 DEBUG [c.c.v.VirtualMachinePowerStateSyncImpl] 
(DirectAgentCronJob-83:ctx-ff822baf) VM state report. host: 2, vm id: 
1, power state: PowerOn
2015-09-06 00:48:27,910 DEBUG [c.c.v.VirtualMachinePowerStateSyncImpl] 
(DirectAgentCronJob-83:ctx-ff822baf) VM power state does not change, 
skip DB writing. vm id: 1
2015-09-06 00:48:27,910 DEBUG [c.c.v.VirtualMachinePowerStateSyncImpl] 
(DirectAgentCronJob-83:ctx-ff822baf) VM state report. host: 2, vm id: 
2, power state: PowerOn
2015-09-06 00:48:27,913 DEBUG [c.c.v.VirtualMachinePowerStateSyncImpl] 
(DirectAgentCronJob-83:ctx-ff822baf) VM power state does not change, 
skip DB writing. vm id: 2
2015-09-06 00:48:27,913 DEBUG [c.c.v.VirtualMachinePowerStateSyncImpl] 
(DirectAgentCronJob-83:ctx-ff822baf) VM state report. host: 2, vm id: 
82, power state: PowerOn
2015-09-06 00:48:27,916 DEBUG [c.c.v.VirtualMachinePowerStateSyncImpl] 
(DirectAgentCronJob-83:ctx-ff822baf) VM power state does not change, 
skip DB writing. vm id: 82
2015-09-06 00:48:27,916 DEBUG [c.c.v.VirtualMachinePowerStateSyncImpl] 
(DirectAgentCronJob-83:ctx-ff822baf) VM state report. host: 2, vm id: 
94, power state: PowerOn
2015-09-06 00:48:27,919 DEBUG [c.c.v.VirtualMachinePowerStateSyncImpl] 
(DirectAgentCronJob-83:ctx-ff822baf) VM power state does not change, 
skip DB writing. vm id: 94
2015-09-06 00:48:27,919 DEBUG [c.c.v.VirtualMachinePowerStateSyncImpl] 
(DirectAgentCronJob-83:ctx-ff822baf) VM state report. host: 2, vm id: 
90, power state: PowerOn
2015-09-06 00:48:27,922 DEBUG [c.c.v.VirtualMachinePowerStateSyncImpl] 
(DirectAgentCronJob-83:ctx-ff822baf) VM power state does not change, 
skip DB writing. vm id: 

RE: CS 4.5.2: all hosts reboot after 3 days at production

2015-09-13 Thread Vadim Kimlaychuk

Remi,

   Thank you for hint. At least one problem is identified:

[root@xcp1 ~]# xe pool-list params=all | grep -E "ha-enabled|ha-config"
ha-enabled ( RO): false
ha-configuration ( RO):

   Where should I look for storage errors? Host? Management server? 
I have checked /var/log/messages and there were only regular messages, 
no "fence" or "reboot" commands.


   I have dedicated NFS server that should be accessible all the 
time (at least NIC interfaces are bonded in master-slave mode). Server 
is used for both primary and secondary storage.


Thanks,

Vadim.



On 2015-09-13 14:38, Remi Bergsma wrote:


Hi Vadim,

Not sure what the problem is. Although I do know that when shared 
storage is used, both CloudStack and XenServer will fence (reboot) the 
box to prevent corruption in case access to the network or the storage 
is not possible. What storage do you use?


What does this return on a XenServer?:
xe pool-list params=all | grep -E "ha-enabled|ha-config"

HA should be on, or else a hypervisor crash will not recover properly.

If you search the logs for Fence or reboot, does anything come back?

The logs you mention are nothing to worry about.

Can you tell us in some more details what happens and how we can 
reproduce it?


Regards,
Remi

-Original Message-
From: Vadim Kimlaychuk [mailto:va...@kickcloud.net]
Sent: zondag 13 september 2015 9:32
To: users@cloudstack.apache.org
Cc: Remi Bergsma
Subject: Re: CS 4.5.2: all hosts reboot after 3 days at production

Hello Remi,

This issue has nothing to do with CS 4.5.2. We got host reboot after 
precisely 1 week with previous version of CS (4.5.1). Previous version 
has been working without restart for 106 days before. So it is not a 
software issue.


What does really make me unhappy -- accidental host reboot made entire 
cluster unusable. Cloudstack management server was up and running, 
second cluster node was up and running all the time and VM were 
transferred to the second host, but System VMs were not rebooted 
properly by CS and half of the network was down. SSVM and CPVM were in 
"disconnected" status. Client VMs were up, but couldn't connect to 
storage, because VRs were offline. Entire mess.


I have used planned maintenance mode before and cluster worked just 
perfect. We didn't have any single second downtime. But with accidental 
reboot there is no use of clusterization. :(


Vadim.

On 2015-09-08 09:35, Vadim Kimlaychuk wrote:


Hello Remi,

First of all I don't have /var/log/xha.log file. I have examined logs
in detail and haven't found any trace that heartbeat has failed. The
only serious problem I have found in management logs before restart is
repeating many times error:

-⁠-⁠-⁠-⁠-⁠-⁠-⁠-⁠-⁠-⁠-⁠-⁠-⁠-⁠-⁠-⁠-⁠-⁠-⁠-⁠-⁠-⁠-⁠-⁠-⁠-⁠-⁠-⁠-⁠-⁠-⁠-⁠-⁠-⁠-⁠
-⁠-⁠-⁠-⁠-⁠-⁠-⁠-⁠-⁠-⁠-⁠-⁠-⁠-⁠
2015-09-06 00:47:21,591 DEBUG [c.c.a.m.AgentManagerImpl]
(RouterMonitor-1:ctx-2d67d422) Details from executing class
com.cloud.agent.api.NetworkUsageCommand: Exception:
java.lang.Exception
Message: vpc network usage plugin call failed
Stack: java.lang.Exception: vpc network usage plugin call failed at
com.cloud.hypervisor.xenserver.resource.XenServer56Resource.VPCNetwork
Usage(XenServer56Resource.java:172)
at
com.cloud.hypervisor.xenserver.resource.XenServer56Resource.execute(Xe
nServer56Resource.java:195)
at
com.cloud.hypervisor.xenserver.resource.XenServer56Resource.executeReq
uest(XenServer56Resource.java:62)
at
com.cloud.hypervisor.xenserver.resource.XenServer610Resource.executeRe
quest(XenServer610Resource.java:87)
at
com.cloud.hypervisor.xenserver.resource.XenServer620SP1Resource.execut
eRequest(XenServer620SP1Resource.java:65)
at
com.cloud.agent.manager.DirectAgentAttache$Task.runInContext(DirectAge
ntAttache.java:302)
...

-⁠-⁠-⁠-⁠-⁠-⁠-⁠-⁠-⁠-⁠-⁠-⁠-⁠-⁠-⁠-⁠-⁠-⁠-⁠-⁠-⁠-⁠-⁠-⁠-⁠-⁠-⁠-⁠-⁠-⁠-⁠-⁠-⁠-⁠-⁠
-⁠-⁠-⁠-⁠-⁠-⁠-⁠-⁠-⁠-⁠-⁠-⁠-⁠-⁠

Just couple of seconds before XCP2 host restart:

-⁠-⁠-⁠-⁠-⁠-⁠-⁠-⁠-⁠-⁠-⁠-⁠-⁠-⁠-⁠-⁠-⁠-⁠-⁠-⁠-⁠-⁠-⁠-⁠-⁠-⁠-⁠-⁠-⁠-⁠-⁠-⁠-⁠-⁠-⁠
-⁠-⁠-⁠-⁠-⁠-⁠-⁠-⁠-⁠-⁠-⁠-⁠-⁠-⁠
2015-09-06 00:48:27,884 DEBUG [c.c.a.m.DirectAgentAttache]
(DirectAgentCronJob-83:ctx-ff822baf) Ping from 2(xcp1)
2015-09-06 00:48:27,884 DEBUG [c.c.v.VirtualMachinePowerStateSyncImpl]
(DirectAgentCronJob-83:ctx-ff822baf) Process host VM state report from
ping process. host: 2
2015-09-06 00:48:27,904 DEBUG [c.c.v.VirtualMachinePowerStateSyncImpl]
(DirectAgentCronJob-83:ctx-ff822baf) Process VM state report. host: 2,
number of records in report: 6
2015-09-06 00:48:27,904 DEBUG [c.c.v.VirtualMachinePowerStateSyncImpl]
(DirectAgentCronJob-83:ctx-ff822baf) VM state report. host: 2, vm id:
85, power state: PowerOn
2015-09-06 00:48:27,907 DEBUG [c.c.v.VirtualMachinePowerStateSyncImpl]
(DirectAgentCronJob-83:ctx-ff822baf) VM power state does not change,
skip DB writing. vm id: 85
2015-09-06 00:48:27,907 DEBUG [c.c.v.VirtualMachinePowerStateSyncImpl]
(DirectAgentCronJob-83:ctx-ff822baf) VM state report. 

RE: CS 4.5.2: all hosts reboot after 3 days at production

2015-09-13 Thread Remi Bergsma
Hi Vadim,

Not sure what the problem is. Although I do know that when shared storage is 
used, both CloudStack and XenServer will fence (reboot) the box to prevent 
corruption in case access to the network or the storage is not possible. What 
storage do you use?

What does this return on a XenServer?:
xe pool-list params=all | grep -E "ha-enabled|ha-config"

HA should be on, or else a hypervisor crash will not recover properly.

If you search the logs for Fence or reboot, does anything come back?

The logs you mention are nothing to worry about.

Can you tell us in some more details what happens and how we can reproduce it?

Regards,
Remi


-Original Message-
From: Vadim Kimlaychuk [mailto:va...@kickcloud.net] 
Sent: zondag 13 september 2015 9:32
To: users@cloudstack.apache.org
Cc: Remi Bergsma
Subject: Re: CS 4.5.2: all hosts reboot after 3 days at production

Hello Remi,

 This issue has nothing to do with CS 4.5.2. We got host reboot after 
precisely 1 week with previous version of CS (4.5.1). Previous version has been 
working without restart for 106 days before. So it is not a software issue.

 What does really make me unhappy -- accidental host reboot made entire 
cluster unusable. Cloudstack management server was up and running, second 
cluster node was up and running all the time and VM were transferred to the 
second host, but System VMs were not rebooted properly by CS and half of the 
network was down. SSVM and CPVM were in "disconnected" status. Client VMs were 
up, but couldn't connect to storage, because VRs were offline. Entire mess.

 I have used planned maintenance mode before and cluster worked just 
perfect. We didn't have any single second downtime. But with accidental reboot 
there is no use of clusterization. :(

Vadim.

On 2015-09-08 09:35, Vadim Kimlaychuk wrote:

> Hello Remi,
> 
> First of all I don't have /var/log/xha.log file. I have examined logs 
> in detail and haven't found any trace that heartbeat has failed. The 
> only serious problem I have found in management logs before restart is 
> repeating many times error:
> 
> -⁠-⁠-⁠-⁠-⁠-⁠-⁠-⁠-⁠-⁠-⁠-⁠-⁠-⁠-⁠-⁠-⁠-⁠-⁠-⁠-⁠-⁠-⁠-⁠-⁠-⁠-⁠-⁠-⁠-⁠-⁠-⁠-⁠-⁠-⁠
> -⁠-⁠-⁠-⁠-⁠-⁠-⁠-⁠-⁠-⁠-⁠-⁠-⁠-⁠
> 2015-09-06 00:47:21,591 DEBUG [c.c.a.m.AgentManagerImpl]
> (RouterMonitor-1:ctx-2d67d422) Details from executing class
> com.cloud.agent.api.NetworkUsageCommand: Exception: 
> java.lang.Exception
> Message: vpc network usage plugin call failed
> Stack: java.lang.Exception: vpc network usage plugin call failed at
> com.cloud.hypervisor.xenserver.resource.XenServer56Resource.VPCNetwork
> Usage(XenServer56Resource.java:172)
> at
> com.cloud.hypervisor.xenserver.resource.XenServer56Resource.execute(Xe
> nServer56Resource.java:195)
> at
> com.cloud.hypervisor.xenserver.resource.XenServer56Resource.executeReq
> uest(XenServer56Resource.java:62)
> at
> com.cloud.hypervisor.xenserver.resource.XenServer610Resource.executeRe
> quest(XenServer610Resource.java:87)
> at
> com.cloud.hypervisor.xenserver.resource.XenServer620SP1Resource.execut
> eRequest(XenServer620SP1Resource.java:65)
> at
> com.cloud.agent.manager.DirectAgentAttache$Task.runInContext(DirectAge
> ntAttache.java:302)
> ...
> 
> -⁠-⁠-⁠-⁠-⁠-⁠-⁠-⁠-⁠-⁠-⁠-⁠-⁠-⁠-⁠-⁠-⁠-⁠-⁠-⁠-⁠-⁠-⁠-⁠-⁠-⁠-⁠-⁠-⁠-⁠-⁠-⁠-⁠-⁠-⁠
> -⁠-⁠-⁠-⁠-⁠-⁠-⁠-⁠-⁠-⁠-⁠-⁠-⁠-⁠
> 
> Just couple of seconds before XCP2 host restart:
> 
> -⁠-⁠-⁠-⁠-⁠-⁠-⁠-⁠-⁠-⁠-⁠-⁠-⁠-⁠-⁠-⁠-⁠-⁠-⁠-⁠-⁠-⁠-⁠-⁠-⁠-⁠-⁠-⁠-⁠-⁠-⁠-⁠-⁠-⁠-⁠
> -⁠-⁠-⁠-⁠-⁠-⁠-⁠-⁠-⁠-⁠-⁠-⁠-⁠-⁠
> 2015-09-06 00:48:27,884 DEBUG [c.c.a.m.DirectAgentAttache]
> (DirectAgentCronJob-83:ctx-ff822baf) Ping from 2(xcp1)
> 2015-09-06 00:48:27,884 DEBUG [c.c.v.VirtualMachinePowerStateSyncImpl]
> (DirectAgentCronJob-83:ctx-ff822baf) Process host VM state report from 
> ping process. host: 2
> 2015-09-06 00:48:27,904 DEBUG [c.c.v.VirtualMachinePowerStateSyncImpl]
> (DirectAgentCronJob-83:ctx-ff822baf) Process VM state report. host: 2, 
> number of records in report: 6
> 2015-09-06 00:48:27,904 DEBUG [c.c.v.VirtualMachinePowerStateSyncImpl]
> (DirectAgentCronJob-83:ctx-ff822baf) VM state report. host: 2, vm id: 
> 85, power state: PowerOn
> 2015-09-06 00:48:27,907 DEBUG [c.c.v.VirtualMachinePowerStateSyncImpl]
> (DirectAgentCronJob-83:ctx-ff822baf) VM power state does not change, 
> skip DB writing. vm id: 85
> 2015-09-06 00:48:27,907 DEBUG [c.c.v.VirtualMachinePowerStateSyncImpl]
> (DirectAgentCronJob-83:ctx-ff822baf) VM state report. host: 2, vm id: 
> 1, power state: PowerOn
> 2015-09-06 00:48:27,910 DEBUG [c.c.v.VirtualMachinePowerStateSyncImpl]
> (DirectAgentCronJob-83:ctx-ff822baf) VM power state does not change, 
> skip DB writing. vm id: 1
> 2015-09-06 00:48:27,910 DEBUG [c.c.v.VirtualMachinePowerStateSyncImpl]
> (DirectAgentCronJob-83:ctx-ff822baf) VM state report. host: 2, vm i

Re: CS 4.5.2: all hosts reboot after 3 days at production

2015-09-08 Thread Vadim Kimlaychuk

Hello Remi,

 First of all I don't have /var/log/xha.log file. I have examined logs 
in detail and haven't found any trace that heartbeat has failed. The 
only serious problem I have found in management logs before restart is 
repeating many times error:


-
2015-09-06 00:47:21,591 DEBUG [c.c.a.m.AgentManagerImpl] 
(RouterMonitor-1:ctx-2d67d422) Details from executing class 
com.cloud.agent.api.NetworkUsageCommand: Exception: java.lang.Exception

Message:  vpc network usage plugin call failed
Stack: java.lang.Exception:  vpc network usage plugin call failed
at 
com.cloud.hypervisor.xenserver.resource.XenServer56Resource.VPCNetworkUsage(XenServer56Resource.java:172)
at 
com.cloud.hypervisor.xenserver.resource.XenServer56Resource.execute(XenServer56Resource.java:195)
at 
com.cloud.hypervisor.xenserver.resource.XenServer56Resource.executeRequest(XenServer56Resource.java:62)
at 
com.cloud.hypervisor.xenserver.resource.XenServer610Resource.executeRequest(XenServer610Resource.java:87)
at 
com.cloud.hypervisor.xenserver.resource.XenServer620SP1Resource.executeRequest(XenServer620SP1Resource.java:65)
at 
com.cloud.agent.manager.DirectAgentAttache$Task.runInContext(DirectAgentAttache.java:302)

...

-

Just couple of seconds before XCP2 host restart:

-
2015-09-06 00:48:27,884 DEBUG [c.c.a.m.DirectAgentAttache] 
(DirectAgentCronJob-83:ctx-ff822baf) Ping from 2(xcp1)
2015-09-06 00:48:27,884 DEBUG [c.c.v.VirtualMachinePowerStateSyncImpl] 
(DirectAgentCronJob-83:ctx-ff822baf) Process host VM state report from 
ping process. host: 2
2015-09-06 00:48:27,904 DEBUG [c.c.v.VirtualMachinePowerStateSyncImpl] 
(DirectAgentCronJob-83:ctx-ff822baf) Process VM state report. host: 2, 
number of records in report: 6
2015-09-06 00:48:27,904 DEBUG [c.c.v.VirtualMachinePowerStateSyncImpl] 
(DirectAgentCronJob-83:ctx-ff822baf) VM state report. host: 2, vm id: 
85, power state: PowerOn
2015-09-06 00:48:27,907 DEBUG [c.c.v.VirtualMachinePowerStateSyncImpl] 
(DirectAgentCronJob-83:ctx-ff822baf) VM power state does not change, 
skip DB writing. vm id: 85
2015-09-06 00:48:27,907 DEBUG [c.c.v.VirtualMachinePowerStateSyncImpl] 
(DirectAgentCronJob-83:ctx-ff822baf) VM state report. host: 2, vm id: 1, 
power state: PowerOn
2015-09-06 00:48:27,910 DEBUG [c.c.v.VirtualMachinePowerStateSyncImpl] 
(DirectAgentCronJob-83:ctx-ff822baf) VM power state does not change, 
skip DB writing. vm id: 1
2015-09-06 00:48:27,910 DEBUG [c.c.v.VirtualMachinePowerStateSyncImpl] 
(DirectAgentCronJob-83:ctx-ff822baf) VM state report. host: 2, vm id: 2, 
power state: PowerOn
2015-09-06 00:48:27,913 DEBUG [c.c.v.VirtualMachinePowerStateSyncImpl] 
(DirectAgentCronJob-83:ctx-ff822baf) VM power state does not change, 
skip DB writing. vm id: 2
2015-09-06 00:48:27,913 DEBUG [c.c.v.VirtualMachinePowerStateSyncImpl] 
(DirectAgentCronJob-83:ctx-ff822baf) VM state report. host: 2, vm id: 
82, power state: PowerOn
2015-09-06 00:48:27,916 DEBUG [c.c.v.VirtualMachinePowerStateSyncImpl] 
(DirectAgentCronJob-83:ctx-ff822baf) VM power state does not change, 
skip DB writing. vm id: 82
2015-09-06 00:48:27,916 DEBUG [c.c.v.VirtualMachinePowerStateSyncImpl] 
(DirectAgentCronJob-83:ctx-ff822baf) VM state report. host: 2, vm id: 
94, power state: PowerOn
2015-09-06 00:48:27,919 DEBUG [c.c.v.VirtualMachinePowerStateSyncImpl] 
(DirectAgentCronJob-83:ctx-ff822baf) VM power state does not change, 
skip DB writing. vm id: 94
2015-09-06 00:48:27,919 DEBUG [c.c.v.VirtualMachinePowerStateSyncImpl] 
(DirectAgentCronJob-83:ctx-ff822baf) VM state report. host: 2, vm id: 
90, power state: PowerOn
2015-09-06 00:48:27,922 DEBUG [c.c.v.VirtualMachinePowerStateSyncImpl] 
(DirectAgentCronJob-83:ctx-ff822baf) VM power state does not change, 
skip DB writing. vm id: 90
2015-09-06 00:48:27,928 DEBUG [c.c.v.VirtualMachinePowerStateSyncImpl] 
(DirectAgentCronJob-83:ctx-ff822baf) Done with process of VM state 
report. host: 2
2015-09-06 00:48:27,940 DEBUG [c.c.a.m.DirectAgentAttache] 
(DirectAgentCronJob-154:ctx-2e8a5911) Ping from 1(xcp2)
2015-09-06 00:48:27,940 DEBUG [c.c.v.VirtualMachinePowerStateSyncImpl] 
(DirectAgentCronJob-154:ctx-2e8a5911) Process host VM state report from 
ping process. host: 1
2015-09-06 00:48:27,951 DEBUG [c.c.v.VirtualMachinePowerStateSyncImpl] 
(DirectAgentCronJob-154:ctx-2e8a5911) Process VM state report. host: 1, 
number of records in report: 4
2015-09-06 00:48:27,951 DEBUG [c.c.v.VirtualMachinePowerStateSyncImpl] 
(DirectAgentCronJob-154:ctx-2e8a5911) VM state report. host: 1, vm id: 
100, power state: PowerOn
2015-09-06 00:48:27,954 DEBUG [c.c.v.VirtualMachinePowerStateSyncImpl] 
(DirectAgentCronJob-154:ctx-2e8a5911) VM power state does not change, 
skip DB writing. vm id: 100
2015-09-06 00:48:27,954 DEBUG [c.c.v.VirtualMachinePowerStateSyncImpl] 
(DirectAgentCronJob-154:ctx-2e8a5911) 

Re: CS 4.5.2: all hosts reboot after 3 days at production

2015-09-07 Thread Remi Bergsma
Hi Vadim,

What kind of storage do you use? Can you show /var/log/xha.log (I think that is 
the name) please? It could be xen-ha that fences the box if the heartbeat 
cannot be written. 

You suggest it is CloudStack. Did you see anything in the mgt logs?

Regards, Remi 

Sent from my iPhone

> On 07 Sep 2015, at 08:26, Vadim Kimlaychuk  wrote:
> 
> Hello all,
> 
>I have experienced accidental cluster reboot 3 days after update to CS 
> 4.5.2. Cluster is XenServer 6.5 with SP1. Reboot has been started from slave 
> node and then - master.
> Syslog on slave shows only this:
> 
> Sep  6 00:47:05 xcp2 last message repeated 3 times
> Sep  6 00:47:15 xcp2 xenstored:  D12  write data/meminfo_free 
> 713732
> Sep  6 00:47:15 xcp2 xenstored:  A1564203 w event   
> /local/domain/12/data/meminfo_free /local/domain/12/data/meminfo_free
> Sep  6 00:47:15 xcp2 xenstored:  D12  write data/updated Sun Sep  
> 6 00:48:55 EEST 2015
> Sep  6 00:47:15 xcp2 xenstored:  A6   w event   
> /local/domain/12/data/updated /local/domain/12/data/updated
> Sep  6 00:47:15 xcp2 xenstored:  A10  w event   
> /local/domain/12/data/updated /local/domain/12/data/updated
> Sep  6 00:47:26 xcp2 dhclient: DHCPREQUEST on xenbr0 to 172.17.0.1 port 67 
> (xid=0x304ae9dc)
> Sep  6 00:47:27 xcp2 xapi: [ info|xcp2|462044 INET 
> 0.0.0.0:80|dispatch:host.call_plugin D:7593b578fada|taskhelper] task 
> host.call_plugin R:ddd3cc399f86 forwarded 
> (trackid=407f6adaa118a34f19eb1e29cd68a0e8)
> Sep  6 00:47:36 xcp2 dhclient: DHCPREQUEST on xenbr0 to 172.17.0.1 port 67 
> (xid=0x304ae9dc)
> Sep  6 00:48:18 xcp2 last message repeated 4 times
> Sep  6 00:48:25 xcp2 xenstored:  D1   write data/meminfo_free 
> 1740496
> Sep  6 00:48:25 xcp2 xenstored:  A1564203 w event   
> /local/domain/1/data/meminfo_free /local/domain/1/data/meminfo_free
> Sep  6 00:48:25 xcp2 xenstored:  D1   write data/updated Sat Sep  
> 5 21:50:07 EEST 2015
> Sep  6 00:48:25 xcp2 xenstored:  A6   w event   
> /local/domain/1/data/updated /local/domain/1/data/updated
> Sep  6 00:48:25 xcp2 xenstored:  A10  w event   
> /local/domain/1/data/updated /local/domain/1/data/updated
> Sep  6 00:48:26 xcp2 dhclient: DHCPREQUEST on xenbr0 to 172.17.0.1 port 67 
> (xid=0x304ae9dc)
> Sep  6 00:48:27 xcp2 xapi: [ info|xcp2|462044 INET 
> 0.0.0.0:80|dispatch:host.call_plugin D:f2c8987bc0ff|taskhelper] task 
> host.call_plugin R:b62d2d4f58eb forwarded 
> (trackid=e3d4ea00c96194830a7dbbfc35563a3c)
> Sep  6 00:48:38 xcp2 dhclient: DHCPREQUEST on xenbr0 to 172.17.0.1 port 67 
> (xid=0x304ae9dc)
> Sep 06 00:48:48 xcp2 syslogd 1.4.1: restart.
> Sep  6 00:48:48 xcp2 kernel: klogd 1.4.1, log source = /proc/kmsg started.
> Sep  6 00:48:48 xcp2 kernel: [0.00] Initializing cgroup subsys cpuset
> Sep  6 00:48:48 xcp2 kernel: [0.00] Initializing cgroup subsys cpu
> Sep  6 00:48:48 xcp2 kernel: [0.00] Initializing cgroup subsys cpuacct
> Sep  6 00:48:48 xcp2 kernel: [0.00] Linux version 3.10.0+2
> 
>   Can anyone help with diagnostics ?
> 
> Thank you,
> 
> Vadim.