cs 4.5.1, hosts stuck in disconnected status

2016-07-21 Thread Francois Scheurer

Dear CS contributors

We use CS 4.5.1 on a 3 Clusters with XenServer 6.5.

One Host in a cluster (and another in another cluster as well) got and 
stayed in status "Disconnected".

We tried to unmanage/remanage the cluster to force a reconnection, we 
also destroyed all System VM's (virtual console and secondary storage 
VM's), we restarted all management servers.
We verified on the xen server that it is not disabled, we restarted the 
xen toolstack.
We also updated the host table to put a mgmt_server_id: update host set 
where id=15;

Then we restarted the management servers again and also the System VM's.
We finally updated the table to without mgmt_server_id: update host set 
status="Alert",resource_state="Disabled",mgmt_server_id=NULL where id=15;

Then we restarted the management servers again and also the System VM's.
Nothing helps, the server does not reconnect.

Calling ForceReconnect shows this error:

2016-07-18 11:26:07,418 DEBUG [c.c.a.ApiServlet] 
(catalina-exec-13:ctx-4e82fdce) ===START=== -- GET 

2016-07-18 11:26:07,450 INFO  [o.a.c.f.j.i.AsyncJobMonitor] 
(API-Job-Executor-23:ctx-fc340a8e job-148672) Add job-148672 into job 
2016-07-18 11:26:07,453 DEBUG [o.a.c.f.j.i.AsyncJobManagerImpl] 
(catalina-exec-13:ctx-4e82fdce ctx-9c696de2) submit async job-148672, 
details: AsyncJobVO {id:148672, userId: 51, accountId: 51, instanceType: 
Host, instanceId: 15, cmd: 
org.apache.cloudstack.api.command.admin.host.ReconnectHostCmd, cmdInfo: 
cmdVersion: 0, status: IN_PROGRESS, processStatus: 0, resultCode: 0, 
result: null, initMsid: 345049098122, completeMsid: null, lastUpdated: 
null, lastPolled: null, created: null}
2016-07-18 11:26:07,454 DEBUG [c.c.a.ApiServlet] 
(catalina-exec-13:ctx-4e82fdce ctx-9c696de2) ===END=== -- 
2016-07-18 11:26:07,455 DEBUG [o.a.c.f.j.i.AsyncJobManagerImpl] 
(API-Job-Executor-23:ctx-fc340a8e job-148672) Executing AsyncJobVO 
{id:148672, userId: 51, accountId: 51, instanceType: Host, instanceId: 
15, cmd: org.apache.cloudstack.api.command.admin.host.ReconnectHostCmd, 
cmdVersion: 0, status: IN_PROGRESS, processStatus: 0, resultCode: 0, 
result: null, initMsid: 345049098122, completeMsid: null, lastUpdated: 
null, lastPolled: null, created: null}
2016-07-18 11:26:07,461 DEBUG [c.c.a.m.DirectAgentAttache] 
(DirectAgent-495:ctx-77e68e88) Seq 213-6743858967010618892: Executing 
2016-07-18 11:26:07,467 INFO  [c.c.a.m.AgentManagerImpl] 
(API-Job-Executor-23:ctx-fc340a8e job-148672 ctx-0061c491) Unable to 
disconnect host because it is not connected to this server: 15
2016-07-18 11:26:07,467 WARN [o.a.c.a.c.a.h.ReconnectHostCmd] 
(API-Job-Executor-23:ctx-fc340a8e job-148672 ctx-0061c491) Exception:

org.apache.cloudstack.api.ServerApiException: Failed to reconnect host

at com.cloud.api.ApiDispatcher.dispatch(ApiDispatcher.java:141)

Re: cs 4.5.1, hosts stuck in disconnected status

2016-07-21 Thread Stephan Seitz

> We use CS 4.5.1 on a 3 Clusters with XenServer 6.5.
> One Host in a cluster (and another in another cluster as well) got
> and 
> stayed in status "Disconnected".

xe host-list
to determine your disconnected hosts-uuid, and try to enable it via
xe host-enable uuid=NN

If the host is enabled in xen pool, acs should be able to reconnect it.

VM states should be completely unrelated to your problem.


- Stephan

Re: cs 4.5.1, hosts stuck in disconnected status

2016-07-21 Thread Francois Scheurer

Dear CS contributors

We could fix the issue without having to restart the disconnected Xen Hosts.
We suspect that the root cause was a interrupted agent transfer, during 
a restart of a Managment Server (CSMAN).

We have 3 CSMAN's running in cluster: man01, man02 and man03.
The disconnected vh010 belongs to one Xen Hosts Cluster with 4 nodes: 
vh009, vh010, vh011 and vh012.
See the chronological events from the logs with our comments regarding 
the disconnection of vh010:

===>vh010 (host 19) was on agent 345049103441 (man02)
vh010: Last Disconnected   2016-07-18T14:03:50+0200
345049098498 = man01
345049103441 = man02
345049098122 = man03

2016-07-18T14:00:34.878973+02:00 ewcstack-man02-prod [audit 
root/10467 as root/10467 on 
pts/1/>] /root: service 
cloudstack-management restart; service cloudstack-usage restart

2016-07-18 14:02:15,797 DEBUG [c.c.s.StorageManagerImpl] 
(StorageManager-Scavenger-1:ctx-ea98efd4) Storage pool garbage collector 
found 0 templates to clean up in storage pool: ewcstack-vh010-prod Local 
!2016-07-18 14:02:26,699 DEBUG 
[c.c.a.m.ClusteredAgentManagerImpl] (StatsCollector-1:ctx-7da7a491) Host 
19 has switched to another management server, need to update agent map 
with a forwarding agent attache

2016-07-18T14:02:47.317644+02:00 ewcstack-man01-prod [audit 
root/11094 as root/11094 on 
pts/0/>] /root: service 
cloudstack-management restart; service cloudstack-usage restart;

2016-07-18 14:03:24,859 DEBUG [c.c.s.StorageManagerImpl] 
(StorageManager-Scavenger-1:ctx-c39aaa53) Storage pool garbage collector 
found 0 templates to clean up in storage pool: ewcstack-vh010-prod Local 

2016-07-18 14:03:26,260 DEBUG [c.c.a.m.AgentManagerImpl] 
(AgentManager-Handler-6:null) SeqA 256-29401: Sending Seq 256-29401:  { 
Ans: , MgmtId: 345049103441, via: 256, Ver: v1, Flags: 100010, 
[{"com.cloud.agent.api.AgentControlAnswer":{"result":true,"wait":0}}] }
2016-07-18 14:03:28,535 DEBUG [c.c.s.StatsCollector] 
(StatsCollector-1:ctx-814f1ae1) HostStatsCollector is running...
2016-07-18 14:03:28,553 DEBUG [c.c.a.m.ClusteredAgentAttache] 
(StatsCollector-1:ctx-814f1ae1) Seq 7-6771162039751540742: Forwarding 
null to 345049098122
2016-07-18 14:03:28,661 DEBUG [c.c.a.m.AgentManagerImpl] 
(AgentManager-Handler-7:null) SeqA 244-153489: Processing Seq 
244-153489:  { Cmd , MgmtId: -1, via: 244, Ver: v1, Flags: 11, 
\"connections\": []\n}","wait":0}}] }
2016-07-18 14:03:28,667 DEBUG [c.c.a.m.AgentManagerImpl] 
(AgentManager-Handler-7:null) SeqA 244-153489: Sending Seq 244-153489:  
{ Ans: , MgmtId: 345049103441, via: 244, Ver: v1, Flags: 100010, 
[{"com.cloud.agent.api.AgentControlAnswer":{"result":true,"wait":0}}] }
2016-07-18 14:03:28,731 DEBUG [c.c.a.t.Request] 
(StatsCollector-1:ctx-814f1ae1) Seq 7-6771162039751540742: Received:  { 
Ans: , MgmtId: 345049103441, via: 7, Ver: v1, Flags: 10, { 
GetHostStatsAnswer } }

===>11 = vh006, 345049098122 = man03, vh006 is transfered to man03:
2016-07-18 14:03:28,744 DEBUG [c.c.a.m.ClusteredAgentAttache] 
(StatsCollector-1:ctx-814f1ae1) Seq 11-5143110774457106438: Forwarding 
null to 345049098122
2016-07-18 14:03:28,838 DEBUG [c.c.a.t.Request] 
(StatsCollector-1:ctx-814f1ae1) Seq 11-5143110774457106438: Received:  { 
Ans: , MgmtId: 345049103441, via: 11, Ver: v1, Flags: 10, { 
GetHostStatsAnswer } }
===>19 = vh010, 345049098498 = man01, vh010 is transfered to man01, but 
man01 is stopping and starting at 14:02:47, so the transfer failed:
!2016-07-18 14:03:28,851 DEBUG [c.c.a.m.ClusteredAgentAttache] 
(StatsCollector-1:ctx-814f1ae1) Seq 19-2009731333714083845: Forwarding 
null to 345049098498
2016-07-18 14:03:28,852 DEBUG [c.c.a.m.ClusteredAgentAttache] 
(StatsCollector-1:ctx-814f1ae1) Seq 19-2009731333714083845: Error on 
connecting to management node: null try = 1
2016-07-18 14:03:28,852 INFO [c.c.a.m.ClusteredAgentAttache] 
(StatsCollector-1:ctx-814f1ae1) IOException Broken pipe when sending 
data to peer 345049098498, close peer connection and let it re-open
2016-07-18 14:03:28,856 WARN  [c.c.a.m.AgentManagerImpl] 
(StatsCollector-1:ctx-814f1ae1) Exception while sending


Re: cs 4.5.1, hosts stuck in disconnected status

2016-07-21 Thread Dag Sonstebo
Hi Francois,

As pointed out by Stephan the problem is probably with your Xen cluster rather 
than your CloudStack management. On the disconnected host you may want to carry 
out a xe-toolstack-restart - this will restart Xapi without affecting running 
Vms. After that check your cluster with ‘xe host-list’ etc. If this doesn’t 
help you may have to consider restarting the host.

Dag Sonstebo
Cloud Architect

On 21/07/2016, 11:25, "Francois Scheurer"  

>Dear CS contributors
>We could fix the issue without having to restart the disconnected Xen Hosts.
>We suspect that the root cause was a interrupted agent transfer, during 
>a restart of a Managment Server (CSMAN).
>We have 3 CSMAN's running in cluster: man01, man02 and man03.
>The disconnected vh010 belongs to one Xen Hosts Cluster with 4 nodes: 
>vh009, vh010, vh011 and vh012.
>See the chronological events from the logs with our comments regarding 
>the disconnection of vh010:
>===>vh010 (host 19) was on agent 345049103441 (man02)
> vh010: Last Disconnected   2016-07-18T14:03:50+0200
> 345049098498 = man01
> 345049103441 = man02
> 345049098122 = man03
> ewcstack-man02-prod:
> 2016-07-18T14:00:34.878973+02:00 ewcstack-man02-prod [audit 
>root/10467 as root/10467 on 
>pts/1/>] /root: service 
>cloudstack-management restart; service cloudstack-usage restart
> ewcstack-man02-prod:
> 2016-07-18 14:02:15,797 DEBUG [c.c.s.StorageManagerImpl] 
>(StorageManager-Scavenger-1:ctx-ea98efd4) Storage pool garbage collector 
>found 0 templates to clean up in storage pool: ewcstack-vh010-prod Local 
> !2016-07-18 14:02:26,699 DEBUG 
>[c.c.a.m.ClusteredAgentManagerImpl] (StatsCollector-1:ctx-7da7a491) Host 
>19 has switched to another management server, need to update agent map 
>with a forwarding agent attache
> ewcstack-man01-prod:
> 2016-07-18T14:02:47.317644+02:00 ewcstack-man01-prod [audit 
>root/11094 as root/11094 on 
>pts/0/>] /root: service 
>cloudstack-management restart; service cloudstack-usage restart;
> ewcstack-man02-prod:
> 2016-07-18 14:03:24,859 DEBUG [c.c.s.StorageManagerImpl] 
>(StorageManager-Scavenger-1:ctx-c39aaa53) Storage pool garbage collector 
>found 0 templates to clean up in storage pool: ewcstack-vh010-prod Local 
> ewcstack-man02-prod:
> 2016-07-18 14:03:26,260 DEBUG [c.c.a.m.AgentManagerImpl] 
>(AgentManager-Handler-6:null) SeqA 256-29401: Sending Seq 256-29401:  { 
>Ans: , MgmtId: 345049103441, via: 256, Ver: v1, Flags: 100010, 
>[{"com.cloud.agent.api.AgentControlAnswer":{"result":true,"wait":0}}] }
> 2016-07-18 14:03:28,535 DEBUG [c.c.s.StatsCollector] 
>(StatsCollector-1:ctx-814f1ae1) HostStatsCollector is running...
> 2016-07-18 14:03:28,553 DEBUG [c.c.a.m.ClusteredAgentAttache] 
>(StatsCollector-1:ctx-814f1ae1) Seq 7-6771162039751540742: Forwarding 
>null to 345049098122
> 2016-07-18 14:03:28,661 DEBUG [c.c.a.m.AgentManagerImpl] 
>(AgentManager-Handler-7:null) SeqA 244-153489: Processing Seq 
>244-153489:  { Cmd , MgmtId: -1, via: 244, Ver: v1, Flags: 11, 
>\"connections\": []\n}","wait":0}}] }
> 2016-07-18 14:03:28,667 DEBUG [c.c.a.m.AgentManagerImpl] 
>(AgentManager-Handler-7:null) SeqA 244-153489: Sending Seq 244-153489:  
>{ Ans: , MgmtId: 345049103441, via: 244, Ver: v1, Flags: 100010, 
>[{"com.cloud.agent.api.AgentControlAnswer":{"result":true,"wait":0}}] }
> 2016-07-18 14:03:28,731 DEBUG [c.c.a.t.Request] 
>(StatsCollector-1:ctx-814f1ae1) Seq 7-6771162039751540742: Received:  { 
>Ans: , MgmtId: 345049103441, via: 7, Ver: v1, Flags: 10, { 
>GetHostStatsAnswer } }
>===>11 = vh006, 345049098122 = man03, vh006 is transfered to man03:
> 2016-07-18 14:03:28,744 DEBUG [c.c.a.m.ClusteredAgentAttache] 
>(StatsCollector-1:ctx-814f1ae1) Seq 11-5143110774457106438: Forwarding 
>null to 345049098122
> 2016-07-18 14:03:28,838 DEBUG [c.c.a.t.Request] 
>(StatsCollector-1:ctx-814f1ae1) Seq 11-5143110774457106438: Received:  { 
>Ans: , MgmtId: 345049103441, via: 11, Ver: v1, Flags: 10, { 
>GetHostStatsAnswer } }
>===>19 = vh010, 345049098498 = man01, vh010 is transfered to man01, but 
>man01 is stopping and starting at 14:02:47, so the transfer failed:
> !2016-07-18 14:03:28,851 DEBUG [c.c.a.m.ClusteredAgentAttache] 
>(StatsCollector-1:ctx-814f1ae1) Seq 19-2009731333714083845: Forwarding 
>null to 345049098498
> 2016-07-18 14:03:28,852 DEBUG [c.c.a.m.ClusteredAgentAttache] 
>(StatsCollector-1:ctx-814f1ae1) Seq 19-2009731333714083845: Error on 
>connecting to management node: null try = 1
> 2016-07-18 14:03:28,852 INFO [c.c.a.m.ClusteredAgentAttache] 
>(StatsCollector-1:ctx-814f1ae1) IOException Broken pipe when sending 
>data to peer 345049098498, cl

RE: cs 4.5.1, hosts stuck in disconnected status

2016-07-21 Thread Scheurer François
Dear Stephan and Dag,

we also thought about it and checked it but the host was already enabled on xen.

Best Regards

EveryWare AG
François Scheurer
Senior Systems Engineer

-Original Message-
From: Dag Sonstebo [mailto:dag.sonst...@shapeblue.com] 
Sent: Thursday, July 21, 2016 1:23 PM
To: users@cloudstack.apache.org
Subject: Re: cs 4.5.1, hosts stuck in disconnected status

Hi Francois,

As pointed out by Stephan the problem is probably with your Xen cluster rather 
than your CloudStack management. On the disconnected host you may want to carry 
out a xe-toolstack-restart - this will restart Xapi without affecting running 
Vms. After that check your cluster with ‘xe host-list’ etc. If this doesn’t 
help you may have to consider restarting the host.

Dag Sonstebo
Cloud Architect

On 21/07/2016, 11:25, "Francois Scheurer"  

>Dear CS contributors
>We could fix the issue without having to restart the disconnected Xen Hosts.
>We suspect that the root cause was a interrupted agent transfer, during 
>a restart of a Managment Server (CSMAN).
>We have 3 CSMAN's running in cluster: man01, man02 and man03.
>The disconnected vh010 belongs to one Xen Hosts Cluster with 4 nodes: 
>vh009, vh010, vh011 and vh012.
>See the chronological events from the logs with our comments regarding 
>the disconnection of vh010:
>===>vh010 (host 19) was on agent 345049103441 (man02)
> vh010: Last Disconnected   2016-07-18T14:03:50+0200
> 345049098498 = man01
> 345049103441 = man02
> 345049098122 = man03
> ewcstack-man02-prod:
> 2016-07-18T14:00:34.878973+02:00 ewcstack-man02-prod [audit 
>root/10467 as root/10467 on 
>pts/1/>] /root: service 
>cloudstack-management restart; service cloudstack-usage restart
> ewcstack-man02-prod:
> 2016-07-18 14:02:15,797 DEBUG [c.c.s.StorageManagerImpl] 
>(StorageManager-Scavenger-1:ctx-ea98efd4) Storage pool garbage collector 
>found 0 templates to clean up in storage pool: ewcstack-vh010-prod Local 
> !2016-07-18 14:02:26,699 DEBUG 
>[c.c.a.m.ClusteredAgentManagerImpl] (StatsCollector-1:ctx-7da7a491) Host 
>19 has switched to another management server, need to update agent map 
>with a forwarding agent attache
> ewcstack-man01-prod:
> 2016-07-18T14:02:47.317644+02:00 ewcstack-man01-prod [audit 
>root/11094 as root/11094 on 
>pts/0/>] /root: service 
>cloudstack-management restart; service cloudstack-usage restart;
> ewcstack-man02-prod:
> 2016-07-18 14:03:24,859 DEBUG [c.c.s.StorageManagerImpl] 
>(StorageManager-Scavenger-1:ctx-c39aaa53) Storage pool garbage collector 
>found 0 templates to clean up in storage pool: ewcstack-vh010-prod Local 
> ewcstack-man02-prod:
> 2016-07-18 14:03:26,260 DEBUG [c.c.a.m.AgentManagerImpl] 
>(AgentManager-Handler-6:null) SeqA 256-29401: Sending Seq 256-29401:  { 
>Ans: , MgmtId: 345049103441, via: 256, Ver: v1, Flags: 100010, 
>[{"com.cloud.agent.api.AgentControlAnswer":{"result":true,"wait":0}}] }
> 2016-07-18 14:03:28,535 DEBUG [c.c.s.StatsCollector] 
>(StatsCollector-1:ctx-814f1ae1) HostStatsCollector is running...
> 2016-07-18 14:03:28,553 DEBUG [c.c.a.m.ClusteredAgentAttache] 
>(StatsCollector-1:ctx-814f1ae1) Seq 7-6771162039751540742: Forwarding 
>null to 345049098122
> 2016-07-18 14:03:28,661 DEBUG [c.c.a.m.AgentManagerImpl] 
>(AgentManager-Handler-7:null) SeqA 244-153489: Processing Seq 
>244-153489:  { Cmd , MgmtId: -1, via: 244, Ver: v1, Flags: 11, 
>\"connections\": []\n}","wait":0}}] }
> 2016-07-18 14:03:28,667 DEBUG [c.c.a.m.AgentManagerImpl] 
>(AgentManager-Handler-7:null) SeqA 244-153489: Sending Seq 244-153489:  
>{ Ans: , MgmtId: 345049103441, via: 244, Ver: v1, Flags: 100010, 
>[{"com.cloud.agent.api.AgentControlAnswer":{"result":true,"wait":0}}] }
> 2016-07-18 14:03:28,731 DEBUG [c.c.a.t.Request] 
>(StatsCollector-1:ctx-814f1ae1) Seq 7-6771162039751540742: Received:  { 
>Ans: , MgmtId: 345049103441, via: 7, Ver: v1, Flags: 10, { 
>GetHostStatsAnswer } }
>===>11 = vh006, 345049098122 = man03, vh006 is transfered to man03:
> 2016-07-18 14:03:28,744 DEBUG [c.c.a.m.ClusteredAgentAttache] 
>(StatsCollector-1:ctx-814f1ae1) Seq 11-5143110774457106438: Forwarding 
>null to 345049098122
> 2016-07-18 14:03:28,838 DEBUG [c.c.a.t.Request] 
>(StatsCollector-1:ctx-814f1ae1) Seq 11

Re: cs 4.5.1, hosts stuck in disconnected status

2016-07-21 Thread Marc-Andre Jutras

Hey Francois,

here is some suggestion...

Did you have any load balancer in front of your 3 CSMAN servers? if so, 
is there any persistence defined in your configuration ? Can you try to 
set it to SourceIP and fix the timeout to something like 60 or 120 min ?

Also validate these points:

under global settings / host, make sure your Xen hosts, VM or System VM 
can reach the IP defined there...

iptables : make sure these tcp port are open on each of your CSMAN 
servers... : 8080, 8096, 8250, 9090 ( and also validate that you got 
these ports open on your Load balancer too... )

if your zone is set to Advanced mode, make sure each of your xenserver 
is running openvswitch ( xe-switch-network-backend openvswitch ) if not, 
( basic mode ) set it to bridge... ( xe-switch-network-backend bridge ) 
( more info: 

check also each iptables definition in each of your xen server, to test, 
flush all tables and check if Cloudstack can connect correctly to it... 
( iptables -F  iptables definition in : /etc/sysconfig/iptables )

you can also try to delete one xenhost and re-add it to cloudstack and 
check in the CS logs if you're seeing some files copied to the host...

try that and keep us posted !


On 2016-07-21 10:50 AM, Scheurer François wrote:

Dear Stephan and Dag,

we also thought about it and checked it but the host was already enabled on xen.

Best Regards

EveryWare AG
François Scheurer
Senior Systems Engineer

-Original Message-
From: Dag Sonstebo [mailto:dag.sonst...@shapeblue.com]
Sent: Thursday, July 21, 2016 1:23 PM
To: users@cloudstack.apache.org
Subject: Re: cs 4.5.1, hosts stuck in disconnected status

Hi Francois,

As pointed out by Stephan the problem is probably with your Xen cluster rather 
than your CloudStack management. On the disconnected host you may want to carry 
out a xe-toolstack-restart - this will restart Xapi without affecting running 
Vms. After that check your cluster with ‘xe host-list’ etc. If this doesn’t 
help you may have to consider restarting the host.

Dag Sonstebo
Cloud Architect

On 21/07/2016, 11:25, "Francois Scheurer"  

Dear CS contributors

We could fix the issue without having to restart the disconnected Xen Hosts.
We suspect that the root cause was a interrupted agent transfer, during
a restart of a Managment Server (CSMAN).

We have 3 CSMAN's running in cluster: man01, man02 and man03.
The disconnected vh010 belongs to one Xen Hosts Cluster with 4 nodes:
vh009, vh010, vh011 and vh012.
See the chronological events from the logs with our comments regarding
the disconnection of vh010:

===>vh010 (host 19) was on agent 345049103441 (man02)
 vh010: Last Disconnected   2016-07-18T14:03:50+0200
 345049098498 = man01
 345049103441 = man02
 345049098122 = man03

 2016-07-18T14:00:34.878973+02:00 ewcstack-man02-prod [audit
root/10467 as root/10467 on
pts/1/>] /root: service
cloudstack-management restart; service cloudstack-usage restart

 2016-07-18 14:02:15,797 DEBUG [c.c.s.StorageManagerImpl]
(StorageManager-Scavenger-1:ctx-ea98efd4) Storage pool garbage collector
found 0 templates to clean up in storage pool: ewcstack-vh010-prod Local
 !2016-07-18 14:02:26,699 DEBUG
[c.c.a.m.ClusteredAgentManagerImpl] (StatsCollector-1:ctx-7da7a491) Host
19 has switched to another management server, need to update agent map
with a forwarding agent attache

 2016-07-18T14:02:47.317644+02:00 ewcstack-man01-prod [audit
root/11094 as root/11094 on
pts/0/>] /root: service
cloudstack-management restart; service cloudstack-usage restart;

 2016-07-18 14:03:24,859 DEBUG [c.c.s.StorageManagerImpl]
(StorageManager-Scavenger-1:ctx-c39aaa53) Storage pool garbage collector
found 0 templates to clean up in storage pool: ewcstack-vh010-prod Local

 2016-07-18 14:03:26,260 DEBUG [c.c.a.m.AgentManagerImpl]
(AgentManager-Handler-6:null) SeqA 256-29401: Sending Seq 256-29401:  {
Ans: , MgmtId: 345049103441, via: 256, Ver: v1, Flags: 100010,
[{"com.cloud.agent.api.AgentControlAnswer":{"result":true,"wait":0}}] }
 2016-07-18 14:03:28,535 DEBUG [c.c.s.StatsCollector]
(StatsCollector-1:ctx-814f1ae1) HostStatsCollector is running...
 2016-07-18 14:03:28,553 DEBUG [c.c.a.m.ClusteredAgentAttache]
(StatsCollector-1:ctx-814f1ae1) Seq 7-6771162039751540742: Forwarding
null to 345049098122
 2016-07-18 14:03:28,661 DEBUG [c.c.a.m.AgentManagerImpl]
(AgentManager-Handler-7:null) SeqA 244-153489: Processing Seq

RE: cs 4.5.1, hosts stuck in disconnected status

2016-07-22 Thread Scheurer François
Hi Marcus

Many thanks for your answer.

>Did you have any load balancer in front of your 3 CSMAN servers? if so, is 
>there any persistence defined in your configuration ? Can you try to set it to 
>SourceIP and fix the timeout to something like 60 or 120 min ?

Yes we have a haproxy with balance source. But the timeout are only 5 min, I 
will extend them to 60min as you proposed.

>under global settings / host, make sure your Xen hosts, VM or System VM can 
>reach the IP defined there...

Yes the domain from the global parameters is reachable from Xen and System VM's 
under tcp 8250. (ping is also ok)

>iptables : make sure these tcp port are open on each of your CSMAN servers... 
>: 8080, 8096, 8250, 9090 ( and also validate that you got these ports open on 
>your Load balancer too... )

Yes all 4 ports are opened in the CSMAN iptables.

But in the LB we opened only 8080 (for UI/CS API) and 8250 (privately for 
System VM's).

I thought 8096 and 9090 are only needed between the CSMAN's (for 
unauthenticated API calls from scripts and for pings)

>if your zone is set to Advanced mode, make sure each of your xenserver is 
>running openvswitch ( xe-switch-network-backend openvswitch ) if not, ( basic 
>mode ) set it to bridge... ( xe-switch-network-backend bridge ) ( more info:



Yes this is fine.

>check also each iptables definition in each of your xen server, to test, flush 
>all tables and check if Cloudstack can connect correctly to it...

( iptables -F  iptables definition in : /etc/sysconfig/iptables )

>you can also try to delete one xenhost and re-add it to cloudstack and check 
>in the CS logs if you're seeing some files copied to the host...

Is it possible to delete and re-add a xenhost with VM's running on it? Or do we 
need to evacuate them first?

We also found that solution from older messages from the maillist. But we 
finally got the xen hosts reconnected by simply stopping all CSMAN's and 
restarting a single CSMAN.

After all Xen Hosts got connected we could start the other CSMAN.

As I wrote in previous message, we suspect that the issue was caused by entries 
in in the op_host_transfer table.

It seems that if a Xen Host get transferred from one CSMAN to another 
(rebalance) and if that later CSMAN get stopped before completing the transfer, 
then this table entries stay forever in the DB and the CSMAN never try again to 
reconnect those Xen Hosts.

This is just a speculation, may be you can confirm this.

The main log entries to support this explanations are:

===>11 = vh006, 345049098122 = man03, vh006 is transfered to man03:

  2016-07-18 14:03:28,744 DEBUG [c.c.a.m.ClusteredAgentAttache] 
(StatsCollector-1:ctx-814f1ae1) Seq 11-5143110774457106438: Forwarding null to 

  2016-07-18 14:03:28,838 DEBUG [c.c.a.t.Request] 
(StatsCollector-1:ctx-814f1ae1) Seq 11-5143110774457106438: Received: { Ans: , 
MgmtId: 345049103441, via: 11, Ver: v1, Flags: 10, { GetHostStatsAnswer } }

===>19 = vh010, 345049098498 = man01, vh010 is transfered to man01, but man01 
is stopping and starting at 14:02:47, so the transfer failed:

! 2016-07-18 14:03:28,851 DEBUG [c.c.a.m.ClusteredAgentAttache] 
(StatsCollector-1:ctx-814f1ae1) Seq 19-2009731333714083845: Forwarding null to 

  2016-07-18 14:03:28,852 DEBUG [c.c.a.m.ClusteredAgentAttache] 
(StatsCollector-1:ctx-814f1ae1) Seq 19-2009731333714083845: Error on connecting 
to management node: null try = 1

  2016-07-18 14:03:28,852 INFO [c.c.a.m.ClusteredAgentAttache] 
(StatsCollector-1:ctx-814f1ae1) IOException Broken pipe when sending data to 
peer 345049098498, close peer connection and let it re-open

  2016-07-18 14:03:28,856 WARN [c.c.a.m.AgentManagerImpl] 
(StatsCollector-1:ctx-814f1ae1) Exception while sending 

See more details in my previous post.

I have another question: the cloudstack documentation says that the tcp port 
8250 is used for system vm’s (console proxy & secondary storage) to connect to 
the CSMAN’s.

Is it true that the Xen Hosts does not use this port?

AFAIK the Xen Hosts only get connections from the CSMAN’s (tcp 22/80/443) but 
never iniate connections to them. Is that correct?

Many Thanks to all contributors!

It is really amazing to see such good and reactive support from a free maillist.

Best Regards

Francois Scheurer

-Original Message-
From: Marc-Andre Jutras [mailto:mar...@marcuspocus.com]
Sent: Thursday, July 21, 2016 8:10 PM
To: users@cloudstack.apache.org
Subject: Re: cs 4.5.1, hosts stuck in disconnected status

Hey Francois,

here is some suggestion...

Did you have any load balancer in front of your 3 CSMAN servers? if so, is 
there any persistence defi

Re: cs 4.5.1, hosts stuck in disconnected status

2016-07-22 Thread Marc-Andre Jutras
8 14:03:28,852 DEBUG [c.c.a.m.ClusteredAgentAttache] 
(StatsCollector-1:ctx-814f1ae1) Seq 19-2009731333714083845: Error on connecting 
to management node: null try = 1

   2016-07-18 14:03:28,852 INFO [c.c.a.m.ClusteredAgentAttache] 
(StatsCollector-1:ctx-814f1ae1) IOException Broken pipe when sending data to 
peer 345049098498, close peer connection and let it re-open

   2016-07-18 14:03:28,856 WARN [c.c.a.m.AgentManagerImpl] 
(StatsCollector-1:ctx-814f1ae1) Exception while sending 

See more details in my previous post.

I have another question: the cloudstack documentation says that the tcp port 8250 
is used for system vm’s (console proxy & secondary storage) to connect to the 

Is it true that the Xen Hosts does not use this port?

true : 8250 is only used by the SSVM, CVM and VR / VPC

AFAIK the Xen Hosts only get connections from the CSMAN’s (tcp 22/80/443) but 
never iniate connections to them. Is that correct?

Correct, it's Cloudstack who will initiate the connection to the 

Many Thanks to all contributors!

It is really amazing to see such good and reactive support from a free maillist.

Best Regards

Francois Scheurer

-Original Message-
From: Marc-Andre Jutras [mailto:mar...@marcuspocus.com]
Sent: Thursday, July 21, 2016 8:10 PM
To: users@cloudstack.apache.org
Subject: Re: cs 4.5.1, hosts stuck in disconnected status

Hey Francois,

here is some suggestion...

Did you have any load balancer in front of your 3 CSMAN servers? if so, is 
there any persistence defined in your configuration ? Can you try to set it to 
SourceIP and fix the timeout to something like 60 or 120 min ?

Also validate these points:

under global settings / host, make sure your Xen hosts, VM or System VM can 
reach the IP defined there...

iptables : make sure these tcp port are open on each of your CSMAN servers... : 
8080, 8096, 8250, 9090 ( and also validate that you got these ports open on 
your Load balancer too... )

if your zone is set to Advanced mode, make sure each of your xenserver is 
running openvswitch ( xe-switch-network-backend openvswitch ) if not, ( basic 
mode ) set it to bridge... ( xe-switch-network-backend bridge ) ( more info:



check also each iptables definition in each of your xen server, to test, flush 
all tables and check if Cloudstack can connect correctly to it...

( iptables -F  iptables definition in : /etc/sysconfig/iptables )

you can also try to delete one xenhost and re-add it to cloudstack and check in 
the CS logs if you're seeing some files copied to the host...

try that and keep us posted !


On 2016-07-21 10:50 AM, Scheurer François wrote:

Dear Stephan and Dag,
we also thought about it and checked it but the host was already enabled on xen.
Best Regards
EveryWare AG
François Scheurer
Senior Systems Engineer
-Original Message-
From: Dag Sonstebo [mailto:dag.sonst...@shapeblue.com]
Sent: Thursday, July 21, 2016 1:23 PM
To: users@cloudstack.apache.org<mailto:users@cloudstack.apache.org>
Subject: Re: cs 4.5.1, hosts stuck in disconnected status
Hi Francois,
As pointed out by Stephan the problem is probably with your Xen cluster rather 
than your CloudStack management. On the disconnected host you may want to carry 
out a xe-toolstack-restart - this will restart Xapi without affecting running 
Vms. After that check your cluster with ‘xe host-list’ etc. If this doesn’t 
help you may have to consider restarting the host.
Dag Sonstebo
Cloud Architect
On 21/07/2016, 11:25, "Francois Scheurer" 
mailto:francois.scheu...@everyware.ch>> wrote:

Dear CS contributors
We could fix the issue without having to restart the disconnected Xen Hosts.
We suspect that the root cause was a interrupted agent transfer,
during a restart of a Managment Server (CSMAN).
We have 3 CSMAN's running in cluster: man01, man02 and man03.
The disconnected vh010 belongs to one Xen Hosts Cluster with 4 nodes:
vh009, vh010, vh011 and vh012.
See the chronological events from the logs with our comments
regarding the disconnection of vh010:
===>vh010 (host 19) was on agent 345049103441 (man02)
  vh010: Last Disconnected   2016-07-18T14:03:50+0200
  345049098498 = man01
  345049103441 = man02
  345049098122 = man03
  2016-07-18T14:00:34.878973+02:00 ewcstack-man02-prod [audit
root/10467 as root/10467 on
pts/1/>] /root: service
cloudstack-management restart; service cloudstack-usage restart
  2016-07-18 14:02:15,797 DEBUG [c.c.s.StorageManagerImpl]
(StorageManager-Scavenger-1:ctx-ea98efd4) Storage pool garbage
collector found 0 templates to cl

RE: cs 4.5.1, hosts stuck in disconnected status

2016-07-25 Thread Scheurer François
Hello Marc-Andre

Many thanks for all your answers!

Yes the Xen Hosts use static IP.

>Broken pipe : just like if something external to Cloudstack or Xen have

>reset the communication between your CSMAN server and your Xenserver...

I think the reason of this broken connection was that we were restarting the 
CSMAN man01 at 14:02:47:

2016-07-18T14:02:47.317644+02:00 ewcstack-man01-prod [audit root/11094 
as root/11094 on pts/0/>] /root: service 
cloudstack-management restart; service cloudstack-usage restart;

On the catalina.out of man01 we can see that the broken pipe error of 14:03:28 
was in the middle of a restart:

Jul 18, 2016 2:02:56 PM org.apache.catalina.core.AprLifecycleListener init

INFO: The APR based Apache Tomcat Native library which allows optimal 
performance in production environments was not found on the java.library.path: 


Jul 18, 2016 2:04:12 PM org.apache.coyote.http11.Http11NioProtocol start

INFO: Starting Coyote HTTP/1.1 on http-8080


Jul 18, 2016 2:04:33 PM org.apache.catalina.startup.Catalina start

INFO: Server startup in 96634 ms

On the management-server.log.2016-07-18 of man02 we see then that the agent 
transfer failed:

===>19 = vh010, 345049098498 = man01, vh010 is transfered to man01, but man01 
is stopping and starting at 14:02:47, so the transfer failed:

!2016-07-18 14:03:28,851 DEBUG [c.c.a.m.ClusteredAgentAttache] 
(StatsCollector-1:ctx-814f1ae1) Seq 19-2009731333714083845: Forwarding null to 

2016-07-18 14:03:28,852 DEBUG [c.c.a.m.ClusteredAgentAttache] 
(StatsCollector-1:ctx-814f1ae1) Seq 19-2009731333714083845: Error on connecting 
to management node: null try = 1

2016-07-18 14:03:28,852 INFO  [c.c.a.m.ClusteredAgentAttache] 
(StatsCollector-1:ctx-814f1ae1) IOException Broken pipe when sending data to 
peer 345049098498, close peer connection and let it re-open

2016-07-18 14:03:28,856 WARN  [c.c.a.m.AgentManagerImpl] 
(StatsCollector-1:ctx-814f1ae1) Exception while sending

But the strange thing is that afterwards the host never got connected. We 
stopped all CSMAN’s together, then start all of them, but the host stayed 

It was necessary to stop all CSMAN’s and then start only one, then wait until 
the hosts connect and then finally start all other CSMAN’s.

Do you think it may be a bug related to the op_host_transfer table?

Anyway the main thing is that we could finally get all hosts connected again.

I think in future we will avoid rolling restarts of CSMAN’s.

It seems safer to stop and start all of them at the same time rather than  one 
by one.

Best Regards


EveryWare AG

François Scheurer

Senior Systems Engineer

-Original Message-
From: Marc-Andre Jutras [mailto:mar...@marcuspocus.com]
Sent: Friday, July 22, 2016 5:40 PM
To: users@cloudstack.apache.org
Subject: Re: cs 4.5.1, hosts stuck in disconnected status

Hey !! Answers through your msg...

On 2016-07-22 10:32 AM, Scheurer François wrote:

> Hi Marcus






> Many thanks for your answer.






>> Did you have any load balancer in front of your 3 CSMAN servers? if so, is 
>> there any persistence defined in your configuration ? Can you try to set it 
>> to SourceIP and fix the timeout to something like 60 or 120 min ?

> Yes we have a haproxy with balance source. But the timeout are only 5 min, I 
> will extend them to 60min as you proposed.


Great !! that should stabilize your SSVM // CVM connectivity...




>> under global settings / host, make sure your Xen hosts, VM or System VM can 
>> reach the IP defined there...

> Yes the domain from the global parameters is reachable from Xen and

> System VM's under tcp 8250. (ping is also ok)






>> iptables : make sure these tcp port are open on each of your CSMAN

>> servers... : 8080, 8096, 8250, 9090 ( and also validate that you got

>> these ports open on your Load balancer too... )

> Yes all 4 ports are opened in the CSMAN iptables.


> But in the LB we opened only 8080 (for UI/CS API) and 8250 (privately for 
> System VM's).


> I thought 8096 and 9090 are only needed between the CSMAN's (for

> unauthenticated API calls from scripts and for pings)

Correct... 8096 is required for API and 9090 for HA between all CSMAN server, 
If you're not exposing the API, you don't need to map 8096...

and I personally set 9090 just to have another point of monitoring available 
but it's up to you to add this to your haproxy config or not... ( not required )






>> if your zone is set to Advanced mode, make sure each of your xenserver is 