Re: Management Server won't connect after cluster shutdown and restart

2014-08-31 Thread Ian Duffy
Ilya,

My case wasn't generic/a cloudstack fault in the end (manual editing of the
database had occurred putting things into an invalid state.).

The others on this thread might be able to provide you with information
about your issues. I found bumping the log level up to trace provided a
much greater insight.


On 30 August 2014 19:12, ilya musayev  wrote:

> Can you tell us more please.
>
> In my rather large environments, I may need to do several restarts for
> cloudstack to come up properly.
>
> Otherwise it complains that SSVM and CPVM are not ready to launch in Zone
> X.
>
> Thanks
> ilya
>
> On 8/30/14, 5:29 AM, Ian Duffy wrote:
>
>> Hi All,
>>
>> Thank you very much for the help.
>>
>> Ended up solving the issue. There was an invalid value in our
>> configuration
>> table which seemed to prevent a lot of DAOs from being autowired.
>>
>>
>>
>>
>> On 29 August 2014 21:16, Paul Angus  wrote:
>>
>>  Hi Ian,
>>>
>>> I've seen this kind of behaviour before with KVM hosts reconnecting.
>>>
>>> There’s a select …. WITH UPDATE; query on the op_ha_work table which
>>> locks
>>> the table, stopping other hosts updating their status. If there are a lot
>>> of entries in there they all lock each other out. Deleting the entries
>>> fixed the problem, but you have to deal with hosts and vms being up/down
>>> yourself.
>>>
>>> So check the op_ha_work table for lots of entries which can lock up the
>>> database. If you can check the database for the queries that it's
>>> handling
>>> - that would be best.
>>>
>>> Also check that the management server and MySQL DB is tuned for the load
>>> that being thrown at it.
>>> (http://support.citrix.com/article/CTX132020)
>>> Remember if you have other services such as Nagios or puppet/chef
>>> directly
>>> reading the DB, that adds to the number of connections into the mysql db
>>> -
>>> I have seen the management server starved of mysql connections when a lot
>>> of hosts are brought back online.
>>>
>>>
>>> Regards
>>>
>>> Paul Angus
>>> Cloud Architect
>>> S: +44 20 3603 0540 | M: +447711418784 | T: CloudyAngus
>>> paul.an...@shapeblue.com
>>>
>>> -Original Message-
>>> From: create...@gmail.com [mailto:create...@gmail.com] On Behalf Of
>>> Carlos Reategui
>>> Sent: 29 August 2014 20:55
>>> To: users@cloudstack.apache.org
>>> Subject: Re: Management Server won't connect after cluster shutdown and
>>> restart
>>>
>>> Hi Ian,
>>>
>>> So the root of the problem was that the machines where not started up in
>>> the correct order.
>>>
>>> My plan had been to stop all VMs from CS, then stop CS, then shutdown the
>>> VM hosts.  On the other end the hosts needed to be brought up first and
>>> once they are ok then bring up the CS machine and make sure everything
>>> was
>>> in the same state it thought things were when it was shutdown.
>>>   Unfortunately CS came up before everything else was the way it expected
>>> it to be and I did not realize that at the time.
>>>
>>> To resolve I went back to my CS db backup from right after I shut it down
>>> the MS, made sure the VM hosts were all as expected and then started the
>>> MS.
>>>
>>>
>>>
>>>
>>>
>>>
>>> On Fri, Aug 29, 2014 at 8:02 AM, Ian Duffy  wrote:
>>>
>>>  Hi carlos,
>>>>
>>>> Did you ever find a fix for this?
>>>>
>>>> I'm seeing a same issue on 4.1.1 with Vmware ESXi.
>>>>
>>>>
>>>> On 29 October 2013 04:54, Carlos Reategui  wrote:
>>>>
>>>>  Update.  I cleared out the async_job table and also reset the system
>>>>> vms
>>>>>
>>>> it
>>>>
>>>>> thought where in starting mode from my previous attempts by setting
>>>>> them
>>>>>
>>>> to
>>>>
>>>>> Stopped from starting.  I also re-set the XS pool master to be the
>>>>> one XS thinks it is.
>>>>>
>>>>> Now when I start the CS MS here are the logs leading up to the first
>>>>> exception about the Unable to reach the pool:
>>>>>
>>>>> 2013-10-28 21:27:11,040 DEBUG [cloud.alert.ClusterAlertAdapter]
>

Re: Management Server won't connect after cluster shutdown and restart

2014-08-30 Thread ilya musayev

Can you tell us more please.

In my rather large environments, I may need to do several restarts for 
cloudstack to come up properly.


Otherwise it complains that SSVM and CPVM are not ready to launch in Zone X.

Thanks
ilya
On 8/30/14, 5:29 AM, Ian Duffy wrote:

Hi All,

Thank you very much for the help.

Ended up solving the issue. There was an invalid value in our configuration
table which seemed to prevent a lot of DAOs from being autowired.




On 29 August 2014 21:16, Paul Angus  wrote:


Hi Ian,

I've seen this kind of behaviour before with KVM hosts reconnecting.

There’s a select …. WITH UPDATE; query on the op_ha_work table which locks
the table, stopping other hosts updating their status. If there are a lot
of entries in there they all lock each other out. Deleting the entries
fixed the problem, but you have to deal with hosts and vms being up/down
yourself.

So check the op_ha_work table for lots of entries which can lock up the
database. If you can check the database for the queries that it's handling
- that would be best.

Also check that the management server and MySQL DB is tuned for the load
that being thrown at it.
(http://support.citrix.com/article/CTX132020)
Remember if you have other services such as Nagios or puppet/chef directly
reading the DB, that adds to the number of connections into the mysql db -
I have seen the management server starved of mysql connections when a lot
of hosts are brought back online.


Regards

Paul Angus
Cloud Architect
S: +44 20 3603 0540 | M: +447711418784 | T: CloudyAngus
paul.an...@shapeblue.com

-Original Message-
From: create...@gmail.com [mailto:create...@gmail.com] On Behalf Of
Carlos Reategui
Sent: 29 August 2014 20:55
To: users@cloudstack.apache.org
Subject: Re: Management Server won't connect after cluster shutdown and
restart

Hi Ian,

So the root of the problem was that the machines where not started up in
the correct order.

My plan had been to stop all VMs from CS, then stop CS, then shutdown the
VM hosts.  On the other end the hosts needed to be brought up first and
once they are ok then bring up the CS machine and make sure everything was
in the same state it thought things were when it was shutdown.
  Unfortunately CS came up before everything else was the way it expected
it to be and I did not realize that at the time.

To resolve I went back to my CS db backup from right after I shut it down
the MS, made sure the VM hosts were all as expected and then started the MS.






On Fri, Aug 29, 2014 at 8:02 AM, Ian Duffy  wrote:


Hi carlos,

Did you ever find a fix for this?

I'm seeing a same issue on 4.1.1 with Vmware ESXi.


On 29 October 2013 04:54, Carlos Reategui  wrote:


Update.  I cleared out the async_job table and also reset the system
vms

it

thought where in starting mode from my previous attempts by setting
them

to

Stopped from starting.  I also re-set the XS pool master to be the
one XS thinks it is.

Now when I start the CS MS here are the logs leading up to the first
exception about the Unable to reach the pool:

2013-10-28 21:27:11,040 DEBUG [cloud.alert.ClusterAlertAdapter]
(Cluster-Notification-1:null) Management server node 172.30.45.2 is
up, send alert

2013-10-28 21:27:11,045 WARN  [cloud.cluster.ClusterManagerImpl]
(Cluster-Notification-1:null) Notifying management server join event

took 9

ms

2013-10-28 21:27:23,236 DEBUG [cloud.server.StatsCollector]
(StatsCollector-2:null) HostStatsCollector is running...

2013-10-28 21:27:23,243 DEBUG [cloud.server.StatsCollector]
(StatsCollector-3:null) VmStatsCollector is running...

2013-10-28 21:27:23,247 DEBUG [cloud.server.StatsCollector]
(StatsCollector-1:null) StorageCollector is running...

2013-10-28 21:27:23,255 DEBUG [cloud.server.StatsCollector]
(StatsCollector-1:null) There is no secondary storage VM for
secondary storage host nfs://172.30.45.2/store/secondary

2013-10-28 21:27:23,273 DEBUG [agent.manager.ClusteredAgentAttache]
(StatsCollector-2:null) Seq 1-201916421: Forwarding null to

233845174730255

2013-10-28 21:27:23,274 DEBUG [agent.manager.ClusteredAgentAttache]
(AgentManager-Handler-9:null) Seq 1-201916421: Routing from

233845174730253

2013-10-28 21:27:23,275 DEBUG [agent.manager.ClusteredAgentAttache]
(AgentManager-Handler-9:null) Seq 1-201916421: Link is closed

2013-10-28 21:27:23,275 DEBUG
[agent.manager.ClusteredAgentManagerImpl]
(AgentManager-Handler-9:null) Seq 1-201916421: MgmtId 233845174730253:

Req:

Resource [Host:1] is unreachable: Host 1: Link is c

losed

2013-10-28 21:27:23,275 DEBUG
[agent.manager.ClusteredAgentManagerImpl]
(AgentManager-Handler-9:null) Seq 1--1: MgmtId 233845174730253: Req:
Routing to peer

2013-10-28 21:27:23,277 DEBUG
[agent.manager.ClusteredAgentManagerImpl]
(AgentManager-Handler-11:null) Seq 1--1: MgmtId 233845174730253: Req:
Cancel request received

2013-10-28 21:27:23,277 DEBUG [agent.manager.AgentAttache]
(AgentManager-Handler-11:null) Seq 1-201916421: Cancelling.

2013-10-2

Re: Management Server won't connect after cluster shutdown and restart

2014-08-30 Thread Ian Duffy
Hi All,

Thank you very much for the help.

Ended up solving the issue. There was an invalid value in our configuration
table which seemed to prevent a lot of DAOs from being autowired.




On 29 August 2014 21:16, Paul Angus  wrote:

> Hi Ian,
>
> I've seen this kind of behaviour before with KVM hosts reconnecting.
>
> There’s a select …. WITH UPDATE; query on the op_ha_work table which locks
> the table, stopping other hosts updating their status. If there are a lot
> of entries in there they all lock each other out. Deleting the entries
> fixed the problem, but you have to deal with hosts and vms being up/down
> yourself.
>
> So check the op_ha_work table for lots of entries which can lock up the
> database. If you can check the database for the queries that it's handling
> - that would be best.
>
> Also check that the management server and MySQL DB is tuned for the load
> that being thrown at it.
> (http://support.citrix.com/article/CTX132020)
> Remember if you have other services such as Nagios or puppet/chef directly
> reading the DB, that adds to the number of connections into the mysql db -
> I have seen the management server starved of mysql connections when a lot
> of hosts are brought back online.
>
>
> Regards
>
> Paul Angus
> Cloud Architect
> S: +44 20 3603 0540 | M: +447711418784 | T: CloudyAngus
> paul.an...@shapeblue.com
>
> -Original Message-
> From: create...@gmail.com [mailto:create...@gmail.com] On Behalf Of
> Carlos Reategui
> Sent: 29 August 2014 20:55
> To: users@cloudstack.apache.org
> Subject: Re: Management Server won't connect after cluster shutdown and
> restart
>
> Hi Ian,
>
> So the root of the problem was that the machines where not started up in
> the correct order.
>
> My plan had been to stop all VMs from CS, then stop CS, then shutdown the
> VM hosts.  On the other end the hosts needed to be brought up first and
> once they are ok then bring up the CS machine and make sure everything was
> in the same state it thought things were when it was shutdown.
>  Unfortunately CS came up before everything else was the way it expected
> it to be and I did not realize that at the time.
>
> To resolve I went back to my CS db backup from right after I shut it down
> the MS, made sure the VM hosts were all as expected and then started the MS.
>
>
>
>
>
>
> On Fri, Aug 29, 2014 at 8:02 AM, Ian Duffy  wrote:
>
> > Hi carlos,
> >
> > Did you ever find a fix for this?
> >
> > I'm seeing a same issue on 4.1.1 with Vmware ESXi.
> >
> >
> > On 29 October 2013 04:54, Carlos Reategui  wrote:
> >
> > > Update.  I cleared out the async_job table and also reset the system
> > > vms
> > it
> > > thought where in starting mode from my previous attempts by setting
> > > them
> > to
> > > Stopped from starting.  I also re-set the XS pool master to be the
> > > one XS thinks it is.
> > >
> > > Now when I start the CS MS here are the logs leading up to the first
> > > exception about the Unable to reach the pool:
> > >
> > > 2013-10-28 21:27:11,040 DEBUG [cloud.alert.ClusterAlertAdapter]
> > > (Cluster-Notification-1:null) Management server node 172.30.45.2 is
> > > up, send alert
> > >
> > > 2013-10-28 21:27:11,045 WARN  [cloud.cluster.ClusterManagerImpl]
> > > (Cluster-Notification-1:null) Notifying management server join event
> > took 9
> > > ms
> > >
> > > 2013-10-28 21:27:23,236 DEBUG [cloud.server.StatsCollector]
> > > (StatsCollector-2:null) HostStatsCollector is running...
> > >
> > > 2013-10-28 21:27:23,243 DEBUG [cloud.server.StatsCollector]
> > > (StatsCollector-3:null) VmStatsCollector is running...
> > >
> > > 2013-10-28 21:27:23,247 DEBUG [cloud.server.StatsCollector]
> > > (StatsCollector-1:null) StorageCollector is running...
> > >
> > > 2013-10-28 21:27:23,255 DEBUG [cloud.server.StatsCollector]
> > > (StatsCollector-1:null) There is no secondary storage VM for
> > > secondary storage host nfs://172.30.45.2/store/secondary
> > >
> > > 2013-10-28 21:27:23,273 DEBUG [agent.manager.ClusteredAgentAttache]
> > > (StatsCollector-2:null) Seq 1-201916421: Forwarding null to
> > 233845174730255
> > >
> > > 2013-10-28 21:27:23,274 DEBUG [agent.manager.ClusteredAgentAttache]
> > > (AgentManager-Handler-9:null) Seq 1-201916421: Routing from
> > 233845174730253
> > >
> > > 2013-10-28 21:27:23,275 DEBUG [agent.manager.ClusteredAgentAttache]
> > > (AgentMan

RE: Management Server won't connect after cluster shutdown and restart

2014-08-29 Thread Paul Angus
Hi Ian,

I've seen this kind of behaviour before with KVM hosts reconnecting.

There’s a select …. WITH UPDATE; query on the op_ha_work table which locks the 
table, stopping other hosts updating their status. If there are a lot of 
entries in there they all lock each other out. Deleting the entries fixed the 
problem, but you have to deal with hosts and vms being up/down yourself.

So check the op_ha_work table for lots of entries which can lock up the 
database. If you can check the database for the queries that it's handling - 
that would be best.

Also check that the management server and MySQL DB is tuned for the load that 
being thrown at it.
(http://support.citrix.com/article/CTX132020)
Remember if you have other services such as Nagios or puppet/chef directly 
reading the DB, that adds to the number of connections into the mysql db - I 
have seen the management server starved of mysql connections when a lot of 
hosts are brought back online.


Regards

Paul Angus
Cloud Architect
S: +44 20 3603 0540 | M: +447711418784 | T: CloudyAngus
paul.an...@shapeblue.com

-Original Message-
From: create...@gmail.com [mailto:create...@gmail.com] On Behalf Of Carlos 
Reategui
Sent: 29 August 2014 20:55
To: users@cloudstack.apache.org
Subject: Re: Management Server won't connect after cluster shutdown and restart

Hi Ian,

So the root of the problem was that the machines where not started up in the 
correct order.

My plan had been to stop all VMs from CS, then stop CS, then shutdown the VM 
hosts.  On the other end the hosts needed to be brought up first and once they 
are ok then bring up the CS machine and make sure everything was in the same 
state it thought things were when it was shutdown.
 Unfortunately CS came up before everything else was the way it expected it to 
be and I did not realize that at the time.

To resolve I went back to my CS db backup from right after I shut it down the 
MS, made sure the VM hosts were all as expected and then started the MS.






On Fri, Aug 29, 2014 at 8:02 AM, Ian Duffy  wrote:

> Hi carlos,
>
> Did you ever find a fix for this?
>
> I'm seeing a same issue on 4.1.1 with Vmware ESXi.
>
>
> On 29 October 2013 04:54, Carlos Reategui  wrote:
>
> > Update.  I cleared out the async_job table and also reset the system
> > vms
> it
> > thought where in starting mode from my previous attempts by setting
> > them
> to
> > Stopped from starting.  I also re-set the XS pool master to be the
> > one XS thinks it is.
> >
> > Now when I start the CS MS here are the logs leading up to the first
> > exception about the Unable to reach the pool:
> >
> > 2013-10-28 21:27:11,040 DEBUG [cloud.alert.ClusterAlertAdapter]
> > (Cluster-Notification-1:null) Management server node 172.30.45.2 is
> > up, send alert
> >
> > 2013-10-28 21:27:11,045 WARN  [cloud.cluster.ClusterManagerImpl]
> > (Cluster-Notification-1:null) Notifying management server join event
> took 9
> > ms
> >
> > 2013-10-28 21:27:23,236 DEBUG [cloud.server.StatsCollector]
> > (StatsCollector-2:null) HostStatsCollector is running...
> >
> > 2013-10-28 21:27:23,243 DEBUG [cloud.server.StatsCollector]
> > (StatsCollector-3:null) VmStatsCollector is running...
> >
> > 2013-10-28 21:27:23,247 DEBUG [cloud.server.StatsCollector]
> > (StatsCollector-1:null) StorageCollector is running...
> >
> > 2013-10-28 21:27:23,255 DEBUG [cloud.server.StatsCollector]
> > (StatsCollector-1:null) There is no secondary storage VM for
> > secondary storage host nfs://172.30.45.2/store/secondary
> >
> > 2013-10-28 21:27:23,273 DEBUG [agent.manager.ClusteredAgentAttache]
> > (StatsCollector-2:null) Seq 1-201916421: Forwarding null to
> 233845174730255
> >
> > 2013-10-28 21:27:23,274 DEBUG [agent.manager.ClusteredAgentAttache]
> > (AgentManager-Handler-9:null) Seq 1-201916421: Routing from
> 233845174730253
> >
> > 2013-10-28 21:27:23,275 DEBUG [agent.manager.ClusteredAgentAttache]
> > (AgentManager-Handler-9:null) Seq 1-201916421: Link is closed
> >
> > 2013-10-28 21:27:23,275 DEBUG
> > [agent.manager.ClusteredAgentManagerImpl]
> > (AgentManager-Handler-9:null) Seq 1-201916421: MgmtId 233845174730253:
> Req:
> > Resource [Host:1] is unreachable: Host 1: Link is c
> >
> > losed
> >
> > 2013-10-28 21:27:23,275 DEBUG
> > [agent.manager.ClusteredAgentManagerImpl]
> > (AgentManager-Handler-9:null) Seq 1--1: MgmtId 233845174730253: Req:
> > Routing to peer
> >
> > 2013-10-28 21:27:23,277 DEBUG
> > [agent.manager.ClusteredAgentManagerImpl]
> > (AgentManager-Handler-11:null) Seq 1--1: MgmtId 233845174730253: Req:
> > Cancel request received
> >
> 

RE: Management Server won't connect after cluster shutdown and restart

2014-08-29 Thread Michael Phillips
I posted an email yesterday stating how I shutdown\restart my CS instances. 
Works 100%

> Date: Fri, 29 Aug 2014 12:54:38 -0700
> Subject: Re: Management Server won't connect after cluster shutdown and 
> restart
> From: car...@reategui.com
> To: users@cloudstack.apache.org
> 
> Hi Ian,
> 
> So the root of the problem was that the machines where not started up in
> the correct order.
> 
> My plan had been to stop all VMs from CS, then stop CS, then shutdown the
> VM hosts.  On the other end the hosts needed to be brought up first and
> once they are ok then bring up the CS machine and make sure everything was
> in the same state it thought things were when it was shutdown.
>  Unfortunately CS came up before everything else was the way it expected it
> to be and I did not realize that at the time.
> 
> To resolve I went back to my CS db backup from right after I shut it down
> the MS, made sure the VM hosts were all as expected and then started the
> MS.
> 
> 
> 
> 
> 
> 
> On Fri, Aug 29, 2014 at 8:02 AM, Ian Duffy  wrote:
> 
> > Hi carlos,
> >
> > Did you ever find a fix for this?
> >
> > I'm seeing a same issue on 4.1.1 with Vmware ESXi.
> >
> >
> > On 29 October 2013 04:54, Carlos Reategui  wrote:
> >
> > > Update.  I cleared out the async_job table and also reset the system vms
> > it
> > > thought where in starting mode from my previous attempts by setting them
> > to
> > > Stopped from starting.  I also re-set the XS pool master to be the one XS
> > > thinks it is.
> > >
> > > Now when I start the CS MS here are the logs leading up to the first
> > > exception about the Unable to reach the pool:
> > >
> > > 2013-10-28 21:27:11,040 DEBUG [cloud.alert.ClusterAlertAdapter]
> > > (Cluster-Notification-1:null) Management server node 172.30.45.2 is up,
> > > send alert
> > >
> > > 2013-10-28 21:27:11,045 WARN  [cloud.cluster.ClusterManagerImpl]
> > > (Cluster-Notification-1:null) Notifying management server join event
> > took 9
> > > ms
> > >
> > > 2013-10-28 21:27:23,236 DEBUG [cloud.server.StatsCollector]
> > > (StatsCollector-2:null) HostStatsCollector is running...
> > >
> > > 2013-10-28 21:27:23,243 DEBUG [cloud.server.StatsCollector]
> > > (StatsCollector-3:null) VmStatsCollector is running...
> > >
> > > 2013-10-28 21:27:23,247 DEBUG [cloud.server.StatsCollector]
> > > (StatsCollector-1:null) StorageCollector is running...
> > >
> > > 2013-10-28 21:27:23,255 DEBUG [cloud.server.StatsCollector]
> > > (StatsCollector-1:null) There is no secondary storage VM for secondary
> > > storage host nfs://172.30.45.2/store/secondary
> > >
> > > 2013-10-28 21:27:23,273 DEBUG [agent.manager.ClusteredAgentAttache]
> > > (StatsCollector-2:null) Seq 1-201916421: Forwarding null to
> > 233845174730255
> > >
> > > 2013-10-28 21:27:23,274 DEBUG [agent.manager.ClusteredAgentAttache]
> > > (AgentManager-Handler-9:null) Seq 1-201916421: Routing from
> > 233845174730253
> > >
> > > 2013-10-28 21:27:23,275 DEBUG [agent.manager.ClusteredAgentAttache]
> > > (AgentManager-Handler-9:null) Seq 1-201916421: Link is closed
> > >
> > > 2013-10-28 21:27:23,275 DEBUG [agent.manager.ClusteredAgentManagerImpl]
> > > (AgentManager-Handler-9:null) Seq 1-201916421: MgmtId 233845174730253:
> > Req:
> > > Resource [Host:1] is unreachable: Host 1: Link is c
> > >
> > > losed
> > >
> > > 2013-10-28 21:27:23,275 DEBUG [agent.manager.ClusteredAgentManagerImpl]
> > > (AgentManager-Handler-9:null) Seq 1--1: MgmtId 233845174730253: Req:
> > > Routing to peer
> > >
> > > 2013-10-28 21:27:23,277 DEBUG [agent.manager.ClusteredAgentManagerImpl]
> > > (AgentManager-Handler-11:null) Seq 1--1: MgmtId 233845174730253: Req:
> > > Cancel request received
> > >
> > > 2013-10-28 21:27:23,277 DEBUG [agent.manager.AgentAttache]
> > > (AgentManager-Handler-11:null) Seq 1-201916421: Cancelling.
> > >
> > > 2013-10-28 21:27:23,277 DEBUG [agent.manager.AgentAttache]
> > > (StatsCollector-2:null) Seq 1-201916421: Waiting some more time because
> > > this is the current command
> > >
> > > 2013-10-28 21:27:23,277 DEBUG [agent.manager.AgentAttache]
> > > (StatsCollector-2:null) Seq 1-201916421: Waiting some more time because
> > > this is the current command
> > >
> > > 2013-10-28 21:27:23,277 INFO  [utils.exception

Re: Management Server won't connect after cluster shutdown and restart

2014-08-29 Thread Carlos Reategui
Hi Ian,

So the root of the problem was that the machines where not started up in
the correct order.

My plan had been to stop all VMs from CS, then stop CS, then shutdown the
VM hosts.  On the other end the hosts needed to be brought up first and
once they are ok then bring up the CS machine and make sure everything was
in the same state it thought things were when it was shutdown.
 Unfortunately CS came up before everything else was the way it expected it
to be and I did not realize that at the time.

To resolve I went back to my CS db backup from right after I shut it down
the MS, made sure the VM hosts were all as expected and then started the
MS.






On Fri, Aug 29, 2014 at 8:02 AM, Ian Duffy  wrote:

> Hi carlos,
>
> Did you ever find a fix for this?
>
> I'm seeing a same issue on 4.1.1 with Vmware ESXi.
>
>
> On 29 October 2013 04:54, Carlos Reategui  wrote:
>
> > Update.  I cleared out the async_job table and also reset the system vms
> it
> > thought where in starting mode from my previous attempts by setting them
> to
> > Stopped from starting.  I also re-set the XS pool master to be the one XS
> > thinks it is.
> >
> > Now when I start the CS MS here are the logs leading up to the first
> > exception about the Unable to reach the pool:
> >
> > 2013-10-28 21:27:11,040 DEBUG [cloud.alert.ClusterAlertAdapter]
> > (Cluster-Notification-1:null) Management server node 172.30.45.2 is up,
> > send alert
> >
> > 2013-10-28 21:27:11,045 WARN  [cloud.cluster.ClusterManagerImpl]
> > (Cluster-Notification-1:null) Notifying management server join event
> took 9
> > ms
> >
> > 2013-10-28 21:27:23,236 DEBUG [cloud.server.StatsCollector]
> > (StatsCollector-2:null) HostStatsCollector is running...
> >
> > 2013-10-28 21:27:23,243 DEBUG [cloud.server.StatsCollector]
> > (StatsCollector-3:null) VmStatsCollector is running...
> >
> > 2013-10-28 21:27:23,247 DEBUG [cloud.server.StatsCollector]
> > (StatsCollector-1:null) StorageCollector is running...
> >
> > 2013-10-28 21:27:23,255 DEBUG [cloud.server.StatsCollector]
> > (StatsCollector-1:null) There is no secondary storage VM for secondary
> > storage host nfs://172.30.45.2/store/secondary
> >
> > 2013-10-28 21:27:23,273 DEBUG [agent.manager.ClusteredAgentAttache]
> > (StatsCollector-2:null) Seq 1-201916421: Forwarding null to
> 233845174730255
> >
> > 2013-10-28 21:27:23,274 DEBUG [agent.manager.ClusteredAgentAttache]
> > (AgentManager-Handler-9:null) Seq 1-201916421: Routing from
> 233845174730253
> >
> > 2013-10-28 21:27:23,275 DEBUG [agent.manager.ClusteredAgentAttache]
> > (AgentManager-Handler-9:null) Seq 1-201916421: Link is closed
> >
> > 2013-10-28 21:27:23,275 DEBUG [agent.manager.ClusteredAgentManagerImpl]
> > (AgentManager-Handler-9:null) Seq 1-201916421: MgmtId 233845174730253:
> Req:
> > Resource [Host:1] is unreachable: Host 1: Link is c
> >
> > losed
> >
> > 2013-10-28 21:27:23,275 DEBUG [agent.manager.ClusteredAgentManagerImpl]
> > (AgentManager-Handler-9:null) Seq 1--1: MgmtId 233845174730253: Req:
> > Routing to peer
> >
> > 2013-10-28 21:27:23,277 DEBUG [agent.manager.ClusteredAgentManagerImpl]
> > (AgentManager-Handler-11:null) Seq 1--1: MgmtId 233845174730253: Req:
> > Cancel request received
> >
> > 2013-10-28 21:27:23,277 DEBUG [agent.manager.AgentAttache]
> > (AgentManager-Handler-11:null) Seq 1-201916421: Cancelling.
> >
> > 2013-10-28 21:27:23,277 DEBUG [agent.manager.AgentAttache]
> > (StatsCollector-2:null) Seq 1-201916421: Waiting some more time because
> > this is the current command
> >
> > 2013-10-28 21:27:23,277 DEBUG [agent.manager.AgentAttache]
> > (StatsCollector-2:null) Seq 1-201916421: Waiting some more time because
> > this is the current command
> >
> > 2013-10-28 21:27:23,277 INFO  [utils.exception.CSExceptionErrorCode]
> > (StatsCollector-2:null) Could not find exception:
> > com.cloud.exception.OperationTimedoutException in error code list for
> > exceptions
> >
> > 2013-10-28 21:27:23,277 WARN  [agent.manager.AgentAttache]
> > (StatsCollector-2:null) Seq 1-201916421: Timed out on null
> >
> > 2013-10-28 21:27:23,278 DEBUG [agent.manager.AgentAttache]
> > (StatsCollector-2:null) Seq 1-201916421: Cancelling.
> >
> > 2013-10-28 21:27:23,278 WARN  [agent.manager.AgentManagerImpl]
> > (StatsCollector-2:null) Operation timed out: Commands 201916421 to Host 1
> > timed out after 3600
> >
> > 2013-10-28 21:27:23,278 WARN  [cloud.resource.ResourceManagerImpl]
> > (StatsCollector-2:null) Unable to obtain host 1 statistics.
> >
> > 2013-10-28 21:27:23,278 WARN  [cloud.server.StatsCollector]
> > (StatsCollector-2:null) Received invalid host stats for host: 1
> >
> > 2013-10-28 21:27:23,281 DEBUG [agent.manager.ClusteredAgentAttache]
> > (StatsCollector-1:null) Seq 1-201916422: Forwarding null to
> 233845174730255
> >
> > 2013-10-28 21:27:23,283 DEBUG [agent.manager.ClusteredAgentAttache]
> > (AgentManager-Handler-12:null) Seq 1-201916422: Routing from
> > 233845174730253
> >
> > 2013-10-28 21:27:23,283 DEBUG [agent.manager.Cluster

Re: Management Server won't connect after cluster shutdown and restart

2014-08-29 Thread Ian Duffy
Hi carlos,

Did you ever find a fix for this?

I'm seeing a same issue on 4.1.1 with Vmware ESXi.


On 29 October 2013 04:54, Carlos Reategui  wrote:

> Update.  I cleared out the async_job table and also reset the system vms it
> thought where in starting mode from my previous attempts by setting them to
> Stopped from starting.  I also re-set the XS pool master to be the one XS
> thinks it is.
>
> Now when I start the CS MS here are the logs leading up to the first
> exception about the Unable to reach the pool:
>
> 2013-10-28 21:27:11,040 DEBUG [cloud.alert.ClusterAlertAdapter]
> (Cluster-Notification-1:null) Management server node 172.30.45.2 is up,
> send alert
>
> 2013-10-28 21:27:11,045 WARN  [cloud.cluster.ClusterManagerImpl]
> (Cluster-Notification-1:null) Notifying management server join event took 9
> ms
>
> 2013-10-28 21:27:23,236 DEBUG [cloud.server.StatsCollector]
> (StatsCollector-2:null) HostStatsCollector is running...
>
> 2013-10-28 21:27:23,243 DEBUG [cloud.server.StatsCollector]
> (StatsCollector-3:null) VmStatsCollector is running...
>
> 2013-10-28 21:27:23,247 DEBUG [cloud.server.StatsCollector]
> (StatsCollector-1:null) StorageCollector is running...
>
> 2013-10-28 21:27:23,255 DEBUG [cloud.server.StatsCollector]
> (StatsCollector-1:null) There is no secondary storage VM for secondary
> storage host nfs://172.30.45.2/store/secondary
>
> 2013-10-28 21:27:23,273 DEBUG [agent.manager.ClusteredAgentAttache]
> (StatsCollector-2:null) Seq 1-201916421: Forwarding null to 233845174730255
>
> 2013-10-28 21:27:23,274 DEBUG [agent.manager.ClusteredAgentAttache]
> (AgentManager-Handler-9:null) Seq 1-201916421: Routing from 233845174730253
>
> 2013-10-28 21:27:23,275 DEBUG [agent.manager.ClusteredAgentAttache]
> (AgentManager-Handler-9:null) Seq 1-201916421: Link is closed
>
> 2013-10-28 21:27:23,275 DEBUG [agent.manager.ClusteredAgentManagerImpl]
> (AgentManager-Handler-9:null) Seq 1-201916421: MgmtId 233845174730253: Req:
> Resource [Host:1] is unreachable: Host 1: Link is c
>
> losed
>
> 2013-10-28 21:27:23,275 DEBUG [agent.manager.ClusteredAgentManagerImpl]
> (AgentManager-Handler-9:null) Seq 1--1: MgmtId 233845174730253: Req:
> Routing to peer
>
> 2013-10-28 21:27:23,277 DEBUG [agent.manager.ClusteredAgentManagerImpl]
> (AgentManager-Handler-11:null) Seq 1--1: MgmtId 233845174730253: Req:
> Cancel request received
>
> 2013-10-28 21:27:23,277 DEBUG [agent.manager.AgentAttache]
> (AgentManager-Handler-11:null) Seq 1-201916421: Cancelling.
>
> 2013-10-28 21:27:23,277 DEBUG [agent.manager.AgentAttache]
> (StatsCollector-2:null) Seq 1-201916421: Waiting some more time because
> this is the current command
>
> 2013-10-28 21:27:23,277 DEBUG [agent.manager.AgentAttache]
> (StatsCollector-2:null) Seq 1-201916421: Waiting some more time because
> this is the current command
>
> 2013-10-28 21:27:23,277 INFO  [utils.exception.CSExceptionErrorCode]
> (StatsCollector-2:null) Could not find exception:
> com.cloud.exception.OperationTimedoutException in error code list for
> exceptions
>
> 2013-10-28 21:27:23,277 WARN  [agent.manager.AgentAttache]
> (StatsCollector-2:null) Seq 1-201916421: Timed out on null
>
> 2013-10-28 21:27:23,278 DEBUG [agent.manager.AgentAttache]
> (StatsCollector-2:null) Seq 1-201916421: Cancelling.
>
> 2013-10-28 21:27:23,278 WARN  [agent.manager.AgentManagerImpl]
> (StatsCollector-2:null) Operation timed out: Commands 201916421 to Host 1
> timed out after 3600
>
> 2013-10-28 21:27:23,278 WARN  [cloud.resource.ResourceManagerImpl]
> (StatsCollector-2:null) Unable to obtain host 1 statistics.
>
> 2013-10-28 21:27:23,278 WARN  [cloud.server.StatsCollector]
> (StatsCollector-2:null) Received invalid host stats for host: 1
>
> 2013-10-28 21:27:23,281 DEBUG [agent.manager.ClusteredAgentAttache]
> (StatsCollector-1:null) Seq 1-201916422: Forwarding null to 233845174730255
>
> 2013-10-28 21:27:23,283 DEBUG [agent.manager.ClusteredAgentAttache]
> (AgentManager-Handler-12:null) Seq 1-201916422: Routing from
> 233845174730253
>
> 2013-10-28 21:27:23,283 DEBUG [agent.manager.ClusteredAgentAttache]
> (AgentManager-Handler-12:null) Seq 1-201916422: Link is closed
>
> 2013-10-28 21:27:23,283 DEBUG [agent.manager.ClusteredAgentManagerImpl]
> (AgentManager-Handler-12:null) Seq 1-201916422: MgmtId 233845174730253:
> Req: Resource [Host:1] is unreachable: Host 1: Link is
>
> closed
>
> 2013-10-28 21:27:23,284 DEBUG [agent.manager.ClusteredAgentManagerImpl]
> (AgentManager-Handler-12:null) Seq 1--1: MgmtId 233845174730253: Req:
> Routing to peer
>
> 2013-10-28 21:27:23,286 DEBUG [agent.manager.ClusteredAgentManagerImpl]
> (AgentManager-Handler-13:null) Seq 1--1: MgmtId 233845174730253: Req:
> Cancel request received
>
> 2013-10-28 21:27:23,286 DEBUG [agent.manager.AgentAttache]
> (AgentManager-Handler-13:null) Seq 1-201916422: Cancelling.
>
> 2013-10-28 21:27:23,286 DEBUG [agent.manager.AgentAttache]
> (StatsCollector-1:null) Seq 1-201916422: Waiting some more time because
> this is the current comman

Re: Management Server won't connect after cluster shutdown and restart

2013-10-28 Thread Carlos Reategui
Update.  I cleared out the async_job table and also reset the system vms it
thought where in starting mode from my previous attempts by setting them to
Stopped from starting.  I also re-set the XS pool master to be the one XS
thinks it is.

Now when I start the CS MS here are the logs leading up to the first
exception about the Unable to reach the pool:

2013-10-28 21:27:11,040 DEBUG [cloud.alert.ClusterAlertAdapter]
(Cluster-Notification-1:null) Management server node 172.30.45.2 is up,
send alert

2013-10-28 21:27:11,045 WARN  [cloud.cluster.ClusterManagerImpl]
(Cluster-Notification-1:null) Notifying management server join event took 9
ms

2013-10-28 21:27:23,236 DEBUG [cloud.server.StatsCollector]
(StatsCollector-2:null) HostStatsCollector is running...

2013-10-28 21:27:23,243 DEBUG [cloud.server.StatsCollector]
(StatsCollector-3:null) VmStatsCollector is running...

2013-10-28 21:27:23,247 DEBUG [cloud.server.StatsCollector]
(StatsCollector-1:null) StorageCollector is running...

2013-10-28 21:27:23,255 DEBUG [cloud.server.StatsCollector]
(StatsCollector-1:null) There is no secondary storage VM for secondary
storage host nfs://172.30.45.2/store/secondary

2013-10-28 21:27:23,273 DEBUG [agent.manager.ClusteredAgentAttache]
(StatsCollector-2:null) Seq 1-201916421: Forwarding null to 233845174730255

2013-10-28 21:27:23,274 DEBUG [agent.manager.ClusteredAgentAttache]
(AgentManager-Handler-9:null) Seq 1-201916421: Routing from 233845174730253

2013-10-28 21:27:23,275 DEBUG [agent.manager.ClusteredAgentAttache]
(AgentManager-Handler-9:null) Seq 1-201916421: Link is closed

2013-10-28 21:27:23,275 DEBUG [agent.manager.ClusteredAgentManagerImpl]
(AgentManager-Handler-9:null) Seq 1-201916421: MgmtId 233845174730253: Req:
Resource [Host:1] is unreachable: Host 1: Link is c

losed

2013-10-28 21:27:23,275 DEBUG [agent.manager.ClusteredAgentManagerImpl]
(AgentManager-Handler-9:null) Seq 1--1: MgmtId 233845174730253: Req:
Routing to peer

2013-10-28 21:27:23,277 DEBUG [agent.manager.ClusteredAgentManagerImpl]
(AgentManager-Handler-11:null) Seq 1--1: MgmtId 233845174730253: Req:
Cancel request received

2013-10-28 21:27:23,277 DEBUG [agent.manager.AgentAttache]
(AgentManager-Handler-11:null) Seq 1-201916421: Cancelling.

2013-10-28 21:27:23,277 DEBUG [agent.manager.AgentAttache]
(StatsCollector-2:null) Seq 1-201916421: Waiting some more time because
this is the current command

2013-10-28 21:27:23,277 DEBUG [agent.manager.AgentAttache]
(StatsCollector-2:null) Seq 1-201916421: Waiting some more time because
this is the current command

2013-10-28 21:27:23,277 INFO  [utils.exception.CSExceptionErrorCode]
(StatsCollector-2:null) Could not find exception:
com.cloud.exception.OperationTimedoutException in error code list for
exceptions

2013-10-28 21:27:23,277 WARN  [agent.manager.AgentAttache]
(StatsCollector-2:null) Seq 1-201916421: Timed out on null

2013-10-28 21:27:23,278 DEBUG [agent.manager.AgentAttache]
(StatsCollector-2:null) Seq 1-201916421: Cancelling.

2013-10-28 21:27:23,278 WARN  [agent.manager.AgentManagerImpl]
(StatsCollector-2:null) Operation timed out: Commands 201916421 to Host 1
timed out after 3600

2013-10-28 21:27:23,278 WARN  [cloud.resource.ResourceManagerImpl]
(StatsCollector-2:null) Unable to obtain host 1 statistics.

2013-10-28 21:27:23,278 WARN  [cloud.server.StatsCollector]
(StatsCollector-2:null) Received invalid host stats for host: 1

2013-10-28 21:27:23,281 DEBUG [agent.manager.ClusteredAgentAttache]
(StatsCollector-1:null) Seq 1-201916422: Forwarding null to 233845174730255

2013-10-28 21:27:23,283 DEBUG [agent.manager.ClusteredAgentAttache]
(AgentManager-Handler-12:null) Seq 1-201916422: Routing from 233845174730253

2013-10-28 21:27:23,283 DEBUG [agent.manager.ClusteredAgentAttache]
(AgentManager-Handler-12:null) Seq 1-201916422: Link is closed

2013-10-28 21:27:23,283 DEBUG [agent.manager.ClusteredAgentManagerImpl]
(AgentManager-Handler-12:null) Seq 1-201916422: MgmtId 233845174730253:
Req: Resource [Host:1] is unreachable: Host 1: Link is

closed

2013-10-28 21:27:23,284 DEBUG [agent.manager.ClusteredAgentManagerImpl]
(AgentManager-Handler-12:null) Seq 1--1: MgmtId 233845174730253: Req:
Routing to peer

2013-10-28 21:27:23,286 DEBUG [agent.manager.ClusteredAgentManagerImpl]
(AgentManager-Handler-13:null) Seq 1--1: MgmtId 233845174730253: Req:
Cancel request received

2013-10-28 21:27:23,286 DEBUG [agent.manager.AgentAttache]
(AgentManager-Handler-13:null) Seq 1-201916422: Cancelling.

2013-10-28 21:27:23,286 DEBUG [agent.manager.AgentAttache]
(StatsCollector-1:null) Seq 1-201916422: Waiting some more time because
this is the current command

2013-10-28 21:27:23,286 DEBUG [agent.manager.AgentAttache]
(StatsCollector-1:null) Seq 1-201916422: Waiting some more time because
this is the current command

2013-10-28 21:27:23,286 INFO  [utils.exception.CSExceptionErrorCode]
(StatsCollector-1:null) Could not find exception:
com.cloud.exception.OperationTimedoutException in error code list