Coincidentally, we observed somewhat similar behavior with ACS 4.5 and
KVM Agents (i assume Xen will be no different). Based on the code check,
this issue also exists in master. I'd think 4.2 is no different.

Marcus can speak more intelligently about this issue than i am, but from
what i understand about this issue and his explanation:

Summary:
CloudStack does not handle Agent connection with SSL Handshake properly,
that is - it process each connection serially, causing a block for the
next agent inline until SSL Handshake goes thru - but what if it does not?


Details and Example:
For example, if you open a telnet session to 8250 on MS, MS expects SSL
handshake to go through, however, the fake telnet session does not do
anything other than take up a socket. Current method for agent
connection is "serial", which means - the next proper agent in line -
cannot process its tasks and is being blocked - eventually gets
disconnected.  As a result, you will have many agents disconnect, then -
as "telnet" session is dropped in 60 seconds, you will have a chance to
reconnect. However, if the improper connection on 8250 is consistent,
you will have a continuous denial of service. The improper SSL handshake
can also be sporadic - causing sporadic disconnection issues.

With that said, we are testing internal fix that will allow for each
connection and subsequent tasks - to be treated as separate thread - by
implementing Callable method. If the improper connection comes thru, it
will be living in its own thread and dropped once it reaches timeout,
without affecting other Agents connections.


Once we confirm that it works as expected, we will release a patch.

In the meantime, if you need to bring back the stability to your
environment, try to find the offending connection. It could be one of
the agents going rogue or some other process trying to establish a
connection on 8250 and never completing SSL Handshake. For example, a
security scan invoked on the network that tries to poke a hole in any
port it finds.

Try restarting all cloudstack agents in your environment and make sure
incoming connection to cloudstack MS on 8250 are valid agent connection.

Putting LB in-front of cloudstack MS will make diagnosing this issue
much harder if you want to find a rogue connection. But long-term, you
definitely want to put LB in front of MS.

Another interesting observation, after we implemented a change mentioned
above, restarted MS servers, cloudstack agents reconnected much quicker,
within matter of seconds VS several minutes.

The fix needs more testing and baking until its released to public.

Regards
ilya





On 4/5/16 9:30 PM, Indra Pramana wrote:
> Hi Sanjeev and Rafael,
> 
> Good day to you, and thank you for your replies and advice.
> 
> We are getting a new management server and HA proxy load balancers. Will
> see if this can resolve the problem.
> 
> Thank you.
> 
> 
> 
> On Tue, Apr 5, 2016 at 8:24 PM, Rafael Weingärtner <
> rafaelweingart...@gmail.com> wrote:
> 
>> How many hosts (hypervisors) are you managing with a single MS?
>>
>> If you add new MSs, you need to balance their (HTTP 8080 and TCP 8250)
>> access with something like the HA proxy load balancer.
>>
>>
>>
>> On Tue, Apr 5, 2016 at 2:09 AM, Sanjeev Neelarapu <
>> sanjeev.neelar...@accelerite.com> wrote:
>>
>>> Adding additional management server would definitely help.
>>>
>>> Best Regards,
>>> Sanjeev N
>>> Chief Product Engineer, Accelerite
>>> Off: +91 40 6722 9368 | EMail: sanjeev.neelar...@accelerite.com
>>>
>>>
>>> -----Original Message-----
>>> From: Indra Pramana [mailto:in...@sg.or.id]
>>> Sent: Sunday, April 03, 2016 5:14 PM
>>> To: users@cloudstack.apache.org
>>> Subject: Re: URGENT - CloudStack agent not able to connect to management
>>> server
>>>
>>> Hi Lucian,
>>>
>>> Good day to you, and thank you for your reply. Apologise for the delay in
>>> my reply.
>>>
>>> Yes, I can confirm that we can access the host and port specified. Based
>>> on the logs, the host can connect to the management server but there's no
>>> follow-up logs which usually come after it's connected. Eventually, we
>>> could only connect back the host after we rebooted it, which means
>>> sacrificing all the VMs which were still up and running during the
>>> disconnection.
>>>
>>> At the time when the first hypervisor was disconnected, the CloudStack
>>> management servers were very busy handling the disconnections, trying to
>>> fence the hosts and initiate HA for all the affected VMs, based on the
>>> logs. Could this have put a strain on the management server, causing it
>> to
>>> disconnect all the remaining hosts? Will adding new management server be
>>> able to resolve the problem?
>>>
>>> Any advice is appreciated.
>>>
>>> Looking forward to your reply, thank you.
>>>
>>> Cheers.
>>>
>>> On Thu, Mar 31, 2016 at 5:28 PM, Nux! <n...@li.nux.ro> wrote:
>>>
>>>> Hello,
>>>>
>>>> Are you sure you can connect from the hypervisors to the
>>>> cloudstack-management on the host and port specified in the
>>>> agent.properties?
>>>>
>>>> --
>>>> Sent from the Delta quadrant using Borg technology!
>>>>
>>>> Nux!
>>>> www.nux.ro
>>>>
>>>> ----- Original Message -----
>>>>> From: "Indra Pramana" <in...@sg.or.id>
>>>>> To: users@cloudstack.apache.org
>>>>> Sent: Thursday, 31 March, 2016 03:14:59
>>>>> Subject: URGENT - CloudStack agent not able to connect to management
>>>> server
>>>>
>>>>> Dear all,
>>>>>
>>>>> We are using CloudStack 4.2.0, KVM hypervisor and Ceph RBD storage.
>>>>> All
>>>> our
>>>>> agents got disconnected from the management server and unable to
>>>>> connect again, despite rebooting the management server and stopping
>>>>> and
>>>> restarting
>>>>> the cloudstack-agent many times.
>>>>>
>>>>> We even tried to physically reboot a hypervisor host (sacrificing
>>>>> all the running VMs inside) to see if it can reconnect after
>>>>> boot-up, and it's
>>>> not
>>>>> able to reconnect (keep on "Connecting" state). Here's the excerpts
>>>>> from the logs:
>>>>>
>>>>> ====
>>>>> 2016-03-31 10:07:49,346 DEBUG [cloud.agent.Agent] (UgentTask-5:null)
>>>>> Sending ping: Seq 0-11:  { Cmd , MgmtId: -1, via: 0, Ver: v1, Flags:
>>>>> 11,
>>>>>
>>>> [{"com.cloud.agent.api.PingRoutingWithNwGroupsCommand":{"newGroupState
>>>> s":{},"newStates":{},"_gatewayAccessible":true,"_vnetAccessible":true,
>>>> "hostType":"Routing","hostId":0,"wait":0}}]
>>>>> }
>>>>> 2016-03-31 10:07:49,395 DEBUG [cloud.agent.Agent]
>>>>> (Agent-Handler-2:null) Received response: Seq 0-11:  { Ans: ,
>>>>> MgmtId: 161342671900, via: 75,
>>>> Ver:
>>>>> v1, Flags: 100010,
>>>>>
>>>> [{"com.cloud.agent.api.PingAnswer":{"_command":{"hostType":"Routing","
>>>> hostId":0,"wait":0},"result":true,"wait":0}}]
>>>>> }
>>>>> 2016-03-31 10:08:49,271 DEBUG
>>>>> [kvm.resource.LibvirtComputingResource]
>>>>> (UgentTask-5:null) Executing:
>>>>> /usr/share/cloudstack-common/scripts/vm/network/security_group.py
>>>>> get_rule_logs_for_vms
>>>>> 2016-03-31 10:08:49,350 DEBUG
>>>>> [kvm.resource.LibvirtComputingResource]
>>>>> (UgentTask-5:null) Execution is successful.
>>>>> 2016-03-31 10:08:49,353 DEBUG [cloud.agent.Agent] (UgentTask-5:null)
>>>>> Sending ping: Seq 0-12:  { Cmd , MgmtId: -1, via: 0, Ver: v1, Flags:
>>>>> 11,
>>>>>
>>>> [{"com.cloud.agent.api.PingRoutingWithNwGroupsCommand":{"newGroupState
>>>> s":{},"newStates":{},"_gatewayAccessible":true,"_vnetAccessible":true,
>>>> "hostType":"Routing","hostId":0,"wait":0}}]
>>>>> }
>>>>> 2016-03-31 10:08:49,406 DEBUG [cloud.agent.Agent]
>>>>> (Agent-Handler-3:null) Received response: Seq 0-12:  { Ans: ,
>>>>> MgmtId: 161342671900, via: 75,
>>>> Ver:
>>>>> v1, Flags: 100010,
>>>>>
>>>> [{"com.cloud.agent.api.PingAnswer":{"_command":{"hostType":"Routing","
>>>> hostId":0,"wait":0},"result":true,"wait":0}}]
>>>>> }
>>>>> 2016-03-31 10:09:49,272 DEBUG
>>>>> [kvm.resource.LibvirtComputingResource]
>>>>> (UgentTask-5:null) Executing:
>>>>> /usr/share/cloudstack-common/scripts/vm/network/security_group.py
>>>>> get_rule_logs_for_vms
>>>>> 2016-03-31 10:09:49,345 DEBUG
>>>>> [kvm.resource.LibvirtComputingResource]
>>>>> (UgentTask-5:null) Execution is successful.
>>>>> 2016-03-31 10:09:49,347 DEBUG [cloud.agent.Agent] (UgentTask-5:null)
>>>>> Sending ping: Seq 0-13:  { Cmd , MgmtId: -1, via: 0, Ver: v1, Flags:
>>>>> 11,
>>>>>
>>>> [{"com.cloud.agent.api.PingRoutingWithNwGroupsCommand":{"newGroupState
>>>> s":{},"newStates":{},"_gatewayAccessible":true,"_vnetAccessible":true,
>>>> "hostType":"Routing","hostId":0,"wait":0}}]
>>>>> }
>>>>> 2016-03-31 10:09:49,398 DEBUG [cloud.agent.Agent]
>>>>> (Agent-Handler-4:null) Received response: Seq 0-13:  { Ans: ,
>>>>> MgmtId: 161342671900, via: 75,
>>>> Ver:
>>>>> v1, Flags: 100010,
>>>>>
>>>> [{"com.cloud.agent.api.PingAnswer":{"_command":{"hostType":"Routing","
>>>> hostId":0,"wait":0},"result":true,"wait":0}}]
>>>>> }
>>>>> ====
>>>>>
>>>>> On the existing hypervisor hosts, normally the agent would stuck at
>>>>> this stage and from Cloudstack GUI, we don't see the agent in
>>> "Connecting"
>>>>> state, it will be either on "Disconnected" or "Alert" state.
>>>>>
>>>>> ====
>>>>> 2016-03-31 07:37:09,819 DEBUG [utils.script.Script] (main:null)
>>>> Executing:
>>>>> /bin/bash -c uname -r
>>>>> 2016-03-31 07:37:09,829 DEBUG [utils.script.Script] (main:null)
>>>>> Execution is successful.
>>>>> 2016-03-31 07:37:09,832 DEBUG [cloud.agent.Agent] (main:null) Adding
>>>>> shutdown hook
>>>>> 2016-03-31 07:37:09,833 INFO  [cloud.agent.Agent] (main:null) Agent
>>>>> [id =
>>>>> 73 : type = LibvirtComputingResource : zone = 6 : pod = 6 : workers =
>>> 5 :
>>>>> host = 10.x.x.x : port = 8250
>>>>> 2016-03-31 07:37:09,856 INFO  [utils.nio.NioClient]
>>>>> (Agent-Selector:null) Connecting to 10.x.x.x:8250
>>>>> 2016-03-31 07:37:10,178 INFO  [utils.nio.NioClient]
>>>>> (Agent-Selector:null)
>>>>> SSL: Handshake done
>>>>> 2016-03-31 07:37:10,179 INFO  [utils.nio.NioClient]
>>>>> (Agent-Selector:null) Connected to 10.x.x.x:8250 ====
>>>>>
>>>>> No other significant and useful logs found on both the agents and
>>>>> management server logs.
>>>>>
>>>>> Anyone can give a clue on what could be the problem? Have been
>>>>> trying to reconnect in the past couple of hours without any issues.
>>>>> Any help is greatly appreciated.
>>>>>
>>>>> Looking forward to your reply, thnk you.
>>>>>
>>>>> Cheers.
>>>>>
>>>>> -ip-
>>>>
>>>
>>>
>>>
>>> DISCLAIMER
>>> ==========
>>> This e-mail may contain privileged and confidential information which is
>>> the property of Accelerite, a Persistent Systems business. It is intended
>>> only for the use of the individual or entity to which it is addressed. If
>>> you are not the intended recipient, you are not authorized to read,
>> retain,
>>> copy, print, distribute or use this message. If you have received this
>>> communication in error, please notify the sender and delete all copies of
>>> this message. Accelerite, a Persistent Systems business does not accept
>> any
>>> liability for virus infected mails.
>>>
>>
>>
>>
>> --
>> Rafael Weingärtner
>>
> 

Reply via email to