Re: [openstack-dev] [Neutron] DHCP Agent Reliability

Clint Byrum Tue, 03 Dec 2013 20:04:55 -0800

Excerpts from Maru Newby's message of 2013-12-03 19:37:19 -0800:
> 
> On Dec 4, 2013, at 11:57 AM, Clint Byrum <cl...@fewbar.com> wrote:
> 
> > Excerpts from Maru Newby's message of 2013-12-03 08:08:09 -0800:
> >> I've been investigating a bug that is preventing VM's from receiving IP 
> >> addresses when a Neutron service is under high load:
> >> 
> >> https://bugs.launchpad.net/neutron/+bug/1192381
> >> 
> >> High load causes the DHCP agent's status updates to be delayed, causing 
> >> the Neutron service to assume that the agent is down.  This results in the 
> >> Neutron service not sending notifications of port addition to the DHCP 
> >> agent.  At present, the notifications are simply dropped.  A simple fix is 
> >> to send notifications regardless of agent status.  Does anybody have any 
> >> objections to this stop-gap approach?  I'm not clear on the implications 
> >> of sending notifications to agents that are down, but I'm hoping for a 
> >> simple fix that can be backported to both havana and grizzly (yes, this 
> >> bug has been with us that long).
> >> 
> >> Fixing this problem for real, though, will likely be more involved.  The 
> >> proposal to replace the current wsgi framework with Pecan may increase the 
> >> Neutron service's scalability, but should we continue to use a 'fire and 
> >> forget' approach to notification?  Being able to track the success or 
> >> failure of a given action outside of the logs would seem pretty important, 
> >> and allow for more effective coordination with Nova than is currently 
> >> possible.
> >> 
> > 
> > Dropping requests without triggering a user-visible error is a pretty
> > serious problem. You didn't mention if you have filed a bug about that.
> > If not, please do or let us know here so we can investigate and file
> > a bug.
> 
> There is a bug linked to in the original message that I am already working 
> on.  The fact that that bug title is 'dhcp agent doesn't configure ports' 
> rather than 'dhcp notifications are silently dropped' is incidental.
>


Good point, I suppose that one bug is enough.

> > 
> > It seems to me that they should be put into a queue to be retried.
> > Sending the notifications blindly is almost as bad as dropping them,
> > as you have no idea if the agent is alive or not.
> 
> This is more the kind of discussion I was looking for.  
> 
> In the current architecture, the Neutron service handles RPC and WSGI with a 
> single process and is prone to being overloaded such that agent heartbeats 
> can be delayed beyond the limit for the agent being declared 'down'.  Even if 
> we increased the agent timeout as Yongsheg suggests, there is no guarantee 
> that we can accurately detect whether an agent is 'live' with the current 
> architecture.  Given that amqp can ensure eventual delivery - it is a queue - 
> is sending a notification blind such a bad idea?  In the best case the agent 
> isn't really down and can process the notification.  In the worst case, the 
> agent really is down but will be brought up eventually by a deployment's 
> monitoring solution and process the notification when it returns.  What am I 
> missing? 
>

I have not looked closely into what expectations are built in to the
notification system, so I may have been off base. My understanding was
they were not necessarily guaranteed to be delivered, but if they are,
then this is fine.

> Please consider that while a good solution will track notification delivery 
> and success, we may need 2 solutions:
> 
> 1. A 'good-enough', minimally-invasive stop-gap that can be back-ported to 
> grizzly and havana.
>

I don't know why we'd backport to grizzly. But yes, if we can get a
notable jump in reliability with a clear patch, I'm all for it.

> 2. A 'best-effort' refactor that maximizes the reliability of the DHCP agent.
> 
> I'm hoping that coming up with a solution to #1 will allow us the breathing 
> room to work on #2 in this cycle.
>

Understood, I like the short term plan and think long term having more
CPU available to process more messages is a good thing, most likely in
the form of more worker processes.

_______________________________________________
OpenStack-dev mailing list
OpenStack-dev@lists.openstack.org
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev

Re: [openstack-dev] [Neutron] DHCP Agent Reliability

Reply via email to