Excerpts from Maru Newby's message of 2013-12-03 19:37:19 -0800: > > On Dec 4, 2013, at 11:57 AM, Clint Byrum <cl...@fewbar.com> wrote: > > > Excerpts from Maru Newby's message of 2013-12-03 08:08:09 -0800: > >> I've been investigating a bug that is preventing VM's from receiving IP > >> addresses when a Neutron service is under high load: > >> > >> https://bugs.launchpad.net/neutron/+bug/1192381 > >> > >> High load causes the DHCP agent's status updates to be delayed, causing > >> the Neutron service to assume that the agent is down. This results in the > >> Neutron service not sending notifications of port addition to the DHCP > >> agent. At present, the notifications are simply dropped. A simple fix is > >> to send notifications regardless of agent status. Does anybody have any > >> objections to this stop-gap approach? I'm not clear on the implications > >> of sending notifications to agents that are down, but I'm hoping for a > >> simple fix that can be backported to both havana and grizzly (yes, this > >> bug has been with us that long). > >> > >> Fixing this problem for real, though, will likely be more involved. The > >> proposal to replace the current wsgi framework with Pecan may increase the > >> Neutron service's scalability, but should we continue to use a 'fire and > >> forget' approach to notification? Being able to track the success or > >> failure of a given action outside of the logs would seem pretty important, > >> and allow for more effective coordination with Nova than is currently > >> possible. > >> > > > > Dropping requests without triggering a user-visible error is a pretty > > serious problem. You didn't mention if you have filed a bug about that. > > If not, please do or let us know here so we can investigate and file > > a bug. > > There is a bug linked to in the original message that I am already working > on. The fact that that bug title is 'dhcp agent doesn't configure ports' > rather than 'dhcp notifications are silently dropped' is incidental. >
Good point, I suppose that one bug is enough. > > > > It seems to me that they should be put into a queue to be retried. > > Sending the notifications blindly is almost as bad as dropping them, > > as you have no idea if the agent is alive or not. > > This is more the kind of discussion I was looking for. > > In the current architecture, the Neutron service handles RPC and WSGI with a > single process and is prone to being overloaded such that agent heartbeats > can be delayed beyond the limit for the agent being declared 'down'. Even if > we increased the agent timeout as Yongsheg suggests, there is no guarantee > that we can accurately detect whether an agent is 'live' with the current > architecture. Given that amqp can ensure eventual delivery - it is a queue - > is sending a notification blind such a bad idea? In the best case the agent > isn't really down and can process the notification. In the worst case, the > agent really is down but will be brought up eventually by a deployment's > monitoring solution and process the notification when it returns. What am I > missing? > I have not looked closely into what expectations are built in to the notification system, so I may have been off base. My understanding was they were not necessarily guaranteed to be delivered, but if they are, then this is fine. > Please consider that while a good solution will track notification delivery > and success, we may need 2 solutions: > > 1. A 'good-enough', minimally-invasive stop-gap that can be back-ported to > grizzly and havana. > I don't know why we'd backport to grizzly. But yes, if we can get a notable jump in reliability with a clear patch, I'm all for it. > 2. A 'best-effort' refactor that maximizes the reliability of the DHCP agent. > > I'm hoping that coming up with a solution to #1 will allow us the breathing > room to work on #2 in this cycle. > Understood, I like the short term plan and think long term having more CPU available to process more messages is a good thing, most likely in the form of more worker processes. _______________________________________________ OpenStack-dev mailing list OpenStack-dev@lists.openstack.org http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev