Excerpts from Kyle Mestery's message of 2013-12-12 09:53:57 -0800: > On Dec 12, 2013, at 11:44 AM, Jay Pipes <jaypi...@gmail.com> wrote: > > On 12/12/2013 12:36 PM, Clint Byrum wrote: > >> Excerpts from Russell Bryant's message of 2013-12-12 09:09:04 -0800: > >>> On 12/12/2013 12:02 PM, Clint Byrum wrote: > >>>> I've been chasing quite a few bugs in the TripleO automated bring-up > >>>> lately that have to do with failures because either there are no valid > >>>> hosts ready to have servers scheduled, or there are hosts listed and > >>>> enabled, but they can't bind to the network because for whatever reason > >>>> the L2 agent has not checked in with Neutron yet. > >>>> > >>>> This is only a problem in the first few minutes of a nova-compute host's > >>>> life. But it is critical for scaling up rapidly, so it is important for > >>>> me to understand how this is supposed to work. > >>>> > >>>> So I'm asking, is there a standard way to determine whether or not a > >>>> nova-compute is definitely ready to have things scheduled on it? This > >>>> can be via an API, or even by observing something on the nova-compute > >>>> host itself. I just need a definitive signal that "the compute host is > >>>> ready". > >>> > >>> If a nova compute host has registered itself to start having instances > >>> scheduled to it, it *should* be ready. AFAIK, we're not doing any > >>> network sanity checks on startup, though. > >>> > >>> We already do some sanity checks on startup. For example, nova-compute > >>> requires that it can talk to nova-conductor. nova-compute will block on > >>> startup until nova-conductor is responding if they happened to be > >>> brought up at the same time. > >>> > >>> We could do something like this with a networking sanity check if > >>> someone could define what that check should look like. > >>> > >> Could we ask Neutron if our compute host has an L2 agent yet? That seems > >> like a valid sanity check. > > > > ++ > > > This makes sense to me as well. Although, not all Neutron plugins have > an L2 agent, so I think the check needs to be more generic than that. > For example, the OpenDaylight MechanismDriver we have developed > doesn't need an agent. I also believe the Nicira plugin is agent-less, > perhaps there are others as well. > > And I should note, does this sort of integration also happen with cinder, > for example, when we're dealing with storage? Any other services which > have a requirement on startup around integration with nova as well? >
Does cinder actually have per-compute-host concerns? I admit to being a bit cinder-stupid here. Anyway, it seems to me that any service that is compute-host aware should be able to respond to the compute host whether or not it is a) aware of it, and b) ready to serve on it. For agent-less drivers that is easy, you just always return True. And for drivers with agents, you return false unless you can find an agent for the host. So something like: GET /host/%(compute-host-name) And then in the response include a "ready" attribute that would signal whether all networks that should work there, can work there. As a first pass, just polling until that is "ready" before nova-compute enables itself would solve the problems I see (and that I think users would see as a cloud provider scales out compute nodes). Longer term we would also want to aim at having notifications available for this so that nova-compute could subscribe to that notification bus and then disable itself if its agent ever goes away. I opened this bug to track the issue. I suspect there are duplicates of it already reported, but would like to start clean to make sure it is analyzed fully and then we can use those other bugs as test cases and confirmation: https://bugs.launchpad.net/nova/+bug/1260440 _______________________________________________ OpenStack-dev mailing list OpenStack-dev@lists.openstack.org http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev