[openstack-dev] [oslo] debugging the failures in oslo.messaging gate

Doug Hellmann Fri, 14 Aug 2015 16:02:51 -0700

All patches to oslo.messaging are currently failing the 
gate-tempest-dsvm-neutron-src-oslo.messaging job because the neutron service 
dies. amuller, kevinbenton, and I spent a bunch of time looking at it today, 
and I think we have an issue introduced by some asymmetric gating between the 
two projects.


Neutron has 2 different modes for starting the RPC service, depending on the 
number of workers requested. The problem comes up with rpc_workers=0, which is 
the new default. In that mode, rather than using the ProcessLauncher, the RPC 
server is started directly in the current process. That results in wait() being 
called in a way that violates the new constraints being enforced within 
oslo.messaging after [1] landed. That patch is unreleased, so the only project 
seeing the problem is oslo.messaging. I’ve proposed a revert in [2], which 
passes the gate tests.

I have also added [3] to neutron to see if we can get the gate job to show the 
same error messages I was seeing locally (part of the trouble we’ve had with 
debugging this is the process exits quickly enough that some of the log 
messages are never being written). I’m using [4] as a patch in oslo.messaging 
that was failing before to trigger the job to get the necessary log. That patch 
should *not* be landed, since I don’t think the change it reverts is related to 
the problem, it was just handy for debugging.

The error message I see locally, “start/stop/wait must be called in the same 
thread”, is visible in this log snippet [5].

It’s not clear what the best path forward is. Obviously neutron is doing 
something with the RPC server that oslo.messaging doesn’t expect/want/like, but 
also obviously we can’t release oslo.messaging in its current state and break 
neutron. Someone with a better understanding of both neutron and oslo.messaging 
may be able to fix neutron’s use of the RPC code to avoid this case. There may 
be other users of oslo.messaging with the same ‘broken’ pattern, but IIRC 
neutron is unique in the way it runs both RPC and API services in the same 
process. To be safe, though, it may be better to log error messages instead of 
doing whatever we’re doing now to cause the process to exit. We can then set up 
a log stash search for the error message and find other applications that would 
be broken, fix them, and then switch oslo.messaging back to throwing an 
exception.

I’m going to be at the Ops summit next week, so I need to hand off debugging 
and fixing the issue to someone else on the Oslo team. We created an etherpad 
to track progress and make notes today, and all of these links are referenced 
there, too [6].

Thanks again to amuller and kevinbenton for the time they spent helping with 
debugging today!

Doug

[1] https://review.openstack.org/#/c/209043/
[2] https://review.openstack.org/#/c/213299/
[3] https://review.openstack.org/#/c/213360/
[4] https://review.openstack.org/#/c/213297/
[6] http://paste.openstack.org/show/415030/
[6] https://etherpad.openstack.org/p/wm2D6UGZbf


__________________________________________________________________________
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev

[openstack-dev] [oslo] debugging the failures in oslo.messaging gate

Reply via email to