Hi Ken, Thanks a lot for the analysis, and sorry for the slow reply! Comments inline...
Ken Giusti <kgiu...@gmail.com> wrote: > Hi Adam, > > I think there's a couple of problems here. > > Regardless of worker count, the service.wait() is called before > service.start(). And from looking at the oslo.service code, the 'wait()' > method is call after start(), then again after stop(). This doesn't match > up with the intended use of oslo.messaging.server.wait(), which should only > be called after .stop(). Hmm, so are you saying that there might be a bug in oslo.service's usage of oslo.messaging, and that this Sahara bugfix was the wrong approach too? https://review.openstack.org/#/c/280741/1/sahara/cli/sahara_engine.py > Perhaps a bigger issue is that in the multi threaded case all threads > appear to be calling start, wait, and stop on the same instance of the > service (oslo.messaging rpc server). At least that's what I'm seeing in my > muchly reduced test code: > > https://paste.fedoraproject.org/paste/-73zskccaQvpSVwRJD11cA > > The log trace shows multiple calls to start, wait, stop via different > threads to the same TaskServer instance: > > https://paste.fedoraproject.org/paste/dyPq~lr26sQZtMzHn5w~Vg > > Is that expected? Unfortunately in the interim, your pastes seem to have vanished - any chance you could repaste them? Thanks, Adam > On Mon, Jul 31, 2017 at 9:32 PM, Adam Spiers <aspi...@suse.com> wrote: > > Ken Giusti <kgiu...@gmail.com> wrote: > >> On Mon, Jul 31, 2017 at 10:01 AM, Adam Spiers <aspi...@suse.com> wrote: > >>> I recently discovered a bug where barbican-worker would hang on > >>> shutdown if queue.asynchronous_workers was changed from 1 to 2: > >>> > >>> https://bugs.launchpad.net/barbican/+bug/1705543 > >>> > >>> resulting in a warning like this: > >>> > >>> WARNING oslo_messaging.server [-] Possible hang: stop is waiting for > >>> start to complete > >>> > >>> I found a similar bug in Sahara: > >>> > >>> https://bugs.launchpad.net/sahara/+bug/1546119 > >>> > >>> where the fix was to call start() on the RPC service before making the > >>> launcher wait() on it, so I ported the fix to Barbican, and it seems > >>> to work fine: > >>> > >>> https://review.openstack.org/#/c/485755 > >>> > >>> I noticed that both projects use ProcessLauncher; barbican uses > >>> oslo_service.service.launch() which has: > >>> > >>> if workers is None or workers == 1: > >>> launcher = ServiceLauncher(conf, restart_method=restart_method) > >>> else: > >>> launcher = ProcessLauncher(conf, restart_method=restart_method) > >>> > >>> However, I'm not an expert in oslo.service or oslo.messaging, and one > >>> of Barbican's core reviewers (thanks Kaitlin!) noted that not many > >>> other projects start the task before calling wait() on the launcher, > >>> so I thought I'd check here whether that is the correct fix, or > >>> whether there's something else odd going on. > >>> > >>> Any oslo gurus able to shed light on this? > >>> > >> > >> As far as an oslo.messaging server is concerned, the order of operations > >> is: > >> > >> server.start() > >> # do stuff until ready to stop the server... > >> server.stop() > >> server.wait() > >> > >> The final wait blocks until all requests that are in progress when stop() > >> is called finish and cleanup. > > > > Thanks - that makes sense. So the question is, why would > > barbican-worker only hang on shutdown when there are multiple workers? > > Maybe the real bug is somewhere in oslo_service.service.ProcessLauncher > > and it's not calling start() correctly? __________________________________________________________________________ OpenStack Development Mailing List (not for usage questions) Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev