New submission from Richard Purdie <richard.pur...@linuxfoundation.org>:

We're having some problems with multiprocessing.Queue where the parent process 
ends up hanging with zombie children. The code is part of bitbake, the task 
execution engine behind OpenEmbedded/Yocto Project.

I've cut down our code to the pieces in question in the attached file. It 
doesn't give a runnable test case unfortunately but does at least show what 
we're doing. Basically, we have a set of items to parse, we create a set of 
multiprocessing.Process() processes to handle the parsing in parallel. Jobs are 
queued in one queue and results are fed back to the parent via another. There 
is a quit queue that takes sentinels to cause the subprocesses to quit.

If something fails to parse, shutdown with clean=False is called, the sentinels 
are sent. the Parser() process calls results.cancel_join_thread() on the 
results queue. We do this since we don't care about the results any more, we 
just want to ensure everyting exits cleanly. This is where things go wrong. The 
Parser processes and their queues all turn into zombies. The parent process 
ends up stuck in self.result_queue.get(timeout=0.25) inside shutdown().

strace shows its acquired the locks and is doing a read() on the os.pipe() it 
created. Unfortunately since the parent still has a write channel open to the 
same pipe, it hangs indefinitely.

If I change the code to do:

        self.result_queue._writer.close()
        while True:
            try:
               self.result_queue.get(timeout=0.25)
            except (queue.Empty, EOFError):
                break

i.e. close the writer side of the pipe by poking at the queue internals, we 
don't see the hang. The .close() method would close both sides.

We create our own process pool since this code dates from python 2.x days and 
multiprocessing pools had issues back when we started using this. I'm sure it 
would be much better now but we're reluctant to change what has basically been 
working. We drain the queues since in some cases we have clean shutdowns where 
cancel_join_thread() hasn't been used and we don't want those cases to block.

My question is whether this is a known issue and whether there is some kind of 
API to close just the write side of the Queue to avoid problems like this?

----------
components: Library (Lib)
files: simplified.py
messages: 376350
nosy: rpurdie
priority: normal
severity: normal
status: open
title: multiprocessing.Queue deadlock
type: crash
versions: Python 3.6
Added file: https://bugs.python.org/file49444/simplified.py

_______________________________________
Python tracker <rep...@bugs.python.org>
<https://bugs.python.org/issue41714>
_______________________________________
_______________________________________________
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

Reply via email to