William Fisher <william.w.fis...@gmail.com> added the comment:

asyncio.MultiLoopChildWatcher has two problems that create a race condition.

1. The SIGCHLD signal handler does not guard against interruption/re-entry.
2. The SIGCHLD signal handler can interrupt add_child_handler's 
`self._do_waitpid(pid)`.


Symptoms:

Log messages that look like this:

1634935451.761 WARNING Unknown child process pid 8747, will report returncode 
255
...
1634935451.762 WARNING Child watcher got an unexpected pid: 8747
Traceback (most recent call last):
  File 
"/Users/runner/hostedtoolcache/Python/3.9.7/x64/lib/python3.9/asyncio/unix_events.py",
 line 1306, in _do_waitpid
    loop, callback, args = self._callbacks.pop(pid)
KeyError: 8747


Background:

I've been working on a library to make calling asyncio subprocesses more 
convenient. As part of my testing, I've been stress testing asyncio using 
different child watcher policies. My CI build runs more than 200 tests each on 
macOS, Linux and FreeBSD. I've noticed a small percentage of sporadic failures 
using MultiLoopChildWatcher.

My understanding of Python signal functions is that:

1. Upon receipt of a signal, the native "C" signal handler sets a flag that 
indicates the signal arrived.
2. The main thread checks the signal flags after each bytecode instruction. If 
a signal flag is set, Python saves the call stack, runs the signal handler on 
the main thread immediately, then pops the stack when it returns (assuming no 
exception raised by signal handler).
3. If you are in the middle of a signal handler running on the main thread and 
Python detects another signal flag, your signal handler can be interrupted.
4. Stacked signal handlers run in LIFO order. The last one that enters will run 
to completion without interruption.


Explanation:

I wrapped MultiLoopChildWatcher's sig_chld function in a decorator that logs 
when it is entered and exited. This clearly shows that _sig_chld is being 
re-entered. In the following log snippet, I'm running two commands in a 
pipeline "tr | cat".

1634935451.743 DEBUG process '/usr/bin/tr' created: pid 8747
...
1634935451.746 DEBUG process '/bin/cat' created: pid 8748
...
1634935451.761 DEBUG enter '_sig_chld' 20
1634935451.761 DEBUG enter '_sig_chld' 20
1634935451.761 WARNING Unknown child process pid 8747, will report returncode 
255
1634935451.762 DEBUG process 8748 exited with returncode 0
1634935451.762 DEBUG exit '_sig_chld' 20
1634935451.762 WARNING Child watcher got an unexpected pid: 8747
Traceback (most recent call last):
  File 
"/Users/runner/hostedtoolcache/Python/3.9.7/x64/lib/python3.9/asyncio/unix_events.py",
 line 1306, in _do_waitpid
    loop, callback, args = self._callbacks.pop(pid)
KeyError: 8747
1634935451.763 WARNING Unknown child process pid 8748, will report returncode 
255
1634935451.763 WARNING Child watcher got an unexpected pid: 8748
Traceback (most recent call last):
  File 
"/Users/runner/hostedtoolcache/Python/3.9.7/x64/lib/python3.9/asyncio/unix_events.py",
 line 1306, in _do_waitpid
    loop, callback, args = self._callbacks.pop(pid)
KeyError: 8748
1634935451.763 DEBUG exit '_sig_chld' 20


Here is the breakdown of what happens:

1. Pid 8747 exits and we enter _sig_chld #1.
2. sig_chld #1 calls os.waitpid which gives (pid, status) = (8747, 0).
3. Before sig_chld #1 has a chance to call `self._callbacks.pop(pid)`, it is 
interrupted.
4. sig_chld #2 calls os.waitpid for pid 8747. We get a ChildProcessError and 
then "Unknown child process pid 8747, will report returncode 255"
5. sig_chld #2 invokes the callback for pid 8747 saying the returncode=255.
6. sig_chld #2 continues to completion. It reaps pid 8748 normally.
7. sig_chld #1 picks up where it left off. We get an error when we try to pop 
the callback for 8747.
8. sig_chld #1 calls os.waitpid for pid 8748. This gives us failure messages 
because it was done by sig_chld #2.

The issue of interruption can also happen in the case of running a single 
process. If the _sig_chld interrupts the call to `self._do_waitpid(pid)` in 
add_child_watcher, a similar interleaving can occur.


Work-Around:

In my tests, I patched MultiLoopChildWatcher and so far, it appears to be more 
reliable. In add_child_handler, I call raise_signal(SIGCHLD) so that all the 
work is done in the signal handler.


class PatchedMultiLoopChildWatcher(asyncio.MultiLoopChildWatcher):
    "Test race condition fixes in MultiLoopChildWatcher."

    def add_child_handler(self, pid, callback, *args):
        loop = asyncio.get_running_loop()
        self._callbacks[pid] = (loop, callback, args)

        # Prevent a race condition in case signal was delivered before
        # callback added.
        signal.raise_signal(signal.SIGCHLD)

    @_serialize
    def _sig_chld(self, signum, frame):
        super()._sig_chld(signum, frame)


_serialize is a decorator that looks like this:


def _serialize(func):
    """Decorator to serialize a non-reentrant signal function.
    If one client is already in the critical section, set a flag to run the
    section one more time. Testing purposes only.
    """

    lock = threading.Lock()  # Used as atomic test-and-set.
    retry = False

    @functools.wraps(func)
    def _decorator(*args, **kwargs):
        nonlocal retry

        while True:
            if lock.acquire(blocking=False):  # pylint: 
disable=consider-using-with
                try:
                    retry = False
                    func(*args, **kwargs)
                finally:
                    lock.release()
                if retry:
                    continue
            else:
                # A signal handler that interrupts an existing handler will
                # run to completion (LIFO).
                retry = True
            break

    return _decorator

----------
nosy: +byllyfish

_______________________________________
Python tracker <rep...@bugs.python.org>
<https://bugs.python.org/issue38323>
_______________________________________
_______________________________________________
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

Reply via email to