New submission from Jim Crist-Harif <jcristha...@gmail.com>:

In https://bugs.python.org/issue7946 an issue with how the current GIL 
interacts with mixing IO and CPU bound work. Quoting this issue:

> when an I/O bound thread executes an I/O call,
> it always releases the GIL.  Since the GIL is released, a CPU bound
> thread is now free to acquire the GIL and run.  However, if the I/O
> call completes immediately (which is common), the I/O bound thread
> immediately stalls upon return from the system call.  To get the GIL
> back, it now has to go through the timeout process to force the
> CPU-bound thread to release the GIL again.

This issue can come up in any application where IO and CPU bound work are mixed 
(we've found it to be a cause of performance issues in https://dask.org for 
example). Fixing the general problem is tricky and likely requires changes to 
the GIL's internals, but in the specific case of mixing asyncio running in one 
thread and CPU work happening in background threads, there may be a simpler fix 
- don't release the GIL if we don't have to.

Asyncio relies on nonblocking socket operations, which by definition shouldn't 
block. As such, releasing the GIL shouldn't be needed for many operations 
(`send`, `recv`, ...) on `socket.socket` objects provided they're in 
nonblocking mode (as suggested in https://bugs.python.org/issue7946#msg99477). 
Likewise, dropping the GIL can be avoided when calling `select` on 
`selectors.BaseSelector` objects with a timeout of 0 (making it a non-blocking 
call).

I've made a patch 
(https://github.com/jcrist/cpython/tree/keep-gil-for-fast-syscalls) with these 
two changes, and run a benchmark (attached) to evaluate the effect of 
background threads with/without the patch. The benchmark starts an asyncio 
server in one process, and a number of clients in a separate process. A number 
of background threads that just spin are started in the server process 
(configurable by the `-t` flag, defaults to 0), then the server is loaded to 
measure the RPS.

Here are the results:

```
# Main branch
$ python bench.py -c1 -t0
Benchmark: clients = 1, msg-size = 100, background-threads = 0
16324.2 RPS
$ python bench.py -c1 -t1
Benchmark: clients = 1, msg-size = 100, background-threads = 1
Spinner spun 1.52e+07 cycles/second
97.6 RPS
$ python bench.py -c2 -t0
Benchmark: clients = 2, msg-size = 100, background-threads = 0
31308.0 RPS
$ python bench.py -c2 -t1
Benchmark: clients = 2, msg-size = 100, background-threads = 1
Spinner spun 1.52e+07 cycles/second
96.2 RPS
$ python bench.py -c10 -t0
Benchmark: clients = 10, msg-size = 100, background-threads = 0
47169.6 RPS
$ python bench.py -c10 -t1
Benchmark: clients = 10, msg-size = 100, background-threads = 1
Spinner spun 1.54e+07 cycles/second
95.4 RPS

# With this patch
$ ./python bench.py -c1 -t0
Benchmark: clients = 1, msg-size = 100, background-threads = 0
18201.8 RPS
$ ./python bench.py -c1 -t1
Benchmark: clients = 1, msg-size = 100, background-threads = 1
Spinner spun 9.03e+06 cycles/second
194.6 RPS
$ ./python bench.py -c2 -t0
Benchmark: clients = 2, msg-size = 100, background-threads = 0
34151.8 RPS
$ ./python bench.py -c2 -t1
Benchmark: clients = 2, msg-size = 100, background-threads = 1
Spinner spun 8.72e+06 cycles/second
729.6 RPS
$ ./python bench.py -c10 -t0
Benchmark: clients = 10, msg-size = 100, background-threads = 0
53666.6 RPS
$ ./python bench.py -c10 -t1
Benchmark: clients = 10, msg-size = 100, background-threads = 1
Spinner spun 5e+06 cycles/second
21838.2 RPS
```

A few comments on the results:

- On the main branch, any GIL contention sharply decreases the number of RPS an 
asyncio server can handle, regardless of the number of clients. This makes 
sense - any socket operation will release the GIL, and the server thread will 
have to wait to reacquire it (up to the switch interval), rinse and repeat. So 
if every request requires 1 recv and 1 send, a server with background GIL 
contention is stuck at a max of `1 / (2 * switchinterval)` or 200 RPS with 
default configuration. This effectively prioritizes the background thread over 
the IO thread, since the IO thread releases the GIL very frequently and the 
background thread never does.

- With the patch, we still see a performance degradation, but the degradation 
is less severe and improves with the number of clients. This is because with 
these changes the asyncio thread only releases the GIL when doing a blocking 
poll for new IO events (or when the switch interval is hit). With low load (1 
client), the IO thread becomes idle more frequently and releases the GIL. Under 
higher load though the event loop frequently still has work to do at the end of 
a cycle and issues a `selector.select` call with a 0 timeout (nonblocking), 
avoiding releasing the GIL at all during that loop (note the nonlinear effect 
of adding more clients). Since the IO thread still releases the GIL sometimes, 
the background thread still holds the GIL a larger percentage of the time than 
the IO thread, but the difference between them is less severe than without this 
patch.

I have also tested this patch on a Dask cluster running some real-world 
problems and found that it did improve performance where IO was throttled due 
to GIL contention.

----------
components: C API, IO
files: bench.py
messages: 406422
nosy: jcristharif
priority: normal
severity: normal
status: open
title: Avoid releasing the GIL in nonblocking socket operations
type: performance
versions: Python 3.11
Added file: https://bugs.python.org/file50443/bench.py

_______________________________________
Python tracker <rep...@bugs.python.org>
<https://bugs.python.org/issue45819>
_______________________________________
_______________________________________________
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

Reply via email to