New submission from kevinconway:

My organization noticed this issue after launching several asyncio services 
that would receive either a sustained high number of incoming connections or 
regular bursts of traffic. Our monitoring showed a loss of between 4% and 6% of 
all incoming requests. On the client side we see a socket read error 
"Connection reset by peer". On the asyncio side, with debug turned on, we see 
nothing.

After some more investigation we determined asyncio was not calling 'accept()' 
on the listening socket fast enough. To further test this we put together 
several hello-world type examples and put them under load. I've attached the 
project we used to test. Included are three docker files that will run the 
services under different configurations. One runs the service as an aiohttp 
service, the other uses the aiohttp worker behind gunicorn, and the third runs 
the aiohttp service with the proposed asyncio patch in place. For our testing 
we used 'wrk' to generate traffic and collect data on the OS/socket errors.

For anyone attempting to recreate our experiments, we ran a three test 
batteries against the service for each endpoint using:

wrk --duration 30s --timeout 10s --latency --threads 2 --connections 10 <URL>
wrk --duration 30s --timeout 10s --latency --threads 2 --connections 100 <URL>
wrk --duration 30s --timeout 10s --latency --threads 2 --connections 1000 <URL>

The endpoints most valuable for us to test were the ones that replicated some 
of our production logic:

<URL>/  # Hello World
<URL>/sleep?time=100  # Every request is delayed by 100 milliseconds and 
returns an HTML message.
<URL>/blocking/inband  # Every request performs a bcrypt with complexity 10 and 
performs the CPU blocking work on the event loop thread.

Our results varied based on the available CPU cycles, but we consistently 
recreate the socket read errors from production using the above tests.

Our proposed solution, attached as a patch file, is to put the socket.accept() 
call in a loop that is bounded by the listening socket's backlog. We use the 
backlog value as an upper bound to prevent the reverse situation of starving 
active coroutines while the event loop continues to accept new connections 
without yielding. With the proposed patch in place our loss rate disappeared.

For further comparison, we reviewed the socket accept logic in Twisted against 
which we ran similar tests and encountered no loss. We found that Twisted 
already runs the socket accept in a bounded loop to prevent this issue 
(https://github.com/twisted/twisted/blob/trunk/src/twisted/internet/tcp.py#L1028).

----------
components: asyncio
files: testservice.zip
messages: 273989
nosy: gvanrossum, haypo, kevinconway, yselivanov
priority: normal
severity: normal
status: open
title: Socket accept exhaustion during high TCP traffic
versions: Python 3.4, Python 3.5, Python 3.6
Added file: http://bugs.python.org/file44286/testservice.zip

_______________________________________
Python tracker <rep...@bugs.python.org>
<http://bugs.python.org/issue27906>
_______________________________________
_______________________________________________
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

Reply via email to