On Mon, 2016-06-20 at 17:23 -0400, Jason Baron wrote: > From: Jason Baron <jba...@akamai.com> > > When SO_SNDBUF is set and we are under tcp memory pressure, the effective > write buffer space can be much lower than what was set using SO_SNDBUF. For > example, we may have set the buffer to 100kb, but we may only be able to > write 10kb. In this scenario poll()/select()/epoll(), are going to > continuously return POLLOUT, followed by -EAGAIN from write(), and thus > result in a tight loop. Note that epoll in edge-triggered does not have > this issue since it only triggers once data has been ack'd. There is no > issue here when SO_SNDBUF is not set, since the tcp layer will auto tune > the sk->sndbuf. > > Introduce a new socket flag, SOCK_SHORT_WRITE, such that we can mark the > socket when we have a short write due to memory pressure. By then testing > for SOCK_SHORT_WRITE in tcp_poll(), we hold off the POLLOUT until a > non-zero amount of data has been ack'd. In a previous approach: > http://marc.info/?l=linux-netdev&m=143930393211782&w=2, I had introduced a > new field in 'struct sock' to solve this issue, but its undesirable to add > bloat to 'struct sock'. We also could address this issue, by waiting for > the buffer to become completely empty, but that may reduce throughput since > the write buffer would be empty while waiting for subsequent writes. This > change brings us in line with the current epoll edge-trigger behavior for > the poll()/select() and epoll level-trigger. > > We guarantee that SOCK_SHORT_WRITE will eventually clear, since when we set > SOCK_SHORT_WRITE, we make sure that sk_wmem_queued is non-zero and > SOCK_NOSPACE is set as well (in sk_stream_wait_memory()). > > I tested this patch using 10,000 sockets, and setting SO_SNDBUF on the > server side, to induce tcp memory pressure. A single server thread reduced > its cpu usage from 100% to 19%, while maintaining the same level of > throughput. > > Signed-off-by: Jason Baron <jba...@akamai.com> > ---
When this bug was added, and by which commit ? This looks serious, but your patch looks way too complex. This SOCK_SHORT_WRITE bit constantly cleared in sk_stream_write_space() is expensive, because it dirties sk_flags which is not supposed to written often. This field is located in a read mostly cache line. I believe my suggestion was not to add a per socket bit, but have a global state like tcp_memory_pressure. (or reuse it) Ie , when under global memory pressure, tcp_poll() should apply a strategy to give POLLOUT only on sockets having less than 4K (tcp_wmem[0]) of memory consumed in their write queue.