Re: [RFC PATCH net-next] tcp: reduce cpu usage under tcp memory pressure when SO_SNDBUF is set
On 08/10/2015 10:47 AM, Eric Dumazet wrote: On Fri, 2015-08-07 at 18:31 +, Jason Baron wrote: From: Jason Baron jba...@akamai.com When SO_SNDBUF is set and we are under tcp memory pressure, the effective write buffer space can be much lower than what was set using SO_SNDBUF. For example, we may have set the buffer to 100kb, but we may only be able to write 10kb. In this scenario poll()/select()/epoll(), are going to continuously return POLLOUT, followed by -EAGAIN from write() in a very tight loop. Introduce sk-sk_effective_sndbuf, such that we can track the 'effective' size of the sndbuf, when we have a short write due to memory pressure. By using the sk-sk_effective_sndbuf instead of the sk-sk_sndbuf when we are under memory pressure, we can delay the POLLOUT until 1/3 of the buffer clears as we normally do. There is no issue here when SO_SNDBUF is not set, since the tcp layer will auto tune the sk-sndbuf. In my testing, this brought a single threaad's cpu usage down from 100% to 1% while maintaining the same level of throughput when under memory pressure. I am not sure we need to grow socket for something that looks like a flag ? So I added a new field because I needed to store the new 'effective' sndbuf somewhere and then restore the original value that was set via SO_SNDBUF. So its really b/c of SO_SNDBUF. We could perhaps use the fact that we are in memory pressure to signal wakeups differently, but I'm not sure exactly how. Also you add a race in sk_stream_wspace() as sk_effective_sndbuf value can change under us. + if (sk-sk_effective_sndbuf) + return sk-sk_effective_sndbuf - sk-sk_wmem_queued; + thanks. better? --- a/include/net/sock.h +++ b/include/net/sock.h @@ -798,8 +798,10 @@ static inline int sk_stream_min_wspace(const struct sock *sk) static inline int sk_stream_wspace(const struct sock *sk) { - if (sk-sk_effective_sndbuf) - return sk-sk_effective_sndbuf - sk-sk_wmem_queued; + int effective_sndbuf = sk-sk_effective_sndbuf; + + if (effective_sndbuf) + return effective_sndbuf - sk-sk_wmem_queued; return sk-sk_sndbuf - sk-sk_wmem_queued; } Thanks, -Jason -- To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC PATCH net-next] tcp: reduce cpu usage under tcp memory pressure when SO_SNDBUF is set
On Fri, 2015-08-07 at 18:31 +, Jason Baron wrote: From: Jason Baron jba...@akamai.com When SO_SNDBUF is set and we are under tcp memory pressure, the effective write buffer space can be much lower than what was set using SO_SNDBUF. For example, we may have set the buffer to 100kb, but we may only be able to write 10kb. In this scenario poll()/select()/epoll(), are going to continuously return POLLOUT, followed by -EAGAIN from write() in a very tight loop. Introduce sk-sk_effective_sndbuf, such that we can track the 'effective' size of the sndbuf, when we have a short write due to memory pressure. By using the sk-sk_effective_sndbuf instead of the sk-sk_sndbuf when we are under memory pressure, we can delay the POLLOUT until 1/3 of the buffer clears as we normally do. There is no issue here when SO_SNDBUF is not set, since the tcp layer will auto tune the sk-sndbuf. In my testing, this brought a single threaad's cpu usage down from 100% to 1% while maintaining the same level of throughput when under memory pressure. I am not sure we need to grow socket for something that looks like a flag ? Also you add a race in sk_stream_wspace() as sk_effective_sndbuf value can change under us. + if (sk-sk_effective_sndbuf) + return sk-sk_effective_sndbuf - sk-sk_wmem_queued; + -- To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC PATCH net-next] tcp: reduce cpu usage under tcp memory pressure when SO_SNDBUF is set
On Mon, 2015-08-10 at 13:29 -0400, Jason Baron wrote: + thanks. better? --- a/include/net/sock.h +++ b/include/net/sock.h @@ -798,8 +798,10 @@ static inline int sk_stream_min_wspace(const struct sock *sk) static inline int sk_stream_wspace(const struct sock *sk) { - if (sk-sk_effective_sndbuf) - return sk-sk_effective_sndbuf - sk-sk_wmem_queued; + int effective_sndbuf = sk-sk_effective_sndbuf; + + if (effective_sndbuf) + return effective_sndbuf - sk-sk_wmem_queued; return sk-sk_sndbuf - sk-sk_wmem_queued; } You need to use instead : int effective_sndbuf = READ_ONCE(sk-sk_effective_sndbuf); -- To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[RFC PATCH net-next] tcp: reduce cpu usage under tcp memory pressure when SO_SNDBUF is set
From: Jason Baron jba...@akamai.com When SO_SNDBUF is set and we are under tcp memory pressure, the effective write buffer space can be much lower than what was set using SO_SNDBUF. For example, we may have set the buffer to 100kb, but we may only be able to write 10kb. In this scenario poll()/select()/epoll(), are going to continuously return POLLOUT, followed by -EAGAIN from write() in a very tight loop. Introduce sk-sk_effective_sndbuf, such that we can track the 'effective' size of the sndbuf, when we have a short write due to memory pressure. By using the sk-sk_effective_sndbuf instead of the sk-sk_sndbuf when we are under memory pressure, we can delay the POLLOUT until 1/3 of the buffer clears as we normally do. There is no issue here when SO_SNDBUF is not set, since the tcp layer will auto tune the sk-sndbuf. In my testing, this brought a single threaad's cpu usage down from 100% to 1% while maintaining the same level of throughput when under memory pressure. Signed-off-by: Jason Baron jba...@akamai.com --- include/net/sock.h | 12 net/core/sock.c| 1 + net/core/stream.c | 1 + net/ipv4/tcp.c | 10 +++--- 4 files changed, 21 insertions(+), 3 deletions(-) diff --git a/include/net/sock.h b/include/net/sock.h index 43c6abc..ca49415 100644 --- a/include/net/sock.h +++ b/include/net/sock.h @@ -380,6 +380,7 @@ struct sock { atomic_tsk_wmem_alloc; atomic_tsk_omem_alloc; int sk_sndbuf; + int sk_effective_sndbuf; struct sk_buff_head sk_write_queue; kmemcheck_bitfield_begin(flags); unsigned intsk_shutdown : 2, @@ -779,6 +780,14 @@ static inline bool sk_acceptq_is_full(const struct sock *sk) return sk-sk_ack_backlog sk-sk_max_ack_backlog; } +static inline void sk_set_effective_sndbuf(struct sock *sk) +{ + if (sk-sk_wmem_queued sk-sk_sndbuf) + sk-sk_effective_sndbuf = sk-sk_sndbuf; + else + sk-sk_effective_sndbuf = sk-sk_wmem_queued; +} + /* * Compute minimal free write space needed to queue new packets. */ @@ -789,6 +798,9 @@ static inline int sk_stream_min_wspace(const struct sock *sk) static inline int sk_stream_wspace(const struct sock *sk) { + if (sk-sk_effective_sndbuf) + return sk-sk_effective_sndbuf - sk-sk_wmem_queued; + return sk-sk_sndbuf - sk-sk_wmem_queued; } diff --git a/net/core/sock.c b/net/core/sock.c index 193901d..4fce879 100644 --- a/net/core/sock.c +++ b/net/core/sock.c @@ -2309,6 +2309,7 @@ void sock_init_data(struct socket *sock, struct sock *sk) sk-sk_allocation = GFP_KERNEL; sk-sk_rcvbuf = sysctl_rmem_default; sk-sk_sndbuf = sysctl_wmem_default; + sk-sk_effective_sndbuf = 0; sk-sk_state= TCP_CLOSE; sk_set_socket(sk, sock); diff --git a/net/core/stream.c b/net/core/stream.c index d70f77a..7c175e7 100644 --- a/net/core/stream.c +++ b/net/core/stream.c @@ -32,6 +32,7 @@ void sk_stream_write_space(struct sock *sk) if (sk_stream_is_writeable(sk) sock) { clear_bit(SOCK_NOSPACE, sock-flags); + sk-sk_effective_sndbuf = 0; rcu_read_lock(); wq = rcu_dereference(sk-sk_wq); diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c index 45534a5..9e7f0a5 100644 --- a/net/ipv4/tcp.c +++ b/net/ipv4/tcp.c @@ -845,6 +845,7 @@ struct sk_buff *sk_stream_alloc_skb(struct sock *sk, int size, gfp_t gfp, sk-sk_prot-enter_memory_pressure(sk); sk_stream_moderate_sndbuf(sk); } + sk_set_effective_sndbuf(sk); return NULL; } @@ -939,9 +940,10 @@ new_segment: tcp_mark_push(tp, skb); goto new_segment; } - if (!sk_wmem_schedule(sk, copy)) + if (!sk_wmem_schedule(sk, copy)) { + sk_set_effective_sndbuf(sk); goto wait_for_memory; - + } if (can_coalesce) { skb_frag_size_add(skb_shinfo(skb)-frags[i - 1], copy); } else { @@ -1214,8 +1216,10 @@ new_segment: copy = min_t(int, copy, pfrag-size - pfrag-offset); - if (!sk_wmem_schedule(sk, copy)) + if (!sk_wmem_schedule(sk, copy)) { + sk_set_effective_sndbuf(sk); goto wait_for_memory; + } err = skb_copy_to_page_nocache(sk, msg-msg_iter, skb, pfrag-page, -- 1.8.2.rc2 -- To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to majord...@vger.kernel.org More majordomo info at