On 01/23/2017 01:04 PM, Eric Dumazet wrote:
On Mon, 2017-01-23 at 11:56 -0500, Jason Baron wrote:
On 01/23/2017 09:30 AM, Oleg Nesterov wrote:
Hello,

smp_mb__after_atomic() looks wrong and misleading, sock_reset_flag() does the
non-atomic __clear_bit() and thus it can not guarantee test_bit(SOCK_NOSPACE)
(non-atomic too) won't be reordered.


Indeed. Here's a bit of discussion on it:
http://marc.info/?l=linux-netdev&m=146662325920596&w=2

It was added by 3c7151275c0c9a "tcp: add memory barriers to write space paths"
and the patch looks correct in that we need the barriers in tcp_check_space()
and tcp_poll() in theory, so it seems tcp_check_space() needs smp_mb() ?


Yes, I think it should be upgraded to an smp_mb() there. If you agree
with this analysis, I will send a patch to upgrade it. Note, I did not
actually run into this race in practice.

SOCK_QUEUE_SHRUNK is used locally in TCP, it is not used by tcp_poll().

(Otherwise it would be using atomic set/clear operations)

I do not see obvious reason why we have this smp_mb__after_atomic() in
tcp_check_space().



The idea of the smp_mb__after_atomic() in tcp_check_space() was to ensure that the 'read' of SOCK_NOSPACE there didn't happen before any of the 'write' to make sk_stream_is_writeable() true. Otherwise, we could miss doing the wakeup from tcp_check_space(). There is probably an argument here that there will likely be a subsequent call to tcp_check_space() that will see the SOCK_NOSPACE bit set, but in theory we could have a small send buffer, or a lot of data could be ack'd in one go.

What I missed in the original patch was that sock_reset_flag() isn't an atomic operation and thus the smp_mb__after_atomic() is wrong.


But looking at this code, it seems we lack one barrier if sk_sndbuf is
ever increased. Fortunately this almost never happen during TCP session
lifetime...


But the wakeup from sk->sk_write_space(sk) will imply a smp_wmb() as per the comment in __wake_up() ?

Thanks,

-Jason


diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
index 
bfa165cc455ad0a9aea44964aa663dbe6085aebd..3692e9f4c852cebf8c4d46c141f112e75e4ae66d
 100644
--- a/net/ipv4/tcp_input.c
+++ b/net/ipv4/tcp_input.c
@@ -331,8 +331,13 @@ static void tcp_sndbuf_expand(struct sock *sk)
        sndmem = ca_ops->sndbuf_expand ? ca_ops->sndbuf_expand(sk) : 2;
        sndmem *= nr_segs * per_mss;

-       if (sk->sk_sndbuf < sndmem)
+       if (sk->sk_sndbuf < sndmem) {
                sk->sk_sndbuf = min(sndmem, sysctl_tcp_wmem[2]);
+               /* Paired with second sk_stream_is_writeable(sk)
+                * test from tcp_poll()
+                */
+               smp_wmb();
+       }
 }

 /* 2. Tuning advertised window (window_clamp, rcv_ssthresh)


Reply via email to