We sometimes observe a 'deadly embrace' type deadlock occurring
between mutually connected sockets on the same node. This happens
when the one-hour peer supervision timers happen to expire
simultaneously in both sockets.

The scenario is as follows:

CPU 1:                          CPU 2:
--------                        --------
tipc_sk_timeout(sk1)            tipc_sk_timeout(sk2)
  lock(sk1.slock)                 lock(sk2.slock)
  msg_create(probe)               msg_create(probe)
  unlock(sk1.slock)               unlock(sk2.slock)
  tipc_node_xmit_skb()            tipc_node_xmit_skb()
    tipc_node_xmit()                tipc_node_xmit()
      tipc_sk_rcv(sk2)                tipc_sk_rcv(sk1)
        lock(sk2.slock)                 lock((sk1.slock)
        filter_rcv()                    filter_rcv()
          tipc_sk_proto_rcv()             tipc_sk_proto_rcv()
            msg_create(probe_rsp)           msg_create(probe_rsp)
            tipc_sk_respond()               tipc_sk_respond()
              tipc_node_xmit_skb()            tipc_node_xmit_skb()
                tipc_node_xmit()                tipc_node_xmit()
                  tipc_sk_rcv(sk1)                tipc_sk_rcv(sk2)
                    lock((sk1.slock)                lock((sk2.slock)
                    ===> DEADLOCK                   ===> DEADLOCK

Further analysis reveals that there are at least three different
locations in the socket code where tipc_sk_respond() is called within
the context of the socket lock, with ensuing risk of similar deadlocks.

We solve this by ensuring that messages created by tipc_sk_respond()
only are sent directly if sk_lock.owned mutex is held. Otherwise they
are queued up in the socket write queue and sent after the socket lock
has been released.

v2: - Testing on mutex sk_lock.owned instead of sk_lock.slock in
      tipc_sk_respond(). This is safer, since sk_lock.slock may
      occasionally and briefly be held (by concurrent user contexts)
      even if we are in user context.
v3: - By lowering the socket timeout to 36 ms instead of 3,600,000 and
      setting up 1000 connections I could easily reproduce the deadlock
      and verify that my solution works.
    - When killing one of the processes I sometimes got a kernel crash
      in the loop emptying the socket write queue. Realizing that there
      may be concurrent processes emptying the write queue, I had to add
      a test that the dequeuing actually returned a buffer. This solved
      the problem.
    - I tried Ying's suggestion with unconditionally adding all
      CONN_MANAGER messages to the backlog queue, and it didn't work.
      This is because we will often add the message to the backlog when
      the socket is *not* owned, so there will be nothing triggering
      execution of backlog_rcv() within acceptable time. Apart from
      that, my solution solves the problem at all three locations where
      this deadlock may happen, as already stated above.

Reported-by: GUNA <gbala...@gmail.com>
Signed-off-by: Jon Maloy <jon.ma...@ericsson.com>
---
 net/tipc/socket.c | 14 +++++++++++++-
 1 file changed, 13 insertions(+), 1 deletion(-)

diff --git a/net/tipc/socket.c b/net/tipc/socket.c
index 88bfcd7..e8ed3a8 100644
--- a/net/tipc/socket.c
+++ b/net/tipc/socket.c
@@ -278,7 +278,11 @@ static void tipc_sk_respond(struct sock *sk, struct 
sk_buff *skb, int err)
 
        dnode = msg_destnode(buf_msg(skb));
        selector = msg_origport(buf_msg(skb));
-       tipc_node_xmit_skb(sock_net(sk), skb, dnode, selector);
+
+       if (sock_owned_by_user(sk))
+               tipc_node_xmit_skb(sock_net(sk), skb, dnode, selector);
+       else
+               skb_queue_tail(&sk->sk_write_queue, skb);
 }
 
 /**
@@ -1830,6 +1834,14 @@ void tipc_sk_rcv(struct net *net, struct sk_buff_head 
*inputq)
                                tipc_sk_enqueue(inputq, sk, dport);
                                spin_unlock_bh(&sk->sk_lock.slock);
                        }
+                       /* Send pending response/rejected messages, if any */
+                       while (!skb_queue_empty(&sk->sk_write_queue)) {
+                               skb = skb_dequeue(&sk->sk_write_queue);
+                               if (!skb)
+                                       break;
+                               dnode = msg_destnode(buf_msg(skb));
+                               tipc_node_xmit_skb(net, skb, dnode, dport);
+                       }
                        sock_put(sk);
                        continue;
                }
-- 
1.9.1


------------------------------------------------------------------------------
What NetFlow Analyzer can do for you? Monitors network bandwidth and traffic
patterns at an interface-level. Reveals which users, apps, and protocols are 
consuming the most bandwidth. Provides multi-vendor support for NetFlow, 
J-Flow, sFlow and other flows. Make informed decisions using capacity 
planning reports. https://ad.doubleclick.net/ddm/clk/305295220;132659582;e
_______________________________________________
tipc-discussion mailing list
tipc-discussion@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/tipc-discussion

Reply via email to