On 08/14/2017 06:36 PM, Jon Maloy wrote:
Ying, Partha,
Do you have any viewpoints on this one? I would like to get this into net-next 
asap.
FYI, this was originally an attempt to mitigate the consequences of a stale 
broadcast link before we found the root cause of the problem. (Which is fixed 
in a patch I just sent to netdev.)
Strictly spoken we could do without it after the root cause was found, but I 
still think it is the right thing to do.
I completely agree with you.

I need a clarification on your statement "This elminates any risk that the broadcast link may be declared stale within the worst-case failure dectection time of an unresponsive peer."

With this patch we will increase the worst case detection from 1 second to 250 * 10 (TIPC_BC_RETR_LIMIT) => 2.5 seconds. What do you mean by failure dectection time of an unresponsive peer (link tolerance ?) ?

For the fix, you may add Reviewed-by: Parthasarathy Bhuvaragan <[email protected]>.
/Partha

///jon


-----Original Message-----
From: Jon Maloy
Sent: Wednesday, August 09, 2017 14:22
To: Jon Maloy <[email protected]>; Jon Maloy
<[email protected]>
Cc: Parthasarathy Bhuvaragan <[email protected]>;
Ying Xue <[email protected]>; [email protected]
Subject: [net-next v2 1/1] tipc: don't reset stale broadcast send link

When the broadcast send link after 100 attempts has failed to transfer a
packet to all peers, we consider it stale, and reset it. Thereafter it needs to
re-synchronize with the peers, something currently done by just resetting
and re-establishing all links to all peers. This has turned out to be overkill,
with potentially unwanted consequences for the remaining cluster.

A closer analysis reveals that this can be done much simpler. When this kind
of failure happens, for reasons that may lie outside the TIPC protocol, it is
typically only one peer which is failing to receive and acknowledge packets. It
is hence sufficient to identify and reset the links only to that peer to resolve
the situation, without having to reset the broadcast link at all. This solution
entails a much lower risk of negative consequences for the own node as well
as for the overall cluster.

At the same time, we expand the limit for the link stale counter from
100 to 250. This elminates any risk that the broadcast link may be declared
stale within the worst-case failure dectection time of an unresponsive peer.

We implement this change in this commit.

Signed-off-by: Jon Maloy <[email protected]>
---
  net/tipc/bearer.c | 24 ------------------------  net/tipc/bearer.h |  1 -
  net/tipc/link.c   | 23 +++++++++++++----------
  net/tipc/node.c   | 14 ++++----------
  4 files changed, 17 insertions(+), 45 deletions(-)

diff --git a/net/tipc/bearer.c b/net/tipc/bearer.c index d174ee3..7daf9f1
100644
--- a/net/tipc/bearer.c
+++ b/net/tipc/bearer.c
@@ -365,30 +365,6 @@ static int tipc_reset_bearer(struct net *net, struct
tipc_bearer *b)
        return 0;
  }

-/* tipc_bearer_reset_all - reset all links on all bearers
- */
-void tipc_bearer_reset_all(struct net *net) -{
-       struct tipc_bearer *b;
-       int i;
-
-       for (i = 0; i < MAX_BEARERS; i++) {
-               b = bearer_get(net, i);
-               if (b)
-                       clear_bit_unlock(0, &b->up);
-       }
-       for (i = 0; i < MAX_BEARERS; i++) {
-               b = bearer_get(net, i);
-               if (b)
-                       tipc_reset_bearer(net, b);
-       }
-       for (i = 0; i < MAX_BEARERS; i++) {
-               b = bearer_get(net, i);
-               if (b)
-                       test_and_set_bit_lock(0, &b->up);
-       }
-}
-
  /**
   * bearer_disable
   *
diff --git a/net/tipc/bearer.h b/net/tipc/bearer.h index 635c908..865cb09
100644
--- a/net/tipc/bearer.h
+++ b/net/tipc/bearer.h
@@ -210,7 +210,6 @@ void tipc_bearer_remove_dest(struct net *net, u32
bearer_id, u32 dest);  struct tipc_bearer *tipc_bearer_find(struct net *net,
const char *name);  int tipc_bearer_get_name(struct net *net, char *name,
u32 bearer_id);  struct tipc_media *tipc_media_find(const char *name); -
void tipc_bearer_reset_all(struct net *net);  int tipc_bearer_setup(void);
void tipc_bearer_cleanup(void);  void tipc_bearer_stop(struct net *net); diff
--git a/net/tipc/link.c b/net/tipc/link.c index 60820dc..99b6fee 100644
--- a/net/tipc/link.c
+++ b/net/tipc/link.c
@@ -978,15 +978,15 @@ static void link_retransmit_failure(struct tipc_link
*l, struct sk_buff *skb)
        struct tipc_msg *hdr = buf_msg(skb);

        pr_warn("Retransmission failure on link <%s>\n", l->name);
-       link_print(l, "Resetting link ");
+       link_print(l, "State of link ");
        pr_info("Failed msg: usr %u, typ %u, len %u, err %u\n",
                msg_user(hdr), msg_type(hdr), msg_size(hdr),
msg_errcode(hdr));
        pr_info("sqno %u, prev: %x, src: %x\n",
                msg_seqno(hdr), msg_prevnode(hdr), msg_orignode(hdr));
}

-int tipc_link_retrans(struct tipc_link *l, u16 from, u16 to,
-                     struct sk_buff_head *xmitq)
+int tipc_link_retrans(struct tipc_link *l, struct tipc_link *nacker,
+                     u16 from, u16 to, struct sk_buff_head *xmitq)
  {
        struct sk_buff *_skb, *skb = skb_peek(&l->transmq);
        struct tipc_msg *hdr;
@@ -997,11 +997,14 @@ int tipc_link_retrans(struct tipc_link *l, u16 from,
u16 to,
                return 0;

        /* Detect repeated retransmit failures on same packet */
-       if (likely(l->last_retransm != buf_seqno(skb))) {
-               l->last_retransm = buf_seqno(skb);
-               l->stale_count = 1;
-       } else if (++l->stale_count > 100) {
+       if (nacker->last_retransm != buf_seqno(skb)) {
+               nacker->last_retransm = buf_seqno(skb);
+               nacker->stale_count = 1;
+       } else if (++nacker->stale_count > 250) {
                link_retransmit_failure(l, skb);
+               nacker->stale_count = 0;
+               if (link_is_bc_sndlink(l))
+                       return TIPC_LINK_DOWN_EVT;
                return tipc_link_fsm_evt(l, LINK_FAILURE_EVT);
        }

@@ -1528,7 +1531,7 @@ static int tipc_link_proto_rcv(struct tipc_link *l,
struct sk_buff *skb,

                /* If NACK, retransmit will now start at right position */
                if (gap) {
-                       rc = tipc_link_retrans(l, ack + 1, ack + gap, xmitq);
+                       rc = tipc_link_retrans(l, l, ack + 1, ack + gap, xmitq);
                        l->stats.recv_nacks++;
                }

@@ -1680,7 +1683,7 @@ int tipc_link_bc_sync_rcv(struct tipc_link *l, struct
tipc_msg *hdr,
                return rc;

        if (link_bc_retr_eval(snd_l, &from, &to))
-               rc = tipc_link_retrans(snd_l, from, to, xmitq);
+               rc = tipc_link_retrans(snd_l, l, from, to, xmitq);

        l->snd_nxt = peers_snd_nxt;
        if (link_bc_rcv_gap(l))
@@ -1775,7 +1778,7 @@ int tipc_link_bc_nack_rcv(struct tipc_link *l, struct
sk_buff *skb,

        if (dnode == tipc_own_addr(l->net)) {
                tipc_link_bc_ack_rcv(l, acked, xmitq);
-               rc = tipc_link_retrans(l->bc_sndlink, from, to, xmitq);
+               rc = tipc_link_retrans(l->bc_sndlink, l, from, to, xmitq);
                l->stats.recv_nacks++;
                return rc;
        }
diff --git a/net/tipc/node.c b/net/tipc/node.c index aeef801..1082f7d 100644
--- a/net/tipc/node.c
+++ b/net/tipc/node.c
@@ -1284,7 +1284,7 @@ static void tipc_node_bc_sync_rcv(struct tipc_node
*n, struct tipc_msg *hdr,
        rc = tipc_bcast_sync_rcv(n->net, n->bc_entry.link, hdr);

        if (rc & TIPC_LINK_DOWN_EVT) {
-               tipc_bearer_reset_all(n->net);
+               tipc_node_reset_links(n);
                return;
        }

@@ -1351,15 +1351,9 @@ static void tipc_node_bc_rcv(struct net *net,
struct sk_buff *skb, int bearer_id
        if (!skb_queue_empty(&be->inputq1))
                tipc_node_mcast_rcv(n);

-       if (rc & TIPC_LINK_DOWN_EVT) {
-               /* Reception reassembly failure => reset all links to peer */
-               if (!tipc_link_is_up(be->link))
-                       tipc_node_reset_links(n);
-
-               /* Retransmission failure => reset all links to all peers */
-               if (!tipc_link_is_up(tipc_bc_sndlink(net)))
-                       tipc_bearer_reset_all(net);
-       }
+       /* If reassembly or retransmission failure => reset all links to peer */
+       if (rc & TIPC_LINK_DOWN_EVT)
+               tipc_node_reset_links(n);

        tipc_node_put(n);
  }
--
2.1.4


------------------------------------------------------------------------------
Check out the vibrant tech community on one of the world's most
engaging tech sites, Slashdot.org! http://sdm.link/slashdot
_______________________________________________
tipc-discussion mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/tipc-discussion

Reply via email to