Hi ,

We are using tipc (version 3.10.0-862  for RHEL 7.3).

Scenario :
kernel crash in 2 node cluster , no traffic (bearers enabled/disabled)
We are able to consistently reproduce the crash on a 2 node setup and
enabling/disabling bearers in a loop. (On 1 node enable/disable bearers
within 0.5 seconds , on 2nd node with 4 seconds interval)

Root Cause Analysis of the issue :

Disabling the bearer invokes tipc_link_delete which frees the struct
tipc_link *l_ptr.
This is done under tipc_net_lock, bearer locks

Enabling the bearer, sends discovery messages and on reception from peer
would invoke tipc_link_create which creates a new l_ptr and associates with
the bearer. All this is done with correct locks. However it schedules
link_start as a tasklet.

Inside link_start, bearer lock is taken and work is performed (send proto
msg). So if this tasklet is scheduled before tipc_link_delete , everything
is fine. However if the tipc_link_delete happens first then the tasklet is
invoked with dangling pointer.

Proposed Solution:
Cancel the tasklet when deleting the link. But since its a work queue item
within the tasklet, its not possible to cancel a particular work item. So
within link_start, we do sanity check whether l_ptr is valid o
r not. if l_ptr is invalid (does not match with l_ptr inside any n_ptr in
node list) we simply return.

Following is the diff which seems to solve the issue,

# git diff
diff --git a/tipc/link.c b/tipc/link.c
index 349dbfd..f52dca9 100644
--- a/tipc/link.c
+++ b/tipc/link.c
@@ -393,9 +393,24 @@ void tipc_link_delete(struct tipc_link *l_ptr)

 static void link_start(struct tipc_link *l_ptr)
 {
-       tipc_node_lock(l_ptr->owner);
+       struct tipc_node *n_ptr;
+       read_lock_bh(&tipc_net_lock);
+       list_for_each_entry(n_ptr, &tipc_node_list, list) {
+               u32 i;
+               tipc_node_lock(n_ptr);
+               for (i = 0; i < MAX_BEARERS; i++) {
+                       if (n_ptr->links[i] == l_ptr) {
+                               goto LINK_START;
+                       }
+               }
+               tipc_node_unlock(n_ptr);
+       }
+       read_unlock_bh(&tipc_net_lock);
+       return;
+LINK_START:
        link_state_event(l_ptr, STARTING_EVT);
        tipc_node_unlock(l_ptr->owner);
+       read_unlock_bh(&tipc_net_lock);
 }

Would like to know if there are any obvious issues due to above change?
Appreciate any feedback.

Regards,
Amit

_______________________________________________
tipc-discussion mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/tipc-discussion

Reply via email to