Hi ,
We are using tipc (version 3.10.0-862 for RHEL 7.3).
Scenario :
kernel crash in 2 node cluster , no traffic (bearers enabled/disabled)
We are able to consistently reproduce the crash on a 2 node setup and
enabling/disabling bearers in a loop. (On 1 node enable/disable bearers
within 0.5 seconds , on 2nd node with 4 seconds interval)
Root Cause Analysis of the issue :
Disabling the bearer invokes tipc_link_delete which frees the struct
tipc_link *l_ptr.
This is done under tipc_net_lock, bearer locks
Enabling the bearer, sends discovery messages and on reception from peer
would invoke tipc_link_create which creates a new l_ptr and associates with
the bearer. All this is done with correct locks. However it schedules
link_start as a tasklet.
Inside link_start, bearer lock is taken and work is performed (send proto
msg). So if this tasklet is scheduled before tipc_link_delete , everything
is fine. However if the tipc_link_delete happens first then the tasklet is
invoked with dangling pointer.
Proposed Solution:
Cancel the tasklet when deleting the link. But since its a work queue item
within the tasklet, its not possible to cancel a particular work item. So
within link_start, we do sanity check whether l_ptr is valid o
r not. if l_ptr is invalid (does not match with l_ptr inside any n_ptr in
node list) we simply return.
Following is the diff which seems to solve the issue,
# git diff
diff --git a/tipc/link.c b/tipc/link.c
index 349dbfd..f52dca9 100644
--- a/tipc/link.c
+++ b/tipc/link.c
@@ -393,9 +393,24 @@ void tipc_link_delete(struct tipc_link *l_ptr)
static void link_start(struct tipc_link *l_ptr)
{
- tipc_node_lock(l_ptr->owner);
+ struct tipc_node *n_ptr;
+ read_lock_bh(&tipc_net_lock);
+ list_for_each_entry(n_ptr, &tipc_node_list, list) {
+ u32 i;
+ tipc_node_lock(n_ptr);
+ for (i = 0; i < MAX_BEARERS; i++) {
+ if (n_ptr->links[i] == l_ptr) {
+ goto LINK_START;
+ }
+ }
+ tipc_node_unlock(n_ptr);
+ }
+ read_unlock_bh(&tipc_net_lock);
+ return;
+LINK_START:
link_state_event(l_ptr, STARTING_EVT);
tipc_node_unlock(l_ptr->owner);
+ read_unlock_bh(&tipc_net_lock);
}
Would like to know if there are any obvious issues due to above change?
Appreciate any feedback.
Regards,
Amit
_______________________________________________
tipc-discussion mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/tipc-discussion