Thanks Jon,

Regards,
Amit

On Fri, 8 Feb 2019, 17:20 Jon Maloy <[email protected] wrote:

> Hi Amit,
> Your solution seems ok to me.
>
> BR
> ///jon
>
>
> > -----Original Message-----
> > From: Amit Jain <[email protected]>
> > Sent: 8-Feb-19 00:04
> > To: [email protected]
> > Subject: [tipc-discussion] Race condition in link_start and
> tipc_link_delete
> > function - RHEL 7.3
> >
> > Hi ,
> >
> > We are using tipc (version 3.10.0-862  for RHEL 7.3).
> >
> > Scenario :
> > kernel crash in 2 node cluster , no traffic (bearers enabled/disabled)
> We are
> > able to consistently reproduce the crash on a 2 node setup and
> > enabling/disabling bearers in a loop. (On 1 node enable/disable bearers
> > within 0.5 seconds , on 2nd node with 4 seconds interval)
> >
> > Root Cause Analysis of the issue :
> >
> > Disabling the bearer invokes tipc_link_delete which frees the struct
> tipc_link
> > *l_ptr.
> > This is done under tipc_net_lock, bearer locks
> >
> > Enabling the bearer, sends discovery messages and on reception from peer
> > would invoke tipc_link_create which creates a new l_ptr and associates
> with
> > the bearer. All this is done with correct locks. However it schedules
> link_start
> > as a tasklet.
> >
> > Inside link_start, bearer lock is taken and work is performed (send proto
> > msg). So if this tasklet is scheduled before tipc_link_delete ,
> everything is
> > fine. However if the tipc_link_delete happens first then the tasklet is
> > invoked with dangling pointer.
> >
> > Proposed Solution:
> > Cancel the tasklet when deleting the link. But since its a work queue
> item
> > within the tasklet, its not possible to cancel a particular work item.
> So within
> > link_start, we do sanity check whether l_ptr is valid o r not. if l_ptr
> is invalid
> > (does not match with l_ptr inside any n_ptr in node list) we simply
> return.
> >
> > Following is the diff which seems to solve the issue,
> >
> > # git diff
> > diff --git a/tipc/link.c b/tipc/link.c
> > index 349dbfd..f52dca9 100644
> > --- a/tipc/link.c
> > +++ b/tipc/link.c
> > @@ -393,9 +393,24 @@ void tipc_link_delete(struct tipc_link *l_ptr)
> >
> >  static void link_start(struct tipc_link *l_ptr)  {
> > -       tipc_node_lock(l_ptr->owner);
> > +       struct tipc_node *n_ptr;
> > +       read_lock_bh(&tipc_net_lock);
> > +       list_for_each_entry(n_ptr, &tipc_node_list, list) {
> > +               u32 i;
> > +               tipc_node_lock(n_ptr);
> > +               for (i = 0; i < MAX_BEARERS; i++) {
> > +                       if (n_ptr->links[i] == l_ptr) {
> > +                               goto LINK_START;
> > +                       }
> > +               }
> > +               tipc_node_unlock(n_ptr);
> > +       }
> > +       read_unlock_bh(&tipc_net_lock);
> > +       return;
> > +LINK_START:
> >         link_state_event(l_ptr, STARTING_EVT);
> >         tipc_node_unlock(l_ptr->owner);
> > +       read_unlock_bh(&tipc_net_lock);
> >  }
> >
> > Would like to know if there are any obvious issues due to above change?
> > Appreciate any feedback.
> >
> > Regards,
> > Amit
> >
> > _______________________________________________
> > tipc-discussion mailing list
> > [email protected]
> > https://lists.sourceforge.net/lists/listinfo/tipc-discussion
>

_______________________________________________
tipc-discussion mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/tipc-discussion

Reply via email to