Quoting r. Sean Hefty <[EMAIL PROTECTED]>: > Subject: RE: [openib-general] [RFC] [PATCH 2/7] ib_multicast 2.6.20: add > ib_multicast module to track join requests from the same port > > >But what is the issue? some kind of race? > > If we look at just the ib_multicast patches as an example... > > Calling ib_join_multicast allocates a struct ib_multicast that must be freed. > Here's the relevant portion of ipoib's join callback: > > @@ -325,11 +328,10 @@ ipoib_mcast_sendonly_join_complete(int s > /* Clear the busy flag so we try again */ > + status = test_and_clear_bit(IPOIB_MCAST_FLAG_BUSY, > + &mcast->flags); > } > + return status; > } > > The callback clears the busy flag, and frees the structure by returning a > non-zero value from the callback. (This is convenient for error handling.) > Let > the callback thread hang around right at the return statement for a while. > > When ipoib is unloaded, one of the calls it makes during cleanup is > ipoib_mcast_leave(), which does: > > if(test_and_clear_bit(IPOIB_MCAST_FLAG_BUSY, &mcast->flags)) > ib_free_multicast(mcast->mc); > > If ipoib_mcast_leave() is called at the same time that an error is reported > through the callback, it's possible that the struct ib_multicast will be freed > by the callback thread. But there's nothing to prevent the callback thread > from > executing in the ipoib code after unload has occurred. > > Similar issues can apply to ib_cm and rdma_cm. > > - Sean >
But unlike the sa races which were unfixable without API changes, here users can synchronize the removal of the mc object. So I think what you describe is a user error. For example SDP does lock cm_id = ssk->cm_id ssk->cm_id = NULL unlock if (cm_id) destroy(cm_id) and in callback ssk = context lock if (!ssk->cm_id) { unlock return 0; } ... -- MST _______________________________________________ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general