> -----Original Message----- > From: linux-rdma-ow...@vger.kernel.org > [mailto:linux-rdma-ow...@vger.kernel.org] On > Behalf Of Chuck Lever > Sent: Thursday, July 17, 2014 3:42 PM > To: Steve Wise > Cc: Hefty, Sean; Shirley Ma; Devesh Sharma; Roland Dreier; > linux-rdma@vger.kernel.org > Subject: Re: [for-next 1/2] xprtrdma: take reference of rdma provider module > > > On Jul 17, 2014, at 4:08 PM, Steve Wise <sw...@opengridcomputing.com> wrote: > > > > > > >> -----Original Message----- > >> From: Steve Wise [mailto:sw...@opengridcomputing.com] > >> Sent: Thursday, July 17, 2014 2:56 PM > >> To: 'Hefty, Sean'; 'Shirley Ma'; 'Devesh Sharma'; 'Roland Dreier' > >> Cc: 'linux-rdma@vger.kernel.org'; 'chuck.le...@oracle.com' > >> Subject: RE: [for-next 1/2] xprtrdma: take reference of rdma provider > >> module > >> > >> > >> > >>> -----Original Message----- > >>> From: Hefty, Sean [mailto:sean.he...@intel.com] > >>> Sent: Thursday, July 17, 2014 2:50 PM > >>> To: Steve Wise; 'Shirley Ma'; 'Devesh Sharma'; 'Roland Dreier' > >>> Cc: linux-rdma@vger.kernel.org; chuck.le...@oracle.com > >>> Subject: RE: [for-next 1/2] xprtrdma: take reference of rdma provider > >>> module > >>> > >>>>> So the rdma cm is expected to increase the driver reference count > >>>> (try_module_get) for > >>>>> each new cm id, then deference count (module_put) when cm id is > >>>> destroyed? > >>>>> > >>>> > >>>> No, I think he's saying the rdma-cm posts a RDMA_CM_DEVICE_REMOVAL event > >>>> to each > >>>> application with rdmacm objects allocated, and each application is > >>>> expected > >>>> to destroy all > >>>> the objects it has allocated before returning from the event handler. > >>> > >>> This is almost correct. The applications do not have to destroy all the > >>> objects that > > it has > >>> allocated before returning from their event handler. E.g. an app can > >>> queue a work > > item > >>> that does the destruction. The rdmacm will block in its ib_client remove > >>> handler > > until all > >>> relevant rdma_cm_id's have been destroyed. > >>> > >> > >> Thanks for the clarification. > >> > > > > And looking at xprtrdma, it does handle the DEVICE_REMOVAL event in > rpcrdma_conn_upcall(). > > It sets ep->rep_connected to -ENODEV, wakes everybody up, and calls > rpcrdma_conn_func() > > for that endpoint, which schedules rep_connect_worker... and I gave up > > following the > code > > path at this point... :) > > > > For this to all work correctly, it would need to destroy all the QPs, MRs, > > CQs, etc for > > that device _before_ destroying the rdma cm ids. Otherwise the provider > > module could > be > > unloaded too soon. > > We can't really deal with a CM_DEVICE_REMOVE event while there are active > NFS mounts. > > System shutdown ordering should guarantee (one would hope) that NFS > mount points are unmounted before the RDMA/IB core infrastructure is > torn down. Ordering shouldn't matter as long all NFS activity has > ceased before the CM tries to remove the device. > > So if something is hanging up the CM, there's something xprtrdma is not > cleaning up properly. >
Devesh, how are you reproducing this? Are you just rmmod'ing the ocrdma module while there are active mounts? -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html