On Fri, 2010-07-16 at 02:13 -0700, Pradeep Satyanarayana wrote:
> Ralph Campbell wrote:
> > On Thu, 2010-07-15 at 04:56 -0700, Pradeep Satyanarayana wrote:
> >> Pradeep Satyanarayana wrote:
> >>> Pradeep Satyanarayana wrote:
> >>>> Roland Dreier wrote:
> >>>>>  > I guess I came to a premature conclusion. One set of tests ran fine 
> >>>>> and I made that
> >>>>>  > conclusion. Another set of tests caused the following crash:
> >>>>>
> >>>>> I don't really know how to interpret this.  Is this crash new, or is it
> >>>>> the same crash you were hoping this patch fixed?
> >>>> This is a new crash.
> >>> I see other manifestations resulting in different crashes :
> >>>
> >>> :mon> t
> >>> [c00000074603ba20] d0000000193527ac .ipoib_neigh_flush+0x6c/0x350 
> >>> [ib_ipoib]
> >>> [c00000074603bb10] d000000019356dac .ipoib_mcast_free+0x74/0x2a0 
> >>> [ib_ipoib]
> >>> [c00000074603bbe0] d000000019358558 .ipoib_mcast_restart_task+0x3d0/0x560 
> >>> [ib_ipoib]
> >>> [c00000074603bd40] c0000000000c6fe4 .run_workqueue+0xf4/0x1e0
> >>> [c00000074603be00] c0000000000c7190 .worker_thread+0xc0/0x180
> >>> [c00000074603bed0] c0000000000ccf4c .kthread+0xb4/0xc0
> >>> [c00000074603bf90] c0000000000309fc .kernel_thread+0x54/0x70
> >>> 9:mon> e
> >>> cpu 0x9: Vector: 300 (Data Access) at [c00000074603b720]
> >>>     pc: c0000000005ac390: ._spin_lock+0x20/0xc8
> >>>     lr: d0000000193527ac: .ipoib_neigh_flush+0x6c/0x350 [ib_ipoib]
> >>>     sp: c00000074603b9a0
> >>>    msr: 8000000000009032
> >>>    dar: 3a0
> >>>  dsisr: 40000000
> >>>   current = 0xc000000756ce8b00
> >>>   paca    = 0xc000000000f63800
> >>>     pid   = 18095, comm = ipoib
> >>> 9:mon>
> >> Recreating the crash has been tricky. I have tried several several hundred 
> >> times today
> >> to unload and reload IPoIB while there is traffic and no crashes happened. 
> >> I took
> >> a closer look at the IPoIB CM code and I see a few things that look 
> >> suspicious.
> >>
> >> In the ipoib_cm_send() path no priv->lock is held, whereas the priv->lock 
> >> is held before 
> >> calling ipoib_cm_destroy_tx(). This is true with and without Ralph's patch 
> >> (fix dangling pointer).
> >> Is this a potential race?
> > 
> > ipoib_cm_send() is only called by ipoib_start_xmit() so it is protected
> > by netif_tx_lock(dev) or stopping the ipoib network device.
> 
> I still see one case in ipoib_neigh_cleanup() wherein ipoib_cm_destroy_tx() 
> appears to be called
> without netif_tx_lock(dev) held. Is that correct?
> 
> Thanks
> Pradeep

ipoib_neigh_cleanup() is called by neigh_cleanup_and_release() when
freeing a struct neighbour. I assume the Linux network stack is
not going to call into the IPoIB driver to send sk_buffs in that
case but I could be wrong. If it can, then you are correct that
the netif_tx_lock(dev) should be acquired.

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Reply via email to