Re: IB/ipoib: fix dangling pointer reference to ipoib_neigh and ipoib_path -when will it go upstream?

2010-07-16 Thread Pradeep Satyanarayana
Ralph Campbell wrote:
 On Thu, 2010-07-15 at 04:56 -0700, Pradeep Satyanarayana wrote:
 Pradeep Satyanarayana wrote:
 Pradeep Satyanarayana wrote:
 Roland Dreier wrote:
   I guess I came to a premature conclusion. One set of tests ran fine 
 and I made that
   conclusion. Another set of tests caused the following crash:

 I don't really know how to interpret this.  Is this crash new, or is it
 the same crash you were hoping this patch fixed?
 This is a new crash.
 I see other manifestations resulting in different crashes :

 :mon t
 [c0074603ba20] d000193527ac .ipoib_neigh_flush+0x6c/0x350 [ib_ipoib]
 [c0074603bb10] d00019356dac .ipoib_mcast_free+0x74/0x2a0 [ib_ipoib]
 [c0074603bbe0] d00019358558 .ipoib_mcast_restart_task+0x3d0/0x560 
 [ib_ipoib]
 [c0074603bd40] c00c6fe4 .run_workqueue+0xf4/0x1e0
 [c0074603be00] c00c7190 .worker_thread+0xc0/0x180
 [c0074603bed0] c00ccf4c .kthread+0xb4/0xc0
 [c0074603bf90] c00309fc .kernel_thread+0x54/0x70
 9:mon e
 cpu 0x9: Vector: 300 (Data Access) at [c0074603b720]
 pc: c05ac390: ._spin_lock+0x20/0xc8
 lr: d000193527ac: .ipoib_neigh_flush+0x6c/0x350 [ib_ipoib]
 sp: c0074603b9a0
msr: 80009032
dar: 3a0
  dsisr: 4000
   current = 0xc00756ce8b00
   paca= 0xc0f63800
 pid   = 18095, comm = ipoib
 9:mon
 Recreating the crash has been tricky. I have tried several several hundred 
 times today
 to unload and reload IPoIB while there is traffic and no crashes happened. I 
 took
 a closer look at the IPoIB CM code and I see a few things that look 
 suspicious.

 In the ipoib_cm_send() path no priv-lock is held, whereas the priv-lock is 
 held before 
 calling ipoib_cm_destroy_tx(). This is true with and without Ralph's patch 
 (fix dangling pointer).
 Is this a potential race?
 
 ipoib_cm_send() is only called by ipoib_start_xmit() so it is protected
 by netif_tx_lock(dev) or stopping the ipoib network device.

I still see one case in ipoib_neigh_cleanup() wherein ipoib_cm_destroy_tx() 
appears to be called
without netif_tx_lock(dev) held. Is that correct?

Thanks
Pradeep

--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: IB/ipoib: fix dangling pointer reference to ipoib_neigh and ipoib_path -when will it go upstream?

2010-07-16 Thread Ralph Campbell
On Fri, 2010-07-16 at 02:13 -0700, Pradeep Satyanarayana wrote:
 Ralph Campbell wrote:
  On Thu, 2010-07-15 at 04:56 -0700, Pradeep Satyanarayana wrote:
  Pradeep Satyanarayana wrote:
  Pradeep Satyanarayana wrote:
  Roland Dreier wrote:
I guess I came to a premature conclusion. One set of tests ran fine 
  and I made that
conclusion. Another set of tests caused the following crash:
 
  I don't really know how to interpret this.  Is this crash new, or is it
  the same crash you were hoping this patch fixed?
  This is a new crash.
  I see other manifestations resulting in different crashes :
 
  :mon t
  [c0074603ba20] d000193527ac .ipoib_neigh_flush+0x6c/0x350 
  [ib_ipoib]
  [c0074603bb10] d00019356dac .ipoib_mcast_free+0x74/0x2a0 
  [ib_ipoib]
  [c0074603bbe0] d00019358558 .ipoib_mcast_restart_task+0x3d0/0x560 
  [ib_ipoib]
  [c0074603bd40] c00c6fe4 .run_workqueue+0xf4/0x1e0
  [c0074603be00] c00c7190 .worker_thread+0xc0/0x180
  [c0074603bed0] c00ccf4c .kthread+0xb4/0xc0
  [c0074603bf90] c00309fc .kernel_thread+0x54/0x70
  9:mon e
  cpu 0x9: Vector: 300 (Data Access) at [c0074603b720]
  pc: c05ac390: ._spin_lock+0x20/0xc8
  lr: d000193527ac: .ipoib_neigh_flush+0x6c/0x350 [ib_ipoib]
  sp: c0074603b9a0
 msr: 80009032
 dar: 3a0
   dsisr: 4000
current = 0xc00756ce8b00
paca= 0xc0f63800
  pid   = 18095, comm = ipoib
  9:mon
  Recreating the crash has been tricky. I have tried several several hundred 
  times today
  to unload and reload IPoIB while there is traffic and no crashes happened. 
  I took
  a closer look at the IPoIB CM code and I see a few things that look 
  suspicious.
 
  In the ipoib_cm_send() path no priv-lock is held, whereas the priv-lock 
  is held before 
  calling ipoib_cm_destroy_tx(). This is true with and without Ralph's patch 
  (fix dangling pointer).
  Is this a potential race?
  
  ipoib_cm_send() is only called by ipoib_start_xmit() so it is protected
  by netif_tx_lock(dev) or stopping the ipoib network device.
 
 I still see one case in ipoib_neigh_cleanup() wherein ipoib_cm_destroy_tx() 
 appears to be called
 without netif_tx_lock(dev) held. Is that correct?
 
 Thanks
 Pradeep

ipoib_neigh_cleanup() is called by neigh_cleanup_and_release() when
freeing a struct neighbour. I assume the Linux network stack is
not going to call into the IPoIB driver to send sk_buffs in that
case but I could be wrong. If it can, then you are correct that
the netif_tx_lock(dev) should be acquired.

--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: IB/ipoib: fix dangling pointer reference to ipoib_neigh and ipoib_path -when will it go upstream?

2010-07-15 Thread Pradeep Satyanarayana
Pradeep Satyanarayana wrote:
 Pradeep Satyanarayana wrote:
 Roland Dreier wrote:
   I guess I came to a premature conclusion. One set of tests ran fine and 
 I made that
   conclusion. Another set of tests caused the following crash:

 I don't really know how to interpret this.  Is this crash new, or is it
 the same crash you were hoping this patch fixed?
 This is a new crash.
 
 I see other manifestations resulting in different crashes :
 
 :mon t
 [c0074603ba20] d000193527ac .ipoib_neigh_flush+0x6c/0x350 [ib_ipoib]
 [c0074603bb10] d00019356dac .ipoib_mcast_free+0x74/0x2a0 [ib_ipoib]
 [c0074603bbe0] d00019358558 .ipoib_mcast_restart_task+0x3d0/0x560 
 [ib_ipoib]
 [c0074603bd40] c00c6fe4 .run_workqueue+0xf4/0x1e0
 [c0074603be00] c00c7190 .worker_thread+0xc0/0x180
 [c0074603bed0] c00ccf4c .kthread+0xb4/0xc0
 [c0074603bf90] c00309fc .kernel_thread+0x54/0x70
 9:mon e
 cpu 0x9: Vector: 300 (Data Access) at [c0074603b720]
 pc: c05ac390: ._spin_lock+0x20/0xc8
 lr: d000193527ac: .ipoib_neigh_flush+0x6c/0x350 [ib_ipoib]
 sp: c0074603b9a0
msr: 80009032
dar: 3a0
  dsisr: 4000
   current = 0xc00756ce8b00
   paca= 0xc0f63800
 pid   = 18095, comm = ipoib
 9:mon

Recreating the crash has been tricky. I have tried several several hundred 
times today
to unload and reload IPoIB while there is traffic and no crashes happened. I 
took
a closer look at the IPoIB CM code and I see a few things that look suspicious.

In the ipoib_cm_send() path no priv-lock is held, whereas the priv-lock is 
held before 
calling ipoib_cm_destroy_tx(). This is true with and without Ralph's patch (fix 
dangling pointer).
Is this a potential race?

In Roland's git tree I do see a test_and_clear_bit(IPOIB_FLAG_INITIALIZED, 
tx-flags) in 
ipoib_cm_destroy_tx() which seems to be missing in Ralph's patch. In Ralph's 
patch) there is a 
clear_bit(IPOIB_FLAG_OPER_UP, tx-flags) called before calling 
ipoib_cm_destroy_tx() only in 
select cases. Was that intended?

Thanks
Pradeep

--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: IB/ipoib: fix dangling pointer reference to ipoib_neigh and ipoib_path -when will it go upstream?

2010-07-15 Thread Ralph Campbell
I will write up a description of the locking as I
understand it and the changes I made. Give
me a day or two to write it up and check it.

On Thu, 2010-07-15 at 04:56 -0700, Pradeep Satyanarayana wrote:
 Pradeep Satyanarayana wrote:
  Pradeep Satyanarayana wrote:
  Roland Dreier wrote:
I guess I came to a premature conclusion. One set of tests ran fine 
  and I made that
conclusion. Another set of tests caused the following crash:
 
  I don't really know how to interpret this.  Is this crash new, or is it
  the same crash you were hoping this patch fixed?
  This is a new crash.
  
  I see other manifestations resulting in different crashes :
  
  :mon t
  [c0074603ba20] d000193527ac .ipoib_neigh_flush+0x6c/0x350 [ib_ipoib]
  [c0074603bb10] d00019356dac .ipoib_mcast_free+0x74/0x2a0 [ib_ipoib]
  [c0074603bbe0] d00019358558 .ipoib_mcast_restart_task+0x3d0/0x560 
  [ib_ipoib]
  [c0074603bd40] c00c6fe4 .run_workqueue+0xf4/0x1e0
  [c0074603be00] c00c7190 .worker_thread+0xc0/0x180
  [c0074603bed0] c00ccf4c .kthread+0xb4/0xc0
  [c0074603bf90] c00309fc .kernel_thread+0x54/0x70
  9:mon e
  cpu 0x9: Vector: 300 (Data Access) at [c0074603b720]
  pc: c05ac390: ._spin_lock+0x20/0xc8
  lr: d000193527ac: .ipoib_neigh_flush+0x6c/0x350 [ib_ipoib]
  sp: c0074603b9a0
 msr: 80009032
 dar: 3a0
   dsisr: 4000
current = 0xc00756ce8b00
paca= 0xc0f63800
  pid   = 18095, comm = ipoib
  9:mon
 
 Recreating the crash has been tricky. I have tried several several hundred 
 times today
 to unload and reload IPoIB while there is traffic and no crashes happened. I 
 took
 a closer look at the IPoIB CM code and I see a few things that look 
 suspicious.
 
 In the ipoib_cm_send() path no priv-lock is held, whereas the priv-lock is 
 held before 
 calling ipoib_cm_destroy_tx(). This is true with and without Ralph's patch 
 (fix dangling pointer).
 Is this a potential race?
 
 In Roland's git tree I do see a test_and_clear_bit(IPOIB_FLAG_INITIALIZED, 
 tx-flags) in 
 ipoib_cm_destroy_tx() which seems to be missing in Ralph's patch. In Ralph's 
 patch) there is a 
 clear_bit(IPOIB_FLAG_OPER_UP, tx-flags) called before calling 
 ipoib_cm_destroy_tx() only in 
 select cases. Was that intended?
 
 Thanks
 Pradeep
 


--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: IB/ipoib: fix dangling pointer reference to ipoib_neigh and ipoib_path -when will it go upstream?

2010-07-15 Thread Ralph Campbell
On Thu, 2010-07-15 at 04:56 -0700, Pradeep Satyanarayana wrote:
 Pradeep Satyanarayana wrote:
  Pradeep Satyanarayana wrote:
  Roland Dreier wrote:
I guess I came to a premature conclusion. One set of tests ran fine 
  and I made that
conclusion. Another set of tests caused the following crash:
 
  I don't really know how to interpret this.  Is this crash new, or is it
  the same crash you were hoping this patch fixed?
  This is a new crash.
  
  I see other manifestations resulting in different crashes :
  
  :mon t
  [c0074603ba20] d000193527ac .ipoib_neigh_flush+0x6c/0x350 [ib_ipoib]
  [c0074603bb10] d00019356dac .ipoib_mcast_free+0x74/0x2a0 [ib_ipoib]
  [c0074603bbe0] d00019358558 .ipoib_mcast_restart_task+0x3d0/0x560 
  [ib_ipoib]
  [c0074603bd40] c00c6fe4 .run_workqueue+0xf4/0x1e0
  [c0074603be00] c00c7190 .worker_thread+0xc0/0x180
  [c0074603bed0] c00ccf4c .kthread+0xb4/0xc0
  [c0074603bf90] c00309fc .kernel_thread+0x54/0x70
  9:mon e
  cpu 0x9: Vector: 300 (Data Access) at [c0074603b720]
  pc: c05ac390: ._spin_lock+0x20/0xc8
  lr: d000193527ac: .ipoib_neigh_flush+0x6c/0x350 [ib_ipoib]
  sp: c0074603b9a0
 msr: 80009032
 dar: 3a0
   dsisr: 4000
current = 0xc00756ce8b00
paca= 0xc0f63800
  pid   = 18095, comm = ipoib
  9:mon
 
 Recreating the crash has been tricky. I have tried several several hundred 
 times today
 to unload and reload IPoIB while there is traffic and no crashes happened. I 
 took
 a closer look at the IPoIB CM code and I see a few things that look 
 suspicious.
 
 In the ipoib_cm_send() path no priv-lock is held, whereas the priv-lock is 
 held before 
 calling ipoib_cm_destroy_tx(). This is true with and without Ralph's patch 
 (fix dangling pointer).
 Is this a potential race?

ipoib_cm_send() is only called by ipoib_start_xmit() so it is protected
by netif_tx_lock(dev) or stopping the ipoib network device.
It all depends on what pointer or data structure you think is being
accessed while free or being modified without the proper protection.

 In Roland's git tree I do see a test_and_clear_bit(IPOIB_FLAG_INITIALIZED, 
 tx-flags) in 
 ipoib_cm_destroy_tx() which seems to be missing in Ralph's patch. In Ralph's 
 patch) there is a 
 clear_bit(IPOIB_FLAG_OPER_UP, tx-flags) called before calling 
 ipoib_cm_destroy_tx() only in 
 select cases. Was that intended?

The v4 patch comments explain the changes:
http://www.spinics.net/lists/linux-rdma/msg03733.html
Basically, IPOIB_FLAG_INITIALIZED now means that the struct ipoib_cm_tx
has completed the RC QP creation process via the CM instead of simply
when ipoib_cm_create_tx() allocates the structure.
The test and clear was used to indicate the struct ipoib_cm_tx
had been put on the destroy list and the reaper thread woken up.
Now ipoib_cm_destroy_tx() uses the tx-neigh pointer != NULL to
indicate that ipoib_cm_destroy_tx() has started the destroy process.
ipoib_cm_destroy_tx() is only called when netif_tx_lock() and priv-lock
are held to protect tx-neigh.

 Thanks
 Pradeep

The longer write up on locking is turning out to be very complex.
I will keep working on it but I think it will be just as hard
to understand as slogging through the code.

--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: IB/ipoib: fix dangling pointer reference to ipoib_neigh and ipoib_path -when will it go upstream?

2010-07-14 Thread Pradeep Satyanarayana
Pradeep Satyanarayana wrote:
 Roland Dreier wrote:
   I guess I came to a premature conclusion. One set of tests ran fine and I 
 made that
   conclusion. Another set of tests caused the following crash:

 I don't really know how to interpret this.  Is this crash new, or is it
 the same crash you were hoping this patch fixed?
 
 This is a new crash.

I see other manifestations resulting in different crashes :

:mon t
[c0074603ba20] d000193527ac .ipoib_neigh_flush+0x6c/0x350 [ib_ipoib]
[c0074603bb10] d00019356dac .ipoib_mcast_free+0x74/0x2a0 [ib_ipoib]
[c0074603bbe0] d00019358558 .ipoib_mcast_restart_task+0x3d0/0x560 
[ib_ipoib]
[c0074603bd40] c00c6fe4 .run_workqueue+0xf4/0x1e0
[c0074603be00] c00c7190 .worker_thread+0xc0/0x180
[c0074603bed0] c00ccf4c .kthread+0xb4/0xc0
[c0074603bf90] c00309fc .kernel_thread+0x54/0x70
9:mon e
cpu 0x9: Vector: 300 (Data Access) at [c0074603b720]
pc: c05ac390: ._spin_lock+0x20/0xc8
lr: d000193527ac: .ipoib_neigh_flush+0x6c/0x350 [ib_ipoib]
sp: c0074603b9a0
   msr: 80009032
   dar: 3a0
 dsisr: 4000
  current = 0xc00756ce8b00
  paca= 0xc0f63800
pid   = 18095, comm = ipoib
9:mon

Thanks
Pradeep

--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: IB/ipoib: fix dangling pointer reference to ipoib_neigh and ipoib_path -when will it go upstream?

2010-07-13 Thread Pradeep Satyanarayana
Roland Dreier wrote:
   I guess I came to a premature conclusion. One set of tests ran fine and I 
 made that
   conclusion. Another set of tests caused the following crash:
 
 I don't really know how to interpret this.  Is this crash new, or is it
 the same crash you were hoping this patch fixed?

This is a new crash.

Thanks
Pradeep

--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: IB/ipoib: fix dangling pointer reference to ipoib_neigh and ipoib_path -when will it go upstream?

2010-07-12 Thread Bart Van Assche
On Mon, Jul 12, 2010 at 6:57 AM, Pradeep Satyanarayana
prade...@linux.vnet.ibm.com wrote:
 I realize that the following patch:

 https://patchwork.kernel.org/patch/97243/

 is queued in your backlog of patches and unlikely that it will go into 2.6.35.
 What are the chances that it will make it into 2.6.36? This patch has fixed a
 a rarely seen crash and we would like it to go upstream ASAP.

The following comment was made on that patch by Ralph Campbell (see
also http://www.mail-archive.com/linux-rdma@vger.kernel.org/msg03125.html):

Quite right. I should also use list_for_each_entry_safe(). I will fix this.

This makes me wonder whether version three of this patch can go in unmodified ?

Bart.
--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: IB/ipoib: fix dangling pointer reference to ipoib_neigh and ipoib_path -when will it go upstream?

2010-07-12 Thread Pradeep Satyanarayana
Bart Van Assche wrote:
 On Mon, Jul 12, 2010 at 6:57 AM, Pradeep Satyanarayana
 prade...@linux.vnet.ibm.com wrote:
 I realize that the following patch:

 https://patchwork.kernel.org/patch/97243/

 is queued in your backlog of patches and unlikely that it will go into 
 2.6.35.
 What are the chances that it will make it into 2.6.36? This patch has fixed a
 a rarely seen crash and we would like it to go upstream ASAP.
 
 The following comment was made on that patch by Ralph Campbell (see
 also http://www.mail-archive.com/linux-rdma@vger.kernel.org/msg03125.html):
 
 Quite right. I should also use list_for_each_entry_safe(). I will fix this.
 
 This makes me wonder whether version three of this patch can go in unmodified 
 ?

There was a version 4 that followed. That was what I was referring to.

Thanks
Pradeep

--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: IB/ipoib: fix dangling pointer reference to ipoib_neigh and ipoib_path -when will it go upstream?

2010-07-12 Thread Bart Van Assche
On Mon, Jul 12, 2010 at 12:20 PM, Pradeep Satyanarayana
prade...@linux.vnet.ibm.com wrote:
 Bart Van Assche wrote:
 On Mon, Jul 12, 2010 at 6:57 AM, Pradeep Satyanarayana
 prade...@linux.vnet.ibm.com wrote:
 I realize that the following patch:

 https://patchwork.kernel.org/patch/97243/

 is queued in your backlog of patches and unlikely that it will go into 
 2.6.35.
 What are the chances that it will make it into 2.6.36? This patch has fixed 
 a
 a rarely seen crash and we would like it to go upstream ASAP.

 The following comment was made on that patch by Ralph Campbell (see
 also http://www.mail-archive.com/linux-rdma@vger.kernel.org/msg03125.html):

 Quite right. I should also use list_for_each_entry_safe(). I will fix this.

 This makes me wonder whether version three of this patch can go in 
 unmodified ?

 There was a version 4 that followed. That was what I was referring to.

Thanks for the info -- I had missed version four of that patch. Now
that I had a look at it, why does the comment above
ipoib_cm_flush_path() say that it removes all entries while the loop
inside that function is stopped after the first entry has been found
and removed ? Why does that function use list_for_each_entry_safe()
while only a single entry is removed ?

Bart.
--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: IB/ipoib: fix dangling pointer reference to ipoib_neigh and ipoib_path -when will it go upstream?

2010-07-12 Thread Roland Dreier
  I realize that the following patch: 
  
  https://patchwork.kernel.org/patch/97243/
  
  is queued in your backlog of patches and unlikely that it will go into 
  2.6.35.
  What are the chances that it will make it into 2.6.36? This patch has fixed a
  a rarely seen crash and we would like it to go upstream ASAP.

How much testing and/or review have you done for this patch?  What is
the impact of the bug in the cases you have seen?

 - R.
-- 
Roland Dreier rola...@cisco.com || For corporate legal information go to:
http://www.cisco.com/web/about/doing_business/legal/cri/index.html
--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


IB/ipoib: fix dangling pointer reference to ipoib_neigh and ipoib_path -when will it go upstream?

2010-07-11 Thread Pradeep Satyanarayana
Roland,

I realize that the following patch: 

https://patchwork.kernel.org/patch/97243/

is queued in your backlog of patches and unlikely that it will go into 2.6.35.
What are the chances that it will make it into 2.6.36? This patch has fixed a
a rarely seen crash and we would like it to go upstream ASAP.

Thanks
Pradeep

--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html