[openib-general] Re: [PATCH]Repost: IPoIB skb panic

2006-06-07 Thread Shirley Ma

Roland,

Can you post a recipe to reproduce the crash?
It happened on 32 nodes cluster (each
node has 8 dual core cpus) running IBM applications over IPoIB.

Thanks
Shirley Ma
IBM Linux Technology Center
15300 SW Koll Parkway
Beaverton, OR 97006-6063
Phone(Fax): (503) 578-7638___
openib-general mailing list
openib-general@openib.org
http://openib.org/mailman/listinfo/openib-general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general

Re: [openib-general] Re: [PATCH]Repost: IPoIB skb panic

2006-06-04 Thread Eli Cohen
 More clarification: we saw two races here:
 1. path_free() was called by both unicast_arp_send() and
 ipoib_flush_paths() in the same time.

It is not possible to call path_free() on the same object from both
unicast_arp_send() and ipoib_flush_paths(). This becasue
unicast_arp_send() calls it only for newly created objects for which
path_rec_create() failed, in which case the object was never inserted
into the list or the rb_tree.

 2. during unicast arp skb retransmission, unicast_arp_send() appended
 the skb on the list, while ipoib_flush_paths() calling path_free() to
 free the same skb from the list.

I don't see any issue here as well.

Can you reproduce the crash? If you do, can you send how?

___
openib-general mailing list
openib-general@openib.org
http://openib.org/mailman/listinfo/openib-general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general



Re: [openib-general] Re: [PATCH]Repost: IPoIB skb panic

2006-06-04 Thread Shirley Ma

Ohmm. That's a myth. So this problem
is hardware independent, right? 
It's not easy to reproduce it. ifconfig
up and down stress test could hit this problem occasionally.

thanks
Shirley Ma
IBM Linux Technology Center
15300 SW Koll Parkway
Beaverton, OR 97006-6063
Phone(Fax): (503) 578-7638___
openib-general mailing list
openib-general@openib.org
http://openib.org/mailman/listinfo/openib-general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general

[openib-general] Re: [PATCH]Repost: IPoIB skb panic

2006-06-02 Thread Roland Dreier
  1. path_free() should call dev_kfree_skb_any() (any context) instead of
  dev_kfree_skb_irq() (irq context) since it is called in process
  context. 

Agree -- although actually in the current code, plain dev_kfree_skb()
would be fine.  In fact, since your patch moves the free inside a
spinlock, dev_kfree_skb_irq() would be correct.

  2. path-queue should be protected by priv-lock since there is a  race
  between unicast_send_arp() and ipoib_flush_paths() to release skb when
  bringing interface down. It's  safe to use priv-lock, because
  skb_queue_len(path-queue)   
  IPOIB_MAX_PATH_REC_QUEUE, which is 3.

I'm having a hard time understanding this race.  path_free() should
never be called on paths that are reachable via the list of paths or
the rb-tree of paths, and unicast_send_arp() should never touch a path
that is going to path_free().

Also, it seems if there is a race here then this fix is insufficient,
because path_free() does a kfree() on the whole path structure, which
would lead to use-after-free if unicast_send_arp() might still touch it.

So could you diagram the race you are seeing?  (ie what are the two
different threads doing that causes a problem?)

Thanks,
  Roland
___
openib-general mailing list
openib-general@openib.org
http://openib.org/mailman/listinfo/openib-general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general



[openib-general] Re: [PATCH]Repost: IPoIB skb panic

2006-06-02 Thread Shirley Ma
Roland,

More clarification: we saw two races here:
1. path_free() was called by both unicast_arp_send() and
ipoib_flush_paths() in the same time.
0xc004bff0a0d031  10   R  0xc004bff0a580
*ksoftirqd/0
  SP(esp)PC(eip)  Function(args)
0xcf707c80  0xc03199d0  .skb_release_data +0x7c
0xcf707c80  0xc0319688 (lr) .kfree_skbmem +0x20
0xcf707d10  0xc0319688  .kfree_skbmem +0x20
0xcf707da0  0xc03197fc  .__kfree_skb +0x148
0xcf707e50  0xc031e2a8  .net_tx_action +0xa4
0xcf707f00  0xc006ab38  .__do_softirq +0xa8
0xcf707f90  0xc00177b0  .call_do_softirq +0x14
0xc000cff83d90  0xc0012064  .do_softirq +0x90
0xc000cff83e20  0xc006b0fc  .ksoftirqd +0xfc
0xc000cff83ed0  0xc0081d74  .kthread +0x17c
0xc000cff83f90  0xc0017d24  .kernel_thread +0x4c
KERNEL: assertion (!atomic_read(skb-users)) failed at net/core/dev.c 

2. during unicast arp skb retransmission, unicast_arp_send() appended
the skb on the list, while ipoib_flush_paths() calling path_free() to
free the same skb from the list.
3KERNEL: assertion (!atomic_read(skb-users)) failed at
net/core/dev.c 
(1742)
4Warning: kfree_skb passed an skb still on a list (from c031e2a8).
2kernel BUG in __kfree_skb at net/core/skbuff.c:225! (sles9 sp3 kernel)
void __kfree_skb(struct sk_buff *skb)
{
if (skb-list) {
printk(KERN_WARNING Warning: kfree_skb passed an skb still 
   on a list (from %p).\n, NET_CALLER(skb));
BUG();
}

The patch will fix both problems by using priv-lock to protect path-queue 
list. Am I right?

Thanks
Shirley Ma
IBM LTC

___
openib-general mailing list
openib-general@openib.org
http://openib.org/mailman/listinfo/openib-general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general



[openib-general] Re: [PATCH]Repost: IPoIB skb panic

2006-06-02 Thread Roland Dreier
  2. during unicast arp skb retransmission, unicast_arp_send() appended
  the skb on the list, while ipoib_flush_paths() calling path_free() to
  free the same skb from the list.

I think I see what's going on.  the skb ends up being on two lists at
once I guess...

 - R.
___
openib-general mailing list
openib-general@openib.org
http://openib.org/mailman/listinfo/openib-general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general



[openib-general] Re: [PATCH]Repost: IPoIB skb panic

2006-06-02 Thread Shirley Ma
On Fri, 2006-06-02 at 16:15 -0700, Roland Dreier wrote:
   2. during unicast arp skb retransmission, unicast_arp_send() appended
   the skb on the list, while ipoib_flush_paths() calling path_free() to
   free the same skb from the list.
 
 I think I see what's going on.  the skb ends up being on two lists at
 once I guess...
 
  - R.

The skb has only one prev, one next pointers, it can only be on one list
at a time. How could skb go on two lists at once?

Thanks
Shirley

___
openib-general mailing list
openib-general@openib.org
http://openib.org/mailman/listinfo/openib-general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general



[openib-general] Re: [PATCH]Repost: IPoIB skb panic

2006-06-02 Thread Roland Dreier
  The skb has only one prev, one next pointers, it can only be on one list
  at a time. How could skb go on two lists at once?

Good question.  Actually I was wrong about understanding things
before.  I don't see any way that path_free() and unicast_arp_send()
can be operating on the same struct ipoib_path at the same time.  And
I don't see how unicast_arp_send() could be handling the an skb that's
already queued in a path's queue.

path_free() only gets called from ipoib_flush_paths() after the path
has been removed from the list of paths and the rb_tree of paths (both
protected by priv-lock), so unicast_arp_send() wouldn't find the path
to queue an skb.  And ipoib_flush_paths() can't find a new path
created by unicast_arp_send().

Obviously I'm missing something but I still don't see the real cause
of your crash.

 - R.
___
openib-general mailing list
openib-general@openib.org
http://openib.org/mailman/listinfo/openib-general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general