[openib-general] Re: [PATCH]Repost: IPoIB skb panic
Roland, Can you post a recipe to reproduce the crash? It happened on 32 nodes cluster (each node has 8 dual core cpus) running IBM applications over IPoIB. Thanks Shirley Ma IBM Linux Technology Center 15300 SW Koll Parkway Beaverton, OR 97006-6063 Phone(Fax): (503) 578-7638___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
Re: [openib-general] Re: [PATCH]Repost: IPoIB skb panic
More clarification: we saw two races here: 1. path_free() was called by both unicast_arp_send() and ipoib_flush_paths() in the same time. It is not possible to call path_free() on the same object from both unicast_arp_send() and ipoib_flush_paths(). This becasue unicast_arp_send() calls it only for newly created objects for which path_rec_create() failed, in which case the object was never inserted into the list or the rb_tree. 2. during unicast arp skb retransmission, unicast_arp_send() appended the skb on the list, while ipoib_flush_paths() calling path_free() to free the same skb from the list. I don't see any issue here as well. Can you reproduce the crash? If you do, can you send how? ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
Re: [openib-general] Re: [PATCH]Repost: IPoIB skb panic
Ohmm. That's a myth. So this problem is hardware independent, right? It's not easy to reproduce it. ifconfig up and down stress test could hit this problem occasionally. thanks Shirley Ma IBM Linux Technology Center 15300 SW Koll Parkway Beaverton, OR 97006-6063 Phone(Fax): (503) 578-7638___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
[openib-general] Re: [PATCH]Repost: IPoIB skb panic
1. path_free() should call dev_kfree_skb_any() (any context) instead of dev_kfree_skb_irq() (irq context) since it is called in process context. Agree -- although actually in the current code, plain dev_kfree_skb() would be fine. In fact, since your patch moves the free inside a spinlock, dev_kfree_skb_irq() would be correct. 2. path-queue should be protected by priv-lock since there is a race between unicast_send_arp() and ipoib_flush_paths() to release skb when bringing interface down. It's safe to use priv-lock, because skb_queue_len(path-queue) IPOIB_MAX_PATH_REC_QUEUE, which is 3. I'm having a hard time understanding this race. path_free() should never be called on paths that are reachable via the list of paths or the rb-tree of paths, and unicast_send_arp() should never touch a path that is going to path_free(). Also, it seems if there is a race here then this fix is insufficient, because path_free() does a kfree() on the whole path structure, which would lead to use-after-free if unicast_send_arp() might still touch it. So could you diagram the race you are seeing? (ie what are the two different threads doing that causes a problem?) Thanks, Roland ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
[openib-general] Re: [PATCH]Repost: IPoIB skb panic
Roland, More clarification: we saw two races here: 1. path_free() was called by both unicast_arp_send() and ipoib_flush_paths() in the same time. 0xc004bff0a0d031 10 R 0xc004bff0a580 *ksoftirqd/0 SP(esp)PC(eip) Function(args) 0xcf707c80 0xc03199d0 .skb_release_data +0x7c 0xcf707c80 0xc0319688 (lr) .kfree_skbmem +0x20 0xcf707d10 0xc0319688 .kfree_skbmem +0x20 0xcf707da0 0xc03197fc .__kfree_skb +0x148 0xcf707e50 0xc031e2a8 .net_tx_action +0xa4 0xcf707f00 0xc006ab38 .__do_softirq +0xa8 0xcf707f90 0xc00177b0 .call_do_softirq +0x14 0xc000cff83d90 0xc0012064 .do_softirq +0x90 0xc000cff83e20 0xc006b0fc .ksoftirqd +0xfc 0xc000cff83ed0 0xc0081d74 .kthread +0x17c 0xc000cff83f90 0xc0017d24 .kernel_thread +0x4c KERNEL: assertion (!atomic_read(skb-users)) failed at net/core/dev.c 2. during unicast arp skb retransmission, unicast_arp_send() appended the skb on the list, while ipoib_flush_paths() calling path_free() to free the same skb from the list. 3KERNEL: assertion (!atomic_read(skb-users)) failed at net/core/dev.c (1742) 4Warning: kfree_skb passed an skb still on a list (from c031e2a8). 2kernel BUG in __kfree_skb at net/core/skbuff.c:225! (sles9 sp3 kernel) void __kfree_skb(struct sk_buff *skb) { if (skb-list) { printk(KERN_WARNING Warning: kfree_skb passed an skb still on a list (from %p).\n, NET_CALLER(skb)); BUG(); } The patch will fix both problems by using priv-lock to protect path-queue list. Am I right? Thanks Shirley Ma IBM LTC ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
[openib-general] Re: [PATCH]Repost: IPoIB skb panic
2. during unicast arp skb retransmission, unicast_arp_send() appended the skb on the list, while ipoib_flush_paths() calling path_free() to free the same skb from the list. I think I see what's going on. the skb ends up being on two lists at once I guess... - R. ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
[openib-general] Re: [PATCH]Repost: IPoIB skb panic
On Fri, 2006-06-02 at 16:15 -0700, Roland Dreier wrote: 2. during unicast arp skb retransmission, unicast_arp_send() appended the skb on the list, while ipoib_flush_paths() calling path_free() to free the same skb from the list. I think I see what's going on. the skb ends up being on two lists at once I guess... - R. The skb has only one prev, one next pointers, it can only be on one list at a time. How could skb go on two lists at once? Thanks Shirley ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
[openib-general] Re: [PATCH]Repost: IPoIB skb panic
The skb has only one prev, one next pointers, it can only be on one list at a time. How could skb go on two lists at once? Good question. Actually I was wrong about understanding things before. I don't see any way that path_free() and unicast_arp_send() can be operating on the same struct ipoib_path at the same time. And I don't see how unicast_arp_send() could be handling the an skb that's already queued in a path's queue. path_free() only gets called from ipoib_flush_paths() after the path has been removed from the list of paths and the rb_tree of paths (both protected by priv-lock), so unicast_arp_send() wouldn't find the path to queue an skb. And ipoib_flush_paths() can't find a new path created by unicast_arp_send(). Obviously I'm missing something but I still don't see the real cause of your crash. - R. ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general