Re: [openib-general] [RFC/BUG] DMA vs. CQ race
Roland Dreier [EMAIL PROTECTED] wrote on 02/27/2007 01:40:36 PM: Shirley, can you clarify why doing dma_alloc_coherent() in the kernel helps on your Cell blade? It really seems that dma_alloc_coherent() just allocates some memory and then does dma_map(DMA_BIDIRECTIONAL), which would be exactly the same as allocating the CQ buffer in userspace and using ib_umem_get() to map it into the kernel. I'm looking at a possibly cleaner solution to the Altix issue, so I would like to make sure it fixes whatever the bug on Cell is as well. So any details you can provide about the problem you see on Cell would help a lot. Thanks... Thanks, Roland. The failure on Cell is different with Altix issue after I reviewed the whole thread. So this fix might not help Cell. The problem I have might be related to multiple DMAs mapping to the same CQ. It might be somewhere else lost the sync. Thanks Shirley Ma___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
[openib-general] Fw: [PATCH] enable IPoIB only if broadcast join finish
Hello Roland, Sorry to bother you again. Could you please review below patch to see it's possible to be in upper stream soon? IPoIB can't ping each other if broadcast join successfully but encounting any other IB multicast join failure (like IB multicast group join failure for default IPv6 link local solicited address) when bringing the interface up. It does impact IPoIB usability in large node cluster when MCG LIDs are limited. Thanks Shirley Ma - Forwarded by Shirley Ma/Beaverton/IBM on 02/27/07 06:23 AM - Shirley Ma/Beaverton/IBM@ IBMUS To Sent by: Roland Dreier [EMAIL PROTECTED] openib-general-bo cc [EMAIL PROTECTED] openib-general@openib.org Subject [openib-general] [PATCH] enable 02/05/07 06:50 AM IPoIB only if broadcast join finish Hi, Roland, Please review this patch. According to IPoIB RFC4391 section 5, once IPoIB broacast group has been joined, the interface should be ready for data transfer. In current IPoIB implementation, the interface is UP and RUNNING when all default multicast join successful. We hit a problem while the broadcast join finishe and sucessful but the all hosts multicast join failure. Here is the patch, if possible please give your input asap, we have an urgent customer issue need to be resolved: diff -urpN ipoib/ipoib_multicast.c ipoib-multicast/ipoib_multicast.c --- ipoib/ipoib_multicast.c 2006-11-29 13:57:37.0 -0800 +++ ipoib-multicast/ipoib_multicast.c 2007-02-04 22:34:16.0 -0800 @@ -402,6 +402,11 @@ static void ipoib_mcast_join_complete(in queue_work(ipoib_workqueue, priv-mcast_task); mutex_unlock(mcast_mutex); complete(mcast-done); + /* + * broadcast join finished, enable carrier + */ + if (mcast == priv-broadcast) + netif_carrier_on(dev); return; } @@ -599,7 +604,6 @@ void ipoib_mcast_join_task(void *dev_ptr ipoib_dbg_mcast(priv, successfully joined all multicast groups\n); clear_bit(IPOIB_MCAST_RUN, priv-flags); - netif_carrier_on(dev); } int ipoib_mcast_start_thread(struct net_device *dev) (See attached file: ipoib-multicast.patch) Thanks Shirley Ma IBM Linux Technology Center 15300 SW Koll Parkway Beaverton, OR 97006-6063 Phone(Fax): (503) 578-7638(See attached file: ipoib-multicast.patch) ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general ipoib-multicast.patch Description: Binary data ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
Re: [openib-general] IPOIB NAPI
Roland Dreier [EMAIL PROTECTED] wrote on 02/26/2007 02:36:26 PM: No way, it's way too late at this point to change the kernel-user ABI, let alone change all ULPs. - R. Hello Roland, So the IBV_CQ_REPORT_MISSED_EVENTS has been part of OFED-1.2 already? I can generate the patch for all ULPs to use this for review. Do you need me to do that? Thanks Shirley Ma___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
Re: [openib-general] Fw: [PATCH] enable IPoIB only if broadcast join finish
Roland Dreier [EMAIL PROTECTED] wrote on 02/27/2007 02:35:34 PM: I don't think this applies any more since Sean's multicast stuff was merged. I didn't realize you wanted to get this merged upstream -- anyway, can you please regenerate the patch against the latest kernel? Thanks Sure. I will generate a new patch. Thanks Shirley Ma___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
Re: [openib-general] IPOIB NAPI
oland Dreier [EMAIL PROTECTED] wrote on 02/27/2007 02:41:44 PM: So the IBV_CQ_REPORT_MISSED_EVENTS has been part of OFED-1.2 already? I can generate the patch for all ULPs to use this for review. Do you need me to do that? No, it's not in OFED 1.2 or the upstream kernel. And no one has implemented it for userspace (and I'm somewhat reluctant to break the ABI at this point without some performance numbers to motivate making this API change). Have the NAPI performance problems with ehca been resolved? We could probably merge IPoIB NAPI for 2.6.22 then, which would pull in the kernel changes at least. - R. We have addressed the NAPI performance issues with ehca driver. I believe the patches have been upper stream. However the test results show that it's better to delay poll again to next NAPI interval, something like this: poll-cq notify-cq, if missed_event netif_rx_reschedule() return 1 vs. poll-cq, notify-cq, if missed_event netif_rx_reschedule() poll again return 0 It seems ehca delivering packet much faster than other HCAs. So poll again would stay in the loop for many many times. So the above changes doesn't impact other HCAs, I would recommand it. I saw same implementations on other ethernet drivers. Thanks Shirley Ma___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
Re: [openib-general] Fw: [PATCH] enable IPoIB only if broadcast join finish
Hello Roland, Here is the new patch against 2.6.20-rc1 kernel. Please review it. diff -urpN ipoib/ipoib_multicast.c ipoib-link/ipoib_multicast.c --- ipoib/ipoib_multicast.c 2007-02-27 07:21:50.0 -0800 +++ ipoib-link/ipoib_multicast.c2007-02-27 07:52:10.0 -0800 @@ -407,6 +407,11 @@ static int ipoib_mcast_join_complete(int queue_delayed_work(ipoib_workqueue, priv-mcast_task, 0); mutex_unlock(mcast_mutex); + /* +* broadcast join finished, enable carrier +*/ + if (unlikely(mcast == priv-broadcast)) + netif_carrier_on(dev); return 0; } @@ -596,7 +601,6 @@ void ipoib_mcast_join_task(struct work_s ipoib_dbg_mcast(priv, successfully joined all multicast groups\n); clear_bit(IPOIB_MCAST_RUN, priv-flags); - netif_carrier_on(dev); } int ipoib_mcast_start_thread(struct net_device *dev) (See attached file: ipoib-link.patch) Thanks Shirley Ma ipoib-link.patch Description: Binary data ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
[OFA General] Re: [openib-general] IPOIB NAPI
I'm confused. Which one is faster? Sorry for the confusion, Michael. The one with return 1 has better throughput. Thanks Shirley Ma___ general mailing list [EMAIL PROTECTED] http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
Re: [openib-general] ib0 interface up but can't ping
If your subnet is already has a SM running. Please look at the ifconfig output. If the interface ib0 is UP but not RUNNING, you can't ping since the carrier is not ON. Also look at /var/log/messages to see whether there is any errors. Thanks Shirley Ma IBM Linux Technology Center 15300 SW Koll Parkway Beaverton, OR 97006-6063 Phone(Fax): (503) 578-7638___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
Re: [openib-general] IPOIB NAPI
Roland, Yes. It would be good to reduce number of interrupts by changing all upper layer protocols to use: poll CQ notify CQ, rotting packet notification poll again instead of notify CQ poll CQ If possible this can be in OFED-1.2? Thanks Shirley Ma IBM Linux Technology Center 15300 SW Koll Parkway Beaverton, OR 97006-6063 Phone(Fax): (503) 578-7638 ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
Re: [openib-general] [RFC/BUG] DMA vs. CQ race
Hmm, OK. Then I will do my best to make sure we get a fix for this into 2.6.22. That would be great. We hit a similar problem in our cluster test -- data corruption because of this race. Thanks Shirley Ma___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
Re: [openib-general] [RFC/BUG] DMA vs. CQ race
Roland Dreier [EMAIL PROTECTED] wrote on 02/26/2007 02:09:48 PM: That would be great. We hit a similar problem in our cluster test -- data corruption because of this race. On what platform? - R. On our cell blade + PCI-e Mellanox. Thanks Shirley Ma___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
Re: [openib-general] IPv6oIB neighbour discover broken when MCGs overflow
Roland, Thanks for your quick response. Even SM supports 1000 MCGs, it's still not sufficitent for 250 nodes cluster, each node have 4 links for IPv6 without any scope/global IPv6 address configured.(250*4+ a few default MCGs) There will be a MCG overflow problem anyway in IPv6oIB. Thanks Shirley Ma IBM Linux Technology Center 15300 SW Koll Parkway Beaverton, OR 97006-6063 Phone(Fax): (503) 578-7638 Roland Dreier [EMAIL PROTECTED] m To Shirley Ma/Beaverton/[EMAIL PROTECTED] 02/16/2007 09:00 cc AMMichael S. Tsirkin [EMAIL PROTECTED], openib-general@openib.org Subject Re: IPv6oIB neighbour discover broken when MCGs overflow We have a customer issue regarding IPv6oIB. In the subnet, there are limited number of MCGs supported. So when there are multiple IPv6 addresses are assigned to one interface, each IPv6 address will have one unique solicited-node address (depends on their groupID). Then in a large subnet, we will have tons of MCGs. If IPv6 solicited node addresses exceed the number of MDGs in this subnet, then IPv6 neighbour discovery will be broken, this won't happen in Ethernet since sendonly doesn't require sender to be joined any MCG. I have done an initial patch to addresss MCG overflow problem and redirect the solicited-node address to all hosts node address, thus IPv6 neighbour discovery will work no matter how many IPv6 addresses in this subnet. This patch is only triggered with IPv6 enabled and MGC overflows, so there is almost no performance penalty. I really don't like this approach, since it can break things in very subtle ways (eg suppose one node fails to join its solicited node group, but then a later node wants to talk to it and succeeds in joining the solicited node group as a send-only member -- since the first node is not a member then it will never see the ND messages). I much prefer to fix the SM not to impose too-low limits on the number of MCGs. Supporting O(# nodes) MCGs is really not a very onerous requirement on the SM. - R. ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
Re: [openib-general] IPv6oIB neighbour discover broken when MCGs overflow
Roland, I really don't like this approach, since it can break things in very subtle ways (eg suppose one node fails to join its solicited node group, but then a later node wants to talk to it and succeeds in joining the solicited node group as a send-only member -- since the first node is not a member then it will never see the ND messages). For the successful join, ND sends to the node directly, for the failure join, ND sends to all hosts addr. So ND will work no matter whether the join OK or not, that's the patch does. Thanks Shirley Ma___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
Re: [openib-general] IPv6oIB neighbour discover broken when MCGs overflow
Roland Dreier [EMAIL PROTECTED] wrote on 02/16/2007 09:49:24 AM: For the successful join, ND sends to the node directly, for the failure join, ND sends to all hosts addr. So ND will work no matter whether the join OK or not, that's the patch does. But what if the full-member join fails on node A for node A's solicited node group, but then node B succeeds in joining that group as a send-only member (perhaps because some other nodes have dropped off the fabric in the meantime). Then node B will send the ND message on a MCG that A is not a member of. - R. Yes. B can send ND to A, and A responds without being a member so IPv6 ND works. Is there any security or other problems here? Thanks Shirley Ma___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
Re: [openib-general] IPv6oIB neighbour discover broken when MCGs overflow
Roland Dreier [EMAIL PROTECTED] wrote on 02/16/2007 10:10:55 AM: But what if the full-member join fails on node A for node A's solicited node group, but then node B succeeds in joining that group as a send-only member (perhaps because some other nodes have dropped off the fabric in the meantime). Then node B will send the ND message on a MCG that A is not a member of. Yes. B can send ND to A, and A responds without being a member so IPv6 ND works. Is there any security or other problems here? Node A is not a member of the group B is sending on, so SM does not have to set up any routes for the messages to even reach node A. So it doesn't see the messages and doesn't respond to ND. - R. Two MCGs groups must be establised before IPoIB link up, one is broadcast for IPv4, one is all hosts multicast for IPv6. So Node A is a member of all hosts address, the patch directs ND sends to all hosts, so node A responses it. Thanks Shirley Ma ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
Re: [openib-general] IPv6oIB neighbour discover broken when MCGs overflow
Roland Dreier [EMAIL PROTECTED] wrote on 02/16/2007 09:25:30 AM: Even SM supports 1000 MCGs, it's still not sufficitent for 250 nodes cluster, each node have 4 links for IPv6 without any scope/global IPv6 address configured.(250*4+ a few default MCGs) There will be a MCG overflow problem anyway in IPv6oIB. But what's the problem with supporting 1000 or even 1 MCGs? - R. I am not sure whether I understand your question. I am trying to answer it, please let me know whether I am wrong. Each IPv6 Link local address will create a unique solicited-node multicast address, which will create unique full member of IB MCG, each other IPv6 address will create a solicited-node multicast address, whether it's unique or not based on the groupID. So when IPv6 module being loaded in the kernel, (or might be a part of kernel in the future) in SM, we will see more than 1000 MCGs when IPoIB link up. Some of them can't join any MCGs. Then IPv6 ND is broken with some of the nodes join failure. Shirley Ma___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
Re: [openib-general] IPv6oIB neighbour discover broken when MCGs overflow
Roland Dreier [EMAIL PROTECTED] wrote on 02/16/2007 01:55:06 PM: Two MCGs groups must be establised before IPoIB link up, one is broadcast for IPv4, one is all hosts multicast for IPv6. So Node A is a member of all hosts address, the patch directs ND sends to all hosts, so node Aresponses it. I'm still confused. How do you interoperate with other RFC-compliant nodes (they might not have your patch or might not even be running Linux) that send ND messages to the solicited node group? If node A has your patch and doesn't try to join its own solicited node group, then another node that doesn't know to send ND messages to the all nodes group will not be able to find it. - R. All nodes in the subnet join all hosts multicast group by default. What the patch does differently than before, is when join failure, sends to all hosts multicast group instead of sending to a particular solicited-node multicast address, the node with the destination solicited-node multicast address will respond to it, so the network will not lose the connectivity when MCGs overflow. There is no interoperability issue here between patched and unpatched node or Linux and none-Linux node. I don't think IPoIB RFC covers this corner case. So there is no RFC-compliant problem here. I will discuss this with the author. Thanks Shirley Ma___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
Re: [openib-general] [PATCH] enable IPoIB only if broadcast join finish
Roland, Could you please review this patch when you have time? I am looking forward to seeing your comments to address a customer issue. Appreciate your help. Thanks Shirley Ma___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
Re: [openib-general] [PATCH] enable IPoIB only if broadcast join finish
Thanks Roland, I will apply the patch to the customer's cluster. The problem I found when failover bringing the new IPoIB interface up in the existing fabric, with a limit number of multicast join groups from our configuration, the interface can join broadcast group successfully, but all hosts multicast group join failure. Then ib interface can be UP, but not RUNNING. The interface couldn't work at all. Thanks Shirley Ma IBM Linux Technology Center 15300 SW Koll Parkway Beaverton, OR 97006-6063 Phone(Fax): (503) 578-7638 Roland Dreier [EMAIL PROTECTED] m To Shirley Ma/Beaverton/[EMAIL PROTECTED] 02/06/2007 11:59 cc AMMichael S. Tsirkin [EMAIL PROTECTED], openib-general@openib.org Subject Re: [PATCH] enable IPoIB only if broadcast join finish Here is the patch, if possible please give your input asap, we have an urgent customer issue need to be resolved: I guess this is OK, but what is the urgent issue it fixes? - R. ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
[openib-general] [PATCH] enable IPoIB only if broadcast join finish
Hi, Roland, Please review this patch. According to IPoIB RFC4391 section 5, once IPoIB broacast group has been joined, the interface should be ready for data transfer. In current IPoIB implementation, the interface is UP and RUNNING when all default multicast join successful. We hit a problem while the broadcast join finishe and sucessful but the all hosts multicast join failure. Here is the patch, if possible please give your input asap, we have an urgent customer issue need to be resolved: diff -urpN ipoib/ipoib_multicast.c ipoib-multicast/ipoib_multicast.c --- ipoib/ipoib_multicast.c 2006-11-29 13:57:37.0 -0800 +++ ipoib-multicast/ipoib_multicast.c 2007-02-04 22:34:16.0 -0800 @@ -402,6 +402,11 @@ static void ipoib_mcast_join_complete(in queue_work(ipoib_workqueue, priv-mcast_task); mutex_unlock(mcast_mutex); complete(mcast-done); + /* +* broadcast join finished, enable carrier +*/ + if (mcast == priv-broadcast) + netif_carrier_on(dev); return; } @@ -599,7 +604,6 @@ void ipoib_mcast_join_task(void *dev_ptr ipoib_dbg_mcast(priv, successfully joined all multicast groups\n); clear_bit(IPOIB_MCAST_RUN, priv-flags); - netif_carrier_on(dev); } int ipoib_mcast_start_thread(struct net_device *dev) (See attached file: ipoib-multicast.patch) Thanks Shirley Ma IBM Linux Technology Center 15300 SW Koll Parkway Beaverton, OR 97006-6063 Phone(Fax): (503) 578-7638 ipoib-multicast.patch Description: Binary data ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
Re: [openib-general] Multicast join group failure prevents IPoIB performing
According to IPoIB RFC4391 section 5, once IPoIB broadcast group has been joined, the IPoIB link should be UP, since it's ready for data transfer, the interface should be able to run for broadcast and unicast, do not need to wait for all multicast join successfully. Here is the patch to allow IPoIB interface running without waiting for all multicast join succesful, like all host group multicast join Here is the patch: diff -urpN ipoib/ipoib_multicast.c ipoib-patch/ipoib_multicast.c --- ipoib/ipoib_multicast.c 2006-11-29 13:57:37.0 -0800 +++ ipoib-patch/ipoib_multicast.c 2007-02-03 00:52:23.0 -0800 @@ -566,6 +566,7 @@ void ipoib_mcast_join_task(void *dev_ptr if (!test_bit(IPOIB_MCAST_FLAG_ATTACHED, priv-broadcast-flags)) { ipoib_mcast_join(dev, priv-broadcast, 0); + netif_carrier_on(dev); return; } @@ -599,7 +600,6 @@ void ipoib_mcast_join_task(void *dev_ptr ipoib_dbg_mcast(priv, successfully joined all multicast groups\n); clear_bit(IPOIB_MCAST_RUN, priv-flags); - netif_carrier_on(dev); } int ipoib_mcast_start_thread(struct net_device *dev) (See attached file: multicast.patch) http://www.rfc-editor.org/rfc/rfc4391.txt 5. Setting Up an IPoIB Link The broadcast-GID, as defined in the previous section, MUST be set up for an IPoIB subnet to be formed. Every IPoIB interface MUST FullMember join the IB multicast group defined by the broadcast- GID. This multicast group will henceforth be referred to as the broadcast group. The join operation returns the MTU, the Q_Key, and other parameters associated with the broadcast group. The node then associates the parameters received as a result of the join operation with its IPoIB interface. The broadcast group also serves to provide a link-layer broadcast service for protocols like ARP, net-directed, subnet-directed, and all-subnets-directed broadcasts in IPv4 over IB networks. The join operation is successful only if the Subnet Manager (SM) determines that the joining node can support the MTU registered with the broadcast group [RFC4392] ensuring support for a common link MTU. The SM also ensures that all the nodes joining the broadcast-GID have paths to one another and can therefore send and receive unicast packets. It further ensures that all the nodes do indeed form a multicast tree that allows packets sent from any member to be replicated to every other member. Thus, the IPoIB link is formed by the IPoIB nodes joining the broadcast group. There is no physical demarcation of the IPoIB link other than that determined by the broadcast group membership. Shirley Ma Shirley Ma/Beaverton/IBM@ IBMUS To Sent by: openib-general@openib.org openib-general-bo cc [EMAIL PROTECTED] Subject [openib-general] Multicast join 02/02/07 08:58 PM group failure prevents IPoIB performing When bringing IPoIB interface up, I hit default group multicast join failure. (This could be fixed in SM set up?) ib0: multicast join failed for , status -22 Then the interface was UP but not RUNNING. So the nodes couldn't ping each other. I think the right behavior of the interface should be UP and RUNNING even with some multicast join failure. I would like to provide a patch if there is no problem. Please advise. Thanks Shirley Ma___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general multicast.patch Description: Binary data ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
[openib-general] Multicast join group failure prevents IPoIB performing
When bringing IPoIB interface up, I hit default group multicast join failure. (This could be fixed in SM set up?) ib0: multicast join failed for , status -22 Then the interface was UP but not RUNNING. So the nodes couldn't ping each other. I think the right behavior of the interface should be UP and RUNNING even with some multicast join failure. I would like to provide a patch if there is no problem. Please advise. Thanks Shirley Ma___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
Re: [openib-general] [PATCH/RFC 1/2] IB: Return maybe_missed_event hint from ib_req_notify_cq()
Roland Dreier [EMAIL PROTECTED] wrote on 11/16/2006 11:26:31 AM: What I have found in ehca driver, n! = t, does't mean it's empty. If poll again, there are still some packets in cq. IB_CQ_REPORT_mISSED_EVENTS most of the time reports 1. It relies on netif_rx_reschedule() returns0 to exit napi poll. That might be the reason in poll routine for a long time? I will rerun my test to use n! = 0 to see any difference here. Maybe there's an ehca bug in poll CQ? If n != t then it should mean that the CQ was indeed drained. I would expect a missed event would be rare, because it means a completion occurs between the last poll CQ and the request notify, and that shouldn't be that common... My rough estimate is that even at a higher throughput than what you're seeing, IPoIB should only generate ~ 500K completions/sec, which means the average delay between completions is 2 microseconds. So I wouldn't expect completions to hit the window between poll and request notify that often. - R. I have tried low_latency is 1 to disable TCP prequeue, the throughput was increased from 1XXMb/s to 4XXMb/s. If I delayed net_skb_receive() a little bit, I could get around 1700Mb/s. If I totally disable netif_rx_reschedule(), then there is no repoll and return 0, I could get around 2900Mb/s throughout without packet seeing out of order issues. I have tried to add a spin lock in ipoib_poll(). And I still see packets out of orders. disable prequeue: 2XXMb/s to 4XXMb/s (packets out of order) slowdown netif_receive_skb: 17XXMb/s (packets out of order) don't handle missed event: 28XXMb/s (no packets out of order) handler missed envent later: 7XXMb/s to 11XXMb/s (packets out of order) Maybe it is ehca driver deliver packets much faster? Which makes me think user processes tcp backlogqueue, prequeue might be out of order? Thanks Shirley Ma___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
Re: [openib-general] [PATCH/RFC 1/2] IB: Return maybe_missed_event hint from ib_req_notify_cq()
Roland Dreier [EMAIL PROTECTED] wrote on 11/14/2006 03:18:23 PM: Shirley The rotting packet situation consistently happens for Shirley ehca driver. The napi could poll forever with your Shirley original patch. That's the reason I defer the rotting Shirley packet process in next napi poll. Hmm, I don't see it. In my latest patch, the poll routine does: repoll: done = 0; empty = 0; while (max) { t = min(IPOIB_NUM_WC, max); n = ib_poll_cq(priv-cq, t, priv-ibwc); for (i = 0; i n; ++i) { if (priv-ibwc[i].wr_id IPOIB_OP_RECV) { ++done; --max; ipoib_ib_handle_rx_wc(dev, priv-ibwc + i); } else ipoib_ib_handle_tx_wc(dev, priv-ibwc + i); } if (n != t) { empty = 1; break; } } dev-quota -= done; *budget -= done; if (empty) { netif_rx_complete(dev); if (unlikely(ib_req_notify_cq(priv-cq, IB_CQ_NEXT_COMP | IB_CQ_REPORT_MISSED_EVENTS)) netif_rx_reschedule(dev, 0)) goto repoll; return 0; } return 1; so every receive completion will count against the limit set by the variable max. The only way I could see the driver staying in the poll routine for a long time would be if it was only processing send completions, but even that doesn't actually seem bad: the driver is making progress handling completions. What I have found in ehca driver, n! = t, does't mean it's empty. If poll again, there are still some packets in cq. IB_CQ_REPORT_mISSED_EVENTS most of the time reports 1. It relies on netif_rx_reschedule() returns 0 to exit napi poll. That might be the reason in poll routine for a long time? I will rerun my test to use n! = 0 to see any difference here. Shirley It does help the performance from 1XXMb/s to 7XXMb/s, but Shirley not as expected 3XXXMb/s. Is that 3xxx Mb/sec the performance you see without the NAPI patch? Without NAPI patch, in my test environment ehca can gain around 2800Mb to 3000Mb/s throughput. Shirley With the defer rotting packet process patch, I can see Shirley packets out of order problem in TCP layer. Is it Shirley possible there is a race somewhere causing two napi polls Shirley in the same time? mthca seems to use irq auto affinity, Shirley but ehca uses round-robin interrupt. I don't see how two NAPI polls could run at once, and I would expect worse effects from them stepping on each other than just out-of-order packets. However, the fact that ehca does round-robin interrupt handling might lead to out-of-order packets just because different CPUs are all feeding packets into the network stack. - R. Normally for NAPI there should be only one running at a time. And NAPI process packet all the way to TCP layer by processing packet one by one (netif_receive_skb()). So it shouldn't lead to out-of-packets even for round-robin interrupt handling in NAPI. I am still investing this. Thanks Shirley___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
Re: [openib-general] [PATCH/RFC 1/2] IB: Return maybe_missed_event hint from ib_req_notify_cq()
I will rerun my test to use n! = 0 to see any difference here. It should be n == 0 to indicate empty. Thanks Shirley Ma___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
Re: [openib-general] [PATCH/RFC 1/2] IB: Return maybe_missed_event hint from ib_req_notify_cq()
Roland Dreier [EMAIL PROTECTED] wrote on 11/13/2006 08:45:52 AM: Sorry I was not intend to send previous email. Anyway I accidently sent it out. What I thought was there would be a problem, if the missed_event always return to 1. Then this napi poll would keep forever. Well, it's limited by the quota that the net stack gives it, so there's no possibility of looping forever. However How about defer the rotting packets process later? like this: that seems like it is still correct. With this patch, I could get NAPI + non scaling code throughput performance from 1XXMb/s to 7XXMb/s, anyway there are some other problems I am still investigating now. But I wonder why it gives you a factor of 4 in performance?? Why does it make a difference? I would have thought that the rotting packet situation would be rare enough that it doesn't really matter for performance exactly how we handle it. What are the other problems you're investigating? - R. The rotting packet situation consistently happens for ehca driver. The napi could poll forever with your original patch. That's the reason I defer the rotting packet process in next napi poll. It does help the performance from 1XXMb/s to 7XXMb/s, but not as expected 3XXXMb/s. With the defer rotting packet process patch, I can see packets out of order problem in TCP layer. Is it possible there is a race somewhere causing two napi polls in the same time? mthca seems to use irq auto affinity, but ehca uses round-robin interrupt. Thanks Shirley Ma___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
Re: [openib-general] [PATCH/RFC 1/2] IB: Return maybe_missed_event hint from ib_req_notify_cq()
Roland, I think there is a barrier might be needed in checking LINK SCHED state, like smp_mb_before_clear_bit() and smp_mb_after_clear_bit(), otherwise the netif_rx_reschedule() for rotting packet and next interrupt netif_rx_schedule() could be running in the time. If the interrupt is round-robin fashion, then packets are going to be out of order in TCP layer. I will test it out once I have the resouce. How do you think? Thanks Shirley Ma___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
Re: [openib-general] [PATCH/RFC 1/2] IB: Return maybe_missed_event hint from ib_req_notify_cq()
Roland, Ignore my previous email, test_and_set_bit is atomic operation and has the memeory barrier already. Thanks Shirley Ma___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
Re: [openib-general] [PATCH/RFC 1/2] IB: Return maybe_missed_event hint from ib_req_notify_cq()
From the code work through, if defering rotting packet process by return (missed_event netif_rx_reschedule(dev, 0)); Then the same dev-poll can be added to per cpu poll list twice: one is from netif_rx_reschedule, one is from napi return 1. That might explains packets out of order: when one poll finishes and reset LINK SCHED bit and the next interrupt runs on other cpu. Thanks Shirley Ma___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
Re: [openib-general] [PATCH/RFC 1/2] IB: Return maybe_missed_event hint from ib_req_notify_cq()
I would really like to understand why ehca does worse with NAPI. In my tests both mthca and ipath exhibit various degrees of improvement depending on the test -- but I've never seen performance get worse. This is the main thing holding back merging NAPI. Does the NAPI patch help mthca on pSeries? I wonder if it's not ehca, but rather that there's some ppc64 quirk that makes NAPI a lot more expensive. - R. Got your point. Sorry I haven't made any big progress yet. What I have found so far in none scaling code, if I always set missed_event = 0 without peeking rotting packet, then NAPI will increase the performance and reduce the cpu utilization. That's the reason I suggest above change. I have't found the reason for scaling code dropping 2/3 of the performance yet. The NAPI touch test for methca on power performance is good. So I don't think it's ppc4 issue. Thanks Shirley___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
Re: [openib-general] [PATCH/RFC 1/2] IB: Return maybe_missed_event hint from ib_req_notify_cq()
Roland Dreier [EMAIL PROTECTED] wrote on 11/10/2006 07:00:46 AM: I think it has to stay the way I wrote it. Your version: + if (empty) +return (ib_req_notify_cq(priv-cq, IB_CQ_NEXT_COMP|IB_CQ_REPORT_MISSED_EVENTS) netif_rx_reschedule(dev, 0); + Thanks Shirley Ma ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
Re: [openib-general] [PATCH/RFC 1/2] IB: Return maybe_missed_event hint from ib_req_notify_cq()
Roland, Sorry I was not intend to send previous email. Anyway I accidently sent it out. What I thought was there would be a problem, if the missed_event always return to 1. Then this napi poll would keep forever. How about defer the rotting packets process later? like this: + if (empty) +return (ib_req_notify_cq(priv-cq, IB_CQ_NEXT_COMP|IB_CQ_REPORT_MISSED_EVENTS) netif_rx_reschedule(dev, 0); + With this patch, I could get NAPI + non scaling code throughput performance from 1XXMb/s to 7XXMb/s, anyway there are some other problems I am still investigating now. Thanks Shirley Ma___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
Re: [openib-general] [PATCH/RFC 1/2] IB: Return maybe_missed_event hint from ib_req_notify_cq()
Roland Dreier [EMAIL PROTECTED] wrote on 10/19/2006 09:10:35 PM: Roland, I looked over my code again, and I don't see anything obviously wrong, but it's quite possible I made a mistake that I just can't see right now (like reversing a truth value somewhere). Someone who knows how ehca works might be able to spot the error. - R. Your code is OK. I just found the problem here. + if (empty) { + netif_rx_complete(dev); + ib_req_notify_cq(priv-cq, IB_CQ_NEXT_COMP, missed_event); + if (unlikely(missed_event) netif_rx_reschedule(dev, 0)) + goto repoll; + + return 0; + } netif_rx_complete() should be called right before return. It does improve none scaling performance with this patch, but reduce scaling performance. + if (empty) { + ib_req_notify_cq(priv-cq, IB_CQ_NEXT_COMP, missed_event); + if (unlikely(missed_event) netif_rx_reschedule(dev, 0)) + goto repoll; + netif_rx_complete(dev); + + return 0; + } Any other reason, calling netif_rx_complete() while still possibably within napi? Thanks Shirley Ma IBM Linux Technology Center___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
Re: [openib-general] [PATCH/RFC 1/2] IB: Return maybe_missed_event hint from ib_req_notify_cq()
Retest several times, this hack patch only fixed the none scaling code. I thought I tested both scaling and none scaling, it seems I made a mistake, I might configure and test none scaling configuration twice in previous run. thanks Shirley Ma IBM Linux Technology Center ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
Re: [openib-general] [PATCH/RFC 1/2] IB: Return maybe_missed_event hint from ib_req_notify_cq()
Michael S. Tsirkin [EMAIL PROTECTED] wrote on 10/19/2006 01:21:45 PM: Please also note that due to factors such as TCP window limits, TX on a single socket is often stalled. To really stress a connection and see benefit from NAPI you should be running multiple socket streams in parallel: either just run multiple instances of netperf/netserver, or use iperf with -P flag. I used to get 7600Mb/s IPoIB one socket duplex throughput with my other IPoIB patches on 2.6.5 kernel under certain configuration. Which makes me believe we could gain close to link throughput with one UD QP. Now I couldn't get it anymore on the new kernel. I was struggling with TCP window limits on the new kernel. Do you have any hint? thanks Shirley Ma IBM Linux Technology Center ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
Re: [openib-general] RHEL5 and OFED ...
[EMAIL PROTECTED] wrote on 10/16/2006 01:50:49 PM: On Mon, 2006-10-16 at 15:25 +0200, Michael S. Tsirkin wrote: Quoting r. Maestas, Christopher Daniel [EMAIL PROTECTED]: Subject: Re: [openib-general] RHEL5 and OFED ... Now for userspace - does RHEL5 include at least libibverbs-1.0? This has been released a while back, and Roland makes regular bugfix releases. Here's what I see on a rhel4 u4 system: --- $ rpm -q libibverbs libibverbs-1.0.3-1 --- So I would think rhel5 would have at least that or greater. When I compiled rpms for 1.1rc7 it generated: --- # ls libibverbs-* libibverbs-1.0.4-0.x86_64.rpmlibibverbs-utils-1.0.4-0.x86_64.rpm libibverbs-devel-1.0.4-0.x86_64.rpm Dough, would it be possible to update this + libmthca? Possibly. What's the justification? What's in 1.0.4 that is the primary reason for wanting to update from 1.0.3? -- Doug Ledford [EMAIL PROTECTED] I am not sure whether this already has an answer. The justification is madvise(..., MADV_DONTFORK) is used to make fork() work for verbs consumers in the recent packages. I hope same patch will be in libehca. thanks Shirley Ma IBM Linux Technology Center___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
Re: [openib-general] [openfabrics-ewg] RHEL5 and OFED ...
Roland Dreier [EMAIL PROTECTED] wrote on 10/19/2006 09:19:14 AM: Shirley How can RHEL5 pick up this particular patch? Applications Shirley with fork() depend on this patch. It can't really, since it breaks the libibverbs ABI and therefore has to be part of a major release. Then we need to wait for the new release or find an alternative way which I doubt. Thanks Shirley Ma ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
Re: [openib-general] [PATCH/RFC 1/2] IB: Return maybe_missed_event hint from ib_req_notify_cq()
Thanks Michael for all these tips. I have tried several suggestions as you proposed here. I couldn't see performance any better. The TCP_RR is dropped to 472 trans/s from about 18,000 trans/s , and TCP_STREAM BW is dropped to 1/3 as before ( ehca + scaling code) with same TCP configuration, send queue size=recve queue size = 1K. Thanks Shirley Ma IBM Linux Technology Center___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
Re: [openib-general] [PATCH/RFC 1/2] IB: Return maybe_missed_event hint from ib_req_notify_cq()
Roland, I have applied this patch and updated patch 2/2. You will send out an updated patch 2/2, I think. I did some extra modification in ipoib code, (which has more extra repolls). I do see around 10% or more performance improvement now with this change on both scaling and none scaling code. I will run oprofile tomorrow to see the difference. I think with these extra repolls, the cpu utilization would be much higher. Thanks Shirley Ma IBM Linux Technology Center___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
Re: [openib-general] [PATCH/RFC 1/2] IB: Return maybe_missed_event hint from ib_req_notify_cq()
Roland Dreier [EMAIL PROTECTED] wrote on 10/19/2006 07:39:25 PM: I have applied this patch and updated patch 2/2. You will send out an updated patch 2/2, I think. Sorry, messed that up. I just sent out the patch. No problem, I did same change. You mean you add more calls to ib_poll_cq()? Where do you add them? Why does it help? - R. I run out of ideas why losing 2/3 of the throughput and got 476 trans/s. So I assumed there was always a missed event, then ipoib would stay in its napi poll within its scheduled time. That's why it helps. This is really a hack, doesn't address the problem. It sacrificed cpu utilization and gained the performance back. I need to understand how ehca reports missing event, there might be some delay there? Thanks Shirley Ma IBM Linux Technology Center___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
Re: [openib-general] [PATCH/RFC 1/2] IB: Return maybe_missed_event hint from ib_req_notify_cq()
Roland Dreier [EMAIL PROTECTED] wrote on 10/19/2006 09:10:35 PM: It's entirely possible that my implementation of the missing event hint in ehca is wrong. I just guessed based on how poll CQ is implemented -- if the consumer requests a hint about missing events, then I lock the CQ and check if its empty after requesting notification. I looked over my code again, and I don't see anything obviously wrong, but it's quite possible I made a mistake that I just can't see right now (like reversing a truth value somewhere). Someone who knows how ehca works might be able to spot the error. - R. The oprofile data (with your napi + this hack patch) looks good, it reduced cpu utilization significantly. (I was wrong about cpu utilization.) I will talk with ehca team regarding this missing event hint patch on ehca. thanks Shirley Ma IBM Linux Technology Center ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
Re: [openib-general] [PATCH/RFC 1/2] IB: Return maybe_missed_event hint from ib_req_notify_cq()
Roland Dreier [EMAIL PROTECTED] wrote on 10/17/2006 08:41:59 PM: Anyway, I'm eagerly awaiting your NAPI results with ehca. Thanks, Roland Thanks. The touch test results are not good. This NAPI patch induces huge latency for ehca driver scaling code, the throughput performance is not good. (I am not fully conviced the huge latency is because of raising NAPI in thread context.) Then I tried ehca no scaling driver, the latency looks good, but the throughtput is still a problem. We are working on these issues. Hopefully we can get the answer soon. Thanks Shirley Ma ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
Re: [openib-general] [PATCH/RFC 1/2] IB: Return maybe_missed_event hint from ib_req_notify_cq()
Roland Dreier [EMAIL PROTECTED] wrote on 10/18/2006 01:55:13 PM: I would like to understand why there's a throughput difference with scaling turned off, since the NAPI code doesn't change the interrupt handling all that much, and should lower the CPU usage if anything. That's I am trying to understand now. Yes, the send side rate dropped significant, cpu usage lower as well. Does changing the netdev weight value affect anything? - R. No, it doesn't. Thanks Shirley Ma IBM Linux Technology Center___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
[openib-general] ethtool support for ipoib
I am going to add below ethtool ops in ipoib. Anything comments? Once ethtool support is added, GSO will be get/set directly through ethtool as Michael pointed out earlier. static struct ethtool_ops ipoib_ethtool_ops = { .get_settings = ipoib_get_settings, .set_settings = ipoib_set_settings, .get_drvinfo= ipoib_get_dvrinfo, .get_link = ethtool_op_get_link, .get_stats_count= ipoib_get_stats_count, .get_ethtool_stats = ipoib_get_ethtool_stats, /* can be added later once ipoib support sg .get_sg = ethtool_op_get_sg, .set_sg = ethtool_op_set_sg, */ }; thanks Shirley Ma___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
Re: [openib-general] ethtool support for ipoib
Michael S. Tsirkin [EMAIL PROTECTED] wrote on 10/16/2006 11:12:03 PM: Quoting r. Shirley Ma [EMAIL PROTECTED]: /* can be added later once ipoib support sg .get_sg = ethtool_op_get_sg, .set_sg = ethtool_op_set_sg, */ The difficulty here is that sg currently requires checksum offloading in netdevice. -- MST I read the discussion in net-dev. Since IB packet has its own CRC (ICRC, VCRC). Is it a good idea to enable checksum unnecessary in a pure IB Fabrics for large MTU 64K. It requires some negotiation. Does your prototype implementation for large MTU requires both ends agreement? Practically it can be implemented, but I don't know what RFCs have defined.___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
Re: [openib-general] ethtool support for ipoib
What I suggested here is when it's connected mode with large MTU, set ib interface flag to CHECKSUM_UNNECESSARY. But this only works on packets not being routed off-net at the TCP layer. Thanks Shirley Ma IBM Linux Technology Center 15300 SW Koll Parkway Beaverton, OR 97006-6063 Phone(Fax): (503) 578-7638___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
Re: [openib-general] ethtool support for ipoib
Parks Fields [EMAIL PROTECTED] wrote on 10/17/2006 01:12:48 PM: No, it's never a good idea to turn off TCP or IP checksums. That leads to possibilities of silent data corruption too easily. I totally agree... Have we ever seen silent data corruption in CHECKSUM_HW? Thanks Shirley Ma___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
Re: [openib-general] [PATCH/RFC 1/2] IB: Return maybe_missed_event hint from ib_req_notify_cq()
Hi, Roland, There were a couple errors and warning when I applied this patch to OFED-1.1-rc7. 1. ehca_req_notify_cq() in ehca_iverbs.h is not updated. 2. *maybe_missed_event = ipz_qeit_is_valid(my_cq-ipz_queue) should be =ipz_qeit_is_valid(my_cq-ipz_queue) 3. a compile warning this line return cqe_flags 7 == queue-toggle_state 1; Thanks Shirley Ma___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
Re: [openib-general] enable GSO over IPoIB
Hi, Roland, If we only support GSO enablement in ethtool, there is no problem. What I meant is anything related to MAC address in ethtool utility needs to be updated for IB device. Do you like the idea to add ethtool support in IPoIB? Do you want me to work on this? Thanks Shirley Ma___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
Re: [openib-general] enable GSO over IPoIB
Good. Then after enabling GSO, we can chain multiple packets together in IPoIB for one doorbell to send large packet. Thanks Shirley Ma IBM Linux Technology Center 15300 SW Koll Parkway Beaverton, OR 97006-6063 Phone(Fax): (503) 578-7638___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
Re: [openib-general] enable GSO over IPoIB
Roland Dreier [EMAIL PROTECTED] wrote on 10/16/2006 10:37:12 AM: Shirley Good. Then after enabling GSO, we can chain multiple Shirley packets together in IPoIB for one doorbell to send large Shirley packet. How does that work? GSO doesn't change the hard_start_xmit() interface, does it? - R. No, it doesn't. I am thinking to add enqueue/dequeue multiple packets in qdisc. It would benifit other networking device. Thanks Shirley Ma___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
Re: [openib-general] enable GSO over IPoIB
Roland Dreier [EMAIL PROTECTED] wrote on 10/16/2006 10:49:32 AM: Shirley No, it doesn't. I am thinking to add enqueue/dequeue Shirley multiple packets in qdisc. It would benifit other Shirley networking device. So am I understanding correctly -- this is other work that is independent of GSO? Is the plan to add a new optional driver method that extends hard_start_xmit() to accept multiple packets? - R. Yes, you are right. It is the new work indendent of GSO. I hope I have the BW to do all the work on time. Thanks Shirley Ma IBM Linux Technology Center 15300 SW Koll Parkway Beaverton, OR 97006-6063 Phone(Fax): (503) 578-7638 ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
Re: [openib-general] IPOIB NAPI
Roland, Don't know why I have trouble to get this patch from your git tree. Do you mind to post this patch here so I can test the performance over ehca? Thanks Shirley Ma___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
Re: [openib-general] [PATCH] IB/ipoib: NAPI
[EMAIL PROTECTED] wrote on 09/28/2006 09:11:47 AM: Michael Looked pretty simple on the outset, but oh well. Keep us Michael posted. I just work slowly. Anyway I don't think this is that urgent -- we've dumped enough stuff into 2.6.19, so I think this should wait for 2.6.20 at the earliest anyway. Please wait for other device drivers to finish the performance test. This NAPI patch somehow kills ehca performance, extremly bad. Shirley Ma___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
Re: [openib-general] [PATCH] IB/ipoib: NAPI
Michael S. Tsirkin [EMAIL PROTECTED] wrote on 09/26/2006 09:59:30 PM: I still hope ehca NAPI performance can be fixed. But if not, maybe we should have the low level driver set a disable_napi flag rather than have users play with module options. -- MST I forgot to mention these NAPI parameters should be tunable for different device drivers, like dev-weight, or set up in lower driver. thanks Shirley Ma___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
Re: [openib-general] [PATCH] IB/ipoib: NAPI
Michael S. Tsirkin [EMAIL PROTECTED] wrote on 09/26/2006 11:23:16 PM: Quoting r. Shirley Ma [EMAIL PROTECTED]: We are implementing multiple EQs suport for one adapter now. I think with MSI we can have a per-interface EQ in mthca. Main reason I'm not doing this is because I haven't figured out the right interface to pass this information to the low level driver yet. Maybe we should just assign EQs to CQs in a round-robin fashion for now, and just hope typical use allocates CQs sequentially. Worst case, we are back to where we are now, performance-wise. Roland, how does this sound? If that works, then we can modify the ehca code as mthca. Actually mthca has the same problem as ehca over two links on the same adapter. OK, but if as you point out the issue is not device-specific - that's a good reason not to do tricks in low-level driver to try and work around this, but address this at ULP level. -- MST Yes. That's what we are working on to define the right APIs to pass information to low level driver. Now we are trying per interface per EQ, then we will extent the work to N(CQ):M(EQ) mapping. ehca could support up to 127 EQs, I would suggest to use hash. Thanks Shirley Ma___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
Re: [openib-general] heads-up - ipoib NAPI
Hi, Eli, Eli cohen [EMAIL PROTECTED] wrote on 09/26/2006 11:35:26 PM: On Tue, 2006-09-26 at 21:34 -0700, Shirley Ma wrote: NAPI patch moves ipoib poll from hardware interrupt context to softirq context. It would reduce the hardware interrupts, reduce hardware latency and induce some network latency. It might reduce cpu utilization. But I still question about the BW improvement. I did see various performance with the same test under the same condition. When you open just one connection you can see around 10% of variations in BW measure. But then you don't utilize all the CPU power you have and you don't get to the threshold where NAPI becomes effective. Using multiple connections utilizes all CPUs in the system, increases send rate, and increases the chances of the receiver to poll CQEs up to its quota and be scheduled again without re-enabling interrupts. Send rate shouldn't be limited by one connection. The cpu is much faster than the link speed. I don't think multiple connections send rate is increased than one connection. Do you have any data to show that? When I monitored the CQEs, I didn't see too many CQEs in CQ for one notification, and I don't think moving NAPI from hardware interrupt context to softirq context would increase that number. Or the latency might cause the number increased, I did see that number increased and performance increased with some udelay in hardware interrupt polling mode. If you saw the packets increased, how many packets did you see in both one hardware interrupt poll and one NAPI poll? You NAPI poll is driven either by receiver quota or any send CQE in CQ. Have you tested UDP performance? any difference? Thanks Shirley Ma___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
Re: [openib-general] enable GSO over IPoIB
Michael S. Tsirkin [EMAIL PROTECTED] wrote on 09/27/2006 01:30:03 AM: Any idea what does ethtool do that IPoIB can't support? ethtool is an ethernet device tool. It's OK to partically implement ethtool operations in IPoIB. We also need to patch the userlevel utility to support ibX interface. Now it only supports ethX. thanks Shirley Ma IBM Linux Technology Center___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
Re: [openib-general] [PATCH] IB/ipoib: NAPI
I have created a patch to monitor CQ. That wasn't the reason for performance drop. I couldn't see any race from the output. Thanks Shirley Ma IBM Linux Technology Center 15300 SW Koll Parkway Beaverton, OR 97006-6063 Phone(Fax): (503) 578-7638___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
Re: [openib-general] [PATCH] IB/ipoib: NAPI
Roland, We had a simple version of NAPI patch. We saw the performance improvement on mthca but not ehca. We will test this NAPI patch on ehca when it's available to see how's the performance. Thanks Shirley Ma IBM Linux Technology Center 15300 SW Koll Parkway Beaverton, OR 97006-6063 Phone(Fax): (503) 578-7638 ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
Re: [openib-general] [PATCH] IB/ipoib: NAPI
This patch implements NAPI for iopib. It is a draft implementation. I would like your opinion on whether we need a module parameter to control if NAPI should be activated or not. It can be a configuration option to enable/disable NAPI, just like other network device. Thanks Shirley Ma IBM Linux Technology Center 15300 SW Koll Parkway Beaverton, OR 97006-6063 Phone(Fax): (503) 578-7638___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
[openib-general] enable GSO over IPoIB
Since linux 2.6.18 supports GSO, I have patched IPoIB to enable GSO, but haven't tested the performance yet. Has anyone tried already? Thanks Shirley Ma IBM Linux Technology Center 15300 SW Koll Parkway Beaverton, OR 97006-6063 Phone(Fax): (503) 578-7638___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
Re: [openib-general] 2.6.18 kernel support in the main trunk.
Whats the status with the main trunk kernel code and 2.6.18? I noticed that it doesn't build and needs something like this. I haven't tested this yet... Yes. You need this patch and also need change ipoib_multicast.c: dev-xmit_lock to dev-_xmit_lock to build the trunk on 2.6.18. Thanks Shirley Ma IBM Linux Technology Center 15300 SW Koll Parkway Beaverton, OR 97006-6063 Phone(Fax): (503) 578-7638___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
Re: [openib-general] [PATCH] IB/ipoib: NAPI
We did some touch test on ehca driver, we saw performance drop somehow. I strongly recommand NAPI as a configurable option in ipoib. So customers can turn on/off based on their configurations. Thanks Shirley Ma IBM Linux Technology Center 15300 SW Koll Parkway Beaverton, OR 97006-6063 Phone(Fax): (503) 578-7638___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
Re: [openib-general] [PATCH] IB/ipoib: NAPI
Roland, Do you know how ehca behaves? Does it have that race? ie what happens in this situation: poll CQ - CQ is empty (new completion is added to CQ) request notify on CQ (no more completions are added) Mellanox HCAs will generate a CQ event in this case, although it's not strictly required by the IB spec. How will ehca behave? - R. That could be the reason. I did see mthca poll empty entry, but not on ehca. I will confirm this with ehca team. Thanks Shirley Ma IBM Linux Technology Center 15300 SW Koll Parkway Beaverton, OR 97006-6063 Phone(Fax): (503) 578-7638___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
Re: [openib-general] enable GSO over IPoIB
Shirley Since linux 2.6.18 supports GSO, I have patched IPoIB to Shirley enable GSO, but haven't tested the performance yet. Has Shirley anyone tried already? No, I don't think anyone looked at that yet. Could you post your patch? What is required? Supporting gather/scatter? - R. Don't need too. GSO only improves sender side performance. It allows large packet send in ULPs, and segments this packet in interface layer before driver xmit. The GSO enablement is through ethtool. Since ipoib doesn't support ethtool, i just simply added a module parameter to set the interface GSO flag when loading the module. My next step is to enable gather/scatter in ipoib send to chain multiple packets together for one door bell. Thanks Shirley Ma___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
Re: [openib-general] heads-up - ipoib NAPI
Hi, Eli, Hi, I have a draft implementation of NAPI in ipoib and got the following results: System descriptions === Quad CPU E64T 2.4 Ghz 4 GB RAM MT25204 Sinai HCA I used netperf for benchmarking, the BW test ran for 600 seconds with 8 clients and 8 servers. The results I received are bellow: netperf TCP_STREAM: BW [MByte/sec] clients side [irqs/sec] server side [irqs/sec] -- --- -- without NAPI:506 86441 66311 with NAPI: 550 6830 13600 netperf TCP_RR: rate [tran/sec] --- without NAPI: 39600 with NAPI: 39470 Please note this is still under work and we plan to do more tests and measure on other devices. NAPI patch moves ipoib poll from hardware interrupt context to softirq context. It would reduce the hardware interrupts, reduce hardware latency and induce some network latency. It might reduce cpu utilization. But I still question about the BW improvement. I did see various performance with the same test under the same condition. Have you tested this patch with different message sizes, different socket sizes? Are these results consistent better? Thanks Shirley Ma IBM Linux Technology Center 15300 SW Koll Parkway Beaverton, OR 97006-6063 Phone(Fax): (503) 578-7638___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
Re: [openib-general] [PATCH] IB/ipoib: NAPI
Hi, Roland, Shirley It can be a configuration option to enable/disable NAPI, Shirley just like other network device. But is there any reason to keep the non-NAPI code around? I hate to have two codepaths to maintain. If you would like to maintain one code path only, then we need to compare the NAPI patch with thread-context polling mode patch. I did see big performance improvement with thread-context polling mode patch I have been working on. (I used to split CQ. I am tring without splitting CQ now). And I think it would improve multiple links performance in share one EQ situation. thanks Shirley Ma___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
Re: [openib-general] [PATCH] IB/ipoib: NAPI
Michael S. Tsirkin [EMAIL PROTECTED] wrote on 09/26/2006 09:59:30 PM: Quoting r. Shirley Ma [EMAIL PROTECTED]: Subject: Re: [PATCH] IB/ipoib: NAPI We did some touch test on ehca driver, we saw performance drop somehow. Hmm, it seems ehca still defers the completion event to a tasklet. It always seemed weird to me. So that could be the reason - with NAPI you now get 2 tasklet schedules, as you are actually doing part of what NAPI does,inside the low level driver. Try ripping that out and calling the event handler directly, and see what it does to performance with NAPI. The reason for this ehca implementation is two ports/links shared one EQ. We are implementing multiple EQs suport for one adapter now. If that works, then we can modify the ehca code as mthca. Actually mthca has the same problem as ehca over two links on the same adapter. Two links on the same adapter performance are very bad, not scaled at all. I strongly recommand NAPI as a configurable option in ipoib. So customers can turn on/off based on their configurations. I still hope ehca NAPI performance can be fixed. But if not, maybe we should have the low level driver set a disable_napi flag rather than have users play with module options. -- MST We have been working on this issue for some time. That's the reason we didn't post our NAPI patch. Hopefully we can fix it. If we can show NAPI performance (latency, BW, cpu utilization) are better in all cases (UP vs. SMP, one socket vs. multiple sockets, one link vs. multiple links, different message sizes, different socket sizes) I will agree to turn on NAPI as default. Thanks Shirley Ma IBM Linux Technology Center 15300 SW Koll Parkway Beaverton, OR 97006-6063 Phone(Fax): (503) 578-7638___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
[openib-general] Re: [PATCH]Repost: IPoIB skb panic
Roland, Can you post a recipe to reproduce the crash? It happened on 32 nodes cluster (each node has 8 dual core cpus) running IBM applications over IPoIB. Thanks Shirley Ma IBM Linux Technology Center 15300 SW Koll Parkway Beaverton, OR 97006-6063 Phone(Fax): (503) 578-7638___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
[openib-general] Re: Re: [PATCH]Repost: IPoIB skb panic
Michael, I will apply this patch. This patch would reduce the race, not address the problem. Thanks Shirley Ma IBM Linux Technology Center 15300 SW Koll Parkway Beaverton, OR 97006-6063 Phone(Fax): (503) 578-7638___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
Re: [openib-general] Re: [PATCH]Repost: IPoIB skb panic
Ohmm. That's a myth. So this problem is hardware independent, right? It's not easy to reproduce it. ifconfig up and down stress test could hit this problem occasionally. thanks Shirley Ma IBM Linux Technology Center 15300 SW Koll Parkway Beaverton, OR 97006-6063 Phone(Fax): (503) 578-7638___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
[openib-general] [PATCH]Repost: IPoIB skb panic
Roland, I posted the patch yesterday, it seems it only went to web site. I repost this patch here for you to review. Please let me know if there is any problem to apply this patch. There are two problems in path_free(), which caused kernel skb panic during interface up/down stress test. 1. path_free() should call dev_kfree_skb_any() (any context) instead of dev_kfree_skb_irq() (irq context) since it is called in process context. 2. path-queue should be protected by priv-lock since there is a race between unicast_send_arp() and ipoib_flush_paths() to release skb when bringing interface down. It's safe to use priv-lock, because skb_queue_len(path-queue) IPOIB_MAX_PATH_REC_QUEUE, which is 3. Signed-off-by: Shirley Ma [EMAIL PROTECTED] diff -urpN infiniband/ulp/ipoib/ipoib_main.c infiniband-skb/ulp/ipoib/ipoib_main.c --- infiniband/ulp/ipoib/ipoib_main.c 2006-05-03 13:16:18.0 -0700 +++ infiniband-skb/ulp/ipoib/ipoib_main.c 2006-06-01 09:14:05.0 -0700 @@ -252,11 +252,11 @@ static void path_free(struct net_device struct sk_buff *skb; unsigned long flags; - while ((skb = __skb_dequeue(path-queue))) - dev_kfree_skb_irq(skb); - spin_lock_irqsave(priv-lock, flags); + while ((skb = __skb_dequeue(path-queue))) + dev_kfree_skb_any(skb); + list_for_each_entry_safe(neigh, tn, path-neigh_list, list) { /* * It's safe to call ipoib_put_ah() inside priv-lock Thanks Shirley Ma IBM LTC ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
[openib-general] Re: [PATCH]Repost: IPoIB skb panic
Roland, More clarification: we saw two races here: 1. path_free() was called by both unicast_arp_send() and ipoib_flush_paths() in the same time. 0xc004bff0a0d031 10 R 0xc004bff0a580 *ksoftirqd/0 SP(esp)PC(eip) Function(args) 0xcf707c80 0xc03199d0 .skb_release_data +0x7c 0xcf707c80 0xc0319688 (lr) .kfree_skbmem +0x20 0xcf707d10 0xc0319688 .kfree_skbmem +0x20 0xcf707da0 0xc03197fc .__kfree_skb +0x148 0xcf707e50 0xc031e2a8 .net_tx_action +0xa4 0xcf707f00 0xc006ab38 .__do_softirq +0xa8 0xcf707f90 0xc00177b0 .call_do_softirq +0x14 0xc000cff83d90 0xc0012064 .do_softirq +0x90 0xc000cff83e20 0xc006b0fc .ksoftirqd +0xfc 0xc000cff83ed0 0xc0081d74 .kthread +0x17c 0xc000cff83f90 0xc0017d24 .kernel_thread +0x4c KERNEL: assertion (!atomic_read(skb-users)) failed at net/core/dev.c 2. during unicast arp skb retransmission, unicast_arp_send() appended the skb on the list, while ipoib_flush_paths() calling path_free() to free the same skb from the list. 3KERNEL: assertion (!atomic_read(skb-users)) failed at net/core/dev.c (1742) 4Warning: kfree_skb passed an skb still on a list (from c031e2a8). 2kernel BUG in __kfree_skb at net/core/skbuff.c:225! (sles9 sp3 kernel) void __kfree_skb(struct sk_buff *skb) { if (skb-list) { printk(KERN_WARNING Warning: kfree_skb passed an skb still on a list (from %p).\n, NET_CALLER(skb)); BUG(); } The patch will fix both problems by using priv-lock to protect path-queue list. Am I right? Thanks Shirley Ma IBM LTC ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
[openib-general] Re: [PATCH]Repost: IPoIB skb panic
On Fri, 2006-06-02 at 16:15 -0700, Roland Dreier wrote: 2. during unicast arp skb retransmission, unicast_arp_send() appended the skb on the list, while ipoib_flush_paths() calling path_free() to free the same skb from the list. I think I see what's going on. the skb ends up being on two lists at once I guess... - R. The skb has only one prev, one next pointers, it can only be on one list at a time. How could skb go on two lists at once? Thanks Shirley ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
[openib-general] [PATCH] IPoIB skb panic
Roland, I found there are two problems in path_free(), it would cause kernel skb panic. 1. path_free() should dev_kfree_skb_any() (any context) instead of dev_kfree_skb_irq() (irq context) 2. path-queue should be protected by priv-lock since there is a possible race between unicast_send_arp() and ipoib_flush_paths() when bring interface down. It's safe to use priv-lock, because skb_queue_len(path-queue) IPOIB_MAX_PATH_REC_QUEUE, which is 3. Here is the patch. Please review it and let me know if there is a problem to apply this patch. Signed-off-by: Shirley Ma [EMAIL PROTECTED] diff -urpN infiniband/ulp/ipoib/ipoib_main.c infiniband-skb/ulp/ipoib/ipoib_main.c --- infiniband/ulp/ipoib/ipoib_main.c 2006-05-03 13:16:18.0 -0700 +++ infiniband-skb/ulp/ipoib/ipoib_main.c 2006-06-01 09:14:05.0 -0700 @@ -252,11 +252,11 @@ static void path_free(struct net_device struct sk_buff *skb; unsigned long flags; - while ((skb = __skb_dequeue(path-queue))) - dev_kfree_skb_irq(skb); - spin_lock_irqsave(priv-lock, flags); + while ((skb = __skb_dequeue(path-queue))) + dev_kfree_skb_any(skb); + list_for_each_entry_safe(neigh, tn, path-neigh_list, list) { /* * It's safe to call ipoib_put_ah() inside priv-lock Thanks Shirley Ma IBM LTC ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
Re: [openib-general] [PATCH][1/7]ipoib performance patches -- remove ah_reap
Roland, Yes. The lock sequences are right to me. What I found that the ah is always available the IPoIB neigh, I can modify this patch like that in ipoib_send: if (unlikely(*to_ipoib_neigh(skb-dst-neighbour))) kref_get(); in ipoib completion: if (unlikely(*to_ipoib_neigh(skb-dst-neighbour))) kref_put(); Thanks Shirley Ma IBM Linux Technology Center 15300 SW Koll Parkway Beaverton, OR 97006-6063 Phone(Fax): (503) 578-7638___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
Re: [openib-general] [PATCH][1/7]ipoib performance patches -- remove ah_reap
in ipoib send if (unlikely(!*to_ipoib_neigh(skb-dst-neighbour))) kref_get(); in ipoib completion: if (unlikely(!*to_ipoib_neigh(skb-dst-neighbour))) ipoib_put_ah(); Shirley Ma IBM Linux Technology Center 15300 SW Koll Parkway Beaverton, OR 97006-6063 Phone(Fax): (503) 578-7638 ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
Re: [openib-general] [PATCH][3/7]ipoib performance patches -- remove tx_ring
Roland, Roland, Thanks for the review comments. I will update these patches and tests results. BTW it would be nice if you could figure out a way to fix your mail client to post patches inline without mangling them, or at least attach them with a mime type of text/plain or something. I will use my unix account to send out patches. Also, if you're interested, you could try the patch below and see how it does on your tests. Sure. I will test it after this weekend. Did you see send queue overrun with tx_ring default size 128? Thanks Shirley Ma IBM Linux Technology Center 15300 SW Koll Parkway Beaverton, OR 97006-6063 Phone(Fax): (503) 578-7638___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
Re: [openib-general] [PATCH][3/7]ipoib performance patches -- remove tx_ring
Roland Dreier [EMAIL PROTECTED] wrote on 05/26/2006 04:20:02 PM: (Also the default TX ring size is 64, not 128, isn't it?) - R. Yes. Thanks Shirley Ma IBM Linux Technology Center 15300 SW Koll Parkway Beaverton, OR 97006-6063 Phone(Fax): (503) 578-7638 ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
Re: [openib-general] [PATCH][3/7]ipoib performance patches -- remove tx_ring
Roland, I made some mistakes during splitting these patches. Thanks for pointing out. The reason I removed the cacheline because I have tested and proved that it didn't help somehow. Even in some code, I induced some locks, the overall performance of all these 7 patches I am trying to post here could improve IPoIB from 20% - 80% unidirectional and doubled bidirectional. As you mentioned I need help to repolish these patches. I am glad that you give all these valuable inputs of my patches. Thanks lots here. Thanks Shirley Ma IBM Linux Technology Center 15300 SW Koll Parkway Beaverton, OR 97006-6063 Phone(Fax): (503) 578-7638___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
Re: [openib-general] [PATCH][3/7]ipoib performance patches -- remove tx_ring
Roland Dreier [EMAIL PROTECTED] wrote on 05/25/2006 04:28:27 PM: Shirley didn't help somehow. Even in some code, I induced some Shirley locks, the overall performance of all these 7 patches I Shirley am trying to post here could improve IPoIB from 20% - 80% Shirley unidirectional and doubled bidirectional. I'm guessing that all of that gain comes from letting the send and receive completion handlers run simultaneously. Is that right? For example how much of an improvement do you see if you just apply the patches you've posted -- that is, only 1/7, 2/7 and 3/7? - R. That's not true. I tested performance with 1/7, 3/7 a couple weeks ago, I saw more than 10% improvement. I never saw send queue overrun with tx_ring before on one cpu with one TCP stream. After removing tx_ring, the send path is much faster, the default 128 is not bigger enough to handle it. That's the reason I have another patch to handler send queue overrun -- requeue the packet to the head of dev xmit queue instead of current implementation which is silently dropped the packets when device driver send queue is full, and depends on TCP retransmission. This implementation would cause TCP fast trans, slow start, and packets out of orders. Thanks Shirley Ma IBM Linux Technology Center 15300 SW Koll Parkway Beaverton, OR 97006-6063 Phone(Fax): (503) 578-7638 ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
Re: [openib-general] [PATCH][3/7]ipoib performance patches -- remove tx_ring
Roland, Roland Dreier [EMAIL PROTECTED] wrote on 05/25/2006 09:24:01 AM: This also looks like a step backwards to me. You are replacing a cache-friendly array with a cache-unfriendly linked list, which also requires two more lock/unlock operations in the fast path. This patch reduces one extra ring between dev xmit queue and device send queue and removes tx_lock in completion handler. The whole purpose to have the send_list and slock is for shutting down clean up. Otherwise we don't need to maintain this list. And most likely when shutting down, waiting for 5HZ, the list is empty. I could implment it differently, like use RCU list with cache-friendly. I thought it's not worth it before since i didn't see the gain. Thanks Shirley Ma IBM Linux Technology Center 15300 SW Koll Parkway Beaverton, OR 97006-6063 Phone(Fax): (503) 578-7638 ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
[openib-general] [PATCH][1/7]ipoib performance patches -- remove ah_reap
; } -clear_bit(IPOIB_STOP_REAPER, priv-flags); -queue_delayed_work(ipoib_workqueue, priv-ah_reap_task, HZ); - set_bit(IPOIB_FLAG_INITIALIZED, priv-flags); return 0; @@ -580,24 +540,6 @@ timeout: if (ib_modify_qp(priv-qp, qp_attr, IB_QP_STATE)) ipoib_warn(priv, Failed to modify QP to RESET state\n); -/* Wait for all AHs to be reaped */ -set_bit(IPOIB_STOP_REAPER, priv-flags); -cancel_delayed_work(priv-ah_reap_task); -flush_workqueue(ipoib_workqueue); - -begin = jiffies; - -while (!list_empty(priv-dead_ahs)) { -__ipoib_reap_ah(dev); - -if (time_after(jiffies, begin + HZ)) { - ipoib_warn(priv, timing out; will leak address handles\n); - break; -} - -msleep(1); -} - return 0; } diff -urpN infiniband-split-cq/ulp/ipoib/ipoib_main.c infiniband-ah/ulp/ipoib/ipoib_main.c --- infiniband-split-cq/ulp/ipoib/ipoib_main.c2006-05-22 08:48:47.0 -0700 +++ infiniband-ah/ulp/ipoib/ipoib_main.c2006-05-23 09:31:49.0 -0700 @@ -957,7 +957,6 @@ static void ipoib_setup(struct net_devic INIT_WORK(priv-mcast_task, ipoib_mcast_join_task, priv-dev); INIT_WORK(priv-flush_task, ipoib_ib_dev_flush,priv-dev); INIT_WORK(priv-restart_task, ipoib_mcast_restart_task, priv-dev); -INIT_WORK(priv-ah_reap_task, ipoib_reap_ah, priv-dev); } struct ipoib_dev_priv *ipoib_intf_alloc(const char *name) Shirley Ma IBM Linux Technology Center 15300 SW Koll Parkway Beaverton, OR 97006-6063 Phone(Fax): (503) 578-7638 infiniband-ah.patch Description: Binary data ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
Re: [openib-general] [PATCH][1/7]ipoib performance patches -- remove ah_reap
Roland Dreier [EMAIL PROTECTED] wrote on 05/24/2006 01:50:37 PM: NAK to this patch. Not only is is a step backwards in performance -- you've essentially added two (expensive) atomic operations for every packet sent My observation is the atomic operation is not that expensive. -- but the patch is actually wrong: +err = post_send(priv, priv-tx_head (ipoib_sendq_size - 1), + address-ah, qpn, addr, skb-len); +kref_put(address-ref, ipoib_free_ah); The whole point of the complexity in AH handling in IPoIB is that AHs cannot be freed until the driver knows that all sends referring to them have _completed_. As you've written your patch, an AH can easily be freed before the HCA has a chance to execute the corresponding send request. - R. I thought the path holding another AH reference to prevent it to be freed? Thanks Shirley Ma IBM Linux Technology Center 15300 SW Koll Parkway Beaverton, OR 97006-6063 Phone(Fax): (503) 578-7638 ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
Re: [openib-general] [PATCH][1/7]ipoib performance patches -- remove ah_reap
Roland, Roland Dreier [EMAIL PROTECTED] wrote on 05/24/2006 03:01:01 PM: Shirley Compared to have a single thread handling AH, I don't Shirley think this atomic operation is expensive. But freeing AHs is something that happens infrequently and can be done asynchronously. You're replacing that cost with two atomic operations per send packet! No, actually it didn't free during sending during my test. Shirley It is true for unicast, it has a reference count before Shirley ipoib_send(). I need to look at multicast. But can you guarantee that the AH stays around until after the send completes (which could be an arbitrarily long delay)? - R. I checked negih_add_path(), for unicast it is true always. See code below. static void neigh_add_path(..) { ... if (path-ah) { kref_get(path-ah-ref); neigh-ah = path-ah; ipoib_send(dev, skb, path-ah... } Please correct me if I am wrong. Thanks Shirley Ma IBM Linux Technology Center 15300 SW Koll Parkway Beaverton, OR 97006-6063 Phone(Fax): (503) 578-7638 ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
Re: [openib-general] [PATCH][1/7]ipoib performance patches -- remove ah_reap
Roland Dreier [EMAIL PROTECTED] wrote on 05/24/2006 04:07:58 PM: To reiterate: freeing AHs is a rare, slow path operation that can be done asynchronously. It is not a good tradeoff to do two atomic_t operations for every sent packet, just to avoid occasionally reaping AHs in process context. I don't think two atomic operation is that expensive compare to reaping AHs in process context according to the test results and profiling data. Or we can use RCU instead. But can you guarantee that the AH stays around until after the send completes (which could be an arbitrarily long delay)? I checked negih_add_path(), for unicast it is true always. See code below. static void neigh_add_path(..) { ... if (path-ah) { kref_get(path-ah-ref); neigh-ah = path-ah; ipoib_send(dev, skb, path-ah... } Again, I don't understand how this is a response at all. The AH cannot be freed until after the send operation is actually fully completed, which could be a long time after ib_post_send() returns. If an AH is freed after ipoib_send() returns but before the send is executed, then the HCA may use stale data, which could lead to a send error. To summarize: the patch is broken (leads to incorrect lifetimes for AHs), and in any case makes the send fast path slower. - R. That's a value point. This problem will be addressed in next tx_ring removal patch. The kref_put was called in ipoib_ib_handle_send_wc(). Thanks Shirley Ma IBM Linux Technology Center 15300 SW Koll Parkway Beaverton, OR 97006-6063 Phone(Fax): (503) 578-7638___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
Re: [openib-general] [PATCH][1/7]ipoib performance patches -- remove ah_reap
Roland, My idea is to remove this AH reap thread. We can use RCU to do the same work without lots of coding. Do you agree? And in the reap AH code, tx_tail/tx_head isn't consistently protected by tx_lock. It uses priv-lock. Thanks Shirley Ma IBM Linux Technology Center 15300 SW Koll Parkway Beaverton, OR 97006-6063 Phone(Fax): (503) 578-7638___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
Re: [openib-general] [PATCH][1/7]ipoib performance patches -- remove ah_reap
Roland Dreier [EMAIL PROTECTED] wrote on 05/24/2006 05:11:12 PM: Shirley Roland, My idea is to remove this AH reap thread. We can Shirley use RCU to do the same work without lots of coding. Do Shirley you agree? No, I don't see how that will help. How does RCU know when it's safe to free an AH? With tx_ring removal patch, RCU can be done. Thanks Shirley Ma IBM Linux Technology Center 15300 SW Koll Parkway Beaverton, OR 97006-6063 Phone(Fax): (503) 578-7638___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
Re: [openib-general] [PATCH][1/7]ipoib performance patches -- remove ah_reap
Roland Dreier [EMAIL PROTECTED] wrote on 05/24/2006 05:52:31 PM: Shirley With tx_ring removal patch, RCU can be done. OK, I guess I'll wait and see. But to be honest I don't see how RCU helps anything. - R. I am continuing to sumit the tx_ring patch with atomic operation for you to review, let's discuss the AH reap solution later. Thanks Shirley Ma IBM Linux Technology Center 15300 SW Koll Parkway Beaverton, OR 97006-6063 Phone(Fax): (503) 578-7638 ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
[openib-general] [PATCH][3/7]ipoib performance patches -- remove tx_ring
*/ priv-rx_ring = kzalloc(ipoib_recvq_size * sizeof *priv-rx_ring, GFP_KERNEL); if (!priv-rx_ring) { @@ -855,24 +853,11 @@ int ipoib_dev_init(struct net_device *de goto out; } -priv-tx_ring = kzalloc(ipoib_sendq_size * sizeof *priv-tx_ring, - GFP_KERNEL); -if (!priv-tx_ring) { -printk(KERN_WARNING %s: failed to allocate TX ring (%d entries)\n, - ca-name, ipoib_sendq_size); - goto out_rx_ring_cleanup; -} - -/* priv-tx_head tx_tail are already 0 */ - if (ipoib_ib_dev_init(dev, ca, port)) -goto out_tx_ring_cleanup; +goto out_rx_ring_cleanup; return 0; -out_tx_ring_cleanup: -kfree(priv-tx_ring); - out_rx_ring_cleanup: kfree(priv-rx_ring); @@ -896,10 +881,8 @@ void ipoib_dev_cleanup(struct net_device ipoib_ib_dev_cleanup(dev); kfree(priv-rx_ring); -kfree(priv-tx_ring); priv-rx_ring = NULL; -priv-tx_ring = NULL; } static void ipoib_setup(struct net_device *dev) @@ -944,6 +927,7 @@ static void ipoib_setup(struct net_devic spin_lock_init(priv-lock); spin_lock_init(priv-tx_lock); +spin_lock_init(priv-slist_lock); mutex_init(priv-mcast_mutex); mutex_init(priv-vlan_mutex); @@ -952,6 +936,7 @@ static void ipoib_setup(struct net_devic INIT_LIST_HEAD(priv-child_intfs); INIT_LIST_HEAD(priv-dead_ahs); INIT_LIST_HEAD(priv-multicast_list); +INIT_LIST_HEAD(priv-send_list); INIT_WORK(priv-pkey_task, ipoib_pkey_poll, priv-dev); INIT_WORK(priv-mcast_task, ipoib_mcast_join_task, priv-dev); Thanks Shirley Ma IBM Linux Technology Center 15300 SW Koll Parkway Beaverton, OR 97006-6063 Phone(Fax): (503) 578-7638 ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
Re: [openib-general] [PATCH][3/7]ipoib performance patches -- remove tx_ring
Oops, I missed one pair of spin_lock_irqsave()/spin_lock_irqrestore() to protect send_list in ipoib_ib_handle_send_wc(). Thanks Shirley Ma IBM Linux Technology Center 15300 SW Koll Parkway Beaverton, OR 97006-6063 Phone(Fax): (503) 578-7638___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
Re: [openib-general] [PATCH][1/7]ipoib performance patches -- split CQ
Roland Dreier [EMAIL PROTECTED] wrote on 05/23/2006 09:09:05 AM: Did you send the other 6 patches in this series? Yes, I am splitting these patches. I was waiting to comment until I had all the patches, but there is one really bad thing here: +IPOIB_NUM_SEND_WC = 32, +void ipoib_ib_send_completion(struct ib_cq *cq, void *dev_ptr) +{ +struct net_device *dev = (struct net_device *) dev_ptr; +struct ipoib_dev_priv *priv = netdev_priv(dev); +struct ib_wc ibwc[IPOIB_NUM_SEND_WC]; If I'm doing the math correctly, this function now uses more than 2K of stack, which is of course unacceptable. I don't think there's any way around keeping the wc array in the ipoib_dev_priv structure. - R. The stack is 4k, not 8K anymore. I think we can still use IPOIB_NUM_SEND_WC as 4. I modified mthca_XXX_post_send to remove lock totally before(since sender is exclusive), and found that lock didn't impact performance too much. Thanks Shirley Ma IBM Linux Technology Center 15300 SW Koll Parkway Beaverton, OR 97006-6063 Phone(Fax): (503) 578-7638 ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
Re: [openib-general] different send and receive CQs
Eric, I have no problem with splitting CQ, you can refer my IPoIB splitting CQ patch. Could you share your code here so we can give you some suggestions? Thanks Shirley Ma IBM Linux Technology Center 15300 SW Koll Parkway Beaverton, OR 97006-6063 Phone(Fax): (503) 578-7638___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
[openib-general] Re: ipoib_reap_ah question
Roland Dreier [EMAIL PROTECTED] wrote on 05/22/2006 09:58:13 AM: I think you should keep your patches simple -- one idea per patch. So if you want to experiment with both tx_ring removal and the reap_ah removal, keep in mind that they should be merged as separate patches. So you should probably develop them that way. - R. I will, thanks. If I seperate these two patches I will have to use last_send as atomic_t in tx_ring removal. Thanks Shirley Ma IBM Linux Technology Center 15300 SW Koll Parkway Beaverton, OR 97006-6063 Phone(Fax): (503) 578-7638___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
[openib-general] [PATCH][0/7]ipoib performance patches
Hello Roland, Let me starting to submit some of performance patches one by one for review, these patches have been validated, more tests are still going on. 1. splitting CQ and CQ handler into send/recv, changing the default NUM_WC value to bigger size. 2. requeue packets because of send queue overrun 3. remove tx_ring 4. replace ipoib_reap_ah with kref_get()/kref_put() 5. remove rx_ring 6. change poll_cq from interrupt conext to thread context, multiple threads support on both send and recv 7. tunable poll interval parameters to sycn hardare driver Thanks Shirley Ma IBM Linux Technology Center 15300 SW Koll Parkway Beaverton, OR 97006-6063 Phone(Fax): (503) 578-7638 ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
[openib-general] ipoib_reap_ah question
Hello Roland, Is there any particular reason to use ipoib_reap_ah thread? In my tx_ring removal patch, I tested without ipoib_reap_ah work queue by simply adding kref_get(), kref_put() in ipoib_send(), and i didn't see any difference including performance. If there is no other risk, I will remove it to make it simple. Thanks Shirley Ma IBM Linux Technology Center 15300 SW Koll Parkway Beaverton, OR 97006-6063 Phone(Fax): (503) 578-7638___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
Re: [openib-general] 2.6.17 and 2.6.18 merge plans
Roland, By all the data I have collected so far I think it's not a good idea to have while loop poll_cq() under IB hardware interrupt context. poll_cq() is very expensive, and it increases other hardwares' interrupt latency. If we move this out of hardware interrupt context, latency would be inceased anyway. I have done lots of tests on splitting CQ + work queue on recv/send + remove tx_ring patches over mthca. Both SMP and UP unidirectional throughput gets improved from 20% - 75% w o/i tuning. The latency has increased between 4-10% on mthca. The interesting result is UP performance is good. I used hyperthread CPU running all these tests, don't know whether it's the reason. If you think there are enough time to review these patches and have more chance to be merged into 2.6.17/18, I will clean and submit these patches ASAP, and test on ehca if none multi-threads ehca is available. Thanks Shirley Ma IBM Linux Technology Center 15300 SW Koll Parkway Beaverton, OR 97006-6063 Phone(Fax): (503) 578-7638___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
Re: [openib-general] ip over ib throughtput
Talpey, Thomas [EMAIL PROTECTED] wrote on 05/10/2006 03:53:04 AM: At 11:13 PM 5/9/2006, Shirley Ma wrote: Have you tried to send payload smaller than 2044? Any difference? You mean MTU or ULP payload? The default NFS reads and writes are 32KB, and in the addressing mode used in these tests they were broken into 8 page-sized RDMA ops. So, there were 9 ops from the server, per NFS read. I used the default MTU so these were probably 19 messages on the wire. I don't expect much difference with smaller MTU, but smaller NFS ops would be noticeable. Tom. I meant payload less than or equal to 2044, not IB MTU. IPoIB can only send =2044 payload per ib_send_post(). NFS/RDMA in this case send 32KB per ib_post_send(). It would be nice to know the performance difference under same payload for IPoIB over UD and NFS/RDMA. Is that possible? Thanks Shirley Ma IBM Linux Technology Center 15300 SW Koll Parkway Beaverton, OR 97006-6063 Phone(Fax): (503) 578-7638 ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
Re: [openib-general] ip over ib throughtput
Talpey, Thomas [EMAIL PROTECTED] wrote on 05/10/2006 03:10:57 PM: Sure, but I wonder why it's interesting. Nobody ever uses NFS in such small blocksizes, and 2044 bytes would mean, say, 1800 bytes of payload. What data are you looking for, throughput and overhead? Direct RDMA, or inline? Tom. Throughput. I am wondering how much room IPoIB performance (throughput) can go. Thanks Shirley Ma IBM Linux Technology Center 15300 SW Koll Parkway Beaverton, OR 97006-6063 Phone(Fax): (503) 578-7638___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general