Re: [openib-general] [PATCH][3/7]ipoib performance patches -- remove tx_ring
Shirley This patch reduces one extra ring between dev xmit queue Shirley and device send queue and removes tx_lock in completion Shirley handler. I don't think replacing the ring with a linked list is an improvement. At best it works out to be the same. But I didn't notice that you removed the tx_lock from the send completion handler before. I suspect that this is where all the improvement is coming from. Unfortunately it introduces a couple of race conditions. For example, if CPU1 is in ipoib_send() and CPU2 is running the send completion handler, you can have: CPU1 CPU2 err = post_send(priv, wr_id, address-ah, qpn, addr, skb-len); // err is non-zero if (!err) { } else { if (!netif_queue_stopped(dev)) { // send completion handler // runs many times and drains // completion queue // queue is never stopped so // it never gets woken again if (netif_queue_stopped(dev)) netif_wake_queue(dev); netif_stop_queue(dev); // no more sends are posted so // another completion never // occurs and the queue stays // stopped forever. There are analogous races that cause spurious wakeups too. However I think this is a good idea, and I think we can get around the races by working around them in a way that doesn't hurt performance (eg by having the netdev tx watchdog restart the queue in the scenario above). I cooked up a quick patch (below) but I didn't see any performance improvement from this in my quick tests -- NPtcp peaks at ~3400 Mbit/sec between my test systems with 2.6.17-rc5 both with and without this patch. Then I tried to apply your patches 1/7 and 3/7 to see if they were any different on my setup, but the patches didn't apply. 3/7 had no attachment, and it's hopeless to try and apply the patches you send inline. So can you resend up-to-date versions of the patches that give you a 10% improvement? (BTW it would be nice if you could figure out a way to fix your mail client to post patches inline without mangling them, or at least attach them with a mime type of text/plain or something) Also, if you're interested, you could try the patch below and see how it does on your tests. - R. Here's my tx_lock removal test patch (it would need better comments about all the races with tx_tail and some more careful review before we actually apply it): --- infiniband/ulp/ipoib/ipoib_main.c (revision 7507) +++ infiniband/ulp/ipoib/ipoib_main.c (working copy) @@ -634,6 +634,14 @@ static int ipoib_start_xmit(struct sk_bu return NETDEV_TX_BUSY; } + /* +* Because tx_lock is not held when updating tx_tail in the +* send completion handler, we may receive a spurious wakeup +* that starts our queue when there really isn't space yet. +*/ + if (unlikely(priv-tx_head - priv-tx_tail == ipoib_sendq_size)) + return NETDEV_TX_BUSY; + if (skb-dst skb-dst-neighbour) { if (unlikely(!*to_ipoib_neigh(skb-dst-neighbour))) { ipoib_path_lookup(skb, dev); @@ -703,6 +711,21 @@ static struct net_device_stats *ipoib_ge static void ipoib_timeout(struct net_device *dev) { struct ipoib_dev_priv *priv = netdev_priv(dev); + unsigned long flags; + int lost_wakeup = 0; + + spin_lock_irqsave(priv-tx_lock, flags); + if (netif_queue_stopped(dev) + priv-tx_head - priv-tx_tail ipoib_sendq_size) { + ipoib_dbg(priv, lost wakeup, head %u, tail %u\n, + priv-tx_head, priv-tx_tail); + lost_wakeup = 1; + netif_wake_queue(dev); + } + spin_unlock_irqrestore(priv-tx_lock, flags); + + if (lost_wakeup) + return; ipoib_warn(priv, transmit timeout: latency %d msecs\n, jiffies_to_msecs(jiffies - dev-trans_start)); --- infiniband/ulp/ipoib/ipoib_ib.c (revision 7511) +++ infiniband/ulp/ipoib/ipoib_ib.c (working copy) @@ -244,7 +244,6 @@ static void ipoib_ib_handle_wc(struct ne } else { struct ipoib_tx_buf *tx_req; - unsigned long flags; if (wr_id = ipoib_sendq_size) { ipoib_warn(priv, completion event with wrid %d ( %d)\n, @@ -266,12 +265,17 @@ static void ipoib_ib_handle_wc(struct ne dev_kfree_skb_any(tx_req-skb); - spin_lock_irqsave(priv-tx_lock, flags); ++priv-tx_tail; + + /* +* Since
Re: [openib-general] [PATCH][3/7]ipoib performance patches -- remove tx_ring
Roland, Roland, Thanks for the review comments. I will update these patches and tests results. BTW it would be nice if you could figure out a way to fix your mail client to post patches inline without mangling them, or at least attach them with a mime type of text/plain or something. I will use my unix account to send out patches. Also, if you're interested, you could try the patch below and see how it does on your tests. Sure. I will test it after this weekend. Did you see send queue overrun with tx_ring default size 128? Thanks Shirley Ma IBM Linux Technology Center 15300 SW Koll Parkway Beaverton, OR 97006-6063 Phone(Fax): (503) 578-7638___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
Re: [openib-general] [PATCH][3/7]ipoib performance patches -- remove tx_ring
Shirley Sure. I will test it after this weekend. Did you see send Shirley queue overrun with tx_ring default size 128? No, but my patch would still prevent send queue overruns. It might stop the netdevice queue because the send queue is full, but I didn't turn on debugging to see that. (Also the default TX ring size is 64, not 128, isn't it?) - R. ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
Re: [openib-general] [PATCH][3/7]ipoib performance patches -- remove tx_ring
Roland Dreier [EMAIL PROTECTED] wrote on 05/26/2006 04:20:02 PM: (Also the default TX ring size is 64, not 128, isn't it?) - R. Yes. Thanks Shirley Ma IBM Linux Technology Center 15300 SW Koll Parkway Beaverton, OR 97006-6063 Phone(Fax): (503) 578-7638 ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
Re: [openib-general] [PATCH][3/7]ipoib performance patches -- remove tx_ring
This also looks like a step backwards to me. You are replacing a cache-friendly array with a cache-unfriendly linked list, which also requires two more lock/unlock operations in the fast path. In a more general note, you are definitely generating some creative ideas about IPoIB but you need to work on presentation and communication if you want your patches integrated. For example, all you said about this patch is: Here is the tx_ring removal patch for you to review. If you want this to be merged then you need to provide a changelog entry that explains _what_ the patch does, _how_ it does it, and most importantly _why_ the patch is an improvement. This is especially important for these patches, which naively at least look like they are making things worse. Also, improving the quality of your patches would be helpful. For example, in this patch you delete the comments: - /* -* We put the skb into the tx_ring _before_ we call post_send() -* because it's entirely possible that the completion handler will -* run before we execute anything after the post_send(). That -* means we have to make sure everything is properly recorded and -* our state is consistent before we call post_send(). -*/ and then a few lines further down you create exactly the bug it described: + err = post_send(priv, wr_id, address-ah, qpn, addr, skb-len); + if (!err) { + dev-trans_start = jiffies; + IPOIB_SKB_PRV_ADDR(skb) = addr; + IPOIB_SKB_PRV_AH(skb) = address; + IPOIB_SKB_PRV_SKB(skb) = skb; + spin_lock(priv-slist_lock); + list_add_tail(IPOIB_SKB_PRV_LIST(skb), priv-send_list); + spin_unlock(priv-slist_lock); + return; This means I have to read your patches very carefully, because they often introduce races or other bugs. Finally, keeping strictly to the one idea per patch rule and holding back from the temptation to fiddle with unrelated things would be nice. In this case you have: - spinlock_t tx_lockcacheline_aligned_in_smp; + spinlock_t tx_lock; That cacheline_aligned_in_smp annotation is something that you added earlier in the patch series (with no explanation), and now you're removing it (again with no explanation). This sort of churn just makes the patches bigger and harder to review and merge. As I said you are generating very creative ideas for IPoIB. Perhaps you can find someone to help you polish the presentation so that we can use them? Thanks, Roland ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
Re: [openib-general] [PATCH][3/7]ipoib performance patches -- remove tx_ring
Roland, I made some mistakes during splitting these patches. Thanks for pointing out. The reason I removed the cacheline because I have tested and proved that it didn't help somehow. Even in some code, I induced some locks, the overall performance of all these 7 patches I am trying to post here could improve IPoIB from 20% - 80% unidirectional and doubled bidirectional. As you mentioned I need help to repolish these patches. I am glad that you give all these valuable inputs of my patches. Thanks lots here. Thanks Shirley Ma IBM Linux Technology Center 15300 SW Koll Parkway Beaverton, OR 97006-6063 Phone(Fax): (503) 578-7638___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
Re: [openib-general] [PATCH][3/7]ipoib performance patches -- remove tx_ring
Shirley didn't help somehow. Even in some code, I induced some Shirley locks, the overall performance of all these 7 patches I Shirley am trying to post here could improve IPoIB from 20% - 80% Shirley unidirectional and doubled bidirectional. I'm guessing that all of that gain comes from letting the send and receive completion handlers run simultaneously. Is that right? For example how much of an improvement do you see if you just apply the patches you've posted -- that is, only 1/7, 2/7 and 3/7? - R. ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
Re: [openib-general] [PATCH][3/7]ipoib performance patches -- remove tx_ring
Roland Dreier [EMAIL PROTECTED] wrote on 05/25/2006 04:28:27 PM: Shirley didn't help somehow. Even in some code, I induced some Shirley locks, the overall performance of all these 7 patches I Shirley am trying to post here could improve IPoIB from 20% - 80% Shirley unidirectional and doubled bidirectional. I'm guessing that all of that gain comes from letting the send and receive completion handlers run simultaneously. Is that right? For example how much of an improvement do you see if you just apply the patches you've posted -- that is, only 1/7, 2/7 and 3/7? - R. That's not true. I tested performance with 1/7, 3/7 a couple weeks ago, I saw more than 10% improvement. I never saw send queue overrun with tx_ring before on one cpu with one TCP stream. After removing tx_ring, the send path is much faster, the default 128 is not bigger enough to handle it. That's the reason I have another patch to handler send queue overrun -- requeue the packet to the head of dev xmit queue instead of current implementation which is silently dropped the packets when device driver send queue is full, and depends on TCP retransmission. This implementation would cause TCP fast trans, slow start, and packets out of orders. Thanks Shirley Ma IBM Linux Technology Center 15300 SW Koll Parkway Beaverton, OR 97006-6063 Phone(Fax): (503) 578-7638 ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
Re: [openib-general] [PATCH][3/7]ipoib performance patches -- remove tx_ring
Roland, Roland Dreier [EMAIL PROTECTED] wrote on 05/25/2006 09:24:01 AM: This also looks like a step backwards to me. You are replacing a cache-friendly array with a cache-unfriendly linked list, which also requires two more lock/unlock operations in the fast path. This patch reduces one extra ring between dev xmit queue and device send queue and removes tx_lock in completion handler. The whole purpose to have the send_list and slock is for shutting down clean up. Otherwise we don't need to maintain this list. And most likely when shutting down, waiting for 5HZ, the list is empty. I could implment it differently, like use RCU list with cache-friendly. I thought it's not worth it before since i didn't see the gain. Thanks Shirley Ma IBM Linux Technology Center 15300 SW Koll Parkway Beaverton, OR 97006-6063 Phone(Fax): (503) 578-7638 ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
[openib-general] [PATCH][3/7]ipoib performance patches -- remove tx_ring
Roland, Here is the tx_ring removal patch for you to review. diff -urpN infiniband-ah/ulp/ipoib/ipoib.h infiniband-tx/ulp/ipoib/ipoib.h --- infiniband-ah/ulp/ipoib/ipoib.h2006-05-23 10:09:05.0 -0700 +++ infiniband-tx/ulp/ipoib/ipoib.h2006-05-24 11:45:52.0 -0700 @@ -114,11 +114,19 @@ struct ipoib_rx_buf { dma_addr_tmapping; }; -struct ipoib_tx_buf { -struct sk_buff *skb; -DECLARE_PCI_UNMAP_ADDR(mapping) +struct ipoib_skb_prv { +dma_addr_t addr; +struct ipoib_ah *ah; +struct sk_buff *skb; +struct list_head list; }; +#define IPOIB_SKB_PRV_ADDR(skb)(((struct ipoib_skb_prv *)(skb)-cb)-addr) +#define IPOIB_SKB_PRV_AH(skb)(((struct ipoib_skb_prv *)(skb)-cb)-ah) +#define IPOIB_SKB_PRV_SKB(skb)(((struct ipoib_skb_prv *)(skb)-cb)-skb) +#define IPOIB_SKB_PRV_LIST(skb)(((struct ipoib_skb_prv *)(skb)-cb)-list) + + /* * Device private locking: tx_lock protects members used in TX fast * path (and we use LLTX so upper layers don't do extra locking). @@ -166,12 +174,11 @@ struct ipoib_dev_priv { struct ipoib_rx_buf *rx_ring; -spinlock_t tx_lockcacheline_aligned_in_smp; -struct ipoib_tx_buf *tx_ring; -unsigned tx_head; -unsigned tx_tail; +spinlock_t tx_lock; struct ib_sge tx_sge; struct ib_send_wr tx_wr; +spinlock_t slist_lock; +struct list_head send_list; struct list_head dead_ahs; diff -urpN infiniband-ah/ulp/ipoib/ipoib_ib.c infiniband-tx/ulp/ipoib/ipoib_ib.c --- infiniband-ah/ulp/ipoib/ipoib_ib.c2006-05-23 10:14:08.0 -0700 +++ infiniband-tx/ulp/ipoib/ipoib_ib.c2006-05-24 11:58:46.0 -0700 @@ -243,45 +243,36 @@ static void ipoib_ib_handle_send_wc(stru struct ib_wc *wc) { struct ipoib_dev_priv *priv = netdev_priv(dev); -unsigned int wr_id = wc-wr_id; -struct ipoib_tx_buf *tx_req; -unsigned long flags; - -ipoib_dbg_data(priv, called: id %d, op %d, status: %d\n, - wr_id, wc-opcode, wc-status); - -if (wr_id = ipoib_sendq_size) { -ipoib_warn(priv, completion event with wrid %d ( %d)\n, - wr_id, ipoib_sendq_size); -return; -} - -ipoib_dbg_data(priv, send complete, wrid %d\n, wr_id); +struct sk_buff *skb; +unsigned long wr_id = wc-wr_id; -tx_req = priv-tx_ring[wr_id]; - -dma_unmap_single(priv-ca-dma_device, - pci_unmap_addr(tx_req, mapping), - tx_req-skb-len, - DMA_TO_DEVICE); +skb = (struct sk_buff *)wr_id; +kref_put(IPOIB_SKB_PRV_AH(skb)-ref, ipoib_free_ah); + if (IS_ERR(skb) || skb != IPOIB_SKB_PRV_SKB(skb)) { + ipoib_warn(priv, send completion event with corrupted wrid\n); + return; + } +list_del(IPOIB_SKB_PRV_LIST(skb)); + +ipoib_dbg_data(priv, send complete, wrid %lu\n, wr_id); + + dma_unmap_single(priv-ca-dma_device, + IPOIB_SKB_PRV_ADDR(skb), + skb-len, + DMA_TO_DEVICE); + ++priv-stats.tx_packets; -priv-stats.tx_bytes += tx_req-skb-len; - -dev_kfree_skb_any(tx_req-skb); - -spin_lock_irqsave(priv-tx_lock, flags); -++priv-tx_tail; -if (netif_queue_stopped(dev) - priv-tx_head - priv-tx_tail = ipoib_sendq_size 1) -netif_wake_queue(dev); -spin_unlock_irqrestore(priv-tx_lock, flags); - -if (wc-status != IB_WC_SUCCESS - wc-status != IB_WC_WR_FLUSH_ERR) -ipoib_warn(priv, failed send event - (status=%d, wrid=%d vend_err %x)\n, - wc-status, wr_id, wc-vendor_err); +priv-stats.tx_bytes += skb-len; +dev_kfree_skb_any(skb); + +if (netif_queue_stopped(dev)) + netif_wake_queue(dev); + if (wc-status != IB_WC_SUCCESS + wc-status != IB_WC_WR_FLUSH_ERR) + ipoib_warn(priv, failed send event + (status=%d, wrid=%lu vend_err %x)\n, + wc-status, wr_id, wc-vendor_err); } void ipoib_ib_send_completion(struct ib_cq *cq, void *dev_ptr) @@ -313,7 +304,7 @@ void ipoib_ib_recv_completion(struct ib_ } static inline int post_send(struct ipoib_dev_priv *priv, - unsigned int wr_id, + unsigned long wr_id, struct ib_ah *address, u32 qpn, dma_addr_t addr, int len) { @@ -333,8 +324,9 @@ void ipoib_send(struct net_device *dev, struct ipoib_ah *address, u32 qpn) { struct ipoib_dev_priv *priv = netdev_priv(dev); -struct ipoib_tx_buf *tx_req; dma_addr_t addr; +unsigned long wr_id; +unsigned long flags; int err; kref_get(address-ref); @@ -350,38 +342,31 @@ void ipoib_send(struct net_device *dev, ipoib_dbg_data(priv, sending packet, length=%d address=%p qpn=0x%06x\n, skb-len, address, qpn); - -/* - * We put the skb into the tx_ring _before_ we call post_send() - * because it's entirely possible that the completion
Re: [openib-general] [PATCH][3/7]ipoib performance patches -- remove tx_ring
Oops, I missed one pair of spin_lock_irqsave()/spin_lock_irqrestore() to protect send_list in ipoib_ib_handle_send_wc(). Thanks Shirley Ma IBM Linux Technology Center 15300 SW Koll Parkway Beaverton, OR 97006-6063 Phone(Fax): (503) 578-7638___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general