Re: [openib-general] [PATCH/RFC 1/2] IB: Return maybe_missed_event hint from ib_req_notify_cq()
[EMAIL PROTECTED] wrote on 11/14/2006 03:18:23 PM: Shirley The rotting packet situation consistently happens for Shirley ehca driver. The napi could poll forever with your Shirley original patch. That's the reason I defer the rotting Shirley packet process in next napi poll. Hmm, I don't see it. In my latest patch, the poll routine does: repoll: done = 0; empty = 0; while (max) { t = min(IPOIB_NUM_WC, max); n = ib_poll_cq(priv-cq, t, priv-ibwc); for (i = 0; i n; ++i) { if (priv-ibwc[i].wr_id IPOIB_OP_RECV) { ++done; --max; ipoib_ib_handle_rx_wc(dev, priv-ibwc + i); } else ipoib_ib_handle_tx_wc(dev, priv-ibwc + i); } if (n != t) { empty = 1; break; } } dev-quota -= done; *budget-= done; if (empty) { netif_rx_complete(dev); if (unlikely(ib_req_notify_cq(priv-cq, IB_CQ_NEXT_COMP | IB_CQ_REPORT_MISSED_EVENTS)) netif_rx_reschedule(dev, 0)) goto repoll; return 0; } return 1; so every receive completion will count against the limit set by the variable max. The only way I could see the driver staying in the poll routine for a long time would be if it was only processing send completions, but even that doesn't actually seem bad: the driver is making progress handling completions. Is it possible that when one gets into the rotting packet case, the quota is at or close to 0 (on ehca). If in the cass it is 0 and netif_rx_reschedule() case wins (over netif_rx_schedule()) then it keeps spinning unable to process any packets since the undo parameter for netif_reschedule() is 0. If netif_rx_reschedule() keeps winning for a few iterations then the receive queues get full and dropping packets, thus causing a loss in performance. If this is indeed the case, then one option to try out may be is to change the undo parameter of netif_rx_rechedule()to either IB_WC or even dev-weight. Shirley It does help the performance from 1XXMb/s to 7XXMb/s, but Shirley not as expected 3XXXMb/s. Is that 3xxx Mb/sec the performance you see without the NAPI patch? Shirley With the defer rotting packet process patch, I can see Shirley packets out of order problem in TCP layer. Is it Shirley possible there is a race somewhere causing two napi polls Shirley in the same time? mthca seems to use irq auto affinity, Shirley but ehca uses round-robin interrupt. I don't see how two NAPI polls could run at once, and I would expect worse effects from them stepping on each other than just out-of-order packets. However, the fact that ehca does round-robin interrupt handling might lead to out-of-order packets just because different CPUs are all feeding packets into the network stack. - R. ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
Re: [openib-general] [PATCH/RFC 1/2] IB: Return maybe_missed_event hint from ib_req_notify_cq()
Quoting r. Roland Dreier [EMAIL PROTECTED]: I would really like to understand why ehca does worse with NAPI. In my tests both mthca and ipath exhibit various degrees of improvement depending on the test -- but I've never seen performance get worse. This is the main thing holding back merging NAPI. Documentation/netowkring/NAPI_HOWTO.txt says: APPENDIX 3: Scheduling issues As seen NAPI moves processing to softirq level. Linux uses the ksoftirqd as the general solution to schedule softirq's to run before next interrupt and by putting them under scheduler control. Also this prevents consecutive softirq's from monopolize the CPU. This also have the effect that the priority of ksoftirq needs to be considered when running very CPU-intensive applications and networking to get the proper balance of softirq/user balance. Increasing ksoftirq priority to 0 (eventually more) is reported cure problems with low network performance at high CPU load. So I wonder 1. Was this tried? Its clear that we have high CPU load. 2. Could this be the reason that e.g. e1000 disables NAPI by default? The issue seem sufficiently tricky that we may yet find ourselves debugging NAPI performance problems in the field. Maybe we still need a module option ... -- MST ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
Re: [openib-general] [PATCH/RFC 1/2] IB: Return maybe_missed_event hint from ib_req_notify_cq()
Pradeep Is it possible that when one gets into the rotting Pradeep packet case, the quota is at or close to 0 (on ehca). If Pradeep in the cass it is 0 and netif_rx_reschedule() case wins Pradeep (over netif_rx_schedule()) then it keeps spinning unable Pradeep to process any packets since the undo parameter for Pradeep netif_reschedule() is 0. It is possible that the quota is close to 0, but I don't see how the poll routine could spin with quota (the variable max) equal to 0. If max is 0, then the while (max) loop will never be entered, empty will remain 0, and the poll routine will simply fall through and return 1. Do you agree with that summary? We don't want the undo parameter of netif_rx_reschedule() to be non-zero because when we go back to repoll, done is reset to 0. So there's no reason to increase the quota again. I guess you could instrument how many iterations there are with a small value of max, but I would assume it's self-limiting, since the last few completions should appear fairly quickly. - R. ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
Re: [openib-general] [PATCH/RFC 1/2] IB: Return maybe_missed_event hint from ib_req_notify_cq()
What I have found in ehca driver, n! = t, does't mean it's empty. If poll again, there are still some packets in cq. IB_CQ_REPORT_mISSED_EVENTS most of the time reports 1. It relies on netif_rx_reschedule() returns 0 to exit napi poll. That might be the reason in poll routine for a long time? I will rerun my test to use n! = 0 to see any difference here. Maybe there's an ehca bug in poll CQ? If n != t then it should mean that the CQ was indeed drained. I would expect a missed event would be rare, because it means a completion occurs between the last poll CQ and the request notify, and that shouldn't be that common... My rough estimate is that even at a higher throughput than what you're seeing, IPoIB should only generate ~ 500K completions/sec, which means the average delay between completions is 2 microseconds. So I wouldn't expect completions to hit the window between poll and request notify that often. - R. ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
Re: [openib-general] [PATCH/RFC 1/2] IB: Return maybe_missed_event hint from ib_req_notify_cq()
Roland Dreier [EMAIL PROTECTED] wrote on 11/16/2006 11:26:31 AM: What I have found in ehca driver, n! = t, does't mean it's empty. If poll again, there are still some packets in cq. IB_CQ_REPORT_mISSED_EVENTS most of the time reports 1. It relies on netif_rx_reschedule() returns0 to exit napi poll. That might be the reason in poll routine for a long time? I will rerun my test to use n! = 0 to see any difference here. Maybe there's an ehca bug in poll CQ? If n != t then it should mean that the CQ was indeed drained. I would expect a missed event would be rare, because it means a completion occurs between the last poll CQ and the request notify, and that shouldn't be that common... My rough estimate is that even at a higher throughput than what you're seeing, IPoIB should only generate ~ 500K completions/sec, which means the average delay between completions is 2 microseconds. So I wouldn't expect completions to hit the window between poll and request notify that often. - R. I have tried low_latency is 1 to disable TCP prequeue, the throughput was increased from 1XXMb/s to 4XXMb/s. If I delayed net_skb_receive() a little bit, I could get around 1700Mb/s. If I totally disable netif_rx_reschedule(), then there is no repoll and return 0, I could get around 2900Mb/s throughout without packet seeing out of order issues. I have tried to add a spin lock in ipoib_poll(). And I still see packets out of orders. disable prequeue: 2XXMb/s to 4XXMb/s (packets out of order) slowdown netif_receive_skb: 17XXMb/s (packets out of order) don't handle missed event: 28XXMb/s (no packets out of order) handler missed envent later: 7XXMb/s to 11XXMb/s (packets out of order) Maybe it is ehca driver deliver packets much faster? Which makes me think user processes tcp backlogqueue, prequeue might be out of order? Thanks Shirley Ma___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
Re: [openib-general] [PATCH/RFC 1/2] IB: Return maybe_missed_event hint from ib_req_notify_cq()
Roland Dreier [EMAIL PROTECTED] wrote on 11/14/2006 03:18:23 PM: Shirley The rotting packet situation consistently happens for Shirley ehca driver. The napi could poll forever with your Shirley original patch. That's the reason I defer the rotting Shirley packet process in next napi poll. Hmm, I don't see it. In my latest patch, the poll routine does: repoll: done = 0; empty = 0; while (max) { t = min(IPOIB_NUM_WC, max); n = ib_poll_cq(priv-cq, t, priv-ibwc); for (i = 0; i n; ++i) { if (priv-ibwc[i].wr_id IPOIB_OP_RECV) { ++done; --max; ipoib_ib_handle_rx_wc(dev, priv-ibwc + i); } else ipoib_ib_handle_tx_wc(dev, priv-ibwc + i); } if (n != t) { empty = 1; break; } } dev-quota -= done; *budget -= done; if (empty) { netif_rx_complete(dev); if (unlikely(ib_req_notify_cq(priv-cq, IB_CQ_NEXT_COMP | IB_CQ_REPORT_MISSED_EVENTS)) netif_rx_reschedule(dev, 0)) goto repoll; return 0; } return 1; so every receive completion will count against the limit set by the variable max. The only way I could see the driver staying in the poll routine for a long time would be if it was only processing send completions, but even that doesn't actually seem bad: the driver is making progress handling completions. What I have found in ehca driver, n! = t, does't mean it's empty. If poll again, there are still some packets in cq. IB_CQ_REPORT_mISSED_EVENTS most of the time reports 1. It relies on netif_rx_reschedule() returns 0 to exit napi poll. That might be the reason in poll routine for a long time? I will rerun my test to use n! = 0 to see any difference here. Shirley It does help the performance from 1XXMb/s to 7XXMb/s, but Shirley not as expected 3XXXMb/s. Is that 3xxx Mb/sec the performance you see without the NAPI patch? Without NAPI patch, in my test environment ehca can gain around 2800Mb to 3000Mb/s throughput. Shirley With the defer rotting packet process patch, I can see Shirley packets out of order problem in TCP layer. Is it Shirley possible there is a race somewhere causing two napi polls Shirley in the same time? mthca seems to use irq auto affinity, Shirley but ehca uses round-robin interrupt. I don't see how two NAPI polls could run at once, and I would expect worse effects from them stepping on each other than just out-of-order packets. However, the fact that ehca does round-robin interrupt handling might lead to out-of-order packets just because different CPUs are all feeding packets into the network stack. - R. Normally for NAPI there should be only one running at a time. And NAPI process packet all the way to TCP layer by processing packet one by one (netif_receive_skb()). So it shouldn't lead to out-of-packets even for round-robin interrupt handling in NAPI. I am still investing this. Thanks Shirley___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
Re: [openib-general] [PATCH/RFC 1/2] IB: Return maybe_missed_event hint from ib_req_notify_cq()
I will rerun my test to use n! = 0 to see any difference here. It should be n == 0 to indicate empty. Thanks Shirley Ma___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
Re: [openib-general] [PATCH/RFC 1/2] IB: Return maybe_missed_event hint from ib_req_notify_cq()
Roland Dreier [EMAIL PROTECTED] wrote on 11/13/2006 08:45:52 AM: Sorry I was not intend to send previous email. Anyway I accidently sent it out. What I thought was there would be a problem, if the missed_event always return to 1. Then this napi poll would keep forever. Well, it's limited by the quota that the net stack gives it, so there's no possibility of looping forever. However How about defer the rotting packets process later? like this: that seems like it is still correct. With this patch, I could get NAPI + non scaling code throughput performance from 1XXMb/s to 7XXMb/s, anyway there are some other problems I am still investigating now. But I wonder why it gives you a factor of 4 in performance?? Why does it make a difference? I would have thought that the rotting packet situation would be rare enough that it doesn't really matter for performance exactly how we handle it. What are the other problems you're investigating? - R. The rotting packet situation consistently happens for ehca driver. The napi could poll forever with your original patch. That's the reason I defer the rotting packet process in next napi poll. It does help the performance from 1XXMb/s to 7XXMb/s, but not as expected 3XXXMb/s. With the defer rotting packet process patch, I can see packets out of order problem in TCP layer. Is it possible there is a race somewhere causing two napi polls in the same time? mthca seems to use irq auto affinity, but ehca uses round-robin interrupt. Thanks Shirley Ma___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
Re: [openib-general] [PATCH/RFC 1/2] IB: Return maybe_missed_event hint from ib_req_notify_cq()
Roland, I think there is a barrier might be needed in checking LINK SCHED state, like smp_mb_before_clear_bit() and smp_mb_after_clear_bit(), otherwise the netif_rx_reschedule() for rotting packet and next interrupt netif_rx_schedule() could be running in the time. If the interrupt is round-robin fashion, then packets are going to be out of order in TCP layer. I will test it out once I have the resouce. How do you think? Thanks Shirley Ma___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
Re: [openib-general] [PATCH/RFC 1/2] IB: Return maybe_missed_event hint from ib_req_notify_cq()
Roland, Ignore my previous email, test_and_set_bit is atomic operation and has the memeory barrier already. Thanks Shirley Ma___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
Re: [openib-general] [PATCH/RFC 1/2] IB: Return maybe_missed_event hint from ib_req_notify_cq()
From the code work through, if defering rotting packet process by return (missed_event netif_rx_reschedule(dev, 0)); Then the same dev-poll can be added to per cpu poll list twice: one is from netif_rx_reschedule, one is from napi return 1. That might explains packets out of order: when one poll finishes and reset LINK SCHED bit and the next interrupt runs on other cpu. Thanks Shirley Ma___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
Re: [openib-general] [PATCH/RFC 1/2] IB: Return maybe_missed_event hint from ib_req_notify_cq()
Shirley The rotting packet situation consistently happens for Shirley ehca driver. The napi could poll forever with your Shirley original patch. That's the reason I defer the rotting Shirley packet process in next napi poll. Hmm, I don't see it. In my latest patch, the poll routine does: repoll: done = 0; empty = 0; while (max) { t = min(IPOIB_NUM_WC, max); n = ib_poll_cq(priv-cq, t, priv-ibwc); for (i = 0; i n; ++i) { if (priv-ibwc[i].wr_id IPOIB_OP_RECV) { ++done; --max; ipoib_ib_handle_rx_wc(dev, priv-ibwc + i); } else ipoib_ib_handle_tx_wc(dev, priv-ibwc + i); } if (n != t) { empty = 1; break; } } dev-quota -= done; *budget-= done; if (empty) { netif_rx_complete(dev); if (unlikely(ib_req_notify_cq(priv-cq, IB_CQ_NEXT_COMP | IB_CQ_REPORT_MISSED_EVENTS)) netif_rx_reschedule(dev, 0)) goto repoll; return 0; } return 1; so every receive completion will count against the limit set by the variable max. The only way I could see the driver staying in the poll routine for a long time would be if it was only processing send completions, but even that doesn't actually seem bad: the driver is making progress handling completions. Shirley It does help the performance from 1XXMb/s to 7XXMb/s, but Shirley not as expected 3XXXMb/s. Is that 3xxx Mb/sec the performance you see without the NAPI patch? Shirley With the defer rotting packet process patch, I can see Shirley packets out of order problem in TCP layer. Is it Shirley possible there is a race somewhere causing two napi polls Shirley in the same time? mthca seems to use irq auto affinity, Shirley but ehca uses round-robin interrupt. I don't see how two NAPI polls could run at once, and I would expect worse effects from them stepping on each other than just out-of-order packets. However, the fact that ehca does round-robin interrupt handling might lead to out-of-order packets just because different CPUs are all feeding packets into the network stack. - R. ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
Re: [openib-general] [PATCH/RFC 1/2] IB: Return maybe_missed_event hint from ib_req_notify_cq()
Shirley From the code work through, if defering rotting packet Shirley process by return (missed_event Shirley netif_rx_reschedule(dev, 0)); Then the same dev-poll can Shirley be added to per cpu poll list twice: one is from Shirley netif_rx_reschedule, one is from napi return 1. That Shirley might explains packets out of order: when one poll Shirley finishes and reset LINK SCHED bit and the next interrupt Shirley runs on other cpu. I don't think so. It's completely normal for dev-poll() to return 1 when there's more work to be done, so the networking core will just move the device to the tail of the poll list. So I don't see why it would make a difference if we actually do any work after netif_rx_reschedule() or not. On the other hand I still don't see why it helps to drop out of the poll routine immediately even though we know there is more work to be done, and the networking stack has told us it could handle more packets. - R. ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
Re: [openib-general] [PATCH/RFC 1/2] IB: Return maybe_missed_event hint from ib_req_notify_cq()
Sorry I was not intend to send previous email. Anyway I accidently sent it out. What I thought was there would be a problem, if the missed_event always return to 1. Then this napi poll would keep forever. Well, it's limited by the quota that the net stack gives it, so there's no possibility of looping forever. However How about defer the rotting packets process later? like this: that seems like it is still correct. With this patch, I could get NAPI + non scaling code throughput performance from 1XXMb/s to 7XXMb/s, anyway there are some other problems I am still investigating now. But I wonder why it gives you a factor of 4 in performance?? Why does it make a difference? I would have thought that the rotting packet situation would be rare enough that it doesn't really matter for performance exactly how we handle it. What are the other problems you're investigating? - R. ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
Re: [openib-general] [PATCH/RFC 1/2] IB: Return maybe_missed_event hint from ib_req_notify_cq()
I think it has to stay the way I wrote it. Your version: +if (empty) { +ib_req_notify_cq(priv-cq, IB_CQ_NEXT_COMP, missed_event); +if (unlikely(missed_event) netif_rx_reschedule(dev, 0)) +goto repoll; +netif_rx_complete(dev); + +return 0; +} has a race: suppose missed_event is 0 but an event _is_ generated right before the call to netif_rx_complete(). Then the interrupt handler might run before the call to netif_rx_complete(), try to schedule the NAPI poll, but end up doing nothing because the poll routine is still running. Then the poll routine will call netif_rx_complete() and return 0, so it won't get called again ever (because the CQ event has already fired). And so the interface will hang and never make any more progress. I would really like to understand why ehca does worse with NAPI. In my tests both mthca and ipath exhibit various degrees of improvement depending on the test -- but I've never seen performance get worse. This is the main thing holding back merging NAPI. Does the NAPI patch help mthca on pSeries? I wonder if it's not ehca, but rather that there's some ppc64 quirk that makes NAPI a lot more expensive. - R. ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
Re: [openib-general] [PATCH/RFC 1/2] IB: Return maybe_missed_event hint from ib_req_notify_cq()
I would really like to understand why ehca does worse with NAPI. In my tests both mthca and ipath exhibit various degrees of improvement depending on the test -- but I've never seen performance get worse. This is the main thing holding back merging NAPI. Does the NAPI patch help mthca on pSeries? I wonder if it's not ehca, but rather that there's some ppc64 quirk that makes NAPI a lot more expensive. - R. Got your point. Sorry I haven't made any big progress yet. What I have found so far in none scaling code, if I always set missed_event = 0 without peeking rotting packet, then NAPI will increase the performance and reduce the cpu utilization. That's the reason I suggest above change. I have't found the reason for scaling code dropping 2/3 of the performance yet. The NAPI touch test for methca on power performance is good. So I don't think it's ppc4 issue. Thanks Shirley___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
Re: [openib-general] [PATCH/RFC 1/2] IB: Return maybe_missed_event hint from ib_req_notify_cq()
Roland Dreier [EMAIL PROTECTED] wrote on 11/10/2006 07:00:46 AM: I think it has to stay the way I wrote it. Your version: + if (empty) +return (ib_req_notify_cq(priv-cq, IB_CQ_NEXT_COMP|IB_CQ_REPORT_MISSED_EVENTS) netif_rx_reschedule(dev, 0); + Thanks Shirley Ma ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
Re: [openib-general] [PATCH/RFC 1/2] IB: Return maybe_missed_event hint from ib_req_notify_cq()
Roland, Sorry I was not intend to send previous email. Anyway I accidently sent it out. What I thought was there would be a problem, if the missed_event always return to 1. Then this napi poll would keep forever. How about defer the rotting packets process later? like this: + if (empty) +return (ib_req_notify_cq(priv-cq, IB_CQ_NEXT_COMP|IB_CQ_REPORT_MISSED_EVENTS) netif_rx_reschedule(dev, 0); + With this patch, I could get NAPI + non scaling code throughput performance from 1XXMb/s to 7XXMb/s, anyway there are some other problems I am still investigating now. Thanks Shirley Ma___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
Re: [openib-general] [PATCH/RFC 1/2] IB: Return maybe_missed_event hint from ib_req_notify_cq()
Roland Dreier [EMAIL PROTECTED] wrote on 10/19/2006 09:10:35 PM: Roland, I looked over my code again, and I don't see anything obviously wrong, but it's quite possible I made a mistake that I just can't see right now (like reversing a truth value somewhere). Someone who knows how ehca works might be able to spot the error. - R. Your code is OK. I just found the problem here. + if (empty) { + netif_rx_complete(dev); + ib_req_notify_cq(priv-cq, IB_CQ_NEXT_COMP, missed_event); + if (unlikely(missed_event) netif_rx_reschedule(dev, 0)) + goto repoll; + + return 0; + } netif_rx_complete() should be called right before return. It does improve none scaling performance with this patch, but reduce scaling performance. + if (empty) { + ib_req_notify_cq(priv-cq, IB_CQ_NEXT_COMP, missed_event); + if (unlikely(missed_event) netif_rx_reschedule(dev, 0)) + goto repoll; + netif_rx_complete(dev); + + return 0; + } Any other reason, calling netif_rx_complete() while still possibably within napi? Thanks Shirley Ma IBM Linux Technology Center___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
Re: [openib-general] [PATCH/RFC 1/2] IB: Return maybe_missed_event hint from ib_req_notify_cq()
Quoting r. Shirley Ma [EMAIL PROTECTED]: Subject: Re: [PATCH/RFC 1/2] IB: Return maybe_missed_event hint from ib_req_notify_cq() Michael S. Tsirkin [EMAIL PROTECTED] wrote on 10/19/2006 01:21:45 PM: Please also note that due to factors such as TCP window limits, TX on a single socket is often stalled. To really stress a connection and see benefit from NAPI you should be running multiple socket streams in parallel: either just run multiple instances of netperf/netserver, or use iperf with -P flag. I used to get 7600Mb/s IPoIB one socket duplex throughput with my other IPoIB patches on 2.6.5 kernel under certain configuration. Which makes me believe we could gain close to link throughput with one UD QP. Now I couldn't get it anymore on the new kernel. I was struggling with TCP window limits on the new kernel. Do you have any hint? Could be the stretch ACK fix - newer kernels are sending much more ACKs than 2.6.5. Without NAPI, this means we have more interrupts - lower throughput. -- MST ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
Re: [openib-general] [PATCH/RFC 1/2] IB: Return maybe_missed_event hint from ib_req_notify_cq()
Retest several times, this hack patch only fixed the none scaling code. I thought I tested both scaling and none scaling, it seems I made a mistake, I might configure and test none scaling configuration twice in previous run. thanks Shirley Ma IBM Linux Technology Center ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
Re: [openib-general] [PATCH/RFC 1/2] IB: Return maybe_missed_event hint from ib_req_notify_cq()
Michael S. Tsirkin [EMAIL PROTECTED] wrote on 10/19/2006 01:21:45 PM: Please also note that due to factors such as TCP window limits, TX on a single socket is often stalled. To really stress a connection and see benefit from NAPI you should be running multiple socket streams in parallel: either just run multiple instances of netperf/netserver, or use iperf with -P flag. I used to get 7600Mb/s IPoIB one socket duplex throughput with my other IPoIB patches on 2.6.5 kernel under certain configuration. Which makes me believe we could gain close to link throughput with one UD QP. Now I couldn't get it anymore on the new kernel. I was struggling with TCP window limits on the new kernel. Do you have any hint? thanks Shirley Ma IBM Linux Technology Center ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
Re: [openib-general] [PATCH/RFC 1/2] IB: Return maybe_missed_event hint from ib_req_notify_cq()
Quoting r. Shirley Ma [EMAIL PROTECTED]: Subject: Re: [PATCH/RFC 1/2] IB: Return maybe_missed_event hint from ib_req_notify_cq() Roland Dreier [EMAIL PROTECTED] wrote on 10/18/2006 01:55:13 PM: I would like to understand why there's a throughput difference with scaling turned off, since the NAPI code doesn't change the interrupt handling all that much, and should lower the CPU usage if anything. That's I am trying to understand now. Yes, the send side rate dropped significant, cpu usage lower as well. I think its a TCP configuration issue in your setup. With NAPI, we seem to be getting stable high results as reported previously by Eli. Hope to complete testing and report next week. Shirley, can you please post test setup and results? Some ideas: Please note that you need to apply the NAPI patch on both send and recv side in stream benchmark, otherwise one side will be a bottleneck. Please also note that due to factors such as TCP window limits, TX on a single socket is often stalled. To really stress a connection and see benefit from NAPI you should be running multiple socket streams in parallel: either just run multiple instances of netperf/netserver, or use iperf with -P flag. You also should look at the effect of increasing the send/recv socket buffer size. Finally, tuning RX/TX ring size should also be done differently: you might be over-running your queues, so make them bigger for NAPI. -- MST ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
Re: [openib-general] [PATCH/RFC 1/2] IB: Return maybe_missed_event hint from ib_req_notify_cq()
Thanks Michael for all these tips. I have tried several suggestions as you proposed here. I couldn't see performance any better. The TCP_RR is dropped to 472 trans/s from about 18,000 trans/s , and TCP_STREAM BW is dropped to 1/3 as before ( ehca + scaling code) with same TCP configuration, send queue size=recve queue size = 1K. Thanks Shirley Ma IBM Linux Technology Center___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
Re: [openib-general] [PATCH/RFC 1/2] IB: Return maybe_missed_event hint from ib_req_notify_cq()
OK, as promised I redid the request notify patches according to Michael's suggestion to add a new flag. I think I like this a lot better -- I'll send out the new patches as replies to this email for comments. - R. ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
Re: [openib-general] [PATCH/RFC 1/2] IB: Return maybe_missed_event hint from ib_req_notify_cq()
Roland, I have applied this patch and updated patch 2/2. You will send out an updated patch 2/2, I think. I did some extra modification in ipoib code, (which has more extra repolls). I do see around 10% or more performance improvement now with this change on both scaling and none scaling code. I will run oprofile tomorrow to see the difference. I think with these extra repolls, the cpu utilization would be much higher. Thanks Shirley Ma IBM Linux Technology Center___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
Re: [openib-general] [PATCH/RFC 1/2] IB: Return maybe_missed_event hint from ib_req_notify_cq()
I have applied this patch and updated patch 2/2. You will send out an updated patch 2/2, I think. Sorry, messed that up. I just sent out the patch. I did some extra modification in ipoib code, (which has more extra repolls). I do see around 10% or more performance improvement now with this change on both scaling and none scaling code. I will run oprofile tomorrow to see the difference. I think with these extra repolls, the cpu utilization would be much higher. You mean you add more calls to ib_poll_cq()? Where do you add them? Why does it help? - R. ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
Re: [openib-general] [PATCH/RFC 1/2] IB: Return maybe_missed_event hint from ib_req_notify_cq()
Roland Dreier [EMAIL PROTECTED] wrote on 10/19/2006 07:39:25 PM: I have applied this patch and updated patch 2/2. You will send out an updated patch 2/2, I think. Sorry, messed that up. I just sent out the patch. No problem, I did same change. You mean you add more calls to ib_poll_cq()? Where do you add them? Why does it help? - R. I run out of ideas why losing 2/3 of the throughput and got 476 trans/s. So I assumed there was always a missed event, then ipoib would stay in its napi poll within its scheduled time. That's why it helps. This is really a hack, doesn't address the problem. It sacrificed cpu utilization and gained the performance back. I need to understand how ehca reports missing event, there might be some delay there? Thanks Shirley Ma IBM Linux Technology Center___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
Re: [openib-general] [PATCH/RFC 1/2] IB: Return maybe_missed_event hint from ib_req_notify_cq()
I run out of ideas why losing 2/3 of the throughput and got 476 trans/s. So I assumed there was always a missed event, then ipoib would stay in its napi poll within its scheduled time. That's why it helps. This is really a hack, doesn't address the problem. It sacrificed cpu utilization and gained the performance back. I need to understand how ehca reports missing event, there might be some delay there? It's entirely possible that my implementation of the missing event hint in ehca is wrong. I just guessed based on how poll CQ is implemented -- if the consumer requests a hint about missing events, then I lock the CQ and check if its empty after requesting notification. I looked over my code again, and I don't see anything obviously wrong, but it's quite possible I made a mistake that I just can't see right now (like reversing a truth value somewhere). Someone who knows how ehca works might be able to spot the error. - R. ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
Re: [openib-general] [PATCH/RFC 1/2] IB: Return maybe_missed_event hint from ib_req_notify_cq()
Roland Dreier [EMAIL PROTECTED] wrote on 10/19/2006 09:10:35 PM: It's entirely possible that my implementation of the missing event hint in ehca is wrong. I just guessed based on how poll CQ is implemented -- if the consumer requests a hint about missing events, then I lock the CQ and check if its empty after requesting notification. I looked over my code again, and I don't see anything obviously wrong, but it's quite possible I made a mistake that I just can't see right now (like reversing a truth value somewhere). Someone who knows how ehca works might be able to spot the error. - R. The oprofile data (with your napi + this hack patch) looks good, it reduced cpu utilization significantly. (I was wrong about cpu utilization.) I will talk with ehca team regarding this missing event hint patch on ehca. thanks Shirley Ma IBM Linux Technology Center ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
Re: [openib-general] [PATCH/RFC 1/2] IB: Return maybe_missed_event hint from ib_req_notify_cq()
Roland Dreier [EMAIL PROTECTED] wrote on 10/17/2006 08:41:59 PM: Anyway, I'm eagerly awaiting your NAPI results with ehca. Thanks, Roland Thanks. The touch test results are not good. This NAPI patch induces huge latency for ehca driver scaling code, the throughput performance is not good. (I am not fully conviced the huge latency is because of raising NAPI in thread context.) Then I tried ehca no scaling driver, the latency looks good, but the throughtput is still a problem. We are working on these issues. Hopefully we can get the answer soon. Thanks Shirley Ma ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
Re: [openib-general] [PATCH/RFC 1/2] IB: Return maybe_missed_event hint from ib_req_notify_cq()
Thanks. The touch test results are not good. This NAPI patch induces huge latency for ehca driver scaling code, the throughput performance is not good. (I am not fully conviced the huge latency is because of raising NAPI in thread context.) Then I tried ehca no scaling driver, the latency looks good, but the throughtput is still a problem. We are working on these issues. Hopefully we can get the answer soon. Hmm, the results with scaling on are not that unexpected, since the idea of scheduling a thread round-robin (to kill all cache locality) is pretty dubious anyway. I would like to understand why there's a throughput difference with scaling turned off, since the NAPI code doesn't change the interrupt handling all that much, and should lower the CPU usage if anything. Does changing the netdev weight value affect anything? - R. ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
Re: [openib-general] [PATCH/RFC 1/2] IB: Return maybe_missed_event hint from ib_req_notify_cq()
Roland Dreier [EMAIL PROTECTED] wrote on 10/18/2006 01:55:13 PM: I would like to understand why there's a throughput difference with scaling turned off, since the NAPI code doesn't change the interrupt handling all that much, and should lower the CPU usage if anything. That's I am trying to understand now. Yes, the send side rate dropped significant, cpu usage lower as well. Does changing the netdev weight value affect anything? - R. No, it doesn't. Thanks Shirley Ma IBM Linux Technology Center___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
Re: [openib-general] [PATCH/RFC 1/2] IB: Return maybe_missed_event hint from ib_req_notify_cq()
Hi, Roland, There were a couple errors and warning when I applied this patch to OFED-1.1-rc7. 1. ehca_req_notify_cq() in ehca_iverbs.h is not updated. 2. *maybe_missed_event = ipz_qeit_is_valid(my_cq-ipz_queue) should be =ipz_qeit_is_valid(my_cq-ipz_queue) 3. a compile warning this line return cqe_flags 7 == queue-toggle_state 1; Thanks Shirley Ma___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
Re: [openib-general] [PATCH/RFC 1/2] IB: Return maybe_missed_event hint from ib_req_notify_cq()
Sorry, I just noticed my cross-compilation test setup was messed up, so I never actually built the modified ehca, even though I thought I did. Anyway, the patch below on top of what I sent out should fix everything up. I've also merged this into my ipoib-napi branch, so what's there should be OK for ehca now. Anyway, I'm eagerly awaiting your NAPI results with ehca. Thanks, Roland ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general