Hi Shirley,
On Wed, Apr 19, 2006 at 11:31:32AM -0700, Shirley Ma wrote:
...
By moving netperf RX traffic off the CPU handling interrupts,
the 1.5Ghz ia64 box goes from 2.8 Gb/s to around 3.5 Gb/s.
But the service demand (CPU time per KB payload) goes up
from ~2.3 usec/KB to ~3.1 usec/KB
Hello Grant,
Grant Grundler [EMAIL PROTECTED] wrote on 04/20/2006
08:16:27 AM:
Was this measured using ehca?
If so, the result implies at least two interrupt vectors are used.
And it seems reasonable for IPoIB to tune for that even if it
costs mthca a slight amount of overhead. Roland
Shirley After completion handler receives the notification, don't
Shirley poll the CQ right away, and wait for more WIKIs in
Shirley CQ. That way can reduce the CQ lock overhead.
Roland That's interesting... it makes sense, and it argues in
Roland favor of deferring CQ
Bernard The assumption you have here is that one CPU is capable
Bernard of handling the completions without impacting
Bernard bandwidth. We have seen the opposite in that we end up
Bernard with one CPU pegged at high throughput. The benefit you
Bernard are working on is latency
Roland,
I still don't understand why splitting the CQ
allows you to use more
than one CPU to handle completions. Both CQ events get handled
on the
same CPU -- you just have more overhead in getting to the CQ event
handlers if there are two of them.
The send WC handler is different with
recv
Shirley The send WC handler is different with recv WC
Shirley handler. Even with some overhead we do see big
Shirley improvement in bidirectional throughput.
But how? There's only one CQ interrupt handler, which can only run on
one CPU at a time. So the send WC handler and recv WC
On Wed, Apr 19, 2006 at 10:10:36AM -0400, Bernard King-Smith wrote:
The benefit you are
working on is latency will be faster if we handle both send and receive
processing off the same thread/interrupt, but you have to balance that with
bandwidth limitations. You think 4X has a bandwdith
Hello Grant,
[EMAIL PROTECTED] wrote on 04/19/2006
09:42:26 AM:
I've looked at this tradeoff pretty closely with ia64 (1.5Ghz)
by pinning netperf to a different CPU than the one handling interrupts.
By moving netperf RX traffic off the CPU handling interrupts,
the 1.5Ghz ia64 box goes from
Shirley Some tests have been done over mthca and
Shirley ehca. Unidirectional stream test, gains up to 15%
Shirley throughout with this patch on systems over 4 cpus.
Shirley Bidirectional could gain more. People might get different
Shirley performance improvement number under
Bernard On a multiple CPU system looking at TOP you see one
Bernard process consuming a full CPU. This happens to be the
Bernard thread handling completion queue entries. I suggested
Bernard that we look at separate threads handing send completions
Bernard vs. receive
Bernie,
Bernard King-Smith/Poughkeepsie/IBM wrote on 04/18/2006
01:48:28 PM:
When we ran with the split completion queue patch, we no longer see
one
process pegging the CPU at 100% and we get a speedup of 65% going
from STREAM to Duplex. Without the split completion queue, we only
saw a
Shirley This is another patch to gain huge performance on ehca
Shirley driver. I haven't submitted yet. :-)
What does the patch do?
- R.
___
openib-general mailing list
openib-general@openib.org
Roland Dreier [EMAIL PROTECTED] wrote on 04/18/2006
02:33:55 PM:
Shirley This is another patch to gain huge performance
on ehca
Shirley driver. I haven't submitted yet. :-)
What does the patch do?
- R.
The patch allows you tuning send/recv NUM_WC per poll
and add some
cycles before
Shirley The patch allows you tuning send/recv NUM_WC per poll and
Shirley add some cycles before polling to sync with the hardware.
I have no problem increasing NUM_WC to something much bigger. What do
you mean by add some cycles before polling?
- R.
Roland Dreier [EMAIL PROTECTED] wrote on 04/18/2006
02:49:34 PM:
Shirley The patch allows you tuning send/recv NUM_WC
per poll and
Shirley add some cycles before polling to sync with
the hardware.
I have no problem increasing NUM_WC to something much bigger. What
do
you mean by add
Shirley After completion handler receives the notification, don't
Shirley poll the CQ right away, and wait for more WIKIs in
Shirley CQ. That way can reduce the CQ lock overhead.
That's interesting... it makes sense, and it argues in favor of
deferring CQ polling to a kernel thread.
Roland Dreier [EMAIL PROTECTED] wrote on 04/18/2006
03:01:57 PM:
Shirley After completion handler receives the notification,
don't
Shirley poll the CQ right away, and wait for more
WIKIs in
Shirley CQ. That way can reduce the CQ lock overhead.
That's interesting... it makes sense,
Shirley After completion handler receives the notification, don't
Shirley poll the CQ right away, and wait for more WIKIs in
Shirley CQ. That way can reduce the CQ lock overhead.
Roland That's interesting... it makes sense, and it argues in
Roland favor of deferring CQ polling
Shirley It's on mthca. If you are interested. I can submit a test
Shirley patch for your experimental.
Sure, that would be useful.
- R.
___
openib-general mailing list
openib-general@openib.org
http://openib.org/mailman/listinfo/openib-general
Roland Dreier [EMAIL PROTECTED] wrote on 04/18/2006
03:06:33 PM:
And actually it argues against splitting the CQ, because having one
CQ
increases the number of CQ entries that we have a chance to poll at
any one time, by lumping send and receive completions together...
- R.
The send needs
Roland Dreier [EMAIL PROTECTED] wrote on 04/18/2006
03:07:06 PM:
Shirley It's on mthca. If you are interested. I
can submit a test
Shirley patch for your experimental.
Sure, that would be useful.
- R.
It is built on top of splitting CQ patch.
I will send you the patch tomorrow.
21 matches
Mail list logo